Inside Acropolis

A guide to the Research & Education Space for contributors and developers

October 2016 Edition

Edited by Mo McRoberts, BBC Archive Development.

Preface

The Research & Education Space is a platform that has been built and delivered by the BBC. Its aim is to bring as much as possible of the UK’s publicly-held archives, and more besides, to learners and teachers across the UK.

Powering the Research & Education Space is Acropolis, a technical stack which collects, indexes and organises rich structured data about those archive collections published as Linked Open Data (LOD) on the Web. The collected data is organised around the people, places, events, concepts and things related to the items in the archive collections—and, if the archive assets themselves are available in digital form, that data includes the information on how to access them, all in a consistent machine-readable form.

Building on the Research & Education Space, applications can make use of this index, along with the source data itself, to make those collections accessible and meaningful.

This book describes how a collection-holder can publish their data in a form which can be collected and indexed by the platform and used by applications, and how an application developer can make use of the index and interpret the source data in order to present it to end-users in a useful fashion.

This book is also available in PDF format.

An introduction to the Research & Education Space platform

Overview

The Research & Education Space is powered by an open source software stack named Acropolis, which is made of up three main components: a specialised web crawler, Anansi, an aggregator, Spindle, and a public API layer, Quilt. You can read more about the architecture of the stack in the section Under the hood: the architecture of Acropolis.

Anansi’s role is to crawl the web, retrieving permissively-licensed Linked Open Data, and passing it to the aggregator for processing.

Spindle, which is actually implemented as plug-in modules for our data workflow engine—Twine—examines the data, looking for places where the same digital, physical or conceptual entity is described by Linked Open Data in more than one place, particularly where the data explicitly states the equivalence, and aggregates and stores that information in an index.

This subject-oriented index is the very heart of the Resarch & Education Space: by re-arranging published data so that it's organised around the entities described by it, instead of by publisher or data-set, applications are able to rapidly locate all of the information known about a particular entity because it’s collected together in one place.

Quilt is responsible for making the index available to applications, also by publishing it as Linked Open Data. Because the Research & Education Space maintains an index, rather than a complete copy of all data that it finds, applications must consume data both from the index and from the original data sources—and so the outputs from Quilt also conform to the publishing recommendations in this book.

An overview of the Research & Education Space

We will not be directly developing end-user applications as part of the Research & Education Space itself, although sample code and demonstrations have been published (inside-acropolis-how-to-unlock-the-archives, interacting-with-res-in-php and what-does-res-do-with-your-data) to assist software developers in doing so. There is also a “powered by RES” logo.

The Research & Education Space only indexes and publishes data which has been released under terms which permit re-use in both commercial and non-commercial settings, so that all kinds of applications can be developed using the platform.Go to our Blog ( open-data-the-digital-equivalent-of-letting and the-pros-and-cons-of-open-licensing) to find out more on issue of licensing.

For the Research & Education Space to be most useful, holders of publicly-funded archive collections across the UK should publish Linked Open Data describing their collections (including related digital assets, where they exist). Although many collections are already doing so or plan to, the Research & Education Space technical team will be providing tools and advice to collection-holders in order to assist them. Go to our blog (why-publish-modes-records-as-linked-data) to find out how one collection holder has made their data compatible with RES.

Linked Open Data: What is it, and how does it work?

Linked Open Data is a mechanism for publishing structured data on the Web about virtually anything, in a form which can be consistently retrieved and processed by software. The result is a world wide web of data which works in parallel to the web of documents our browsers usually access, transparently using the same protocols and infrastructure.

Where the ordinary web of documents is a means of publishing a page about something intended for a human being to understand, this web of data is a means of publishing data about those things.

Web addresses, URLs and URIs

Uniform Resource Locators (URLs), often known as Web addresses, are a way of unambiguously identifying something which is published electronically. Although there are a variety of kinds of URL, most that you day-to-day see begin with http or https: this is known as the scheme, and defines how the rest of the URL is structured—although most kinds of URL follow a common structure.

The scheme also indicates the communications protocol which should be used to access the resource identified by the URL: if it's http, then the resource is accessible using HTTP—the protocol used by web servers and browsers; if it's https, then it’s accessible using secure HTTP (i.e., HTTP with added encryption).

Following the scheme in a URL is the authority—the domain name of the web site: it’s called the authority because it identifies the entity responsible for defining the meaning and structure of the remainder of the URL. If the URL begins with http://www.bbc.co.uk/, you know that it's defined and managed by the BBC; if it begins with http://www.bfi.org.uk/, you know that it's managed by the BFI, and so on.

After the authority is an optional path (i.e., the location of the document within the context of the particular domain name or authority), and optional query parameters (beginning with a question-mark), and fragment (beginning with a hash-mark).

URLs serve a dual purpose: not only do they provide a name for something, but they also provide anything which understands them with the information they need to retrieve it. Provided your application is able to speak the HTTP protocol, it should in principle be able to retrieve anything using a http URL.

Universal Resource Indicators (URIs) are a superset of URLs, and are in effect a kind of universal identifier: their purpose is to name something, without necessarily indicating how to retrieve it. In fact, it may be that the thing named using a URI cannot possibly be retrieved using a piece of software and an Internet connection because it refers to an abstract concept or a physical object.

URIs follow the same structure as URLs, in that there is a scheme defining how the remainder is structured, and usually some kind of authority, but there are many different schemes, and many of them do not have any particular mechanism defined for how you might retrieve something named using that scheme.

For example, the tag: URI scheme provides a means for anybody to define a name for something in the form of a URI, using a domain name that they control as an authority, but without indicating any particular semantics about the thing being named.

Meanwhile, URIs which begin with urn: are actually part of one of a number of sub-schemes, many of which exist as a means of writing down some existing identifier about something in the form of a URI. For example, an ISBN can be written as a URI by prefixing it with urn:isbn: (for example, urn:isbn:9781899066100).

You might be forgiven for wondering why somebody might want to write an ISBN in the form of a URI, but in fact there are a few reasons. In most systems, ISBNs are effectively opaque alphanumeric strings: although there is usually some validation of the check digit upon data entry, once stored in a database, they are rarely interrogated for any particular meaning. Given this, ISBNs work perfectly well for identifying books for which ISBNs have been issued—but what if you want to store data about other kinds of things, too? Recognising that this was a particular need for retailers, a few years ago ISBNs were made into a subset of Global Trade Information Numbers (GTINs), the system used for barcoding products sold in shops.

By unifying ISBNs and GTINs, retailers were able to use the same field in their database systems for any type of product being sold, whether it was a book with an ISBN, or some other kind of product with a GTIN. All the while, the identifier remained essentially opaque: provided the string of digits and letters scanned by the bar-code reader could be matched to a row in a database, it doesn't matter precisely what those letters and numbers actually are.

Representing identifiers in the form of URIs can be thought of as another level of generalisation: it allows the development of systems where the underlying database doesn’t need to know nor care about the kind of identifier being stored, and so can store information about absolutely anything which can be identified by a URI. In many cases, this doesn’t represent a huge technological shift—those database systems already pay little attention to the structure of the identifier itself.

Hand-in-hand with this generalisation effect is the ability to disambiguate and harmonise without needing to coordinate a variety of different standards bodies across the world. Whereas the integration of ISBNs and GTINs took a particular concerted effort in order to achieve, the integration of ISBNs and URNs was only a matter of defining the URN scheme, because URIs are already designed to be open-ended and extensible.

Linked Open Data URIs are a subset of URIs which, again, begin with http: or https:, but do not necessarily name something which can be retrieved from a web server. Instead, they are URIs where performing resolution results in machine-readable data about the entity being identified.

In summary:

TermUsed for…
URLsIdentifying digital resources and specifying where they can be retrieved from
URIsIdentifying anything, regardless of whether it can be retrieved electronically or not
Linked Open Data URIsIdentifying anything, but in a way which means that descriptive metadata can be retrieved when the URI is resolved

Describing things with triples

Linked Open Data uses the Resource Description Framework (RDF) to convey information about things. RDF is an open-ended system for modelling information about things, which it does by breaking it down into statements (or assertions), each of which consists of a subject, a predicate and an object.

The subject is the thing being described; the predicate is the aspect or attribute of the subject being described; and the object is the description of that particular attribute.

For example, you might want to state that the book with the ISBN 978-1899066100 has the title Acronyms and Synonyms in Medical Imaging. You can break this assertion down into its subject, predicate, and object:

SubjectPredicateObject
ISBN 978-1899066100Has the titleAcronyms and Synonyms in Medical Imaging

Together, this statement made up of a subject, predicate and object is called a triple (because there are three components to it), while a collection of statements is called a graph.

In RDF, the subject and the predicate are expressed as URIs this helps to remove ambiguity and the risk of clashes so that the data can be published and consumed in the same way regardless of where it comes from or who’s processing it. Objects can be expressed as URIs where you want to assert some kind of reference to something else, but can also be literals (such as text, numeric values, dates, and so on).

Predicates and vocabularies

RDF doesn’t specify the meaning of most predicates itself: in other words, RDF doesn’t tell you what URI you should use to indicate “has the title”. Instead, because anybody can create a URI, it’s entirely up to you whether you invent your own vocabulary when you publish your data, or adopt somebody else’s. Generally, of course, if you want other people to be able to understand your data, it’s probably a good idea to adopt existing vocabularies where they exist.

In essence, RDF provides the grammar, while community consensus provides the dictionary.

One of the most commonly-used general-purpose vocabularies is the DCMI Metadata Terms, managed by the Dublin Core Metadata Initiative (DCMI), and which includes a suitable title predicate:

SubjectPredicateObject
ISBN 978-1899066100http://purl.org/dc/terms/titleAcronyms and Synonyms in Medical Imaging

With this triple, a consuming application that understands the DCMI Metadata Terms vocabulary can process that data and understand the predicate to indicate that the item has the title Acronyms and Synonyms in Medical Imaging.

Because http://purl.org/dc/terms/title is quite long-winded, it’s common to write predicate URIs in a compressed form, consisting of a namespace prefix and local name—similar to the xmlns mechanism used in XML documents.

Because people will often use the same prefix to refer to the same namespace URI, it is not unusual to see this short form of URIs used in books and web pages. Some common prefixes and namespace URIs are shown below:

For example, defining the namespace prefix dct with a namespace URI of http://purl.org/dc/terms/, we can write our predicate as dct:title instead of http://purl.org/dc/terms/title. RDF systems re-compose the complete URI by concatenating the prefix URI and the local name.

Subject URIs

In RDF, subjects are also URIs. While in RDF itself there are no particular restrictions upon the kind of URIs you can use (and there are a great many different kinds — those beginning http: and https: that you see on the Web are just two of hundreds), Linked Open Data places some restrictions on subject URIs in order to function. These are:

  1. Subject URIs must begin with http: or https:.
  2. They must be unique: although you can have multiple URIs for the same thing, one URI can’t refer to multiple distinct things at once.
  3. If a Linked Open Data consumer makes an HTTP request for the subject URI, the server should send back RDF data describing that subject.
  4. As with URLs, subject URIs need to be persistent: that is, they should change as little as possible, and where they do change, you need to be able to make arrangements for requests for the old URI to be forwarded to the new one.

In practice, this means that when you decide upon a subject URI, it needs to be within a domain name that you control and can operate a web server for; you need to have a scheme for your subject URIs which distinguishes between things which are represented digitally (and so have ordinary URLs) and things which cannot; you also need to arrange for your web server to actually serve RDF when it’s requested; and finally you need to decide a form for your subject URIs which minimises changes.

This may sound daunting, but it can be quite straightforward—and shares much in common with deciding upon a URL structure for a website that is intended only for ordinary browsers.

For example, if you are the Intergalactic Alliance Library & Museum, whose domain name is ialm.int, you might decide that all of your books’ URIs will begin with http://ialm.int/books/, and use the full 13-digit ISBN, without dashes, as the key. You could pick something other than the ISBN, such as an identifier meaningful only to your own internal systems, but it makes developers’ lives easier if you incorporate well-known identifiers where it’s not problematic to do so.

Because this web of data co-exists with the web of documents, begin by defining the URL to the document about this book:

http://ialm.int/books/9781899066100

Anybody visiting that URL in their browser will be provided with information about the book in your collection. Because the URL incorporates a well-known identifier, the ISBN, if any other pieces of information about the book change or are corrected, that URL remains stable. As a bonus, incorporating the ISBN means that the URL to the document is predictable.

Having defined the URL for book pages, it’s now time to define the rest of the structure. The Intergalactic Alliance Library & Museum web server will be configured to serve web pages to web browsers, and RDF data to RDF consumers: that is, there are multiple representations of the same data. It’s useful, from time to time, to be able to refer to each of these representations with a distinct URL. Let’s say, then, that we’ll use the general form:

http://ialm.int/books/9781899066100.EXT

In this case, EXT refers to the well-known filename extension for the particular type of representation we’re referring to.

Therefore, the HTML web page for our book will have the representation-specific URL of:

http://ialm.int/books/9781899066100.html

If you also published CSV data for your book, it could be given the representation-specific URL of:

http://ialm.int/books/9781899066100.csv

RDF can be expressed in a number of different forms, or serialisations. The most commonly-used serialisation is called Turtle, and typically has the filename extension of ttl. Therefore our Turtle serialisation would have the representation-specific URL of:

http://ialm.int/books/9781899066100.ttl

Now that we have defined the structure of our URLs, we can define the pattern used for the subject URIs themselves. Remember that the URI needs to be dereferenceable—that is, when a consuming application makes a request for it, the server can respond with the appropriate representation.

In order to do this, there are two options: we can use a special kind of redirect, or we can use fragments. The fragment approach works best where you have a document for each individual item, as we do here, and takes advantage of the fact that in the HTTP protocol, any part of a URL following the “#” symbol is never sent to the server.

Thus, let’s say that we’ll distinguish our URLs from our subject URIs by suffixing the subject URIs with #id. The URI for our book therefore becomes:

http://ialm.int/books/9781899066100#id

When an application requests the information about this book, by the time it arrives at our web server, it’s been turned into a request for the very first URL we defined—the generic “document about this book” URL:

http://ialm.int/books/9781899066100

When an application understands RDF and tells the server as much as part of the request, the server can send back the Turtle representation instead of an HTML web page—a part of the HTTP protocol known as content negotiation. Content negotiation allows a server to pick the most appropriate representation for something (where it has multiple representations), based upon the client’s preferences.

With our subject URI pattern defined, we can revisit our original assertion:

SubjectPredicateObject
http://ialm.int/books/9781899066100#iddct:titleAcronyms and Synonyms in Medical Imaging

Defining what something is: classes

One of the few parts of the common vocabulary which is defined by RDF itself is the predicate rdf:type, which specifies the class (or classes) of a subject. Like predicates, classes are defined by vocabularies, and are also expressed as URIs. The classes of a subject are intended to convey what that subject is.

For example, the Bibliographic Ontology, whose namespace URI is http://purl.org/ontology/bibo/ (commonly prefixed as bibo:) defines a class named bibo:Book (whose full URI we can deduce as being http://purl.org/ontology/bibo/Book).

If we write a triple which asserts that our book is a bibo:Book, any consumers which understand the Bibliographic Ontology can interpret our data as referring to a book:

SubjectPredicateObject
http://ialm.int/books/9781899066100#idrdf:typebibo:Book
http://ialm.int/books/9781899066100#iddct:titleAcronyms and Synonyms in Medical Imaging

Describing things defined by other people

There is no technical reason why your subject URIs must only be URIs that you control directly. In Linked Open Data, the matter of trust is a matter for the data consumer: one application might have a white-list of trusted sources, another might have a black-list of sources known to be problematic, another might have more complex heuristics, while another might use your social network such that assertions from your friends are considered more likely to be trustworthy than those from other people.

Describing subjects defined by other people has a practical purpose. Predicates work in a particular direction, and although sometimes vocabularies will define pairs of predicates so that you can make a statement either way around, interpreting this begins to get complicated, and so most vocabularies define predicates only in one direction.

As an example, you might wish to state that a book held in a library is about a subject that you’re describing. On a web page, you’d simply write this down and link to it—perhaps as part of a “Useful resources” section. In Linked Open Data, you can make the assertion that one of the subjects of the other library’s book is the one you’re describing. This works exactly the same way as if you were describing something that you’d defined yourself—you simply write the statement, but somebody else’s URI as the subject.

This can also be used to make life easier for developers and reduce network overhead of applications. In your “Useful resources” section, you probably wouldn’t only list the URL to the page about the book: instead, you’d list the title and perhaps the author as well as linking to the page about the book. You can do that in Linked Open Data, too. Let’s say that we’re expressing the data about a subject—Roman Gaul—which we’ve assigned a URI of http://ialm.int/things/2068003#id:

SubjectPredicateObject
http://ialm.int/things/2068003#iddct:titleRoman Gaul
http://bnb.data.bl.uk/id/resource/006889069rdf:typebibo:Book
http://bnb.data.bl.uk/id/resource/006889069dct:titleAsterix the Gaul
http://bnb.data.bl.uk/id/resource/006889069dct:subjecthttp://ialm.int/things/2068003#id

In this example we’ve defined a subject, called Roman Gaul, of which we’ve provided very little detail, except to say that it’s a subject of the book Asterix the Gaul, whose identifier is defined by the British Library.

Note that we haven‘t described the book Asterix the Gaul in full: RDF operates on an open world principle, which means that sets of assertions are generally interpreted as being incomplete—or rather, only as complete as they need to be. The fact that we haven’t specified an author or publisher of the book doesn’t mean there isn’t one, just that the data isn’t present here; where in RDF you need to state explicitly that something doesn’t exist, there is usually a particular way to do that.

Turtle: the terse triple language

Turtle is one of the most common languages for writing RDF in use today—although there are many others. Turtle is intended to be interpreted and generated by machines first and foremost, but also be readable and writeable by human beings (albeit usually software developers).

In its simplest form, we can just write out our statements, one by one, each separated by a full stop. URIs are written between angle-brackets (< and >), while string literals (such as the names of things) are written between double-quotation marks (").

<http://ialm.int/books/9781899066100#id> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/bibo/Book> .
	
<http://ialm.int/books/9781899066100#id> <http://purl.org/dc/terms/title> "Acronyms and Synonyms in Medical Imaging" .

This is quite long-winded, but fortunately Turtle allows us to define and use prefixes just as we have in this book. When we write the short form of a URI, it’s not written between angle-brackets:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .

<http://ialm.int/books/9781899066100#id> rdf:type bibo:Book .

<http://ialm.int/books/9781899066100#id> dct:title "Acronyms and Synonyms in Medical Imaging" .

Because Turtle is designed for RDF, and rdf:type is defined by RDF itself, Turtle provides a nice shorthand for the predicate: a. We can simply say that our book is a bibo:Book:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .

<http://ialm.int/books/9781899066100#id> a bibo:Book .

<http://ialm.int/books/9781899066100#id> dct:title "Acronyms and Synonyms in Medical Imaging" .

Writing the triples out this way quickly gets repetitive: you don’t want to be writing the subject URI every time, especially not if writing Turtle by hand. If you end a statement with a semi-colon instead of a full-stop, it indicates that what follows is another predicate and object about the same subject:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .

<http://ialm.int/books/9781899066100#id>
	a bibo:Book ;
	dct:title "Acronyms and Synonyms in Medical Imaging" .

Turtle includes a number of capabilities which we haven’t yet discussed here, but are important for fully understanding real-world RDF in general and Turtle documents in particular. These include:

Typed literals

Typed literals: literals which aren’t simply strings of text, but can be of any one of the XML Schema data types.

Literal types are indicated by writing the literal value, followed by two carets, and then the datatype URI: for example, "2013-01-26"^^xsd:date.

Blank nodes

Blank nodes are entities for which some information is provided, but where the subject URI is not known. There are two different ways of using blank nodes in Turtle: a blank node value is one where in place of a URI or a literal value, an entity is partially described.

Another way of using blank nodes is to assign it a private, transient identifier (a blank node identifier), and then use that identifier where you’d normally use a URI as a subject or object. The transient identifier has no meaning outside of the context of the document: it’s simply a way of referring to the same (essentially anonymous) entity in multiple places within the document.

A blank node value is expressed by writing an opening square bracket, followed by the sets of predicates and values for the blank node, followed by a closing square bracket. For example, we can state that an author of the book is a nondescript entity who we know is a person named Nicola Strickland, but for whom we don’t have an identifier:

<http://ialm.int/books/9781899066100#id> dct:creator [
	a foaf:Person ;
	foaf:name "Nicola Strickland" 
] .

Blank node identifiers are written similarly to the compressed form of URIs, except that an underscore is used as the prefix. For example, _:johnsmith. You don’t have to do anything special to create a blank node identifier (simply use it), and the actual name you assign has no meaning outside of the context of the document—if you replace all instances of _:johnsmith with _:zebra, the actual meaning of the document is unchanged—although it may be slightly more confusing to read and write as a human.

Multi-lingual string literals

String literals in the examples given so far are written in no particular language (which may be appropriate in some cases, particularly when expressing people’s names).

The language used for a string literal is indicated by writing the literal value, followed by an at-sign, and then the ISO 639-1 language code, or an ISO 639-1 language code, followed by a hyphen, and a ISO 3166-1 alpha-2 country code.

For example: "Intergalatic Alliance Library & Museum Homepage"@en, or "grey"@en-gb.

Base URIs

By default, the base URI for the terms in a Turtle document is the URI it’s being served from. Occasionally, it can be useful to specify an alternative base URI. To do this, an @base statement can be included (in a similar fashion to @prefix).

For example, if a document specifies @base <http://www.example.com/things/> ., then the URI <12447652#id> within that document can be expanded to <http://www.example.com/things/12447652#id>, while the URI </artefacts/47fb01> would be expanded to <http://www.example.com/artefacts/47fb01>.

An example of a Turtle document making use of some of these capabilities is shown below:

@base <http://ialm.int/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

</books/9781899066100#id>
	a bibo:Book ;
	dct:title "Acronyms and Synonyms in Medical Imaging"@en ;
	dct:issued "1997"^^xsd:gYear ;
	dct:creator _:allison, _:strickland ;
	dct:publisher [
		a foaf:Organization ;
		foaf:name "CRC Press"
	] .

_:strickland
	a foaf:Person ;
	foaf:name "Nicola Strickland" .

_:allison
	a foaf:Person ;
	foaf:name "David J. Allison" .

In this example, we are still describing our book, but we specify that the title is in English (though don’t indicate any particular national variant of English); we state that it was issued (published) in the year 1997, and that it’s publisher—for whom we don’t have an identifier—is an organisation whose name is CRC Press.

From three to four: relaying provenance with quads

While triples are a perfectly servicable mechanism for describing something, they don’t have the ability to tell you where data is from (unless you impose a restriction that you only deal with data where the domain of the subject URI matches that of the server you’re retrieving from). In some systems, including Acropolis, this limitation is overcome by introducing another element: a graph URI, identifying the source of a triple. Thus, instead of triples, the Research & Education Space actually stores quads.

When we assign an explicit URI to a graph in this way, it becomes known as a named graph—that is, a graph with an explicit identifier (name) assigned to it.

Turtle itself doesn’t have a concept of named graphs, but there is an extension to Turtle, named TriG, which includes the capability to specify the URI of a named graph containing a particular set of triples.

Why does the Research & Education Space use RDF?

RDF isn’t necessarily the simplest way of expressing some data about something, and that means it’s often not the first choice for publishers and consumers. Often, an application consuming some data is designed specifically for one particular dataset, and so its interactions are essentially bespoke and comparatively easy to define.

The Research & Education Space, by nature, brings together a large number of different structured datasets, describing lots of different kinds of things, with a need for a wide range of applications to be able to work with those sets in a consistent fashion.

At the time of writing (ten years after its introduction), RDF’s use of URIs as identifiers, common vocabularies and data types, inherent flexibility and well-defined structure means that is the only option for achieving this.

Whether you’re describing an audio clip or the year 1987, a printed book or the concept of a documentary film, RDF provides the ability to express the data you hold in intricate detail, without being beholden to a single central authority to validate the modelling work undertaken by experts in your field.

For application developers, the separation of grammar and vocabularies means that applications can interpret data in as much or as little detail as is useful for the end-users. For instance, you might develop an application which understands a small set of general-purpose metadata terms but which can be used with virtually everything surfaced through the Research & Education Space.

Alternatively, you might develop a specialist application which interprets rich descriptions in a particular domain in order to target specific use-cases. In either case, you don’t need to know who the data comes from, only sufficient understanding of the vocabularies in use to satisfy your needs.

However, because we recognise that publishing and consuming Linked Open Data as an individual publisher or application developer may be unfamiliar territory, and so throughout the lifetime of the project we are committed to publishing documentation, developing tools and operating workshops in order to help developers and publishers work more easily with RDF in general and the Research & Education Space in particular.

Developing applications for the Research & Education Space

The Research & Education Space is designed to be a platform for developers to build applications upon.

While you can browse the data indexed by the platform in a web browser, those web pages aren’t intended for use by ordinary end-users—they’re just an interface to aid debugging and testing.

What you see there is an HTML representation of the Research & Education Space API—a Linked Open Data service which provides information and query facilities about the entities indexed by the platform.

This part of the book is split into several chapters:

An introduction to the Research & Education Space API
Provides an introduction to how applications can interact with the API and how developers can test different kinds of request.
Common API operations
Describes the different kinds of API operations which are available and how to make use of them.
Retrieving and processing Linked Open Data
Describes the process of de-referencing Linked Open Data URIs, both in theory and in practice, including a detailed description of our approach for implementing a robust Linked Open Data client.
Best practices
Our recommendations for what to do—and what not to do—in order to get the best out of the API and avoid some common pitfalls.
Editorial guidelines for product developers
Discusses some of the editorial issues involved in building applications using the Research & Education Space, such as targeting different age-groups.

An introduction to the Research & Education Space API

The basics

As a Linked Open Data service, the API is accessed by making an HTTP GET request to the public endpoint, http://acropolis.org.uk/, including a Accept request header describing the RDF serialisations that your application supports.

This operation is known as de-referencing a URI, which translates as “given a Linked Data URI for something, retrieve data that describes that thing”.

For example, de-referencing the URI for the root of the API retrieves data about the API itself—in particular information about the collections and supported queries provided by it (see Determine browse and query capabilities, below, for further details).

An application preferring to receive Turtle in response, and N-Triples instead if that's available, an application would send a request similar to the below:

GET / HTTP/1.1
Host: acropolis.org.uk
Accept: text/turtle;q=1.0, application/n-triples;q=0.95, application/rdf+xml;q=0.5

For data to be included in the Research & Education Space, it must have been published as Linked Open Data either as Turtle or RDF/XML, and so applications must support parsing at least both of these serialisations—although applications are free to support processing other types of response.

Meanwhile, to retrieve data about the entity which has the URI http://acropolis.org.uk/824ac9e32f8b48b48a54be3c069f95bc#id, the application would make the following request:

GET /824ac9e32f8b48b48a54be3c069f95bc HTTP/1.1
Host: acropolis.org.uk
Accept: text/turtle;q=1.0, application/n-triples;q=0.95, application/rdf+xml;q=0.5

Notice that the fragment portion of the URI—#id—is not included in the HTTP request: this is specified by the HTTP protocol specification itself. When your application processes the response, it should begin with the triples whose subject is the exact URI (including the fragment) of the entity it was attempting to retrieve the data describing. See Retrieving and processing Linked Open Data for a detailed description of the protocol, and Best practices for tips and guidance in avoiding common issues that can arise.

Testing requests on the command-line

The curl command-line utility can be used to test requests to the API (the output below is abbreviated, but you can run the same command yourself to examine the full response). Simply supply an appropriate Accept request header and the URI to dereference. The -v option in the below enables verbose output from curl itself:

$  curl -v -H 'Accept: text/turtle;q=1.0, application/n-triples;q=0.95, application/rdf+xml;q=0.5' http://acropolis.org.uk/
*   Trying 132.185.142.106...
* Connected to acropolis.org.uk (132.185.142.106) port 80 (#0)
> GET / HTTP/1.1
> Host: acropolis.org.uk
> User-Agent: curl/7.49.1
> Accept: text/turtle;q=1.0, application/n-triples;q=0.95, application/rdf+xml;q=0.5
> 
< HTTP/1.1 200 OK
< Server: Apache/2.2.22 (Debian)
< Content-Location: /index.ttl
< Vary: Accept
< Content-Type: text/turtle; charset=utf-8
< Content-Length: 3800
< Accept-Ranges: bytes
< Date: Wed, 05 Oct 2016 05:05:18 GMT
< X-Varnish: 809766273
< Age: 0
< Via: 1.1 varnish
< Connection: keep-alive
< X-Cache: MISS from vm-gw-0.ch.internal
< 
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix dcmitype: <http://purl.org/dc/dcmitype/> .
@prefix odrl: <http://www.w3.org/ns/odrl/2/> .
@prefix osd: <http://a9.com/-/spec/opensearch/1.1/> .

<http://acropolis.org.uk/>
    osd:Language "cy-gb", "en-gb", "ga-gb", "gd-gb" ;
    osd:template "http://acropolis.org.uk/?class={rdfs:Class?}&for={odrl:Party?}&lang={language?}&limit={count?}&media={dct:DCMIType?}&offset={startIndex?}&q={searchTerms?}&type={dct:IMT?}" ;
    dct:hasFormat <http://acropolis.org.uk/index.ttl> ;
    void:classPartition <http://acropolis.org.uk/agents>, <http://acropolis.org.uk/assets>, <http://acropolis.org.uk/collections>, <http://acropolis.org.uk/concepts>, <http://acropolis.org.uk/events>, <http://acropolis.org.uk/groups>, <http://acropolis.org.uk/people>, <http://acropolis.org.uk/places>, <http://acropolis.org.uk/things>, <http://acropolis.org.uk/works> ;
    void:openSearchDescription <http://acropolis.org.uk/index.osd> ;
    void:rootResource <http://acropolis.org.uk/everything> ;
    void:uriLookupEndpoint <http://acropolis.org.uk/?uri=> ;
    rdfs:seeAlso <http://acropolis.org.uk/audiences> .

Common API operations

Determine the kind of entity that retrieved data describes

Once you have retrieved data from the index about an entity (which might be the index itself), determining the type of the entity allows you to determine how that data should be presented and tailor further user journeys appropriately.

The rdf:type properties in the data describing an entity represent the class or kind of entity being described. The platform will automatically assign very broad classes, but will also convey all of the classes which can be found in the source data.

The example below shows the data held in the index about http://acropolis.org.uk/824ac9e32f8b48b48a54be3c069f95bc#id, which is the URI assigned by the platform for Jabal Karsh, a place in Saudi Arabia. The class geo:SpatialThing is assigned by the platform (because the aggregated data was determined when it was indexed to refer to a place). The class gn:Feature is also present, because one or more descriptions of the place in source data used this class.

<http://acropolis.org.uk/824ac9e32f8b48b48a54be3c069f95bc#id>
	a gn:Feature, geo:SpatialThing ;
	dct:isPartOf <http://acropolis.org.uk/7a811505b8304933b47f9ab6a90b9da8#id> ;
	olo:slot <http://acropolis.org.uk/824ac9e32f8b48b48a54be3c069f95bc#7a811505b8304933b47f9ab6a90b9da8> ;
	gn:locatedIn <http://acropolis.org.uk/e71507eb3c764e4a8a914a141672c512#id> ;
	rdfs:label "Jabal Karsh" ;
	rdfs:seeAlso <http://acropolis.org.uk/7a811505b8304933b47f9ab6a90b9da8#id> ;
	geo:lat 26.1 ;
	geo:long 39.65 .

Once your application has determined which of the classes applies to the entity you’ve retrieved data about, it can determine how to proceed with processing the data further and presenting it. For example, when something is a place—as in the example shown above—it can look for geographical co-ordinates and show them on a map. Meanwhile, a dataset might be presented as a listing with search and refinement options.

Determine browse and query capabilities

If the data you have retrieved describes a dataset (either the root of the API itself, a partition, some related dataset, or a set of results), then it will include properties which describe the available browse and query capabilities.

Applications can use the presence (or absence) of these different properties in order to determine the facilities which should be offered to users. For example, where a dataset description includes OpenSearch metadata (see below), the application can use that metadata provide to present search fields to the user.

Class partitions

These are subsets of the index organised by the RDF class of the entities within them, linked to via void:classPartition.

For example, at the root there is a /places partition which is an index of all of the entities which have the geo:SpatialThing class:

</> void:classPartition </places> .

</places>
	a void:Dataset ;
	rdfs:label "Places"@en ;
	void:class geo:SpatialThing .

OpenSearch metadata

OpenSearch is a mechanism for conveying machine-readable data about query capabilities.

At the root of the index, the Research & Education Space API uses the void:openSearchDescription predicate to link to an OpenSearch XML document which describes the search options available across the whole index.

Perhaps more usefully, some of the key OpenSearch terms (specifically, osd:template and osd:language) are used as predicates against each individual dataset, ensuring that applications have the data needed to offer scoped searches to users.

The example below shows the OpenSearch properties provided at the root:

</> void:openSearchDescription </index.osd> ;
	osd:language "en-gb", "ga-gb", "cy-gb", "gd-gb" ;
	osd:template "http://acropolis.org.uk/?class={rdfs:Class?}&for={odrl:Party?}&lang={language?}&limit={count?}&media={dct:DCMIType?}&offset={startIndex?}&q={searchTerms?}&type={dct:IMT?}" .

Note that the language, count, startIndex and searchTerms placeholders are defined by OpenSearch itself, while other placeholders (such as dct:DCMIType) are RDF classes, where the value if provided is intended to be the URI of an entity which has that class. For example, the value of the media parameter should be the URI of a well-known instance of the dct:DCMIType class.

Ordered results

When your application retrieves RDF data, it can be any order—a particular triple appearing near the start of some serialised data doesn't indicate anything about the prominence of that triple.

For query results in particular, ordering can be important—the difference in relevance between the first and tenth results can be quite significant.

In order to convey an ordered result-set, the API generates additional triples using the Ordered List Ontology to define slots:

<http://acropolis.org.uk/places?q=paris>
    osd:Language "cy-gb", "en-gb", "ga-gb", "gd-gb" ;
    osd:template "http://acropolis.org.uk/places?for={odrl:Party?}&lang={language?}&limit={count?}&media={dct:DCMIType?}&offset={startIndex?}&q={searchTerms?}&type={dct:IMT?}" ;
    dct:hasFormat <http://acropolis.org.uk/places.ttl?q=paris> ;
    dct:isPartOf <http://acropolis.org.uk/places> ;
    olo:slot <http://acropolis.org.uk/places.ttl?q=paris#427cf01d1215477489dee6091de16e65>, <http://acropolis.org.uk/places.ttl?q=paris#735793ca9cd845e19ae32627e79333db> ;
    a void:Dataset ;
    rdfs:label "Places containing \"paris\""@en-gb ;
    rdfs:seeAlso <http://acropolis.org.uk/427cf01d1215477489dee6091de16e65#id>, <http://acropolis.org.uk/735793ca9cd845e19ae32627e79333db#id> .

<http://acropolis.org.uk/places.ttl?q=paris#427cf01d1215477489dee6091de16e65>
    olo:index 1 ;
    olo:item <http://acropolis.org.uk/427cf01d1215477489dee6091de16e65#id> ;
    a olo:Slot ;
    rdfs:label "Result #1"@en-gb .

<http://acropolis.org.uk/427cf01d1215477489dee6091de16e65#id>
    a gn:Feature, geo:SpatialThing, skos:Concept ;
    rdfs:label "75873"@post, "Lutetia Parisorum"@la, "PAR"@iata, "Paarys"@gv, "Pari"@ln, "Paries"@li, "Pariggi"@scn, "Parigi"@it, "Pariis"@et, "Pariisi"@fi, "Parijs"@nl, "Paris", "Paris"@als, "Paris"@cy, "Paris"@da, "Paris"@de, "Paris"@eu, "Paris"@fr, "Paris"@gd, "Paris"@ia, "Paris"@id, "Paris"@io, "Paris"@ksh, "Paris"@kw, "Paris"@lad, "Paris"@lmo, "Paris"@mg, "Paris"@ms, "Paris"@na, "Paris"@nds, "Paris"@nn, "Paris"@no, "Paris"@pms, "Paris"@pt, "Paris"@ro, "Paris"@sco, "Paris"@sv, "Paris"@sw, "Paris"@tl, "Paris"@tr, "Paris"@ty, "Paris"@vi, "Paris"@vls, "Paris (France)"@en, "Parisi"@sq, "Pariz"@br, "Pariz"@hr, "Pariz"@sl, "Parizo"@eo, "Parys"@af, "Parys"@fy, "Paryż"@pl, "Paryžius"@lt, "Paräis"@lb, "París"@an, "París"@ast, "París"@ca, "París"@es, "París"@is, "París"@oc, "París"@tet, "París - Paris"@gl, "Paríž"@sk, "Parîs"@ku, "Parīze"@lv, "Paříž"@cs, "Páras"@ga, "Párizs"@hu, "http://ru.wikipedia.org/wiki/%D0%9F%D0%B0%D1%80%D0%B8%D0%B6"@link, "Παρίσι"@el, "Париж"@bg, "Париж"@os, "Париж"@ru, "Париж"@tg, "Париж"@uk, "Париз"@mk, "Париз"@sr, "Парыж"@be, "Փարիզ"@hy, "פריז"@he, "باريس"@ar, "پارىژ"@ug, "پاریس"@fa, "پیرس"@ur, "ܦܪܝܣ"@arc, "पॅरिस"@mr, "பாரிஸ்"@ta, "ಪ್ಯಾರಿಸ್"@kn, "ปารีส"@th, "პარიზი"@ka, "ፓሪስ"@am, "パリ"@ja, "巴黎"@zh, "파리 시"@ko ;
    geo:lat 48.85341 ;
    geo:long 2.3488 .

<http://acropolis.org.uk/places.ttl?q=paris#735793ca9cd845e19ae32627e79333db>
    olo:index 2 ;
    olo:item <http://acropolis.org.uk/735793ca9cd845e19ae32627e79333db#id> ;
    a olo:Slot ;
    rdfs:label "Result #2"@en-gb .

<http://acropolis.org.uk/735793ca9cd845e19ae32627e79333db#id>
    a gn:Feature, geo:SpatialThing, skos:Concept ;
    rdfs:label "75460"@post, "Paris", "Paris (Texas)"@en, "http://en.wikipedia.org/wiki/Paris%2C_Texas"@link, "Париж"@ru ;
    geo:lat 33.660939 ;
    geo:long -95.555513 .

Note that the statements in the above example have been re-ordered for clarity, but an RDF processor should not care.

In this example, the API has returned two results for a query of places matching the free-text term “paris”. To process this data, an application would first locate the triples whose subject the URI being dereferenced (in this instance, http://acropolis.org.uk/places?q=paris), then look for olo:slot predicates, and assemble a list which can be sorted from the objects of those slots. For each slot, the olo:index property specifies the numeric sort order (where lower numbers are more prominent), and olo:item connects the slot to the actual result entry.

Therefore, we can see that the first result is “Paris (France)” (http://acropolis.org.uk/427cf01d1215477489dee6091de16e65#id), while the second is “Paris (Texas)” (http://acropolis.org.uk/735793ca9cd845e19ae32627e79333db#id).

Pagination

It will often be the case that datasets and result-sets will be paginated when they are returned from the API.

In order to traverse the paginated results, applications can use the xhtml:prev and xhtml:next properties where present, which link to the previous and next page of results respectively.

<http://acropolis.org.uk/?q=hamlet>
	xhtml:next <http://acropolis.org.uk/?offset=25&q=hamlet> .

Locate the source data for an entity

Because the platform only provides a high-level index with minimal cached source data, applications will often need to fetch the original data that was indexed by the platform for further detail on the entities that have been indexed.

Source data is always available in one of the two RDF serialisations that all RES applications must support: Turtle, and RDF/XML. Publishers may also make data available in other RDF serialisations (and other formats altogether), and applications can use HTTP Content Negotiation to obtain data in a preferred format supported by both the client and server.

The index entry for an entity contains owl:sameAs triples which link the URIs of the entity as used in source data to the URI assigned by the Research & Education Space platform. Therefore, an application can examine all of the triples which match this pattern to obtain a list of source URIs.

The example below shows some of the data held in the index about Shakespeare’s Julius Ceaser, which has the URI http://acropolis.org.uk/5878ea99754a44cbb682c3e7b9d9c266#id assigned to it by the platform:

<http://dbpedia.org/resource/Julius_Caesar_%28play%29>
owl:sameAs <http://acropolis.org.uk/5878ea99754a44cbb682c3e7b9d9c266#id> .

<http://dbpedia.org/resource/Julius_Caesar_(play)>
owl:sameAs <http://acropolis.org.uk/5878ea99754a44cbb682c3e7b9d9c266#id>, <http://www.dbpedialite.org/things/57328#id> .

<http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000006df71>
owl:sameAs <http://acropolis.org.uk/5878ea99754a44cbb682c3e7b9d9c266#id> .

<http://rdf.freebase.com/ns/m.0frwk>
owl:sameAs <http://acropolis.org.uk/5878ea99754a44cbb682c3e7b9d9c266#id> .

<http://www.dbpedialite.org/things/57328#id>
owl:sameAs <http://acropolis.org.uk/5878ea99754a44cbb682c3e7b9d9c266#id>, <http://dbpedia.org/resource/Julius_Caesar_%28play%29>, <http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000006df71>, <http://rdf.freebase.com/ns/m.0frwk>, <http://www.wikidata.org/entity/Q215750> .

<http://www.wikidata.org/entity/Q215750>
owl:sameAs <http://acropolis.org.uk/5878ea99754a44cbb682c3e7b9d9c266#id>, <http://dbpedia.org/resource/Julius_Caesar_(play)> .

In this example, there are a number of owl:sameAs triples, some of which were expressed in the source data indexed by the platform. To obtain a list of the URIs for the entity as used in the source data, we can filter the triples for those matching the following pattern:—

?uri owl:sameAs <http://acropolis.org.uk/5878ea99754a44cbb682c3e7b9d9c266#id> .

Applying this filter results in the following list:—

URI
http://dbpedia.org/resource/Julius_Caesar_%28play%29
http://dbpedia.org/resource/Julius_Caesar_(play)
http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000006df71
http://rdf.freebase.com/ns/m.0frwk
http://www.dbpedialite.org/things/57328#id
http://www.wikidata.org/entity/Q215750

Find the index entry for an entity given its URI

The URI look-up capability is available when a dataset (including the root) has the void:uriLookupEndpoint property. The value of this property is the URL to which an encoded URI to look up should be appended. For example:

<http://acropolis.org.uk/>
	void:uriLookupEndpoint <http://acropolis.org.uk/?uri=> .

Given the above triple found at the root, an application can locate the entry corresponding to a well-known URI by URL-encoding it and appending it to the look-up endpoint. The API will then return either a redirect response, or a 404 (“Not found”):

$  curl -v -H 'Accept: text/turtle;q=1.0, application/n-triples;q=0.95, application/rdf+xml;q=0.5' http://acropolis.org.uk/?uri=http://dbpedia.org/resource/France
*   Trying 132.185.142.106...
* Connected to acropolis.org.uk (132.185.142.106) port 80 (#0)
> GET /?uri=http://dbpedia.org/resource/France HTTP/1.1
> Host: acropolis.org.uk
> User-Agent: curl/7.49.1
> Accept: text/turtle;q=1.0, application/n-triples;q=0.95, application/rdf+xml;q=0.5
> 
< HTTP/1.1 303 See other
< Server: Apache/2.2.22 (Debian)
< Location: /323248ebe45e47f1a7041c482df3950f#id
< Content-Length: 0
< Accept-Ranges: bytes
< Date: Wed, 05 Oct 2016 07:29:46 GMT
< X-Varnish: 809858881
< Age: 0
< Via: 1.1 varnish
< Connection: keep-alive
< X-Cache: MISS from vm-gw-0.ch.internal
< 
* Connection #0 to host acropolis.org.uk left intact

In the above example, the redirect is to http://acropolis.org.uk/427cf01d1215477489dee6091de16e65#id, which the application can then dereference.

Iterate a partition or result-set

Iteration of a set of results is accomplished by examining the olo:slot ordered result slots if present (or following rdfs:seeAlso properties if not), and using the xhtml:next property to traverse paginated results.

Discover known audience group URIs

The for query parameter is used to specify the URI of an audience (user) group who is able to access a particular source of media which is not available to the general public but is indexed by the platform and available to some users. By default, queries performed using the API will only take into account media which is accessible to all—providing the for parameter enables additional media to be included.

It's important to note that this is not an access-control mechanism: users must still authenticate themselves (using whatever means they ordinarily would) to the media source in order to view or play it, but the query filtering prevents media sources from being shown to users who can’t already access them.

The Research & Education Space was designed so that applications can be generally agnostic to different sources of media, even including presenting media which is available to some users but not the developers of the application itself!

In order to include the correct audience URIs in queries, applications must provide a mechanism by which either end-users or administrators can select the audience groups that a user is a member of so that the correct query parameters can be sent to the API. To enable this, the platform provides an “Audiences” dataset which publishes information about the known user groups. An application can then use this dataset to populate a selection interface.

Discovering the audiences dataset is accomplished by retrieving the data describing the API root and examining the targets of rdfs:seeAlso properties. The audiences dataset, if present, will be described in the returned data as a void:Dataset whose void:Class is odrl:Party.

Further details about the data which is used to generate the audience group index can be found in the Conditionally-accessible resources section.

Retrieving and processing Linked Open Data

In an ideal world, consuming Linked Open Data is as straightforward as:—

  • Make a request for the URI you want to get data about, sending an Accept HTTP request header containing the MIME types of the formats you support in your application.
  • Parse the data in the response using an RDF parser.
  • Examine the parsed data to find triples whose subject is the URI that you started with.

While this process is simple in principle, and could be implemented using virtually any HTTP client in common use today, it brings about a few questions. How do you deal with redirects? What happens if the server doesn't return the data in the format that you asked for? Where do you start?

This chapter aims to answer these questions so that your application can be both useful and robust in face of these kinds of real-world challenge.

Consuming Linked Open Data in practice

As part of the Research & Education Space programme, we are developing a Linked Open Data client library. Although at present this library is currently only available to low-level languages such as C and C++, the process it follows can be implemented in any language. It is intended to be a liberal consumer which can deal with real situations, such as different kinds of redirects and content negotiation failing or being disabled by the publisher.

The algorithm is as follows (implemented in the LOD library in fetch.c):—

  1. Optionally, check if data about the request-URI is present in our RDF model: if so, return a reference to it.
  2. Append request-URI to subject-list.
  3. If request-URI has a fragment, remove it and store it as fragment.
  4. Set followed-link to false, and count to 0.
  5. If count is more than our configured max-redirects value, return an error status indicating that the redirect limit has been exceeded.
  6. Create an HTTP request for request-URI, setting the Accept header based upon the data formats supported by the application. Note that the Research & Education Space requires that both publishers and applications to support at least RDF/XML (application/rdf+xml) and Turtle (text/turtle), but both clients and servers may support other formats which can be negotiated.
  7. Perform the HTTP request. Note that this should be a single request-response pair, and not automatically follow redirects.
  8. If a low-level error in performing the request occurred (such as the hostname in the URI not being resolveable), return an error status indicating that the request could not be performed.
  9. Store the canonicalised form of request-URI as the base.
  10. Obtain the Content-Type of the response, if any, and store it in content-type.
  11. If the HTTP status code is between 200 and 299 and there is a document body:—

    1. If content-type is not set, return an error status indicating that no suitable data could be found.

      If the Content-Type is not one of text/html, application/xhtml+xml, application/vnd.wap.xhtml+xml, application/vnd.ctv.xhtml+xml or application/vnd.hbbtv.xhtml+xml, then skip to step 14.

    2. If followed-link is true, return an error status indicating that a <link rel="alternate"> has already been followed.
    3. Parse the returned document as HTML, and extract any <link> elements within <head> which have a type and href attributes and a rel attribute with a value of alternate.
    4. If no suitable <link> elements were found, return an error status indicating that no suitable data could be found.
    5. Rank the returned links based upon the application's weighting values (allowing an application to consume a particular serialisation if available in preference to others).
    6. Append the highest ranked link’s URI (that is, the value of the href attribute) to subject-list, set request-URI to it, set followed-link to true, increment count, and skip back to step 5.
  12. If the HTTP status code is between 300 and 399:—

    1. Set target-URI to the redirect target (the Location header of the HTTP response). If no target is available, return with an error status indicating that an unsuitable HTTP status was returned.
    2. If the HTTP status code is 303, set request-URI to target-URI, increment count and skip back to step 5.
    3. If fragment is set, append it to target-URI, replacing any fragment which might be present already.
    4. Push target-URI onto subject-list, increment count, and skip back to step 5.
  13. If the HTTP code is not between 200 and 399, return an error status indicating that an HTTP error was returned by the server.
  14. Optionally, if content-type is text/plain, application/octet-stream or application/x-unknown, attempt to determine a new content type via content sniffing. If successful, store the new type in content-type.
  15. Parse the document body as content-type into our RDF model. If the type is not supported, or parsing fails for any other reason, return an error status.
  16. Starting with the first item in subject-list:—

    1. Set subject-URI to the current entry in the list.
    2. Perform a query against the RDF model to determine whether any triples whose subject are subject-URI exist.
    3. If triples were found, return a reference to them.
    4. Otherwise, move to the next item in subject-list.
  17. Finally, return an error status indicating that no triples were found in the retrieved data.

A starting point: the Research & Education Space index

Just as an ordinary web browser needs a homepage or an address bar, so too do Linked Open Data applications. Whether your application has a fixed configured starting point or is intended to be an open-ended “data browser”, the Research & Education Space index is intended to be a useful Linked Open Data “homepage” for many applications.

Described in more detail in The Research & Education Space API: the index and how it’s structured, the index is itself Linked Open Data which can be retrieved and processed using the algorithms described above. The URI for the index is currently http://acropolis.org.uk, and this URI can be used as default “homepage” for RES applications.

In the same way that a homepage or search engine only provides the starting point for a user of a web browser, the same is true of the Research & Education Space index: applications can allow users to explore and search the index, but to also follow the onward links to source data and media assets.

For some applications, use the Research & Education Space index as a starting point won’t be appropriate: it may be necessary or useful to implement an intermediary service that provides additional capabilities or a specific curated subset of resources. There is no requirement that RES applications must directly use the base of the RES index as their home.

Best practices

Be careful what you accept

Take care not to include * in your HTTP Accept header unless your application really can deal intelligently with responses of any type. Similarly, don't explicitly include types which your application can’t process properly.

Implement your own middleware

However you implement your application (e.g., Javascript in a web browser, a mobile app, etc.), avoid having end-users connecting directly to the Research & Education Space API: instead, provide a middleware layer that you maintain and control instead. This has several benefits:

  • It allows you to implement caching which insulates your application (and your users) from any performance or availability issues which might arise either in the platform, or any of the external data sources.
  • It provides you with an opportunity to augment or filter results before returning them to your application.
  • It allows you to transform data in a format which is best suited to your application (for example, you might arrange things such that only your middleware needs to understand Turtle and RDF/XML, while the data is delivered to your application as JSON regardless of its source format).
  • It allows you to fetch source data where needed, without relying on widespread implementation of CORS across all sources of data.

Tailor behaviour to the data

Applications should hard-code as little behaviour as possible to particular URIs or patterns: instead, behaviours should generally be triggered by the nature of the data retrieved.

Follow your nose

API responses will often include data that your application doesn’t need, and should ignore. Rather than attempting to iterate all of the data in a response, treat it as a tree that you traverse, starting with the data whose subject is the URI that you wanted to dereference in the first place.

Use the source

While the Research & Education Space aims for correctness, it does not aim for completeness, and can guarantee neither. Applications should be prepared to retrieve the source data referenced by entries in the index and process it directly to obtain further details.

Editorial guidelines for product developers

What do we mean by “editorial”?

In this context we mean what is in the data and the associated media, such as text, video or images.

  • What does it say and what is it about?
  • Is it suitable for all ages to see and hear?
  • Are there any limits you would want to set around who could see this material?

When making data and media available in to education, it is important to understand the expectations of the audience in terms of what they will see and hear.

These guidelines are intended to help product developers think about these issues as early in the design and development process as possible.

The Research & Education Space platform is funded with public money and needs to show that it is serving the public interest and behaving responsibly.

  • The Research & Education Space programme envisages that in schools and FE colleges it will be teachers who are the primary users of the products built on top of the Resarch & Education Space platform, both the index and the assets.
  • Teachers will then judge the suitability of the content for particular age ranges and make it available to pupils.
  • The pupils and students will therefore be the secondary users of any products, accessing a moderated version of the whole platform.
  • Teachers will need to share material with pupils and other teachers and this functionality will be vital.
  • Where possible the metadata will include any guidance as to the suitability of the content for particular age groups, for example the BBC might include guidance warning metadata.
  • How this will be displayed to teachers is an important consideration in the design of products and services.
  • However where no such information is available, it needs to be clear that this does not mean that the material is necessarily suitable for all ages (so perhaps a “no age range given” note is appropriate?)
  • The Research & Education Space programme will provide teachers with guidelines about the range of material available in the platform and hints on how to navigate and mediate such a large volume of metadata and media.
  • Teachers will also form their own view of what material is suitable for whom, and their ability to add that information to the metadata and share it is important.
  • Every product or service built on the platform must have a means of feeding back any concerns about aspects of the assets or the metadata to the provider of the catalogue and assets.

Publishing data for the Research & Education Space

Publishers wishing to make their data visible in the Acropolis index and useable by RES applications must conform to a small set of basic requirements. These are:

Although the Research & Education Space requires that you publish Linked Open Data, that doesn’t mean you can’t also publish your data in other ways. While human-facing HTML pages are the obvious example, there’s nothing about publishing Linked Open Data which means you can’t also publish JSON with a bespoke schema, CSV, OpenOffice.org spreadsheets, or operate complex query APIs requiring registration to use.

In fact, best practice generally is that you publish in as many formats as you’re able to, and do so in a consistent fashion. And, while your “data views” (that is, the structured machine-readable representations of your data about things) are going to be very dull and uninteresting to most human beings, that doesn’t mean that you can’t serve nicely-designed web pages about them as the serialisation for ordinary web browsers.

Checklist for data publication

Support the most common RDF serialisations

RDF can be serialised in a number of different ways, but there are two serialisations which RES publishers must provide one of because these are the two serialisations guaranteed to be supported by RES applications:

NameMedia typeFurther information
Turtletext/turtlehttp://www.w3.org/TR/2014/REC-turtle-20140225/
RDF/XMLapplication/rdf+xmlhttp://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/

Turtle is increasingly the most common RDF serialisation in circulation and is very widely-supported by processing tools and libraries.

RDF/XML is an older serialisation which is slightly more well-supported than Turtle. RDF/XML is often more verbose than the equivalent Turtle expression of a graph, but as an XML-based format can be generated automatically from other kinds of XML using XSLT.

If you are considering publishing your data as JSON, you may consider publishing it as JSON-LD, a serialisation of RDF which is intended to be useful to consumers which don’t understand RDF specifically. JSON-LD isn’t currently supported by RES, but may be in the future.

Use terms from existing common vocabularies where applicable

The goal of Linked Open Data in particular, and as a consequence of the Research & Education Space, is to create a web of data which applications can navigate seamlessly. Simply using common underlying protocols and data serialisations on their own is not enough to achieve this: both publishers and applications need to support the same set of terms for classes and properties (i.e., use the same vocabularies) where it’s feasible to do so.

This doesn’t mean that where there isn’t a widely-used term to support something from your data that you shouldn’t publish it, rather that you should only invent your own vocabularies of terms where existing well-known terms don’t fit.

The second part of this book, Working with the platform, describes the different vocabularies in widespread use for different classes of entity (physical, digital, real, or imaginary) that you might need to describe. Where some of the terms have specific meaning to the platform, these are noted in the relevant sections and in the class and predicate indices.

Describe the document and serialisations as well as the item

A minimal RDF serialisation intended for use by RES must include data about three distinct subjects:

SubjectExample
Document URLhttp://ialm.int/books/9781899066100
Representation URLhttp://ialm.int/books/9781899066100.ttl
Item URIhttp://ialm.int/books/9781899066100#id

It is recommended that publishers describe any other serialisations which they are making available as well, but it is not currently a requirement to do so.

A description of the metadata which should be served about the document and representations is included in the Metadata about documents section.

Include licensing information in the data

The data about the representation must include a rights information triple referring to the well-known URI of a supported license. The rights and licensing section describes a simple approach for doing this.

Perform content negotiation when requests are received for item URIs

If you use fragment-based URIs, this means that your web server must be configured to perform content negotiation on requests received for the portion of the URI before the hash (#) sign.

For example, if your subject URIs are in the form:

http://ialm.int/books/9781899066100#id

Then when your server receives requests for the document:

/books/9781899066100

It should perform content negotiation and return an appropriate media type, including the supported RDF serialisations if requested.

When sending a response, the server must send an appropriate Vary header, and should send a Content-Location header referring to the representation being served. For example:

HTTP/1.0 OK
Server: Apache/2.2 (Unix)
Vary: Accept
Content-Type: text/turtle; charset=utf-8
Content-Location: /books/9781899066100.ttl
Content-Length: 272
…

Publishing digital media

The Research & Education Space will not directly consume or publish digital media (audio, video, images, documents) itself. However, it will aggregate data about digital media which has been published in a form which can be used consistently by applications built on the platform.

This chapter describes how those media assets can be published in ways which will be most useful to applications, while balancing the range of access mechanisms and rights restrictions applicable to users in educational settings.

While this chapter provides guidance on publishing media assets themselves, those assets only become useful to within the Research & Education Space when they are properly described in accompanying metadata. For more information on publishing data which describes digital media assets, please refer to the chapter Describing digital assets.

Approaches to publication

There are three strategies for publishing media for the Research & Education Space: publishing “raw” media assets, providing embeddable players, and publishing pages which include playback capabilities.

Publishing media directly

Publishing media directly is most suited to situations where the media assets are openly-licensed and can be both downloaded and streamed by applications. It is not suitable for media which is rights-restricted to the extent that downloads are not permitted.

Direct publishing allows an application to make use of native playback, viewing, editing, and tagging capabilities, and consequentially offers the greatest level of flexibility to applications and users alike. While it provides no technical barrier to end-users sharing downloaded media (in whole or part on its own, or combined into a larger composition), it does not automatically imply that sharing is permitted.

While affording the greatest level of flexibility to the consuming application, publishing media in this way is also the simplest from a technical perspective: the encoded media files are simply uploaded to a web server and then described in the accompanying metadata.

Use direct publication where:—

  • Licensing allows both streaming and download of the media asset.
  • If you want to allow snipping or other kinds of editing of the media.
  • You want to provide the widest possible range of device support.

For example:—

Embeddable players

Embeddable players are best suited to situations where media files should not be downloaded by applications and end-users, but the playback capability may be provided in-line with other content by an application.

With an embeddable player, although media assets themselves are published in some fashion, the resource described in accompanying metadata is a web page capable of playing them, typically via an <iframe> or equivalent, with the metadata including the preferred dimensions of the frame.

This approach limits the capabilities which can be offered by the application to its users: as far as the application is concerned, the contents of the framed web page are completely opaque; it can only assume that the page will provide a suitable player for the media asset, and will have no control over playback.

Use an embeddable player where:—

  • Licensing only permits streaming of the asset, but does allow its presentation as part of a larger body of content (for example, within in a MOOC).
  • Media is only available through a technology which may not be widely supported except through a custom player.
  • Your media is published through a third party solution which does not provide ready access to direct media asset URLs.
  • As a fall-back option alongside a direct media link (for example, to enable an application to generate the embeddable player code snippet for pasting into a MOOC or social network).

For example:—

PropertyValue
Media asset URL//player.vimeo.com/video/110040373
MIME typetext/html
Embeddable?Yes
Poster image URL//i.vimeocdn.com/video/494149068_960.jpg
Preferred width500px
Preferred height281px
TitleMount Piños Astrophotography Time Lapse
Duration45s
LicenseCreative Commons 3.0 Unported (CC BY 3.0)

Stand-alone playback pages

Stand-alone playback pages provide the least flexibility to applications, and—depending upon presentation—may result in reduced visibility of your media.

With this strategy, an application is not able to embed your media at all, but instead must navigate to the page that you provide in a browser window. The application might provide a thumbnail or text link to your playback page, or it might choose to omit the media altogether if including it would result in a poor user experience.

Use a stand-alone playback page where:—

  • Licensing restrictions mean that you’re not able to authorise any kind of embedding.
  • As a fall-back option alongside an embeddable player or direct media links (particularly if you already publish a playback page for each media asset).

For example:—

PropertyValue
Media asset URLhttp://www.bbc.co.uk/iplayer/episode/p0285z2y/horizon-19811982-the-race-to-ruin
TitleHorizon: 1981-1982: The Race to Ruin
Embeddable?No
Duration48m52s
Geographical restrictionUK-only

Access control and media availability

A key aim of the Research & Education Space is to increase the visibility of and access to digital media resources which are available to staff and students of educational establishments within the United Kingdom. While this naturally includes the wealth of resources which are openly-licensed and available to everybody, it also includes digital media which can only be accessed at scale by UK educational users.

In order to provide access to this material, publishers typically implement some kind of access control. While the platform itself is generally agnostic to media assets and their access-control mechanisms, applications require the ability to make user-interface decisions based upon the access restrictions imposed upon the media.

For this reason, the Research & Education Space defines three specific kinds of access-control mechanism, as well as a policy which conformant media must be published according to. Specifically, this policy is that media assets must:—

  1. Media must be available either freely or under the terms of a blanket or statutorily-backed licensing scheme available to educational establishments (or licenses may be obtained on their behalf by local authorities or central government).
  2. It must be possible to obtain the media without further subscription or other charges, however “value-added” services may be provided which offer additional capabilities (such as archiving, enhanced search), provided those services can be readily subscribed to at an establishment level.
  3. The media must be generally available on a long term basis. Media available only for short periods has limited value in education because it prevents the same resources being used again in the future.
  4. The technical access-control mechansims must be one or more of those described below.
  5. The nature of the access-control mechanism must be described in the metadata accompanying the media.

For example, all of the following conform to the policy:—

  • Media published via Wikimedia Commons is available to everybody on a permanent basis without any additional payment or subscritpion.
  • Programmes which are part of BBC Four Collections are made available to everybody in the UK on a long-term basis (but may not be embedded). Access control is implemented through geo-blocking.
  • Recordings of broadcasts made according to the terms of Section 35 of the Copyright, Designs and Patents Act 1988 (as amended) is may be used by the institution who recorded it (or it was recorded on behalf of), provided their ERA Licence is maintained.
  • Services which are authorised by ERA to maintain an archive of Section 35 recordings and make them available to ERA Licence-holders who pay a subscription fee, provided access is through a mechanism described below.
  • A consortium of rights-holders who together define a scheme for access to one or more sets of media on an affordable establishment-level subscription basis, provided access is through a mechanism described below.

For more information about describing rights restrictions and access-control mechansims, see Metadata describing rights and licensing and Describing conditionally-accessible resources.

Geographical restrictions (geo-blocking)

Geo-blocking is the automatic determination of ability-to-access a resource by looking up the end-user’s public IP address against a database correlating IP address ranges with countries. For example, the address 132.185.240.10 is part of a range which is within the UK, whereas 192.0.32.8 is part of a range which is within the US.

Geo-location databases and live services are available both for free and on commercial terms, with varying levels of quality and service assurance.

Geo-blocking should generally be applied only where other access-control mechanisms are not applicable: for example, because a media asset is available to everybody within a particular country.

Federated access control using Shibboleth and the UK Access Management Federation

Shibboleth is a federated authentication single sign-on mechanism which is widely used by providers of materials to provide access only to staff and students of educational establishments.

The UK Access Management Federation, operated by Janet, provides the Shibboleth federation for UK institutions.

Shibboleth-protected resources present a sign-in page to users who are not already authenticated, which makes it suitable for use with both the embeddable player and the stand-alone playback page publication approaches described above.

Shibboleth-based access control is the preferred mechanism for use where media should be made available only to educational users.

IP-based access control

IP-based access control is often the simplest mechanism to implement, as it requires only for the publisher to check the end-user’s public IP address against a white-list and allow or permit access as required.

However, creating and maintaining that white-list can involve significant administrative burden, particularly on a nation-wide basis, and it does not allow ready access to media to remote-working staff and students without their institution providing additional infrastructure such as remote-desktop services and VPNs.

IP-based access control should generally be employed alongside Shibboleth-based authentication, and only for specific institutions which are not able to participate in the UK Acesss Management Federation.

Editorial Guidelines for Content Providers

What do we mean by “editorial”?

In this context we mean what is in the metadata and the associated media, such as text, video or images.

  • What does it say and what is it about?
  • Is it suitable for all ages to see and hear?
  • Are there any limits you would want to set around who could see this material?

When making metadata and media available in to education, it is important to understand the expectations of the users in terms of what they will see and hear.

These guidelines are intended to help content providers think about these issues as early in the process as possible.

The RES platform is funded with public money and needs to show that it is serving the public interest and behaving responsibly.

  • Some items in physical collections are only available to certain users.
  • How is this information transferred to the online catalogue?
  • Are there items in your collections which you believe are not suitable for under-18s?
  • How will you help end users know this?
  • The RES proposal intends that in schools, the primary users of the products built on the RES aggregator will be teachers.
  • But teachers are over-worked and are more likely to use your material if it is easy and quick to identify as relevant to their students.
  • If you hold any data or guidance on age suitability you should include this in the data you publish.
  • Users will be able to feedback to you about concerns with the metadata or assets, including possible breach of copyright – how will you as an institution manage this?
  • Although you will probably already have a mechanism for dealing with feedback and/or requests of either a legal (copyright, data protection etc) or editorial nature, it is worth being aware that RES may expose your material to a wider audience and these requests may therefore increase. Can your existing workflows manage this?
  • In sharing data and assets are you comfortable that you are complying with the Data Protection Act.

Working with the platform

Introduction

The Research & Education Space platform indexes linked open data, and presents its API in the same fashion. This part of the book is intended for both collection-holders seeking guidance on how to most effectively publish their data so that it can be indexed by the platform, and for software developers wanting to know how to interact with it.

Although some chapters are more applicable to some audiences than others (for example, it isn’t necessary for collection-holders to have a deep understanding of the available API operations), we do recommend that both collection-holders and application developers read this part of the book in full to best understand how to get the most out of the platform.

At the core of the platform is the Resarch & Education Space index. This index is available as web pages (to make it easier for application developers to see what’s there and how it works), but is primarily published as Linked Open Data in various machine-readable RDF serialisations. Accessing the index via HTTP GET requests and using Content Negotiation to ask for machine-readable data is how you use the Research & Education Space API.

Common metadata

Vocabularies used in this chapter:

VocabularyNamespace URIPrefix
RDF syntaxhttp://www.w3.org/1999/02/22-rdf-syntax-ns#rdf:
RDF schemahttp://www.w3.org/2000/01/rdf-schema#rdfs:
DCMI termshttp://purl.org/dc/terms/dct:
OWLhttp://www.w3.org/2002/07/owl#owl:
FOAFhttp://xmlns.com/foaf/0.1/foaf:
XHTMLhttp://www.w3.org/1999/xhtml/vocab#xhtml:
Creative Commons RELhttp://creativecommons.org/ns#cc:
ODRL 2.xhttp://www.w3.org/ns/odrl/2/odrl:
ODRS Vocabularyhttp://schema.theodi.org/odrs#odrs:

Classes

As mentioned previously, a predicate defined by RDF itself, rdf:type is used to state what kind of thing an entity is.

The Research & Education Space uses these classes, mapping them to one of a small number of generic classes (see the Class index for further details), which group the entities into partitions for ease of discovery, and also affects how certain class-specific predicates are interpreted (see the subsequent chapters in this part of the book for further details).

Example

The following example states that the entity with the URI https://www.w3.org/People/Berners-Lee/card#i is a person. Note that in Turtle, the language used for examples, rdf:type is abbreviated simply to a.

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<https://www.w3.org/People/Berners-Lee/card#i> a foaf:Person .

Expressed as N-Triples instead, this would be written as:

<https://www.w3.org/People/Berners-Lee/card#i> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .

Generic descriptive terms

Example

Expressing equivalence with other identifiers

Linked Open Data in general, and the Research & Education Space in particular, is at its most useful when the data describing things links to other data describing the same thing.

In RDF, this is achieved using the owl:sameAs predicate. This predicate implies a direct equivalence relationship—in effect, it creates a synonym.

You can use owl:sameAs whether or not the alternative identifiers use http: or https:, although the practical usefulness of URIs which can’t resolved as Linked Open Data can be quite limited.

For example, one might wish to specify that our book has an ISBN using the urn:isbn: URN scheme [RFC3187]:

</books/9781899066100#id> owl:sameAs <urn:isbn:9781899066100> .

We can also indicate that the book described by our data refers to the same book at the British Library:

</books/9781899066100#id> owl:sameAs <http://bnb.data.bl.uk/id/resource/011012558> .

The outcome of this is that if an application queries the index for data about the entity identified as urn:isbn:9781899066100, any data describing the entity using your identifier, or the British Library’s, will also be aggregated by the index and returned.

Tips for collection-holders

Asserting the equivalence between your own authoratitive identifiers and well-known identifiers for the same entity is an easy way to ensure that the Research & Education Space properly connects your information with other people’s.

While you could look for equivalent identifiers managed by other collections and make owl:sameAs assertions about any you can find, this is likely to be wasted effort. Instead, if everyone expresses the equivalence with the common well-known identifier, the index will aggregate all of the data from all of the collections which state equivalence with that single identifier.

Note that this doesn’t only apply to identifier schemes which are formal standards: it works just as well with Wikidata URIs as it does with ISBNs or GTINs.

Tips for application developers

When an application retrieves data from the index about an entity, the Research & Education Space includes information about all of the identifiers it has encountered which refer to that entity. This information is expressed as a series of owl:sameAs statements where the subject is the URI encountered in the source data, and the object is the http://acropolis.org.uk/… URI that was automatically assigned by the Research & Education Space platform.

Rights and licensing

The Research & Education Space aggregates data which can be re-used by a wide range of applications, and much of that data describes digital media (such as videos and photographs) which are accessible only to certain groups of people.

To minimise the risk of data being incorporated into the index which applications cannot use, and to ensure an optimum user experience when querying for related media, it's crucial that the data that collections publish includes accurate rights and licensing information.

Linked Open Data must be published under the terms of a supported license and include explicit licensing data in order for it to be indexed by the Research & Education Space and be useable by applications. Our approach is aligned with the Open Data Institute’s guide to publishing machine-readable rights data.

The Research & Education Space crawler understands several common predicates for expressing the well-known URI of the license of a document:

Simple licensing statements

In many cases, the simplest way of expressing rights information is to include it in the data that you're publishing, and this can often be accomplished by adding a single triple to each published documents using the dct:rights predicate or an equivalent (see above) predicates and a supported license URI.

For data to be indexed by the Research & Education Space, the crawler must be able to find (or derive) a licensing triple where the URL of the RDF document itself is the subject—essentially a triple which states “this document is licensed according to the terms identified by the following URI…”. In Turtle, this can be written as a triple similar to the below:

<> dct:license <http://creativecommons.org/publicdomain/zero/1.0/> .

This example assumes that you do not set @base, or if you do, that you don't set it to be anything other than the document’s own URL. If you do, you will need to be more specific in the subject of your licensing triple.

Publishers should also include similar triples for any digital assets that are referenced in the data which are licensed under a common set of terms. For example, if the data describes a set of images licensed under the terms of the Creative Commons Attribution License 4.0, then you should include triples indicating this.

Example

The following example specifies that the Turtle representation of the data about our book is licensed according to the terms of the Creative Commons Attribution 4.0 International licence, using the host-relative URI identifying the specific representation.

</books/9781899066100.ttl> dct:license <http://creativecommons.org/licenses/by/4.0/> .

Tips for collection-holders

It is important that the subject of this triple is the URI of the concrete document. If you have different URIs for each RDF representation, and either send a Content-Location header or redirect to them, you need to ensure that the subject of the licensing triple is representation-specific URI (e.g., /books/9781899066100.ttl).

This is because the Research & Education Space crawler is stateless, just like the underlying HTTP protocol itself. In practice, this means that when a document is being processed by the crawler, the only information which can be used to evaluate it is the request-URI, the Content-Location (if provided), any Link headers that were sent, and the serialised RDF itself.

For publishers for whom publishing a licensing triple in each document is difficult (for example because it involves adding a large number of triples to a store due to the nature of the publication workflow), there are alternative—equivalent—mechanisms described in the Publishing data for the Research & Education Space part of this book.

If you need to, you can provide more information than the license triple alone. For example, you might include a request which is not a formal requirement of the licensing terms, but you would like consumers to adhere to if possible.

One way to do this is to include a dct:license triple referring to the well-known license URI, alongside a dct:rights triple pointing to a locally-defined odrs:RightsStatement entity (described using the Open Data Rights Statement Vocabulary).

Tips for application developers

The Research & Education Space was designed such that applications do not have to trust the platform regarding data licensing. As the crawler discriminates based upon the presence of explicit data licensing statements, an application is always able to verify those statements at source itself using the same mechanisms employed by the crawler.

Conditionally-accessible resources

Many kinds of resource are not available to the general public but may be accessed by the key audience of the Research & Education Space: students and teachers affiliated with a recognised educational institution in the UK. This may be because specific exceptions in law allow access when it would not otherwise be possible, or because the rights-holder has elected to make the assets available only to those in formal education.

In order to support this, and ensure that users of RES applications are able to use to the greatest range of material that they legitimately have access to, the metadata describing those assets which aren’t available to the public but are to educational users must describe means by which they are accessed.

Conditional access is described by publishing the following data:

  • Assigning a URI to the group of people who may access the resource, known by the Research & Education Space as an audience URI.
  • A rights statement, which uses the ODRL vocabulary to grant conditional permission to access the resources the statement is associated with.
  • Using one of the rights predicates described above to associate an actual resource with the rights statement.

When this data is processed, the Research & Education Space indexes the fact that the resource is conditionally-accessible and the URI assigned to the group of people who have permission to access it.

Applications can then use this URI when performing queries in order to include restricted resources in the results.

Example

The following example excerpt describes two episodes of television programmes, and their associated media players (see Audio, video, and images) which are both accessible only to authorised users of the BBC Shakespeare Archive Resource.

<http://shakespeare.acropolis.org.uk/programmes/PCDD329F#id> a po:Episode ;
	rdfs:label "Arall Fyd"@cy-gb ;
	mrss:player <http://shakespeare.acropolis.org.uk/programmes/PCDD329F/player> .

<http://shakespeare.acropolis.org.uk/programmes/LRPI646P#id> a po:Episode ;
	rdfs:label "Five to Eleven - Cyril Luckham"@en-gb ;
	mrss:player <http://shakespeare.acropolis.org.uk/programmes/LRPI646P/player> .

<http://shakespeare.acropolis.org.uk/programmes/PCDD329F/player>
	dct:format <http://purl.org/NET/mediatypes/text/html> ;
	dct:license <http://shakespeare.acropolis.org.uk/terms#id> ;
	a dcmitype:MovingImage .

<http://shakespeare.acropolis.org.uk/programmes/LRPI646P/player>
	dct:format <http://purl.org/NET/mediatypes/text/html> ;
	dct:license <http://shakespeare.acropolis.org.uk/terms#id> ;
	a dcmitype:MovingImage .

<http://shakespeare.acropolis.org.uk/terms#id>
	a odrl:Policy ;
	rdfs:label "Accessible only by authorised users in formal education in the UK"@en-gb ;
	odrl:permission <http://shakespeare.acropolis.org.uk/terms#play> .

<http://shakespeare.acropolis.org.uk/terms#play>
	a odrl:Permission ;
	odrl:action odrl:play ;
	odrl:assignee <http://shakespeare.acropolis.org.uk/#members> .

<http://shakespeare.acropolis.org.uk/#members>
	a odrl:Group ;
	rdfs:label "Authorised users of the BBC Shakespeare Archive Resource"@en-gb .

In the example above, the two po:Episode entities state that they have an embeddable media player using the mrss:player predicate. The data describing the two player URIs indicates that specific licensing terms apply to them, identified by the URI http://shakespeare.acropolis.org.uk/terms#id, which is an ODRL Policy.

The policy statement includes a permission (identified as http://shakespeare.acropolis.org.uk/terms#play, although the Research & Education Space does not require an explicit URI to be assigned to each individual permission), which states that the action odrl:play may be performed by members of the group http://shakespeare.acropolis.org.uk/#members.

When this data is processed by the Research & Education Space, the platform indexes the fact that these players are associated with their respective creative works, and that they are accessible only be members of the audience group which has the URI http://shakespeare.acropolis.org.uk/#members.

Tips for collection-holders

It is important to remember that the Research & Education Space platform, and the applications which use its API, are not responsible for enforcing access control applied to resources that you publish. The mechanisms described here are used to describe the access restrictions that you already have in place so that user interfaces can properly account for them.

In the absence of data to the contrary, the platform will assume that media resources are generally accessible to the public. Once data has been processed which describes access restrictions, the index is updated accordingly.

Access restrictions are only recognised when they assign one of the ODRL actions odrl:play, odrl:present, odrl:display or odrl:use to an audience URI using the odrl:assignee predicate as part of an ODRL policy statement. Other forms of policy are not processed, and permissions related to other types of action are ignored.

Tips for application developers

By default, the Research & Education Space API will not include conditionally-accessible resources in results of queries. This means that when querying for concepts which have associated media, the default behaviour is to return those concepts whose associated media appears to be available to all.

In order to include conditionally-accessible resources in the results, an application must explicitly include the for parameter in its queries, which specifies the URI of an audience group to include.

Applications should provide a mechanism for the user or administrator to specify which audience groups users are part of. Note that this doesn't grant or deny access to the resources themselves—that's the responsibility of the publisher of those resources—but is instead a mechanism for applications to hide from view those resources which are not accessible to the end-user.

Information about the audience groups known to the platform can be retrieved from the /audiences dataset. By periodically retrieving this dataset, applications can populate a list of audience groups for users or administrators to choose from.

Digital assets

When we talk about “digital assets”, we mean digital resources that users can stream or download to their devices: video, audio and image files, web pages, and other kinds of document, including machine-readable data and collections.

In many cases, these digital assets are a manifestation of a creative work (which should be described separately and related to the digital assets).

In contrast to the case where some non-digital entity (such as a person or a concept) is being described, the subject URI used in the data is the actual URL of the digital asset.

Although we recommend you read this whole section, we have split it into three broad categories of assets that you may be publishing or consuming data about: Documents, Audio, video and images, and Collections and data-sets.

Documents

The term documents is intended to include any sort of document-based media: web pages, machine-readable data in various forms (including RDF serialisations, CSV, XML, and proprietary formats), PDFs, Office documents, and so on. The guidance in this section also applies if you are describing audio, video or image assets.

Vocabularies used in this section:

VocabularyNamespace URIPrefix
RDF syntaxhttp://www.w3.org/1999/02/22-rdf-syntax-ns#rdf:
DCMI termshttp://purl.org/dc/terms/dct:
DCMI typeshttp://purl.org/dc/dcmitype/dcmit:
FOAFhttp://xmlns.com/foaf/0.1/foaf:
Media typeshttp://purl.org/NET/mediatypes/mime:
W3C formats registryhttp://www.w3.org/ns/formats/formats:

Describing a document

Give the document a class of foaf:Document:

</books/9781899066100> a foaf:Document .

Give the document a title:

</books/9781899066100> dct:title "Information about 'Acronyms and Synonyms in Medical Imaging' at the Intergalatic Alliance Library & Museum"@en .

If the document is not a data-set, specify the primary topic (that is, the URI of the thing described by the document):

</books/9781899066100> foaf:primaryTopic </books/12345#id> .

Link to each of the serialisations:

</data/9781899066100> dct:hasFormat </data/9781899066100.ttl> .
</data/9781899066100> dct:hasFormat </data/9781899066100.html> .

Describing the serialisations of a document

Use a member of the DCMI type vocabulary as a class, where the subject is the URL of the serialisation:

</books/9781899066100.ttl> a dcmit:Text .

Where available, use a member of the W3C formats vocabulary as a class:

</books/9781899066100.ttl> a formats:Turtle .

Use the dct:format predicate, along with the MIME type beneath the http://purl.org/NET/mediatypes/ tree:

</books/9781899066100.ttl> dct:format <http://purl.org/NET/mediatypes/text/turtle> .

Give the serialisation a specific title:

</books/9781899066100.ttl> dct:title "Description of 'Acronyms and Synonyms in Medical Imaging' as Turtle (RDF)"@en .

Specify the licensing terms for the serialisation, if applicable:

</books/9781899066100.ttl> dct:rights <http://creativecommons.org/licenses/by/4.0/> .

Also see the Metadata describing rights and licensing section for details on the licensing statements required by RES, as well as information about supported licences.

Example

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix dcmit: <http://purl.org/dc/dcmitype/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix formats: <http://www.w3.org/ns/formats/> .

</data/9781899066100>
	a foaf:Document ;
	dct:title "Information about 'Acronyms and Synonyms in Medical Imaging' at the Intergalatic Alliance Library & Museum"@en .
	foaf:primaryTopic </books/12345#id> ;
	dct:hasFormat
		</data/9781899066100.ttl> ,
		</data/9781899066100.html> .

</data/9781899066100.ttl>
	a dcmit:Text, formats:Turtle ;
	dct:format <http://purl.org/NET/mediatypes/text/turtle> ;
	dct:title "Description of 'Acronyms and Synonyms in Medical Imaging' as Turtle (RDF)"@en ;
	dct:rights <http://creativecommons.org/licenses/by/4.0/> .

</data/9781899066100.html>
	a dcmit:Text ;
	dct:format <http://purl.org/NET/mediatypes/text/html> ;
	dct:title "Description of 'Acronyms and Synonyms in Medical Imaging' as a web page"@en .

Audio, video, and images

Vocabularies used in this section:

VocabularyNamespace URIPrefix
DCMI termshttp://purl.org/dc/terms/dct:
DCMI typeshttp://purl.org/dc/dcmitype/dcmit:
FOAFhttp://xmlns.com/foaf/0.1/foaf:
Media RSShttp://search.yahoo.com/mrss/mrss:
Media typeshttp://purl.org/NET/mediatypes/mime:

Describing your media assets (or, in the case of embeddable players, the pages hosting them) allows applications to properly surface the correct media for a user. You can do this even if your media assets are hosted entirely separately from your data and other web pages.

  • Because you are describing directly-retrievable resources, the subject URIs are the actual URLs of your assets.
  • Use members of the DCMI Type Vocabulary as the classes of your assets: dcmit:MovingImage, dcmit:Sound, and so on.
  • For each asset, include a dct:format triple referring to the entry in the media types vocabulary matching the MIME type of the resource. For example, an MP4 audio file has a MIME type of audio/mp4, and so the dct:format would be http://purl.org/NET/mediatypes/audio/mp4.
  • Add triples to the RDF describing the creative work that this asset is a representation of: mrss:player (for embedded player pages), foaf:page (for stand-alone playback pages) or mrss:content (for directly-accessible media).
  • Add foaf:primaryTopic triples to the data describing your assets, referring to the most specific creative work entity that the asset represents.
  • Use a rights statement predicate to link to an RDF policy document describing who is able to access the media asset if it is not generally-available, or use the well-known URI of a well-known license if the asset is available according to those terms.

Example

For example, you might describe an episode of a television programme, using the programmes ontology, with the following:—

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix dcmit: <http://purl.org/dc/dcmitype/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix formats: <http://www.w3.org/ns/formats/> .
@prefix po: <http://purl.org/ontology/po/> .

</programmes/b04bndbd#programme>
	a po:Episode ;
	dct:title "Trampolining"@en-gb ;
	po:version </programmes/b04bndb9#programme> ;
	mrss:player <http://player.example.com/b513ef5a60fe47d5a45826e2c592fb3f> .
	
</programmes/b04bndb9#programme>
	a po:Version ;
	rdfs:label "An episode of "Nina and the Neurons: Get Sporty - Trampolining"@en-gb ;
	po:aspect_ratio "16:9" ;
	po:sound_format "Stereo" ;
	po:duration 900 .

<http://player.example.com/b513ef5a60fe47d5a45826e2c592fb3f>
	a dcmit:MovingImage ;
	dct:format <http://purl.org/NET/mediatypes/text/html> ;
	dct:title "Nina and the Neurons: Get Sporty - CBeebies - 2015-05-27"@en-gb ;
	foaf:primaryTopic </programmes/b04bndb9#programme> ;
	dct:license <http://player.example.com/terms#id> .

This data describes an episode of Nina the Neurons, a specific version of that episode, and an embeddable web player for that version which is subject to access controls.

Note that in practice this data might be split over several RDF resources. However, if the specified media asset URL does not perform content negotiation and result in an RDF description of itself, you should include the data about the asset in the same resource that you describe the creative work that it is a representative of.

Collections and data-sets

In this context, we use the term collection to refer to any arbitrary group of entities indexed by the Research & Education Space. Publishers can explicitly group subjects that they describe together into collections, and applications can scope their queries to only those items which appear within a specific collection.

Vocabularies used in this section:

VocabularyNamespace URIPrefix
DCMI termshttp://purl.org/dc/terms/dct:
FOAFhttp://xmlns.com/foaf/0.1/foaf:
VoIDhttp://rdfs.org/ns/void#void:

Data-set auto-discovery

Physical things

People, projects and organisations

Places (and other geographical features)

Events and time-spans

Concepts and taxonomies

Vocabularies used in this section:

VocabularyNamespace URIPrefix
SKOShttp://www.w3.org/2008/05/skos#skos:

Creative works

Vocabularies used in this section:

VocabularyNamespace URIPrefix
RDF syntaxhttp://www.w3.org/1999/02/22-rdf-syntax-ns#rdf:
Bibliographic Ontologyhttp://purl.org/ontology/bibo/bibo:
DCMI termshttp://purl.org/dc/terms/dct:
FRBR Corehttp://purl.org/vocab/frbr/core#frbr:
Programmes Ontologyhttp://purl.org/ontology/po/po:

“Creative works” is the broad term used to describe books, magazines, photographs, paintings, music, TV shows, and so on. In the Research & Education Space, a creative work is a distinct entity with its own description separate to any digital assets which are manifestations of creative works. It’s not uncommon for there to be multiple digital manifestations hosted by different organisations of the same creative work.

  • Use the Programmes Ontology to describe television, radio, and online-only programmes, clips and series.
  • Use the Bibliographic Ontology to describe books and periodicals.
  • Use foaf:topic and foaf:primaryTopic as appropriate to relate creative works to the subjects of those works. Where possible, use URIs for terms that are also used by other people, such as WikiData and DBpedia.

    If you have your own subject hierarchy which is also published as Linked Open Data, you can establish topic references into that hierarchy, and then express the equivalence between your terms and those defined by others.

  • Try to express as much detail about your creative works as you feasibly can. In particular, if your works are organised into series, or might have different editions or versions, it's helpful to express this information as it enables more comprehensive user journeys.
  • Follow the patterns described in Describing digital assets to relate specific manifestations of works to the information about the works themselves. This allows interfaces to offer links or embedded playback facilities for media.

Under the hood: the architecture of Acropolis

The Acroplolis stack consists of several distinct, relatively simple services. Within the Research & Education Space, they are used together, but each can be deployed independently to suit different applications.

High-level architecture of the Acropolis stack

Quilt

Quilt is a modular Linked Open Data server. At its simplest, Quilt can serve a directory tree of Turtle files in the various RDF serialisations supported by librdf, but it can be extended with new engines (which can retrieve data from alternative sources), serialisers (which can output the data in different formats), and SAPIs (server interfaces, which receive requests from different sources).

The core of Quilt is libquilt, which is linked into the SAPI that is used to receive requests. libquilt encapsulates request and response data, and implements a common approach to configuration, loading modules, and request processing workflow irrespective of SAPI is in use. Each request follows the following process:

  1. Encapsulate request data obtained from the SAPI (such as the request-URI, and any request headers) along with an empty librdf model which will contain the response data. This encapsulated request-response object is then passed to the engine and then on to the serialiser which generates the response payload.
  2. Perform content negotiation to determine the best response format supported by both the client and the server.
  3. Pass the request to the configured engine for processing: the engine is responsible for populating the RDF model (or returning an error response if it's unable to).
  4. The serialiser for the negotiated response format completes the request by serialising the RDF model in that format.

Quilt includes two SAPIs: a FastCGI interface, which receives requests from any web server supporting the FastCGI interface, and a command-line SAPI which is useful for testing and debugging. New SAPIs can be developed by implementing the Quilt server inteface and linking against libquilt.

Quilt itself comes with engines for obtaining data from files and from a quad-store. In both cases, the engines perform simple translation of the request-URI into a file path or a graph URI and populate the RDF model with the contents of that file or graph.

Engines could be developed which obtain data from any source. For example, the BBC billings data service, operated as part of the Research & Education Space, is implemented as an engine which populates RDF models based upon queries performed against a SQL database.

The Acropolis stack includes another engine, Spindle, which implements the query capabilities provided to applications by the Research & Education Space API.

libquilt itself incorporates a serialiser which will generate output from an RDF model in any format supported by librdf. An additional serialiser is included which can generate HTML from templates written using a subset of the Liquid templating language.

Twine

Twine is a simple, modular, queue-driven workflow engine designed for RDF processing. It receives AMQP messages whose payload is a document which can be transformed to RDF and pushed, using SPARQL 1.1 Update into a quad-store. Future versions of Twine may support other queue mechanisms, such as Amazon SQS. More information about using Twine can be found in the manual pages.

Twine is typically operated as a continuously-running daemon, twine-writerd. Each received message must include a content type in its headers, which is used to termine which processing module the message should be routed to.

An internal API allows this basic workflow to be augmented by support for new message types, pre-processors (which can perform early transformation of RDF graphs before normal message processors are invoked), and post-processors (which can perform additional work based upon the final rendition of a graph).

Twine includes processors for TriG and N-Quads (which simply store each named graph within the source data), the GeoNames RDF dump format, and a configurable XSLT processor which applies user-supplied XSL transforms in order to generate RDF/XML from source data in an artbitrary XML format.

A special class of processors, called handlers, allows for a degree of indirection in message processing. Handlers use the contents of a message to retrieve data from another source which can then be passed back to Twine for processing as if it had been received as a message directly.

For example, an S3 handler receives messages whose payload is simply one or more S3 URLs (i.e., URLs in the form s3://bucketname/path/to/resource). Each is fetched in turn, and passed back to Twine for normal processing. The S3 handler works with both Amazon S3 and the Ceph RADOS object gateway.

The Anansi handler is very similar to the S3 handler, but it is designed to process messages containing S3 URLs to objects and extended metadata cached in a bucket by the Anansi web crawler.

Bridges are tools which push messages into the Twine processing queue. A simple example bridge, twine-inject reads from standard input and pushes the contents directly into the queue. An additional bridge is included which queries an Anansi database for newly-cached resources and pushes messages containing their URLs into the processing queue.

For the Research & Education Space, the Spindle module for Twine is responsible for processing RDF crawled by Anansi in order to generate the index.

Anansi

Anansi is a web crawler, which is used in the Research & Education Space to find and cache Linked Open Data for processing by Twine.

Anansi is implemented as a generic web crawling library, libcrawl, and crawling daemon, crawld. Loadable modules are used to provide support for different cache stores and for processing engines which are able to inspect retrieved resources (and potentially reject them if they do not meet desired criteria), and extract URLs which should be added to the crawler queue.

The daemon is intended to operate in a parallel fashion. Although an instance can be configured to run in a fixed-size cluster, it can also use etcd for dynamic peer discovery. In this dynamic configuration, the cluster can be expanded or contracted at will, with Anansi automatically re-balancing each node when the cluster size changes.

Anansi includes a generic RDF processor, which indiscriminately follows any URIs found in documents which can be parsed by librdf. This is extended by the Linked Open Data module, which requires that explicit open licensing is present and rejects resources which don’t include licensing information, or whose licence is not in the configurable white-list. This module is used by the Research & Education Space to process RDF and reject resources which do not meet the licensing criteria.

Spindle

Spindle module for Twine

Within the platform, Linked Open Data which has been successfully retrieved by Anansi is cached in a RADOS bucket. The s3:// URLs are passed to to the Twine message queue for processing by Spindle’s module for Twine along with Twine's provided RDF quads processor.

The module includes both a pre-processor and post-processor, and is responsible for implementing the co-reference aggregation, indexing and caching logic of the Research & Education Space.

When first loaded, the module parses and evaluates its rule-base, which specifies how co-referencing predicates should be interpreted, which predicates in source data should be cached, and the relationship between source classes and predicates those incorporated into the aggregate generated entities. See the class and predicate indices for more information on these relationships.

The pre-processor is applied to any data before it is written to platform’s quad-store, and uses the information in the rule-base to remove triples from the graph which should not be cached.

Once the data has been “stripped” by the pre-processor, Twine’s RDF quads processor writes the updated graphs to a quad-store via a SPARQL PUT request. Thus, the quad-store contains a copy of all of the source data which the rule-base specifies should be cached by the platform.

Twine invokes any registered post-processors for each graph which is updated once the update has been completed, and the Spindle module installs a post-processing handler so that it can perform indexing and aggregation when this happens. The post-processing steps are described in the following-sections:—

Co-reference discovery

For each updated graph, Spindle generates a list of co-references using the matching rules specified in the rule-base. To do this, both the source data and existing cached data referring to the subjects are evaluated (that is, the order that the data is processed by Spindle doesn’t matter, which is important because Anansi might encounter it in any order). Where no co-references were found for a particular subject, it's added to the co-reference list as a “dangling reference”.

Next, each entry in the list of co-references is assigned a UUID which is used to form the URI of the entity within the Research & Education Space index.

Where a particular entity is encountered for the first time (either because all of the known co-references are within the graph being processed, or because no co-references were found), a new UUID is generated and assigned to the entity.

Where the newly-discovered co-references refer only to the same existing entity (and possibly to other entities about which there is no existing data), the existing entity’s UUID is simply assigned to the entity.

Finally, where the co-references span two or more existing entities, they are all assigned the same UUID (that is, the existing entries will be updated as well).

The result is a set of pairs of “local” subject URIs (comprised of the configured base URI, followed by the UUID assigned as described above, and a fixed fragment of #id) and “remote” subject URIs from the source data. These pairs are written into either the quad-store as owl:sameAs triples (in the graph with the same name as the configured base URI), or are written into a SQL database table.

Each updated local UUID-derived URI is added to a list for this processing pass which is passed to the subsequent phases described below.

For example, if we begin with no previously-cached data and process a graph which states that A and B are equivalent, then the pair of (A, B) will be added to the co-reference list described at the beginning of this section. Because this co-reference doesn’t refer to any previously-known entities, it’s assigned a newly-generated UUID which we can refer to as U1, and two local-remote co-reference pairs are generated and stored in the quad-store or SQL database:—

  1. (http://baseuri/U1#id, A)
  2. (http://baseuri/U1#id, B)

Next, we process another graph which states that C and D are equivalent, and this results in a new UUID being generated, U2, and two new pairs being generated and stored:—

  1. (http://baseuri/U2#id, C)
  2. (http://baseuri/U2#id, D)

If we then process a graph which states that A and D or equivalent (possibly as well as other subjects), then one of the references from U1 or U2 will be updated so that all of A, B, C and D are all stored as co-references from a single local entity. For example purposes, we shall say that U1 will be the chosen UUID, although either could occur (the choice is not currently deterministic). Thus, we update the U2 references such that:—

  1. (http://baseuri/U1#id, C)
  2. (http://baseuri/U1#id, D)

This means that if we query our quad-store or SQL database for co-references for U1, the following pairs will be returned:—

  1. (http://baseuri/U1#id, A)
  2. (http://baseuri/U1#id, B)
  3. (http://baseuri/U1#id, C)
  4. (http://baseuri/U1#id, D)

Proxy generation

For each updated local URI, a proxy is generated: that is an RDF entity which is distilled using the the rule-base from all of the source data it is co-referenced with.

The rule-base consists of two sets of rules for proxy generation, which are processed in turn. First, the class of the proxy is determined, by finding all of the classes of all of the co-referenced entities and ordering them according to the score value in rule-base.

Next, a similar process is applied to properties across the source data. The property rules include a similar scoring approach to that used by the classes (so that, for example, skos:prefLabel takes precedence over rdfs:label), as well as discrimination by data type (longitude and latitude should not be an xsd:dateTime, for example) and class applicability (that is, some properties are ignored unless the class determined in the previous step is a particular value: e.g., gn:parentFeature would be ignored for a foaf:Person).

All of this “conveyed” data has a UUID-derived local subject URI, as described above, and is placed in a named graph whose URI is in the form http://baseURI/UUID.

Indexing

If Spindle is configured to use a SQL database as an index, then certain elements of the generated entity are stored in database tables for later query and retrieval. These include:—

  • the combined list of RDF classes of the entity;
  • the label and description in any languages they are available;
  • the UUIDs of any entities that have this entity as a topic;
  • if this item is a creative work, then any digital assets which are manifestations of it;
  • if the entity is a place, the longitude and latitude, if known; and
  • if this item is a digital asset, then information about the kind (e.g., moving-image, sound, interactive resource, etc.), type (in the form of a MIME type, such as text/html), and where the asset is known to be conditionally accessible, the URIs of audiences which are able to access it.

Storing pre-composed quads

The RDF graph generated above (see Proxy generation) is written to an S3 or RADOS bucket as N-Quads if the module has been configured to do so. If not, then the graph is written into the quad-store instead.

Where N-Quads are stored in a bucket, the document will also include all of the data about the co-referenced entities from their source graphs as well. This means that the Spindle module for Quilt can rapidly retrieve the majority of the data about an entity with a single authenticated GET request to the bucket.

Spindle module for Quilt

The Spindle module for Quilt is a companion to the corresponding Twine module described above, and includes several capabilites not present in the simple resource-graph module that is included with Quilt itself:—

  • the ability to perform URI look-up queries (i.e., locate the entry within the index which is coreferenced to the specified URI and redirect to it);
  • the ability to retrieve data about the co-referenced entities from their original graphs;
  • when configured with an S3 or RADOS bucket, the module can avoid SPARQL for simple fetches and instead perform a simple fetch from the bucket; and
  • when configured to use a SQL database, the ability to efficiently perform complex queries, such as “locate all places like ‘france’ which have related video that is available to users of a particular service”.

When configured with both a SQL database and an S3 or RADOS bucket, the module does not perform any SPARQL queries at all, although this may change in the future as graph databases evolve.

The module processes four kinds of request, which are described in more detail in the section The Research & Education Space API: the index and how it’s structured:—

  • a “root resource” request, which generates data about the different class partitions and the available query capabilities;
  • an item request, which retrieves data about the item using its “local” URI, as well as data about related entities and media;
  • a look-up request, which accepts a subject URI and responds either with a 303 See other redirect response, or a 404 Not found (if the URI is not present in the index); and
  • an index query request, which can be a query across any combination of RDF class, free-form text, related media kind, related media MIME type or audience.

Because the API provided by the module is read-only, the cluster of Quilt instances can be scaled up and down to meet demand as required with minimal co-ordination (subject to underlying database scalability).

Appendix I: Tools and resources

Guides

Useful blog posts

Tools for consuming Linked Open Data

Tools for processing RDF and publishing Linked Open Data

Technical standards

Appendix II: Codec & container format reference

Video codecs

KindUsagePropertiesExamples
PreservationLong-term archive storageLossless compression, typically 2:1DNG sequence, Motion JPEG 2000 lossless, VC2 (Dirac) lossless
Intermediate (mezzanine)Fine-cut editingVisually lossless, typically 4:1–6:1VC2 (Dirac), VC3 (DNx), Apple ProRes
DeliveryDistribution through a broadcast chain or publishing on physical mediaOutput format, constrained by bandwidth, typically 10:1–40:1H.262 (MPEG-2 Part 2), H.264 (MPEG-4 Part 10, AVC)
BrowseLightweight, streamable, viewing proxyOutput format, constrained by bandwidth, typically in excess of 50:1H.262 (MPEG-2 Part 2), H.264 (MPEG-4 Part 10, AVC), WebM (VP8+), Theora (VP3+), VP6
CodecKindAuthorityLossy/losslessDepthChromaNotes
SMPTE VC-2 (Dirac)VideoSMPTE/BBCBoth8, 10, 124:2:0, 4:2:2, 4:4:4Currently limited support
SMPTE VC-3 (DNx)VideoSMTPE/AvidLossy8, 103:1:1, 4:2:2, 4:4:4Max 1080i59.94
H.262 (MPEG-2 Part 2)VideoISO/MPEGLossy84:2:0, 4:2:2, 4:4:4Considered legacy
H.264 (MPEG-4 Part 10, AVC)VideoISO/MPEGLossy8, 104:2:0, 4:2:2, 4:4:4Widely supported
Apple ProResVideoAppleLossy10, 124:2:2, 4:4:4Proprietary intermediate codec
Apple Intermediate CodecVideoAppleLossy8, 104:2:0Considered legacy
Ogg Theora/VP3VideoXiphLossy84:2:0, 4:2:2, 4:4:4
VP6VideoGoogle/AdobeLossy84:2:0Classic Flash video codec
WebM/VP8+VideoGoogle84:2:0Limited support
Motion JPEG 2000VideoISO/JPEGBoth8, 10VariousParticularly suited to preservation

Audio codecs

KindUsagePropertiesExamples
PreservationLong-term archive storageLossless compression, typically 2:1Raw PCM, FLAC, ALAC, Dolby TrueHD
Intermediate (mezzanine)Fine-cut editingAudibly lossless, typically 4:1–6:1Raw PCM, FLAC, ALAC, AAC (MPEG-2 Part 7, MPEG-4 Part 3), Dolby TrueHD
DeliveryDistribution through a broadcast chain or publishing on physical mediaOutput format, constrained by bandwidth, typically 7:1AAC (MPEG-2 Part 7, MPEG-4 Part 3), MP3 (MPEG-1 Part 3, MPEG-2 Part 3), Dolby AC-3, Dolby TrueHD
BrowseLightweight, streamable, proxyOutput format, constrained by bandwidth, typically in excess of 11:1AAC (MPEG-2 Part 7, MPEG-4 Part 3), MP3 (MPEG-1 Part 3, MPEG-2 Part 3), Dolby AC-3
CodecKindAuthorityLossy/losslessNotes
Raw PCMAudioVariousUncompressedTypically wrapped in AIFF or RIFF (WAV)
FLACAudioXiphLosslessLimited hardware support
Apple Lossless (ALAC)AudioAppleLosslessLimited support
Dolby TrueHDAudioDolbyLossless
Dolby AC-3AudioDolbyLossyWidely supported in professional applications
AAC (MPEG-2 Part 7, MPEG-4 Part 3)AudioISO/MPEGLossyWidely supported
MP3 (MPEG-1 Part 3, MPEG-2 Part 3)AudioISO/MPEGLossyVery widely supported
Ogg VorbisAudioXiphLossyAdopted as audio codec for WebM
OpusAudioIETFLossyCurrently being trialled, particularly by radio broadcasters

Image codecs

KindUsagePropertiesExamples
PreservationLong-term archive storage, editing & compositionLossless compression, typically 2:1Adobe DNG (RAW), JPEG 2000 (ISO/IEC 15444) lossless, TIFF, PNG
DeliveryDistribution through a broadcast chain or publishing on physical mediaOutput format, constrained by bandwidth, typically 10:1-40:1JPEG 2000 (ISO/IEC 15444) lossless, TIFF, PNG
BrowseLightweight viewing proxy/thumbnailOutput format, constrained by bandwidth, typically in excess of 30:1JPEG (ISO/IEC 10918), JPEG 2000 (ISO/IEC 15444) lossless, PNG
CodecKindAuthorityLossy/losslessDepth (BPC)ChromaNotes
Adobe DNGRAW imageAdobeLosslessArbitraryDerived from TIFF
DPXProcessed imageSMPTELossless8-64 log
TIFFISO/AdobeBothArbitrary4:4:4, 4:2:0Supports HDR, alpha
OpenEXRProcessed imageDisney-PixarBoth16Supports HDR
JPEG 2000 (ISO/IEC 15444)Processed imageISO/JPEGBoth8, 10VariousSupports sequences with Motion JPEG 2000
JPEG (ISO/IEC 10918)Processed imageISO/JPEGLossy84:2:0
PNG (ISO/IEC 15948)Processed imageW3CLossless8bpp, 8bpcSupports alpha
WebPProcessed imageGoogleBoth84:2:0Derived from WebM/VP8+

Container formats

ContainerAuthoritySeekable?Multiple tracks?Multiple programs?MIME typeNotes
Transport Stream (MPEG-2 Part 1)ISO/MPEGNoYesYesvideo/MP2TUsed by DVB, ATSC, ARIB, Apple HLS, modified for use by Blu-Ray and AVCHD
Program Stream (MPEG-2 Part 1)ISO/MPEGYesYesNovideo/MP2PUsed by DVD-Video (VOB), HD-DVD (EVO)
QuickTimeAppleYesYesNovideo/quicktimeNow harmonised with and extends Base Media
Base Media (MPEG-4 Part 12)ISO/MPEGYesYesNoVariousDerived from QuickTime .mov
MP4 (MPEG-4 Part 14)ISO/MPEGYesYesNovideo/mp4, audio/mp4Derived from Base Media
FLVAdobeYesYesNovideo/x-flvDerived from Base Media
3GP & 3G23GPPYesYesNovideo/3gppDerived from Base Media
AVCHD/Blu-Ray MTS/TODVariousYesYesNovideo/MP2TTransport Stream packets prefixed with a 32-bit timecode
Elementary Stream (ES)ISO/MPEGNoNoNoVariousRaw codec data
Packetized Elementary Stream (PES)ISO/MPEGYesNoNoNone (application/octet-stream)Elementary Stream split into packets with an added header
MXFSMPTEYesYesNoapplication/mxfForms the basis of the Digital Production Partnership (DPP) UK broadcasting delivery specification
AIFFAppleYesNoNoaudio/x-aiff, audio/aiffTypically used as a lightweight single-essence container
AAFAMWAYesYesNoNone (application/octet-stream)Derived from Microsoft (OLE) Structured Storage as used by legacy Microsoft Office
MatroskaMatroskaYesYesNoaudio/x-matroska, video/x-matroskaNot well-supported
JP2 (ISO 15444-12)ISO/JPEGNoNoimage/jp2, image/jpx, image/jpm, video/mj2Derived from Base Media; profiled for JPEG 2000 (and Motion JPEG 2000) essence
WebMGoogleYesYesNoaudio/webm, video/webmDerived from Matroska; only used to carry WebM audio & video essence
RIFFMicrosoftYesYesNoVarious (particularly audio/vnd.wave, audio/wav, audio/wave, audio/x-wav, video/x-msvideo)WAV and AVI are both RIFF formats
ASFMicrosoftYesYesNoaudio/x-ms-wma, video/x-ms-wmvConsidered legacy; WMA and WMV are both ASF formats
OggXiphYesYesNoaudio/ogg, video/oggDe facto container for Vorbis audio and Theora video

Metadata formats

ContainerAuthorityExtensibilityStandalone?Embedded inNotes
ExifUnmaintainedControlledNoJPEG, TIFF, JPEG 2000, PNGLargely superseded by XMP; contains IPTC IIM
Adobe XMPAdobeArbitrary (URIs)YesTIFF, JPEG 2000, PDFXMP is a subset of RDF/XML; widely-used
ID3v2VariousConsensusNoMP3, AIFF, MP4Considered legacy, but widely-used
OggXiphControlledNoOgg
MP4ISO/MPEGFourCC registryNoBase Media and derivatives
MPEG-7ISO/MPEGControlledYesBase MediaXML-based; describes relationships between components
MPEG-21ISO/MPEGControlledYesBase MediaIncludes rights expression
TV-AnytimeUnmaintainedControlledYesBase MediaConsidered legacy but used in broadcast applications
Turtle (RDF)W3CArbitrary (URIs)YesNot currently widely-used as a media metadata container; can be generated from RDF/XML
RDF/XMLW3CArbitrary (URIs)YesGenerally considered legacy, superseded by Turtle; basis of Adobe XMP

Packaging formats

PackageAuthorityMetadata formatsContainer formatsMultiple programs?Notes
AVCHDSony/PanasonicMTS/TODYesDerived from Blu-Ray
DVD-VideoDVD ForumProgram Stream (MPEG-2 Part 1)Yes
Blu-RayBDAMTS/TODYes
CinemaDNGAdobeXMPMXF, DNGNoIntended to package losslessly-encoded media
Digital Production Partnership (DPP)DPPDPP XMLMXFNoIntended for delivery of complete programmes to broadcasters

Streaming formats

FormatAuthorityManifest formatContainer formatsNotes
IIS Smooth StreamingMicrosoftXMLMTS/TODHTTP-based adaptive streaming for Silverlight clients
RTSP & RTPIETFSDP
RTMPAdobeProtocol exchangeAdaptive streaming for Adobe Flash; considered legacy but remains widely-used, often alongside HLS
Apple HLSApple/IETFExtended playlist (m3u8)Transport Stream (MPEG-2 Part 1)Particularly well-supported on mobile devices
Adobe HDSAdobeXMLFLVConsidered legacy; Adobe is transitioning to HLS for streaming media

Glossary

Acropolis

The software stack which powers the Research & Education Space.

Anansi

The Acropolis web crawler, which is used in the Research & Education Space to locate Linked Open Data.

audience (conditional access)

Within the Research & Education Space, in the context of conditional access to resources, the term audience refers to a specific group of people who are assigned an identifier in the form of a URI. This allows any data about resources accessible to that group to use the same identifier to refer to them, and for the Research & Education Space index to allow queries for digital assets which can be accessed by them.

Content negotiation

The mechanism by which an HTTP user-agent specifies the list of formats it is able to interpret, and a server selects the format it will return a document in based upon that information (in principle, it selects the highest-scored format from the intersection of client-supported and server-supported formats).

co-reference

A piece of data which states that two identifiers (in the form of URIs within the context of the Research & Education Space) refer to the same entity.

creative work

The abstract form of the output of a creative process. Creative works can have manifestations in several forms, some of which may be digital assets. For example, a book can be a creative work, with both printed and EPUB editions being (physical and digital) manifestations of it.

digital asset

Any sort of document or media, including machine-readable data, which can be represented digitally. This includes RDF/XML documents, PDFs, MP3 audio, PowerPoint presentations, and so on.

entity

Something which is described by some data. Often termed a resource in the RDF specifications, but this can be confusing because “resource” is often used to refer to a document which can be transferred electronically (particularly via HTTP).

Linked Data

Data which is published on the web using resolveable URIs as identifiers, so that de-referencing the URI of something retrieves the data about it.

Linked Open Data

Linked Data which is also openly licensed.

manifestation (creative work)

A particular physical or digital version of a creative work. For example, a PDF can be a digital asset which is a manifestation of a book, which is a creative work.

Not all digital assets are manifestations of creative works: some digital assets are representations of data produced by some automated process, for example.

Quilt

A Linked Data server which is part of the Acropolis stack. Used within the Research & Education Space to serve its API.

serialisation

A concrete representation of a document (typically RDF) in some format.

Spindle

The aggregator which forms part of the Acropolis stack, comprising plug-in modules for Twine and Quilt. Responsible for generating and serving the Research & Education Space index.

Twine

An RDF processing engine which is part of the Acropolis stack, and which is used in the Research & Education Space to process Linked Open Data that has been fetched by Anansi.

License index

The Research & Education Space crawler discards RDF data which is not explicitly licensed using one of the well-known licenses listed below. Note that the URI listed here is the URI which must be used as the object in the licensing statement, as described in Incorporating rights information into RDF data.

LicenceURI
Creative Commons Public Domain (CC0)http://creativecommons.org/publicdomain/zero/1.0/
Library of Congress Public Domainhttp://id.loc.gov/about/
Creative Commons Attribution 4.0 International (CC BY 4.0)http://creativecommons.org/licenses/by/4.0/
Open Government Licence http://reference.data.gov.uk/id/open-government-licence, https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/, http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/, https://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/, http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/, https://www.nationalarchives.gov.uk/doc/open-government-licence/version/1/, http://www.nationalarchives.gov.uk/doc/open-government-licence/version/1/, http://www.ordnancesurvey.co.uk/business-and-government/licensing/using-creating-data-with-os-products/os-opendata.html, http://www.ordnancesurvey.co.uk/oswebsite/opendata/licence/docs/licence.pdf.
Digital Public Space Licence, version 1.0http://bbcarchdev.github.io/licences/dps/1.0#id
Creative Commons 1.0 Generic (CC BY 1.0)http://creativecommons.org/licenses/by/1.0/
Creative Commons 2.5 Generic (CC BY 2.5)http://creativecommons.org/licenses/by/2.5/
Creative Commons 3.0 Unported (CC BY 3.0)http://creativecommons.org/licenses/by/3.0/
Creative Commons 3.0 US (CC BY 3.0 US)http://creativecommons.org/licenses/by/3.0/us/

Vocabulary index

VocabularyNamespace URIPrefixSection
Access Control ontologyhttp://www.w3.org/ns/auth/aclacl:Describing conditionally-accessible resources
Bibliographic Ontologyhttp://purl.org/ontology/bibo/bibo:Describing creative works
Basic geo vocabularyhttp://www.w3.org/2003/01/geo/wgs84_pos#geo:Describing places
Creative Commons Rights Expression Languagehttp://creativecommons.org/ns#cc:Metadata describing rights and licensing
CIDOC CRMhttp://www.cidoc-crm.org/cidoc-crm/crm:Describing physical things
DCMI Metadata Termshttp://purl.org/dc/terms/dct:Common metadata, Metadata describing rights and licensing, Collections and data-sets
DCMI Typeshttp://purl.org/dc/dcmitype/dcmit:Metadata describing documents, Collections and data-sets
Event ontologyhttp://purl.org/NET/c4dm/event.owl#event:Describing events
FOAFhttp://xmlns.com/foaf/0.1/foaf:Common metadata, Describing digital assets
FRBR Corehttp://purl.org/vocab/frbr/core#frbr:Describing creative works
GeoNames Ontologyhttp://www.geonames.org/ontology#gn:Describing places
Media RSShttp://search.yahoo.com/mrss/mrss:Publishing digital media, Describing digital assets
Media typeshttp://purl.org/NET/mediatypes/mime:Describing digital assets
ODRL 2.0http://www.w3.org/ns/odrl/2/odrl:Metadata describing rights and licensing
OpenSearchhttp://a9.com/-/spec/opensearch/1.1/osd:The RES API: the index and how it’s structured
OWLhttp://www.w3.org/2002/07/owl#owl:The RES API: the index and how it’s structured, Referencing alternative identifiers: expressing equivalence
Programmes Ontologyhttp://purl.org/ontology/po/po:Describing creative works
RDF schemahttp://www.w3.org/2000/01/rdf-schema#rdfs:The RES API: the index and how it’s structured, Common metadata
RDF syntaxhttp://www.w3.org/1999/02/22-rdf-syntax-ns#rdf:The RES API: the index and how it’s structured, Common metadata
SKOShttp://www.w3.org/2008/05/skos#skos:Describing concepts and taxonomies
VoIDhttp://rdfs.org/ns/void#void:The RES API: the index and how it’s structured, Collections and data-sets
W3C formats registryhttp://www.w3.org/ns/formats/formats:The RES API: the index and how it’s structured, Metadata describing documents
XHTML Vocabularyhttp://www.w3.org/1999/xhtml/vocab#xhtml:The RES API: the index and how it’s structured

Class index

The following RDF classes are applied to entries in the RES index by the aggregator, based upon the class they are evaluated as belonging to:—

ClassDescriptionSection
foaf:AgentAgents (i.e., things operating on behalf of people or groups).Describing people, projects and organisations
dcmitype:CollectionCollectionsCollections and data-sets
skos:ConceptConceptsDescribing concepts and taxonomies
frbr:WorkCreative worksDescribing creative works
void:DatasetDatasetsCollections and data-sets
foaf:DocumentDigital assetsDescribing digital assets
event:EventEvents (time-spans)Describing events
foaf:OrganizationOrganizationsDescribing people, projects and organisations
foaf:PersonPeopleDescribing people, projects and organisations
crm:E18_Physical_ThingPhysical thingsDescribing physical things
geo:SpatialThingPlaces (locations)Describing places

Predicate index

This section lists the predicates which are specifically recognised by the RES aggregation engine, whether they are cached (against the original subject URI from the data in which they appear), and whether they can relayed in the composite entity generated by the aggregator.

PredicateEntity kindCached?Relayed?
rdf:typeAnyYesYes, but also mapped to pre-defined classes
rdfs:labelAnyYesYes
foaf:givenName and foaf:familyNamePeopleYesYes, as rdfs:label
foaf:nameAgentsYesYes, as rdfs:label
gn:namePlacesYesYes, as rdfs:label
gn:alternateNamePlacesYesYes, as rdfs:label
dct:title, dc:title, foaf:name, skos:prefLabelAnyYesYes, as rdfs:label
foaf:depictionAnyYesYes
crm:P138i_has_representationAnyYesYes, as foaf:depiction
dct:subjectCreative works, collections, digital assetsYesYes
geo:latPlacesYesYes
geo:longPlacesYesYes
dct:rights, dct:license, cc:licenseAnyYesNo
skos:inSchemeConceptsYesYes
skos:broaderConceptsYesYes
skos:narrowerConceptsYesYes