From WorldCat Developers' Network
Contents |
Dewey Summaries as Linked Data
For some time, the Dewey team at OCLC has wanted to do something with Linked Data. That is, apply Linked Data principles to parts of the Dewey Decimal Classification and present the data as a small “terminology service.” The service should respond to regular HTTP requests with either a machine- or a human-readable presentation of Dewey classes. There should be a URI (and, even better, a Web page that delivers a useful description) for every Dewey concept, not just single classes. The data should be presented in RDF using a vocabulary that is capable of handling rich semantic information while still being simple enough to allow user agents to explore the data on their own. The service should also accommodate more complex use cases by offering API-like access, at first primarily for queries. Finally, the data should be reusable by anyone for non-commercial purposes.
Implementing dewey.info as a Linked Data platform
Publishing a large controlled vocabulary as Linked Data can act as a force that drives development on several different fronts. A URI pattern has to be developed that supports the identification of several different kinds of entities that are part of the domain (background information can be found here and here). These URIs have to act as dereferenceable identifiers that deliver representations of the referenced resources in a RESTful manner. Compliance with httpRange-14 is only a start; the problems of using HTTP’s semantics the right way, serving up predictable representations while not overloading the protocol with meaning, are intricate and manifold. Looking over related TAG issues (57, 62, 63) it is apparent how actively discussed an area this is in the Semantic Web and among Web architecture communities.
SKOS seems to become more and more the RDF vocabulary of choice for representing controlled vocabularies on the Web. SKOS is very much geared towards expressing thesaurus-like knowledge structures, so some difficult choices have to be made in order to model a classification system with SKOS. An early anatomy of modeling a Dewey class with SKOS was presented at last year’s ISKO-NA preconference.
And finally, because Linked Open Data are not really open without the appropriate license model, the data on dewey.info are available under a Creative Commons BY-NC-ND license. Licensing information is embedded in RDF and RDFa following the Creative Commons Rights Expression Language (ccREL) specification.
To test whether some of these goals can be achieved, the Dewey Summaries seemed to be a suitable data set to publish according to Linked Data principles. The latest version of the Summaries, i.e., the top three levels of DDC 22, has been available as a Web document for some time. To broaden the possible applications of what now essentially is just tag soup (in only one language), every class had to be identified with a URI and the data had to be presented in a reusable way.
Dewey URIs
The suggested URI pattern is based on the distinction between non-information resources (abstract or concrete real-world objects), generic resources, and their representations. Generic resources are Web documents describing the abstract object they are associated with (e.g., a Dewey class). The pattern for generic documents defined in the current pool of Dewey URIs is as follows:
http://dewey.info/{object-collection}/{object}/{snapshot-collection}/{snapshot}/about
Specific documents have a variable resource name component and allow specification of content language and type (format):
http://dewey.info/{object-collection}/{object}/{snapshot-collection}/{snapshot}/{resource-name}.{language}.{content-type}
An object is a member of the DDC domain and part of an object collection. The object collection specifies the type of the object. The object collection is a mandatory component and can have one of the values “scheme,” “table,” “class,” “manual,” “index,” “summary,” and “id.” A specific object from that collection follows if required. Some examples:
http://dewey.info/class/576.83/ http://dewey.info/scheme/ http://dewey.info/table/2/
A snapshot is used to refer to versions of objects at specific points in time. Snapshots can be part of a snapshot collection, e.g., “e22,” referring to every concept version that is part of Edition 22 of the DDC.
http://dewey.info/class/2--74-2--79/e22/ http://dewey.info/class/2--74-2--79/2009/01/ http://dewey.info/class/2--74-2--79/e22/2009/01/
Both snapshot collection and snapshot are optional components.
Resource names specify the content of the information resource that is returned to a user agent. As shown above, “about” is used for a generic description of the object. Other examples include
http://dewey.info/class/576.83/ancestors http://dewey.info/class/576.83/children
for the complete upward hierarchy or the immediate children, respectively, of a specific class. The resource
http://dewey.info/class/572/history.rdf
provides a view (in RDF) of changes that happened to this concept over time. A resource like
http://dewey.info/scheme/2008/07/updates.de.atom
might provide an Atom feed of updates to the German version of the DDC that were published in July 2008.
Many URIs may reference the same resource in different ways, so a system to manage URI aliases will be necessary for services that implement this scheme.
Dewey URIs in dewey.info
The explanation above represents to some degree a conceptual analysis of the DDC domain. Currently, dewey.info only implements a very small part of that pattern. The idea is to start simple and to implement an increasingly fuller view of the DDC as we receive feedback from technical and non-technical Dewey users, translators, and other members of the community. The suggested URI pattern might change along with the reserved values for specific elements. It is an experiment, after all.
The current service tries to implement Recipe 6 of the Best Practice Recipes for Publishing RDF Vocabularies. It uses PHP in combination with the superb ARC RDF framework by Benjamin Nowack. Content is served up in HTML (XHTML+RDFa) and RDF in three different serializations (RDF/XML, Turtle, and JSON). The service also provides a SPARQL endpoint (very easy to set up with ARC). The endpoint supports the SPARQL protocol and generates query results in different serializations.
At the moment, dewey.info only supports the “class” object collection. Some examples: http://dewey.info/class/641/ stands for class “641” in the DDC and redirects a user agent accepting HTML or XHTML (issuing a 303 See Other) automatically to HTML representations of all available versions of this class in all available languages (http://dewey.info/class/641/about). The “/about” part indicates that this URL stands for a general descriptions of the abstract concept (i.e., Dewey class 641), not the concept itself, in compliance with Semantic Web standards. The concept itself — as an abstract thing or idea — does not have a representation that can be sent over the Web, so the Web server points the user agent to a place on the Web where a description of that thing can be found.
The specific format of this description is negotiated in the background by the user agent and the server. While for a regular Web browser like Opera or Firefox an HTML version of the page should be served up, a Linked Data browser like Zitgist would be presented with a RDF (Resource Description Framework) version of the data that it then uses to construct its own view.
The Content-Location header field of the server response contains the URI where this specific information resource can be found directly (bypassing content negotiation), e.g. http://dewey.info/class/641/about.html or http://dewey.info/class/641/about.rdf.
The handling of languages works in a similar way, as languages are being treated as part of the representation just like Content-Type. Appending a language tag to the URI of the generic resource (ending in “/about”) narrows down to versions in a specific language: http://dewey.info/class/641/about.fr. (The HTML view of a single class for which other languages are available also includes links to these versions.) The ability to bypass content negotiation by specifying the desired format directly is of course also available: http://dewey.info/class/641/about.fr.rdf.
Finally, the service offers the possibility of specifying the date of the version that should be identified or retrieved. Since only two different “timeslices” are present at the moment, the utility of this feature will likely become more apparent as updates are added to the service. In addition, while the original plan for Dewey URIs calls for more precision in specifying the “timeslice” of a version, down to minutes and seconds, only year and month are supported at this time. By specifying a year and/or month in the URI (http://dewey.info/class/641/2009/08/) the service will only show concepts from that period of time, in this case, August 2009. A combination of all those elements results in a fairly complete description of a Dewey class: http://dewey.info/class/641/2009/08/about.ar.html.
It has already been mentioned that the HTML version of a class is annotated with ccREL licensing information using RDFa. But in fact, the entire displayed content is annotated with RDFa attributes, so that an HTML user agent might still be able to interact with the underlying semantics of the data (e.g., end-user facing tools like a semantic clipboard). Note that the “fat RDF” triples are not necessarily in sync with the distilled RDFa content (compare the RDF representation of 640 with its distilled XHTML+RDFa representation generated using pyRdfa).
The SPARQL endpoint opens up additional possibilities for developers and user agents alike to explore the data model and also download chunks of the data. The following query lists all properties (more precisely, all predicates) that were used in the store, grouping them by frequency (using the widely implemented COUNT aggregate function that presumably will be in the next version of the SPARQL specification).
SELECT ?p
COUNT(?p) AS ?pc
WHERE {
?s ?p ?o .
}
GROUP BY ?p
ORDER BY DESC(?pc)
Currently, there is no direct way to retrieve the ten main classes (or 100 divisions) of the DDC Summaries with a single URI. But with SPARQL, this query (in this example for the main classes in French) is very easy to do:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT DISTINCT ?notation ?prefLabel
WHERE {
_:s skos:topConceptOf _:o ;
skos:prefLabel ?prefLabel ;
skos:notation ?notation .
FILTER langMatches( lang(?prefLabel), "fr" )
} ORDER BY ASC(?notation)
Limitations and future plans
Dewey.info is very much a place for experimentation, a place to start a dialogue and exchange of ideas with the Linked Data community. Secondly, dewey.info should be a platform for Dewey data on the (Semantic) Web. The DDC Summaries may not be the most challenging or complex data set to be published in this manner, but more is to come in terms of languages, deeper data, and links to other data sets. We appreciate any comments or suggestions.
Some next steps will likely include the integration of several more languages and links to mapped vocabularies like the LCSH (using http://id.loc.gov/authorities/).
