Saturday, April 11, 2009

LSIDs, HTTP URI, Linked Data, and bioGUID

The LSID discussion has flared up (again) on the TDWG mailing lists. This discussion keeps coming around (I've touched on it here and here), this time it was sparked by the LSID SourceForge site being broken (the part where you get the code is OK). Some of the issues being raised include:
  • Nobody uses LSIDs except the biodiversity informatics crowd, have we missed something?
  • LSIDs don't play nice with the Linked Data/Semantic Web world, which is much bigger than us
  • If we adopt HTTP URIs, will this send the wrong message to data providers (LSIDs imply a commitment to persistence, URLs don't)
  • The community has invested a lot in LSIDs, it's too late to change course now
There are other issues as well, in many ways much harder, namely how to ensure adoption and long term persistence of whatever identifier technology the techies agree on.

I've been twittering (@rdmpage) about some of this, and Pierre Lindenbaum blogged about my earlier paper on testing LSIDs (doi:10.1186/1751-0473-3-2), so I decided to return to one of the original goals of my bioGUID project, namely providing a tool to resolve existing identifiers in a consistent way (see the now moribund bioGUID blog, I now blog about bioGUID here on iPhylo). One of the goals of bioGUID was to take an identifier and return RDF. I also had an underlying triple store that was populated with this RDF. After a hardware crash I took the opportunity to rebuild bioGUID from scratch, focussing on OpenURL access to literature. Now, I'm looking at LSIDs again.

The standard response to the concern that the rest of the world has gone down the HTTP URI route is to say that we can stick a HTTP proxy on the front of the LSID (e.g., http://lsid.tdwg.org/urn:lsid:indexfungorum.org:names:21364) and play ball with the Linked Data crowd, who are rapidly linking diverse data sets together:

However, sticking a HTTP proxy on an LSID isn't enough. As outlined in the document Cool URIs for the Semantic Web, we need a way of distinguishing between a HTTP URI that identifies real-world objects or concepts (such as a person or a car), and documents describing those things (put another way, if I put a HTTP URI for Angelina Jolie into a web browser, I expect to get a document describing her, not Ms Jolie herself) . One solution (and the one that is gaining traction) is to use 303 redirects to make this explicit:

A client resolving a URI for a thing will get a 303 status code, telling them that the URI identifies an object. They can get the appropriate representation via content negotiation (a web browser wants HTML, a linked data browser wants RDF).

Data URIs. So, in order to get LSIDs to play ball with Linked Data we need a HTTP proxy that supports 303 redirects (as Roger Hyam pointed out). I've implemented a simple one as part of bioGUID. If you append a LSID to http://bioguid.info/ you get a HTTP URI that passes the
Vapour Linked Data validator tests. For example, http://bioguid.info/urn:lsid:indexfungorum.org:names:21364 resolves to a web page in a browser, but clients that ask for RDF will get that. You can see the steps involved in resolving this Cool URI here. Vapour provides a nice graphical overview of the process:



The TDWG LSID proxy doesn't validate, so this is something that should be addressed.

In addition to resolving LSIDs, my service stores the resulting RDF in a triple store using ARC, and you can query this triple store using a SPARQL interface that makes use of Danny Ayers' Javascript SPARQL editor. I've a serious case of déjà vu as I've implemented this feature several times before using 3store3 (usually after much fun getting it to work). I got bored with triple stores as the bigger problem seemed to be the errors in the metadata I was harvesting, which seriously limited my ability to link different objects together (but that's another story).

No comments: