Wednesday, April 15, 2009

LSIDs, to proxy or not to proxy?

The LSID discussion rumbles on (see my earlier post). One issue that has re-emerged is the use of HTTP proxies in RDF documents. In a recent email Greg Whitbread wrote:

The existing TDWG recommendation that "5. All references to LSIDs within RDF documents should use the proxified form", basically states that LSID will never appear in any way other than bundled into an http URI - if we are also to publish data as RDF.

That sounds as if it means that those wanting to use LSID resolution will first have to extract the LSID part from the http URI which will now appear everywhere we would expect to find our unique identifier.

Donald [Hobern] has presented a strong case for unique identifiers conforming to the LSID specification but we have now an equally strong case that in its http form our identifier must behave as a dereferenceable URN per W3C linked data recommendations.
My own view is that the RDF should always contain a canonical, un-proxied version of an identifier (whether LSID or DOI), because:
  1. having only the proxied version assumes that there is only one suitable proxy (there may be multiple ones)
  2. it assumes that the specified proxy will always exist (our track record in durable HTTP services is poor)
  3. assumes the specified proxy will always match conform to current standards
  4. it imposes an overhead on clients that want the canonical identifier (i.e., they have to strip away the proxy)
I predict that for any meaningful, successful (read "actually used") identifier there will be multiple services that will be capable of consuming that identifier, not just HTTP proxies. DOIs can be proxied (by several servers, including http://dx.doi.org/ and http://hdl.handle.net ), resolved using OpenURL resolvers, etc.

In order to play ball with Linked Data, there are several ways forward:
  1. always refer to LSIDs in their proxied form (see above for reasons why this might not be a good idea)
  2. ensure that at least one proxy exists which can resolve LSIDs in a linked data friendly way (see bioGUID as an example)
  3. use or develop linked data clients that understand LSIDs (e.g., http://linkeddata.uriburner.com/, see this view of urn:lsid:zoobank.org:pub:2C6BD020-B54A-4119-9693-3231C9FCEFA6)
2 and 3 already exist, so I'm not so keen on 1.

For me this is one of the biggest hurdles facing using HTTP URIs as identifiers -- I have to choose one. As an analogy, I can identify a book using an ISBN (say, 0226644677). How do I represent this in RDF? Well, I could use an HTTP URI, say http://www.amazon.com/Tangled-Trees-Phylogeny-Cospeciation-Coevolution/dp/0226644677/ , or maybe http://www.worldcat.org/isbn/0226644677. There are many, many I could choose from. However, so long as I know that the ISBN is 0226644677, I'm free to use whatever URI best suits my needs. So, what I really want is the ISBN by itself.

Imagine, for example, a publisher such as PLoS or Magnolia Press (publisher of Zootaxa), both of which have recently published taxonomic papers containing LSIDs (e.g., doi:10.1371/journal.pone.0001787). They might want to display LSIDs linked to their own LSID resolver that embellishes the metadata with information they have (e.g., they might wish to highlight links to other content that they host). In a sense this is much the same idea as supported by OpenURL COinS, where OpenURL-format metadata is embedded in a HTML document and the user choose what resolver to use to resolve the links (including tools such as Zotero).

Having LSIDs prefixed with a HTTP proxy makes these task a little harder.

No comments: