Monday, February 27, 2006

Interval queries


VLDB2005 has some interesting papers. One which caught my eye is Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases[PDF] by Enderle et al. Why? Because with all the effort systematists are placing on estimating divergence times, it would be nice if a phylogeny database could recover phylogenies by time slices (as well as by taxa, geography, data type, author, etc.). Imagine being able to find phylogenies with divergence dates that spanned the K/T boundary, for example.

Document icons


Amber Frid-Jimenez: has a really nice page showing document icons that reflect the words contained in the document. Neat idea, and has a lot of potential. One obvious extension to phylogenetics would be to represent the taxonomic coverage in a similar style, so people could very quickly find studies on related sets of organisms simply by browsing.



(Via information aesthetics.)

Wednesday, February 15, 2006

LSID Firefox extension update

I've updated my extension to resolve LSIDs in Firefox so that it works with version 1.5.0.1 (the most recent version of Firefox). The extension is available from Mozdev. It may take a little while for the mirrors to update with the new version, so if you get a "404" when trying to download, you may need to come back later.

IBM's LSID project have their own Firefox extension LSID Launchpad for Firefox, which is a lot slicker than mine. It's in beta, but well worth a look.



In case you're wondering, these extensions enable you to browse LSIDs as if they were URLs. For example, with the extension installed, a LSID such as urn:lsid:ubio.org.lsid.zoology.gla.ac.uk:namebankID:10386 becomes clickable, and you can see the metadata associated with the LSID. For the technically minded, they add support for the lsidres protocol to Firefox.

Sunday, February 12, 2006

Rob McCool on Rethinking the Semantic Web

Having read Rob McCool's articles on Rethinking the Semantic Web (brought to my attention by Bob McMorris' comment on my earlier post on globally unique identifiers), I think he makes very interesting points, but they are not all relevant to whether biodiversity informatics adopts RDF.

In terms of whether the dream of the Semantic Web will happen, I suspect he is right - technologies such as tags and microformats will be a lot easier to adopt, and will make more effective use of existing tools. I'm not writing the Semantic Web off, but McCool's point about keeping things very simple is, I think, on the money.

Much of the work on RDF and the Semantic Web has been done in academia, and most examples concern things such as relationships between people and projects (typically computer science projects in, you guessed it, the Semantic Web). Within a small academic community there is often a small problem scope, consistent vocabulary (or at least, it is tractable to develop either a vocabulary or a mapping between vocabularies), obvious identifiers, experience with ontologies, and a limited set of problems. My sense is that biodiversity informatics fits this model. If the goal is to integrate databases of integrate taxonomic names, specimens, images, character data, DNA sequences, and publications, and make inferences based on this aggregation of information, then I feel the use of Semantic Web techniques will be quite tractable, indeed productive.

In the same way, much of the scepticism about whether ontologies are actually be useful in the real world (see Clay Shirky's brilliant Ontology is Overrated -- Categories, Links, and Tags, or listen to a MP3) is probably well founded. Again, I think the issue is one of scope. Biologists are used to ontologies, after all what is taxonomic classification but a large ontology with well developed rules for its construction and maintenance?

That said, there are areas in our field where insistence on RDF, controlled vocabularies, and ontologies will probably be counterproductive. Ontologies for morphological characters will, I suspect, prove hard to sell. Even though we have a history of shared terminology (think of papers establishing consistent numbering schemes for setae on insect heads), these shared vocabularies tend to have limited applicability unless they are very general (matching setae on the head of a fly and a louse is tricky), and if they are general (e.g., "legs") they are very low level. There is also the thorny issue that many aspects of morphology are not homologous in evolutionary terms (in what sense are the wings of a fly and a bird both "wings"?). Leaving aside the conceptual issues, this is one area where I think people will balk if it becomes a pain to use ontologies. It's hard enough getting people to use scientific names (never mind remembering that species names such as Homo sapiens should be written in italics). I suspect this is one area (along with scientific literature) where tagging will be a compelling alternative. For an example of the power of tagging literature see Connotea.


McCool's articles are available here:

Thursday, February 09, 2006

Globally Unique Identifiers

I attended the TDWG-GUID workshop on Global Unique Indenitifers (GUIDs) held at NESCent, which has issued a report. Essentially, the aim of this work is to deploy globally unique identifiers for digital objects in biodiversity informatics, such as taxon names, specimen records, images, etc. The workshop settled on LSIDs (Life Science Identifiers), which is a sensible choice.

LSIDs have been around, and there is considerable software support from IBM (see their project on SourceForge). I've used them in my Taxonomic Search Engine. Not everybody is thrilled by LSIDs (see Anyone using LSID? on NodalPoint).

DOIs and Handles were also considered. I have flirted with handles (see my comments on the iSpecies blog). DOIs have some useful properties, especially stable infrastructure, management tools, and immediate utility by the publishing industry, although they are not cheap. George Garrity uses them in his NamesforLife© project(doi:10.1601/tx.0). Long term the biodiversity community might benefit from thinking seriously about this. The German Science Foundation has invested in providing free DOIs to the German scientific community (see Publication and Citation of Scientific Primary Data). There's also a certain irony in a blog posting talking about GUIDs and rejecting DOIs, when every reference to an external publication is made using, you guessed it, a DOI.

Regarding the workshop itself, at times I wanted to gnaw off parts of my body to retain sanity. As a result I was pretty obnoxious. My frustration stemmed partly from a feeling that the TDWG community seems determined to make life hard for themselves by placing obstacles in their path whenever possible. They've also a lot of investment in XML schema, which I regard as misguided (that's being polite). Anybody who thinks XML schema are the answer to our problems should read "From XML to RDF: how semantic web technologies will change the design of 'omic' standards" (doi:10.1038/nbt1139). I nearly lost it when there was discussion of adopting LSIDs but serving the metadata in XML schema. This defeats the whole point of LSIDs. By serving RDF, we can do inference, in particular we can easily aggregate RDF into triple stores. Populating a database becomes as easy as resolving the LSID and sucking down the metadata. Consequently, data integration suddenly looks a lot more tractable. Indeed, from the perspective of RDF, LSIDs are just another Uniform Resource Identifier (URI), albeit one which consistently resolves to RDF.

As the workshop drew to a close, I began to feel that one reason people just didn't "get" LSIDs and RDF was that there were no really cool examples of what can be done with the technology. If you just look at RDF serialised as XML, then it's not obvious what the big deal is. So we serve a different form of XML, what's the big deal? This is a little like my first impression of XML -- it just seemed like a more fussy version of HTML, so what was all the hype about? Once you see the power of the tools associated with XML (such as the parsers, XSLT and XPath), then you see the point. It can make exchanging and processing data a lot easier, and style XSLT sheets are just way kewl. The difference between XML and RDF is of this order. So, what we need are some cool applications combining LSIDs, metadata, and triple stores to show people just why this is so much more powerful than the XML schema that have obsessed the TDWG community for so long.

Wednesday, February 08, 2006

Search result comparison



Yet more cool stuff from information aesthetics, this time a comparison of search results from Google and Yahoo. Given my own talk on Google versus Yahoo and the death of taxonomy?, this page certainly caught my eye.

Treemaps


I came across this version of treemaps a little while ago, but this post on information aesthetics reminds me to add this to the list of cool things that are worth thinking about when considering how to visualise the Tree of Life.

Isn't it gorgeous?

Tuesday, February 07, 2006

TreeBASE talk at CIPRES


On Saturday I gave a short invited talk at the CIPRES all hands meeting in Austin. Not sure if I'll be invited back after this, but I think it was worth saying a few things that somebody, somewhere should be saying. You can grab the PowerPoint slides here. To give you a sense of the talk (and the style in which it was intended), here's the abstract:


The current TreeBASE is a black hole -- data disappears in and is difficult to extract again. Furthermore, no use is made of the wealth of information that could be linked to data in TreeBASE. The only external links TreeBASE contains are author email addresses. Yet, given a GenBank sequence or a paper title one can go to the Internet and readily extract information on genes, specimen localities, PubMed records, citation links, images, taxonomic authorities, etc. The search interface is limited, and locks users into primitive and often fruitless searches. TreeBASE is a walled garden in a time when the world is discovering data integration, federated searches, and "mashups." Designing new, improved (read bigger) relational database schema does nothing to address these issues. If the community wants a useful tool that tells us what we know (and what we don't know) about the tree of life, and enables the kind of integrated research that we systematists so often say is only possible with a phylogenetic underpinning, then I suggest we need something rather different. This talk will sketch some problems with TreeBASE, discuss some ideas relating to globally unique identifers, metadata, inference, and the Semantic web, and will end with the author running from the room hotly pursued by Bill Piel.


Apart from biological gripes, I was also a little surprised that some of the stuff from the early days of "phyloinformatics" wasn't being picked up on, especially the idea of a phylogenetic query language (e.g., Jamil et al., BIBE 2001 doi:10.1109/BIBE.2001.974405) (and, no, I don't think a CORBA wrapper constitutes a phylogenetic query language).