Saturday, December 23, 2006

Unique Author Identification | Public Library of Science

Very interesting post by Richard Cave on a PLoS blog concerning unique author identification. Here's the opening paragraph.

I [Richard Cave] recently gave a brief presentation at the yearly CrossRef member meeting on unique author identification in scientific publishing. I had gathered information for the presentation from speaking with PLoS staff and online articles, but didn’t put pen to paper until the night before the meeting. Given my procrastination and rambling presentation, I think that it’s a good idea to write down my notes so that they are more understandable.

There is also some interesting commentary.

Saturday, December 02, 2006

Folksonomies - why philosophy is a bad thing

The November 2006 issue of D-Lib magazine contains an article by Elaine Peterson entitled "Beneath the Metadata: Some Philosophical Problems with Folksonomy" (doi:10.1045/november2006-peterson). She writes:

The choice to use folksonomy for organizing information on the Internet is not a simple, straightforward decision, but one with important underlying philosophical issues. Although folksonomy advocates are beginning to correct some linguistic and cultural variations when applying tags, inconsistencies within the folksonomic classification scheme will always persist...Most information seekers want the most relevant hits when keying in a search query. Folksonomy is a scheme based on philosophical relativism, and therefore it will always include the failings of relativism. A traditional classification scheme will consistently provide better results to information seekers.

This article is one of the most irritating things I've read in a while, and as much as I like philosophy, it reinforces my prejudice that invoking philosophy is almost always a bad idea. Casting the discussion about folksonomy versus classification as a clash between "Aristotelian categories" and "philosophical relativism" just substitutes name calling for analysis, and the paper makes unsubstantiated claims such as "A traditional classification scheme based on Aristotelian categories yields search results that are more exact", and "A traditional classification scheme will consistently provide better results to information seekers." Er, how do we know this? Do we have data to support this? And, um, what classification scheme does Google use, exactly?

Now, I'm a fan of classifications, and would argue that biological taxonomy has one of the largest, most elaborate classifications that is actively used, complete with detailed rules governing it's maintenance. Indeed, much of this iPhylo blog is about a project to add classification to a database (TreeBASE) that eschews classification (to its detriment). However, classification is problematic — there are competing classifications, and within biological taxonomy there is much discussion about how names relate to classifications (see earlier posts More on names (and frogs) and Synonomy and kinds of name). Despite being armed with one of the best developed classifications around, biologists also use informal names to refer to groups, partly because our knowledge of the real world changes, and hence our classifications change (but often lagging behind the latest research).

Classifications can also constrain the kinds of questions we can ask. For example, NCBI's classification of animals lacks the Ecdysozoa, a group whose existence is controversial, but I guess most zoologists would accept. Despite this broad acceptance, NCBI prevents users asking questions such as "how many sequences have been obtained from members of the Ecdysozoa?" To see this, try typing "Ecdysozoa" as a search term in the NCBI's Taxonomy Browser. If you want to ask this question, you need to construct a complex query that specifies all the groups belonging to the Ecdysozoa. This problem motivated a paper Gabriel Valiente and I wrote (doi:10.1186/1471-2105-6-208) that suggested using edit scripts to modify trees so that users can generate their preferred classification using the NCBI tree as a starting point. The other motivation was that the NCBI tree is continually growing as the NCBI database grows.

Given these issues, the flexibility of folksonomies may offer some advantages. Indeed, I think the notion of "tagging" may prove a useful way to think about taxonomic names. Guy and Tonkin's article "Folksonomies: Tidying up tags?" (doi:10.1045/january2006-guy) offers a rather more sensible perspective:

We agree with the premise that tags are no replacement for formal systems, but we see this as being the core quality that makes folksonomy tagging so useful.

Seems like a case of "the genius of AND".

Wednesday, November 29, 2006

Homonyms and uBio's data model (yet more on names)

As part of the TreeBASE name mapping exercise, I've come across some interesting names, such as "Diplura". This is a homonym, meaning that more than one taxon has this name. This can complicate life somewhat.

In TreeBASE, the taxon Diplura is a spider genus (TreeBASE taxon T4182), part of the study by Fredrick Coyle (hdl:2246/1665).

NCBI has "Diplura" (Taxonomy ID (29997), but this is the insect class (or order, depdnign on what classification you use). NCBI mistakenly links "Diplura" in NCBI to "Diplura" in TreeBASE, but links correctly to the insect record in ITIS (Taxonomic Serial No:99228).

To make matters worse, there is also an algal genus Diplura, which ITIS also has (Taxonomic Serial No:10873).

The problem comes when we look up this name in uBio. The name Diplura is listed as appearing in several classifications, including NCBI, ITIS, etc., as well as its occurrence as a butterfly name (Diplura Ranbur, 1866). However, in the metadata for this name there is the tag <ubio:taxonomicGroup>Phaeophyceae</ubio:taxonomicGroup> (the Phaeophyceae are algae). Clearly, a name that is used by a spider, an insect, and an alga (never mind a butterfly) can't be assigned to a single taxonomic group. Perhaps one solution would be have multiple instance of the <ubio:taxonomicGroup> tag, one for each major taxonomic group the name came from.

My motivation in all this is to start thinking about taxonomic names as simple "tags", with a view to using some of the vocabularies for taxonomies and "folksonomies" geing developed elsewhere, such as SKOS Core. Under this approach, I'd need GUIDs for name strings, independent of their usage. uBio pretty much does this, but for the <ubio:taxonomicGroup> tag.

Friday, November 24, 2006

Demise of

As noted on the Society of Systematic Biology (SSB) web site, the journal Phyloinformatics has disappeared. It only published eight papers, but this still represents lost effort, and some of the papers are highly relevant to issues I'm interested in. Luckily, with the help of the Internet Archive's "wayback machine", and a PDF sent by Paul Sereno, I've put all the PDFs on the SSB web site. You can get them here.

Friday, November 17, 2006

More on names (and frogs)

Molecular Phylogenetics and Evolution carries two articles debating the application of names to trees, which reflects tensions between two codes of nomenclatures (ICZN and Phylocode). Alain Dubois (doi:10.1016/j.ympev.2006.06.007 and
David Hillis (doi:10.1016/j.ympev.2006.08.001) present rather different views. The paper that brought things to a head is Hillis and Wilcox (2005) "Phylogeny of the New World true frogs (Rana)" (doi:10.1016/j.ympev.2004.10.007, TreeBASE S1186). I've not had time to digest this (it's Friday evening, after all), but I think it's interesting to see to what extent the systems can coexist (which is what Hillis seems to argue, if only as a transitional stage), or whether they are simply incompatible.

Tuesday, November 14, 2006

Synonomy and kinds of name

Just wanted to write this example down before I loose it. Browsing Bill Piel's trees in Google Earth, and was looking at Lee et al.'s paper (doi:10.1111/j.1365-294X.2005.02707.x) on Physalaemus pustulosus. Searching in lead to records in GenBank, whereupon I stumble on the fact that in GenBank it is Engystomops pustulosus (Taxonomy ID: 76066). Following up the reference on the NCBI taxonomy page, I find a
PDF of the paper freely available (although only a URL for an identifier). Browsing the GenBank records (e.g., DQ337249), I find Ron et al. (10.1016/j.ympev.2005.11.022). Among other things, this paper refines the genus Engystomops:

Engystomops, Jiménez de la Espada 1872 (converted clade name). Definition: clade stemming from the most recent common ancestor of E. petersi Jiménez de la Espada 1872, and E. pustulatus (Shreve, 1941).

Then paddling off to HerpNET I query for "Engystomops pustulosus" and get one record, whereas for "Physalaemus pustulosus" I get lots of records (although the geographic range doesn't include all the localities in doi:10.1111/j.1365-294X.2005.02707.x.

My point? Well, don't really have one, except that again we are clicking around different web sites to get a complete picture of what is going on, important data are attached to different names for the same animal, and the nature of those names themselves may vary (for example, Nascimento et al. define Engystomops as a set with a type species (E. petersi), whereas Ron et al. (10.1016/j.ympev.2005.11.022) define Engystomops as a least common ancestor of two taxa on a tree. All of this makes integration a challenge, to say the least.

Monday, November 06, 2006

The politics and practice of accessibility in systematics

Stumbled across New Infrastructures for Knowledge Production: Understanding E-science while writing about TAXACOM on the iSpecies blog.
The book is edited by Christine Hine, who has an article entitled The politics and practice of accessibility in systematics, which I think will be part of Past, Present & Future of Research in the Information Society. The final paragraph of this article is intriguing:
There are some messages here for an open access movement that places belief in the ability of digital solutions to realise access to information. The experience of systematics suggests that too great a focus on the movement, and too much emphasis on the ability of particular technologies to realise a desired effect can be counter productive. A belief in the inevitability of digital solutions can sideline consideration of potential users and transform it into a simple belief that they will come. From this perspective open access looks like a low cost technical fix to issues of inequality, and of course nothing is that simple. However, we can expect that within the “open access movement” a wide diversity of initiatives may proliferate, and these will make sense to those most directly involved in a variety of ways which cross and blur the distinction between providers and users of information. There will be a need to remain open to non-digital solutions, and to respect the capacity of practitioners to craft their own appropriate technologies, even whilst we celebrate the ability of grand visions of open access to inspire, stimulate and offer a way of making sense of diverse experience.

Sunday, October 15, 2006

Mac Spoof: Upgrading

Really should be doing some work, but nearly wet myself when I saw this...

Friday, September 22, 2006

PygmyBrowse demo

I've now put a simple demo of the PygmyBrowse tree browser up here. It's a simple toy just to demo the idea.

Wednesday, September 20, 2006


I've previously bemoaned the lack of a decent way to display and navigate through phylogenies. Ryen White, a graduate of Glasgow's Computer Science department and now at Microsoft Research is coauthor of of cool paper on viewing large trees in small spaces. PygmyBrowse: a small screen tree browser (doi:10.1145/1125451.1125562) describes an elegant approach to browsing a hierarchy that strikes me as being potentially very useful for navigating taxonomic classifications. It should be a cinch to implement this using AJAX (I'll let you know if I manage to do this before getting distracted by other things).

Tuesday, August 29, 2006

Collaborative data matrices using EditGrid

EditGrid is an online collaborative spreadsheet tool that I stumbled across via Ogle Earth. It strikes me thjat this could be a way to create phylogenetic data sets collaboratively.

As a quick test I grabbed the Vertebrates example file that comes with MacClade, exported the NEXUS file as a table, opened it in Excel, then uploaded the Excel file to EditGrid. You can see the results here.

The spreadsheet is a natural metaphor for phylogenetic data, although in this application is is likely to be better suited to morphological data where a team of people are assembling a matrix from various sources.

The developers of EditGrid have a blog whioch converys their own sense of excitement about this project.

Tuesday, August 08, 2006

Connotea and TreeBASE

One of my (forever) ongoing projects is to map taxon names in TreeBASE to names in external databases (such as uBio) as a way of checking that the names are correct, adding the ability to handle synonyms, and hierarchical queries (see my earlier post for more details).

Now, many names in TreeBASE aren't in any of the major name databases (fossils seem particularly poorly supported), which means hunting on Google for the name. In some cases I come across the name and the original reference for the name, which means I can document that the name is correct. For example, TreeBASE taxon T8737 is Eocaiman cavernensis, which doesn't occur in any of the name sources I use (uBio, ITIS, NCBI, IPNI, etc.). It's a fossil crocodilian, described by George Gaylord Simpson in 1933.

The original description in American Museum Novitates is online (hdl:2246/2050), courtesy of the AMNH's DSpace server. So, how do I link the name and the publication -- without me creating a new database to do this? Well, Connotea to the rescue. I add Simpson's paper to Connotea, tagged with the TreeBASE TaxonID T8737, and viola, the information is stored.

Now, to make use of this we need to do a little bit more, such as have a triple store that contains both the TreeBASE names and the Connotea record, but given that Connotea serves RSS 1.0 (i.e., RDF), this is easy.

What I like about this is:

  1. I don't have to do much work

  2. The publication information is stored where others can see it and make use of it (i.e., if my experiments with these ideas fall by the wayside, the data still remain).

Now, back to the tedious task of mapping...

Wednesday, July 12, 2006

Small Pieces Loosely Joined

Just finished reading Small Pieces Loosely Joined. It's a fabulous essay on the nature of the Web. The more I read it the more it confirms my fear that most people talking about biological taxonomy and biodiversity on the Web simply don't "get" the Web. Adopting the Web successfully will require a willingness to accept error, ambiguity, and downplaying "expertise" and "authority". It will be interesting to see what happens.

Wednesday, June 28, 2006

TreeBASE rocks

I gave a talk today ("Dude, where's my tree?") at the Evolution 2006 meeting at Stony Brook. It was intended as a somewhat tongue-in-check overview of some issues concerning TreeBASE, and broader areas of biodiversity informatics, making use of ants as an example (see my SemAnt project).
Michael Donoghue took me aside after the talk and made some interesting points. He was a little tired — understandably — of hearing that "TreeBASE sucks" (e.g., my CIPRES talk), and felt that my constantly saying this was counter productive. It could also lead to people not putting their data in TreeBASE because they'd heard that it "sucks".
There is an element of social responsibility here, I guess. I resolutely avoid politics. I don't mean this in a pejorative sense, it's just that I don't have the temperament or skill for it, unlike Michael himself (Lee Belbin is another person in this area who strikes me as a very skilled manager).
Now, my talk was intended to be fun, and I was taking the piss out of myself as much as anything. I also think the things we criticise are the things we value the most. But that said, let be make it clear that TreeBASE is very important. As editor of Systematic Biology I've made authors submit data to it. I have a lot of respect for the work Michael, Bill Piel, and Mike Sanderson put into TreeBASE. If you have phylogenetic data — submit it to TreeBASE. It's the best we have. It's just that, well, as a community, we could do better.

Taxonomic names, metadata, and the Semantic Web

My paper "Taxonomic names, metadata, and the Semantic Web" has appeared in Biodiversity Informatics.

Life Science Identifiers (LSIDs) offer an attractive solution to the problem of globally unique identifiers for digital objects in biology. However, I suggest that in the context of taxonomic names, the most compelling benefit of adopting these identifiers comes from the metadata associated with each LSID. By using existing vocabularies wherever possible, and using a simple vocabulary for taxonomy-specific concepts we can quickly capture the essential information about a taxonomic name in the Resource Description Framework (RDF) format. This opens up the prospect of using technologies developed for the Semantic Web to add ``taxonomic intelligence" to biodiversity databases. This essay explores some of these ideas in the context of providing a taxonomic framework for the phylogenetic database TreeBASE.

Saturday, May 27, 2006

More on trees and Google Earth

Well, turns out Bill's not the only one putting trees on Google Earth. Declan Butler pointed me to Ogle Earth, where there is a teaser of some work on guiology.

Currently playing in iTunes: Crazy In Love by Beyoncé

Avian flu, phylogeny, and Google Earth

The penny just dropped (duh!).
Having mentioned Bill Piel's very cool visualisation of phylogenies on Google Earth

what about the other cool use of Google Earth in biology, namely Declan Butler's displays of the march of avian flu?

Instead of standard diagrams like this one from the Ruben Donis' paper in Emerging Infections Diseases:

why not take phylogenies for avian flu virus and add them to the data Declan is displaying? This could be a potentially compelling graphic, and a test of whether phylogenies add useful information to our understanding of what is going on.

Friday, May 26, 2006

TreeBASE meets Google Earth

Bill Piel has created a cool tool for creating KMZ files of phylogenies for Google Earth. From the web site:

One of the components of the CIPRES project is the development of TreeBASE II — a robust, scalable, and versatile re-design and re-engineering of TreeBASE. As part of this project, we are exploring other ways of browsing and visualizing trees. Google Earth is a fantastic 3-D browser for exploring geographic resources and has the potential to be a useful and fun tool for delivering biological information with a geographic component.

Google Earth (available for Windows and Mac OS X) is opening up all sorts of possibilities for biodiversity informatics (ants being one of the first examples). What is cool about Bill's work is that it departs from simple locality records.

As always, after pausing to say "wow", there are all sorts of things that one could think of adding. For example, some trees are clearer than others, due to how well the geography and trees match. I wonder if this could be used as a measure of how well geography "explains" the tree. For example, simple vicariance or serial dispersal would have few cross-overs, a history of dispersal (or an old pattern with extinction, or if geography has changed) might be messier. Perhaps there is a metric that could be developed for this. It strikes me as similar in spirit to trees for tandem duplications -- there's a nice spatial (albeit it linear) order in a tree if the sequences are tandem duplications.

If the trees had dated nodes (i.e., were "chronograms"), presumably this could be used to compute node heights, so you'd be able to have chronograms. Sort of a reverse onion, the layers getting older as you go out. People could then see whether biogeographic patterns were of a similar age. This adds a spatial dimension to chronograms (see an earlier post on the analogy between genome browsers and chronograms).

As an aside, and because I was once a panbiogeography enthusiast, why haven't panbiogeographers leap on Google Earth as a tool to display "tracks"? If ever there was an opportunity to drag that movement out of the doldrums, this is it.

Wednesday, May 24, 2006

Open Access taxonomy

Pyramica boltoni
Originally uploaded by Roderic Page.
Fussing around with ants, I stumbled across this paper (doi:10.1653/0015-4040(2006)89[1:PBANSO]2.0.CO;2) (if the DOI doesn't work, try this link), which describes a new species, Pyramica boltoni. This paper is Open Access, so the question arises, how do I get it into a triple store? I could add the metadata about the paper (it would be nice to do this automatically via Connotea and the DOI, but some BioOne DOIs aren't "live" yet), but what about things like the pictures?
For fun, I grabbed Fig. 1, uploaded it into iPhoto, then exported it to Flickr using the FlickrExport plugin.
Flickr has an API, hence the image (and the associated tags) could be retrieved automatically. Hence, anybody with Connotea and Flickr accounts could contribute to a triple store.

Sunday, May 21, 2006

Towards the ToL database - some visions

So, when I started this blog I promised to write something about phyloinformatics, and the goal of a phylogenetic database. I've been playing around with various ideas, some of which have made it online, but most remain buried on various hard drives until they get written up to the state they are useable.

There are also numerous distractions, and detours along the way, such as MyPHPBib, Taxonomic Search Engine, and LSIDs, oh and iSpecies, which got me into trouble with Google, then there is a certain journal, and a certain person (but let's not go there...).

My point (and I do have one), is that maybe it's time to rethink some cherished ideas. Basically, my original goal of creating a phylogenetic database involved massive annotation, disambiguation of taxonomic names, and linking to global identifiers for taxonomic names, sequences, images, and publications. This is the project outlined at the start of this blog.

I still believe this would be worthwhile, and I've a lot of the work done for TreeBASE (e.g., mapping TreeBASE names to external databases, BLASTing sequences in ttreeBASE to get accession numbers, etc.). This is a lot of work, and I wonder about scalability and involvement. In other words, can it cope with the amount of data and trees we generate, and how do we get people to contribute. So, here are a few different (not necessarily exclusive approaches).

Use TreeBASE as a seed and continue to grow that database, adding extensive annotations and cross links. Time consuming, but potentially very powerful, especially is data is dumped into a triple store and cool ways to query it are developed.

Googolise everything
Use Google to crawl for NEXUS files (e.g., "#nexus" "begin data" format dna), extract them and put them into a database. Use string matching and BLAST to figure out what the files are about.

Phylogeny news
Monitor NCBI and journal RSS feeds, when new sequences or papers appear, extract popsets, use or build alignments, compute trees quickly and wack into a database. Interface is something like Postgenomic (maybe using the same clustering algorithms to link related stories), or even cooler, newsmap

Connotea style

Inspired by projects like Connotea, perhaps the trick is to mobilise the community by lowering the barrier to entry. Instead of aiming for a carefully curated database, what if people could upload the basics (some sort of identifier for the paper, such as a DOI or a PubMed id, and one or more trees in Newick format). I think this is what Wayne Maddison was suggesting when we chatted at the CIPRES (see my earlier post about that meeting) -- if Wayne didn't suggest this, then my apologies. The idea would be that people could upload the bare minimum, but be able to annotate, comment, link, etc. Behind the scenes we have scripts to look up whatever identifiers we have and extract relevant metadata.

Saturday, May 20, 2006

Taxonomic Search Engine back online

My Taxonomic Search Engine is back online (mostly). This tool to search multiple databases for scientific names was another casualty of hacking. Having been developed under PHP 4, it needed some work to play nice with PHP 5. The changes were minor, and mainly concerned changes in code involving XPath and XSLT. I've commited these changes to SourceForge. I've not got the Species 2000 service back up (this needs local data to be restored), and the LSIDs are broken due to problems with IBM's LSID perl stack on my Fedora Core 4 machine (sigh).

Wednesday, May 17, 2006

AntBase and Web 2.0 business value

Dion Hinchcliffe has a piece entitled Creating real business value with Web 2.0 which lists (I think he actually means AntWeb) as an example of a non-commercial Web 2.0 service that demonstrates "scalable marshalling of underutilized data resources," and shows: a scientific community turned massive taxonomy resources otherwise mouldering away in basements as lost specimens into a thriving online database of information that can be shared by all. Understanding the success and importance of both of these points to intriguing and largely unexploited possibilities that I predict will become more common and widespread in the near future.

The article comes with this graphic:

See also Dion's Thinking in Web 2.0: Sixteen ways (via Danny Ayers).

Tuesday, May 09, 2006

CrossRef's OpenURL resolver

CrossRef's OpenURL resolver can be used to find DOIs for papers, or give a DOI it can extract metadata. For example, consider Brian Fisher's article on silk production in Melissotarsus emeryi. This is in FORMIS, and the record is displayed here in RIS format:

AU - Fisher, B.L/
AU - Robertson, H.G.
PY - 1999
TI - Silk production by adult workers of the ant Melissotarsus emeryi (Hymenoptera, Formicidae) in South African fynbos
SP - 78-83
JF - Insect. Soc.
JO - Insectes Sociaux
VL - 46
N1 - Part of FORMIS 00; from PSW,
KW - ant; Formicidae; Melissotarsus emeryi; Africa; South Africa; scientific; nest; tending Homoptera; silk; production; adult; worker; gland; cuticular depressions; hypostoma; silk brushes; nest construction; defense; diaspidid symbiont;
ID - 6573
ER -

We can find the DOI for this article using the following query:, which yields the following XML:

<?xml version = "1.0" encoding = "UTF-8"?>
<crossref_result version="2.0" xmlns="" xmlns:xsi="" xsi:schemaLocation="">
<query key="555-555" status="resolved">
<doi type="journal_article">10.1007/s000400050116</doi>
<issn type="print">00201812</issn>
<issn type="electronic">14209098</issn>
<journal_title match="exact">Insectes Sociaux</journal_title>
<author match="fuzzy">Fisher</author>
<volume match="exact">46</volume>
<first_page match="exact">78</first_page>
<year match="exact">1999</year>
<article_title>Silk production by adult workers of the ant Melissotarsus emeryi (Hymenoptera, Formicidae) in South African fynbos</article_title>

The DOI for this article is 10.1007/s000400050116. This service could be a simple way to get DOIs for recent papers in FORMIS, enabling us to get GUIDs for the articles, as well as providing a link to the article itself.

Saturday, May 06, 2006

Updating ants

A triple store for ants is all very well, but it contains just the information available when the triple store was created. What about updating it? What about doing this automatically? Here are some ideas:

Connotea provides semantically rich RSS feeds. We could subscribe to a feed using a tag (such as Formicidae), and extract recent posts. Could use HTTP conditional GET, or parse the Connotea feed and use XPath to extract references more recent than a given date. Connotea makes extensive use of RDF in their RSS feeds, so it's easy to dump this into the triple store.
uBio's taxononmically intelligent RSS feed reader could be used to monitor publications on ants (e.g., Formicidae). uBio uses RSS 2.0, which doesn't include RDF (see Wikipedia entry for RSS). One option would be to parse the RSS and see what we can extract from the links (e.g., if they contain DOIs, are Ingenta feeds, etc.). If there are DOIs we could use CrossRef's OpenURL lookup. Or we could use the Connotea Web API. We'd upload the URLs, and get Connotea to see what it can do with them, then we make use of their RSS feed. This also makes the information available to everybody for tagging.

We could also track new sequences in GenBank (to do).

stamen design | big ideas worth pursuing

stamen (which brought us Mappr) has a nice discussion of data visualisation.

Currently playing in iTunes: Summertime by George Benson

Ants, RDF, and triple stores

In order to explore the promise of RDF and triple stores we need some large, interesting data sets. Ants are cool, there is a lot of data available online (e.g., AntWeb at the California Academy of Sciences, Antbase at the American Museum of Natural History, New York, and the Hymenoptera Name Server at Ohio State University, Chris Schmidt's, and Ant News), and they get good press (for example, the "Google ant").


Firstly, we start with a Google Earth file for ants, obtained from AntWeb on Monday April 24th, 2006. AntWeb gives the link as, which is a compressed KML file. However, this file merely gives the location for the actual data file, Grab that file, expand it and you get 27 Mb of XML listing 50,550 ant specimens and 1,511 associated images.

We use the Google Earth file because it gives us a dump of AntWeb, and does it in a reasonably easy to handle format (XML). I wrote a C++ program to parse the KML file and dump the information to RDF. One limitation is that my program dies on the recent KML files because they have very long lines. Doing a search on <Placemark> and replacing it with \r<Placemark> in TextWrangler fixed the problem.

In order to keep things as simple and as generic as possible, I use Dublin Core metadata terms wherever possible, and the basic geo (WGS84 lat/long) vocabulary for geographical coordinates. The URI for the specimen is the URL (no LSIDs just yet).

In addition to the RDF, I generate two text dumps for further processing.

As noted at iSpecies, we can automate the extraction of metadata from images using Exif tags.There is a vocabulary for describing Exif data in RDF, which I've adopted. However, I don't use all the tags, nor do I use IFD, which frankly I don't understand.

So, the basic idea is to have a Perl script that:

  1. Takes a list of AntWeb images (more preciesly, the URLs for the images)

  2. Fetches each image in turn using LWP and writes them to a temporary folder

  3. Uses Image::EXIF to extract Exif tags

  4. Generate RDF

Some AntWeb specific things include linking the image to the specimen, and linking to a Creative Commons license.
Here is an example:

<rdf:Description rdf:about="" >
<dc:subject rdf:resource="" />
<dc:publisher rdf:resource=""/>

This RDF is generated from the image to the right. What is interesting about the Exif metadata is that it isn't generated from the AntWeb database itself, but from the images. Hence, unlike the Goggle Earth file, we are adding value rather than simply reformatting an existing resource.

Of course, there are some gotchas. Some images look like this ("image not available"), and the Exif tag Copyright has a stray null character (\0x00) appended at the end, which breaks the RDF. Fixed this by Zap gremlins in TextWrangler.

There is no single authorative list of scientific names. I'm going to use the biggest (and best), uBio, specifically their findIT SOAP service. It might make more sense to use the Hymenoptera Name Server, but uBio serves RDF, and gets most off the ant names anyway as the Hymenoptera Name Server feeds names into ITIS, which in turn end up in uBio. The result of this mapping is a <dc:subject> tag for each specimen that links using rdf:resource to a uBio LSID. When we make the mapping, we write the uBio namebank ids to a separate file, which we then process to get the RDF for each name.
The script reads a list of specimens and taxon names, calls uBio's findIT SOAP service, and if it gets a direct match, writes some RDF linking the specimen URI to the uBio LSID. It also stores the uBio id in memory, and dumps these into a file for processing in the next step.

uBio metadata

Having mapped ant names to uBio, we can then go to uBio and use their LSID authority to retrieve metadata for each name in, you guessed it, RDF. We could resolve LSIDs, but for speed I'll "cheat" and append the uBio namebank ID to
So, armed with a Perl script we read the list of uBio ids, fetch the metadata for each one and dump it into directory. I then run another Perl script that scans a directory for ".rdf" files and puts them in the triple store.


I retrieved all ant sequences from GenBank by searching the taxonomy browser for Formicidae, downloading all the sequence gis, then running a Perl script that retrieved the XML record for each sequence and populated a MySQL database. I then found all sequences that include a specimen voucher field with CASENT%:

specimen INNER JOIN source USING (source_id)
INNER JOIN sequence_dbxref ON source.seq_id = sequence_dbxref.sequence_id
INNER JOIN dbxref USING (dbxref_id)
WHERE (code LIKE "CASENT%") AND (dbxref.namespace = "GI")

Next, we fetch these records from NCBI. This seems redundant as we have the information already in a local MySQL database, but I want to use a simple script that takes a GI and outputs RDF so that anybody can do this.

In much the same way, I grabbed the TaxIds for ants with sequences, and grabbed RDF for each name.

For PubMed records I wrote a simple Perl script that, given a list of PubMed identifiers, retrieves the XML record from NCBI and converts it to RDF using a XSLT style sheet. The script also gets the identifiers for any sequence linked to that PubMed record using elinks, and uses the <dcterms:references> tag to model the relationship. For the ant project I only use PubMed ids for papers that include sequences that have CASENT specimens:

specimen INNER JOIN source USING (source_id)
INNER JOIN sequence_dbxref ON source.seq_id = sequence_dbxref.sequence_id
INNER JOIN dbxref USING (dbxref_id)
WHERE (code LIKE "CASENT%") AND (dbxref.namespace = "PUBMED")

Turns out there are only three such papers:




We could add bibliographic data from FORMIS, which can be searched online here, and downloaded as EndNote files. This would be "fun" to convert to RDF.

PubMed Central
This search finds all papers on Formicidae in PubMed Central, which we could use as an easy source of XML data, in some cases with full text and citation links.

Triple store
The beauty of a triple store is that we can import all these RDF documents into a single store and query them. It doesn't matter that we have information about images in one file, information about specimens in another, and information about names in yet another file. If we use URIs consistently, it all comes together. This is data integration made easy.


This RDQL query finds all images for Melissotarsus insularis

(?taxon, <dc:subject>, "Melissotarsus insularis")
(?specimen, <dc:subject>, ?taxon)
(?image, <dc:subject>, ?specimen)
(?image, <dc:type>,"image")

Which returns two images: .

OK, now for something a little more fun. The Smith et al. barcoding paper that surveyed ants in Madagascar has PubMed id 16214741 (this paper also has the identifier doi:10.1098/rstb.2005.1714). Given this id (recast as a LSID, we can find the geographic localities the authors sampled from using this query:

SELECT ?lat, ?long WHERE
(?nuc, <dcterms:isReferencedBy>, <>)
(?nuc, <dc:source>, ?specimen)
(?specimen, <geo:lat>, ?lat)
(?specimen, <geo:long>, ?long)
dcterms FOR <>
geo FOR <>

which gives four localities:

?lat ?long
"-13.263333" "49.603333"
"-13.464444" "48.551666"
"-13.211666" "49.556667"
"-14.4366665" "49.775"

We can also search our triple store using other identifiers, such as DOIs:

SELECT ?lat, ?long WHERE
(?pubmed, <dc:identifier>, <doi:10.1098/rstb.2005.1714>)
(?nuc, <dcterms:isReferencedBy>, ?pubmed)
(?specimen, <geo:lat>, ?lat)
(?specimen, <geo:long>, ?long)
dcterms FOR <>
geo FOR <>

is the same query as above, but uses the DOI for the barcoding paper.

New inferences

One thing I noticed early on is that there are specimens that have been barcoded and which are labelled in GenBank as unidentified (i.e., they have names like "Melissotarsus sp. BLF m1"), but the same specimen has a proper name in AntWeb (e.g., casent0107665-d01 is Melissotarsus insularis). Assuming the identification is correct (a big if), we can then use this information to add value to GenBank. For example, a search of GenBank for sequences for Melissotarsus insularis find nothing, but it does have sequences for this taxon, albeit under the name "Melissotarsus sp. BLF m1".

This query searches the triple store for specimens that are named differently in AntWeb and GenBank. Often both names are not proper names, but represent different ways of saying "we don't know what this is". But in some cases, the specimen does have a proper name attached to it:

SELECT ?specimen, ?ident, ?name WHERE
(?specimen, <dc:type>, "specimen")
(?specimen, <dc:subject>, ?ident)
(?nuc, <dc:source>, ?specimen)
(?nuc, <dc:subject>, ?taxid)
(?taxid, <dc:type>, "Scientific Name")
(?taxid, <dc:title>, ?name)
AND ?ident ne ?name

Currently playing in iTunes: One by Mary J Blige & Bono. Currently playing in iTunes: Crazy (Single Version) by Gnarls Barkley

Thursday, May 04, 2006

Nascent: Open Text Mining Interface

From Nature's blog on web technology and science comes this post on Open Text Mining Interface (OTMI):

Every now and then a scientist contacts Nature asking for a machine-readable copy if our content (i.e., the XML) to use in text-mining research. We're usually happy to oblige, but there has to be a better way for everyone concerned, not least the poor researcher, who might have to contact any number of publishers and deal with many different content formats to conduct their work. Much better, surely, to have a common format in which all publishers can issue their content for text-mining and indexing purposes.

and further

The example of RSS shows how powerful a relatively simple common standard can be when it comes to aggregating content from multiple sources (even when it's messed up as badly as RSS ;). So maybe an approach like OTMI (or a better one dreamt up by someone else) can help those who want to index and text-mine scientific and other content. Like RSS, I think publishers might also come to see this as a kind of advert for their content because it should help interested readers to discover it. And on the basis that a something is always better than nothing, it also doesn't force publishers to give away the human-readable form of their content — they can limit themselves to snippets or even just word vectors if they want to.

Currently playing in iTunes: By the Time I Get to Phoenix by Glen Campbell

Tuesday, May 02, 2006

Monday, April 24, 2006

Ambient Findability

Ambient Findability by Peter Morville is a wonderful read, full of snippets of inspiration. In many ways, like ambient music alluded to at the end of the book, it is less about specifics and more about a way of thinking, and about the possibilities once things become findable.

Sunday, April 23, 2006

Darwin hacked

One of my lab's web servers was hacked last week. This machine hosts a lot of projects, such as the Glasgow Name Server, the Taxonomic Search Engine, iSpecies, LouseBase, and TreeView X. Sadly, it was not completely backed up, although most of the key stuff is replicated elsewhere (including source code in CVS on another machine, or in SourceForge, copies of databases on other machines, etc.). Even if it was completely backed up, there is the hassle of rebuilding a machine. Still, since it wasn't backed up, here are some of the things I had to go through.

The kernel (Red Hat 8) had been tampered with, so the machine would no longer boot. I'm was now faced with the task of getting stuff off the machine in case reinstalling the operating system lost data. Luckily the machine (a Dell Precision 620) booted from a Knoppix CD, which gave me a GUI. So now I can browse my crippled machine, but...

... it couldn't talk to the Net because the Knoppix live CD uses DHCP to get an IP address, and my university doesn't support DHCP (argh!). However, I have an Apple Airport base station with a spare Ethernet port, and connecting the Dell to that port provided a DHCP address (yay).

Booting from a live CD has one major limitation -- I can't alter anything on the disks in the Dell. Hence, doing things like changing file permissions, or making tarballs to be able to FTP directories is out of the question. I don't have a USB key or an external USB hard drive big enough to take the gigabytes of stuff on the Dell.

What worked, after a lot of fussing was Samba. Using the smb:// protocol in Konqueror (I trick I learned from Mac OS X), I managed to connect to a Fedora Core 4 box in my lab. I could then drag and drop key files onto the FC4 machine (such as httpd.conf, hosts.allow, various CGI scripts, etc.) that were specific to the hacked machine. I also made backups of the home folders, just in case.

This left MySQL databases. Moving these proved to be a major pain, because they are not accessible by the Knoppix user. The solution turned out to be to mount the FC4 box using Samba:

  1. su

  2. mkdir /mnt/linnaeus

  3. mount -t smbfs -o username=xxxx // /mnt/linnaeus

Now we can copy all the MySQL databases on the FC4 machine.

Ah, but how to get the actual data...? Well, on my Mac OS 10.3 iBook, I have MySQL 4.0.21, which works with the MySQL files from Red Hat 8 (3.23 I think). I use CocoaSQL to create the database, then move all the .MYI and .MYD files into the appropriate folder in /Library/MySQL/data/, then set permissions to ensure that mysql can read the files (make user mysql the owner chown mysql *, and set permissions to 660).

Yes, the obvious lesson is to have everything backed up, but on a developmental machine with gigabytes of images and other data, much of it moved around frequently, and a central backup system whose client software wouldn't build on my machine, I'd sort of let this slip (doh!).

Tuesday, April 18, 2006

Render DOT files on the fly on Mac OS X

Webdot isn't available for Mac OS X, and as I use an iBook running Panther for all my development work (before moving to a Linux box to host the results) I wanted to have the same functionality on my iBook. This can be achieved by hacking a simplified version of webdot. This Perl script creates a virtual web browser to serve the image. I've simplified things somewhat, but it works.

The two things you need to set in the script dot.cgi are the path to your copy of the Graphviz program dot,and a directory where dot can write temporary files (I use /tmp).

You can get a copy of the script here.

To render an image of a graph on the fly you insert an img
tag with the src attribute comprising:

  1. the path to the CGI script, e.g. /cgi-bin/dot.cgi

  2. a '/' delimiter

  3. the URL of the graph file, e.g. http://localhost/~rpage/dot/

  4. the extension of the image format you want (e.g., png, svg, etc.) preceeded by a dot "."

As an example, here is the dot file http://localhost/~rpage/dot/ as a PNG image, using the HTML:

<img src="/cgi-bin/dot.cgi/http://localhost/~rpage/dot/" />

The source file for this graph looks like this:

graph G {
node [width=.2,height=.2,fontsize=10];
edge [fontsize=10,len=2];
0 [label="0"];
1 [label="3"];
2 [label="4"];
3 [label="5"];
4 [label="6"];
5 [label="7"];
0 -- 1 [label="13"];
0 -- 2 [label="12"];
0 -- 5 [label="8"];
0 -- 4 [label="71"];
1 -- 5 [label="84"];
1 -- 4 [label="8"];
2 -- 5 [label="18"];
2 -- 4 [label="11"];
2 -- 3 [label="51"];
3 -- 4 [label="87"];

Saturday, April 01, 2006

Visualizing literature derived networks

This paper in Genome Biology is a nice example of visualising relationships derived from PubMed:

We have developed PubNet, a web-based tool that extracts several types of relationships returned by PubMed queries and maps them into networks, allowing for graphical visualization, textual navigation, and topological analysis. PubNet supports the creation of complex networks derived from the contents of individual citations, such as genes, proteins, Protein Data Bank (PDB) IDs, Medical Subject Headings (MeSH) terms, and authors. This feature allows one to, for example, examine a literature derived network of genes based on functional similarity.

I've added it to my Connotea library under the tag visualisation (note to self: American English and British English spelling is just one of the problems with "tagging"). I'd seen this paper before, but "forgot" it until browsing Connotea and stumbling across nicmila's library. Nice illustration of the power of shared tags.

(Via nicmila.)

Currently playing in iTunes: Wisemen (Album Version) by James Blunt

Monday, March 27, 2006

Quantum treemaps

One of the things that keeps bothering me is the lack of compelling ways to visualise information in phylogenetic databases. Trees themselves are, I feel, pretty awful objects to work with. They are large, and displaying them takes up a lot of screen real estate. Yet, in many ways, the more one sees of the tree the less one gains from the experience. For example, CAIDA's Walrus tool (right), used by Tim Hughes to display large trees looks fabulous, but is it useful? By which I mean, can we use it find out about stuff, or do we just spin it around and go "ohh, isn't it pretty?"

Treemaps are another tool that I've looked at, but never been terribly impressed. However, quantum treemaps, described in Ordered and Quantum Treemaps: Making Effective Use of 2D Space to Display Hierarchies, look potentially useful. To quote from the paper describing them:

The goal of the Quantum Treemap algorithm is similar to other treemap algorithms, but instead of generating rectangles of arbitrary aspect ratios, it generates rectangles with widths and heights that are integer multiples of a given elemental size. In this manner, it always generates rectangles in which a grid of elements of the same size can be layed out. Furthermore, all the grids of elements will align perfectly with rows and columns of elements running across the entire series of rectangles. It is this basic element size that cannot be made any smaller that led to the name of Quantum Treemaps

Quantum treemaps have made their way into Photomesa.

So, here's my thought. What if we used a quantum treemap to browse TreeBASE? Suppose we have a mapping between TreeBASE taxa and the NCBI taxonomy (or any other taxonomy, it doesn't really matter). If we then have some notion of what taxa each TreeBASE study is mainly about, then we could display a quantum treemap of studies rooted at any node in the NCBI taxonomy. For example, studies on mammals, grouped by order. The point here is not to see the tree, but to navigate through the studies using the tree.

Whereas treemaps usually display a nested hierarchy, my sense is that quantum treemaps are used to display the children of a node, rather than the whole tree. I think this is because the final size of a quantum treemap is unpredictable.

The mapping of TreeBASE names to NCBI tax_ids is not trivial, but I've got most of one done. Mapping studies to taxa needs a little thought. One approach is to take a tree from a study, relabel it with NCBI tax_ids, then find the least common ancestor in the NCBI taxonomy of the centroid of the tree. The idea is that this is in the core of the tree, and hence should capture what the tree is about. Finding the LCA of the root would be an obvious thing to do, but if one has a tree comprising mostly vertebrates, but rooted with a bacterium, then the root LCA is the root of life, which isn't a terribly accurate summary of the tree.

I've been playing with generating quantum treemaps, based on a C++ port of some Java code written by Ben Bederson. The next step is to try and bolt this together into a demo of how this might be used to navigate TreeBASE.


Tuesday, March 21, 2006


Not a huge fan of IE, but this post on David Patten's blog nicely illustrates the ease of use of A9's OpenSearch with IE 7.

I'd previously played with OpenSearch as a quick way to integrate biodiversity sources, and put together a couple that have been registered with A9 (search for "taxonomy" and you'll find them). It's essentially adding a few tags to RSS or Atom feeds, coupled with a simple way to describe the search engine.

Perhaps it's time to play with this a little more. It would be a very simple way to open up some data.

(Via A9 Developer Blog.)

Currently playing in iTunes: Wonderwall by Oasis