Wednesday, February 15, 2017

New feature for BioStor: extracting literature cited from OCR text

At present BioStor provides a simple display of an article extracted from BHL. You get the page images, and sometimes a map and an altmetric "donut". But we can do better than this. For example, I'm starting to experiment with displaying a list of literature cited by the article. Below is a screenshot of the article A remarkable new species of Homalomena (Araceae) from Peninsular Malaysia showing the two references this article cites:

Screenshot 2017 02 15 19 28 17

These references have been extracted using some simple regular expressions written in Javascript and wrapped up in a CouchDB view. They are extracted as simple text strings, I've not made any further attempt to parse the string into authors, title, journal, etc.

Of course, what we really want is to be able to convert these strings into clickable links to the actual reference. In the spirit of "We don't need no stinkin' parser" (see also Resolving free-form citations) I've added a little search icon that when you click on it attempts to find the reference in BioStor. In the screenshot above we've found both references in BioStor.

Obvious next steps are to add other resolvers (such as CrossRef for DOIs), do the resolution before the references are displayed (rather than wait for the user to click on the search icon), and even more usefully, display a list of articles that cite each article in BioStor (in the example above, both cited articles should "know" that they have been cited).

Whether an article in BioStor has a list of citations depends on the success of the regular expressions in extracting them, and whether the database has the OCR text. The current version of BioStor didn't originally store the OCR text, so I'm slowly adding that to the references. Other examples of articles with citations include Northeast African racers of the Platyceps rhodorachis complex (Reptilia: Squamata: Colubrinae) and Synopsis of the Neotropical mantid genus Pseudacanthops Saussure, 1870, with the description of three new species (Mantodea: Acanthopidae).

Long term adding linked citations to BioStor means we get a step closer to being able to offer readers an experience like PubMed Central (PMC), where articles in PMC are linked to articles in PMC that either cite, or a cited by that article. I think there's a case for a PubMed Central-like service for biodiversity literature (see Possible project: A PubMed Central for taxonomy) that rescues that literature from the ghetto much of it currently resides in, and instead makes it a first class citizen of the wider digital biodiversity landscape.