Wednesday, April 27, 2016

Possible project: Biodiversity dashboard

Mattern 1 dashboard 1020x703 Despite the well deserved scepticism about dashboards voiced by Shannon Mattern @shannonmattern (see Mission Control: A History of the Urban Dashboard, I discovered this by reading Ignore the Bat Caves and Marketplaces: lets talk about Zoning by Leigh Dodds @ldodds) I'm intrigued by the idea a dashboard for biodiversity. We could have several different kinds of information, displayed in a single place.

Immediate information

There are sites such as Global Forest Watch Fires that track events that affect biodiversity and which are haoppoening right now. Some of this data can be harvested (e.g., from the NASA Fire Information for Resource Management System) to show real-time forest fires. Below is an image for the last 24 hours:

We could also have Twitter feeds of these sorts of events

Historical trends

We could have longer-term trends, such as changes in forest cover, or changes in abundance of species over time.

Trends in information

We could have feeds that show us how our knowledge is changing. For example, we could have a map of data from the newest datasets uploaded to GBIF, the lastest DNA barcodes, etc.

As an example, @wikiredlist tweets overtime an article about a species from the IUCN Red List is edited on the English language Wikipedia.

Imagine several such streams, both as lists and as maps. As another example, a while ago I created a visualisation of new species discoveries:

Summary

I'm aware of the irony of drawing inspiration from a critique of dashboards, but I still think there is value in having an overview of global biodiversity. But we shouldn't loose site of the fact that such views will be biassed and constrained, and in many cases it will be much easy to visualise what is going on (or, at least, what our chosen sources reveal) than to effect change on those trends that we find most alarming.

Thursday, April 21, 2016

Searching GBIF by drawing on a map

One of my frustrations with the GBIF portal is that it is hard to drill down and search in a specific area. You have to zoom in and then click for a list of occurrences in the current bounding box of the map. You can't, for example, draw a polygon such as the boundary of a protected area and search within that area.

As a quick and dirty hack I've thrown together a demo of how it would be possible to search GBIF by drawing the search area on a map. Once a shape is drawn, we call GBIF's API to retrieve the first 300 occurrences from that area. The code is here, and below is a live demo (see also http://bl.ocks.org/rdmpage/43073981694598fecab725a16e890d3b).

This demo uses Leaflet.draw to draw shapes, and Wicket to convert the GeoJSON shape to the WKT format required by GBIF's API. I was inspired by the Leaflet.draw plugin with options set demo by d3noob, and used it as a starting point.

Friday, April 15, 2016

GBIF and impact: CrossRef, FundRef, and Altmetric

Wiki hitFor anyone doing research or involved in scientific infrastructure, demonstrating the "impact" of those activities is becoming increasingly important. This has fostered a growth industry in "alt metrics", tools to track how research gathers attention outside academia (of course, we can argue whether attention is the same as impact).

For an organisation such as GBIF there's a clear need to show that it has impact on the field of biodiversity (and beyond), especially to its funders (which are ultimately national governments). To do this GBIF needs to track how its data is used by the research communities, both to do science and to inform policy. This is hard to do, especially if there's a limited culture of data citation. It occurs to me that another way to tackle this problem is to invert it by looking not at the impact of GBIF, but at GBIF as a source of impact.

For a moment let's replace GBIF with Wikipedia. We can ask "what is the impact of Wikipedia on the research community?" For example, Wikipedia is the 8th largest referrer of DOIs, which means that Wikipedia is a major source traffic to academic publishing sites. All those Wikipedia pages which cite the primary literature are driving traffic to those articles.

Conversely, if we regard Wikipedia as important we can use citations of articles in Wikipedia pages as a measure of a researcher's impact. For example, according to Impact story I am "Wikitastic" as 11 Wikipedia pages cite articles that I am an author of (authorship is discovered by using my ORCID 0000-0002-7101-9767).

Likewise, altmetric tracks citations on Wikipedia, so that a paper like the one below may have minimal social media impact but as the gray donut rings signifying that it's been cited on Wikipedia.

JENKINS, P. D., & ROBINSON, M. F. (2002, June). Another variation on the gymnure theme: description of a new species of Hylomys (Lipotyphla, Erinaceidae, Galericinae). Bulletin of The Natural History Museum. Zoology Series. Cambridge University Press (CUP) doi:10.1017/S0968047002000018

Hence, we can look at Wikipedia in two different ways. The first is to ask "what is the impact of Wikipedia?", the second is to assume that Wikipedia has impact, and then use that as one measure of the impact of researchers (how "Wikitastic" you are).

So, let's go back to GBIF. Imagine we leave aside the question of whether GBIF has impact and imagine that we can use GBIF as a measure of impact ("GBIFtastic", sorry, that was unforgivable).

Example 1: From DOI to FundRef to GBIF

In a previous post I discussed the lack of mosquito data in GBIF and how I plugged this gap by using open data cite by a paper in eLife. This paper has the DOI 10.7554/elife.08347 and if I plug that into CrossRef's search engine I can get back some information on the funders of that paper:

Research funded by Sir Richard Southwood Graduate Scholarship | Rhodes Scholarships | National Institutes of Health (RAPIDD program, R01-AI069341, R01-AI091980, R01-GM08322, N01-A1-25489) | Wellcome Trust (#095066, Vecnet, #099872) | National Aeronautics and Space Administration (#NNX15AF36G) | Biotechnology and Biological Sciences Research Council | Bill and Melinda Gates Foundation (#OPP1053338, #OPP52250) | Studienstiftung des Deutschen Volkes | Directorate-General for Research and Innovation (#21803) | European Centre for Disease Prevention and Control (ECDC/09/018)

Now, this gives me a connection between funding agencies, a paper they funded, and the data in GBIF. For example, the Bill and Melinda Gates Foundation (doi:10.13039/100000865) funded doi:10.7554/elife.08347 which generated data in GBIF doi:10.15468/7apj8n.

I suspect that the Bill and Melinda Gates Foundation don't know that they've funded data gathering that has ended up in GBIF, but I suspect they'd be interested. Especially if that could be quantified (een better if we can demonstrate reuse). The process of linking funders to data can be largely automated, especially as more and more papers are now automatically linked to funder information. The link between publications and data in GBIF can be harder to establish, but at least one publisher (Pensoft) has establish a direct feed from publication to GBIF.

So, what if GBIF could computationally discover the funders of the data it holds, and could then communicate that to the funders. I think there's scope here for funders to take an interest in GBIF and it's role in expanding the reuse (and hence impact) of data that funders have paid for. Demonstrating to governments that national funding agencies are supporting research that generates data that ends up in GBIF may help make the case that GBIF is worth supporting.

Example 2: GBIF as altmetric source

The little altmetric donuts that we see on papers require sources of data, such as Twitter, Wikipedia, blogs, etc. For example, the Plant List dataset I recently put into GBIF has a DOI (doi:10.15468/btkum2)and this has received some attention so it has a altimetric donut (wouldn't it be nice if GBIF showed these on dataset pages?):

What if GBIF itself became a source that altimetric scanned when measuring impact? What if having your papers mentioned in GBIF (for example, as a source of distributional data or a taxonomic name) contributed to the visible impact of that work. Wouldn't that encourage people to mobilise their data? Wouldn't that help people discover the wider conversation about the data and associated publications? Wouldn't that help generate more impact for papers that might otherwise gather less attention?

Summary

I realise that I've somewhat avoided the question of the impact of GBIF itself, which is something that also needs to be tackled (and this is one reason why GBIF assigns DOIs to datasets and downloads to support data citation), but I think that may be only a part of the bigger picture. If we assume GBIF is impactful to start with, then I think we can start to think how GBIF can help persuade researchers and funders that contributing to GBIF is a good thing.

The Zika virus, GBIF, and the missing mosquitoes

One of GBIF's goals is to provide up to date, comprehensive data on the distribution of species. Although GBIF's taxonomy and geographic scope is global, not all species are equal, in the sense that the need for information on some species is potentially much more pressing. An example are mosquitoes of the genus Aedes, such as the species A. aegypti and A. albopictus that spread the Zika virus.

Over the last few days I discovered how poor GBIF's coverage of these two vectors is, and a way to fix that gap quickly. Like many things I work on, I stumbled across the problem by accident. GBIF has released a report on whether GBIF data are fit for modeling species distributions. The publicity material included a psychedelic image showing a map for Aedes aegypti from a recent eLife paper by Kraemer et al. (The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus http://doi.org/10.7554/elife.08347 ).

Moritz et al 2015 Global Aedes aegypti distribution detail2

Curious, I read the paper and the phrase "GBIF" occurs only once in the text:

we selected 10,000 occurrence records of Aedes species from the Global Biodiversity Information Facility (http://www.gbif.org), omitting all records of Ae. aegypti and Ae. albopictus. This dataset is intended to reflect biases in mosquito reporting in areas which are suitable for Aedes mosquitoes.

So, GBIF data on these two mosquitoes wasn't used. A quick look at what GBIF had for Aedes albopictus and it's not surprising why GBIF data played such a small role:

1651430

Compare this with the data shown in the Scientific Data paper (http://doi.org/10.1038/sdata.2015.35 on the data that underpins the eLife paper.

Sdata201535 f3

Note the striking lack of any GBIF records from Brazil. Fortunately the data collected by Kraemer et al. are freely available in Dryad http://doi.org/10.5061/dryad.47v3c, so I grabbed the files, fussed about with them a bit (https://github.com/rdmpage/global-distribution-arbovirus-vectors) to get them into the format required by GBIF, and uploaded them. Below is the data for Aedes albopictus in GBIF:

1651430 updated

This is looking more like it! If you are more interested in Aedes aegypti then that data is also available.

Questions

This example raises a number of questions:

  1. How come GBIF had such poor data to start with? If GBIF is going to be relevant to people who need biodiversity data, in some cases urgently, then there's an argument to be made that GBIF should be targeting species such as disease vectors that are likely to be in demand in the future.
  2. Why wasn't the latest data in GBIF? One reason GBIF's data was poor is that the relevant data was widely scattered in the literature (Kraemer et al. list over 1000 papers that they looked at, not including the unpublished sources). This clearly requires a lot of effort to assemble. But once assembled, why wasn't it deposited in GBIF? Is it a case of researchers not thinking this would be a useful thing to do, or not knowing how to do it?
  3. What about all the other data out there? This particular example was prompted by me wondering what is that hideous image on the GBIF post, reading the eLife article, wondering where the data was, and having sufficient access to GBIF to simply upload the data. This is clearly not a scalable approach. How can we improve this process? Can we automate harvesting relevant data from repositories such as Dryad so that this data gets fed into GBIF automatically? Should GBIF become a data repository itself so authors can store their data there? And how do we retrospectively harvest all the rest of the data languishing in the scientific literature?

Side note

One aspect of the Kraemer et al. data I've not focussed on is that it is derived from the literature, most of it unpublished, but some is in the primary literature (the list of papers is missing from the Dryad repository but I obtained a copy from Moritz Kraemer (@MOUGK and it's now on github). This means we can link individual occurrence records back to the evidence for that occurrence (i.e., the paper that made the assertion that this species of mosquito is found at this locality). This means we can (a) provide provenance for the data, and (b) provide credit to the authors of that observation. I hope to explore this topic in a subsequent blog post.

References

Kraemer, M. U. G., Sinka, M. E., Duda, K. A., Mylne, A., Shearer, F. M., Brady, O. J., … Hay, S. I. (2015, July 7). The global compendium of Aedes aegypti and Ae. albopictus occurrence. Scientific Data. Nature Publishing Group. http://doi.org/10.1038/sdata.2015.35

Kraemer, Moritz U. G., Sinka, Marianne E., Duda, Kirsten A., Mylne, Adrian, Shearer, Freya M., Brady, Oliver J., … Hay, Simon I. (2015). Data from: The global compendium of Aedes aegypti and Ae. albopictus occurrence. Dryad Digital Repository. http://doi.org/10.5061/dryad.47v3c

Kraemer, M. U., Sinka, M. E., Duda, K. A., Mylne, A. Q., Shearer, F. M., Barker, C. M., … Hay, S. I. (2015, June 30). The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus . eLife. eLife Sciences Organisation, Ltd. http://doi.org/10.7554/elife.08347

The Biodiversity Heritage Library at 10: Let's talk impact interview by @UDCMRK

As part of BHL's "Celebrating 10 years of inspiring discovery through free access to biodiversity knowledge" at the NHM and Kew Gardens in London, I was interviewed by Martin Kalfatovic (@UDCMRK). We chatted about BHL, the work I've been doing on BioStor, and the future of BHL. I haven't had the courage to watch it myself, but if you want to watch an academic giving Roger Hyam a run for his money in the "flappy hands" stakes, and not knowing whether to look at the camera, Martin, or towards the distant horizon, then here is the video.

Friday, April 08, 2016

Guest post: 10 explanations for messy data, by Bob Mesibov

The follow is a guest post by Bob Mesibov, who has contributed to iPhylo before. Bob

Like many iPhylo readers, I deal with large, pre-existing compilations of biodiversity data. The compilations come from museums, herbaria, aggregation projects and government agencies. For simplicity in what follows and to avoid naming names, I'll lump all these sources into a single fictional entity, the PAI (for Projects, Agencies and Institutions).

The datasets I get from the PAI typically contain duplicate records, inconsistencies in content and format, unexplained data gaps, data in wrong fields, fields improperly used, no flagging of doubtful data, etc. Data cleaning consumes most of the time I spend on a data project. Cleaning can take weeks, analysing the cleaned data takes minutes, reporting the results of the analysis takes hours or days. (Example: doi:10.3897/BDJ.2.e1160)

I can understand how datasets get messy. Data entry errors account for a lot of the problems I see, and I make data entry errors myself. But the causes of messiness are one thing and its cure is another. The custodians of those data compilations don't seem to have done much (or any) data checking. Why not?

When I'm brave enough to ask that question, I usually get a polite response from the PAI. Here are 10 explanations I've heard for inadequate data checking and cleaning:

(1) The data are fit for use, as-is. No cleaning is needed, because the data are fit for some use, and the PAI is satisfied with that. One data manager wrote to me in an email: '...even records with lower certainty, in this case an uncertain identification, can be useful at a coarser resolution. Although we have no idea as to the reliability of the identification to the species or even genus they are likely correctly identify[ing] something as at least an animal, arthropod and possibly to class so the record is suitable for analysis at that level.'

(2) The PAI is exposing its data online. The crowd will spot any problems and tell the PAI about them.

I've previously pointed out (doiL10.3897/zookeys.293.5111) how lame this explanation is. As a strategy for data cleaning it's slow, piecemeal and wildly optimistic. At best, it accumulates data-cleaning 'tickets' with no guarantee that any will ever be closed. What I hear from the PAI is 'We're aware of problems of that kind and are hoping to find a general solution, rather than deal with a multitude of individual cases'. Years pass and the individual cases don't get fixed, so interested members of the crowd lose faith in the process and stop reporting problems.

(3) No one outside the PAI is allowed to look at the whole dataset, and no one inside the PAI has the time (or skills) to do data checking and cleaning.

This is a particularly nice Catch-22. I once offered to check a portion of the PAI's data holdings for free, and was told that PAI policy was that the dataset was not to be shared with anyone outside the PAI. The same data were freely available on the PAI's website in bits and pieces through a database search page.

(4) The PAI is migrating to new database software next year. Data cleaning will be part of the migration.

No, it won't. Note that this response isn't always simple procrastination, because it's sometimes the case that the PAI's database has only limited capabilities for data checking and editing. PAI staff are hopeful that checking and editing will be easier with the new software. They'll be disappointed.

(5) The person who manages data is on leave / was seconded to another project / resigned and hasn't been replaced yet / etc.

This is another way of saying that no one inside the PAI has the time to do data checking and cleaning. When the data manager returns to work or gets replaced, data checking and cleaning will have the same low priority it had before. That's why it didn't get done.

(6) Top management says any data cleaning would have to be done by outside specialists, but there's not enough money in the current budget to hire such people.

Not only a Catch-22, but a solid, long-term excuse, applicable in any financial year. It would cost less to train PAI staff to do the job in-house.

(7) The PAI would prefer to use a specialist data tool to clean data, like OpenRefine, but hasn't yet got up to speed on its use.

The PAI believes in magic. OpenRefine will magically clean the data without any thought required on the part of PAI staff. The magic will have to be applied repeatedly, because the sources of the duplications, gaps and errors haven't been found and squashed.

(8) The PAI staff best qualified to check and clean the data aren't allowed to do so.

IT policy strictly forbids anyone but IT staff from tinkering with the PAI database, whose integrity is sacrosanct. A very specific request from biodiversity staff may be ticketed by IT staff for action, but global checking and editing is out of the question. IT staff are not expected to understand biodiversity studies, and biodiversity staff are not expected to understand databases.

This explanation is interesting because it implies a workaround. If a biodiversity staffer can get a dump from the database as a simple text file, she can do global checking and editing of the data using the command line or a spreadsheet. The cleaned data can then be passed to IT staff for incorporation into the database as replacement data items. The day that happens, pigs will be seen flying outside the PAI windows.

(9) The PAI datasets have grown so big that global data checking and editing is no longer possible.

Harder, yes; impossible, no. And the datasets didn't suddenly appear, they grew by accretion. Why wasn't data checking and editing done as data was added?

(10) All datasets are messy and data users should do their own data cleaning.

The PAI shrugs its shoulders and says 'That's just the way it is, live with it. Our data are no messier than anyone else's'.

I've left this explanation for last because it begs the question. Yes, users can do their own data cleaning — because it's not that hard and there are many ways to do it. So why isn't it done by highly qualified, well-paid PAI data managers?

Towards a biodiversity knowledge graph now in RIO

E2asamswAfter experimenting with a dynamic, online version of my notes "Towards a biodiversity knowledge graph" I've published a static version in RIO: doi:10.3897/rio.2.e8767.