Following on from earlier posts exploring how to map DNA barcodes and putting barcodes into GBIF it's time to think about taking advantage of what makes barcodes different from typical occurrence data. At present GBIF displays data as dots on a map (as do I in http://iphylo.org/~rpage/bold-map/). But barcodes come with a lot more information than that. I'm interested in exploring how we might measure and visualise biodiversity using just sequences.
Based on a talk by Zachary Tong (Going Organic - Genomic sequencing in Elasticsearch) I've started to play with n-gram searches on DNA barcodes using Elasticsearch, an open source search engine. The idea is that we break the DNA sequence into every possible "word" of length n (also called a k-mer or k-tuple, where k = n).
For example, for n = 5, the sequence GTATCGGTAACGAACTT would look like this:
GTATCGGTAACGAACTT GTATC TATCG ATCGG TCGGT CGGTA GGTAA GTAAC TAACG AACGAA ACGAAC CGAACT GAACTT
The sequence GTATCGGTAACGAACTT comes from Hajibabaei and Singer (2009) who discussed "Googling" DNA sequences using search engines (see also Kuksa and Pavlovic, 2009). If we index sequences in this way then we can do BLAST-like searches very quickly using Elasticsearch. This means it's feasible to take a DNA barcode and ask "what sequences look like this?" and return an answer qucikly enoigh for a user not to get bored waiting.
Another nice feature of Elasticsearch is that it supports geospatial queries, so we can ask for, say, all the barcodes in a particular region. Having got such a list, what we really want is not a list of sequences but a phylogenetic tree. Traditionally this can be a time consuming operation, we have to take the sequences, align them, then input that alignment into a tree building algorithm. Or do we?
There's growing interest in "alignment-free" phylogenetics, a phrase I'd heard but not really followed up. Yang and Zhang (2008) described an approach where every sequences is encoded as a vector of all possible k-tuples. For DNA sequences k = 5 there are 45 = 1024 possible combinations of the bases A, C, G, and T, so a sequence is represented as a vector with 1024 elements, each one is the frequency of the corresponding 5-tuple. The "distance" between two sequences is the mathematical distance between these vectors for the two sequences. Hence we no longer need to align the sequences being comapred, we simply chunk them into all "words" of 5 bases in length, and compare the frequencies of the 1024 different possible "words".
In their study Yang and Zhang (2008) found that:
We compared tuples of diﬀerent sizes and found that tuple size 5 combines both performance speed and accuracy; tuples of shorter lengths contain less information and include more randomness; tuples of longer lengths contain more information and less random- ness, but the vector size expands exponentially and gets too large and computationally ineﬃcient.
Putting this all together, I've built a rough-and-ready demo that takes some DNA barcodes, puts them on a map, then enables you to draw a box on a map and the demo will retrieve the DNA barcodes in that area, compute a distance matrix using 5-tuples, then build a NJ tree, all on the fly in your web browser.
This is all very crude, and I need to explore scalability (at the moment I limit the results to the first 200 DNA sequences found), but it's encouraging. I like the idea that, in principle, we could go to any part of the globe, ask "what's there?" and get back a phylogenetic tree for the DNA barcodes in that area.
This also means that we could start exploring phylogenetic diversity using DNA barcodes, as Faith & Baker (2006) wanted a decade ago:
...PD has been advocated as a way to make the best-possible use of the wealth of new data expected from large-scale DNA “barcoding” programs. This prospect raises interesting bio-informatics issues (discussed below), including how to link multiple sources of evidence for phylogenetic inference, and how to create a web-based linking of PD assessments to the barcode–of-life database (BoLD).
The phylogenetic diversity of an area is essentially the length of the tree of DNA barcodes, so if we build a tree we have a measure of diversity. Note that this contrasts with other approaches, such as Miraldo et al.'s "An Anthropocene map of genetic diversity" which measured genetic diversity within species but not between (!).
There are also data management issues. I'm exploring downloading DNA barcodes, creating a Darwin Core Archive file using the Global Genome Biodiversity Network (GGBN) data standard, then converting the Darwin Core Archive into JSON and sending that to Elasticsearch. The reason for the intermediate step of creating the archive is so that we can edit the data, add missing geospatial informations, etc. I envisage having a set of archives, hosted say on GitHub. These archives could also be directly imported into GBIF, ready for the time that GBIF can handle genomic data.
- Faith, D. P., & Baker, A. M. (2006). Phylogenetic diversity (PD) and biodiversity conservation: some bioinformatics challenges. Evol Bioinform Online. 2006; 2: 121–128. PMC2674678
- Hajibabaei, M., & Singer, G. A. (2009). Googling DNA sequences on the World Wide Web. BMC Bioinformatics. Springer Nature. https://doi.org/10.1186/1471-2105-10-s14-s4
- Kuksa, P., & Pavlovic, V. (2009). Efficient alignment-free DNA barcode analytics. BMC Bioinformatics. Springer Nature. https://doi.org/10.1186/1471-2105-10-s14-s9
- Miraldo, A., Li, S., Borregaard, M. K., Florez-Rodriguez, A., Gopalakrishnan, S., Rizvanovic, M., … Nogues-Bravo, D. (2016, September 29). An Anthropocene map of genetic diversity. Science. American Association for the Advancement of Science (AAAS). https://doi.org/10.1126/science.aaf4381
- Yang, K., & Zhang, L. (2008, January 10). Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucleic Acids Research. Oxford University Press (OUP). https://doi.org/10.1093/nar/gkn075