What's in a database?

Databases are key to our ability to do meaningful biological work. This ongoing blog series will be about exploring some common databases and vizualizing their contents. In this installment , we’ll be starting just by looking at taxonomic diversity of samples within the database and vizualizing them.

#Introduction It is likely common sense to many that biological databases (and repositories of knowledge) are incomplete and non randomly populated. Incompleteness and non random sampling arise from may issues: a species may find itself more studied because it is deemed to be more important (i.e. data derived from homo sapiens is common for obvious reasons, as is Salmonella and E. coli) , or for historical reasons ( so called “model organisms” like: A. thaliana, M. musculus, and D. rerario).

While this may seem to be common knowledge, our group has found that vizualizing these disparities to be incredibly useful when presenting our research to outsiders.

The Short Read Archive

The Short Read Archive is a major repository of short (and now long) read sequencing data. Like many NCBI databases it is closely linked with international partners in Europe (EBI) and Japan (DDJB). It houses the sequencing data for 1.99 million sequencing runs generated from 1.5 million unique samples. The total size of the SRA is at or above 1.3 Petabytes of data. The SRA serves as a public repository allowing for individuals (and groups) to re-analyze data using new methodologies and compare data sets from many different labs in new meaningful ways.

Distribution of Samples by Taxonomy

Below is a graphical representation of the number of samples within the SRA currently. The largest circle is the root node and contains all observations available in taxonomy, the next set of circles would be the top level taxonomies as defined by the NCBI ( Eukaryotes , Bacteria, Viruses, Archaea ), within those circles is the next level , continuing down until you hit the species level. Each circles size indicates the number of samples that are in that circle. This graphic is interactive so feel free to click on it to explore different nodes. (Apologies in advance for the slow responsiveness - this is a very early attempt at this graph style. I’ll continue hacking away at it to try and clean it up and make it a little smoother soon. )

This data was generated by getting counts of unique ncbi taxonomy ids using the SRAdb available via the Meltzer lab. 1

General observations

The top ten sampled species in the database are :

  1. Danio rerio (21353)
  2. Drosophila melanogaster (14418)
  3. Arabidopsis thaliana (10569)
  4. Caenorhabditis elegans (7342)
  5. Campylobacter;Campylobacter sp. (6775)
  6. Bos taurus (3111)
  7. Chlorocebus sabaeus (2557)
  8. Glycine max (2115)
  9. Macaca mulatta (2052)
  10. Mycobacterium tuberculosis complex bacterium (1509)

There are currently 15,145 species represented within the SRA , 10,467 of these species are represented by a single sample.

Assuming that we would want an even sampling of species we’d currently expect 11.2 samples per species. Currently 1,026 species are at that level or higher , 926 are more than double that even sampling rate.

If you explore around within the Eukaryotes, you’ll see that Z. mays as a species has more samples associated with it than many other genera within the Plantae. Which makes sense from a historical and societal perspective.

While you’re exploring the bacterial side of things you may notice that there is a large circle within the genus Campylobacter simply titled Campylobacter sp. , in fact this notation is the largest notation within that genus. This points to the difficulty of defining ( and assigning ) species level memberships within this clade.

Of course I’ve notices some issues as well. Salmonella as a genus and the species within it are largely absent from the tables. I’m currently chalking this up to a parsing issue within the NCBI taxonomy - but do need to investigate it further to verify it. Knowing that this isn’t quite correct - I’ll need to dig deeper into the data to make sure we’re parsing taxonomies correctly and well.

References:

1: Yuelin Zhu , Robert M. Stephens, Paul S. Meltzer, and Sean R. Davis. “SRAdb: query and use public next-generation sequencing data from within R” , [http://dx.doi.org/10.1186/1471-2105-14-19]

Code for reproduction.

As always code for this post is available here.