Wednesday, May 1, 2013

Releasing phylogenetic data


One approach that I have taken in this blog to popularizing the use of networks in phylogenetic analysis has been to investigate published data using network techniques. However, this is often difficult because the data have not been publicly made available (eg. Phylogenetic position of turtles: a network view).

I am not the only person to find fault with the failure to release phylogenetic data, although there are recognized reasons why data sometimes cannot be released. Razib Khan at the Gene Expression blog recently had this to say (Why not release data for phylogenetic papers?):
Last month I noted that a paper on speculative inferences as to the phylogenetic origins of Australian Aborigines was hampered in its force of conclusions by the fact that the authors didn't release the data to the public (more accurately, peers). There are likely political reasons for this in regards to Australian Aborigine data sets, so I don’t begrudge them this (Well, at least too much. I’d probably accept the result more myself if I could test drive the data set, but I doubt they could control the fact that the data had to be private). This is why when a new paper on a novel phylogenetic inference comes out I immediately control-f to see if they released their data. In regards to genome-wide association studies on medical population panels I can somewhat understand the need for closed data (even though anonymization obviates much of this), but I don’t see this rationale as relevant at all for phylogenetic data (if concerned one can remove particular functional SNPs). 
Yesterday I noticed PLoS Genetics published a paper on the genomics of Middle Eastern populations ... The results were moderately interesting, but bravo to the authors for putting their new data set online. The reason is simple: reading the paper I wanted to see an explicit phylogenetic tree/graph to go along with their figures (e.g., with TreeMix). Now that I have their data I can do that.
In this particular case the data were made available on the homepage of one of the authors, which is better than nothing but is clearly less than ideal. There are a number of formal repositories for phylogenetic data, all of which should have greater longevity than any personal homepage, including:
TreeBASE
Dryad
The first of these databases has a long history of storing phylogenetic trees and their associated datasets. It has not yet lived up to its full potential, but people like Rod Page are pushing for it to do so eventually.

Dryad is a more general data repository (ie. not just for phylogenetic data), and its use is now encouraged by many of the leading journals — Systematic Biology, for example, makes its use mandatory, at least for data during the submission process, and also for "data files and/or other supplementary information related to the paper" for the published version.

Phylogeny databases are not without their skeptics, however. For example, Rod Page (Data matters but do data sets?) has noted:
How much re-use do data sets get? I suspect the answer is "not much". I think there are two clear use cases, repeatability of a study, and benchmarks. Repeatability is a worthy goal, but difficult to achieve given the complexity of many analyses and the constant problem of "bit rot" as software becomes harder to run the older it gets. Furthermore, despite the growing availability of cheap cloud computing, it simply may not be feasible to repeat some analyses. 
Methodological fields often rely on benchmarks to evaluate new methods, and this is an obvious case where a dataset may get reused ("I ran my new method on your dataset, and my method is the business — yours, not so much"). 
But I suspect the real issue here is granularity. Take DNA sequences, for example. New studies rarely reuse (or cite) previous data sets, such as a TreeBASE alignment or a GenBank Popset. Instead they cite individual sequences by accession number. I think in part this is because the rate of accumulation of new sequences is so great that any subsequent study would needs to add these new sequences to be taken seriously. Similarly, in taxonomic work the citable data unit is often a single museum specimen, rather than a data set made up of specimens.
However, all of this begs the question that seems to me to be central to science. Science is unique in being based primarily on evidence rather than expert opinion, and therefore the core of science must be direct access to the original evidence, rather than some statistical summary of it or someone's opinion about it. How can I evaluate evidence if I don't have access to it? How can I verify it, explore it, or re-analyze it? Being given the raw data (eg. the sequences) is one thing, but being given the data you actually analyzed and based your conclusions on (eg. the aligned sequences) is another thing entirely.

In short, if you won't openly give me your dataset then I don't see how you can call yourself a serious scientist.

No comments:

Post a Comment