Wednesday, November 12, 2014

Archiving of phylogenetics data

The draft Minimum Information about a Phylogenetic Analysis standard (Leebens-Mack et al. 2006) suggests that all relevant information about each and every published phylogenetics analysis should be archived, so that it can be scrutinized by later researchers, either for validation or for re-use. The issues here are both preservation of the information (data and analysis protocols) and open access to it.

In this blog we have already pointed out that there has been criticism of the bioinformatics part of this archiving, where there have been repeated claims that many computer programs are poorly maintained (Poor bioinformatics?) as well as poorly archived (Archiving of bioinformatics software).

Anyone who has ever tried to get data out of a biologist will know that the data-related part of the standard is no better. My own success rate, at requesting data from all areas of biology not just phylogenetics, is less than 20% over the past 25 years. The responses have been, in order: (i) no response (>50%), (ii) "a student / postdoc / colleague has the data not me", and (iii) "I have moved recently and don't know where the data are". My most recent attempt, to get the data from Collard et al. (2006), was ultimately unsuccessful even after several attempts.

For phylogenetics, this situation has recently been quantified and analyzed by Magee et al. (2014). They tried to collect phylogenetic data (comprising nucleotide sequence alignment and tree files) from 217 published studies. Of these, 54 (25%) had at least some part of the data (alignment or tree) archived in an online repository, and 91 (42%) were obtained by direct solicitation, but in 72 (33%) of cases nothing could be obtained even after three requests. Overall, complete datasets (both tree and alignment) were available for only 40% of the studies.

The authors note that the data were more likely to be deposited in online archives and/ or shared upon request when the publishing journal has a strong data-sharing policy. Furthermore, there has been a positive impact of recent policy initiatives and infrastructural changes involving data repositories. The TreeBASE phylogenetic-data repository has existed for more than 20 years, but its use has been sporadic. However, the recent establishment of the Joint Data Archiving Policy by a consortium of journals, which requires the submission of data to online archives as a condition of publication, and the concomitant establishment of the Dryad repository for evolutionary and ecological data, has seen a surge in the archiving of data.

So, all in all, things have been no better on the bio side than the informatics side of bioinformatics.

Stoltzfus et al. (2012) have identified a number of possible barriers to successful data archiving, including lack of awareness of options and policies, perception that benefits do not justify burden, and an active desire to restrict data access. Importantly, there are also a number of practical issues even for those people who do wish to archive their data:
  • inconvenience of gathering complete data and metadata
  • inconvenience of format conversions needed for archiving
  • frustration when some data don't fit the archive's data model
  • poor and undocumented archive submission interfaces.
For the readers of this blog, issue three is possibly the most important one — all current repositories are based on a tree model for phylogenetics, and therefore network phylogenies are frustrating to deal with.

In order to improve the overall situation, there are explicit suggestions from Cranston et al. (2014) for best practices when archiving. They have ten simple guidelines that, if followed, will result in you providing open access to your data and analyses, even if the publishing journal does not force you to do it.

Footnote: I have been reminded that archiving data in PDF format is inappropriate. Trying to extract text (such as a dataset) from a PDF file can be difficult, because there is no standard format for storing the text. Consequently, different PDF readers will extract the text in different ways, and it is possible that in all cases the output will need extensive manual re-formatting, in order to recover the original text formatting that went into the PDF file. In my experience, Google Chrome may do the least-worst job.


Collard M, Shennan SJ, Tehrani JJ (2006) Branching, blending, and the evolution of cultural similarities and differences among human populations. Evolution and Human Behavior 27: 169-184.

Cranston K, Harmon LJ, O'Leary MA, Lisle C (2014) Best practices for data sharing in phylogenetic research. PLoS Currents Jun 19;6.

Leebens-Mack J, Vision T, Brenner E, Bowers JE, Cannon S, Clement MJ, Cunningham CW, dePamphilis C, deSalle R, Doyle JJ, Eisen JA, Gu X, Harshman J, Jansen RK, Kellogg EA, Koonin EV, Mishler BD, Philippe H, Pires JC, Qiu YL, Rhee SY, Sjölander K, Soltis DE, Soltis PS, Stevenson DW, Wall K, Warnow T, Zmasek C (2006) Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA). OMICS 10: 231-237.

Magee AF, May MR, Moore BR (2014) The dawn of open access to phylogenetic data. PLoS One 9: e110268.

Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie EL, Kumar S, Rosauer DF, Vos RA (2012) Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis. BMC Research Notes 5: 574.

No comments:

Post a Comment