Monday, February 19, 2018

We want to publish our phylogenetic data – including networks, but where?

(This is a joint post by Guido Grimm and David Morrison)

About five years ago, David wrote two posts regarding issues with the public availability and release of phylogenetic data. Since then, the situation has become a bit more beneficial for science, but we still have not progressed as far as we should have. In this post, we will share some anecdotes , and give some tips for where you can do store your networks.

David asked an interesting question: Why are phylogeneticists so reluctant to present their actual data in the first place? In this schematic, this asks why the arrow connecting "Data Product" to "Reality" is so often missing.

The archiving of primary data (the data matrix) and its derivatives (eg. phylogenies) should be obligatory, so that the basic data are publicly available, so that the results can be verified by others, and any errors identified / eliminated.

There is no good reason to hold it back. While we may have put a lot of effort into our data sets, if we don't share them then this effort will only benefit ourselves, and it will become null and void after we have published our paper. We also may leave science (via retirement or something else), or otherwise stop maintaining our professional homepage, and at this point our data legacy will likely drift off in a puff of smoke.

On the other hand, when we make the data publicly available, others can take it from there. Indeed, we may even meet new collaborators, if they are interested in the same line of research. Just as importantly, we are no longer responsible for keeping it at hand for eventual requests. This is one of the chief advantages of sites like ResearchGate, which automate this sort of administrative effort.

If the re-users of our data are honest scientists, then they will (of course) cite us for our data matrix. But if they have to sit down to harvest the genebanks, and re-create the matrix from scratch, then why should they cite the people that produced the data? More importantly, making data sets accessible enables teachers / lectures to make use of it in their courses, having at hand one (or more, when the data were re-used) publications for discussion.

It also gives developers some test datasets for new algorithms and programs. For instance, Guido's best-cited (first-author) paper on GoogleScholar (Grimm et al. Evolutionary Bioinformatics 2006) has been cited 66 times (per February 13th), mainly because the maple dataset has become a tricky test set for a large amount of bioinformatic papers passed from one bioinformatician to the other. It is for this reason that our compilation of verified empirical network datasets was first created.

Finally, for most of us our research is made possible by public money, so we do not actually own our data, personally. It really belongs to the public, who funded it, so there should be public access to it — we cannot monopolize expertise that is created by public funding.

As an aside, it avoids responses such as these (all of which are real, and quite common):
I cannot send you the data because I don't have a backup on my new computer
I don't have the data, only the late Ph.D. student has it, who has left the lab
I can't find the data, because I have changed universities
I'm not sure if I can share the data, as it was a collaborative project
I expect to be a co-author, even if I do no further work.

Tides have turned, somewhat

There are quite a few journals that now expect that each phylogenetic data matrix, and the inferred tree, is stored within a public repository. For instance, BioMed Central journals such as BMC Evolutionary Biology (now owned by Springer-Nature), expect you store your (phylogenetic) data in a public repository such as TreeBase or Dryad. However, few journals enforce the documentation of primary data (e.g. Nature, the same publisher's flagship journal, does not), but treat it only as a recommendation. The easiest way to enforce the archiving is to refuse to review any manuscript where the data has not already been deposited.

TreeBase, which is free of charge, is still only an option when you deal with simple data: a matrix and a tree, or a few trees inferred from the matrix — network-formatted genealogies cannot be stored, only trees. When you have networks, a compilation of analysis files, trees including labels that are not referring to species (in a taxonomic sense), it is not an option. For example, the TreeBase submission of the above-mentioned maple data is defunct, because the maximum likelihood trees were based on individual clones or consensus sequences. The main result, "bipartition networks" based on the ML bootstrap pseudoreplicate samples, cannot be handled; and naked matrices are not published anymore (you need a tree to go with the matrix).

Dryad has no file type or content limitations, but it charges a fee (although quite modest). A few of the journals enforcing data storage such as Systematic Biology cover the cost, but Springer-Nature's BMC Evolutionary Biology does not — with respect for what they charge for a publication (> $2,500), they should. Springer-Nature has now launched an open research initiative with open data components (eg. LOD), of its own, but so far little has changed (see eg. the fresh paper on Citrus in Nature); and it would be surprising that making data openly accessible would come with no extra costs for the authors.

Ideally, there would be as online supplement

Providing the data as an open-access online supplement directly linked to the paper seems to be a natural choice. Everyone that finds the paper can then directly access the related data and main analysis files.

Journals such as PeerJ, or the Public Library of Science (PLoS) series, make it possible to upload a wide range of file formats as online supplements. While most journals now have online supplements, relatively few allow uploading of, for example, a packed (zipped) archive file. This is the only possible option when you want to not only provide the raw NEXUS file and a NEWICK-formatted text file with the tree, but also e.g. the bootstrap samples or the Bayesian sampled topology file and the support consensus networks based on them. This requires an annotated (graphically enhanced) Split-NEXUS file generated with SplitsTree, or a fully annotated matrix, or the outcome of a median network analysis from the NETWORK program. There is usually some limitation on the maximum size (storage space generates real costs for the publisher).

A nice touch of PeerJ is that each supplement file gets it's own DOI, similar to Dryad's annotation procedure, making the uploaded data archives/files individually referencable.

More alternatives

Most, if not all, journals with good online supplement storages are open access journals, where you have to pay to publish — currently a bit over 1000 $ for PeerJ; and ~ 1500 $ for e.g. PLoS ONE (PeerJ also has the option of individual life-long publishing plans). Perhaps a basic problem with open access is that it moves the financial cost from the reader to the writer — this is not good if you have little funding to do your work.

So what do you do when you publish in a traditional journal with few online storage options?

One alternative is Figshare, where you have up to 20 GB storage for free, and can upload a variety of file types, including images, spreadsheets, and data archives. Uploading images and data to repositories like Dryad or figshare may also be a good option where restrictive copyright clauses still occasionally are found in publication agreements. Before submitting the final version, you simply publish the data and figures there under a CC-BY licence, and reference them accordingly in your copyrighted book chapter or paper.

And increasing number of institutions now also provide the possibility to store (permanently) research data produced at the institution. So, it's always worth asking the IT-department or the university biobliotheque about the availability of such an option. And some countries such as Austria have launched their own open data platforms.

Uploading data files to ResearchGate is probably not an option for network-affine research, as it allow only PDF files (they then need to be text-extractable). As phylogeneticists, we want to distribute our (usually NEXUS-, FASTA- or PHYLIP-formatted) matrices and primary inference-results file, so that they become part of the scientific world.

There is also the possibility of generic cloud storage, which is often free, or at least available to users of certain operating systems or programs. Unfortunately, this is entirely a short-term option, no different from a personal home page; and it may be a target for hackers, anyway.

Final comment

One frequently raised issue not mentioned so far is the concept of a gray area of social or personal responsibility. That is, there might be unforeseen or undesirable consequences to a general obligation to provide full documentation of primary data. This is always an issue in the medical and social sciences, for example, where the exposure of personal data might lead to societal problems. Even in palaeontology, there may be legitimate concerns about, for example, making the GPS coordinates of special fossil sites publicly available.

However, there is nothing to stop an author highlighting such issues at the time of their manuscript submission, and the editor asking for comments from the reviewers, who are supposed to be experts in the particular field.

Some further relevant links (please feel free to point out more)

Join the discussion by using our comments below; or provide your answer to the open question at the PeerJ Questions portal: Should we be forced to publish primary data integral to our results?

Twitter has the hashtag #OpenData, used by people / organisations promoting or providing open data, as well as those who are (so far) only allegedly dedicated to it (such as Springer-Nature and RELX-Elsevier).

The open source software environment RStudio for R allows knitting and publishing html-files (and other file formats) on their RPubs server, which can be a convenient way to permanently store your R-obtained results and scripts (e.g. Potts & Grimm, 2017).

Preprint servers such as arXiv, bioRxiv, and PeerJ Preprints also provide the option to attach supplementary data files (there are usually size limits), using a wide range of file formats including zipped archives. arXiv had to end its data storage programme in 2013, but still accepts "ancillary files" for raw data, code, etc. "up to a few MB" (which should be enough for a phylogenetic data matrix).

For Austrian/German-speaking users, as noted above, there is Austria's new Open Data Portal (ODP). So far, German is the only language selectable from the scroll-down menu, but there seem to be no registering restrictions.

No comments:

Post a Comment