Wednesday, September 9, 2015

Sharing supplementary data: a linguist's perspective


The Problem of Data Sharing

In 2013, Nature launched a discussion on how to increase the reproducability of research in the biomedical sciences. David addressed the problem of data sharing more concretely in two blog posts from 2013, one on the practice of releasing phylogenetic data, and one on its public availability. In my opinion, this topic does not only concern the sciences, but also, and even specifically, the humanities. In times where more and more data for anthropological research is being produced, and the formerly manually conducted analyses are being automated, we need to increase the awareness of scholars and publishers that publishing only the results is not enough to meet rigorous scientific standards.

When discussing these issues with colleagues, various reasons have been brought up as to why scholars would not release their data along with a publication. Apart from practical considerations (which mostly concern the publishers who do not provide the infrastructure to host supplementary material transparently), scholars often also bring up personal and legal concerns: they are afraid that their painstaking efforts in collecting a dataset will have been in vain, once they release the data to the public, since other researchers might take over and run analyses they would like to run themselves in the future. Furthermore, there are situations when data cannot simply be published completely, because the compilers of the datasets do not obtain the copyrights on the data itself.

In my opinion, all of these problems can be solved directly, and there is no reason to publish a study in which at least a part of the data is not provided in supplementary form.

Practical Solutions: GitHub and Zenodo

Regarding practical issues, one can use GitHub to host and curate data and computer source code. The advantage of using GitHub is that it allows for distributed revision control: all changes and modifications to the data can be tracked, and all of those who contributed to the compilation of a given dataset can receive the credit they deserve. Even for the case of anonymous data submission, there is a simple solution available along with GitHub Gist: by just uploading data to a Gist (a flat repository which does not allow for a folder structure) without being logged in with a GitHub account, one can anonymously host the data for review purposes.

If one doesn't completely trust the longeavity of GitHub in hosting the data forever (it might well happen that GitHub changes its payment policy at some point in the future, or limits the amount of open repositories), there is Zenodo, which offers full GitHub integration and allows storage of up to 2 GB per dataset. For more information regarding the possibilities that the GitHub integration offers, see this blog post by Robert Forkel. Zenodo was developed by CERN and, although they write on their website that their sustainability plan is still in development, it is quite unlikely that they will run out of funding within the next twenty years.

As a recommended way of hosting data, one would start with an anonymous Gist when submitting a paper. This would then be converted to a full GitHub repository once the paper has been accepted. By setting up an official release of this repository, the data would be automatically transferred to Zenodo, where it is permanently stored and provided with a DOI.

Sharing Data Prevents Data Theft

Regarding the personal concerns that one's data might be "stolen" by other scholars, I think it is important to make clear that at the core of all research we build on the work of our colleagues. Nobody should own a dataset, as well as nobody should own a theory. It is clear that in the stage of developing datasets (as well as theories), we may decide to be careful in sharing them with certain colleagues. But once they are finished and ready to use, we should allow our colleagues to run their own analyses on them.

What is important and missing here is an established practice, but also infrastructure support to give credits to the work of others. In linguistics, we lack journals, such as BMC Bioinformatics, that publish articles on source code or databases. There are, however, recent attempts to address these problems in linguistic research (see, for example, this blog post by Martin Haspelmath).

But even while this infrastructure is lacking, it should be made clear that scholars win more than they risk when submitting their data along with their publication. If the data turns out to be useful for additional research, then they will receive credit in the form of citations, and they will even prevent others from actually stealing their data — as with ideas, data can only be stolen by falsely associating it with another name. Once the data is out along with the publication, this is not likely to happen.

Giving Something is More than Giving Nothing

Even in those cases where there are real copyright restrictions, one can make a compromise and publish an illustrative snapshot of the data and the detailed results. Especially, computational analyses produce a large amount of data as part of their results, and this data may well turn out to be interesting for other scholars. Instead of publishing just a tree or a network, we may want to see the individual character evolution that was inferred along with the algorithm. And when illustrating a new algorithm for homolog detection in historical linguistics, it may be interesting for one scholar or another (but maybe also for the reviewer) to have a look at the detailed results apart from the aggregated evaluation scores.

Summary and Outlook

Current research practice in historical linguistics faces serious reproducability problems. Fortunately, solutions exist for most of the practical problems of the past. What we need now is to increase the awareness among scholars that all research based on data and source code is nothing without the data and the source code. Publishing both source code and data along with a paper is easy nowadays, especially thanks to GitHub and Zenodo. Guaranteeing that one gets the credit for ones efforts in the humanities is a bit more difficult, but not impossible, and colleagues are working on solutions.

What we need in addition to the publication of the raw data itself are explicit formats of data exchange. In historical linguistics, using only NEXUS-format files is not sufficient, since the nature of our data requires its own representation. Here again, scholars are already working on a solution by trying to define and establish specific formats for data sharing in historical linguistics and typology (see this discussion on GitHub).

In an ideal future scenario that was introduced to me by Michael Cysouw, all publications involving automatic analyses should provide not only the supplementary data, but also some kind of a MAKE file containing the code for the workflow that enables scholars to carry out the computational analyses immediately on their computer.

No comments:

Post a Comment