Showing posts with label Datasets. Show all posts
Showing posts with label Datasets. Show all posts

Monday, September 24, 2018

Structural data in historical linguistics


The majority of historical linguists compare words to reconstruct the history of different languages. However, in phylogenetic studies focusing on cognate sets reflecting shared homologs across the languages under investigation, there exists another data type that people have been trying to explore in the past. The nature of this data type is difficult to understand for non-linguists, given that it has a very abstract nature. In the past, it has led to a considerable amount of confusion both among linguists and among non-linguists who tried to use this data for quick (and often also dirty) phylogenetic approaches. For this reason, I figured it would be useful to introduce this type of data in more detail.

This data type can be called "structural". To enable interested readers to experiment with the data themselves, this blogpost comes along with two example datasets that we converted into a computer-readable format (with much help from David), since the original papers only offered the data as PDF files. In future blogposts, we will try to illustrate how the data can, and should, be explored with network methods. In this first blogpost, I will try to explain the basic structure of the data.

Structural data in historical linguistics and language typology

In order to illustrate the type of data we are dealing with here, let's have a look at a typical dataset, compiled by the famous linguist Jerry Norman to illustrate differences between Chinese dialects (Norman 2003). The table below shows a part of the data provided by Norman.

No. Feature Beijing Suzhou Meixian Guangzhou
1 The third person pronoun is tā, or cognate to it + - - -
4 Velars palatalize before high-front vowels + + - -
7 The qu-tone lacks a register distinction + - + -
12 The word for "stand" is zhàn or cognate to it + - - -

In this example, the data is based on a questionnaire that provides specific questions; and for each of the languages in the sample, the dataset answers the question with either + or -. Many of these datasets are binary in their nature, but this is not a necessary condition, and questionnaires can also query categorical variables, such as, for example, the major type of word order might have three categories (subject-object-verb, subject-verb-object or other).

We can also see is that the questions can be very diverse. While we often use more or less standardized concept lists for lexical research (such as fixed lists of basic concepts, List et al. 2016), this kind of dataset is much less standardized, due to the nature of the questionnaire: asking for the translation of a concept is more or less straightforward, and the number of possible concepts that are useful for historical research is quite constrained. Asking a question about the structure of a language, however, be it phonological, lexical, based on attested sound changes, or on syntax, provides an incredible number of different possibilities. As a result, it seems that it is close to impossible to standardize these questions across different datasets.

Although scholars often call the data based on these questionnaires "grammatical" (since many questions are directed towards grammatical features, such as word order, presence or absence of articles, etc.), most datasets show a structure in which questions of phonology, lexicon, and grammar are mixed. For this reason, it is misleading to talk of "grammatical datasets", but instead the term "structural data" seems more adequate, since this is what the datasets were originally designed for: to investigate differences in the structure of different languages, as reflected in the most famous World Atlas of Language Structures (Dryer and Haspelmath 2013, https://wals.info).

Too much freedom is a restriction

In addition to mixed features that can be observed without knowing the history of the languages under investigation, many datasets (including the one by Norman we saw above) also use explicit "historical" (diachronic in linguistic terminology) questions in their questionnaires. In his paper describing the dataset, Norman defends this practice, as he argues that the goal of his study is to establish an historical classification of the Chinese dialects. With this goal in mind, it seems defensible to make use of historical knowledge and to include observed phenomena of language change in general, and sound change in specific, when compiling a structural dataset for group of related language varieties.

The problem of the extremely diverse nature of questionnaire items in structural datasets, however, makes their interpretation extremely difficult. This becomes especially evident when using the data in combination with computational methods for phylogenetic reconstruction. This is problematic for two major reasons.
  1. Since questions are by nature less restricted regarding their content, scholars can easily pick and choose the features in such a way that they confirm the theory they want them to confirm rather than testing it objectively. Since scholars can select suitable features from a virtually unlimited array of possibilities, it is extremely difficult to guarantee the objectivity of a given feature collection. 
  2. If features are mixed, phylogenetic methods that work on explicit statistical models (like gain and loss of character states, etc.) may often be inadequate to model the evolution of the characters, especially if the characters are historical. While a feature like "the language has an article" may be interpreted as a gain-loss process (at some point, the language has no article, then it gains the article, then it looses it, etc.), features showing the results of processes, like "the words that originally started in [k] followed by a front vowel are now pronounced as []", cannot be interpreted as a process, since the feature itself describes a process.
For these reasons, all phylogenetic studies that make use of structural data, in contrast to purely lexical datastes, should be taken with great care, not only because they tend to yield unreliable results, but more importantly because they are extremely difficult to compare across different language families, given that they have way too much freedom when compiling them. Feature collections provided in structural datasets are an interesting resource for diversity linguistics, but they should not be used to make primary claims about external language history or subgrouping.

Two structural datasets for Chinese dialects

Before I start to bore the already small circle of readers interested in these topics, it seems better to stop discussing the usefulness of structural data at this point, and to introduce the two datasets that were promised at the beginning of the post.

Both datasets target Chinese dialect classification, the former being proposed by Norman (2003), and the latter reflecting a new data collection that was recently used by Szeto et al. (2018) to propose a North-South-split of dialects of Mandarin Chinese with help of a Neighbor-Net analysis (Bryant and Moulton 2004). Both datasets have been uploaded to Zenodo, and can be found in the newly established community collection cldf-datasets. The main idea of this collection is to collect various structural datasets that have been published in the literature in the past, and allow those people interested in the data, be it for replication studies or to thest alternative approaches, easy access to the data in various formats.

The basic format is based on the format specifications laid out by the CLDF initiative (Forkel et al. 2018), which provides a software API, format specifications, and examples for best practice for both structural and lexical datasets in historical linguistics and language typology. The collection is curated on GitHub (cldf-datasets), and datasets are converted to CLDF (with all languages being linked to the Glottolog database, glottolog.org, Hammarström et al. 2018) and also to Nexus format. The dataset is versionized, it may be updated in the future, and interested readers can study the code used to generate the specific data format from the raw files, as well as the Nexus files, to learn how to submit their own datasets to our initiative.

Final remarks on publishing structural datasets online

By providing only two initial datasets for an enterprise whose general usefulness is highly questionable, readers might ask themselves why we are going through the pain of making data created by other people accessible through the web.

The truth is that the situation in historical linguistics and language typology has for a very long time been very unsatisfactory. Most of the research based on data did not supply the data with the paper, and often authors directly refuse to share the data when asked after publication (see also the post on Sharing supplementary data). In other cases, access to the data is exacerbated by providing data only in PDF format in tables inside the paper (or even worse: long tables in the supplement of a paper), which force scholars wishing to check a given analysis themselves to reverse-engineer the data from the PDF. That data is provided in a form difficult to access is not even necessarily the fault of the authors, since some journals even restrict the form of supplementary data to PDF only, giving authors wishing to share their data in an appropriate form a difficult time.

Many colleagues think that it is time to change this, and we can only change it by offering standard ways to share our data. The CLDF along with the Nexus file, in which the two Chinese datasets are now published in this open repository collection, may hopefully serve as a starting point for larger collaboration among typologists and historical linguistics. Ideally, all people who publish papers that make use of structural datasets, would — similar to the practice in biology where scholars submit data to GenBank (Benson et al. 2013) — submit their data in CLDF format and Nexus, so that their colleagues can easily build on their results, and test them for potential errors.

References

Benson D., M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and E. Sayers (2013) GenBank. Nucleic Acids Res. 41.Database issue: 36-42.

Bryant D. and V. Moulton (2004) Neighbor-Net. An agglomerative method for the construction of phylogenetic networks. Molecular Biology and Evolution 21.2: 255-265.
Campbell, L. and W. Poser (2008): Language classification: History and method. Cambridge University Press: Cambridge.

Cathcard C., G. Carling, F. Larson, R. Johansson, and E. Round (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.

Dryer M. and Haspelmath, M. (2013) WALS Online. Max Planck Institute for Evolutionary Anthropology: Leipzig.

Forkel R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (forthcoming) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data.

Hammarström H., R. Forkel, and M. Haspelmath (2018) Glottolog. Version 3.3. Max Planck Institute for Evolutionary Anthropology: Leipzig. http://glottolog.org.

List J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp 2393-2400.

Norman J. (2003) The Chinese dialects. Phonology. In: Thurgood, G. and R. LaPolla (eds.) The Sino-Tibetan languages. Routledge: London and New York, pp 72-83.

Pritchard J., M. Stephens, and P. Donnelly (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.

Szeto P., U. Ansaldo, and S. Matthews (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Zhang M., W. Pan, S. Yan, and L. Jin (2018) Phonemic evidence reveals interwoven evolution of Chinese dialects. bioarxiv.

Monday, February 19, 2018

We want to publish our phylogenetic data – including networks, but where?


(This is a joint post by Guido Grimm and David Morrison)

About five years ago, David wrote two posts regarding issues with the public availability and release of phylogenetic data. Since then, the situation has become a bit more beneficial for science, but we still have not progressed as far as we should have. In this post, we will share some anecdotes , and give some tips for where you can do store your networks.

David asked an interesting question: Why are phylogeneticists so reluctant to present their actual data in the first place? In this schematic, this asks why the arrow connecting "Data Product" to "Reality" is so often missing.


The archiving of primary data (the data matrix) and its derivatives (eg. phylogenies) should be obligatory, so that the basic data are publicly available, so that the results can be verified by others, and any errors identified / eliminated.

There is no good reason to hold it back. While we may have put a lot of effort into our data sets, if we don't share them then this effort will only benefit ourselves, and it will become null and void after we have published our paper. We also may leave science (via retirement or something else), or otherwise stop maintaining our professional homepage, and at this point our data legacy will likely drift off in a puff of smoke.

On the other hand, when we make the data publicly available, others can take it from there. Indeed, we may even meet new collaborators, if they are interested in the same line of research. Just as importantly, we are no longer responsible for keeping it at hand for eventual requests. This is one of the chief advantages of sites like ResearchGate, which automate this sort of administrative effort.

If the re-users of our data are honest scientists, then they will (of course) cite us for our data matrix. But if they have to sit down to harvest the genebanks, and re-create the matrix from scratch, then why should they cite the people that produced the data? More importantly, making data sets accessible enables teachers / lectures to make use of it in their courses, having at hand one (or more, when the data were re-used) publications for discussion.

It also gives developers some test datasets for new algorithms and programs. For instance, Guido's best-cited (first-author) paper on GoogleScholar (Grimm et al. Evolutionary Bioinformatics 2006) has been cited 66 times (per February 13th), mainly because the maple dataset has become a tricky test set for a large amount of bioinformatic papers passed from one bioinformatician to the other. It is for this reason that our compilation of verified empirical network datasets was first created.

Finally, for most of us our research is made possible by public money, so we do not actually own our data, personally. It really belongs to the public, who funded it, so there should be public access to it — we cannot monopolize expertise that is created by public funding.

As an aside, it avoids responses such as these (all of which are real, and quite common):
I cannot send you the data because I don't have a backup on my new computer
I don't have the data, only the late Ph.D. student has it, who has left the lab
I can't find the data, because I have changed universities
I'm not sure if I can share the data, as it was a collaborative project
I expect to be a co-author, even if I do no further work.

Tides have turned, somewhat

There are quite a few journals that now expect that each phylogenetic data matrix, and the inferred tree, is stored within a public repository. For instance, BioMed Central journals such as BMC Evolutionary Biology (now owned by Springer-Nature), expect you store your (phylogenetic) data in a public repository such as TreeBase or Dryad. However, few journals enforce the documentation of primary data (e.g. Nature, the same publisher's flagship journal, does not), but treat it only as a recommendation. The easiest way to enforce the archiving is to refuse to review any manuscript where the data has not already been deposited.

TreeBase, which is free of charge, is still only an option when you deal with simple data: a matrix and a tree, or a few trees inferred from the matrix — network-formatted genealogies cannot be stored, only trees. When you have networks, a compilation of analysis files, trees including labels that are not referring to species (in a taxonomic sense), it is not an option. For example, the TreeBase submission of the above-mentioned maple data is defunct, because the maximum likelihood trees were based on individual clones or consensus sequences. The main result, "bipartition networks" based on the ML bootstrap pseudoreplicate samples, cannot be handled; and naked matrices are not published anymore (you need a tree to go with the matrix).

Dryad has no file type or content limitations, but it charges a fee (although quite modest). A few of the journals enforcing data storage such as Systematic Biology cover the cost, but Springer-Nature's BMC Evolutionary Biology does not — with respect for what they charge for a publication (> $2,500), they should. Springer-Nature has now launched an open research initiative with open data components (eg. LOD), of its own, but so far little has changed (see eg. the fresh paper on Citrus in Nature); and it would be surprising that making data openly accessible would come with no extra costs for the authors.

Ideally, there would be as online supplement

Providing the data as an open-access online supplement directly linked to the paper seems to be a natural choice. Everyone that finds the paper can then directly access the related data and main analysis files.

Journals such as PeerJ, or the Public Library of Science (PLoS) series, make it possible to upload a wide range of file formats as online supplements. While most journals now have online supplements, relatively few allow uploading of, for example, a packed (zipped) archive file. This is the only possible option when you want to not only provide the raw NEXUS file and a NEWICK-formatted text file with the tree, but also e.g. the bootstrap samples or the Bayesian sampled topology file and the support consensus networks based on them. This requires an annotated (graphically enhanced) Split-NEXUS file generated with SplitsTree, or a fully annotated matrix, or the outcome of a median network analysis from the NETWORK program. There is usually some limitation on the maximum size (storage space generates real costs for the publisher).

A nice touch of PeerJ is that each supplement file gets it's own DOI, similar to Dryad's annotation procedure, making the uploaded data archives/files individually referencable.


More alternatives

Most, if not all, journals with good online supplement storages are open access journals, where you have to pay to publish — currently a bit over 1000 $ for PeerJ; and ~ 1500 $ for e.g. PLoS ONE (PeerJ also has the option of individual life-long publishing plans). Perhaps a basic problem with open access is that it moves the financial cost from the reader to the writer — this is not good if you have little funding to do your work.

So what do you do when you publish in a traditional journal with few online storage options?

One alternative is Figshare, where you have up to 20 GB storage for free, and can upload a variety of file types, including images, spreadsheets, and data archives. Uploading images and data to repositories like Dryad or figshare may also be a good option where restrictive copyright clauses still occasionally are found in publication agreements. Before submitting the final version, you simply publish the data and figures there under a CC-BY licence, and reference them accordingly in your copyrighted book chapter or paper.

And increasing number of institutions now also provide the possibility to store (permanently) research data produced at the institution. So, it's always worth asking the IT-department or the university biobliotheque about the availability of such an option. And some countries such as Austria have launched their own open data platforms.

Uploading data files to ResearchGate is probably not an option for network-affine research, as it allow only PDF files (they then need to be text-extractable). As phylogeneticists, we want to distribute our (usually NEXUS-, FASTA- or PHYLIP-formatted) matrices and primary inference-results file, so that they become part of the scientific world.

There is also the possibility of generic cloud storage, which is often free, or at least available to users of certain operating systems or programs. Unfortunately, this is entirely a short-term option, no different from a personal home page; and it may be a target for hackers, anyway.


Final comment

One frequently raised issue not mentioned so far is the concept of a gray area of social or personal responsibility. That is, there might be unforeseen or undesirable consequences to a general obligation to provide full documentation of primary data. This is always an issue in the medical and social sciences, for example, where the exposure of personal data might lead to societal problems. Even in palaeontology, there may be legitimate concerns about, for example, making the GPS coordinates of special fossil sites publicly available.

However, there is nothing to stop an author highlighting such issues at the time of their manuscript submission, and the editor asking for comments from the reviewers, who are supposed to be experts in the particular field.

Some further relevant links (please feel free to point out more)

Join the discussion by using our comments below; or provide your answer to the open question at the PeerJ Questions portal: Should we be forced to publish primary data integral to our results?

Twitter has the hashtag #OpenData, used by people / organisations promoting or providing open data, as well as those who are (so far) only allegedly dedicated to it (such as Springer-Nature and RELX-Elsevier).

The open source software environment RStudio for R allows knitting and publishing html-files (and other file formats) on their RPubs server, which can be a convenient way to permanently store your R-obtained results and scripts (e.g. Potts & Grimm, 2017).

Preprint servers such as arXiv, bioRxiv, and PeerJ Preprints also provide the option to attach supplementary data files (there are usually size limits), using a wide range of file formats including zipped archives. arXiv had to end its data storage programme in 2013, but still accepts "ancillary files" for raw data, code, etc. "up to a few MB" (which should be enough for a phylogenetic data matrix).

For Austrian/German-speaking users, as noted above, there is Austria's new Open Data Portal (ODP). So far, German is the only language selectable from the scroll-down menu, but there seem to be no registering restrictions.

Wednesday, September 16, 2015

Some new additions to the dataset database


Recently, I have added three new datasets to the database of "gold standards" that might be used to evaluate network algorithms. All three are different to what has previously been included, and so I will briefly discuss them here.

Pedigree data

I have included a known pedigree from a small group of thoroughbred stallions (Eclipse dataset) for which there are mitochondrial D-loop (control region) sequences. Pedigrees are networks, not trees, whenever there is inter-breeding among close relatives, and so their inclusion in the database is needed.

There are practical problems with including more pedigrees. Most of the known pedigrees do not have readily available sequence data associated with them, as the collected data have been mainly for features associated with diseases syndromes. Conversely, most of the available sequence data are not associated with known pedigrees, although for humans they are often taken from known social / linguistic / geographical groups (usually based on the place of birth of all four grandparents).

Language data

The database currently contains only a few examples from the social sciences, notably some experimental manipulations from stemmatology. However, there is so far nothing from linguistics, mainly because the phylogenetic history of languages is often poorly known. Nevertheless, languages form networks whenever there is borrowing of words (ie. loan words) between languages (usually as a result of geographical contact), and so their inclusion is desirable.

I have now included one dataset (the List dataset) taken from what appears to be the best-curated source of linguistic data, the Indo-European Lexical Cognacy Database. Known loan words are explicitly tagged in this source; and the phylogenetic relationships of many Indo-European languages are also tolerably well known (eg. see Ethnologue: Languages of the World).

Simulated data

I have not previously included simulated data, for two reasons. First, such data can easily be generated anew each time a set is required; and even if this is impractical then there are readily available datasets online (eg. see the compilation at utcs Phylogenetics). Second, and more importantly, simulations are based on a model (eg. using Brownian motion, Ornstein–Uhlenbeck, or Markov chains), and therefore they model only a subset of reality. Simulations are useful for situations involving a few well-defined variables, but they are much less useful for multivariate data such as occur in phylogenetics.

Nevertheless, I have included one well-known dataset, the Caminalcules (Camin dataset). These data were simulated manually back in the 1960s, and they include morphological features for both extant and fossil organisms. Over the years, the data have been used for many pedagogic purposes in the teaching of systematics, particularly in the U.S.A. (see Pasta have no phylogeny, so don't try to give them one). The data are strictly tree-like, and they do match real datasets in a number of ways (see Sokal 1983). However, there are also known ways in which they differ detectably from real data (see Holman 1986; Wirth 1993).

References

Holman EW (1986) A taxonomic difference between the Caminalcules and real organisms. Systematic Zoology 35: 259-261.

Sokal RR (1983) A phylogenetic analysis of the Caminalcules. I. The data base. Systematic Zoology 32: 159-184.

Wirth U (1993) Caminalcules and Didaktozoa: imaginary organisms as test-examples for systematics. In: Opitz O, Lausen B, Klar R (eds) Information and Classification: Concepts, Methods and Applications, pp. 421-433. Springer, Berlin.

Wednesday, September 9, 2015

Sharing supplementary data: a linguist's perspective


The Problem of Data Sharing

In 2013, Nature launched a discussion on how to increase the reproducability of research in the biomedical sciences. David addressed the problem of data sharing more concretely in two blog posts from 2013, one on the practice of releasing phylogenetic data, and one on its public availability. In my opinion, this topic does not only concern the sciences, but also, and even specifically, the humanities. In times where more and more data for anthropological research is being produced, and the formerly manually conducted analyses are being automated, we need to increase the awareness of scholars and publishers that publishing only the results is not enough to meet rigorous scientific standards.

When discussing these issues with colleagues, various reasons have been brought up as to why scholars would not release their data along with a publication. Apart from practical considerations (which mostly concern the publishers who do not provide the infrastructure to host supplementary material transparently), scholars often also bring up personal and legal concerns: they are afraid that their painstaking efforts in collecting a dataset will have been in vain, once they release the data to the public, since other researchers might take over and run analyses they would like to run themselves in the future. Furthermore, there are situations when data cannot simply be published completely, because the compilers of the datasets do not obtain the copyrights on the data itself.

In my opinion, all of these problems can be solved directly, and there is no reason to publish a study in which at least a part of the data is not provided in supplementary form.

Practical Solutions: GitHub and Zenodo

Regarding practical issues, one can use GitHub to host and curate data and computer source code. The advantage of using GitHub is that it allows for distributed revision control: all changes and modifications to the data can be tracked, and all of those who contributed to the compilation of a given dataset can receive the credit they deserve. Even for the case of anonymous data submission, there is a simple solution available along with GitHub Gist: by just uploading data to a Gist (a flat repository which does not allow for a folder structure) without being logged in with a GitHub account, one can anonymously host the data for review purposes.

If one doesn't completely trust the longeavity of GitHub in hosting the data forever (it might well happen that GitHub changes its payment policy at some point in the future, or limits the amount of open repositories), there is Zenodo, which offers full GitHub integration and allows storage of up to 2 GB per dataset. For more information regarding the possibilities that the GitHub integration offers, see this blog post by Robert Forkel. Zenodo was developed by CERN and, although they write on their website that their sustainability plan is still in development, it is quite unlikely that they will run out of funding within the next twenty years.

As a recommended way of hosting data, one would start with an anonymous Gist when submitting a paper. This would then be converted to a full GitHub repository once the paper has been accepted. By setting up an official release of this repository, the data would be automatically transferred to Zenodo, where it is permanently stored and provided with a DOI.

Sharing Data Prevents Data Theft

Regarding the personal concerns that one's data might be "stolen" by other scholars, I think it is important to make clear that at the core of all research we build on the work of our colleagues. Nobody should own a dataset, as well as nobody should own a theory. It is clear that in the stage of developing datasets (as well as theories), we may decide to be careful in sharing them with certain colleagues. But once they are finished and ready to use, we should allow our colleagues to run their own analyses on them.

What is important and missing here is an established practice, but also infrastructure support to give credits to the work of others. In linguistics, we lack journals, such as BMC Bioinformatics, that publish articles on source code or databases. There are, however, recent attempts to address these problems in linguistic research (see, for example, this blog post by Martin Haspelmath).

But even while this infrastructure is lacking, it should be made clear that scholars win more than they risk when submitting their data along with their publication. If the data turns out to be useful for additional research, then they will receive credit in the form of citations, and they will even prevent others from actually stealing their data — as with ideas, data can only be stolen by falsely associating it with another name. Once the data is out along with the publication, this is not likely to happen.

Giving Something is More than Giving Nothing

Even in those cases where there are real copyright restrictions, one can make a compromise and publish an illustrative snapshot of the data and the detailed results. Especially, computational analyses produce a large amount of data as part of their results, and this data may well turn out to be interesting for other scholars. Instead of publishing just a tree or a network, we may want to see the individual character evolution that was inferred along with the algorithm. And when illustrating a new algorithm for homolog detection in historical linguistics, it may be interesting for one scholar or another (but maybe also for the reviewer) to have a look at the detailed results apart from the aggregated evaluation scores.

Summary and Outlook

Current research practice in historical linguistics faces serious reproducability problems. Fortunately, solutions exist for most of the practical problems of the past. What we need now is to increase the awareness among scholars that all research based on data and source code is nothing without the data and the source code. Publishing both source code and data along with a paper is easy nowadays, especially thanks to GitHub and Zenodo. Guaranteeing that one gets the credit for ones efforts in the humanities is a bit more difficult, but not impossible, and colleagues are working on solutions.

What we need in addition to the publication of the raw data itself are explicit formats of data exchange. In historical linguistics, using only NEXUS-format files is not sufficient, since the nature of our data requires its own representation. Here again, scholars are already working on a solution by trying to define and establish specific formats for data sharing in historical linguistics and typology (see this discussion on GitHub).

In an ideal future scenario that was introduced to me by Michael Cysouw, all publications involving automatic analyses should provide not only the supplementary data, but also some kind of a MAKE file containing the code for the workflow that enables scholars to carry out the computational analyses immediately on their computer.

Wednesday, September 2, 2015

Is this a "gold standard" dataset?


I have just added another dataset to our database. This one is of considerable interest, because it is a complex one. As the authors note, it is likely to contain ancient hybrid speciation, recent introgression and deep coalescence. Thus, identifying recent hybrids will be problematic.
Michael L. Moody and Loren H. Rieseberg (2012) Sorting through the chaff, nDNA gene trees for phylogenetic inference and hybrid identification of annual sunflowers (Helianthus sect. Helianthus). Molecular Phylogenetics and Evolution 64: 145–155.
There are 29 accessions from 13 species, with data for 11 loci in 5 linkage groups (a total of 8,077 aligned nucleotides). The accessions have sequences for either 1 or 2 of the alleles, and sometimes 3 (the latter are likely to be the result of PCR artifacts). The authors have also tried to identify recombinant sequences. Three of the species are previously identified hybrid taxa.

Unfortunately, adding this dataset to the database has also been problematic, because there are internal inconsistencies. For complete consistency, Figure 1 of the paper should agree with its own Table 1, and the GenBank data should agree with both of them. Unfortunately, this three-way consistency exists for only 2 of the 11 loci. For the rest, in 7 instances the dataset is the odd one out, in 4 cases it is the table, and in four instances it is the figure. For the data discrepancies, in 2 cases a sequence is missing, in 1 case there is an extra sequence, and for the remaining 2 pairs it is likely that there is mis-labelling of the sequences.

It is therefore not immediately obvious to what extent this counts as a "gold standard" dataset. I have included it because of its intrinsic interest, but obviously with a caveat emptor warning. Sadly, this sort of situation has been all too common in my search for suitable datasets.

Wednesday, August 26, 2015

Request for datasets


During one of the discussion sessions at the recent Phylogenetic Network Workshop, in Singapore, the need was re-iterated for "gold standard" empirical datasets, in order to aid the development and validation of algorithms for phylogenetic networks.

The current collection of such datasets is located on this blog, at:
http://phylonetworks.blogspot.se/p/datasets.html
However, it is still quite a small database, as so far it has been based solely on my own ability to locate suitable datasets that are freely available (see the comments in Public availability of phylogenetic data).

I would therefore like to remind everyone that if you have, or know of, suitable empirical datasets then please contact me.

The database is currently hierarchically arranged as follows:

Datasets where the history is a tree
  Datasets where the history is known from experimentation
  Datasets where the history is known from retrospective observation
Datasets where the history is reticulated
  Datasets where the history is known from experimentation
    Hybridization
    Contamination
  Datasets where the reticulation is inferred
    Hybridization
    Recombination
    Lateral Gene Transfer

The basic requirement for a "gold standard" dataset that contains one or more reticulations (ie. there is gene flow) is that the evidence for the reticulation(s) is independent of the particular dataset. That is, there should be either experimental data, or at least another independent dataset, confirming the gene flow. This is quite a tough criterion, particularly for lateral gene transfer, but it is a necessary quality criterion.

Finally, the database requires the processed data (eg. a multiple sequence alignment), rather than the original raw data (see the comments in Releasing phylogenetic data).

Wednesday, January 29, 2014

More datasets for validating network algorithms


Four more datasets have been added to the Datasets blog page. These are:
  • 1 stemmatology study where the manuscript history is known from experimentation
  • 2 stemmatology studies where the manuscript history is known from experimentation, and where there is reticulation caused by contamination
  • 1 plant study where recombination is known.

These are the first three studies to be added from the social sciences, all of them from experimental manipulation of text copying.

Unfortunately, it is unlikely that suitable datasets will be found from other parts of the social sciences, such as linguistics; but please tell us if you know of any relevant studies, where the phylogenetic history is known or inferred independently of the dataset itself.

Wednesday, September 11, 2013

Public availability of phylogenetic data


I have previously noted the frequent failure of phylogeneticists to make their data publicly available (Releasing phylogenetic data ). Recently, a paper appeared in PLoS Biology providing some quantitative data regarding this issue:
Drew B.T., Gazis R., Cabezas P., Swithers K.S., Deng J., Rodriguez R., Katz L.A., Crandall K.A., Hibbett D.S., Soltis D.E. (2013) Lost branches on the Tree of Life. PLoS Biology 11(9): e1001636.
While constructing a super-tree of life, Drew et al. noted that of their 7,500 papers (appearing in 2000–2012) the published data (eg. alignment and tree) had been deposited in a public repository in only one-sixth of the cases, and were available on request from the original authors for a further one-sixth, leaving two-thirds of the data unavailable.

Not unexpectedly, they suggest that the journals publishing these papers might play a role in addressing this issue:
Our findings indicate that while some journals (e.g., Evolution, Nature, PLOS Biology, Systematic Biology) currently require nucleotide sequence alignments, associated tree files, and other relevant data to be deposited in public repositories, most journals do not have these requirements.
Notable among the absent journals are high-profile phylogenetic ones such as Molecular Biology and Evolution and Molecular Phylogenetics and Evolution.

Sadly, the role of journals has been presented in a rather poor light by some bloggers. For example, Roli Roberts notes:
And it's clear that journals are indeed spectacularly well-placed to police and incentivise the deposition, tracking, accessibility, and permanence of data associated with the papers that they publish. At the point of acceptance we have the authors over a barrel, and are in a great position to mandate deposition of all data for every paper.
This attitude has been criticized by other bloggers. For example, Rod Page notes:
In my opinion, as soon as you start demanding people do something you've lost the argument, and you're relying on power ("you don't get to publish with us unless you do x"). This is also lazy. I have argued that this is the wrong approach: when building shared resources carrots are better than sticks ... So, my challenge to the phylogenetics community is to stop resorting to bullying people, and ask instead how you could make it a no brainer for people to share their trees.
However, I ask a different question:
Why are phylogeneticists so reluctant to present their actual data in the first place?
After all, without data science is merely opinion, and you don't need to be a scientist to have an opinion. (Even theoretical science ultimately concerns itself with data, so data really is the essence of science.) One does not have to be sceptical about a dataset in order to think that it should be publicly and freely available.

So, why is telling phylogeneticists to act like scientists "resorting to bullying people"? Why do we have to "inspire [people] to contribute" by offering them carrots? It seems to me that we have lost the argument that phylogenetics is science if the phylogeneticists won't behave like scientists.

Note that the alignment is the key thing in phylogenetics, not the derived tree. In one sense, a tree just makes a figure out of a table. So, given the published description of the tree-building method, it should be straightforward to reproduce the tree from the alignment. Indeed, if the tree cannot be reproduced from the alignment then there is serious cause for concern.

In this sense, databases like TreeBASE might be missing the point somewhat. Where does one put the alignment if one is not interested in also storing the tree? Where does one put a network, if that is what you have instead of a tree? One could use Dryad, but they are now insisting on payment for storing scientific data — for those of us without financial support this is no longer a realistic option.

Problems with data availability are not unique to phylogenetics, of course. Dani Zamir has recently noted:
In crop genetics and breeding research, phenotypic data are collected for each plant genotype, often in multiple locations and field conditions, in search of the genomic regions that confer improved traits. But what is happening to all of these phenotypic data? Currently, virtually none of the data generated from the hundreds of phenotypic studies conducted each year are being made publically available as raw data; thus there is little we can learn from past experience when making decisions about how to breed better crops for the future.
Nevertheless, in biology, there are databases for many things, such as gene sequences (Genbank), protein structures (PDB), and gene ontology (GO), and these are all used to one extent or another. Perhaps the most direct parallel to the problems with phylogenetic datasets is that of ecological datasets, as recently discussed in a PLoS ONE article:
Morris B.D., White E.P. (2013) The EcoData Retriever: improving access to existing ecological data. PLoS ONE 8(6): e65848.
It is interesting to ponder why this is such a problem in the biological sciences when it is apparently not so in the physical sciences. There are databases in astronomy, and databases of chemical properties in chemistry, for example, but otherwise it is generally the ability to get the same data by repeating the experiment that is the important thing in the physical sciences. In most cases a database would be not only redundant but also self-defeating (storing the data would imply that the data are not repeatable!).

So, this appears to be yet another by-product of dealing with biodiversity — data are incredibly variable in many areas of biology, and so it is necessary to store them for posterity because they are unique.

Wednesday, May 1, 2013

Releasing phylogenetic data


One approach that I have taken in this blog to popularizing the use of networks in phylogenetic analysis has been to investigate published data using network techniques. However, this is often difficult because the data have not been publicly made available (eg. Phylogenetic position of turtles: a network view).

I am not the only person to find fault with the failure to release phylogenetic data, although there are recognized reasons why data sometimes cannot be released. Razib Khan at the Gene Expression blog recently had this to say (Why not release data for phylogenetic papers?):
Last month I noted that a paper on speculative inferences as to the phylogenetic origins of Australian Aborigines was hampered in its force of conclusions by the fact that the authors didn't release the data to the public (more accurately, peers). There are likely political reasons for this in regards to Australian Aborigine data sets, so I don’t begrudge them this (Well, at least too much. I’d probably accept the result more myself if I could test drive the data set, but I doubt they could control the fact that the data had to be private). This is why when a new paper on a novel phylogenetic inference comes out I immediately control-f to see if they released their data. In regards to genome-wide association studies on medical population panels I can somewhat understand the need for closed data (even though anonymization obviates much of this), but I don’t see this rationale as relevant at all for phylogenetic data (if concerned one can remove particular functional SNPs). 
Yesterday I noticed PLoS Genetics published a paper on the genomics of Middle Eastern populations ... The results were moderately interesting, but bravo to the authors for putting their new data set online. The reason is simple: reading the paper I wanted to see an explicit phylogenetic tree/graph to go along with their figures (e.g., with TreeMix). Now that I have their data I can do that.
In this particular case the data were made available on the homepage of one of the authors, which is better than nothing but is clearly less than ideal. There are a number of formal repositories for phylogenetic data, all of which should have greater longevity than any personal homepage, including:
TreeBASE
Dryad
The first of these databases has a long history of storing phylogenetic trees and their associated datasets. It has not yet lived up to its full potential, but people like Rod Page are pushing for it to do so eventually.

Dryad is a more general data repository (ie. not just for phylogenetic data), and its use is now encouraged by many of the leading journals — Systematic Biology, for example, makes its use mandatory, at least for data during the submission process, and also for "data files and/or other supplementary information related to the paper" for the published version.

Phylogeny databases are not without their skeptics, however. For example, Rod Page (Data matters but do data sets?) has noted:
How much re-use do data sets get? I suspect the answer is "not much". I think there are two clear use cases, repeatability of a study, and benchmarks. Repeatability is a worthy goal, but difficult to achieve given the complexity of many analyses and the constant problem of "bit rot" as software becomes harder to run the older it gets. Furthermore, despite the growing availability of cheap cloud computing, it simply may not be feasible to repeat some analyses. 
Methodological fields often rely on benchmarks to evaluate new methods, and this is an obvious case where a dataset may get reused ("I ran my new method on your dataset, and my method is the business — yours, not so much"). 
But I suspect the real issue here is granularity. Take DNA sequences, for example. New studies rarely reuse (or cite) previous data sets, such as a TreeBASE alignment or a GenBank Popset. Instead they cite individual sequences by accession number. I think in part this is because the rate of accumulation of new sequences is so great that any subsequent study would needs to add these new sequences to be taken seriously. Similarly, in taxonomic work the citable data unit is often a single museum specimen, rather than a data set made up of specimens.
However, all of this begs the question that seems to me to be central to science. Science is unique in being based primarily on evidence rather than expert opinion, and therefore the core of science must be direct access to the original evidence, rather than some statistical summary of it or someone's opinion about it. How can I evaluate evidence if I don't have access to it? How can I verify it, explore it, or re-analyze it? Being given the raw data (eg. the sequences) is one thing, but being given the data you actually analyzed and based your conclusions on (eg. the aligned sequences) is another thing entirely.

In short, if you won't openly give me your dataset then I don't see how you can call yourself a serious scientist.

Note: see also this later post: Public availability of phylogenetic data

Wednesday, January 30, 2013

More datasets for validating network algorithms


Ten more datasets have been added to the Datasets blog page. These are:
  • 2 plant studies where hybrids are known from experimentation
  • 3 more plant studies where natural hybrids are known
  • 5 studies (fungi, plants, protozoa, viruses, animals) where recombination is known.

A comment

It is worth noting something that has become obvious to me while compiling these datasets — the mathematical model often applied to hybridization networks cannot easily be applied to many of the datasets collected by biologists. The usual mathematical model involves incompatibility between two or more trees for the same set of taxa, for example from different genes or genomes. The incompatibilities are resolved by postulating one or more reticulations in the network.

However, the data produced by biologists often involve only a single nuclear gene, most frequently the Internal Transcribed Spacer region, so that the biologists do not have multiple trees. Instead, hybrids are detected by additive polymorphisms at alignment positions within the study gene. These polymorphisms arise either from (i) the polyploid nature of the hybrids (there are multiple copies of each chromosome, each of which may have a gene copy from either parental species), or (ii) from multiple paralogous copies of the genes (the rRNA region, which contains the ITS, usually has many tandemly repeated copies of the genes, which are homogenized by concerted evolution, but in a hybrid any of them may have a gene copy from either parental species).

This means that it is difficult to use any current evolutionary network for the phylogenetic analysis of many of the datasets used for detecting hybridization. In turn, this suggests that we may need a different model, one based on additive polymorphisms rather than incongruent trees.

The usual mathematical model for lateral-transfer networks is actually the same as for hybridization networks, since the only real difference between HGT and hybridization is that HGT does not occur via sexual reproduction while hybridization does. (Also, hybridization often involves whole genomes while HGT usually involves partial genomes.) Importantly, the mathematical model does seem to apply to the sort of datasets collected by biologists when they are studying HGT. That is, HGT is detected by incompatibility between two or more trees for the same set of taxa. Indeed, this model is usually the only evidence for HGT, unlike hybridization and recombination where there is often evidence that is independent of the network model.

Wednesday, January 16, 2013

Datasets for validating algorithms for evolutionary networks


Steven Kelk has previously raised the issue about Validating methods for constructing evolutionary phylogenetic networks: there are currently not many options for validating the biological relevance of methods for constructing evolutionary phylogenetic networks. These are phylogenetic networks intended to represent evolutionary history, such as HGT networks. hybridization networks, and recombination networks.

Thus, we need a repository of biological datasets where there is some level of consensus amongst biologists as to the character, extent and location of reticulate evolutionary events. This could then be used as a framework for validating the output of algorithms for constructing evolutionary phylogenetic networks.

This issue was discussed at some length at the Workshop: The Future of Phylogenetic Networks. It was suggested by Leo van Iersel that a practical starting point would be to use this blog as a link to suitable datasets. As people become aware of such datasets, a blog post would be published with the details, and the dataset would be linked from one of the blog Pages.

This page now exists (Datasets), and can be accessed at the top right of each blog page. Everyone is encouraged to contribute to this "database", which you can do by sending details about potential dataset  to me by email.

In another post, What should a database of datasets look like?, I have noted that there have been four suggested approaches to acquiring datasets for evaluating algorithms (in order of increasing reality):
  1. simulate datasets under one or more data-generation models
  2. create mixed datasets from "pure" datasets, or create artificial mosaic taxa from real datasets
  3. use datasets where the postulated reticulation events have been independently confirmed
  4. experimentally create taxa with a known evolutionary history.
It seems unnecessary to store datasets of type (1), since they can be created to order by computer programs. Datasets of type (2) are rare, but would be suitable for the database.

Datasets of type (4) currently exist for tree-like evolutionary histories but not yet, as far as I know, for reticulated histories. I have added the known (and available) ones to the database.

Datasets of type (3) are likely to form the bulk of the database, and I have started this part of the database with some example datasets involving hybridization.

For the latter datasets, it is important to note the potential problem of the degree to which the postulated reticulation events have been independently confirmed. I suspect that only weak evidence has been applied to far too many datasets. This is particularly true for those involving horizontal gene transfer (HGT), where mere incongruence between genes is presented as the sole "evidence". More than this is required (see Than C, Ruths D, Innan H, Nakhleh L. 2007. Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 14: 517-535.).