Monday, September 24, 2018

Structural data in historical linguistics

The majority of historical linguists compare words to reconstruct the history of different languages. However, in phylogenetic studies focusing on cognate sets reflecting shared homologs across the languages under investigation, there exists another data type that people have been trying to explore in the past. The nature of this data type is difficult to understand for non-linguists, given that it has a very abstract nature. In the past, it has led to a considerable amount of confusion both among linguists and among non-linguists who tried to use this data for quick (and often also dirty) phylogenetic approaches. For this reason, I figured it would be useful to introduce this type of data in more detail.

This data type can be called "structural". To enable interested readers to experiment with the data themselves, this blogpost comes along with two example datasets that we converted into a computer-readable format (with much help from David), since the original papers only offered the data as PDF files. In future blogposts, we will try to illustrate how the data can, and should, be explored with network methods. In this first blogpost, I will try to explain the basic structure of the data.

Structural data in historical linguistics and language typology

In order to illustrate the type of data we are dealing with here, let's have a look at a typical dataset, compiled by the famous linguist Jerry Norman to illustrate differences between Chinese dialects (Norman 2003). The table below shows a part of the data provided by Norman.

No. Feature Beijing Suzhou Meixian Guangzhou
1 The third person pronoun is tā, or cognate to it + - - -
4 Velars palatalize before high-front vowels + + - -
7 The qu-tone lacks a register distinction + - + -
12 The word for "stand" is zhàn or cognate to it + - - -

In this example, the data is based on a questionnaire that provides specific questions; and for each of the languages in the sample, the dataset answers the question with either + or -. Many of these datasets are binary in their nature, but this is not a necessary condition, and questionnaires can also query categorical variables, such as, for example, the major type of word order might have three categories (subject-object-verb, subject-verb-object or other).

We can also see is that the questions can be very diverse. While we often use more or less standardized concept lists for lexical research (such as fixed lists of basic concepts, List et al. 2016), this kind of dataset is much less standardized, due to the nature of the questionnaire: asking for the translation of a concept is more or less straightforward, and the number of possible concepts that are useful for historical research is quite constrained. Asking a question about the structure of a language, however, be it phonological, lexical, based on attested sound changes, or on syntax, provides an incredible number of different possibilities. As a result, it seems that it is close to impossible to standardize these questions across different datasets.

Although scholars often call the data based on these questionnaires "grammatical" (since many questions are directed towards grammatical features, such as word order, presence or absence of articles, etc.), most datasets show a structure in which questions of phonology, lexicon, and grammar are mixed. For this reason, it is misleading to talk of "grammatical datasets", but instead the term "structural data" seems more adequate, since this is what the datasets were originally designed for: to investigate differences in the structure of different languages, as reflected in the most famous World Atlas of Language Structures (Dryer and Haspelmath 2013,

Too much freedom is a restriction

In addition to mixed features that can be observed without knowing the history of the languages under investigation, many datasets (including the one by Norman we saw above) also use explicit "historical" (diachronic in linguistic terminology) questions in their questionnaires. In his paper describing the dataset, Norman defends this practice, as he argues that the goal of his study is to establish an historical classification of the Chinese dialects. With this goal in mind, it seems defensible to make use of historical knowledge and to include observed phenomena of language change in general, and sound change in specific, when compiling a structural dataset for group of related language varieties.

The problem of the extremely diverse nature of questionnaire items in structural datasets, however, makes their interpretation extremely difficult. This becomes especially evident when using the data in combination with computational methods for phylogenetic reconstruction. This is problematic for two major reasons.
  1. Since questions are by nature less restricted regarding their content, scholars can easily pick and choose the features in such a way that they confirm the theory they want them to confirm rather than testing it objectively. Since scholars can select suitable features from a virtually unlimited array of possibilities, it is extremely difficult to guarantee the objectivity of a given feature collection. 
  2. If features are mixed, phylogenetic methods that work on explicit statistical models (like gain and loss of character states, etc.) may often be inadequate to model the evolution of the characters, especially if the characters are historical. While a feature like "the language has an article" may be interpreted as a gain-loss process (at some point, the language has no article, then it gains the article, then it looses it, etc.), features showing the results of processes, like "the words that originally started in [k] followed by a front vowel are now pronounced as []", cannot be interpreted as a process, since the feature itself describes a process.
For these reasons, all phylogenetic studies that make use of structural data, in contrast to purely lexical datastes, should be taken with great care, not only because they tend to yield unreliable results, but more importantly because they are extremely difficult to compare across different language families, given that they have way too much freedom when compiling them. Feature collections provided in structural datasets are an interesting resource for diversity linguistics, but they should not be used to make primary claims about external language history or subgrouping.

Two structural datasets for Chinese dialects

Before I start to bore the already small circle of readers interested in these topics, it seems better to stop discussing the usefulness of structural data at this point, and to introduce the two datasets that were promised at the beginning of the post.

Both datasets target Chinese dialect classification, the former being proposed by Norman (2003), and the latter reflecting a new data collection that was recently used by Szeto et al. (2018) to propose a North-South-split of dialects of Mandarin Chinese with help of a Neighbor-Net analysis (Bryant and Moulton 2004). Both datasets have been uploaded to Zenodo, and can be found in the newly established community collection cldf-datasets. The main idea of this collection is to collect various structural datasets that have been published in the literature in the past, and allow those people interested in the data, be it for replication studies or to thest alternative approaches, easy access to the data in various formats.

The basic format is based on the format specifications laid out by the CLDF initiative (Forkel et al. 2018), which provides a software API, format specifications, and examples for best practice for both structural and lexical datasets in historical linguistics and language typology. The collection is curated on GitHub (cldf-datasets), and datasets are converted to CLDF (with all languages being linked to the Glottolog database,, Hammarström et al. 2018) and also to Nexus format. The dataset is versionized, it may be updated in the future, and interested readers can study the code used to generate the specific data format from the raw files, as well as the Nexus files, to learn how to submit their own datasets to our initiative.

Final remarks on publishing structural datasets online

By providing only two initial datasets for an enterprise whose general usefulness is highly questionable, readers might ask themselves why we are going through the pain of making data created by other people accessible through the web.

The truth is that the situation in historical linguistics and language typology has for a very long time been very unsatisfactory. Most of the research based on data did not supply the data with the paper, and often authors directly refuse to share the data when asked after publication (see also the post on Sharing supplementary data). In other cases, access to the data is exacerbated by providing data only in PDF format in tables inside the paper (or even worse: long tables in the supplement of a paper), which force scholars wishing to check a given analysis themselves to reverse-engineer the data from the PDF. That data is provided in a form difficult to access is not even necessarily the fault of the authors, since some journals even restrict the form of supplementary data to PDF only, giving authors wishing to share their data in an appropriate form a difficult time.

Many colleagues think that it is time to change this, and we can only change it by offering standard ways to share our data. The CLDF along with the Nexus file, in which the two Chinese datasets are now published in this open repository collection, may hopefully serve as a starting point for larger collaboration among typologists and historical linguistics. Ideally, all people who publish papers that make use of structural datasets, would — similar to the practice in biology where scholars submit data to GenBank (Benson et al. 2013) — submit their data in CLDF format and Nexus, so that their colleagues can easily build on their results, and test them for potential errors.


Benson D., M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and E. Sayers (2013) GenBank. Nucleic Acids Res. 41.Database issue: 36-42.

Bryant D. and V. Moulton (2004) Neighbor-Net. An agglomerative method for the construction of phylogenetic networks. Molecular Biology and Evolution 21.2: 255-265.
Campbell, L. and W. Poser (2008): Language classification: History and method. Cambridge University Press: Cambridge.

Cathcard C., G. Carling, F. Larson, R. Johansson, and E. Round (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.

Dryer M. and Haspelmath, M. (2013) WALS Online. Max Planck Institute for Evolutionary Anthropology: Leipzig.

Forkel R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (forthcoming) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data.

Hammarström H., R. Forkel, and M. Haspelmath (2018) Glottolog. Version 3.3. Max Planck Institute for Evolutionary Anthropology: Leipzig.

List J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp 2393-2400.

Norman J. (2003) The Chinese dialects. Phonology. In: Thurgood, G. and R. LaPolla (eds.) The Sino-Tibetan languages. Routledge: London and New York, pp 72-83.

Pritchard J., M. Stephens, and P. Donnelly (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.

Szeto P., U. Ansaldo, and S. Matthews (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Zhang M., W. Pan, S. Yan, and L. Jin (2018) Phonemic evidence reveals interwoven evolution of Chinese dialects. bioarxiv.

Monday, September 17, 2018

Getting the wrong tree when reticulations are ignored

One issue that has long intrigued me is what happens when someone constructs a phylogenetic tree under circumstances where there are reticulate evolutionary events in the actual (ie. true) phylogeny itself. That is, a network is required to accurately represent the phylogeny, but a tree is used as the model, instead. How accurate is the tree?

By this, I mean that, if the phylogeny can be thought of as a "tree with reticulations", do we simply get that tree but miss the reticulations, or do we get a different (ie. wrong) tree?

Sometimes, people refer to this situation as having a "backbone tree" — the phylogeny is basically tree-like, but there are a few extra branches, perhaps representing occasional hybridizations or horizontal gene transfers. The phylogenetic tree can then be treated as a close approximation to the true phylogeny, representing the diversification events but not the (rarer) reticulation events.

I have argued against this approach (2014. Systematic Biology 63: 628-638.). Instead of seeing a network as a generalization of a tree, we should see a tree as a simplification of a network. If we do this, then we would construct a network every time; and sometimes that network would be a tree, because there are no reticulation events in the phylogeny. It cannot work the other way around, because we can never get a network if all we ask for is a tree!

Presumably, if there are no reticulations then we should get the same answer (phylogenetic tree) irrespective of whether we simply construct a tree or instead construct a network that turns out to be a tree. But what about the "backbone tree" situation? Here, it has always seemed to me to be possible that we do not get the same tree. If this is so, then constructing a tree and then adding a few reticulations to it (as is often done in the literature) would not work — we would be adding reticulations to the wrong backbone tree.

There are two possible ways in which we can get the wrong backbone tree: the topology might be incorrect, or the branch-lengths might be incorrect (or both). For example, if there are true reticulations and yet we do not include them in our model, I have argued that the branches will be too short (2014. Systematic Biology 63: 847-849.) — two taxa will be genetically similar because of the reticulation events, but the tree-building algorithm can only make them similar on the tree by shortening the branches (not by adding a reticulation).

Fortunately, for at least one tree-building model Luay Nakhleh and his group have now done some simulations to answer my questions. You may not yet have noticed their results, because they are not necessarily in the most obvious place; so I will highlight them here. The analyses involve the Multispecies Coalescent (MSC) model, which accounts for incomplete lineage sorting during the tree-like part of evolution, as compared to the Multispecies Network Coalescent (MSNC) which adds reticulations (eg hybridization) to the model.

Dingqiao Wen, Yun Yu, Matthew W. Hahn, Luay Nakhleh (2016) Reticulate evolutionary history and extensive introgression in mosquito species revealed by phylogenetic network analysis. Molecular Ecology 25: 2361-2372.

This paper compares a tree-based analysis (construct a tree first then add reticulations) with a network-based analysis (construct a network) for an empirical genomic dataset. The two results differ.

Dingqiao Wen, Luay Nakhleh (2018) Coestimating reticulate phylogenies and gene trees from multilocus sequence data. Systematic Biology 67: 439-457.

Tucked away in the Supplementary Information are the results of a set of simulations comparing the MSC (using *Beast) and the MSNC (using PhyloNet), with (section 3) and without (section 2) reticulations. The basic conclusion is that, in the presence of reticulation, tree-based methods either get the tree completely wrong, or they get the tree topology right but the branch lengths are "forced" to be very short. A summary of the latter result is shown in the figure above. In the absence of reticulation, both methods produce the same tree.

R.A. Leo Elworth, Huw A. Ogilvie, Jiafan Zhu, and Luay Nakhleh (ms.) Advances in computational methods for phylogenetic networks in the presence of hybridization. (chapter for a forthcoming book]

A summary of the group's work to date. Section 6.3 summarizes the results from the paper 2.

Monday, September 10, 2018

Limitations of the new book about HGT networks

This is a joint post by David Morrison and Ajith Harish.

There has been a flurry of reviewing activity recently about the new book:

The Tangled Tree: a Radical New History of Life
David Quammen. 2018. Simon & Schuster.

This book has received glowing reviews, including:

The book is intended for the general public, rather than for specialists, explaining the "new view" of evolutionary history that includes extensive horizontal gene transfer (HGT), especially in the microbial world. Quammen describes himself as a science, nature and travel writer, so his book is more than just a record of science, and is as much about the people involved as about the scientific theory. In particular, it contains a biography of Carl Woese.

Quammen’s recent New York Times feature article The scientist who scrambled Darwin’s Tree of Life is a very good primer to his book. For us, it indicates that the book has many overlaps with Jan Sapp's earlier book The New Foundations of Evolution: on the Tree of Life (2009. Oxford University Press). The publisher’s advertised selling point of that book is: "This is the first book on (and first history of) microbial evolutionary biology, and that it puts forth a new theory of evolution", with HGT being the new theory. In this sense, the "radical new view" is simply that genetic material can be transferred without sexual reproduction, an idea that goes back rather a long way in history (see The history of HGT), and which is often seen as anti-Darwinian.

Bill Hanage in his review of Sapp’s book (2010. The trouble with trees. Science 327: 645-646) argues that the book neither puts forward a new theory nor is the debate actually about horizontal gene transfer, and the Tree of Life is thus far from settled. There are many other interesting points discussed in that review. Furthermore, even after almost 10 years, Hanage’s review of Sapp’s 2009 book can be substituted verbatim as a review of Quammen’s 2018 book! This PDF shows how the book review would read if the author and book names in Hanage’s review were to be substituted [reproduced with the permission of the original author].

The debate allegedly involving HGT is, at heart, about explaining the pattern of extensively mixed genetic material found in the akaryotes. However, simply looking at a pattern does not tell you about the process that created the pattern. In order to study processes, we need a model, in this case a model about how evolution occurs. The "HGT model" is that the Last Universal Common Ancestor (LUCA) of life was a relatively simple organism genetically, and that subsequent evolutionary history has involved complexification of that ancestor, both by diversification and by HGT.

What the two books do not explore is the other major model for the current distribution of genetic material among akaryotes. This alternative scenario is that the LUCA was genetically complex, and that the subsequent evolutionary history involved independent losses of parts of the genetic material — the sporadically shared material is basically coincidental. All that this model requires is that there be evolutionary history prior to the LUCA, during which it became a complex organism from its simple beginnings — the LUCA is merely as far back as we can see into the past, with the prior history being unrecoverable by us (ie. we cannot see past the LUCA bottleneck).

Over the past couple of decades, a number of papers have explored the evidence for the latter idea, from both the RNA and protein perspectives, including:
  • Anthony Poole, Daniel Jeffares, David Penny (1999) Early evolution: prokaryotes, the new kids on the block. BioEssays 21: 880-889.
  • Christos A. Ouzounis, Victor Kunin, Nikos Darzentas, Leon Goldovsky (2006) A minimal estimate for the gene content of the last universal common ancestor — exobiology from a terrestrial perspective. Research in Microbiology 157: 57-68.
  • Miklós Csűrös István Miklós (2009) Streamlining and large ancestral genomes in Archaea inferred with a phylogenetic birth-and-death model. Molecular Biology and Evolution 26: 2087-2095.
  • Kyung Mo Kim, Gustavo Caetano-Anollés (2011) The proteomic complexity and rise of the primordial ancestor of diversified life. BMC Evolutionary Biology 11: 140.
  • Ajith Harish, Charles G. Kurland (2017) Akaryotes and Eukaryotes are independent descendants of a universal common ancestor. Biochimie 138: 168-183.
Finally, even from the perspective of phylogenetic networks, Quammen's book is very one-sided. In particular, the other processes that lead to reticulate evolution (eg. introgression and hybridization) are pretty much ignored. That is, the focus is on akaryotes not eukaryotes. The latter are also of phylogenetic interest.

Monday, September 3, 2018

More on networks for placing fossils, such as Eocene lantern fruits

A colleague pointed me to a paper published last year in Science about a spectacular fossil find: an Eocene Physalis-fruit with a preserved lampion. In an recent post, I advocated Neighbor-nets as nice and quick tools to place fossils phylogenetically. In this post, I'll will exemplify this once more, and argue why this would have been even more informative than what the authors showed as graphs.

The study and the data

In their 2017 paper, Wilf et al. (Science 355: 71–75) describe a new fossil find, which, by itself, rejects the often-too-young molecular dating estimates for Solanceae, the potato-tomato family, the "Nightshades". The Nightshades include many well-known plants, in addition to potato/tomato (the latter is phylogenetically a subclade of the potatoes) — we have e.g. the tobacco genus (Nicotiana), and also the genus Physalis, which includes several species commercialized as fruits (e.g. P. peruviana, also known as Cape gooseberry or goldenberry) and ornamental plants (e.g. P. alkekengi, the Chinese Lantern).

Just by looking at the pictures showing the fossil (Wilf et al.'s text-Fig. 1), anyone who ever ate a physalis, would agree that it was produced by a member of the genus. However, science is not usually about common sense, but about formal reconstructions. Thus, the authors placed their fossil using a total evidence tree approach: they scored 13 morphological traits as binary or ternary characters, concatenated these data with a molecular data set and inferred trees under maximum parsimony (their text-Fig. 2, below) and maximum likelihood (the tree can be found in the supporting information).

Wilf et al.'s total evidence tree showing the (quoted from the legend)
"Phylogenetic relationships of Physalis infinemundi sp. nov. and selected Solanaceae species" (their Fig. 2). Strict consensus of 2835 most parsimonious trees of 3510 steps (CI = 0.438, RI = 0.726)."

Based on the graph, one can confirm that the fossil (arrow; pictured, too) is part of the core Physalis, but its position within this core clade is unresolved. The Decay index shown indicates that moving the entire branch would require just one step more. Not overly re-assuring regarding the total length of the tree (3510 steps) and underlying data (the used matrix has 7070 characters!)

The molecular data were selected from an earlier study (Särkinen et al., BMC Evol. Biol., 2013), but the total evidence matrix is not provided (see this post on why we want to publish our phylogenetic data). But at least the "...morphological matrix developed in this paper is tabulated in the supplementary materials."

This file includes two sheets: the first shows the "raw scores", including four continuous characters, and the second shows the "character scoring" used for the analysis, where the continuous characters were scored (binned) as ternary and binary characters. The iinformation provided is partly wrong, likely to be the result of copy & paste errors (this is another reason why it should be obligatory for phylogenetic studies to provide the data as aligned-FASTA or NEXUS file). A corrected version of the "character scores" sheet based on the "raw scores" sheet is included in the figshare submission for this post.

By just filtering this matrix for same-as-in-the-fossil characters, we can identify two extant species that are identical to the fossil in all scored characters: Physalis acutifolia and P. lanceolata. Both are part of the Physalis core clade in Wilf et al.'s total evidence tree, but their position is as unresolved as that of the fossil.

Enlarged part of the above figure, showing the absolute character difference (0 to 5 out of 13 covered characters) between the fossil and other members of the Physalis core clade.

The reason for this becomes clear in the total-evidence maximum-likelihood tree. Here, the fossil is resolved as the sister of P. lanceolata (maximum likelihood bootstrap support: ML-BS < 70, the actual value would have been nice), to which it is identical, both being deeply nested in the Physalis core clade. However, the other identical species (morphologically), P. acutifolia, is placed in the first diverging subclade of the core clade (ML-BS < 70, along with most of the backbone of this clade). The "low" support may have two possible reasons:
  • the fossil, with 99.8% missing data, acts as a 'rogue' taxon; or
  • the genetic data provides little discriminating or ambiguous signals.
Solanaceae genera can be tricky, and the gene sample lacks high-divergent sequence regions. Since the molecular data are not documented, I can't assess how significant this separation is, but it appears to be supported by at least some mutations: the tree-wise distance is about 0.04 expected substitutions; and the two morphologically indistinct (regarding the scored characters) species are genetically distinct (to some degree).

Extract from Wilf et al.'s Fig. S1, showing the Physalinae subtree with the core Physalis clade and the deeply nested fossil P. infinemundi (in bold font). Support is only shown for branches with a ML-BS support ≥70.

Trees may fail to show the obvious, but networks won't

Just by using the Neighbour-net to visualize the signal in the morphological partition, we can directly argue that the fossil is likely to be part of the core Physalis. Thus, being Eocene of age, rejects the much-too-young age estimates in e.g. the dated tree by Särkinen et al. (the reference for the molecular data used by Wilf et al.)

Neighbour-net splits graph based on the morphological data partition included in Wilf et al.'s "supermatrix".

In contrast to the little information that comes along with the tree shown above (soft-ish polytomy, weak Decay index, potentially decreased ML-BS support), the splits graph highlights the ambiguity (incompatibility) of the morphological signal. The graph shows little tree-likeness, and members of the same (sub)tribe show little coherence (C = Capsiceae, H = Hyoscyameae, J = Juanulloeae, S = Solaneae; W = Withaninae; all represented by de-facto molecular clades with ML-BS ≥ 77 in Wilf et al.'s supplement Fig. S1). There is one notable exception: members of the core Physalis (red dots) are sufficiently distinct from anything else, forming a highly supported clade (ML-BS = 98 in Wilf et al.'s fig. S1),.

The network also shows that the fossil is identical to both P. acutifolia and P. lanceolata.

Neighbour-net after reducing the taxon set to the phylogenetic neighbourhood of the fossil specimen. Filled fields indicate sister/sibling species supported by a ML-BS >= 80 in Wilf et al.'s "total evidence" ML tree.

By focusing on the phylogenetic neighborhood of the fossil, we end up with a spider-web-like graph. Which means that the morphological partition has little consistent signal for recognizing potential relatives: the same features are likely to have evolved in parallel (all members of this neighborhood a likely to share a common origin) — 50 million years (and more) is a long time for a lineage to end up with a similar fruit (see also the maximum-parsimony character reconstructions on the parsimony strict-consensus tree provided in the supplement to Wilf et al.'s study).

Data and graphs

The Splits-NEXUS files for the Neighbor-nets and NEXUS-versions of Wilf et al.'s Data S1, as well as additional graphics (network with labeled bubbles) can be found on figshare.