Monday, February 24, 2020

How should one study language evolution?

This is a joint post by Justin Power, Guido Grimm, and Johann-Mattis List.

Like in biology, we have two basic possibilities for studying how languages evolve:
  • We set up a list of universal comparanda. These should occur in all languages and show a high enough degree of variation that we can use them as indicators of how languages have evolved;
  • We create individual lists of comparanda. These are specific for certain language groups that we want to study.
Universal comparanda

While most studies would probably aim to employ a set of universal comparanda, the practice often requires a compromise solution in which some non-universal characteristics are added. This holds, for example, for the idea of a core genome in biology, which ends up being so small in overlap across all living species that it makes little sense to compute phylogenies based on it, except for for closely related species (Dagan and Martin 2006). Another example is the all-inclusive matrices that are used to establish evolutionary relationships of extinct animals characterized by high levels of missing data (eg. Tschopp et al. 2015; Hartman et al. 2019). The same holds for historical linguistics, with the idea of a basic lexicon or basic vocabulary, represented by a list of basic concepts that are supposed to be expressed by simple words in every human language (Swadesh 1955), given that the number of concepts represented by simple words shared across all human languages is extremely small (Hoijer 1956).

Figure 1: All humans have hands and arms but some words for ‘hands’ and ‘arms’ address different things (see our previous post "How languages loose body parts").

Apart from the problem that basic vocabulary concepts occurring in all languages may be extremely limited, test items need to fulfill additional characteristics that may not be easy to find,in order to be useful for phylogenetic studies. They should, for example, be rather resistant to processes of lateral transfer or borrowing in linguistics. They should preferably be subject to neutral evolution, since selective pressure may lead to parallel but phylogenetically independent processes (in biology known as convergent evolution) that are difficult to distinguish and can increase the amount of noise in the data (homoplasy).

Selective pressure, as we might find, for example, in a specific association between certain concepts and certain sounds across a large phygenetically independent sample of human languages, is rarely considered to be a big problem in historical linguistics studies dealing with the evolution of spoken languages (see Blasi et al. 2016 for an exception). In sign language evolution, however, the problem may be more acute because of a similar iconic motivation of many lexical signs in phylogenetically independent sign languages (Guerra Currie et al. 2002), as well as the representation of concepts such as body parts and pronouns using indexical signs with similar forms. This latter characteristic of all known sign languages has led to the design of a basic vocabulary list that differs from those traditionally used in the historical linguistics of spoken languages (Woodward 1993); and we know of only one proposal attempting to address the problem of iconicity in sign languages for phylogenetic research (Parkhurst and Parkhurst 2003).

Figure 2: Basic processes in the evolution of languages, spoken or signed  (see our previous post How languages loose body parts).

All in all, it seems that there may be no complete solution for a list of lexical comparanda for all human languages, including sign languages, given the complexities of lexical semantics, the high variability in expression among the languages of the world (see Hymes 1960 for a detailed discussion on this problem), and the problems related to selective pressures highlighted above. Scholars have proposed alternative features for comparing languages, such as grammatical properties (Longobardi et al. 2015) or other "structural" features (Szeto et al. 2018), but these are either even more problematic for historical language comparison—given that it is never clear if these alternative features have evolved independently or due to common inheritance—or they are again based on a targeted selection for a certain group of languages in a certain region.

Targeted comparanda

If there is no universal list of features that can be used to study how languages have evolved, we have to resort to the second possibility mentioned above, by creating targeted lists of comparanda for the specific language groups whose evolution we want to study. When doing so, it is best to aim at a high degree of universality in the list of comparanda, even if one knows that complete universality cannot be achieved. This practice helps to compare a given study with alternative studies; it may also help colleagues to recycle the data, at least in part, or to merge datasets for combined analyses, if similar comparanda have been published for other languages.

But there are cases where this is not possible, especially when conducting studies where no previous data have been published, and rigorous methods for historical language comparison have yet to be established. Sign languages can, again, be seen as a good example for this case. So far, few phylogenetic studies have addressed sign language evolution, and none have supplied the data used in putting forward an evolutionary hypothesis. Furthermore, because the field lacks unified techniques for the transcription of signs, it is extremely difficult to collect lexical data for a large number of sign languages from comparable glossaries, wordlists, and dictionaries, the three primary sources, apart from fieldwork, that spoken language linguists would use in order to start a new data collection. We are aware of one comparative database with basic vocabulary for sign languages that is currently being built (Yu et al. 2018), and that may represent lexical items in a way that can be compared efficiently, but these data have not yet been made available to other researchers.

Sign languages

When Justin Power approached Mattis about three years ago, asking if he wanted to collaborate on a study relating to sign language evolution, we quickly realized that it would be infeasible to gather enough lexical data for a first study. Tiago Tresoldi, a post-doc in our group, suggested the idea of starting with sign language manual alphabets instead. From the start, it was clear that these manual alphabets might have certain disadvantages — because they are used to represent written letters of a different language, they may constitute a set of features evolving independently from the sign language itself.

Figure 3: Processes shaping manual alphabets. The evolution of signed concepts may be affected by the same, leading to congruent patterns, or different processes, leading to incongruent differentiation patterns (see our previous post: Stacking networks based on sign language manual alphabets).

But on the other hand, the data had many advantages. First, a sufficient number of examples for various European sign languages were available in online databases that could be transcribed in a uniform way. Second, the comparison itself was facilitated, since in most cases there was no ambiguity about which “concepts” to compare, in contrast to what one would encounter in a comparison of lexical entries. For example, an “a” is an “a” in all languages. Third, it turned out that for quite a few languages, historical manual alphabets could be added to the sample. This point was very important for our study. Given that scholars still have limited knowledge regarding the details of sign change in sign language evolution, it is of great importance to compare sources of the same variety, or those assumed to be the same, across time—just as spoken language linguists compared Latin with Spanish and Italian in order to study how sounds change over time. And finally, manual alphabets in fact constitute an integrated part of many sign languages that may, for example, contribute to the forms of lexical signs, making the idea more plausible that an understanding of the evolution of manual alphabets could be informative about the evolution of sign languages as a whole.

Figure 4: Early evolution of handshapes used to sign ‘g’ (see our previous post: Character cliques and networks – mapping haplotypes of manual alphabets).

Guido later joined our team, providing the expertise to analyze the data with network methods that do not assume tree-like evolution a priori. We therefore thought that we had done a rather good job when our pilot study on the evolution of sign language manual alphabets, titled Evolutionary Dynamics in the Dispersal of Sign Languages, finally appeared last month (Power et al. 2020). We identified six basic lineages from which the manual alphabets of the 40 contemporary sign languages developed. The term "lineage" was deliberately chosen in this context, since it was unclear whether the evolution of the manual alphabets should be seen as representative of the evolution of the sign languages as a whole. We also avoided the term "family", because we were wary of making potentially unwarranted assumptions about sign language evolution based on theories in historical linguistics.

Figure 5: The all-inclusive Neighbor-net (taken from Power et al. 2020).

While the study was positively received by the popular media, and even made it onto the title page of the Süddeutsche Zeitung (one of the largest daily newspapers in Germany), there were also misrepresentations of our results in some media channels. The Daily Mail (in the UK), in particular, invented the claim that all human sign languages have evolved from five European lineages. Of course, our study never said this, nor could it have, since only European sign languages were included in our sample. (We included three manual alphabets representing Arabic-based scripts from Afghan, Jordanian, and Pakistan Sign Languages, where there was some indication that these may have been informed by European sources.)

Study of phylogenetics

While we share our colleagues’ distaste for the Daily Mail’s likely purposeful misrepresentation (in the end, unfortunately, it may have achieved its purpose as click bait), some colleagues went a bit further. One critique that came up in reaction to the Daily Mail piece was that our title opens the door to misinterpretation, because we had only investigated manual alphabets and, hence, cannot say anything about the "evolutionary dynamics of sign languages".

While the title does not mention manual alphabets, it should be clear that any study on evolution is based on a certain amount of reduction. Where and how this reduction takes place is usually explained in the studies. Many debates in historical linguistics of spoken languages have centered around the question of what data are representative enough to study what scholars perceive as the "overall evolution" of languages; and scholars are far from having reached a communis opinio in this regard. At this point, we simply cannot answer the question of whether manual alphabets provide clues about sign language evolution that contrast with the languages’ "general" evolution, as expressed, for example, in selecting and comparing 100 or 200 words of basic vocabulary. We suspect that this may, indeed, be the case for some sign languages, but we simply lack the comparative data to make any claims in this respect.

Figure 6: Evolution doesn’t mean every feature has to follow the same path: a synopsis of molecular phylogenies inferred for oaks, Quercus, and their relatives, Fagaceae (upcoming post on Res.I.P.) While nuclear differentiation matches phenotypic evolution and the fossil record (likely monophyla in bold font), the evolution of the plastome is partly decoupled (gray shaded: paraphyletic clades). Likewise, we can expect that different parts of languages, such as manual alphabets vs. core “lingome” of sign languages, may indicate different relationships.

The philosophical question, however, goes much deeper, to the "nature" of language: What constitutes a language? What do all languages have in common? How do languages change? What are the best ways to study how languages evolve?

One approach to answering these questions is to compare collectible features of languages ("traits" in biology)­, and to study how they evolve. As the field develops, we may find that the evolution of a manual alphabet does not completely coincide with the evolution of the lexicon or grammar of a sign language. But would it follow from such a result that we have learned nothing about the evolution of sign languages?

There is a helpful analogy in biology: we know that different parts of the genetic code can follow different evolutionary trajectories; we also know that phenotype-based phylogenetic trees sometimes conflict with those based on genotypes. But this understanding does not stop biologists from putting forward evolutionary hypotheses for extinct organisms, where only one set of data is available (phenotypes; Tree of Life). Furthermore, such conflicting results may lead to a more comprehensive understanding of how a species has evolved.

Figure 7: A likely case of convergence: the sign for “г” in Russian and Greek Sign Language, visually depicting the letter (see our previous post Untangling vertical and horizontal processes in the evolution of handshapes). Complementing studies of signed concepts may reveal less obvious cases of convergence (or borrowing).

Because we felt the need to further clarify the intentions of our study, and to answer some of the criticism raised about the study on Twitter, we decided to prepare a short series of blog posts devoted to the general question of "How should one study language evolution" (or more generally: "How should one study evolution?"). We hope to take some of the heat out of the discussion that evolved on Twitter, by inviting those who raised critiques about our study to answer our posts in the form of comments here, or in their own blog posts.

The current blog post can thus be understood as an opening for more thoughts and, hopefully, more fruitful discussions around the question of how language evolution should be studied.

In that context, feel free to post any questions and critiques you may have about our study below, and we will aim to pick those up in future posts.


Damián E. Blasi and Wichmann, Søren and Hammarström, Harald and Stadler, Peter and Christiansen, Morten H. (2016) Sound–meaning association biases evidenced across thousands of languages. Proceedings of the National Academy of Science of the United States of America 113.39: 10818-10823.

Dagan, Tal and Martin, William (2006) The tree of one percent. Genome Biology 7.118: 1-7.

Guerra Currie, Anne-Marie P. and Meier, Richard P. and Walters, Keith (2002) A cross-linguistic examination of the lexicons of four signed languages. In R. P. Meier, K. Cormier, & D. Quinto-Pozos (Eds.), Modality and Structure in Signed and Spoken Languages (pp.224-236). Cambridge University Press.

Hoijer, Harry (1956) Lexicostatistics: a critique. Language 32.1: 49-60.

Hymes, D. H. (1960) Lexicostatistics so far. Current Anthropology 1.1: 3-44.

Longobardi, Giuseppe and Ghirotto, Silva and Guardiano, Cristina and Tassi, Francesca and Benazzo, Andrea and Ceolin, Andrea and Barbujan, Guido (2015) Across language families: Genome diversity mirrors linguistic variation within Europe. American Journal of Physical Anthropology 157.4: 630-640.

Parkhurst, Stephen and Parkhurst, Dianne (2003) Lexical comparisons of signed languages and the effects of iconicity. Working Papers of the Summer Institute of Linguistics, University of North Dakota Session, vol. 47.

Power, Justin M. and Grimm, Guido and List, Johann-Mattis (2020) Evolutionary dynamics in the dispersal of sign languages. Royal Society Open Science 7.1: 1-30. DOI: 10.1098/rsos.191100

Swadesh, Morris (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21.2: 121-137.

Szeto, Pui Yiu and Ansaldo, Umberto and Matthews, Steven (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Woodward, James (1993) Lexical evidence for the existence of South Asian and East Asian sign language families. Journal of Asian Pacific Communication 4.2: 91-107.

Monday, February 17, 2020

Large morphomatrices – trivial signal

In my last post about fossils, Farris and Felsenstein Zones, I gave an example of a trivial (signal-wise perfect) binary phylogenetic matrix, which will give us the true tree no matter which optimality criterion we use. In this post, we will look at a real world example, a huge bird therapods matrix.
S. Hartman, M. Mortimer, W. R. Wahl, D. R. Lomax, J. Lippincott, D. M. Lovelace
A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight. PeerJ 7: e7247.
What intrigued me about this particular paper (I have no idea about dinosaurs, but the documentation, pictures and data, and presentation seems impeccable) was the following sentence:
The analysis resulted in >99999 most parsimonious trees with a length of 12,123 steps. The recovered trees had a consistency index of 0.073, and a retention index of 0.589.
What can you possibly do with strict consensus trees (Losing information in phylogenetic consensus) based on an unknown number of MPTs that have a CI converging to 0 (but and RI of 0.6; The curious case[s] of tree-like matrices with no synapomorphies)? And isn't this a case for some networks-based exploratory data analysis?

The complete matrix has 501 taxa and 700 characters (the largest plant morphological matrices have hardly more than 100 characters) but also a gappyness of 72%. In this case, 255,969 of the 353,500 cells in the matrix are ambiguous or undefined (missing). The matrix is a (rich) Swiss cheese with very big holes. The high number of MPTs is hence not surprising, and neither is the low CI.

Why run elaborate tree-inferences on such a swiss cheese matrix? One answer is that (some) vertebrate palaeophylogeneticists are convinced that few taxa – many character matrices can lead to wrong clades (clades that are not monophyletic); and each added taxon, no matter how many characters can be scored, will lead to a better tree, by eliminating (parsimony) branching artifacts (see Q&A to the paper). At least 56 of the 501 taxa have 5% or fewer defined characters; still, with 700 characters, 5% equals up to 35 defined traits, which is more than we can recruit for most plant fossils. The median missing data proportion is 74% — more than half of the taxa are scored for less than 26% (< 182 out of 700) of the characters. Can such taxa really save the all-inclusive tree from branching artefacts, or is the high number of MPTs an indication for signal conflicts and data gaps issues?

For this post, we will just look at the tip of the iceberg. What is the signal from the 700 characters to start with?

The basic signal

Here's the heat map for the 19 taxa that have a gappyness of less than 15% (ie. at least 595 of 700 possible characters are defined). The taxon order is mostly the one from the original matrix, sorted by phylogenetic groups — for more orientation, I added next-inclusive superclass "Clades" from Wikipedia (so apologize any errors).

In my last post, I showed that evolutionary lineages (and monophyly) can be directly deduced from such a heat map following the simple logic: two taxa sharing a (direct) common origin are usually more similar to each other than to a third, fourth etc. taxon not part of the same lineage. Exceptions include fossils close to the last common ancestors lacking advanced traits.

The outgroup as used (in this taxon sample: Allosaurus to Tyrannosaurus) is most similar to each other but not monophyletic. One (Allosaurus) respresents the sister lineage of, the other an early split within the lineage that lead to the birds (Coelurosauria:Tyrannoraptora). The extinct (monophyletic) families (Tyrannosauridae, Ornithomimidae, Dromaesauridae) are, however, well visible, being defined by low intra-family and higher inter-family pairwise distances. The same is true for the direct relatives (Clade Ornithurae) of modern birds (class Aves).

Very typical for such datasets is the increasing distance between the (primitive?) outgroups and the most derived, modern-day taxa (living birds: Struthio – ostrich, Anas – duck, Meleagris – turkey). Closest relatives in the taxon set, phylogenetically and time-wise, are (much) more similar than distant ones. Allosaurus may be most similar to the tyrannosaurs, not because of common ancestry but because both are scored as being primitive with respect to the group of interest.

The only tree

This situation becomes very obvious from the only possible (single-optimal) tree that can be inferred from this matrix, when visualized as a phylogram (Stop using cladograms!)

The ML, MP and LS/NJ tree overlapped and scaled to equal root (first split within Tyrannoraptor) to tip (split between Anas and Meleagris) distance (phylogenetic distance, via the tree). Pink, the LS clade conflicting with ML and MP trees, and Wikipedia's tree(s).

No matter which optimisation criterion is used (here Least-Squares via Neighbor-joining, Maximum Parsimony, Maximum Likelihood), the result is the same. The only exception is that the NJ/LS tree places Archaeopteryx as sister to Dromaeosauridae; and the relative branch lengths of roots vs. tips also differ.

Because our matrix has favorable properties (few taxa, many defined characters), it's straightforward to establish branch support. This is a bit frowned upon in palaeontological circles, but having dealt with morphological evolution in cases where we have molecular data, I want to know how robust my clades are, and what may be the alternatives, before I conclude that they reflect monophyly. Bootstrapping coupled with consensus networks is a quick and simple way to test robustness and investigate ambiguous support (Connecting tree and network edges) .

The BS support consensus networks for NJ/LS and ML have only a single reticulation each.

Rooted support consensus networks based on the NJ/LS (10,000 pseudoreplicates, PAUP*) and ML bootstrap (100, number of necessary replicates determined by bootstop criterion implemented in RAxML) samples. Only splits are shown that ocurred in at least 15% of the BS pseudoreplicates.

The MP BS support consensus network is, however, has many more reticulations.

Rooted MP-BS support consensus network (10,000 BS pseudoreplicates, PAUP*). Green — edge bundles corresponding to clades in the all-optimal tree(s); orange — less supported conflicting alternatives; red – higher supported conflicting alternatives; pink – wrong clade in NJ/LS tree.

We can make two generally relevant observations here:
  1. The wrong Archaeopterix-Dromaeosauridae clade (pink edge/branch) masks a split BSNJ support: 68 for the wrong clade, 31 for the right one. While resampling under ML appears to be inert to this conflict, MP is not.
  2. While the NJ- and ML support networks are very tree-like, all clades in the inferred tree have high to unambiguous support, and are near-congruent, the MP network is much more boxy. In some cases the split in agreement with the all-optimal tree has a lower BS support than an alternative (here usually in conflict with the gold tree).
Similar observations can be made with other data sets: although NJ/LS and ML optimisation are fundamentally different (distance- vs. character-based, equal change vs. varying probability of change), they show more agreement with each other when it comes to supporting a topology (or topological alternatives) than MP (character-based like ML, but all changes are treated as equal like NJ/LS). MP is a very conservative approach, highly dependent on possibly a few discerning characters. If they are missing from the BS pseudoreplicate, the backbone tree collapses or changes, and BS values may decrease rapidly. This is so even for a very data-dense matrix like the one used here (few taxa, many characters, low gappyness).

On the positive side, we can expect that MP will produce fewer false positives. On the negative side, it is also more dependent on character coverage, and will produce much more false negatives. Any fossil lacking the crucial characters (or showing too few of them) may be still resolved (placed and supported) under NJ/LS and ML but not using MP. When inferring trees, these fossils will quickly increase the number of MPTs and decrease branch support for the part of the tree they interact with. Personally, given how hard it can be to place a fossil per se with the data at hand, I always preferred a method that can give some result, and point towards possible alternatives (even risking including erroneous), rather than no result at all.

The simplest of networks

Naturally, we can use the distance matrix directly to infer a Neighbor-net, and explore the basic differentiation signal beyond trees but also with regard to the all-optimal tree.

Neighbor-net based on the pairwise distance matrix. Coloration highlights edges found (or not) in the optimised trees.

The Neighbor-net recovers the clades from the all-optimal tree (green, purple the NJ/LS-unique branch), but shows additional edges (orange). The principal signal in the data has, for instance, problems with placing Archaeopteryx, because it is (signal-wise) intermediate between the Avebrevicaudata, the lineage including modern birds, and the Dromaeosauridae, their sister lineage (note that the vertebrate fossil record is considered to be free of ancestors and precursors; all fossils represent extinct sister lineages – evolutionary dead-ends). Skeleton IGM 100042 (an Oviraptoridae), placed as sister to both in the all-optimal tree, also lacks obvious affinities: this is a taxon where the tree inference makes a decision that is not based on a trivial signal encoded in the matrix.

The central boxy part of the Neighbor-net correlates with the 2/3-dimensional part of the parsimony BS consensus network: to resolve these relationships, we need a large set of characters (under MP). On the other hand, recognizing the Ornithurae, members of an extinct family, or a relative of IGM 100042, should be straightforward even with a limited amount of defined characters. Based on the Neighbor-net, which is inferred in a blink no matter how large the matrix, we can also make a decision, as to which taxa interfere and which ones facilitate tree-inferences. The more tree-like the Neighbor-net graph becomes, the easier it is for a tree inference to be made.

Placing fossils, quickly and easily

Using this backbone graph, it is easy to assess in which phylogenetic neighborhood a newly coded fossil falls, eg. the fossil newly described in Hartman et al. and scored for 267 unambiguously defined traits, Hesperornithoides.

Neighbor-net including Hesperornithoides.

Hesperornithoides is obviously a member of the Eumaniraptora (= Paraves), morphologically somewhat intermediate between the Avialae, the "flying dinosaurs", and Dromaeosauridae, but doesn't seem to be part of either of these sister lineages. The graph lacks a prominent neighborhood, the Archaeopteryx-Bambiraptor neighborhood may reflect local long-edge attraction (note the long terminal edges) or convergent evolution in both taxa and, possibly, also the Hesperornithoides lineage. Just based on this simple and quick-to-infer network, Hartman et al.'s title "A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight" appears to be correct (in future posts, we may come back to this morphological supermatrix to see what else networks could have quickly shown).

One should be willing to leave the phylogenetic beaten track – ie. relying on strict consensus parsimony trees as the sole basis for phylogenetic hypothesis. The Neighbor-net is a valuable tool for quick pre- and post-analysis because it can:
  • visualize how coherent the clades in our trees are, 
  • how easy it will be for the tree inference (especially MP) to find and support clades, 
  • help to differentiate ambiguous from important taxa, and finally, 
  • assess whether a new fossil really requires an in-depth re-analysis of the full matrix (and dealing with >99,999 MPTs) instead of using a more focussed taxon (and character) set.

Monday, February 10, 2020

Fossils and networks 1 – Farris and Felsenstein

Over 60 years ago, Robert Sokal and Peter Sneath changed the way we quantitatively study evolution, by providing the first numerical approach to infer a phylogenetic tree. About the same time, but in German, Willi Hennig established the importance of distinguishing primitive and advanced character states, rather than treating all states as equal. This established a distinction between phenetics and phylogenetics; and the latter is the basis of all modern studies, whether it is explicitly acknowledged or not.

More than two decades later, Steve Farris and the Willi Hennig Society (WHS) established parsimony as the standard approach for evaluating character-state changes for tree inference. In this approach, morphological traits are scored and arranged data matrices, and then the most parsimonious solution is found to explain the data. This tree, usually a collection of most-parsimonious trees (MPT), was considered to be the best approximation of the true tree. Clades in the trees were synonymized with monophyly sensu Hennig (1950, short English version published 1965), and grades with paraphyly: Cladistics was born (see also: Let's distinguish between Hennig and Cladistics).

Why parsimony? Joe Felsenstein, who was not a member of the WHS but brought us, among many other things, the nonparametic bootstrap (Felsenstein 1985), put it like this (Felsenstein 2001):
History: William of Ockham told Popper to tell Hennig to use parsimony
Soon, parsimony and cladistics came under threat by advances in computer technology and Kary Mullis' development of the polymerase-chain-reaction (PCR; Mullis & Faloona 1987) in the early 80s (note: Mullis soon went on with more fun stuff, outside science). While the data analysis took ages (literally) in the early days, more and more speedy heuristics were invented for probabilistic inferences. PCR marked the beginning of the Molecular Revolution, and genetic data became easy to access. Soon, many researchers realized that parsimony trees perform badly for this new kind of data, a notion bitterly rejected by the parsimonists, organized mainly in the WHS: the "Phylogenetic Wars" raged.

The parsimony faction lost. Today, when we analyze (up to) terabytes of molecular data, we use probabilistic methods such as maximum likelihood (ML) and Bayesian inference (BI). However, one parsimony bastion has largely remained unfazed: palaeontology.

In a series of new posts, we will try to change that; and outline what easy-to-compute networks have to offer when analyzing non-molecular data.

It's just similarity, stupid!

One collateral damage of the Phylogenetic Wars was distance-based methods, which, still today, are sometimes classified as "phenetic" in opposite to the "phylogenetic" character-based methods (parsimony, ML, BI). The first numerical phylogenetic trees were not based on character matrices but distance matrices (eg. Michener & Sokal 1957 using a cluster algorithm; Cavalli-Sforza & Edwards 1965 using parsimony; see also Felsenstein 2004, pp.123ff).

But no matter which method, optimality criterion or data-type we use, in principal we do the analysis under the same basic assumptions:
  1. the more closely related two taxa are, then the more similar they should be.
  2. the more similar two taxa are, then the more recent is their split.

The ingroup (blue clade) and outgroup (red clade) are most distant from each other. Placing the fossils is trivial: Z is closer to O than to C, the member of the ingroup with the fewest advanced character states. Ingroup sister taxa A + C and B + D are most similar to each other. The monophyly of the ingroup and its two subclades is perfectly reflected by the inter-taxon distances.

Assuming that the character distance reflects the phylogenetic distance (ie. the distance along the branches of the true tree), any numerical phylogenetic approach will succeed in finding the true tree. The Neighbor-Joining method (using either the Least Squares or Minimum Evolution criteria) will be the quickest calculation. The signal from such matrices is trivial, we are in the so-called "Farris Zone" (defined below).

We wouldn't even have to infer a tree to get it right (ie. nested monophyly of A + C, B + D, A–D), we could just look at the heat map sorted by inter-taxon distance.

Just from the distance distributions, visualized in the form of a "heat-map", it is obvious that A–D are monophyletic, and fossil Z is part of the outgroup lineage. As expected for the same phylogenetic lineage (because changes accumulate over time), its fossils C and D are still relatively close, having few advanced character states, while the modern-day members A and B are have diverged from each other (based on derived character states). Taxon B is most similar to D, while C is most similar A. So, we can hypothesize that C is either a sister or precursor of A, and D is the same of B. If C and D are stem group taxa (ie. they are paraphyletic), then we would expect that both would show similar distances to A vs. B, and be closer to the outgroup. If representing an extinct sister lineage (ie. CD is monophyletic), they should be more similar to each other than to A or B. In both cases (CD either paraphyletic or monophyletic), A and B would be monophyletic, and so they should be relatively more similar to each other than to the fossils as well.

Having a black hole named after you

The Farris Zone is that part of the tree-space where the signals from the data are trivial, we have no branching artifacts, and any inference (tree or network), gives us the true tree.

It's opposite has been, unsurprisingly perhaps, labeled the "Felsenstein Zone". This is the part of the tree-space where branching artifacts are important — the inferred tree deviates from the true tree. Clades and grades (structural aspects of the tree) are no longer synonymous with monophyly and paraphyly (their evolutionary interpretation).

We can easily shift our example from the Farris into the Felsenstein Zone, by halving the distances between the fossils and the first (FCA) and last (LCA) common ancestors of ingroup and outgroup and adding some (random = convergence; lineage-restricted = homoiology) homoplasy to the long branches leading to the modern-day genera.

The difference between distance, parsimony and probabilistic methods is how we evaluate alternative tree topologies when similarity patterns become ambiguous — ie. when we approach or enter the Felsenstein Zone. Have all inferred mutations the same probability; how clock-like is evolution; are their convergence/ saturation effects; how do we deal with missing data?

For our example, any tree inference method will infer a wrong AB clade, because the fossils lack enough traits shared only with their sisters but not with the other part of the ingroup. Only the roots are supported by exclusively shared (unique) derived traits (Hennigian "synapomorphies"). The long-branch attraction (LBA) between A and B is effectively caused by:
  • 'short-branch culling' between C and D: the fossils are too similar to each other; and their modern relatives too modified;
  • the character similarity between A and B underestimates the phylogenetic distance between A and B, due to derived traits that evolved in parallel (homoiologies).
While ML and NJ make the same decision, the three maximum-parsimony trees permute options for placing C and D, except for the correct options, and including an impossible one (a hard trichotomy). Standard phylogenetic trees are (by definition) dichotomous being based on the concept of cladogenesis — one lineage splits into two lineages.

MPTs inferred using PAUP*'s branch-and-bound algorithm (no heuristics, this algorithm will find the actual most parsimonious solution); NJ/LS using PAUP* BioNJ implementation and simple (mean) pairwise distances; ML using RAxML, corrected (asc.) and not (unc.*) for ascertainment bias (the character matrix has no invariable sites). All trees are not rooted using a defined outgroup but mid-point rooted. If rooted with the known outgroup O, the fossil Z would be misinterpreted as early member of the ingroup.

We have no ingroup-outgroup LBA, because the three convergent traits shared by O and A or B, respectively, compete with a total of eight lineage-unique and conserved traits (synapomorphies) — six characters are compatible with a O-A or O-B sister-relationship (clade in a rooted tree) but eight are incompatible. We correctly infer an A–D | O + Z split (ie. A–D clade when rooted with O) simply because A and B are still more similar to C and D than to Z and O; not some method- or model-inflicted magic.

The magic of non-parametric bootstrapping

When phylogeneticists perform bootstrapping, they usually do it to try to evaluate branch support values — a clade alone is hardly sufficient to infer an inclusive common origin (Hennig's monophyly), so we add branch support to quantify its quality (Some things you probably don't know about the bootstrap). In palaeontology, however, this is not a general standard (Ockhams Razor applied but not used), for one simple reason: bootstrapping values for critical branches in the trees are usually much lower than the molecular-based (generally accepted) threshold of 70+ for "good support" (All solved a decade ago).

When we bootstrap the Felsenstein Zone matrix that gives us the "wrong" (paraphyletic) AB clade, no matter which tree-inference method we use, we can see why this standard approach undervalues the potential of bootstrapping to explore the signal in our matrices.

Consensus networks based on each 10,000 BS pseudoreplicates, only splits are shown that are at least found in 15% of the replicate trees (trivial splits collapsed). Reddish – false splits (paraphyletic clades), green – true splits (monophyletic clades).

While parsimony and NJ bootstrap pseudoreplicates either fall prey to LBA or don't provide any viable alternative (the bootstrap replicate matrix lacks critical characters), in the example a significant amount of ML pseudoreplicates did escape the A/B long-branch attraction.

Uncorrected, the correct splits A + C vs. rest and B + D vs. rest can be found in 19% of the 10,000 computed pseudoreplicate trees. When correcting for ascertainment bias, their number increases to 41%, while the support for the wrong A + B "clade" collapses to BSML = 49. Our BS supports are quite close to what Felsenstein writes in his 2004 book: For the four taxon case, ML has a 50:50 chance to escape LBA (BSML = true: 41 vs. false: 49), while MP and distance-methods will get it always wrong (BSMP = 88, BSNJ = 86).

The inferred tree may get it wrong but the (ML) bootstrap samples tell us the matrix' signal is far from clear.

Side-note: Bayesian Inference cannot escape such signal-inherent artifacts because its purpose is to find the tree that best matches all signals in the data, which, in our case, is the wrong alternative with the AB clade — supported by five characters, rejected by four each including three that are largely incompatible with the true tree. Posterior Probabilities will quickly converge to 1.0 for all branches, good and bad ones (see also Zander 2004); unless there is very little discriminating signal in the matrix — a CD clade, C sister to ABD, D sister to ABC, ie. the topological alternatives not conflicting with the wrong AB clade, will have PP << 1.0 because these topological alternatives give similar likelihoods.

Long-edge attraction

When it comes to LBA artifacts and the Felsenstein Zone, our preferred basic network-inference method, the Neighbor-Net (NNet), has its limitations, too. The NNet algorithm is in principle a 2-dimensional extension of the NJ algorithm. The latter is prone to LBA, hence, and the NNet can be affected by LEA: long edge attraction. The relative dissimilarity of A and B to C and D, and (relative) similarity of A/B and C/D, respectively, will be expressed in the form of a network neighborhood.

Note, however, the absence of a C/D neighbourhood. If C is a sister of D (as seen in the NJ tree), then there should be at least a small neighbourhood. It's missing because C has a different second-best neighbour than B within the A–D neighbourhood. While the tree forces us into a sequence of dichotomies, the network visualizes the two competing differentiation patterns: general advancement on the one hand (ABO | CDZ split), and on the other potential LBA/LEA, vs. similarity due to a shared common ancestry (ABCD | OZ split; BD neighborhood).

Just from the network, we would conclude that C and D are primitive relatives of A and B, potentially precursors. The same could be inferred from the trees; but if we map the character changes onto the net (Why we may want to map trait evolution on networks), we can notice there may be more to it.

Character splits ('cliques') mapped on the NNet. Green, derived traits shared by all descendants of a common ancestor ('synapomorphies'); blue, lineage-restricted derived traits, ie. shared by some but not all descendants (homoiologies); pink, convergences between in- and outgroup; black, unique traits ('autapomorphies')

Future posts

In each of the upcoming posts in this (irregular) series, we will look at a specific problem with non-molecular data, and test to what end exploratory data analysis can save us from misleading clades; eg. clades in morphology-informed (parts of, in case of total evidence) trees that are not monophyletic.

* The uncorrected ML tree shows branch lengths that are unrealistic (note the scale), and highly distorted. The reason for this is that the taxon set includes (very) primitive fossils and (highly) derived modern-day genera, but the matrix has no invariable sites telling the ML optimization that changing from 0 ↔ 1 is not that big of a deal. This is where the ascertainment bias correction(s) step(s) in (RAxML-NG, the replacement for classic RAxML 8 used here, has more than one implementation to correct for ascertainment bias. A tip for programmers and coders: effects of corrections have so far not been evaluated for non-molecular data).

Cited literature
Cavalli-Sforza LL, Edwards AWF. 1965. Analysis of human evolution. In: Geerts SJ, ed. Proceedings of the XI International Congress of Genetics, The Hague, The Netherlands, 1963. Genetics Today, vol. 3. Oxford: Pergamon Press, p. 923–933.
Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791.
Felsenstein J. 2001. The troubled growth of statistical phylogenetics. Systematic Biology 50:465–467.
Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc., 664 pp. (in chapter 10, Felsenstein provides an entertaining "degression on history and philosophy" of phylogenetics and systematics).
Hennig W. 1950. Grundzüge einer Theorie der phylogenetischen Systematik. Berlin: Dt. Zentralverlag, 370 pp.
Hennig W. 1965. Phylogenetic systematics. Annual Review of Entomology 10:97–116.
Michener CD, Sokal RR. 1957. A quantitative approach to a problem in classification. Evolution 11:130–162.
Mullis KB, Faloona F. 1987. Specific synthesis of DNA in vitro via a polymerase catalyzed chain reaction. Methods in Enzymology 155:335–350.

Monday, February 3, 2020

A network of life expectancy and body mass index

At my advanced age, the concept of Life Expectancy (the average age at which people of my generation die) becomes of some practical importance. Perhaps more importantly, the concept of Healthy Life Expectancy rears its head, this being the average age at which one's health starts to notably deteriorate.

Both of these human attributes are related to many things, but in the modern world Obesity is one of the most important contributors to lack of health. This is frequently measured as the Body Mass Index (BMI), defined as the body mass (kilograms) divided by the square of the body height (meters). A BMI > 30 is classified as Obese, and this is definitely considered to represent lomg-term poor health.

So, let's look at some data, to see how the USA currently fares with regard to these characteristics. The US Burden of Disease Collaborators recently released some up-to-date data (The state of US health, 1990-2016. Burden of diseases, injuries, and risk factors among US states. Journal of the American Medical Association 319: 1444-1472). You can consult their Table 1 if you want to consider the major recent causes of death in the USA.

However, we will focus on the positive side, instead — how long do people live? The first graph here shows the relationship between the two Life Expectancy variables for the year 2016, with each point representing one state of the USA, plus DC. The line shown on the graph represents the national average.

Life expectabcy versus healthy life expectancy

As expected, there is a high correlation between the two variables, although there is a 6-year difference in Expectancy among the various states. The top states include Hawaii, California, Connecticut, Minnesota, New York, Massachusetts, Colorado, New Jersey and Washington; while the bottom states are Mississippi, West Virginia, Alabama, Louisiana, Oklahoma, Arkansas, Kentucky, Tennessee and South Carolina. The social and economic differences between those two groups should be clear to everyone, and this is well-known to relate to life-length.

The national average for Life Expectancy is 78.9 years, while the Healthy Life Expectancy is 67.7 years (ie. 11.2 years less). This probably doesn't surprise you — the last 11 years of your life is likely to be spent dealing with ill health. The points on the graph are scattered around the national-average line except at the lowest Expectancies — this implies a shorter period of unhealth at the end of life for those with a poor Life Expectancy. Notably, Mississippi has the lowest Life Expectancy but only the 5th lowest Healthy LE.

We can now turn to look at Body Mass Index (BMI) and how it relates to Healthy Life Expectancy. This is shown in the next graph, where the BMI data refer to the percentage of people who are obese (BMI > 30). Once again, each point refers to a single state. Clearly, as Obesity increases then Healthy LE decreases. The medical people have been telling us this for decades.

Body mass index versus healthy life expectancy

Note, however, the big difference in obesity levels between the states (15.5 percentage points) — there are nearly two-thirds more obese people in some states than in others. The states with the highest Obesity levels include West Virginia, Mississippi, Oklahoma, Iowa, Alabama, Louisiana and Arkansas, while the other extreme includes Colorado, DC, Hawaii, California, Montana, Utah, New York and Massachusetts.

Also, note that the relationship between the Obesity and Life Expectancy variables is not linear. Below 26% population obesity there is little change in average Life Expectancy, whereas above 30% obesity levels Life Expectancy declines rapidly. For every 1% increase in average Obesity the average LE is reduced by 0.3 years.

Two of the territories are labeled in the graph, as showing unusual patterns. The people of the District of Columbia are clearly not "fat cats", as often depicted, but their lives are apparently not all that healthy. On the other hand, the people of Iowa somehow manage to remain healthy for longer than average, even though they have one of the highest Obesity levels.

Finally, we can put all of this together in a single network, depicting the data patterns. As usual in this blog, one of the simplest ways to get a pictorial overview of the data is to use a phylogenetic network, as a form of exploratory data analysis. For this analysis, I first calculated the similarity of the states using the manhattan distance, based on the three variables listed above. A Neighbor-net analysis was then used to display the between-territory similarities.

The resulting network is shown in the final graph. Territories that are closely connected in the network are similar to each other based on their two Life Expectancies and BMI levels, and those that are further apart are progressively more different from each other.

Network of life expectancy and body mass index

In this case, the network displays states with decreasing Life Expectancies from top to bottom, and decreasing Obesity from left to right. It makes visually clear that those states with the shortest Life Expectancies are almost always associated with high Obesity levels (ie. they are at the bottom-left of the network).

For longer Life Expectancies, some states have high Obesity levels (top-left of the network) while some have lower levels (top-right). Iowa is shown as quite distinct from the other states (it has a long edge of its own), since it has longer LE than would be expected for its population Obesity level.