Monday, February 24, 2020

How should one study language evolution?

This is a joint post by Justin Power, Guido Grimm, and Johann-Mattis List.

Like in biology, we have two basic possibilities for studying how languages evolve:
  • We set up a list of universal comparanda. These should occur in all languages and show a high enough degree of variation that we can use them as indicators of how languages have evolved;
  • We create individual lists of comparanda. These are specific for certain language groups that we want to study.
Universal comparanda

While most studies would probably aim to employ a set of universal comparanda, the practice often requires a compromise solution in which some non-universal characteristics are added. This holds, for example, for the idea of a core genome in biology, which ends up being so small in overlap across all living species that it makes little sense to compute phylogenies based on it, except for for closely related species (Dagan and Martin 2006). Another example is the all-inclusive matrices that are used to establish evolutionary relationships of extinct animals characterized by high levels of missing data (eg. Tschopp et al. 2015; Hartman et al. 2019). The same holds for historical linguistics, with the idea of a basic lexicon or basic vocabulary, represented by a list of basic concepts that are supposed to be expressed by simple words in every human language (Swadesh 1955), given that the number of concepts represented by simple words shared across all human languages is extremely small (Hoijer 1956).

Figure 1: All humans have hands and arms but some words for ‘hands’ and ‘arms’ address different things (see our previous post "How languages loose body parts").

Apart from the problem that basic vocabulary concepts occurring in all languages may be extremely limited, test items need to fulfill additional characteristics that may not be easy to find,in order to be useful for phylogenetic studies. They should, for example, be rather resistant to processes of lateral transfer or borrowing in linguistics. They should preferably be subject to neutral evolution, since selective pressure may lead to parallel but phylogenetically independent processes (in biology known as convergent evolution) that are difficult to distinguish and can increase the amount of noise in the data (homoplasy).

Selective pressure, as we might find, for example, in a specific association between certain concepts and certain sounds across a large phygenetically independent sample of human languages, is rarely considered to be a big problem in historical linguistics studies dealing with the evolution of spoken languages (see Blasi et al. 2016 for an exception). In sign language evolution, however, the problem may be more acute because of a similar iconic motivation of many lexical signs in phylogenetically independent sign languages (Guerra Currie et al. 2002), as well as the representation of concepts such as body parts and pronouns using indexical signs with similar forms. This latter characteristic of all known sign languages has led to the design of a basic vocabulary list that differs from those traditionally used in the historical linguistics of spoken languages (Woodward 1993); and we know of only one proposal attempting to address the problem of iconicity in sign languages for phylogenetic research (Parkhurst and Parkhurst 2003).

Figure 2: Basic processes in the evolution of languages, spoken or signed  (see our previous post How languages loose body parts).

All in all, it seems that there may be no complete solution for a list of lexical comparanda for all human languages, including sign languages, given the complexities of lexical semantics, the high variability in expression among the languages of the world (see Hymes 1960 for a detailed discussion on this problem), and the problems related to selective pressures highlighted above. Scholars have proposed alternative features for comparing languages, such as grammatical properties (Longobardi et al. 2015) or other "structural" features (Szeto et al. 2018), but these are either even more problematic for historical language comparison—given that it is never clear if these alternative features have evolved independently or due to common inheritance—or they are again based on a targeted selection for a certain group of languages in a certain region.

Targeted comparanda

If there is no universal list of features that can be used to study how languages have evolved, we have to resort to the second possibility mentioned above, by creating targeted lists of comparanda for the specific language groups whose evolution we want to study. When doing so, it is best to aim at a high degree of universality in the list of comparanda, even if one knows that complete universality cannot be achieved. This practice helps to compare a given study with alternative studies; it may also help colleagues to recycle the data, at least in part, or to merge datasets for combined analyses, if similar comparanda have been published for other languages.

But there are cases where this is not possible, especially when conducting studies where no previous data have been published, and rigorous methods for historical language comparison have yet to be established. Sign languages can, again, be seen as a good example for this case. So far, few phylogenetic studies have addressed sign language evolution, and none have supplied the data used in putting forward an evolutionary hypothesis. Furthermore, because the field lacks unified techniques for the transcription of signs, it is extremely difficult to collect lexical data for a large number of sign languages from comparable glossaries, wordlists, and dictionaries, the three primary sources, apart from fieldwork, that spoken language linguists would use in order to start a new data collection. We are aware of one comparative database with basic vocabulary for sign languages that is currently being built (Yu et al. 2018), and that may represent lexical items in a way that can be compared efficiently, but these data have not yet been made available to other researchers.

Sign languages

When Justin Power approached Mattis about three years ago, asking if he wanted to collaborate on a study relating to sign language evolution, we quickly realized that it would be infeasible to gather enough lexical data for a first study. Tiago Tresoldi, a post-doc in our group, suggested the idea of starting with sign language manual alphabets instead. From the start, it was clear that these manual alphabets might have certain disadvantages — because they are used to represent written letters of a different language, they may constitute a set of features evolving independently from the sign language itself.

Figure 3: Processes shaping manual alphabets. The evolution of signed concepts may be affected by the same, leading to congruent patterns, or different processes, leading to incongruent differentiation patterns (see our previous post: Stacking networks based on sign language manual alphabets).

But on the other hand, the data had many advantages. First, a sufficient number of examples for various European sign languages were available in online databases that could be transcribed in a uniform way. Second, the comparison itself was facilitated, since in most cases there was no ambiguity about which “concepts” to compare, in contrast to what one would encounter in a comparison of lexical entries. For example, an “a” is an “a” in all languages. Third, it turned out that for quite a few languages, historical manual alphabets could be added to the sample. This point was very important for our study. Given that scholars still have limited knowledge regarding the details of sign change in sign language evolution, it is of great importance to compare sources of the same variety, or those assumed to be the same, across time—just as spoken language linguists compared Latin with Spanish and Italian in order to study how sounds change over time. And finally, manual alphabets in fact constitute an integrated part of many sign languages that may, for example, contribute to the forms of lexical signs, making the idea more plausible that an understanding of the evolution of manual alphabets could be informative about the evolution of sign languages as a whole.

Figure 4: Early evolution of handshapes used to sign ‘g’ (see our previous post: Character cliques and networks – mapping haplotypes of manual alphabets).

Guido later joined our team, providing the expertise to analyze the data with network methods that do not assume tree-like evolution a priori. We therefore thought that we had done a rather good job when our pilot study on the evolution of sign language manual alphabets, titled Evolutionary Dynamics in the Dispersal of Sign Languages, finally appeared last month (Power et al. 2020). We identified six basic lineages from which the manual alphabets of the 40 contemporary sign languages developed. The term "lineage" was deliberately chosen in this context, since it was unclear whether the evolution of the manual alphabets should be seen as representative of the evolution of the sign languages as a whole. We also avoided the term "family", because we were wary of making potentially unwarranted assumptions about sign language evolution based on theories in historical linguistics.

Figure 5: The all-inclusive Neighbor-net (taken from Power et al. 2020).

While the study was positively received by the popular media, and even made it onto the title page of the Süddeutsche Zeitung (one of the largest daily newspapers in Germany), there were also misrepresentations of our results in some media channels. The Daily Mail (in the UK), in particular, invented the claim that all human sign languages have evolved from five European lineages. Of course, our study never said this, nor could it have, since only European sign languages were included in our sample. (We included three manual alphabets representing Arabic-based scripts from Afghan, Jordanian, and Pakistan Sign Languages, where there was some indication that these may have been informed by European sources.)

Study of phylogenetics

While we share our colleagues’ distaste for the Daily Mail’s likely purposeful misrepresentation (in the end, unfortunately, it may have achieved its purpose as click bait), some colleagues went a bit further. One critique that came up in reaction to the Daily Mail piece was that our title opens the door to misinterpretation, because we had only investigated manual alphabets and, hence, cannot say anything about the "evolutionary dynamics of sign languages".

While the title does not mention manual alphabets, it should be clear that any study on evolution is based on a certain amount of reduction. Where and how this reduction takes place is usually explained in the studies. Many debates in historical linguistics of spoken languages have centered around the question of what data are representative enough to study what scholars perceive as the "overall evolution" of languages; and scholars are far from having reached a communis opinio in this regard. At this point, we simply cannot answer the question of whether manual alphabets provide clues about sign language evolution that contrast with the languages’ "general" evolution, as expressed, for example, in selecting and comparing 100 or 200 words of basic vocabulary. We suspect that this may, indeed, be the case for some sign languages, but we simply lack the comparative data to make any claims in this respect.

Figure 6: Evolution doesn’t mean every feature has to follow the same path: a synopsis of molecular phylogenies inferred for oaks, Quercus, and their relatives, Fagaceae (upcoming post on Res.I.P.) While nuclear differentiation matches phenotypic evolution and the fossil record (likely monophyla in bold font), the evolution of the plastome is partly decoupled (gray shaded: paraphyletic clades). Likewise, we can expect that different parts of languages, such as manual alphabets vs. core “lingome” of sign languages, may indicate different relationships.

The philosophical question, however, goes much deeper, to the "nature" of language: What constitutes a language? What do all languages have in common? How do languages change? What are the best ways to study how languages evolve?

One approach to answering these questions is to compare collectible features of languages ("traits" in biology)­, and to study how they evolve. As the field develops, we may find that the evolution of a manual alphabet does not completely coincide with the evolution of the lexicon or grammar of a sign language. But would it follow from such a result that we have learned nothing about the evolution of sign languages?

There is a helpful analogy in biology: we know that different parts of the genetic code can follow different evolutionary trajectories; we also know that phenotype-based phylogenetic trees sometimes conflict with those based on genotypes. But this understanding does not stop biologists from putting forward evolutionary hypotheses for extinct organisms, where only one set of data is available (phenotypes; Tree of Life). Furthermore, such conflicting results may lead to a more comprehensive understanding of how a species has evolved.

Figure 7: A likely case of convergence: the sign for “г” in Russian and Greek Sign Language, visually depicting the letter (see our previous post Untangling vertical and horizontal processes in the evolution of handshapes). Complementing studies of signed concepts may reveal less obvious cases of convergence (or borrowing).

Because we felt the need to further clarify the intentions of our study, and to answer some of the criticism raised about the study on Twitter, we decided to prepare a short series of blog posts devoted to the general question of "How should one study language evolution" (or more generally: "How should one study evolution?"). We hope to take some of the heat out of the discussion that evolved on Twitter, by inviting those who raised critiques about our study to answer our posts in the form of comments here, or in their own blog posts.

The current blog post can thus be understood as an opening for more thoughts and, hopefully, more fruitful discussions around the question of how language evolution should be studied.

In that context, feel free to post any questions and critiques you may have about our study below, and we will aim to pick those up in future posts.


Damián E. Blasi and Wichmann, Søren and Hammarström, Harald and Stadler, Peter and Christiansen, Morten H. (2016) Sound–meaning association biases evidenced across thousands of languages. Proceedings of the National Academy of Science of the United States of America 113.39: 10818-10823.

Dagan, Tal and Martin, William (2006) The tree of one percent. Genome Biology 7.118: 1-7.

Guerra Currie, Anne-Marie P. and Meier, Richard P. and Walters, Keith (2002) A cross-linguistic examination of the lexicons of four signed languages. In R. P. Meier, K. Cormier, & D. Quinto-Pozos (Eds.), Modality and Structure in Signed and Spoken Languages (pp.224-236). Cambridge University Press.

Hoijer, Harry (1956) Lexicostatistics: a critique. Language 32.1: 49-60.

Hymes, D. H. (1960) Lexicostatistics so far. Current Anthropology 1.1: 3-44.

Longobardi, Giuseppe and Ghirotto, Silva and Guardiano, Cristina and Tassi, Francesca and Benazzo, Andrea and Ceolin, Andrea and Barbujan, Guido (2015) Across language families: Genome diversity mirrors linguistic variation within Europe. American Journal of Physical Anthropology 157.4: 630-640.

Parkhurst, Stephen and Parkhurst, Dianne (2003) Lexical comparisons of signed languages and the effects of iconicity. Working Papers of the Summer Institute of Linguistics, University of North Dakota Session, vol. 47.

Power, Justin M. and Grimm, Guido and List, Johann-Mattis (2020) Evolutionary dynamics in the dispersal of sign languages. Royal Society Open Science 7.1: 1-30. DOI: 10.1098/rsos.191100

Swadesh, Morris (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21.2: 121-137.

Szeto, Pui Yiu and Ansaldo, Umberto and Matthews, Steven (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Woodward, James (1993) Lexical evidence for the existence of South Asian and East Asian sign language families. Journal of Asian Pacific Communication 4.2: 91-107.

Monday, February 17, 2020

Large morphomatrices – trivial signal

In my last post about fossils, Farris and Felsenstein Zones, I gave an example of a trivial (signal-wise perfect) binary phylogenetic matrix, which will give us the true tree no matter which optimality criterion we use. In this post, we will look at a real world example, a huge bird therapods matrix.
S. Hartman, M. Mortimer, W. R. Wahl, D. R. Lomax, J. Lippincott, D. M. Lovelace
A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight. PeerJ 7: e7247.
What intrigued me about this particular paper (I have no idea about dinosaurs, but the documentation, pictures and data, and presentation seems impeccable) was the following sentence:
The analysis resulted in >99999 most parsimonious trees with a length of 12,123 steps. The recovered trees had a consistency index of 0.073, and a retention index of 0.589.
What can you possibly do with strict consensus trees (Losing information in phylogenetic consensus) based on an unknown number of MPTs that have a CI converging to 0 (but and RI of 0.6; The curious case[s] of tree-like matrices with no synapomorphies)? And isn't this a case for some networks-based exploratory data analysis?

The complete matrix has 501 taxa and 700 characters (the largest plant morphological matrices have hardly more than 100 characters) but also a gappyness of 72%. In this case, 255,969 of the 353,500 cells in the matrix are ambiguous or undefined (missing). The matrix is a (rich) Swiss cheese with very big holes. The high number of MPTs is hence not surprising, and neither is the low CI.

Why run elaborate tree-inferences on such a swiss cheese matrix? One answer is that (some) vertebrate palaeophylogeneticists are convinced that few taxa – many character matrices can lead to wrong clades (clades that are not monophyletic); and each added taxon, no matter how many characters can be scored, will lead to a better tree, by eliminating (parsimony) branching artifacts (see Q&A to the paper). At least 56 of the 501 taxa have 5% or fewer defined characters; still, with 700 characters, 5% equals up to 35 defined traits, which is more than we can recruit for most plant fossils. The median missing data proportion is 74% — more than half of the taxa are scored for less than 26% (< 182 out of 700) of the characters. Can such taxa really save the all-inclusive tree from branching artefacts, or is the high number of MPTs an indication for signal conflicts and data gaps issues?

For this post, we will just look at the tip of the iceberg. What is the signal from the 700 characters to start with?

The basic signal

Here's the heat map for the 19 taxa that have a gappyness of less than 15% (ie. at least 595 of 700 possible characters are defined). The taxon order is mostly the one from the original matrix, sorted by phylogenetic groups — for more orientation, I added next-inclusive superclass "Clades" from Wikipedia (so apologize any errors).

In my last post, I showed that evolutionary lineages (and monophyly) can be directly deduced from such a heat map following the simple logic: two taxa sharing a (direct) common origin are usually more similar to each other than to a third, fourth etc. taxon not part of the same lineage. Exceptions include fossils close to the last common ancestors lacking advanced traits.

The outgroup as used (in this taxon sample: Allosaurus to Tyrannosaurus) is most similar to each other but not monophyletic. One (Allosaurus) respresents the sister lineage of, the other an early split within the lineage that lead to the birds (Coelurosauria:Tyrannoraptora). The extinct (monophyletic) families (Tyrannosauridae, Ornithomimidae, Dromaesauridae) are, however, well visible, being defined by low intra-family and higher inter-family pairwise distances. The same is true for the direct relatives (Clade Ornithurae) of modern birds (class Aves).

Very typical for such datasets is the increasing distance between the (primitive?) outgroups and the most derived, modern-day taxa (living birds: Struthio – ostrich, Anas – duck, Meleagris – turkey). Closest relatives in the taxon set, phylogenetically and time-wise, are (much) more similar than distant ones. Allosaurus may be most similar to the tyrannosaurs, not because of common ancestry but because both are scored as being primitive with respect to the group of interest.

The only tree

This situation becomes very obvious from the only possible (single-optimal) tree that can be inferred from this matrix, when visualized as a phylogram (Stop using cladograms!)

The ML, MP and LS/NJ tree overlapped and scaled to equal root (first split within Tyrannoraptor) to tip (split between Anas and Meleagris) distance (phylogenetic distance, via the tree). Pink, the LS clade conflicting with ML and MP trees, and Wikipedia's tree(s).

No matter which optimisation criterion is used (here Least-Squares via Neighbor-joining, Maximum Parsimony, Maximum Likelihood), the result is the same. The only exception is that the NJ/LS tree places Archaeopteryx as sister to Dromaeosauridae; and the relative branch lengths of roots vs. tips also differ.

Because our matrix has favorable properties (few taxa, many defined characters), it's straightforward to establish branch support. This is a bit frowned upon in palaeontological circles, but having dealt with morphological evolution in cases where we have molecular data, I want to know how robust my clades are, and what may be the alternatives, before I conclude that they reflect monophyly. Bootstrapping coupled with consensus networks is a quick and simple way to test robustness and investigate ambiguous support (Connecting tree and network edges) .

The BS support consensus networks for NJ/LS and ML have only a single reticulation each.

Rooted support consensus networks based on the NJ/LS (10,000 pseudoreplicates, PAUP*) and ML bootstrap (100, number of necessary replicates determined by bootstop criterion implemented in RAxML) samples. Only splits are shown that ocurred in at least 15% of the BS pseudoreplicates.

The MP BS support consensus network is, however, has many more reticulations.

Rooted MP-BS support consensus network (10,000 BS pseudoreplicates, PAUP*). Green — edge bundles corresponding to clades in the all-optimal tree(s); orange — less supported conflicting alternatives; red – higher supported conflicting alternatives; pink – wrong clade in NJ/LS tree.

We can make two generally relevant observations here:
  1. The wrong Archaeopterix-Dromaeosauridae clade (pink edge/branch) masks a split BSNJ support: 68 for the wrong clade, 31 for the right one. While resampling under ML appears to be inert to this conflict, MP is not.
  2. While the NJ- and ML support networks are very tree-like, all clades in the inferred tree have high to unambiguous support, and are near-congruent, the MP network is much more boxy. In some cases the split in agreement with the all-optimal tree has a lower BS support than an alternative (here usually in conflict with the gold tree).
Similar observations can be made with other data sets: although NJ/LS and ML optimisation are fundamentally different (distance- vs. character-based, equal change vs. varying probability of change), they show more agreement with each other when it comes to supporting a topology (or topological alternatives) than MP (character-based like ML, but all changes are treated as equal like NJ/LS). MP is a very conservative approach, highly dependent on possibly a few discerning characters. If they are missing from the BS pseudoreplicate, the backbone tree collapses or changes, and BS values may decrease rapidly. This is so even for a very data-dense matrix like the one used here (few taxa, many characters, low gappyness).

On the positive side, we can expect that MP will produce fewer false positives. On the negative side, it is also more dependent on character coverage, and will produce much more false negatives. Any fossil lacking the crucial characters (or showing too few of them) may be still resolved (placed and supported) under NJ/LS and ML but not using MP. When inferring trees, these fossils will quickly increase the number of MPTs and decrease branch support for the part of the tree they interact with. Personally, given how hard it can be to place a fossil per se with the data at hand, I always preferred a method that can give some result, and point towards possible alternatives (even risking including erroneous), rather than no result at all.

The simplest of networks

Naturally, we can use the distance matrix directly to infer a Neighbor-net, and explore the basic differentiation signal beyond trees but also with regard to the all-optimal tree.

Neighbor-net based on the pairwise distance matrix. Coloration highlights edges found (or not) in the optimised trees.

The Neighbor-net recovers the clades from the all-optimal tree (green, purple the NJ/LS-unique branch), but shows additional edges (orange). The principal signal in the data has, for instance, problems with placing Archaeopteryx, because it is (signal-wise) intermediate between the Avebrevicaudata, the lineage including modern birds, and the Dromaeosauridae, their sister lineage (note that the vertebrate fossil record is considered to be free of ancestors and precursors; all fossils represent extinct sister lineages – evolutionary dead-ends). Skeleton IGM 100042 (an Oviraptoridae), placed as sister to both in the all-optimal tree, also lacks obvious affinities: this is a taxon where the tree inference makes a decision that is not based on a trivial signal encoded in the matrix.

The central boxy part of the Neighbor-net correlates with the 2/3-dimensional part of the parsimony BS consensus network: to resolve these relationships, we need a large set of characters (under MP). On the other hand, recognizing the Ornithurae, members of an extinct family, or a relative of IGM 100042, should be straightforward even with a limited amount of defined characters. Based on the Neighbor-net, which is inferred in a blink no matter how large the matrix, we can also make a decision, as to which taxa interfere and which ones facilitate tree-inferences. The more tree-like the Neighbor-net graph becomes, the easier it is for a tree inference to be made.

Placing fossils, quickly and easily

Using this backbone graph, it is easy to assess in which phylogenetic neighborhood a newly coded fossil falls, eg. the fossil newly described in Hartman et al. and scored for 267 unambiguously defined traits, Hesperornithoides.

Neighbor-net including Hesperornithoides.

Hesperornithoides is obviously a member of the Eumaniraptora (= Paraves), morphologically somewhat intermediate between the Avialae, the "flying dinosaurs", and Dromaeosauridae, but doesn't seem to be part of either of these sister lineages. The graph lacks a prominent neighborhood, the Archaeopteryx-Bambiraptor neighborhood may reflect local long-edge attraction (note the long terminal edges) or convergent evolution in both taxa and, possibly, also the Hesperornithoides lineage. Just based on this simple and quick-to-infer network, Hartman et al.'s title "A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight" appears to be correct (in future posts, we may come back to this morphological supermatrix to see what else networks could have quickly shown).

One should be willing to leave the phylogenetic beaten track – ie. relying on strict consensus parsimony trees as the sole basis for phylogenetic hypothesis. The Neighbor-net is a valuable tool for quick pre- and post-analysis because it can:
  • visualize how coherent the clades in our trees are, 
  • how easy it will be for the tree inference (especially MP) to find and support clades, 
  • help to differentiate ambiguous from important taxa, and finally, 
  • assess whether a new fossil really requires an in-depth re-analysis of the full matrix (and dealing with >99,999 MPTs) instead of using a more focussed taxon (and character) set.

Monday, February 10, 2020

Fossils and networks 1 – Farris and Felsenstein

Over 60 years ago, Robert Sokal and Peter Sneath changed the way we quantitatively study evolution, by providing the first numerical approach to infer a phylogenetic tree. About the same time, but in German, Willi Hennig established the importance of distinguishing primitive and advanced character states, rather than treating all states as equal. This established a distinction between phenetics and phylogenetics; and the latter is the basis of all modern studies, whether it is explicitly acknowledged or not.

More than two decades later, Steve Farris and the Willi Hennig Society (WHS) established parsimony as the standard approach for evaluating character-state changes for tree inference. In this approach, morphological traits are scored and arranged data matrices, and then the most parsimonious solution is found to explain the data. This tree, usually a collection of most-parsimonious trees (MPT), was considered to be the best approximation of the true tree. Clades in the trees were synonymized with monophyly sensu Hennig (1950, short English version published 1965), and grades with paraphyly: Cladistics was born (see also: Let's distinguish between Hennig and Cladistics).

Why parsimony? Joe Felsenstein, who was not a member of the WHS but brought us, among many other things, the nonparametic bootstrap (Felsenstein 1985), put it like this (Felsenstein 2001):
History: William of Ockham told Popper to tell Hennig to use parsimony
Soon, parsimony and cladistics came under threat by advances in computer technology and Kary Mullis' development of the polymerase-chain-reaction (PCR; Mullis & Faloona 1987) in the early 80s (note: Mullis soon went on with more fun stuff, outside science). While the data analysis took ages (literally) in the early days, more and more speedy heuristics were invented for probabilistic inferences. PCR marked the beginning of the Molecular Revolution, and genetic data became easy to access. Soon, many researchers realized that parsimony trees perform badly for this new kind of data, a notion bitterly rejected by the parsimonists, organized mainly in the WHS: the "Phylogenetic Wars" raged.

The parsimony faction lost. Today, when we analyze (up to) terabytes of molecular data, we use probabilistic methods such as maximum likelihood (ML) and Bayesian inference (BI). However, one parsimony bastion has largely remained unfazed: palaeontology.

In a series of new posts, we will try to change that; and outline what easy-to-compute networks have to offer when analyzing non-molecular data.

It's just similarity, stupid!

One collateral damage of the Phylogenetic Wars was distance-based methods, which, still today, are sometimes classified as "phenetic" in opposite to the "phylogenetic" character-based methods (parsimony, ML, BI). The first numerical phylogenetic trees were not based on character matrices but distance matrices (eg. Michener & Sokal 1957 using a cluster algorithm; Cavalli-Sforza & Edwards 1965 using parsimony; see also Felsenstein 2004, pp.123ff).

But no matter which method, optimality criterion or data-type we use, in principal we do the analysis under the same basic assumptions:
  1. the more closely related two taxa are, then the more similar they should be.
  2. the more similar two taxa are, then the more recent is their split.

The ingroup (blue clade) and outgroup (red clade) are most distant from each other. Placing the fossils is trivial: Z is closer to O than to C, the member of the ingroup with the fewest advanced character states. Ingroup sister taxa A + C and B + D are most similar to each other. The monophyly of the ingroup and its two subclades is perfectly reflected by the inter-taxon distances.

Assuming that the character distance reflects the phylogenetic distance (ie. the distance along the branches of the true tree), any numerical phylogenetic approach will succeed in finding the true tree. The Neighbor-Joining method (using either the Least Squares or Minimum Evolution criteria) will be the quickest calculation. The signal from such matrices is trivial, we are in the so-called "Farris Zone" (defined below).

We wouldn't even have to infer a tree to get it right (ie. nested monophyly of A + C, B + D, A–D), we could just look at the heat map sorted by inter-taxon distance.

Just from the distance distributions, visualized in the form of a "heat-map", it is obvious that A–D are monophyletic, and fossil Z is part of the outgroup lineage. As expected for the same phylogenetic lineage (because changes accumulate over time), its fossils C and D are still relatively close, having few advanced character states, while the modern-day members A and B are have diverged from each other (based on derived character states). Taxon B is most similar to D, while C is most similar A. So, we can hypothesize that C is either a sister or precursor of A, and D is the same of B. If C and D are stem group taxa (ie. they are paraphyletic), then we would expect that both would show similar distances to A vs. B, and be closer to the outgroup. If representing an extinct sister lineage (ie. CD is monophyletic), they should be more similar to each other than to A or B. In both cases (CD either paraphyletic or monophyletic), A and B would be monophyletic, and so they should be relatively more similar to each other than to the fossils as well.

Having a black hole named after you

The Farris Zone is that part of the tree-space where the signals from the data are trivial, we have no branching artifacts, and any inference (tree or network), gives us the true tree.

It's opposite has been, unsurprisingly perhaps, labeled the "Felsenstein Zone". This is the part of the tree-space where branching artifacts are important — the inferred tree deviates from the true tree. Clades and grades (structural aspects of the tree) are no longer synonymous with monophyly and paraphyly (their evolutionary interpretation).

We can easily shift our example from the Farris into the Felsenstein Zone, by halving the distances between the fossils and the first (FCA) and last (LCA) common ancestors of ingroup and outgroup and adding some (random = convergence; lineage-restricted = homoiology) homoplasy to the long branches leading to the modern-day genera.

The difference between distance, parsimony and probabilistic methods is how we evaluate alternative tree topologies when similarity patterns become ambiguous — ie. when we approach or enter the Felsenstein Zone. Have all inferred mutations the same probability; how clock-like is evolution; are their convergence/ saturation effects; how do we deal with missing data?

For our example, any tree inference method will infer a wrong AB clade, because the fossils lack enough traits shared only with their sisters but not with the other part of the ingroup. Only the roots are supported by exclusively shared (unique) derived traits (Hennigian "synapomorphies"). The long-branch attraction (LBA) between A and B is effectively caused by:
  • 'short-branch culling' between C and D: the fossils are too similar to each other; and their modern relatives too modified;
  • the character similarity between A and B underestimates the phylogenetic distance between A and B, due to derived traits that evolved in parallel (homoiologies).
While ML and NJ make the same decision, the three maximum-parsimony trees permute options for placing C and D, except for the correct options, and including an impossible one (a hard trichotomy). Standard phylogenetic trees are (by definition) dichotomous being based on the concept of cladogenesis — one lineage splits into two lineages.

MPTs inferred using PAUP*'s branch-and-bound algorithm (no heuristics, this algorithm will find the actual most parsimonious solution); NJ/LS using PAUP* BioNJ implementation and simple (mean) pairwise distances; ML using RAxML, corrected (asc.) and not (unc.*) for ascertainment bias (the character matrix has no invariable sites). All trees are not rooted using a defined outgroup but mid-point rooted. If rooted with the known outgroup O, the fossil Z would be misinterpreted as early member of the ingroup.

We have no ingroup-outgroup LBA, because the three convergent traits shared by O and A or B, respectively, compete with a total of eight lineage-unique and conserved traits (synapomorphies) — six characters are compatible with a O-A or O-B sister-relationship (clade in a rooted tree) but eight are incompatible. We correctly infer an A–D | O + Z split (ie. A–D clade when rooted with O) simply because A and B are still more similar to C and D than to Z and O; not some method- or model-inflicted magic.

The magic of non-parametric bootstrapping

When phylogeneticists perform bootstrapping, they usually do it to try to evaluate branch support values — a clade alone is hardly sufficient to infer an inclusive common origin (Hennig's monophyly), so we add branch support to quantify its quality (Some things you probably don't know about the bootstrap). In palaeontology, however, this is not a general standard (Ockhams Razor applied but not used), for one simple reason: bootstrapping values for critical branches in the trees are usually much lower than the molecular-based (generally accepted) threshold of 70+ for "good support" (All solved a decade ago).

When we bootstrap the Felsenstein Zone matrix that gives us the "wrong" (paraphyletic) AB clade, no matter which tree-inference method we use, we can see why this standard approach undervalues the potential of bootstrapping to explore the signal in our matrices.

Consensus networks based on each 10,000 BS pseudoreplicates, only splits are shown that are at least found in 15% of the replicate trees (trivial splits collapsed). Reddish – false splits (paraphyletic clades), green – true splits (monophyletic clades).

While parsimony and NJ bootstrap pseudoreplicates either fall prey to LBA or don't provide any viable alternative (the bootstrap replicate matrix lacks critical characters), in the example a significant amount of ML pseudoreplicates did escape the A/B long-branch attraction.

Uncorrected, the correct splits A + C vs. rest and B + D vs. rest can be found in 19% of the 10,000 computed pseudoreplicate trees. When correcting for ascertainment bias, their number increases to 41%, while the support for the wrong A + B "clade" collapses to BSML = 49. Our BS supports are quite close to what Felsenstein writes in his 2004 book: For the four taxon case, ML has a 50:50 chance to escape LBA (BSML = true: 41 vs. false: 49), while MP and distance-methods will get it always wrong (BSMP = 88, BSNJ = 86).

The inferred tree may get it wrong but the (ML) bootstrap samples tell us the matrix' signal is far from clear.

Side-note: Bayesian Inference cannot escape such signal-inherent artifacts because its purpose is to find the tree that best matches all signals in the data, which, in our case, is the wrong alternative with the AB clade — supported by five characters, rejected by four each including three that are largely incompatible with the true tree. Posterior Probabilities will quickly converge to 1.0 for all branches, good and bad ones (see also Zander 2004); unless there is very little discriminating signal in the matrix — a CD clade, C sister to ABD, D sister to ABC, ie. the topological alternatives not conflicting with the wrong AB clade, will have PP << 1.0 because these topological alternatives give similar likelihoods.

Long-edge attraction

When it comes to LBA artifacts and the Felsenstein Zone, our preferred basic network-inference method, the Neighbor-Net (NNet), has its limitations, too. The NNet algorithm is in principle a 2-dimensional extension of the NJ algorithm. The latter is prone to LBA, hence, and the NNet can be affected by LEA: long edge attraction. The relative dissimilarity of A and B to C and D, and (relative) similarity of A/B and C/D, respectively, will be expressed in the form of a network neighborhood.

Note, however, the absence of a C/D neighbourhood. If C is a sister of D (as seen in the NJ tree), then there should be at least a small neighbourhood. It's missing because C has a different second-best neighbour than B within the A–D neighbourhood. While the tree forces us into a sequence of dichotomies, the network visualizes the two competing differentiation patterns: general advancement on the one hand (ABO | CDZ split), and on the other potential LBA/LEA, vs. similarity due to a shared common ancestry (ABCD | OZ split; BD neighborhood).

Just from the network, we would conclude that C and D are primitive relatives of A and B, potentially precursors. The same could be inferred from the trees; but if we map the character changes onto the net (Why we may want to map trait evolution on networks), we can notice there may be more to it.

Character splits ('cliques') mapped on the NNet. Green, derived traits shared by all descendants of a common ancestor ('synapomorphies'); blue, lineage-restricted derived traits, ie. shared by some but not all descendants (homoiologies); pink, convergences between in- and outgroup; black, unique traits ('autapomorphies')

Future posts

In each of the upcoming posts in this (irregular) series, we will look at a specific problem with non-molecular data, and test to what end exploratory data analysis can save us from misleading clades; eg. clades in morphology-informed (parts of, in case of total evidence) trees that are not monophyletic.

* The uncorrected ML tree shows branch lengths that are unrealistic (note the scale), and highly distorted. The reason for this is that the taxon set includes (very) primitive fossils and (highly) derived modern-day genera, but the matrix has no invariable sites telling the ML optimization that changing from 0 ↔ 1 is not that big of a deal. This is where the ascertainment bias correction(s) step(s) in (RAxML-NG, the replacement for classic RAxML 8 used here, has more than one implementation to correct for ascertainment bias. A tip for programmers and coders: effects of corrections have so far not been evaluated for non-molecular data).

Cited literature
Cavalli-Sforza LL, Edwards AWF. 1965. Analysis of human evolution. In: Geerts SJ, ed. Proceedings of the XI International Congress of Genetics, The Hague, The Netherlands, 1963. Genetics Today, vol. 3. Oxford: Pergamon Press, p. 923–933.
Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791.
Felsenstein J. 2001. The troubled growth of statistical phylogenetics. Systematic Biology 50:465–467.
Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc., 664 pp. (in chapter 10, Felsenstein provides an entertaining "degression on history and philosophy" of phylogenetics and systematics).
Hennig W. 1950. Grundzüge einer Theorie der phylogenetischen Systematik. Berlin: Dt. Zentralverlag, 370 pp.
Hennig W. 1965. Phylogenetic systematics. Annual Review of Entomology 10:97–116.
Michener CD, Sokal RR. 1957. A quantitative approach to a problem in classification. Evolution 11:130–162.
Mullis KB, Faloona F. 1987. Specific synthesis of DNA in vitro via a polymerase catalyzed chain reaction. Methods in Enzymology 155:335–350.

Monday, February 3, 2020

A network of life expectancy and body mass index

At my advanced age, the concept of Life Expectancy (the average age at which people of my generation die) becomes of some practical importance. Perhaps more importantly, the concept of Healthy Life Expectancy rears its head, this being the average age at which one's health starts to notably deteriorate.

Both of these human attributes are related to many things, but in the modern world Obesity is one of the most important contributors to lack of health. This is frequently measured as the Body Mass Index (BMI), defined as the body mass (kilograms) divided by the square of the body height (meters). A BMI > 30 is classified as Obese, and this is definitely considered to represent lomg-term poor health.

So, let's look at some data, to see how the USA currently fares with regard to these characteristics. The US Burden of Disease Collaborators recently released some up-to-date data (The state of US health, 1990-2016. Burden of diseases, injuries, and risk factors among US states. Journal of the American Medical Association 319: 1444-1472). You can consult their Table 1 if you want to consider the major recent causes of death in the USA.

However, we will focus on the positive side, instead — how long do people live? The first graph here shows the relationship between the two Life Expectancy variables for the year 2016, with each point representing one state of the USA, plus DC. The line shown on the graph represents the national average.

Life expectabcy versus healthy life expectancy

As expected, there is a high correlation between the two variables, although there is a 6-year difference in Expectancy among the various states. The top states include Hawaii, California, Connecticut, Minnesota, New York, Massachusetts, Colorado, New Jersey and Washington; while the bottom states are Mississippi, West Virginia, Alabama, Louisiana, Oklahoma, Arkansas, Kentucky, Tennessee and South Carolina. The social and economic differences between those two groups should be clear to everyone, and this is well-known to relate to life-length.

The national average for Life Expectancy is 78.9 years, while the Healthy Life Expectancy is 67.7 years (ie. 11.2 years less). This probably doesn't surprise you — the last 11 years of your life is likely to be spent dealing with ill health. The points on the graph are scattered around the national-average line except at the lowest Expectancies — this implies a shorter period of unhealth at the end of life for those with a poor Life Expectancy. Notably, Mississippi has the lowest Life Expectancy but only the 5th lowest Healthy LE.

We can now turn to look at Body Mass Index (BMI) and how it relates to Healthy Life Expectancy. This is shown in the next graph, where the BMI data refer to the percentage of people who are obese (BMI > 30). Once again, each point refers to a single state. Clearly, as Obesity increases then Healthy LE decreases. The medical people have been telling us this for decades.

Body mass index versus healthy life expectancy

Note, however, the big difference in obesity levels between the states (15.5 percentage points) — there are nearly two-thirds more obese people in some states than in others. The states with the highest Obesity levels include West Virginia, Mississippi, Oklahoma, Iowa, Alabama, Louisiana and Arkansas, while the other extreme includes Colorado, DC, Hawaii, California, Montana, Utah, New York and Massachusetts.

Also, note that the relationship between the Obesity and Life Expectancy variables is not linear. Below 26% population obesity there is little change in average Life Expectancy, whereas above 30% obesity levels Life Expectancy declines rapidly. For every 1% increase in average Obesity the average LE is reduced by 0.3 years.

Two of the territories are labeled in the graph, as showing unusual patterns. The people of the District of Columbia are clearly not "fat cats", as often depicted, but their lives are apparently not all that healthy. On the other hand, the people of Iowa somehow manage to remain healthy for longer than average, even though they have one of the highest Obesity levels.

Finally, we can put all of this together in a single network, depicting the data patterns. As usual in this blog, one of the simplest ways to get a pictorial overview of the data is to use a phylogenetic network, as a form of exploratory data analysis. For this analysis, I first calculated the similarity of the states using the manhattan distance, based on the three variables listed above. A Neighbor-net analysis was then used to display the between-territory similarities.

The resulting network is shown in the final graph. Territories that are closely connected in the network are similar to each other based on their two Life Expectancies and BMI levels, and those that are further apart are progressively more different from each other.

Network of life expectancy and body mass index

In this case, the network displays states with decreasing Life Expectancies from top to bottom, and decreasing Obesity from left to right. It makes visually clear that those states with the shortest Life Expectancies are almost always associated with high Obesity levels (ie. they are at the bottom-left of the network).

For longer Life Expectancies, some states have high Obesity levels (top-left of the network) while some have lower levels (top-right). Iowa is shown as quite distinct from the other states (it has a long edge of its own), since it has longer LE than would be expected for its population Obesity level.

Monday, January 27, 2020

From words to deeds?

If you want to annoy a linguist, then there are three easy ways to do so: ask them how many languages they speak; ask them for their opinion regarding the German spelling reform; or ask them whether it is true that the Eskimo language has 50 words for snow. What those three questions have in common is that they all touch upon some big issues in linguistics, which are so big that they give us a headache when being reminded of them.

For the first question, asking about a linguist's linguistic talent touches upon the conviction of quite a few linguists that in order to practice linguistics, one does not need to study many languages. One language is usually enough; and even if that language is only English, this may also be sufficient (at least according to some fanatics who practice syntax). To put it in different words: knowing only one language does not prevent a linguist from making claims about the evolution of whole language families. Knowing how to describe a language, or how to compare several languages, does not necessarily require anyone to be able to speak them. After all, mathematicians also pride themselves on not being able to calculate.

The second question, regarding the German spelling reform, marks the last time when German linguists failed royally in proving the importance of their studies to the broader public. The problem was that the German spelling reform, the first after some 100 years of linguistic peace, was mostly done without any linguistic input. Those who commented on it were, instead, novelists, poets and journalists, usually a bit older in age, who felt that the reform was proposed mainly in order to annoy them personally. At the same time, and this was maybe no coincidence, more and more institutes for comparative linguistics disappeared from German universities. The reason was again that the field had not succeeded in explaining its importance to the public. However, historical language comparison can, indeed, be important when discussing the reform of a writing system that is being used by millions of people, specifically also because the investigation of historically evolving linguistic systems is one of the specialties of historical-comparative linguistics. This was completely ignored by then.

The last question concerns the almost ancient debate about the hypothesis commonly known attributed to Edward Sapir (1884-1939) and Benjamin Lee Whorf (1897-1941). This says, in its strong form (Whorf 1950), that speaking influences thinking to such an extent that we might, for example, develop a different kind of Relativity Theory in physics if we started to practice our science in languages different from English, French, and German. Given that Eskimo languages are said to have some 50 different words for snow (as people keep repeating), it should be clear enough that those speaking an Eskimo language must think completely differently from those who start to forget what snow is after all.

The latter concept leads to an interesting use of networks, which I will discuss here.

Words versus deeds

The hypothesis by Sapir and Whorf annoys many linguists (including myself), because it has been long since disproved, at least in its strong, naive form. It was disproved by linguistic data, not by arguments; and the data were the data used by Whorf in order to prove his point in a first instance. However, although there is little evidence for the hypothesis in its strong form, people keep repeating it, especially in non-linguistic circles, where it is often instrumentalized.

Whether we can find evidence for a weak form of the hypothesis — which would say that we can find some influence of speaking on thinking — is another question; which is, however, difficult to answer. It may well be possible that our thoughts are channeled to some degree by the material we use in order to express them. When distinguishing color shades, for example, such as light blue and dark blue, by distinct words, such as goluboj and sin'ij in Russian or celeste and azul in Spanish, it may be that we develop different thoughts when somebody talks about blue cheese, which is called dark blue cheese in Spanish (queso azul).

But this does not mean that somebody who speaks English would never know that there is some difference between light and dark blue, just because the language does not primarily make the distinction between the two color tones. It is possible that the stricter distinction in Russian and Spanish triggers an increased attention among speakers, but we do not know how large the underlying effect is in the end, and how many people would be affected by it.

Particular languages are thus neither a template nor a mirror of human thinking — they do not necessarily channel our thoughts, and may only provide small hints as to how we perceive things around us. For example, if a language expresses different concepts, such as "arm" and "hand" with the same word, this may be a hint that "arm" and "hand" are not that different from each other, or that they belong together functionally in some sense, which is why we may perceive them as a unit. This is the case in Russian, where we find only one expression ruka for both concepts. In daily conversations, this works pretty well, and there are rarely any situations where Russian speakers would not understand each other due to ambiguities, since most of the time the context in which people speak disambiguates all they want to express well enough.

Colexification network with the central concept "MIND" and the geographical distribution of languages colexifying "MIND" and "BRAIN"

These colexifications, as we now call the phenomenon (François2008), occur frequently in the languages of the world. This is due to the polysemy of many of the words we use, since no single word denotes only one concept alone, but often denote several similar concepts at the same time. On the other hand, we encounter identical word forms in the same language which express completely different things, resulting from coincidental processes by which originally different pronunciations came to sound alike (called convergence, in biology). Those colexifications that are not coincidental but result from polysemy are the most interesting ones for linguists, not least because the words are related by network graphs not trees (as shown above). When assembled in large enough numbers, across a sufficiently large sample of languages, they may allow us some interesting insights into human cognition.

The procedure to mine these insights from cross-linguistic data has already been discussed in a previous blog, from 2018. The main idea is to collect colexifications for as many concepts and languages and possible, in order to construct a colexification network, in which each concept is represented by a node, and weighted links between the nodes represent how often each colexification between the linked concepts occurs; that is, they represent how often we find a language that expresses the two linked concepts with the same word.

Having proposed a first update of our Database of Cross-Linguistic Colexifications (CLICS) back in 2018, we have now been able to further increase the data. With this third installment of the database, we could double the number of language varieties, from 1,200 to 2,400. In addition, we could enhance the workflows that we use to aggregate data from different sources, in a rigorously reproducible way (Rzymski et al. 2020).

Current work

Even more interesting than these data, however, is a study initiated by colleagues from psychology from the University of North Carolina, which was recently published, after more than two years of intensive collaboration (Jackson et al. 2019). In this study, the colexifications for emotion concepts, such as "love", "pity", "surprise", and "fear", were assembled and the resulting networks were statistically compared across different language families. The surprising result was that the structures of the networks differed quite considerably from each other (an effect that we could not find for color concepts derived from the same data). Some language families, for example, tend to colexify "surprise" and "fear (fright)" (see our subgraph for "surprised"), while others colexified "love" and "pity" (see the subgraph for "pity").

Not all aspects of the network structures were different. An additional analysis involving informants showed that especially the criterion of valency (that is, if something is perceived as negative or positive) played an important role for the structure of the networks; and similar effects could be found for the degree of arousal.

These results show that the way in which we express emotion concepts in our languages is, on the one hand, strongly influenced by cultural factors, while on the other hand there are some cognitive aspects that seem to be reflected similarly across all languages.

What we cannot conclude from the results, however, is, that those, who speak languages in which "pity" and "love" are represented by the same word, will not know the difference between the two emotions. Here again, it is important to emphasize, what I mentioned above with respect to color terms: if a particular distinction is not present in a given language, this it does not mean that the speakers do not know the difference.

It may be tempting to dig out the old hypothesis of Sapir and Whorf in the context of the study on emotions; but the results do not, by any means, provide evidence that our thinking is directly shaped and restricted by the languages we speak. Many factors influence how we think. Language is one aspect among many others. Instead of focusing too much on the question as to which languages we speak, we may want to focus on how we speak the language in which we want to express our thoughts.


François, Alexandre (2008) Semantic maps and the typology of colexification: intertwining polysemous networks across languages. In: Vanhove, Martine (ed.): From polysemy to semantic change. Amsterdam: Benjamins, pp. 163-215.

Joshua Conrad Jackson, Joseph Watts, Teague R. Henry, Johann-Mattis List, Peter J. Mucha, Robert Forkel, Simon J. Greenhill and Kristen Lindquist (2019) Emotion semantics show both cultural variation and universal structure. Science 366.6472: 1517-1522. DOI: 10.1126/science.aaw8160

Rzymski, Christoph, Tiago Tresoldi, Simon Greenhill, Mei-Shin Wu, Nathanael E. Schweikhard, Maria Koptjevskaja-Tamm, Volker Gast, Timotheus A. Bodt, Abbie Hantgan, Gereon A. Kaiping, Sophie Chang, Yunfan Lai, Natalia Morozova, Heini Arjava, Nataliia Hübler, Ezequiel Koile, Steve Pepper, Mariann Proos, Briana Van Epps, Ingrid Blanco, Carolin Hundt, Sergei Monakhov, Kristina Pianykh, Sallona Ramesh, Russell D. Gray, Robert Forkel and Johann-Mattis List (2020): The Database of Cross-Linguistic Colexifications, reproducible analysis of cross- linguistic polysemies. Scientific Data 7.13: 1-12. DOI: 10.1038/s41597-019-0341-x

Benjamin Lee Whorf (1950) An American Indian Model of the Universe. International Journal of American Linguistics 16.2: 67-72.

Monday, January 20, 2020

Worldwide gender differences in amount of paid versus unpaid work

A few weeks ago, I wrote about National differences in the amount of paid and unpaid work. This involved a look at the time that people spend per day on each of various different activities, averaged across each year. The data came from the time-use surveys conducted by the Organisation for Economic Co-operation and Development (OECD) for its 30 member countries. I concluded that there are many similarities among countries that share strong cultural ties, although some countries stand out as unusual within this context.

Four main categories of time use are reported in the surveys: Paid Work or Study, Unpaid Work, Personal Care, and Leisure Time; these are described in more detail in my previous post. The aggregated results for each country are available online, including data for three non-OECD countries, for comparison (China, India, South Africa).

Of particular interest is that these data are actually aggregated separately for males and females (see Balancing paid work, unpaid work and leisure). This allows us to look at the various national time-management behaviors in the light of potential differences in gender roles within those countries.

Obviously, we expect some consistent gender differences, not least because in most cultures it is the females who have traditionally been the primary care-givers in a family, and this is one of the main unpaid work activities. We can use the OECD data to look at this in a bit more detail.

Overall gender differences

First, we can look at the overall time-management differences between the two genders.

In order to get an overview of the current differences between the 33 countries (30 OECD, 3 non-OECD), I have performed this blog's usual exploratory data analysis. The available data are multivariate, since there are five measured variables for each country — total paid work, total unpaid work, total personal care time, leisure time (each measured in average number of minutes per day), plus Other (to make a total of 1,440 minutes per day). One of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network, as a form of exploratory data analysis. For this network analysis, I first calculated the gender differences as Male time minus Female time (for each variable separately), and then calculated the similarity of the countries using the manhattan distance. A Neighbor-net analysis was then used to display the between-country similarities.

The resulting network is shown in the first graph. Countries that are closely connected in the network are similar to each other based on their average gender difference in time management, and those countries that are further apart are progressively more different from each other.

At the bottom of the network we see those countries with the biggest gender differences, progressing up to the top with those countries with the least difference.

So, the non-European countries show the most traditional separation of gender roles, with Portugal standing out as being the only one from Europe. China is not situated with the other two Asian countries (Japan, Korea), although why it should be similar to South Africa is not clear.

Indeed, the English-speaking part of the southern hemisphere does not do well, with all three countries (South Africa, Australia, New Zealand) showing stronger gender differences than any of the other English-speaking countries (Canada, USA, UK), except for the Irish (who thus have some explaining to do).

The Scandinavia countries are at the top (Sweden, Norway, Denmark), with the smallest gender differences, which will not surprise anyone who knows these people. On the other hand, the location of France may surprise those people who have a clichéd image of the behavior of Frenchmen. France is clearly separated from the more traditional societies of the other Mediterranean countries (Spain, Greece, Italy), appearing in the network with other northern countries (Belgium, Netherlands, Germany).

Finland and Estonia have strong historical ties, and they are distinct from the other Baltic countries (Latvia and Lithuania).

Work time differences

Having thus noted that there are some strong gender differences in time-management between countries, we can now proceed to look specifically at Paid versus Unpaid work.

First, we can simply take the total amount of reported Paid + Unpaid work, and compare gender differences across the various countries. This table lists the reported differences expressed as Male time minus Female time, in average minutes per day:
New Zealand
South Africa

These time differences between males and females become very large towards the bottom of the table, where in India it amounts to 1.5 hours per day, and is >1 hour for all of the bottom 8 countries. Note that only in the first four countries (out of the 33) does the total work time for males exceed that for females. It is unclear why the reported gender difference is so large for Norwegians; but maybe some of my readers might think that this could be a useful role model for the other countries!

We can now look at the balance between paid and unpaid work for the two genders. The following graph shows the difference as Male time minus Female time (in average minutes per day) for Paid work (horizontally) and Unpaid work (vertically). The pink line indicates the balance between the two types of work (ie. a decrease in paid work is balanced by a corresponding increase in unpaid work, and vice versa).

Gender differences in amount of paid versus unpaid work

The horizontal axis makes it clear that males always do more paid work than do females, on average, in every country, and up to 4 hours more in Mexico and Turkey. The vertical axis makes it clear that females always do more unpaid work than do males, on average, in every country, and up to 5 hours more in India.

These two variables must be correlated, since most people do either the one type of work or the other. However, in most countries the gender balance is not equal, as shown in the table above (females usually do more total work than do males). Some countries come close to a balance (indicated by the pink line), including the USA.

Note that the country with the closest gender equality is the one with the best reputation in this regard: Sweden. For example, Swedish couples frequently share their workplace parental leave for new-born children, so that there is very little gender bias in who is the primary care-giver in a family. However, the gender bias still amounts to 5–7 minutes of work per day, even in Sweden.

At the other end of the scale, there are a number of countries that still abide by the traditional model of gender roles, of which five are labeled at the bottom of the graph. These cover quite a diversity of cultures, so that no generalizations can be made. However, the gender bias in India exceeds that in Mexico — the Indians report less total work time than do the Mexicans, but that time is organized in a more gender-biased manner. Once again, Portugal stands out among the European countries — the Portuguese work longer hours than do other Europeans, and that time is organized in a more gender-biased manner.

Other differences

Gender differences occur among the other survey variables, as well. As one simple example, we can consider the time reported as being spent Eating & Drinking. This graph shows the time (in minutes per day) spent by the males (horizontally) and the females (vertically) for each of the 33 countries.

Gender differences in amount of time spent eating and drinking

As you can see, there is not a big difference between the two genders, in any country. However, in most countries males do report spending more time feeding themselves than do the females (ie. the points are to the right of the pink line, which represents equal time).

The Mediterranean countries spend the most time eating and drinking, with Greece showing the biggest gender difference. The fast food preferred by Canadians and Americans clearly does not take much time to consume, in any given day, and females can apparently eat it just as fast as males.


The conclusion surprises no-one — all countries have clear gender differences in who does most of the unpaid work. Two Scandinavian countries stand out — Norway, because males do more total work than do females; and Sweden, where the gender balance between paid and unpaid work is smallest. Some countries still show strong gender bias, including India, Mexico, Turkey and Portugal

Monday, January 13, 2020

Why we may want to map trait evolution on networks, pt. 2 – Topological ambiguity

In last week's Part 1, I gave an introduction to the problem of categorizing the polarity of morphological traits. How can we reconstruct which characters are primitive, or plesiomorphic according to Hennig, and which are derived, or apomorphic? This is something we need to do to reconstruct evolution, because most of the past is only preserved in the form of fossils, usually lacking any DNA. In this second part of the discussion, I'm going to take apart my own tree and show why we inevitably need networks, not trees.

There may be more than one tree

Even with more and more data at hand, some molecular phylogenies refuse to be unambiguous. Even worse, different, well-sampled molecular data sets may tell different stories — ie. there is more than one molecular tree to explain the diversity patterns. The ML tree used for the ML character mapping in Part 1 was pretty well supported, but not telling the entire truth.

For a start, there is no reason to assume that oaks are not monophyletic even though the data fail to resolve them as a clade (evolving something unique like the oaks twice would be a striking trick, even for gambling Mother Nature) — molecular trees may have misleading, sometimes just wrong, branches, even when they are highly supported.

In this case, one complication is that the oligogene dataset combines plastid and nuclear gene regions that not only differ in their information content but also infer different phylogenetic scenarios (and mask a lot of intra-generic and sub-generic incongruencies). This is illustrated in the following tanglegram.

Fig. 3 – A tanglegram, on the left the ML tree inferred from only the plastid gene regions (1406 DAP, alignment 15254 bp long), and on the right the corresponding nuclear data based tree (1691 DAP per only 4983 bp).

Even though the support along the backbone of the plastid tree is lowa (to non-existent), it well reflects the general diversification patterns in Fagaceae plastomes (see also the tree in Manos et al. 2008, Madroño 55:181–190; and Yan et al. 2019, BMC Evol. Biol. 19: 202, for an oak global picture). Plastid signatures show a strong geographic sorting (eg. New World vs. Old World), while the nuclear data provides most of the lineage-differentiating signal expressed in the combined tree (Part 1, Fig. 2).

Mapping along networks

How do we decide what is a real synapomorphy, a homoiology, or a good symplesiomorphy? Mapping the traits along all possible rooted trees is one option. Another option is to just map them along a consensus network of all trees, as shown next.

Fig. 4 – Map of the seven characters on the consensus network of the nuclear and plastid trees shown in Fig. 3. Blue – genus autapomorphies, dark green – synapomorphies/terminal homoiologies, light green – symplesiomorphies, orange – deep homoiologies, red – randomly distributed trait, pink – genus-restricted reversals.

According to the mapping, the newly described South American Castanopsis rothwellii, assigned to the modern (Souteast Asian) genus Castanopsis, is a stem Castanoideae / Fagaceae, while the "extinct" North American genus Castanopsoidea (then the "earliest megafossil evidence of Fagaceae": Crepet & Nixon 1989, Am. J. Bot. 76: 842–855) could be a stem / crown member of the Castanea-Castanopsis lineage. The difference to the ML trait mapping (Fig. 3 in Part 1) on the combined tree is that we get a better picture what is a lineage-specific trait set in Castanea-Castanopsis, because the interference of the monophyletic(!) oak grade is minimized.

Another possibility is to map the characters directly along a distance-based network, and then compare the latter with the molecular-based topological alternatives. This is quite puzzling in this case, because the morphology (Fig. 1 in Part 1) matches neither the nuclear tree nor the plastid tree (Figs. 2–4) — the traits scored for the fossils cover largely morphological Play-Doh of the Fagaceae.

Fig. 5 – Neighbor-nets based on mean morphological distances. Top graph – polymorphisms treated as ambiguities (standard approach), bottom graph – polymorphism treated as additional states (experimental approach). Text coloring as in Fig. 4, light blue – potential autapomorphy of the fossil American castaneoid lineage. Edge colors: green – edge representing a molecular clade/likley monophyletic group; orange – edge representing a paraphyletic group; red – edge rejected by molecular data; blue – edges supporting a distinct fossil American castaneoid lineage.

The likely primitive characters, irrespective of the evolutionary scenario we prefer, are those also found in the Eocene fossilsb. There are no derived traits/character suites pinning the fossils to Castanopsis. The fossils are a bit derived on their own terms (note their position in Fig. 5), and hence we can deduce that the fossils are either: (a) representing a relatively primitive extinct American sister lineage or (b) surviving, somewhat evolved members of the precursors of modern-day core Fagaceae. Note that the derived oaks evolved nearly 60 myrs ago, ie. 8 myrs before the oldest (Patagonian) Castanoideae fossil was deposited. The earliest (known) Fagaceae and castaneoid pollen are from 80+ Ma old Upper Cretaceous sediments in western North America (Grímsson et al. 2016, Acta Palaeobot. 56: 247–305; open access) and Japan (Takahashi et al. 2008, Intl. J. Plant Sci. 169:899–907), giving them plenty of time to migrate into North and then South America during the Paleocene-Eocene green house episode.

Fig. 6 – Earliest fossil record of Fagaceae and Castanoideae mapped on Scotese's Paleoglobes (© Scotese 2013, GoogleEarth layover files are available from here). Note that although there was no continuous land bridge, North and South America were already connected by a chain of large and high islands, providing a corridor for intercontinental dispersal of near- and extra-tropical plant lineages. A potential  crown-group Castanopsis (C. kaulii, cupule with associated seeds and pollen) has been recently recovered from the Baltic Amber (Sadowski et al. 2018 Am. J. Bot 105: 2025–2036).

Both of the mapping procedures described above are crude, in the sense that they ignore the molecular branch lengths, and use Ockham's Razor. But it strikes me as being not a bad start. They are better than just mapping along a single preferred molecular tree (as is done in many neontological papers; see Part 1) or along a morphology-based strict consensus cladogram (as is done in far too many paleontological papers; many palaeobotanical papers do neither the one nor the other: eg. Wilf et al., 2019, Science 364: eaaw5139). It's important to realize that if one taxon or subtree of our modern taxon set is characterized solely by the lack of shared derived traits or unstable expression of derived traits (like Castanopsis here, see position in both graphs in Fig. 5), ie. represents living fossils or little-evolved lineages, any ancient and primitive fossil, stem group, sister group or precursor, will be attracted by them in a total evidence or any other tree-based approach, especially when we rely on change-probability-naive parsimony as inference criterion. As we pointed out repeatedly: forming a clade in tree is neither a necessary nor a sufficient criterion for monophyly.

All gone, what to do when we have no molecular data?

Morphology alone, like genes on their own, will inevitably get some things wrong (compare Fig. 4 with Fig. 5). Without molecular data, one may have little reason to reject the monophyly of the Castaneoideae (when using more than the seven characters scored by Wilf et al. 2019; see eg. the cladogram in Crepet & Nixon 1989, fig. 1 based on an undocumented 25-character matrix). In the process, we would misinterpret overall similarity, due to shared primitive character suites and the lack of shared derived traits as evidence for an inclusive common originc.

What can we do if we have no or very few extant taxa, when we only have one set of data prone to circular reasoning? Then using networks is inevitable as well (see Fig. 5; and some examples provided in the reading list below). We need to explore in-depth the signal in our data matrix. Only extremely biased morphological matrices provide clear tree-like signals, comprehensive ones will have internal conflict and allow for inferring many, partly very different but more or less equally optimal trees.

Exploratory data analysis will not eliminate all possible errors — based only on the graph in Fig. 5, we would get the inter-generic phylogenetic relationships in Fagaceae partly wrong. However, this may lead to an informed decision as to which of the many equally probable evolutionary scenarios make more sense than others. It will help to reduce the alternatives, without eliminating those that are equally valid (which every tree does). If the time-coverage is good, exploring morphological differentiation over time can be an asset, too (see eg. Stacking neighbor-nets – a real-world example).


The matrices used, networks etc. can be accessed via figshare.

Selection of related posts on The Genealogical World of Phylogenetic Networks

Clades, Cladograms, Cladistics, and why networks are inevitableillustrates why paleontologists should also be less tree-naive (see example in footnote c).
Has homoiology be neglected in phylogenetics? — why we should try to assess the phylogenetic quality of our traits.
Let distinguish between Hennig and Cladisticsas said in the title, the post provides reasons why we should distinguish between Hennig's concepts and clades in phylogenetic trees.
Ockham's Razor applied, but not used: can we do DNA-scaffolding with seven characters? — the original post dealing with Wilf et al.'s (2019) "phylogenetic analysis", which obviously was not scrutinized during review.
Please stop use cladograms!No matter whether you think evolution is tree-like or not, cladograms should be a matter of the past.
Should we try to infer trees on tree-unlikely matrices? —  using well-known (among paleobotanists) examples, I show why networks reveal much more than any tree when we deal with fossils.
More non-treelike data forced into trees: a glimpse into the dinosaursthe same but for a thunder lizard matrix.
Trivial data, but not so trivial graphsan inference experiment using very simple artificial binary matrices.

a The main reason for the lack of branch support is that individuals of different genera growing in the same area can share plastid haplotypes, while individuals of the same genus / infra-generic lineage, even species, can be quite different. [Note that the standard 4x4 ML nucleotide model treats polymorphisms as such, not as missing data.] Plus, the different lineages show different levels of plastid diversity (highest in Quercus subgenus Cerris, but low in subgenus Quercus, the North American castanoids and Lithocarpus outside Borneo, Castanea-Castanopsis appear to be in-between the extremes), and there is a tendency to preferably mutate sequence patterns within a lineage that otherwise differentiate between lineages (for instance, inversions that distinguish two genera, can be found as intra-lineage variation in the third genus or one of the oak sections).

b The striking similarity between the newly found South American and long-known slightly older North American fossils is likely the reason for not discussing the latter in the original paper or including them in the "DNA-scaffold" analysis. As is obvious from the graphs, the slightly younger North American fossil could easily be a slightly more derived of the same lineage than the South American fossil (Planchard et al. 2016 Paleont. Electr. 19.3.51A give a revised age of ≥ 49 Ma for the plant-bearing strata), and thus would have been at odds with the narrative of the authors (see also comment by Denk et al. 2019, Science 10.1126/science.aaz2189).

c As done by Wilf et al. (see also the argumentation in Wilf et al.'s response, Science 10.1126/science.aaz2297, to Denk et al.'s 2019 comment). The combination of circular reasoning, systematic bias, and (parsimony) tree-naivity is well expressed in Wilf et al.'s own words:
Fourth, Denk et al. erroneously contend that Castanopsis rothwellii, a fossil with so many diagnostic characters preserved that it could only be assigned to Castanopsis if “found alive” today (1), has plesiomorphic features and cannot be placed confidently in the extant genus [see Figs. 1–5 in this two-part post]. ... Denk et al.’s phylogenetic conclusions from their emended tree and matrix are misleading, in that any morphological matrix includes characters that are relevant only for the taxa included in the analysis. ... Because the fossils are castaneoid in all features, we did not include all Fagaceae in our original analysis (1) and likewise did not include all characters relevant to non-castaneoid fagaceous taxa. ... By adding just three relevant characters to the Denk et al. scaffold to accommodate the genera they added (Table 1), the fossil Castanopsis rothwellii is placed only with Castanopsis in the single [ie. the strict consensus of two equally parsimonious trees] most parsimonious tree (Fig. 1).
One of the three added traits ("expanded stigma") is exclusively shared by all five Castaneoideae genera, the second ("nut generally rounded in cross section") shared by all but one Castaneoideae and Quercus, and thus are symplesiomorphies of core Fagaceae: shared primitive traits that can be expected in a precursor of several or all modern genera or their less evolved extinct sister lineages. Or positively selected homoiologies, ie. evolved multiple times within the core Fagaceae. The third ("asymmetrical cupule") is an unstable convergence / deep parallelism and a trait of little phylogenetic value, since expressed as intra-generic (intraspecific?) variation in two distantly related genera: the monotypic Formanodendron, a trigonobalanoid, and Castanopsis. These are two genera that share only a very distant (and exclusive fide Hennig) common origin (see Part 1) but inhabit overlapping climate envelopes and ecological niches in modern-day East Asia.

Despite adding three hand-picked characters (from a set of at least 25 at hand, Crepet & Nixon 1989) and accepting a phylogeny closer to the reality, the Castanopsis "clade" in the new "scaffold tree" including the Patagonian fossil remains unsupported by any exclusive or even shared and stable derived trait/set of traits (as in the original study, Wilf et al. refrain from establishing any sort of node or branch support, or test of alternative placements).

Moreover, it is safe to assume that when one adds the extinct genus Castanopsoidea to the scaffold (Wilf et al. deliberately chose not to do so), it would compete with Castanopsis rothwellii for the placement next to the modern-day Castanopsis. According to Crepet & Nixon 1989, fig. 1, one possible placement of Castanopsoidea is a sister to "Castanopsis (1)". This is not necessarily because they share a direct common origin but because these fossils also lack uniquely derived characters or a clearly derived character suite defining all Fagaceae genera except for Castanopsis (which in Crepet & Nixon's morpho-tree, is paraphyletic to Lithocarpus, which, back then, included the potential oak sister genus Notholithocarpus — literally: the 'false Lithocarpus'). Personally, for the same reasons as outlined and applied in Bomfleur et al. 2017, PeerJ 5: e3433 (and like Denk et al. 2019), I would have no problem calling all these fossils Castanopsis by defining the genus as explicitly paraphyletic, which could include the modern-day species of Castanopsis (which are probably monophyletic) and Castanopsis-like fossils that may be more or less related to them and/or other core Fagaceae: the precursors and extinct but similar, underived sister lineages.