Monday, June 24, 2019

Simulation of lexical change (Open problems in computational diversity linguistics 5)

The fifth problem in my list of open problems in computational diversity linguistics is devoted to the problem of simulating lexical change. In a broad sense, lexical change refers to the way in which the lexicon of a human language evolves over time. In a narrower sense, we would reduce it to the major processes that constitute the changes that affect the words of human languages.

Following Gevaudán (2007: 15-17), we can distinguish three different dimensions along which words can change, namely:
  • the semantic dimension — a given word can change its meaning
  • the morphological dimension —new words are formed from old words by combining existing words or deriving new words with help of affixes, and
  • the stratic dimension — languages may acquire words from their neighbors and thus contain strata of contact.
If we take these three dimension as the basis of any linguistically meaningful system that simulates lexical change (and I would strongly argue that we should), the task of simulating lexical change can thus be worded as follows:
Create a model of lexical change that simulates how the lexicon of a given language changes over time. This model may be simplifying, but it should account for change along the major dimensions of lexical change, including morphological change, semantic change, and lexical borrowing.
Note that the focus on three dimensions along which a word can change deliberately excludes sound change (which I will treat as a separate problem in an upcoming blogpost). Excluding sound change is justified by the fact that, in the majority of cases, the process proceeds independently from semantic change, morphological change, and borrowing, while the latter three process often interact.

There are, of course, cases where sound change may trigger the other three processes — for example, in cases where sound change leads to homophonous words in a language that express contrary meanings, which is usually resolved by using another word form for one of the concepts. An example for this process can be found in Chinese, where shǒu (in modern pronunciation) came to mean both "head" and "hand" (spelled as 首 and 手). Nowadays, shǒu remains only in expressions like shǒudū 首都 "capital", while tóu 头 is the regular word for "head".

Since the number of these processes where we have sufficient evidence to infer that sound change triggered other changes is rather small, we will do better to ignore it when trying to design initial models of lexical change. Later models could, of course, combine sound change with lexical change in an overarching framework, but given how the modeling of lexical change is already complex just with the three dimensions alone, it seems useful to put it aside for the moment and treat it as a separate problem.

Why simulating lexical change is hard

For historical linguists, it is obvious why it is hard to simulate lexical change in a computational model. The reason is that all three major processes of lexical change, semantic change, morphological change, and lexical borrowing, are already hard to model and understand themselves.

Morphological change is not only difficult to understand as a process, it is even difficult to infer; and it is for this reason, that we find morphological segmentation as the first example in my list of open problems. The same holds for lexical borrowing, which I discussed as the second example in my list of open problems. The problem of common pathways of semantic change will be discussed in a later post, devoted to the general typology of semantic change processes.

If each of the individual processes that constitute lexical change is itself either hard to model or to infer, it is no wonder that the simulation is also hard.

Traditional insights into the process of lexical change

Important work on lexical change goes back at least to the 1950s, when Morris Swadesh (1909-1967) proposed his theory of lexicostatistics and glottochronology (Swadesh 1952, 1955, Lees 1953). What was important in this context was not the idea that one could compute the divergence time of languages, but the data model which Swadesh introduced. This data model is represented by a word-list in which a particular list of concepts is translated into a particular range of languages. While former work on semantic change had been mostly onomasiological — ie. form-based, taking the word as the basic unit and asking how it would change its meaning over time — the new model used concepts as a comparandum, investigating how word forms replaced each other in expressing specific contexts over time. This onomasiological or concept-based perspective has the great advantage of drastically facilitating the sampling of language data from different languages.

When comparing only specific word forms for cognacy, it is difficult to learn something about the dynamics of lexical change through time, since it is never clear how to sample those words that one wants to investigate more closely in a given study. With Swadesh's data model, the sampling process is reduced to the selection of concepts, regardless of whether one knows how many concepts one can find in a given sample of languages. Swadesh was by no means the first to propose this perspective, but he was the one who promulgated it.

Swadesh's data model does not directly measure lexical change, but instead measures the results of lexical change, given that its results surface in the distribution of cognate sets across lexicostatistical word-lists. While historical linguists mostly focused on sound change processes before, often ignoring morphological and semantic change, the lexicostatistical data model moved semantic change, lexical borrowing, and (to a lesser degree also) morphological change into the spotlight of linguistic endeavors. As an example, consider the following quote from Lees (1953), discussing the investigation of change in vocabulary under the label of morpheme decay:
The reasons for morpheme decay, ie. for change in vocabulary, have been classified by many authors; they include such processes as word tabu, phonemic confusion of etymologically distinct items close in meaning, change in material culture with loss of obsolete terms, rise of witty terms or slang, adoption of prestige forms from a superstratum language, and various gradual semantic shifts such as specialization, generalization, and pejoration. [Lees 1953: 114]
In addition to lexicostatistics and the discussions that arose especially from it (including those that criticized the method harshly), I consider the aforementioned model of three dimensions of lexical change by Gevaudán (2007) to be very useful in this context, since it constitutes one of the few attempts to approach the question of lexical change in a formal (or formalizable) way.

Computational approaches

Among the most frequently used models in the historical linguistics literature are those in which lexical change is modeled as a process of cognate gain and cognate loss. Modeling lexical change as a process of word gain and word loss, or root gain and root loss, is in fact straightforward. We well know that languages may cease to use certain words during their evolution, either because the things the words denote no longer exist (think of the word walkman and then try to project the future of the word ipad), or because a specific word form is no longer being used to denote a concept and therefore drops out of the language at some point (think of thorp which meant something like "village", as a comparison with German Dorf "village" shows, but now exists only as a suffix in place names).

Since the gain-loss (or birth-death) model finds a direct counterpart in evolutionary biology, where genome evolution is often modeled as a process involving gain and loss of gene families (Cohen et al. 2008), it is also very easy to apply it to linguistics. The major work on the stochastic description of different gain-loss models has already been done, and we can find very stable software to helps us employ gain-loss models to reconstruct phylogenetic trees (Ronquist and Huelsenbeck 2003).

It is therefore not surprising that gain-loss models are very popular in computational approaches to historical linguistics. Starting from pioneering work by Gray and Jordan (2000) and Gray and Atkinson (2003), they have now been used on many language families, including Austronesian (Gray et al. 2007), Australian languages (Bowern and Atkinson 2012), and most recently also Sino-Tibetan (Sagart et al. 2019). Although scholars (including myself) have expressed skepticism about their usefulness (List 2016), the gain-loss model can be seen as reflecting the quasi-standard of phylogenetic reconstruction in contemporary quantitative historical linguistics.

Despite their popularity for phylogenetic reconstructions, gain-loss models have been used only sporadically in simulation studies. The only attempts that I know of so far are one study by Greenhill et al. (2009), where the authors used the TraitLab software (Nicholls 2013) to simulate language change along with horizontal transfer events, and a study by Murawaki (2015), in which (if I understand the study correctly) a gain-loss model is used to model language contact.

Another approach is reflected in the more "classical" work on lexicostatistics, where lexical change is modeled as a process of lexical replacement within previously selected concept slots. I will call this model the concept-slot model. In this model (and potential variants of it), a language is not a bag of words whose contents changes over time, but is more like a chest of drawers, in which each drawer represents a specific concept and the content of a drawer represents the words that can be used to express that given concept. In such a model, lexical change proceeds as a replacement process: a word within a given concept drawer is replaced by another word.

This model represents the classical way in which Morris Swadesh used to view the evolution of a given language. It is still present in the work of scholars working in the original framework of lexicostatistics (Starostin 2000), but it is used almost exclusively within distance-based frameworks, since a character-based account of the model would require a potentially large number of character states, which usually exceeds the number of character states allowed in the classical software packages for phylogenetic reconstruction.

Similar to the gain-loss model, there have not been many attempts to test the characteristics of this model in simulation studies. The only one known to me is a posthumously published letter from Sergei Starostin (1953-2005) to Murray Gell-Mann (Starostin 2007), in which he describes an attempt to account for his theory that a word's replacement rage increases with the word's age (Starostin 2000) in a computer simulation.

Problems with current models of lexical change

Neither the gain-loss model nor the concept-slot model seem to be misleading when it comes to describe the process of lexical change. However, they both obviously ignore specific and crucial aspects of lexical change that (according to the task stated above) any ambitious simulation of lexical change should try to account for. The gain-loss model, for example, deliberately ignores semantic change and morphological change. It can account for borrowings, which can be easily included in a simulation by allowing contemporary languages to exchange words with each other, but it cannot tell us (since it ignores the meaning of word forms) how the meaning of words changes over time, or how word forms change their shape due to morphological change.

The concept-slot model can, in theory, account for semantic change, but only as far as the concept-slots allow: the number of concepts in this model is fixed and one usually does not assume that it would change. Furthermore, while borrowing can be included in this model, the model does not handle morphological change processes.

In phylogenetic approaches, both models also have clear disadvantages. The main problem of the gain-loss model is the sampling procedure. Since one cannot sample all words of a language, scholars usually derive the cognate sets they use to reconstruct phylogenies from cognate-coded lexicostatistical word-lists. As I have tried to show earlier, in List (2016), this sampling procedure can lead to problems when homology is defined in a loose way. The problem of the concept-slot model is that it cannot be easily applied in phylogenetic inference based on likelihood models (like Maximum likelihood or Bayesian inference), since the only straightforward way to handle them would be multi-state models, which are generally difficult to handle.

Initial ideas for improvement

For the moment, I have no direct idea of how to model morphological change, and more research will be needed before we will be able to handle this in models of lexical change. The problem of the gain-loss and the concept-slot models to account for semantic change, however, can be overcome by turning to bipartite graph models of lexical change (see Newman 2010: 32f for details on bipartite graphs). In such a model, the lexicon of a human language is represented by a bipartite graph consisting of concepts as one type of node and word forms (or forms) as another type of node. The association strength of a given word node and a given concept node (or its "reference potential", see List 2014: 21f), ie. the likelihood of a word being used by a speaker to denote a given concept, can be modeled with help of weighted edges. This model naturally accounts for synonymy (if a meaning can be expressed by multiple words) and polysemy (if a word can express multiple meanings). Lexical change in such a model would consist of the re-arrangement of the weights in the network. Word loss and word gain would occur if a new word node is introduced into the network or an existing node gets dissociated from all of the concepts.

Sankoff's (1996) bipartite model of the lexicon of human languages

We can find this idea of bipartite modeling of a language's lexicon in the early linguistic work of Sankoff (1969: 28-53), as reflected in the figure above, taken from his dissertation (Figure 5, p. 36). Similarly, Smith (2004) used bipartite form-concept networks (which he describes as a matrix) in order to test the mechanisms by which these vocabularies are transmitted from the perspective of different theories on cultural evolution.

As I have never actively tried to review the large amount of literature devoted to simulation studies in historical linguistics, biology, and cultural evolution, it is quite possible that this blogpost lacks reference to important studies devoted to the problem. Despite this possibility, we can clearly say that we are lacking simulation studies in historical linguistics. I am furthermore convinced that the problem of handling lexical change in simulation studies is a difficult one, and that we may well have to wait to acquire more knowledge of the key processes involving lexical change in order to address it sufficiently in the future.

While I understand the popularity of gain-loss models in recent work on phylogenetic reconstruction in historical linguistics, I hope that it might be possible to develop more realistic models in the future. It is well possible that such studies will confirm the superiority of gain-loss models over alternative approaches. But instead of assuming this in an axiomatic way, as we seem to be doing it for the time being, I would rather see some proof for this in simulation studies, or in studies where the data fed to the gain-loss algorithms is sampled differently.


Bowern, Claire and Atkinson, Quentin D. (2012) Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88: 817-845.

Cohen, Ofir and Rubinstein, Nimrod D. and Stern, Adi and Gophna, Uri and Pupko, Tal (2008) A likelihood framework to analyse phyletic patterns. Philosophical Transactions of the Royal Society B 363: 3903-3911.

Gévaudan, Paul (2007) Typologie des lexikalischen Wandels. Bedeutungswandel, Wortbildung und Entlehnung am Beispiel der romanischen Sprachen. Tübingen:Stauffenburg.

Gray, Russell D. and Jordan, Fiona M. (2000) Language trees support the express-train sequences of Austronesian expansion. Nature 405: 1052-1055.

Gray, Russell D. and Atkinson, Quentin D. (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426: 435-439.

Gray, Russell D. and Greenhill, Simon J. and Ross, Malcolm D. (2007) The pleasures and perils of Darwinzing culture (with phylogenies). Biological Theory 2: 360-375.

Greenhill, S. J. and Currie, T. E. and Gray, R. D. (2009) Does horizontal transmission invalidate cultural phylogenies? Proceedings of the Royal Society of London, Series B 276: 2299-2306.

Lees, Robert B. (1953) The basis of glottochronology. Language 29: 113-127.

List, Johann-Mattis (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1: 119-136.

Murawaki, Yugo (2015) Spatial structure of evolutionary models of dialects in Contact. PLoS One 10: e0134335.

Newman, M. E. J. (2010) Networks: An Introduction. Oxford: Oxford University Press.

Nicholls, Geoff K and Ryder, Robin J and Welch, David (2013) TraitLab: A MatLab package for fitting and simulating binary tree-like data.

Ronquist, Frederik and Huelsenbeck, J. P. (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572–1574.

Sagart, Laurent, Jacques, Guillaume, Lai, Yunfan, Ryder, Robin, Thouzeau, Valentin, Greenhill, Simon J., List, Johann-Mattis (2019) Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116: 10317–10322. DOI: 10.1073/pnas.1817972116

Sankoff, David (1969) Historical Linguistics as Stochastic Process. McGill University: Montreal.

Smith, Kenny (2004) The evolution of vocabulary. Journal of Theoretical Biology 228: 127-142.

Starostin, Sergej Anatolévič (2000) Comparative-historical linguistics and lexicostatistics. In: Renfrew, Colin, McMahon, April, Trask, Larry (eds.): Time Depth in Historical Linguistics: 1. Cambridge:McDonald Institute for Archaeological Research, pp. 223-265.

Starostin, Sergej A. (2007) Computer-based simulation of the glottochronological process (Letter to M. Gell-Mann). In: : S. A. Starostin: Trudy po yazykoznaniyu [S. A. Starostin: Works in Linguistics]. LRC Publishing House, pp. 854-861.

Swadesh, Morris (1952) Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proceedings of the American Philosophical Society 96: 452-463.

Swadesh, Morris (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21.2: 121-137.

Monday, June 17, 2019

Ockham's Razor applied, but not used: can we do DNA-scaffolding with seven characters?

One of the most interesting research areas in organismal science is the cross-road between palaeontology and neontology, which puts together a picture marrying the fossil record with molecular-based phylogenies. Unfortunately, when it comes to plant (palaeo-)phylogenetics, some people adhere to outdated analysis frameworks (sometimes with little data).

How to place a fossil?

The fossil record is crucial for neontology as it can provide age constraints (minimum ages when doing node dating) and inform us about the past distribution of a lineage. This, especially in the case of plants that can't run away from unfortunate habitat changes, can be much different than today.

The main question in this context is whether a fossil represents the stem, ie. a precursor or extinct ancient sister lineage, or the crown group, ie. a modern-day taxon (primarily modern-day genus). For instance, the oldest crown fossil gives the best-possible minimum age for the stem (root) age of a modern lineage, whereas a stem fossil can give (at best) only a rough estimate for the crown age of the next-larger taxon/clade when doing the common node dating of molecular trees (note that fossilized birth-death dating can make use of both).

There are two commonly accepted criteria to identify a crown-group fossil:
  1. Apomorphy-based argues that if a fossil shows a uniquely derived character (ie. a aut- or synapomorphy sensu Hennig) or character suite diagnostic for a modern-day genus, it represents a crown-group fossil.
  2. Phylogeny-based aims to place the fossil in a phylogenetic framework, the position of the fossil in the genus- or species-level tree (most commonly done) or network (rarely done but producing much less biased or flawed results) then informs what it is.
(We will focus on members of modern-day genera, since it becomes more trickier for higher-level taxa, see eg. my posts thinking about What is an angiosperm? [part1][part2][why I pondered about it].)

There a three basic options to place a fossil using a phylogenetic tree.
  1. Putting up a morphological matrix, then inferring the tree. A classic but due to the nature of most morphological data sets leading to a partly wrong tree as we demonstrated in some posts here on the Genealogical World of Phylogenetic Networks (hence, such analysis should always be done in a network-based exploratory data analysis framework).
  2. Putting up a mixed molecular-morphological matrix, then inferring a "total evidence" tree. This includes sophisticated approaches that use the molecular data to implement weights on the morphological traits and/or consider the age of the fossils (so-called total evidence dating approaches). Works not that bad with animal-data, provided the matrix includes a lot of morphological traits reflecting aspects of the (molecular-based) phylogeny. Doesn't work too well for plants because we usually have much fewer scorable traits, most of which are evolved convergently or in parallel. Non-trivial plant fossils love to act as rogues during phylogenetic inference.
  3. Optimise the position of a fossil in a molecular-based tree, eg. using so-called "DNA scaffold approach" (usually using parsimony as optimality criterion) or the evolutionary placement algorithm implemented in RAxML (using maximum likelihood). A special form of this approach is to first map the traits on a (dated) molecular tree, and then find the position where a fossil would fit best.

Why (standard) phylogenetic tree-based approaches are tricky

Below a simple example, including three fossils of different age (and often, place) with different character suites.

Even though none of the derived traits (blue and red "1") is a synapomorphy (fide Hennig), we can assign the youngest fossil X to the lineage of genus 1A just based just based on its unique derived ('apomorphic') character suite. Its likely a crown-group fossil of clade 1, and may inform a minimum age for the most-recent common ancestor (MRCA) of the two modern-day genera of Clade 1.
Apomorphy-wise, fossils Y and Z cannot be unambiguously placed. The red trait appears to be independently obtained in both clades, and the blue trait may have been
To discern between the options, we'd be well-advised to do character mapping in a probabilistic framework which require a tree with independently defined branch-lengths.

Just by using parsimony-based DNA-scaffolding, fossil X would be confirmed as crown-group fossil and member of genus 1A (being identical and different from all others) and fossil Z would end up as a stem-group fossil. Fossil Y, however, would be placed as sister to genus 2C (again, identical to each other and different from all others). Using Y in node dating, would then lead to a much too old divergence age for the crown-group age of Clade 2. In reality, what researchers do with such a seemingly too old fossil is not to use it by the book, as MRCA of Genus 2B and 2C, but to inform the MRCA of eg. genera 2A, 2B, and 2C assuming that the fossil's age and trait set indicate the 2C morphology is primitive within the clade or Y is an extinct sister lineage and the shared derived trait a convergence (parallelism).

Four characters, three homoplastic and one invariant, are surely not enough for DNA-scaffolding, but adding more and more characters has a catch. Easy to do for the modern-day taxa, for which we also have molecular data, the preservation of fossils limits adding many more traits; any trait not preserved in the fossil is effectively useless when placing it (including not-preserved traits in total evidence approach may, nonetheless, help the analysis). Which brings us to the real-world example just published in Science:

Wilf P, Nixon KC, Gandolfo MA, Cúneo RA (2019). Eocene Fagaceae from Patagonia and Gondwanan legacy in Asian rainforests. Science 364, 972. Full-text article at Science website.

Why one should not place a fossil using DNA-scaffolding with seven characters

Wilf et al. show (another) spectacularly preserved fossil from the Eocene of Patagonia. Personally, I think that just publishing and shortly describing such a beautiful fossil should be enough to get into the leading biological journals.

But Wilf et al. wanted (needed?) more and came up with the following "phylogenetic analysis" to argue that their fossil is a crown-group Castanoideae, a representative of the modern-day firmly Southeast Asian tropical-subtropical genus Castanopsis, and evidence for a "southern route to Asia hypothesis" (via Antarctica and Australia, both well-studied but devoid so far of any Fagaceae presence; despite the fact that the modern-day climate allows cultivating them as eg. source for commercially used wood).

Wilf et al's Fig. 3 and Table 1 suggest to me that the paper was not critically reviewed by anyone familiar with the molecular genetics of Fagaceae or phylogenetic methods in general — perhaps this is not needed, since the first author is well-merited and the second author a world-leading expert of botanical palaeo-cladistics. However, parsimony-based DNA-scaffolding can be tricky, even with a larger set of characters (see eg. the post on Juglandaceae using a well-done matrix), and using seven is therefore quite bold. Notably, of the seven characters, one is parsimony-uninformative and four are variable within at least one of the included OTUs.

Side note: The tree used as a backbone is outdated and not comprehensive. Plastid and nuclear-molecular data indicate that the castanoids Lithocarpus (mostly tropical SE Asia) and Chrysolepis (temperate N. America) may be sisters. However, the morphologically quite similar Notholithocarpus is not related to either of these, but is instead a close relative of the ubiquitous oaks, genus Quercus (not included in Wilf et al.'s backbone tree), especially subgenus Quercus. Furthermore, the (today Eurasian) castanoid sisterpair Castanea (temperate)-Castanopsis (tropical-subtropical) have stronger affinities to the (today and in the past) Eurasian oaks of subgenus Cerris. The Fagaceae also include three distinct monotypic relict genera, the "trigonobalanoids" Formanodendron and Trigonobalanus, SE Asia, and Colombobalanus from Columbia, South America. Using a more up-to-date instead of a 2-decade-old molecular hypothesis would have been a fair request during review, as would compiling a new molecular matrix to infer a tree used as backbone (currently gene banks include > 238,000 nucleotide DNA accessions including complete plastomes). This would have also enabled the authors to map their traits using a probabilistic framework, which can protect to some degree against homoplastic bias but requires a backbone tree with defined branch-lengths.

There are many more problems with the paper and its conclusions, but this critique would be content- not network-related. Let's just look at the data and see why Wilf et al. would have better off not showing any phylogenetic analysis at all (and the impact-driven editors and positive-meaning reviewers should have advised against it). Or a network.

Clades with little character support

The scaffolding placed the Eocene fossil in a clade with both representatives of Castanopsis, from which it differs by 0–2 and 1–4 traits, respectively. Phylogeny-based, the fossil is a stem- or crown-Castanopsis.

However, the fossil has a character suite that differs in just a single trait (#6: valve deshiscence) from the (genetically very distant) sister taxon of all other Fagaceae, Fagus (the beech), used here as the outgroup to root the Castanoideae subtree. As far as apomorphies are concerned, the data are inconclusive as to whether the fossil represents a stem-Castanoideae (or extinct Fagaceae lineage) or a Castanopsis — this critical, potentially diagnostic derived trait, partial valve dehiscence, is only shared by the fossil and some but not all modern-day Castanopsis. This particular trait is not mentioned elsewhere in the text, although it is the reason the fossil is placed next to Castanopsis and not the outgroup Fagus in the "phylogenetic analysis".

In the following figure, I have mapped (with parsimony) the putative character mutations on the tree used by Wilf et al.

Black font: shared by Fagus (outgroup) and "Castanoideae". Green font: potential uniquely derived traits. Blue font: traits reconstructed as having evolved in parallel/convergently. Red branches, clades in the used backbone tree that are at odds with currently available molecular data (the N. American relict Notholithocarpus should be sister to the Eurasian Castanea-Castanopsis).

This hardly presents a strong case of crown-group assignation. Except for partial dehiscence, even the modern-day Castanopsis have little discriminating derived traits — they are living fossils with a primitive ('plesiomorphic') character suite. Intriguingly, they are also genetically less derived than other Castanoideae and the oaks (see eg. the ITS tree in Denk & Grimm 2010).

The actual differentiation pattern

The best way to depict what the character set provides as information for placing the fossil is, of course, the Neighbor-net, as shown next.

Neighbor-net based on Wilf et al.'s seven scored morphological traits used to place the fossil. Green: the current molecular-based phylogenetic synopsis — based mostly on Oh & Manos 2008; Manos et al. 2008; Denk & Grimm 2010. I had the opportunity to get familiar with all of the then-available genetic data when harvesting all Fagaceae data from gene banks in 2012 for a talk in Bordeaux. One complication in getting an all-Fagaceae-tree is that plastids, geographically constrained, and nuclear regions tell partly different stories.

Castanopsis, including the fossil, is morphologically a paraphyletic (see also our other posts dealing with paraphyla represented as clades in trees). Note also the long edge-bundle separating the temperate Chrysolepis and chestnuts (Castanea), from their respective cold-intolerant sister genera (Lithocarpus viz Castanopsis) — derived traits have been accumulated in parallel within the "Castanoideae". The scored aspects of Fagaceae morphology are very flexible and ~50 million years is a long time, possibly leading to partial valve indehiscence (or losing it) without being part of the same generic lineage. The puzzling differentiation, and the profoundly primitive appearance of the fossil (shared with modern-day Castanopsis), may in fact be the reason the authors didn't: (i) optimize / discuss very similar, co-eval fossils from the Northern Hemisphere interpreted (and cited) as extinct genera (eg. Crepet & Nixon 1989), (ii) left out the two Fagaceae genera today occurring in South America, (iii) opted for classic parsimony and a partly outdated molecular hypothesis, and (iv) just showed a naked cladogram without branch support values as the result of their "phylogenetic analysis" (Please stop using cladograms!)

Based on the scored characters, the position of the fossil in the graph, and on the background of a more up-to-date molecular-based phylogenetic synopsis (the green tree in the figure above), the most parsimonious interpretation (and probably, the most likely) is that the fossil may indeed be a stem-Castanoideae, a representative of the lineage from which the Laurasian oaks evolved at least 55 million yrs ago (oldest Quercus fossil was found in SE Asia), or even represent a morphologically primitive, extinct (South) American lineage of the Fagaceae. Regarding the "southern route", Ockham's Razor would favor that they are just a South American extension of the widespread Eocene Laurasian Fagaceae / Castanoideae, since very similar fossils and castaneoid pollen is found in equally old and older sites in North America, Greenland (papers cited by Wilf et al.) and Eurasia but not Australia, New Zealand or Antarctica.

A final note: when you have so few characters to compare, you should use OTUs that are not completely ambiguous in every potentially discriminating character, as scored for the "C. fissa group" — the "Castanopsis group" has a single unambiguously defined, potentially derived trait. Using artificial bulk taxa is generally a bad idea when mapping trait evolution onto a molecular backbone tree. Instead, you should compile a representative placeholder taxa set, with as many taxa as you need (or are feasible) to represent all character combinations seen in the modern species/genera.

Postscriptum (14/1/2020)
Relevant matrices (NEXUS-formatted) and explicit character trait maps (Why we want to map trait evolution on networks, pt.1 – Introduction, pt.2 – Topological Ambiguity) have been uploaded to figshare.

Other cited references, with comments
Crepet WL, Nixon KC (1989) Earliest megafossil evidence of Fagaceae: phylogenetic and biogeographic implications. American Journal of Botany 76: 842–855. – introducing a Castanopsis-like infructescence interpreted to represent an extinct genus but very similar to the new Patagonian fossil in its preserved features; and co-occuring with castaneoid pollen (not reported so far for Patagonia) and foliage.
Denk T, Grimm GW (2010) The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59: 351–366. — includes an all-"Quercaceae" ITS-tree (fig. 3) and -network (fig. 4) using data of ~ 1000 ITS accessions; the nuclear-encoded ITS is so far the only comprehensively sampled gene region that gets the genera and main intra-generic lineages apart (recently confirmed and refined by NGS phylogenomic data), something wide-sampled plastid barcodes struggle with. Analysed with up-to-date methods and avoiding long-branch interference by excluding the only partially alignable Fagus, Castanopsis dissolves into a grade in the all-accessions tree and Quercus is deeply nested within the Castanoideae (as already seen in the 2001 tree used by Wilf et al. as backbone). The species-level PBC neighbor-net prefers a ciruclar arrangement in which Notholithocarpus remains a putative sister of substantially divergent and diversified Quercus, followed by Castanea-Castanopsis, and Lithocarpus, while Chrysolepis is recognized as unique.

Oh S-H, Manos PS (2008) Molecular phylogenetics and cupule evolution in Fagaceae as inferred from nuclear CRABS CLAW sequences. Taxon 57: 434–451. – Probably still the best Fagaceae tree, and surely not a bad basis for probabilistic mapping of morphological traits in the family.

Manos PS, Cannon CH, Oh S-H (2008) Phylogenetic relationships and taxonomic status of the paleoendemic Fagaceae of Western North America: recognition of a new genus, Notholithocarpus. Madroño 55: 181–190. – the tree failed to resolve the monophyly of the largest genus, the oaks, but depicts well the data reality when combining ITS with plastid data and, hence, provides a good trade-off guide tree.

Monday, June 10, 2019

Why don't people draw evolutionary networks sensibly?

In phylogenetics there are two types of network:
  • those where the network edges have a time direction, whether explicit or implied; and
  • those where the edges are undirected.
The latter networks are among the most valuable tools ever devised for the exploration of multivariate data patterns; and this blog is replete with examples drawn from all fields that produce quantitative data (see the Analyses blog page). The first type of network, however, is the only one that can display hypothesized evolutionary histories — that is, they can truly be called evolutionary networks.

Evolutionary networks have a set of characteristics that are essential in order to successfully display biological histories, such as:
  • no directed cycles, because otherwise one of the descendants would be its own ancestor;
  • time consistency, meaning that reticulations in the network only occur between contemporaries.
The latter requirement is not needed for the history of human artifacts, because the ideas on which those artifacts are based can be recorded, and then not used until much later — ideas can "leap forward" in time. There are a number of examples of this in this blog, as discussed in last week's post (A phylogenetic network outside science).

However, time consistency is pretty much universal in biology (see the post on Time inconsistency in evolutionary networks). Natural hybridization and introgression require two living organisms in order to occur, as does horizontal gene transfer. This is basic biology, at least outside the laboratory.

So, the question posed in this post's title refers to the fact that so many people draw their evolutionary networks in a manner that appears to violate time consistency.

Consider this example (from: Interspecies hybrids play a vital role in evolution. Quanta Magazine):

Note that the reticulation edges (the dashed lines) represent gene transfers by introgression or hybidization, and yet none of them are drawn vertically, as they would need to be in order to be time consistent (since time travels from left to right).

It might be argued that most of these are not all that important in practice, but the one to the left quite definitely matters very much. It shows gene transfer between: (i) an organism that speciated 3.65 million years ago and (ii) an organism that is the descendant of one that speciated 3.47 million years ago. The 180,000 years between those two events are not irrelevant; and they make the claimed gene transfer impossible.

One might think that this is simply the general media misunderstanding the network requirements, but this is not so. The diagram is actually a quite accurate representation of the one from the original scientific publication (from: Genome-wide signatures of complex introgression and adaptive evolution in the big cats. Science Advances 3: e1700299; 2017.):

The network shows the same series of hybridizations / introgressions. However, this time three sets of gene transfers are shown to be time consistent, represented by the horizontal arrows (since time flows from top to bottom). Two of the three diagonal arrows (light blue and orange) could be made time consistent (ie. drawn horizontally), although the authors have chosen not to do so, apparently for artistic reasons. However, the first reticulation cannot be made time consistent, for the reason outlined above.

So, people, please think about what you are drawing, and don't show things that are biologically impossible,

Monday, June 3, 2019

A phylogenetic network outside science

I have written before about the presentation of historical information using the pictorial representation of a phylogeny (eg. Phylogenetic networks outside science; Another phylogenetic network outside science). These diagrams are often representations of the evolutionary history of human artifacts, and so a phylogeny is quite appropriate. They are of interest because:
  • they are usually hybridization networks, rather than divergent trees, because the artifact ideas involve horizontal transfer (ideas added) and recombination (ideas replaced);
  • they are often not time consistent, because ideas can leap forward in time, so that the reticulations do not connect contemporary artifacts (see Time inconsistency in evolutionary networks); and
  • they are sometimes drawn badly, in the sense that the diagram does not reflect the history in a consistent way.
The latter point often involves poor indication of the time direction (see Direction is important when showing history), or involves subdividing the network into a set of linearized trees.

One particularly noteworthy example that I have previously discussed is of the GNU/Linux Distribution Timeline, which illustrates the complex history of the computer operating system. The problems with this diagram as a phylogeny are discussed in the blog post section History of Linux distributions.

In this new post I will simply point out that there is a more acceptable diagram, showing the key Unix and Unix-like operating systems. I have reproduced a copy of it below.

Click to enlarge.

This version of the information correctly shows the history as a network, not a series of linearized trees (each with a central axis). It also draws the reticulations in an informative manner, rather than having them be merely artistic fancies.

It is good to know that phylogenetic diagrams can be drawn well, even outside biology and linguistics.