Monday, April 6, 2020

Consensus networks: cluster union or edge union?

(Another joint post by David and Guido)

In the book Introduction to Phylogenetic Networks (Morrison 2011), it was convenient to organize the various network types into two groups:
  • those that are intended to provide a summary of various possible phylogenetic histories
  • those that simply summarize the multivariate data into a convenient visualization.
The former are directed networks (ie. they have an explicit root) that are interpretable as phylogenies (ie. phylogenetic hypotheses), while the latter are undirected networks (ie. no root), and therefore do not display historical pathways of evolution.

The consensus network of Holland et al. (2004. Molecular Biology and Evolution 21: 1459-1461) is among the most popular of the networks in the second group. This is formally a Cluster Union Network (CUN), in which the clusters represented by a set of input trees are combined into a single diagram. The clusters are defined by the edges in the original (unrooted) trees - each edge splits the tree into two parts. The trees are thus reduced to the set of splits that appear in at least one of the trees. Each split will then appear in the CUN. If there is no disagreement among the trees, then a split will be represented by a single edge in the CUN; but if there is conflict among the trees then a split will be represented by a set of parallel edges.

A cluster consensus network, with two reticulation areas,
each defined by two sets of parallel edges.

The end result is that the edges of the CUN no longer represent phylogenetic pathways, even if they did do so in the input trees. Some of the edges of the CUN are there solely as part of a set of parallels. To put it another way, some of the edges do not appear in any of the original trees, but are the result of combining the clusters. So, a CUN will vary from tree-like, if there is little conflict among the input trees (ie. compatible splits) , to a complex spider-web, if there is a lot of conflict (many incompatible splits).

It is this property of representing splits by a set of edges that prevents the network being a representation of phylogenetic history – formally, the edges define clusters not clades.

Miyagi and Wheeler (2019. Cladistics 35: 688-694) have addressed this issue by defining what they call an Edge Union Network. In essence, it is a subset of the CUN - formally, the EUN is contained within the CUN. It can be thought of as a CUN that contains only those edges that appear in at least one of the input trees. M&W see the edges as "redundant" is they appear in the CUN but not the input trees.

M&W's objective for the EUN is thus "to display the total history of all the input trees, rather than the simplest graph which contains all clusters present in the data" (which the CUN does). M&Y see "phylogenetic networks as hypotheses for evolutionary history", so that the EUN can be rooted, just like the input trees. The criterion for the EUN is parsimony, so that "it is important to minimize the number of distinct paths between nodes".

It is important to note in the following discussion that M&W are interested in rooted networks, and so their version of a CUN is not quite the same as the original unrooted Consensus Network.


M&W provide a graphical example, the CUN and EUN of two incongruent rooted trees. Here's a colored version: all nodes (internal and terminal) are re-labelled to express the last common ancestor (LCA) that they represent, and internal (conflicting) tree edges are colored, so we can trace them in the networks.

M&W's example of two incongruent trees (their Fig. 1) and the CUN (their Fig. 2; bottom right).
The stars are nodes of the full CUN (bottom left) not represented in M&W's CUN (bottom right);
the dotted lines indicate dropped edges.

At the bottom left is the strict consensus network, the full CUN, of both trees. Most internal nodes (alternative LCAs) in the trees (ABC, ABCD, AD, DE) are not represented by a single node in the full CUN but by a set of parallel edge bundles (dotted lines). Nonetheless, each edge set represents a branch (clade) in one or both trees – a full CUN depicts all topological alternatives in the two trees. We can extract sets of congruent splits, and reconstruct the two trees in the process.

But since the nodes in the full CUN are not (alternative) LCAs but just connections of (parallel) edges, we cannot interpret this (bottom left) graph as a phylogenetic network. However, the CUN depicted in M&W does do this: we start in the root and walk from node to node along the branches (arrows) until we end up with an explicit phylogenetic network (bottom right). This includes an edge that is not found in any of the trees, a 'false' edge (ABCD-ABC: violet, fat line), while also missing an edge found in one tree (ABCD-BC).

EUN in comparison to a full CUN for M&W's example. The 'false' ABCD-ABC edge
is replaced by a ABCD-BC edge resulting in a phylogenetic network that has
only edges seen in the two phylogenetic trees.

The false ABCD-ABC edge is replaced by an ABCD-BC edge, and ABC is reconnected directly to the root.

An implicit assumption and reason to reduce CUNs into EUNs is that the topological ambiguity in the two trees represents reticulation evolution (eg. hybridization): the trees indicate that the LCA of taxa B and C evolved from the LCA of A to C and the LCA of A to D, but the LCA of A to D is not ancestral to the LCA of A to C. This, however, appears quite strange from an evolutionary point of view. A simple explanation for the conflict between the two trees would be that D is a hybrid of the lineage leading to A, which is the sister of B + C, and E.

A simple evolutionary scenario explaining the difference between the two conflicting trees:
A is the paternal, and E the maternal donor of the hybrid D.

As shown in tis figure, the LCA of A to D equals the LCA of A to C (depicted as two different nodes in the EUN), and the LCA of A+D and D+E are just (the precursors of) A and E . Taxon D is related to the ABC clade because the paternal donor has been A, plus to the E lineage via its mother.

This leads us to a principal question: do we want to reduce CUNs, which are splits graphs depicting all splits in a set of trees, ie. competing topological alternatives, to directed phylogenetic networks at all? The EUN has fewer edges (and nodes) than the CUN, but it still is an overly complex graph even for potentially very simple evolutionary scenarios.

On the Mesquite discussion group, a question was asked whether EUNs should be implemented as a means to quickly investigate conflict between trees. The answer to that question is: no. Consensus networks (CUNs) will be more than sufficient, since they are splits-based not node-based.

One application of EUNs may be ancestral state reconstruction. Character progression could be modeled the same way as it currently is along trees. Instead of viewing the nodes as actual LCAs in a reticulation scenario, one could consider them as competing alternative LCAs, and use the results of the ancestral state reconstruction along the EUN, to make a choice among alternatives, or simply to compare different evolutionary scenarios in the same graph.

Monday, March 30, 2020

Trees and viruses: the SARS group

[This is a joint post by David and Guido]

There seems to be a lot of current confusion about the Covid-19 disease, and the SARS-CoV-2 virus that causes it. To help clear up some of the misunderstandings, this post might help:

   There seems to be a lot of public misunderstanding about the coronavirus

From the perspective of phylogenetics, in his last post (Problems with the phylogeny of coronaviruses), David pointed out that phylogenetic trees may not be a good choice for visualizing the phylogeny of coronaviruses. For this post, we collected from GenBank all complete genomes of one group of Betacoronavirus, the SARS-inflicting viruses, which includes the new SARS-CoV-2, the Covid-19 disease-causing virus, to look at this in more detail.

Reticulation analysis

Based on a preliminary analysis (included in a figshare submission), we ended up combining the individual genome accessions into group-based consensus sequences, in which intra-lineage variation is expressed as polymorphism: IUPAC ambiguity codes. The graphs in Figs. 1 and 2 are based on a data set including 291 accessions that are not literal duplicates of others, out of the total harvest included 395 genomes.For example, Groups 1 and 7 (as labeled in the figures below) include many accessions that are nearly identical or differ only in stochastic mutation patterns.

When processing the harvested data, one can notice mutational patterns in the best-sampled groups (eg. the original SARS-viruses, the new CoV-2 virus), which may be the result of mediocre sequencing and editing artifacts.

Consensus sequences have several advantages against choosing placeholders:
  • We can reduce the number of OTUs without losing too much information about intra-group diversity.
    • This facilitates visual inspection of the underlying alignments.
    • Under maximum likelihood (ML) inference, ambiguity codes can make a difference. They are not treated as missing data, as polymorphisms can be informative to some degree even in extreme cases (eg. Potts et al., Systematic Biology, 2014). Note that the RAxML-NG program now includes special models for (phased) DNA and RNA data that includes a lot of ambiguities.
  • Stochastic mutations found in a single of many accession within a group can be completely eliminated using modal consensus sequences, instead of strict consensus sequences. For the data used here, it makes little difference (results provided in the linked figshare submission).
Fig. 1 shows the mid-point rooted ML tree based on the group-consensus matrix. It largely agrees with the non-consensed tree (included in the figshare submission) but takes much less time to infer and run bootstrap analysis.

Fig. 1. Mid-point rooted ML tree using group-consensus sequences (labeled by numbers), tips with GenBank accession numbers represent non-consensed data.

Note that we find poor split support for some topological features, which can have two reasons:
  • Lack of discriminatory signal.
  • Conflicting signal, ie. here: potential recombination.
Except for a few aspects, the tree seems to be clear. And may be fooling us.

We can see this by looking at a network instead of a tree. The Neighbor-net based on the group-censuses data (Fig. 2) doesn't appear to be overly tree-like, especially when compared to the ML tree and its branch support (Fig. 1).

Fig. 2. Neighbor-net based on the group-consensus data, colored arches, arrows and field refer to mutational patterns and recombination cross-checked for by visual inspection of the alignment.

In fact, the highly supported Group 4 (in the tree) includes some genomes showing evidence for reticulation outside this apparent clade. The accessions labeled as 3c and 3d are recombinants as well, and hence are placed between the two main clusters (1 + 8 vs. 4–7) in the tree. The 1 + 2 + 3a/b group collects what appears to be a gradual evolutionarey trend — all " non-1" sequences simply differ more and more from Group 1, but not necessarily in the direction of Groups 4–9. The relatively low support for the 8+9 group is also the result of recombination and conflicting signals in the underlying data (Fig. 2).

The reasons for these reticulate signals probably include homoplasy (especially within main groups), but also obvious recombination, as shown in Fig. 3.

Fig. 3. Potentially alien DNA within difficult to align regions. Note that CoV-1 Type "A" and "B" do not sort along the ML tree, and only to some degree agree with neighborhoods in the Neighbor-net. Note that this shown portion was not included in our analysis. Congruent pattern can nonetheless be found in the sequentially more conserved regions we used

Groups 2a and 3b have very similar sequences elsewhere (2a has been represented by a single consensus sequence in our analysis) but can show either type in the regions shown in Fig. 3. The partly incongruent distribution of Types A and B in Fig. 3 can only be explained by secondary recombination between CoV-1 lineages.

Going back to the alignment, we can see that the uniqueness of accession 3d lies in patterns otherwise seen only in the distantly related Group 5, combined with regions where is mirrors Group 9 (see Fig. 3), while the rest of the sequence shows the basic Group 1–3 type, which is difficult to distinguish from Group 8 (leading to its position in the ML tree, Fig. 1, and NNet, Fig. 2). Groups 1 and 8 have mutations not seen in any other accession, separating them clearly from each other (lack of a neighborhood, Fig. 2), and also seaprating Group 8 more from Groups 2 and 3 than Group 1 (clade in Fig. 1 but fan in Fig. 2). Group 8 may be a recombinant of Group 1(–3) and Group 4, as suggested in Fig. 2; and bootstrap (BS) < 100 in the ML tree in Fig. 1. One accession of Group 4 includes sequence patterns in one regions diagnostic for the CoV-2 lineage and its sister lineage (Group 6).

The most striking recombination feature is, however, not captured even by the Neighbor-net. The orange field in Fig. 2 refers to the last sixth of the sequences (~ 5,000 bp) which are near-identical in accessions 3c, 3d, one 9a and all members of Group 4 except 4a, showing a sequence visibly different from all others in the alignment.

Moreover, our analysis and graphs did not consider sequence patterns in difficult or impossible to align regions, which can show very complex patterns, as illustrated in Fig. 4.

Fig. 4. Bird's eye view of the transition zone between alignable and sequentially (extreme) diverse genome portions that were excluded from analyses.

The non-alignable region in Fig. 4 shows about a dozen substantially different sequence types, most of which may actually be obtained by recombination with coronaviruses outside the SARS group. In some cases, strikingly similar sequence types are shared by members of different groups shown in the tree; while members of the same groups, even highly supported sister taxa, can have sequence types that are near 100% different.

Knowing the enemy: What is new about SARS-CoV-2?

We can see that some groups (labeled as 2a to 3c, 8, 9) have sequentces close to Group 1, while others (labeled 4 to 7) are increasingly diifferent (Fig. 2). The new strain, discovered in Wuhan and now spreading across the world (substantially affecting the way we live), has one potential direct sister strain: Group 6 (accessions MG772933/34) labeled in GenBank as "Bat SARS-like coronavirus", and isolated from bats (Rhinolophus sinicus). The reference for MG772933 is Hu et al. (2018) Emerging Microbes and Infection 7: e1006698, and was submitted in January 2018 by a researcher at the Institute of Military Medicine in Nanjing. We compare these sequences in Fig. 5.

Fig. 5. Close-up on a sequentially conserved region. All mutations that differ between CoV-1 and CoV-2 are also found in Group 6 (either obtained by multiple mutation or recombination with viruses not included in our data).

The Y (= C/T) in the Group 6 consensus sequence illustrates a general feature of the two Group 6 accessions: one is less modified with respect to the other (older, longer-known) SARS viruses, reflected by a BS = 67 for one being sister to Group 7, when using the accession data (not consensus sequences). But both share sequence patterns not found in Group 7 (BS = 32, note the long terminal branch in the Neighbor-net), which demonstrates that these viruses represent a genuine sister lineage. Elsewhere in their genome, both Group 6 and 7 are identical (Fig. 5). In other sequence regions, only Group 7 (CoV-2) differs (Fig. 6).

Fig. 6. A sequentially equally conserved region, purple arrows highlight unique mutations in CoV-2 strain; yellow background shared mutations (convergent and lineage-conserved).

Unfortunately, the various gene banks have been flooded with new, near-identical CoV-2 genomes, which are useless from a phylogenetic point of view. So, there is no point in looking for unique sequence features shared by Groups 6 and 7, or unique to Group 7 (CoV-2) — all of the best hits will be novel SARS-CoV-2 genomes. So, we cannot assess to what degree the new CoV-2 lineage, which likely includes the Group 6 accessions, differs from the remaining CoV-1 (original SARS) because of recombination with other coronaviruses.

However, recombination is most likely, given that CoV-1 genomes show plenty of signals that do not fit into tree-like evolution, but instead evidence inter-group reticulation (Fig. 2; example: Fig. 4). We also have sequence portions that are nearly 100% different (and hence not included in the phylogenetic analyses; Figs. 3/4) along with mutational patterns in conserved regions, which appear to be just mutations from the original SARS (Figs. 5, 6).

The harvested, mafft-aligned (uncurated) data and curated alignment, as well as files needed for analysis have been uploaded to figshare (CC-BY).

Grimm G, Morrison, D (2020). Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset. 

Monday, March 23, 2020

Evolution unchained: The development of person names and the limits of sequences

What do person names like Jack and Hans have in common, and what unites Joe and Pepe? Both name pairs go back to a common ancestor. For Jack and Hans, this would be John (ultimately going back to Iōánnēs in Greek), and for Joe and Pepe, this would be Josef (originally from Hebrew). Given the striking dissimilarity of the names in their current form, the pathways of change by which they have evolved into their current shape are quite complicated.

While the German name Hans can be easily shown to be a short form of the German variant Johannes, the evolution of Jack is more complicated. First (at least this is what people on Wikipedia suppose), Iōánnēs becomes John in English, similar to the process that transformed German Johannes into Hans. Then, in an ancient form of English, a diminutive was built for John, which yielded the form Jenkin, with the diminutive suffix -kin that has a homologous counterpart in German -chen (which can be attached to Hans as well, yielding Hänschen). Etymologically, Jack is little Johnny.

While Joe in English is a shortening of Josef, the development of Pepe is again a bit more complex. First, we find the form Giuseppe as an Italian counterpart of Josef. How this form then yielded Pepe as a diminutive is not completely clear to me; but since we find the pe in the Italian form, we can think of a process by which Giuseppe becomes Giuseppepe, leaving Pepe after the deletion of the initial two syllables.

The complexity of person-name evolution

Even from these two examples alone, we can already see that the evolution of person names can easily become quite complex. If all words in all spoken languages in the world evolved in the same way in which our person names evolve, we would have a big problem in historical linguistics, since the amount of speculation in our etymologies would drastically increase.

When comparing etymologically related words from different languages, we generally assume that they show regular correspondences among their sound segments. This presupposes that there is still enough sound material that reflects these correspondences, allowing us to detect and assess them. But since the evolution of person names rarely consists of the regular modification of sounds, but rather results in the deletion, reduplication, and rearrangement of whole word parts, there is rarely enough left in the end that could be used as the basis for a classical sequence comparison.

With the name Tina in German being the short form of Bettina, Christina, and at times even Katharina, and with Bettina itself going back to Elisabeth, and with Tina becoming Tinchen, Tinka, or Tine, we face an almost insurmountable challenge when trying to model the complexity of the various patterns by which names can change.

Modeling word derivation with directed networks

That words do not evolve solely by the alternation of sounds, but also by different forms of derivation, is nothing new for historical linguistics. We face the problem, for example, when looking for etymologically related words in the basic lexicon of phylogenetically related languages. However, these phenomena can be easily investigated by enhanced means of annotation. The evolution of person names, on the other hand, presents us with larger challenges.

While working as a research fellow in France in 2015-2016, I had the time to develop a small tool that allows us to represent derivational relations between related words with help of a directed network, and thus allows us to model these relations in a rough way. Such a graph is directed, and our words are the nodes in the network, with the edges drawn between the assumed ancestor word forms and their descendants. This tool, which I then called DeriViz, is still available online. and makes it possible to visualize network relations between words.

I have now conducted a small experiment with this tool, by taking name variants of Elisabeth, as they are listed in Wikipedia, and trying to model them in a directed network, along with intermediate stages. You can do this easily yourself, by copying the network that I have constructed in text form below, and pasting it into the field for data entry at the DeriViz-Homepage. The network will be visualized when you press on the OK button; and you can play with it by dragging it around.
Elisabeth → BETT
BETT → Betty
BETT → Bettina
BETT → Bettine
BETT → Betsi
Elisabeth → ELISABETH
Elisabeth → ILSA
ILSA → Ilsa
ILSA → Ilse
Elisabeth → Isabella
Elisabeth → LISA
LISA → Lieschen
LISA → Liese
LISA → Liesel
LISA → Lis
LISA → Lisa
LISA → Lisbeth
LISA → Lisette
LISA → Lise
LISA → Liesl
Elisabeth → LILA 
LISA → Lila
LISA → Liliane
LISA → Lilian
LISA → Lilli
Elisabeth → Sisi
I intentionally reduced the amount of data here, in order to make sure that the graphic can still be inspected. But it is clear that even this simple model, which assumes unique ancestor-descendant relations among all of the derived person names, is stretched to its limits when applied to names as productive as Elisabeth, at least as far as the visualization is concerned.

Derivation network of names derived from Elisabeth

If you now imagine that there are various processes that turn an ancestral name into a descendant name, and that one would ideally want to model the differences between these processes as well, one can see easily that it is indeed not a trivial problem to model the evolution of person names (and we are not even speaking of inferring any of these relations).

How names evolve

Names evolve in various ways along different dimensions. With respect to their primary function, or their use, we tend to use, among others, nick names. Formally, nick names are often a short form of an original name, but depending on the community of speakers, it is also possible that there is a formal procedure by which a nick name can be derived from a base name. Thus, every speaker of Russian should know that Jekaterina can be turned into Katerina, which can be turned into Katja, which can be turned into Katjuscha, or, in the case of a Vocative, into Katj. Once the primary function of a name changes, its form usually also changes, as we can now see in many examples.

But the form can also change when a name crosses language borders. If you go with your name into another country, and the speakers have problems pronouncing certain sounds that occur in your name, it is very likely that they will adjust your name's pronunciation to the phonetic needs of their own language, and modify it. Names cross language borders very quickly, since we tend not to leave them at home when visiting or migrating to foreign countries. As a result, a great deal of the diversity of person names  observed today is due to the migration of names across the world's larger linguistic communities.

How we change names when building short forms or nick names, or when trying to adapt a name to a given target language, depends on the structure of the language. The most important part is the phonology of the language in which the change happens. For example, when transferring a name from one language to another, and the new language lacks some of the sounds in the original name, speakers will replace them with those sounds which they perceive to be closest to the lacking ones.

But the modification is not restricted to the replacement of sounds. My own given name, Mattis, for example, usually has the stress on the first syllable, but in France, most people tend to call me Matisse, with the accent on the second syllable, reflecting the general tendency to stress the last syllable of a word in French. In Russian, on the other hand, Mattis could be perfectly pronounced, but since people do not know the name, they often confuse it with its variant Matthias, which then sounds like Matjes when pronounced in Russian (which is the name for soused herring in Germany). There are more extreme cases; and both English and German speakers are also good at drastically adjusting foreign names to the needs of their mother tongues.

It would be nice if it was possible to investigate the huge diversity in the evolution of person names more systematically. In principle, this should be possible. I think, starting from directed networks is definitely a good idea; but it would probably have to be extended by distinguishing different types of graph edges. Even if a given selection may not handle all of the processes known to us, it might help to collect some primary data in the first place.

With a large enough set of well-annotated data, on the other hand, one might start to look into the development of algorithms that could infer derivation relationships between person names; or one could analyze the data and search for the most frequent processes of person name evolution. Many more analyses might be possible. One could see to which degree the processes differ across languages, or how names migrate from one language to another across times, usage types, and maybe even across fashions.


I assume that the result of such a collection would be interesting not only for couples who are about to replicate themselves, but would also be interesting for historical research and research in the field of cultural evolution. Whether such a collection will ever exist, however, seems less likely. The problem is that there are not enough scholars in the world who would be interested in this topic, as one can see from the very small number of studies that have been devoted to the problem up to now (as one of the few exceptions known to me, compare the nice overview of person name classification by Handschuh 2019). I myself would not be able to help in this endeavour, given that I lack the scholarly competence of investigating name evolution. But I would sure like to investigate and inspect the results, if they every become available.


Handschuh, Corinna (2019) The classification of names. A crosslinguistic study of sex-specific forms, classifiers, and gender marking on personal names. STUF — Language Typology and Universals 72.4: 539-572.

Monday, March 16, 2020

Problems with the phylogeny of coronaviruses

Coronaviruses are much in the news at the moment. Indeed, one particular variant seems to be the major news topic as I write this post. This is the one known as 2019-nCoV or SARS-CoV-2, which is responsible for the human pneumonia called COVID-19.

Obviously, the main issue for the public is infection biology, particularly the apparent ease with which the virus can spread in human populations. Part of the issue here seems to be that human coronaviruses are covered with a lipid membrane, which means that they "can remain infectious on inanimate surfaces [like metal, glass or plastic] at room temperature for up to 9 days" (Kampf et al. 2020), which dramatically increases the probability of each of us encountering one.

There is now a decline in reported cases in China, but there may a resurgence. The problem is that an infected person may show no symptoms, or only very mild ones, and thus never report themselves. So, there may be millions more infected people running around the country, ready to infect new people when the travel restrictions are lifted, and the unexposed people come in contact with them. Biologically, the only safety is immunization, which occurs when you are exposed to the virus — which is risky, of course.

From Forni et al. (2017). Click to enlarge.

There will obviously be a lot of political fall-out in coming weeks, with various governments being accused of not doing enough and others of doing too much. The widespread infections in South Korea seem to be the result of a secretive religious organization (responsible for more than 60% of the national infections), to which the government has responded better than most others. On the other hand, in Iran it seems to be government that has been the major problem, hiding the initial infections because of their potential affect on impending elections.

In Italy, the country seems to have been overwhelmed, and the death rate is very high, while in Germany the infection rate is relatively high but the death rate is currently still low. Indeed, Italy's long-delayed "lock-down" on internal travel contrasts strongly with China's much more rapid response, and this seems to be reflected in vastly different infection rates (Italy currently has 6x the number infections per million people). More than a half of the cases to date where I live, in Sweden, came initially from northern Italy, with most of the rest from Austria, which are popular downhill-skiing destinations at this time of the year.


However, for our purposes here it is the phylogenetics of coronaviruses that is of professional interest, not infection biology. This has been a research topic for the past couple of decades, with the origin of several novel coronavirus strains in humans during that time (see the timeline above). These include SARS-CoV (causing Severe Acute Respiratory Syndrome) and MERS-CoV (causing Middle East Respiratory Syndrome) — both of these have much higher fatality rates than the current epidemic (10% and 34%, respectively), but lower rates of spread. A selected set of relevant papers is listed below; and I have included a couple of phylogenies as examples.

The issue that I wish to mention here is that there appears to be a disconnection between the so-called phylogenies presented in these papers and the concept of a phylogenetic history. The papers present either a rooted or an unrooted tree. In the first case, this simply represents a set of clusters based on genomic similarity. In the second case, this represents a hierarchical grouping based on genomic similarity.
Obviously, an unrooted tree cannot represent a phylogenetic history, since evolution has a time direction, and this can only be illustrated using a directed (ie. rooted) tree or network.

However, the bigger issue is that these trees cannot represent an actual virus phylogeny. The argument for presenting them seems to be that the clusters / groups are based on genomic similarity, which in turn is caused by the phylogenetic history of the viruses. This is true, but we cannot thereby invert the logic. Phylogenetics creates similarity, but mere similarity does not necessarily represent phylogenetic history.

In the case of coronaviruses, the evolutionary history is reported to involve extensive genomic recombination in the formation of novel strains (reviewed by Cui et al. 2019). That is, during an epidemic the phylogeny might be tree-like, but at the origin of the epidemic it is not. This especially occurs because coronaviruses can infect a range of hosts (not just humans), and it is the recombination that occurs while within one host that allows novel strains to appear that can create epidemics in a different host.

This is also prevalent in, for example, influenza viruses (which also have a lipid membrane). This occurred for the world's worst epidemic (c. 500 million affected), the so-called Spanish Flu of 1918-1920, which actually started in the USA. The current most-likely explanation is that both a bird-host and a human-host influenza strain got into a pig, recombined in the cells of that host, and then the new virus strain got back into the human population.

Therefore the full phylogenetic history cannot be tree-like. Indeed, the actual history must be in the form of a recombination network, as discussed elsewhere in this blog. So, the trees, as shown in the papers below, represent the similarity of the coronaviruses but not all of their phylogeny. For the latter, we need a haplotype network representation, as illustrated in this example:

Some small haplotype networks; from Yu et al. (2020)

It would be interesting to construct a recombination network based on the data from one or more of the coronavirus papers, as an example. However, as far as I can see, none of the authors has referred to an online version of their genomic alignment; and so I cannot present such a thing here.


Cui J, Li F, Shi Z-L (2019) Origin and evolution of pathogenic coronaviruses. Nature Reviews Microbiology 17: 181-192.

Chen Y, Liu Q, Guo D (2020) Emerging coronaviruses: genome structure, replication, and pathogenesis. Journal of Medical Virology 92: 418-423.

Eickmann M et al. (2003) Phylogeny of the SARS coronavirus. Science 302: 1504-1505.

Forni D, Cagliani R, Clerici M, Sironi M (2017) Molecular evolution of human coronavirus genomes Trends in Microbiology 25: 35-48.

Gorbalenya AE, Snijder EJ, Spaan WJ (2004) Severe acute respiratory syndrome coronavirus phylogeny: toward consensus. Journal of Virology 8: 7863-7866.

Kampf G, Todt D, Pfaender S, Steinmann E (2020) Persistence of coronaviruses on inanimate surfaces and their inactivation with biocidal agents. Journal of Hospital Infection 104: 246-251.

Luk HKH, Li X, Fung J, Lau SKP, Woo PCY (2019) Molecular epidemiology, evolution and phylogeny of SARS coronavirus. Infection Genetics and Evolution 71: 21-30.

Woo PC, Lau SK, Huang Y, Yuen KY (2009) Coronavirus diversity, phylogeny and interspecies jumping. Experimental Biology and Medicine 234: 1117-1127.

Yu WB, Tang G-D, Zhang L, Corlett RT (2020) Decoding evolution and transmissions of novel pneumonia coronavirus (SARS-CoV-2) using the whole genomic data. (ResearchGate)

Zhang L, Shen F-M,Chen F, Lin Z (2020) Origin and evolution of the 2019 novel coronavirus. Clinical Infectious Diseases (Epub ahead of print).

An unrooted tree; from Cui et al. (2019).

A rooted tree; from Chen et al. (2020)

Monday, March 9, 2020

A sneak peek into the upcoming SplitsTree 5

For some time now, the official SplitsTree page ( has been offline. The reason is that a major update is on the way: SplitsTree5. A beta version is already available, so let's take a quick look at it.

During installation you will be asked how much RAM you want to dedicate. Give as much as possible, in case you want to handle large tree with myriads of splits. I chose 16 GB (ie. half of the RAM installed on my PC).

Here's how it looks when you start the program:

The menus known from SplitsTree4 are still there, and the important functions appear to be already implemented. Some are new, and some have been moved:
  • Menu File: there is a new option is to "Export workflow", which produces a graphical representation (ie. a flow-chart) of what you did with the imported data, which is shown in the main display panel ("Workflow")
  • New menu Select: collects together the Select options formerly included under Edit.
  • The Trees option is now called Tree
  • Menu Network has all of the classics (distance-based phylogenetic networks, tree-based networks, character-based networks); but missing (so far) are the Pruned Quasi Median network and Spectral Splits options, possibly due to very little demand. An important new function is (or will be) that one can change between "Splits Network view" (ie. the view we are used to from SplitsTree4) and "Haplotype Network view" (as known from the TCS, NETWORK, etc. programs)
  • New menu PCoA, to do principal component analysis (at some point).
  • The menu Analysis appears to be still in development. Currently there are five options: Show Bootstrap tree..., Show Bootstrap network..., Estimate invariable sites..., Compute Phylogenetic Diversity, Compute Delta Score, and (new) Show workflow.
  • The menu Window will be split into Window and Help. Menu Help includes also now a direct link to the (new to me, and, noting the low number of discussion threads, apparently most of the world), a SplitsTree Community page (online since September 2017).
The new GUI reminds me a bit of RStudio —instead of pop-up windows vanishing once you perform a function, you will keep subsequent sheets in the panels. This makes it easier for new users.

When opening a data matrix not directly interpretable, you may activate the "Import" menu, asking you to specify the data type and the file format:

Eventually, as in SplitsTree4, the importer is currently sensitive to additional code and commentary brackets, and cannot eg. handle polymorphisms for categorical data (such as "(01)", "{01}"). Accordingly, importer warnings will pop up. Probably, a lot of testing and tweaking is required to make this work as planned. The selection list for file formats is comprehensive, but also ambitious. It may be a good idea to focus on a simple import format (eg. Phylip without its name-length restrictions, or clean NEXUS), and leave the import / export issues to other software packages (such as Mesquite, or R-conversion tools).

But we can read in Splits-NEXUS files generated by SplitsTree4 without any problems. To sneak a bit more:

A very nice function is that the flags in the analysis pipeline are fully interactive, allowing for quick manipulation / overview of what was used. For example, by clicking on "NeighborNet", we get a new panel for tweaking the NNets options or change the used method:

When moving above a menu item, a short explanation may pop up. The menus in the modification panel include drop-down boxes and input fields (here, for NNet):

Close-up of the NeighborNet panel.

Another important upgrade is the "Workflow" sheet, which gives you access to data filtering, methods and visualization etc., by just double-clicking on the respective item in the flow-chart (items can be dragged and moved, too):

Graphically, SplitsTree5 is functional as well. View > Format... (Ctrl-Shift-J) will open the remodeled coloring and type window in the method / lower left panel, where you can chose: font, label and (selected) edge(s) colors, node colors and shapes. In addition to circles and squares, we now have the choice between up- and down-triangles, diamonds, and hexagons. The graphical export option is gone (Ctrl-Shift-M; for now) and replaced by a modifiable, objects-containing PDF (similar to the ones produced by Dendroscope), generated simply by printing out to PDF.

The current beta version may not be able to fully replace SplitsTree4 yet (especially since the current manual only contains an 'Acknowledgments' section) but has already enough functionality (some new) to play around and explore the wonderful world of phylogenetic networks.

So, try it out for yourself.

Current issues

Glitches (on my Windows-PC running the latest Java version) that I have encountered include:
  • flickering scroll bars – but, when resizing the window a bit and keeping the left mouse button pressed, the flickering stops
  • I couldn't exit the program after opening more than one window / data set
  • a few menu items may not work yet (e.g. Select > All Labeled Nodes, Ctrl-Shift-L).
Moving edges can also be a bit tricky. You need to first select the edges, when the selected edge bundle will be highlighted by a broad yellow aura, and then move the pointer to one of the nodes, until the node is surrounded by an even broader aura. Then click and keep the mouse button down.

To get rid of node shapes, I had to click several times on "none" (first it changes to circles, which then become smaller until being nearly invisible).

Important note: While I had no problem in opening any of my SplitsTree4-generated and saved files, when saving a file in SplitsTree5, SplitsTree4 gives an import failure error message.

Monday, March 2, 2020

The phylogenetics of the Last Universal Common Ancestor is hard

If we define phylogenetics as the study of sister-group historical relationships, then it stands to reason that the hardest thing to do in biology would be to study the Last Universal Common Ancestor (LUCA), which is the common ancestor of all known organisms. This is because, by definition, it has no knowable sister group.

Study of the LUCA has therefore mostly been seen as a study of ancestor-descendant relationships, being an attempt to trace the ancestry of living things all the way back until there is nothing more to detect.

This latter approach seems to lead to a lot of arguments. There are arguments about what type of character data to use (it seems doubtful that nucleotide sequences are informative that far back in evolution). There are arguments about how many monophyletic groups there might be of akaryotes, and whether we should consider eukaryotes to be monophyletic, given that they have organelles. For a brief introduction to the use of protein domains for phylogenies, as well as the dispute about the three-domains versus two-domains issue, see this Twitter presentation.

On the other hand, trying to study the LUCA phylogenetically raises some interesting questions, because we are trying to produce a phylogeny with a root but without an outgroup. I recently gave a talk on this subject; and I have included a PDF copy of the slides from that talk here.

The talk starts with some personal history, which just happens to lead into a discussion of what I see as the essential points of phylogenetic analysis. I discuss the essential points of characters versus taxa, emphasizing the role of both character and taxon models. The essential point for the LUCA is the need to determine character polarity, as this gives as the time direction, and allows us to find the earliest time.

Conclusion 1: The characters used to study the LUCA probably need to be molecular, but the form of the character analysis needs to be fundamentally different from what molecular biologists commonly employ — we need to analyze character polarity.

Conclusion 2: We need to think about which characters will have relevant phylogenetic information, for the age depth we are looking at.

Conclusion 3: We need to think about the taxon-change model, as well as the character-change model — the history may be very complex at the root.

Conclusion 4: We study contemporary taxa, and it is inappropriate to try putting ancestors into any modern group, unless you have good evidence that the ancestor is the MRCA of that group (ie. the group is monophyletic).

For the study of the phylogeny of the LUCA:
  • The root cannot be added to an unrooted line graph, but instead the root must be a direct product of the data analysis
  • Sequence data are unlikely to be informative, because the required character-change models matter too much at that time depth
  • The evolutionary history may be much more complex than can be represented by a tree, and may be impractical even for any current form of network analysis
  • The LUCA is not part of any extant phylogenetic group.

Monday, February 24, 2020

How should one study language evolution?

This is a joint post by Justin Power, Guido Grimm, and Johann-Mattis List.

Like in biology, we have two basic possibilities for studying how languages evolve:
  • We set up a list of universal comparanda. These should occur in all languages and show a high enough degree of variation that we can use them as indicators of how languages have evolved;
  • We create individual lists of comparanda. These are specific for certain language groups that we want to study.
Universal comparanda

While most studies would probably aim to employ a set of universal comparanda, the practice often requires a compromise solution in which some non-universal characteristics are added. This holds, for example, for the idea of a core genome in biology, which ends up being so small in overlap across all living species that it makes little sense to compute phylogenies based on it, except for for closely related species (Dagan and Martin 2006). Another example is the all-inclusive matrices that are used to establish evolutionary relationships of extinct animals characterized by high levels of missing data (eg. Tschopp et al. 2015; Hartman et al. 2019). The same holds for historical linguistics, with the idea of a basic lexicon or basic vocabulary, represented by a list of basic concepts that are supposed to be expressed by simple words in every human language (Swadesh 1955), given that the number of concepts represented by simple words shared across all human languages is extremely small (Hoijer 1956).

Figure 1: All humans have hands and arms but some words for ‘hands’ and ‘arms’ address different things (see our previous post "How languages loose body parts").

Apart from the problem that basic vocabulary concepts occurring in all languages may be extremely limited, test items need to fulfill additional characteristics that may not be easy to find,in order to be useful for phylogenetic studies. They should, for example, be rather resistant to processes of lateral transfer or borrowing in linguistics. They should preferably be subject to neutral evolution, since selective pressure may lead to parallel but phylogenetically independent processes (in biology known as convergent evolution) that are difficult to distinguish and can increase the amount of noise in the data (homoplasy).

Selective pressure, as we might find, for example, in a specific association between certain concepts and certain sounds across a large phygenetically independent sample of human languages, is rarely considered to be a big problem in historical linguistics studies dealing with the evolution of spoken languages (see Blasi et al. 2016 for an exception). In sign language evolution, however, the problem may be more acute because of a similar iconic motivation of many lexical signs in phylogenetically independent sign languages (Guerra Currie et al. 2002), as well as the representation of concepts such as body parts and pronouns using indexical signs with similar forms. This latter characteristic of all known sign languages has led to the design of a basic vocabulary list that differs from those traditionally used in the historical linguistics of spoken languages (Woodward 1993); and we know of only one proposal attempting to address the problem of iconicity in sign languages for phylogenetic research (Parkhurst and Parkhurst 2003).

Figure 2: Basic processes in the evolution of languages, spoken or signed  (see our previous post How languages loose body parts).

All in all, it seems that there may be no complete solution for a list of lexical comparanda for all human languages, including sign languages, given the complexities of lexical semantics, the high variability in expression among the languages of the world (see Hymes 1960 for a detailed discussion on this problem), and the problems related to selective pressures highlighted above. Scholars have proposed alternative features for comparing languages, such as grammatical properties (Longobardi et al. 2015) or other "structural" features (Szeto et al. 2018), but these are either even more problematic for historical language comparison—given that it is never clear if these alternative features have evolved independently or due to common inheritance—or they are again based on a targeted selection for a certain group of languages in a certain region.

Targeted comparanda

If there is no universal list of features that can be used to study how languages have evolved, we have to resort to the second possibility mentioned above, by creating targeted lists of comparanda for the specific language groups whose evolution we want to study. When doing so, it is best to aim at a high degree of universality in the list of comparanda, even if one knows that complete universality cannot be achieved. This practice helps to compare a given study with alternative studies; it may also help colleagues to recycle the data, at least in part, or to merge datasets for combined analyses, if similar comparanda have been published for other languages.

But there are cases where this is not possible, especially when conducting studies where no previous data have been published, and rigorous methods for historical language comparison have yet to be established. Sign languages can, again, be seen as a good example for this case. So far, few phylogenetic studies have addressed sign language evolution, and none have supplied the data used in putting forward an evolutionary hypothesis. Furthermore, because the field lacks unified techniques for the transcription of signs, it is extremely difficult to collect lexical data for a large number of sign languages from comparable glossaries, wordlists, and dictionaries, the three primary sources, apart from fieldwork, that spoken language linguists would use in order to start a new data collection. We are aware of one comparative database with basic vocabulary for sign languages that is currently being built (Yu et al. 2018), and that may represent lexical items in a way that can be compared efficiently, but these data have not yet been made available to other researchers.

Sign languages

When Justin Power approached Mattis about three years ago, asking if he wanted to collaborate on a study relating to sign language evolution, we quickly realized that it would be infeasible to gather enough lexical data for a first study. Tiago Tresoldi, a post-doc in our group, suggested the idea of starting with sign language manual alphabets instead. From the start, it was clear that these manual alphabets might have certain disadvantages — because they are used to represent written letters of a different language, they may constitute a set of features evolving independently from the sign language itself.

Figure 3: Processes shaping manual alphabets. The evolution of signed concepts may be affected by the same, leading to congruent patterns, or different processes, leading to incongruent differentiation patterns (see our previous post: Stacking networks based on sign language manual alphabets).

But on the other hand, the data had many advantages. First, a sufficient number of examples for various European sign languages were available in online databases that could be transcribed in a uniform way. Second, the comparison itself was facilitated, since in most cases there was no ambiguity about which “concepts” to compare, in contrast to what one would encounter in a comparison of lexical entries. For example, an “a” is an “a” in all languages. Third, it turned out that for quite a few languages, historical manual alphabets could be added to the sample. This point was very important for our study. Given that scholars still have limited knowledge regarding the details of sign change in sign language evolution, it is of great importance to compare sources of the same variety, or those assumed to be the same, across time—just as spoken language linguists compared Latin with Spanish and Italian in order to study how sounds change over time. And finally, manual alphabets in fact constitute an integrated part of many sign languages that may, for example, contribute to the forms of lexical signs, making the idea more plausible that an understanding of the evolution of manual alphabets could be informative about the evolution of sign languages as a whole.

Figure 4: Early evolution of handshapes used to sign ‘g’ (see our previous post: Character cliques and networks – mapping haplotypes of manual alphabets).

Guido later joined our team, providing the expertise to analyze the data with network methods that do not assume tree-like evolution a priori. We therefore thought that we had done a rather good job when our pilot study on the evolution of sign language manual alphabets, titled Evolutionary Dynamics in the Dispersal of Sign Languages, finally appeared last month (Power et al. 2020). We identified six basic lineages from which the manual alphabets of the 40 contemporary sign languages developed. The term "lineage" was deliberately chosen in this context, since it was unclear whether the evolution of the manual alphabets should be seen as representative of the evolution of the sign languages as a whole. We also avoided the term "family", because we were wary of making potentially unwarranted assumptions about sign language evolution based on theories in historical linguistics.

Figure 5: The all-inclusive Neighbor-net (taken from Power et al. 2020).

While the study was positively received by the popular media, and even made it onto the title page of the Süddeutsche Zeitung (one of the largest daily newspapers in Germany), there were also misrepresentations of our results in some media channels. The Daily Mail (in the UK), in particular, invented the claim that all human sign languages have evolved from five European lineages. Of course, our study never said this, nor could it have, since only European sign languages were included in our sample. (We included three manual alphabets representing Arabic-based scripts from Afghan, Jordanian, and Pakistan Sign Languages, where there was some indication that these may have been informed by European sources.)

Study of phylogenetics

While we share our colleagues’ distaste for the Daily Mail’s likely purposeful misrepresentation (in the end, unfortunately, it may have achieved its purpose as click bait), some colleagues went a bit further. One critique that came up in reaction to the Daily Mail piece was that our title opens the door to misinterpretation, because we had only investigated manual alphabets and, hence, cannot say anything about the "evolutionary dynamics of sign languages".

While the title does not mention manual alphabets, it should be clear that any study on evolution is based on a certain amount of reduction. Where and how this reduction takes place is usually explained in the studies. Many debates in historical linguistics of spoken languages have centered around the question of what data are representative enough to study what scholars perceive as the "overall evolution" of languages; and scholars are far from having reached a communis opinio in this regard. At this point, we simply cannot answer the question of whether manual alphabets provide clues about sign language evolution that contrast with the languages’ "general" evolution, as expressed, for example, in selecting and comparing 100 or 200 words of basic vocabulary. We suspect that this may, indeed, be the case for some sign languages, but we simply lack the comparative data to make any claims in this respect.

Figure 6: Evolution doesn’t mean every feature has to follow the same path: a synopsis of molecular phylogenies inferred for oaks, Quercus, and their relatives, Fagaceae (upcoming post on Res.I.P.) While nuclear differentiation matches phenotypic evolution and the fossil record (likely monophyla in bold font), the evolution of the plastome is partly decoupled (gray shaded: paraphyletic clades). Likewise, we can expect that different parts of languages, such as manual alphabets vs. core “lingome” of sign languages, may indicate different relationships.

The philosophical question, however, goes much deeper, to the "nature" of language: What constitutes a language? What do all languages have in common? How do languages change? What are the best ways to study how languages evolve?

One approach to answering these questions is to compare collectible features of languages ("traits" in biology)­, and to study how they evolve. As the field develops, we may find that the evolution of a manual alphabet does not completely coincide with the evolution of the lexicon or grammar of a sign language. But would it follow from such a result that we have learned nothing about the evolution of sign languages?

There is a helpful analogy in biology: we know that different parts of the genetic code can follow different evolutionary trajectories; we also know that phenotype-based phylogenetic trees sometimes conflict with those based on genotypes. But this understanding does not stop biologists from putting forward evolutionary hypotheses for extinct organisms, where only one set of data is available (phenotypes; Tree of Life). Furthermore, such conflicting results may lead to a more comprehensive understanding of how a species has evolved.

Figure 7: A likely case of convergence: the sign for “г” in Russian and Greek Sign Language, visually depicting the letter (see our previous post Untangling vertical and horizontal processes in the evolution of handshapes). Complementing studies of signed concepts may reveal less obvious cases of convergence (or borrowing).

Because we felt the need to further clarify the intentions of our study, and to answer some of the criticism raised about the study on Twitter, we decided to prepare a short series of blog posts devoted to the general question of "How should one study language evolution" (or more generally: "How should one study evolution?"). We hope to take some of the heat out of the discussion that evolved on Twitter, by inviting those who raised critiques about our study to answer our posts in the form of comments here, or in their own blog posts.

The current blog post can thus be understood as an opening for more thoughts and, hopefully, more fruitful discussions around the question of how language evolution should be studied.

In that context, feel free to post any questions and critiques you may have about our study below, and we will aim to pick those up in future posts.


Damián E. Blasi and Wichmann, Søren and Hammarström, Harald and Stadler, Peter and Christiansen, Morten H. (2016) Sound–meaning association biases evidenced across thousands of languages. Proceedings of the National Academy of Science of the United States of America 113.39: 10818-10823.

Dagan, Tal and Martin, William (2006) The tree of one percent. Genome Biology 7.118: 1-7.

Guerra Currie, Anne-Marie P. and Meier, Richard P. and Walters, Keith (2002) A cross-linguistic examination of the lexicons of four signed languages. In R. P. Meier, K. Cormier, & D. Quinto-Pozos (Eds.), Modality and Structure in Signed and Spoken Languages (pp.224-236). Cambridge University Press.

Hoijer, Harry (1956) Lexicostatistics: a critique. Language 32.1: 49-60.

Hymes, D. H. (1960) Lexicostatistics so far. Current Anthropology 1.1: 3-44.

Longobardi, Giuseppe and Ghirotto, Silva and Guardiano, Cristina and Tassi, Francesca and Benazzo, Andrea and Ceolin, Andrea and Barbujan, Guido (2015) Across language families: Genome diversity mirrors linguistic variation within Europe. American Journal of Physical Anthropology 157.4: 630-640.

Parkhurst, Stephen and Parkhurst, Dianne (2003) Lexical comparisons of signed languages and the effects of iconicity. Working Papers of the Summer Institute of Linguistics, University of North Dakota Session, vol. 47.

Power, Justin M. and Grimm, Guido and List, Johann-Mattis (2020) Evolutionary dynamics in the dispersal of sign languages. Royal Society Open Science 7.1: 1-30. DOI: 10.1098/rsos.191100

Swadesh, Morris (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21.2: 121-137.

Szeto, Pui Yiu and Ansaldo, Umberto and Matthews, Steven (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Woodward, James (1993) Lexical evidence for the existence of South Asian and East Asian sign language families. Journal of Asian Pacific Communication 4.2: 91-107.

Monday, February 17, 2020

Large morphomatrices – trivial signal

In my last post about fossils, Farris and Felsenstein Zones, I gave an example of a trivial (signal-wise perfect) binary phylogenetic matrix, which will give us the true tree no matter which optimality criterion we use. In this post, we will look at a real world example, a huge bird therapods matrix.
S. Hartman, M. Mortimer, W. R. Wahl, D. R. Lomax, J. Lippincott, D. M. Lovelace
A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight. PeerJ 7: e7247.
What intrigued me about this particular paper (I have no idea about dinosaurs, but the documentation, pictures and data, and presentation seems impeccable) was the following sentence:
The analysis resulted in >99999 most parsimonious trees with a length of 12,123 steps. The recovered trees had a consistency index of 0.073, and a retention index of 0.589.
What can you possibly do with strict consensus trees (Losing information in phylogenetic consensus) based on an unknown number of MPTs that have a CI converging to 0 (but and RI of 0.6; The curious case[s] of tree-like matrices with no synapomorphies)? And isn't this a case for some networks-based exploratory data analysis?

The complete matrix has 501 taxa and 700 characters (the largest plant morphological matrices have hardly more than 100 characters) but also a gappyness of 72%. In this case, 255,969 of the 353,500 cells in the matrix are ambiguous or undefined (missing). The matrix is a (rich) Swiss cheese with very big holes. The high number of MPTs is hence not surprising, and neither is the low CI.

Why run elaborate tree-inferences on such a swiss cheese matrix? One answer is that (some) vertebrate palaeophylogeneticists are convinced that few taxa – many character matrices can lead to wrong clades (clades that are not monophyletic); and each added taxon, no matter how many characters can be scored, will lead to a better tree, by eliminating (parsimony) branching artifacts (see Q&A to the paper). At least 56 of the 501 taxa have 5% or fewer defined characters; still, with 700 characters, 5% equals up to 35 defined traits, which is more than we can recruit for most plant fossils. The median missing data proportion is 74% — more than half of the taxa are scored for less than 26% (< 182 out of 700) of the characters. Can such taxa really save the all-inclusive tree from branching artefacts, or is the high number of MPTs an indication for signal conflicts and data gaps issues?

For this post, we will just look at the tip of the iceberg. What is the signal from the 700 characters to start with?

The basic signal

Here's the heat map for the 19 taxa that have a gappyness of less than 15% (ie. at least 595 of 700 possible characters are defined). The taxon order is mostly the one from the original matrix, sorted by phylogenetic groups — for more orientation, I added next-inclusive superclass "Clades" from Wikipedia (so apologize any errors).

In my last post, I showed that evolutionary lineages (and monophyly) can be directly deduced from such a heat map following the simple logic: two taxa sharing a (direct) common origin are usually more similar to each other than to a third, fourth etc. taxon not part of the same lineage. Exceptions include fossils close to the last common ancestors lacking advanced traits.

The outgroup as used (in this taxon sample: Allosaurus to Tyrannosaurus) is most similar to each other but not monophyletic. One (Allosaurus) respresents the sister lineage of, the other an early split within the lineage that lead to the birds (Coelurosauria:Tyrannoraptora). The extinct (monophyletic) families (Tyrannosauridae, Ornithomimidae, Dromaesauridae) are, however, well visible, being defined by low intra-family and higher inter-family pairwise distances. The same is true for the direct relatives (Clade Ornithurae) of modern birds (class Aves).

Very typical for such datasets is the increasing distance between the (primitive?) outgroups and the most derived, modern-day taxa (living birds: Struthio – ostrich, Anas – duck, Meleagris – turkey). Closest relatives in the taxon set, phylogenetically and time-wise, are (much) more similar than distant ones. Allosaurus may be most similar to the tyrannosaurs, not because of common ancestry but because both are scored as being primitive with respect to the group of interest.

The only tree

This situation becomes very obvious from the only possible (single-optimal) tree that can be inferred from this matrix, when visualized as a phylogram (Stop using cladograms!)

The ML, MP and LS/NJ tree overlapped and scaled to equal root (first split within Tyrannoraptor) to tip (split between Anas and Meleagris) distance (phylogenetic distance, via the tree). Pink, the LS clade conflicting with ML and MP trees, and Wikipedia's tree(s).

No matter which optimisation criterion is used (here Least-Squares via Neighbor-joining, Maximum Parsimony, Maximum Likelihood), the result is the same. The only exception is that the NJ/LS tree places Archaeopteryx as sister to Dromaeosauridae; and the relative branch lengths of roots vs. tips also differ.

Because our matrix has favorable properties (few taxa, many defined characters), it's straightforward to establish branch support. This is a bit frowned upon in palaeontological circles, but having dealt with morphological evolution in cases where we have molecular data, I want to know how robust my clades are, and what may be the alternatives, before I conclude that they reflect monophyly. Bootstrapping coupled with consensus networks is a quick and simple way to test robustness and investigate ambiguous support (Connecting tree and network edges) .

The BS support consensus networks for NJ/LS and ML have only a single reticulation each.

Rooted support consensus networks based on the NJ/LS (10,000 pseudoreplicates, PAUP*) and ML bootstrap (100, number of necessary replicates determined by bootstop criterion implemented in RAxML) samples. Only splits are shown that ocurred in at least 15% of the BS pseudoreplicates.

The MP BS support consensus network is, however, has many more reticulations.

Rooted MP-BS support consensus network (10,000 BS pseudoreplicates, PAUP*). Green — edge bundles corresponding to clades in the all-optimal tree(s); orange — less supported conflicting alternatives; red – higher supported conflicting alternatives; pink – wrong clade in NJ/LS tree.

We can make two generally relevant observations here:
  1. The wrong Archaeopterix-Dromaeosauridae clade (pink edge/branch) masks a split BSNJ support: 68 for the wrong clade, 31 for the right one. While resampling under ML appears to be inert to this conflict, MP is not.
  2. While the NJ- and ML support networks are very tree-like, all clades in the inferred tree have high to unambiguous support, and are near-congruent, the MP network is much more boxy. In some cases the split in agreement with the all-optimal tree has a lower BS support than an alternative (here usually in conflict with the gold tree).
Similar observations can be made with other data sets: although NJ/LS and ML optimisation are fundamentally different (distance- vs. character-based, equal change vs. varying probability of change), they show more agreement with each other when it comes to supporting a topology (or topological alternatives) than MP (character-based like ML, but all changes are treated as equal like NJ/LS). MP is a very conservative approach, highly dependent on possibly a few discerning characters. If they are missing from the BS pseudoreplicate, the backbone tree collapses or changes, and BS values may decrease rapidly. This is so even for a very data-dense matrix like the one used here (few taxa, many characters, low gappyness).

On the positive side, we can expect that MP will produce fewer false positives. On the negative side, it is also more dependent on character coverage, and will produce much more false negatives. Any fossil lacking the crucial characters (or showing too few of them) may be still resolved (placed and supported) under NJ/LS and ML but not using MP. When inferring trees, these fossils will quickly increase the number of MPTs and decrease branch support for the part of the tree they interact with. Personally, given how hard it can be to place a fossil per se with the data at hand, I always preferred a method that can give some result, and point towards possible alternatives (even risking including erroneous), rather than no result at all.

The simplest of networks

Naturally, we can use the distance matrix directly to infer a Neighbor-net, and explore the basic differentiation signal beyond trees but also with regard to the all-optimal tree.

Neighbor-net based on the pairwise distance matrix. Coloration highlights edges found (or not) in the optimised trees.

The Neighbor-net recovers the clades from the all-optimal tree (green, purple the NJ/LS-unique branch), but shows additional edges (orange). The principal signal in the data has, for instance, problems with placing Archaeopteryx, because it is (signal-wise) intermediate between the Avebrevicaudata, the lineage including modern birds, and the Dromaeosauridae, their sister lineage (note that the vertebrate fossil record is considered to be free of ancestors and precursors; all fossils represent extinct sister lineages – evolutionary dead-ends). Skeleton IGM 100042 (an Oviraptoridae), placed as sister to both in the all-optimal tree, also lacks obvious affinities: this is a taxon where the tree inference makes a decision that is not based on a trivial signal encoded in the matrix.

The central boxy part of the Neighbor-net correlates with the 2/3-dimensional part of the parsimony BS consensus network: to resolve these relationships, we need a large set of characters (under MP). On the other hand, recognizing the Ornithurae, members of an extinct family, or a relative of IGM 100042, should be straightforward even with a limited amount of defined characters. Based on the Neighbor-net, which is inferred in a blink no matter how large the matrix, we can also make a decision, as to which taxa interfere and which ones facilitate tree-inferences. The more tree-like the Neighbor-net graph becomes, the easier it is for a tree inference to be made.

Placing fossils, quickly and easily

Using this backbone graph, it is easy to assess in which phylogenetic neighborhood a newly coded fossil falls, eg. the fossil newly described in Hartman et al. and scored for 267 unambiguously defined traits, Hesperornithoides.

Neighbor-net including Hesperornithoides.

Hesperornithoides is obviously a member of the Eumaniraptora (= Paraves), morphologically somewhat intermediate between the Avialae, the "flying dinosaurs", and Dromaeosauridae, but doesn't seem to be part of either of these sister lineages. The graph lacks a prominent neighborhood, the Archaeopteryx-Bambiraptor neighborhood may reflect local long-edge attraction (note the long terminal edges) or convergent evolution in both taxa and, possibly, also the Hesperornithoides lineage. Just based on this simple and quick-to-infer network, Hartman et al.'s title "A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight" appears to be correct (in future posts, we may come back to this morphological supermatrix to see what else networks could have quickly shown).

One should be willing to leave the phylogenetic beaten track – ie. relying on strict consensus parsimony trees as the sole basis for phylogenetic hypothesis. The Neighbor-net is a valuable tool for quick pre- and post-analysis because it can:
  • visualize how coherent the clades in our trees are, 
  • how easy it will be for the tree inference (especially MP) to find and support clades, 
  • help to differentiate ambiguous from important taxa, and finally, 
  • assess whether a new fossil really requires an in-depth re-analysis of the full matrix (and dealing with >99,999 MPTs) instead of using a more focussed taxon (and character) set.