Monday, March 30, 2020

Trees and viruses: the SARS group

[This is a joint post by David and Guido]

There seems to be a lot of current confusion about the Covid-19 disease, and the SARS-CoV-2 virus that causes it. To help clear up some of the misunderstandings, this post might help:

   There seems to be a lot of public misunderstanding about the coronavirus

From the perspective of phylogenetics, in his last post (Problems with the phylogeny of coronaviruses), David pointed out that phylogenetic trees may not be a good choice for visualizing the phylogeny of coronaviruses. For this post, we collected from GenBank all complete genomes of one group of Betacoronavirus, the SARS-inflicting viruses, which includes the new SARS-CoV-2, the Covid-19 disease-causing virus, to look at this in more detail.

Reticulation analysis

Based on a preliminary analysis (included in a figshare submission), we ended up combining the individual genome accessions into group-based consensus sequences, in which intra-lineage variation is expressed as polymorphism: IUPAC ambiguity codes. The graphs in Figs. 1 and 2 are based on a data set including 291 accessions that are not literal duplicates of others, out of the total harvest included 395 genomes.For example, Groups 1 and 7 (as labeled in the figures below) include many accessions that are nearly identical or differ only in stochastic mutation patterns.

When processing the harvested data, one can notice mutational patterns in the best-sampled groups (eg. the original SARS-viruses, the new CoV-2 virus), which may be the result of mediocre sequencing and editing artifacts.

Consensus sequences have several advantages against choosing placeholders:
  • We can reduce the number of OTUs without losing too much information about intra-group diversity.
    • This facilitates visual inspection of the underlying alignments.
    • Under maximum likelihood (ML) inference, ambiguity codes can make a difference. They are not treated as missing data, as polymorphisms can be informative to some degree even in extreme cases (eg. Potts et al., Systematic Biology, 2014). Note that the RAxML-NG program now includes special models for (phased) DNA and RNA data that includes a lot of ambiguities.
  • Stochastic mutations found in a single of many accession within a group can be completely eliminated using modal consensus sequences, instead of strict consensus sequences. For the data used here, it makes little difference (results provided in the linked figshare submission).
Fig. 1 shows the mid-point rooted ML tree based on the group-consensus matrix. It largely agrees with the non-consensed tree (included in the figshare submission) but takes much less time to infer and run bootstrap analysis.

Fig. 1. Mid-point rooted ML tree using group-consensus sequences (labeled by numbers), tips with GenBank accession numbers represent non-consensed data.

Note that we find poor split support for some topological features, which can have two reasons:
  • Lack of discriminatory signal.
  • Conflicting signal, ie. here: potential recombination.
Except for a few aspects, the tree seems to be clear. And may be fooling us.

We can see this by looking at a network instead of a tree. The Neighbor-net based on the group-censuses data (Fig. 2) doesn't appear to be overly tree-like, especially when compared to the ML tree and its branch support (Fig. 1).

Fig. 2. Neighbor-net based on the group-consensus data, colored arches, arrows and field refer to mutational patterns and recombination cross-checked for by visual inspection of the alignment.

In fact, the highly supported Group 4 (in the tree) includes some genomes showing evidence for reticulation outside this apparent clade. The accessions labeled as 3c and 3d are recombinants as well, and hence are placed between the two main clusters (1 + 8 vs. 4–7) in the tree. The 1 + 2 + 3a/b group collects what appears to be a gradual evolutionarey trend — all " non-1" sequences simply differ more and more from Group 1, but not necessarily in the direction of Groups 4–9. The relatively low support for the 8+9 group is also the result of recombination and conflicting signals in the underlying data (Fig. 2).

The reasons for these reticulate signals probably include homoplasy (especially within main groups), but also obvious recombination, as shown in Fig. 3.

Fig. 3. Potentially alien DNA within difficult to align regions. Note that CoV-1 Type "A" and "B" do not sort along the ML tree, and only to some degree agree with neighborhoods in the Neighbor-net. Note that this shown portion was not included in our analysis. Congruent pattern can nonetheless be found in the sequentially more conserved regions we used

Groups 2a and 3b have very similar sequences elsewhere (2a has been represented by a single consensus sequence in our analysis) but can show either type in the regions shown in Fig. 3. The partly incongruent distribution of Types A and B in Fig. 3 can only be explained by secondary recombination between CoV-1 lineages.

Going back to the alignment, we can see that the uniqueness of accession 3d lies in patterns otherwise seen only in the distantly related Group 5, combined with regions where is mirrors Group 9 (see Fig. 3), while the rest of the sequence shows the basic Group 1–3 type, which is difficult to distinguish from Group 8 (leading to its position in the ML tree, Fig. 1, and NNet, Fig. 2). Groups 1 and 8 have mutations not seen in any other accession, separating them clearly from each other (lack of a neighborhood, Fig. 2), and also seaprating Group 8 more from Groups 2 and 3 than Group 1 (clade in Fig. 1 but fan in Fig. 2). Group 8 may be a recombinant of Group 1(–3) and Group 4, as suggested in Fig. 2; and bootstrap (BS) < 100 in the ML tree in Fig. 1. One accession of Group 4 includes sequence patterns in one regions diagnostic for the CoV-2 lineage and its sister lineage (Group 6).

The most striking recombination feature is, however, not captured even by the Neighbor-net. The orange field in Fig. 2 refers to the last sixth of the sequences (~ 5,000 bp) which are near-identical in accessions 3c, 3d, one 9a and all members of Group 4 except 4a, showing a sequence visibly different from all others in the alignment.

Moreover, our analysis and graphs did not consider sequence patterns in difficult or impossible to align regions, which can show very complex patterns, as illustrated in Fig. 4.

Fig. 4. Bird's eye view of the transition zone between alignable and sequentially (extreme) diverse genome portions that were excluded from analyses.

The non-alignable region in Fig. 4 shows about a dozen substantially different sequence types, most of which may actually be obtained by recombination with coronaviruses outside the SARS group. In some cases, strikingly similar sequence types are shared by members of different groups shown in the tree; while members of the same groups, even highly supported sister taxa, can have sequence types that are near 100% different.

Knowing the enemy: What is new about SARS-CoV-2?

We can see that some groups (labeled as 2a to 3c, 8, 9) have sequentces close to Group 1, while others (labeled 4 to 7) are increasingly diifferent (Fig. 2). The new strain, discovered in Wuhan and now spreading across the world (substantially affecting the way we live), has one potential direct sister strain: Group 6 (accessions MG772933/34) labeled in GenBank as "Bat SARS-like coronavirus", and isolated from bats (Rhinolophus sinicus). The reference for MG772933 is Hu et al. (2018) Emerging Microbes and Infection 7: e1006698, and was submitted in January 2018 by a researcher at the Institute of Military Medicine in Nanjing. We compare these sequences in Fig. 5.

Fig. 5. Close-up on a sequentially conserved region. All mutations that differ between CoV-1 and CoV-2 are also found in Group 6 (either obtained by multiple mutation or recombination with viruses not included in our data).

The Y (= C/T) in the Group 6 consensus sequence illustrates a general feature of the two Group 6 accessions: one is less modified with respect to the other (older, longer-known) SARS viruses, reflected by a BS = 67 for one being sister to Group 7, when using the accession data (not consensus sequences). But both share sequence patterns not found in Group 7 (BS = 32, note the long terminal branch in the Neighbor-net), which demonstrates that these viruses represent a genuine sister lineage. Elsewhere in their genome, both Group 6 and 7 are identical (Fig. 5). In other sequence regions, only Group 7 (CoV-2) differs (Fig. 6).

Fig. 6. A sequentially equally conserved region, purple arrows highlight unique mutations in CoV-2 strain; yellow background shared mutations (convergent and lineage-conserved).

Unfortunately, the various gene banks have been flooded with new, near-identical CoV-2 genomes, which are useless from a phylogenetic point of view. So, there is no point in looking for unique sequence features shared by Groups 6 and 7, or unique to Group 7 (CoV-2) — all of the best hits will be novel SARS-CoV-2 genomes. So, we cannot assess to what degree the new CoV-2 lineage, which likely includes the Group 6 accessions, differs from the remaining CoV-1 (original SARS) because of recombination with other coronaviruses.

However, recombination is most likely, given that CoV-1 genomes show plenty of signals that do not fit into tree-like evolution, but instead evidence inter-group reticulation (Fig. 2; example: Fig. 4). We also have sequence portions that are nearly 100% different (and hence not included in the phylogenetic analyses; Figs. 3/4) along with mutational patterns in conserved regions, which appear to be just mutations from the original SARS (Figs. 5, 6).

The harvested, mafft-aligned (uncurated) data and curated alignment, as well as files needed for analysis have been uploaded to figshare (CC-BY).

Grimm G, Morrison, D (2020). Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset. 

Monday, March 23, 2020

Evolution unchained: The development of person names and the limits of sequences

What do person names like Jack and Hans have in common, and what unites Joe and Pepe? Both name pairs go back to a common ancestor. For Jack and Hans, this would be John (ultimately going back to Iōánnēs in Greek), and for Joe and Pepe, this would be Josef (originally from Hebrew). Given the striking dissimilarity of the names in their current form, the pathways of change by which they have evolved into their current shape are quite complicated.

While the German name Hans can be easily shown to be a short form of the German variant Johannes, the evolution of Jack is more complicated. First (at least this is what people on Wikipedia suppose), Iōánnēs becomes John in English, similar to the process that transformed German Johannes into Hans. Then, in an ancient form of English, a diminutive was built for John, which yielded the form Jenkin, with the diminutive suffix -kin that has a homologous counterpart in German -chen (which can be attached to Hans as well, yielding Hänschen). Etymologically, Jack is little Johnny.

While Joe in English is a shortening of Josef, the development of Pepe is again a bit more complex. First, we find the form Giuseppe as an Italian counterpart of Josef. How this form then yielded Pepe as a diminutive is not completely clear to me; but since we find the pe in the Italian form, we can think of a process by which Giuseppe becomes Giuseppepe, leaving Pepe after the deletion of the initial two syllables.

The complexity of person-name evolution

Even from these two examples alone, we can already see that the evolution of person names can easily become quite complex. If all words in all spoken languages in the world evolved in the same way in which our person names evolve, we would have a big problem in historical linguistics, since the amount of speculation in our etymologies would drastically increase.

When comparing etymologically related words from different languages, we generally assume that they show regular correspondences among their sound segments. This presupposes that there is still enough sound material that reflects these correspondences, allowing us to detect and assess them. But since the evolution of person names rarely consists of the regular modification of sounds, but rather results in the deletion, reduplication, and rearrangement of whole word parts, there is rarely enough left in the end that could be used as the basis for a classical sequence comparison.

With the name Tina in German being the short form of Bettina, Christina, and at times even Katharina, and with Bettina itself going back to Elisabeth, and with Tina becoming Tinchen, Tinka, or Tine, we face an almost insurmountable challenge when trying to model the complexity of the various patterns by which names can change.

Modeling word derivation with directed networks

That words do not evolve solely by the alternation of sounds, but also by different forms of derivation, is nothing new for historical linguistics. We face the problem, for example, when looking for etymologically related words in the basic lexicon of phylogenetically related languages. However, these phenomena can be easily investigated by enhanced means of annotation. The evolution of person names, on the other hand, presents us with larger challenges.

While working as a research fellow in France in 2015-2016, I had the time to develop a small tool that allows us to represent derivational relations between related words with help of a directed network, and thus allows us to model these relations in a rough way. Such a graph is directed, and our words are the nodes in the network, with the edges drawn between the assumed ancestor word forms and their descendants. This tool, which I then called DeriViz, is still available online. and makes it possible to visualize network relations between words.

I have now conducted a small experiment with this tool, by taking name variants of Elisabeth, as they are listed in Wikipedia, and trying to model them in a directed network, along with intermediate stages. You can do this easily yourself, by copying the network that I have constructed in text form below, and pasting it into the field for data entry at the DeriViz-Homepage. The network will be visualized when you press on the OK button; and you can play with it by dragging it around.
Elisabeth → BETT
BETT → Betty
BETT → Bettina
BETT → Bettine
BETT → Betsi
Elisabeth → ELISABETH
Elisabeth → ILSA
ILSA → Ilsa
ILSA → Ilse
Elisabeth → Isabella
Elisabeth → LISA
LISA → Lieschen
LISA → Liese
LISA → Liesel
LISA → Lis
LISA → Lisa
LISA → Lisbeth
LISA → Lisette
LISA → Lise
LISA → Liesl
Elisabeth → LILA 
LISA → Lila
LISA → Liliane
LISA → Lilian
LISA → Lilli
Elisabeth → Sisi
I intentionally reduced the amount of data here, in order to make sure that the graphic can still be inspected. But it is clear that even this simple model, which assumes unique ancestor-descendant relations among all of the derived person names, is stretched to its limits when applied to names as productive as Elisabeth, at least as far as the visualization is concerned.

Derivation network of names derived from Elisabeth

If you now imagine that there are various processes that turn an ancestral name into a descendant name, and that one would ideally want to model the differences between these processes as well, one can see easily that it is indeed not a trivial problem to model the evolution of person names (and we are not even speaking of inferring any of these relations).

How names evolve

Names evolve in various ways along different dimensions. With respect to their primary function, or their use, we tend to use, among others, nick names. Formally, nick names are often a short form of an original name, but depending on the community of speakers, it is also possible that there is a formal procedure by which a nick name can be derived from a base name. Thus, every speaker of Russian should know that Jekaterina can be turned into Katerina, which can be turned into Katja, which can be turned into Katjuscha, or, in the case of a Vocative, into Katj. Once the primary function of a name changes, its form usually also changes, as we can now see in many examples.

But the form can also change when a name crosses language borders. If you go with your name into another country, and the speakers have problems pronouncing certain sounds that occur in your name, it is very likely that they will adjust your name's pronunciation to the phonetic needs of their own language, and modify it. Names cross language borders very quickly, since we tend not to leave them at home when visiting or migrating to foreign countries. As a result, a great deal of the diversity of person names  observed today is due to the migration of names across the world's larger linguistic communities.

How we change names when building short forms or nick names, or when trying to adapt a name to a given target language, depends on the structure of the language. The most important part is the phonology of the language in which the change happens. For example, when transferring a name from one language to another, and the new language lacks some of the sounds in the original name, speakers will replace them with those sounds which they perceive to be closest to the lacking ones.

But the modification is not restricted to the replacement of sounds. My own given name, Mattis, for example, usually has the stress on the first syllable, but in France, most people tend to call me Matisse, with the accent on the second syllable, reflecting the general tendency to stress the last syllable of a word in French. In Russian, on the other hand, Mattis could be perfectly pronounced, but since people do not know the name, they often confuse it with its variant Matthias, which then sounds like Matjes when pronounced in Russian (which is the name for soused herring in Germany). There are more extreme cases; and both English and German speakers are also good at drastically adjusting foreign names to the needs of their mother tongues.

It would be nice if it was possible to investigate the huge diversity in the evolution of person names more systematically. In principle, this should be possible. I think, starting from directed networks is definitely a good idea; but it would probably have to be extended by distinguishing different types of graph edges. Even if a given selection may not handle all of the processes known to us, it might help to collect some primary data in the first place.

With a large enough set of well-annotated data, on the other hand, one might start to look into the development of algorithms that could infer derivation relationships between person names; or one could analyze the data and search for the most frequent processes of person name evolution. Many more analyses might be possible. One could see to which degree the processes differ across languages, or how names migrate from one language to another across times, usage types, and maybe even across fashions.


I assume that the result of such a collection would be interesting not only for couples who are about to replicate themselves, but would also be interesting for historical research and research in the field of cultural evolution. Whether such a collection will ever exist, however, seems less likely. The problem is that there are not enough scholars in the world who would be interested in this topic, as one can see from the very small number of studies that have been devoted to the problem up to now (as one of the few exceptions known to me, compare the nice overview of person name classification by Handschuh 2019). I myself would not be able to help in this endeavour, given that I lack the scholarly competence of investigating name evolution. But I would sure like to investigate and inspect the results, if they every become available.


Handschuh, Corinna (2019) The classification of names. A crosslinguistic study of sex-specific forms, classifiers, and gender marking on personal names. STUF — Language Typology and Universals 72.4: 539-572.

Monday, March 16, 2020

Problems with the phylogeny of coronaviruses

Coronaviruses are much in the news at the moment. Indeed, one particular variant seems to be the major news topic as I write this post. This is the one known as 2019-nCoV or SARS-CoV-2, which is responsible for the human pneumonia called COVID-19.

Obviously, the main issue for the public is infection biology, particularly the apparent ease with which the virus can spread in human populations. Part of the issue here seems to be that human coronaviruses are covered with a lipid membrane, which means that they "can remain infectious on inanimate surfaces [like metal, glass or plastic] at room temperature for up to 9 days" (Kampf et al. 2020), which dramatically increases the probability of each of us encountering one.

There is now a decline in reported cases in China, but there may a resurgence. The problem is that an infected person may show no symptoms, or only very mild ones, and thus never report themselves. So, there may be millions more infected people running around the country, ready to infect new people when the travel restrictions are lifted, and the unexposed people come in contact with them. Biologically, the only safety is immunization, which occurs when you are exposed to the virus — which is risky, of course.

From Forni et al. (2017). Click to enlarge.

There will obviously be a lot of political fall-out in coming weeks, with various governments being accused of not doing enough and others of doing too much. The widespread infections in South Korea seem to be the result of a secretive religious organization (responsible for more than 60% of the national infections), to which the government has responded better than most others. On the other hand, in Iran it seems to be government that has been the major problem, hiding the initial infections because of their potential affect on impending elections.

In Italy, the country seems to have been overwhelmed, and the death rate is very high, while in Germany the infection rate is relatively high but the death rate is currently still low. Indeed, Italy's long-delayed "lock-down" on internal travel contrasts strongly with China's much more rapid response, and this seems to be reflected in vastly different infection rates (Italy currently has 6x the number infections per million people). More than a half of the cases to date where I live, in Sweden, came initially from northern Italy, with most of the rest from Austria, which are popular downhill-skiing destinations at this time of the year.


However, for our purposes here it is the phylogenetics of coronaviruses that is of professional interest, not infection biology. This has been a research topic for the past couple of decades, with the origin of several novel coronavirus strains in humans during that time (see the timeline above). These include SARS-CoV (causing Severe Acute Respiratory Syndrome) and MERS-CoV (causing Middle East Respiratory Syndrome) — both of these have much higher fatality rates than the current epidemic (10% and 34%, respectively), but lower rates of spread. A selected set of relevant papers is listed below; and I have included a couple of phylogenies as examples.

The issue that I wish to mention here is that there appears to be a disconnection between the so-called phylogenies presented in these papers and the concept of a phylogenetic history. The papers present either a rooted or an unrooted tree. In the first case, this simply represents a set of clusters based on genomic similarity. In the second case, this represents a hierarchical grouping based on genomic similarity. Obviously, an unrooted tree cannot represent a phylogenetic history, since evolution has a time direction, and this can only be illustrated using a directed (ie. rooted) tree or network.

However, the bigger issue is that these trees cannot represent an actual virus phylogeny. The argument for presenting them seems to be that the clusters / groups are based on genomic similarity, which in turn is caused by the phylogenetic history of the viruses. This is true, but we cannot thereby invert the logic. Phylogenetics creates similarity, but mere similarity does not necessarily represent phylogenetic history.

In the case of coronaviruses, the evolutionary history is reported to involve extensive genomic recombination in the formation of novel strains (reviewed by Cui et al. 2019). That is, during an epidemic the phylogeny might be tree-like, but at the origin of the epidemic it is not. This especially occurs because coronaviruses can infect a range of hosts (not just humans), and it is the recombination that occurs while within one host that allows novel strains to appear that can create epidemics in a different host.

This is also prevalent in, for example, influenza viruses (which also have a lipid membrane). This occurred for the world's worst epidemic (c. 500 million affected), the so-called Spanish Flu of 1918-1920, which actually started in the USA. The current most-likely explanation is that both a bird-host and a human-host influenza strain got into a pig, recombined in the cells of that host, and then the new virus strain got back into the human population.

Therefore the full phylogenetic history cannot be tree-like. Indeed, the actual history must be in the form of a recombination network, as discussed elsewhere in this blog. So, the trees, as shown in the papers below, represent the similarity of the coronaviruses but not all of their phylogeny. For the latter, we need a haplotype network representation, as illustrated in this example:

Some small haplotype networks; from Yu et al. (2020)

It would be interesting to construct a recombination network based on the data from one or more of the coronavirus papers, as an example. However, as far as I can see, none of the authors has referred to an online version of their genomic alignment; and so I cannot present such a thing here.


Cui J, Li F, Shi Z-L (2019) Origin and evolution of pathogenic coronaviruses. Nature Reviews Microbiology 17: 181-192.

Chen Y, Liu Q, Guo D (2020) Emerging coronaviruses: genome structure, replication, and pathogenesis. Journal of Medical Virology 92: 418-423.

Eickmann M et al. (2003) Phylogeny of the SARS coronavirus. Science 302: 1504-1505.

Forni D, Cagliani R, Clerici M, Sironi M (2017) Molecular evolution of human coronavirus genomes Trends in Microbiology 25: 35-48.

Gorbalenya AE, Snijder EJ, Spaan WJ (2004) Severe acute respiratory syndrome coronavirus phylogeny: toward consensus. Journal of Virology 8: 7863-7866.

Kampf G, Todt D, Pfaender S, Steinmann E (2020) Persistence of coronaviruses on inanimate surfaces and their inactivation with biocidal agents. Journal of Hospital Infection 104: 246-251.

Luk HKH, Li X, Fung J, Lau SKP, Woo PCY (2019) Molecular epidemiology, evolution and phylogeny of SARS coronavirus. Infection Genetics and Evolution 71: 21-30.

Woo PC, Lau SK, Huang Y, Yuen KY (2009) Coronavirus diversity, phylogeny and interspecies jumping. Experimental Biology and Medicine 234: 1117-1127.

Yu WB, Tang G-D, Zhang L, Corlett RT (2020) Decoding evolution and transmissions of novel pneumonia coronavirus (SARS-CoV-2) using the whole genomic data. (ResearchGate)

Zhang L, Shen F-M,Chen F, Lin Z (2020) Origin and evolution of the 2019 novel coronavirus. Clinical Infectious Diseases (Epub ahead of print).

An unrooted tree; from Cui et al. (2019).

A rooted tree; from Chen et al. (2020)

Monday, March 9, 2020

A sneak peek into the upcoming SplitsTree 5

For some time now, the official SplitsTree page ( has been offline. The reason is that a major update is on the way: SplitsTree5. A beta version is already available, so let's take a quick look at it.

During installation you will be asked how much RAM you want to dedicate. Give as much as possible, in case you want to handle large tree with myriads of splits. I chose 16 GB (ie. half of the RAM installed on my PC).

Here's how it looks when you start the program:

The menus known from SplitsTree4 are still there, and the important functions appear to be already implemented. Some are new, and some have been moved:
  • Menu File: there is a new option is to "Export workflow", which produces a graphical representation (ie. a flow-chart) of what you did with the imported data, which is shown in the main display panel ("Workflow")
  • New menu Select: collects together the Select options formerly included under Edit.
  • The Trees option is now called Tree
  • Menu Network has all of the classics (distance-based phylogenetic networks, tree-based networks, character-based networks); but missing (so far) are the Pruned Quasi Median network and Spectral Splits options, possibly due to very little demand. An important new function is (or will be) that one can change between "Splits Network view" (ie. the view we are used to from SplitsTree4) and "Haplotype Network view" (as known from the TCS, NETWORK, etc. programs)
  • New menu PCoA, to do principal component analysis (at some point).
  • The menu Analysis appears to be still in development. Currently there are five options: Show Bootstrap tree..., Show Bootstrap network..., Estimate invariable sites..., Compute Phylogenetic Diversity, Compute Delta Score, and (new) Show workflow.
  • The menu Window will be split into Window and Help. Menu Help includes also now a direct link to the (new to me, and, noting the low number of discussion threads, apparently most of the world), a SplitsTree Community page (online since September 2017).
The new GUI reminds me a bit of RStudio —instead of pop-up windows vanishing once you perform a function, you will keep subsequent sheets in the panels. This makes it easier for new users.

When opening a data matrix not directly interpretable, you may activate the "Import" menu, asking you to specify the data type and the file format:

Eventually, as in SplitsTree4, the importer is currently sensitive to additional code and commentary brackets, and cannot eg. handle polymorphisms for categorical data (such as "(01)", "{01}"). Accordingly, importer warnings will pop up. Probably, a lot of testing and tweaking is required to make this work as planned. The selection list for file formats is comprehensive, but also ambitious. It may be a good idea to focus on a simple import format (eg. Phylip without its name-length restrictions, or clean NEXUS), and leave the import / export issues to other software packages (such as Mesquite, or R-conversion tools).

But we can read in Splits-NEXUS files generated by SplitsTree4 without any problems. To sneak a bit more:

A very nice function is that the flags in the analysis pipeline are fully interactive, allowing for quick manipulation / overview of what was used. For example, by clicking on "NeighborNet", we get a new panel for tweaking the NNets options or change the used method:

When moving above a menu item, a short explanation may pop up. The menus in the modification panel include drop-down boxes and input fields (here, for NNet):

Close-up of the NeighborNet panel.

Another important upgrade is the "Workflow" sheet, which gives you access to data filtering, methods and visualization etc., by just double-clicking on the respective item in the flow-chart (items can be dragged and moved, too):

Graphically, SplitsTree5 is functional as well. View > Format... (Ctrl-Shift-J) will open the remodeled coloring and type window in the method / lower left panel, where you can chose: font, label and (selected) edge(s) colors, node colors and shapes. In addition to circles and squares, we now have the choice between up- and down-triangles, diamonds, and hexagons. The graphical export option is gone (Ctrl-Shift-M; for now) and replaced by a modifiable, objects-containing PDF (similar to the ones produced by Dendroscope), generated simply by printing out to PDF.

The current beta version may not be able to fully replace SplitsTree4 yet (especially since the current manual only contains an 'Acknowledgments' section) but has already enough functionality (some new) to play around and explore the wonderful world of phylogenetic networks.

So, try it out for yourself.

Current issues

Glitches (on my Windows-PC running the latest Java version) that I have encountered include:
  • flickering scroll bars – but, when resizing the window a bit and keeping the left mouse button pressed, the flickering stops
  • I couldn't exit the program after opening more than one window / data set
  • a few menu items may not work yet (e.g. Select > All Labeled Nodes, Ctrl-Shift-L).
Moving edges can also be a bit tricky. You need to first select the edges, when the selected edge bundle will be highlighted by a broad yellow aura, and then move the pointer to one of the nodes, until the node is surrounded by an even broader aura. Then click and keep the mouse button down.

To get rid of node shapes, I had to click several times on "none" (first it changes to circles, which then become smaller until being nearly invisible).

Important note: While I had no problem in opening any of my SplitsTree4-generated and saved files, when saving a file in SplitsTree5, SplitsTree4 gives an import failure error message.

Monday, March 2, 2020

The phylogenetics of the Last Universal Common Ancestor is hard

If we define phylogenetics as the study of sister-group historical relationships, then it stands to reason that the hardest thing to do in biology would be to study the Last Universal Common Ancestor (LUCA), which is the common ancestor of all known organisms. This is because, by definition, it has no knowable sister group.

Study of the LUCA has therefore mostly been seen as a study of ancestor-descendant relationships, being an attempt to trace the ancestry of living things all the way back until there is nothing more to detect.

This latter approach seems to lead to a lot of arguments. There are arguments about what type of character data to use (it seems doubtful that nucleotide sequences are informative that far back in evolution). There are arguments about how many monophyletic groups there might be of akaryotes, and whether we should consider eukaryotes to be monophyletic, given that they have organelles. For a brief introduction to the use of protein domains for phylogenies, as well as the dispute about the three-domains versus two-domains issue, see this Twitter presentation.

On the other hand, trying to study the LUCA phylogenetically raises some interesting questions, because we are trying to produce a phylogeny with a root but without an outgroup. I recently gave a talk on this subject; and I have included a PDF copy of the slides from that talk here.

The talk starts with some personal history, which just happens to lead into a discussion of what I see as the essential points of phylogenetic analysis. I discuss the essential points of characters versus taxa, emphasizing the role of both character and taxon models. The essential point for the LUCA is the need to determine character polarity, as this gives as the time direction, and allows us to find the earliest time.

Conclusion 1: The characters used to study the LUCA probably need to be molecular, but the form of the character analysis needs to be fundamentally different from what molecular biologists commonly employ — we need to analyze character polarity.

Conclusion 2: We need to think about which characters will have relevant phylogenetic information, for the age depth we are looking at.

Conclusion 3: We need to think about the taxon-change model, as well as the character-change model — the history may be very complex at the root.

Conclusion 4: We study contemporary taxa, and it is inappropriate to try putting ancestors into any modern group, unless you have good evidence that the ancestor is the MRCA of that group (ie. the group is monophyletic).

For the study of the phylogeny of the LUCA:
  • The root cannot be added to an unrooted line graph, but instead the root must be a direct product of the data analysis
  • Sequence data are unlikely to be informative, because the required character-change models matter too much at that time depth
  • The evolutionary history may be much more complex than can be represented by a tree, and may be impractical even for any current form of network analysis
  • The LUCA is not part of any extant phylogenetic group.