Monday, June 29, 2020

Annotating rhymes in texts (From rhymes to networks 3)

Having discussed some general aspects of rhyming in a couple of different languages, in last month's blog post, the third post in this series is devoted to the question of how rhyme can be annotated. Annotation plays a crucial role in almost all fields of linguistics. The main idea is to add value to a given resource (Milà-Garcia 2018). What value we add to resources can differ widely, but as far as textual resources are concerned, we can say that the information that we add can usually not be extracted automatically from the resource.

In our case, the information we want to explicitly add to rhyme texts or rhyme corpora is the rhyme relations between words. Retrieving this information may be trivial, as in the case of Shakespeare's Sonnets, where we know the rhyme schema in advance, but it is considerably complicated when working with other, less strict types of rhyming.

One usually distinguishes two basic types of annotation: inline and stand-off (Eckart 2012). For inline annotation, we add our information directly into our textual resource, while stand-off annotation creates an index over the resource, and then adds the information in a separate resource that refers to the index of the original text.

Both methods have their pros and cons. Stand-off annotation often seems to provide a cleaner solution (as one never knows how much a manual annotation added into a text might modify the text involuntarily). However, inline annotation has, in my experience, the advantage of allowing for a much faster annotation process, at least as long as the annotation has to be done in text files directly, without interfaces that could help to assist in the annotation process.

Overview of existing annotation practice

If we look at different practices that have been used to annotate rhymes in collections of poetry, we will find quite a variety of techniques that have been used so far.

Wáng (1980), for example, uses an inline annotation style in his corpus of the rhymes in the Book of Odes, as illustrated in the following example taken from List et al. (2019). In this annotation, rhyme words are indirectly annotated by providing reconstructed readings for the Chinese characters which are supposed to narrow the original pronunciation. Whenever two rhyme words share the same main vowel, the author would have judged them to have rhymed in the original text.

Annotation in Wáng (1980)

Baxter (1992) uses a stand-off annotation, which is shown (again taken from List et al. 2019) in the following table. An advantage of Baxter's annotation is that it allows him to provide multiple layers of information for each rhyme word. A disadvantage is that a clear index to the words in the poem is lacking. While this is not entirely problematic, since it is usually easy to identify which words are in rhyme position, it is not entirely "safe", from an annotation point-of-view, as it may still create ambiguities.

Annotation in Baxter (1992)

In a study of automated rhyme word detection, Haider and Kuhn (2018) use annotated rhyme datasets from a variety of German styles (Hip Hop, contemporary lyrics, and more ancient lyrics). To annotate the data, they use the standard format of the Text Encoding Initiative, which is based essentially on XML. Unfortunately, however, they do not provide tags for each word that rhymes, but instead only add an attribute to each stanza, indicating the rhyme schema, as can be seen in the example below:
<lg rhyme="aabccb" type="stanza">
  <l>Vor seinem Löwengarten,</l>
  <l>Das Kampfspiel zu erwarten,</l>
  <l>Saß König Franz,</l>
  <l>Und um ihn die Großen der Krone,</l>
  <l>Und rings auf hohem Balkone</l>
  <l>Die Damen in schönem Kranz.</l>
The drawback of this annotation style is that it places the annotation where it does not belong, assuming that a poem only rhymes the words that appear in the end of a line, and that there are no exceptions.

For French, I found an interesting website called métrique en ligne, offering a large number of phonetically analyzed texts in French. They offer a rhyme analysis in an interactive fashion: one can have a look at a poem in raw form and then see which parts of the words appear in rhyme relation. A screenshot of the website (with the poem "Les Phares" from Charles Baudelaire) illustrates this annotation:

It is very nice that the project offers the rhyme annotation in such a clear form, annotating explicitly those parts of the words (albeit in orthography) that are supposed to be responsible for the rhyming. However, the annotation has a clear drawback, in that it provides rhyme annotation only on the level of the stanza, although we know well that quite a few poems have recurring rhymes that are reused across many stanzas, and we would like to acknowledge that in our annotation.

The most complete annotation of poetry I have found so far is ``MCFlow: A Digital Corpus of Rap Transcriptions'' (Condit-Schultz 2017). The goal of the annotation was not to annotate rhyme in the primary instance, but to provide a corpus that also takes the musical and rhythmic aspects of rap into account. As a result it offers annotations along seven major aspects: rhythm, stress, tone, break, rhyme, pronunciation, and the lyrics themselves. The rhyme annotation itself is provided for each syllable (the texts themselves are all syllabified), with capital letters indicating stressed, and lower case letters indicating unstressed syllables. Rhyme units (usually, but not necessarily words) are marked by brackets. The following figure from Condit-Schultz (2017) illustrates this schema.

Annotation of rhymes by Condit-Schultz (2017)

What I do not entirely understand is the motivation of using the same lowercase letters for unstressed syllables as for the stressed ones in a rhyme sequence. Given that the information about stress is generally available from the annotation, it seems redundant to add it; and it is not clear to me for what it serves, specifically also because unstressed syllables do not necessarily rhyme in rhyme sequences. But apart from this, I find the information that this annotation schema provides quite convincing, although I find the format difficult to parse computationally; and I also imagine that it is quite difficult to annotate it manually.

Initial reflections on rhyme annotation

When dealing with annotation schemas and trying to develop a framework for annotation, it is always useful to recall the Zen of Python, especially the first seven lines:
  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.
  • Flat is better than nested.
  • Sparse is better than dense.
  • Readability counts.
What I think we can extract from these seven lines are the following basic rules for an initial annotation schema for rhyme data.
  • First, ideally, we want an annotation schema that gives us the same look and feel that we know when reading a poem. This does not mean we need to store the full annotation in this schema, but for a quick editing of rhyme relations, such an annotation schema has many advantages.
  • Second, in order to maintain explicitness, all rhymes should be treated as rhyming globally inside a poem — we should never restrict annotation of rhymes to a single stanza, and we should also avoid brackets to mark rhyming sequences, as there are other ways to assign words to units.
  • Third, we should be explicit enough to show which parts of a word rhyme but, for now, I think it is not necessary to annotate all syllables at the same time. Since this would cost a lot of time, and specifically since syllabification differs from language to language, it seems better to add this information later on a language-specific basis, semi-automatically. Since many words repeat across poems, one can design a lookup-table to syllabify a word much more easily from a corpus that has been assembled, than adding the information when preparing each poem.

Towards a: Standardized Annotation of Rhyme Data

Last year, we proposed an annotation schema for rhyme annotation (List et al. 2019). Our basic idea was inspired by tabular formats. These are used in linguistic software packages dealing with problems in computational historical linguistics, such as LingPy. They are also used as the backbone of the Cross-Linguistic Data Formats Initiative (Forkel et al. 2018), which uses tabular formats in combination with metadata in order to render linguistic datasets (wordlists, information on structural features) cross-linguistically comparable. Essentially, the format can be seen as a stand-off annotation, where the original data are not modified directly. While our basic format was rather powerful with respect to what can be annotated, it is also very difficult to code data in this format, at least in the absence of a proper annotation tool.

At the same time, to ease the initial preparation of annotated rhyme data conforming to these standards, we proposed an intermediate format, in which a poem was provided just in text form, with minimal markup for metadata, and in which rhymes could be annotated inline. As an example, consider the first two stanzas of the poem "Morning has broken" by Eleanor Farjeon (1881-1965):
@CREATED: 2020-06-26 06:09:04
@TITLE: Morning has broken
@AUTHOR: Eleanor Farjeon
@BIODATE: 1881-1965
@YEAR: before 1965
@MODIFIED: 2020-06-26 06:09:46
@LANGUAGE: English

Morning has [a]broken like the first morning
Blackbird has [a]spoken like the first [b]bird
Praise for the [c]singing, praise for the morning
Praise for them [c]springing fresh from the [b]Word

Sweet the rain's [e]new_[f]fall, sunlit from heaven
Like the first [e]dew_[f]fall on the first [g]grass
Praise for the [d]sweet[h]ness of the wet garden
Sprung in com[d]plete[h]ness where His feet [g]pass
As you can see from this example, we start with some metadata (which is more or less a free form, consisting of the formula @key: value, and then render the stanzas, line by line, separating stanzas by one blank line. Rhymes are annotated by enclosing rhyme labels in angular brackets before the part of the word responsible for the rhyme. If wanted, one can annotate rhymes for each syllable, as done in the rhyme words [d]sweet[h]ness and com[d]plete[h]ness, but one can also only annotate the rhyme as a whole, as done in the rhyme words [a]broken and [a]spoken.

In order to assign words to rhyme units, an understroke can be used that indicates that two orthographic words are perceived as one unit in the rhyme, which is the case for [e]new_[f]fall rhyming with [e]dew_fall. Furthermore, if a stanza reappears throughout a poem or song in the form of a refrain, this can be indicated by adding two spaces before all lines of the stanza.

Comments can be added by beginning a line with the hash symbol #, as shown in this small excerpt of Bob Dylan's "Sad-Eyed Lady of the Lowlands".
# [Verse 1]
With your mercury mouth in the missionary [c]times
And your eyes like smoke and your prayers like [c]rhymes
And your silver cross, and your voice like [c]chimes
Oh, who do they think could [i]bury_[j]you?
With your pockets well protected at [e]last
And your streetcar visions which ya' place on the [e]grass
And your flesh like silk, and your face like [e]glass
Who could they get to [i]carry_[j]you?

# [Chorus]
  Sad-eyed lady of the lowlands
  Where the sad-eyed prophet say that no man [a]comes
  My warehouse eyes, my Arabian [a]drums
  Should I put them by your [b]gate
  Or, sad-eyed lady, should I [b]wait?
When testing this framework on many different kinds of poems from different languages and styles, I realized that the greedy rhyme annotation that I used (you place the rhyme tag before a word, and all letters that follow will be considered to belong to that very rhyme tag) has a disadvantage in those situations where syllables in multi-syllabic rhyme units essentially do not rhyme. As an example consider the following lines from Eminem's "Not Afraid":
I'ma be what I set out to be, 
without a doubt, undoubtedly
And all those who look down on me, 
I'm tearin' down your balcony
Here, the author plays with rhymes centering around the words out to be, undoubtedly, down on me, and balcony. Condit-Schultz has annotated the rhymes as follows (I use the rhyme schema inline for simplicity):
I'ma D|be what I set (C|out c|to D|be), 
wi(C|thout c|a) (C|doubt, c|un)(C|doub.c|ted.D|ly)
And all those who look (C|down c|on D|me), 
I'm tearin' C|down your (C|bal.c|co.D|ny)
In my opinion, however, the parts annotated with c by Condit-Schultz do not really rhyme in these lines, they are mere fillers for the rhythm, while the most important rhyme parts, which are also perceived as such, are the stressed syllables with the main vowel ou. To mark that a syllable is not really rhyming, but also in order to mark the border of a rhyme (and thus allow indication that only the first syllable of a word rhymes with another word), I therefore decided to introduce a specific "empty" rhyme symbol, which is now represented by a plus. My annotation of the lines thus looks as follows:
I'ma be what I set [h]out_[+]to_[e]be, 
wi[h]thout a [h]doubt, un[h]doub[+]tab[e]ly
And all those who look [h]down_[d]on_[e]me
I'm tearin' down your bal[d]co[e]ny

An Interactive Tool for Rhyme Annotation

While I consider the inline-annotation format as now rather complete (with all limitations resulting from inline-annotation), I realized, when trying to annotate poems by using the format, that it is not fun to edit text files in this way. I am not talking about small edits, like one stanza, or typing in some metadata — annotating a whole rap song can become very tedious and even problematic, as one may easily forget which rhyme tags one has already used, or oversee which words have been annotated as rhyming, or forget brackets and the like.

As a result, I decided to write an interactive rhyme annotation tool that supports the inline-annotation format and can be edited both in the text and interactively at the same time. This is a bit similar to the text processing programs in blogging software, which allow writing both in the HTML source and in a more convenient version that shows you what you will get.

The following screenshot in the database, for example, shows how the rhymes in Shakespeare's Sonnet Number 98 are visually rendered.

Visual display of Shakespeare's Sonnet 98

This tool is now already available online. I call it RhyAnT, which is short for Rhyme Annotation Tool. I have been using it in combination with a small server, to populate a first database with rhymes in different languages, which already contains more than 350 annotated poems. This database can be accessed and inspected by everybody interested, at AntRhyme; but copyrighted texts from modern songs can — unfortunately — not be rendered yet (as I am not sure how many I would be allowed to share).

I do not want to claim that I am gifted as a designer (I am surely not), and it is possible that there are better ways to implement the whole interface. However, I find it important to note that the format itself, with the coloring of rhyme words, has dramatically increased my efficiency at annotating rhyme data, and also my accuracy in spotting similarities.

Annotating the same poem with RhyAnT, the interactive rhyme annotator

The above screenshot shows how I can edit the poem from my edit access to the database. Alternatively, one can just paste in the text and edit it on the publicly accessible interface of the RhyAnT tool, edit the data, and then copy-paste it to store it. In this form, the interface can already be used by anybody who wants to annotate rhymes in their work.


The current annotation framework that I have illustrated here is not almighty, specifically because it does not allow for multi-layered annotation (Banski 2019: 230f), which would allow us to add pronunciation, rhythm, and many other aspects than rhyming alone. However, I hope that many of these aspects can be later added quickly, by creating lookup tables and processing the annotated corpus automatically. Following the Zen of Python, this seems to be much simpler than investing a lot of time in the creation of a highly annotated dataset that would discourage working with the data from the beginning.


Bański, Piotr and Witt, Andreas (2019) Modeling and annotating complex data structures. In: Julia Flanders and Fotis Jannidis (eds) The Shape of Data in the Digital Humanities: Modeling Texts and Text-based Resources. Oxford and New York: Routledge, pp. 217-235.

Baxter, William H. (1992) A Handbook of Old Chinese Phonology. Berlin: de Gruyter.

Nathaniel Condit-Schultz (2017) MCFlow: A Digital Corpus of Rap Transcriptions. Empirical Musicology Review 11.2: 124-147.

Eckart, Kerstin (2012):Resource annotations. In: Clarin-D, AP 5 (ed.) Berlin: DWDS, pp. 30-42.

Forkel, Robert and List, Johann-Mattis and Greenhill, Simon J. and Rzymski, Christoph and Bank, Sebastian and Cysouw, Michael and Hammarström, Harald and Haspelmath, Martin and Kaiping, Gereon A. and Gray, Russell D. (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5.180205: 1-10.

Haider, Thomas and Kuhn, Jonas (2018) Supervised rhyme detection with Siamese recurrent networks. In: Proceedings of Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 81-86.

List, Johann-Mattis and Nathan W. Hill and Christopher J. Foster (2019) Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship 17.1: 26-43.

Milà‐Garcia, Alba (2018) Pragmatic annotation for a multi-layered analysis of speech acts: a methodological proposal. Corpus Pragmatics 2.1: 265-287.

Wáng, Lì 王力 (2006) Hànyǔ shǐgǎo 漢語史稿 [History of the Chinese language]. Běijīng 北京:Zhōnghuá Shūjú 中华书局.

Monday, June 22, 2020

Can we dig too deep? Signal conflict in mitochondrial genes of land plants

In an earlier post, Supernetworks and gene tree incongruence, I illustrated what Supernetworks can tell us about incongruent mitochondrial gene trees, using the dataset of Sousa et al. (PeerJ 8: e8995, 2020). Here, I will take a closer look at these data, in order to illustrate another point.

Fig. 1 A Supernetwork based on 34 individual mitochondrial gene trees (atp1 and atp8 missing due an alignment glitch). Groups (clades) referring to splits not found in any of the gene trees, including the "Septaphyta", are shown in gray font, in blue, groups referring to clades seen in Sousa et al.'s preferred tree.

Sousa et al.'s set of analyses aimed to filter signal in order to get a better all-inclusive tree, and succeeded to produce support for a "Septaphyta" clade, comprising liverworts and mosses, which is a split not found in any of the inferred (Bayesian) gene trees.

Fig. 2 Comprehensive but branch-length and frequency ignorant Supernetwork of Sousa et al.'s Bayesian MRC gene trees (trees are provided as supplementary online data on zenodo), inferred from nucleotide sequences. The trees show several alternatives (colored and labeled) regarding the sister lineages of mosses and liverworts. Any split found in at least one gene tree, is represented in this Supernetwork.

This split did occur, however, when the amino acid sequences were used, instead of the nucleotide sequences.

Fig. 3 Comprehensive branch-length and frequency ignorant Supernetwork of Sousa et al.'s amino acid Bayesian MRC gene trees. The sister taxon of liverworts are either the mosses (= Septaphyta clade) or the outgroup green algae Coleochaete (further inclusive splits include subsequently more outgroups, placing the liverworts as sister to all other land plants). Realized alternatives for mosses include further being a sister of hornworts (in which case liverworts would be sister to higher land plants), or sister to all land plants (brown split: green algae + mosses | rest).

The alternative to the Septaphyta clade, which does appear in the gene trees, recognizes the liverworts as the closest relative of the vascular plants, while the mosses are resolved as the first branch. As Sousa et al. point out:
The tree inferred from the concatenated nucleotide data set of 36 mitochondrial genes shows mosses as the sister-group to the remaining land plants, as previous analyses of mitochondrial nucleotide data have shown (Liu et al., 2014). ...
The concatenated tree hence only reflects a minor aspect of the Supernetwork (Fig. 1) of the individual gene trees:
... However, the mosses are replaced by the liverworts in the same position when analysing codon-degenerate recoded data.
This seems to be the preferred placement when summarizing the gene trees using the Supernetwork.

In this post, we will take a closer look. Is there a deep, easily obscured signal for the Septaphyta clade in the mitochondria of plants? A signal that only surfaces in some amino acid gene trees (Fig. 2) and the filtered concatenated tree (Sousa et al.'s fig. 2), or is it just a branching artifact?

An example for a (low-supported) artificial clade (cf. Schliep et al., Methods Ecol. Evol. 8: 1212–1220, 2017). The trees in (a), (b) and (c) give the paternal (Y-chromosome), maternal (mt genes) and biparental (nuclear autosomal introns) genealogies. While the paternal and biparental genealogies are compatible (congruent), the maternal is in strong conflict. When combining these three data sets, this substantial conflict decreases branch support and results in the artificial red clade. The support for the artificial clade is low but worrisome: the tips differ substantially from each other and the hypothetical, alternative common ancestors and there are literally no patterns in the data supporting a sister relationship of Sloth and Sun bears. The artificial red clade is a secondary product of the inference: the sister relationship of Polar and Brown Bear is trivial (data and inference wise), the American Black Bear the more likely sister (note the length of competing branches in a/c vs. b). The signal for a clade of Asian Black and Sloth bears is less pronounced, here the mt genealogy clade is strongly incongruent and forces the combined tree to resolve the conflict by introducing the artificial red clade.

Starting simply

The Supernetwork in Fig. 1 shows that, no matter which gene we look at, liverworts and mosses were originally most similar to each other, and, absolutely speaking, still close to the (hypothetical) mitochondrion of the ancestor of all land plants. We can illustrate the general situation about the signals using a Neighbor-net inferred from the concatenated data of all 36 genes.

Fig. 4 Neighbor-net based on uncorrected p-distances inferred from the concatenated gene data.

Note that we used a substitution model via a naive distance matrix for a set of coding genes that include saturated third codon positions. Some phylogenetic relationships are obviously based on trivial signals: the Neighbor-net in Fig. 4 includes ± prominent edge bundles defining neighborhoods in line with generally accepted clades (in bold). To capture these evolutionary lineages (some going back nearly half a billion of years), we just need the raw data but no sophisticated phylogenetic analysis.

In the case of the probably monophyletic gymnosperms, the gymnosperm neighborhood competes with a neighborhood excluding the Gnetidae Welwitschia, which is the most distinct of the seed plants in this taxon set (this applies to Gnetidae in general, no matter which data are used). In addition, we see a neighborhood defined by the pink edge bundle: a split of green algae + Welwitschia versus all other land plants. This is a case of obvious long-edge attraction, enforced (here) by missing data (Welwitschia lacks data for 12 out of the 34 genes).

The center of the graph with respect to all tips would be a candidate for the ancestral mitochondrion of the common ancestor of all land plants. Closest to this point are the Septaphyta (mosses + liverworts) and the lycophyte Huperzia (the better represented taxon only missing out on five genes, while Isoetum miss 15).

One can depict a phylogenetic hypothesis by just dropping the less pronounced neighborhoods in Fig. 4:
  • Most prominent edge bundles define three main cluster (= lineages): green algae, seed plants, and other land plants.
  • Within green algae:
    • Closterium is sister to Gonatozygon, next is Roya → Zygnematales
    • there is no prominent edge bundle connecting Chaetospharidium with the Zygnematales; the closest relative is however the last green algae (→ Coleochatales; only group without a neighborhood).
  • Within seed plants:
    • Brassica may be the sister of Liriodendron (more prominent edge bundle), Oryza complements the clade as first diverged member → angiosperms
    • Cycas is the sister of Gingko, the two are sister to the angiosperms
    • this leaves Welwitschia as the first diverged branch.
  • Taking the green algae as outgroups:
    • the ferns are the sister group of the seed plants (edges longer than the alternative of a primitive land plant clade)
    • mosses are sister to liverworts (→ Septaphyta); Huperzia shares the same edge bundle but is apparently sister of Isoetes (→ lycophytes), and the lycophytes appear to be ± primitive sisters of ferns and seed plants
    • this leaves the hornworts, a highly coherent group sharing no prominent edge bundles with any other member of the land plant cluster, and hence are a candidate for the first diverging land plant lineage.
This is a tree hypothesis that is strikingly similar with Sousa et al.'s preferred tree.

Sousa et al.'s fig. 2

The only differences lie in terminal subtrees (Oryza as sister to Liriodendron; Marchantia-Treubia grade, the position of the latter two within liverworts being unclear based on the Neighbor-net).

Something that is easily overlooked in Sousa et al.'s rooted tree, but that is apparent from the Neighbor-net, is that we should be aware of ingroup-outgroup long-branch attraction (LBA). The green algae are not only highly divergent but also very distant from all ingroup taxa, the land plants.; and the first ingroup branch in Sousa et al.'s tree has the longest root.

Additive and subtractive support

In principal, when comparing single gene tree samples to combined trees, we face four sorts of signals in our data:
  • Very strong signals imprinted in one or a few genes; they will outcompete, and possibly even be re-enforced by any conflicting signal. Walker et al. (PeerJ 7: e7747, 2019), studied this phenomenon for the case of angiosperm plastomes (see also our miniseries The emperor has no clothes on).
  • Phylogenetically sorted, weak but consistent signals; they will add up, as branch support will increase with each gene added. In this category fall signals reflecting deep splits obscured by terminal noise, when analyzing a single gene or few genes – like the one found by Sousa et al. supporting a Septaphyta clade.
  • Disparate gene histories; eg. because of intergenomic recombination. The support will be diminished with every added gene not sharing the same history.
  • General conflict; eg. when combining data from different genomes reflecting different genealogies, such as combining chloroplast (product of biogeographic history) and nuclear data (product of speciation processes) of tree genera. This will be expressed by split bootstrap (BS) support, and may result in artificial clades in the combined/concatenated tree (eg. bear example shown above).
Adding to this is the absence of signal: short-branch culling, a special case of long-branch attraction, which could also explain the inference of a (paraphyletic) Septaphyta clade. If there are few tips in the data that are close (absolute, not only regarding their phylogenetic distance) to the all-ancestor without clear affinities, they may be collected in a subtree, being leftovers from optimizing all other tips with certain affinities and higher distance to the all-ancestor.

Fig. 5 Short-branch culling. Let's assume liverworts are the sister clade of higher land plants (an alternative with near-unambiguous support from cox1). The signal for this in mitochondrial data is weak (short root). On the other hand, there is a high risk for ingroup-outgroup long-branch attraction (LBA) leading eventually to an artificial Septaphyta clade. Because of (inevitable) LBA, even though the false branch is very short, its support can be high (unambiguous when using Bayesian inference).

By compiling the support for all alternatives, we can assess where the support is additive or subtractive. We do this using my re-analysis not Sousa et al.'s Bayesian analysis because:
  1. BS support is more sensitive to internal signal conflict than Bayesian PP,
  2. to extract this information, we need the tree samples used to establish the branch support.
When doing this, we find that the split defining the Septaphyta clade is not only missing from the nucleotide genes trees but also rarely found in the BS pseudoreplicate samples. Only for seven gene regions (atp4, atp8, nad2, rpl16, rps2, rps3, rps13) do we find BS ≥ 25; the highest support comes from rps3 (BS = 65; however, the split is not found in the corresponding Bayesian MRC of Sousa et al.).

On the other hand, the main alternatives find much higher and more consistent support, as shown here.

Fig. 6 Competing support for (purple) and against a Septaphyta clade (greens and yellows). Placing the hornworts as sister to all other land plants (pink) is compatible with the hypothesis of a Septaphyta clade as well as the competing alternative of placing the liverworts as sister to higher land plants; note the high support from cox1 gene for an according tree. *, for these genes no hornwort data were included/have been available.

Short-branch culling, a special form of ingroup-outgroup LBA

Now, my BS analyses were deliberately naive, because they did not apply any data partitioning. However, both liverworts and mosses have short-branches while the outgroup, the green algae, are extremely long-branched. If substitution saturation is an issue for misplacing either liverworts or mosses as sisters to all other land plants, then there should also be ingroup-outgroup LBA. A false split of liverwort + outgroup versus the rest, or moss + outgroup versus the rest, has a lower chance to be supported than would a false hornwort + outgroup versus the rest split. The latter directly opens the door for a Septaphyta clade (see Fig. 5).

Let's have a look at the trees of the four genes supporting the Septaphyta split, as the best alternative. ("AA tree/PP" is the amino acid tree provided by Sousa et al.; BS support refers to my unpartitioned ML analyses)
  • atp8 — The AA tree is a star tree (comb), strongly distorted by LBA: a Coleochaetales + seed plants | Zygnematales + all other land plants splits has a PP = 1; the short- and long-branched lycophytes are not resolved as sisters.
  • rpl16 — Also here, the AA tree is star-like regarding deep relationships: (i) green algae (unresolved), PP = 1; (ii) liverworts, very long root, little internal resolution PP = 1; (iii) mosses (unresolved), root half as long as for liverworts, PP = 1; (iv) higher land plants, short root, PP = 0.88.
  • rps3 — No ingroup-outgroup LBA, shortest-branched ingroup, liverworts, resolved as sister to mosses + rest (PP = 0.77); thus, AA tree, not affected by saturation issues, rejects the Septaphyta (PP < 0.23).
  • rps13 — Again, the AA tree is star-like, with five tips: (i) green algae (PP = 1); (ii) long-rooted hornworts (PP =1); (iii) liverworts, relatively short root (PP =1); mosses, longer root (PP = 1); (v) higher land plants, shortest root (PP = 0.89).
The Septaphyta root is either extremely short or non-existent, as we would expect for a false clade, because there are no character splits in the matrix that support the taxon split.

Fig. 7 Sousa et al.'s amino acid Bayesian MRC trees (top row) compared to codon-naive nucleotide ML trees (bottom row) producing highest BS support for a Septaphyta clade. Note that in two cases, rpl16 and rps13, the 'best-known' ML tree shows a competing split with much lower support.

Typically, since we are looking at a deep split, we would expect that support increases when shifting from (codon-naive) nucleotide to amino acid analysis, because we eliminate terminal noise. However, we observe the opposite (Bayesian PP more easily converges to unambiguous support than BS values). The difference between our codon-naive nucleotide ML and Sousa et al.'s amino acid MRC trees tells us that it is mostly information from the 3rd codon position that triggers a Septaphyta versus the rest split for these four genes — ie. potentially synonymous substitutions that Sousa et al. filtered against.

Where does the high support comes from for the Septaphyta clade in their combined tree? That tree is based on a matrix, that should have a signal in-between our codon-naive nucleotide and their amino acid analysis.

A five-taxon problem with a glitch

Sousa et al.'s study is exemplary, in that it provides a careful, and well documented, analysis of the combined data. If you want to infer a potentially good tree, this is one way to do it.

However, their Septaphyta clade is most likely a branching artifact. It still combines data that, genuinely, provides not only diffuse but conflicting information about how the main lineages of land plants diverged from each other (Fig. 6). No analysis, no matter how sophisticated and well-crafted, can compensate for the deficits of the underlying data. By filtering out "noise", one also filters out actual conflicting signal. In this case, this is about how liverworts, mosses, and hornworts stand in relation to the extremely long-branched and divergent outgroup, the green algae, and their increasingly evolved siblings, the higher land plants (lycophytes, ferns, and seed plants). It is another example of what I pointed out in last week's post: Big Data invites big (ie. well supported) errors.

It is important to realize that, although we use many more OTUs, we are still looking at a five-taxon problem. When our data supports one split (or prefers it, being biased or not), there are only three more alternatives to select from.

Ingroup-outgroup LBA draws the hornworts, as the genetically most distinct (longest-rooted) lineage of the "bryophytes", away from liverworts and the lycophyte Huperzia, which connects the much more diverged higher land plants to the bryophytes. This leaves three alternatives:
  1. Liverworts are the sister of higher land plants. Their mitochondria show some affinity, but only to the lycophytes, mostly the low-divergent and better sampled Huperzia; and often together with the hornworts, ie. a split incompatible with the hornwort-green algae versus the rest split.
  2. Mosses are the sister of higher land plants, but their mitochondria show very little affinity to any of them (including Huperzia). In fact, they seem to have the most primitive of all land plant mitochondria.
  3. Septaphyta are monophyletic, as the trade-off with the least conflict. Being (much) less diverged than the higher seed plants, they are genetically closer, and ± equally close, to the hornworts and the least-evolved higher land plant, the lycophyte Huperzia.
Sousa et al.'s codon-degenerate approach enforced ingroup-outgroup LBA between the hornworts (the worst-sampled ingroup) and the green algae, while decreasing the absolute distance between liverworts and mosses, and increasing their distance to the higher land plants. That is, Alternative 3 outcompetes Alternative 1. Alternative 2 has no support in the data.

Are the mosses sister to all land plants?

Probably not. Just because the Septaphyta clade is an artifact, it doesn't mean the Septaphyta cannot be monophyletic — it just means the mitochondrial genes don't provide any clear signal to support or reject such a hypothesis, or any other alternative. The same applies to the mosses as the first diverging lineage; their position in earlier trees is likely also to be an artifact — not a branching, but a data artifact. If their mitochondrial genomes are still very similar to that of the common ancestor of all land plants, then they should be placed like an ancestor in the tree — as a short-branched sister to all of their "offspring", the remaining land plant mitochondria.

Eight of the nine genes that support a moss + outgroup versus the rest split, fail to resolve a moss clade. This is a clear indication that the moss mitochondria are simply primitive (at all gene positions that matter). What divides them from most (or all) other land plants are symplesiomorphies — shared but ancestral sequence patterns. The only gene that prefers both splits at once, mosses as sister to all other lands plants as well as a moss clade, is nad4 (BS = 67 and 62, respectively); but only when using nucleotides.

Fig. 8 A small but important difference: the codon-naive ML nucleotide (nt) tree (left) shows a moss clade as sister to all other land plants. The Bayesian amino acid (aa) MRC tree for the same gene shows a wrong split (purple internode) between green algae + ferns + angiosperms (long-branch, prominent roots) and bryophytes + lycophytes (mostly short-branched, short roots). By translating nucleotides into amino acids one may eliminate genuine discriminative signal encoded in synonymous substitutions, while in other, faster evolving parts of the tree, the same site/gene is oversaturated/biased. The poorly supported sister relationship of Roya and Gonatozygon within the green algae in the nt ML tree is an artifact, correctly resolved by the aa tree based on the same gene.

The shift from nucleotide data (ML / BS) to amino acid data (Bayesian MRC) triggers ingroup-outgroup LBA between green algae and ferns + seed plants (PP = 0.53; 'short-branch culling' of bryophytes and lycophyte Huperzia), and results in a branching artifact — the monophyly of higher land plants is well established, and hence they should form a clade.

By contrast, the genes providing strong support for a moss clade (such as atp1, atp8, ccmB, cob, cox1, cox3, nad2, nad5, rpl6, rpl16, rps3, rps13, and rps14) fail to resolve any deep relationships at all, or prefer different alternatives (including the Septaphyta hypothesis: atp8, rps3, rps13). The combined tree's solution is therefore a least-conflicting one, again — a moss clade (based on a consistent signal in the majority of genes: 13 with BS ≥ 90; in total 24 with BS ≥ 58) as sister to the rest of the land plants (based on a signal found in other genes not reflecting the monophyly of mosses). This solution adds to the phenomenon that moss mitochondria are generally primitive (ie. show a variant basically ancestral to all other land plants), and doesn't conflict with a wide range of otherwise conflicting splits strongly supported by individual genes (in contrast to the Septaphyta clade, see Fig. 6).


Having spent some time with the data and gene trees, I have little hope that mitochondrial data can be used to resolve the deep relationships between land plants. Each tweaking may result in something different, and the support-after-tweaking will be inflated.

Nevertheless, it will be worthwhile to close the data gaps, especially for the hornworts. This may not solve the 5-taxon problem,* but may give unique insights in how the mitochondrial genome evolved and sorted during the initial radiation of land plants.

Notably, the mitochondriomes of land plants can differ in the arrangement of their genes; which means that they recombined with or within the nucleome (or even plastome). While in some plants the mitochondriome is passed on via both parents (like in Ginkgo or Cycas), in others it is only the mother (most, maybe all, angiosperms). Plants may have colonized land more than once, and expanded quickly, so that lineage crossing and also lineage sorting may be an issue — marine species can be cosmopolitan and genetically heterogeneous (cryptic speciation). Thus, some mitochondrial genes may tell different stories from others. Instead of trying to solve which of the alternatives is correct (which is what most phylogenetic literature revolves around), we should find out which gene or part of the genome agrees with which alternative, as they may be all true.

The question to address with mitochondrial data cannot be whether mosses, liverworts or hornworts are the first diverging branch of extant land plants, but should be why moss, liverwort and hornwort mitochondriomes show different stages of evolution, as exemplified by the nad4 trees in Fig. 8.

Data availability

An archive including the support consensus networks (in Splits-NEXUS format) and inferred gene ML trees (plain NEWICK), as well as the comprehensive split support table (XLSX format), has been uploaded to figshare.

* It may help to have an in-depth analysis of a more focused taxon set with no data gaps that minimizes the risk of LBA. This starts with a better selection of taxa representing the higher land plants:
  • Oryza (the rice) is a domesticated, much cultivated, and thus extremely evolved and polyploid monocot. If there is any deep signal embedded in the mitochondria of seed plants, the mitochondrion of rice is probably the last place to look for it.
  • When trying to resolve the deepest land relationships, including a Gnetidae like Welwitschia (a genus that is an evolutionary oddball to start with), makes equally little sense — like any of the three surviving genera of this unique gymnosperm lineage, it is genetically the outer-most tip of an iceberg. Each mutation in its genome is the product of an unknown number of divergences in the past.
  • If any seed plant should be included at all, would be more than sufficient to have: Liriodendron, a magnoliid, and thus a member of the least-diverged angiosperm lineage, plus Cycas, as a representative of an ancient, slow-evolving gymnosperm lineage. These are much more recent additions to the plant Tree of Life.
  • Being a tip of an iceberg applies even more to Isoetum. It is strikingly similar only to the other lycophyte, but it has more data gaps and is much more diverged, and thus can invite branching artifacts. When one wants to dig deep, the much more primitive Huperzia is obviously the better representative.
  • Last, the green algae are the only possible outgroups for inference, but they are poor for this – apparently, their mitochondria have evolved much farther from the common ancestor than those of the land plants. Rather than inferring trees including them, one should infer trees without them, and then optimize their position within trees that will then potentially be unbiased by outgroup-LBA — eg. using the evolutionary placement algorithm, to test the land plant root. An interesting experiment could also be to infer the sequence of the common ancestor(s) of modern-day green algae (lacking a time machine to sample it), and use them instead. The new RAxML-NG, for example, allows for ancestral state reconstruction of nucleotides.
In addition, standard 4-base substitution models are not the best choice when analyzing matrices with a high proportion of ambiguous base calls, like Sousa et al.'s codon-degenerate matrix (note that Sousa et al. already applied models that compensate for substitutional bias). This is especially so, given the importance of synonymous mutations to resolve relationships in the slow-evolving lineages, and slow evolving genes. One could try to use ambiguity-aware substitution models instead. The newest releases of RAxML-NG (Kozlov et al. 2019, Bioinformatics 35: 4453–4455) include models for "phased" and heterozygous data — ie. models that can make use of ambiguity codes as additional information during tree inference (see also Potts et al. 2014, Syst. Biol. 63:1–16).

Monday, June 15, 2020

How I would (realistically) analyze SARS-CoV-2 (or similar) phylogenetic data

While writing this post, the Gisaid database reported over 40,000 SARS-CoV-2 genomes (a week before it was only 32,000), which is rather a lot for a practical data analysis. There have been a few posts on the RAxML Google group about how to analyze such large datasets, and speed up the analysis:
How to run ML search and BS 100 replicates most rapidly for a 30000 taxa * 30000 bp DNA dataset
In response, Alexandros Stamatakis, the developer of RAxML, expressed the basic problem this way:
Nonetheless, the dataset has insufficient phylogenetic signal, and thus it can and should not be analyzed using some standard command line that we provide you here; but requires a more involved analysis, carefully exploring if there is sufficient signal to even represent the result as a binary/bifurcating tree, which I personally seriously doubt.
As demonstrated in our current collection of recent blog posts, we also doubt this. One user, having read some of our posts, wondered whether we can't just use the NETWORK program to infer a haplotype network, instead. Typically, the answer to such a question is "Yes, but..."

So, here's a post about how I would design an experiment to get the most information out of thousands of virus genomes (see also: Inferring a tree with 12000 [or more] virus genomes).

Why trees struggle with resolving virus phylogenies and reconstructing their evolution. X, the genotype of Patient Zero (first host, not first-diagnosed host) spread into five main lineages. All splits (internodes, taxon bipartitions) in this graph are trivial, ie. one tip is seperated from all others. Thus, they, and the underlying data, cannot provide any information to infer a tree, which is a sequence of non-trivial taxon bipartitions. For instance, an outgroup (O)-defined root would require to sample the 'Source' (S), the all-ancestor, hence, defining a split O+S | X+A+B+C+D+E. All permutations of X+descendant | rest should have the same probability, leading to a 5-way split support (BS = 20, PP = 0.2). In reality, however, tree-analyses, Bayesian inference more than ML bootstrapping, may prefer one split over any other, eg. because of long-branch attraction between C and D and 'short-branch culling' of X and E. See also: Problems with the phylogeny of coronaviruses and A new SARS-CoV-2 variant.

Start small

Having a large set of data doesn't mean that you have to analyze it all at once. Big Data does not mean that we must start with a big analysis! The reason we have over 40,000 CoV-2 genomes is simply the recent advances in DNA sequencing, and that we have effectively spread the virus globally, to provide a lot of potential samples.

The first step would thus be:
  1. Take one geographical region at a time, and infer its haplotype network.
This will allow us to define the main virus types infecting each region. It will also eliminate all satellite types (local or global) that are irrelevant for reconstructing the evolution of the virus, as they evolved from a designated ancestor, which is also included in our data.

We can also search the regional data for recombinants — virus may recombine, but to do so they need to come into contact, ie. be sympatric.

C/G→U mutations seen in several of the early sampled CoV-2 genomes: note their mixing-up within haplotypes collected from the cruiseship 'Diamond Princess' (from Using Median-networks to study SARS-CoV-2)

Go big

Once the main virus variants in each region are identified, we can filter them and then use them to infer both:
  1. a global haplotype network, and
  2. a global, bootstrapped maximum-likelihood (ML) tree.
The inference of the latter will now be much faster, because we have eliminated a lot of the non-tree-like signal ("noise") in our data set. The ML tree, and its bootstrap Support consensus network, will give us an idea about phylogenetic relationships under the assumption that not all mutations are equally probable (which they clearly aren't) — this provides a phylogenetic hypothesis that is not too biased by convergence or mutational preferences, eg. replacing A, C, and G by U (Finding the CoV-2 root).

On the other hand, the haplotype network (Median-joining or Statistical parsimony) may be biased, but it can inform us about ancestor-descendant relationships. Using the ML tree as guide, we may even be able to eliminate saturated sites or weigh them for the network inference, provided that the filtered, pruned-down dataset provides enough signal.

With the ML tree, bootstrapping analysis and haplotype networks at hand, it is easy to do things like compare the frequency of the main lineages, and assess their global distribution. This also facilitates the depiction of potential recombination, we can sub-divide the complete genome and infer trees/networks for the different bits, and then compare them.

Only based on nearly 80 CoV-2 genomes stored in gene banks by March 2020. The same can be done for any number of accessions, provided tools are used taking into account the reality of the data. The "x" indicate recombination, arrows ancestor-descendant relationships (from: Using Median networks to study SARS-CoV-2)

Change over time

The most challenging problem for tree inference and haplotype-network inference, is the fact that virus genomes evolve steadily through time. That is, the CoV-2 data will include both the earliest variants of the virus as well as its many, diverse offspring — both ancestors and descendants are included among the (now) 40,000+ genomes. We have shown a number of examples where trees cannot handle ancestor-descendant relationships very well. Haplotype networks, on the other hand, are vulnerable to homoplasy (random convergences). So:
  1. Take one time-slice and establish the amount of virus divergence at that time.
Depending on the virus diversity, one can use haplotype networks or distance-based Neighbor-nets (RAxML can export model-based distances). Even traditional trees are an option — by focusing on one time slice, we fulfill the basic requirement of standard tree-inference: that all tips are of the same age.
  1. Then stack the time-slice graphs together, for a general overview.
It will be straightforward to establish which subsequent virus variant is most similar to which one in the slice before.

Based on such networks, we can also easily filter the main variants for each time slice, to compile a reduced set for further explicit dating analysis, for example via the commonly used dating software BEAST (it was actually designed originally for use with virus phylogenies).

A stack of time-filtered Neighbor-nets (from: Stacking neighbour-nets: a real world example; see Stacking neighbour-nets: ancestors and descendants for an introduction)

Networks and trees go hand-in-hand

With the analyses above, it should be straightforward to model not only the spread of the virus (as GISAID tries to do using Nextstrain) but also its evolution – global and general, local and in-depth, and linear and reticulate.

The set of reconstructions will allow for exploratory data analysis. Conflicts between trees and networks are often a first hint towards reticulate history — in the case of viruses this will be recombination. Keep in mind that deep recombinants will not necessarily create conflict in either trees (eg. decreased bootstrap support) or networks (eg. boxes), but may instead result in long terminal branches.

There may be haplotypes in the regional networks that are oddly different, or create parallel edge-bundles. Using the ML guide-tree, we can assess their relationship within the global data set — whether they show patterns diagnostic for more than one lineage or are the result of homoplasy.

Likewise, there may be branches in the ML tree with ambiguous support, which can be understood when using haplotype networks (see eg., Tree informing networks explaining trees).

Era of Big Data, and Big Error

SARS-CoV-2 data form a very special dataset, but there are parallels to other Big Data phylogenomic studies. Many of these studies produce fully resolved trees: and it is often assumed that the more data are used then the more correct is the result. Further examination is thus unnecessary (and it may be impossible, because of the amount of compiled data).

As somebody who worked at the coal-face of evolution, I have realized that the more data we have then the more complex will be the patterns we can extract from them. The risk of methodological bias will not vanish, but may even increase; and the more I then need to check which part of my data resolves which aspect of a taxonomic group's evolution.

This can mean that, rather than a single tree of 10,000 samples, it is better to infer 100 graphs that each reflects variation among 100 samples and one overall graph that includes only the main sample types. Make use of supernetworks (eg. Supernetworks and gene incongruence) and consensus networks to explore all aspects of a group's evolution. In particular, when you leave CoV-2 behind and task larger groups of coronaviruses (Hack and fish...for recombination in coronaviruses).

Monday, June 8, 2020

Hack and fish ... for recombination in the SARS group

Following the current flow, we have had a few recent coronavirus posts here on the Genealogical World of Phylogenetic Networks. In this post, I'll show the results of a little experiment coming back to David's original post on the topic. Can we use trees to "fish" for evidence of recombination?

As David pointed out, even when we use a phylogenetic-tree inference method to analyze virus genomes, we don't really end up with a phylogenetic tree. Instead, we have a tree reflecting genetic similarity, which will reflect the phylogeny to some unknown extent. The main problem with virus genomes, however, is that they easily recombine — and thus different parts of a virus genome may have different evolutionary histories. A single tree cannot reflect this.

This does not mean that trees cannot tell is something about virus evolution. However, these trees become part of a fishing exercise, looking for different possible historical pathways, which may reflect recombination events.

The tree

Our SARS harvest matrix includes about a dozen sequence groups, which we have labeled Type 1 (the original SARS-CoV) to 9b. Type 7 is the new SARS-CoV-2. For my experiment here, I picked one place-holder sequence per main type (to speed up calculation time). I added two more types: the newly found direct sister of SARS-CoV-2; and some "unclassified" SARS-like viruses from pangolins, which earlier were proposed as sisters, as shown in this tree from the GISAID web page.

The phylogenetic neighborhood of SARS-CoV-2 (GISAID, screenshot captured 3/6/2020). Note the flatness of the CoV(-1; yellow) and CoV-2 (red) subtrees.

GISAID doesn't give the GenBank accession numbers, so we cannot easily say whether our sample matches theirs. However, the tree we can infer from the complete genomes (high-divergent, non-alignable regions excluded) looks very similar, as shown next, and some of the labels match up.

Fig. 1 Maximum likelihood (ML) tree inferred for our sample using (old, v.8.0.20) RAxML. Roman numbers refer to earlier defined Types 1–9 (Tree and viruses – the SARS group), Arabic numbers give nonparametric bootstrap (BS) support based on 100 BS pseudoreplicates (number of neccessary BS replicates determined by the extended majority rule criterion). Branches without Arabic number are unambiguous (BS = 100).

Most importantly, all but three branches have unambiguous support: the phylogeny of this sample is resolved. Unfortunately, as our recurring readers already know, this nearly resolved tree simplifies a much more complex situation.

The Neighbor-net with recombinations and mutational trends (arrows, connectives; cf. Tree and viruses – the SARS group).

Hack and slash

A simple method to fish for different evolutionary histories in a genome is to cut the virus genomes into sub-sequences, infer a tree for each sub-sequence, and then compare the trees. Most researchers compare trees by showing them and discussing which one makes most sense. Here is an example from Corman et al. (2014), who searched for the root of MERS (Middle East Respiratory Syndrome) virus, an illness closely related to SARS.

Reprint of Corman et al. 2014, fig. 3 with colors added to EriCoV (green) and HKU/BtCoV (olive) groups

Each tree in their Fig. 4A and B (Bayesian majority rule consensus trees) was inferred from a different part of the genome. Corman et al.'s focus was to root the MERS viruses by identifying a better outgroup. However, note that the new sister-group (red, green stars – sister to MERS; orange stars – sister to someone else) moves, and so does the green EriCoV clade and the olive HKU/BtCoV group (clade in some trees and grade in others). Do some of these trees get it wrong? Or is, eg. NeoCoV the product of reticulate evolution (here: ancient recombination)? Some parts of its genome might be derived from a common ancestor with MERS (blues), and others from a common ancestor with KW2E (black) and EriCoV (green).

Our complete matrix has 27,333 characters, providing nearly 6,000 distinct alignment patterns (abbreviated DAP, below), which is a lot — the GISAID link above also provides a graphical representation of site divergence. However, probabilistic tree inference methods (ML, Bayes) can handle moderate to high levels of divergence in the data. On the other hand, they also need a certain amount of data to perform well (see also: Inferring a tree with 12000 [or more] virus genomes). So, for my experiment, I hacked the matrix into nine bits of equal size, ie. each submatrix has a bit more than 3,000 nucleotides, providing between 615 (bit #5) and 1029 (bit #1) DAPs.

Fig. 2 Nine ML trees with BS support annotated along branches, each based on a ~3000 nucleotide long bit of the genomes (ordered left-right, top-bottom). Purple highlights branches conflicting with the complete genome tree.

Our nine trees (shown above) are not badly resolved, as most branches get substantial support. But they are not congruent. If we are dealing with recombination, then we might assume that all of these trees do show an actual aspect of the evolutionary history of the genomes. That is, they are all right and wrong, at the same time.

Moreover, we have high supported clades conflicting with the complete genome tree's (Fig. 1) topology. The signal issues, due to recombination (see Trees and viruses...), did not decrease branch support. That is, 6,000+ DAP is a lot, and recombination only affects a part of the complete genome, possibly quite a small part.

Non-trivial evolution needs more than trivial graphs

To depict the reticulate phylogeny of the virus sample, we need to consider the differences seen in the hacked-and-slashed matrix trees. This can easily be illustrated using a network, instead of a set of trees, as shown here.

Fig. 3 A (strict) consensus network of all nine trees, in which the edge lengths give the sum of the branch lenghts in the tree sample. The gray brackets give the topology of the near-fully resolved complete genome tree.

The graph above is a phylogenetic network: the competing edge bundles represent the different inferred histories of bits of the genomes. The SARS-CoV-2 lineage seems to be the product of (ancient) recombination, and recombination also played a role in forming the members of the original SARS-CoV group.

Fig. 4 Pruned consensus network showing only the CoV(-1) lineage exhibiting various levels of recombination within and between clades as defined by the complete genome tree (tree sample sames as in Fig. 3).

Consensus networks can also be used to summarize the support for alternative splits, as shown next.

Fig. 5 Sum-support consensus network based on the bit-wise BS analyses (111/112 pseudoreplicates generated per bit). Only splits are shown occurring in at least 20% of all BS replicates, i.e. splits supported by at least two bits, trivial splits are collapsed. Colored splits represent according groups/clades in the full-genome tree (Fig. 1). Inlet: 'splits rose' showing competing splits patterns within Types II and III (cf. according subtrees/-trunks in Fig. 2 and Fig. 4).

In contrast to the networks before (Figs. 3, 4), generated using the same algorithm*, the BS consensus network in Fig. 5 is not a phylogenetic network. The boxes don't reflect disparate histories of parts of the genomes but the varying support for competing topological alternatives. By summing up the bit-wise BS analyses instead of bootstrapping the entire data (the BS consensus network for the full data, Fig. 1, shows only two boxes), we get a better idea which aspects of the all-genome tree find robust support across the genome.**


Sub-dividing an alignment is a really quick way to fish for evidence of recombination, especially when one then uses a consensus network to summarise the resulting partial trees.

For interpretation, a tree is a very simple, trivial, and hence appealing graph: A is sister to B and so on. Even a child can interpret a tree. Networks are already visually more challenging, but whenever an organism's evolution doesn't follow a tree (as for viruses), we shouldn't use a tree to depict its phylogeny (or reconstruct its evolution).

Data availability

The dataset used for our experiment is a taxon subset of the original data set, available via figshare (with a permanent, hence, citable DOI):
Grimm GW, Morrison D. 2020. Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare Dataset. 


Corman VM, Ithete NL, Richards LR, Schoeman MC, Preiser W, Drosten C, Drexlera JF (2014) Rooting the phylogenetic tree of Middle East respiratory syndrome coronavirus by characterization of a conspecific virus from an African Bat. Journal of Virology 88: 11297–11303.

* SplitsTree includes five options to determine "edge weights" (= edge-lengths) in case of Consensus networks: "median" and "mean" average the branch-lengths in the tree sample; "count", the setting used to generate Support consensus networks, counts how often a certain taxon bipartition (split) is found in the tree sample – an edge length is proportional to the frequency of a split; "sum", used here to generate the first network, summarizes the branch-lengths; and "none" discards both branch-lengths and split frequency.

** A split supported only by one of the nine bits, even if unambiguous, ie. present in all 111 (112) per bit BS replicates, will not be represented in the sum-Support consenus network using a cut-off of 20%.

† The complete set of ML analyses took 20 min on a stand-alone computer; consensus networks are generated in a blink, and take hardly a minute even when using trees with many leaves.

Monday, June 1, 2020

To what degree are Median-joining networks phylogenetic?

In a comment to the recent paper by Forster et al. (2020), Sánchez-Pacheco et al. (2020) argue that Forster et al.'s analysis is "neither phylogenetic nor evolutionary" because it's based on the use of a Median-joining network. They don't re-analyse the data, but instead mostly refer to a paper they published four years ago in Cladistics (Kong et al. 2016), the journal of the Willi-Hennig Society.

In that earlier paper, Kong et al. conclude:
Other than fast computation and very attractive graphics, MJNs [Median-joining networks] harbour no virtue for phylogenetic inference. MJNs are distance-based, unrooted branching diagrams with cycles that say nothing about the evolutionary history due to the absence of direction. MJ was introduced in 1999 and, in contrast to most scientific ideas, its application has spread rapidly through copying the methods of others, and, unfortunately, with little further scrutiny. We hope that the theoretical arguments presented here can reverse this trend.
It seems unlikely that it will, as I will argue here.

What makes a graph a phylogenetic tree or network – direction

Kong et al. argue that a line graph needs to be directed (ie. the edges indicate a time direction) in order to represent a phylogeny, which is a good point. After all, a phylogenetic tree is a directed (rooted) branching diagram that represents the hypothesized relationships among the organisms under study.

A phylogenetic tree (see also: Fritz Müller and the first phylogenetic tree)

A phylogenetic network is the generalization of a phylogenetic tree, as it combines lineage splits (divergences) with lineage anastomoses.

A phylogenetic network including a reticulation leading to a circle in the graph — B is the product of crossing of lineages that produced its sisters A and C.

Since a MJ network is, per se, an undirected graph, it thus cannot be an explicit phylogenetic network.

However, following this argument, few inferred trees are directly a phylogenetic tree, either — including the Nextstrain-generated tree on the GISAID page that is promoted by Mavian et al. 2020 (which is another comment to Forster et al., focusing on data issues). Irrespective of which criterion we use to optimize the tree, almost all trees we infer (with no matter what tree inference software) are unrooted graphs — in general, we root them only after the analysis, by defining one leaf or a subtree as an outgroup. (Note, this includes those based on parsimony, the method of choice of the Willi-Hennig Society and Cladistics to this day.)

The difference between inference and interpretation: Using the tip sequences, we can infer a single most parsimonious (6-step long; using PAUP*'s branch-and-bound or NETWORK's MJ algoritm), but also most likely and shortest (distance-based), unrooted tree. By defining a root – here: one taxon designated as outgroup and assuming that all single-taxon-unique sequence patterns are autapomorphies – we can interpret the inferred tree as four different phylogenetic trees.

The same can be said of MJ networks — outgroup rooting can be applied (Finding the CoV-2 root).

Difficulty in depicting ancestor-descendant relationships

A phylogenetic relationship focuses on ancestors, which, for the purpose of inferring a phylogenetic tree, are considered to be purely hypothetical, although they are not hypothetical in a MJ network (or related graphs). We can easily create character sets where the inferred tree will not "represent the hypothesized relationship". Most parsimony studies show a strict consensus cladogram of most-parsimonious trees (MPT). This is unproblematic, as long as all leaves have the same age, and all of the cladogenic events resulted in unique, lineage-conserved character patterns. We then:
  1. would only infer but a single MPT;
  2. have no zero-length branches.
So, following Hong et al.'s logic, any dataset that results in more than one MPT and has subtrees including zero-length branches (like our example above) cannot qualify as phylogenetic trees.

Median-joining networks are, like MP trees (both use parsimony as the optimality criterion), vulnerable to homoplasy (Using Median networks to study SARS-CoV-2; see also Mavian et al. 2020), but while a MP tree (or any other tree we infer) cannot resolve ancestor-descendant relationships, MJ networks can (see eg. Why do we still use tree for Neanderthal genealogy).

Median (or MJ) network, left, and MPT, right, inferred from a perfect matrix. "x" = all-ancestor, ie. represents the root. "a" is the ancestor of "B" and "C", "d" of "f" to "H", "f" of "G" and "H". While the median networks depicts all ancestor-descendant relationships, the MPT only depicts them indirectly by trichotomies including the ancestor as zero-length branch.
Imperfect matrices (data including homoplasy) lead to wrong edges and branches. Being able to recognize ancestors, the MJ network comes closer to the phylgoenetic tree (same as above; from Clades, cladograms, cladistics, and why networks are inevitable).

Hence, Bandelt et al.'s (1999) statement, as cited by Kong et al., that “reconstructing phylogenies from intraspecific data ... is often a challenging task because of large sample sizes and small genetic distances between individuals”. Such data results in largely uninformative, comb-like MPT strict consensus trees. This is because identical sequences, equally probable alternative pathways, non-dichotomous differentiation patterns, and ancestral sequence variants present in the data increase the number of MPTs (sometimes to near-infinity). This leads to the collapse of branches in the strict consensus tree used to summarize the MPT sample. Probabilistic methods struggle, too, because the likelihood surface of the tree space is too flat to make a call.

[Kong et al. point to the mathematical definition of 'network', as "nothing more than an unrooted branching diagram with reticulation" but not of 'tree', which they consider is always a directed acyclic graph, ie. synonym to 'phylogenetic tree'. However, it is, inference-wise, clearly nothing more than an unrooted branching diagram without reticulation.]

Confusing heuristics with principle

To discredit the MJ network, Kong et al. then "... focus on its phenetic nature."

There is a tendency among cladists to dismiss a method as "distance-based", as this is treated as synonymous with phenetics. In reply, Joe Felsenstein commented on this alleged fundamental difference between distance-based and parsimony methods of tree inference (Felsenstein 2004, Chapter 10, p. 145f, The irrelevance of classification):
The terminology is also affected by the lingering emphasis on classification. Many systematists believe that it is important to label certain methods (primarily parsimony methods) as "cladistic" and others (distance matrix methods, for example) as "phenetic". These are terms that have rather straightforward meaning when applied to methods of classification. But are they appropriate for methods of inferring phylogenies? I don't think they are. Making this distinction implies that something fundamental is missing from the "phenetic" methods, that they are ignoring information that the "cladistic" methods do not. In fact, both methods can be considered to be statistical methods, making their estimates in slightly different ways.
The following chapter in Felsenstein's book (Chapter 11, pp. 147–175) deals exclusively with the "phenetic" distance matrix methods because they were the first to be used to infer phylogenetic trees (their limitations are outlined on pp. 174f).

Because the inference of MJ network starts from the generation of a Minimum-spanning network, which is generated from a distance matrix, Kong et al. argue the MJ network is merely a distance-based graph, ie. "phenetic", and "not phylogenetic". Any NP hard problem requires heuristics but, just because we use a distance-based graph to start with, doesn't determine whether the end-product is or is not a distance-based graph.

For instance, the Neighbor-joining (NJ) algorithm (Felsenstein 2004, p.166ff) is a cluster algorithm, which finds a phylogenetic tree fulfilling either the Minimum evolution (ME, p. 159f) or Least-squares (LS) criteria (p. 148ff). Thus, the tree inferred is, indeed, based on a distance-matrix via NJ, but it is not a cluster dendrogram — instead, it is a ME or LS optimised phylogenetic tree. Similarly, FastTree, IQTree, and RAxML are extremely fast programmes to infer Maximum likelihood (ML) phylogenetic trees; but, while FastTree and IQTree start with "phenetic" Neighbor-joining trees, RAxML (like GARLI before) infers first a quick-and-dirty parsimony tree. The final product in both cases is a topology optimized under ML, and the results are hence ML trees and not distance-based or MP trees (even though they started that way).

The final MJ network shows the most parsimonious evolutionary pathways that change one sequence type into another. When you infer it with the NETWORK program, all inferred mutations are mapped onto the final graph, and, using the Steiner post-analysis step, you can look through all of the MP trees that have been included in this graph. However, according to Kong et al. these are not MP trees:
[Following Farris (1970] Invoking principles of parsimony does not validate a phenetic technique as being a phylogenetic method. Indeed, the best Steiner trees are not necessarily the most parsimonious trees.
Kong et al. did not provide any real-world data examples; possibly because they would be very difficult to find. Just take my simple example above — clearly the MJ tree is actually a most-parsimonious solution to the data. Alternatively, you could take any data set for which you can infer plausible MPTs with (ie. data where the rate of change is low), eg. using the TNT program, and compare the result – the Consensus network of all MPTs, not the collapsed strict consensus tree (Stop using cladograms!) – with the Steiner trees inferred using NETWORK and the MJ algorithm.

Are medians ancestors? Do cycles represent reticulation?

Kong et al.'s final point is:
BEA99 [ie. Bandelt et al. 1999] stressed that median vectors can be interpreted biologically as existing unsampled or extinct ancestral sequences (i.e. they can represent missing intermediates; Fig. 3). However, a median vector in an MJ analysis is a sequence generated by majority, and is a mathematically drawn point in the final MJN that connects a triplet of sequences. The resulting “evolutionary paths in the form of cycles” (BEA99, p. 37) merely illustrates the failure of the algorithm to choose between alternative, equally optimal connections due to the modification of Kruskal’s algorithm. Consequently, a cycle represents an analytical artefact rather than an evolutionary scenario (Salzburger et al., 2011).
It is obvious to anyone who has ever used MJ networks, and is familiar with their own data, that Medians are likely to be ancestors, and that medians separated by parallel edge bundles are usually alternative ancestors. But, like all inferences, MJ networks may not capture the complex truth.

A phylogeny involving a recombination ...
... and the MJ network that can be inferred on the same data including two wrong edges (red). The West-1/East-ancestor recombinant is resolved as the product of hybridisation of the West- and East-ancestors, while West-2, a descendant of West-1, is resolved as hybrid of West-1 and the recombinant. Any tree included in the network would have 7 steps (ie. is most-parsimonious).
These reconstructed medians thus do bring Kong et al. to their only valid point, which, however, doesn't apply to the method as proposed, but is instead a common misinterpretation of MJ networks — their cycles do not necessarily reflect reticulation.

Bandelt et al. (1999) clearly state that the MJ network is only an approximation, to deal with complex situations. The cycles usually represent equally optimal alternative pathways, and are usually the result of homoplasy but not reticulation. The final goal is hence to get a graph with as few reticulation as possible but as many as are necessary (see NETWORK manual on selection of the epsilon parameter and weighting).

The Sanchéz-Pacheco et al. critique of Forster et al.

As I showed in an earlier post using actual CoV-2 data, only this part of anchéz-Pacheco et al.'s critique of Forster et al.'s paper is valid — we do need to be very careful before we interpret parallel edge bundles in virus-based (or other) MJ networks as being evidence for reticulation. MJ networks can be phylogenetic networks, but they are still consensus networks of competing, equally parsimonious alternatives. If we take a strict position, then most MJ networks are probably not phylogenetic networks; but neither are all trees phylogenetic trees.

Everything else in their comment is simply cladistic lore. Most importantly, their critique ignores the fact that the obvious alternative to MJ networks when analysing low-divergent virus data, which is parsimony-based trees, has exactly the same data-inherent shortcomings — ie. vulnerability to homoplasy, impossibility to detect and reconstruct recombination. They also have an extra one: they treat all samples to be of the same age and generation, and thus have to resolve actual ancestors as being sisters. Which increases the number of possible, equally parsimonious solutions.

The 19, 7-step long MPTs that can be inferred for the recombination example using PAUP*'s branch-and-bound algorithm – rooted with the Source, the common ancestor of all ("AllAnc") – and their strict consensus tree (gray background, 11-steps long). "Best" shows a phylogenetic tree that comes closest to the true tree: ancestors are resolved as zero-length tips in clades including their descendants. "Close" denotes trees that only misplace the recombinant (purple), which – being a recombinant of the East ancestor (red) and West-1 (blue) – should be placed in a tree as sister to either parent.

The consequence of this is that what Sánchez-Pacheco et al. and Kong et al. criticize about the MJ networks applies even more to the predominately used phylogenetic trees. As David pointed out earlier (Problems with the phylogeny of coronaviruses): virus trees may be inferred using phylogenetic methods but they effectively depict only similarity patterns.


Bandelt H-J, Forster P, Röhl A. 1999. Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution 16:37-48.

Forster P, Forster L, Renfrew C, Forster M. 2020. Phylogenetic network analysis of SARS-CoV-2 genomes. PNAS 117:9241–9243.

Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Mavian C, Kosakovsky Pond S, Marini S, Rife Magalis B, Vandamme A-M, Dellicour S, Scarpino SV, Houldcroft C, Villabona-Arenas J, Paisie TK, Trovão NS, Boucher C, Zhang Y, Scheuermann RH, Gascuel O, Tsan-Yuk Lam T, Suchard MA, Abecasis A, Wilkinson E, de Oliveira T, Bento AI, Schmidt HA, Martin D, Hadfield J, Faria N, Grubaugh ND, Neher RA, Baele G, Lemey P, Stadler T, Albert J, Crandall KA, Leitner T, Stamatakis A, Prosperi M, Salemi M. 2020. Sampling bias and incorrect rooting make phylogenetic network tracing of SARS-COV-2 infections unreliable. PNAS doi:10.1073/pnas.2007295117

Sánchez-Pacheco S, Kong S, Pulido-Santacruz P, Murphy RW, Kubatko L. 2020. Median-joining network analysis of SARS-CoV-2 genomes is neither phylogenetic nor evolutionary. PNAS, doi:10.1073/pnas.2007062117.