Monday, June 1, 2020

To what degree are Median-joining networks phylogenetic?

In a comment to the recent paper by Forster et al. (2020), Sánchez-Pacheco et al. (2020) argue that Forster et al.'s analysis is "neither phylogenetic nor evolutionary" because it's based on the use of a Median-joining network. They don't re-analyse the data, but instead mostly refer to a paper they published four years ago in Cladistics (Kong et al. 2016), the journal of the Willi-Hennig Society.

In that earlier paper, Kong et al. conclude:
Other than fast computation and very attractive graphics, MJNs [Median-joining networks] harbour no virtue for phylogenetic inference. MJNs are distance-based, unrooted branching diagrams with cycles that say nothing about the evolutionary history due to the absence of direction. MJ was introduced in 1999 and, in contrast to most scientific ideas, its application has spread rapidly through copying the methods of others, and, unfortunately, with little further scrutiny. We hope that the theoretical arguments presented here can reverse this trend.
It seems unlikely that it will, as I will argue here.

What makes a graph a phylogenetic tree or network – direction

Kong et al. argue that a line graph needs to be directed (ie. the edges indicate a time direction) in order to represent a phylogeny, which is a good point. After all, a phylogenetic tree is a directed (rooted) branching diagram that represents the hypothesized relationships among the organisms under study.

A phylogenetic tree (see also: Fritz Müller and the first phylogenetic tree)

A phylogenetic network is the generalization of a phylogenetic tree, as it combines lineage splits (divergences) with lineage anastomoses.

A phylogenetic network including a reticulation leading to a circle in the graph — B is the product of crossing of lineages that produced its sisters A and C.

Since a MJ network is, per se, an undirected graph, it thus cannot be an explicit phylogenetic network.

However, following this argument, few inferred trees are directly a phylogenetic tree, either — including the Nextstrain-generated tree on the GISAID page that is promoted by Mavian et al. 2020 (which is another comment to Forster et al., focusing on data issues). Irrespective of which criterion we use to optimize the tree, almost all trees we infer (with no matter what tree inference software) are unrooted graphs — in general, we root them only after the analysis, by defining one leaf or a subtree as an outgroup. (Note, this includes those based on parsimony, the method of choice of the Willi-Hennig Society and Cladistics to this day.)

The difference between inference and interpretation: Using the tip sequences, we can infer a single most parsimonious (6-step long; using PAUP*'s branch-and-bound or NETWORK's MJ algoritm), but also most likely and shortest (distance-based), unrooted tree. By defining a root – here: one taxon designated as outgroup and assuming that all single-taxon-unique sequence patterns are autapomorphies – we can interpret the inferred tree as four different phylogenetic trees.

The same can be said of MJ networks — outgroup rooting can be applied (Finding the CoV-2 root).

Difficulty in depicting ancestor-descendant relationships

A phylogenetic relationship focuses on ancestors, which, for the purpose of inferring a phylogenetic tree, are considered to be purely hypothetical, although they are not hypothetical in a MJ network (or related graphs). We can easily create character sets where the inferred tree will not "represent the hypothesized relationship". Most parsimony studies show a strict consensus cladogram of most-parsimonious trees (MPT). This is unproblematic, as long as all leaves have the same age, and all of the cladogenic events resulted in unique, lineage-conserved character patterns. We then:
  1. would only infer but a single MPT;
  2. have no zero-length branches.
So, following Hong et al.'s logic, any dataset that results in more than one MPT and has subtrees including zero-length branches (like our example above) cannot qualify as phylogenetic trees.

Median-joining networks are, like MP trees (both use parsimony as the optimality criterion), vulnerable to homoplasy (Using Median networks to study SARS-CoV-2; see also Mavian et al. 2020), but while a MP tree (or any other tree we infer) cannot resolve ancestor-descendant relationships, MJ networks can (see eg. Why do we still use tree for Neanderthal genealogy).

Median (or MJ) network, left, and MPT, right, inferred from a perfect matrix. "x" = all-ancestor, ie. represents the root. "a" is the ancestor of "B" and "C", "d" of "f" to "H", "f" of "G" and "H". While the median networks depicts all ancestor-descendant relationships, the MPT only depicts them indirectly by trichotomies including the ancestor as zero-length branch.
Imperfect matrices (data including homoplasy) lead to wrong edges and branches. Being able to recognize ancestors, the MJ network comes closer to the phylgoenetic tree (same as above; from Clades, cladograms, cladistics, and why networks are inevitable).

Hence, Bandelt et al.'s (1999) statement, as cited by Kong et al., that “reconstructing phylogenies from intraspecific data ... is often a challenging task because of large sample sizes and small genetic distances between individuals”. Such data results in largely uninformative, comb-like MPT strict consensus trees. This is because identical sequences, equally probable alternative pathways, non-dichotomous differentiation patterns, and ancestral sequence variants present in the data increase the number of MPTs (sometimes to near-infinity). This leads to the collapse of branches in the strict consensus tree used to summarize the MPT sample. Probabilistic methods struggle, too, because the likelihood surface of the tree space is too flat to make a call.

[Kong et al. point to the mathematical definition of 'network', as "nothing more than an unrooted branching diagram with reticulation" but not of 'tree', which they consider is always a directed acyclic graph, ie. synonym to 'phylogenetic tree'. However, it is, inference-wise, clearly nothing more than an unrooted branching diagram without reticulation.]

Confusing heuristics with principle

To discredit the MJ network, Kong et al. then "... focus on its phenetic nature."

There is a tendency among cladists to dismiss a method as "distance-based", as this is treated as synonymous with phenetics. In reply, Joe Felsenstein commented on this alleged fundamental difference between distance-based and parsimony methods of tree inference (Felsenstein 2004, Chapter 10, p. 145f, The irrelevance of classification):
The terminology is also affected by the lingering emphasis on classification. Many systematists believe that it is important to label certain methods (primarily parsimony methods) as "cladistic" and others (distance matrix methods, for example) as "phenetic". These are terms that have rather straightforward meaning when applied to methods of classification. But are they appropriate for methods of inferring phylogenies? I don't think they are. Making this distinction implies that something fundamental is missing from the "phenetic" methods, that they are ignoring information that the "cladistic" methods do not. In fact, both methods can be considered to be statistical methods, making their estimates in slightly different ways.
The following chapter in Felsenstein's book (Chapter 11, pp. 147–175) deals exclusively with the "phenetic" distance matrix methods because they were the first to be used to infer phylogenetic trees (their limitations are outlined on pp. 174f).

Because the inference of MJ network starts from the generation of a Minimum-spanning network, which is generated from a distance matrix, Kong et al. argue the MJ network is merely a distance-based graph, ie. "phenetic", and "not phylogenetic". Any NP hard problem requires heuristics but, just because we use a distance-based graph to start with, doesn't determine whether the end-product is or is not a distance-based graph.

For instance, the Neighbor-joining (NJ) algorithm (Felsenstein 2004, p.166ff) is a cluster algorithm, which finds a phylogenetic tree fulfilling either the Minimum evolution (ME, p. 159f) or Least-squares (LS) criteria (p. 148ff). Thus, the tree inferred is, indeed, based on a distance-matrix via NJ, but it is not a cluster dendrogram — instead, it is a ME or LS optimised phylogenetic tree. Similarly, FastTree, IQTree, and RAxML are extremely fast programmes to infer Maximum likelihood (ML) phylogenetic trees; but, while FastTree and IQTree start with "phenetic" Neighbor-joining trees, RAxML (like GARLI before) infers first a quick-and-dirty parsimony tree. The final product in both cases is a topology optimized under ML, and the results are hence ML trees and not distance-based or MP trees (even though they started that way).

The final MJ network shows the most parsimonious evolutionary pathways that change one sequence type into another. When you infer it with the NETWORK program, all inferred mutations are mapped onto the final graph, and, using the Steiner post-analysis step, you can look through all of the MP trees that have been included in this graph. However, according to Kong et al. these are not MP trees:
[Following Farris (1970] Invoking principles of parsimony does not validate a phenetic technique as being a phylogenetic method. Indeed, the best Steiner trees are not necessarily the most parsimonious trees.
Kong et al. did not provide any real-world data examples; possibly because they would be very difficult to find. Just take my simple example above — clearly the MJ tree is actually a most-parsimonious solution to the data. Alternatively, you could take any data set for which you can infer plausible MPTs with (ie. data where the rate of change is low), eg. using the TNT program, and compare the result – the Consensus network of all MPTs, not the collapsed strict consensus tree (Stop using cladograms!) – with the Steiner trees inferred using NETWORK and the MJ algorithm.

Are medians ancestors? Do cycles represent reticulation?

Kong et al.'s final point is:
BEA99 [ie. Bandelt et al. 1999] stressed that median vectors can be interpreted biologically as existing unsampled or extinct ancestral sequences (i.e. they can represent missing intermediates; Fig. 3). However, a median vector in an MJ analysis is a sequence generated by majority, and is a mathematically drawn point in the final MJN that connects a triplet of sequences. The resulting “evolutionary paths in the form of cycles” (BEA99, p. 37) merely illustrates the failure of the algorithm to choose between alternative, equally optimal connections due to the modification of Kruskal’s algorithm. Consequently, a cycle represents an analytical artefact rather than an evolutionary scenario (Salzburger et al., 2011).
It is obvious to anyone who has ever used MJ networks, and is familiar with their own data, that Medians are likely to be ancestors, and that medians separated by parallel edge bundles are usually alternative ancestors. But, like all inferences, MJ networks may not capture the complex truth.

A phylogeny involving a recombination ...
... and the MJ network that can be inferred on the same data including two wrong edges (red). The West-1/East-ancestor recombinant is resolved as the product of hybridisation of the West- and East-ancestors, while West-2, a descendant of West-1, is resolved as hybrid of West-1 and the recombinant. Any tree included in the network would have 7 steps (ie. is most-parsimonious).
These reconstructed medians thus do bring Kong et al. to their only valid point, which, however, doesn't apply to the method as proposed, but is instead a common misinterpretation of MJ networks — their cycles do not necessarily reflect reticulation.

Bandelt et al. (1999) clearly state that the MJ network is only an approximation, to deal with complex situations. The cycles usually represent equally optimal alternative pathways, and are usually the result of homoplasy but not reticulation. The final goal is hence to get a graph with as few reticulation as possible but as many as are necessary (see NETWORK manual on selection of the epsilon parameter and weighting).

The Sanchéz-Pacheco et al. critique of Forster et al.

As I showed in an earlier post using actual CoV-2 data, only this part of anchéz-Pacheco et al.'s critique of Forster et al.'s paper is valid — we do need to be very careful before we interpret parallel edge bundles in virus-based (or other) MJ networks as being evidence for reticulation. MJ networks can be phylogenetic networks, but they are still consensus networks of competing, equally parsimonious alternatives. If we take a strict position, then most MJ networks are probably not phylogenetic networks; but neither are all trees phylogenetic trees.

Everything else in their comment is simply cladistic lore. Most importantly, their critique ignores the fact that the obvious alternative to MJ networks when analysing low-divergent virus data, which is parsimony-based trees, has exactly the same data-inherent shortcomings — ie. vulnerability to homoplasy, impossibility to detect and reconstruct recombination. They also have an extra one: they treat all samples to be of the same age and generation, and thus have to resolve actual ancestors as being sisters. Which increases the number of possible, equally parsimonious solutions.

The 19, 7-step long MPTs that can be inferred for the recombination example using PAUP*'s branch-and-bound algorithm – rooted with the Source, the common ancestor of all ("AllAnc") – and their strict consensus tree (gray background, 11-steps long). "Best" shows a phylogenetic tree that comes closest to the true tree: ancestors are resolved as zero-length tips in clades including their descendants. "Close" denotes trees that only misplace the recombinant (purple), which – being a recombinant of the East ancestor (red) and West-1 (blue) – should be placed in a tree as sister to either parent.

The consequence of this is that what Sánchez-Pacheco et al. and Kong et al. criticize about the MJ networks applies even more to the predominately used phylogenetic trees. As David pointed out earlier (Problems with the phylogeny of coronaviruses): virus trees may be inferred using phylogenetic methods but they effectively depict only similarity patterns.


Bandelt H-J, Forster P, Röhl A. 1999. Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution 16:37-48.

Forster P, Forster L, Renfrew C, Forster M. 2020. Phylogenetic network analysis of SARS-CoV-2 genomes. PNAS 117:9241–9243.

Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Mavian C, Kosakovsky Pond S, Marini S, Rife Magalis B, Vandamme A-M, Dellicour S, Scarpino SV, Houldcroft C, Villabona-Arenas J, Paisie TK, Trovão NS, Boucher C, Zhang Y, Scheuermann RH, Gascuel O, Tsan-Yuk Lam T, Suchard MA, Abecasis A, Wilkinson E, de Oliveira T, Bento AI, Schmidt HA, Martin D, Hadfield J, Faria N, Grubaugh ND, Neher RA, Baele G, Lemey P, Stadler T, Albert J, Crandall KA, Leitner T, Stamatakis A, Prosperi M, Salemi M. 2020. Sampling bias and incorrect rooting make phylogenetic network tracing of SARS-COV-2 infections unreliable. PNAS doi:10.1073/pnas.2007295117

Sánchez-Pacheco S, Kong S, Pulido-Santacruz P, Murphy RW, Kubatko L. 2020. Median-joining network analysis of SARS-CoV-2 genomes is neither phylogenetic nor evolutionary. PNAS, doi:10.1073/pnas.2007062117.

Monday, May 25, 2020

General remarks on rhyming (From rhymes to networks 2)

In this month's post, I want to provide some general remarks on rhyming and rhyme practice. I hope that they will help lay the foundations for tackling the problem of rhyme annotation, in the next post. Ideally, I should provide a maximally unbiased overview that takes all languages and cultures into account. However, since this would be an impossible task at this time (at least for myself), I hope that I can, instead, look at the phenomenon from a viewpoint that is a bit broader than the naive prescriptive accounts of rhyming used by teachers torture young school kids mentally.

What is a rhyme?

It is not easy to give an exact and exhaustive definition of rhyme. As a starting point, one can have a look at Wikipedia, where we find the following definition:
A rhyme is a repetition of similar sounds (usually, exactly the same sound) in the final stressed syllables and any following syllables of two or more words. Most often, this kind of perfect rhyming is consciously used for effect in the final positions of lines of poems and songs. Wikipedia: s. v. "Rhyme", accessed on 21.05.2020
This definition is a good starting point, but it does not apply to rhyming in general, but rather to rhyming in English as a specific language. While stress, for example, seems to play an important role in English rhyming, we don't find stress being used in a similar way in Chinese, so if we tie a definition of rhyming to stress, we exclude all of those languages in which stress plays a minor role or no role at all.

Furthermore, the notion of similar and identical sounds is also problematic from a cross-linguistic perspective on rhyming. It is true that rhyming requires some degree of similarity of sounds, but where the boundaries are being placed, and how the similarity is defined in the end, can differ from language to language and from tradition to tradition. Thus, while in German poetry it is fine to rhyme words like Mai [mai] and neu [noi], it is questionable whether English speakers would ever think that words like joy could form a rhyme with rye. Irish seems to be an extreme case of very complex rules underlying what counts as a rhyme, where consonants are clustered into certain classes (b, d, g, or ph, f, th, ch) that are defined to rhyme with each other (provided the vowels also rhyme), and as a result, words like oba and foda are judged to be good rhymes (Cuív 1966).

When looking at philological descriptions of rhyme traditions of individual languages, we often find a distinction between perfect rhymes on the one hand and imperfect rhymes on the other. But what counts as perfect or imperfect often differs from language to language. Thus, while French largely accepts the rhyming of words that sound identical, this is considered less satisfactory in English and German, and studies seem to have confirmed that speakers of French and English indeed differ in their intuitions about rhyme in this regard (Wagner and McCurdy 2010.

Peust (2014) discusses rhyme practices across several languages and epochs, suggesting that similarity in rhyming was based on some sort of rhyme phonology, that would account for the differences in rhyme judgments across languages. While the ordinary phonology of a language is a classical device in linguistics to determine those sounds that are perceived as being distinctive in a given language, rhyme phonology can achieve the same for rhyming in individual languages.

While this idea has some appeal at first sight, given that the differences in rhyme practice across languages often follow very specific rules, I am afraid it may be too restrictive. Instead, I rather prefer to see rhyming as a continuum, in which a well-defined core of perfect rhymes is surrounded by various instances of less perfect rhymes, with language-specific patterns of variation that one would still have to compare in detail.

Beyond perfection

If we accept that all languages have some notion of a perfect rhyme that they distinguish from less perfect rhymes, which will, nevertheless, still be accepted as rhymes, it is useful to have a quick look at differences in deviation from the perfect. German, for example, is often used as an example where vowel differences in rhymes are treated rather loosely; and, indeed, we find that diphthongs like the above-mentioned [ai] and [oi] are perceived as rhyming well by most German speakers. In popular songs, however, we find additional deviations from the perceived norm, which are usually not discussed in philological descriptions of German rhyming. Thus, in the famous German Schlager Griechischer Wein by Udo Jürgens (1934-2014), we find the following introductory line:
Es war schon dunkel, als ich durch Vorstadtstrassen heimwärts ging.
Da war ein Wirtshaus, aus dem das Licht noch auf den Gehsteig schien.
[Translation: It was already dark, when I went through the streets outside of the city. There was a pub which still emitted light that was shining on the street.]
There is no doubt that the artist intended these two lines to rhyme, given that the overall schema of the song shows a strict schema of AABCCB. So, in this particular case, the artist judged that rhyming ging [gɪŋ] with schien [ʃiːn] would be better than not attempting a rhyme at all, and it shows that it is difficult to assume one strict notion of rhyme phonology to guide all of the decisions that humans make when they create poems.

More extreme cases of permissive rhyming can be found in some traditions of English poetry, including Hip Hop (of course), but also the work of Bob Dylan, who does not have a problem rhyming time with fine, used with refused, or own with home, as in Like a Rolling Stone. In Spanish, where we also find a distinction between perfect (rima consonante) and imperfect rhyming (rima asonante), basically all that needs to coincide are the vowels, which allows Silvio Rodriguez to rhyme amór with canción in Te doy una canción.

While most languages coincide on the notion of perfect rhymes (notwithstanding certain differences due to general differences in their phonology), the interesting aspects for rhyming are those where they allow for imperfection. Given that rhyming seems to be something that reflects, at least to some extent, a general linguistic competence of the native speakers, a comparison of the practices across languages and cultures may help to shed light on general questions in linguistics.

Rhyming is linear

When discussing with colleagues the idea of making annotated rhyme corpora, I was repeatedly pointed to the worst cases, which I would never be able to capture. This is typical for linguists, who tend to see the complexities before they see what's simple, and who often prefer to not even try to tackle a problem before they feel they have understood all the sub-problems that could arise from the potential solution they might want to try.

One of the worst cases, when we developed our first annotation format as presented last year (List et al. 2019), was the problem of intransitive rhyming. The idea behind this is that imperfect rhyming may lead to a situation where one word rhymes with a word that follows, and this again rhymes with a word that follows that, but the first and the third would never really rhyme themselves. We find this clearly stated in Zwicky (1976: 677):
Imperfect rhymes can also be linked in a chain: X is rhymed (imperfectly) with Y, and Y with Z, so that X and Z may count as rhymes thanks to the mediation of Y, even when X and Z satisfy neither the feature nor the subsequence principle.
Intransitive rhyming is, indeed, a problem for annotation, since it would require that we think of very complex annotation schemas in which we assign words to individual rhyme chains instead of just assigning them to the same group of rhymes in a poem or a song. However, one thing that I realized afterwards, which one should never forget is: rhyming is linear. Rhyming does proceed in a chain. We first hear one line, then we hear another line, etc, so that each line is based on a succession of words that we all hear through time.

It is just as the famous Ferdinand de Saussure (1857-1913) said about the linguistic sign and its material representation, which can be measured in a single dimension ("c'est une linge", Saussure 1916: 103). Since we perceive poetry and songs in a linear fashion, we should not be surprised that the major attention we give to a rhyme when perceiving it is on those words that are not too far away from each other in their temporal arrangement.

The same holds accordingly for the concrete comparison of words that rhyme: since words are sequences of sounds, the similarity of rhyme words is a similarity of sequences. This means we can make use of the typical methods for automated and computer-assisted sequence comparison in historical linguistics, which have been developed during the past twenty years (see the overview in List 2014), when trying to analyze rhyming across different languages and traditions.


When writing this post, I realized that I still feel like I am swimming in an ocean of ignorance when it comes to rhyming and rhyming practices, and how to compare them in a way that takes linguistic aspects into account. I hope that I can make up for this in the follow-up post, where I will introduce my first solutions for a consistent annotation of poetry. By then, I also hope it will become clearer why I give so much importance to the notion of imperfect rhymes, and the emphasis on the linearity of rhyming.


Brian Ó Cuív (1966) The phonetic basis of classical modern irish rhyme. Ériu 20: 94-103.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Nathan W. Hill and Christopher J. Foster (2019) Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship 17.1: 26-43.

Peust, Carsten (2014) Parametric variation of end rhyme across languages. In: Grossmann et al. Egyptian-Coptic Linguistics in Typological Perspective. Berlin: Mouton de Gruyter, pp. 341-385.

de Saussure, Ferdinand (1916) Cours de linguistique générale. Lausanne:Payot.

Wagner, M. and McCurdy, K. (2010) Poetic rhyme reflects cross-linguistic differences in information structure. Cognition 117.2: 166-175.

Zwicky, Arnold (1976) Well, This rock and roll has got to stop. Junior’s head is hard as a rock. In: Papers from the Twelfth Regional Meeting of the Chicago Linguistic Society 676-697.

Monday, May 18, 2020

Supernetworks and gene tree incongruence

One of the most interesting side effects of using Big Data in phylogenetics has been the realization that individual gene trees may be wrong as an estimation of the genome phylogeny. In the past, the easy "solution" was just to add more and more gene data until any topological ambiguity disappears, leaving us with a fully resolved, all-inclusive, tree.

However, even with increasing amounts of data, some relationships remain ambiguous. Adding data sometimes reveals primary conflict in the data, eg. because of a reticulate phylogenetic history. With modern-day software, we can quickly infer trees and compare them. Less straightforward is the interpretation of the observed conflict and what to do with it.

In our The Emperor has no clothes blog mini-series [Pt 1 – the mighty matK, Pt 2 – a thicket of trees, Pt 3 – conflict or not], I discussed gene incongruence in the complete angiosperm plastome data, which some people have interpreted as evidence for inter-plastome reticulation (recombination), as well as discussing the role that a single gene may play when inferring a trees based on a concatenated, multi-gene data set.

A new analysis

In this new post, I will show what a supernetwork can tell us about gene conflict using the data from a recent (open access and open data) paper investigating deep relationships in (land) plants.
Sousa F, Civáň P, Brazão J, Foster PG, Cox CJ. 2020. The mitochondrial phylogeny of land plants shows support for Setaphyta under composition-heterogeneous substitution models. PeerJ 8:e8995.
This is what the authors state:
Majority-rule consensus trees resulting from the best-fitting Bayesian MCMC analyses of individual genes had low resolution in general, but liverworts were supported (>95% posterior probability (PP)) as the earliest-branching lineage in two genes (nad 3 and nad 5), whereas the mosses were supported as the earliest-branching lineage in one gene (ccm C). All other resolutions of the bryophyte lineages relative to the tracheophyte clade were not statistically supported, and the Setaphyta clade was not resolved in any gene tree.
The authors objective was to filter the data, leaving those genes and information that can provide a better tree — ie. a less ambiguous one, resolving the deep relationships in land plants. Filtering the additive weak signals in each gene, and countering the problem of codon bias, leads to the fully resolved, unambiguously supported concatenated tree shown below.

Sousa et al.'s fig. 2, a fully resolved (all branches have PP = 1) Bayesian tree inferred from the concatenated 36 gene amino acid data. The novelty is the unambiguous support and recognition of a Septaphyta clade composed of liverworts (turquois) and mosses (brown).

The zenodo link provides full access to the data used. It includes:
  • the single-gene nucleotide and translated amino acid matrices,
  • the single-gene Bayesian majority rule consensus trees (MRC)
  • a concatenated 36-gene nucleotide and amino acid matrix, plus one cleaned for substitution bias ("codon degenerate"), and the resulting ML trees.
For this post, my objective is to visualize how the genes' information differs using a somewhat under-used (and under-appreciated) method: the supernetwork.

Why use a Supernetwork

Typically, we might just display the Consensus network of the single-gene trees (eg. summarizing branch-lengths or using averaged branch-lengths), and then compare it to the "preferred" concatenated tree (above). But, the data at hand includes missing gene partitions: some gene trees have fewer leaves. Consensus networks need tree samples that all have the same set of tips, while Supernetworks, on the other hand, can handle trees with different tip sets.

Here is the Supernetwork for our data, using the z-closure algorithm and tree-size averages for edge lengths (Huson et al. 2004, IEEE/ACM Trans. Comput. Biol. Bioinform. 1: 151–158).

Supernetwork of the 36 mitochondrial gene trees. Edge bundles corresponding to branches in Sousa et al.'s fully resolved tree in blue, conflicting edge bundles in red. Gray, systematic groups not represented by according branches (taxon bipartitions) in any gene tree.

This Supernetwork is not based on the MRC provided, but gene-based ML trees. MRC trees are not strictly phylogenetic trees (Losing information in phylogenetic consensus; Why should we present a Bayesian phylogenetic analyses using networks): they can include, as in this case, polytomies originating from collapsing branches below a certain PP threshold. To distill and visualize the primary (raw) signal, I also did not partition the codons; treating the often saturated (in the case of mt DNA) third codon position as distinct data partition, would be obligatory in case we wanted to have the best-possible phylogenies.

Although the network is quite boxy (not tree-like), it does contain nearly all of the edges (branches; in blue) that make up Sousa et al.'s fully resolved, concatenated amino acid tree, with the exception of the Septaphyta edge (as noted by Sousa et al.) Although we used the suboptimal nucleotide matrices and none of our gene trees was fully congruent with the concatenated amino acid tree, they all included aspects of it (despite not filtering or correcting for codon bias).

When talking about gene incongruence, use a Supernetwork

Supernetworks (see also Whitfield et al. 2008, Syst. Biol. 57: 939–947) are much under-used graphs, ideal to visualize statements such as the one by Sousa et al. quoted above. They not only show the full extent of the gene tree incongruence but also to what degree it relates to short or long branches in the gene trees — this is something that, eg., cloudograms, fail to do. While a cloudogram is a summary of cladograms (sister relationships), a Supernetwork is a summary of phylograms, phylogenetic trees informing us about the amount of change. Keeping an eye on branch lengths helps a lot when interpreting branch support and discussing tree topologies, especially trees based on concatenated gene data (filtered or not).

Monday, May 11, 2020

A new SARS-CoV-2 variant?

In previous blog posts, Guido has examined the phylogenetic patterns in the current SARS-CoV-2 outbreak, responsible for the socially disruptive Covid-19 pandemic:
These patterns are traceable because, being a virus, there is a high mutation rate in the genome, and many genomes have been sequenced. Even on the Diamond Princess boat, it is clear that a number of genetic variants arose during its few weeks of quarantine.

Guido analyzed in detail some of these known variants, and their associated genome mutations. He carefully tried to distinguish possible sequencing artifacts from genuine mutations, and which of the latter seem to be the result of genomic recombination among different strains. Naturally, he did this in the context of using phylogenetic networks as the preferred tool of analysis.

Needless to say, Guido is not the only person to have tried this sort of analysis, although people do not really seem to have grasped that recombination as a molecular process requires the concept of a phylogenetic network. There is an intellectual fixation with phylogenetic trees rather than networks. The tree approach is to detect incompatibilities among the trees, and to deduce recombination as the cause. However, why demonstrate that your preferred analysis method fails, and reach a conclusion from this, when you could simply analyze the data appropriately in the first place?

One recent pre-print that has attracted a lot of attention, based on looking for genetic mutations in a single gene, and then using a tree-based analysis, is:
 Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2

The attention-getting part of the paper is that a particular mutation variant of the virus seems to be getting more common among hosts, and in some places has become the dominant strain. The authors conclude that the mutation has been positively selected due to greater infectivity. This is potentially important because the gene being studied is the Spike (or S) protein, which creates the distinctive crown-like appearance of the virus itself. This crown mediates infection of host cells, and is thus the target of most vaccine strategies and antibody-based therapies. Clearly, then, this variant might be of great practical interest.

However, while the press coverage has been enthusiastic, most of the professional commentary so far has been unimpressed with the authors' conclusions. Basically, the reaction to the authors has been "not so fast, guys". The evidence is suggestive at best, and not yet verified (see We don’t know yet whether a mutation has made SARS-CoV-2 more infectious).


My points in this blog post are about the analyses. There are two parts to the analyses: the identification of mutations and selection, and the study of recombination.

First, only one mutation has been identified, which appears to increase in prevalence through time. So, the conclusion that the new variant is more virulent seems to be based on the idea that it becomes the dominant strain in any population. If this is so, then we still have only one main variant to deal with, in terms of medical response. Indeed, if this variant has been around since February, as the report claims, then most infected people must have it. The only people who wouldn't have this one would be the very earliest cases.

Moreover, if a mutation is positively selected, then it must be difficult to distinguish reticulation from convergence. If variants that gain a mutation via reticulation become dominant, then with every generation we increase the probability that the same mutation will be independently obtained by another virus lineage. Being positively selected, these independent mutations will quickly be dispersed. Given that the virus has been around now for nearly 5 months, with a steadily increasing and diversifying available-host population, there would be plenty of time for convergent evolution of the same beneficial mutation.

Second, phylogenetic trees are often used to try to study the origin of genetic variation, especially if there has been recurrent emergence of particular variants, each of which has subsequently diverged independently. This was Charles Darwin's idea when he talked about the tree as a model for evolution. However, Darwin's book also has a long chapter on hybridization, which cannot easily be studied using the tree model. This apparent contradiction did not concern Darwin, because his book is mostly about the continuity of evolutionary history, which was his main motivation for using the tree model. Hybridization is evidence for continuity, even though the tree model is too simple for studying it. The same argument applies to the study of introgression.

It is the same for processes like recombination, which is conceptually no different, although it occurs at the molecular level, instead. As far as the new paper is concerned, its Figure 1, which is a couple of phylogenetic trees, does not fit well with Figure 6, which is a set of alignments illustrating recombination. Why authors cannot see contradictions between different parts of their own work remains a mystery.

As a final note, the authors raise the specter of re-infection by the new SARS-CoV-2 variant. However, it is our developed immunity (ie. production of antibodies) that protects us, epidemiologically. To allow re-infection, the virus would need to avoid these antibodies. Being more infectious does not automatically make a virus able to avoid antibodies. Nevertheless, I would not be surprised if we learn that some people become ill more than once. (NB. This is different from saying that people have multiple strains. Multiple infections do not necessarily result in multiple illnesses, because of the antibodies.) A bigger concern for new illnesses is likely to be the observed large variation in the amount of antibodies that people produce (more is better, of course).

Monday, May 4, 2020

Finding the CoV-2 root

In my last post, I looked at the prospects and pitfalls of using Median networks to trace virus evolution in the case of the SARS-CoV-2 virus. In this post, I will explore how we can try to root the CoV-2 MJ network, and why using an outgroup, as done by Forster et al., PNAS (2020), is not the best choice.

We'll stick to our 88 sequence dataset because I have already investigated its characteristics in my last post (XLSX-file included in the figshare file set). Here's the unweighted MJ network that can be inferred from these data, including all 146 mutation patterns (145 characters because one indel overlaps with a SNP – single-nucleotide polymorphism).

Median-joining network for the 88 samples in our early March harvest, color-coded for provenance and with sample dates. Four mutations (purples) are resolved as homoplasies. Red edges – potential recombination with unsampled types, line thickness gives here the number of deviating SNPs. Forster et al's Types given for orientation.

As in Forster et al.'s graph, we have one box in the central part of the graph, probably between Forster et al.'s type B (the big pie in the center and its satellites) and their type C (here: the long-edge global group including the Australian and European samples).

There's a useful rule-of-thumb in population genetics: a widespread, frequent haplotpype with many satellite types is often the ancestral type of the investigated sample. This, in our case, includes the reference CoV-2 genome ("Wuhan 1"; NC_0455512, sampled 26/12/2019). Having investigated in detail the data behind the graph (see the last post; adding sample date, provenance, graph above), we can put forward hypotheses as to what degree the parallel edge bundles represent alternative evolutionary scenarios, or are alternatively the result of potential recombinants between CoV-2 sub-lineages.

This allows us to depict an evolutionary scenario for our early samples, to picture how (i) the putative original variant (Wuhan 1/Type B) was distributed during the intitial phase (largely unmodified — light gray arrows in the next figure), (ii) where mutations happened to give rise to sequentially new (sub)types, and (iii) where recombination may have happened (crosses in the figure). Some links (the dotted lines) require further data in order to decide whether the shared mutation is lineage-diagnostic (as indicated by the MJ network) or a convergence.

Early evolution of CoV-2 in time (earliest dates) and space (coloring). Different grays distinguish the main two/three lineages: 20% gray, original Wuhan type (Forster et al.'s type B), dispersed unmodified to rest of China (sampled), Nepal (not sampled), the cruiseship (sampled) and North America (not sampled); 40% gray, potential type C differing by one transversion (basic type not sampled); 60% gray, Forster et al.'s type A differing by two transitions, basic variant found in a sample from Taiwan (Jan 31st). The circle sizes give the number of additional mutations within a lineage and geographic cluster; the x indicate potential recombination (within or between main types/lineages).

The early samples demonstrate that the later USA samples were infected by various (sub)types by mid-/end of January (by up to six lineages), while most of the variation arising in locked-down Wuhan did not escape (at this early stage) — the earliest two samples from 23/12 (MT019529) and 26/12 (reference genome) differ by three mutations.

The quarantined cruise-ship in Japan was infected with the unmodified Wuhan 1 type, which then evolved within the vessel's population. So, this quarantine worked, because the vessel's mutated viruses are not found elsewhere. While the 11121-transition has probably been propagated in the vessel's population via recombination, its occurrence outside (in the Jetsetter/USA lineage, type C?, and USA-Type A) could be due to homoplasy: both the Jetsetter/USA and the A-type USA genomes are (strongly) derived. The 24072 and 28892-transitions point to reticulation between (less evolved) American B- and (highly evolved) A-type lineages; the MJ network can't resolve the resulting box because the American A-type showing the 24072 mutation is strongly derived.

Note: It's also interesting to compare our graph with the tree-based virus "phylogeny" on the GISAID page, which doesn't seem to include the cruise-ship samples. Note that most of the deep branches of the GISAID tree are unsupported ("no mutation"), and samples identical to the reference can be found among the early samples of most main "clades" depicted in the GISAID "phylogeny".

Substitution probabilities

It is also straightforward to identify likely (→ U) and less likely substitutions (all others), as shown in the table.

There is a clear substitutional bias, as transitions are more likely than transversions, the approximate substitution model is abaaba for substitutions replacing the reference / CoV-2-consensus nucleotide. But the model is asymmetrical: Us are more likely to replace C than vice versa, while A/G transitions are balanced. Stochastically distributed singleton/rare mutations have a high probability to show a U, in general. So, a shared C is more likely to be a conserved, shared ancestral pattern (what Hennig called a "symplesiomorphy"). A shared U may be a uniquely shared, derived pattern (a "synapomorphorphy"), or a convergently (in parallel) obtained, derived pattern, a homoplasy. Low-frequency Cs, but also A and Gs at predominately U positions, are most probably synapormorphies as well (based on the data situation and observed substitution probabilities).

Currently, there is no maximum likelihood analog to Median networks, but one could weight mutation patterns differently (see, e.g., guidelines provided in NETWORK under the Help > About menu item in the Median Joining analysis window).

With each successive virus generation, the probability for a homoplasious U increases. Thus, when using MJ networks for virus evolution, we should consider analyzing the data at different time-points, rather than including all of the data in one large analysis (see also our posts on stacking Neighbor-nets: introduction, fossil king ferns, and manual alphabets).

Homoplasy + distant outgroups = wrong roots

By relying on a distantly related sister-lineage to infer an outgroup root of the MJ network, Forster et al. likely got the basic relationships wrong.

Central part of the original outgroup-rooted "phylogenetic network". Coloring after Forster et al. (2020).

Their Type A is probably not ancestral to Type B/Wuhan 1, but derived from it or representing an early split.

Same graph, mutation arrows taking into account observed mutation probabilities (our 88 genomes data) and assuming that there was no recombination among earliest types of each lineage.

The 3 Us shared by the bat outgroup and (part of) Type A (8782, 18060, 29095) likely represent homoplasy in distantly related sister lineages (cf. our last SARS virus post). Being homoplasies, they produce a network box reflecting alternative mutational pathways but not recombination. Homoplasious (convergently evolved) mutational patterns accumulate with increasing phylogenetic distance. Neutral mutations have a generally higher chance to replace a C by an U, back-mutations are less likely, and some sites are more likely to be mutated than others. Hence, there is a good chance that the bat sister-CoV-virus shows more shared mutational patterns with a derived CoV-2 lineage (ie. derived Type A variants) than with the ancestral one (Type B). Distant outgroups should not be used to root Median networks (see also: How do we interpret a rooted median network).

The only possibly genuine mutation would be the shared C (Forster et al.'s pos. 28144, pos. 28219 in our alignment) opposing a U in all Type B and Type C, differentiated only by two incompatible mutations, G → U transitions. The U at pos 28144 may have evolved in parallel in the B and C types; and the actual all-ancestor of CoV-2 (as indicated above) is neither included in Forster et al.'s sample, nor in the current GISAID sample (or our harvest).


It will be interesting to infer MJ networks on time-stamped and geo-referenced subsamples collected in the GISAID database, once the virus has had half a year (or more) to evolve, to see (i) how common homoplasy is, (ii) which sites are likely to accumulate → U substitutions independently of ancestry and (iii) whether there are further and more obvious examples of recombination. The further that genotypes evolve from the original stock, then the more diagnostic their sequences may become, and the easier it will be to decide whether shared but incompatible sequence features are the result of homoplasy or recombination.

Monday, April 27, 2020

From rhymes to networks (A new blog series in six steps)

Whenever one feels stuck in solving a particular problem, it is useful to split this problem into parts, in order to identify exactly where the problems are. The problem that is vexing me at the moment is how to construct a network of rhymes from a set of annotated poems, either by one and the same author, or by many authors who wrote during the same epoch in a certain country using a certain language.

For me, a rhyme network is a network in which words (or parts of words) occur as nodes, and weighted links between the nodes indicate how often the linked words have been found to rhyme in a given corpus

An example

As an example, the following figure illustrates this idea for the case of two Chinese poems, where the rhyme words represented by Chinese characters are linked to form a network (taken from List 2016).

Figure 1: Constructing a network of rhymes in Chinese poetry (List 2016)

One may think that it is silly to make a network from rhymes. However, experiments on Chinese rhyme networks (of which I have reported in the past) have proven to be quite interesting, specifically because they almost always show one large connected component. I find this fascinating, since I would have expected that we would see multiple connected components, representing very distinct rhymes.

It is obvious that some writers don't have a good feeling for rhymes and fail royally when they try to do it — this happens across all languages and cultures in which rhyming plays a role. However, it was much less obvious to me that rhyming can be seen to form at least some kind of a continuum, as you can see from the rhyme networks that we have constructed from Chinese poetry (again) in the past (taken from List et al. 2017).

Figure 2: A complete rhyme network of poems in the Book of Odes (ca. 1000 BC, List et al. 2017)

The current problem

My problem now is that I do not know how to do the same for rhyme collections in other languages. During recent months, I have thought a lot about the problem of constructing rhyme networks for languages such as English or German. However, I always came to a point where I feel stuck, where I realized that I actually did not know at all how to deal with this.

I thought, first, that I could write one blog post listing the problems; but the more I thought about it, I realized that there were so many problems that I could barely do it in one blogpost. So, I decided then that I could just do another series of blog posts (after the nice experience from the series on open problems in computational historical linguistics I posted last year), but this time devoted solely to the question of how one can get from rhymes into networks.

So for the next six months, I will discuss the four major issues that keep me from presenting German or English rhyme networks here and now. I hope that at the end of this discussion I may even have solved the problem, so that I will then be able to present a first rhyme network of Goethe, Shakespeare, or Bob Dylan. (I would not do Eminem, as the rhymes are quite complex, and tedious to annotate).

Summary of the series

Before we can start to think about the modeling of rhyme patterns in rhymed verse, we need to think about the problem in general, and discuss how rhyming shows up in different languages. So, I will start the series with the problem of rhyming in general, by discussing how languages rhyme, where these practices differ, and what we can learn from these differences. Having looked into this, we can think about ways of annotating rhymes in texts in order to acquire a first corpus of examples. So, the following post will deal with the problems that we encounter when trying to annotate the rhyme words that we identify in poetry collections.

If one knows how to annotate something, one will sooner or later get impatient, and long for faster ways to do these boring tasks. Since this also holds for the manual annotation of rhyme collections (which we need for our rhyme networks), it is obvious to think about automated ways of finding rhymes in corpora — that is, to think about the inference of rhyme patterns, which can also be done semi-automatically, of course. So the major problems related to automated rhyme detection will be discussed in a separate post.

Once this is worked out, and one has a reasonably large corpus of rhyme patterns, one wants to analyze it — and the way I want to analyze annotated rhyme corpora is with the help of network models. But, as I mentioned before, I realized that I was stuck when I started to think about rhyme networks of German and English (which are relatively easy languages, one should think). So, it will be important to discuss clearly what seems to be the best way to construct rhyme networks as a first step of analysis. This will therefore be dealt with in a separate blogpost. In a final post, I then plan to tackle the second analysis step, by discussing very briefly what one can do with rhyme networks.

All in all, this makes for six posts (including this one); so we will be busy for the next six months, thinking about rhymes and poetry, which is probably not the worst thing one can do. I hope, but I cannot promise at this point, that this gives me enough time to stick to my ambitious annotation goals, and then present you with a real rhyme network of some poetry collection, other than the Chinese ones I already published in the past.


List, Johann-Mattis, Pathmanathan, Jananan Sylvestre, Hill, Nathan W., Bapteste, Eric, Lopez, Philippe (2017) Vowel purity and rhyme evidence in Old Chinese reconstruction. Lingua Sinica 3.1: 1-17.

List, Johann-Mattis (2016) Using network models to analyze Old Chinese rhyme data. Bulletin of Chinese Linguistics 9.2: 218-241.

Monday, April 20, 2020

Using Median Networks to study SARS-CoV-2

One software package essential for my research has been the free-/shareware NETWORK by Fluxus Engineering. NETWORK can (now) read in PHYLIP- (and NEXUS-)formatted sequence files to infer Reduced Median (RM) and Median-joining (MJ) networks. The people behind NETWORK have just landed a sort of scientific scoop by publishing a Phylogenetic network analysis of SARS-CoV-2 genomes in PNAS — this is the first such network to be published (appearing the same day as our previous blog post).

Why use Median networks

A full Median Network depicts all possible direct mutational links between the sampled sequences in a data set, hence, is rarely seen in published papers. Here's an example from my own (unpublished) research on oaks.

A full Median network for the 5S nrDNA intergenic spacer (5S-IGS) data of Mediterranean oaks
(Quercus sect. Ilex), The numbers on the edges give mutated alignment positions; the
abbreviations show the the provenance of the sequences (reflecting inter-population
and intra-genomic variation); and the coloration shows the general 5S-IGS variant
(genotype, also called "ribotype" in the literature)

Such graphs can easily get very complex, meaning that the full Median network is often impractical. So, NETWORK gives you two practical options to analyze the data while decreasing the complexity of the resulting graph. One can:
  1. infer the so-called Reduced Median networks (Bandelt et al. 1995; mostly used for binary or RY-transformed data) or
  2. apply the Median-joining (MJ) network algorithm (Bandelt et al. 1999).
[PS: When choosing an inference in NETWORK, you can view a how-to-do step-by-step explanation via Help → About.]
Basically, the MJ network is a summary of the possible parsimony trees for the data, not unlike a strict consensus network of most-parsimonious trees. NETWORK's in-built viewer allows browsing through the parsimony trees that make up the network. The subtle but very important difference is that the sampled sequences are not regarded exclusively as network tips but can be resolved as internal nodes of the graph, the so-called medians. A median represents the "ancestral type" from which the more terminal types were evolved. So, in contrast to a phylogenetic tree (or consensus network), the MJ network can depict ancestor-descendant relationships (see also: Reconstructing ancestors in splits graphs; Clades, cladograms, cladistics, and why networks are inevitable).

This makes Median (in particular MJ) networks more proficient to investigate virus phylogenies than phylogenetic trees. Because we have to expect that our sample includes ancestral and derived variants of the virus' RNA: some of the OTUs are expected to be placed on internal nodes of the phylogenetic tree/network.

So, Forster et al., in their paper, harvested a data repository dedicated to epidemological data (GISAID), and provided the following MJ network based on complete CoV-2 genomes (click to enlarge it).

Forster et al. highlight some (tree-like) features of their MJ network that fit with individual patient travel histories and assumed virus propagation patterns (their data and NETWORK-files can be found here).

The central part of Forster et al.'s MJ network is characterized by several boxes.

Close-up of the central part, the differentiation of the original Type A (as defined by the bat sistergroup) into B and C types. Note that most of the (likely synonymous) mutations during the intitial differentiation phase are transitions from U to C, assuming the sistergroup can inform the ingroup root. The reference sequence (Wuhan 1; NC_045512, sampled Dec 2019) has an ancestral B type, derived from a globablly distributed A-type intermediate between B and the not-sampled last common ancestor ("original genome").

There is a reason why you don't find a MJ network in our last post on coronaovirus genomes (aside from taking non-annotated data from gene banks and hence we lacked quick-to-access background information). This is that inferring a MJ network for the CoV-2-group seems premature at this point. Its interpretation as a phylogenetic network (arrows above) is problematic because we have parallel edges in the graph, and thus do not have unique evolutionary pathways to be inferred.

Let's look at what I mean.

Homoplasy is bad, but recombination is worse

In the "Significance" section of their paper, Forster et al. state
These genomes are closely related and under evolutionary selection in their human hosts, sometimes with parallel evolution events, that is, the same virus mutation emerges in two different human hosts. This makes character-based phylogenetic networks the method of choice for reconstructing their evolutionary paths and their ancestral genome in the human host.
"Parallel evolution events", ie. homoplasy, are the major shortcoming of Median networks, when we interpret them as phylogenetic networks. In a phylogenetic network, a reticulation (forming a "box" in the graph) represents a reticulation event; and the most common in viruses are recombinations.

Let's take the following simple example with four sites (SNPs – single nucleotide polymorphisms) mutated with every generation of the virus, plus one homoplasy (transition from A to G at the forth SNP) and a final recombination event.

Not including the recombinant, the MJ network (below) depicts the true phylogenetic network, which, in the absence of a reticulate event, is a tree. However, one benefit of the MJ network for the use of non-trivial phylogenies, is that the graph is not restricted to dichotomous speciation events: one virus sequence may be source of more than two offspring. The commonly seen phylogenetic trees struggle with such a data situation: they assume that all ancestors are gone (not represented in the data) and have been replaced by exactly two offspring.

Note: The inferred MJ network is an undirected, unrooted graph.
By knowing the source (the all-ancestor), we can interpret it as
a directed phylogenetic network.

When we include the recombinant in this analysis, the MJ network depicts what could be a phylogenetic network. However, it is a wrong one.

The West-1/East-ancestor recombinant is resolved as hybrid/cross of
West- and East-ancestors, and West-2 as cross of West-1 and the
Recombinant. False edges are in red.

It is wrong because Median networks, like parsimony or probabilistic trees, assume that every difference in the sequence is due to a mutation. The East-ancestor mutated only the last of the SNPs in the example. The West-lineage mutated the first SNP, then the third one, and finally (parallel to the East-lineage), the last SNP. Only the last 'West' mutation is found in the recombinant, because it recombined the first half of the West-1 genome with the second half of the East-ancestor.

However, homoplasy on its own can also produce reticulations in the network, as shown next.

The descendant of the East-ancestor shows a West-lineage mutation, leading to a
sequence identical to that of the West-1 x East-ancestor recombinant.

MJ networks can be, but are not always, phylogenetic networks. That is, a box in a MJ network may reflect either of two different things:
  • homoplasy, ie. alternative evolutionary pathways
  • reticulation events.
A Median-Joining network is not enough to study viruses

In their "Significance" section, Forster et al. continue:
The network method has been used in around 10,000 phylogenetic studies of diverse organisms, and is mostly known for reconstructing the prehistoric population movements of humans and for ecological studies, but is less commonly employed in the field of virology.
However, using these networks is tricky, because they (like any parsimony method) struggle with homoplasy, and (like all tree inferences) they cannot handle recombinants. A virus MJ network provides a display of mutation sites in an evolutionary context that, in the presence of ancestor-descendant relationships, does better than a Consensus network of most-parsimonious trees; but it is not a phylogenetic network per se.

Forster et al. provide free access to their data, but only as an RDF file, which is NETWORK's matrix format; and there is no data export option in the freeware version of the program. So, we cannot do any quick downstream investigation of the "published" dataset (and have to rely on our own harvest, as for the previous post, available via figshare).

The reason, we can apply Median networks to complete CoV-2 genomes at all is their low divergence. From our previous post (sampled between December 2019 and March 1st 2020 with a focus on China and the USA), our Group 7 sequences (= SARS-CoV-2) show 146 mutation patterns, 141 site variations and five 3 to 15 nt-long deletions in a stretch covering ~29,700 of the up to 30,000 basepairs of 88 CoV-2-genomes (ends trimmed for missing data). There are also polymorphic base calls in the data, but no prior way to judge whether these represent genuine host polymorphism or simply mediocre sequencing.

Are we detecting homoplasy, or is it recombination?

Since the overall divergence is low, and we have nearly 30,000 basepairs (i.e. 10,000+ for synonymous substitutions underlying &plusm; neutral evolution), we can fairly rule out random homoplasy creating the network patterns. The chance that two independent virus lineages mutate the same position of a total of 30,000 by accident is low. Indeed, most SNPs and three of the deletions occur only in a single sequence, stochastically distributed across the genomes. So, we have:
  • 111 singletons: 94 SNPs, including one set of linked SNPs (6 SNPs, stretching across 50 nt), 13 possible intra-host polymorphisms (PIHP), and 4 deletions.
  • 35 parsimony-informative patterns: 34 SNPs, of which eight involve PIHP, and 1 deletion.
We may still have homoplasy, even in the parsimony-informative sites, because some positions may be more susceptible to mutations than are others, and some mutations may be generally beneficial for the virus' spread. If the sample is large enough, then these should be easy to spot, because they should be frequent, and show character splits incompatible with the rest of the sequences.

In our data, there are two candidates for homoplasy among the parsimony-informative patterns, both of them mutations from G or C in the reference and majority of genomes to U.

Example 1

At alignment position 11121, the majority G is replaced by U in nine genomes, and C in one. If we exclude recombination as a cause, then it represents a safe homoplasy because U-carrying genomes show rare additional mutations deviating from the consensus (which is identical to the reference genome, "Wuhan 1") also seen in G-carrying genomes. Those mutations can be located at the start, center or end of the genomes. In addition, we find one transversion at the G/U site. This could be indicative for the G → U/C site being a site that is subject to increased probability of mutation , and hence homoplasy.

Genomes sharing rare mutations in addition to G/U variation at alignment position 11121. The first occurrence of the U-mutation, not accompanied by any other mutation, was discovered by Japanese researchers on the docked cruise ship. The thickness of the lines shows the number of genomes with identical mutation patterns in the parsimony-informative sites (1 pt = 1 genome), the size of the majority base, always found in the reference genome, its frequency (0.5 pt = 1 genome). The "jet setter" host is a Brazilian coming home from Switzerland via Italy.
However, six of the nine accession are from the "Cruise A" sample, the early quarantined Diamond Princess. Given the setting (a closed, densely populated space) and usually diverse host populations on cruise ships, the otherwise unchanged CoV-2 U-strain (top) and already modified G-strains present in the ship's population may just have recombined: the sequences up- and down-stream of the G/UC-site can be identical in various CoV-2 lineages for hundreds of basepairs.

Example 2

An analogous situation is found for the other candidate position, alignment position 24072 (black arrow), where a C is replaced by U in four genomes. One genome (MN988713; from Illinois, USA, sampled Jan 21st) shows the polymorphism: Y (= C/U). In MN988713, 7 more of the 35 parsimony-informative SNPs are polymorphic: the sequence is a near-perfect (gray arrow) consensus of the original "Wuhan 1" type and a strongly derived type (probably Forster et al.'s A cluster) from a second Illinois host sampled a week later, Jan 28th (MT044257)

Black and gray arrows highligh sites indicative for homoplasy or within-USA recombination. The polymorphic Illinois genome represents a strict consensus of the second Illinois strain (sampled one week later) — directly derived from the California strain, derived within the Type A cluster — and a (not sampled) sequence differing from the Wuhan 1 type (Type B) by one point mutation shared with two North American samples from end of January.

If we assume that the lab didn't just mix up or cross-contaminate the IL1 and IL2 samples, then the MN988713 host was infected twice by the CoV-2 virus: once by the original strain (Forster et al.'s Type B), and a second time by an evolved strain, being the tip of a new CoV-2 lineage that can be traced back (by congruent mutation patterns) to Jan 10th, Shenzhen (Guangdong, China) characterized by two C → U transitions at alignment pos. 8820 and 28182 (Forster et al.'s Type A).

Distinguishing homoplasy and recombination

With a growing set of samples, and given that the virus is free to mutate further in a large amount of hosts, it might become easier and more straightforward to distinguish homoplasy from recombination. It is possible that incongruent character splits have not one but two reasons: they have evolved in parallel but also have been propagated by recombination. The U replacing a G or C (or A) at the same site in one accession reflects a different history from another accession. Homoplasy and recombination result in the same graph inferences.

I agree with Forster et al. that the MJ network is under-used in virology (and other biological disciplines: eg. Why do we still use trees for the Neandertal genealogy; Using median networks to understand the evolution of genera) because it is a perfect tool — especially when used as a data-display network (eg. Networks can outperform PCA ordinations in phylogenetic analysis; Can we depict the evolution of highly conserved gene regions such as the ribosomal RNA genes). It facilitates grouping genotypes, to define ancestors and descendants, and to put them in a preliminary evolutionary framework.

But it cannot replace investigating the sequence mutation patterns, especially when we want to look out for intra-host variation — that is, a patient carrying more than one virus strain (parsimony treats polymorphism as missing data) — and recombinants. Visual inspection and tabulation can do this, although it takes a lot more time (and space).

Inferring a MJ network is Step 1. The obligatory Step 2 is to assess how conserved and/or phylogenetically informative are the reconstructed mutation patterns. This also can help to identify wrong roots inferred via outgroups. Forster et al.'s Type A is likely not the ancestral type, and the shared U-sites with the bat-virus outgroup are due to homoplasy, instead, as I will show in the next post (in two weeks's time).


The complete tabulation of mutation patterns (EXCEL spread sheets) and the CoV-2-only alignment in ready-to-use NEXUS and (extended) PHYLIP format have been added to our figshare coronavirus data and file collection.

Grimm G, Morrison D (2020) Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset.


Bandelt H-J, Forster P, Sykes BC, Richards MB (1995) Mitochondrial portraits of human populations using median networks. Genetics 141: 743–753.

Bandelt H-J, Forster P, Röhl A (1999) Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution 16: 37–48.

Monday, April 13, 2020

Do people admire your wine brand? A network analysis

Each year, the April edition of Drinks International magazine contains a supplement with a survey called The World’s Most Admired Wine Brands. A group of people are asked to vote for the wine brands they "most admire" based on the criteria that each brand should:
  • be of consistent and / or improving quality
  • reflect its region or country
  • be well marketed and packaged
  • respond to the needs and tastes of the target audience
  • have broad appeal among wine consumers.
The tenth list, for 2020, has just been released, although the award ceremony has been delayed because of the current pandemic. It is therefore worth looking at the past decade, to see what these lists look like.

The people polled each year are drawn from "a broad spectrum of the global wine community", which apparently includes: masters of wine, sommeliers, commercial wine buyers, wine importers and retailers, wine journalists, wine consultants and analysts, wine educators, and other wine professionals. There were only 60 people involved back in 2012, but there are now more than 200.

The people could originally vote for up to six wine brands, but apparently they are now asked for only three choices. Furthermore, they are provided with a list of previous winners, including "a list of more than 80 well-known brands and producers, but as usual we also encourage the option of free choices".

I have compiled the poll results for the years 2011-2020 inclusive. Each of the published lists contains only the results for the top 50 ranked wine brands in that year — all we know about the other brands is that were ranked lower than 50th place in that year. We also do not know how many people actually voted for each of the brands that did make it into the top 50.

Across the 10 years, 116 different brands have appeared at least once in the lists. However, only 9 of these brands appeared in all 10 lists, with a further 15 brands appearing in 9 of the 10 lists. There have been 36 brands (31%) that appeared only once each. There is thus a great deal of variability in "admiration" from year to year.

As usual in this blog, we can get a picture of this variability by using a phylogenetic network, as a form of exploratory data analysis. For the first analysis, I calculated the similarity of the 10 years using the Bray-Curtis distance, based on all 116 wine brands. A Neighbor-net analysis was then used to display the between-year similarities, as shown in the graph above. Years that are closely connected in the network are similar to each other based on the ranking of the wine brands, and those that are further apart are progressively more different from each other.

This graph shows a basic gradient from 2011, at the top-left, anti-clockwise around to 2020, at the top-right. So, the rankings changed progressively through time, which is not unexpected. However, the first three years, clustered at the left, are quite different from the seven later years, at the right. Indeed, one brand (Black Tower) appeared only in the first three years, while five others appeared twice there only.

Also, this year, 2020, is notably different from previous years (as indicated by the long terminal network edge). Indeed, quite a few long-standing wine brands disappeared from the list this year, including six that had appeared in every previous list. These were replaced by 15 new brands, which had never appeared before, including the brand ranked first (Catena, from Argentina).

We can look at the brands (instead of the years) by doing the same form of network analysis. To simplify things, I included only those 55 wine brands that appeared in at least 4 of the 10 lists, as shown in the next graph. Each brand is represented by a dot in the network. Brands that are closely connected in the network are similar to each other based on their rankings across the 10 polls, and those that are further apart are progressively more different from each other.

Network of the Most Admired Wine Brands from 2011-2020

Basically, the network progresses from the most highly admired brands at the top down to the less-admired wine brands at the bottom. High admiration can be achieved either by being ranked in the lists in most years, or by achieving a high ranking in at least a few years.

Clearly, the most highly admired brand is Torres (from Spain), which is marked in red in the network. It was ranked in the top 3 in every year; and, indeed, it was first or second in each of the first nine years, dropping to third this year. Penfolds (from Australia) was ranked in the top 5 every year, while Concha y Toro (from Chile; known for their Casillero del Diablo wines) was always in the top 6. Nothing else comes even close to these three brands (eg. Vega Sicilia, also from Spain, varied from 2nd to 14th).

Those brands that appeared in all 10 lists are shown in blue in the network, while those in green appeared in 9 of the 10 years. Note that some of the latter are at the bottom of the network, indicating that they rarely ranked highly, when they did appear in the lists.

Those countries that produce the most wine dominate the lists, of course, although the two biggest producers, Italy followed by Spain, do not do the best in terms of admiration. This is shown in the table of how many of the 116 wine brands come from each country.

New Zealand
South Africa
For Portugal, 5 of the 7 brands are based in the Port-producing region, rather than making table wine. For France, 10 of the 21 brands are from Bordeaux. Interestingly, 4 of these were among those 6 dropped from the lists for the first time this year. Apparently, admiration for the wine chateaux of Bordeaux is waning, along with declining purchases of their produce.