Monday, April 27, 2020

From rhymes to networks (A new blog series in six steps)


Whenever one feels stuck in solving a particular problem, it is useful to split this problem into parts, in order to identify exactly where the problems are. The problem that is vexing me at the moment is how to construct a network of rhymes from a set of annotated poems, either by one and the same author, or by many authors who wrote during the same epoch in a certain country using a certain language.

For me, a rhyme network is a network in which words (or parts of words) occur as nodes, and weighted links between the nodes indicate how often the linked words have been found to rhyme in a given corpus

An example

As an example, the following figure illustrates this idea for the case of two Chinese poems, where the rhyme words represented by Chinese characters are linked to form a network (taken from List 2016).


Figure 1: Constructing a network of rhymes in Chinese poetry (List 2016)

One may think that it is silly to make a network from rhymes. However, experiments on Chinese rhyme networks (of which I have reported in the past) have proven to be quite interesting, specifically because they almost always show one large connected component. I find this fascinating, since I would have expected that we would see multiple connected components, representing very distinct rhymes.

It is obvious that some writers don't have a good feeling for rhymes and fail royally when they try to do it — this happens across all languages and cultures in which rhyming plays a role. However, it was much less obvious to me that rhyming can be seen to form at least some kind of a continuum, as you can see from the rhyme networks that we have constructed from Chinese poetry (again) in the past (taken from List et al. 2017).


Figure 2: A complete rhyme network of poems in the Book of Odes (ca. 1000 BC, List et al. 2017)

The current problem

My problem now is that I do not know how to do the same for rhyme collections in other languages. During recent months, I have thought a lot about the problem of constructing rhyme networks for languages such as English or German. However, I always came to a point where I feel stuck, where I realized that I actually did not know at all how to deal with this.

I thought, first, that I could write one blog post listing the problems; but the more I thought about it, I realized that there were so many problems that I could barely do it in one blogpost. So, I decided then that I could just do another series of blog posts (after the nice experience from the series on open problems in computational historical linguistics I posted last year), but this time devoted solely to the question of how one can get from rhymes into networks.

So for the next six months, I will discuss the four major issues that keep me from presenting German or English rhyme networks here and now. I hope that at the end of this discussion I may even have solved the problem, so that I will then be able to present a first rhyme network of Goethe, Shakespeare, or Bob Dylan. (I would not do Eminem, as the rhymes are quite complex, and tedious to annotate).

Summary of the series

Before we can start to think about the modeling of rhyme patterns in rhymed verse, we need to think about the problem in general, and discuss how rhyming shows up in different languages. So, I will start the series with the problem of rhyming in general, by discussing how languages rhyme, where these practices differ, and what we can learn from these differences. Having looked into this, we can think about ways of annotating rhymes in texts in order to acquire a first corpus of examples. So, the following post will deal with the problems that we encounter when trying to annotate the rhyme words that we identify in poetry collections.

If one knows how to annotate something, one will sooner or later get impatient, and long for faster ways to do these boring tasks. Since this also holds for the manual annotation of rhyme collections (which we need for our rhyme networks), it is obvious to think about automated ways of finding rhymes in corpora — that is, to think about the inference of rhyme patterns, which can also be done semi-automatically, of course. So the major problems related to automated rhyme detection will be discussed in a separate post.

Once this is worked out, and one has a reasonably large corpus of rhyme patterns, one wants to analyze it — and the way I want to analyze annotated rhyme corpora is with the help of network models. But, as I mentioned before, I realized that I was stuck when I started to think about rhyme networks of German and English (which are relatively easy languages, one should think). So, it will be important to discuss clearly what seems to be the best way to construct rhyme networks as a first step of analysis. This will therefore be dealt with in a separate blogpost. In a final post, I then plan to tackle the second analysis step, by discussing very briefly what one can do with rhyme networks.

All in all, this makes for six posts (including this one); so we will be busy for the next six months, thinking about rhymes and poetry, which is probably not the worst thing one can do. I hope, but I cannot promise at this point, that this gives me enough time to stick to my ambitious annotation goals, and then present you with a real rhyme network of some poetry collection, other than the Chinese ones I already published in the past.

References

List, Johann-Mattis, Pathmanathan, Jananan Sylvestre, Hill, Nathan W., Bapteste, Eric, Lopez, Philippe (2017) Vowel purity and rhyme evidence in Old Chinese reconstruction. Lingua Sinica 3.1: 1-17.

List, Johann-Mattis (2016) Using network models to analyze Old Chinese rhyme data. Bulletin of Chinese Linguistics 9.2: 218-241.

Monday, April 20, 2020

Using Median Networks to study SARS-CoV-2


One software package essential for my research has been the free-/shareware NETWORK by Fluxus Engineering. NETWORK can (now) read in PHYLIP- (and NEXUS-)formatted sequence files to infer Reduced Median (RM) and Median-joining (MJ) networks. The people behind NETWORK have just landed a sort of scientific scoop by publishing a Phylogenetic network analysis of SARS-CoV-2 genomes in PNAS — this is the first such network to be published (appearing the same day as our previous blog post).

Why use Median networks

A full Median Network depicts all possible direct mutational links between the sampled sequences in a data set, hence, is rarely seen in published papers. Here's an example from my own (unpublished) research on oaks.

A full Median network for the 5S nrDNA intergenic spacer (5S-IGS) data of Mediterranean oaks
(Quercus sect. Ilex), The numbers on the edges give mutated alignment positions; the
abbreviations show the the provenance of the sequences (reflecting inter-population
and intra-genomic variation); and the coloration shows the general 5S-IGS variant
(genotype, also called "ribotype" in the literature)

Such graphs can easily get very complex, meaning that the full Median network is often impractical. So, NETWORK gives you two practical options to analyze the data while decreasing the complexity of the resulting graph. One can:
  1. infer the so-called Reduced Median networks (Bandelt et al. 1995; mostly used for binary or RY-transformed data) or
  2. apply the Median-joining (MJ) network algorithm (Bandelt et al. 1999).
[PS: When choosing an inference in NETWORK, you can view a how-to-do step-by-step explanation via Help → About.]
Basically, the MJ network is a summary of the possible parsimony trees for the data, not unlike a strict consensus network of most-parsimonious trees. NETWORK's in-built viewer allows browsing through the parsimony trees that make up the network. The subtle but very important difference is that the sampled sequences are not regarded exclusively as network tips but can be resolved as internal nodes of the graph, the so-called medians. A median represents the "ancestral type" from which the more terminal types were evolved. So, in contrast to a phylogenetic tree (or consensus network), the MJ network can depict ancestor-descendant relationships (see also: Reconstructing ancestors in splits graphs; Clades, cladograms, cladistics, and why networks are inevitable).

This makes Median (in particular MJ) networks more proficient to investigate virus phylogenies than phylogenetic trees. Because we have to expect that our sample includes ancestral and derived variants of the virus' RNA: some of the OTUs are expected to be placed on internal nodes of the phylogenetic tree/network.

So, Forster et al., in their paper, harvested a data repository dedicated to epidemological data (GISAID), and provided the following MJ network based on complete CoV-2 genomes (click to enlarge it).


Forster et al. highlight some (tree-like) features of their MJ network that fit with individual patient travel histories and assumed virus propagation patterns (their data and NETWORK-files can be found here).

The central part of Forster et al.'s MJ network is characterized by several boxes.

Close-up of the central part, the differentiation of the original Type A (as defined by the bat sistergroup) into B and C types. Note that most of the (likely synonymous) mutations during the intitial differentiation phase are transitions from U to C, assuming the sistergroup can inform the ingroup root. The reference sequence (Wuhan 1; NC_045512, sampled Dec 2019) has an ancestral B type, derived from a globablly distributed A-type intermediate between B and the not-sampled last common ancestor ("original genome").

There is a reason why you don't find a MJ network in our last post on coronaovirus genomes (aside from taking non-annotated data from gene banks and hence we lacked quick-to-access background information). This is that inferring a MJ network for the CoV-2-group seems premature at this point. Its interpretation as a phylogenetic network (arrows above) is problematic because we have parallel edges in the graph, and thus do not have unique evolutionary pathways to be inferred.

Let's look at what I mean.

Homoplasy is bad, but recombination is worse

In the "Significance" section of their paper, Forster et al. state
These genomes are closely related and under evolutionary selection in their human hosts, sometimes with parallel evolution events, that is, the same virus mutation emerges in two different human hosts. This makes character-based phylogenetic networks the method of choice for reconstructing their evolutionary paths and their ancestral genome in the human host.
"Parallel evolution events", ie. homoplasy, are the major shortcoming of Median networks, when we interpret them as phylogenetic networks. In a phylogenetic network, a reticulation (forming a "box" in the graph) represents a reticulation event; and the most common in viruses are recombinations.

Let's take the following simple example with four sites (SNPs – single nucleotide polymorphisms) mutated with every generation of the virus, plus one homoplasy (transition from A to G at the forth SNP) and a final recombination event.


Not including the recombinant, the MJ network (below) depicts the true phylogenetic network, which, in the absence of a reticulate event, is a tree. However, one benefit of the MJ network for the use of non-trivial phylogenies, is that the graph is not restricted to dichotomous speciation events: one virus sequence may be source of more than two offspring. The commonly seen phylogenetic trees struggle with such a data situation: they assume that all ancestors are gone (not represented in the data) and have been replaced by exactly two offspring.

Note: The inferred MJ network is an undirected, unrooted graph.
By knowing the source (the all-ancestor), we can interpret it as
a directed phylogenetic network.

When we include the recombinant in this analysis, the MJ network depicts what could be a phylogenetic network. However, it is a wrong one.

The West-1/East-ancestor recombinant is resolved as hybrid/cross of
West- and East-ancestors, and West-2 as cross of West-1 and the
Recombinant. False edges are in red.

It is wrong because Median networks, like parsimony or probabilistic trees, assume that every difference in the sequence is due to a mutation. The East-ancestor mutated only the last of the SNPs in the example. The West-lineage mutated the first SNP, then the third one, and finally (parallel to the East-lineage), the last SNP. Only the last 'West' mutation is found in the recombinant, because it recombined the first half of the West-1 genome with the second half of the East-ancestor.

However, homoplasy on its own can also produce reticulations in the network, as shown next.

The descendant of the East-ancestor shows a West-lineage mutation, leading to a
sequence identical to that of the West-1 x East-ancestor recombinant.

MJ networks can be, but are not always, phylogenetic networks. That is, a box in a MJ network may reflect either of two different things:
  • homoplasy, ie. alternative evolutionary pathways
  • reticulation events.
A Median-Joining network is not enough to study viruses

In their "Significance" section, Forster et al. continue:
The network method has been used in around 10,000 phylogenetic studies of diverse organisms, and is mostly known for reconstructing the prehistoric population movements of humans and for ecological studies, but is less commonly employed in the field of virology.
However, using these networks is tricky, because they (like any parsimony method) struggle with homoplasy, and (like all tree inferences) they cannot handle recombinants. A virus MJ network provides a display of mutation sites in an evolutionary context that, in the presence of ancestor-descendant relationships, does better than a Consensus network of most-parsimonious trees; but it is not a phylogenetic network per se.

Forster et al. provide free access to their data, but only as an RDF file, which is NETWORK's matrix format; and there is no data export option in the freeware version of the program. So, we cannot do any quick downstream investigation of the "published" dataset (and have to rely on our own harvest, as for the previous post, available via figshare).

The reason, we can apply Median networks to complete CoV-2 genomes at all is their low divergence. From our previous post (sampled between December 2019 and March 1st 2020 with a focus on China and the USA), our Group 7 sequences (= SARS-CoV-2) show 146 mutation patterns, 141 site variations and five 3 to 15 nt-long deletions in a stretch covering ~29,700 of the up to 30,000 basepairs of 88 CoV-2-genomes (ends trimmed for missing data). There are also polymorphic base calls in the data, but no prior way to judge whether these represent genuine host polymorphism or simply mediocre sequencing.

Are we detecting homoplasy, or is it recombination?

Since the overall divergence is low, and we have nearly 30,000 basepairs (i.e. 10,000+ for synonymous substitutions underlying &plusm; neutral evolution), we can fairly rule out random homoplasy creating the network patterns. The chance that two independent virus lineages mutate the same position of a total of 30,000 by accident is low. Indeed, most SNPs and three of the deletions occur only in a single sequence, stochastically distributed across the genomes. So, we have:
  • 111 singletons: 94 SNPs, including one set of linked SNPs (6 SNPs, stretching across 50 nt), 13 possible intra-host polymorphisms (PIHP), and 4 deletions.
  • 35 parsimony-informative patterns: 34 SNPs, of which eight involve PIHP, and 1 deletion.
We may still have homoplasy, even in the parsimony-informative sites, because some positions may be more susceptible to mutations than are others, and some mutations may be generally beneficial for the virus' spread. If the sample is large enough, then these should be easy to spot, because they should be frequent, and show character splits incompatible with the rest of the sequences.

In our data, there are two candidates for homoplasy among the parsimony-informative patterns, both of them mutations from G or C in the reference and majority of genomes to U.

Example 1

At alignment position 11121, the majority G is replaced by U in nine genomes, and C in one. If we exclude recombination as a cause, then it represents a safe homoplasy because U-carrying genomes show rare additional mutations deviating from the consensus (which is identical to the reference genome, "Wuhan 1") also seen in G-carrying genomes. Those mutations can be located at the start, center or end of the genomes. In addition, we find one transversion at the G/U site. This could be indicative for the G → U/C site being a site that is subject to increased probability of mutation , and hence homoplasy.

Genomes sharing rare mutations in addition to G/U variation at alignment position 11121. The first occurrence of the U-mutation, not accompanied by any other mutation, was discovered by Japanese researchers on the docked cruise ship. The thickness of the lines shows the number of genomes with identical mutation patterns in the parsimony-informative sites (1 pt = 1 genome), the size of the majority base, always found in the reference genome, its frequency (0.5 pt = 1 genome). The "jet setter" host is a Brazilian coming home from Switzerland via Italy.
However, six of the nine accession are from the "Cruise A" sample, the early quarantined Diamond Princess. Given the setting (a closed, densely populated space) and usually diverse host populations on cruise ships, the otherwise unchanged CoV-2 U-strain (top) and already modified G-strains present in the ship's population may just have recombined: the sequences up- and down-stream of the G/UC-site can be identical in various CoV-2 lineages for hundreds of basepairs.

Example 2

An analogous situation is found for the other candidate position, alignment position 24072 (black arrow), where a C is replaced by U in four genomes. One genome (MN988713; from Illinois, USA, sampled Jan 21st) shows the polymorphism: Y (= C/U). In MN988713, 7 more of the 35 parsimony-informative SNPs are polymorphic: the sequence is a near-perfect (gray arrow) consensus of the original "Wuhan 1" type and a strongly derived type (probably Forster et al.'s A cluster) from a second Illinois host sampled a week later, Jan 28th (MT044257)

Black and gray arrows highligh sites indicative for homoplasy or within-USA recombination. The polymorphic Illinois genome represents a strict consensus of the second Illinois strain (sampled one week later) — directly derived from the California strain, derived within the Type A cluster — and a (not sampled) sequence differing from the Wuhan 1 type (Type B) by one point mutation shared with two North American samples from end of January.

If we assume that the lab didn't just mix up or cross-contaminate the IL1 and IL2 samples, then the MN988713 host was infected twice by the CoV-2 virus: once by the original strain (Forster et al.'s Type B), and a second time by an evolved strain, being the tip of a new CoV-2 lineage that can be traced back (by congruent mutation patterns) to Jan 10th, Shenzhen (Guangdong, China) characterized by two C → U transitions at alignment pos. 8820 and 28182 (Forster et al.'s Type A).

Distinguishing homoplasy and recombination

With a growing set of samples, and given that the virus is free to mutate further in a large amount of hosts, it might become easier and more straightforward to distinguish homoplasy from recombination. It is possible that incongruent character splits have not one but two reasons: they have evolved in parallel but also have been propagated by recombination. The U replacing a G or C (or A) at the same site in one accession reflects a different history from another accession. Homoplasy and recombination result in the same graph inferences.

I agree with Forster et al. that the MJ network is under-used in virology (and other biological disciplines: eg. Why do we still use trees for the Neandertal genealogy; Using median networks to understand the evolution of genera) because it is a perfect tool — especially when used as a data-display network (eg. Networks can outperform PCA ordinations in phylogenetic analysis; Can we depict the evolution of highly conserved gene regions such as the ribosomal RNA genes). It facilitates grouping genotypes, to define ancestors and descendants, and to put them in a preliminary evolutionary framework.

But it cannot replace investigating the sequence mutation patterns, especially when we want to look out for intra-host variation — that is, a patient carrying more than one virus strain (parsimony treats polymorphism as missing data) — and recombinants. Visual inspection and tabulation can do this, although it takes a lot more time (and space).

Inferring a MJ network is Step 1. The obligatory Step 2 is to assess how conserved and/or phylogenetically informative are the reconstructed mutation patterns. This also can help to identify wrong roots inferred via outgroups. Forster et al.'s Type A is likely not the ancestral type, and the shared U-sites with the bat-virus outgroup are due to homoplasy, instead, as I will show in the next post (in two weeks's time).

Data

The complete tabulation of mutation patterns (EXCEL spread sheets) and the CoV-2-only alignment in ready-to-use NEXUS and (extended) PHYLIP format have been added to our figshare coronavirus data and file collection.

Grimm G, Morrison D (2020) Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset. https://doi.org/10.6084/m9.figshare.12046581.v2

References

Bandelt H-J, Forster P, Sykes BC, Richards MB (1995) Mitochondrial portraits of human populations using median networks. Genetics 141: 743–753.

Bandelt H-J, Forster P, Röhl A (1999) Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution 16: 37–48.

Monday, April 13, 2020

Do people admire your wine brand? A network analysis


Each year, the April edition of Drinks International magazine contains a supplement with a survey called The World’s Most Admired Wine Brands. A group of people are asked to vote for the wine brands they "most admire" based on the criteria that each brand should:
  • be of consistent and / or improving quality
  • reflect its region or country
  • be well marketed and packaged
  • respond to the needs and tastes of the target audience
  • have broad appeal among wine consumers.
The tenth list, for 2020, has just been released, although the award ceremony has been delayed because of the current pandemic. It is therefore worth looking at the past decade, to see what these lists look like.


The people polled each year are drawn from "a broad spectrum of the global wine community", which apparently includes: masters of wine, sommeliers, commercial wine buyers, wine importers and retailers, wine journalists, wine consultants and analysts, wine educators, and other wine professionals. There were only 60 people involved back in 2012, but there are now more than 200.

The people could originally vote for up to six wine brands, but apparently they are now asked for only three choices. Furthermore, they are provided with a list of previous winners, including "a list of more than 80 well-known brands and producers, but as usual we also encourage the option of free choices".

I have compiled the poll results for the years 2011-2020 inclusive. Each of the published lists contains only the results for the top 50 ranked wine brands in that year — all we know about the other brands is that were ranked lower than 50th place in that year. We also do not know how many people actually voted for each of the brands that did make it into the top 50.

Across the 10 years, 116 different brands have appeared at least once in the lists. However, only 9 of these brands appeared in all 10 lists, with a further 15 brands appearing in 9 of the 10 lists. There have been 36 brands (31%) that appeared only once each. There is thus a great deal of variability in "admiration" from year to year.


As usual in this blog, we can get a picture of this variability by using a phylogenetic network, as a form of exploratory data analysis. For the first analysis, I calculated the similarity of the 10 years using the Bray-Curtis distance, based on all 116 wine brands. A Neighbor-net analysis was then used to display the between-year similarities, as shown in the graph above. Years that are closely connected in the network are similar to each other based on the ranking of the wine brands, and those that are further apart are progressively more different from each other.

This graph shows a basic gradient from 2011, at the top-left, anti-clockwise around to 2020, at the top-right. So, the rankings changed progressively through time, which is not unexpected. However, the first three years, clustered at the left, are quite different from the seven later years, at the right. Indeed, one brand (Black Tower) appeared only in the first three years, while five others appeared twice there only.

Also, this year, 2020, is notably different from previous years (as indicated by the long terminal network edge). Indeed, quite a few long-standing wine brands disappeared from the list this year, including six that had appeared in every previous list. These were replaced by 15 new brands, which had never appeared before, including the brand ranked first (Catena, from Argentina).

We can look at the brands (instead of the years) by doing the same form of network analysis. To simplify things, I included only those 55 wine brands that appeared in at least 4 of the 10 lists, as shown in the next graph. Each brand is represented by a dot in the network. Brands that are closely connected in the network are similar to each other based on their rankings across the 10 polls, and those that are further apart are progressively more different from each other.

Network of the Most Admired Wine Brands from 2011-2020

Basically, the network progresses from the most highly admired brands at the top down to the less-admired wine brands at the bottom. High admiration can be achieved either by being ranked in the lists in most years, or by achieving a high ranking in at least a few years.

Clearly, the most highly admired brand is Torres (from Spain), which is marked in red in the network. It was ranked in the top 3 in every year; and, indeed, it was first or second in each of the first nine years, dropping to third this year. Penfolds (from Australia) was ranked in the top 5 every year, while Concha y Toro (from Chile; known for their Casillero del Diablo wines) was always in the top 6. Nothing else comes even close to these three brands (eg. Vega Sicilia, also from Spain, varied from 2nd to 14th).

Those brands that appeared in all 10 lists are shown in blue in the network, while those in green appeared in 9 of the 10 years. Note that some of the latter are at the bottom of the network, indicating that they rarely ranked highly, when they did appear in the lists.

Those countries that produce the most wine dominate the lists, of course, although the two biggest producers, Italy followed by Spain, do not do the best in terms of admiration. This is shown in the table of how many of the 116 wine brands come from each country.

France
Australia
Spain
USA
Italy
New Zealand
Chile
South Africa
Portugal
Germany
Argentina
Canada
China
Hungary
Lebanon
21
16
15
13
9
9
8
8
7
3
3
1
1
1
1
For Portugal, 5 of the 7 brands are based in the Port-producing region, rather than making table wine. For France, 10 of the 21 brands are from Bordeaux. Interestingly, 4 of these were among those 6 dropped from the lists for the first time this year. Apparently, admiration for the wine chateaux of Bordeaux is waning, along with declining purchases of their produce.

Monday, April 6, 2020

Consensus networks: cluster union or edge union?

(Another joint post by David and Guido)

In the book Introduction to Phylogenetic Networks (Morrison 2011), it was convenient to organize the various network types into two groups:
  • those that are intended to provide a summary of various possible phylogenetic histories
  • those that simply summarize the multivariate data into a convenient visualization.
The former are directed networks (ie. they have an explicit root) that are interpretable as phylogenies (ie. phylogenetic hypotheses), while the latter are undirected networks (ie. no root), and therefore do not display historical pathways of evolution.

The consensus network of Holland et al. (2004. Molecular Biology and Evolution 21: 1459-1461) is among the most popular of the networks in the second group. This is formally a Cluster Union Network (CUN), in which the clusters represented by a set of input trees are combined into a single diagram. The clusters are defined by the edges in the original (unrooted) trees - each edge splits the tree into two parts. The trees are thus reduced to the set of splits that appear in at least one of the trees. Each split will then appear in the CUN. If there is no disagreement among the trees, then a split will be represented by a single edge in the CUN; but if there is conflict among the trees then a split will be represented by a set of parallel edges.

A cluster consensus network, with two reticulation areas,
each defined by two sets of parallel edges.

The end result is that the edges of the CUN no longer represent phylogenetic pathways, even if they did do so in the input trees. Some of the edges of the CUN are there solely as part of a set of parallels. To put it another way, some of the edges do not appear in any of the original trees, but are the result of combining the clusters. So, a CUN will vary from tree-like, if there is little conflict among the input trees (ie. compatible splits) , to a complex spider-web, if there is a lot of conflict (many incompatible splits).

It is this property of representing splits by a set of edges that prevents the network being a representation of phylogenetic history – formally, the edges define clusters not clades.

Miyagi and Wheeler (2019. Cladistics 35: 688-694) have addressed this issue by defining what they call an Edge Union Network. In essence, it is a subset of the CUN - formally, the EUN is contained within the CUN. It can be thought of as a CUN that contains only those edges that appear in at least one of the input trees. M&W see the edges as "redundant" is they appear in the CUN but not the input trees.

M&W's objective for the EUN is thus "to display the total history of all the input trees, rather than the simplest graph which contains all clusters present in the data" (which the CUN does). M&Y see "phylogenetic networks as hypotheses for evolutionary history", so that the EUN can be rooted, just like the input trees. The criterion for the EUN is parsimony, so that "it is important to minimize the number of distinct paths between nodes".

It is important to note in the following discussion that M&W are interested in rooted networks, and so their version of a CUN is not quite the same as the original unrooted Consensus Network.

Discussion

M&W provide a graphical example, the CUN and EUN of two incongruent rooted trees. Here's a colored version: all nodes (internal and terminal) are re-labelled to express the last common ancestor (LCA) that they represent, and internal (conflicting) tree edges are colored, so we can trace them in the networks.

M&W's example of two incongruent trees (their Fig. 1) and the CUN (their Fig. 2; bottom right).
The stars are nodes of the full CUN (bottom left) not represented in M&W's CUN (bottom right);
the dotted lines indicate dropped edges.

At the bottom left is the strict consensus network, the full CUN, of both trees. Most internal nodes (alternative LCAs) in the trees (ABC, ABCD, AD, DE) are not represented by a single node in the full CUN but by a set of parallel edge bundles (dotted lines). Nonetheless, each edge set represents a branch (clade) in one or both trees – a full CUN depicts all topological alternatives in the two trees. We can extract sets of congruent splits, and reconstruct the two trees in the process.

But since the nodes in the full CUN are not (alternative) LCAs but just connections of (parallel) edges, we cannot interpret this (bottom left) graph as a phylogenetic network. However, the CUN depicted in M&W does do this: we start in the root and walk from node to node along the branches (arrows) until we end up with an explicit phylogenetic network (bottom right). This includes an edge that is not found in any of the trees, a 'false' edge (ABCD-ABC: violet, fat line), while also missing an edge found in one tree (ABCD-BC).

EUN in comparison to a full CUN for M&W's example. The 'false' ABCD-ABC edge
is replaced by a ABCD-BC edge resulting in a phylogenetic network that has
only edges seen in the two phylogenetic trees.

The false ABCD-ABC edge is replaced by an ABCD-BC edge, and ABC is reconnected directly to the root.

An implicit assumption and reason to reduce CUNs into EUNs is that the topological ambiguity in the two trees represents reticulation evolution (eg. hybridization): the trees indicate that the LCA of taxa B and C evolved from the LCA of A to C and the LCA of A to D, but the LCA of A to D is not ancestral to the LCA of A to C. This, however, appears quite strange from an evolutionary point of view. A simple explanation for the conflict between the two trees would be that D is a hybrid of the lineage leading to A, which is the sister of B + C, and E.

A simple evolutionary scenario explaining the difference between the two conflicting trees:
A is the paternal, and E the maternal donor of the hybrid D.

As shown in tis figure, the LCA of A to D equals the LCA of A to C (depicted as two different nodes in the EUN), and the LCA of A+D and D+E are just (the precursors of) A and E . Taxon D is related to the ABC clade because the paternal donor has been A, plus to the E lineage via its mother.

This leads us to a principal question: do we want to reduce CUNs, which are splits graphs depicting all splits in a set of trees, ie. competing topological alternatives, to directed phylogenetic networks at all? The EUN has fewer edges (and nodes) than the CUN, but it still is an overly complex graph even for potentially very simple evolutionary scenarios.

On the Mesquite discussion group, a question was asked whether EUNs should be implemented as a means to quickly investigate conflict between trees. The answer to that question is: no. Consensus networks (CUNs) will be more than sufficient, since they are splits-based not node-based.

One application of EUNs may be ancestral state reconstruction. Character progression could be modeled the same way as it currently is along trees. Instead of viewing the nodes as actual LCAs in a reticulation scenario, one could consider them as competing alternative LCAs, and use the results of the ancestral state reconstruction along the EUN, to make a choice among alternatives, or simply to compare different evolutionary scenarios in the same graph.