The Genealogical World of Phylogenetic Networks: Using Median Networks to study SARS-CoV-2

One software package essential for my research has been the free-/shareware NETWORK by Fluxus Engineering. NETWORK can (now) read in PHYLIP- (and NEXUS-)formatted sequence files to infer Reduced Median (RM) and Median-joining (MJ) networks. The people behind NETWORK have just landed a sort of scientific scoop by publishing a Phylogenetic network analysis of SARS-CoV-2 genomes in PNAS — this is the first such network to be published (appearing the same day as our previous blog post).

Why use Median networks

A full Median Network depicts all possible direct mutational links between the sampled sequences in a data set, hence, is rarely seen in published papers. Here's an example from my own (unpublished) research on oaks.

A full Median network for the 5S nrDNA intergenic spacer (5S-IGS) data of Mediterranean oaks
(Quercus sect. Ilex), The numbers on the edges give mutated alignment positions; the
abbreviations show the the provenance of the sequences (reflecting inter-population
and intra-genomic variation); and the coloration shows the general 5S-IGS variant
(genotype, also called "ribotype" in the literature)

Such graphs can easily get very complex, meaning that the full Median network is often impractical. So, NETWORK gives you two practical options to analyze the data while decreasing the complexity of the resulting graph. One can:

infer the so-called Reduced Median networks (Bandelt et al. 1995; mostly used for binary or RY-transformed data) or
apply the Median-joining (MJ) network algorithm (Bandelt et al. 1999).

[PS: When choosing an inference in NETWORK, you can view a how-to-do step-by-step explanation via Help → About.]

Basically, the MJ network is a summary of the possible parsimony trees for the data, not unlike a strict consensus network of most-parsimonious trees. NETWORK's in-built viewer allows browsing through the parsimony trees that make up the network. The subtle but very important difference is that the sampled sequences are not regarded exclusively as network tips but can be resolved as internal nodes of the graph, the so-called medians. A median represents the "ancestral type" from which the more terminal types were evolved. So, in contrast to a phylogenetic tree (or consensus network), the MJ network can depict ancestor-descendant relationships (see also: Reconstructing ancestors in splits graphs; Clades, cladograms, cladistics, and why networks are inevitable).

This makes Median (in particular MJ) networks more proficient to investigate virus phylogenies than phylogenetic trees. Because we have to expect that our sample includes ancestral and derived variants of the virus' RNA: some of the OTUs are expected to be placed on internal nodes of the phylogenetic tree/network.

So, Forster et al., in their paper, harvested a data repository dedicated to epidemological data (GISAID), and provided the following MJ network based on complete CoV-2 genomes (click to enlarge it).

Forster et al. highlight some (tree-like) features of their MJ network that fit with individual patient travel histories and assumed virus propagation patterns (their data and NETWORK-files can be found here).

The central part of Forster et al.'s MJ network is characterized by several boxes.

Close-up of the central part, the differentiation of the original Type A (as defined by the bat sistergroup) into B and C types. Note that most of the (likely synonymous) mutations during the intitial differentiation phase are transitions from U to C, assuming the sistergroup can inform the ingroup root. The reference sequence (Wuhan 1; NC_045512, sampled Dec 2019) has an ancestral B type, derived from a globablly distributed A-type intermediate between B and the not-sampled last common ancestor ("original genome").

There is a reason why you don't find a MJ network in our last post on coronaovirus genomes (aside from taking non-annotated data from gene banks and hence we lacked quick-to-access background information). This is that inferring a MJ network for the CoV-2-group seems premature at this point. Its interpretation as a phylogenetic network (arrows above) is problematic because we have parallel edges in the graph, and thus do not have unique evolutionary pathways to be inferred.

Let's look at what I mean.

Homoplasy is bad, but recombination is worse

In the "Significance" section of their paper, Forster et al. state

These genomes are closely related and under evolutionary selection in their human hosts, sometimes with parallel evolution events, that is, the same virus mutation emerges in two different human hosts. This makes character-based phylogenetic networks the method of choice for reconstructing their evolutionary paths and their ancestral genome in the human host.

"Parallel evolution events", ie. homoplasy, are the major shortcoming of Median networks, when we interpret them as phylogenetic networks. In a phylogenetic network, a reticulation (forming a "box" in the graph) represents a reticulation event; and the most common in viruses are recombinations.

Let's take the following simple example with four sites (SNPs – single nucleotide polymorphisms) mutated with every generation of the virus, plus one homoplasy (transition from A to G at the forth SNP) and a final recombination event.

Not including the recombinant, the MJ network (below) depicts the true phylogenetic network, which, in the absence of a reticulate event, is a tree. However, one benefit of the MJ network for the use of non-trivial phylogenies, is that the graph is not restricted to dichotomous speciation events: one virus sequence may be source of more than two offspring. The commonly seen phylogenetic trees struggle with such a data situation: they assume that all ancestors are gone (not represented in the data) and have been replaced by exactly two offspring.

Note: The inferred MJ network is an undirected, unrooted graph.
By knowing the source (the all-ancestor), we can interpret it as
a directed phylogenetic network.

When we include the recombinant in this analysis, the MJ network depicts what could be a phylogenetic network. However, it is a wrong one.

The West-1/East-ancestor recombinant is resolved as hybrid/cross of
West- and East-ancestors, and West-2 as cross of West-1 and the
Recombinant. False edges are in red.

It is wrong because Median networks, like parsimony or probabilistic trees, assume that every difference in the sequence is due to a mutation. The East-ancestor mutated only the last of the SNPs in the example. The West-lineage mutated the first SNP, then the third one, and finally (parallel to the East-lineage), the last SNP. Only the last 'West' mutation is found in the recombinant, because it recombined the first half of the West-1 genome with the second half of the East-ancestor.

However, homoplasy on its own can also produce reticulations in the network, as shown next.

The descendant of the East-ancestor shows a West-lineage mutation, leading to a
sequence identical to that of the West-1 x East-ancestor recombinant.

MJ networks can be, but are not always, phylogenetic networks. That is, a box in a MJ network may reflect either of two different things:

homoplasy, ie. alternative evolutionary pathways
reticulation events.

A Median-Joining network is not enough to study viruses

In their "Significance" section, Forster et al. continue:

The network method has been used in around 10,000 phylogenetic studies of diverse organisms, and is mostly known for reconstructing the prehistoric population movements of humans and for ecological studies, but is less commonly employed in the field of virology.

However, using these networks is tricky, because they (like any parsimony method) struggle with homoplasy, and (like all tree inferences) they cannot handle recombinants. A virus MJ network provides a display of mutation sites in an evolutionary context that, in the presence of ancestor-descendant relationships, does better than a Consensus network of most-parsimonious trees; but it is not a phylogenetic network per se.

Forster et al. provide free access to their data, but only as an RDF file, which is NETWORK's matrix format; and there is no data export option in the freeware version of the program. So, we cannot do any quick downstream investigation of the "published" dataset (and have to rely on our own harvest, as for the previous post, available via figshare).

The reason, we can apply Median networks to complete CoV-2 genomes at all is their low divergence. From our previous post (sampled between December 2019 and March 1st 2020 with a focus on China and the USA), our Group 7 sequences (= SARS-CoV-2) show 146 mutation patterns, 141 site variations and five 3 to 15 nt-long deletions in a stretch covering ~29,700 of the up to 30,000 basepairs of 88 CoV-2-genomes (ends trimmed for missing data). There are also polymorphic base calls in the data, but no prior way to judge whether these represent genuine host polymorphism or simply mediocre sequencing.

Are we detecting homoplasy, or is it recombination?

Since the overall divergence is low, and we have nearly 30,000 basepairs (i.e. 10,000+ for synonymous substitutions underlying &plusm; neutral evolution), we can fairly rule out random homoplasy creating the network patterns. The chance that two independent virus lineages mutate the same position of a total of 30,000 by accident is low. Indeed, most SNPs and three of the deletions occur only in a single sequence, stochastically distributed across the genomes. So, we have:

111 singletons: 94 SNPs, including one set of linked SNPs (6 SNPs, stretching across 50 nt), 13 possible intra-host polymorphisms (PIHP), and 4 deletions.
35 parsimony-informative patterns: 34 SNPs, of which eight involve PIHP, and 1 deletion.

We may still have homoplasy, even in the parsimony-informative sites, because some positions may be more susceptible to mutations than are others, and some mutations may be generally beneficial for the virus' spread. If the sample is large enough, then these should be easy to spot, because they should be frequent, and show character splits incompatible with the rest of the sequences.

In our data, there are two candidates for homoplasy among the parsimony-informative patterns, both of them mutations from G or C in the reference and majority of genomes to U.

Example 1

At alignment position 11121, the majority G is replaced by U in nine genomes, and C in one. If we exclude recombination as a cause, then it represents a safe homoplasy because U-carrying genomes show rare additional mutations deviating from the consensus (which is identical to the reference genome, "Wuhan 1") also seen in G-carrying genomes. Those mutations can be located at the start, center or end of the genomes. In addition, we find one transversion at the G/U site. This could be indicative for the G → U/C site being a site that is subject to increased probability of mutation , and hence homoplasy.

Genomes sharing rare mutations in addition to G/U variation at alignment position 11121. The first occurrence of the U-mutation, not accompanied by any other mutation, was discovered by Japanese researchers on the docked cruise ship. The thickness of the lines shows the number of genomes with identical mutation patterns in the parsimony-informative sites (1 pt = 1 genome), the size of the majority base, always found in the reference genome, its frequency (0.5 pt = 1 genome). The "jet setter" host is a Brazilian coming home from Switzerland via Italy.

However, six of the nine accession are from the "Cruise A" sample, the early quarantined Diamond Princess. Given the setting (a closed, densely populated space) and usually diverse host populations on cruise ships, the otherwise unchanged CoV-2 U-strain (top) and already modified G-strains present in the ship's population may just have recombined: the sequences up- and down-stream of the G/UC-site can be identical in various CoV-2 lineages for hundreds of basepairs.

Example 2

An analogous situation is found for the other candidate position, alignment position 24072 (black arrow), where a C is replaced by U in four genomes. One genome (MN988713; from Illinois, USA, sampled Jan 21st) shows the polymorphism: Y (= C/U). In MN988713, 7 more of the 35 parsimony-informative SNPs are polymorphic: the sequence is a near-perfect (gray arrow) consensus of the original "Wuhan 1" type and a strongly derived type (probably Forster et al.'s A cluster) from a second Illinois host sampled a week later, Jan 28th (MT044257)

Black and gray arrows highligh sites indicative for homoplasy or within-USA recombination. The polymorphic Illinois genome represents a strict consensus of the second Illinois strain (sampled one week later) — directly derived from the California strain, derived within the Type A cluster — and a (not sampled) sequence differing from the Wuhan 1 type (Type B) by one point mutation shared with two North American samples from end of January.

If we assume that the lab didn't just mix up or cross-contaminate the IL1 and IL2 samples, then the MN988713 host was infected twice by the CoV-2 virus: once by the original strain (Forster et al.'s Type B), and a second time by an evolved strain, being the tip of a new CoV-2 lineage that can be traced back (by congruent mutation patterns) to Jan 10th, Shenzhen (Guangdong, China) characterized by two C → U transitions at alignment pos. 8820 and 28182 (Forster et al.'s Type A).

Distinguishing homoplasy and recombination

With a growing set of samples, and given that the virus is free to mutate further in a large amount of hosts, it might become easier and more straightforward to distinguish homoplasy from recombination. It is possible that incongruent character splits have not one but two reasons: they have evolved in parallel but also have been propagated by recombination. The U replacing a G or C (or A) at the same site in one accession reflects a different history from another accession. Homoplasy and recombination result in the same graph inferences.

I agree with Forster et al. that the MJ network is under-used in virology (and other biological disciplines: eg. Why do we still use trees for the Neandertal genealogy; Using median networks to understand the evolution of genera) because it is a perfect tool — especially when used as a data-display network (eg. Networks can outperform PCA ordinations in phylogenetic analysis; Can we depict the evolution of highly conserved gene regions such as the ribosomal RNA genes). It facilitates grouping genotypes, to define ancestors and descendants, and to put them in a preliminary evolutionary framework.

But it cannot replace investigating the sequence mutation patterns, especially when we want to look out for intra-host variation — that is, a patient carrying more than one virus strain (parsimony treats polymorphism as missing data) — and recombinants. Visual inspection and tabulation can do this, although it takes a lot more time (and space).

Inferring a MJ network is Step 1. The obligatory Step 2 is to assess how conserved and/or phylogenetically informative are the reconstructed mutation patterns. This also can help to identify wrong roots inferred via outgroups. Forster et al.'s Type A is likely not the ancestral type, and the shared U-sites with the bat-virus outgroup are due to homoplasy, instead, as I will show in the next post (in two weeks's time).

Data

The complete tabulation of mutation patterns (EXCEL spread sheets) and the CoV-2-only alignment in ready-to-use NEXUS and (extended) PHYLIP format have been added to our figshare coronavirus data and file collection.

Grimm G, Morrison D (2020) Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset. https://doi.org/10.6084/m9.figshare.12046581.v2

References

Bandelt H-J, Forster P, Sykes BC, Richards MB (1995) Mitochondrial portraits of human populations using median networks. Genetics 141: 743–753.

Bandelt H-J, Forster P, Röhl A (1999) Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution 16: 37–48.

Monday, April 20, 2020

Using Median Networks to study SARS-CoV-2

No comments:

Post a Comment