In my last post, I looked at the prospects and pitfalls of using Median networks to trace virus evolution in the case of the SARS-CoV-2 virus. In this post, I will explore how we can try to root the CoV-2 MJ network, and why using an outgroup, as done by Forster et al., PNAS (2020), is not the best choice.
We'll stick to our 88 sequence dataset because I have already investigated its characteristics in my last post (XLSX-file included in the figshare file set). Here's the unweighted MJ network that can be inferred from these data, including all 146 mutation patterns (145 characters because one indel overlaps with a SNP – single-nucleotide polymorphism).
As in Forster et al.'s graph, we have one box in the central part of the graph, probably between Forster et al.'s type B (the big pie in the center and its satellites) and their type C (here: the long-edge global group including the Australian and European samples).
There's a useful rule-of-thumb in population genetics: a widespread, frequent haplotpype with many satellite types is often the ancestral type of the investigated sample. This, in our case, includes the reference CoV-2 genome ("Wuhan 1"; NC_0455512, sampled 26/12/2019). Having investigated in detail the data behind the graph (see the last post; adding sample date, provenance, graph above), we can put forward hypotheses as to what degree the parallel edge bundles represent alternative evolutionary scenarios, or are alternatively the result of potential recombinants between CoV-2 sub-lineages.
This allows us to depict an evolutionary scenario for our early samples, to picture how (i) the putative original variant (Wuhan 1/Type B) was distributed during the intitial phase (largely unmodified — light gray arrows in the next figure), (ii) where mutations happened to give rise to sequentially new (sub)types, and (iii) where recombination may have happened (crosses in the figure). Some links (the dotted lines) require further data in order to decide whether the shared mutation is lineage-diagnostic (as indicated by the MJ network) or a convergence.
The early samples demonstrate that the later USA samples were infected by various (sub)types by mid-/end of January (by up to six lineages), while most of the variation arising in locked-down Wuhan did not escape (at this early stage) — the earliest two samples from 23/12 (MT019529) and 26/12 (reference genome) differ by three mutations.
The quarantined cruise-ship in Japan was infected with the unmodified Wuhan 1 type, which then evolved within the vessel's population. So, this quarantine worked, because the vessel's mutated viruses are not found elsewhere. While the 11121-transition has probably been propagated in the vessel's population via recombination, its occurrence outside (in the Jetsetter/USA lineage, type C?, and USA-Type A) could be due to homoplasy: both the Jetsetter/USA and the A-type USA genomes are (strongly) derived. The 24072 and 28892-transitions point to reticulation between (less evolved) American B- and (highly evolved) A-type lineages; the MJ network can't resolve the resulting box because the American A-type showing the 24072 mutation is strongly derived.
Note: It's also interesting to compare our graph with the tree-based virus "phylogeny" on the GISAID page, which doesn't seem to include the cruise-ship samples. Note that most of the deep branches of the GISAID tree are unsupported ("no mutation"), and samples identical to the reference can be found among the early samples of most main "clades" depicted in the GISAID "phylogeny".
It is also straightforward to identify likely (→ U) and less likely substitutions (all others), as shown in the table.
There is a clear substitutional bias, as transitions are more likely than transversions, the approximate substitution model is abaaba for substitutions replacing the reference / CoV-2-consensus nucleotide. But the model is asymmetrical: Us are more likely to replace C than vice versa, while A/G transitions are balanced. Stochastically distributed singleton/rare mutations have a high probability to show a U, in general. So, a shared C is more likely to be a conserved, shared ancestral pattern (what Hennig called a "symplesiomorphy"). A shared U may be a uniquely shared, derived pattern (a "synapomorphorphy"), or a convergently (in parallel) obtained, derived pattern, a homoplasy. Low-frequency Cs, but also A and Gs at predominately U positions, are most probably synapormorphies as well (based on the data situation and observed substitution probabilities).
Currently, there is no maximum likelihood analog to Median networks, but one could weight mutation patterns differently (see, e.g., guidelines provided in NETWORK under the Help > About menu item in the Median Joining analysis window).
With each successive virus generation, the probability for a homoplasious U increases. Thus, when using MJ networks for virus evolution, we should consider analyzing the data at different time-points, rather than including all of the data in one large analysis (see also our posts on stacking Neighbor-nets: introduction, fossil king ferns, and manual alphabets).
Homoplasy + distant outgroups = wrong roots
By relying on a distantly related sister-lineage to infer an outgroup root of the MJ network, Forster et al. likely got the basic relationships wrong.
|Central part of the original outgroup-rooted "phylogenetic network". Coloring after Forster et al. (2020).
Their Type A is probably not ancestral to Type B/Wuhan 1, but derived from it or representing an early split.
|Same graph, mutation arrows taking into account observed mutation probabilities (our 88 genomes data) and assuming that there was no recombination among earliest types of each lineage.
The 3 Us shared by the bat outgroup and (part of) Type A (8782, 18060, 29095) likely represent homoplasy in distantly related sister lineages (cf. our last SARS virus post). Being homoplasies, they produce a network box reflecting alternative mutational pathways but not recombination. Homoplasious (convergently evolved) mutational patterns accumulate with increasing phylogenetic distance. Neutral mutations have a generally higher chance to replace a C by an U, back-mutations are less likely, and some sites are more likely to be mutated than others. Hence, there is a good chance that the bat sister-CoV-virus shows more shared mutational patterns with a derived CoV-2 lineage (ie. derived Type A variants) than with the ancestral one (Type B). Distant outgroups should not be used to root Median networks (see also: How do we interpret a rooted median network).
The only possibly genuine mutation would be the shared C (Forster et al.'s pos. 28144, pos. 28219 in our alignment) opposing a U in all Type B and Type C, differentiated only by two incompatible mutations, G → U transitions. The U at pos 28144 may have evolved in parallel in the B and C types; and the actual all-ancestor of CoV-2 (as indicated above) is neither included in Forster et al.'s sample, nor in the current GISAID sample (or our harvest).
It will be interesting to infer MJ networks on time-stamped and geo-referenced subsamples collected in the GISAID database, once the virus has had half a year (or more) to evolve, to see (i) how common homoplasy is, (ii) which sites are likely to accumulate → U substitutions independently of ancestry and (iii) whether there are further and more obvious examples of recombination. The further that genotypes evolve from the original stock, then the more diagnostic their sequences may become, and the easier it will be to decide whether shared but incompatible sequence features are the result of homoplasy or recombination.