Monday, March 30, 2020

Trees and viruses: the SARS group

[This is a joint post by David and Guido]

There seems to be a lot of current confusion about the Covid-19 disease, and the SARS-CoV-2 virus that causes it. To help clear up some of the misunderstandings, this post might help:

   There seems to be a lot of public misunderstanding about the coronavirus

From the perspective of phylogenetics, in his last post (Problems with the phylogeny of coronaviruses), David pointed out that phylogenetic trees may not be a good choice for visualizing the phylogeny of coronaviruses. For this post, we collected from GenBank all complete genomes of one group of Betacoronavirus, the SARS-inflicting viruses, which includes the new SARS-CoV-2, the Covid-19 disease-causing virus, to look at this in more detail.

Reticulation analysis

Based on a preliminary analysis (included in a figshare submission), we ended up combining the individual genome accessions into group-based consensus sequences, in which intra-lineage variation is expressed as polymorphism: IUPAC ambiguity codes. The graphs in Figs. 1 and 2 are based on a data set including 291 accessions that are not literal duplicates of others, out of the total harvest included 395 genomes.For example, Groups 1 and 7 (as labeled in the figures below) include many accessions that are nearly identical or differ only in stochastic mutation patterns.

When processing the harvested data, one can notice mutational patterns in the best-sampled groups (eg. the original SARS-viruses, the new CoV-2 virus), which may be the result of mediocre sequencing and editing artifacts.

Consensus sequences have several advantages against choosing placeholders:
  • We can reduce the number of OTUs without losing too much information about intra-group diversity.
    • This facilitates visual inspection of the underlying alignments.
    • Under maximum likelihood (ML) inference, ambiguity codes can make a difference. They are not treated as missing data, as polymorphisms can be informative to some degree even in extreme cases (eg. Potts et al., Systematic Biology, 2014). Note that the RAxML-NG program now includes special models for (phased) DNA and RNA data that includes a lot of ambiguities.
  • Stochastic mutations found in a single of many accession within a group can be completely eliminated using modal consensus sequences, instead of strict consensus sequences. For the data used here, it makes little difference (results provided in the linked figshare submission).
Fig. 1 shows the mid-point rooted ML tree based on the group-consensus matrix. It largely agrees with the non-consensed tree (included in the figshare submission) but takes much less time to infer and run bootstrap analysis.

Fig. 1. Mid-point rooted ML tree using group-consensus sequences (labeled by numbers), tips with GenBank accession numbers represent non-consensed data.

Note that we find poor split support for some topological features, which can have two reasons:
  • Lack of discriminatory signal.
  • Conflicting signal, ie. here: potential recombination.
Except for a few aspects, the tree seems to be clear. And may be fooling us.

We can see this by looking at a network instead of a tree. The Neighbor-net based on the group-censuses data (Fig. 2) doesn't appear to be overly tree-like, especially when compared to the ML tree and its branch support (Fig. 1).

Fig. 2. Neighbor-net based on the group-consensus data, colored arches, arrows and field refer to mutational patterns and recombination cross-checked for by visual inspection of the alignment.

In fact, the highly supported Group 4 (in the tree) includes some genomes showing evidence for reticulation outside this apparent clade. The accessions labeled as 3c and 3d are recombinants as well, and hence are placed between the two main clusters (1 + 8 vs. 4–7) in the tree. The 1 + 2 + 3a/b group collects what appears to be a gradual evolutionarey trend — all " non-1" sequences simply differ more and more from Group 1, but not necessarily in the direction of Groups 4–9. The relatively low support for the 8+9 group is also the result of recombination and conflicting signals in the underlying data (Fig. 2).

The reasons for these reticulate signals probably include homoplasy (especially within main groups), but also obvious recombination, as shown in Fig. 3.

Fig. 3. Potentially alien DNA within difficult to align regions. Note that CoV-1 Type "A" and "B" do not sort along the ML tree, and only to some degree agree with neighborhoods in the Neighbor-net. Note that this shown portion was not included in our analysis. Congruent pattern can nonetheless be found in the sequentially more conserved regions we used

Groups 2a and 3b have very similar sequences elsewhere (2a has been represented by a single consensus sequence in our analysis) but can show either type in the regions shown in Fig. 3. The partly incongruent distribution of Types A and B in Fig. 3 can only be explained by secondary recombination between CoV-1 lineages.

Going back to the alignment, we can see that the uniqueness of accession 3d lies in patterns otherwise seen only in the distantly related Group 5, combined with regions where is mirrors Group 9 (see Fig. 3), while the rest of the sequence shows the basic Group 1–3 type, which is difficult to distinguish from Group 8 (leading to its position in the ML tree, Fig. 1, and NNet, Fig. 2). Groups 1 and 8 have mutations not seen in any other accession, separating them clearly from each other (lack of a neighborhood, Fig. 2), and also seaprating Group 8 more from Groups 2 and 3 than Group 1 (clade in Fig. 1 but fan in Fig. 2). Group 8 may be a recombinant of Group 1(–3) and Group 4, as suggested in Fig. 2; and bootstrap (BS) < 100 in the ML tree in Fig. 1. One accession of Group 4 includes sequence patterns in one regions diagnostic for the CoV-2 lineage and its sister lineage (Group 6).

The most striking recombination feature is, however, not captured even by the Neighbor-net. The orange field in Fig. 2 refers to the last sixth of the sequences (~ 5,000 bp) which are near-identical in accessions 3c, 3d, one 9a and all members of Group 4 except 4a, showing a sequence visibly different from all others in the alignment.

Moreover, our analysis and graphs did not consider sequence patterns in difficult or impossible to align regions, which can show very complex patterns, as illustrated in Fig. 4.

Fig. 4. Bird's eye view of the transition zone between alignable and sequentially (extreme) diverse genome portions that were excluded from analyses.

The non-alignable region in Fig. 4 shows about a dozen substantially different sequence types, most of which may actually be obtained by recombination with coronaviruses outside the SARS group. In some cases, strikingly similar sequence types are shared by members of different groups shown in the tree; while members of the same groups, even highly supported sister taxa, can have sequence types that are near 100% different.

Knowing the enemy: What is new about SARS-CoV-2?

We can see that some groups (labeled as 2a to 3c, 8, 9) have sequentces close to Group 1, while others (labeled 4 to 7) are increasingly diifferent (Fig. 2). The new strain, discovered in Wuhan and now spreading across the world (substantially affecting the way we live), has one potential direct sister strain: Group 6 (accessions MG772933/34) labeled in GenBank as "Bat SARS-like coronavirus", and isolated from bats (Rhinolophus sinicus). The reference for MG772933 is Hu et al. (2018) Emerging Microbes and Infection 7: e1006698, and was submitted in January 2018 by a researcher at the Institute of Military Medicine in Nanjing. We compare these sequences in Fig. 5.

Fig. 5. Close-up on a sequentially conserved region. All mutations that differ between CoV-1 and CoV-2 are also found in Group 6 (either obtained by multiple mutation or recombination with viruses not included in our data).

The Y (= C/T) in the Group 6 consensus sequence illustrates a general feature of the two Group 6 accessions: one is less modified with respect to the other (older, longer-known) SARS viruses, reflected by a BS = 67 for one being sister to Group 7, when using the accession data (not consensus sequences). But both share sequence patterns not found in Group 7 (BS = 32, note the long terminal branch in the Neighbor-net), which demonstrates that these viruses represent a genuine sister lineage. Elsewhere in their genome, both Group 6 and 7 are identical (Fig. 5). In other sequence regions, only Group 7 (CoV-2) differs (Fig. 6).

Fig. 6. A sequentially equally conserved region, purple arrows highlight unique mutations in CoV-2 strain; yellow background shared mutations (convergent and lineage-conserved).

Unfortunately, the various gene banks have been flooded with new, near-identical CoV-2 genomes, which are useless from a phylogenetic point of view. So, there is no point in looking for unique sequence features shared by Groups 6 and 7, or unique to Group 7 (CoV-2) — all of the best hits will be novel SARS-CoV-2 genomes. So, we cannot assess to what degree the new CoV-2 lineage, which likely includes the Group 6 accessions, differs from the remaining CoV-1 (original SARS) because of recombination with other coronaviruses.

However, recombination is most likely, given that CoV-1 genomes show plenty of signals that do not fit into tree-like evolution, but instead evidence inter-group reticulation (Fig. 2; example: Fig. 4). We also have sequence portions that are nearly 100% different (and hence not included in the phylogenetic analyses; Figs. 3/4) along with mutational patterns in conserved regions, which appear to be just mutations from the original SARS (Figs. 5, 6).

The harvested, mafft-aligned (uncurated) data and curated alignment, as well as files needed for analysis have been uploaded to figshare (CC-BY).

Grimm G, Morrison, D (2020). Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset.