In a previous post, Guido constructed trees for coronaviruses in the SARS group to search for evidence of recombination. He also constructed unrooted data-display networks using SplitsTree. Here, we discuss our attempts to construct rooted genealogical phylogenetic networks for the same dataset [6] but with some modifications.
In particular, we deleted some sequences, giving a smaller data set with only 12 taxa. These taxa include, next to SARS-CoV-2 (the virus causing COVID-19) and SARS-CoV (responsible for the SARS epidemic in 2002/2003), the viruses MP789 and PCoV_GX-P1E sampled from Malayan pangolins from two different Chinese provinces and several viruses found in different bat species in the horseshoe bat genus (Rhinolophus), all from China.
This research was done by Rosanne Wallin, an MSc student at VU Amsterdam and UvA. Her full thesis as well as all data and results can be found on github.
The first algorithm we applied to this data set was the TreeChild Algorithm [1], which is one of the methods that take a number of discordant (rooted, binary) trees as input and finds a rooted network containing each input tree, minimizing the number of reticulate events in the network. To filter out some noise, we contracted some poorly-supported branches and then resolved multifurcations consistently across the trees (using a tool within the TreeChild Algorithm). This gave the network below. Note that the method is restricted to so-called tree-child networks, meaning that certain complex scenarios are excluded (where a network node only has reticulate children). Also note that this is not necessarily the only optimal tree-child network and not all topological differences can be distinguished based on the trees [5].
|
Figure 1: Phylogenetic network constructed by the Tree-Child algorithm (blocks_A_len0.01_supp70).
|
The network shows no reticulation in the SARS-CoV-2 clade (the bottom four taxa) and puts SARS-CoV-2 right next to RaTG13. Furthermore, it shows a reticulation between an ancestor of HKU3-1 and a common ancestor of SARS-CoV-2 and RaTG13 leading to bat-SL-CoVZC45. However, it cannot exactly identify which common ancestor of SARS-CoV-2 and RaTG13 is the parent, leading to multiple branches (in red) leading into this reticulation. All these observations are consistent with previous research [2].
Importantly, we cannot directly conclude that each reticulation corresponds to a recombination event. See Table 2.1 of David’s book [10] for a nice overview of possible causes of reticulation. Nevertheless, based on [2], it does look like at least the reticulation leading to bat-SL-CoVZC45 corresponds to a recombination event.
The second algorithm we applied was TriLoNet [3], which constructs a rooted network directly from sequence data. It is restricted to so-called level-1 networks, meaning that it cannot construct overlapping cycles. This method produced the network below.
|
Figure 2: Phylogenetic network constructed by TriLoNet. |
At first sight, the network may look a bit different from the previous one (Figure 1). However, note that the three observations above also hold for this second network. Moreover, the SARS-CoV-2 clade is identical in both networks. This network contains only one reticulation, which is most likely due to the level-1 restriction.
Nevertheless, we can still use this method to find more putative recombination events. To do so, we simply exclude the recombinant bat-SL-CoVZC45 from the analysis and rerun the algorithm. This gives the following network.
|
Figure 3: Phylogenetic network constructed by TriLoNet, after omitting bat-SL-CoVZC45. |
We have now found a second putative recombination event with Rf1 as recombinant. Note that this is also consistent with the network in Figure 1. On the other hand, also note that the branching order in the SARS-CoV clade (the bottom 7 taxa in Figure 3) has changed a bit. This could mean that more recombination events are present in the SARS-CoV clade, as we also see in Figure 1.
One interesting follow-up question is whether the two (or more) networks produced by TriLoNet can be combined into a single higher-level network, in order to show multiple reticulations simultaneously (see [4] for an algorithm that could be useful).
Another interesting observation from these networks is that there is no sign of recombination involving the pangolin coronaviruses MP789 and PCoV_GX-P1E. It rather looks like these viruses evolved from common ancestors of SARS-CoV-2 and RaTG13, but it is important to note that we cannot exclude a recombination event on the basis of these networks. The relationship between SARS-CoV-2 and pangolin coronaviruses is still being debated in the literature [2,7,8,9].
Some limitations of the algorithms were noticed during this study. Firstly, the depicted networks are purely topological, i.e., the branch lengths do not represent anything. Adapting these algorithms to take branch length information into account could possibly improve their accuracy for this data set since the extant taxa have precise time stamps and for recent divergence events these times can be estimated quite accurately, see [2].
Another limitation is that we had to remove several taxa from the original data set [6] before the TreeChild algorithm could find a solution. By removing taxa, we reduced the number of reticulations needed to display the trees, making the TreeChild algorithm run in reasonable time. We made sure to include a diverse set of taxa (based on their pairwise distances [6]) to represent as much of the subgenus as possible.
Rosanne used several other algorithms, taxon selections and also used trees based on genes rather than fixed-length blocks (which we did above, following Guido’s post), see her thesis on github.
Conclusion
Although rooted phylogenetic network methods are often limited in the number of taxa that can be analysed and/or the complexity of the networks that can be constructed, we have seen that these methods can be useful for constructing hypothetical evolutionary histories. Moreover, although the constructed networks are not identical, we have seen that they share certain key properties, which are also consistent with previous research.
Rosanne Wallin, Leo van Iersel, Mark Jones, Steven Kelk and Leen Stougie
[1] Leo van Iersel, Remie Janssen, Mark Jones, Yukihiro Murakami and Norbert Zeh. A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees.
arXiv:1907.08474 [cs.DM] (2019).
[2] Maciej F. Boni, Philippe Lemey, Xiaowei Jiang, Tommy Tsan-Yuk Lam, Blair W. Perry, Todd A. Castoe, Andrew Rambaut and David L. Robertson. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat Microbiol 5, 1408–1417 (2020).
https://doi.org/10.1038/s41564-020-0771-4
[3] James Oldman, Taoyang Wu, Leo van Iersel and Vincent Moulton. TriLoNet: Piecing together small networks to reconstruct reticulate evolutionary histories. Molecular Biology and Evolution, 33 (8): 2151-2162 (2016).
http://dx.doi.org/10.1093/molbev/msw068 (
postprint)
[5] Fabio Pardi and Celine Scornavacca. Reconstructible phylogenetic networks: do not distinguish the indistinguishable. PLoS Comput Biol, 11(4), e1004135 (2015).
[7] Lam, Tommy Tsan-Yuk, Marcus Ho-Hin Shum, Hua-Chen Zhu, Yi-Gang Tong, Xue-Bing Ni, Yun-Shi Liao, Wei Wei, et al. Identifying SARS-CoV-2 Related Coronaviruses in Malayan Pangolins. Nature, 583, 282–285 (2020).
https://doi.org/10.1038/s41586-020-2169-0
[8] Wang, Hongru, Lenore Pipes, and Rasmus Nielsen. Synonymous Mutations and the Molecular Evolution of SARS-Cov-2 Origins. [Preprint] Evolutionary Biology, April 21, 2020.
https://doi.org/10.1101/2020.04.20.052019
[9] Li, Xiaojun, Elena E. Giorgi, Manukumar Honnayakanahalli Marichannegowda, Brian Foley, Chuan Xiao, Xiang-Peng Kong, Yue Chen, S. Gnanakaran, Bette Korber, and Feng Gao. Emergence of SARS-CoV-2 through Recombination and Strong Purifying Selection. Science Advances, Vol. 6, no. 27 (2020).
https://doi.org/10.1126/sciadv.abb9153