Monday, March 16, 2020

Problems with the phylogeny of coronaviruses

Coronaviruses are much in the news at the moment. Indeed, one particular variant seems to be the major news topic as I write this post. This is the one known as 2019-nCoV or SARS-CoV-2, which is responsible for the human pneumonia called COVID-19.

Obviously, the main issue for the public is infection biology, particularly the apparent ease with which the virus can spread in human populations. Part of the issue here seems to be that human coronaviruses are covered with a lipid membrane, which means that they "can remain infectious on inanimate surfaces [like metal, glass or plastic] at room temperature for up to 9 days" (Kampf et al. 2020), which dramatically increases the probability of each of us encountering one.

There is now a decline in reported cases in China, but there may a resurgence. The problem is that an infected person may show no symptoms, or only very mild ones, and thus never report themselves. So, there may be millions more infected people running around the country, ready to infect new people when the travel restrictions are lifted, and the unexposed people come in contact with them. Biologically, the only safety is immunization, which occurs when you are exposed to the virus — which is risky, of course.

From Forni et al. (2017). Click to enlarge.

There will obviously be a lot of political fall-out in coming weeks, with various governments being accused of not doing enough and others of doing too much. The widespread infections in South Korea seem to be the result of a secretive religious organization (responsible for more than 60% of the national infections), to which the government has responded better than most others. On the other hand, in Iran it seems to be government that has been the major problem, hiding the initial infections because of their potential affect on impending elections.

In Italy, the country seems to have been overwhelmed, and the death rate is very high, while in Germany the infection rate is relatively high but the death rate is currently still low. Indeed, Italy's long-delayed "lock-down" on internal travel contrasts strongly with China's much more rapid response, and this seems to be reflected in vastly different infection rates (Italy currently has 6x the number infections per million people). More than a half of the cases to date where I live, in Sweden, came initially from northern Italy, with most of the rest from Austria, which are popular downhill-skiing destinations at this time of the year.


However, for our purposes here it is the phylogenetics of coronaviruses that is of professional interest, not infection biology. This has been a research topic for the past couple of decades, with the origin of several novel coronavirus strains in humans during that time (see the timeline above). These include SARS-CoV (causing Severe Acute Respiratory Syndrome) and MERS-CoV (causing Middle East Respiratory Syndrome) — both of these have much higher fatality rates than the current epidemic (10% and 34%, respectively), but lower rates of spread. A selected set of relevant papers is listed below; and I have included a couple of phylogenies as examples.

The issue that I wish to mention here is that there appears to be a disconnection between the so-called phylogenies presented in these papers and the concept of a phylogenetic history. The papers present either a rooted or an unrooted tree. In the first case, this simply represents a set of clusters based on genomic similarity. In the second case, this represents a hierarchical grouping based on genomic similarity. Obviously, an unrooted tree cannot represent a phylogenetic history, since evolution has a time direction, and this can only be illustrated using a directed (ie. rooted) tree or network.

However, the bigger issue is that these trees cannot represent an actual virus phylogeny. The argument for presenting them seems to be that the clusters / groups are based on genomic similarity, which in turn is caused by the phylogenetic history of the viruses. This is true, but we cannot thereby invert the logic. Phylogenetics creates similarity, but mere similarity does not necessarily represent phylogenetic history.

In the case of coronaviruses, the evolutionary history is reported to involve extensive genomic recombination in the formation of novel strains (reviewed by Cui et al. 2019). That is, during an epidemic the phylogeny might be tree-like, but at the origin of the epidemic it is not. This especially occurs because coronaviruses can infect a range of hosts (not just humans), and it is the recombination that occurs while within one host that allows novel strains to appear that can create epidemics in a different host.

This is also prevalent in, for example, influenza viruses (which also have a lipid membrane). This occurred for the world's worst epidemic (c. 500 million affected), the so-called Spanish Flu of 1918-1920, which actually started in the USA. The current most-likely explanation is that both a bird-host and a human-host influenza strain got into a pig, recombined in the cells of that host, and then the new virus strain got back into the human population.

Therefore the full phylogenetic history cannot be tree-like. Indeed, the actual history must be in the form of a recombination network, as discussed elsewhere in this blog. So, the trees, as shown in the papers below, represent the similarity of the coronaviruses but not all of their phylogeny. For the latter, we need a haplotype network representation, as illustrated in this example:

Some small haplotype networks; from Yu et al. (2020)

It would be interesting to construct a recombination network based on the data from one or more of the coronavirus papers, as an example. However, as far as I can see, none of the authors has referred to an online version of their genomic alignment; and so I cannot present such a thing here.


Cui J, Li F, Shi Z-L (2019) Origin and evolution of pathogenic coronaviruses. Nature Reviews Microbiology 17: 181-192.

Chen Y, Liu Q, Guo D (2020) Emerging coronaviruses: genome structure, replication, and pathogenesis. Journal of Medical Virology 92: 418-423.

Eickmann M et al. (2003) Phylogeny of the SARS coronavirus. Science 302: 1504-1505.

Forni D, Cagliani R, Clerici M, Sironi M (2017) Molecular evolution of human coronavirus genomes Trends in Microbiology 25: 35-48.

Gorbalenya AE, Snijder EJ, Spaan WJ (2004) Severe acute respiratory syndrome coronavirus phylogeny: toward consensus. Journal of Virology 8: 7863-7866.

Kampf G, Todt D, Pfaender S, Steinmann E (2020) Persistence of coronaviruses on inanimate surfaces and their inactivation with biocidal agents. Journal of Hospital Infection 104: 246-251.

Luk HKH, Li X, Fung J, Lau SKP, Woo PCY (2019) Molecular epidemiology, evolution and phylogeny of SARS coronavirus. Infection Genetics and Evolution 71: 21-30.

Woo PC, Lau SK, Huang Y, Yuen KY (2009) Coronavirus diversity, phylogeny and interspecies jumping. Experimental Biology and Medicine 234: 1117-1127.

Yu WB, Tang G-D, Zhang L, Corlett RT (2020) Decoding evolution and transmissions of novel pneumonia coronavirus (SARS-CoV-2) using the whole genomic data. (ResearchGate)

Zhang L, Shen F-M,Chen F, Lin Z (2020) Origin and evolution of the 2019 novel coronavirus. Clinical Infectious Diseases (Epub ahead of print).

An unrooted tree; from Cui et al. (2019).

A rooted tree; from Chen et al. (2020)


  1. Recombination is a lot easier in flu viruses, which are orthomyxoviruses, not corona-: they have several separate "chromosomes". Coronaviruses just have a single DNA strand.

  2. I presume that this is why new influenza strains appear more often than do new coronavirus strains.