The Genealogical World of Phylogenetic Networks: 2020

Sunday, November 15, 2020

Rooted phylogenetic networks for coronaviruses

In a previous post, Guido constructed trees for coronaviruses in the SARS group to search for evidence of recombination. He also constructed unrooted data-display networks using SplitsTree. Here, we discuss our attempts to construct rooted genealogical phylogenetic networks for the same dataset [6] but with some modifications.

In particular, we deleted some sequences, giving a smaller data set with only 12 taxa. These taxa include, next to SARS-CoV-2 (the virus causing COVID-19) and SARS-CoV (responsible for the SARS epidemic in 2002/2003), the viruses MP789 and PCoV_GX-P1E sampled from Malayan pangolins from two different Chinese provinces and several viruses found in different bat species in the horseshoe bat genus (Rhinolophus), all from China.

This research was done by Rosanne Wallin, an MSc student at VU Amsterdam and UvA. Her full thesis as well as all data and results can be found on github.

The first algorithm we applied to this data set was the TreeChild Algorithm [1], which is one of the methods that take a number of discordant (rooted, binary) trees as input and finds a rooted network containing each input tree, minimizing the number of reticulate events in the network. To filter out some noise, we contracted some poorly-supported branches and then resolved multifurcations consistently across the trees (using a tool within the TreeChild Algorithm). This gave the network below. Note that the method is restricted to so-called tree-child networks, meaning that certain complex scenarios are excluded (where a network node only has reticulate children). Also note that this is not necessarily the only optimal tree-child network and not all topological differences can be distinguished based on the trees [5].

Figure 1: Phylogenetic network constructed by the Tree-Child algorithm (blocks_A_len0.01_supp70).

The network shows no reticulation in the SARS-CoV-2 clade (the bottom four taxa) and puts SARS-CoV-2 right next to RaTG13. Furthermore, it shows a reticulation between an ancestor of HKU3-1 and a common ancestor of SARS-CoV-2 and RaTG13 leading to bat-SL-CoVZC45. However, it cannot exactly identify which common ancestor of SARS-CoV-2 and RaTG13 is the parent, leading to multiple branches (in red) leading into this reticulation. All these observations are consistent with previous research [2].

Importantly, we cannot directly conclude that each reticulation corresponds to a recombination event. See Table 2.1 of David’s book [10] for a nice overview of possible causes of reticulation. Nevertheless, based on [2], it does look like at least the reticulation leading to bat-SL-CoVZC45 corresponds to a recombination event.

The second algorithm we applied was TriLoNet [3], which constructs a rooted network directly from sequence data. It is restricted to so-called level-1 networks, meaning that it cannot construct overlapping cycles. This method produced the network below.

Figure 2: Phylogenetic network constructed by TriLoNet.

At first sight, the network may look a bit different from the previous one (Figure 1). However, note that the three observations above also hold for this second network. Moreover, the SARS-CoV-2 clade is identical in both networks. This network contains only one reticulation, which is most likely due to the level-1 restriction.

Nevertheless, we can still use this method to find more putative recombination events. To do so, we simply exclude the recombinant bat-SL-CoVZC45 from the analysis and rerun the algorithm. This gives the following network.

Figure 3: Phylogenetic network constructed by TriLoNet, after omitting bat-SL-CoVZC45.

We have now found a second putative recombination event with Rf1 as recombinant. Note that this is also consistent with the network in Figure 1. On the other hand, also note that the branching order in the SARS-CoV clade (the bottom 7 taxa in Figure 3) has changed a bit. This could mean that more recombination events are present in the SARS-CoV clade, as we also see in Figure 1.

One interesting follow-up question is whether the two (or more) networks produced by TriLoNet can be combined into a single higher-level network, in order to show multiple reticulations simultaneously (see [4] for an algorithm that could be useful).

Another interesting observation from these networks is that there is no sign of recombination involving the pangolin coronaviruses MP789 and PCoV_GX-P1E. It rather looks like these viruses evolved from common ancestors of SARS-CoV-2 and RaTG13, but it is important to note that we cannot exclude a recombination event on the basis of these networks. The relationship between SARS-CoV-2 and pangolin coronaviruses is still being debated in the literature [2,7,8,9].

Some limitations of the algorithms were noticed during this study. Firstly, the depicted networks are purely topological, i.e., the branch lengths do not represent anything. Adapting these algorithms to take branch length information into account could possibly improve their accuracy for this data set since the extant taxa have precise time stamps and for recent divergence events these times can be estimated quite accurately, see [2].

Another limitation is that we had to remove several taxa from the original data set [6] before the TreeChild algorithm could find a solution. By removing taxa, we reduced the number of reticulations needed to display the trees, making the TreeChild algorithm run in reasonable time. We made sure to include a diverse set of taxa (based on their pairwise distances [6]) to represent as much of the subgenus as possible.

Rosanne used several other algorithms, taxon selections and also used trees based on genes rather than fixed-length blocks (which we did above, following Guido’s post), see her thesis on github.

Conclusion

Although rooted phylogenetic network methods are often limited in the number of taxa that can be analysed and/or the complexity of the networks that can be constructed, we have seen that these methods can be useful for constructing hypothetical evolutionary histories. Moreover, although the constructed networks are not identical, we have seen that they share certain key properties, which are also consistent with previous research.

Rosanne Wallin, Leo van Iersel, Mark Jones, Steven Kelk and Leen Stougie

[1] Leo van Iersel, Remie Janssen, Mark Jones, Yukihiro Murakami and Norbert Zeh. A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees. arXiv:1907.08474 [cs.DM] (2019).

[2] Maciej F. Boni, Philippe Lemey, Xiaowei Jiang, Tommy Tsan-Yuk Lam, Blair W. Perry, Todd A. Castoe, Andrew Rambaut and David L. Robertson. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat Microbiol 5, 1408–1417 (2020). https://doi.org/10.1038/s41564-020-0771-4

[3] James Oldman, Taoyang Wu, Leo van Iersel and Vincent Moulton. TriLoNet: Piecing together small networks to reconstruct reticulate evolutionary histories. Molecular Biology and Evolution, 33 (8): 2151-2162 (2016). http://dx.doi.org/10.1093/molbev/msw068 (postprint)

[4] Yukihiro Murakami, Leo van Iersel, Remie Janssen, Mark Jones and Vincent Moulton. Reconstructing Tree-Child Networks from Reticulate-Edge-Deleted Subnetworks. Bulletin of Mathematical Biology, 81(10):3823–3863 (2019).

[5] Fabio Pardi and Celine Scornavacca. Reconstructible phylogenetic networks: do not distinguish the indistinguishable. PLoS Comput Biol, 11(4), e1004135 (2015).

[6] Grimm, Guido; Morrison, David (2020): Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset. https://doi.org/10.6084/m9.figshare.12046581.v3

[7] Lam, Tommy Tsan-Yuk, Marcus Ho-Hin Shum, Hua-Chen Zhu, Yi-Gang Tong, Xue-Bing Ni, Yun-Shi Liao, Wei Wei, et al. Identifying SARS-CoV-2 Related Coronaviruses in Malayan Pangolins. Nature, 583, 282–285 (2020). https://doi.org/10.1038/s41586-020-2169-0

[8] Wang, Hongru, Lenore Pipes, and Rasmus Nielsen. Synonymous Mutations and the Molecular Evolution of SARS-Cov-2 Origins. [Preprint] Evolutionary Biology, April 21, 2020. https://doi.org/10.1101/2020.04.20.052019

[9] Li, Xiaojun, Elena E. Giorgi, Manukumar Honnayakanahalli Marichannegowda, Brian Foley, Chuan Xiao, Xiang-Peng Kong, Yue Chen, S. Gnanakaran, Bette Korber, and Feng Gao. Emergence of SARS-CoV-2 through Recombination and Strong Purifying Selection. Science Advances, Vol. 6, no. 27 (2020). https://doi.org/10.1126/sciadv.abb9153

[10] David Morrison, Introduction to Phylogenetic Networks. RJR Productions, Uppsala, Sweden (2011). http://www.rjr-productions.org/Networks/index.html

Monday, November 2, 2020

The end of this blog?

Last week's post is planned to be the final one for The Genealogical World of Phylogenetic Networks, at least for the time-being. This post is simply to say goodbye, and to say thanks to all of our readers.

As some of you may know, Blogger has decided that it is no longer interested in having contributors who work on desktop computers. They have changed their author interface to one designed for swiping and pressing on a small touch-screen, not typing and mousing while looking at a full-size screen. This new interface is almost unusable on my computing equipment — they have taken a limited but quite usable system and made it unworkable, in practice. *

Notably, the new system for automatic formatting of the posts does not match the format of our blog, and there is thus now an added onus to do everything manually, undoing the mess created by the automatic formatter, or typing in the HTML code for yourself (which is what it seems to assume you actually want to do). It won't even paste images into the right place in the text!

Even worse, Blogger (owned by Google) will no longer even allow me to log-in using my version of Chrome (owned by Google); Safari (owned by Apple) can no longer display the administration pages at all; and Firefox (owned by Mozilla) will allow me to work on posts but not allow me to comment on them.

Moreover, recently Blogger unexpectedly deleted one of my completed posts, without warning. It has not done this for a long time, and it will be the last time it does it to me — I have had enough of the new system.

So, after more than 8.5 years (the first post was on February 25, 2012), it is time for me to say goodbye. I have had six co-authors, at various times, and I thank them very much for being here. We have now written 663 posts about lots of topics, most of them related to phylogenetics in one way or another — and most of it seemed like fun at the time, although obviously a lot of work for everyone. Here is a graph with a brief timeline, as Guido's summary of the blog's history (click to enlarge).

Timeline of the Genealogical World of Phylogenetic Networks

Finally, thanks to all of you, for being readers — it would be awfully lonely here without you. By my calculations, we are currently getting c. 2,000 non-bot hits per week, which is very respectable. (NB: viewbot hits have comprised at least 20% of all traffic, first detected after c. 100 posts.) The blog is treated as a "real" scientific output, and the posts are thus indexed by sites like Altmetric (there are a couple of their example pages shown below).

The Comments facility will soon be switched off. The blog itself will remain online, for any new readers; but it will effectively be an archive, until such times as Blogger decides to delete it (whether intentionally or not), or someone has the inspiration to create a new post.

cheerio

David
david.morrison@ebc.uu.se

* My local liquor chain (the only one in the country!) has just done the same thing. Instead of having one web page with all of the relevant information for each item, there are now half-a-dozen pages, each with a small amount of information, and with VERY BIG text. Parts of the new site do not work at all on my computer; and they have taken away facilities that I actually relied upon. I find this new site just as unusable as the new Blogger.

Here are a couple of example Altmetric pages referring to our blog posts (also provided by Guido).

Monday, October 26, 2020

Just try it for your data – a last first-of-its-kind Neighbor-net using FTIR data

This is likely to be my last post for this blog.

Some thoughts

When I joined the Genealogical World of Phylogenetic Networks three years ago, I didn't know how much fun it is to blog about science. Blogging, or writing essays, has several advantages against the traditional way to get a researcher's ideas out into the world — writing a scientific paper. The most important one is, one can just try out something without having to worry how this would get past the peer-reviewers and editors (or as I like to call them: the Mighty Beasts lurking in the Forest of Reviews). When I was still a (sort-of) career scientist (ie. paid by tax-payers to do science), I had my share of discouraging experiences, whenever we tried to leave the beaten (and worn out) paths to try something new; to look into the dark places and not right under the street-lights.

Before we submitted papers, we hence put a considerable effort into them, pondering what our peers may criticize, or what might alienate them (being likely unfamiliar with our methodological and philosophical approaches), and thus to minimize the chance all our work would be for nothing. In a couple of cases, where we expected fierce resistance, we opted for low-impact journals with no manuscript length restrictions and more welcoming editors and peers, to be able to put in everything that we had. Some of my best bits are buried in journals where you'd never expect them!

But it was increasingly annoying, nevertheless;. It was no fun anymore to formally publish research, and so I let my career as smoothly run out in the 2010s as it started in the Zeroes.

David's encouraging me to write blog-posts, just after I early retired, thus revitalized my interest in science, to "boldly go where no-one has gone before". The amount of effort is typically much lower, although some of my posts do involve the same work that I put into the papers that I co-authored. More importantly, there are no beasts in the World Wide Web that can bite you from the shadows; they have to do it in the open. It's an ideal way to get an idea out, without having to think about the consequences. None of the work I put into a post has been for vain. What a difference: before, for every graph / analysis result published, two ended in the bin, many devoured by the Mighty Beasts.

And, maybe somebody will find the work interesting enough to try it out; and eventually my idea finds a place in the sanctionized, peer-reviewed scientific world, anyway. Since I'm out-of-business, I can afford to not cash in the credit (no-one formally cites a blog post).

My last Neighbor-net for the Genealogical World

Neighbor-nets (NNets) and myself was love at the first sight (this was, in my case, ~2005, when my boss Vera Hemleben, a geneticist, sent me over to the new professor in our bioinformatics department, named Daniel Huson, who had just released a new software package, SplitsTree). These networks are...

... most versatile: any kind of data can be transformed into a distance matrix;
... quick-and-easy to infer.

And even if they are not phylogenetic networks in the strict sense – NNets are unrooted and their edge-bundles do not necessarily reflect evolutionary pathways – they more often than not point towards common origins and down-scale ± complex phylogenetic relationships more comprehensively than any phylogenetic tree (coalescent or not) that we could infer. The Genealogical World is full of examples, and the writers of this blog such as David [homepage], Mattis [GoogleScholar/ homepage], myself [GoogleScholar/ homepage], and like-minded researches have published quite a few of them (in high- and low-impact journals). For a comprehensive, permanently updated list see Philippe Gambette's Who's Who in Phylogenetic Networks page.

For my final post, I decided on a fascinating new data source in paleobiology: Fourier transformed infrared spectra (FTIR) of fossil cuticles.

The cuticle is a plant's skin, and it's composition and structure show a lot of variation, down to species level. Thus, their morphological-anatomical features have long been used as taxonomic markers to identify fossil material. Using infrared spectroscopy, one can look at the chemical composition of cuticles. Like any other spectrum, an FTIR-spectrum can be broken down in sets of quantitative (discrete, binned) or qualitative (continuous) characters; and one can then create a dissimilarity matrix for the investigated material. This is what Vajda, Pucetaite et al. (Nature Ecol. Evol. 1: 1093–1099, 2017) did for long-death (Mesozoic) but enigmatic seed plants and their equally enigmatic modern counterparts.

A UPGMA dendrogram based on FTIR data of fossil taxa (Vajda et al. 2017, fig. 4). Brackets to the right give the topology of the UPGMA dendrogram including extant material and data (Vajda et al. 2017, fig. 3).

PCA plots of the first and second (a), and first and third (b) coordinates, with the main seed plant lineages indicated (modified after Vajda et al. 2017, suppl.-fig. 4)

PCA and UPGMA are not phylogenetic inference methods, but there is obviously some phylogenetic signal encoded in these FTIR spectra, as shown above.

When I first saw the paper, I contacted the authors (including former colleagues of mine at Naturhistoriska riksmuseet in Stockholm), and the first author gave the second author, Milda Pucetaite (a Ph.D. student), a green light to share and convert her FTIR data into a simple distance matrix for me to run a NNet, as shown below.

Neighbor-net based on the combined distance matrix provided by Milda (pers. comm. July 2017).

Note that this NNet is a partly impossible graph, phylogenetically. The chemical composition naturally changes after the foliage (in this case) gets buried in sediment, and its cuticle is then conserved for millions of years by various taphonomic and diagenetic processes. As pointed out by the experienced biochemist among the authors during our correspondence: it is hence pointless to combine the data from extant and extinct taxa.

Well, since this is a post and not a paper, I combined them anyway. I find the result quite compelling, supporting the paper's conclusions including more speculative follow-up ones. The NNet reflects every aspect that these kind of data can provide for phylogenetic and systematic purposes.

The prominent central edge bundle reflects the taphonomic-diagenetic change separating the living from fossil samples. The basic sequence within the subgraphs is the same: gingkoes are closest to cycads, and cycads bridge to Araucariaceae, which is a relict lineage of the "needle" trees, the conifers (many of which don't have needles but leaves). Bennettitales and Nilssoniales are extinct groups of seed plants, which are here resolved as a distinct lineage. Especially, the Bennettitales have been have long puzzled scientists. They may represent a third major lineage of seed plants that are neither angiosperms (flowering plants) nor gymnosperms (ginkgoes, cycads, conifers, gnetids), or perhaps an early side lineage of either one (or lineages, as their two main groups are quite different).

As for pretty much any kind of data, just try it out for yourself. This is exploratory data analysis (EDA), particularly useful to get a first, fast impression of the primary signal in your data. This is true even if you keep it to yourself, having to watch out for the Mighty Beasts of the Forest of Reviews (especially the ones that call themselves "cladists"). Who are quick in telling you, what you can't do, but not so straightforward, when it comes pointing you to other options for analyzing your data.

My dive-in list for some more (im-)possible NNets

"Man gave name to all those animals": cats and dogs — a joint post with Mattis, where we just mapped the names on a NNet of world languages. Cormac Anderson joined for the second part dealing with goats and sheep.
Should we try to infer trees on tree-unlikely matrices — my first post for the Genealogical World of Phylogenetic Networks about a NNet that, at the time I made it, was impossible to publish.
Stacking networks based on sign language manual alphabets — NNets based on hand-shapes historically used for sign letters.
To boldy go where no one has gone before – networks of moons — a joint post with Timothy Holt, who scored celestial bodies of our stellar system for classification.
Visualizing U.S. gun laws — inferring NNets based on serious issues can be fun. Related post: The 2nd Amendment does more than keep King George away.

With David retiring, the Genealogical Worlds of Phylogenetic Networks will fall dormant, the next and final post will be a farewell from David. Like Mattis (Von Wörtern und Bäumen), I will keep on science-blogging (in spite of the new buggy Blogger-editing interface forcing me to draft directly in HTML) for a little while (and irregularly) on my Res.I.P. blog, which also includes a tag for "phylo-networks" for any future NNets and the like.

Monday, October 19, 2020

Xenoplasy

A major obstacle in studying morphological evolution is homoplasy. This occurs when the same (or similar) traits are evolved independently in different lineages (convergences), and are positively selected for or incompletely sorted within a lineage (parallelisms, homoiologies). Traits that not sort following the true tree create incompatible signal patterns, and, eventually, topological ambiguity. No matter which inference method we use, we end up with several alternative trees that combine aspects of the true tree with artificial branching patterns.

Homoplasy is the rule, while trait sorting is the exception. Consequently, we have to expect that any morphology-based tree will have more wrong branches than correct ones.

For extant group of organisms, a simple solution to the problem is to analyse morphological traits in the framework of a molecular phylogeny. The genetic data provides us with an independent, best-possible tree. By mapping the morphological traits on this tree, we can evaluate their potency as phylogenetic markers.

But what if our group of organisms is not the product of a simple repeated dichotomous splitting pattern? What if there were anastomoses as well? That is, the morphological traits are not the product of mere (incomplete) lineage or incomplete gene sorting (the latter is called "hemiplasy") but fusion of traits in different lineages. Thus, a tree is not enough to explain the genetic data? What does this imply for the morphological differentiation we observe?

Take the London Plane (Platanus x acerifolia or P. x hispanica), for example, which is a tree that many of us are familiar with. In case you don't know the name: they are the large trees with a patterned bark and deeply lobed leaves and fluffy fruiting bodies found in abundance in parks and alleys throughout the world. It's a cultivation-hybrid (17th—18th century) of the North American plane tree, Platanus occidentalis, and its distant eastern Mediterranean relative, P. orientalis. These are genetically and morphologically distinct species. Their history is summarized in the following doodle (Grimm & Denk 2010).

Each line represents a semi-sorted nuclear gene region. The split between proto-PNA-E (SW. U.S., NW. Mexico, E. Mediterranean) and proto-ANA (Atlantic-facing Central America, E. U.S.) must have been > 12 myrs ago (last Platanus of Iceland). The minimum air distance between the sister species P. orientalis and P. racemosa is ~11,500 km (via the Arctic). Interestingly, fossils from that time and later (including western Eurasia) have more ANA-clade morphologies: P. orientalis- and P. racemosa-types pop up ~5 Ma. Both ANA and PNA-E clade have distinct morphologies. With respect to the individual gene trees, those exclusively shared by P. palmeri and P. rzedowski with P. occidentalis s.str. and P. mexicana of the ANA clade could be adressed as "hemiplasies".

If you look at the leaves and fruits of London Planes, you can find everything in between the two endpoints; and the same holds for their genetics. The London Plane is much hardier than Europe's own P. orientalis and more drought-resistant than its hardier North American parent. With climate change going on, the hybrid will eventually meld with the European species entirely. And, thanks to what we call "hybrid vigor", given a few millions of years, it might consume its other parent, too. London Planes have been re-introduced into the Americas; and P. orientalis has become an invasive species in California, where it has started to hybridize with its local sister species P. racemosa. Now imagine a future researcher of Platanus evolution having to deal with a highly complex accumulation of Platanus fossils in the Northern Hemisphere, while being able to study only the left-over complex genetics of a single species that replaced two.

This is where a recently coined new concept comes in: xenoplasy.

Yaxuan Wang, Zhen Cao, Huw A. Ogilvie, Luay Nakhleh (2020). Phylogenomic assessment of the role ofhybridization and introgression in trait evolution. bioRxiv doi: 10.1101/2020.09.16.300343

Xenoplasies are traits that originate from hybridization and subsequent introgression. In standard phylogenetics, they would act like any homoplasious character, but their distinction is that they are not independently involved. They are captured via lineage crossing, and reflect a common ancestry.

Example for a trait incongruent with the species tree, representing a xenoplasy obtained by introgression of I1-A lineage which evolved the trait into I3-B lineage, part of the I2 clade. Pending how far they are affected by incomplete lineage sorting (ILS) and introgression, individual gene may result in any of the three possible genealogies. Modified after Wang et al. (2020), fig. 1.

As such, their phylogenetic weight (information content) equals that of the anyhow rare classic autapomorphies or synapomorphies (fide Hennig), and this weight is higher than that of the more common homoiologies, shared apomorphies or symplesiomorphies. Note, in the palaeozoological cladistic literature, sorted versions of the latter three are often called synapomorphies – any lineage-specific, derived trait ("synapomorphy") may be lost / modified in some sublineage(s), or rarely pop-up outside the lineage.

Wang et al. provide an analytical framework for identifying a trait as xenoplasy, and assessing the probability for it ("xenoplasy risk factor"). If you're interested in the mechanics, check out the pre-print. The mathematical part of my brain has been dormant for most of the last two decades (when I exchanged chemistry for geology-biology), so I'm more into possible applications to explore this new concept.

Where to look next

The Wang et al. real-world example (Jaltomata) is, however, not very appealing. The problem is that, to look for xenoplasy, we need data that requires us to infer an explicit phylogenetic network (in the strict sense) to start with. In addition, we could use a morphological partition: scored morphological traits; which is usually absent. Last, identifying xenoplasies would make most sense for traits that can be traced in the fossil record, not only to identify potential products of past reticulation but have a better grip on placing critical fossils. Often overlooked by neontologists, fossils are the only physical proof that a lineage was at a certain place at a certain point in time. So, here's two examples: beeches and bears.

Beeches are a small genus of extra-tropical angiosperm trees with a pretty well understood fossil record. Morphologically, their differentiation is very hard to put into a tree, as shown here.

A morpholgy-based Neigbor-net of fossil (open circles) and extant beech (closed circles) taxa. Coloration gives the (paleo-)geographic distribution (abbreviated as three letters). For more background and information see my Res.I.P. post: The challenging and puzzling ordinary beech – a (hi)story

Mapping species-discriminating traits on a tree would be of little help here, because the modern species are the product of recurrent phases of mixing and incomplete sorting. I have summarized this in the following doodle, depicting the diversification and propagation of 5S-IGS variants (a non-transcribed, poly-copy, multi-array intergenic nuclear spacer) in a still very small sample.

A doodle summarizing differentiation patterns in a sample of 686 "representative" 5S-IGS variants obtained using high-throughput sequencing of six beech populations of western Eurasia and Japan (Simone Cardoni et al., to be submitted in the near future; see Piredda et al. 2020 for a similar analytical set-up).

The people involved in researching this project (drawn by passion rather than resources) don't have the resources to generate the NGS data needed to construct a species network for all of the species of beech, like Wang et al.'s Jaltomata data. But given that there are only 9–10 species, it would be easy prey for a well-funded research group. If you are interested, but don't know how to get the material and are unfamiliar with beeches, feel free to contact the senior author of Piredda et al. 2020, Marco Simeone — new beech-enthusiasts are always welcomed by this group.

Bears are one of the best-studied extant mammal predators, and they also have a decent fossil record. This is probably the reason that Heath et al. (2014) used bears as the case study when introducing their new molecular dating approach: the fossilized-birth-death dating.

A fossilized birth-death dated tree of bears (modified from Heath et al. 2014, fig. 4). The numbers in brackets give the number of fossil taxa (extinct genera, Ursus spp.) listed on Wikipedia.

As nice as it looks (and done), their analysis is pretty flawed from an evolutionary point of view. Their dated tree only reflects a single aspect of bear evolution and may involve branch-length artifacts. Heath et al. relied on complete mitochondrial genomes, which they combined with a single nuclear protein-coding gene. Mitochondrial genes reflect only the maternal lineage; they did not date a species tree but a mitochondrial genealogy. Paternal and biparentally inherited gene markers (which includes nuclear genes) tell very different stories about species relationships (this is why we also used the bears as example data for Schliep et al. 2017).

Strict, branch-length ignorant Consensus network of three trees inferred using species-consensus sequences generated from three sets of data: biparentally inherited nuclear-encoded autosomal introns (ncAI), paternally inherited Y-chromosomes (YCh) and maternally inherited mitochondrial genes (complete set; mtG). This is clearly not the product of a strictly dichotomous evolution. Thick lines: edges found in Heath et al.'s chronogram (= mitochondrial genealogy).

And while it may be that morphology reflects more the maternal than the paternal side, it has never been tested. Neither how morphology fits with the coalescent species tree. Which would be a network, as shown below.

Gene flow in bears within the last 5 myrs (estimate; from Kumar et al. 2017).

How Heath et al. linked the fossils to clades might have been just as wrong as it was right (note that FBD dating is much less biased by mis- or unoptimal placed fossils than traditional node dating). Hemi- and xenoplasy must be considered here. In addition to the highly incongruent paternal and maternal genealogies, we know that even the morphologically most distinct sister species (grizzlies, a special form of Brown Bear, and polar bears) can produce vital offspring ("Grolar") with morphological traits from either side of the family (usually, the Grizzly-side dominates).

Wildlife services usually kill these hybrids as they are considered to speed up the decline of polar bears (they are food competitors). However, with the (possibly inevitable) melting of the polar caps, these hybrids could be instrumental in the survival of a bit of Polar Bear legacy, in the form of genetic diversity not found in brown bears, and xenoplasies. If two highly distinct bear species hybridize today in the wild due to (in this case: human-induced) environmental pressure, their ancestors probably have done so in the past in reaction to shifting habitats and migration patterns.

Given how long bears have intrigued researchers, there are plenty of classic morphological studies involving fossils; and, in the light of the vast amount of molecular data (including ancient DNA!) that have been collected for bears, it should be pretty easy to apply Wang et al.'s new approach to bears. For example, is the Cave Bear a dead-end side lineage, intrograde or hybrid dead-end? Mitochondrial-wise Cave bears are placed as sister to Brown and Polar bears but that's just because of their provenance. Like chloroplast genealogies in plants, mitochondrial genealogies in animals typically show a strong geographic correlation. Especially in bears, the mothers and daughters don't migrate as much as the fathers and sons.

Mitochondrial genealogy of bears including Cave bears (Kumar et al. 2017, fig. 3), the famous European bears of the Ice Ages. ABC bears are insular brown bears living on the subarctic Admirality, Baranof and Chichagof islands of the Alexander archipelago known as natural example for gene flow between Brown and Polar bears (Kumar et al. 2017, fig. 1, provides a map of current distribution of bears).

Postscriptum

Birds are another animal group that likes to diversify into many species, some of which love to transgress recently established species barriers, forming hybrid swarms. These are actually dinosaurs, a group exclusively studied using cladistic analyses of morphological traits providing non-tree-like signals — mostly homoplasies, a lot of not-really-synapomorphies (good deal are probably homoiologies), and, it wouldn't surprise me, one or another xenoplasy. Or can we assume they were much to advanced to hybridize and intrograde?

Cited literature

Grimm GW, Denk T. 2010. The reticulate origin of modern plane trees (Platanus, Platanaceae) - a nuclear marker puzzle. Taxon 59:134–147.
Heath TA, Huelsenbeck JP, Stadler T. 2014. The fossilized birth–death process for coherent calibration of divergence-time estimates. PNAS 111:E2957–E2966.
Kumar V, Lammers F, Bidon T, Pfenninger M, Kolter L, Nilsson MA, Janke A. 2017. The evolutionary history of bears is characterized by gene flow across species. Scientific Reports 7:46487 [e-pub].
Piredda R, Grimm GW, Schulze E-D, Denk T, Simeone MC. 2020. High-throughput sequencing of 5S-IGS in oaks: Exploring intragenomic variation and algorithms to recognize target species in pure and mixed samples. Molecular Ecology Resources doi:10.1111/1755-0998.13264.
Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution 8:1212–1220.

Monday, October 12, 2020

Tattoo Monday XXI

There are a number of tattoo designs that incorporate the concept of a Tree of Life with the concept of DNA. A selection of these was included in the previous post, Tattoo Monday XX. Here are a few more.

Monday, October 5, 2020

Rogue dinosaurs, an example from the Aetosauria

In several earlier posts (a non-comprehensive link list can be found at the end of the post), I outlined how networks, tree-sample (Consensus networks, SuperNetworks) or distance-based (Neighbor-nets) may be of practical help, especially when we study phylogenetic relationships of extinct organisms.

In this post, I will further explore this by looking at a matrix for Aetosauria (Parker 2016, PeerJ) that provides an overall (relatively) strong and unambiguous signal. [NB: The reason, I prefer to use PeerJ papers as examples is that it is one of the very few journals that is open access and has a strict open data policy — to publish there, authors have to give access to the used data.]

In the abstract of the original paper, we read the following:

Nonetheless, aetosaur phylogenetic relationships are still poorly understood, owing to an overreliance on osteoderm characters, which are often poorly constructed and suspected to be highly homoplastic. A new phylogenetic analysis of the Aetosauria, comprising 27 taxa and 83 characters, includes more than 40 new characters that focus on better sampling the cranial and endoskeletal regions, and represents the most comprehensive phylogeny of the clade to date. Parsimony analysis recovered three most parsimonious trees; the strict consensus of these trees finds an Aetosauria that is divided into two main clades: Desmatosuchia, which includes the Desmatosuchinae and the Stagonolepidinae, and Aetosaurinae, which includes the Typothoracinae.

Parker's (2016) fig. 6 shows the results of the "initial analysis" (click to enlarge, colored annotations added by me).

Systematic groups based on clades are abbreviated (see next graph for full names).

A is a "Strict component consensus" of the 30 inferred MPTs (most parsimonious trees), B the Adams consensus. C the Majority rule consensus, branch labels give percentages for branches not found in all MPTs. D a "Maximum agreement subtree after a priori pruning of one taxon (black star) within the upper clade.

Parker's (2016) fig. 7 then shows the preferred result: a "reduced strict consensus of 3 MPTs" with the red star taxon removed, and (rarely seen in dinosaur phylogeny papers) branch-support — including Bootstrap support values below 70, which are very rarely reported in the literature (from my own experience it seems that editors of systematic biology journals don't like them).

Removal of one rogue taxon (called a "wildcard" in paleozoology), Aetobarbakinoides brasiliensis, substantially reduced the number of MPTs. Nonetheless, many branches have low support, and hence also the clades (used here as synonym for monophyla) derived from them – Parker uses branch-based ("stem"-based, brackets on his tree), and node-based taxa (dots).

Low branch support may or may not matter

There are two possible reasons for low branch-support:

non-discriminatory signal: any alternative branching pattern receives diminishing support
internal signal conflict: two (or more) alternatives receive similar support.

Mapping the support on the preferred (inferred) optimal tree cannot tell us whether it's the one or the other — only Support consensus networks can visualize this. Since we are interested in the rogue, I re-ran the parsimony BS analysis (10,000 quick-and-dirty replicates, following Müller 2005, BMC Evol. Biol. 5:58) including Aetobarbakinoides brasiliensis.

Support consensus network based on 10,000 parsimony BS pseudoreplicates. Trivial splits collapsed, only splits are shown the occured in at least 20% of the BS replicates.

The decreased/low BS support within the most terminal (root-distant) subtrees, the Des'ini and Par'ini, relates to conflicting alternatives involving one or two OTUs. In the case of Des'ini, it is the affinity of Lucasuchus and NCSM 21723, while in the case of Par'ini an alternative (recognizing Tecovasuchus as sister to the remainder) is found in 1 out of three BS pseudoreplicate trees. The diminishing support for basal relationships (root-proximal branches/edges) is due to the general lack of discriminatory signal (BS any alternative < 25). However, there are very few situations in which the best-supported alternative differs much from that in the preferred tree. For instance, any alternative to a Stag'inae sister relationship has even less than BS = 24 (BS = 27 in Parker's "reduced" tree).

Our rogue, however, is not really a 'wildcard'. The scored characters simply put it much closer to the outgroup than is any other ingroup taxon. A simple explanation could be that it is a most primitive (least derived) member of the Aetosauria. Another possibility is that it lacks any critical trait needed to place it within the ingroup. Since the deep splits within the Aetosauria rely on very few character changes, we can put it in different position down here and the tree will still have the same number of inferred changes.

Trivial and non-trivial taxa

The cladograms typically shown provide limited information about the signal in the underlying matrix, its strength and weaknesses, even when not "naked" but annotated using branch-support values. Given that there are no severe overlap gaps in the data, a very quick alternative is the Neighbor-net (a necessary addition, in my opinion).

Bold edges correspond to branches (hence: clades) in Parker's preferred tree.

Using this, we can directly depict which groups, potential clades, draw substantial (partly trivial) character support.

For instance, according to Parker's tree and following cladistic classification, Stagonolepis is an invalid taxon: one species (St. robertsoni) is part of the Stag'inae clade, the other (St. olenki) is of the Des'inae clade. Character support is, however, nearly non-existent (Bremer value = 1 and BS = 7 in the original analysis; BS ≤ 20 for any competing alternative in our re-analysis). The distance network shows us why — indeed, both species are closest to each other; but, while St. robertsoni shares a critical Stag'inae character suite and, consequently, shows the highest similarity to Polesinuchus, St. olenki does not share this (note the lack of a corresponding neighborhood). Furthermore, any alternative placement fits even less. Parker's tree only resolved it at sister to all other Des'inae because it didn't fit into any of the well-supported, terminal clades (prominent edge-bundles).

We can also see where we may have to deal with internal signal conflict, and how this may affect the tree inference and lead to ambiguous branch support. Take, for instance, the NCSM 21723 individual (= Gorgetosuchus pekinensis). It's clearly a Des'inae. The reason, we have ambiguous branch support for this staircase-like subtree is that NCSM 21723 is substantially more similar to the distant, equally evolved sister lineage, the Par'ini (purple edge bundle). Hence, it must be placed as sister to all other Des'inae, although it appears to represent a more derived form than Longosuchus, representing the next step towards the most-derived crown-taxon Desmatosuchus. Tecovachus is the source of topological conflict within the Par'ini because it is the least-derived taxon. Its primitiveness will be expressed by placing it as sister to all other Par'ini, while few shared, non-exclusive apomorphies are behind its position in the preferred tree (Bremer value = 1, BS = 48 in Parker's fig. 7).

While it is obvious that the matrix has no clear tree-like signal for resolving any OTU that is not part of the terminal Des'ini and Typ'inae lineages, our 'wildcard' (Aetobarkinoides) is particularly close to the outgroup while showing no affinity to anything else. If it is part of the ingroup, it represents the ancestral form, ie. shows a character suite that is primitive (derived traits may be missing because they are simply not preserved: see description of the taxon in Parker 2016). This is the reason why it acted rogue-ish in tree inferences even though it's favored phylogenetic position is clear.

Data

Parker's original matrix can be found in the supplement to the paper. An annotated ready-to-use NEXUS-formatted version (including my standard codelines for parsimony and distance bootstrapping) and the inference results used here can be found in this figshare submission, which I generated for a technical Q&A.

Here is the promised list of previous posts dealing with fossils and networks.

Should we try to infer trees on tree-unlikely matrices? July 2017; the signal phylogenetic matrices of major groups of extinct and extant seed plants.
More non-treelike data forced into trees: a glimpse into the dinosaurs, Aug. 2017; why also paleozoologists should start with network-based EDA — exploratory data analysis.
Networks, not trees, identify "weak spots" in phylogenetic trees, Oct. 2017; how Consensus networks can be used to visualize topological conflict among MPTs.
Summarizing non-trivial Bayes tree samples for dating? Just use support consensus networks, Jan. 2018; Bayesian Consensus networks based on mixed data matrices.
The curious case(s) of tree-like matrices with no synapomorphies, joint post with David, Apr. 2019; looking at CI, RI values and treelikeness.
Networks for matrices used in Cladistics studies, part 1 (historical matrices), part 2 (recent matrices), Nov. 2018; a collection of networks inferred from matrices used to infer parsimony trees.
Phylogenetic ambiguity: data gaps, indifference and internal conflict, Jan. 2019; an example (squids) for why consensus networks should be obligatory when facing ambiguous branch support.
Why the emperor has no clothes on – a thicket of trees, Nov. 2019; gene tree incongruence in plant plastomes and why it probably has little to do with decoupled gene histories.
Large morphomatrices – trivial signal, Feb. 2020; about the principal signal in a bird-dinosaur supermatrix.
Supernetworks and gene tree incongruence, May 2020; about mtDNA and splits in early land plants
Fossils and Networks 2: Deleting (and adding) a tip, Aug. 2020; studying the effect of removing a single taxon from the tree inference using the best-sampled taxa of the bird-dinosaur supermatrix.

Monday, September 28, 2020

Analyzing rhyme networks (From rhymes to networks 6)

For this, final post of my little series on rhyme networks, I set myself the ambitious goal of providing concrete examples how rhyme networks for languages other than Chinese can be analyzed. Unfortunately, I have to admit that this goal turned out to be a bit too ambitious. Although I managed to create a first corpus of annotated German rhymes, I am still not entirely sure how to construct rhyme networks from this corpus. Even if this problem is solved pragmatically, I realized that the question of how to analyze the rhyme network data is far less straightforward than I originally thought.

I will nevertheless try to end this series by providing a detailed description of how a preliminary rhyme network of the German poetry collection can be analyzed. Since these initial ideas for analysis still have a rather preliminary nature, I hope that they can be sufficiently enhanced in the nearer future.

Constructing directed rhyme networks

I mentioned in last month's post that the it is not ideal to count, as rhyming with each other, all words that are assigned to the same rhyme cluster in a given stanza of a given poem, since this means that one has to normalize the weights of the edges when constructing the rhyme network afterwards (List 2016). I also mentioned the personal communication with Aison Bu, who shared the idea of counting only those rhymes that are somehow close to each other in a stanza.

During this month, I finally found time to think about how to account for this idea in practice, and I came up with a procedure that essentially yields a directed network. In this procedure, we first extract all of the rhyme words in a given stanza in the order of their appearance. We then proceed from the first rhyme word and iterate over the rest of the rhyme words until we find a match. Having found a match, we interrupt the loop and add a directed edge to our rhyme network, which goes from the first rhyme word to its first match. We then delete the first rhyme word from the list and proceed again.

This procedure yields a directed, weighted rhyme network. At first sight, one may not see any specific advantages in the directionality of the network, but in my opinion it does not necessarily hurt; and it is straightforward to convert the network into an undirected one by simply ignoring the directions of the edges and collapsing those which go in two directions in a given pair of rhyme words.

Handling complex rhymes

In last month's blog post, I also mentioned the problem of handling rhymes that stretch across more than one word. While these are properly annotated (in my opinion), I had problems handling them in the rhyme network I presented last week. We find similar problems when working with certain rhymes involving words with more than one syllable. As an example, consider the following words which are all taken from the song Cruisen, and which I further represent in syllabified form in phonetic transcription.

Rhyme Words	Stressed Syllable	Unstressed Syllable
Tube	tuː	bə
Bude	buː	də
Gurke	guɐ	kə
hupe	huː	kə
Kurve	kuɐ	və
Schurke	ʃuɐ	kə
Punkte	puŋ	tə

These words do not rhyme according to traditional poetry rules (where unstressed syllables following stressed syllables need to be identical), but they do reflect a common rhyme tendency in German Hip Hop, where rhyme practice has been evolving lately. In order to properly account for this, I assigned both the first and the second syllable of the words to their own rhyme group (one stressed syllable rhyme and one unstressed syllable rhyme).

When constructing the rhyme network, however, the separation into two rhyme groups turned out to not make much sense any longer, since the rhymes occur on a sub-morphemic level, where the parts to not themselves express a meaning anymore. To cope with this, I modified the network code slightly by treating only those words as rhyming with each other which show identical rhyme groups in all of their syllables.

Infomap communities and connected components

Having constructed the rhyme network in this new way, we can start with some preliminary analyses. As a first step, it is useful to check the general characteristics of the network. When using the new approach for network construction and the correction for complex rhymes, as reported above, the network consists of 3,104 nodes which together occur as many as 7,707 times. The network itself is only sparsely connected, being separated into 840 connected components.

As a first and very straightforward analysis, I used the Infomap algorithm (Rosvall and Bergstrom 2008) to see whether the connected components could be split any further. This analysis resulted in 932 communities, indicating that quite a few of the larger connected components in the rhyme network seem to show an additional community structure.

Unfortunately, I have not had time for a complete revision of all of the communities, but when checking a few of the larger connected components that were later separated into several communities, it seemed that most of these cases are due to very infrequent rhymes that are only licensed in very specific situations. As an example, consider the figure below, in which a larger connected component is shown along with the three communities identified by the Infomap algorithm.

The three communities, marked by the color of the nodes in the network, reflect three basic German rhyme patterns, which we can label -ung, -um, and -und. Transitions between the communities are sparse, although they are surely licensed by the phonetic similarity of the rhyme patterns, since they share the same main vowel and only differ by their finals, which all show a nasal component. The Infomap analysis assigns the nodes rum and krumm wrongly to the -und pattern but, given how sparse the graph is (with weights of one occurrence only for all of the edges), it is not surprising that this can happen. Both instances where edges connect the communities are rhymes occurring in the same Hip Hop lyrics from the song Geschichten aus der Nachbarschaft, as can be seen from the following annotated line of the song.

Judging from quickly eye-balling the data, most of the communities that further split the connected components of the network reflect groups of very closely rhyming words (usually corresponding to what one might call perfect rhymes). Links between communities reflect either possible similarities between the rhyme words represented by the communities, or direct errors introduced by my encoding.

Unfortunately, I could not find time to further elaborate on this analysis. What would be interesting to do, for example, would be a phonetic alignment analysis of the communities, with the goal of identifying the most general sound sequence that might represent a given community. It would also help to measure to what degree transitions between communities conform to these patterns, or to what degree individual words might reflect the communities' consensus rhyming more or less closely.

But even the brief analysis here has shown me that, first, there are still many errors in my annotation, and, second, the Infomap algorithm for community detection seems to work just as well with German rhyme data as it works on Chinese rhyme data.

Frequent rhyme pairs and promiscuous rhyme words

As a last example of how rhyme networks can be analyzed, I want to have a look at frequently recurring patterns in the current poetry collection. A very simple first test we can do in this regard is to look at the edges with the highest weights in our networks. Poets typically try to be very original in their work, since nothing is considered as boring as repetition in the literature. Nevertheless, since the pool of words from which poets can choose when creating their poems is, by nature, limited, there are always patterns that are more frequently used.

The following table shows those directed rhymes that occur most frequently in the German poetry database.

Rhyme Part A	Rhyme Part B	No. of Poems
sein	lein	10
aus	haus	10
haus	aus	9
triebe	liebe	9
leben	geben	9
geben	leben	9
zeit	keit	9
nein	sein	8
wieder	lieder	7
nur	tur	7

This collection may not tell you too much, if you are not a native speaker of German. But if you are, then you will easily see that most of these rhymes are very common, involving either very common words (sein "to be"), or suffixes that frequently recur in different words of the German lexicon (-lein either as diminutive suffix or as part of allein "alone"). We also find the very sad match of liebe (Liebe "love") and triebe (Triebe "urges"), which is mostly thanks to the poems by Rainer Maria Rilke (1875-1926), who wrote a lot about "love", and had the same problem as most German poets: there are not many words rhyming nicely with Liebe (the only other candidates I know of would be bliebe "would stay" and Hiebe "stroke or blow").

As a last example, we can consider promiscuous rhyme words, that is, rhyme words that tend to be reused in many poems with many other words as partners. The following table shows the top ten in terms of rhyme promiscuity in the German poetry dataset.

Rhyme Part	Rhyme Partners	Occurrences
sein	14	87
ein	9	34
bei	9	36
sagen	8	19
leben	8	39
schein	8	26
mehr	8	25
nicht		8
zeit	8	36
welt	7	32

Here, I find it rather interesting that we find so many words rhyming with -ein in this short list. However, when checking the community of -ein, we can see that there is, indeed, a rather large number of words from which one can choose (including basic words like Bein "leg", Schein "shine", Stein "stone"). Additionally, there are a larger number of verbs of the form -eien that are traditionally shortened in colloquial speech (compare the node schreien "to scream").

Concluding remarks

When I started this series on rhyme networks, I was hoping to achieve more in the six months that I had ahead. In the light of my initial hopes, the analyses I have shown here are somewhat disappointing. However, even if I could not keep the promises I made to myself, I have learned a lot during these months, and I remain optimistic that many of the still untackled problems can be solved in the near future. What today's analysis has specifically shown to me, however, is that more data will be needed, since the network produced from the small collection of 300 German poems is clearly too small to serve for a fully fledged analysis of rhymes in German poetry.

References

List, Johann-Mattis (2016) Using network models to analyze Old Chinese rhyme data. Bulletin of Chinese Linguistics 9.2: 218-241.

Rosvall, M. and Bergstrom, C. T. (2008) Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences 105.4: 1118-1123.

Data and Code

Data and code are available in the form of a GitHub Gist.

Monday, September 21, 2020

Herd immunity and the end of Covid-19

Following on from my previous posts about the SARS-CoV-2 virus, and Covid-19, the human disease that it causes, there are a number of miscellaneous topics that could also be discussed. Unfortunately, this is only a part of the post that I originally intended. I had written about some aspects of the pandemic that seem to be less well known. However, Blogger deleted the draft without warning, and this is the only part that I could recover.

Here, I talk about how the pandemic ends, as far as biology (rather than society) is concerned.
There is a lot of wishful thinking at the moment, that production of a vaccine will see the end of the pandemic, but the World Health Organization has warned that this may not be so. For example, they are apparently trying to develop a 5-year strategy for Europe, not a 5-month one. One of their officials, Hans Henri Kluge, has noted: "The end of the pandemic is the moment when we as a society learn how we can live with the pandemic."

Biologically, safety from pathogens involves what is called herd immunity. This refers to the proportion of the population who are not infectious, and thus are not spreading the pathogen (whether it is a virus, a bacterium, an apicomplexan, or a fungus). Lack of infectiousness can be achieved by:

being resistant to the pathogen in the first place, perhaps due to past immunological events (eg. Coronavirus: How the common cold might protect you from COVID)
becoming infected and then recovering, by producing antibodies or T-cells (eg. This trawler’s haul: Evidence that antibodies block the coronavirus)
being vaccinated, which produces the same immune response as 2., by producing protective antibodies.

Note that 2. is not necessarily dangerous for most people, as reports show that anything up to half of the people who have antibodies to SARS-CoV-2 did not report clinical symptoms, or only mild symptoms. [Note also: lack of symptoms does not mean that you are not infectious.] However, the variation in human response has clearly been huge (see From ‘brain fog’ to heart damage, COVID-19’s lingering problems alarm scientists), in many cases resulting in cytokine storms, and death.

The main risk factors are also clear — age and gender (The coronavirus is most deadly if you are older and male — new data reveal the risks), and any pre-existing medical conditions, notably obesity (Individuals with obesity and COVID‐19: a global perspective on the epidemiology and biological relationships). Furthermore, we do not yet know how long any immune protection lasts — for example, we now have people who have been infected more than once (Researchers document first case of virus reinfection), although most have kept their antibodies for at least 4 months (Fyra av fem behåller antikroppar mot nya coronaviruset).

Nor do we yet know about the success or danger of 3., because it normally takes a couple of years of clinical trials before a vaccine is approved for use, and even then we can get it badly wrong (cf. the originally undetected side-effects of thalidomide). As far as health care is concerned, responsibility for treatment of any unfortunate outcomes from immunization is not at all clear. Furthermore, those nations that spend the most on healthcare per person may not be ranked highest for health outcomes and quality of care (see: What country spends the most on healthcare?). Therefore, it is hardly surprising that many people are concerned about taking any new vaccine (A Covid-19 vaccine problem: people who are afraid to get one), and that the World Health Organization is being much more cautious than many government leaders (Most people likely won't get a coronavirus vaccine until the middle of 2021).

Nevertheless, once herd immunity is achieved in my local population, I am relatively safe, irrespective of whether I have been vaccinated or not — there will be few infectious people around me, and so I am not very likely to catch the pathogen. Personally, I could wait a while to see how the myriad new vaccines affect people, as they have been rush-produced in a way that would not normally be accepted as safe for public use (what is called the Phase 3 trial takes time). After all, there seems to be an awful lot of politics involved, especially in the USA (The 943-dimensional chess of a trustworthy Covid-19 vaccine).

Some calculations

The point here is that the development of any epidemic is an interaction between infectivity, herd immunity and infection control. Let's consider some explicit numbers to make this clear (based on: Flockimmunitet på lägre nivå kan hejda smittan).

Infectivity refers to how the pathogen spreads among the at-risk population, usually described as the basal reproductive rate (R0). If each infected individual infects 2-3 others, then the R0 value is c. 2.5 (each person infects 2.5 other people, on average). This means that the epidemic must spread — if R = 1 then there is no spread; and if R < 1 then the infection slowly dies out (it stops instantly if R = 0).

Clearly, infectivity can be reduced by any infection control measure that reduces R. Some of these were listed in the previous section. These measures can easily reduce the initial R0 by one half, meaning that the epidemic spreads much more slowly, if R = 1.25.

Herd immunity comes into this by also reducing R. For example, if herd immunity reaches 60%, then only the remaining 40% of the people are susceptible to the infection. If we combine this 40% with the initial R0 = 2.5, then R = 1, and the epidemic no longer increases. That is, we now have it under control. Moreover, if we have managed to get to R = 1.25, then a herd immunity of even 20% will cause the epidemic to decrease.

Bhoj Raj Singh has a good slide presentation elaborating on this topic.

These calculations interact with the concept of relative risk, of course. The calculations so far assume that infection exposure is random in society, which is obviously too simple an idea. Some people are more socially active than others, are thus likely to be more exposed, and they will then quickly achieve significant herd immunity. Others find it difficult to self-isolate because of their work or social conditions, which also increases the development of herd immunity. All of this also helps more isolated people, of course, because they are not at risk of infection from those active groups with herd immunity.

We would thus expect herd immunity to develop first in cities (eg. Experts say Stockholm is close to achieving herd immunity ; A third of people tested in Bronx have coronavirus antibodies) and in poor communities (Herd immunity may be developing in Mumbai’s poorest areas), both of which seem to be the case for SARS-CoV-2.

Equally importantly, herd immunity cannot develop if we all hide from the virus. This has happened in New Zealand, for example, which has so far successfully quarantined itself from the rest of the world — they have not successfully fought the virus, they have instead successfully hidden from it. The issue is that the populace can never come out of hiding, and can thus never let anyone come into the country, not even returning New Zealanders. As an example, Hawaii had the same isolation advantage, and then lost it, just as expected (Hawaii is no longer safe from Covid-19), as also did Australia (Coronavirus (COVID-19) current situation and case numbers).

It is a classic question: which is better, fight or flight? In a pandemic, flight cannot lead to herd immunity, which is what we need in order to "learn how we can live with the pandemic".

So, where are we now? Well, a recent poll in the USA suggests that it is an even split about whether people will actually take a vaccine if offered soon (U.S. public now divided over whether to get Covid-19 vaccine). Will 50% be enough to ensure herd immunity in that country?