The Genealogical World of Phylogenetic Networks

Rooted phylogenetic networks for coronaviruses

2020-11-15T22:45:00.000+01:00

In a previous post, Guido constructed trees for coronaviruses in the SARS group to search for evidence of recombination. He also constructed unrooted data-display networks using SplitsTree. Here, we discuss our attempts to construct rooted genealogical phylogenetic networks for the same dataset [6] but with some modifications.

In particular, we deleted some sequences, giving a smaller data set with only 12 taxa. These taxa include, next to SARS-CoV-2 (the virus causing COVID-19) and SARS-CoV (responsible for the SARS epidemic in 2002/2003), the viruses MP789 and PCoV_GX-P1E sampled from Malayan pangolins from two different Chinese provinces and several viruses found in different bat species in the horseshoe bat genus (Rhinolophus), all from China.

This research was done by Rosanne Wallin, an MSc student at VU Amsterdam and UvA. Her full thesis as well as all data and results can be found on github.

The first algorithm we applied to this data set was the TreeChild Algorithm [1], which is one of the methods that take a number of discordant (rooted, binary) trees as input and finds a rooted network containing each input tree, minimizing the number of reticulate events in the network. To filter out some noise, we contracted some poorly-supported branches and then resolved multifurcations consistently across the trees (using a tool within the TreeChild Algorithm). This gave the network below. Note that the method is restricted to so-called tree-child networks, meaning that certain complex scenarios are excluded (where a network node only has reticulate children). Also note that this is not necessarily the only optimal tree-child network and not all topological differences can be distinguished based on the trees [5].

Figure 1: Phylogenetic network constructed by the Tree-Child algorithm (blocks_A_len0.01_supp70).

The network shows no reticulation in the SARS-CoV-2 clade (the bottom four taxa) and puts SARS-CoV-2 right next to RaTG13. Furthermore, it shows a reticulation between an ancestor of HKU3-1 and a common ancestor of SARS-CoV-2 and RaTG13 leading to bat-SL-CoVZC45. However, it cannot exactly identify which common ancestor of SARS-CoV-2 and RaTG13 is the parent, leading to multiple branches (in red) leading into this reticulation. All these observations are consistent with previous research [2].

Importantly, we cannot directly conclude that each reticulation corresponds to a recombination event. See Table 2.1 of David’s book [10] for a nice overview of possible causes of reticulation. Nevertheless, based on [2], it does look like at least the reticulation leading to bat-SL-CoVZC45 corresponds to a recombination event.

The second algorithm we applied was TriLoNet [3], which constructs a rooted network directly from sequence data. It is restricted to so-called level-1 networks, meaning that it cannot construct overlapping cycles. This method produced the network below.

Figure 2: Phylogenetic network constructed by TriLoNet.

At first sight, the network may look a bit different from the previous one (Figure 1). However, note that the three observations above also hold for this second network. Moreover, the SARS-CoV-2 clade is identical in both networks. This network contains only one reticulation, which is most likely due to the level-1 restriction.

Nevertheless, we can still use this method to find more putative recombination events. To do so, we simply exclude the recombinant bat-SL-CoVZC45 from the analysis and rerun the algorithm. This gives the following network.

Figure 3: Phylogenetic network constructed by TriLoNet, after omitting bat-SL-CoVZC45.

We have now found a second putative recombination event with Rf1 as recombinant. Note that this is also consistent with the network in Figure 1. On the other hand, also note that the branching order in the SARS-CoV clade (the bottom 7 taxa in Figure 3) has changed a bit. This could mean that more recombination events are present in the SARS-CoV clade, as we also see in Figure 1.

One interesting follow-up question is whether the two (or more) networks produced by TriLoNet can be combined into a single higher-level network, in order to show multiple reticulations simultaneously (see [4] for an algorithm that could be useful).

Another interesting observation from these networks is that there is no sign of recombination involving the pangolin coronaviruses MP789 and PCoV_GX-P1E. It rather looks like these viruses evolved from common ancestors of SARS-CoV-2 and RaTG13, but it is important to note that we cannot exclude a recombination event on the basis of these networks. The relationship between SARS-CoV-2 and pangolin coronaviruses is still being debated in the literature [2,7,8,9].

Some limitations of the algorithms were noticed during this study. Firstly, the depicted networks are purely topological, i.e., the branch lengths do not represent anything. Adapting these algorithms to take branch length information into account could possibly improve their accuracy for this data set since the extant taxa have precise time stamps and for recent divergence events these times can be estimated quite accurately, see [2].

Another limitation is that we had to remove several taxa from the original data set [6] before the TreeChild algorithm could find a solution. By removing taxa, we reduced the number of reticulations needed to display the trees, making the TreeChild algorithm run in reasonable time. We made sure to include a diverse set of taxa (based on their pairwise distances [6]) to represent as much of the subgenus as possible.

Rosanne used several other algorithms, taxon selections and also used trees based on genes rather than fixed-length blocks (which we did above, following Guido’s post), see her thesis on github.

Conclusion

Although rooted phylogenetic network methods are often limited in the number of taxa that can be analysed and/or the complexity of the networks that can be constructed, we have seen that these methods can be useful for constructing hypothetical evolutionary histories. Moreover, although the constructed networks are not identical, we have seen that they share certain key properties, which are also consistent with previous research.

Rosanne Wallin, Leo van Iersel, Mark Jones, Steven Kelk and Leen Stougie

[1] Leo van Iersel, Remie Janssen, Mark Jones, Yukihiro Murakami and Norbert Zeh. A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees. arXiv:1907.08474 [cs.DM] (2019).

[2] Maciej F. Boni, Philippe Lemey, Xiaowei Jiang, Tommy Tsan-Yuk Lam, Blair W. Perry, Todd A. Castoe, Andrew Rambaut and David L. Robertson. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat Microbiol 5, 1408–1417 (2020). https://doi.org/10.1038/s41564-020-0771-4

[3] James Oldman, Taoyang Wu, Leo van Iersel and Vincent Moulton. TriLoNet: Piecing together small networks to reconstruct reticulate evolutionary histories. Molecular Biology and Evolution, 33 (8): 2151-2162 (2016). http://dx.doi.org/10.1093/molbev/msw068 (postprint)

[4] Yukihiro Murakami, Leo van Iersel, Remie Janssen, Mark Jones and Vincent Moulton. Reconstructing Tree-Child Networks from Reticulate-Edge-Deleted Subnetworks. Bulletin of Mathematical Biology, 81(10):3823–3863 (2019).

[5] Fabio Pardi and Celine Scornavacca. Reconstructible phylogenetic networks: do not distinguish the indistinguishable. PLoS Comput Biol, 11(4), e1004135 (2015).

[6] Grimm, Guido; Morrison, David (2020): Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset. https://doi.org/10.6084/m9.figshare.12046581.v3

[7] Lam, Tommy Tsan-Yuk, Marcus Ho-Hin Shum, Hua-Chen Zhu, Yi-Gang Tong, Xue-Bing Ni, Yun-Shi Liao, Wei Wei, et al. Identifying SARS-CoV-2 Related Coronaviruses in Malayan Pangolins. Nature, 583, 282–285 (2020). https://doi.org/10.1038/s41586-020-2169-0

[8] Wang, Hongru, Lenore Pipes, and Rasmus Nielsen. Synonymous Mutations and the Molecular Evolution of SARS-Cov-2 Origins. [Preprint] Evolutionary Biology, April 21, 2020. https://doi.org/10.1101/2020.04.20.052019

[9] Li, Xiaojun, Elena E. Giorgi, Manukumar Honnayakanahalli Marichannegowda, Brian Foley, Chuan Xiao, Xiang-Peng Kong, Yue Chen, S. Gnanakaran, Bette Korber, and Feng Gao. Emergence of SARS-CoV-2 through Recombination and Strong Purifying Selection. Science Advances, Vol. 6, no. 27 (2020). https://doi.org/10.1126/sciadv.abb9153

[10] David Morrison, Introduction to Phylogenetic Networks. RJR Productions, Uppsala, Sweden (2011). http://www.rjr-productions.org/Networks/index.html

The end of this blog?

2020-11-02T00:30:00.067+01:00

Last week's post is planned to be the final one for The Genealogical World of Phylogenetic Networks, at least for the time-being. This post is simply to say goodbye, and to say thanks to all of our readers.

As some of you may know, Blogger has decided that it is no longer interested in having contributors who work on desktop computers. They have changed their author interface to one designed for swiping and pressing on a small touch-screen, not typing and mousing while looking at a full-size screen. This new interface is almost unusable on my computing equipment — they have taken a limited but quite usable system and made it unworkable, in practice. *

Notably, the new system for automatic formatting of the posts does not match the format of our blog, and there is thus now an added onus to do everything manually, undoing the mess created by the automatic formatter, or typing in the HTML code for yourself (which is what it seems to assume you actually want to do). It won't even paste images into the right place in the text!

Even worse, Blogger (owned by Google) will no longer even allow me to log-in using my version of Chrome (owned by Google); Safari (owned by Apple) can no longer display the administration pages at all; and Firefox (owned by Mozilla) will allow me to work on posts but not allow me to comment on them.

Moreover, recently Blogger unexpectedly deleted one of my completed posts, without warning. It has not done this for a long time, and it will be the last time it does it to me — I have had enough of the new system.

So, after more than 8.5 years (the first post was on February 25, 2012), it is time for me to say goodbye. I have had six co-authors, at various times, and I thank them very much for being here. We have now written 663 posts about lots of topics, most of them related to phylogenetics in one way or another — and most of it seemed like fun at the time, although obviously a lot of work for everyone. Here is a graph with a brief timeline, as Guido's summary of the blog's history (click to enlarge).

Finally, thanks to all of you, for being readers — it would be awfully lonely here without you. By my calculations, we are currently getting c. 2,000 non-bot hits per week, which is very respectable. (NB: viewbot hits have comprised at least 20% of all traffic, first detected after c. 100 posts.) The blog is treated as a "real" scientific output, and the posts are thus indexed by sites like Altmetric (there are a couple of their example pages shown below).

The Comments facility will soon be switched off. The blog itself will remain online, for any new readers; but it will effectively be an archive, until such times as Blogger decides to delete it (whether intentionally or not), or someone has the inspiration to create a new post.

cheerio

David
david.morrison@ebc.uu.se

* My local liquor chain (the only one in the country!) has just done the same thing. Instead of having one web page with all of the relevant information for each item, there are now half-a-dozen pages, each with a small amount of information, and with VERY BIG text. Parts of the new site do not work at all on my computer; and they have taken away facilities that I actually relied upon. I find this new site just as unusable as the new Blogger.

Here are a couple of example Altmetric pages referring to our blog posts (also provided by Guido).

Just try it for your data – a last first-of-its-kind Neighbor-net using FTIR data

2020-10-26T00:30:00.324+01:00

This is likely to be my last post for this blog.

Some thoughts

When I joined the Genealogical World of Phylogenetic Networks three years ago, I didn't know how much fun it is to blog about science. Blogging, or writing essays, has several advantages against the traditional way to get a researcher's ideas out into the world — writing a scientific paper. The most important one is, one can just try out something without having to worry how this would get past the peer-reviewers and editors (or as I like to call them: the Mighty Beasts lurking in the Forest of Reviews). When I was still a (sort-of) career scientist (ie. paid by tax-payers to do science), I had my share of discouraging experiences, whenever we tried to leave the beaten (and worn out) paths to try something new; to look into the dark places and not right under the street-lights.

Before we submitted papers, we hence put a considerable effort into them, pondering what our peers may criticize, or what might alienate them (being likely unfamiliar with our methodological and philosophical approaches), and thus to minimize the chance all our work would be for nothing. In a couple of cases, where we expected fierce resistance, we opted for low-impact journals with no manuscript length restrictions and more welcoming editors and peers, to be able to put in everything that we had. Some of my best bits are buried in journals where you'd never expect them!

But it was increasingly annoying, nevertheless;. It was no fun anymore to formally publish research, and so I let my career as smoothly run out in the 2010s as it started in the Zeroes.

David's encouraging me to write blog-posts, just after I early retired, thus revitalized my interest in science, to "boldly go where no-one has gone before". The amount of effort is typically much lower, although some of my posts do involve the same work that I put into the papers that I co-authored. More importantly, there are no beasts in the World Wide Web that can bite you from the shadows; they have to do it in the open. It's an ideal way to get an idea out, without having to think about the consequences. None of the work I put into a post has been for vain. What a difference: before, for every graph / analysis result published, two ended in the bin, many devoured by the Mighty Beasts.

And, maybe somebody will find the work interesting enough to try it out; and eventually my idea finds a place in the sanctionized, peer-reviewed scientific world, anyway. Since I'm out-of-business, I can afford to not cash in the credit (no-one formally cites a blog post).

My last Neighbor-net for the Genealogical World

Neighbor-nets (NNets) and myself was love at the first sight (this was, in my case, ~2005, when my boss Vera Hemleben, a geneticist, sent me over to the new professor in our bioinformatics department, named Daniel Huson, who had just released a new software package, SplitsTree). These networks are...

... most versatile: any kind of data can be transformed into a distance matrix;
... quick-and-easy to infer.

And even if they are not phylogenetic networks in the strict sense – NNets are unrooted and their edge-bundles do not necessarily reflect evolutionary pathways – they more often than not point towards common origins and down-scale ± complex phylogenetic relationships more comprehensively than any phylogenetic tree (coalescent or not) that we could infer. The Genealogical World is full of examples, and the writers of this blog such as David [homepage], Mattis [GoogleScholar/ homepage], myself [GoogleScholar/ homepage], and like-minded researches have published quite a few of them (in high- and low-impact journals). For a comprehensive, permanently updated list see Philippe Gambette's Who's Who in Phylogenetic Networks page.

For my final post, I decided on a fascinating new data source in paleobiology: Fourier transformed infrared spectra (FTIR) of fossil cuticles.

The cuticle is a plant's skin, and it's composition and structure show a lot of variation, down to species level. Thus, their morphological-anatomical features have long been used as taxonomic markers to identify fossil material. Using infrared spectroscopy, one can look at the chemical composition of cuticles. Like any other spectrum, an FTIR-spectrum can be broken down in sets of quantitative (discrete, binned) or qualitative (continuous) characters; and one can then create a dissimilarity matrix for the investigated material. This is what Vajda, Pucetaite et al. (Nature Ecol. Evol. 1: 1093–1099, 2017) did for long-death (Mesozoic) but enigmatic seed plants and their equally enigmatic modern counterparts.

A UPGMA dendrogram based on FTIR data of fossil taxa (Vajda et al. 2017, fig. 4). Brackets to the right give the topology of the UPGMA dendrogram including extant material and data (Vajda et al. 2017, fig. 3).

PCA plots of the first and second (a), and first and third (b) coordinates, with the main seed plant lineages indicated (modified after Vajda et al. 2017, suppl.-fig. 4)

PCA and UPGMA are not phylogenetic inference methods, but there is obviously some phylogenetic signal encoded in these FTIR spectra, as shown above.

When I first saw the paper, I contacted the authors (including former colleagues of mine at Naturhistoriska riksmuseet in Stockholm), and the first author gave the second author, Milda Pucetaite (a Ph.D. student), a green light to share and convert her FTIR data into a simple distance matrix for me to run a NNet, as shown below.

Neighbor-net based on the combined distance matrix provided by Milda (pers. comm. July 2017).

Note that this NNet is a partly impossible graph, phylogenetically. The chemical composition naturally changes after the foliage (in this case) gets buried in sediment, and its cuticle is then conserved for millions of years by various taphonomic and diagenetic processes. As pointed out by the experienced biochemist among the authors during our correspondence: it is hence pointless to combine the data from extant and extinct taxa.

Well, since this is a post and not a paper, I combined them anyway. I find the result quite compelling, supporting the paper's conclusions including more speculative follow-up ones. The NNet reflects every aspect that these kind of data can provide for phylogenetic and systematic purposes.

The prominent central edge bundle reflects the taphonomic-diagenetic change separating the living from fossil samples. The basic sequence within the subgraphs is the same: gingkoes are closest to cycads, and cycads bridge to Araucariaceae, which is a relict lineage of the "needle" trees, the conifers (many of which don't have needles but leaves). Bennettitales and Nilssoniales are extinct groups of seed plants, which are here resolved as a distinct lineage. Especially, the Bennettitales have been have long puzzled scientists. They may represent a third major lineage of seed plants that are neither angiosperms (flowering plants) nor gymnosperms (ginkgoes, cycads, conifers, gnetids), or perhaps an early side lineage of either one (or lineages, as their two main groups are quite different).

As for pretty much any kind of data, just try it out for yourself. This is exploratory data analysis (EDA), particularly useful to get a first, fast impression of the primary signal in your data. This is true even if you keep it to yourself, having to watch out for the Mighty Beasts of the Forest of Reviews (especially the ones that call themselves "cladists"). Who are quick in telling you, what you can't do, but not so straightforward, when it comes pointing you to other options for analyzing your data.

My dive-in list for some more (im-)possible NNets

"Man gave name to all those animals": cats and dogs — a joint post with Mattis, where we just mapped the names on a NNet of world languages. Cormac Anderson joined for the second part dealing with goats and sheep.
Should we try to infer trees on tree-unlikely matrices — my first post for the Genealogical World of Phylogenetic Networks about a NNet that, at the time I made it, was impossible to publish.
Stacking networks based on sign language manual alphabets — NNets based on hand-shapes historically used for sign letters.
To boldy go where no one has gone before – networks of moons — a joint post with Timothy Holt, who scored celestial bodies of our stellar system for classification.
Visualizing U.S. gun laws — inferring NNets based on serious issues can be fun. Related post: The 2nd Amendment does more than keep King George away.

With David retiring, the Genealogical Worlds of Phylogenetic Networks will fall dormant, the next and final post will be a farewell from David. Like Mattis (Von Wörtern und Bäumen), I will keep on science-blogging (in spite of the new buggy Blogger-editing interface forcing me to draft directly in HTML) for a little while (and irregularly) on my Res.I.P. blog, which also includes a tag for "phylo-networks" for any future NNets and the like.

Xenoplasy

2020-10-19T00:30:00.553+02:00

A major obstacle in studying morphological evolution is homoplasy. This occurs when the same (or similar) traits are evolved independently in different lineages (convergences), and are positively selected for or incompletely sorted within a lineage (parallelisms, homoiologies). Traits that not sort following the true tree create incompatible signal patterns, and, eventually, topological ambiguity. No matter which inference method we use, we end up with several alternative trees that combine aspects of the true tree with artificial branching patterns.

Homoplasy is the rule, while trait sorting is the exception. Consequently, we have to expect that any morphology-based tree will have more wrong branches than correct ones.

For extant group of organisms, a simple solution to the problem is to analyse morphological traits in the framework of a molecular phylogeny. The genetic data provides us with an independent, best-possible tree. By mapping the morphological traits on this tree, we can evaluate their potency as phylogenetic markers.

But what if our group of organisms is not the product of a simple repeated dichotomous splitting pattern? What if there were anastomoses as well? That is, the morphological traits are not the product of mere (incomplete) lineage or incomplete gene sorting (the latter is called "hemiplasy") but fusion of traits in different lineages. Thus, a tree is not enough to explain the genetic data? What does this imply for the morphological differentiation we observe?

Take the London Plane (Platanus x acerifolia or P. x hispanica), for example, which is a tree that many of us are familiar with. In case you don't know the name: they are the large trees with a patterned bark and deeply lobed leaves and fluffy fruiting bodies found in abundance in parks and alleys throughout the world. It's a cultivation-hybrid (17th—18th century) of the North American plane tree, Platanus occidentalis, and its distant eastern Mediterranean relative, P. orientalis. These are genetically and morphologically distinct species. Their history is summarized in the following doodle (Grimm & Denk 2010).

Each line represents a semi-sorted nuclear gene region. The split between proto-PNA-E (SW. U.S., NW. Mexico, E. Mediterranean) and proto-ANA (Atlantic-facing Central America, E. U.S.) must have been > 12 myrs ago (last Platanus of Iceland). The minimum air distance between the sister species P. orientalis and P. racemosa is ~11,500 km (via the Arctic). Interestingly, fossils from that time and later (including western Eurasia) have more ANA-clade morphologies: P. orientalis- and P. racemosa-types pop up ~5 Ma. Both ANA and PNA-E clade have distinct morphologies. With respect to the individual gene trees, those exclusively shared by P. palmeri and P. rzedowski with P. occidentalis s.str. and P. mexicana of the ANA clade could be adressed as "hemiplasies".

If you look at the leaves and fruits of London Planes, you can find everything in between the two endpoints; and the same holds for their genetics. The London Plane is much hardier than Europe's own P. orientalis and more drought-resistant than its hardier North American parent. With climate change going on, the hybrid will eventually meld with the European species entirely. And, thanks to what we call "hybrid vigor", given a few millions of years, it might consume its other parent, too. London Planes have been re-introduced into the Americas; and P. orientalis has become an invasive species in California, where it has started to hybridize with its local sister species P. racemosa. Now imagine a future researcher of Platanus evolution having to deal with a highly complex accumulation of Platanus fossils in the Northern Hemisphere, while being able to study only the left-over complex genetics of a single species that replaced two.

This is where a recently coined new concept comes in: xenoplasy.

Yaxuan Wang, Zhen Cao, Huw A. Ogilvie, Luay Nakhleh (2020). Phylogenomic assessment of the role ofhybridization and introgression in trait evolution. bioRxiv doi: 10.1101/2020.09.16.300343

Xenoplasies are traits that originate from hybridization and subsequent introgression. In standard phylogenetics, they would act like any homoplasious character, but their distinction is that they are not independently involved. They are captured via lineage crossing, and reflect a common ancestry.

Example for a trait incongruent with the species tree, representing a xenoplasy obtained by introgression of I1-A lineage which evolved the trait into I3-B lineage, part of the I2 clade. Pending how far they are affected by incomplete lineage sorting (ILS) and introgression, individual gene may result in any of the three possible genealogies. Modified after Wang et al. (2020), fig. 1.

As such, their phylogenetic weight (information content) equals that of the anyhow rare classic autapomorphies or synapomorphies (fide Hennig), and this weight is higher than that of the more common homoiologies, shared apomorphies or symplesiomorphies. Note, in the palaeozoological cladistic literature, sorted versions of the latter three are often called synapomorphies – any lineage-specific, derived trait ("synapomorphy") may be lost / modified in some sublineage(s), or rarely pop-up outside the lineage.

Wang et al. provide an analytical framework for identifying a trait as xenoplasy, and assessing the probability for it ("xenoplasy risk factor"). If you're interested in the mechanics, check out the pre-print. The mathematical part of my brain has been dormant for most of the last two decades (when I exchanged chemistry for geology-biology), so I'm more into possible applications to explore this new concept.

Where to look next

The Wang et al. real-world example (Jaltomata) is, however, not very appealing. The problem is that, to look for xenoplasy, we need data that requires us to infer an explicit phylogenetic network (in the strict sense) to start with. In addition, we could use a morphological partition: scored morphological traits; which is usually absent. Last, identifying xenoplasies would make most sense for traits that can be traced in the fossil record, not only to identify potential products of past reticulation but have a better grip on placing critical fossils. Often overlooked by neontologists, fossils are the only physical proof that a lineage was at a certain place at a certain point in time. So, here's two examples: beeches and bears.

Beeches are a small genus of extra-tropical angiosperm trees with a pretty well understood fossil record. Morphologically, their differentiation is very hard to put into a tree, as shown here.

A morpholgy-based Neigbor-net of fossil (open circles) and extant beech (closed circles) taxa. Coloration gives the (paleo-)geographic distribution (abbreviated as three letters). For more background and information see my Res.I.P. post: The challenging and puzzling ordinary beech – a (hi)story

Mapping species-discriminating traits on a tree would be of little help here, because the modern species are the product of recurrent phases of mixing and incomplete sorting. I have summarized this in the following doodle, depicting the diversification and propagation of 5S-IGS variants (a non-transcribed, poly-copy, multi-array intergenic nuclear spacer) in a still very small sample.

A doodle summarizing differentiation patterns in a sample of 686 "representative" 5S-IGS variants obtained using high-throughput sequencing of six beech populations of western Eurasia and Japan (Simone Cardoni et al., to be submitted in the near future; see Piredda et al. 2020 for a similar analytical set-up).

The people involved in researching this project (drawn by passion rather than resources) don't have the resources to generate the NGS data needed to construct a species network for all of the species of beech, like Wang et al.'s Jaltomata data. But given that there are only 9–10 species, it would be easy prey for a well-funded research group. If you are interested, but don't know how to get the material and are unfamiliar with beeches, feel free to contact the senior author of Piredda et al. 2020, Marco Simeone — new beech-enthusiasts are always welcomed by this group.

Bears are one of the best-studied extant mammal predators, and they also have a decent fossil record. This is probably the reason that Heath et al. (2014) used bears as the case study when introducing their new molecular dating approach: the fossilized-birth-death dating.

A fossilized birth-death dated tree of bears (modified from Heath et al. 2014, fig. 4). The numbers in brackets give the number of fossil taxa (extinct genera, Ursus spp.) listed on Wikipedia.

As nice as it looks (and done), their analysis is pretty flawed from an evolutionary point of view. Their dated tree only reflects a single aspect of bear evolution and may involve branch-length artifacts. Heath et al. relied on complete mitochondrial genomes, which they combined with a single nuclear protein-coding gene. Mitochondrial genes reflect only the maternal lineage; they did not date a species tree but a mitochondrial genealogy. Paternal and biparentally inherited gene markers (which includes nuclear genes) tell very different stories about species relationships (this is why we also used the bears as example data for Schliep et al. 2017).

Strict, branch-length ignorant Consensus network of three trees inferred using species-consensus sequences generated from three sets of data: biparentally inherited nuclear-encoded autosomal introns (ncAI), paternally inherited Y-chromosomes (YCh) and maternally inherited mitochondrial genes (complete set; mtG). This is clearly not the product of a strictly dichotomous evolution. Thick lines: edges found in Heath et al.'s chronogram (= mitochondrial genealogy).

And while it may be that morphology reflects more the maternal than the paternal side, it has never been tested. Neither how morphology fits with the coalescent species tree. Which would be a network, as shown below.

Gene flow in bears within the last 5 myrs (estimate; from Kumar et al. 2017).

How Heath et al. linked the fossils to clades might have been just as wrong as it was right (note that FBD dating is much less biased by mis- or unoptimal placed fossils than traditional node dating). Hemi- and xenoplasy must be considered here. In addition to the highly incongruent paternal and maternal genealogies, we know that even the morphologically most distinct sister species (grizzlies, a special form of Brown Bear, and polar bears) can produce vital offspring ("Grolar") with morphological traits from either side of the family (usually, the Grizzly-side dominates).

Wildlife services usually kill these hybrids as they are considered to speed up the decline of polar bears (they are food competitors). However, with the (possibly inevitable) melting of the polar caps, these hybrids could be instrumental in the survival of a bit of Polar Bear legacy, in the form of genetic diversity not found in brown bears, and xenoplasies. If two highly distinct bear species hybridize today in the wild due to (in this case: human-induced) environmental pressure, their ancestors probably have done so in the past in reaction to shifting habitats and migration patterns.

Given how long bears have intrigued researchers, there are plenty of classic morphological studies involving fossils; and, in the light of the vast amount of molecular data (including ancient DNA!) that have been collected for bears, it should be pretty easy to apply Wang et al.'s new approach to bears. For example, is the Cave Bear a dead-end side lineage, intrograde or hybrid dead-end? Mitochondrial-wise Cave bears are placed as sister to Brown and Polar bears but that's just because of their provenance. Like chloroplast genealogies in plants, mitochondrial genealogies in animals typically show a strong geographic correlation. Especially in bears, the mothers and daughters don't migrate as much as the fathers and sons.

Mitochondrial genealogy of bears including Cave bears (Kumar et al. 2017, fig. 3), the famous European bears of the Ice Ages. ABC bears are insular brown bears living on the subarctic Admirality, Baranof and Chichagof islands of the Alexander archipelago known as natural example for gene flow between Brown and Polar bears (Kumar et al. 2017, fig. 1, provides a map of current distribution of bears).

Postscriptum

Birds are another animal group that likes to diversify into many species, some of which love to transgress recently established species barriers, forming hybrid swarms. These are actually dinosaurs, a group exclusively studied using cladistic analyses of morphological traits providing non-tree-like signals — mostly homoplasies, a lot of not-really-synapomorphies (good deal are probably homoiologies), and, it wouldn't surprise me, one or another xenoplasy. Or can we assume they were much to advanced to hybridize and intrograde?

Cited literature

Grimm GW, Denk T. 2010. The reticulate origin of modern plane trees (Platanus, Platanaceae) - a nuclear marker puzzle. Taxon 59:134–147.
Heath TA, Huelsenbeck JP, Stadler T. 2014. The fossilized birth–death process for coherent calibration of divergence-time estimates. PNAS 111:E2957–E2966.
Kumar V, Lammers F, Bidon T, Pfenninger M, Kolter L, Nilsson MA, Janke A. 2017. The evolutionary history of bears is characterized by gene flow across species. Scientific Reports 7:46487 [e-pub].
Piredda R, Grimm GW, Schulze E-D, Denk T, Simeone MC. 2020. High-throughput sequencing of 5S-IGS in oaks: Exploring intragenomic variation and algorithms to recognize target species in pure and mixed samples. Molecular Ecology Resources doi:10.1111/1755-0998.13264.
Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution 8:1212–1220.

Tattoo Monday XXI

2020-10-12T00:30:00.002+02:00

There are a number of tattoo designs that incorporate the concept of a Tree of Life with the concept of DNA. A selection of these was included in the previous post, Tattoo Monday XX. Here are a few more.

Rogue dinosaurs, an example from the Aetosauria

2020-10-05T00:30:00.435+02:00

In several earlier posts (a non-comprehensive link list can be found at the end of the post), I outlined how networks, tree-sample (Consensus networks, SuperNetworks) or distance-based (Neighbor-nets) may be of practical help, especially when we study phylogenetic relationships of extinct organisms.

In this post, I will further explore this by looking at a matrix for Aetosauria (Parker 2016, PeerJ) that provides an overall (relatively) strong and unambiguous signal. [NB: The reason, I prefer to use PeerJ papers as examples is that it is one of the very few journals that is open access and has a strict open data policy — to publish there, authors have to give access to the used data.]

In the abstract of the original paper, we read the following:

Nonetheless, aetosaur phylogenetic relationships are still poorly understood, owing to an overreliance on osteoderm characters, which are often poorly constructed and suspected to be highly homoplastic. A new phylogenetic analysis of the Aetosauria, comprising 27 taxa and 83 characters, includes more than 40 new characters that focus on better sampling the cranial and endoskeletal regions, and represents the most comprehensive phylogeny of the clade to date. Parsimony analysis recovered three most parsimonious trees; the strict consensus of these trees finds an Aetosauria that is divided into two main clades: Desmatosuchia, which includes the Desmatosuchinae and the Stagonolepidinae, and Aetosaurinae, which includes the Typothoracinae.

Parker's (2016) fig. 6 shows the results of the "initial analysis" (click to enlarge, colored annotations added by me).

Systematic groups based on clades are abbreviated (see next graph for full names).

A is a "Strict component consensus" of the 30 inferred MPTs (most parsimonious trees), B the Adams consensus. C the Majority rule consensus, branch labels give percentages for branches not found in all MPTs. D a "Maximum agreement subtree after a priori pruning of one taxon (black star) within the upper clade.

Parker's (2016) fig. 7 then shows the preferred result: a "reduced strict consensus of 3 MPTs" with the red star taxon removed, and (rarely seen in dinosaur phylogeny papers) branch-support — including Bootstrap support values below 70, which are very rarely reported in the literature (from my own experience it seems that editors of systematic biology journals don't like them).

Removal of one rogue taxon (called a "wildcard" in paleozoology), Aetobarbakinoides brasiliensis, substantially reduced the number of MPTs. Nonetheless, many branches have low support, and hence also the clades (used here as synonym for monophyla) derived from them – Parker uses branch-based ("stem"-based, brackets on his tree), and node-based taxa (dots).

Low branch support may or may not matter

There are two possible reasons for low branch-support:

non-discriminatory signal: any alternative branching pattern receives diminishing support
internal signal conflict: two (or more) alternatives receive similar support.

Mapping the support on the preferred (inferred) optimal tree cannot tell us whether it's the one or the other — only Support consensus networks can visualize this. Since we are interested in the rogue, I re-ran the parsimony BS analysis (10,000 quick-and-dirty replicates, following Müller 2005, BMC Evol. Biol. 5:58) including Aetobarbakinoides brasiliensis.

Support consensus network based on 10,000 parsimony BS pseudoreplicates. Trivial splits collapsed, only splits are shown the occured in at least 20% of the BS replicates.

The decreased/low BS support within the most terminal (root-distant) subtrees, the Des'ini and Par'ini, relates to conflicting alternatives involving one or two OTUs. In the case of Des'ini, it is the affinity of Lucasuchus and NCSM 21723, while in the case of Par'ini an alternative (recognizing Tecovasuchus as sister to the remainder) is found in 1 out of three BS pseudoreplicate trees. The diminishing support for basal relationships (root-proximal branches/edges) is due to the general lack of discriminatory signal (BS any alternative < 25). However, there are very few situations in which the best-supported alternative differs much from that in the preferred tree. For instance, any alternative to a Stag'inae sister relationship has even less than BS = 24 (BS = 27 in Parker's "reduced" tree).

Our rogue, however, is not really a 'wildcard'. The scored characters simply put it much closer to the outgroup than is any other ingroup taxon. A simple explanation could be that it is a most primitive (least derived) member of the Aetosauria. Another possibility is that it lacks any critical trait needed to place it within the ingroup. Since the deep splits within the Aetosauria rely on very few character changes, we can put it in different position down here and the tree will still have the same number of inferred changes.

Trivial and non-trivial taxa

The cladograms typically shown provide limited information about the signal in the underlying matrix, its strength and weaknesses, even when not "naked" but annotated using branch-support values. Given that there are no severe overlap gaps in the data, a very quick alternative is the Neighbor-net (a necessary addition, in my opinion).

Bold edges correspond to branches (hence: clades) in Parker's preferred tree.

Using this, we can directly depict which groups, potential clades, draw substantial (partly trivial) character support.

For instance, according to Parker's tree and following cladistic classification, Stagonolepis is an invalid taxon: one species (St. robertsoni) is part of the Stag'inae clade, the other (St. olenki) is of the Des'inae clade. Character support is, however, nearly non-existent (Bremer value = 1 and BS = 7 in the original analysis; BS ≤ 20 for any competing alternative in our re-analysis). The distance network shows us why — indeed, both species are closest to each other; but, while St. robertsoni shares a critical Stag'inae character suite and, consequently, shows the highest similarity to Polesinuchus, St. olenki does not share this (note the lack of a corresponding neighborhood). Furthermore, any alternative placement fits even less. Parker's tree only resolved it at sister to all other Des'inae because it didn't fit into any of the well-supported, terminal clades (prominent edge-bundles).

We can also see where we may have to deal with internal signal conflict, and how this may affect the tree inference and lead to ambiguous branch support. Take, for instance, the NCSM 21723 individual (= Gorgetosuchus pekinensis). It's clearly a Des'inae. The reason, we have ambiguous branch support for this staircase-like subtree is that NCSM 21723 is substantially more similar to the distant, equally evolved sister lineage, the Par'ini (purple edge bundle). Hence, it must be placed as sister to all other Des'inae, although it appears to represent a more derived form than Longosuchus, representing the next step towards the most-derived crown-taxon Desmatosuchus. Tecovachus is the source of topological conflict within the Par'ini because it is the least-derived taxon. Its primitiveness will be expressed by placing it as sister to all other Par'ini, while few shared, non-exclusive apomorphies are behind its position in the preferred tree (Bremer value = 1, BS = 48 in Parker's fig. 7).

While it is obvious that the matrix has no clear tree-like signal for resolving any OTU that is not part of the terminal Des'ini and Typ'inae lineages, our 'wildcard' (Aetobarkinoides) is particularly close to the outgroup while showing no affinity to anything else. If it is part of the ingroup, it represents the ancestral form, ie. shows a character suite that is primitive (derived traits may be missing because they are simply not preserved: see description of the taxon in Parker 2016). This is the reason why it acted rogue-ish in tree inferences even though it's favored phylogenetic position is clear.

Data

Parker's original matrix can be found in the supplement to the paper. An annotated ready-to-use NEXUS-formatted version (including my standard codelines for parsimony and distance bootstrapping) and the inference results used here can be found in this figshare submission, which I generated for a technical Q&A.

Here is the promised list of previous posts dealing with fossils and networks.

Should we try to infer trees on tree-unlikely matrices? July 2017; the signal phylogenetic matrices of major groups of extinct and extant seed plants.
More non-treelike data forced into trees: a glimpse into the dinosaurs, Aug. 2017; why also paleozoologists should start with network-based EDA — exploratory data analysis.
Networks, not trees, identify "weak spots" in phylogenetic trees, Oct. 2017; how Consensus networks can be used to visualize topological conflict among MPTs.
Summarizing non-trivial Bayes tree samples for dating? Just use support consensus networks, Jan. 2018; Bayesian Consensus networks based on mixed data matrices.
The curious case(s) of tree-like matrices with no synapomorphies, joint post with David, Apr. 2019; looking at CI, RI values and treelikeness.
Networks for matrices used in Cladistics studies, part 1 (historical matrices), part 2 (recent matrices), Nov. 2018; a collection of networks inferred from matrices used to infer parsimony trees.
Phylogenetic ambiguity: data gaps, indifference and internal conflict, Jan. 2019; an example (squids) for why consensus networks should be obligatory when facing ambiguous branch support.
Why the emperor has no clothes on – a thicket of trees, Nov. 2019; gene tree incongruence in plant plastomes and why it probably has little to do with decoupled gene histories.
Large morphomatrices – trivial signal, Feb. 2020; about the principal signal in a bird-dinosaur supermatrix.
Supernetworks and gene tree incongruence, May 2020; about mtDNA and splits in early land plants
Fossils and Networks 2: Deleting (and adding) a tip, Aug. 2020; studying the effect of removing a single taxon from the tree inference using the best-sampled taxa of the bird-dinosaur supermatrix.

Analyzing rhyme networks (From rhymes to networks 6)

2020-09-28T00:30:00.003+02:00

For this, final post of my little series on rhyme networks, I set myself the ambitious goal of providing concrete examples how rhyme networks for languages other than Chinese can be analyzed. Unfortunately, I have to admit that this goal turned out to be a bit too ambitious. Although I managed to create a first corpus of annotated German rhymes, I am still not entirely sure how to construct rhyme networks from this corpus. Even if this problem is solved pragmatically, I realized that the question of how to analyze the rhyme network data is far less straightforward than I originally thought.

I will nevertheless try to end this series by providing a detailed description of how a preliminary rhyme network of the German poetry collection can be analyzed. Since these initial ideas for analysis still have a rather preliminary nature, I hope that they can be sufficiently enhanced in the nearer future.

Constructing directed rhyme networks

I mentioned in last month's post that the it is not ideal to count, as rhyming with each other, all words that are assigned to the same rhyme cluster in a given stanza of a given poem, since this means that one has to normalize the weights of the edges when constructing the rhyme network afterwards (List 2016). I also mentioned the personal communication with Aison Bu, who shared the idea of counting only those rhymes that are somehow close to each other in a stanza.

During this month, I finally found time to think about how to account for this idea in practice, and I came up with a procedure that essentially yields a directed network. In this procedure, we first extract all of the rhyme words in a given stanza in the order of their appearance. We then proceed from the first rhyme word and iterate over the rest of the rhyme words until we find a match. Having found a match, we interrupt the loop and add a directed edge to our rhyme network, which goes from the first rhyme word to its first match. We then delete the first rhyme word from the list and proceed again.

This procedure yields a directed, weighted rhyme network. At first sight, one may not see any specific advantages in the directionality of the network, but in my opinion it does not necessarily hurt; and it is straightforward to convert the network into an undirected one by simply ignoring the directions of the edges and collapsing those which go in two directions in a given pair of rhyme words.

Handling complex rhymes

In last month's blog post, I also mentioned the problem of handling rhymes that stretch across more than one word. While these are properly annotated (in my opinion), I had problems handling them in the rhyme network I presented last week. We find similar problems when working with certain rhymes involving words with more than one syllable. As an example, consider the following words which are all taken from the song Cruisen, and which I further represent in syllabified form in phonetic transcription.

Rhyme Words	Stressed Syllable	Unstressed Syllable
Tube	tuː	bə
Bude	buː	də
Gurke	guɐ	kə
hupe	huː	kə
Kurve	kuɐ	və
Schurke	ʃuɐ	kə
Punkte	puŋ	tə

These words do not rhyme according to traditional poetry rules (where unstressed syllables following stressed syllables need to be identical), but they do reflect a common rhyme tendency in German Hip Hop, where rhyme practice has been evolving lately. In order to properly account for this, I assigned both the first and the second syllable of the words to their own rhyme group (one stressed syllable rhyme and one unstressed syllable rhyme).

When constructing the rhyme network, however, the separation into two rhyme groups turned out to not make much sense any longer, since the rhymes occur on a sub-morphemic level, where the parts to not themselves express a meaning anymore. To cope with this, I modified the network code slightly by treating only those words as rhyming with each other which show identical rhyme groups in all of their syllables.

Infomap communities and connected components

Having constructed the rhyme network in this new way, we can start with some preliminary analyses. As a first step, it is useful to check the general characteristics of the network. When using the new approach for network construction and the correction for complex rhymes, as reported above, the network consists of 3,104 nodes which together occur as many as 7,707 times. The network itself is only sparsely connected, being separated into 840 connected components.

As a first and very straightforward analysis, I used the Infomap algorithm (Rosvall and Bergstrom 2008) to see whether the connected components could be split any further. This analysis resulted in 932 communities, indicating that quite a few of the larger connected components in the rhyme network seem to show an additional community structure.

Unfortunately, I have not had time for a complete revision of all of the communities, but when checking a few of the larger connected components that were later separated into several communities, it seemed that most of these cases are due to very infrequent rhymes that are only licensed in very specific situations. As an example, consider the figure below, in which a larger connected component is shown along with the three communities identified by the Infomap algorithm.

The three communities, marked by the color of the nodes in the network, reflect three basic German rhyme patterns, which we can label -ung, -um, and -und. Transitions between the communities are sparse, although they are surely licensed by the phonetic similarity of the rhyme patterns, since they share the same main vowel and only differ by their finals, which all show a nasal component. The Infomap analysis assigns the nodes rum and krumm wrongly to the -und pattern but, given how sparse the graph is (with weights of one occurrence only for all of the edges), it is not surprising that this can happen. Both instances where edges connect the communities are rhymes occurring in the same Hip Hop lyrics from the song Geschichten aus der Nachbarschaft, as can be seen from the following annotated line of the song.

Judging from quickly eye-balling the data, most of the communities that further split the connected components of the network reflect groups of very closely rhyming words (usually corresponding to what one might call perfect rhymes). Links between communities reflect either possible similarities between the rhyme words represented by the communities, or direct errors introduced by my encoding.

Unfortunately, I could not find time to further elaborate on this analysis. What would be interesting to do, for example, would be a phonetic alignment analysis of the communities, with the goal of identifying the most general sound sequence that might represent a given community. It would also help to measure to what degree transitions between communities conform to these patterns, or to what degree individual words might reflect the communities' consensus rhyming more or less closely.

But even the brief analysis here has shown me that, first, there are still many errors in my annotation, and, second, the Infomap algorithm for community detection seems to work just as well with German rhyme data as it works on Chinese rhyme data.

Frequent rhyme pairs and promiscuous rhyme words

As a last example of how rhyme networks can be analyzed, I want to have a look at frequently recurring patterns in the current poetry collection. A very simple first test we can do in this regard is to look at the edges with the highest weights in our networks. Poets typically try to be very original in their work, since nothing is considered as boring as repetition in the literature. Nevertheless, since the pool of words from which poets can choose when creating their poems is, by nature, limited, there are always patterns that are more frequently used.

The following table shows those directed rhymes that occur most frequently in the German poetry database.

Rhyme Part A	Rhyme Part B	No. of Poems
sein	lein	10
aus	haus	10
haus	aus	9
triebe	liebe	9
leben	geben	9
geben	leben	9
zeit	keit	9
nein	sein	8
wieder	lieder	7
nur	tur	7

This collection may not tell you too much, if you are not a native speaker of German. But if you are, then you will easily see that most of these rhymes are very common, involving either very common words (sein "to be"), or suffixes that frequently recur in different words of the German lexicon (-lein either as diminutive suffix or as part of allein "alone"). We also find the very sad match of liebe (Liebe "love") and triebe (Triebe "urges"), which is mostly thanks to the poems by Rainer Maria Rilke (1875-1926), who wrote a lot about "love", and had the same problem as most German poets: there are not many words rhyming nicely with Liebe (the only other candidates I know of would be bliebe "would stay" and Hiebe "stroke or blow").

As a last example, we can consider promiscuous rhyme words, that is, rhyme words that tend to be reused in many poems with many other words as partners. The following table shows the top ten in terms of rhyme promiscuity in the German poetry dataset.

Rhyme Part	Rhyme Partners	Occurrences
sein	14	87
ein	9	34
bei	9	36
sagen	8	19
leben	8	39
schein	8	26
mehr	8	25
nicht		8
zeit	8	36
welt	7	32

Here, I find it rather interesting that we find so many words rhyming with -ein in this short list. However, when checking the community of -ein, we can see that there is, indeed, a rather large number of words from which one can choose (including basic words like Bein "leg", Schein "shine", Stein "stone"). Additionally, there are a larger number of verbs of the form -eien that are traditionally shortened in colloquial speech (compare the node schreien "to scream").

Concluding remarks

When I started this series on rhyme networks, I was hoping to achieve more in the six months that I had ahead. In the light of my initial hopes, the analyses I have shown here are somewhat disappointing. However, even if I could not keep the promises I made to myself, I have learned a lot during these months, and I remain optimistic that many of the still untackled problems can be solved in the near future. What today's analysis has specifically shown to me, however, is that more data will be needed, since the network produced from the small collection of 300 German poems is clearly too small to serve for a fully fledged analysis of rhymes in German poetry.

References

List, Johann-Mattis (2016) Using network models to analyze Old Chinese rhyme data. Bulletin of Chinese Linguistics 9.2: 218-241.

Rosvall, M. and Bergstrom, C. T. (2008) Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences 105.4: 1118-1123.

Data and Code

Data and code are available in the form of a GitHub Gist.

Herd immunity and the end of Covid-19

2020-09-21T00:30:00.031+02:00

Following on from my previous posts about the SARS-CoV-2 virus, and Covid-19, the human disease that it causes, there are a number of miscellaneous topics that could also be discussed. Unfortunately, this is only a part of the post that I originally intended. I had written about some aspects of the pandemic that seem to be less well known. However, Blogger deleted the draft without warning, and this is the only part that I could recover.

Here, I talk about how the pandemic ends, as far as biology (rather than society) is concerned.
There is a lot of wishful thinking at the moment, that production of a vaccine will see the end of the pandemic, but the World Health Organization has warned that this may not be so. For example, they are apparently trying to develop a 5-year strategy for Europe, not a 5-month one. One of their officials, Hans Henri Kluge, has noted: "The end of the pandemic is the moment when we as a society learn how we can live with the pandemic."

Biologically, safety from pathogens involves what is called herd immunity. This refers to the proportion of the population who are not infectious, and thus are not spreading the pathogen (whether it is a virus, a bacterium, an apicomplexan, or a fungus). Lack of infectiousness can be achieved by:

being resistant to the pathogen in the first place, perhaps due to past immunological events (eg. Coronavirus: How the common cold might protect you from COVID)
becoming infected and then recovering, by producing antibodies or T-cells (eg. This trawler’s haul: Evidence that antibodies block the coronavirus)
being vaccinated, which produces the same immune response as 2., by producing protective antibodies.

Note that 2. is not necessarily dangerous for most people, as reports show that anything up to half of the people who have antibodies to SARS-CoV-2 did not report clinical symptoms, or only mild symptoms. [Note also: lack of symptoms does not mean that you are not infectious.] However, the variation in human response has clearly been huge (see From ‘brain fog’ to heart damage, COVID-19’s lingering problems alarm scientists), in many cases resulting in cytokine storms, and death.

The main risk factors are also clear — age and gender (The coronavirus is most deadly if you are older and male — new data reveal the risks), and any pre-existing medical conditions, notably obesity (Individuals with obesity and COVID‐19: a global perspective on the epidemiology and biological relationships). Furthermore, we do not yet know how long any immune protection lasts — for example, we now have people who have been infected more than once (Researchers document first case of virus reinfection), although most have kept their antibodies for at least 4 months (Fyra av fem behåller antikroppar mot nya coronaviruset).

Nor do we yet know about the success or danger of 3., because it normally takes a couple of years of clinical trials before a vaccine is approved for use, and even then we can get it badly wrong (cf. the originally undetected side-effects of thalidomide). As far as health care is concerned, responsibility for treatment of any unfortunate outcomes from immunization is not at all clear. Furthermore, those nations that spend the most on healthcare per person may not be ranked highest for health outcomes and quality of care (see: What country spends the most on healthcare?). Therefore, it is hardly surprising that many people are concerned about taking any new vaccine (A Covid-19 vaccine problem: people who are afraid to get one), and that the World Health Organization is being much more cautious than many government leaders (Most people likely won't get a coronavirus vaccine until the middle of 2021).

Nevertheless, once herd immunity is achieved in my local population, I am relatively safe, irrespective of whether I have been vaccinated or not — there will be few infectious people around me, and so I am not very likely to catch the pathogen. Personally, I could wait a while to see how the myriad new vaccines affect people, as they have been rush-produced in a way that would not normally be accepted as safe for public use (what is called the Phase 3 trial takes time). After all, there seems to be an awful lot of politics involved, especially in the USA (The 943-dimensional chess of a trustworthy Covid-19 vaccine).

Some calculations

The point here is that the development of any epidemic is an interaction between infectivity, herd immunity and infection control. Let's consider some explicit numbers to make this clear (based on: Flockimmunitet på lägre nivå kan hejda smittan).

Infectivity refers to how the pathogen spreads among the at-risk population, usually described as the basal reproductive rate (R0). If each infected individual infects 2-3 others, then the R0 value is c. 2.5 (each person infects 2.5 other people, on average). This means that the epidemic must spread — if R = 1 then there is no spread; and if R < 1 then the infection slowly dies out (it stops instantly if R = 0).

Clearly, infectivity can be reduced by any infection control measure that reduces R. Some of these were listed in the previous section. These measures can easily reduce the initial R0 by one half, meaning that the epidemic spreads much more slowly, if R = 1.25.

Herd immunity comes into this by also reducing R. For example, if herd immunity reaches 60%, then only the remaining 40% of the people are susceptible to the infection. If we combine this 40% with the initial R0 = 2.5, then R = 1, and the epidemic no longer increases. That is, we now have it under control. Moreover, if we have managed to get to R = 1.25, then a herd immunity of even 20% will cause the epidemic to decrease.

Bhoj Raj Singh has a good slide presentation elaborating on this topic.

These calculations interact with the concept of relative risk, of course. The calculations so far assume that infection exposure is random in society, which is obviously too simple an idea. Some people are more socially active than others, are thus likely to be more exposed, and they will then quickly achieve significant herd immunity. Others find it difficult to self-isolate because of their work or social conditions, which also increases the development of herd immunity. All of this also helps more isolated people, of course, because they are not at risk of infection from those active groups with herd immunity.

We would thus expect herd immunity to develop first in cities (eg. Experts say Stockholm is close to achieving herd immunity ; A third of people tested in Bronx have coronavirus antibodies) and in poor communities (Herd immunity may be developing in Mumbai’s poorest areas), both of which seem to be the case for SARS-CoV-2.

Equally importantly, herd immunity cannot develop if we all hide from the virus. This has happened in New Zealand, for example, which has so far successfully quarantined itself from the rest of the world — they have not successfully fought the virus, they have instead successfully hidden from it. The issue is that the populace can never come out of hiding, and can thus never let anyone come into the country, not even returning New Zealanders. As an example, Hawaii had the same isolation advantage, and then lost it, just as expected (Hawaii is no longer safe from Covid-19), as also did Australia (Coronavirus (COVID-19) current situation and case numbers).

It is a classic question: which is better, fight or flight? In a pandemic, flight cannot lead to herd immunity, which is what we need in order to "learn how we can live with the pandemic".

So, where are we now? Well, a recent poll in the USA suggests that it is an even split about whether people will actually take a vaccine if offered soon (U.S. public now divided over whether to get Covid-19 vaccine). Will 50% be enough to ensure herd immunity in that country?

Exploring the oak phylogeny

2020-09-14T00:30:00.001+02:00

Neighbor-nets are a most versatile tools for exploratory data analysis, including phylogenetics. They are not only fast to infer, but possibly most straightforward in depicting the signal in one's data matrix — this is called Exploratory Data Analysis. EDA makes them useful additions to any phylogenetic paper, because it gives the reader (and peers and editors during review) a good idea what the data can possibly show, and where there may be problems.

A nice example of this use is the Neighbor-net in a recent paper on Chinese oaks:

Yang J, Guo Y-F, Chen X-D, Zhang X, Ju M-M, Bai G-Q, Liu Z-L, Zhao G-F. Framework Phylogeny, Evolution and Complex Diversification of Chinese Oaks. Plants 2020: 1024.

[Note: The paper is, from a purely methodological point-of-view, pretty well done, but has probably not experienced any real peer-review.**]

Oaks (Quercus L.) are ideal models to assess patterns of plant diversity. We integrated the sequence data of five chloroplast and two nuclear loci from 50 Chinese oaks to explore the phylogenetic framework, evolution and diversification patterns of the Chinese oak’s lineage. The framework phylogeny strongly supports two subgenera Quercus and Cerris comprising four infrageneric sections Quercus, Cerris, Ilex and Cyclobalanopsis for the Chinese oaks.

None of this is new. My colleagues and I published an updated classification for oaks a few years ago (Denk et al. 2017) that took into account molecular phylogenies, and introduced the systematic concept referred to by Yang et al., and recently followed by a many-species global oak phylogenomic study (Hipp et al. 2020). All of this is based on nuclear data only, because any researcher who ever studies oak genetics soon realizes that the plastomes are largely decoupled from speciation processes, but are geographically highly constrained (eg. Simeone et al. 2016, Yan et al. 2019). This is the reason why oaks are indeed "ideal models to assess patterns of plant diversity" – they provide a worst-case scenario not the (trivial) best-case one.

As can be seen in the Yang et al. tree, members of section Ilex, a monophyletic lineage forming highly supported clades in trees based on nuclear data, are scattered all across the subgenus Cerris subtree. I have annotated a copy of this tree here.

Yang et al.'s fig. 1a, with some clades newly labeled for orientation

Because of the plastid incongruence, the subgenus Cerris subtree has a wrong root (section Cylcobalanopsis diverged before sister sections Cerris and Ilex split). Also, the reciprocally monophyletic, genetically coherent sections Cerris (green) and Cyclobalanopsis (blue) are embedded in the much more diverse Ilex 3 and Ilex 4 clades. The remaining Ilex species are placed in two early diverged clades, which I have labeled Ilex 1 and Ilex 2 in the above tree (note: the taxon set only includes Chinese oak species). The only indication the tree gives that we have a data conflict issue is the low support (gray circles represent branches with Maximum likelihood bootstrap support > 60).

The network

When interpreting the phylogenetic implications of a Neighbor-net, we have to keep in mind that it is not a phylogenetic network in the strict sense (ie. displaying an evolutionary history), but is instead a meta-phylogenetic graph: a summary of incompatible splits patterns. Incompatibility can have different origins: reticulation, recombination, diffuse or poorly sorted signals, etc. Consequently, when looking at a Neighbor-nets and their neighborhoods (Splits and neighborhoods in splits graphs), we need to keep in mind what kind of data we used to calculate the underlying distance matrix in the first place.

If the data follows two incongruent trees ("phylogenies"), as in this case for the oaks, the Neighbor-net has a good chance of capturing the incompatible splits of both genealogies. Here is the graph from the paper.

Wang et al.'s fig. 1b.

The central inflated portion of the graph reflects the incongruence between the combined data sets: we have overlapping nuclear-informed and plastid-informed neighborhoods.

The authors' brackets (shown in black) refer to neighborhoods triggered by the two nuclear markers in the data set: these are neighborhoods reflecting the common origin and speciation within the oak lineages. We can even see that this signal, which is incompatible with all deep splits in the combined tree, is unambiguous in part of the data (the nuclear partitions): section Ilex spans out as a wide fan, but there is a relatively prominent edge bundle defining the according neighborhood (the blue split).

The net shows additional, even more prominent edge bundles defining partly overlapping or distinct neighborhoods (the red splits). These neighborhoods are represented as clades in Yang et al.'s phylogenetic tree (fig.1a). They write (p. 11 of 20):

However, the conflict between the two datasets seems to be recovered by the neighbor-net method in this study, as the neighbor-net network based on combined plastid–nuclear data strongly shows the presence of two subgenera and four infrageneric species groups for the Chinese oak’s lineage (Figure 1b).

Interestingly, the authors nonetheless used the substantially incongruent combined data for downstream dating and trait mapping analysis (p. 7/20):

Bayesian evolutionary analyses provided a concordant infrageneric phylogeny for the Chinese oak’s lineage at the species level (Figure 2).

This uses a taxon-filtered, obviously constrained (fixed) topology, fitted to the current synopsis outlined in Denk et al. (2017). [Note: the supplement includes the extremely incongruent nuclear and plastid trees, each of which has further incongruence issues because they combine fast- and very slow-evolving sequence regions.]

Postscript

More posts on oaks, plastid data and networks can be found here in the Genealogical World and in my Res.I.P. blog.

Using median networks to understand evolution of genera (→ Example 3), Geneal. World Phyl. Networks, January 2018
Reticulation at its best – an example from the oaks, Geneal. World Phyl. Networks, July 2018
Comparing neighbour-nets and PCA graphs – the example of Mediterranean oaks, Geneal. World Phyl. Networks, March 2019
Next-generation neighbor-nets, Geneal. World Phyl. Networks, April 2019
Why you never should do a single-species plastid analysis of oaks, Res.I.P., September 2019
When dating is futile – plastome-based chronograms for oaks, Res.I.P., October 2019

Cited papers

Denk T, Grimm GW, Manos PS, Deng M, Hipp AL. (2017) An updated infrageneric classification of the oaks: review of previous taxonomic schemes and synthesis of evolutionary patterns. In: Gil-Pelegrín E, Peguero-Pina JJ, and Sancho-Knapik D, eds. Oaks Physiological Ecology. Cham: Springer, pp. 13–38. Open access Pre-Print [major change: Ponticae and Virentes accepted as additional sections in final version].

Hipp AL, Manos PS, Hahn M, Avishai M, + 20 more authors. (2020) Genomic landscape of the global oak phylogeny. New Phytologist 229: 1198–1212. Open access.

Simeone MC, Grimm GW, Papini A, Vessella F, Cardoni S, Tordoni E, Piredda R, Franc A, Denk T. (2016) Plastome data reveal multiple geographic origins of Quercus Group Ilex. PeerJ 4:e1897. Open access.

Yan M, Liu R, Li Y, Hipp AL, Deng M, Xiong Y. (2019) Ancient events and climate adaptive capacity shaped distinct chloroplast genetic structure in the oak lineages. BMC Evolutionary Biology 19:202. Open access.

** The publisher, MDPI, thrives in the gray zone between predatory and accredited publishing. Originally included in the recently reactivated Beall's List (new homepage), it has been tentatively dropped (see the linked Wikipedia article; but see also this post by Mats Widgren). Personally, I have encountered articles published in MDPI journals only where the review process must have been, at least, strongly compromised. But it's always quick: Yang et al.'s paper was submitted July 24th, accepted August 12th, and published a day later. Three weeks is about the length of time that the editors of my first oak paper needed to find a peer reviewer at all.

Fossils and Networks 3 – (deleting and) adding one tip

2020-09-07T00:30:00.000+02:00

In the last Fossils and Networks post, we explored the use of SuperNetworks to identify both safe and problematic branching patterns by removing one OTU and re-evaluating the analysis. Here, we'll take the opposite approach, and see what we can learn from adding one OTU to our analysis.

Breaking and supporting wrong branches

We start again with the artificial Felsenstein Zone matrix that results in a wrong AB clade. Here's the original true tree used to generate the matrix.

Because of convergent/parallel evolution in the modern taxa (genera O, A and B) and primitive characters of their fossil sisters, any phylogenetic inference method will find the wrong, tree with a A + B | rest split.

In the Felsenstein Zone, parsimony will always get the wrong tree due to long-branch attraction (LBA), while Maximum likelihood has a 50:50 chance to escape LBA. To break down the LBA between A and B, we need a fossil that is, from an evolutionary point of view, intermediate between D and B.

If we add a fossil E that features 1 out of 3 derived traits found in the BD lineage (including the only synapomorphy of BD), we end up with two alternative parsimony trees: one with a wrong topology and the other the correct topology, as shown here.

By adding a fossil F featuring 2 out of 3 derived traits, we increase the number of most-parsimonious trees (MPTs) to three alternatives, all of which fall prey to A-B+F LBA, as shown next.

Convergent evolution is a problem for tree inference but selection bias and homoiologies are worse, involving accumulation of the same advanced trait within some but not all members of a lineage (Has homoiology been neglected in phylogenetics?). This is worse because the characters will enforce attraction between long-branching, highly evolved (more modern) taxa. A and B are siblings, but by enforcing an ABF clade, we will inevitably misinterpret the most primitive members of the ingroup, C and D. Hence, we may draw wrong conclusions about evolution in the A–F lineage.

Because E is virtually half-way evolved between D and F, and F is the next step towards B, the all-inclusive tree gets it right. We infer a single optimal tree, shown here.

PS: Also, in this case we could use any other optimality criterion (Maximum Likelihood, Least-squares, Minimum Evolution) and we would end up with the same tree.

Missing the important bits

That last observation is encouraging: the more fossils we include in our matrix and the better they reflect the evolutionary trends within a group (here from a D-like ancestor via E to F and B), the greater the chance of ending up with the true tree. There's only one drawback: in real-world data sets, we may miss exactly those traits in the fossil sample that are needed in order to infer (or stabilize) the true tree.

(Paleo-)Parsimonists have frequently argued that missing data are unproblematic, which is true in one sense, as shown in the above example. The commonly used strict consensus tree has no wrong branches, because it only has one, which is the trivial ingroup-outgroup split. The much less commonly used Adams consensus tree has one more branch, which is wrong: the ABF clade.

As always in such cases, the strict Consensus network visualizes the MPT sample best (again exemplifying why we should stop using cladograms).

The price for not having false positives is that we cannot infer a most-parsimonious tree or a few alternative trees any more, but could easily end up with scores of them. Here, we have 41 MPTs for a 8-taxon dataset that include fairly wrong trees*, although some of them are closer to the true tree (green and olive edges in the strict Consensus network above). For large matrices, or matrices lacking tree-like signals, the number of MPTs can easily reach tens or hundreds of thousands. Lacking critical traits in E (14 out of 46 characters missing) and F (7 missing), we may escape LBA at the cost of decisiveness. If we do have those traits only in F but not E, we will enforce LBA between A and B.

Plus-1-trees (and SuperNetworks)

Before adding a taxon as an additional leaf to our tree, we may be interested in what that taxon does to our tree: can it trigger a topological change or does it fall in line? We will again take the dinosaur-to-bird-matrix of Hartman et al. (2019, PeerJ 7: e7247) as a real-world example. This includes everything from well-covered highly derived and most primitive taxa, to those that lack discriminatory signal in general (ie. are unresolved), plus the one or two rogue taxa, with ambiguous phylogenetic affinities creating topological conflict. (Note: the commonly reported strict consensus trees cannot distinguish between those two alternatives.)

The best-covered 15 taxa provide us with a single optimal tree that is in agreement with current opinion (shown below). However, this struggles to resolve the clade of modern birds because the extinct Lithornis is being attracted by Anas, the duck. When we remove Dromiceiomimus (as shown in Fossil and Networks 2), we end up with a putatively wrong Dromaeosauridae grade, because of LBA between the most distinct Dromaesauridae, Velociraptor and Bambiraptor, and the distantly related (to flying dinosaurs) Allosaurus, Tyrannosaurus and the IGM 10042 skeleton.

Two of the Minus-1 trees generated for the last post of this series.

For our experiment, we will take this (partly) wrong tree, and add every other taxon included in the Hartman et al. (2019) matrix as 15th tip. We can then perform a branch-and-bound search to infer these 14-Plus-1 tree(s). When we browse through the inferred MPTs, we can see that many taxa fall in line with the wrong topology, including a few that, in addition, increase uncertainty for branches correctly resolved in the minus-Dromiceiomimus tree.

Out of the 485 candidate trees, only 10** have a set of characters that can compensate for the missing Dromiceiomimus, leading to Plus-1 trees that show a Dromaesauridae clade, as shown here.

Two of the ten Plus-1 trees, where the added tip saves the inference from LBA. Numbers give the amount of defined characters (scored traits). Both Halszkaraptor and Zhenyuanlong are classified as Dromaeosauridae, however only the better covered taxon is placed as sister to the Dromaeosauridae included in the original 14-taxon tree.

The presence of the deep-branching Compsognathus (Tyrannoraptora: ... :Neocoelurosauria: †Compsognathidae) triggers an Archaeopteryx-Dromaesauridae clade.

In the case of relative deep-branching Garudimus (... :Neocoelurosauria: Maniraptoriformes: †Ornithomimosauria: †Deinocheiridae) and Epidexipteryx (... : Maniraptoriformes: ... : : ... : Paraves: †Scansoriopterygidae) one or two of the two or three MPTs show the wrong grade except the last the clade.

Note: the relative low number of scored traits for Epidexipteryx can avoid LBA leading to a Dromaeosauridae grade but misplace the taxon within the Plus-1 MPTs: its family, the Scansoriopterygidae, are considered to represent the sister lineage (Wikipedia, referring to Godefroit et al. 2013 Nature 498: 359–362) of the Eumaniraptora which include the Dromaeosauridae as first-diverging branch.

We can also summarize the outcome, a collection of 640 Plus-1 MPTs, in form of a z-closure SuperNetwork, as we did for the Minus-1 trees in the previous Fossils and Networks post (shown next).

This SuperNetwork is quite boxy, and may be only semi-comprehensive (I used only 20 runs, which took half a day). Matching 485 tips into a 14-taxon backbone tree is not the kind of tree sample that the SuperNetwork has originally been designed for!

Only four edges, fat and blue, are without alternatives. In all other cases, the added tip triggered the creation of several alternatives: the highest dimension for the boxes is five, but most have four or less dimensions. Regarding our problem of saving the Dromaeosauridae clade, we can see that the topological change depends on very few characters, with Microraptor being very close to the divergence but a bit more bird-like (in a very broad sense), while the other two are much more derived.

Close-up on the Dromaeosauridae part of the network, with all tips labeled. Pie charts give the percentage of scored traits/missing data. * – Tips that saved the inference from LBA (see above).

Note the length of some of the colored edges, especially the light green which represent edges reflecting a Dromaeosauridae clade. Other Dromaeosauridae taxa increase not only the diversity but also may create substantial topological ambiguity (bluish and greenish edge bundles; same color = same split) and branching bias.

Take-home message

Creating morphological supermatrixes makes a lot of sense, because it ensures normalization and facilitates universal comparability, which is crucial also for paleobiology. However, even more than molecular phylogenies, paleophylogenies are affected by character and taxon sampling. This is nothing new, and much debate has dealt with which parsimony strict consensus cladogram is the better one.

I suggest taking a new route. Instead of using morphological supermatrixes to infer trees – for this matrix, Hartman et al. found millions of equally optimal parsimony trees further filtered by post-analysis, initial tree topology informed character weighting (as implemented in TNT) – we should use it to generate subsets and engage in exploratory data analysis. This will pinpoint strengths and weaknesses of the data and its individual taxa. Rather than producing evolutionary meaningless soft polytomies, one should study the reasons for any topological ambiguity. After all, one simple reason for unstable branching patterns may be that all so-far inferred trees are biased, only differently.

The SuperNetwork can assist us in putting together taxon sets that could allow not only a simple tree inference but also topology testing.

If we want to test the stability of, e.g., the Dromaeosauridae clade against taxon sampling, it will be of little use to include the most primitive (anything outside Maniraptora) and much more advanced taxa (Avialae including modern birds) of the 501-taxon matrix. On one had, the most primitive taxa will only increase the computational load, because our inferred tree not only optimizes branches we are interested in, but also irrelevant ones, using taxa that largely lack discriminative signal for the branches of interest or at all. On the other hand, the most derived taxa may bias the tree inference by providing strong terminal signals outcompeting potentially conflicting weak basal signals.
If we want to test the stability of the backbone phylogeny against adding taxa and entire lineages, we may prefer short-branched over long-branched taxa, in order to avoid (local) LBA (especially when we want to stick to parsimony). The terminal edges in the SuperNetwork indicate the minimum number of unique changes for each tip added to the 14-taxon tree. As seen also in our hypothetical example: E and F only break down the wrong AB clade because both are either identical (or very similar) to the last common ancestor of E+F+B and F+B, respectively.

In a future post, I'll come back to the issue of identifying taxa that are game changers, using a simple and quick tree-based approach: the so-called "evolutionary placement algorithm", first implemented in RAxML.

PS.
For any of you who really don't like networks, but still find no comfort in comb-like strict consensus cladograms either: just tick the SuperTree option when inferring the SuperNetwork. But only if your input trees converge to a shared topology. Otherwise the result may look like this:

A SuperTree based on the 640 Plus-1 MPTs.

* Somebody familiar with Consensus networks and morphological data partitions providing complex signal, can extract a phylogenetic hypothesis from this boxy network for the included taxa. In general, the distance along the network edges represents a phylogenetic distance, and thus gives a direct measure of how derived a taxon is.

For example, C, D are closer to the ougroup and placed close to the centre of the graph, which is exactly where a primitive ingroup taxon, with an ancestral morphology, would be placed. F is most likely a sister of B. The olive EF | rest split supports a potential common origin of E, F, and B (long green edge bundle). Hence, A can only represent a distant, strongly evolved sister lineage (both the alternative AB and ABF clade have less character support). Also, since the graph depicts E as least derived of the four (irrespective of the topological alternatives), its affinity to F and B has more value than the affinity between A and B, both being long-branched, and hence susceptible to LBA. D fits into the picture, the olive DE edge either: (1) represents a common origin, which would make D an early member of the red lineage; or (2) has similarity due to shared primitive traits within the ingroup, which would make D an early member of an ABEF lineage. C, in contrast to D, has no clear affinities with any other ingroup member, and so can only be interpreted as an early, very primitive form with uncertain phylogenetic relationships. The (true tree) mutual monophyly of the red and blue ingroup lineages has very little character support in the matrix, and hence cannot possibly be resolved.

** Systematically they cover a range of maniraptoran ('hand hunters') families 'below' the Avialae ('flying' dinosaurs) including, in addition to two Dromaeosauridae (Halszkaraptor, Zhenyuanlong, trees shown above), members of †Alvarezsauroidea (Haplocheirus), †Caudipteridae (Caudipteryx), †Sinovenatorinae (Sinovenator), †Therizinosauroidea or related (Beipiaosaurus, Jianchangosaurus) and †Troodontidae (Gobivenator, Sinornithoides). Caihong is a member of the †Anchiornithidae, which Wikipedia flags as "Avialae ?". These OTUs show data coverage far above the median (74% missing), with 278 (Caihong) to 558 (Caudipteryx) defined characters (out of a total of 700).

Coronavirus patterns of spread

2020-08-31T00:30:00.000+02:00

Following on from my previous posts about the SARS-CoV-2 virus, and Covid-19, the human disease that it causes, there are a number of miscellaneous topics that could also be discussed. So, here are a few topics about the spread of the pandemic, which may be of interest.

Networks of cases

I have so far not presented a phylogenetic network related to the current pandemic. I may one day do so, although collating the data I would like to use will not be easy. In the meantime, the folks over at Fluxus Engineering did publish a network of genomes back in April: Phylogenetic network analysis of SARS-CoV-2 genomes.

The authors identified:

... three central variants distinguished by amino acid changes, which we have named A, B, and C, with A being the ancestral type according to the bat outgroup coronavirus. The A and C types are found in significant proportions outside East Asia, that is, in Europeans and Americans. In contrast, the B type is the most common type in East Asia, and its ancestral genome appears not to have spread outside East Asia without first mutating into derived B types, pointing to founder effects or immunological or environmental resistance against this type outside Asia.

Needless to say, their paper generated some controversy, with three published responses criticizing the methodology (these are shown at the link above). However, the Global Initiative on Sharing All Influenza Data (GISAID) uses an expanded version of their cladistic classification.

Networks can also be used much more locally, to illustrate spread, although in an epidemic this will almost always be tree-like rather than reticulating. Here is a recent example from China: Large SARS-CoV-2 outbreak caused by asymptomatic traveler. The authors comment about the wide spread from a one individual:

An asymptomatic person infected with severe acute respiratory syndrome coronavirus 2 returned to Heilongjiang Province, China, after international travel. The traveler’s neighbor became infected and generated a cluster of >71 cases, including cases in 2 hospitals. Genome sequences of the virus were distinct from viral genomes previously circulating in China.

Different patterns of infection among communities

Pandemics are actually a series of local epidemics, and are therefore rarely simple things, in terms of when people become infected. For example, there are often a series of alternating "waves" of new cases, in response to the behavior of either the pathogen or the people themselves.

In the case of the Covid-19 disease, the virus has so far apparently produced a series of at least seven variant strains (Geographic and genomic distribution of SARS-CoV-2 mutations), but the waves are mainly the result of people's implementation of infection control measures. Depending on the pathogen, these measures can include: social distancing, fewer / smaller crowds (especially indoors), working from home, closing social venues such as restaurants and bars, as well as mass testing and infection tracking. Reducing the spread of breath aerosols also works well for SARS-CoV-2, including careful cleaning of surfaces, and wearing gloves and masks or visors.

So, early on in most epidemics, people get infected because they are not ready to deal with things; and the number of cases increases, as shown in the above graph of Covid-19 cases in the USA this year — this is the First Wave. The number of cases then usually decreases for a while, in response to the effectiveness of the control measures. However, if the measures do not remain effective, or the people get sick of implementing them, then the number of cases increases again, creating the Second Wave. The graph above makes it clear that for the USA the Second Wave has been much more serious than the First, in terms of the number of cases.

However, this picture is often much too simple, because the USA is a pretty big place. In this example, there are 50 main jurisdictions in the country, and there is no reason to expect any epidemic to proceed in the same way in every state and territory. Here are equivalent graphs for four different US states, each showing a different pattern of waves.

So, New York (and several other north-eastern states) got the SARS-CoV-2 virus early on, and most of the at-risk people got infected at that time, so that there has not yet been a Second Wave. Rhode Island, on the other hand, has actually had a small Second Wave. From here on in the north-east, infections are likely to be mostly local outbreaks (eg. New York city mayor says rise in Covid-19 cases in Brooklyn not a cluster), such as is now also being observed in Europe.

By contrast, Louisiana, the state with the highest percent of cases (per population) so far, had a relatively small First Wave, and it is the Second Wave that has been much more problematic for epidemic control. Even more extreme, Florida (and other states like California) had the virus spread much later, so that there was not really a First Wave at the same time as the other states, and it is the Second Wave that is producing the high percentage of infected people.

So, the country's pattern of pandemic spread is made up of a series of different sub-patterns of epidemics, with different jurisdictions having very different degrees of success in controlling virus spread. This matters very much for any national response to the pandemic, because it is not the same epidemic everywhere.

In a similar manner, deaths have been concentrated in those places that got the SARS-CoV-2 virus early on. We expect for most pandemics that the number of deaths will rise as the number of infection cases rises. This next graph shows the case rates (proportion of people infected) and death rates (proportion of people who have died) in each US state (each point represents one state, plus DC).

The proportion of cases varies from a low in Vermont to a high in Louisiana, and the proportion of deaths rises along with this — 44% of the variation in deaths between states is correlated with the difference in case rate. However, there are four states in the north-east of the country (as labeled on the graph) where the death rate has been much higher than expected (about double). These states all got their virus infections early in the pandemic, so that one or more of these has been happening:

the deaths predominantly occurred before effective treatment strategies were developed;
the at-risk groups are now being protected more effectively; or
the currently predominant strains of the virus are less deadly than those circulating originally.

As I noted in my previous post: It is about time we started behaving rationally in response to Covid-19?. A rational response needs to take into account geographical variation in the current state of the pandemic. A one-size-fits-all response cannot be particularly effective in the face of large variation.

Comparing lock-downs to voluntary isolation

Many governments have responded to the spread of SARS-CoV-2 by instituting economic lock-downs as a form of quarantine, to keep their populace apart from each other. This is expected to be effective biologically, because the virus is spread by aerosol droplets, and keeping people apart reduces the risk of infection (eg. 1 m when breathing, 2 m when sneezing, 4 m when coughing).

However, lock-downs have not been universal. In particular, Sweden has become well-known for leaving social distancing as a voluntary exercise, although along with strict recommendations — see my post: Media misunderstandings about the coronavirus in Sweden for an explanation of the actual situation. The essential difference is between a government mandated and enforced response and a response based on social co-operation.

The economic consequences of lock-downs have been very serious, and we have constant media reports about how dire the situation has been for various industries. So, it is interesting to compare the spread of the virus in Sweden with the spread elsewhere, as a simple means of estimating how effective the lock-downs have been.

One possible comparison is with the United Kingdom. The pandemic started in both countries at the same time (first reports on 26-27 February), and the current total death rates (attributed to Covid-19) are similar (Sweden: 576 people per million, UK: 611 people per million). The case rates are quite different, however (Sweden: 8,305 people per million, UK: 4,897 people per million), and this might be attributed to the two different strategies. [Note: the USA also has a similar death rate (564 per million) but a much high case rate (18,495 per million).]

For a meaningful comparison, we need to look at the rates, not the raw data, because the two populations are very different in size (Sweden; 10 million, UK: 68 million). These two graphs show the case rate and death rate through time for the two countries. The comparison is quite revealing. [Note: the saw-tooth patterns in the graphs come from the fact that medical reports in most countries are notably fewer on weekends.]

As expected, the cases initially increased faster in Sweden. However, the case rates were very similar in the two countries by the last week of March; and they remained so until Sweden started serious virus-testing in late May. Just at the moment, the case-rates are similar again, although the UK has actually done twice as much virus testing as Sweden (240,000 tests per million people versus 110,000). Anyway, the two different government responses did not produce much difference in the number of cases for the first 3 months of the pandemic.

The death rates show quite a different pattern. The rates started off very similar, but by the end of March the UK actually had a higher death rate than Sweden. This situation was maintained until the end of May, after which Sweden had the higher rate until the end of July. Once again, the two countries are now very similar. Overall, the time-course of deaths is highly correlated between the two countries (79% shared variation), while the case rates are not (7%).

Of particular note here is that the differences in case rates have not resulted in differences in death rates. Apparently, Sweden's voluntary response has allowed a greater proportion of the population to become infected but this has not resulted in more deaths. I am fairly sure that the authorities will attribute this to the development of herd immunity (which I will talk about in my next post on the coronavirus) (WHO expert praises Swedish strategy - urges other countries to follow suit). [Note: a direct comparison with the USA would be pointless, given the geographical variation discussed above.]

The consequences are far-reaching. As but one example of the unfortunate consequences of the UK lock-down, you could read up on the fiasco concerning the final-year school exams (A coronavirus lesson about the modern state) — without a lock-down, Sweden avoided such problems for its young people.

Conclusion

There is a wealth of data in this pandemic, enough to keep data analysts busy for a very long time. I am sure that we will be inundated with reports for many years to come. In the meantime, like all pandemics, the geography of the local epidemics is a vital point in implementing effective control strategies.

Constructing rhyme networks (From rhymes to networks 5)

2020-08-24T00:30:00.000+02:00

As is now happening for the summer, this little series on rhyme networks is also coming to its end. We have only two more blog posts to go, with this one discussing the construction of rhyme networks, and then one more post in September, discussing how rhyme networks can be analyzed.

A preliminary annotated collection of rhymed poetry in German

While my original plan was to have all of Goethe's Faust annotated by the end of this series, so that I could illustrate how to make rhyme analyses with a large dataset of rhyme patterns in a language other than Chinese, I now have to admit that this plan was way too ambitious.

Nevertheless, I have managed to assemble a larger collection of German rhymes from various pieces of literature, ranging from boring love poems to recent examples of German Hip-Hop; and all of the rhymes have been manually annotated by myself during recent months.

This little corpus currently consists of 336 German "œuvres" (the data collection itself has more poems and songs from different languages), which make up a total of 1,544 stanzas (deliberately excluding the refrains in songs). There are 3,950 words that rhyme in this collection; and together they occur 5,438 times in a total of 49,797 words written by 72 different authors. The following table summarizes major features of the German part of the database.

Aspect	Score
components	994
authors	72
poems	336
stanzas	1544
lines	8340
rhyme words	3950
words rhyming	5438
words total	49797

The whole collection, which is currently available under the working title "AntRhyme: Annotated Rhyme Database", can be inspected online at https://digling.org/rhyant/, but due to copyright restrictions for texts from recent pop songs, not all of the poems can be displayed. In order to share the annotated rhymes along with the initial Python code that I wrote for this post, I have therefore created a version in which only the annotated rhyme words are provided, along with dummy words in which each character was replaced by a miscellaneous symbol. As a result, the song "Griechicher Wein" ("Greek wine") by Udo Jürgens from 1974 now looks as shown in the following figure.

Modeling rhymes with networks

As far as Chinese rhyme networks were concerned, I have always given the impression (and also truly thought this myself) that the reconstruction of a rhyme network is something rather trivial. Given a stanza in a given poem, all one has to do is to model the rhyme words in the stanza as nodes in the network, and then add connections for all of the words that rhyme with each other according to the annotation.

While I still think that this simple rhyme network model is a very good starting point, there are certain non-trivial aspects that one needs to carefully consider when working with this kind of rhyme network. First, there is the question of weighting. In the first study that I devoted to Old Chinese poetry (List 2016), I weighted the nodes by counting their appearance, and I also weighted the edges by first counting how often they occurred. I then normalized this score in order to receive a more balanced weighting. The normalization would first count each rhyme pair only once, even if the same word occurred more than one time in the same stanza, and then apply a formula for normalization based on the number of words rhyming with each other within the same stanza (see ibid. 228 for details).

However, in the meantime, a young scholar Aison Bu has suggested an even better way of counting rhymes, in an email conversation with me. [The pandemic prevented us meeting in person at a conference in early April, so we could never follow this up.] Since rhyming is essentially linear, my original counting of all rhymes that are assigned to the same rhyme partition in a given stanza may essentially be misleading. Instead, Aison suggested counting only adjacent rhymes.

To provide a concrete example, consider the third stanza in the song "Griechischer Wein" by Udo Jürgens (shown above). Here, we have the rhyme group labeled as f, which occurs three times in the data, with the rhyme words Wind (wind), sind (they are), and Kind (child). The normalization procedure that I proposed in the study from 2016 would now construct a network in which all three words rhyme with each other. To normalize the edge weights, each individual edge weight would be modified by the factor 1 / (G-1), where G is the number of rhymes in the rhyme group in the stanza (3 in this case, as we have three words rhyming with each other). Aison's rhyme network construction, however, would only add two edges, one for Wind and sind, and one for sind and Kind, as they immediately follow each other in the verse. A specific normalization of the edge weights would not be needed in this case.

A first rhyme network

Unfortunately, I have not had time so far to test Aison's idea, to draw only edges for adjacent rhymes when constructing rhyme networks. However, with the data for more than 300 German poems and songs assembled, I have had enough time to construct a first and very simple network of German rhyme data.

For this network, I disregarded all normalization issues, and just added an edge for each pair of words that would have been assigned to the same rhyme group in my rhyme annotation. This network resulted in a rather sparse collection of 994 connected components. This is in strong contrast to the Chinese poems I have analyzed in the past (List 2016, List 2020), which were all very close to small-world networks, with one huge connected component, and very few additional components. However, it would be too early to conclude that German rhyme networks are fundamentally different from Chinese ones, given that the data may just be too sparse for this kind of experiment.

At this stage of the analysis, it is therefore important to carefully inspect the networks, in order to explore to what degree the network modeling or the data annotation could be further improved. When looking at the largest connected component, shown in the following figure, for example, it is clear that typical rhyme groups that we would expect to find separated in rhyme dictionaries do cluster together. We find -aut on the left, -aus and -auf on the right, with the word auch (also) as a very central rhyme word, as well as Frau (woman).

While these words can be defended as rhymes, given that they share the diphthong au, we also find some strange matches. Among these is as the cluster with -ut on the bottom left, which links via Mut (courage) to Bauch (belly) and resolut (straightforward). Another example is the link between Frau and trauern (mourn). The former link is due to an annotation error in the poem "Freundesbrief an einen Melancholischen" ("Friendly letter to a melancholic") by Otto Julius Bierbaum (1921), where I wrongly annotated Bauch and auch to rhyme with resolut and Mut.

However, the second example is due to a modeling problem with rhymes that encompass more than one word. This pattern is very frequent in Hip-Hop texts, and I have not yet found a good way of handling it. In the case of Frau rhyming with trauern, the original text rhymes trauern with Frau an, the latter being a part of the sentence "schaut euch diese Frau an" ("look at this woman"). Since my conversion of the text to rhyme networks only considers the first part of multi-word rhymes as the word under question, it obviously mistakenly displays the rhyme, which is also show in its original form in the figure below.

Conclusion

The initial construction of German rhyme networks which I have presented in this post has shown some potential problems in the conversion of rhyme judgments to rhyme networks. First, we have to count with certain errors in the annotation (which seem to be inevitable when doing things manually). Second, certain aspects of the annotation, especially rhymes stretching over more than one word, need to be handled more properly. Third, assuming that poetry is spoken, and spoken texts are realized in linear form, it may be useful to reconsider the current rhyme network construction, by which edges for rhyme examples are added for all possible combinations of rhyme words occuring in the same rhyme group. For the final post in this series next month, I hope that I will find time to address all of these problems in a satisfying way.

References

List, Johann-Mattis (2016) Using network models to analyze Old Chinese rhyme data. Bulletin of Chinese Linguistics 9.2: 218-241.

List, Johann-Mattis (2020) Improving data handling and analysis in the study of rhyme patterns. Cahiers de Linguistique Asie Orientale 49.1: 43-57.

For those of you interested in data and code that I used in this study, you can find them in this GitHub Gist.

Isn't it about time we started behaving rationally in response to Covid-19?

2020-08-17T00:30:00.000+02:00

I have written a few blog posts recently about the current Covid-19 pandemic, caused by the arrival of the SARS-CoV-2 virus in our lives. This interests me as a biologist with some background in the study of pathogens (disease-causing organisms).

There have been two extreme responses to the current pandemic. There are all sorts of variants in between, of course, but I will start by characterizing the extremes, and then move on to some practical examples. The point here is that we need a reasoned response to this pandemic, based on the effect of the virus on people, and the make-up of the populations being affected. The current one-size-fits-all approach used by most governments is not going to work, long-term.

The future of having to live with the virus is becoming clearer. Actions can be individual, but they need to be co-ordinated, with each of the risk groups being treated appropriately. Even if you personally feel secure, those around you might experience risks very differently. An all-purpose set of mandated behaviors might work short-term, but we cannot continue to live that way. Behavior needs to make all risk groups feel safe at all times, by being targeted appropriately.

Behaviors

At one extreme, people are trying to hide from the virus. By this, I mean that they are trying to keep away from it. Obviously, many people are doing this individually, but whole countries have also been trying to do it, notably Australia and New Zealand, which are geographically isolated by virtue of being islands. At the other extreme, people are trying to "crush" the virus, like they are playing poker against some weak opponent.

The problem with the first extreme is that you can never come out of hiding, because the virus does not go away, it just sits there (like viruses do) until you finally come past, and then it will get you, after all. This is what the so-called Second Wave of infections is currently showing us. The First Wave of infections occurs because people do not know about the pathogen, and therefore catch it inadvertently. In response to the rapid increase in case rates, people go into self-quarantine, trying to prevent themselves from encountering the virus. This works, but they eventually get tired of doing it, and they come back out again — and that is the Second Wave of infections. It is nothing new as far as the virus is concerned, it simply reflects changing human behavior (out, in, out again).

A prime example of the other extreme is expressed by this recent New York Times article: Here's how to crush the virus until vaccines arrive, or even the Wall Street Journal: The treatment that could crush Covid. You can't crush a pandemic, as we know from the seemingly endless series of previous pandemics in recorded history, and presumably many more of them before we learned to write. Naturally, Wikipedia has a List of epidemics, for you to peruse.

However, at some stage, people are going to have to start treating the current pandemic like the influenza virus — a natural part of their environment, where they take standard precautions to minimize their risk. In response to the perennial threat of flu, old people take vaccines in winter, middle-aged people stay away from public transport during flu season, and young people simply get on with their lives (because a bit of flu will not kill them). These are rational responses, taken by people after evaluating the perceived risk of infection to themselves.

To do this for Covid-19 we need to consider what we have learned so far this year.

We need to learn

During the First Wave of any pandemic we need to over-react, while we find out how the new pathogen behaves and what effects it can have. So, we try everything from social distancing to lock-downs, to see what seems to work in practice. The objective is to reduce the rate of spread of the virus — in biological terms, we are trying to work out what things will flatten the curve (see: Coronavirus: What is 'flattening the curve,' and will it work?).

For example, one current debate is: do face-masks provide protection, in the community setting? They work in hospitals, for sure (Face masks really do matter: the scientific evidence is growing), but that is a specialist environment, where they are used by professionals in conjunction with other methods (hand scrubbing, special clothing, etc). We need to find out whether people can routinely wear face-masks properly, so that the masks do what they are designed to do. We may actually be better off with perspex visors, for example, which are also effective at preventing the spread of breath aerosols (which is the main problem), and they can be worn effectively even by a novice — and they do not make us all look like we are involved in a bank hold-up.

We also need different groups of people to try different approaches, to see how effective they are. If everyone does exactly the same thing, strictly following World Health Organization recommendations for example, then we do not learn much, as a global community. That is, a pandemic is simply a widespread (global) series of epidemics, one in each local area. Since countries are all different, culturally, this cultural diversity creates the ideal environment to maximize learning-by-doing, by treating the pandemic as a set of epidemics, to which we might respond differently.

For example, the Buddhist-dominated communities of South-East Asia have done things in a very community-cooperative manner (these people do not work alone, by choice); and they collectively have the lowest infection rates on the planet. The Muslim-dominated countries of the Middle East do not worry much about life threats (whether they die or not is the Will of Allah), and they collectively have the worst rates. The individual creed of Americans does not encourage them to act co-operatively (resulting in draconian government-mandated lock-downs), and so they also have a very high rate. Sweden is one of the few remaining socialist cultures, where governments give advice rather than issuing instructions (resulting in this case in co-operative self-quarantines), and they have a middling-to-high infection rate.

We learn many things about alternative effective actions from this cultural diversity. In particular, media criticism of the different national reactions to the pandemic is now dying down, as the critics slowly come to realize that uniformity always results in an all-or-none outcome.

What have we learned?

Okay, so after the First Wave we know that this new virus can do everything from: apparently nothing (there are plenty of people with antibodies who have never felt any symptoms of having had the virus), to creating flu-like symptoms (key symptoms: fever, cough, skin rash, loss of taste & smell), on to hospitalization (with usually c. 7 days to get rid of the symptoms but 5 weeks to get rid of the actual virus), or even intensive care (as a result of what is medically called a cytokine storm). For the elderly, and others with pre-existing medical conditions, the virus seems to be one thing too many for their body, the proverbial straw that breaks the camel's back — which can lead to death sooner rather than later.

So, not only does SARS-CoV-2 infection not mean death for the vast majority of people (globally, < 3.6% of reported infections have resulted in death), it does not even necessarily mean sickness at all (eg. a Swedish study showed that 46% of those study people with antibodies had never reported clinical symptoms). This should mean something for our future responses.

Notably, in those countries where a significant Second Wave is now occurring, the new infections are often not resulting in deaths (except notably in Australia). This is a very important difference between the First and Second Waves, in most places. There is speculation that the SARS-CoV-2 variants currently widespread are less deadly than were those common at the beginning of the pandemic; but it is equally likely that those people who were most susceptible to the virus have already succumbed during the First Wave.

So, we now know about the risk groups, roughly, which is as good as we ever know such things; and we have a good idea about the outcomes of the various risks. This means we can start to do some reasoned things, as a pandemic response. The Second Wave is a perfect time to start treating the Covid-19 situation rationally.

The time for some new action?

This means that it is time to start targeting actions to the degree of risk for each person, rather than having over-arching actions that affect everyone equally. Our individual responses to the virus are not equal, so why are most government actions still predicated on the idea that we are all equal?

The point is, we have to respond to what we have learned about relative risks. For example, I have argued before that the biggest mistake Sweden has made was letting Covid-19 get into the aged-care facilities, which is where most of the country's deaths have now occurred. Has anyone learned from this mistake? Apparently not in the USA: Untested for Covid-19, nursing-home inspectors move through facilities. Come on people — get your act together.

The response to the First Wave always needs to assume equality, because anything else would be irresponsible, in the face of our initial ignorance. During the Second Wave, however, we are no longer quite so ignorant, and we can tailor our actions to suit the conditions. When are we going to start doing this?

In order to think about this question, it is worthwhile to consider a few topics that seem to be on the agenda, and look at some practical examples of three relevant situations.

Trying to hide

Any country that successfully hides from the virus has to keep hiding, forever. New Zealand has recently been crowing about having gone 100 days without a new coronavirus case. That record was destroyed this week (New Zealand on alert after 4 cases of COVID-19 emerge from unknown source); and it will get even worse on the day they allow the first visitor into their country. Their current Alert Level 3 response cannot change this — you cannot hide from a virus.

New Zealand's near neighbor, Australia, has demonstrated this point even more strongly. In one sense, the Australians understand quarantine, because it is a big part of keeping plant and animal diseases out of their country. For example, international visitors are regularly surprised to have biological products (notably wood) confiscated at the arrival airport — better safe than sorry.

So, dealing with Covid-19 should be straightforward for them — you just apply the same idea to the people, themselves. Sadly, it took them some time to realize that you have to take people straight from the airport to a quarantine hotel, if the quarantine strategy is to work. One of my nephews returned to Sydney (Australia) from Copenhagen (Denmark) at the beginning of the First Wave, and he had to make his own long way by public transport from the airport to the quarantine house that his father had arranged!

So, it should not be a surprise that quarantine has not been effective everywhere in Australia — one mistake is all it takes. This mistake was made in the quarantine hotels in Melbourne (Victoria), where the quarantine security turned out to be a joke (see: New coronavirus lockdown Melbourne amid sex, lies, quarantine hotel scandal). Perhaps the security guards should have read the earlier article on: Sex in the time of coronavirus.

The issue here is that Australians are no better than Americans at following government instructions — individual rights take precedence (see: Individual choice is a bad fit for Covid safety). Even my local newspaper here in Uppsala (Sweden) reported (Regel brott ger böter) the news that military personnel were sent to visit 3,000 Australians who were supposed to be in self-quarantine at home (due to having tested positive for the virus), and 800 of them (one-quarter!) were not at home. I lived in Australia for 40 years, and this situation surprises me not at all.

So, hiding does not work, long-term, because you have to keep it up for too long to be practical for most people. The Second Wave in Victoria is actually worse than the First Wave, in terms of number of Covid-19 cases. The ensuing lock-down is now even worse than it has been in most other places (see: 'Very dead': army and police patrol the deserted streets of coronavirus-stricken Melbourne); and Victoria itself has been quarantined from the rest of the country.

Schools

We have all been told that the effect of Covid-19 is age-related; and the global data shows that this is true everywhere — the older you are, the more likely you are to seriously affected. One outcome of this knowledge is that actions can be tailored to age groups. Notably, we can consider the idea that massively disrupting the lives of very young people may be doing more them harm than good, due to stress if nothing else (Lockdowns and school shutdowns may make youngsters sicker).

Most countries mandated the closure of schools, and instituted some form of working from home for the pupils. This move was predicated on the idea that children will catch the virus in the crowded schools, and bring the disease home to their elders. This scenario seemed to be the case, for example, in the early spread of the SARS-CoV-2 in northern Italy.

Recent evidence, however, suggests that, while the youngsters do catch the virus, they are much less infectious than older people (see: COVID-19 study confirms low transmission in educational settings). We are talking about pre-teenagers here, not older children. This does not mean that they can't spread the virus (see: Latest research points to children carrying, transmitting coronavirus), but merely that this is a much lower risk.

It has therefore been suggested that a rational response would involve a trade-off between disrupting the lives of very young people versus the risk of viral spread (see: Why it’s (mostly) safe to reopen the schools). Notably, this issue was explicitly considered in Sweden, and during the First Wave it was decided to keep the junior schools open, but to close the senior schools (ie. high school). So, the younger children have all been trundling off to school every week-day, just as usual, the whole time. As far as I know, there has not been even one reported outbreak involving any of the open schools.

This is why I emphasize the importance of culturally diverse responses to a pandemic. In this case, the Swedes seem to have got it right; and everyone else could learn from this.

Young people

It is a different matter for somewhat older (but still young) people. The so-called Millennial generation has had a pretty tough time, especially financially. This is the second financial down-turn that they have experienced in a dozen years, just when they are trying to get themselves onto their own two feet (see: Millennials slammed by second financial crisis fall even further behind).

So, none of us should be surprised that these people are thoroughly sick of restrictive pandemic responses by now. Indeed, it is becoming widespread news that case rates are increasing among 20-29 year olds (or 15-25, depending on how people are grouped) (see: WHO urges young people to help control the spread of coronavirus). This has become particularly obvious in Europe (see: Coronavirus cases rise in Europe as youth hit beaches and bars), but also in North America (see: B.C. hospitalizations, deaths steady as latest wave hits mostly young people) and Australia (see: Coronavirus Australia: Why young people are spreading COVID-19).

This is not necessarily as bad as it might sound, because the effect of the virus is age-related, and these people will probably mostly be safe (but not all). The same thing is true for somewhat younger people — youth is a social time, and mandated restrictions about distancing may not be very effective (see: Why the teenage brain pushes young people to ignore virus restrictions).

Places like Japan and Spain are now cracking down on bars, and the like (eg. Spain cracks down on outdoor drinking, smoking in renewed push against COVID-19). If you want some survey data on what activities U.S. people currently feel comfortable doing, then check out: Weekly updates on consumers’ comfort level with various pastimes.

In this situation, Sweden has not been exempted; and recent coronavirus cases have become prevalent in the 20-29 year old group, just like elsewhere else. Once again, this emphasizes that our knowledge cannot all come from one place. No-one gets it all right, but they may get some things right; and we should learn from both success and failure. This is the rational approach, not the one-size-fits-all approach.

Adding to this scenario, as I write this blog post, Europe is having a warm spell (up to 40 °C in the south), and my local newspaper has the headline: Chaos on Europe's beaches in the heatwave. All governments are warning about the need to continue keeping people apart, for those who wish to avoid infection. Fortunately, the summer holidays are nearing their end in the northern hemisphere.

Concluding comments

From the biological perspective, for the future to be bearable, we need to reach herd immunity, which refers to public safety in the presence of a pathogen. This is determined by the proportion of the (local) population that needs to become immunized (either by becoming infected or by being vaccinated) in order for the infection to stop spreading (see: A new understanding of herd immunity).

We can achieve herd immunity by responding rationally based on the make-up of the population, in terms of the relative risks. At-risk groups need to be protected, while the rest of the people get on with their lives. For example, Stockholm, in Sweden may now be getting close to herd immunity (or flock immunity, as the locals would call it), the Swedes having foregone the lock-downs imposed elsewhere, and thus allowing immunity to arise naturally.

Herd immunity can be achieved without rationality, of course — we simply wait for the weakest people to die, and the rest are likely to be safe. You might not like the moral implications of doing this, but it is biologically effective, nonetheless. For example, India may potentially end up with the world's worst case-rate for infections, given its population size and large degree of poverty in many areas (where social distancing is not feasible). However, its saving grace, in terms of deaths, may well be the consequent fact that poor people are usually young, because poor people do not live long in the first place. Herd immunity to SARS-CoV-2 is easy to achieve under these circumstances (see: Herd immunity seems to be developing in Mumbai’s poorest areas).

I vote for the rational approach, myself, among the many biological alternatives.

Fossils and networks 2 – deleting (and adding) one tip

2020-08-10T00:30:00.000+02:00

A general assumption in phylogenetics is: the more the better. The more data my matrix includes, the better will be my tree. The more taxa I include, the better will be my phylogenetic analysis. But is this true when we include (or rely on) fossils? After all, there is an old saying: less is more; and in this post I will show you that it is often true here, too.

Perfect data – how to recognize unproblematic topologies

In the first post of this series (Farris and Felsenstein), I introduced two matrices, a Farris Zone matrix and a Felsenstein Zone matrix, with the same set of tip taxa: three extant genera and three early fossils, one for each generic lineage.

The Farris Zone matrix provides a perfect signal. No matter which inference criterion one uses, one always gets the true tree. In such a case, the taxon sampling should be irrelevant; and it is. Any 5-taxon sub-tree correctly shows only splits found in the 6-taxon true tree — shown below are the actual most parsimonious trees (MPT) of each inference using the branch-and-bound algorithm.

Six most-parsimonious trees showing the topology of the true tree; trees are midpoint-rooted and have the same scale.
Note: NJ/LS and ML would give the same result for this experiment.

Consequently, for the perfect case, the SuperNetwork of the six 5-taxon trees is the 6-taxon true tree.

Z-closure SuperNetwork (Huson et al. 2004) of the 5-taxon MPTs generated with SplitsTree (walkthrough at the end of the post) depicting the true tree.

Therefore, the simplest test to check for potential topological issues in any set of data is to sub-sample the taxa by sequentially pruning a single taxon, infer the resulting group of trees (which I will call minus-one trees), and then summarize this tree sample in the form of a SuperNetwork. If the data have no signal issues – and the inferred all-inclusive tree is unbiased – all minus-one trees will be congruent with the all-inclusive inferred tree. The resulting SuperNetwork will then be a tree matching the inferred all-inclusive tree.

On the other hand, if removing a single taxon has a significant effect on the inferred tree, then this either means you need this taxon to get the right tree or that this taxon is causing bias. We cannot assume that trees with many taxa are better than trees with fewer taxa. Only if a topology is independent of taxon sampling can we be sure that we are looking at a true tree (or one inevitable with the data at hand).

Taxon-sampling matters? Then the all-inclusive tree may be biased

Real data matrices are far from perfect. Paleophylogenetic matrices, for instance, not only include a lot of missing data limiting the decision capacity of any phylogenetic inference, but, being restricted to morphological traits, usually high levels of homoplasy — that is, similarity in conflict or only partial agreement with the phylogeny (here are some related posts: Has homoiology been neglected in phylogeny? Should we bother about character dependency? Please stop using cladograms! The curious case[s] of tree-like matrices with no synapomorphies and More non-treelike data forced into trees: a glimpse into the dinosaurs). While some OTUs are primitive in their character suites, others are highly derived. We often, without realizing it, are infering within or close to the Felsenstein Zone.

If we repeat the same minus-one experiment, but now use the Felsenstein Zone matrix, instead, we end up with something quite different. We get three most-parsimonious tree (MPT) solutions when eliminating the outgroup genus O or its fossil Z; and eliminating the genera A and B and their fossils C and D, respectively, each leads to a single MPT. This yields a total of 10 MPTs.

First row rooted with Z, all other trees mid-point rooted. All trees have the same scale.

By pruning the long-branching genera A or B, even parsimony analysis gets the correct tree because we have eliminated the source of the long-branch attraction. Adding fossils to break down long branches can be effective (classic paper: Wiens 2005), but dropping long-branching tip taxa works just as well. Changing between a close outgroup (fossil Z) and a distant outgroup (fossil O) has little benefit here.

In this case, the resulting SuperNetwork of our 10 MPTs is not a tree but a network including alternative clades, wrong ones (orange), ie. not monophyletic, and correct ones (green) — ie. branches (internodes, bipartitions) reflecting the monophyletic lineages of the true tree.

Comprehensive Z-closure SuperNetwork of the 10 minus-one MPT inferred based on the Felsenstein Zone matrix. The network includes all split patterns found in the MPT sample.

A real world example

To give an example of how sequentially dropping one taxon works with real-world data, we'll use the exhaustive 700 character matrix for bird-related dinosaurs provided by Hartman et al. (2019).

With its total of 501 taxa (OTUs), the apparent rationale behind the matrix is that, by including as many taxa as possible, one gets the best-possible (parsimony) trees, irrespective of the signal quality provided by individual OTUs. However, the full matrix cannot be forced into a single-optimal parsimony tree, due to missing data (72% of the matrix' cells are undefined or ambiguous, ie. 255969 cells) and a scarcity of synapomorphies (in a Hennigian sense) — this is discussed in Hartman et al.; see also the related Q&A.

Here, in light of the computational effort and to avoid heuristics when searching the MPTs, we'll use a pruned sub-matrix. For our first experiment, we take 15 out of the 19 best-covered OTUs. Thus, OTU pairs / triplets that are much more similar to each other than to any other OTU, are reduced to the best-covered representative.

The 19-taxon matrix that I used in a previous post (Large morphomatrices – trivial signal) had only one most-parsimonious tree solution, showing only clades in agreement with current opinion, which assumes a largely staircase-like evolution from dinosaurs to modern birds (Tree of Life). In contrast to the full matrix, the 19-taxon matrix provided high support for most clades (method-independent), reflecting the number of scored traits. The extant taxa, representatives of modern birds (duck, turkey and ostrich, all edible), have many derived cgaracters, with the extinct bird genus Lithornis being placed in-between ostrich and duck + turkey.

The optimal topologies for the 19 best-covered taxon matrix. Green, the single most-parsimonious tree. Clade names copied from Wikipedia/Tree of Life.

The ML and NJ/LS (except for one branch) trees were topologically identical; each branch is supported by about 100 inferred changes. The signal from the matrix should be straightforward.

The tree-size weighted mean (default in SplitsTree) SuperNetwork, summarizing the result of an exhaustive branch-and-bound search using the 15-dropped-1-taxon matrices (each one resulting in a single optimal MPT) has a tree-like structure.

Allosaurus-rooted SuperNetwork of the 15 minus-one MPTs. Green – clades also found in the all-inclusive tree representing monophyla; orange – conflicting clades, blue – the all-inclusive tree doesn't resolve the assumed monophyly of modern birds, but places Lithornis as sister to Neognathae.

Conflicting clades are found in only two of the 15 inferred MPTs, being represented by short branches (their length in the other 14 trees is counted as zero).

Nonetheless, these conflicts received considerable character support. The frequency of a split in the minus-1 tree sample is irrelevant (see the A-B LBA problem discussed above — any tree including A and B showed the wrong clade). When summarizing our tree sample (especially when using MPTs), we should hence opt for a SuperNetwork, in which the edge lengths give the minimum branch lengths found in the MPT collection, ie. the edge length reflects the minimum length of the branch in all trees showing that branch.

Same SuperNetwork as above, but using the "Min" option instead of the default setting for computing edge lengths.

Without Dromiceiomimus – representing an earlier diverged lineage and step in bird evolution – the Dromaeosauridae clade, which is probably monophyletic (Wikipedia), flips and dissolves into a grade. By removing the intermediate step, we seem to create some ingroup-outgroup (long-branch) attraction.

Anas, the duck, forms the morphological link to Lithornis – with a mean morphological pairwise Hamming distance (MD) of 0.23, Anas is the most-similar OTU; and, hence, the MPT places Lithornis as sister to Anas + Meleagris (turkey; MD = 0.17). By eliminating Anas, the remaining contemporary birds form a clade — the modern birds (Neornithes) are assumed to be monophyletic but do not form a clade in the all-inclusive MPT (Struthio, the ostrich, is morphologically more distant from duck, turkey and Lithornis).

Conclusion

Even the most comprehensive, least gappy of paleophylogenetic matrices have substantial signal issues. If a tree inference is dependent on which OTUs are sampled, we cannot assume that we will automatically get better trees simply by including everything we have. Some OTUs (in our experiment: Dromiceiomimus) will stabilize correct aspects of a tree, while others will manifest bias or error (here: Anas). It's unlikely that a wrong, ie. not monophyletic, clade created by the attraction of two well-sampled taxa can be broken down by adding numerous taxa showing only a fraction of defined characters. SuperNetworks of minus-one trees can point you to the critical OTUs and unstable branching patterns of your (backbone) phylogeny.

PS. Personally, I would analyze a matrix with these properties, and a taxon sample spanning more than 150 myrs of evolution (from Allosaurus to modern birds), using ML not MP. I used MP in this post only because paleontologists are still very fond of it (not a few still discard anything else as unfit for their data). ML is less prone to long-branch attraction, results in a single tree (easier to compare when using larger taxon samples), and is speedy these days, allowing for more in-depth experiments towards the end of the exploratory data analysis. Both IQ-Tree (homepage; includes links to online servers) and RAxML-NG (open access paper providing essential links / github; implemented on various online servers) can quickly infer ML trees and establish branch support (including but not restricted to nonparametric bootstrapping) using binary and multistate data.

Walk-through for computing Z-closure SuperNetworks (Huson et al. 2004) in SplitsTree (v. 4, since v. 5 is still not fully functional):

Make sure the tree sample for reading is in Newick format, including branch-length information. The trees can be in a single file or multiple files.
Start SplitsTree.
To read in the tree sample:
- File > Open, if your trees are in one file;
- File > Tools > Load multiple trees, if your files (eg. minus-1 MPTs) are in different files.
Go to Networks > SuperNetwork. Choose "Min" for "Edge Weight" in the pop-up analysis window for the first graph. You can also try out "Mean"/"Sum" (short, rare alternatives will be less prominent), "AverageRelative" (trade-off) or "None" (branch-lengths in the minus-one tree sample are ignored). When using simple tree samples (little topological variation, matrix with fairly stringent signals), a single run (default) suffices. Increasing the number (eg. to 100) ensures no branching pattern in the minus-one tree sample gets lost. For instance, for the Felsenstein Zone matrix, a single run will give you a SuperNetwork capturing the major conflicting aspects, while 100 runs will lead to a higher dimensional graph that includes the correct BD and AC clades as alternatives. If you like to view the overall best-fitting tree instead of a network, tick "SuperTree".

Cited papers

Hartman S, Mortimer M, Wahl WR, Lomax DR, Lippincott J, Lovelace DM (2019) A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight. PeerJ 7: e7247.

Huson DH, Dezulian T, Kloepper T, Steel MA (2004) Phylogenetic super-networks from partial trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1: 151–158.

Wiens JJ (2005) Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Systematic Biology 54: 731–742.

Coronavirus statistics are (almost) all misleading

2020-08-03T00:30:00.000+02:00

There are plenty of places on the internet where we can access statistics about the current Covid-19 pandemic, caused by the rapid global spread of the SARS-CoV-2 virus — notably Johns Hopkins University (formally described here), and Worldometer. These are compilations of official government statistics, comparing different countries, or states within a country. These are potentially interesting, because we can see how things are progressing in our own location, and compare it to other places. If nothing else, this might inform our own actions for protecting ourselves.

The basic problem is that these data are often not comparable between jurisdictions, in the sense that they will have been collected in different ways and with different degrees of success. For example, consider these two recent articles about the country that is very likely to end up being the worst hit:

The second one contains this quote that sums up the issue: "India is the third-worst hit country in the world, but there are concerns a lack of testing could mean the true figure is far higher." Government organizations usually do their best to collate their local data, but their relative success in a situation like this will vary from "okay" to "abysmal". We cannot really know where any given dataset fits into that continuum, and this profoundly affects how we interpret the data.

Data must be comparable if we are to compare them. This is an obvious truism, especially in science; but achieving comparability is often very difficult in practice, and scientists spend much of their time trying to achieve it in their own work. I would hate to be the person delegated the job of summarizing this pandemic globally, because they will really be us against the wall. But someone will have a go at it, believe me, and I wish them every success.

In this post, I summarize the main data-collecting issues, as they are currently understood. The two main statistics reported are the number of infection cases and the number of resulting deaths, which have separate issues.

Case numbers

Deciding whether a particular person is a Covid-19 case is not straightforward. Three main criteria have been used to date:

disease symptoms (which are similar to influenza)
detection of a viral genome in the body (meaning the person currently has the virus)
detection of virus antibodies in the body (meaning the person has previously had the virus).

These three criteria will yield different estimates of the number of cases.

Since the virus seems to have originated in China, the Chinese were the first to officially count cases. They started by including only those people who had been tested for the virus itself (after they showed symptoms), but soon realized that this caused a delay before these people received medical treatment. So, the official data show a massive spike in case numbers, when the authorities switched to using symptoms alone to count cases. You can see in this graph (from Worldometer) which day that was.

Using symptoms alone presumably over-estimates the number of cases, because of the similarity of coronavirus symptoms to those resulting from influenza viruses. Clearly, symptoms need to be confirmed by a direct test for each particular type of virus.

However, without a concerted testing effort for SARS-CoV-2, the number of cases will be under-estimated, probably by a large margin. We now know that many people show few or no symptoms of this coronavirus, and will therefore not be detected if we test only those people with explicit symptoms, and who visit a testing center. Some countries have made massive testing efforts, relative to their population size, while many other countries have been much less active. This table shows the top data from Worldometer, counted as the number of tests per million people.

Clearly, the more of your population you test, the more likely you are to correctly detect all of your cases. The effect of this can be seen in this next Worldometer graph, for Sweden. The apparent burst in cases after June 5 was due to the government finally implementing large-scale virus testing, which naturally increases the detection rate for this type of situation. That is, the data were greatly under-estimated before June 5, and the official data were corrected during June, by catching up with many of the as-yet-undetected cases. This increased testing has continued, which means that the drop in cases during July is cause for optimism, as in any situation where you search for something bad and don't find it. Nevertheless, these tests cover only 8% of the population, to date, and so even now the data may still (theoretically) be under-estimates.

So, between-country comparisons are misleading, unless the same amount of virus testing has been conducted. This is the point I made about India, above, where testing is a real challenge given the size of the population. Those of you in the USA might like to contemplate just how many cases you really have — your officials have conducted more tests than anyone else except China, but you still have covered only 17% of your population (the table above is cut off at 30% coverage).

Alternatively, antibody testing is a good way to detect people who have had the virus without knowing it, since this studies their body's reaction to the virus rather than looking for the virus itself. As this sort of testing proceeds around the world, the number of official cases will continue to increase. However, the number of false positives and false negatives of the antibody tests means that even they are not entirely reliable (see False positive and false negative coronavirus test results explained). Indeed, a review article assessing the range of currently available antibody tests shows remarkable variation in their success rates (Diagnostic accuracy of serological tests for Covid-19: systematic review and meta-analysis).

As a final point, which has been very obvious here in Sweden, is just how long a person is considered to be a Covid-19 case. As far as Sweden is concerned, there were apparently a lot of "active cases" early in the pandemic. However, what was happening was that most other jurisdictions were declaring cases as "recovered" after the person's symptoms receded, which takes about 7 days, and were then removed from the official list of cases. On the other hand, Sweden did not officially declare a case recovered until the person was completely free of the virus, which takes about 5 weeks. So, Sweden's reported number of active cases remained much higher than for most other places, for a much longer time. The number of Swedish cases was actively criticized by the foreign media, but the cause was never mentioned — the data were not comparable to elsewhere.

Similarly, the reporting of cases is obviously not equal throughout any given week, so that daily reports are unreliable — there are obvious weekly cycles in almost all of the national datasets, with fewer reported cases or deaths on Saturdays and Sundays. The same thing applies to regional (geographic) patterns, of course. For example, both Spain and the United Kingdom have noted that their current outbreaks are all regional, with the majority of their countries being much less affected.

Number of deaths

This brings us a consideration of counting deaths due to Covid-19. We all know what death is, but it is not so easy to assign a particular cause to any particular death. A death certificate signed by a professional medical practitioner will assign an official "cause of death", and possibly list other "contributing factors". So, when does a death count as a coronavirus death?

The simplest solution is to say that any dead person who has a virus genome in their body counts; and it is clear that some of the statistics around the world have counted Covid-19 deaths this way. Unfortunately, as has been pointed out ironically, this counts people who are carrying the virus when they get run over by a car; and this may not be what most people mean when referring to "a coronavirus death".

Just as importantly, some jurisdictions have clearly tested, and thus counted, only those people who died in hospital. Similarly, there are clear differences in counting due to social circumstances, especially in countries with large poor communities. These factors will under-estimate the actual death rate.

The main issue, however, is that most of the people severely affected by this new virus are elderly persons with pre-existing medical conditions. For example, 7.3% of the reported Covid-19 cases in Sweden have resulted in death, to date, but 89.1% of those deaths have been in the 70+ age group. This is a bit more extreme than elsewhere, as early on in the pandemic the virus got into several aged-care facilities in Sweden. In most of these cases, the SARS-CoV-2 virus was simply one thing too many, for people whose health was already declining — this is called co-morbidity (the presence of one or more additional conditions co-occurring with a primary medical condition).

So, where is the border between a main cause and a subsidiary factor? The answer to this question clearly differs around the world; and this makes the officially reported death data non-comparable. Some data will be over-estimates and some will be under-estimates, compared to some global standard definition. So, what does the following graph, from Worldometer, really tell us?

The generally accepted solution to this conundrum is to consider what is called excess mortality, which assumes that there has been a temporary change in the number of deaths during some specified period of time. That is, we do not assign deaths to particular causes, but simply compare the total number of deaths now to the total number of deaths in previous years. The difference can be attributed directly or indirectly to the current circumstances. This is not perfect, but it is the best we have got.

So, we should compare the number of deaths during the current pandemic period with some estimate of a baseline number of deaths under more normal circumstances. The baseline is commonly taken as the equivalent data from the immediately preceding 3–5 years, or so — how many more people have died during the pandemic, compared to the average deaths during the same months of prior years?

The U.S. Centers for Disease Control and Prevention has a compilation of these data for the states of the USA, updated daily: Excess deaths associated with COVID-19. The data are still provisional, but it would be nice to think that they are directly comparable. Whether the data are actually meaningful for the current pandemic is a point I discuss at the end of this post.

Similarly, the EuroMOMO collaborative network is supported by the European Centre for Disease Prevention and Control, and provides weekly data for public health threats in 24 European countries. If you look at their graphs, you can see the age-related effects of seasonal flu in every winter since 2016, as well as the magnitude of current pandemic. Here is a graph of their current data, pooled across all age groups and countries. Roughly speaking, deaths are 80% greater than in previous years.

Elsewhere in the world, data are a bit more scarce. The principal problem is lack of suitable prior data — not everywhere on the planet has accurate estimates of the local death rate, for some combination of social, economic or political reasons. Nevertheless, we have data for all of the expected places; and some of the groups who are collating the excess mortality data for the current pandemic are listed by the Our World in Data site: Excess mortality from the coronavirus pandemic (COVID-19).

These groups include three newspapers, each of which is covering the current pandemic across c. 10 countries:

All three of these make their compiled data publicly available on GitHub.

Conclusion and final point

The world is a complex place, and biology is one of the most complex parts of it. Do not over-interpret simplistic data, no matter how prettily it is presented. In particular, for data to be meaningful, all parts of it need to be directly comparable; otherwise the conclusions are likely to be wonky.

Sadly, as a final point to emphasize the issues, I will note that the USA itself apparently has rather big practical problems, as discussed in: Covid-19 data in the US is an ‘information catastrophe’. According to this media report, there are serious problems with the hospitalization data:

Covid-19 data in the US — in fact, almost all public health data — is chaotic: not one pipe, but a tangle ... Every health system, every public health department, every jurisdiction really has their own ways of going about things ... It's very difficult to get an accurate and timely and geographically resolved picture of what's happening in the US, because there's such a jumble of data.

The issue seems to be the National Healthcare Safety Network, as used by the Centers for Disease Control and Prevention, which is responsible for collating the data nationally. The Department of Health and Human Services has now taken over direct responsibility for data concerning Covid-19 infections in hospitalized patients, much to the dismay of many people.

Automated detection of rhymes in texts (From rhymes to networks 4)

2020-07-27T00:30:00.000+02:00

Having discussed how to annotate rhymes in last month's blog post, we can now discuss the automated detection of rhymes. I am fascinated by this topic, although I have not managed to find a proper approach yet. What fascinates me more, however, is how easily the problem is misunderstood. I have witnessed this a couple of times in discussions with colleagues. When mentioning my wish to create a magic algorithm that does the rhyme annotation for me, so that I no longer need to do it manually, nobody seems to agree with me that the problem is not trivial.

On the contrary, the problem seems to be so easy that it should have been solved already a couple of years ago. One typical answer is that I should just turn to artificial intelligence and neural networks, whatever this means in concrete, and that they would certainly outperform any algorithm that was proposed in the past. Another typical answer, which is slightly more subtle, assumes that some kind of phonetic comparison should easily reveal what we are dealing with.

Unfortunately, none of these approaches work. So, instead of presenting a magic algorithm that works, I will use this post to try and explain why I think that the problem of rhyme detection is far less trivial than people seem to think.

Defining the problem of automated rhyme detection

Before we can discuss potential solutions to rhyme detection, we need to define the problem. If we think of a rhyme annotation model that allows us to annotate rhymes at the level of specific word parts (not restricted to entire words), the most general rhyme detection problem can be presented as follows:

Given a rhyme corpus that is divided into poems, with poems divided into stanzas, and stanzas being divided into lines, find all of the word parts that clearly rhyme with each other within each stanza within each poem within the corpus.

With respect to machine learning strategies, we can further distinguish supervised versus unsupervised learning. While supervised learning for the rhyme detection problem would build on a large annotated rhyme corpus, in order to infer the best strategies to identify words that rhyme and words that do not rhyme, unsupervised approaches would not require any training data at all.

With respect to the application target, we should further specify whether we want our approach to work for a multilingual sample or just a single language. If we want the method to work on a truly multilingual (that is: cross-linguistic) basis, we would probably need to require a unified transcription for speech sounds as input. It is already obvious that, although the annotation schema I presented last month is quire general, it would not work for those languages with writing systems that are not spelled from left to write, for example, not to speak of writing systems that are not alphabetic.

Why rhyme detection is difficult

It is obvious that the most general problem for rhyme detection would be the cross-linguistic unsupervised detection of rhymes within a corpus of poetry. Developing systems for monolingual rhyme detection seems to be a bit trivial, given that one could just assemble a big list of words that rhyme in a given language, and then find where they occur in a given corpus. However, given that the goal of poetry is also to avoid "boring" rhymes, and come up with creative surprises, it may turn out to be less trivial than it seems at first sight.

As an example, consider the following refrain from a recent hip-hop song by German comedian Carolin Kebekus, in which the text rhymes Gemeinden (community) with vereinen (unite), as well as Mädchen (girl) with Päpstin (female pope) (the video has English subtitles for those who are interested in the text but do not speak German).

Figure 1: Rhyme example from a recent German hip-hop song.

While one could argue whether those words qualify as proper rhymes and were intended as such, I am quite convinced that the words were chosen for their near-rhyme similarity, and I am also convinced that most native speakers of German listening to the song will understand the intended rhyme here. Both rhymes are not perfect, but they are close enough, and they are beyond doubt creative and unexpected — it is extremely unlikely that one could find them in any German rhyme book. This example shows that humans' creative treatment of language keeps constantly searching for similarities that have not been used before by others. This leads to a situation where we cannot simply use a static look-up table of licensed rhyme words, to solve the problem of rhyme detection for a particular language.

What we instead need is some way to estimate the phonetic similarity of words parts, in order to check whether they could rhyme or not. However, since languages may have different rhyme rules, these similarities would have to be adjusted for each language. While phonetic similarity can be measured fairly well with the help of alignment algorithms applied to phonetic transcriptions, what counts as being similar may differ from language to language, and rhyme usually reflects local similarity of words.

Since rhyme is closely accompanied by rhythm and word or phrase stress, we would also need this information to be supplied from the original transcriptions. All in all, working on a general method for rhyme detection seems like a hell of an enterprise, specifically whilever we lack any datasets that we could use for testing and training.

Less interesting sub-problems and proposed solutions

While, to the best of my knowledge, nobody has every tried to propose a solution for the general problem of rhyme detection as I outlined it above, there are some studies in which a sub-problem of rhyme detection has been tackled. This sub-problem can be presented as follows:

Given a rhyme corpus of poems that are divided into stanzas, which are themselves divided into lines, try to find the rhyme schemas underlying each stanza.

This problem, which has been often called rhyme scheme discovery, has been addressed using at least three approaches that I have been able to find. Reddy and Knight (2011) employ basic assumptions about the repetition of rhyme pairs in order to create an unsupervised method based on expectation maximization. Addanki and Wu (2013) test the usefulness of Hidden Markov Models for unsupervised rhyme scheme detection. Haider and Kuhn (2018) use Siamese Recurrent Networks for a supervised approach to the same problem. Additionally, Plechač (2018) proposes a modification of the algorithm by Reddy and Knight, and tests it on three languages (English, Czech, and French).

One could go into the details, and discuss the advantages and disadvantages of these approaches. However, in my opinion it is much more important to emphasize the fundamental difference between the task of rhyme scheme detection and the problem of general rhyme detection, as I have outlined it above. Rhyme scheme detection does not seek to explain rhyme in terms of partial word similarity, but rather assumes that a general overarching structure (in terms of rhyme schemas) underlies all kinds of rhymed poetry.

There are immediate consequences to assuming that rhymed poetry needs to be organized by rhyme schemes. First, the underlying model does not accept rhymes that occur in any other place than the end of a given line, which is problematic, specifically when dealing with more recent genres like hip-hop. Second, if one assumes that rhyme scheme structure dominates rhymed poetry, the model does not accept any immediate, more spontaneous forms of rhyming, which, however, frequently occur in human language (compare the famous examples in political speech, discussed by Jakobson 1958).

Concentrating on rhyme schemes, instead of rhyme word detection, has immediate consequences for the algorithms. First, the methods need to be applied to "normal" poetry, given that any form of poetry that evades the strict dominance of rhyme schemes cannot be characterized properly by the underlying rhyme model. Second, all that the methods need as input are the words occurring at the end of a line, since these are the only ones that can rhyme (and the test datasets are all constructed in this way alone). Third, the methods are all trained in such a way that they need to identify rhymes in a text, so that they cannot be used to test whether a given text collection rhymes or not.

Outlook

In this post, I have tried to present what I consider to be the "ultimate" problem of rhyme detection, a problem that I consider to be the "general" rhyme detection problem in computational approaches to literature. In contrast, I think that the problem of detecting only rhyme schemes is much less interesting than the general rhyme detection problem. The focus on rhyme schemes, instead of focusing on the actual words that rhyme, reflects a certain lack of knowledge regarding the huge variation by which people rhyme words across different languages, cultures, styles, and epochs.

If all poetry followed the same rhyme schemes, then we would not need any rhyme detection methods at all. Think of Shakespeare's 154 sonnets, all coded in the same rhyme schema: no algorithm would be needed to detect the rhyme schema, as we already know it beforehand — for a perfect supervised method, it would be enough to pass the algorithm the line numbers and the resulting schema.

The picture changes, however, when working with different styles, especially those representing an emerging rather than an established tradition of poetry. Rhyme schemes in the most ancient Chinese inscriptions, for example, are far less fixed (Behr 2008). In modern hip-hop lyrics, which also represent a tradition that has only recently emerged, it does not make real sense to talk about rhyme schemes either, as can be easily seen from the following excerpt of Akhenaton's Mes soleils et mes lunes, which I have tried to annotate to the best of my knowledge.

Figure 2: First stanza from Akhenaton's Mes soleils et mes lunes

Surprisingly, both Haider and Kuhn (2018), as well as Addanki and Wu (2013) explicitly test their methods on hip-hop corpora. They interpret them as normal poems, extract the rhyme words, and classify them line by line. I would be curious what these methods would yield if they are fed non-rhyming text passages. For me, the ability of an algorithm to distinguish rhyming from non-rhyming texts is one of the crucial tests for its suitability. We do not need approaches that confirm what we already know.

Ultimately, we hope to find methods for rhyme detection that could actively help us to learn something about the difference between conscious rhyming versus word similarities by chance. But, given the huge differences in rhyming practice across languages and cultures, it is not clear if we will ever arrive at this point.

References

Addanki, Karteek and Wu, Dekai (2013) Unsupervised rhyme scheme identification in Hip Hop lyrics using Hidden Markov Models. In: Statistical Language and Speech Processing, pp. 39-50.

Behr, Wolfgang (2008) Reimende Bronzeinschriften und die Entstehung der Chinesischen Endreimdichtung. Bochum:Projekt Verlag.

Haider, Thomas and Kuhn, Jonas (2018) Supervised rhyme detection with Siamese recurrent networks. In: Proceedings of Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 81-86.

Jakobson, Roman (1958) Typological studies and their contribution to historical comparative linguistics. In: Proceedings of the Eighth International Congress of Linguistics, pp. 17-35.

Plecháč, Petr (2018) A collocation-driven method of discovering rhymes (in Czech, English, and French poetry). In: Masako Fidler and Václav Cvrček (eds.) Taming the Corpus: From Inflection and Lexis to Interpretation. Cham:Springer, pp. 79-95.

Media misunderstandings about the coronavirus in Sweden

2020-07-20T00:30:00.000+02:00

The worldwide spread of the SARS-CoV-2 virus, and the consequent Covid-19 disease pandemic, is still a topic of conversation, although it does seem that many people are sick of hearing about it. They just want to "get back to normal", without understanding that this is going to take many months, if it happens at all. There is every possibility that there will be a "new normal" from now on, and in many places the virus will be endemic.

We started off knowing little about this virus and the disease that it causes, as I have written about before (There seems to be a lot of public misunderstanding about the coronavirus); and we have slowly accumulated more and more understanding of what we should be doing in response. In particular, the future of having to live with the virus is becoming clearer, until (or if) we reach herd immunity (A new understanding of herd immunity).

Among all of this, there has been some commentary about the official response within Sweden, with some media (and the World Health Organization) claiming that the Swedes have reacted in a different and controversial manner. This is far from the truth, as I happen know, because I now live in Sweden, although I grew up in Australia. As a resident biological scientist, I thought that I might write about the situation, in this post. There have been massive quarantine efforts here, although for cultural reasons they might look quite different to how such things are organized in the English-speaking parts of the world. [Note: Japan has also used a different strategy to most other places, but without any serious criticism, although it is now experiencing a serious "second wave".]

Many of the misleading media reports, have originated in the USA, which currently has the world's biggest Covid problem. The latter may soon change, because there is every reason to expect India to surpass the US infection count, as its rate is still rapidly increasing and India has a much larger population. I hope to be wrong on this matter, but it will be very hard to contain spread among the masses of poor people in that country. Maybe their saving grace will be the fact that the majority of their population is younger than 40 years old, so that the death rate will be contained.

Anyway, we have had US media reports about Sweden such as these:

The latter article contains this quote:

At one end of the spectrum, Sweden chose to forgo severe restrictions on public life and its economy and opt to let the virus spread through its population while shielding the most vulnerable groups.

Both pieces of information here are wrong. Sweden has not allowed the virus to spread, but has instead instituted quarantine measures; and it has failed miserably in its efforts to protect the prime vulnerable group: the elderly.

Virus spread in Sweden

Let's start by looking at the actual data. Here is a table of the current officially reported number of SARS cases as of July 18 (as collated on the Worldometer web site). Note that the information we are interested in is the case rate (percent of population affected), not the number of cases. The number of cases is determined mainly by the population size — of course the USA has more cases than Sweden, for example, because there are 330 million Americans and only 10 million Swedes.

As you can see, the case rate in the USA is 10,500 per million people, whereas in Sweden it is only three-quarters of this, at 7,500 cases. So, who is doing better at containing the spread of the virus? Mind you, within Europe, only Armenia and Luxembourg have higher reported rates, along with tiny places like San Marino, Andorra and the Vatican City (where even a few cases can create apparent large rates, due to the small sample size).

Moreover, the number of new cases per day in Sweden is now as low as at any time since mid March, as shown in this next graph (also from Worldometer). The apparent burst in cases after June 5 was due to the government finally implementing large-scale virus testing, which always increases the detection rate for this type of situation. The subsequent decrease in cases suggests that Sweden may well be moving towards herd immunity, which is required for long-term epidemic control. This week's report from Folkhälsomyndigheten (the Public Health Agency) shows a continue decrease in the proportion of positive tests, despite a continued high level of sampling.

The Swedish situation contrasts with the current situation in the USA, where the number of new cases is higher than at any previous time, being double what it was during the April-June period. This is, at least in part, due to a massive sampling effort now on, which I noted above will increase the case detections.

The same trend can be seen in the number of new daily deaths in Sweden — it is now as low as at any time since mid March. The number of US deaths, on the other hand, has surged this month (although it is still less than a half of what it was back in April). Sweden may be a cautionary tale, perhaps; but the criticism sounds more like sour grapes, to me, from the media of a country that has clearly handled this pandemic worse than anyone else.

It is important to mention a point of difference, as it has become increasingly obvious that different jurisdictions have compiled coronavirus cases differently, even within the European Union.. As far as Sweden is concerned, there were apparently a lot of "active cases" early in the pandemic. However, what was happening was that most places were declaring cases as "recovered" after the person's symptoms receded, which takes about 7 days. On the other hand, Sweden did not officially declare a case recovered until the person was completely virus free, which takes about 5 weeks. So, Sweden's reported number of active cases remained much higher than most other places, for a much longer time, which may have generated a lot of the negative media publicity. This situation no longer applies, because the number of cases is much lower now.

I would hate to be the person who has to officially compile the worldwide data on this pandemic. Even the decision about what constitutes a "Covid death" differs between countries, with some jurisdictions including all people who test positive for the virus, irrespective of what they die of, and others counting only those cases where the virus is the main cause of death (eg. a cytokine storm). Trying to make the worldwide data comparable will not be easy.

Quarantine in Sweden

So, what has been different about Sweden? It is simply that the national government expects Swedes to take official advice when they are given it, without being forced to do so. In most cases, this actually works, although there will always be exceptions. In the case of this pandemic, the government simply gave everyone the same advice as everyone else in the world was forced to take. It really is as simple as that.

Where I spent the first two-thirds of my life, in Australia, such an approach would be laughable. because Australians do not respect their governments, state or national. So, without a police-enforced mandatory shut-down, the virus would have spread unchecked. You may have seen the media pictures of Sydney people jammed onto beaches when they were told not to go to work (Famous Sydney beaches closed after crowds flout coronavirus restrictions); and you may have read about the complete failure in the Melbourne hotel used for quarantining international arrivals (Breaches of hotel quarantine 'let Victorians down', Minister says as inquiry launched). There is nothing unexpected about this, even if I say this as an Australian citizen.

In contrast, Sweden's island summer-holiday destinations have had among the lowest infection rates in the whole country — Öland 0.3%, Gotland 0.3%, compared to a national total of 0.8%. I am not claiming that Swedes are more sensible than anyone else (or less!), merely that they take official advice without being forced. This may seem odd to you, perhaps, but it is true, as I can attest from living here for the past one-third of my life. Swedes are quite proud of being different in this way. Indeed, to a Swede, a government-enforced lock-down would probably have worked a great deal worse than the official (advisory) approach chosen.

So, businesses were told to have their employees work from home, and those that can do so have been implementing this. The recommendation remains in force until the end of the year, notably to reduce problems with public transport. Of direct effect on me, universities all immediately instituted online classes (instead of face-to-face), and this remains in force — Uppsala University is a pretty quiet place, these days. In a similar manner, senior high schools have had their students working from home (they are on summer holidays now, of course) — secondary schools are at risk of being important sources of infection (see Contact tracing during coronavirus disease outbreak, South Korea, 2020).

On the other hand, of greatest surprise to me, it was decided to keep the junior (primary) schools operating normally. This has turned out very well, because there have been no reports of any students bringing Covid-19 home to their families. It is now accepted that young children are not usually infectious, contrary to the common belief at the beginning of the pandemic (Children are not COVID-19 super spreaders: time to go back to school). This is one thing that Sweden apparently got right, contrary to actions in most other places in the world — disrupting the lives of young people is not a good thing.

In other quarantine actions, many places will now deliver your shopping order to your car, so you don't have to enter shops; and all open locations have signs about social distancing, and 1.5-meter (5 foot) marks on the floor. All public-access places have perspex screens between the service-provider and customer, and between customers; hand-washes are freely available; and cleaning services are now more strict and frequent. Most eating places serve customers outdoors only. We have been advised not to meet in groups, except outdoors, and even then there should be fewer than 50 people. All professional sporting activities have been postponed, along with other group activities, such as garden viewings (eg. Öppen Trädgård 2020 inställt).

My local supermarket now opens one hour earlier on week-days, specifically for people in high-risk groups (such as myself) during that extra time. The accompanying sign is typically Swedish, in that it points out the purpose of opening early, and asks for co-operation from other customers, but also says that this will not be formally policed. As expected from Swedes, when my wife and I go there, almost all of the people are elderly, indicating that the others are, indeed, co-operating (or perhaps do not want to get up early).

Be realistic, would this type of voluntary approach actually work in your country? The only report of a major breach of quarantine was a party held to celebrate graduating from high school. The government recommended that these parties be avoided this year, much to the disappointment of the students, as this is always a big event. One group of c. 200 people ignored this advice, and thereby spread the virus among more than 40 people (Coronasmitta spreds på stor studentfest). All countries have idiots.

There are practical problems to all of this, of course, just like in those places with full lock-downs. A personal one for me was the loss of my non-pension income. I used to help a Swedish academic with his English, but we have not met since the arrival of the virus in Sweden. I doubt that these meetings will ever resume, post-Covid.

Also, all travel has been restricted, which resulted in the cancellation of our long-planned trip to northern Sweden and Norway. All countries in Europe officially closed their borders for a few months. Within Sweden, typically, given what I have said above, we were not actually prevented from traveling, but were instead told that if we get sick we will have to be medically treated within our home county, which dissuaded everyone from going very far.

This has all changed in the past week. The ferries to Germany are now open; and it is summer holidays. This seems to have encouraged Swedes to come out of quarantine, and get on the move. This past weekend, it has become clear that relatives are visiting each other again (they are out cycling in family groups on my country roads, for example); and I have seen more caravans and campervans on the highways than I have at any time since last summer. Apparently, the summer destinations have started filling up with tourists, so this will be the test of how far Sweden has come (Tusentals turister trängs på Gotlands gator).

As a final discussion point, I will mention that I actually live just outside of town, in a small community in the countryside. So, social distancing is not a practical problem for me, unless I go into town. In my local area, there have been 24 confirmed cases out of 3,007 people, which is an infection rate of 0.8%, which is the same rate as for Sweden as a whole.

However, this introduces the issue of the non-randomness of cases, which are quite definitely clustered (A fraction of European regions account for a majority of covid deaths). Within Sweden, for example, Stockholm, as by far the largest city, has the highest death rate, as I will discuss below. So, the risks associated with infection depend very much on where you live. Sweden may have a small population, but its area is quite large, and spatial diversity is a real factor, just as it is in larger countries.

It is therefore a pity that all decisions within the European Union regarding the pandemic are done at the national level. A pandemic requires communal action, because any individual action can threaten the safety of the group as a whole. It has apparently one of the biggest "riddles" that the Buddhist countries of South-East Asia (Cambodia, Laos, Myanmar, Thailand, Vietnam) have been almost completely untouched by the pandemic that has spread to every other part of the globe (Why has the pandemic spared the Buddhist parts of South-East Asia?); but anyone who has ever watched the co-operative way in which these communities function will not be surprised in the slightest.

It has therefore been the biggest disappointment that the European Union has been surprisingly non-united in its responses. At the moment, some countries are now open to visitors from some other countries, while residents of yet other countries are currently banned. None of this seems to be based on the actual case-rate data, but is much more to do with politics and how much money might be made during the summer holiday season. Greece, for example, is open to the British but not to Swedes, while Croatia is open to both. Needless to say, Croatia (and neighboring Montenegro) have had massive surges in cases in the past few weeks, since they are open to most holiday-makers, having had relatively few cases before — it is now no safer to be there than in much of Sweden.

[Aside: My wife and I came back from a holiday in Croatia on the same day that the main influx of the virus arrived in Sweden from northern Italy, where is was acquired by Swedes who had taken the school break week to go downhill skiing. The other large source in Scandinavia was via those people who had gone to Austria for the same purpose.]

Protecting the elderly

This brings us to the biggest point of criticism within Sweden itself. This pandemic has highlighted very strongly just how badly elderly people are treated in this country. Put simply, I would never live in an aged-care home here, even if they were paying me, rather than the other way around.

First, let's look at the current data on age-related Covid cases in Sweden (compiled by Han Yin Lap). As you can see, 7.3% of the Covid-19 cases in Sweden have resulted in death, but 89.1% of those deaths have been in the 70+ age group. This is pretty much the same as elsewhere, sadly enough.

The problem in Sweden has been that the virus got into many of the aged-care homes long before anything was officially done about it. The government did not institute mandatory virus-testing of the staff (or even recommend it); and, as we now all know, it is the asymptomatic people who are the most dangerous in terms of spread. Furthermore, all reports (anecdotal as well as official) indicate that staff operational procedures were not modified before the middle of May, to protect either the patients or the staff (being a health-care worker is always risky: How many healthcare workers have gotten coronavirus?).

You can imagine the outcome for yourself. The worst case was in Jönköping County. This is not a densely populated place by any means, but the case rate has been 1.2% of the people, compared to the national rate of 0.8%. The virus got into a large aged-care facility, of course. The highest death rates have been been in Stockholm County (0.10%) and Södermanland County (0.08%), compared to the national 0.05%, for exactly the same reason.

Closer to home, my local newspaper recently reported the data shown in the following table (Stora skillnad i hur hårt äldreboenden drabbats. Upsala Nya Tidning, Lördag 4 juli 2020, p.6). Across 979 people in 20 aged-care facilities in Uppsala County, the death rate has been 5.8%, but varied from 0% to 18%. Only two facilities have so far reported no coronavirus-related deaths.

You can see why this has been a big discussion point, as this situation is by no means unusual in the other counties, except for Västerbotten (Inte enbart en slump att Västerbotten har få döda i covid-19). Indeed, it is a national disgrace.

The issue here has been the lack of government-instituted testing. Sweden has a nationalized health-care system, and it does not work any better than such systems ever do. I once lay in a hospital ward for a day and a half, fully scrubbed and prepared for surgery, to have my appendix removed. When they finally got around to me, the knot on my surgery gown was so tight that they had to cut the cord to get the thing off (with a laugh, of course). I have other anecdotes of similar nature.

So, as far as the pandemic has been concerned, the national government dithered for months before deciding that they would, indeed, bear much of the financial cost of testing. Until then, only people with symptoms were tested for the virus. What is the point of that?!! We needed to know who had the virus and did not themselves know it, not those whom we were already sure had it.

Anyway, without national funding, the counties, who do the actual sampling, typically do nothing. This is how a national health scheme works (or does not). Fortunately, the government finally started testing more widely for the virus, which created a spike in reported cases in June, as noted in the first graph above.

Recently, the government agreed to fund testing for antibodies, for anyone who wants it. Only two counties, Uppsala and Stockholm, immediately implemented this idea, at the beginning of this month. Sadly, my wife and I have now been waiting for 3 weeks for the results of our tests. We were told: "it make take a week", which in the Swedish health-care system translates as: "don't hold you breath". We have, of course, been sent our bills, for our (smallish) part of the cost.

Conclusion

So, there you have it. Sweden has done no worse than a lot of other places, in spite of doing things somewhat differently. There was no government-enforced lock-down, but instead a government-advised voluntary quarantine. This has worked okay, and certainly much better than the government lock-down in the USA; but plenty of countries in Europe have had lower case rates. The death rate is a bit embarrassing, because old people are not treated well in Sweden. In that sense, what I am doing living in Sweden in my sixties? As Pete Townsend once noted (My Generation): "I hope I die before I get old."

Note: For a slightly later but similar commentary by another local, see: Sweden did not take herd immunity approach against coronavirus pandemic.

Tattoo Monday XX

2020-07-13T00:30:00.000+02:00

There are a number of tattoo designs that take the concept of a Tree of Life and incorporate the concept of DNA. Here is a selection of some of them. For an earlier example, see Tattoo Monday IV.

The power of wine and spirits brands in the marketplace

2020-07-06T00:30:00.000+02:00

Commercial alcoholic beverages have all sorts of market characteristics, one of which is their ability to dominate their markets. This feature was investigated in a survey of the world’s leading drinks brands, published annually from 2006-2015 by the international company strategists Intangible Business. This was called The Power 100, in which each brand was given a power score, allowing them to be ranked.

Intangible Business apparently researched c. 10,000 spirit and wine brands across the globe, to assess both the financial contribution of each brand and its strength in the eyes of the consumer. To do this, they combined scores from a panel of drinks industry experts with global sales data (see Methodology, and Panelists). [Note: the resulting reports used to be housed at www.drinkspowerbrands.com, but this site disappeared in 2017, with 2015 as the final report.]

The Brand Score (out of 100) was produced by the panelists, who scored each brand for these eight characteristics (scale: 0–10):

Share of market: a volume-based measure of market share
Future Growth: projected growth based on 10 years of historical data plus future trends
Premium Price Positioning: a measure of the brand’s ability to command a premium
Market Scope: number of markets in which the brand has a significant presence
Brand Awareness: a combination of prompted and spontaneous awareness
Brand Relevancy: capacity to relate to the brand and a propensity to purchase
Brand Heritage: the brand’s longevity and a measure of how it is embedded in local culture
Brand Perception: loyalty, and how close a strong brand image is to a desire for ownership.

This Score was then turned into a Total Score (out of 100) by multiplying this by the brand's weighted sales volume. It was this Total Score that was used for the final Power list, with the top 100 being listed each year. However, I am not interested in this here — the Total Score is dominated by the sales volume, not by the Brand Score. The latter seems more interesting, so I will look at it here.

Across the 10 years, 141 brands appeared at least once, although only 68 (48%) of them appeared in all 10 surveys, with another 8 appearing in 9/10 years. That is, only half of the brands had any sustained Power. In the other cases, the brands either appeared in the early surveys only, or in the later surveys only — very few came and went from year to year (implying that they were just on the border of the top 100).

As usual in this blog, we can get a picture of the variation among brands by using a phylogenetic network, as a form of exploratory data analysis. For the first analysis, I calculated the similarity across the 8 Brand Score criteria using the Manhattan distance, based on those 100 brands that appeared in the final (2015) report. A Neighbor-net analysis was then used to display the between-year similarities, as shown in the graph above. Brands that are closely connected in the network are similar to each other based on their Brand Score, and those that are further apart are progressively more different from each other.

There is a general trend of high scores at the top of the network downwards to the bottom left. However, the network does not show a simple trend, such as is implied by the 1-dimensional ranking produced in the original Intangible Business report. That is, there is a complexity among the scores — it is possible for two brands to get the same Brand Score but to get it by scoring highly on quite different criteria. This illustrates the importance of using multi-dimensional summaries for exploratory data analysis — the patterns to be found may not be simple.

In this particular case, note that some brands, like Crown Royal and Dom Perignon, diverge greatly from the overall trend, indicating that they have unusual combinations of scores. Also, the two neighborhoods at the left and right of the network have different combinations from each other, although they end up with similar overall Brand Scores.

For the second analysis, I compared the different years. I calculated the Brand Score similarity across the 10 years using the Manhattan distance, based only on those 104 brands that appeared in at least 5 of the years. A Neighbor-net analysis was then used to display the between-year similarities, as shown in the second graph.

As you can see, in this case the network is as linear as you could expect, indicating that there is little more than 1 dimension of information to summarize. In this case, it basically shows a single rank-ordering of the Brand Scores averaged across the years (with the highest average score at the top of the network and the lowest at the bottom). So, in this case it is much simpler just to list the average Brand Scores in a table, rather than use the network (keep it simple!) — the network is being used to check whether there are more complex patterns, but not to display the pattern found.

This table is shown next, because it has never been listed before (none of the original reports compare all of the years). You can find your favorite brand, and check how "powerful" it has been in the maketplace, across time. Spirits do better than wines, but there is no consistency about which types of spirits do best.

Brand
Johnnie Walker
Bacardi
Hennessy
Jack Daniel's
Moët et Chandon
Smirnoff Vodka
Absolut
Dom Pérignon
Baileys
Veuve Clicquot
Chivas Regal
Captain Morgan
Cuervo
Martini Vermouth
Jameson
The Macallan
Ballantine's
Havana Club
Rémy Martin
Jägermeister
Maker's Mark
Glenfiddich
Martell
Jim Beam
Grey Goose
Bombay Sapphire
The Glenlivet
Concha y Toro
Robert Mondavi
Stolichnaya
Beefeater
Gordon's Gin
Courvoisier
Malibu
Tanqueray
Sauza
Crown Royal
Taittinger
Mumm
J & B
Patrón
Penfolds
Hardys
Cointreau
Freixenet
Gallo
Wolf Blass
Southern Comfort
Jacobs Creek
Campari Bitters
Famous Grouse
Torres
Grand Marnier
Canadian Club
Finlandia
Piper Heidsieck
Laurent Perrier
Beringer
Dewars
Kahlua
Martini Sparkling Wine
Yellowtail
Lindeman's
Svedka
Skyy
Wild Turkey
Grant's Scotch
Teacher's
Ketel One
De Kuyper
Kendall Jackson
Nicolas Feuillatte
Cutty Sark
Aperol
Disaronno
Ricard
Cinzano Vermouth
Russian Standard
Fernet-Branca
Bell's
Blossom Hill
Sutter Home
William Lawson's
Wyborowa
El Jimador
Bols Liqueurs
Eristoff
Clan Campbell
Seagram's 7 Crown
100 Pipers
Seagram Gin
Ramazzotti Amaro
Inglenook
Black Velvet
Three Olives
Seagram V.O.
Cacique
Metaxa
E & J Brandy
Canadian Mist
Dreher
Masson Grande Amber Brandy
Pastis 51
Moskowskaya

Category
Blended Scotch
Rum / Cane
Cognac
US Whiskey
Champagne
Vodka
Vodka
Champagne
Liqueurs
Champagne
Blended Scotch
Rum / Cane
Tequila
Light Aperitif
Blended Irish Whiskey
Malt Scotch
Blended Scotch
Rum / Cane
Cognac
Bitters / Spirit Aperitifs
US Whiskey
Malt Scotch
Cognac
US Whiskey
Vodka
Gin / Genever
Malt Scotch
Still Light Wine
Still Light Wine
Vodka
Gin / Genever
Gin / Genever
Cognac
Liqueurs
Gin / Genever
Tequila
Canadian Whisky
Champagne
Champagne
Blended Scotch
Tequila
Still Light Wine
Still Light Wine
Liqueurs
Other Sparkling
Still Light Wine
Still Light Wine
Liqueurs
Still Light Wine
Bitters / Spirit Aperitifs
Blended Scotch
Still Light Wine
Liqueurs
Canadian Whisky
Vodka
Champagne
Champagne
Still Light Wine
Blended Scotch
Liqueurs
Other Sparkling
Still Light Wine
Still Light Wine
Vodka
Vodka
US Whiskey
Blended Scotch
Blended Scotch
Vodka
Liqueurs
Still Light Wine
Champagne
Blended Scotch
Light Aperitif
Liqueurs
Aniseed
Light Aperitif
Vodka
Bitters / Spirit Aperitifs
Blended Scotch
Still Light Wine
Still Light Wine
Blended Scotch
Vodka
Tequila
Liqueurs
Georgian Vodka
Blended Scotch
US Whiskey
Blended Scotch
Gin / Genever
Bitters / Spirit Aperitifs
Still Light Wine
Canadian Whisky
Vodka
Canadian Whisky
Rum / Cane
Other Brandy
Other Brandy
Canadian Whisky
Other Brandy
Other Brandy
Aniseed
Vodka

Brand Score
81.0
76.9
76.9
76.8
74.2
73.6
70.8
69.7
69.3
69.3
69.1
67.4
67.1
66.3
65.7
63.4
63.4
63.3
63.2
62.8
62.0
62.0
61.9
61.6
61.6
60.9
60.8
60.7
60.4
60.2
59.7
58.7
58.7
57.8
57.7
57.7
57.1
57.0
57.0
56.9
56.9
56.4
56.1
55.9
55.7
55.6
55.4
55.3
55.0
54.7
54.7
54.5
54.5
54.0
53.9
53.2
52.9
52.6
52.5
52.2
52.2
52.1
52.0
51.9
51.8
51.8
51.5
51.1
51.0
50.4
50.0
49.2
49.1
49.0
49.0
49.0
48.8
48.7
48.4
48.0
47.6
47.1
46.4
45.8
45.1
45.0
44.1
43.7
43.1
42.4
42.3
42.3
42.2
42.0
41.9
41.0
40.6
39.6
39.5
39.3
39.3
37.7
37.6
37.0

Annotating rhymes in texts (From rhymes to networks 3)

2020-06-29T00:30:00.000+02:00

Having discussed some general aspects of rhyming in a couple of different languages, in last month's blog post, the third post in this series is devoted to the question of how rhyme can be annotated. Annotation plays a crucial role in almost all fields of linguistics. The main idea is to add value to a given resource (Milà-Garcia 2018). What value we add to resources can differ widely, but as far as textual resources are concerned, we can say that the information that we add can usually not be extracted automatically from the resource.

In our case, the information we want to explicitly add to rhyme texts or rhyme corpora is the rhyme relations between words. Retrieving this information may be trivial, as in the case of Shakespeare's Sonnets, where we know the rhyme schema in advance, but it is considerably complicated when working with other, less strict types of rhyming.

One usually distinguishes two basic types of annotation: inline and stand-off (Eckart 2012). For inline annotation, we add our information directly into our textual resource, while stand-off annotation creates an index over the resource, and then adds the information in a separate resource that refers to the index of the original text.

Both methods have their pros and cons. Stand-off annotation often seems to provide a cleaner solution (as one never knows how much a manual annotation added into a text might modify the text involuntarily). However, inline annotation has, in my experience, the advantage of allowing for a much faster annotation process, at least as long as the annotation has to be done in text files directly, without interfaces that could help to assist in the annotation process.

Overview of existing annotation practice

If we look at different practices that have been used to annotate rhymes in collections of poetry, we will find quite a variety of techniques that have been used so far.

Wáng (1980), for example, uses an inline annotation style in his corpus of the rhymes in the Book of Odes, as illustrated in the following example taken from List et al. (2019). In this annotation, rhyme words are indirectly annotated by providing reconstructed readings for the Chinese characters which are supposed to narrow the original pronunciation. Whenever two rhyme words share the same main vowel, the author would have judged them to have rhymed in the original text.

Annotation in Wáng (1980)

Baxter (1992) uses a stand-off annotation, which is shown (again taken from List et al. 2019) in the following table. An advantage of Baxter's annotation is that it allows him to provide multiple layers of information for each rhyme word. A disadvantage is that a clear index to the words in the poem is lacking. While this is not entirely problematic, since it is usually easy to identify which words are in rhyme position, it is not entirely "safe", from an annotation point-of-view, as it may still create ambiguities.

Annotation in Baxter (1992)

In a study of automated rhyme word detection, Haider and Kuhn (2018) use annotated rhyme datasets from a variety of German styles (Hip Hop, contemporary lyrics, and more ancient lyrics). To annotate the data, they use the standard format of the Text Encoding Initiative, which is based essentially on XML. Unfortunately, however, they do not provide tags for each word that rhymes, but instead only add an attribute to each stanza, indicating the rhyme schema, as can be seen in the example below:

<lg rhyme="aabccb" type="stanza">
  <l>Vor seinem Löwengarten,</l>
  <l>Das Kampfspiel zu erwarten,</l>
  <l>Saß König Franz,</l>
  <l>Und um ihn die Großen der Krone,</l>
  <l>Und rings auf hohem Balkone</l>
  <l>Die Damen in schönem Kranz.</l>
</lg>

The drawback of this annotation style is that it places the annotation where it does not belong, assuming that a poem only rhymes the words that appear in the end of a line, and that there are no exceptions.

For French, I found an interesting website called métrique en ligne, offering a large number of phonetically analyzed texts in French. They offer a rhyme analysis in an interactive fashion: one can have a look at a poem in raw form and then see which parts of the words appear in rhyme relation. A screenshot of the website (with the poem "Les Phares" from Charles Baudelaire) illustrates this annotation:

It is very nice that the project offers the rhyme annotation in such a clear form, annotating explicitly those parts of the words (albeit in orthography) that are supposed to be responsible for the rhyming. However, the annotation has a clear drawback, in that it provides rhyme annotation only on the level of the stanza, although we know well that quite a few poems have recurring rhymes that are reused across many stanzas, and we would like to acknowledge that in our annotation.

The most complete annotation of poetry I have found so far is ``MCFlow: A Digital Corpus of Rap Transcriptions'' (Condit-Schultz 2017). The goal of the annotation was not to annotate rhyme in the primary instance, but to provide a corpus that also takes the musical and rhythmic aspects of rap into account. As a result it offers annotations along seven major aspects: rhythm, stress, tone, break, rhyme, pronunciation, and the lyrics themselves. The rhyme annotation itself is provided for each syllable (the texts themselves are all syllabified), with capital letters indicating stressed, and lower case letters indicating unstressed syllables. Rhyme units (usually, but not necessarily words) are marked by brackets. The following figure from Condit-Schultz (2017) illustrates this schema.

Annotation of rhymes by Condit-Schultz (2017)

What I do not entirely understand is the motivation of using the same lowercase letters for unstressed syllables as for the stressed ones in a rhyme sequence. Given that the information about stress is generally available from the annotation, it seems redundant to add it; and it is not clear to me for what it serves, specifically also because unstressed syllables do not necessarily rhyme in rhyme sequences. But apart from this, I find the information that this annotation schema provides quite convincing, although I find the format difficult to parse computationally; and I also imagine that it is quite difficult to annotate it manually.

Initial reflections on rhyme annotation

When dealing with annotation schemas and trying to develop a framework for annotation, it is always useful to recall the Zen of Python, especially the first seven lines:

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.

What I think we can extract from these seven lines are the following basic rules for an initial annotation schema for rhyme data.

First, ideally, we want an annotation schema that gives us the same look and feel that we know when reading a poem. This does not mean we need to store the full annotation in this schema, but for a quick editing of rhyme relations, such an annotation schema has many advantages.
Second, in order to maintain explicitness, all rhymes should be treated as rhyming globally inside a poem — we should never restrict annotation of rhymes to a single stanza, and we should also avoid brackets to mark rhyming sequences, as there are other ways to assign words to units.
Third, we should be explicit enough to show which parts of a word rhyme but, for now, I think it is not necessary to annotate all syllables at the same time. Since this would cost a lot of time, and specifically since syllabification differs from language to language, it seems better to add this information later on a language-specific basis, semi-automatically. Since many words repeat across poems, one can design a lookup-table to syllabify a word much more easily from a corpus that has been assembled, than adding the information when preparing each poem.

Towards a: Standardized Annotation of Rhyme Data

Last year, we proposed an annotation schema for rhyme annotation (List et al. 2019). Our basic idea was inspired by tabular formats. These are used in linguistic software packages dealing with problems in computational historical linguistics, such as LingPy. They are also used as the backbone of the Cross-Linguistic Data Formats Initiative (Forkel et al. 2018), which uses tabular formats in combination with metadata in order to render linguistic datasets (wordlists, information on structural features) cross-linguistically comparable. Essentially, the format can be seen as a stand-off annotation, where the original data are not modified directly. While our basic format was rather powerful with respect to what can be annotated, it is also very difficult to code data in this format, at least in the absence of a proper annotation tool.

At the same time, to ease the initial preparation of annotated rhyme data conforming to these standards, we proposed an intermediate format, in which a poem was provided just in text form, with minimal markup for metadata, and in which rhymes could be annotated inline. As an example, consider the first two stanzas of the poem "Morning has broken" by Eleanor Farjeon (1881-1965):

@ANNOTATOR: Mattis
@CREATED: 2020-06-26 06:09:04
@TITLE: Morning has broken
@AUTHOR: Eleanor Farjeon
@BIODATE: 1881-1965
@YEAR: before 1965
@MODIFIED: 2020-06-26 06:09:46
@LANGUAGE: English

Morning has [a]broken like the first morning
Blackbird has [a]spoken like the first [b]bird
Praise for the [c]singing, praise for the morning
Praise for them [c]springing fresh from the [b]Word

Sweet the rain's [e]new_[f]fall, sunlit from heaven
Like the first [e]dew_[f]fall on the first [g]grass
Praise for the [d]sweet[h]ness of the wet garden
Sprung in com[d]plete[h]ness where His feet [g]pass

As you can see from this example, we start with some metadata (which is more or less a free form, consisting of the formula @key: value, and then render the stanzas, line by line, separating stanzas by one blank line. Rhymes are annotated by enclosing rhyme labels in angular brackets before the part of the word responsible for the rhyme. If wanted, one can annotate rhymes for each syllable, as done in the rhyme words [d]sweet[h]ness and com[d]plete[h]ness, but one can also only annotate the rhyme as a whole, as done in the rhyme words [a]broken and [a]spoken.

In order to assign words to rhyme units, an understroke can be used that indicates that two orthographic words are perceived as one unit in the rhyme, which is the case for [e]new_[f]fall rhyming with [e]dew_fall. Furthermore, if a stanza reappears throughout a poem or song in the form of a refrain, this can be indicated by adding two spaces before all lines of the stanza.

Comments can be added by beginning a line with the hash symbol #, as shown in this small excerpt of Bob Dylan's "Sad-Eyed Lady of the Lowlands".

# [Verse 1]
With your mercury mouth in the missionary [c]times
And your eyes like smoke and your prayers like [c]rhymes
And your silver cross, and your voice like [c]chimes
Oh, who do they think could [i]bury_[j]you?
With your pockets well protected at [e]last
And your streetcar visions which ya' place on the [e]grass
And your flesh like silk, and your face like [e]glass
Who could they get to [i]carry_[j]you?

# [Chorus]
  Sad-eyed lady of the lowlands
  Where the sad-eyed prophet say that no man [a]comes
  My warehouse eyes, my Arabian [a]drums
  Should I put them by your [b]gate
  Or, sad-eyed lady, should I [b]wait?

When testing this framework on many different kinds of poems from different languages and styles, I realized that the greedy rhyme annotation that I used (you place the rhyme tag before a word, and all letters that follow will be considered to belong to that very rhyme tag) has a disadvantage in those situations where syllables in multi-syllabic rhyme units essentially do not rhyme. As an example consider the following lines from Eminem's "Not Afraid":

I'ma be what I set out to be, 
without a doubt, undoubtedly
And all those who look down on me, 
I'm tearin' down your balcony

Here, the author plays with rhymes centering around the words out to be, undoubtedly, down on me, and balcony. Condit-Schultz has annotated the rhymes as follows (I use the rhyme schema inline for simplicity):

I'ma D|be what I set (C|out c|to D|be), 
wi(C|thout c|a) (C|doubt, c|un)(C|doub.c|ted.D|ly)
And all those who look (C|down c|on D|me), 
I'm tearin' C|down your (C|bal.c|co.D|ny)

In my opinion, however, the parts annotated with c by Condit-Schultz do not really rhyme in these lines, they are mere fillers for the rhythm, while the most important rhyme parts, which are also perceived as such, are the stressed syllables with the main vowel ou. To mark that a syllable is not really rhyming, but also in order to mark the border of a rhyme (and thus allow indication that only the first syllable of a word rhymes with another word), I therefore decided to introduce a specific "empty" rhyme symbol, which is now represented by a plus. My annotation of the lines thus looks as follows:

I'ma be what I set [h]out_[+]to_[e]be, 
wi[h]thout a [h]doubt, un[h]doub[+]tab[e]ly
And all those who look [h]down_[d]on_[e]me
I'm tearin' down your bal[d]co[e]ny

An Interactive Tool for Rhyme Annotation

While I consider the inline-annotation format as now rather complete (with all limitations resulting from inline-annotation), I realized, when trying to annotate poems by using the format, that it is not fun to edit text files in this way. I am not talking about small edits, like one stanza, or typing in some metadata — annotating a whole rap song can become very tedious and even problematic, as one may easily forget which rhyme tags one has already used, or oversee which words have been annotated as rhyming, or forget brackets and the like.

As a result, I decided to write an interactive rhyme annotation tool that supports the inline-annotation format and can be edited both in the text and interactively at the same time. This is a bit similar to the text processing programs in blogging software, which allow writing both in the HTML source and in a more convenient version that shows you what you will get.

The following screenshot in the database, for example, shows how the rhymes in Shakespeare's Sonnet Number 98 are visually rendered.

Visual display of Shakespeare's Sonnet 98

This tool is now already available online. I call it RhyAnT, which is short for Rhyme Annotation Tool. I have been using it in combination with a small server, to populate a first database with rhymes in different languages, which already contains more than 350 annotated poems. This database can be accessed and inspected by everybody interested, at AntRhyme; but copyrighted texts from modern songs can — unfortunately — not be rendered yet (as I am not sure how many I would be allowed to share).

I do not want to claim that I am gifted as a designer (I am surely not), and it is possible that there are better ways to implement the whole interface. However, I find it important to note that the format itself, with the coloring of rhyme words, has dramatically increased my efficiency at annotating rhyme data, and also my accuracy in spotting similarities.

Annotating the same poem with RhyAnT, the interactive rhyme annotator

The above screenshot shows how I can edit the poem from my edit access to the database. Alternatively, one can just paste in the text and edit it on the publicly accessible interface of the RhyAnT tool, edit the data, and then copy-paste it to store it. In this form, the interface can already be used by anybody who wants to annotate rhymes in their work.

Outlook

The current annotation framework that I have illustrated here is not almighty, specifically because it does not allow for multi-layered annotation (Banski 2019: 230f), which would allow us to add pronunciation, rhythm, and many other aspects than rhyming alone. However, I hope that many of these aspects can be later added quickly, by creating lookup tables and processing the annotated corpus automatically. Following the Zen of Python, this seems to be much simpler than investing a lot of time in the creation of a highly annotated dataset that would discourage working with the data from the beginning.

References

Bański, Piotr and Witt, Andreas (2019) Modeling and annotating complex data structures. In: Julia Flanders and Fotis Jannidis (eds) The Shape of Data in the Digital Humanities: Modeling Texts and Text-based Resources. Oxford and New York: Routledge, pp. 217-235.

Baxter, William H. (1992) A Handbook of Old Chinese Phonology. Berlin: de Gruyter.

Nathaniel Condit-Schultz (2017) MCFlow: A Digital Corpus of Rap Transcriptions. Empirical Musicology Review 11.2: 124-147.

Eckart, Kerstin (2012):Resource annotations. In: Clarin-D, AP 5 (ed.) Berlin: DWDS, pp. 30-42.

Forkel, Robert and List, Johann-Mattis and Greenhill, Simon J. and Rzymski, Christoph and Bank, Sebastian and Cysouw, Michael and Hammarström, Harald and Haspelmath, Martin and Kaiping, Gereon A. and Gray, Russell D. (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5.180205: 1-10.

Haider, Thomas and Kuhn, Jonas (2018) Supervised rhyme detection with Siamese recurrent networks. In: Proceedings of Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 81-86.

List, Johann-Mattis and Nathan W. Hill and Christopher J. Foster (2019) Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship 17.1: 26-43.

Milà‐Garcia, Alba (2018) Pragmatic annotation for a multi-layered analysis of speech acts: a methodological proposal. Corpus Pragmatics 2.1: 265-287.

Wáng, Lì 王力 (2006) Hànyǔ shǐgǎo 漢語史稿 [History of the Chinese language]. Běijīng 北京:Zhōnghuá Shūjú 中华书局.

Can we dig too deep? Signal conflict in mitochondrial genes of land plants

2020-06-22T00:30:00.000+02:00

In an earlier post, Supernetworks and gene tree incongruence, I illustrated what Supernetworks can tell us about incongruent mitochondrial gene trees, using the dataset of Sousa et al. (PeerJ 8: e8995, 2020). Here, I will take a closer look at these data, in order to illustrate another point.

Fig. 1 A Supernetwork based on 34 individual mitochondrial gene trees (atp1 and atp8 missing due an alignment glitch). Groups (clades) referring to splits not found in any of the gene trees, including the "Septaphyta", are shown in gray font, in blue, groups referring to clades seen in Sousa et al.'s preferred tree.

Sousa et al.'s set of analyses aimed to filter signal in order to get a better all-inclusive tree, and succeeded to produce support for a "Septaphyta" clade, comprising liverworts and mosses, which is a split not found in any of the inferred (Bayesian) gene trees.

Fig. 2 Comprehensive but branch-length and frequency ignorant Supernetwork of Sousa et al.'s Bayesian MRC gene trees (trees are provided as supplementary online data on zenodo), inferred from nucleotide sequences. The trees show several alternatives (colored and labeled) regarding the sister lineages of mosses and liverworts. Any split found in at least one gene tree, is represented in this Supernetwork.

This split did occur, however, when the amino acid sequences were used, instead of the nucleotide sequences.

Fig. 3 Comprehensive branch-length and frequency ignorant Supernetwork of Sousa et al.'s amino acid Bayesian MRC gene trees. The sister taxon of liverworts are either the mosses (= Septaphyta clade) or the outgroup green algae Coleochaete (further inclusive splits include subsequently more outgroups, placing the liverworts as sister to all other land plants). Realized alternatives for mosses include further being a sister of hornworts (in which case liverworts would be sister to higher land plants), or sister to all land plants (brown split: green algae + mosses | rest).

The alternative to the Septaphyta clade, which does appear in the gene trees, recognizes the liverworts as the closest relative of the vascular plants, while the mosses are resolved as the first branch. As Sousa et al. point out:

The tree inferred from the concatenated nucleotide data set of 36 mitochondrial genes shows mosses as the sister-group to the remaining land plants, as previous analyses of mitochondrial nucleotide data have shown (Liu et al., 2014). ...

The concatenated tree hence only reflects a minor aspect of the Supernetwork (Fig. 1) of the individual gene trees:

... However, the mosses are replaced by the liverworts in the same position when analysing codon-degenerate recoded data.

This seems to be the preferred placement when summarizing the gene trees using the Supernetwork.

In this post, we will take a closer look. Is there a deep, easily obscured signal for the Septaphyta clade in the mitochondria of plants? A signal that only surfaces in some amino acid gene trees (Fig. 2) and the filtered concatenated tree (Sousa et al.'s fig. 2), or is it just a branching artifact?

An example for a (low-supported) artificial clade (cf. Schliep et al., Methods Ecol. Evol. 8: 1212–1220, 2017). The trees in (a), (b) and (c) give the paternal (Y-chromosome), maternal (mt genes) and biparental (nuclear autosomal introns) genealogies. While the paternal and biparental genealogies are compatible (congruent), the maternal is in strong conflict. When combining these three data sets, this substantial conflict decreases branch support and results in the artificial red clade. The support for the artificial clade is low but worrisome: the tips differ substantially from each other and the hypothetical, alternative common ancestors and there are literally no patterns in the data supporting a sister relationship of Sloth and Sun bears. The artificial red clade is a secondary product of the inference: the sister relationship of Polar and Brown Bear is trivial (data and inference wise), the American Black Bear the more likely sister (note the length of competing branches in a/c vs. b). The signal for a clade of Asian Black and Sloth bears is less pronounced, here the mt genealogy clade is strongly incongruent and forces the combined tree to resolve the conflict by introducing the artificial red clade.

Starting simply

The Supernetwork in Fig. 1 shows that, no matter which gene we look at, liverworts and mosses were originally most similar to each other, and, absolutely speaking, still close to the (hypothetical) mitochondrion of the ancestor of all land plants. We can illustrate the general situation about the signals using a Neighbor-net inferred from the concatenated data of all 36 genes.

Fig. 4 Neighbor-net based on uncorrected p-distances inferred from the concatenated gene data.

Note that we used a substitution model via a naive distance matrix for a set of coding genes that include saturated third codon positions. Some phylogenetic relationships are obviously based on trivial signals: the Neighbor-net in Fig. 4 includes ± prominent edge bundles defining neighborhoods in line with generally accepted clades (in bold). To capture these evolutionary lineages (some going back nearly half a billion of years), we just need the raw data but no sophisticated phylogenetic analysis.

In the case of the probably monophyletic gymnosperms, the gymnosperm neighborhood competes with a neighborhood excluding the Gnetidae Welwitschia, which is the most distinct of the seed plants in this taxon set (this applies to Gnetidae in general, no matter which data are used). In addition, we see a neighborhood defined by the pink edge bundle: a split of green algae + Welwitschia versus all other land plants. This is a case of obvious long-edge attraction, enforced (here) by missing data (Welwitschia lacks data for 12 out of the 34 genes).

The center of the graph with respect to all tips would be a candidate for the ancestral mitochondrion of the common ancestor of all land plants. Closest to this point are the Septaphyta (mosses + liverworts) and the lycophyte Huperzia (the better represented taxon only missing out on five genes, while Isoetum miss 15).

One can depict a phylogenetic hypothesis by just dropping the less pronounced neighborhoods in Fig. 4:

Most prominent edge bundles define three main cluster (= lineages): green algae, seed plants, and other land plants.
Within green algae:

Closterium is sister to Gonatozygon, next is Roya → Zygnematales
there is no prominent edge bundle connecting Chaetospharidium with the Zygnematales; the closest relative is however the last green algae (→ Coleochatales; only group without a neighborhood).

Within seed plants:

Brassica may be the sister of Liriodendron (more prominent edge bundle), Oryza complements the clade as first diverged member → angiosperms
Cycas is the sister of Gingko, the two are sister to the angiosperms
this leaves Welwitschia as the first diverged branch.

Taking the green algae as outgroups:

the ferns are the sister group of the seed plants (edges longer than the alternative of a primitive land plant clade)
mosses are sister to liverworts (→ Septaphyta); Huperzia shares the same edge bundle but is apparently sister of Isoetes (→ lycophytes), and the lycophytes appear to be ± primitive sisters of ferns and seed plants
this leaves the hornworts, a highly coherent group sharing no prominent edge bundles with any other member of the land plant cluster, and hence are a candidate for the first diverging land plant lineage.

This is a tree hypothesis that is strikingly similar with Sousa et al.'s preferred tree.

Sousa et al.'s fig. 2

The only differences lie in terminal subtrees (Oryza as sister to Liriodendron; Marchantia-Treubia grade, the position of the latter two within liverworts being unclear based on the Neighbor-net).

Something that is easily overlooked in Sousa et al.'s rooted tree, but that is apparent from the Neighbor-net, is that we should be aware of ingroup-outgroup long-branch attraction (LBA). The green algae are not only highly divergent but also very distant from all ingroup taxa, the land plants.; and the first ingroup branch in Sousa et al.'s tree has the longest root.

Additive and subtractive support

In principal, when comparing single gene tree samples to combined trees, we face four sorts of signals in our data:

Very strong signals imprinted in one or a few genes; they will outcompete, and possibly even be re-enforced by any conflicting signal. Walker et al. (PeerJ 7: e7747, 2019), studied this phenomenon for the case of angiosperm plastomes (see also our miniseries The emperor has no clothes on).
Phylogenetically sorted, weak but consistent signals; they will add up, as branch support will increase with each gene added. In this category fall signals reflecting deep splits obscured by terminal noise, when analyzing a single gene or few genes – like the one found by Sousa et al. supporting a Septaphyta clade.
Disparate gene histories; eg. because of intergenomic recombination. The support will be diminished with every added gene not sharing the same history.
General conflict; eg. when combining data from different genomes reflecting different genealogies, such as combining chloroplast (product of biogeographic history) and nuclear data (product of speciation processes) of tree genera. This will be expressed by split bootstrap (BS) support, and may result in artificial clades in the combined/concatenated tree (eg. bear example shown above).

Adding to this is the absence of signal: short-branch culling, a special case of long-branch attraction, which could also explain the inference of a (paraphyletic) Septaphyta clade. If there are few tips in the data that are close (absolute, not only regarding their phylogenetic distance) to the all-ancestor without clear affinities, they may be collected in a subtree, being leftovers from optimizing all other tips with certain affinities and higher distance to the all-ancestor.

Fig. 5 Short-branch culling. Let's assume liverworts are the sister clade of higher land plants (an alternative with near-unambiguous support from cox1). The signal for this in mitochondrial data is weak (short root). On the other hand, there is a high risk for ingroup-outgroup long-branch attraction (LBA) leading eventually to an artificial Septaphyta clade. Because of (inevitable) LBA, even though the false branch is very short, its support can be high (unambiguous when using Bayesian inference).

By compiling the support for all alternatives, we can assess where the support is additive or subtractive. We do this using my re-analysis not Sousa et al.'s Bayesian analysis because:

BS support is more sensitive to internal signal conflict than Bayesian PP,
to extract this information, we need the tree samples used to establish the branch support.

When doing this, we find that the split defining the Septaphyta clade is not only missing from the nucleotide genes trees but also rarely found in the BS pseudoreplicate samples. Only for seven gene regions (atp4, atp8, nad2, rpl16, rps2, rps3, rps13) do we find BS ≥ 25; the highest support comes from rps3 (BS = 65; however, the split is not found in the corresponding Bayesian MRC of Sousa et al.).

On the other hand, the main alternatives find much higher and more consistent support, as shown here.

Fig. 6 Competing support for (purple) and against a Septaphyta clade (greens and yellows). Placing the hornworts as sister to all other land plants (pink) is compatible with the hypothesis of a Septaphyta clade as well as the competing alternative of placing the liverworts as sister to higher land plants; note the high support from cox1 gene for an according tree. *, for these genes no hornwort data were included/have been available.

Short-branch culling, a special form of ingroup-outgroup LBA

Now, my BS analyses were deliberately naive, because they did not apply any data partitioning. However, both liverworts and mosses have short-branches while the outgroup, the green algae, are extremely long-branched. If substitution saturation is an issue for misplacing either liverworts or mosses as sisters to all other land plants, then there should also be ingroup-outgroup LBA. A false split of liverwort + outgroup versus the rest, or moss + outgroup versus the rest, has a lower chance to be supported than would a false hornwort + outgroup versus the rest split. The latter directly opens the door for a Septaphyta clade (see Fig. 5).

Let's have a look at the trees of the four genes supporting the Septaphyta split, as the best alternative. ("AA tree/PP" is the amino acid tree provided by Sousa et al.; BS support refers to my unpartitioned ML analyses)

atp8 — The AA tree is a star tree (comb), strongly distorted by LBA: a Coleochaetales + seed plants | Zygnematales + all other land plants splits has a PP = 1; the short- and long-branched lycophytes are not resolved as sisters.
rpl16 — Also here, the AA tree is star-like regarding deep relationships: (i) green algae (unresolved), PP = 1; (ii) liverworts, very long root, little internal resolution PP = 1; (iii) mosses (unresolved), root half as long as for liverworts, PP = 1; (iv) higher land plants, short root, PP = 0.88.
rps3 — No ingroup-outgroup LBA, shortest-branched ingroup, liverworts, resolved as sister to mosses + rest (PP = 0.77); thus, AA tree, not affected by saturation issues, rejects the Septaphyta (PP < 0.23).
rps13 — Again, the AA tree is star-like, with five tips: (i) green algae (PP = 1); (ii) long-rooted hornworts (PP =1); (iii) liverworts, relatively short root (PP =1); mosses, longer root (PP = 1); (v) higher land plants, shortest root (PP = 0.89).

The Septaphyta root is either extremely short or non-existent, as we would expect for a false clade, because there are no character splits in the matrix that support the taxon split.

Fig. 7 Sousa et al.'s amino acid Bayesian MRC trees (top row) compared to codon-naive nucleotide ML trees (bottom row) producing highest BS support for a Septaphyta clade. Note that in two cases, rpl16 and rps13, the 'best-known' ML tree shows a competing split with much lower support.

Typically, since we are looking at a deep split, we would expect that support increases when shifting from (codon-naive) nucleotide to amino acid analysis, because we eliminate terminal noise. However, we observe the opposite (Bayesian PP more easily converges to unambiguous support than BS values). The difference between our codon-naive nucleotide ML and Sousa et al.'s amino acid MRC trees tells us that it is mostly information from the 3rd codon position that triggers a Septaphyta versus the rest split for these four genes — ie. potentially synonymous substitutions that Sousa et al. filtered against.

Where does the high support comes from for the Septaphyta clade in their combined tree? That tree is based on a matrix, that should have a signal in-between our codon-naive nucleotide and their amino acid analysis.

A five-taxon problem with a glitch

Sousa et al.'s study is exemplary, in that it provides a careful, and well documented, analysis of the combined data. If you want to infer a potentially good tree, this is one way to do it.

However, their Septaphyta clade is most likely a branching artifact. It still combines data that, genuinely, provides not only diffuse but conflicting information about how the main lineages of land plants diverged from each other (Fig. 6). No analysis, no matter how sophisticated and well-crafted, can compensate for the deficits of the underlying data. By filtering out "noise", one also filters out actual conflicting signal. In this case, this is about how liverworts, mosses, and hornworts stand in relation to the extremely long-branched and divergent outgroup, the green algae, and their increasingly evolved siblings, the higher land plants (lycophytes, ferns, and seed plants). It is another example of what I pointed out in last week's post: Big Data invites big (ie. well supported) errors.

It is important to realize that, although we use many more OTUs, we are still looking at a five-taxon problem. When our data supports one split (or prefers it, being biased or not), there are only three more alternatives to select from.

Ingroup-outgroup LBA draws the hornworts, as the genetically most distinct (longest-rooted) lineage of the "bryophytes", away from liverworts and the lycophyte Huperzia, which connects the much more diverged higher land plants to the bryophytes. This leaves three alternatives:

Liverworts are the sister of higher land plants. Their mitochondria show some affinity, but only to the lycophytes, mostly the low-divergent and better sampled Huperzia; and often together with the hornworts, ie. a split incompatible with the hornwort-green algae versus the rest split.
Mosses are the sister of higher land plants, but their mitochondria show very little affinity to any of them (including Huperzia). In fact, they seem to have the most primitive of all land plant mitochondria.
Septaphyta are monophyletic, as the trade-off with the least conflict. Being (much) less diverged than the higher seed plants, they are genetically closer, and ± equally close, to the hornworts and the least-evolved higher land plant, the lycophyte Huperzia.

Sousa et al.'s codon-degenerate approach enforced ingroup-outgroup LBA between the hornworts (the worst-sampled ingroup) and the green algae, while decreasing the absolute distance between liverworts and mosses, and increasing their distance to the higher land plants. That is, Alternative 3 outcompetes Alternative 1. Alternative 2 has no support in the data.

Are the mosses sister to all land plants?

Probably not. Just because the Septaphyta clade is an artifact, it doesn't mean the Septaphyta cannot be monophyletic — it just means the mitochondrial genes don't provide any clear signal to support or reject such a hypothesis, or any other alternative. The same applies to the mosses as the first diverging lineage; their position in earlier trees is likely also to be an artifact — not a branching, but a data artifact. If their mitochondrial genomes are still very similar to that of the common ancestor of all land plants, then they should be placed like an ancestor in the tree — as a short-branched sister to all of their "offspring", the remaining land plant mitochondria.

Eight of the nine genes that support a moss + outgroup versus the rest split, fail to resolve a moss clade. This is a clear indication that the moss mitochondria are simply primitive (at all gene positions that matter). What divides them from most (or all) other land plants are symplesiomorphies — shared but ancestral sequence patterns. The only gene that prefers both splits at once, mosses as sister to all other lands plants as well as a moss clade, is nad4 (BS = 67 and 62, respectively); but only when using nucleotides.

Fig. 8 A small but important difference: the codon-naive ML nucleotide (nt) tree (left) shows a moss clade as sister to all other land plants. The Bayesian amino acid (aa) MRC tree for the same gene shows a wrong split (purple internode) between green algae + ferns + angiosperms (long-branch, prominent roots) and bryophytes + lycophytes (mostly short-branched, short roots). By translating nucleotides into amino acids one may eliminate genuine discriminative signal encoded in synonymous substitutions, while in other, faster evolving parts of the tree, the same site/gene is oversaturated/biased. The poorly supported sister relationship of Roya and Gonatozygon within the green algae in the nt ML tree is an artifact, correctly resolved by the aa tree based on the same gene.

The shift from nucleotide data (ML / BS) to amino acid data (Bayesian MRC) triggers ingroup-outgroup LBA between green algae and ferns + seed plants (PP = 0.53; 'short-branch culling' of bryophytes and lycophyte Huperzia), and results in a branching artifact — the monophyly of higher land plants is well established, and hence they should form a clade.

By contrast, the genes providing strong support for a moss clade (such as atp1, atp8, ccmB, cob, cox1, cox3, nad2, nad5, rpl6, rpl16, rps3, rps13, and rps14) fail to resolve any deep relationships at all, or prefer different alternatives (including the Septaphyta hypothesis: atp8, rps3, rps13). The combined tree's solution is therefore a least-conflicting one, again — a moss clade (based on a consistent signal in the majority of genes: 13 with BS ≥ 90; in total 24 with BS ≥ 58) as sister to the rest of the land plants (based on a signal found in other genes not reflecting the monophyly of mosses). This solution adds to the phenomenon that moss mitochondria are generally primitive (ie. show a variant basically ancestral to all other land plants), and doesn't conflict with a wide range of otherwise conflicting splits strongly supported by individual genes (in contrast to the Septaphyta clade, see Fig. 6).

Conclusion

Having spent some time with the data and gene trees, I have little hope that mitochondrial data can be used to resolve the deep relationships between land plants. Each tweaking may result in something different, and the support-after-tweaking will be inflated.

Nevertheless, it will be worthwhile to close the data gaps, especially for the hornworts. This may not solve the 5-taxon problem,* but may give unique insights in how the mitochondrial genome evolved and sorted during the initial radiation of land plants.

Notably, the mitochondriomes of land plants can differ in the arrangement of their genes; which means that they recombined with or within the nucleome (or even plastome). While in some plants the mitochondriome is passed on via both parents (like in Ginkgo or Cycas), in others it is only the mother (most, maybe all, angiosperms). Plants may have colonized land more than once, and expanded quickly, so that lineage crossing and also lineage sorting may be an issue — marine species can be cosmopolitan and genetically heterogeneous (cryptic speciation). Thus, some mitochondrial genes may tell different stories from others. Instead of trying to solve which of the alternatives is correct (which is what most phylogenetic literature revolves around), we should find out which gene or part of the genome agrees with which alternative, as they may be all true.

The question to address with mitochondrial data cannot be whether mosses, liverworts or hornworts are the first diverging branch of extant land plants, but should be why moss, liverwort and hornwort mitochondriomes show different stages of evolution, as exemplified by the nad4 trees in Fig. 8.

Data availability

An archive including the support consensus networks (in Splits-NEXUS format) and inferred gene ML trees (plain NEWICK), as well as the comprehensive split support table (XLSX format), has been uploaded to figshare.

* It may help to have an in-depth analysis of a more focused taxon set with no data gaps that minimizes the risk of LBA. This starts with a better selection of taxa representing the higher land plants:

Oryza (the rice) is a domesticated, much cultivated, and thus extremely evolved and polyploid monocot. If there is any deep signal embedded in the mitochondria of seed plants, the mitochondrion of rice is probably the last place to look for it.
When trying to resolve the deepest land relationships, including a Gnetidae like Welwitschia (a genus that is an evolutionary oddball to start with), makes equally little sense — like any of the three surviving genera of this unique gymnosperm lineage, it is genetically the outer-most tip of an iceberg. Each mutation in its genome is the product of an unknown number of divergences in the past.
If any seed plant should be included at all, would be more than sufficient to have: Liriodendron, a magnoliid, and thus a member of the least-diverged angiosperm lineage, plus Cycas, as a representative of an ancient, slow-evolving gymnosperm lineage. These are much more recent additions to the plant Tree of Life.
Being a tip of an iceberg applies even more to Isoetum. It is strikingly similar only to the other lycophyte, but it has more data gaps and is much more diverged, and thus can invite branching artifacts. When one wants to dig deep, the much more primitive Huperzia is obviously the better representative.
Last, the green algae are the only possible outgroups for inference, but they are poor for this – apparently, their mitochondria have evolved much farther from the common ancestor than those of the land plants. Rather than inferring trees including them, one should infer trees without them, and then optimize their position within trees that will then potentially be unbiased by outgroup-LBA — eg. using the evolutionary placement algorithm, to test the land plant root. An interesting experiment could also be to infer the sequence of the common ancestor(s) of modern-day green algae (lacking a time machine to sample it), and use them instead. The new RAxML-NG, for example, allows for ancestral state reconstruction of nucleotides.

In addition, standard 4-base substitution models are not the best choice when analyzing matrices with a high proportion of ambiguous base calls, like Sousa et al.'s codon-degenerate matrix (note that Sousa et al. already applied models that compensate for substitutional bias). This is especially so, given the importance of synonymous mutations to resolve relationships in the slow-evolving lineages, and slow evolving genes. One could try to use ambiguity-aware substitution models instead. The newest releases of RAxML-NG (Kozlov et al. 2019, Bioinformatics 35: 4453–4455) include models for "phased" and heterozygous data — ie. models that can make use of ambiguity codes as additional information during tree inference (see also Potts et al. 2014, Syst. Biol. 63:1–16).

How I would (realistically) analyze SARS-CoV-2 (or similar) phylogenetic data

2020-06-15T00:30:00.000+02:00

While writing this post, the Gisaid database reported over 40,000 SARS-CoV-2 genomes (a week before it was only 32,000), which is rather a lot for a practical data analysis. There have been a few posts on the RAxML Google group about how to analyze such large datasets, and speed up the analysis:

How to run ML search and BS 100 replicates most rapidly for a 30000 taxa * 30000 bp DNA dataset

In response, Alexandros Stamatakis, the developer of RAxML, expressed the basic problem this way:

Nonetheless, the dataset has insufficient phylogenetic signal, and thus it can and should not be analyzed using some standard command line that we provide you here; but requires a more involved analysis, carefully exploring if there is sufficient signal to even represent the result as a binary/bifurcating tree, which I personally seriously doubt.

As demonstrated in our current collection of recent blog posts, we also doubt this. One user, having read some of our posts, wondered whether we can't just use the NETWORK program to infer a haplotype network, instead. Typically, the answer to such a question is "Yes, but..."

So, here's a post about how I would design an experiment to get the most information out of thousands of virus genomes (see also: Inferring a tree with 12000 [or more] virus genomes).

Why trees struggle with resolving virus phylogenies and reconstructing their evolution. X, the genotype of Patient Zero (first host, not first-diagnosed host) spread into five main lineages. All splits (internodes, taxon bipartitions) in this graph are trivial, ie. one tip is seperated from all others. Thus, they, and the underlying data, cannot provide any information to infer a tree, which is a sequence of non-trivial taxon bipartitions. For instance, an outgroup (O)-defined root would require to sample the 'Source' (S), the all-ancestor, hence, defining a split O+S | X+A+B+C+D+E. All permutations of X+descendant | rest should have the same probability, leading to a 5-way split support (BS = 20, PP = 0.2). In reality, however, tree-analyses, Bayesian inference more than ML bootstrapping, may prefer one split over any other, eg. because of long-branch attraction between C and D and 'short-branch culling' of X and E. See also: Problems with the phylogeny of coronaviruses and A new SARS-CoV-2 variant.

Start small

Having a large set of data doesn't mean that you have to analyze it all at once. Big Data does not mean that we must start with a big analysis! The reason we have over 40,000 CoV-2 genomes is simply the recent advances in DNA sequencing, and that we have effectively spread the virus globally, to provide a lot of potential samples.

The first step would thus be:

Take one geographical region at a time, and infer its haplotype network.

This will allow us to define the main virus types infecting each region. It will also eliminate all satellite types (local or global) that are irrelevant for reconstructing the evolution of the virus, as they evolved from a designated ancestor, which is also included in our data.

We can also search the regional data for recombinants — virus may recombine, but to do so they need to come into contact, ie. be sympatric.

C/G→U mutations seen in several of the early sampled CoV-2 genomes: note their mixing-up within haplotypes collected from the cruiseship 'Diamond Princess' (from Using Median-networks to study SARS-CoV-2)

Go big

Once the main virus variants in each region are identified, we can filter them and then use them to infer both:

a global haplotype network, and
a global, bootstrapped maximum-likelihood (ML) tree.

The inference of the latter will now be much faster, because we have eliminated a lot of the non-tree-like signal ("noise") in our data set. The ML tree, and its bootstrap Support consensus network, will give us an idea about phylogenetic relationships under the assumption that not all mutations are equally probable (which they clearly aren't) — this provides a phylogenetic hypothesis that is not too biased by convergence or mutational preferences, eg. replacing A, C, and G by U (Finding the CoV-2 root).

On the other hand, the haplotype network (Median-joining or Statistical parsimony) may be biased, but it can inform us about ancestor-descendant relationships. Using the ML tree as guide, we may even be able to eliminate saturated sites or weigh them for the network inference, provided that the filtered, pruned-down dataset provides enough signal.

With the ML tree, bootstrapping analysis and haplotype networks at hand, it is easy to do things like compare the frequency of the main lineages, and assess their global distribution. This also facilitates the depiction of potential recombination, we can sub-divide the complete genome and infer trees/networks for the different bits, and then compare them.

Only based on nearly 80 CoV-2 genomes stored in gene banks by March 2020. The same can be done for any number of accessions, provided tools are used taking into account the reality of the data. The "x" indicate recombination, arrows ancestor-descendant relationships (from: Using Median networks to study SARS-CoV-2)

Change over time

The most challenging problem for tree inference and haplotype-network inference, is the fact that virus genomes evolve steadily through time. That is, the CoV-2 data will include both the earliest variants of the virus as well as its many, diverse offspring — both ancestors and descendants are included among the (now) 40,000+ genomes. We have shown a number of examples where trees cannot handle ancestor-descendant relationships very well. Haplotype networks, on the other hand, are vulnerable to homoplasy (random convergences). So:

Take one time-slice and establish the amount of virus divergence at that time.

Depending on the virus diversity, one can use haplotype networks or distance-based Neighbor-nets (RAxML can export model-based distances). Even traditional trees are an option — by focusing on one time slice, we fulfill the basic requirement of standard tree-inference: that all tips are of the same age.

Then stack the time-slice graphs together, for a general overview.

It will be straightforward to establish which subsequent virus variant is most similar to which one in the slice before.

Based on such networks, we can also easily filter the main variants for each time slice, to compile a reduced set for further explicit dating analysis, for example via the commonly used dating software BEAST (it was actually designed originally for use with virus phylogenies).

A stack of time-filtered Neighbor-nets (from: Stacking neighbour-nets: a real world example; see Stacking neighbour-nets: ancestors and descendants for an introduction)

Networks and trees go hand-in-hand

With the analyses above, it should be straightforward to model not only the spread of the virus (as GISAID tries to do using Nextstrain) but also its evolution – global and general, local and in-depth, and linear and reticulate.

The set of reconstructions will allow for exploratory data analysis. Conflicts between trees and networks are often a first hint towards reticulate history — in the case of viruses this will be recombination. Keep in mind that deep recombinants will not necessarily create conflict in either trees (eg. decreased bootstrap support) or networks (eg. boxes), but may instead result in long terminal branches.

There may be haplotypes in the regional networks that are oddly different, or create parallel edge-bundles. Using the ML guide-tree, we can assess their relationship within the global data set — whether they show patterns diagnostic for more than one lineage or are the result of homoplasy.

Likewise, there may be branches in the ML tree with ambiguous support, which can be understood when using haplotype networks (see eg., Tree informing networks explaining trees).

Era of Big Data, and Big Error

SARS-CoV-2 data form a very special dataset, but there are parallels to other Big Data phylogenomic studies. Many of these studies produce fully resolved trees: and it is often assumed that the more data are used then the more correct is the result. Further examination is thus unnecessary (and it may be impossible, because of the amount of compiled data).

As somebody who worked at the coal-face of evolution, I have realized that the more data we have then the more complex will be the patterns we can extract from them. The risk of methodological bias will not vanish, but may even increase; and the more I then need to check which part of my data resolves which aspect of a taxonomic group's evolution.

This can mean that, rather than a single tree of 10,000 samples, it is better to infer 100 graphs that each reflects variation among 100 samples and one overall graph that includes only the main sample types. Make use of supernetworks (eg. Supernetworks and gene incongruence) and consensus networks to explore all aspects of a group's evolution. In particular, when you leave CoV-2 behind and task larger groups of coronaviruses (Hack and fish...for recombination in coronaviruses).

Hack and fish ... for recombination in the SARS group

2020-06-08T00:30:00.000+02:00

Following the current flow, we have had a few recent coronavirus posts here on the Genealogical World of Phylogenetic Networks. In this post, I'll show the results of a little experiment coming back to David's original post on the topic. Can we use trees to "fish" for evidence of recombination?

As David pointed out, even when we use a phylogenetic-tree inference method to analyze virus genomes, we don't really end up with a phylogenetic tree. Instead, we have a tree reflecting genetic similarity, which will reflect the phylogeny to some unknown extent. The main problem with virus genomes, however, is that they easily recombine — and thus different parts of a virus genome may have different evolutionary histories. A single tree cannot reflect this.

This does not mean that trees cannot tell is something about virus evolution. However, these trees become part of a fishing exercise, looking for different possible historical pathways, which may reflect recombination events.

The tree

Our SARS harvest matrix includes about a dozen sequence groups, which we have labeled Type 1 (the original SARS-CoV) to 9b. Type 7 is the new SARS-CoV-2. For my experiment here, I picked one place-holder sequence per main type (to speed up calculation time). I added two more types: the newly found direct sister of SARS-CoV-2; and some "unclassified" SARS-like viruses from pangolins, which earlier were proposed as sisters, as shown in this tree from the GISAID web page.

The phylogenetic neighborhood of SARS-CoV-2 (GISAID, screenshot captured 3/6/2020). Note the flatness of the CoV(-1; yellow) and CoV-2 (red) subtrees.

GISAID doesn't give the GenBank accession numbers, so we cannot easily say whether our sample matches theirs. However, the tree we can infer from the complete genomes (high-divergent, non-alignable regions excluded) looks very similar, as shown next, and some of the labels match up.

Fig. 1 Maximum likelihood (ML) tree inferred for our sample using (old, v.8.0.20) RAxML. Roman numbers refer to earlier defined Types 1–9 (Tree and viruses – the SARS group), Arabic numbers give nonparametric bootstrap (BS) support based on 100 BS pseudoreplicates (number of neccessary BS replicates determined by the extended majority rule criterion). Branches without Arabic number are unambiguous (BS = 100).

Most importantly, all but three branches have unambiguous support: the phylogeny of this sample is resolved. Unfortunately, as our recurring readers already know, this nearly resolved tree simplifies a much more complex situation.

The Neighbor-net with recombinations and mutational trends (arrows, connectives; cf. Tree and viruses – the SARS group).

Hack and slash

A simple method to fish for different evolutionary histories in a genome is to cut the virus genomes into sub-sequences, infer a tree for each sub-sequence, and then compare the trees. Most researchers compare trees by showing them and discussing which one makes most sense. Here is an example from Corman et al. (2014), who searched for the root of MERS (Middle East Respiratory Syndrome) virus, an illness closely related to SARS.

Reprint of Corman et al. 2014, fig. 3 with colors added to EriCoV (green) and HKU/BtCoV (olive) groups

Each tree in their Fig. 4A and B (Bayesian majority rule consensus trees) was inferred from a different part of the genome. Corman et al.'s focus was to root the MERS viruses by identifying a better outgroup. However, note that the new sister-group (red, green stars – sister to MERS; orange stars – sister to someone else) moves, and so does the green EriCoV clade and the olive HKU/BtCoV group (clade in some trees and grade in others). Do some of these trees get it wrong? Or is, eg. NeoCoV the product of reticulate evolution (here: ancient recombination)? Some parts of its genome might be derived from a common ancestor with MERS (blues), and others from a common ancestor with KW2E (black) and EriCoV (green).

Our complete matrix has 27,333 characters, providing nearly 6,000 distinct alignment patterns (abbreviated DAP, below), which is a lot — the GISAID link above also provides a graphical representation of site divergence. However, probabilistic tree inference methods (ML, Bayes) can handle moderate to high levels of divergence in the data. On the other hand, they also need a certain amount of data to perform well (see also: Inferring a tree with 12000 [or more] virus genomes). So, for my experiment, I hacked the matrix into nine bits of equal size, ie. each submatrix has a bit more than 3,000 nucleotides, providing between 615 (bit #5) and 1029 (bit #1) DAPs.

Fig. 2 Nine ML trees with BS support annotated along branches, each based on a ~3000 nucleotide long bit of the genomes (ordered left-right, top-bottom). Purple highlights branches conflicting with the complete genome tree.

Our nine trees (shown above) are not badly resolved, as most branches get substantial support. But they are not congruent. If we are dealing with recombination, then we might assume that all of these trees do show an actual aspect of the evolutionary history of the genomes. That is, they are all right and wrong, at the same time.

Moreover, we have high supported clades conflicting with the complete genome tree's (Fig. 1) topology. The signal issues, due to recombination (see Trees and viruses...), did not decrease branch support. That is, 6,000+ DAP is a lot, and recombination only affects a part of the complete genome, possibly quite a small part.

Non-trivial evolution needs more than trivial graphs

To depict the reticulate phylogeny of the virus sample, we need to consider the differences seen in the hacked-and-slashed matrix trees. This can easily be illustrated using a network, instead of a set of trees, as shown here.

Fig. 3 A (strict) consensus network of all nine trees, in which the edge lengths give the sum of the branch lenghts in the tree sample. The gray brackets give the topology of the near-fully resolved complete genome tree.

The graph above is a phylogenetic network: the competing edge bundles represent the different inferred histories of bits of the genomes. The SARS-CoV-2 lineage seems to be the product of (ancient) recombination, and recombination also played a role in forming the members of the original SARS-CoV group.

Fig. 4 Pruned consensus network showing only the CoV(-1) lineage exhibiting various levels of recombination within and between clades as defined by the complete genome tree (tree sample sames as in Fig. 3).

Consensus networks can also be used to summarize the support for alternative splits, as shown next.

Fig. 5 Sum-support consensus network based on the bit-wise BS analyses (111/112 pseudoreplicates generated per bit). Only splits are shown occurring in at least 20% of all BS replicates, i.e. splits supported by at least two bits, trivial splits are collapsed. Colored splits represent according groups/clades in the full-genome tree (Fig. 1). Inlet: 'splits rose' showing competing splits patterns within Types II and III (cf. according subtrees/-trunks in Fig. 2 and Fig. 4).

In contrast to the networks before (Figs. 3, 4), generated using the same algorithm*, the BS consensus network in Fig. 5 is not a phylogenetic network. The boxes don't reflect disparate histories of parts of the genomes but the varying support for competing topological alternatives. By summing up the bit-wise BS analyses instead of bootstrapping the entire data (the BS consensus network for the full data, Fig. 1, shows only two boxes), we get a better idea which aspects of the all-genome tree find robust support across the genome.**

Conclusion

Sub-dividing an alignment is a really quick^† way to fish for evidence of recombination, especially when one then uses a consensus network to summarise the resulting partial trees.

For interpretation, a tree is a very simple, trivial, and hence appealing graph: A is sister to B and so on. Even a child can interpret a tree. Networks are already visually more challenging, but whenever an organism's evolution doesn't follow a tree (as for viruses), we shouldn't use a tree to depict its phylogeny (or reconstruct its evolution).

Data availability

The dataset used for our experiment is a taxon subset of the original data set, available via figshare (with a permanent, hence, citable DOI):
Grimm GW, Morrison D. 2020. Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare Dataset. https://doi.org/10.6084/m9.figshare.12046581

References

Corman VM, Ithete NL, Richards LR, Schoeman MC, Preiser W, Drosten C, Drexlera JF (2014) Rooting the phylogenetic tree of Middle East respiratory syndrome coronavirus by characterization of a conspecific virus from an African Bat. Journal of Virology 88: 11297–11303.

* SplitsTree includes five options to determine "edge weights" (= edge-lengths) in case of Consensus networks: "median" and "mean" average the branch-lengths in the tree sample; "count", the setting used to generate Support consensus networks, counts how often a certain taxon bipartition (split) is found in the tree sample – an edge length is proportional to the frequency of a split; "sum", used here to generate the first network, summarizes the branch-lengths; and "none" discards both branch-lengths and split frequency.

** A split supported only by one of the nine bits, even if unambiguous, ie. present in all 111 (112) per bit BS replicates, will not be represented in the sum-Support consenus network using a cut-off of 20%.

† The complete set of ML analyses took 20 min on a stand-alone computer; consensus networks are generated in a blink, and take hardly a minute even when using trees with many leaves.

To what degree are Median-joining networks phylogenetic?

2020-06-01T00:30:00.000+02:00

In a comment to the recent paper by Forster et al. (2020), Sánchez-Pacheco et al. (2020) argue that Forster et al.'s analysis is "neither phylogenetic nor evolutionary" because it's based on the use of a Median-joining network. They don't re-analyse the data, but instead mostly refer to a paper they published four years ago in Cladistics (Kong et al. 2016), the journal of the Willi-Hennig Society.

In that earlier paper, Kong et al. conclude:

Other than fast computation and very attractive graphics, MJNs [Median-joining networks] harbour no virtue for phylogenetic inference. MJNs are distance-based, unrooted branching diagrams with cycles that say nothing about the evolutionary history due to the absence of direction. MJ was introduced in 1999 and, in contrast to most scientific ideas, its application has spread rapidly through copying the methods of others, and, unfortunately, with little further scrutiny. We hope that the theoretical arguments presented here can reverse this trend.

It seems unlikely that it will, as I will argue here.

What makes a graph a phylogenetic tree or network – direction

Kong et al. argue that a line graph needs to be directed (ie. the edges indicate a time direction) in order to represent a phylogeny, which is a good point. After all, a phylogenetic tree is a directed (rooted) branching diagram that represents the hypothesized relationships among the organisms under study.

A phylogenetic tree (see also: Fritz Müller and the first phylogenetic tree)

A phylogenetic network is the generalization of a phylogenetic tree, as it combines lineage splits (divergences) with lineage anastomoses.

A phylogenetic network including a reticulation leading to a circle in the graph — B is the product of crossing of lineages that produced its sisters A and C.

Since a MJ network is, per se, an undirected graph, it thus cannot be an explicit phylogenetic network.

However, following this argument, few inferred trees are directly a phylogenetic tree, either — including the Nextstrain-generated tree on the GISAID page that is promoted by Mavian et al. 2020 (which is another comment to Forster et al., focusing on data issues). Irrespective of which criterion we use to optimize the tree, almost all trees we infer (with no matter what tree inference software) are unrooted graphs — in general, we root them only after the analysis, by defining one leaf or a subtree as an outgroup. (Note, this includes those based on parsimony, the method of choice of the Willi-Hennig Society and Cladistics to this day.)

The difference between inference and interpretation: Using the tip sequences, we can infer a single most parsimonious (6-step long; using PAUP*'s branch-and-bound or NETWORK's MJ algoritm), but also most likely and shortest (distance-based), unrooted tree. By defining a root – here: one taxon designated as outgroup and assuming that all single-taxon-unique sequence patterns are autapomorphies – we can interpret the inferred tree as four different phylogenetic trees.

The same can be said of MJ networks — outgroup rooting can be applied (Finding the CoV-2 root).

Difficulty in depicting ancestor-descendant relationships

A phylogenetic relationship focuses on ancestors, which, for the purpose of inferring a phylogenetic tree, are considered to be purely hypothetical, although they are not hypothetical in a MJ network (or related graphs). We can easily create character sets where the inferred tree will not "represent the hypothesized relationship". Most parsimony studies show a strict consensus cladogram of most-parsimonious trees (MPT). This is unproblematic, as long as all leaves have the same age, and all of the cladogenic events resulted in unique, lineage-conserved character patterns. We then:

would only infer but a single MPT;
have no zero-length branches.

So, following Hong et al.'s logic, any dataset that results in more than one MPT and has subtrees including zero-length branches (like our example above) cannot qualify as phylogenetic trees.

Median-joining networks are, like MP trees (both use parsimony as the optimality criterion), vulnerable to homoplasy (Using Median networks to study SARS-CoV-2; see also Mavian et al. 2020), but while a MP tree (or any other tree we infer) cannot resolve ancestor-descendant relationships, MJ networks can (see eg. Why do we still use tree for Neanderthal genealogy).

Median (or MJ) network, left, and MPT, right, inferred from a perfect matrix. "x" = all-ancestor, ie. represents the root. "a" is the ancestor of "B" and "C", "d" of "f" to "H", "f" of "G" and "H". While the median networks depicts all ancestor-descendant relationships, the MPT only depicts them indirectly by trichotomies including the ancestor as zero-length branch.

Imperfect matrices (data including homoplasy) lead to wrong edges and branches. Being able to recognize ancestors, the MJ network comes closer to the phylgoenetic tree (same as above; from Clades, cladograms, cladistics, and why networks are inevitable).

Hence, Bandelt et al.'s (1999) statement, as cited by Kong et al., that “reconstructing phylogenies from intraspecific data ... is often a challenging task because of large sample sizes and small genetic distances between individuals”. Such data results in largely uninformative, comb-like MPT strict consensus trees. This is because identical sequences, equally probable alternative pathways, non-dichotomous differentiation patterns, and ancestral sequence variants present in the data increase the number of MPTs (sometimes to near-infinity). This leads to the collapse of branches in the strict consensus tree used to summarize the MPT sample. Probabilistic methods struggle, too, because the likelihood surface of the tree space is too flat to make a call.

[Kong et al. point to the mathematical definition of 'network', as "nothing more than an unrooted branching diagram with reticulation" but not of 'tree', which they consider is always a directed acyclic graph, ie. synonym to 'phylogenetic tree'. However, it is, inference-wise, clearly nothing more than an unrooted branching diagram without reticulation.]

Confusing heuristics with principle

To discredit the MJ network, Kong et al. then "... focus on its phenetic nature."

There is a tendency among cladists to dismiss a method as "distance-based", as this is treated as synonymous with phenetics. In reply, Joe Felsenstein commented on this alleged fundamental difference between distance-based and parsimony methods of tree inference (Felsenstein 2004, Chapter 10, p. 145f, The irrelevance of classification):

The terminology is also affected by the lingering emphasis on classification. Many systematists believe that it is important to label certain methods (primarily parsimony methods) as "cladistic" and others (distance matrix methods, for example) as "phenetic". These are terms that have rather straightforward meaning when applied to methods of classification. But are they appropriate for methods of inferring phylogenies? I don't think they are. Making this distinction implies that something fundamental is missing from the "phenetic" methods, that they are ignoring information that the "cladistic" methods do not. In fact, both methods can be considered to be statistical methods, making their estimates in slightly different ways.

The following chapter in Felsenstein's book (Chapter 11, pp. 147–175) deals exclusively with the "phenetic" distance matrix methods because they were the first to be used to infer phylogenetic trees (their limitations are outlined on pp. 174f).

Because the inference of MJ network starts from the generation of a Minimum-spanning network, which is generated from a distance matrix, Kong et al. argue the MJ network is merely a distance-based graph, ie. "phenetic", and "not phylogenetic". Any NP hard problem requires heuristics but, just because we use a distance-based graph to start with, doesn't determine whether the end-product is or is not a distance-based graph.

For instance, the Neighbor-joining (NJ) algorithm (Felsenstein 2004, p.166ff) is a cluster algorithm, which finds a phylogenetic tree fulfilling either the Minimum evolution (ME, p. 159f) or Least-squares (LS) criteria (p. 148ff). Thus, the tree inferred is, indeed, based on a distance-matrix via NJ, but it is not a cluster dendrogram — instead, it is a ME or LS optimised phylogenetic tree. Similarly, FastTree, IQTree, and RAxML are extremely fast programmes to infer Maximum likelihood (ML) phylogenetic trees; but, while FastTree and IQTree start with "phenetic" Neighbor-joining trees, RAxML (like GARLI before) infers first a quick-and-dirty parsimony tree. The final product in both cases is a topology optimized under ML, and the results are hence ML trees and not distance-based or MP trees (even though they started that way).

The final MJ network shows the most parsimonious evolutionary pathways that change one sequence type into another. When you infer it with the NETWORK program, all inferred mutations are mapped onto the final graph, and, using the Steiner post-analysis step, you can look through all of the MP trees that have been included in this graph. However, according to Kong et al. these are not MP trees:

[Following Farris (1970] Invoking principles of parsimony does not validate a phenetic technique as being a phylogenetic method. Indeed, the best Steiner trees are not necessarily the most parsimonious trees.

Kong et al. did not provide any real-world data examples; possibly because they would be very difficult to find. Just take my simple example above — clearly the MJ tree is actually a most-parsimonious solution to the data. Alternatively, you could take any data set for which you can infer plausible MPTs with (ie. data where the rate of change is low), eg. using the TNT program, and compare the result – the Consensus network of all MPTs, not the collapsed strict consensus tree (Stop using cladograms!) – with the Steiner trees inferred using NETWORK and the MJ algorithm.

Are medians ancestors? Do cycles represent reticulation?

Kong et al.'s final point is:

BEA99 [ie. Bandelt et al. 1999] stressed that median vectors can be interpreted biologically as existing unsampled or extinct ancestral sequences (i.e. they can represent missing intermediates; Fig. 3). However, a median vector in an MJ analysis is a sequence generated by majority, and is a mathematically drawn point in the final MJN that connects a triplet of sequences. The resulting “evolutionary paths in the form of cycles” (BEA99, p. 37) merely illustrates the failure of the algorithm to choose between alternative, equally optimal connections due to the modification of Kruskal’s algorithm. Consequently, a cycle represents an analytical artefact rather than an evolutionary scenario (Salzburger et al., 2011).

It is obvious to anyone who has ever used MJ networks, and is familiar with their own data, that Medians are likely to be ancestors, and that medians separated by parallel edge bundles are usually alternative ancestors. But, like all inferences, MJ networks may not capture the complex truth.

A phylogeny involving a recombination ...

... and the MJ network that can be inferred on the same data including two wrong edges (red). The West-1/East-ancestor recombinant is resolved as the product of hybridisation of the West- and East-ancestors, while West-2, a descendant of West-1, is resolved as hybrid of West-1 and the recombinant. Any tree included in the network would have 7 steps (ie. is most-parsimonious).

These reconstructed medians thus do bring Kong et al. to their only valid point, which, however, doesn't apply to the method as proposed, but is instead a common misinterpretation of MJ networks — their cycles do not necessarily reflect reticulation.

Bandelt et al. (1999) clearly state that the MJ network is only an approximation, to deal with complex situations. The cycles usually represent equally optimal alternative pathways, and are usually the result of homoplasy but not reticulation. The final goal is hence to get a graph with as few reticulation as possible but as many as are necessary (see NETWORK manual on selection of the epsilon parameter and weighting).

The Sanchéz-Pacheco et al. critique of Forster et al.

As I showed in an earlier post using actual CoV-2 data, only this part of anchéz-Pacheco et al.'s critique of Forster et al.'s paper is valid — we do need to be very careful before we interpret parallel edge bundles in virus-based (or other) MJ networks as being evidence for reticulation. MJ networks can be phylogenetic networks, but they are still consensus networks of competing, equally parsimonious alternatives. If we take a strict position, then most MJ networks are probably not phylogenetic networks; but neither are all trees phylogenetic trees.

Everything else in their comment is simply cladistic lore. Most importantly, their critique ignores the fact that the obvious alternative to MJ networks when analysing low-divergent virus data, which is parsimony-based trees, has exactly the same data-inherent shortcomings — ie. vulnerability to homoplasy, impossibility to detect and reconstruct recombination. They also have an extra one: they treat all samples to be of the same age and generation, and thus have to resolve actual ancestors as being sisters. Which increases the number of possible, equally parsimonious solutions.

The 19, 7-step long MPTs that can be inferred for the recombination example using PAUP*'s branch-and-bound algorithm – rooted with the Source, the common ancestor of all ("AllAnc") – and their strict consensus tree (gray background, 11-steps long). "Best" shows a phylogenetic tree that comes closest to the true tree: ancestors are resolved as zero-length tips in clades including their descendants. "Close" denotes trees that only misplace the recombinant (purple), which – being a recombinant of the East ancestor (red) and West-1 (blue) – should be placed in a tree as sister to either parent.

The consequence of this is that what Sánchez-Pacheco et al. and Kong et al. criticize about the MJ networks applies even more to the predominately used phylogenetic trees. As David pointed out earlier (Problems with the phylogeny of coronaviruses): virus trees may be inferred using phylogenetic methods but they effectively depict only similarity patterns.

References

Bandelt H-J, Forster P, Röhl A. 1999. Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution 16:37-48.

Forster P, Forster L, Renfrew C, Forster M. 2020. Phylogenetic network analysis of SARS-CoV-2 genomes. PNAS 117:9241–9243.

Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Mavian C, Kosakovsky Pond S, Marini S, Rife Magalis B, Vandamme A-M, Dellicour S, Scarpino SV, Houldcroft C, Villabona-Arenas J, Paisie TK, Trovão NS, Boucher C, Zhang Y, Scheuermann RH, Gascuel O, Tsan-Yuk Lam T, Suchard MA, Abecasis A, Wilkinson E, de Oliveira T, Bento AI, Schmidt HA, Martin D, Hadfield J, Faria N, Grubaugh ND, Neher RA, Baele G, Lemey P, Stadler T, Albert J, Crandall KA, Leitner T, Stamatakis A, Prosperi M, Salemi M. 2020. Sampling bias and incorrect rooting make phylogenetic network tracing of SARS-COV-2 infections unreliable. PNAS doi:10.1073/pnas.2007295117.

Sánchez-Pacheco S, Kong S, Pulido-Santacruz P, Murphy RW, Kubatko L. 2020. Median-joining network analysis of SARS-CoV-2 genomes is neither phylogenetic nor evolutionary. PNAS, doi:10.1073/pnas.2007062117.

General remarks on rhyming (From rhymes to networks 2)

2020-05-25T00:30:00.000+02:00

In this month's post, I want to provide some general remarks on rhyming and rhyme practice. I hope that they will help lay the foundations for tackling the problem of rhyme annotation, in the next post. Ideally, I should provide a maximally unbiased overview that takes all languages and cultures into account. However, since this would be an impossible task at this time (at least for myself), I hope that I can, instead, look at the phenomenon from a viewpoint that is a bit broader than the naive prescriptive accounts of rhyming used by teachers torture young school kids mentally.

What is a rhyme?

It is not easy to give an exact and exhaustive definition of rhyme. As a starting point, one can have a look at Wikipedia, where we find the following definition:

A rhyme is a repetition of similar sounds (usually, exactly the same sound) in the final stressed syllables and any following syllables of two or more words. Most often, this kind of perfect rhyming is consciously used for effect in the final positions of lines of poems and songs. Wikipedia: s. v. "Rhyme", accessed on 21.05.2020

This definition is a good starting point, but it does not apply to rhyming in general, but rather to rhyming in English as a specific language. While stress, for example, seems to play an important role in English rhyming, we don't find stress being used in a similar way in Chinese, so if we tie a definition of rhyming to stress, we exclude all of those languages in which stress plays a minor role or no role at all.

Furthermore, the notion of similar and identical sounds is also problematic from a cross-linguistic perspective on rhyming. It is true that rhyming requires some degree of similarity of sounds, but where the boundaries are being placed, and how the similarity is defined in the end, can differ from language to language and from tradition to tradition. Thus, while in German poetry it is fine to rhyme words like Mai [mai] and neu [noi], it is questionable whether English speakers would ever think that words like joy could form a rhyme with rye. Irish seems to be an extreme case of very complex rules underlying what counts as a rhyme, where consonants are clustered into certain classes (b, d, g, or ph, f, th, ch) that are defined to rhyme with each other (provided the vowels also rhyme), and as a result, words like oba and foda are judged to be good rhymes (Cuív 1966).

When looking at philological descriptions of rhyme traditions of individual languages, we often find a distinction between perfect rhymes on the one hand and imperfect rhymes on the other. But what counts as perfect or imperfect often differs from language to language. Thus, while French largely accepts the rhyming of words that sound identical, this is considered less satisfactory in English and German, and studies seem to have confirmed that speakers of French and English indeed differ in their intuitions about rhyme in this regard (Wagner and McCurdy 2010.

Peust (2014) discusses rhyme practices across several languages and epochs, suggesting that similarity in rhyming was based on some sort of rhyme phonology, that would account for the differences in rhyme judgments across languages. While the ordinary phonology of a language is a classical device in linguistics to determine those sounds that are perceived as being distinctive in a given language, rhyme phonology can achieve the same for rhyming in individual languages.

While this idea has some appeal at first sight, given that the differences in rhyme practice across languages often follow very specific rules, I am afraid it may be too restrictive. Instead, I rather prefer to see rhyming as a continuum, in which a well-defined core of perfect rhymes is surrounded by various instances of less perfect rhymes, with language-specific patterns of variation that one would still have to compare in detail.

Beyond perfection

If we accept that all languages have some notion of a perfect rhyme that they distinguish from less perfect rhymes, which will, nevertheless, still be accepted as rhymes, it is useful to have a quick look at differences in deviation from the perfect. German, for example, is often used as an example where vowel differences in rhymes are treated rather loosely; and, indeed, we find that diphthongs like the above-mentioned [ai] and [oi] are perceived as rhyming well by most German speakers. In popular songs, however, we find additional deviations from the perceived norm, which are usually not discussed in philological descriptions of German rhyming. Thus, in the famous German Schlager Griechischer Wein by Udo Jürgens (1934-2014), we find the following introductory line:

Es war schon dunkel, als ich durch Vorstadtstrassen heimwärts ging.
Da war ein Wirtshaus, aus dem das Licht noch auf den Gehsteig schien.
[Translation: It was already dark, when I went through the streets outside of the city. There was a pub which still emitted light that was shining on the street.]

There is no doubt that the artist intended these two lines to rhyme, given that the overall schema of the song shows a strict schema of AABCCB. So, in this particular case, the artist judged that rhyming ging [gɪŋ] with schien [ʃiːn] would be better than not attempting a rhyme at all, and it shows that it is difficult to assume one strict notion of rhyme phonology to guide all of the decisions that humans make when they create poems.

More extreme cases of permissive rhyming can be found in some traditions of English poetry, including Hip Hop (of course), but also the work of Bob Dylan, who does not have a problem rhyming time with fine, used with refused, or own with home, as in Like a Rolling Stone. In Spanish, where we also find a distinction between perfect (rima consonante) and imperfect rhyming (rima asonante), basically all that needs to coincide are the vowels, which allows Silvio Rodriguez to rhyme amór with canción in Te doy una canción.

While most languages coincide on the notion of perfect rhymes (notwithstanding certain differences due to general differences in their phonology), the interesting aspects for rhyming are those where they allow for imperfection. Given that rhyming seems to be something that reflects, at least to some extent, a general linguistic competence of the native speakers, a comparison of the practices across languages and cultures may help to shed light on general questions in linguistics.

Rhyming is linear

When discussing with colleagues the idea of making annotated rhyme corpora, I was repeatedly pointed to the worst cases, which I would never be able to capture. This is typical for linguists, who tend to see the complexities before they see what's simple, and who often prefer to not even try to tackle a problem before they feel they have understood all the sub-problems that could arise from the potential solution they might want to try.

One of the worst cases, when we developed our first annotation format as presented last year (List et al. 2019), was the problem of intransitive rhyming. The idea behind this is that imperfect rhyming may lead to a situation where one word rhymes with a word that follows, and this again rhymes with a word that follows that, but the first and the third would never really rhyme themselves. We find this clearly stated in Zwicky (1976: 677):

Imperfect rhymes can also be linked in a chain: X is rhymed (imperfectly) with Y, and Y with Z, so that X and Z may count as rhymes thanks to the mediation of Y, even when X and Z satisfy neither the feature nor the subsequence principle.

Intransitive rhyming is, indeed, a problem for annotation, since it would require that we think of very complex annotation schemas in which we assign words to individual rhyme chains instead of just assigning them to the same group of rhymes in a poem or a song. However, one thing that I realized afterwards, which one should never forget is: rhyming is linear. Rhyming does proceed in a chain. We first hear one line, then we hear another line, etc, so that each line is based on a succession of words that we all hear through time.

It is just as the famous Ferdinand de Saussure (1857-1913) said about the linguistic sign and its material representation, which can be measured in a single dimension ("c'est une linge", Saussure 1916: 103). Since we perceive poetry and songs in a linear fashion, we should not be surprised that the major attention we give to a rhyme when perceiving it is on those words that are not too far away from each other in their temporal arrangement.

The same holds accordingly for the concrete comparison of words that rhyme: since words are sequences of sounds, the similarity of rhyme words is a similarity of sequences. This means we can make use of the typical methods for automated and computer-assisted sequence comparison in historical linguistics, which have been developed during the past twenty years (see the overview in List 2014), when trying to analyze rhyming across different languages and traditions.

Conclusion

When writing this post, I realized that I still feel like I am swimming in an ocean of ignorance when it comes to rhyming and rhyming practices, and how to compare them in a way that takes linguistic aspects into account. I hope that I can make up for this in the follow-up post, where I will introduce my first solutions for a consistent annotation of poetry. By then, I also hope it will become clearer why I give so much importance to the notion of imperfect rhymes, and the emphasis on the linearity of rhyming.

References

Brian Ó Cuív (1966) The phonetic basis of classical modern irish rhyme. Ériu 20: 94-103.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Nathan W. Hill and Christopher J. Foster (2019) Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship 17.1: 26-43.

Peust, Carsten (2014) Parametric variation of end rhyme across languages. In: Grossmann et al. Egyptian-Coptic Linguistics in Typological Perspective. Berlin: Mouton de Gruyter, pp. 341-385.

de Saussure, Ferdinand (1916) Cours de linguistique générale. Lausanne:Payot.

Wagner, M. and McCurdy, K. (2010) Poetic rhyme reflects cross-linguistic differences in information structure. Cognition 117.2: 166-175.

Zwicky, Arnold (1976) Well, This rock and roll has got to stop. Junior’s head is hard as a rock. In: Papers from the Twelfth Regional Meeting of the Chicago Linguistic Society 676-697.