Monday, September 28, 2020

Analyzing rhyme networks (From rhymes to networks 6)

For this, final post of my little series on rhyme networks, I set myself the ambitious goal of providing concrete examples how rhyme networks for languages other than Chinese can be analyzed. Unfortunately, I have to admit that this goal turned out to be a bit too ambitious. Although I managed to create a first corpus of annotated German rhymes, I am still not entirely sure how to construct rhyme networks from this corpus. Even if this problem is solved pragmatically, I realized that the question of how to analyze the rhyme network data is far less straightforward than I originally thought.

I will nevertheless try to end this series by providing a detailed description of how a preliminary rhyme network of the German poetry collection can be analyzed. Since these initial ideas for analysis still have a rather preliminary nature, I hope that they can be sufficiently enhanced in the nearer future.

Constructing directed rhyme networks

I mentioned in last month's post that the it is not ideal to count, as rhyming with each other, all words that are assigned to the same rhyme cluster in a given stanza of a given poem, since this means that one has to normalize the weights of the edges when constructing the rhyme network afterwards (List 2016). I also mentioned the personal communication with Aison Bu, who shared the idea of counting only those rhymes that are somehow close to each other in a stanza.

During this month, I finally found time to think about how to account for this idea in practice, and I came up with a procedure that essentially yields a directed network. In this procedure, we first extract all of the rhyme words in a given stanza in the order of their appearance. We then proceed from the first rhyme word and iterate over the rest of the rhyme words until we find a match. Having found a match, we interrupt the loop and add a directed edge to our rhyme network, which goes from the first rhyme word to its first match. We then delete the first rhyme word from the list and proceed again.

This procedure yields a directed, weighted rhyme network. At first sight, one may not see any specific advantages in the directionality of the network, but in my opinion it does not necessarily hurt; and it is straightforward to convert the network into an undirected one by simply ignoring the directions of the edges and collapsing those which go in two directions in a given pair of rhyme words.

Handling complex rhymes

In last month's blog post, I also mentioned the problem of handling rhymes that stretch across more than one word. While these are properly annotated (in my opinion), I had problems handling them in the rhyme network I presented last week. We find similar problems when working with certain rhymes involving words with more than one syllable. As an example, consider the following words which are all taken from the song Cruisen, and which I further represent in syllabified form in phonetic transcription.

Rhyme Words Stressed Syllable Unstressed Syllable
Tube tuː
Bude buː
Gurke guɐ
hupe huː
Kurve kuɐ
Schurke ʃuɐ
Punkte puŋ

These words do not rhyme according to traditional poetry rules (where unstressed syllables following stressed syllables need to be identical), but they do reflect a common rhyme tendency in German Hip Hop, where rhyme practice has been evolving lately. In order to properly account for this, I assigned both the first and the second syllable of the words to their own rhyme group (one stressed syllable rhyme and one unstressed syllable rhyme).

When constructing the rhyme network, however, the separation into two rhyme groups turned out to not make much sense any longer, since the rhymes occur on a sub-morphemic level, where the parts to not themselves express a meaning anymore. To cope with this, I modified the network code slightly by treating only those words as rhyming with each other which show identical rhyme groups in all of their syllables.

Infomap communities and connected components

Having constructed the rhyme network in this new way, we can start with some preliminary analyses. As a first step, it is useful to check the general characteristics of the network. When using the new approach for network construction and the correction for complex rhymes, as reported above, the network consists of 3,104 nodes which together occur as many as 7,707 times. The network itself is only sparsely connected, being separated into 840 connected components.

As a first and very straightforward analysis, I used the Infomap algorithm (Rosvall and Bergstrom 2008) to see whether the connected components could be split any further. This analysis resulted in 932 communities, indicating that quite a few of the larger connected components in the rhyme network seem to show an additional community structure.

Unfortunately, I have not had time for a complete revision of all of the communities, but when checking a few of the larger connected components that were later separated into several communities, it seemed that most of these cases are due to very infrequent rhymes that are only licensed in very specific situations. As an example, consider the figure below, in which a larger connected component is shown along with the three communities identified by the Infomap algorithm.

The three communities, marked by the color of the nodes in the network, reflect three basic German rhyme patterns, which we can label -ung, -um, and -und. Transitions between the communities are sparse, although they are surely licensed by the phonetic similarity of the rhyme patterns, since they share the same main vowel and only differ by their finals, which all show a nasal component. The Infomap analysis assigns the nodes rum and krumm wrongly to the -und pattern but, given how sparse the graph is (with weights of one occurrence only for all of the edges), it is not surprising that this can happen. Both instances where edges connect the communities are rhymes occurring in the same Hip Hop lyrics from the song Geschichten aus der Nachbarschaft, as can be seen from the following annotated line of the song.

 Judging from quickly eye-balling the data, most of the communities that further split the connected components of the network reflect groups of very closely rhyming words (usually corresponding to what one might call perfect rhymes). Links between communities reflect either possible similarities between the rhyme words represented by the communities, or direct errors introduced by my encoding.

Unfortunately, I could not find time to further elaborate on this analysis. What would be interesting to do, for example, would be a phonetic alignment analysis of the communities, with the goal of identifying the most general sound sequence that might represent a given community. It would also help to measure to what degree transitions between communities conform to these patterns, or to what degree individual words might reflect the communities' consensus rhyming more or less closely.

But even the brief analysis here has shown me that, first, there are still many errors in my annotation, and, second, the Infomap algorithm for community detection seems to work just as well with German rhyme data as it works on Chinese rhyme data.

Frequent rhyme pairs and promiscuous rhyme words

As a last example of how rhyme networks can be analyzed, I want to have a look at frequently recurring patterns in the current poetry collection. A very simple first test we can do in this regard is to look at the edges with the highest weights in our networks. Poets typically try to be very original in their work, since nothing is considered as boring as repetition in the literature. Nevertheless, since the pool of words from which poets can choose when creating their poems is, by nature, limited, there are always patterns that are more frequently used.

The following table shows those directed rhymes that occur most frequently in the German poetry database.

Rhyme Part A Rhyme Part B No. of Poems
sein lein 10
aus haus 10
haus aus 9
triebe liebe 9
leben geben 9
geben leben 9
zeit keit 9
nein sein 8
wieder lieder 7
nur tur 7

 This collection may not tell you too much, if you are not a native speaker of German. But if you are, then you will easily see that most of these rhymes are very common, involving either very common words (sein "to be"), or suffixes that frequently recur in different words of the German lexicon (-lein either as diminutive suffix or as part of allein "alone"). We also find the very sad match of liebe (Liebe "love") and triebe (Triebe "urges"), which is mostly thanks to the poems by Rainer Maria Rilke (1875-1926), who wrote a lot about "love", and had the same problem as most German poets: there are not many words rhyming nicely with Liebe (the only other candidates I know of would be bliebe "would stay" and Hiebe "stroke or blow").

As a last example, we can consider promiscuous rhyme words, that is, rhyme words that tend to be reused in many poems with many other words as partners. The following table shows the top ten in terms of rhyme promiscuity in the German poetry dataset.

Rhyme Part Rhyme Partners Occurrences
sein 14 87
ein 9 34
bei 9 36
sagen 8 19
leben 8 39
schein 8 26
mehr 8 25
zeit 8 36
welt 7 32

Here, I find it rather interesting that we find so many words rhyming with -ein in this short list. However, when checking the community of -ein, we can see that there is, indeed, a rather large number of words from which one can choose (including basic words like Bein "leg", Schein "shine", Stein "stone"). Additionally, there are a larger number of verbs of the form -eien that are traditionally shortened in colloquial speech (compare the node schreien "to scream").

Concluding remarks

When I started this series on rhyme networks, I was hoping to achieve more in the six months that I had ahead. In the light of my initial hopes, the analyses I have shown here are somewhat disappointing. However, even if I could not keep the promises I made to myself, I have learned a lot during these months, and I remain optimistic that many of the still untackled problems can be solved in the near future. What today's analysis has specifically shown to me, however, is that more data will be needed, since the network produced from the small collection of 300 German poems is clearly too small to serve for a fully fledged analysis of rhymes in German poetry.  


List, Johann-Mattis (2016) Using network models to analyze Old Chinese rhyme data. Bulletin of Chinese Linguistics 9.2: 218-241.

Rosvall, M. and Bergstrom, C. T. (2008) Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences 105.4: 1118-1123.

Data and Code 

Data and code are available in the form of a GitHub Gist.

Monday, September 21, 2020

Herd immunity and the end of Covid-19

Following on from my previous posts about the SARS-CoV-2 virus, and Covid-19, the human disease that it causes, there are a number of miscellaneous topics that could also be discussed. Unfortunately, this is only a part of the post that I originally intended. I had written about some aspects of the pandemic that seem to be less well known. However, Blogger deleted the draft without warning, and this is the only part that I could recover.

Here, I talk about how the pandemic ends, as far as biology (rather than society) is concerned.
There is a lot of wishful thinking at the moment, that production of a vaccine will see the end of the pandemic, but the World Health Organization has warned that this may not be so. For example, they are apparently trying to develop a 5-year strategy for Europe, not a 5-month one. One of their officials, Hans Henri Kluge, has noted: "The end of the pandemic is the moment when we as a society learn how we can live with the pandemic."

Biologically, safety from pathogens involves what is called herd immunity. This refers to the proportion of the population who are not infectious, and thus are not spreading the pathogen (whether it is a virus, a bacterium, an apicomplexan, or a fungus). Lack of infectiousness can be achieved by:
  1. being resistant to the pathogen in the first place, perhaps due to past immunological events (eg. Coronavirus: How the common cold might protect you from COVID)
  2. becoming infected and then recovering, by producing antibodies or T-cells (eg. This trawler’s haul: Evidence that antibodies block the coronavirus)
  3. being vaccinated, which produces the same immune response as 2., by producing protective antibodies.

Note that 2. is not necessarily dangerous for most people, as reports show that anything up to half of the people who have antibodies to SARS-CoV-2 did not report clinical symptoms, or only mild symptoms. [Note also: lack of symptoms does not mean that you are not infectious.] However, the variation in human response has clearly been huge (see From ‘brain fog’ to heart damage, COVID-19’s lingering problems alarm scientists), in many cases resulting in cytokine storms, and death.

The main risk factors are also clear — age and gender (The coronavirus is most deadly if you are older and male — new data reveal the risks), and any pre-existing medical conditions, notably obesity (Individuals with obesity and COVID‐19: a global perspective on the epidemiology and biological relationships). Furthermore, we do not yet know how long any immune protection lasts — for example, we now have people who have been infected more than once (Researchers document first case of virus reinfection), although most have kept their antibodies for at least 4 months (Fyra av fem behåller antikroppar mot nya coronaviruset).

Nor do we yet know about the success or danger of 3., because it normally takes a couple of years of clinical trials before a vaccine is approved for use, and even then we can get it badly wrong (cf. the originally undetected side-effects of thalidomide). As far as health care is concerned, responsibility for treatment of any unfortunate outcomes from immunization is not at all clear. Furthermore, those nations that spend the most on healthcare per person may not be ranked highest for health outcomes and quality of care (see: What country spends the most on healthcare?). Therefore, it is hardly surprising that many people are concerned about taking any new vaccine (A Covid-19 vaccine problem: people who are afraid to get one), and that the World Health Organization is being much more cautious than many government leaders (Most people likely won't get a coronavirus vaccine until the middle of 2021).

Nevertheless, once herd immunity is achieved in my local population, I am relatively safe, irrespective of whether I have been vaccinated or not — there will be few infectious people around me, and so I am not very likely to catch the pathogen. Personally, I could wait a while to see how the myriad new vaccines affect people, as they have been rush-produced in a way that would not normally be accepted as safe for public use (what is called the Phase 3 trial takes time). After all, there seems to be an awful lot of politics involved, especially in the USA (The 943-dimensional chess of a trustworthy Covid-19 vaccine).

Some calculations

The point here is that the development of any epidemic is an interaction between infectivity, herd immunity and infection control. Let's consider some explicit numbers to make this clear (based on: Flockimmunitet på lägre nivå kan hejda smittan).

Infectivity refers to how the pathogen spreads among the at-risk population, usually described as the basal reproductive rate (R0). If each infected individual infects 2-3 others, then the R0 value is c. 2.5 (each person infects 2.5 other people, on average). This means that the epidemic must spread — if R = 1 then there is no spread; and if R < 1 then the infection slowly dies out (it stops instantly if R = 0).

Clearly, infectivity can be reduced by any infection control measure that reduces R. Some of these were listed in the previous section. These measures can easily reduce the initial R0 by one half, meaning that the epidemic spreads much more slowly, if R = 1.25.

Herd immunity comes into this by also reducing R. For example, if herd immunity reaches 60%, then only the remaining 40% of the people are susceptible to the infection. If we combine this 40% with the initial R0 = 2.5, then R = 1, and the epidemic no longer increases. That is, we now have it under control. Moreover, if we have managed to get to R = 1.25, then a herd immunity of even 20% will cause the epidemic to decrease.

Bhoj Raj Singh has a good slide presentation elaborating on this topic.

These calculations interact with the concept of relative risk, of course. The calculations so far assume that infection exposure is random in society, which is obviously too simple an idea. Some people are more socially active than others, are thus likely to be more exposed, and they will then quickly achieve significant herd immunity. Others find it difficult to self-isolate because of their work or social conditions, which also increases the development of herd immunity. All of this also helps more isolated people, of course, because they are not at risk of infection from those active groups with herd immunity.

We would thus expect herd immunity to develop first in cities (eg. Experts say Stockholm is close to achieving herd immunity ; A third of people tested in Bronx have coronavirus antibodies) and in poor communities (Herd immunity may be developing in Mumbai’s poorest areas), both of which seem to be the case for SARS-CoV-2.

Equally importantly, herd immunity cannot develop if we all hide from the virus. This has happened in New Zealand, for example, which has so far successfully quarantined itself from the rest of the world — they have not successfully fought the virus, they have instead successfully hidden from it. The issue is that the populace can never come out of hiding, and can thus never let anyone come into the country, not even returning New Zealanders. As an example, Hawaii had the same isolation advantage, and then lost it, just as expected (Hawaii is no longer safe from Covid-19), as also did Australia (Coronavirus (COVID-19) current situation and case numbers).

It is a classic question: which is better, fight or flight? In a pandemic, flight cannot lead to herd immunity, which is what we need in order to "learn how we can live with the pandemic".

So, where are we now? Well, a recent poll in the USA suggests that it is an even split about whether people will actually take a vaccine if offered soon (U.S. public now divided over whether to get Covid-19 vaccine). Will 50% be enough to ensure herd immunity in that country?

Monday, September 14, 2020

Exploring the oak phylogeny

Neighbor-nets are a most versatile tools for exploratory data analysis, including phylogenetics. They are not only fast to infer, but possibly most straightforward in depicting the signal in one's data matrix — this is called Exploratory Data Analysis. EDA makes them useful additions to any phylogenetic paper, because it gives the reader (and peers and editors during review) a good idea what the data can possibly show, and where there may be problems.

A nice example of this use is the Neighbor-net in a recent paper on Chinese oaks:
Yang J, Guo Y-F, Chen X-D, Zhang X, Ju M-M, Bai G-Q, Liu Z-L, Zhao G-F. Framework Phylogeny, Evolution and Complex Diversification of Chinese Oaks. Plants 2020: 1024.
[Note: The paper is, from a purely methodological point-of-view, pretty well done, but has probably not experienced any real peer-review.**]
Oaks (Quercus L.) are ideal models to assess patterns of plant diversity. We integrated the sequence data of five chloroplast and two nuclear loci from 50 Chinese oaks to explore the phylogenetic framework, evolution and diversification patterns of the Chinese oak’s lineage. The framework phylogeny strongly supports two subgenera Quercus and Cerris comprising four infrageneric sections Quercus, Cerris, Ilex and Cyclobalanopsis for the Chinese oaks.
None of this is new. My colleagues and I published an updated classification for oaks a few years ago (Denk et al. 2017) that took into account molecular phylogenies, and introduced the systematic concept referred to by Yang et al., and recently followed by a many-species global oak phylogenomic study (Hipp et al. 2020). All of this is based on nuclear data only, because any researcher who ever studies oak genetics soon realizes that the plastomes are largely decoupled from speciation processes, but are geographically highly constrained (eg. Simeone et al. 2016, Yan et al. 2019). This is the reason why oaks are indeed "ideal models to assess patterns of plant diversity" – they provide a worst-case scenario not the (trivial) best-case one.

As can be seen in the Yang et al. tree, members of section Ilex, a monophyletic lineage forming highly supported clades in trees based on nuclear data, are scattered all across the subgenus Cerris subtree. I have annotated a copy of this tree here.

Yang et al.'s fig. 1a, with some clades newly labeled for orientation

Because of the plastid incongruence, the subgenus Cerris subtree has a wrong root (section Cylcobalanopsis diverged before sister sections Cerris and Ilex split). Also, the reciprocally monophyletic, genetically coherent sections Cerris (green) and Cyclobalanopsis (blue) are embedded in the much more diverse Ilex 3 and Ilex 4 clades. The remaining Ilex species are placed in two early diverged clades, which I have labeled Ilex 1 and Ilex 2 in the above tree (note: the taxon set only includes Chinese oak species). The only indication the tree gives that we have a data conflict issue is the low support (gray circles represent branches with Maximum likelihood bootstrap support > 60).

The network

When interpreting the phylogenetic implications of a Neighbor-net, we have to keep in mind that it is not a phylogenetic network in the strict sense (ie. displaying an evolutionary history), but is instead a meta-phylogenetic graph: a summary of incompatible splits patterns. Incompatibility can have different origins: reticulation, recombination, diffuse or poorly sorted signals, etc. Consequently, when looking at a Neighbor-nets and their neighborhoods (Splits and neighborhoods in splits graphs), we need to keep in mind what kind of data we used to calculate the underlying distance matrix in the first place.

If the data follows two incongruent trees ("phylogenies"), as in this case for the oaks, the Neighbor-net has a good chance of capturing the incompatible splits of both genealogies. Here is the graph from the paper.

Wang et al.'s fig. 1b.

The central inflated portion of the graph reflects the incongruence between the combined data sets: we have overlapping nuclear-informed and plastid-informed neighborhoods.

The authors' brackets (shown in black) refer to neighborhoods triggered by the two nuclear markers in the data set: these are neighborhoods reflecting the common origin and speciation within the oak lineages. We can even see that this signal, which is incompatible with all deep splits in the combined tree, is unambiguous in part of the data (the nuclear partitions): section Ilex spans out as a wide fan, but there is a relatively prominent edge bundle defining the according neighborhood (the blue split).

The net shows additional, even more prominent edge bundles defining partly overlapping or distinct neighborhoods (the red splits). These neighborhoods are represented as clades in Yang et al.'s phylogenetic tree (fig.1a). They write (p. 11 of 20):
However, the conflict between the two datasets seems to be recovered by the neighbor-net method in this study, as the neighbor-net network based on combined plastid–nuclear data strongly shows the presence of two subgenera and four infrageneric species groups for the Chinese oak’s lineage (Figure 1b).
Interestingly, the authors nonetheless used the substantially incongruent combined data for downstream dating and trait mapping analysis (p. 7/20):
Bayesian evolutionary analyses provided a concordant infrageneric phylogeny for the Chinese oak’s lineage at the species level (Figure 2).
This uses a taxon-filtered, obviously constrained (fixed) topology, fitted to the current synopsis outlined in Denk et al. (2017). [Note: the supplement includes the extremely incongruent nuclear and plastid trees, each of which has further incongruence issues because they combine fast- and very slow-evolving sequence regions.]


More posts on oaks, plastid data and networks can be found here in the Genealogical World and in my Res.I.P. blog.

Cited papers

Denk T, Grimm GW, Manos PS, Deng M, Hipp AL. (2017) An updated infrageneric classification of the oaks: review of previous taxonomic schemes and synthesis of evolutionary patterns. In: Gil-Pelegrín E, Peguero-Pina JJ, and Sancho-Knapik D, eds. Oaks Physiological Ecology. Cham: Springer, pp. 13–38. Open access Pre-Print [major change: Ponticae and Virentes accepted as additional sections in final version].

Hipp AL, Manos PS, Hahn M, Avishai M, + 20 more authors. (2020) Genomic landscape of the global oak phylogeny. New Phytologist 229: 1198–1212. Open access.

Simeone MC, Grimm GW, Papini A, Vessella F, Cardoni S, Tordoni E, Piredda R, Franc A, Denk T. (2016) Plastome data reveal multiple geographic origins of Quercus Group Ilex. PeerJ 4:e1897. Open access.

Yan M, Liu R, Li Y, Hipp AL, Deng M, Xiong Y. (2019) Ancient events and climate adaptive capacity shaped distinct chloroplast genetic structure in the oak lineages. BMC Evolutionary Biology 19:202. Open access.

** The publisher, MDPI, thrives in the gray zone between predatory and accredited publishing. Originally included in the recently reactivated Beall's List (new homepage), it has been tentatively dropped (see the linked Wikipedia article; but see also this post by Mats Widgren). Personally, I have encountered articles published in MDPI journals only where the review process must have been, at least, strongly compromised. But it's always quick: Yang et al.'s paper was submitted July 24th, accepted August 12th, and published a day later. Three weeks is about the length of time that the editors of my first oak paper needed to find a peer reviewer at all.

Monday, September 7, 2020

Fossils and Networks 3 – (deleting and) adding one tip

In the last Fossils and Networks post, we explored the use of SuperNetworks to identify both safe and problematic branching patterns by removing one OTU and re-evaluating the analysis. Here, we'll take the opposite approach, and see what we can learn from adding one OTU to our analysis.

Breaking and supporting wrong branches

We start again with the artificial Felsenstein Zone matrix that results in a wrong AB clade. Here's the original true tree used to generate the matrix.

Because of convergent/parallel evolution in the modern taxa (genera O, A and B) and primitive characters of their fossil sisters, any phylogenetic inference method will find the wrong, tree with a A + B | rest split.

In the Felsenstein Zone, parsimony will always get the wrong tree due to long-branch attraction (LBA), while Maximum likelihood has a 50:50 chance to escape LBA. To break down the LBA between A and B, we need a fossil that is, from an evolutionary point of view, intermediate between D and B.

If we add a fossil E that features 1 out of 3 derived traits found in the BD lineage (including the only synapomorphy of BD), we end up with two alternative parsimony trees: one with a wrong topology and the other the correct topology, as shown here.

By adding a fossil F featuring 2 out of 3 derived traits, we increase the number of most-parsimonious trees (MPTs) to three alternatives, all of which fall prey to A-B+F LBA, as shown next.

Convergent evolution is a problem for tree inference but selection bias and homoiologies are worse, involving accumulation of the same advanced trait within some but not all members of a lineage (Has homoiology been neglected in phylogenetics?). This is worse because the characters will enforce attraction between long-branching, highly evolved (more modern) taxa. A and B are siblings, but by enforcing an ABF clade, we will inevitably misinterpret the most primitive members of the ingroup, C and D. Hence, we may draw wrong conclusions about evolution in the A–F lineage.

Because E is virtually half-way evolved between D and F, and F is the next step towards B, the all-inclusive tree gets it right. We infer a single optimal tree, shown here.

PS: Also, in this case we could use any other optimality criterion (Maximum Likelihood, Least-squares, Minimum Evolution) and we would end up with the same tree.

Missing the important bits

That last observation is encouraging: the more fossils we include in our matrix and the better they reflect the evolutionary trends within a group (here from a D-like ancestor via E to F and B), the greater the chance of ending up with the true tree. There's only one drawback: in real-world data sets, we may miss exactly those traits in the fossil sample that are needed in order to infer (or stabilize) the true tree.

(Paleo-)Parsimonists have frequently argued that missing data are unproblematic, which is true in one sense, as shown in the above example. The commonly used strict consensus tree has no wrong branches, because it only has one, which is the trivial ingroup-outgroup split. The much less commonly used Adams consensus tree has one more branch, which is wrong: the ABF clade.

As always in such cases, the strict Consensus network visualizes the MPT sample best (again exemplifying why we should stop using cladograms).

The price for not having false positives is that we cannot infer a most-parsimonious tree or a few alternative trees any more, but could easily end up with scores of them. Here, we have 41 MPTs for a 8-taxon dataset that include fairly wrong trees*, although some of them are closer to the true tree (green and olive edges in the strict Consensus network above). For large matrices, or matrices lacking tree-like signals, the number of MPTs can easily reach tens or hundreds of thousands. Lacking critical traits in E (14 out of 46 characters missing) and F (7 missing), we may escape LBA at the cost of decisiveness. If we do have those traits only in F but not E, we will enforce LBA between A and B.

Plus-1-trees (and SuperNetworks)

Before adding a taxon as an additional leaf to our tree, we may be interested in what that taxon does to our tree: can it trigger a topological change or does it fall in line? We will again take the dinosaur-to-bird-matrix of Hartman et al. (2019, PeerJ 7: e7247) as a real-world example. This includes everything from well-covered highly derived and most primitive taxa, to those that lack discriminatory signal in general (ie. are unresolved), plus the one or two rogue taxa, with ambiguous phylogenetic affinities creating topological conflict. (Note: the commonly reported strict consensus trees cannot distinguish between those two alternatives.)

The best-covered 15 taxa provide us with a single optimal tree that is in agreement with current opinion (shown below). However, this struggles to resolve the clade of modern birds because the extinct Lithornis is being attracted by Anas, the duck. When we remove Dromiceiomimus (as shown in Fossil and Networks 2), we end up with a putatively wrong Dromaeosauridae grade, because of LBA between the most distinct Dromaesauridae, Velociraptor and Bambiraptor, and the distantly related (to flying dinosaurs) Allosaurus, Tyrannosaurus and the IGM 10042 skeleton.

Two of the Minus-1 trees generated for the last post of this series.

For our experiment, we will take this (partly) wrong tree, and add every other taxon included in the Hartman et al. (2019) matrix as 15th tip. We can then perform a branch-and-bound search to infer these 14-Plus-1 tree(s). When we browse through the inferred MPTs, we can see that many taxa fall in line with the wrong topology, including a few that, in addition, increase uncertainty for branches correctly resolved in the minus-Dromiceiomimus tree.

Out of the 485 candidate trees, only 10** have a set of characters that can compensate for the missing Dromiceiomimus, leading to Plus-1 trees that show a Dromaesauridae clade, as shown here.

Two of the ten Plus-1 trees, where the added tip saves the inference from LBA. Numbers give the amount of defined characters (scored traits). Both Halszkaraptor and Zhenyuanlong are classified as Dromaeosauridae, however only the better covered taxon is placed as sister to the Dromaeosauridae included in the original 14-taxon tree.

The presence of the deep-branching Compsognathus (Tyrannoraptora: ... :Neocoelurosauria: †Compsognathidae) triggers an Archaeopteryx-Dromaesauridae clade.

In the case of relative deep-branching Garudimus (... :Neocoelurosauria: Maniraptoriformes: †Ornithomimosauria: †Deinocheiridae) and Epidexipteryx (... : Maniraptoriformes: ... : : ... : Paraves: †Scansoriopterygidae) one or two of the two or three MPTs show the wrong grade except the last the clade.

Note: the relative low number of scored traits for Epidexipteryx can avoid LBA leading to a Dromaeosauridae grade but misplace the taxon within the Plus-1 MPTs: its family, the Scansoriopterygidae, are considered to represent the sister lineage (Wikipedia, referring to Godefroit et al. 2013 Nature 498: 359–362) of the Eumaniraptora which include the Dromaeosauridae as first-diverging branch.

We can also summarize the outcome, a collection of 640 Plus-1 MPTs, in form of a z-closure SuperNetwork, as we did for the Minus-1 trees in the previous Fossils and Networks post (shown next).

This SuperNetwork is quite boxy, and may be only semi-comprehensive (I used only 20 runs, which took half a day). Matching 485 tips into a 14-taxon backbone tree is not the kind of tree sample that the SuperNetwork has originally been designed for!

Only four edges, fat and blue, are without alternatives. In all other cases, the added tip triggered the creation of several alternatives: the highest dimension for the boxes is five, but most have four or less dimensions. Regarding our problem of saving the Dromaeosauridae clade, we can see that the topological change depends on very few characters, with Microraptor being very close to the divergence but a bit more bird-like (in a very broad sense), while the other two are much more derived.

Close-up on the Dromaeosauridae part of the network, with all tips labeled. Pie charts give the percentage of scored traits/missing data. * – Tips that saved the inference from LBA (see above).

Note the length of some of the colored edges, especially the light green which represent edges reflecting a Dromaeosauridae clade. Other Dromaeosauridae taxa increase not only the diversity but also may create substantial topological ambiguity (bluish and greenish edge bundles; same color = same split) and branching bias.

Take-home message

Creating morphological supermatrixes makes a lot of sense, because it ensures normalization and facilitates universal comparability, which is crucial also for paleobiology. However, even more than molecular phylogenies, paleophylogenies are affected by character and taxon sampling. This is nothing new, and much debate has dealt with which parsimony strict consensus cladogram is the better one.

I suggest taking a new route. Instead of using morphological supermatrixes to infer trees – for this matrix, Hartman et al. found millions of equally optimal parsimony trees further filtered by post-analysis, initial tree topology informed character weighting (as implemented in TNT) – we should use it to generate subsets and engage in exploratory data analysis. This will pinpoint strengths and weaknesses of the data and its individual taxa. Rather than producing evolutionary meaningless soft polytomies, one should study the reasons for any topological ambiguity. After all, one simple reason for unstable branching patterns may be that all so-far inferred trees are biased, only differently.

The SuperNetwork can assist us in putting together taxon sets that could allow not only a simple tree inference but also topology testing.
  • If we want to test the stability of, e.g., the Dromaeosauridae clade against taxon sampling, it will be of little use to include the most primitive (anything outside Maniraptora) and much more advanced taxa (Avialae including modern birds) of the 501-taxon matrix. On one had, the most primitive taxa will only increase the computational load, because our inferred tree not only optimizes branches we are interested in, but also irrelevant ones, using taxa that largely lack discriminative signal for the branches of interest or at all. On the other hand, the most derived taxa may bias the tree inference by providing strong terminal signals outcompeting potentially conflicting weak basal signals.
  • If we want to test the stability of the backbone phylogeny against adding taxa and entire lineages, we may prefer short-branched over long-branched taxa, in order to avoid (local) LBA (especially when we want to stick to parsimony). The terminal edges in the SuperNetwork indicate the minimum number of unique changes for each tip added to the 14-taxon tree. As seen also in our hypothetical example: E and F only break down the wrong AB clade because both are either identical (or very similar) to the last common ancestor of E+F+B and F+B, respectively.
In a future post, I'll come back to the issue of identifying taxa that are game changers, using a simple and quick tree-based approach: the so-called "evolutionary placement algorithm", first implemented in RAxML.

For any of you who really don't like networks, but still find no comfort in comb-like strict consensus cladograms either: just tick the SuperTree option when inferring the SuperNetwork. But only if your input trees converge to a shared topology. Otherwise the result may look like this:

A SuperTree based on the 640 Plus-1 MPTs.

* Somebody familiar with Consensus networks and morphological data partitions providing complex signal, can extract a phylogenetic hypothesis from this boxy network for the included taxa. In general, the distance along the network edges represents a phylogenetic distance, and thus gives a direct measure of how derived a taxon is.

For example, C, D are closer to the ougroup and placed close to the centre of the graph, which is exactly where a primitive ingroup taxon, with an ancestral morphology, would be placed. F is most likely a sister of B. The olive EF | rest split supports a potential common origin of E, F, and B (long green edge bundle). Hence, A can only represent a distant, strongly evolved sister lineage (both the alternative AB and ABF clade have less character support). Also, since the graph depicts E as least derived of the four (irrespective of the topological alternatives), its affinity to F and B has more value than the affinity between A and B, both being long-branched, and hence susceptible to LBA. D fits into the picture, the olive DE edge either: (1) represents a common origin, which would make D an early member of the red lineage; or (2) has similarity due to shared primitive traits within the ingroup, which would make D an early member of an ABEF lineage. C, in contrast to D, has no clear affinities with any other ingroup member, and so can only be interpreted as an early, very primitive form with uncertain phylogenetic relationships. The (true tree) mutual monophyly of the red and blue ingroup lineages has very little character support in the matrix, and hence cannot possibly be resolved.

** Systematically they cover a range of maniraptoran ('hand hunters') families 'below' the Avialae ('flying' dinosaurs) including, in addition to two Dromaeosauridae (Halszkaraptor, Zhenyuanlong, trees shown above), members of †Alvarezsauroidea (Haplocheirus), †Caudipteridae (Caudipteryx), †Sinovenatorinae (Sinovenator), †Therizinosauroidea or related (Beipiaosaurus, Jianchangosaurus) and †Troodontidae (Gobivenator, Sinornithoides). Caihong is a member of the †Anchiornithidae, which Wikipedia flags as "Avialae ?". These OTUs show data coverage far above the median (74% missing), with 278 (Caihong) to 558 (Caudipteryx) defined characters (out of a total of 700).