The Genealogical World of Phylogenetic Networks: December 2019

Monday, December 30, 2019

National differences in amount of paid versus unpaid work

Countries differ in many cultural ways. An important one of those ways concerns how time is managed. There are 24 hours in every day, and 7 days in every week, and the time that people spend on each of the different activities can be averaged across each year. When combined across the whole population, these averages usually differ between countries, and this is what we mean when we recognize national behaviors. There are, however, many similarities among countries that share strong cultural ties.

The Organisation for Economic Co-operation and Development (OECD) has collected data on this matter among its member countries, as it also has for many other cultural and economic characteristics. In each of the 30 member countries, the OECD conducts regular "time-use surveys, based on nationally representative samples of between 4,000 and 20,000 people." The aggregated results are available online, including data for three other countries, for comparison (China, India, South Africa).

Four main categories of time use are reported by the surveys:

Paid Work or Study, which includes paid work time, time in school or classes, travel to and from work / study, research / homework, and job search.
Unpaid Work, which includes child care, care for other household members, care for non household members, routine housework, shopping, volunteering, and travel related to household activities.
Personal Care, which includes sleeping, eating & drinking, medical services, and travel related to personal care.
Leisure Time, which includes sports, participating / attending events, visiting or entertaining friends, and TV or radio at home.

Of particular interest is that the data are aggregated separately for males and females. I will look at the gender data in a future blog post, while here I will look only at the pooled data for each country.

National differences

In order to look at the current differences between the 33 countries (30 OECD, 3 non-OECD), I have performed this blog's usual exploratory data analysis. The available data are multivariate, since there are five measured variables for each country — total paid work, total unpaid work, total personal care time, leisure time (each measured in average number of minutes per day), plus Other (to make a total of 1,440 minutes per day). One of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network, as a form of exploratory data analysis. For this network analysis, I calculated the similarity of the countries using the manhattan distance; and a Neighbor-net analysis was then used to display the between-country similarities.

The resulting network is shown in the first graph. Countries that are closely connected in the network are similar to each other based on their average time management, and those countries that are further apart are progressively more different from each other.

National differences in amount of paid versus unpaid work

The expected cultural similarities of the countries are, in most cases, reflected in the network. For example, the country that we might expect to be the most different to the other 32 is Mexico (it is the only country from Latin America), and it is also the most isolated one in the network. It is characterized by having the greatest amount of average work time per day, particularly unpaid work, and the least leisure time.

Furthermore, the three Asian countries are clustered together: Japan, Korea, and China. They have similar high amounts of paid work, but much less unpaid work than the Mexicans.

On the other hand, it is not clear why India is shown as very similar to some of the European countries, given its very different culture. However, differences do appear in the gender patterns discussed below.

In other cases, there are occasional countries that are not where we might anticipate them to be in the network, given other known historical and cultural similarities, particularly language. For example, Sweden is not near the other Nordic countries (Denmark, Norway, Finland), as the Swedes report many more minutes of paid work per day, and correspondingly less time on each of the other activities. Portugal is not near Spain, Italy, and France, as they also report more minutes of paid work, and specifically less leisure time. On the other hand, Australia is not near Canada, the USA, New Zealand, and the UK, because the Australians report fewer minutes of paid work but correspondingly more unpaid work time. The people of Latvia and Lithuania also report many more minutes of paid work per day than do those of Poland and Estonia.

Other differences

Lest you get the impression that historical and cultural ties dominate the time-management data, we can look at one part of the data in detail.

As noted above, the Personal Care data includes separate information for sleeping versus eating & drinking. In the next graph I have plotted these two variables against each other (in average minutes per day), for all 33 countries.

Time spent sleeping versus eating & sleeping for 33 countries

As you can see, thee is no correlation whatsoever between these two variables. That is, extra eating and drinking time does not come out of the time allocated for sleeping, or vice versa.

Moreover, you will note that the denizens of the three Asian countries do not behave anything like each other, particularly as the Chinese sleep longer than everyone except the South Africans. Nor do the Swedes behave much like the Danes, in terms of eating and drinking.

Finally, the Mexicans report that they do not spend much time eating or sleeping, which follows from the work data discussed above. Instead, it is the Mediterranean peoples who like to spend their time eating and drinking. On the other hand, the Americans (and Canadians) certainly behave like they live on fast food, spending less time on eating and drinking than anyone else. They do, however, like their 8.5 hours sleep per day, which most other populations think they can do without that extra half hour.

Monday, December 23, 2019

Evolutionary processes hidden in Christmas clip arts

As a blogger, one is free to indulge in complexity in all data, although this is not necessarily a good idea in professional publications. We can thus think about the art of networks — that is, inferring networks with an artistic touch (eg. this post on artistic network depictions; and trees can be art, too). For this Christmas post, we will just take the shape of famous Christmas-themed clip art to muse about the evolutionary process that could produce a similar structure.

The snowflake

Despite climate warming, snowflakes are still the basic component of most Christmas-themed art. Phylogentically, they are just unrooted trees, no matter how complex they look, with the difference that the radiation events are trichotomous, not dichotomous, and hence represent rapid radiations (leading to soft polytomies in the inferred tree).

The Christmas tree

In Germany, a proper Christmas tree has to be a Nordmanntanne — Nord = north, Mann = man, Tanne = fir; Abies nordmanniana (since we planted all our Christmas trees in the garden, we rarely had one). This, in contrast with what you may be thinking, does not come from the north but from the south (and named after a noble Swedo-Finnish zoologist: Alexander von Nordmann), specifically, the mountains along the southern Black Sea coast. Firs are perfect trees for Christmas because they have broad bases and taper towards the tips (try to decorate a Tränenkiefer, Pinus wallichiana, which we once had as an alternative Christmas tree).

Phylogenetically, they are quite specific metaphors, as explained in the diagram.

With only one survivor, we won't have any molecular data reconstructing a fir-like phylogeny. However, we can observe appropriate phenomena in the fossil record: when lineages emerge they usually are not represented by a single form but in several, then there are phases of stasis, then their diversity explodes in a very short time, with subsequent loss until the next radiation event, although the lineage itself is on decline.

We can also turn the Christmas tree upside down, and fill it with two sister lineages, as shown here.

We then end up with a scenario not unlikely in nature. Note that due to the last extinction event the survivors cuddle in the original niche, and only genetics may give us a clue that the similar morphology is not reflecting a recent common origin or the actual phylogenetic relationships.

A last image

Despite being a geologist-palaeontologist, I early got into contact with (and was intrigued by) population genetics, due to various circumstances (mainly because I'd make a very poor taxonomist). Hence, I love cactus-like evolutionary metaphors (see here for a real-world example). In the next diagram there is one showing several common evolutionary processes, with the colors representing haplotypic or genotypic variation within the species/species complex represented by the bubbles.

This includes everything I fancy about working at the coalface of evolution: a lot of stochasticity, a bit of reticulation, spiced with potentially misleading signals in the molecular data. A truly angelic evolutionary scenario.

Frohe Weihnachten, Glad jul, Bonne noël, and a Merry Christmas to everyone (or whatever you celebrate and enjoy at the end of the year).

And in case you are looking for a New Year's resolution: next time you write a phylogeny paper, sneak in a phylogenetic (or data-display) network in addition to the always simplifying, and per se trivial, tree.

More on Christmas (and networks)

Monday, December 16, 2019

Open problems in computational diversity linguistics: Conclusion and Outlook

One year has now passed since I discussed the idea with David to devote a whole year of 12 blopgosts to the topic of "Open problems in computational diversity linguistics". It is time to look back at this year, and the topics that have been discussed.

Quantitative view

The following table lists the pageviews (or clicks) for each blogpost (with all caveats as to what this actually entails), from January to November.

Problem	Month	Title	Clicks	Comments
0	January	Introduction	535	4
1	February	Automatic morpheme detection	718	0
2	March	Automatic borrowing detection	422	1
3	April	Automatic sound law induction	522	2
4	May	Automatic phonological reconstruction	517	0
5	June	Simulation of lexical change	269	0
6	July	Simulation of sound change	423	0
7	August	Statistical proof of language relatedness	383	1
8	September	Typology of semantic change	372	2
9	October	Typology of sound change	250	3
10	November	Typology of semantic promiscuity	217	2

The first thing to note is that people might have gotten tired of the problems, since the last two blogs were not very well-received in terms of readers (or not yet, anyway). One should, however, not forget that the number of clicks received by the system are cumulative, so if a blog is older, it may have received more readers just because it has been online for a longer time.

What seems, however, to be interesting is the rather high number of readers for the February post; and it seems that this is related to the topic, rather than the content. Morpheme detection is considered to be a very interesting problem by many practitioners of Natural Language Processing (NLP), and the field of NLP has generally many more followers than the field of historical linguistics.

Reader comments and discussions

For a few of the posts, I received interesting comments, and I replied to all of them, where I found that a reply was in order. A few of them are worth emphasizing here.

As a first comment in March, Guillaume Jacques replied in form of a blog post of his own, where he proposed a very explicit method for the detection of borrowings, which assumes that data are compared where an ancestral language is a available in written sources (see here for the post). Since it will still take some time to prepare the data in the manner proposed by Guillaume, I have not had time to test this method for myself, but it is a very nice example for a new method for borrowing detection, which addresses one specific data type and has so far not been tested.

Thomas Pellard provided a very useful comment on my April post, emphasizing that automatic reconstruction based on regular expressions (as I had proposed it, more or less, as a riddle that should be solved), requires a "very precise chronology (order) of the sound changes", as well as "a perfect knowledge of all the sound changes having occurred". He concluded that "regular expression-based approach may thus be rather suited for the final stage of a reconstruction rather than for exploratory purposes". What is remarkable about this comment is that it partly contradicts (at least in my opinion) the classical doctrine of historical language comparison, since we often assume that linguists apply their "sound laws" perfectly well, being able to explain the history of a given set of languages in full detail. The sparsity of the available literature, and the problems that even small experiments encounter, shows that the idea of completely regular sound change that can be laid out in form of transducers has always remained an idea, but was never really practiced. It seems that it is time to leave the realm of theory and do more practical research on sound change, as suggested by Thomas.

In response to my post on problem number 7 (August), the proof of language relatedness, Guillaume Jacques wrote that: "although most historical linguists see inflectional morphology as the most convincing evidence for language relatedness, it is very difficult to conceive a statistical test that could be applied to morphological paradigms in any systematic way cross-linguistically". I think he is completely right with this point.

J. Pystynen made a very good point with respect to my post on the typology of semantic change (September), mentioning that semantic change may, similar to sound change, also underlie dynamics resulting from the fact that the lexicon of a given language at a given time is a system whose parts are determined by their relation to each other.

David Marjanović criticized my use (in October) of the Indo-European laryngeals as an example to make clear that the abstractionalist-realist problem in the debate about sound change has an impact on what scholars actively reconstruct, and that they are often content to not further specify concrete sound values as long as they can be sure that there are distinctive values for a given phenomenon. His main point was that — in his opinion — the reconstruction of sound values for the Indo-European laryngeal is much clearer than I presented it in my post. I think that Marjanović was misunderstanding the point I wanted to make; and I also think that he is not right regarding the surety with which we can determine sound values for the laryngeal sounds.

As a last and very long comment from November, Alex(andre) François (I assume that it was him, but he only left his first name) provided excellent feedback on the last problem, which I had labelled the problem of establishing a typology of "semantic promiscuity". Alex argues that I overemphasized the role of semantics in the discussion, and that the phenomenon I described might better be labelled "lexical yield of roots". I think that he's right with this criticism, but I am not sure whether the term "lexical yield" is better than the notion of promiscuity. Given that we are searching for a counterpart of the mostly form-based term "productivity", which furthermore focuses on grammatical affixes, the term "promiscuity" focuses on the success of certain form-concept pairs at being recycled during the process of word formation. Alex is right that we are in fact talking about the root here, as a linguistic concept that is — unfortunately — not very strictly defined in linguistics. For the time being, I would propose either the term "root promiscuity" or "lexical promiscuity", but avoid the term "yield", since it sounds too static to me.

Advances on particular problems

Although the problems that I posted are personal, and I am keen to try tackling them in at least some way in the future, I have not yet managed to advance on any of them in particular.

I have experimented with new approaches to borrowing detection, which are not yet in a state where they could be published, but it helped myself to re-think the whole matter in detail. Parts of my ideas shared in this blog post also appeared, in a deeper discussion, in an article that was published this year (List 2019).

I played with the problem of morpheme detection, but none of the different approaches was really convincing enough so far. However, I am still convinced that we can do better than "meaning-less" NLP approaches (which try to infer morphology from dictionaries alone, ignoring any semantic information).

A peripheral thought on automated phonological reconstruction, focusing on the question of the evaluation of a set of automated reconstructions and a set of human-annotated gold standard data, has now been published (List 2019b) as a comment to a target study by Jäger (2019). While my proposal can solve cases where two reconstruction systems differ only by their segment-wise phonological information, I had to conclude my comment by admitting that there are cases where two sets of words in different languages are equivalent in their structure, but not identical. Formally, that means that structurally identical sets of segmented strings in linguistics can be converted from one set to the other with help of simple replacement rules, while structurally equivalent (I am still unsure, if the two terms are well chosen) sets of segmented strings may require additional context rules.

Although I tried to advance on most of the problems mentioned throughout the year, and I carried out quite a few experiments, most of the things that I tested were not conclusive. Before I discuss them in detail, I should make sure they actually work, or provide a larger study that emphasizes and explains why they do not work. At this stage, however, any sharing of information on the different experiments I ran would be premature, leading to confusion rather than to clarification.

Strategies for problem solving

Those of you who have followed my treatment of all the problems over the year will see that I tend to be very careful in delegating problem solutions to classical machine learning approaches. I do this because I am convinced that most of the problems that I mentioned and discussed can, in fact, be handled in a very concrete manner. When dealing with problems that one thinks can ultimately be solved by an algorithm, one should not start by developing a machine learning algorithm, but rather search for the algorithm that really solves the problem.

Nobody would develop a machine learning approach to replace an abacus, although this may in fact be possible. In the same way, I believe that the practice of historical linguistics has sufficiently shown that most of the problems can be solved with help of concrete methods, with the exception, perhaps, of phylogenetic reconstruction (see, for example, my graph-based solution for the sound correspondence pattern detection problem, presented in List 2019c). For this reason, I prefer to work on concrete solutions, avoiding probabilistic approaches or black-box methods, such as neural networks.

A language problem

Retrospect and outlook

In retrospect, I enjoyed the series a lot. It has the advantage of being easier to plan, as I knew in advance what I had to write about. It was, however, also tedious at times, since I knew I could not just talk about a seemingly simpler topic in my monthly post, but had to develop the problem and share all of my thoughts on it. In some situations, I had the impression that I failed, since I realized that there was not enough time to really think everything through. Here, the comments of colleagues were quite helpful.

Content-wise, the idea of looking at our field through the lens of unsolved problems turned out to be very useful. For quite a few of the problems, I have initial ideas (as I tried to indicate each time); and maybe there will be time in the next years to test them in concrete, and to potentially even cross off the one or the other problem from the big list.

Writing a series instead of a collection of unrelated posts turned out to have definite advantages. With my monthly goal of writing at least one contribution for the Genealogical World of Phylogenetic Networks, I never had the problem of thinking too hard of something that might be interesting for a broader readership. While this happened in the past, blog series have the disadvantage of not allowing for flexibility, when something interesting comes up, especially if one sticks to one post per month and reserves this post for the series.

In the next year, I am still considering to write another series, but maybe this time, I will handle it less strictly, allowing some room for surprise, since this is as well one of the major advantages of writing scientific blogs: one is never really be bound to follow beaten tracks.

But for now, I am happy that the year is over, since 2019 has been very busy for me in terms of work. Since this is the final post for the year, I would like to take the chance to thank all who read the posts, and specifically also all those who commented on them. But my greatest thanks go to David for being there, as always, reading my texts, correcting my errors in writing, and giving active feedback in the form of interesting and inspiring comments.

References

Jäger, Gerhard (2019) Computational historical linguistics. Theoretical Linguistics 45.3-4: 151-182.

List, Johann-Mattis (2019a) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

List, Johann-Mattis (2019b) Beyond Edit Distances: Comparing linguistic reconstruction systems. Theoretical Linguistics 45.3-4: 1-10.

List, Johann-Mattis (2019c) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Monday, December 9, 2019

The Science of Spice by S. Farrimond — in networks

It's feasting time; and any good feast tickles the tongues with flavours unknown and exotic. But not all spices go well with each other. One suggested solution is to "understand flavour connections" in order to "revolutionize your cooking", which is the subtitle of a book by Stuart Farrimond: The Science of Spice (Dorling Kindersley 2018, ISBN: 978-0-2413-0214-9).

In his book, Farrimond categorizes spices into flavour groups characterized by their major and secondary chemical compounds, such as "sweet warming phenols", "fragrant terpenes" and "pungent compounds". He presents a "periodic table of spices" covering 54 spices, and gives a four-step protocol for how to combine spices:

Step 1: Choose the main flavour group(s);
Step 2: Check the blending science (which is quite elaborate — you have to buy the book);
Step 3: Pick your primary spices; and
Step 4: Add complexity (something we strongly encourage in general here at the Genealogical World of Phylogenetic Networks).

Farrimond provides five sets of principal data in the various chapters of his book entitled: "Spice science" (an introduction), "World of Spice" (which spices are used in which countries, including a recipe for a local spice blend), and "Spice Profiles" (bit of history, food to spice, blending science). For the 54 spices of the periodic table, they are:

chemical composition;
geography (uses as "signature", "supporting" and "supplementary spice" in various countries);
general characterization, such as "sweet", "pungent", "earthy", "complex";
food partners;
flavor category.

All of this information can be visualized using our beloved Neighbor-nets. Here, we will show only two: the flavour compounds network (based on information tabulated on pp. 214–217), and a network grouping countries by similarity in spice use. For those interested in the primary data used here; tabulated data, character matrices and raw networks can be found @ figshare.

Spice compounds

Humans are, and have always been, very diverse, and so is their food; and the spices are no exception. They contain numerous flavor-active substances, and Farrimond has picked for his periodic table of spices those that cover a huge range of flavor compounds. Accordingly, the Neighbor-net is star-like, as shown here.

Neighbor-net based on absence/presence of 117 chemical compounds that put spice in spices.

For estimating (Hamming) chemical 'inter-spice' distances, I used ternary ordered characters: "0" – absence; "1" – presence; "2" – flagged as major compound. Most flavor groups are chemically diverse; Mother Nature has many means to tickle our taste buds in a certain fashion. One exception are the spices of the "citrous terpenes" flavor group characterized by citral as the main flavor compound (otherwise only found as accessory compound in wattle, ginger and turmeric) accompanied by linalool (a compound found in many other spices and main compound of coriander).

Geographic patterns

To visualize the geographic differentiation of the spices, I treated the absence/presence of each spice in the local cuisines as an ordered character:

"3" – a signature spice, ie. a main spice in the local cuisines;
"2" – supporting spice accompanying many dishes;
"1" – supplementary spice, ie. a spice to round up or add more particularity;
"0" – absent, ie. not mentioned by Ferrimond.

In total, the matrix covers 93 spices for 44 countries/regions. Some spices are relatively ubiquitous, and hence are not informative about geographical variation, such as chili (37 out 44 cuisines, with 26 using it as a signature spice), garlic (25 uses as a signature spice) or ginger (19), while others are rare or geographically quite restricted. For instance turmeric is a signature spice of Indian cuisines and also of South Africa. During the British Empire, many Indians migrated to South Africa, and Indian traditions blended in with African and European; which makes South Africa an interesting place to visit and feast (as I can affirm first hand).

A global network based on the used spices. Colorization refers to the continental regions used by Ferrimond (chapter World of Spice, p. 20ff)

Not unexpectedly, the network shown here reflects geographic vicinity as well as rather ancient historical connections. For example, most aspects of European civilization have their origin in the Middle East, and spices reached medieval Europe via Arab sea-traders and the Silk Route; but there was also influence from elsewhere during the various the colonial epochs.

The Latin American cuisines are spice-wise most similar to those of Spain and Portugal within their regional groups, while Canada and the U.S.A. mix this tradition with that of other European countries such as Italy and France. Great Britain is distinct because His/Her majesties ruled many lands with a great variety of food and spices. In contrast to many other aspects of colonialism, the influence hence goes both ways.

The most unique spice cuisines are Indonesia, the home of many spices (and the reason why both the Portuguese and the Spanish set sail), and (tropical) western and central Africa. That the Horn of Africa graphs within the South Asian group is not surprising as it was for a very long time the sea-trade spice hub between Asia and Europe.

The is also a higher diversity seen in the Southeast Asian compared to the East Asian and South Asian countries and regions.

A bit of an oddball is the placement of the Caribbean cuisine, and especially the Creole kitchen, which is known for its spice mixing — in Farrimond's three-concepts characterization: "Adventurous | Bold | Spicy".

Conclusion

So, in case you want to spice up the coming holiday and festive season, Farrimond's book is an invaluable source for applied science, which has a simple primary use: filling the mouth with taste while filling the belly with ballast.

Monday, December 2, 2019

Trees informing networks explaining trees

Working at the coalface of evolution, one phenomenon always intrigued me: How does the signal in the data build up a tree? Especially since we have to assume some sort of reticulation happened at some point — evolution is rarely a strictly dichotomous process, which we would model by a tree. In earlier posts, we have covered the difference between clades and grades in a tree, and Hennig's concepts of monophyly and paraphyly. Clearly, in the light of actual evolutionary processes, the cladistic approach synonymizing clades with monophyly is a simplification at best, and naive at worst.

In this post, I will discuss a real-world example using molecular data put together for a (probably) recently evolved plant genus, Drosanthemum, as discussed in this paper:

Liede-Schumann S, Grimm GW, Nürk NM, Potts AJ, Meve U, Hartmann HEK. 2019. Phylogenetic relationships in the southern African genus Drosanthemum (Ruschioideae, Aizoaceae). bioRxiv preprint.

These days, next-generation sequencing (NGS) and phylogenomic data may provide what you need to resolve everything from the beginning of life to the very tips of the Tree of Life (ToL; which, at the root and tips is probably more of a Network of Life). However, this has two shortcomings: You need a lot of money, and a lot of DNA. Given the number of modern-day plant species, including not a few that are in flux, it's pretty safe to assume that I won't live long enough to have all of the ToL leaves resolved by NGS data.

On the hand, there are a countless numbers of scientists with taxonomic expertise struggling for funding; and classic Sanger sequencing has become very cheap. Thus, Oligogene (fossil) data sets will remain in use for quite some time. We do, however, have to deal with their shortcomings, such as not giving us a fully resolved phylogenetic tree, but instead providing partially diffuse signals. Nevertheless, we can get a lot of insights by combining traditional tree and network inferences.

The tree

To test systematic concepts and construct a species phylogeny for Drosanthemum, we tapped into four non-coding plastid gene regions, which, following earlier research, were the most divergent within the larger group:

the close intergenic plastid spacers trnK-rps16 and rps16-trnQ,
the trnS-trnG intergenic spacer, and
the rpl16 intron.

Following popular demand, we also sequenced the nuclear-encoded ITS region used in earlier phylogenetic studies (despite being quite useless for tree inference in the larger lineage, being much too conserved).

We did a full analysis, a single-gene tree inference, and bootstrapping vs. combined analysis, with or without data partitioning. We concluded that the combined plastid (not including ITS data) tree does provide a good phylogenetic-systematic framework for the genus.

Our Drosanthemum tree, rooted using the most probable rooting scenario (following an outgroup-EPA analysis; see Liede-Schumann et al., fig. 4 and Supplement file S4). Major clades and subclades are annotated, on the left the morphological subgenera associated with each major clade.

Why consider it to be good? Well, it fits amazingly well with the morphology-based systematics. Evolution doesn't follow a straight path: (i) reticulation will happen at least during the formation of species; (ii) there will always be some incomplete lineage sorting of geno- and phenotypes; and (iii) morphologies have substantially different evolutionary constraints from noncoding plastid gene regions. If it ends up in a good match, it won't be by coincidence but more likely because our inferred phylogeny captures well the true tree (coalescent).

Regarding monophyly, the tree is hence well suited to construct a framework: most of our major clades are linked to a specific, clade-unique morphology. (Don't hope to find too many autapomorphies, in plants common origin manifests typically in diagnostic character suites rather than individual aut-/synapomorphies.) The exception is subgenus Drosanthemum, which is apparently diphyletic — this term is meant literally, not just because its members form two molecular clades. Furthermore, although not visible on the tree, Clade IV / Vespertina may well be evolved from Clade III / Drosanthemum. The III-IV grade represents a monophyletic group, and Clade III may be paraphyletic.

At this point, you may be thinking: Guido has lost it; but bear with me.

Ancestral and derived haplotypes

The point is, I know our data. When we look at the sequence patterns in the gene regions, we can readily see that Clade III (subgenus Drosanthemum) and IV (subgenus Speciosa) likely had a (genetic) common ancestor different of that from the more evolved clades I (subgenus Drosanthemum) and II, and hence the high support and increased branch length of the corresponding branches — I + II and III + IV could well be reciprocally monophyletic. Realizing that clades III and IV are part of the same evolutionary lineage, we can take a closer look at them using, for example, Median networks. Parsimony is generally inferior to probabilistic methods when dealing with (mostly) neutral but stochastic mutation patterns. However, since we are very close to the coalface of evolution, we are dealing with rather minute changes — too minute for ML to make a well-informed call (also, there is no ML counterpart to haplotype networks).

To not miss something in our data (or overweight indels and linked mutation patterns), we do not just use the nucleotide sequences but we tabulate and code each mutational pattern — simple ones like single-nucleotide polymorphisms (SNPs) and duplications, but also complex ones, sequence patterns in length-polymorphic, sequentially diverse parts. The next figure shows an example:

"Export" refers to the unaltered export of (parsimony-informative) variable sites of the aligned nucleotide matrix; "recoded" the correction for excess mutations (when treating gaps as 5th base) in order to ensure the coding matches the number of steps in the theory of Median networks (see Liede-Schumann et al., supplement file S2).

Now, we are operating above the species level, which is outside the comfort zone of Median networks, which were originally designed for investigating within-species population structure. We are dealing with signals from phylogenetically sorting (eg. evolution of complex sequence patterns, see example above) mixed with (partly) convergent/ homoiologous patterns (eg. duplications, which are very rarely lineage-specific in plant plastomes). The resulting Median networks are quite complex, as shown next.

The output from NETWORK for Clade III + IV and the trnG-trnS intergenic spacer. In total, the matrix codes for 14 mutational patterns (10 SNPs, 3 indels, 1 length-polymorphic region involving a SNP; see sheet Clade3&4.trnGS in f_Haplotyping.xlsx in folder 1_main_data_and_results the online supporting archive @ DataDryad); the red edge numbers indicate which pattern changed, the bubble colors refer to the group: Cyan, Subgrade IIIa; blue, Subclade IIIb; yellow, Clade IV (note: Clade III and IV differ from other major clades by uniquely shared patterns)

One option would be to weight the characters. However, it is pretty difficult to come up with a weighting scheme given that we deal with very different mutation patterns, which include everything from simple transitions to reorganization of length-polymorphic regions. When looking at the SNPs, AG transitions appear to be more probable than AC transversions, but some AG transitions are highly diagnostic for clades, while some AC transversions seem random. Instead of getting lost in weighting (and self-enforcing bias), we compare them across the four gene regions by collapsing haplotype groups and their (diffuse) subnets, as shown next.

'Condensed' Median networks for the Clade III + IV lineage, parts of the graphs collecting sequentially similiar members of one group are replaced by bubbles (cf. Liede-Schumann et al., fig. 6).

Note that, in contrast to traditional haplotype networks, the bubbles in the figure don't represent the number of accessions of the haplotype, but instead are the sequential diversity of the collapsed haplotype group. From the graphs superimposed on the background of the combined tree, it is straightforward to see which haplotype maybe ancestral within a lineage and which haplotype is derived and how they relate to each other.

Paraphyletic "clades"

We now know how the haplotypes of each covered gene region are related to each other, and which species have substantially derived sequences, and which species have putatively ancestral sequences. Using the networks and by comparison with the sequence patterns in the sister group(s), we could even reconstruct an hypothetical haplotype of the common ancestor. But just by comparing the median networks for each gene regions with the corresponding subtrees in the combined tree we can (try to) interpret our clades and grades as monophyletic or paraphyletic.

Fig. 5 from Liede-Schumann et al. (2019) showing the 'condensed' Median networks for Clade I/ Drosanthemum (s.str.)

Members of Subclade Ib, the subtree with the worst support within Clade I / Drosanthemum (s.str.), may represent the survivors of the initial radiation, and hence are a paraphyletic group. They are resolved as a clade in the tree because of the signal from the trnS-trnG region producing a clear split between the three groups. However, this is also the most-conserved gene region, and when compared with the mutational patterns in the other clades (especially the sister clade, Clade II), it would not be far fetched to conclude that the trnS-trnG haplotype B is the original haplotype of the entire lineage.

The distinctive feature of Subclade Ib in the trnS-trnG is a complex duplication pattern not found in the otherwise genetically more coherent subclades Ia and Ib, as shown next.

This looks like a simple evolutionary sequence, with Clade Ia and Ic having retained the original pattern, with the complex pattern being a derived, clade-unique feature of Clade Ib (an autapormorphy for the corresponding monophyletic group).

But when we add the patterns of Clade II, the reciprocally monophyletic sister clade, it's not that simple anymore, as shown next.

Why one should be careful with gap-coding: even complex plastid duplication patterns evolve in parallel (or convergently). No matter whether X-Y or X-Y'-X-Y is the ancestral pattern, we have one/two convergent mutations in parts of Clades I and II; either duplication of X and insertion of Y' or (subsequent) deletion of X-Y'.

Realizing that a few clades in our tree may be paraphyletic gives us a new edge on our data and phylogenetic framework that can be further elaborated. Because they directly point towards a first, quick radiation that predates the formation of the monophyletic molecular clades (this is only a tautonym in cladistics, not in phylogenetics) — the members of paraphyletic molecular clades are genetically distinct (long terminal branches, typically low and/or ambiguous support for the clade root) or little evolved survivors (short root and terminal branches, but relatively high root branch support).

Furthermore, we can now see why some species act a bit roguish, are difficult to resolve, or inflict internal data conflict.

'Condensed' Medium networks for the sister clades V and VI (modified after Liede-Schumann et al., fig. 7)

Drosanthemum gracillimum is the only species our tree that doesn't resolve as a member of one of the two main (definitely monophyletic) subclades within Clade V: Subclade Va / Speciosa and Vb / Ossicula, genetically close but morphologically distinct sisters. We had no material of this species for our analysis, and instead used available GenBank data (out of curiosity). Its trnS-trnG and rps16-trnQ haplotypes are unique but rather ancestral within Clade V, and hence the tree cannot resolve where to place it.

Another example for how ancestry of sequences contribute to topological conflict or ambiguity in intrageneric phylogenies, but also illustrating the limitation of our approach, is one individual of D. striatum. It's the only member of Subclade Vb / Ossicula with a Subclade Va / Speciosa-type rps16-trnQ. With respect to my last blog post, the simplest explanation is that it just retained a less derived rps16-trnQ haplotype. However, this spacer includes a high-divergent, genotaxomomically valuable region that we had to exclude from all analyses (but included in our spreadsheet haplotyping.xlsx). In this, it shows a very unique, complex, apparently derived pattern shared with a few other members of the sister Clade Va. Maybe there was some reticulation and plastome-recombination at work here (contamination can be ruled out, as the material was processed twice).

Just try it with your own data

We cannot all afford perfect, often seemingly trivial, NGS / phylogenomic data. Combined trees can inform us about groups sharing a likely (mostly inclusive) common origin, such as molecular clades with fair support and distinctly long root branches and/or shared unique morphologies (ie. "monophyla" in a strict Hennigian sense). Clade-restricted haplotype networks can help us to understand the molecular evolution in these groups, free from the assumption of dichotomy and time equality.

By definition, all tip sequences represent the same time (today) in a tree, so they can only be sisters not ancestors and descendants. In reality, when we approach the coalface, we have some sequence patterns or actual sequences that are ancestral to others, because the species carrying them didn't evolve as much and as fast as their sibling(s). At some time, different parts evolved at different speeds within one lineage (see the examples above).

The networks hence fill a gap that the tree can't possibly resolve. They allow to understand why the tree may make more sense in certain parts than in others; and where it is probably 100% reliable and where we may want to have a closer look. Furthermore, only the networks can tell us if there is some real conflict in the data: different gene regions reflecting different histories.

Epilogue

As a careful reader ,you may have noticed that we skipped the ITS sequence entirely. The reason is shown in the following two graphs.

The first one shows a statistical parsimony network of all of our ITS data compiled for the species included in the plastid combined tree.

A statistical parsimony network based on the ITS data. Colors give the main cp clades (see Liede-Schumann et al., Supplement file S3)

The network approaches a spider-web, as shown above. The reason for this is that there are only a limited number of ITS positions where Drosanthemum fixes mutations (notably nearly exclusively SNPs, with no length-polymorphism). So, the genus is likely a young one, much like its sister clade the Ruschideae, which also mostly shows randomly distribution ITS mutation patterns.

Inferring an ITS tree is possible but useless, in that the data don't provide a clear signal. Furthermore, when we map the observed mutational patterns onto the plastid tree, we see a lot of messing up towards the leaves; but, in principle, it's all just sorting along the shared coalescent. We can identify those ITS mutation that a (plastid-)clade specific and lineage-diagnostic, including ITS-"synapormophies" for plastid-inferred clades that are likely monophyletic (being correlated to a distinct morphology and supported by derived, uniquely shared sequence patterns).

ITS genotypes mapped on the plastid tree, pointing to a largely congruent history with incomplete (ITS) lineage sorting. CU = clade-unique sequence pattern; Sh = shared, not unique, sequence pattern.

This opens the door to quickly screen for individuals / species that don't fall in line of the coalescent but are the product of (deep) reticulation (either using bulk sequencing and NGS genotyping or traditional cheap methods such as PCR-RFLP).