Monday, December 16, 2019

Open problems in computational diversity linguistics: Conclusion and Outlook


One year has now passed since I discussed the idea with David to devote a whole year of 12 blopgosts to the topic of "Open problems in computational diversity linguistics". It is time to look back at this year, and the topics that have been discussed.

Quantitative view

The following table lists the pageviews (or clicks) for each blogpost (with all caveats as to what this actually entails), from January to November.

Problem Month Title Clicks Comments
0 January Introduction 535 4
1 February Automatic morpheme detection 718 0
2 March Automatic borrowing detection 422 1
3 April Automatic sound law induction 522 2
4 May Automatic phonological reconstruction 517 0
5 June Simulation of lexical change 269 0
6 July Simulation of sound change 423 0
7 August Statistical proof of language relatedness  383 1
8 September  Typology of semantic change 372 2
9 October Typology of sound change 250 3
10 November Typology of semantic promiscuity 217 2

The first thing to note is that people might have gotten tired of the problems, since the last two blogs were not very well-received in terms of readers (or not yet, anyway). One should, however, not forget that the number of clicks received by the system are cumulative, so if a blog is older, it may have received more readers just because it has been online for a longer time.

What seems, however, to be interesting is the rather high number of readers for the February post; and it seems that this is related to the topic, rather than the content. Morpheme detection is considered to be a very interesting problem by many practitioners of Natural Language Processing (NLP), and the field of NLP has generally many more followers than the field of historical linguistics.

Reader comments and discussions

For a few of the posts, I received interesting comments, and I  replied to all of them, where I found that a reply was in order. A few of them are worth emphasizing here.

As a first comment in March, Guillaume Jacques replied in form of a blog post of his own, where he proposed a very explicit method for the detection of borrowings, which assumes that data are compared where an ancestral language is a available in written sources (see here for the post). Since it will still take some time to prepare the data in the manner proposed by Guillaume, I have not had time to test this method for myself, but it is a very nice example for a new method for borrowing detection, which addresses one specific data type and has so far not been tested.

Thomas Pellard provided a very useful comment on my April post, emphasizing that automatic reconstruction based on regular expressions (as I had proposed it, more or less, as a riddle that should be solved), requires a "very precise chronology (order) of the sound changes", as well as "a perfect knowledge of all the sound changes having occurred". He concluded that "regular expression-based approach may thus be rather suited for the final stage of a reconstruction rather than for exploratory purposes". What is remarkable about this comment is that it partly contradicts (at least in my opinion) the classical doctrine of historical language comparison, since we often assume that linguists apply their "sound laws" perfectly well, being able to explain the history of a given set of languages in full detail. The sparsity of the available literature, and the problems that even small experiments encounter, shows that the idea of completely regular sound change that can be laid out in form of transducers has always remained an idea, but was never really practiced. It seems that it is time to leave the realm of theory and do more practical research on sound change, as suggested by Thomas.

In response to my post on problem number 7 (August), the proof of language relatedness, Guillaume Jacques wrote that: "although most historical linguists see inflectional morphology as the most convincing evidence for language relatedness, it is very difficult to conceive a statistical test that could be applied to morphological paradigms in any systematic way cross-linguistically". I think he is completely right with this point.

J. Pystynen made a very good point with respect to my post on the typology of semantic change (September), mentioning that semantic change may, similar to sound change, also underlie dynamics resulting from the fact that the lexicon of a given language at a given time is a system whose parts are determined by their relation to each other.

David Marjanović criticized my use (in October) of the Indo-European laryngeals as an example to make clear that the abstractionalist-realist problem in the debate about sound change has an impact on what scholars actively reconstruct, and that they are often content to not further specify concrete sound values as long as they can be sure that there are distinctive values for a given phenomenon. His main point was that — in his opinion — the reconstruction of sound values for the Indo-European laryngeal is much clearer than I presented it in my post. I think that Marjanović was misunderstanding the point I wanted to make; and I also think that he is not right regarding the surety with which we can determine sound values for the laryngeal sounds.

As a last and very long comment from November, Alex(andre) François (I assume that it was him, but he only left his first name) provided excellent feedback on the last problem, which I had labelled the problem of establishing a typology of "semantic promiscuity". Alex argues that I overemphasized the role of semantics in the discussion, and that the phenomenon I described might better be labelled "lexical yield of roots". I think that he's right with this criticism, but I am not sure whether the term "lexical yield" is better than the notion of promiscuity. Given that we are searching for a counterpart of the mostly form-based term "productivity", which furthermore focuses on grammatical affixes, the term "promiscuity" focuses on the success of certain form-concept pairs at being recycled during the process of word formation. Alex is right that we are in fact talking about the root here, as a linguistic concept that is — unfortunately — not very strictly defined in linguistics. For the time being, I would propose either the term "root promiscuity" or "lexical promiscuity", but avoid the term "yield", since it sounds too static to me.

Advances on particular problems

Although the problems that I posted are personal, and I am keen to try tackling them in at least some way in the future, I have not yet managed to advance on any of them in particular.

I have experimented with new approaches to borrowing detection, which are not yet in a state where they could be published, but it helped myself to re-think the whole matter in detail. Parts of my ideas shared in this blog post also appeared, in a deeper discussion, in an article that was published this year (List 2019).

I played with the problem of morpheme detection, but none of the different approaches was really convincing enough so far. However, I am still convinced that we can do better than "meaning-less" NLP approaches (which try to infer morphology from dictionaries alone, ignoring any semantic information).

A peripheral thought on automated phonological reconstruction, focusing on the question of the evaluation of a set of automated reconstructions and a set of human-annotated gold standard data, has now been published (List 2019b) as a comment to a target study by Jäger (2019). While my proposal can solve cases where two reconstruction systems differ only by their segment-wise phonological information, I had to conclude my comment by admitting that there are cases where two sets of words in different languages are equivalent in their structure, but not identical. Formally, that means that structurally identical sets of segmented strings in linguistics can be converted from one set to the other with help of simple replacement rules, while structurally equivalent (I am still unsure, if the two terms are well chosen) sets of segmented strings may require additional context rules.

Although I tried to advance on most of the problems mentioned throughout the year, and I carried out quite a few experiments, most of the things that I tested were not conclusive. Before I discuss them in detail, I should make sure they actually work, or provide a larger study that emphasizes and explains why they do not work. At this stage, however, any sharing of information on the different experiments I ran would be premature, leading to confusion rather than to clarification.

Strategies for problem solving

Those of you who have followed my treatment of all the problems over the year will see that I tend to be very careful in delegating problem solutions to classical machine learning approaches. I do this because I am convinced that most of the problems that I mentioned and discussed can, in fact, be handled in a very concrete manner. When dealing with problems that one thinks can ultimately be solved by an algorithm, one should not start by developing a machine learning algorithm, but rather search for the algorithm that really solves the problem.

Nobody would develop a machine learning approach to replace an abacus, although this may in fact be possible. In the same way, I believe that the practice of historical linguistics has sufficiently shown that most of the problems can be solved with help of concrete methods, with the exception, perhaps, of phylogenetic reconstruction (see, for example, my graph-based solution for the sound correspondence pattern detection problem, presented in List 2019c). For this reason, I prefer to work on concrete solutions, avoiding probabilistic approaches or black-box methods, such as neural networks.

A language problem

Retrospect and outlook

In retrospect, I enjoyed the series a lot. It has the advantage of being easier to plan, as I knew in advance what I had to write about. It was, however, also tedious at times, since I knew I could not just talk about a seemingly simpler topic in my monthly post, but had to develop the problem and share all of my thoughts on it. In some situations, I had the impression that I failed, since I realized that there was not enough time to really think everything through. Here, the comments of colleagues were quite helpful.

Content-wise, the idea of looking at our field through the lens of unsolved problems turned out to be very useful. For quite a few of the problems, I have initial ideas (as I tried to indicate each time); and maybe there will be time in the next years to test them in concrete, and to potentially even cross off the one or the other problem from the big list.

Writing a series instead of a collection of unrelated posts turned out to have definite advantages. With my monthly goal of writing at least one contribution for the Genealogical World of Phylogenetic Networks, I never had the problem of thinking too hard of something that might be interesting for a broader readership. While this happened in the past, blog series have the disadvantage of not allowing for flexibility, when something interesting comes up, especially if one sticks to one post per month and reserves this post for the series.

In the next year, I am still considering to write another series, but maybe this time, I will handle it less strictly, allowing some room for surprise, since this is as well one of the major advantages of writing scientific blogs: one is never really be bound to follow beaten tracks.

But for now, I am happy that the year is over, since 2019 has been very busy for me in terms of work. Since this is the final post for the year, I would like to take the chance to thank all who read the posts, and specifically also all those who commented on them. But my greatest thanks go to David for being there, as always, reading my texts, correcting my errors in writing, and giving active feedback in the form of interesting and inspiring comments.

References

Jäger, Gerhard (2019) Computational historical linguistics. Theoretical Linguistics 45.3-4: 151-182.

List, Johann-Mattis (2019a) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

List, Johann-Mattis (2019b) Beyond Edit Distances: Comparing linguistic reconstruction systems. Theoretical Linguistics 45.3-4: 1-10.

List, Johann-Mattis (2019c) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Monday, December 9, 2019

The Science of Spice by S. Farrimond — in networks


It's feasting time; and any good feast tickles the tongues with flavours unknown and exotic. But not all spices go well with each other. One suggested solution is to "understand flavour connections" in order to "revolutionize your cooking", which is the subtitle of a book by Stuart Farrimond: The Science of Spice (Dorling Kindersley 2018, ISBN: 978-0-2413-0214-9).

In his book, Farrimond categorizes spices into flavour groups characterized by their major and secondary chemical compounds, such as "sweet warming phenols", "fragrant terpenes" and "pungent compounds". He presents a "periodic table of spices" covering 54 spices, and gives a four-step protocol for how to combine spices:
  • Step 1: Choose the main flavour group(s);
  • Step 2: Check the blending science (which is quite elaborate — you have to buy the book);
  • Step 3: Pick your primary spices; and
  • Step 4: Add complexity (something we strongly encourage in general here at the Genealogical World of Phylogenetic Networks).
Farrimond provides five sets of principal data in the various chapters of his book entitled: "Spice science" (an introduction), "World of Spice" (which spices are used in which countries, including a recipe for a local spice blend), and "Spice Profiles" (bit of history, food to spice, blending science). For the 54 spices of the periodic table, they are:
  • chemical composition;
  • geography (uses as "signature", "supporting" and "supplementary spice" in various countries);
  • general characterization, such as "sweet", "pungent", "earthy", "complex";
  • food partners;
  • flavor category.
All of this information can be visualized using our beloved Neighbor-nets. Here, we will show only two: the flavour compounds network (based on information tabulated on pp. 214–217), and a network grouping countries by similarity in spice use. For those interested in the primary data used here; tabulated data, character matrices and raw networks can be found @ figshare.

Spice compounds

Humans are, and have always been, very diverse, and so is their food; and the spices are no exception. They contain numerous flavor-active substances, and Farrimond has picked for his periodic table of spices those that cover a huge range of flavor compounds. Accordingly, the Neighbor-net is star-like, as shown here.

Neighbor-net based on absence/presence of 117 chemical compounds that put spice in spices.

For estimating (Hamming) chemical 'inter-spice' distances, I used ternary ordered characters: "0" – absence; "1" – presence; "2" – flagged as major compound. Most flavor groups are chemically diverse; Mother Nature has many means to tickle our taste buds in a certain fashion. One exception are the spices of the "citrous terpenes" flavor group characterized by citral as the main flavor compound (otherwise only found as accessory compound in wattle, ginger and turmeric) accompanied by linalool (a compound found in many other spices and main compound of coriander).

Geographic patterns

To visualize the geographic differentiation of the spices, I treated the absence/presence of each spice in the local cuisines as an ordered character:
  • "3" – a signature spice, ie. a main spice in the local cuisines;
  • "2" – supporting spice accompanying many dishes;
  • "1" – supplementary spice, ie. a spice to round up or add more particularity;
  • "0" – absent, ie. not mentioned by Ferrimond.
In total, the matrix covers 93 spices for 44 countries/regions. Some spices are relatively ubiquitous, and hence are not informative about geographical variation, such as chili (37 out 44 cuisines, with 26 using it as a signature spice), garlic (25 uses as a signature spice) or ginger (19), while others are rare or geographically quite restricted. For instance turmeric is a signature spice of Indian cuisines and also of South Africa. During the British Empire, many Indians migrated to South Africa, and Indian traditions blended in with African and European; which makes South Africa an interesting place to visit and feast (as I can affirm first hand).

A global network based on the used spices. Colorization refers to the continental regions used by Ferrimond (chapter World of Spice, p. 20ff)

Not unexpectedly, the network shown here reflects geographic vicinity as well as rather ancient historical connections. For example, most aspects of European civilization have their origin in the Middle East, and spices reached medieval Europe via Arab sea-traders and the Silk Route; but there was also influence from elsewhere during the various the colonial epochs.

The Latin American cuisines are spice-wise most similar to those of Spain and Portugal within their regional groups, while Canada and the U.S.A. mix this tradition with that of other European countries such as Italy and France. Great Britain is distinct because His/Her majesties ruled many lands with a great variety of food and spices. In contrast to many other aspects of colonialism, the influence hence goes both ways.

The most unique spice cuisines are Indonesia, the home of many spices (and the reason why both the Portuguese and the Spanish set sail), and (tropical) western and central Africa. That the Horn of Africa graphs within the South Asian group is not surprising as it was for a very long time the sea-trade spice hub between Asia and Europe.

The is also a higher diversity seen in the Southeast Asian compared to the East Asian and South Asian countries and regions.

A bit of an oddball is the placement of the Caribbean cuisine, and especially the Creole kitchen, which is known for its spice mixing — in Farrimond's three-concepts characterization: "Adventurous | Bold | Spicy".

Conclusion

So, in case you want to spice up the coming holiday and festive season, Farrimond's book is an invaluable source for applied science, which has a simple primary use: filling the mouth with taste while filling the belly with ballast.

Monday, December 2, 2019

Trees informing networks explaining trees


Working at the coalface of evolution, one phenomenon always intrigued me: How does the signal in the data build up a tree? Especially since we have to assume some sort of reticulation happened at some point — evolution is rarely a strictly dichotomous process, which we would model by a tree. In earlier posts, we have covered the difference between clades and grades in a tree, and Hennig's concepts of monophyly and paraphyly. Clearly, in the light of actual evolutionary processes, the cladistic approach synonymizing clades with monophyly is a simplification at best, and naive at worst.

In this post, I will discuss a real-world example using molecular data put together for a (probably) recently evolved plant genus, Drosanthemum, as discussed in this paper:

Liede-Schumann S, Grimm GW, Nürk NM, Potts AJ, Meve U, Hartmann HEK. 2019. Phylogenetic relationships in the southern African genus Drosanthemum (Ruschioideae, Aizoaceae). bioRxiv preprint.
These days, next-generation sequencing (NGS) and phylogenomic data may provide what you need to resolve everything from the beginning of life to the very tips of the Tree of Life (ToL; which, at the root and tips is probably more of a Network of Life). However, this has two shortcomings: You need a lot of money, and a lot of DNA. Given the number of modern-day plant species, including not a few that are in flux, it's pretty safe to assume that I won't live long enough to have all of the ToL leaves resolved by NGS data.

On the hand, there are a countless numbers of scientists with taxonomic expertise struggling for funding; and classic Sanger sequencing has become very cheap. Thus, Oligogene (fossil) data sets will remain in use for quite some time. We do, however, have to deal with their shortcomings, such as not giving us a fully resolved phylogenetic tree, but instead providing partially diffuse signals. Nevertheless, we can get a lot of insights by combining traditional tree and network inferences.

The tree

To test systematic concepts and construct a species phylogeny for Drosanthemum, we tapped into four non-coding plastid gene regions, which, following earlier research, were the most divergent within the larger group:
  • the close intergenic plastid spacers trnK-rps16 and rps16-trnQ,
  • the trnS-trnG intergenic spacer, and
  • the rpl16 intron.
Following popular demand, we also sequenced the nuclear-encoded ITS region used in earlier phylogenetic studies (despite being quite useless for tree inference in the larger lineage, being much too conserved).

We did a full analysis, a single-gene tree inference, and bootstrapping vs. combined analysis, with or without data partitioning. We concluded that the combined plastid (not including ITS data) tree does provide a good phylogenetic-systematic framework for the genus.

Our Drosanthemum tree, rooted using the most probable rooting scenario (following an outgroup-EPA analysis; see Liede-Schumann et al., fig. 4 and Supplement file S4). Major clades and subclades are annotated, on the left the morphological subgenera associated with each major clade.

Why consider it to be good? Well, it fits amazingly well with the morphology-based systematics. Evolution doesn't follow a straight path: (i) reticulation will happen at least during the formation of species; (ii) there will always be some incomplete lineage sorting of geno- and phenotypes; and (iii) morphologies have substantially different evolutionary constraints from noncoding plastid gene regions. If it ends up in a good match, it won't be by coincidence but more likely because our inferred phylogeny captures well the true tree (coalescent).

Regarding monophyly, the tree is hence well suited to construct a framework: most of our major clades are linked to a specific, clade-unique morphology. (Don't hope to find too many autapomorphies, in plants common origin manifests typically in diagnostic character suites rather than individual aut-/synapomorphies.) The exception is subgenus Drosanthemum, which is apparently diphyletic — this term is meant literally, not just because its members form two molecular clades. Furthermore, although not visible on the tree, Clade IV / Vespertina may well be evolved from Clade III / Drosanthemum. The III-IV grade represents a monophyletic group, and Clade III may be paraphyletic.

At this point, you may be thinking: Guido has lost it; but bear with me.

Ancestral and derived haplotypes

The point is, I know our data. When we look at the sequence patterns in the gene regions, we can readily see that Clade III (subgenus Drosanthemum) and IV (subgenus Speciosa) likely had a (genetic) common ancestor different of that from the more evolved clades I (subgenus Drosanthemum) and II, and hence the high support and increased branch length of the corresponding branches — I + II and III + IV could well be reciprocally monophyletic. Realizing that clades III and IV are part of the same evolutionary lineage, we can take a closer look at them using, for example, Median networks. Parsimony is generally inferior to probabilistic methods when dealing with (mostly) neutral but stochastic mutation patterns. However, since we are very close to the coalface of evolution, we are dealing with rather minute changes — too minute for ML to make a well-informed call (also, there is no ML counterpart to haplotype networks).

To not miss something in our data (or overweight indels and linked mutation patterns), we do not just use the nucleotide sequences but we tabulate and code each mutational pattern — simple ones like single-nucleotide polymorphisms (SNPs) and duplications, but also complex ones, sequence patterns in length-polymorphic, sequentially diverse parts. The next figure shows an example:

"Export" refers to the unaltered export of (parsimony-informative) variable sites of the aligned nucleotide matrix; "recoded" the correction for excess mutations (when treating gaps as 5th base) in order to ensure the coding matches the number of steps in the theory of Median networks (see Liede-Schumann et al., supplement file S2).

Now, we are operating above the species level, which is outside the comfort zone of Median networks, which were originally designed for investigating within-species population structure. We are dealing with signals from phylogenetically sorting (eg. evolution of complex sequence patterns, see example above) mixed with (partly) convergent/ homoiologous patterns (eg. duplications, which are very rarely lineage-specific in plant plastomes). The resulting Median networks are quite complex, as shown next.

The output from NETWORK for Clade III + IV and the trnG-trnS intergenic spacer. In total, the matrix codes for 14 mutational patterns (10 SNPs, 3 indels, 1 length-polymorphic region involving a SNP; see sheet Clade3&4.trnGS in f_Haplotyping.xlsx in folder 1_main_data_and_results the online supporting archive @ DataDryad); the red edge numbers indicate which pattern changed, the bubble colors refer to the group: Cyan, Subgrade IIIa; blue, Subclade IIIb; yellow, Clade IV (note: Clade III and IV differ from other major clades by uniquely shared patterns)

One option would be to weight the characters. However, it is pretty difficult to come up with a weighting scheme given that we deal with very different mutation patterns, which include everything from simple transitions to reorganization of length-polymorphic regions. When looking at the SNPs, AG transitions appear to be more probable than AC transversions, but some AG transitions are highly diagnostic for clades, while some AC transversions seem random. Instead of getting lost in weighting (and self-enforcing bias), we compare them across the four gene regions by collapsing haplotype groups and their (diffuse) subnets, as shown next.

'Condensed' Median networks for the Clade III + IV lineage, parts of the graphs collecting sequentially similiar members of one group are replaced by bubbles (cf. Liede-Schumann et al., fig. 6).

Note that, in contrast to traditional haplotype networks, the bubbles in the figure don't represent the number of accessions of the haplotype, but instead are the sequential diversity of the collapsed haplotype group. From the graphs superimposed on the background of the combined tree, it is straightforward to see which haplotype maybe ancestral within a lineage and which haplotype is derived and how they relate to each other.

Paraphyletic "clades"

We now know how the haplotypes of each covered gene region are related to each other, and which species have substantially derived sequences, and which species have putatively ancestral sequences. Using the networks and by comparison with the sequence patterns in the sister group(s), we could even reconstruct an hypothetical haplotype of the common ancestor. But just by comparing the median networks for each gene regions with the corresponding subtrees in the combined tree we can (try to) interpret our clades and grades as monophyletic or paraphyletic.

Fig. 5 from Liede-Schumann et al. (2019) showing the 'condensed' Median networks for Clade I/ Drosanthemum (s.str.)

Members of Subclade Ib, the subtree with the worst support within Clade I / Drosanthemum (s.str.), may represent the survivors of the initial radiation, and hence are a paraphyletic group. They are resolved as a clade in the tree because of the signal from the trnS-trnG region producing a clear split between the three groups. However, this is also the most-conserved gene region, and when compared with the mutational patterns in the other clades (especially the sister clade, Clade II), it would not be far fetched to conclude that the trnS-trnG haplotype B is the original haplotype of the entire lineage.

The distinctive feature of Subclade Ib in the trnS-trnG is a complex duplication pattern not found in the otherwise genetically more coherent subclades Ia and Ib, as shown next.


This looks like a simple evolutionary sequence, with Clade Ia and Ic having retained the original pattern, with the complex pattern being a derived, clade-unique feature of Clade Ib (an autapormorphy for the corresponding monophyletic group).

But when we add the patterns of Clade II, the reciprocally monophyletic sister clade, it's not that simple anymore, as shown next.

Why one should be careful with gap-coding: even complex plastid duplication patterns evolve in parallel (or convergently). No matter whether X-Y or X-Y'-X-Y is the ancestral pattern, we have one/two convergent mutations in parts of Clades I and II; either duplication of X and insertion of Y' or (subsequent) deletion of X-Y'.

Realizing that a few clades in our tree may be paraphyletic gives us a new edge on our data and phylogenetic framework that can be further elaborated. Because they directly point towards a first, quick radiation that predates the formation of the monophyletic molecular clades (this is only a tautonym in cladistics, not in phylogenetics) — the members of paraphyletic molecular clades are genetically distinct (long terminal branches, typically low and/or ambiguous support for the clade root) or little evolved survivors (short root and terminal branches, but relatively high root branch support).

Furthermore, we can now see why some species act a bit roguish, are difficult to resolve, or inflict internal data conflict.

'Condensed' Medium networks for the sister clades V and VI (modified after Liede-Schumann et al., fig. 7)

Drosanthemum gracillimum is the only species our tree that doesn't resolve as a member of one of the two main (definitely monophyletic) subclades within Clade V: Subclade Va / Speciosa and Vb / Ossicula, genetically close but morphologically distinct sisters. We had no material of this species for our analysis, and instead used available GenBank data (out of curiosity). Its trnS-trnG and rps16-trnQ haplotypes are unique but rather ancestral within Clade V, and hence the tree cannot resolve where to place it.

Another example for how ancestry of sequences contribute to topological conflict or ambiguity in intrageneric phylogenies, but also illustrating the limitation of our approach, is one individual of D. striatum. It's the only member of Subclade Vb / Ossicula with a Subclade Va / Speciosa-type rps16-trnQ. With respect to my last blog post, the simplest explanation is that it just retained a less derived rps16-trnQ haplotype. However, this spacer includes a high-divergent, genotaxomomically valuable region that we had to exclude from all analyses (but included in our spreadsheet haplotyping.xlsx). In this, it shows a very unique, complex, apparently derived pattern shared with a few other members of the sister Clade Va. Maybe there was some reticulation and plastome-recombination at work here (contamination can be ruled out, as the material was processed twice).

Just try it with your own data

We cannot all afford perfect, often seemingly trivial, NGS / phylogenomic data. Combined trees can inform us about groups sharing a likely (mostly inclusive) common origin, such as molecular clades with fair support and distinctly long root branches and/or shared unique morphologies (ie. "monophyla" in a strict Hennigian sense). Clade-restricted haplotype networks can help us to understand the molecular evolution in these groups, free from the assumption of dichotomy and time equality.

By definition, all tip sequences represent the same time (today) in a tree, so they can only be sisters not ancestors and descendants. In reality, when we approach the coalface, we have some sequence patterns or actual sequences that are ancestral to others, because the species carrying them didn't evolve as much and as fast as their sibling(s). At some time, different parts evolved at different speeds within one lineage (see the examples above).

The networks hence fill a gap that the tree can't possibly resolve. They allow to understand why the tree may make more sense in certain parts than in others; and where it is probably 100% reliable and where we may want to have a closer look. Furthermore, only the networks can tell us if there is some real conflict in the data: different gene regions reflecting different histories.



Epilogue

As a careful reader ,you may have noticed that we skipped the ITS sequence entirely. The reason is shown in the following two graphs.

The first one shows a statistical parsimony network of all of our ITS data compiled for the species included in the plastid combined tree.

A statistical parsimony network based on the ITS data. Colors give the main cp clades (see Liede-Schumann et al., Supplement file S3)

The network approaches a spider-web, as shown above. The reason for this is that there are only a limited number of ITS positions where Drosanthemum fixes mutations (notably nearly exclusively SNPs, with no length-polymorphism). So, the genus is likely a young one, much like its sister clade the Ruschideae, which also mostly shows randomly distribution ITS mutation patterns.

Inferring an ITS tree is possible but useless, in that the data don't provide a clear signal. Furthermore, when we map the observed mutational patterns onto the plastid tree, we see a lot of messing up towards the leaves; but, in principle, it's all just sorting along the shared coalescent. We can identify those ITS mutation that a (plastid-)clade specific and lineage-diagnostic, including ITS-"synapormophies" for plastid-inferred clades that are likely monophyletic (being correlated to a distinct morphology and supported by derived, uniquely shared sequence patterns).

ITS genotypes mapped on the plastid tree, pointing to a largely congruent history with incomplete (ITS) lineage sorting. CU = clade-unique sequence pattern; Sh = shared, not unique, sequence pattern.
This opens the door to quickly screen for individuals / species that don't fall in line of the coalescent but are the product of (deep) reticulation (either using bulk sequencing and NGS genotyping or traditional cheap methods such as PCR-RFLP).

Monday, November 25, 2019

Typology of semantic promiscuity (Open problems in computational diversity linguistics 10)


The final problem in my list of ten open problems in computational diversity linguistics touches upon a phenomenon that most linguists, let alone ordinary people, might not have even have heard about. As a result, the phenomenon does not have a real name in linguistics, and this makes it even more difficult to talk about it.

Semantic promiscuity, in brief, refers to the empirical observations that: (1) the words in the lexicon of human languages are often built from already existing words or word parts, and that (2) the words that are frequently "recycled", ie. the words that are promiscuous (similar to the sense of promiscuous domains in biology, see Basu et al. 2008) denote very common concepts.

If this turns out to be true, that the meaning of words decides, at least to some degree, their success in giving rise to new words, then it should be possible to derive a typology of promiscuous concepts, or some kind of cross-linguistic ranking of those concepts that turn out to be the most successful on the long run.

Our problem can (at least for the moment, since we still have problems of completely grasping the phenomenon, as can be seen from the next section) thus be stated as follows:
Assuming a certain pre-selection of concepts that we assume are expressed by as many languages as possible, can we find out which of the concepts in the sample give rise to the largest amount of new words?
I am not completely happy with this problem definition, since a concept does not actually give rise to a new word, but instead a concept is expressed by a word that is then used to form a new word; but I have decided to leave the problem in this form for reasons of simplicity.

Background on semantic promiscuity

The basic idea of semantic promiscuity goes back to my time as a PhD student in Düsseldorf. My supervisor then was Hans Geisler, a Romance linguist, with a special interest in sound change and sensory-motor concepts. Sensory-motor concepts are concepts that are thought to be grounded in sensory-motor processes. In concrete, scholars assume that many abstract concepts expressed by many, if not all, languages of the world originate in concepts that denote concrete bodily experience (Ströbel 2016).

Thus, we can "grasp an idea", we can "face consequences", or we can "hold a thought". In such cases we express something that is abstract in nature, but expressed by means of verbs that are originally concrete in their meaning and relate to our bodily experience ("to grasp", "to face", "to hold").

When I later met Hans Geisler in 2016 in Düsseldorf, he presented me with an article that he had recently submitted for an anthology that appeared two years later (Geisler 2018). This article, titled "Sind unsere Wörter von Sinnen?" (approximate translation of this pun would be: "Are our words out of the sense?"), investigates concepts such as "to stand" and "to fall" and their importance for the lexicon of German language. Geisler claims that it is due to the importance of the sensory-motor concepts of "standing" and "falling" that words built from stehen ("to stand") and fallen ("to fall") are among the most productive (or promiscuous) ones in the German lexicon.

Words built from fallen and stehen in German.

I found (and still find) this idea fascinating, since it may explain (if it turns out to hold true for a larger sample of the world's languages) the structure of a language's lexicon as a consequence of universal experiences shared among all humans.

Geisler did not have a term for the phenomenon at hand. However, I was working at the same time in a lab with biologists (led by Eric Bapteste and Philippe Lopez), who introduced me to the idea of domain promiscuity in biology, during a longer discussion about similar processes between linguistics and biology. In our paper reporting our discussion of these similarities, we proposed that the comparison of word formation processes in linguistics and protein assembly processes in biology could provide fruitful analogies for future investigations (List et al. 2016: 8ff). But we did not (yet) use the term promiscuity in the linguistic domain.

Geisler's idea, that the success of words to be used to form other words in the lexicon of a language may depend on the semantics of the original terms, changed my view on the topic completely, and I began to search for a good term to denote the phenomenon. I did not want to use the term "promiscuity", because of its original meaning.

Linguistics has the term "productive", which is used for particular morphemes that can be easily attached to existing words to form new ones (eg. by turning a verb into a noun, or by turning a noun into an adjective, etc.). However, "productivity" starts from the form and ignores the concepts, while concepts play a crucial role for Geisler's phenomenon.

At some point, I gave up and began to use the term "promiscuity" in lack of a better term, first in a blogpost discussing Geisler's paper (List 2018, available here). Later in 2018, Nathanael E. Schweikhard, a doctoral student in our research group, developed the idea further, using the term semantic promiscuity (Schweikhard 2018, available here), which considers my tenth and last open problem in computational diversity linguistics (at least for 2019).

In the discussions with Schweikhard, which were very fruitful, we also learned that the idea of expansion and attraction of concepts comes close to the idea of semantic promiscuity. This references Blank's (2003) idea that some concepts tend to frequently attract new words to express them (think of concepts underlying taboo, for simplicity), while other concepts tend to give rise to many new words ("head" is a good example, if you think of all the meanings it can have in different concepts),. However, since Blank is interested in the form, while we are interested in the concept, I agree with Schweikhard in sticking with "promiscuity" instead of adopting Blank's term.

Why it is hard to establish a typology of semantic promiscuity

Assuming that certain cross-linguistic tendencies can be found that would confirm the hypothesis of semantic promiscuity, why is it hard to do so? I see three major obstacles here: one related to the data, one related to annotation, and one related to the comparison.

The data problem is a problem of sparseness. For most of the languages for which we have lexical data, the available data are so sparse that we often even have problems to find a list of 200 or more words. I know this well, since we were struggling hard in a phylogenetic study of Sino-Tibetan languages, where we ended up discarding many interesting languages because the sources did not provide enough lexical data to fill in our wordlists (Sagart et al. 2019).

In order to investigate semantic promiscuity, we need substantially more data than we need for phylogenetic studies, since we ultimately want to investigate the structure of word families inside a given language and compare these structures cross-linguistically. It is not clear where to start here, although it is clear that we cannot be exhaustive in linguistics, as biologists can be when sequencing a whole gene or genome. I think that one would need, at least, 1,000 words per language in order to be able to start looking into semantic promiscuity.

The second problem points to the annotation and the analysis that would be needed in order to investigate the phenomenon sufficiently. What Hans Geisler used in his study were larger dictionaries of German that are digitally available and readily annotated. However, for a cross-linguistic study of semantic promiscuity, all of the annotation work of word families would still have to be done from scratch.

Unfortunately, we have also seen that the algorithms for automated morpheme detection that have been proposed today usually fail greatly when it comes to detecting morpheme boundaries. In addition, word families often have a complex structure, and parts of the words shared across other words are not necessarily identical, due to numerous processes involved in word formation. So, a simple algorithm that splits the words into potential morphemes would not be enough. Another algorithm that identifies language-internal cognate morphemes would be needed; and here, again, we are still waiting for convincing approaches to be developed by computational linguists.

The third problem is the comparison itself, reflects the problem of comparing word-family data across different languages. Since every language has its own structure of words and a very individual set of word families, it is not trivial to decide how one should compare annotated word-family data across multiple languages. While one could try to compare words with the same meaning in different languages, it is quite possible that one would miss many potentially interesting patterns, especially since we do not yet know how (and if at all) the idea of promiscuity features across languages.

Traditional approaches

Apart from the work by Geisler (2018), mentioned above, we find some interesting studies on word formation and compounding in which scholars have addressed some similar questions. Thus, Steve Pepper has submitted (as far as I know) his PhD thesis on The Typology and Semantics of Binomial Lexemes (Pepper 2019, draft here), where he looks into the structure of words that are frequently constructed from two nominal parts, such as "windmill", "railway", etc. In her masters thesis titled Body Part Metaphors as a Window to Cognition, Annika Tjuka investigates how terms for objects and landscapes are created with help of terms originally denoting body parts (such as the "foot" of the table, etc., see Tjuka 2019).

Both of these studies touch on the idea of semantic promiscuity, since they try to look at the lexicon from a concept-based perspective, as opposed to a pure form-based one, and they also try to look at patterns that might emerge when looking at more than one language alone. However, given their respective focus (Pepper looking at a specific type of compounds, Tjuka looking at body-part metaphors), they do not address the typology of semantic promiscuity in general, although they provide very interesting evidence showing that lexical semantics plays an important role in word formation.

Computational approaches

The only study that I know of that comes close to studying the idea of semantic promiscuity computationally is by Keller and Schulz (2014). In this study, the authors analyze the distribution of morpheme family sizes in English and German across a time span of 200 years. Using Birth-Death-Innovation Models (explained in more detail in the paper), they try to measure the dynamics underlying the process of word formation. Their general finding (at least for the English and German data analyzed) is that new words tend to be built from those word forms that appear less frequently across other words in a given language. If this holds true, it would mean that speakers tend to avoid words that are already too promiscuous as a basis to coin new words for a given language. What the study definitely shows is that any study of semantic promiscuity has to look at competing explanations.

Initial ideas for improvement

If we accept that the corpus perspective cannot help us to dive deep into the semantics, since semantics cannot be automatically inferred from corpora (at least not yet to a degree that would allow us to compare them afterwards across a sufficient sample of languages), then we need to address the question in smaller steps.

For the time being, the idea that a larger amount of the words in the lexicon of human languages are recycled from words that originally express specific meanings remains a hypothesis (whatever those meanings may be, since the idea of sensory motor concepts is just one suggestion for a potential candidate for a semantic field). There are enough alternative explanations that could drive the formation of new words, be it the frequency of recycled morphemes in a lexicon, as proposed by Keller and Schulz, or other factors that we still not know, or that I do not know, because I have not yet read the relevant literature.

As long as the idea remains a hypothesis, we should first try to find ways to test it. A starting point could consist of the collection of larger wordlists for the languages of the world (eg. more than 300 words per language) which are already morphologically segmented. With such a corpus, one could easily create word families, by checking which morphemes are re-used across words. By comparing the concepts that share a given morpheme, one could try and check to which degree, for example, sensory-motor concepts form clusters with other concepts.

All in all, my idea is far from being concrete; but what seems clear is that we will need to work on larger datasets that offer word lists for a sufficiently large sample of languages in morpheme-segmented form.

Outlook

Whenever I try to think about the problem of semantic promiscuity, asking myself whether it is a real phenomenon or just a myth, and whether a typology in the form of a world-wide ranking is possible after all, I feel that my brain is starting to itch. It feels like there is something that I cannot really grasp (yet, hopefully), and something I haven't really understood.

If the readers of this post feel the same way afterwards, then there are two possibilities as to why you might feel as I do: you could suffer from the same problem that I have whenever I try to get my head around semantics, or you could just have fallen victim of a largely incomprehensible blog post. I hope, of course, that none of you will suffer from anything; and I will be glad for any additional ideas that might help us to understand this matter more properly.

References

Basu, Malay Kumar and Carmel, Liran and Rogozin, Igor B. and Koonin, Eugene V. (2008) Evolution of protein domain promiscuity in eukaryotes. Genome Research 18: 449-461.

Blank, Andreas (1997) Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Tübingen:Niemeyer.

Geisler, Hans (2018) Sind unsere Wörter von Sinnen? Überlegungen zu den sensomotorischen Grundlagen der Begriffsbildung. In: Kazzazi, Kerstin and Luttermann, Karin and Wahl, Sabine and Fritz, Thomas A. (eds.) Worte über Wörter: Festschrift zu Ehren von Elke Ronneberger-Sibold. Tübingen:Stauffenburg. 131-142.

Keller, Daniela Barbara and Schultz, Jörg (2014) Word formation is aware of morpheme family size. PLoS ONE 9.4: e93978.

List, Johann-Mattis and Pathmanathan, Jananan Sylvestre and Lopez, Philippe and Bapteste, Eric (2016) Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct 11.39: 1-17.

List, Johann-Mattis (2018) Von Wortfamilien und promiskuitiven Wörtern [Of word families and promiscuous words]. Von Wörtern und Bäumen 2.10. URL: https://wub.hypotheses.org/464.

Pepper, Steve (2019) The Typology and Semantics of Binominal Lexemes: Noun-noun Compounds and their Functional Equivalents. University of Oslo: Oslo.

Sagart, Laurent and Jacques, Guillaume and Lai, Yunfan and Ryder, Robin and Thouzeau, Valentin and Greenhill, Simon J. and List, Johann-Mattis (2019) Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116: 10317-10322. DOI: https://doi.org/10.1073/pnas.1817972116

Schweikhard, Nathanael E. (2018) Semantic promiscuity as a factor of productivity in word formation. Computer-Assisted Language Comparison in Practice 1.11. URL: https://calc.hypotheses.org/1169.

Ströbel, Liane (2016) Introduction: Sensory-motor concepts: at the crossroad between language & cognition. In: Ströbel, Liane (ed.) Sensory-motor Concepts: at the Crossroad Between Language & Cognition. Düsseldorf University Press, pp. 11-16.

Tjuka, Annika (2019) Body Part Metaphors as a Window to Cognition: a Cross-linguistic Study of Object and Landscape Terms. Humboldt Universität zu Berlin: Berlin. DOI: https://doi.org/10.17613/j95n-c998.

Monday, November 18, 2019

Why the emporer has no clothes on – conflict or not?


In the final part of this series dissecting angiosperm gene trees (see: Why the emporer has no clothes on — part 1 and part 2), we will enter muddy ground. Using our example data set, we will try to make a call on whether or not there has been any (detectable) major reticulation in the deep branches of the angiosperm tree.

What triggers conflicting gene histories

Before we look at the data, it may be a good idea to set the scene using simple theoretical examples of what we may look at.


Our two genes, represented by circle and pentagon (could be multigene regions or entire genomes), both follow the same evolutionary history (the gray background tree). In the left lineage, we have a bit of incomplete lineage sorting, because the ancestor was polymorphic for the circles. In the right lineage, we have different fixation rates: the circles evolve faster than the pentagons. With molecular data we usually don't have the ancestors, making any inference straightforward; we only have the tips.


Because of incomplete lineage sorting and different fixation rates in the left and right lineages, the circle gene tree gets the phylogeny pretty wrong. The pentagon gene tree comes closer to the reality – we only infer two sister clades where there is a grade. (With real-world data, the branch support values could give one a clue that three of the inferred blue clades have a higher quality than the fourth supporting a pseudo-monophylum.) The circle and pentagon trees are largely incongruent despite sharing the same history; and we may infer a pseudo-hybrid (the first diverging lineage within the right clade).

Combining these data may allow us to infer a tree that fits the real tree much better. In the left clade the trivial pentagon signal can out-compete the misleading circle signal, and avoid the misplacement of the first diverging lineage of the right clade. In the right clade, the circle signal can help to correct for the pseudo-clade.

Now we can add a late reticulation, and re-infer the gene trees.


Because of the reticulation (the circles are biparentally inherited, the pentagons maternally), the gene trees are more congruent then in the example above (circle and pentagon get it a bit wrong in the left clade), except for the hybrid and its pseudo-hybrid parent. The gene conflict in placing the lineage cross (part of the left clade in the circle-based tree, part of the right clade in the pentagon tree) well reflects its hybrid origin.

Different histories of nuclear genes vs. plastid / mitochondrial genes?

The easiest way to catch reticulation is to compare trees based on plastid / mitochondrial data (maternally inherited) vs. nuclear data (biparentally inherited). If reticulation happened in the past, we can expect that the maternal and biparental genealogy diverge from each other (see part 2).

Strict Consensus network of the plastid (data from 3 protein-coding genes +1 partly coding gene region), mitochondrial (3 protein-coding genes) and nuclear trees (2 nrDNAs). The bold lines represent generally accepted phylogenetic splits (APG IV tree, see also Steven's comprehensive Angiosperm Phylogeny Website).

This network is much more box-like compared to what one would have expected based on the combined tree that can be inferred from the data (Part 1). But are we looking on largely decoupled histories?

This mess is hardly surprising. The combined tree is constrained by the plastid tree, specifically by the signal from the matK gene (Part 1), while the remaining plastid genes (from a different part of the plastome) fall into line. The mitochondrial tree combines genes that on their own inform poorly resolved trees riddled with branching artifacts (Part 2). The nuclear tree, on the other hand, combines the most and least divergent nuclear genes widely known. Because of this, they show topological conflict between each other.

18S-25S rDNA tanglegram. The branch numbers show each gene's bootstrap support (BS) deviating from the combined BS support for the respective branch (indicated by line thickness): green, increased BS support when combining both genes, red, decreased BS support.

However, they are part of the same multi-copy coding unit (the 35S nuclear rDNA) that has very particular evolutionary constraints, such as structural constraints, affected by completeness of concerted evolution and intra-genomic recombination. Polyploid grasses, for example, can have up to three different collections of 35S rDNA, reflecting four different evolutionary origins, being part of the A, B, C or D genomes. You end up with what is called a multi-labelled tree: the A, B, C and D-genome variants of the same taxon pop up (consistently) in different parts of the tree, and you can have recombinants. If we look into the 18S vs. 25S data, however, we find no consistent sequence patterns supporting the topological conflicts between the two trees, or examples for recombination.

As in our theoretical example, each of the trees has certain strengths, and its own set of weaknesses, some of which can be overcome when combining the data (eg. branches with increased combined support in the 18S-25S tanglegram)

Bootstrap (BS) Consensus networks for the combined cp (upper left), mt (upper right), nc (lower left) and full data (lower right). Branches without numbers: BS = 100. Splits conflicting with those present based on the full data highlighted by red font (all with BS < 100).

In contrast to the boxy network appearance and the substantial conflict between the single gene trees (Part 2), most of the relationships (eg. the major clade roots but also many intra-clade relationships) receive high or unambiguous support in all three trees*. Aside from the disparate signals, the data seem to converge on a coalescent. If the genomes had different histories, they wouldn't converge so easily. Also, we would expect to see more consistent conflict between the "genome" trees than between the single-gene trees of the same genome, since the nuclear rDNA is biparentally inherited while the plastid and mitochondrial DNAs are passed on via the mothers only. Many of the angiosperms in our data reproduce sexually.

So far, no conclusive evidence for reticulation

Mere gene-tree incongruence is a poor basis to conclude about decoupled gene histories. We need to dig for sequence-based evidence for reticulation and recombination. For instance, we might find a clearly derived sequence pattern exclusive to the right clade in a member of the left clade.

The importance of rare genomic changes when interpreting conflicting gene trees. The left and right clades obtained a unique and conserved gene or sequence feature before they diversified. The hybrid is the only taxon showing both.

This is where the Walker et al. (2019) and Sullivan et al. (2017) studies seem to fall short — they don't give any example, gene, gene region, or recognizable lineage-diagnostic sequence pattern that could be used as direct evidence for decoupled gene histories and/or reticulation.

For my data set, I cannot pinpoint such evidence either. All high(er)-supported conflict seems to be related to lineage sorting and data/signal issues, the inability of certain gene regions to resolve relationships in parts of the angiosperm tree, or falling prey to (more local than global) long-branch attraction. When looking at the sequences, there's no reason to question, for example, the assumed monophyly of the main lineages and orders, in spite of the topological conflict we face when analyzing these data. If there was reticulation between the ancestors of angiosperm lineages, or later on between the already formed lineages, it left no obvious imprint in the data.

Thus, after having investigated aspects of the seeming conflict by going back to the data (checking highly divergent and conserved sequence patterns, tabulating the partly competing BS support of the single genes, and minus-one gene analyses), I did not hesitate to combine these data and use a Bayesian total-evidence dating procedure. (We never published the results because mid-Cretaceaous angiosperm fossils have much too derived morphologies for total evidence dating; when left unconstrained, MrBayes optimized towards an angiosperm root age of 4.5 Ba, which was the in-built maximum).

A total-evidence Bayes tree based on the full data set. Stars indicate the position of fossil taxa (mid-Cretaceaous). Note their relative long terminal branches, a situation total-evidence dating cannot handle. The matrix can be found at figshare: A basic total evidence matrix for basal angiosperms — combining Soltis et al (2011) with Doyle & Endress (2010).

An example for actual reticulation resulting in gene tree conflict

Working at the coal-face of evolution, I have encountered examples of apparently real reticulation (when analysing biparentally inherited nuclear data). The most compelling was probably the ancient relictual genotypes and pseudogenes that point towards ancient reticulation in the widely known plane trees, Platanus. Platanus subgenus Platanus (which includes all but one species, P. kerrii, a relict of a distant lineage growing in tropical-hot subtropical lowland forests of North Vietnam) falls into two main lineages characterized by unique sets of genotypes, the ANA clade (Atlantic-facing North and Mesoamerica) and the PNA-E clade (NW. Mexico, California and Mediterranean).

Haplo/-genotypic composition of Platanus (Grimm & Denk, Taxon, 2010, ES2 [PDF]). Platanus kerrii represent the sole surviving relative within the Platanaceae (genetically very distinct), an old lineage of angiosperm trees (going back deep into the Cretaceous). Their next kin today are, according to angiosperm molecular trees, the enigmatic Proteaceae, a Gondwanan relict (represented in our angiosperm data by Petrophile). For an even more comprehensive genotypic study that also covers plastid markers check out De Castro et al., Ann. Bot., 2013 [open access])

Individuals in the contact zone between species of the two main lineages (including hybrids) can be heterozygotic / polymorphic for at least one of the sequenced nuclear regions, so that identification of recent hybrids is straightforward. Beyond this, genetically inconspicious members of the ANA clade may show ITS pseudogenes from the PNA-E clade (stippled line in the figures above and below). Furthermore, two of the ANA clade species show (predominately), a PNA-E LEAFY genotype — P. palmeri (pa) and P. rzedowskii (rz), which grow closest to the populations of the PNA-E clade. However, this is not the genotype found in the close-by American PNA-E species (ra, ge), which is one that's sequence is phylogenetically closer to the Mediterranean species, P. orientalis (or), on the other side of the globe.

Overlay of the LEAFY, 5S-IGS and ITS histories in Platanus. This doodle is based on tree- and network-inferences coupled with PCR-RFLP-based genotyping and in-depth analysis of mutation patterns in length-polymorphic sequence regions (Grimm & Denk 2010, ES1). P. x hispanica is the well-known ornamental alley/park tree, the 'London plane'. A cultivated historical hybrid (mid 18th century) of the most hardy North American plane, P. occidentalis, and the frost-vulnerable Mediterranean plane, P. orientalis. In the Mediterranean, due to frequent backcrossing, one can find morphologically mixed individuals showing only the P. orientalis genotypes or homogenous (American or European) type individuals showing occidenatlis and orientalis genotypes (see eg. Pilotti et al., Euphytica, 2009

Further reading

An animal example, of seemingly incongruent single-gene trees that may well be the product of a largely shared evolutionary history, is the autosomal intron data compiled for bears by Kutschera et al. (2014. Bears in a forest of gene trees: Phylogenetic inference is complicated by incomplete lineage sorting and gene flow. Mol. Biol. Evol. 31:2004–2017). Rather than a "forest of trees", each gene tree is poorly resolved but, when combined, allows inferring a phylogeny that matches quite well the parental genealogy based on Y-chromosome data, both in strong conflict with the maternal genealogy inferred from mitochondriomes (see Part 2).

In Supplement File S6 [PDF] of Grímsson et al. (2018, Grana 57:16–116), I outline how ambiguous signal from combined gene regions relate to the poor support of critical branches in the Loranthaceae tree; see also the related posts: Using consensus networks to understand poor roots and Trivial but illogical – reconstructing the biogeographic history of the Loranthaceae (again). Some gene-tree conflicts are possibly linked to different histories (nuclear vs. chloroplast data), while others are a mix of insufficient signal and missing data (between chloroplast genes).

In a previous post (All solved a decade ago: the asterisk branch in the Fagales phylogeny), I give another example using an old Fagales matrix, which resulted in a tree that, even today, is the gold standard of Fagales phylogeny. The matrix combines a highly conserved nuclear gene (18S) conflicting with the plastid genes and complemented by an entirely uninformative mitochondrial gene (matR) to provide a "tree based on all three genomes". Also in this case the three-genome tree is essentially the matK tree.



* That doesn't mean that all highly supported, unconflicted relationships must be true. Note that just by combining a few genes, we obtain a near-unambiguous support for the split between Mesangiosperms and the ANA-grade + gymnosperms, one of the splits defining the root and "basal" part of the angiosperm tree. The outgroup-inferred root is well fixed. Even when using nuclear data, despite the fact that the 18S signal (the one showing the least ingroup-outgroup genetic distance) doesn't support such a root but the 25S does (see part 2), being more divergent and prone to ingroup-outgroup long branch attraction (LBA). That we have LBA issues with the data is obvious from a tiny detail: Ginkgo is supported with BS > 70 as sister of Podocarpus, which is wrong, based on all we know about gymnosperms,(see also Earle's gymnosperm database and literature cited therein). The likely correct split, Ginkgo as sister to Cycas, is present in the nc tree, but represents a much less supported alternative (BS <= 25). It is also obvious when one looks at the alignment(s): Cycas and Ginkgo share some potential genetic 'synapomorphies' in the low-divergent, generally conserved regions (eg. 18S, stem-regions of 25S), but there are essentially none for Gingko + Podocarpus.

Monday, November 11, 2019

A new playground for networks and exploratory data analysis


[This is a post by Guido with some help from David]

There tend to be two types of studies of inheritance and evolution. First, there is evolution of organisms, either of the phenotype (morphology, anatomy, cell ultrastructure, etc) or genotype (chromosome, nucleotides). The latter involves direct inheritance, but it is often treated as including all molecules, although it is the nucleotides (and chromosomes) that get inherited, not amino acids, for example.

Second, there are studies of the evolution of behaviour, which has focused mainly on humans, of course, but can include all species. For humans, this includes socio-cultural phenomena, particularly language (written as well as spoken), but also including cultural advancements such as social organization, tool use, agriculture, etc., which are inherited indirectly, by learning.

However, we rarely see studies that are multi-disciplinary in the sense of combining both physical and behavioural evolution. It is therefore very interesting to note the just-published preprint by:
Fernando Racimo, Martin Sikora, Hannes Schroeder, Carles Lalueza-Fox. 2019. Beyond broad strokes: sociocultural insights from the study of ancient genomes. arXiv.
These authors provide a review about the extent to which the analysis of ancient human genomes has provided new insights into socio-cultural evolution. This provides a platform for interesting future cross-disciplinary research.

The authors comment:
In this review, we summarize recent studies showcasing these types of insights, focusing on the methods used to infer sociocultural aspects of human behaviour. This work often involves working across disciplines that have, until recently, evolved in separation. We argue that multidisciplinary dialogue is crucial for a more integrated and richer reconstruction of human history, as it can yield extraordinary insights about past societies, reproductive behaviours and even lifestyle habits that would not have been possible to obtain otherwise.
Since multi-disciplinary dialogue is a focal point here at the Genealogical World of Phylogenetic Networks. Since our blog embraces non-biological data, we have done a little brainstorming, to put forward some ideas based on Racimo et al.'s comments. The four figures contain some extra discussion, with some visual representations of the ideas.

Why it's important to correlate genetic, linguistic and socio-cultural data. The doodle shows a simple free expansion model of a founder population with three genotypes (yellow, green, blue), a shared language (L) and two major cultural innovations (white stars). Because of drift and stochastic intra-population processes (size represent the size of the actively reproducing populace) the first expansion (light gray arrows) lead to 'tribes' that show already some variation. The smaller ones close to the founder population spoke still the same language, the ones further away used variants (dialects) of L (L', still close to L, L'', more distinct). Because of bootlenecks, geographic distance and differing levels of inbreeding (the smaller a population, the farther away from the source, the more likely are changes in genotype frequency), each population has a different genotype composition. The second expansion (mid-gray arrows) mixing two sources leads to a grandchild that evolved a new language M and lost the blue genotype. Because the cultural innovations are beneficial, we find them in the entire group. In extreme cases of genetic sorting and linguistic evolution, such shared cultural innovations may be the only evidence clearly linking all these populations.

Social-cultural character matrices

Correlating different sets of data and (cross-)exploring the signal in these data can be facilitated by creating suitable character matrices. In phylogenetics, we primarily use characters that underlie (ideally) neutral evolution, such as nucleotide sequences and their transcripts, amino-acid sequences. When using matrices scoring morphological traits, we relax the requirement of neutral evolution, but we are still scoring traits that are the product of biological evolution. However, we don't need to stop there, phylo-linguistics is an active field, even though languages involve different evolutionary constraints and processes than we meet in biology. Data-wise there are nonetheless many analogies, and phylogenetic methods seem to work fine.

So, why not also score socio-cultural traits in a character matrix? For instance, we can characterize cultures and populations by basic features including: the presence of agriculture, which crops were cultivated, which animals were domesticated, which technological advances were available, whether it was a stone-age, bronze-age, iron-age culture, etc. Linguistically, we could also develop matrices of local populations, with regional accents or dialects, etc.

Creating such a matrix should, of course, be informed by available objective information. As in the case of morphological matrices or non-biological matrices in general, we should not be concerned about character independence. We don't need to infer a phylogenetic tree from these matrices, as their purpose is just to sum up all available characteristics of a socio-cultural group.

Second phase: stabilization of differentiation pattern. While the close-by tribes are still in contact with the mother population, the most distant lost contact. As consequence the gene pools of the L/L'-speaking communities will become more similar, and new innovations acquired by the founder population (black star) are readily propagated within its cultural sphere. Re-migration from the larger M-speaking tribe to the struggling L''-speakers (small population with high inbreeding levels) lead to the extinction of the blue genotype in the latter and increased 'borrowing' of M-words and concepts.

Distance calculations

Pairwise distance matrices are most versatile for comparing data across different data sets.

First, any character matrix can be quickly transformed into a distance matrix, and the right distance transformation can handle any sort of data: qualitative, categorical data as well as quantitative, continuous data.

Second, the signal in any distance matrix can be quickly visualized using Neighbor-nets. This blog has a long list of posts showing Neighbor-nets based on all sorts of sociological data that don't follow any strict pattern of evolution, and are heavily biased by socio-cultural constraints (eg. bikability, breast sizes, German politics, gun legislation, happiness, professional poker, spare-time activities). We have even included celestial bodies.

Third, distance matrices can be tested for correlation as-is, without any prior inference, using simple statistics, such as the Pearson correlation coefficient. To give just one example from our own research: in Göker and Grimm (BMC Evol. Biol. 2008), the latter was used for testing the performance of character and distance transformations for cloned ITS data covering substantial intra-genomic diversity, by correlating the resulting individual-based distances with species-level morphological data matrices. (The internal transcribed spacers are multi-copy, nuclear-encoded, non-coding gene regions; in the simplest case each individual has two sets of copies, arrays, one inherited from the father, the other from the mothers, which may differ between but also within the individual.)

In the context of Racimo et al.'s paper, one could construct a genetic, a socio-cultural, a linguistic and a geographical matrix, determine the pairwise distances between what in phylogenetics are called OTUs (the operational taxonomic units), and test how well these data (or parts of it) correlate. The OTUs would be local human groups sharing the same culture (and, if known) language.

Alternatively, one can just map the scored socio-cultural traits onto trees based on genetic data or linguistics.

A new culture with its own language (Λ), genotype (red) and innovations (ruby-red pentagon) migrates close to the settling area of the L-people. Because of raids, genotypes and innovations from the the L-people get incorporated into the the Λ-culture.

How to get the same set of OTUs

The Göker & Grimm paper mentioned above tested several options for character and distance transformations, because we faced a similar problem to what researchers will face when trying to correlate socio-cultural data with genetic profiles of our ancestors: a different set of leaves (the OTUs). We were interested in phylogenetic relationships between individuals using data representing the genetic heterogeneity within these individuals.

Genetic studies of human (ancient or modern) DNA use data based from individuals, but socio-cultural and linguistic data can only be compiled at a (much) higher level: societies, or other groups of many individuals. In addition, these groups may also span a larger time frame. Since humans love to migrate, we are even more of a genetic mess than were the ITS data that we studied.

One potential alternative is to use the host-associate analysis framework of Göker & Grimm. Instead of using the individual genetic profiles (the associate data), one sums them across a socio-cultural unit (serving as host). The simplest method is to create a consensus of the data (in Göker & Grimm, we tested strict and modal consensuses). This produces sequences with a lot of ambiguity codes — genetic diversity within the population will be presented by intra-unit sequence polymorphism (IUSP). Standard distance and parsimony implementation do not deal with ambiguities, but the Maximum likelihood, as implemented in RAxML, does to some degree. A gapstop is the recoding of ambiguities as discrete states for phylogenetic analysis (tree and network inference) as done by Potts et al. (Syst. Biol. 2014 [PDF]) for 2ISPs ('twisps'), intra-individual site polymorphism. It can't hurt to try out whether this works for IUSPs, too.

Since humans (tribes, local groups) often differ in the frequency of certain genotypes, it would be straightforward to use these frequencies directly when putting up a host matrix. Instead of, for example, nucleotides or their ambiguity codes, the matrix would have the frequency of the different haplotypes. We can't infer trees from such a matrix (we need categorical data), but we can still calculate the distance matrix and infer a Neighbor-net.

The 'phylogenetic Bray-Curtis' (distance) transformation introduced in Göker & Grimm (2008) also keeps the information about within-host diversity when determining inter-host distances (see Reticulation at its best ...)


Transformations for genetic data from smaller to larger, more-inclusive units are implemented in the software package POFAD by Joli et al. (Methods in Ecology & Evolution, 2015. Their paper also provides a comparison of different methods, including the ones tested in Göker & Grimm (2008, also implemented in the tiny executables g2cef and pbc, compiled for any platform).

The process of assimilation. The Λ-people subdued the L-culture with the consequence that all innovations are shared in their influence sphere. Having a much smaller total population size, the language of the invaders is largely lost but the new common language L* still includes some Λ-elements (in a phylogenetic tree analysis, L* would be part of the L/M clade, using networks, L* would share edges with Λ in contrast to L and M). The L''/M-speaking remote population is re-integrated. The invaders' genotype (red) becomes part of the L-people's gene pool. Re-migration (forced or not) introduces L-genotypes into the original Λ-population. Only by comparing all available data, ideally covering more than one time period, we can deduce that the M-speakers represent an early isolated subpopulation of the L-people that was not affected by the Λ-invasion. With only the genetic data at hand, one may identify the M-speakers as one source and the Λ-tribe as another source for the L*-people, and infer that all L/M and Λ-tribes share a common origin (since the yellow genotype is found in both the M- and the original Λ-population).

Conclusion

It therefore seems to us that there is enormous potential for multi-disciplinary work, that truly combine organismal and socio-cultural evolution. We have provided a few practical suggestions here about how this might be done. We encourage you all to have try some of these ideas, to see where it leads us all.