Monday, January 28, 2019

Future challenges for computational diversity linguistics

At the end of each year, many people start to think of the things they want to do during the next year. While not being very extreme in this perspective, I tend to do the same thing at times; and last year, it happened that I started — inspired by a discussion with students I had in Buenos Aires — thinking of the biggest challenges that I see for the field of computational diversity linguistics (i.e. historical and typological language comparison carried out in a formal or quantitative way). I thus sat down before my holidays started, and made a short list of tasks that are challenging, but of which I think can still be tackled in the nearer or further future.

The idea to make such a list of questions is not new to mathematicians, who have their well-known Hilbert Problems, proposed by David Hilbert in 1900. In linguistics, I first heard about them from Russell Gray, who himself was introduced to this by a talk of the linguist Martin Hilpert, who gave a talk on challenging questions for linguistics in 2014 (online available here), called "Challenges for 21st century linguistics". Russell Gray since then has emphasized the importance to propose "Hilbert" questions for the fields of linguistic and cultural evolution, and has also presented his own big challenges in the past.

As somebody who considers himself to be a methodologist, I'm not going to frame questions as "big" or challenging as Russell Gray or Martin Hilpert did. Instead, the problems I would like to see tackled are pure computational challenges, that I think can be solved by algorithms or workflows. This does not mean, of course, that these problems are not challenging in the big sense, and it also does not automatically mean that they can be solved in the near future. But given that my own work, and that of colleagues in the field of computational and computer-assisted language comparison, progresses steadily, at times even at an impressive paste, I have some trust that these problems will indeed be solvable within the next 5-10 years.

The problems I came up with are listed below:
  1. automatic morpheme segmentation
  2. automatic sound law induction
  3. automatic borrowing detection
  4. automatic phonological reconstruction
  5. simulating lexical change
  6. simulating sound change
  7. statistical proof of language relatedness
  8. typology of semantic change
  9. typology of semantic promiscuity
  10. typology of sound change.
You can see that the way I worded the problems divides them into four major categories. The first four problems point to questions of inference, such as the inference of morpheme boundaries in a mono-lingual wordlist (# 1), the inference of laws by which sounds are changed from a parent to a daughter language (# 2), the inference of borrowings in multilingual datasets (# 3), and the inference of so far unattested proto-forms (# 4). The fifth and the sixth problems deal with simulation, and I distinguish the simulation of lexical change (# 5) and the simulation of sound change (# 6) as two separate tasks, although they could of course be combined later. The seventh problem is a bit different from the others, as it deals with the question of genealogical relationship among languages, and how we can test  it statistically (see Baxter and Manaster Ramer 2000 for an overview).

The last three problems deal with general patterns that can, or could be, observed for change in semantics and phonology. Semantic change (# 8) shows highly interesting cross-linguistic tendencies that are not yet fully understood (see Wilkins 1996 for an early discussion). Furthermore (# 9), words are often re-used across the lexicon of a given language, and it is an open question whether striking preferences for building many new words from just a few basic words denoting "promiscuitive" concepts (like "fall", "stand", see Geisler forthcoming and a recent blogpost by Schweikhard 2018 for an overview). Sound change (# 10) also follows cross-linguistic regularities, but the nature of these similarities are still not very well understood (see Kümmel 2008 for a pilot study on the topic).

Discussing each task would be way too long for a single post, given that I have reflected about these problems a lot during the last years, and may at times even have some ideas on how the problems could be tackled in concrete.

So, based on my idea of making plans for 2019, I decided that I would try to discuss each of these ten problems in greater detail in separate blog posts throughout 2019. This post thus serves merely to introduce the problems. Over the next ten months, I will try to devote some time to discuss each problem in a blog post devoted to each of the topics; and then I will discuss all of problems again at the end of this year.

I do not yet know how far this will go, and whether I will have the discipline to write up a post on each topic within the coming months, especially since it may also be possible that I end up discarding problems from my list. However, I feel that this could turn into a nice road map for my research in 2019. If I have to devote at least half a day each month over the next year to think about problems in computational historical and typological language comparison, it might not only help myself but also some colleagues to come up with a solution to some of the problems.


Baxter, William H. and Manaster Ramer, Alexis (2000) Beyond lumping and splitting: probabilistic issues in historical linguistics. In: Renfrew, Colin and McMahon, April and Trask, Larry (eds.) Time Depth in Historical Linguistics. Cambridge: McDonald Institute for Archaeological Research, 167-188.

Geisler, Hans (2018): Sind unsere Wörter von Sinnen? Überlegungen zu den sensomotorischen Grundlagen der Begriffsbildung. In: Kazzazi, Kerstin and Luttermann, Karin and Wahl, Sabine and Fritz, Thomas A. (eds.) Worte über Wörter. Festschrift zu Ehren von Elke Ronneberger-Sibold. Tübingen:Stauffenburg, 131-142.

Nathanael E. Schweikhard (2018) Semantic promiscuity as a factor of productivity in word formation. Computer-Assisted Language Comparison in Practice 1.11.19.

Wilkins, David P. (1996) Natural tendencies of semantic change and the search for cognates. In: Durie, Mark (ed.) The comparative method reviewed. Regularity and irregularity in language change. New York:Oxford University Press, 264-304.

Monday, January 21, 2019

A question about coalescent-based species phylogenies

This may be a naive question; but as I am now semi-retired, so I can now ask it without professional embarrassment.

It is common when constructing species phylogenies (both trees or networks) to use a model that takes into consideration multiple replacements of characters through evolutionary time. If the states of any given character have been modified multiple times, then the currently observed differences in that character between taxa will not accurately reflect their evolutionary history.

For example, we "correct for multiple substitutions" when using DNA/RNA sequence data. We do this because, with only four character states, the probability that undetectable multiple substitutions have occurred increases considerably through evolutionary time. So, we have developed any number of sophisticated models for addressing this issue, such as JC and GTR; and it is unusual to see a published paper with a species phylogeny that does not use one of them.

This leads to a question about population phylogenies. In this case, the use of the coalescent model is prevalent. It allows the calculation of various population parameters, based on viewing phylogenies backwards through time. For the purpose of phylogenetics, the key calculation is the coalescence time of each pair of lineages, although population size is also of some interest.

The coalescent model is based on a set of assumptions, of course. Indeed, it is based on the Fisher-Wright model of population genetics. This is an infinite-sites model, meaning that it assumes that multiple replacements of characters do not occur during the evolution of the populations. That is, if the genetic sequences are infinitely long then the probability of multiple substitutions is 1 / infinity = zero.

This, then, is my question: Can we really assume that multiple substitutions never occur, in one part of the analysis, and assume that they are so common that we need to adjust for them, in another part of the same analysis?

I have not found this issue addressed either in the published literature or on the internet. Indeed, most people I have spoken to did not even realize that the coalescent is ultimately based on an infinite-sites model. So, for me at least, this is an interesting question.

Monday, January 14, 2019

Phylogenetic ambiguity: data gaps, indifference and internal conflict

A tweet by my favourite journal (not only because they insist that authors make their data available) pointed me to their most viewed paper of 2018, with a nice title (for a network-fan):
Genus-level phylogeny of cephalopods using molecular markers: current status and problematic areas, by Sanchez et al. (PeerJ, 6:e4331).
"Problematic areas" are exactly my cup of tea. However, the graphical representation of these falls a bit short. The authors show three maximum-likelihood phylograms, one for the Cephalopoda with support annotated at some branches (their Fig. 1), and one each for two of the constituent lineages, the Decabrachia (their Fig. 2) and the Octobrachia (Fig. 3, reproduced below, because we will take a look at the data behind it).

Original: "Figure 3: Maximum-likelihood tree of the Octobrachia under the
GTR + Gamma model with the morphological character set mapped onto the tree.
Taxa highlighted in red represents discrepancy to previously published studies."

Unfortunately, we don't know the actual support for each of the branches — there is a legend in the lower right, but no signatures etc. associated with it. You will find some information throughout the text, of course. For example:
The use of concatenated sequences of all markers (Fig. 2) resulted in a resolved topology for monophyly of the Octobrachia (BS = 58%), and strong support for monophyly of the Decabrachia (BS = 98%), with both clades strongly supported by the Bayesian approach with PP = 0.78 and 0.75 respectively
The latter is quite strange, as PP are expected (methodologically) to be ≥ BS.
Although monophyly was demonstrated for several families contained within both superorders, the relationships of the families contained within Octobrachia were better supported than those in Decabrachia (Fig. 2). Of the 37 nodes in the Octobrachia portion of the general tree containing all taxa, the majority were resolved above the 50% level (31 nodes with BS > 50%); but only 28 out of 80 nodes in the Decabrachia were resolved at BS >50%, most of which were located at family level.
BS = 51 could be lack of signal (all other alternatives BS ~ 0) or conflict (one alternative has a BS = 49).

What we can infer directly from the alignment

Let's have a look at the first three gene regions in the matrix provided, using Mesquite's bird-view option.

We can see from the alignment that the first gene (left; mitochondrial 12S rDNA) splits the taxon set (the taxon order seems to be arbitrary) into two (three if we include those with no data) main groups with substantially divergent 12S rDNAs. However, in the second, much more homogeneous gene, no such differentiation is obvious, with the exception of two accessions that remain very different from the rest. This is quite puzzling, because the second gene is the (close-by) mitochondrial 16S rDNA.

Without going into details, the 12S rDNA unambiguously supports (and enforces) an Octopodia core clade defined by a 12S rDNA entirely different from that of other taxa, and comprising five of this order's families, in which Amphioctopus and Octopus make up a subclade with strongly derivating 16S rDNA.

With respect to the tree, we also have to assume that the 12S rDNA of the Octopodia core clade is derived, strongly evolved, whereas it remained largely unmodified (ie. is primitive) in the other, earlier diverged (according to the tree) lineages. However, some of these lineages have equally long terminal branches: there has been more evolution going on in other genes.

The third gene, the nuclear-encoded gene for the 18S rRNA (18S nrDNA), shows another pattern (and quite typical). Large stretches with very little variation, hence, devoid of differentiating signal that would allow the tree algorithm to make a decision (and letting Bayes get lost in the treespace resulting in PP < 1.0).

For half of the taxa, no information is available, but this hardly matters because even genera with strongly different mt 12S rDNA have nearly the same 18S nrDNA. There is a little hickup in the second part in one accession (a gap in Cirrothauma with a small, off-alignment strand in between), but this could just be a sequencing artefact. Limited to a single taxon, it has no topological effect (we at least need four to make a call), it will only increase the length of the terminal branch.

The remainder of the matrix mirrors the situation in the first three partitions, eg. in the well-sampled (only six taxa missing) mt coI gene, Callistoctopus is visibly distinct from all other genera, while most general variation is concentrated at the 3rd codon position. All other mt-genes, accounting for 58% of the matrix' characters, are covered for four of the taxa (the sister taxon used to root, Vampyrotheutis, and three of the core Octopodia, hence, can only support a single split within this group and be used to test for its alternatives.

What networks could have shown

The matrix provided for the shown tree (made available via figshare) has 40 taxa and 16104 characters, quick to run these days. Here's the tree with branch support annotated along branches.

ML phylogram inferred from Sanchez et al.'s matrix, taxa ordered as in the original fig. 3. Members of the same taxon (order, superfamily, family, as annotated in Sanchez et al.'s fig. 3) colored accordingly. Values at branches indicate ML-BS  support using a single partition for the entire data ("unpart.") or using the gene-wise partition scheme provided in the figshare submission ("part.")

Even though I run an unpartitioned analysis, my tree is very similar to the original tree, with a near identical topology except for Ameloctopus being moved one node up and placed as sister to Hapalochlaena (ML, unpartioned-BS = 52 vs. 46[!] for the alternative seen in Sanchez et al.'s fig. 3). I never understood the fuzz about model and partition testing, when we usually work with data where any model will inevitably be suboptimal (see alignments). As a geneticist, I also believe data partitions should be informed by function, not computer programmes (eg. one for 1st and 2nd codon position, another for the 3rd codon position, and one for the rDNAs).

We have unambiguously supported branches (BS ~ 100), and others, the "problematic areas" (BS << 100). Ambiguity in support values for branches of a tree can have two reasons:
  1. Lack of signal, the data is indifferent regarding the placements of certain taxa and/or subtrees (PP < 1.0 are indicative for lack of signal).
  2. Conflicting signal, parts of the data (data partitions) prefer one topological alternative, others a (partly) conflicting one (keep in mind that even in the presence of substantial signal conflict, PP ~ 1).
Short branches with low (BS) support point to the former, long branches with low (BS) support are a direct indication of the latter. Two apparent sources of conflict would be that the data include gene regions from the biparentally inherited nucleome and the (usually maternally inherited, not sure how this is in squids) mitochondriome and combine protein-coding genes (amino-acids coded by codons) with rRNA genes (directly encoding a certain secondary, tertiary structure).

In our tree here, we notice a general correlation between the branch lengths and the support; the shorter the branch, the lower the support. With a few exceptions, eg. the Octopodida core clade, triggered by the unique, strongly diverged sequences of the 12S rDNA, has a long root branch with compartively low support (ML-BS = 63; collapses when using the authors' partitioning scheme that treats each gene region as individual partition).

Full BS Consensus network based on 450 ML pseudoreplicates (result of the unpartitioned analysis). Edge lengths are proportional to the BS support (frequency of the splits in the BS tree sample), trivial splits not collapsed. Arrow points to the root (cf. Sanchez et al.'s fig. 1).

The BS Consensus network shows us that some of the "problematic areas", ie. branches with ambiguous support, are not really problematic (alternatives have no to very little support), but others are. Including the 12S rDNA-based Octopodida core clade, and connected to this, the division of the Megaleledonidae, as annotated in Sanchez et al.'s fig. 3, into two clades (not discussed in the paper). A clade including all Megaleledonidae has a BSunpart./part. = 34/55 and competes with the 12S rDNA split (BS = 63/37) and the placement of Cistopus as sister to the Octopodida core clade (BS = 52/34). It doesn't conflict with the alternative topology placing Cistopus as sister to all of them (BS = 38/50). The reason for this is, of course, that by using a different partion for the highly divergent mt-12S rDNA, we allow RAxML to estimate high probabilities for all mutations, effectively down-weighting each mutation in this gene compared to those in other, more conservatively structured gene regions, which seem to prefer alternative splits.

Vice versa, the poorly supported sister relationship (BS = 45/21) of Bathypolypus with the Enteroctopodidae (light green) + part of the Argonautoidea (pink) stands unopposed, alternative splits have BSunpart. < 10. In the partitioned analysis, however, there is an equally poor supported alternative sticking out a bit: Bathypolypus as sister to the (all-including) Megaleledonidae clade (BSpart. = 23).

While we see little effect on the tree topology, partitioning affects some of the support values. An nice example is the structure of the Megaleledonidae s.str. subtree. The root is unambiguously supported, as is the sister relationship of Graneldone and Bentheledone. The remaining branches have ambiguous support.

Here, the partitioning scheme is a game changer. Unpartioned, the favored alternative is a Adelieledone-Pareledone-Megaeledone (APM) grade "basal" to Graneldone and Bentheledone (BS = 68/49); using the authors' partitioning scheme, the data favors an APM clade sister to the latter two (quite a difference, since we often equal clades with monophyly and grades with paraphyly).

It doesn't matter whether a clade has a BS support of 30, 50 or 70. We need to know, if the remaining 70%, 50%, or 30% of bootstrap replicates show random or the same alternative(s). When a tree has ambiguously support branches, BS Consensus networks should be obligatory.

Instead of reading sentences like this:
Benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) together with the Megaleledonidae (possessing a single row of suckers) formed a well-supported monophyletic group (BS = 72%, PP = 0.61).
we should read this:
A clade including all benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) and the Megaleledonidae (possessing a single row of suckers) received ambiguous support (BS = 72%, PP = 0.61), but potential alternatives received no support at all. The combination of a relative high BS but low PP points towards a faint, but consistent signal in the available data.
And include the Consensus networks at least in the supplement.

When we aim to map morphological traits (which a nice touch of Sanchez et al.'s paper), why not consider the topological alternatives we see there?

Running single-gene trees is never wrong, too. But, in the case of these data, that would be the topic of another post, using a different type of network: a super-network.

Final note. This post is not intended to criticize Sanchez et al.'s paper (my squid-expertise ends with having seen them in aquaria). My impression is they put a lot of effort into getting the matrix together. Having been forced to harvest molecular data myself in the past, I know how important and tedious this work is. Instead, this post stresses and shows, using an easy-to-access example that raised a lot of interest (attracted many views), that we often have to work with suboptimal data not providing trivial results in the form of fully resolved trees. This is a situation in which easy to generate networks offer a lot. No peer reviewer should, in such a case, be content with seeing just a tree (although they, to my experience, always are).