Monday, January 28, 2019

Future challenges for computational diversity linguistics

At the end of each year, many people start to think of the things they want to do during the next year. While not being very extreme in this perspective, I tend to do the same thing at times; and last year, it happened that I started — inspired by a discussion with students I had in Buenos Aires — thinking of the biggest challenges that I see for the field of computational diversity linguistics (i.e. historical and typological language comparison carried out in a formal or quantitative way). I thus sat down before my holidays started, and made a short list of tasks that are challenging, but of which I think can still be tackled in the nearer or further future.

The idea to make such a list of questions is not new to mathematicians, who have their well-known Hilbert Problems, proposed by David Hilbert in 1900. In linguistics, I first heard about them from Russell Gray, who himself was introduced to this by a talk of the linguist Martin Hilpert, who gave a talk on challenging questions for linguistics in 2014 (online available here), called "Challenges for 21st century linguistics". Russell Gray since then has emphasized the importance to propose "Hilbert" questions for the fields of linguistic and cultural evolution, and has also presented his own big challenges in the past.

As somebody who considers himself to be a methodologist, I'm not going to frame questions as "big" or challenging as Russell Gray or Martin Hilpert did. Instead, the problems I would like to see tackled are pure computational challenges, that I think can be solved by algorithms or workflows. This does not mean, of course, that these problems are not challenging in the big sense, and it also does not automatically mean that they can be solved in the near future. But given that my own work, and that of colleagues in the field of computational and computer-assisted language comparison, progresses steadily, at times even at an impressive paste, I have some trust that these problems will indeed be solvable within the next 5-10 years.

The problems I came up with are listed below:
  1. automatic morpheme segmentation
  2. automatic sound law induction
  3. automatic borrowing detection
  4. automatic phonological reconstruction
  5. simulating lexical change
  6. simulating sound change
  7. statistical proof of language relatedness
  8. typology of semantic change
  9. typology of semantic promiscuity
  10. typology of sound change.
You can see that the way I worded the problems divides them into four major categories. The first four problems point to questions of inference, such as the inference of morpheme boundaries in a mono-lingual wordlist (# 1), the inference of laws by which sounds are changed from a parent to a daughter language (# 2), the inference of borrowings in multilingual datasets (# 3), and the inference of so far unattested proto-forms (# 4). The fifth and the sixth problems deal with simulation, and I distinguish the simulation of lexical change (# 5) and the simulation of sound change (# 6) as two separate tasks, although they could of course be combined later. The seventh problem is a bit different from the others, as it deals with the question of genealogical relationship among languages, and how we can test  it statistically (see Baxter and Manaster Ramer 2000 for an overview).

The last three problems deal with general patterns that can, or could be, observed for change in semantics and phonology. Semantic change (# 8) shows highly interesting cross-linguistic tendencies that are not yet fully understood (see Wilkins 1996 for an early discussion). Furthermore (# 9), words are often re-used across the lexicon of a given language, and it is an open question whether striking preferences for building many new words from just a few basic words denoting "promiscuitive" concepts (like "fall", "stand", see Geisler forthcoming and a recent blogpost by Schweikhard 2018 for an overview). Sound change (# 10) also follows cross-linguistic regularities, but the nature of these similarities are still not very well understood (see Kümmel 2008 for a pilot study on the topic).

Discussing each task would be way too long for a single post, given that I have reflected about these problems a lot during the last years, and may at times even have some ideas on how the problems could be tackled in concrete.

So, based on my idea of making plans for 2019, I decided that I would try to discuss each of these ten problems in greater detail in separate blog posts throughout 2019. This post thus serves merely to introduce the problems. Over the next ten months, I will try to devote some time to discuss each problem in a blog post devoted to each of the topics; and then I will discuss all of problems again at the end of this year.

I do not yet know how far this will go, and whether I will have the discipline to write up a post on each topic within the coming months, especially since it may also be possible that I end up discarding problems from my list. However, I feel that this could turn into a nice road map for my research in 2019. If I have to devote at least half a day each month over the next year to think about problems in computational historical and typological language comparison, it might not only help myself but also some colleagues to come up with a solution to some of the problems.


Baxter, William H. and Manaster Ramer, Alexis (2000) Beyond lumping and splitting: probabilistic issues in historical linguistics. In: Renfrew, Colin and McMahon, April and Trask, Larry (eds.) Time Depth in Historical Linguistics. Cambridge: McDonald Institute for Archaeological Research, 167-188.

Geisler, Hans (2018): Sind unsere Wörter von Sinnen? Überlegungen zu den sensomotorischen Grundlagen der Begriffsbildung. In: Kazzazi, Kerstin and Luttermann, Karin and Wahl, Sabine and Fritz, Thomas A. (eds.) Worte über Wörter. Festschrift zu Ehren von Elke Ronneberger-Sibold. Tübingen:Stauffenburg, 131-142.

Nathanael E. Schweikhard (2018) Semantic promiscuity as a factor of productivity in word formation. Computer-Assisted Language Comparison in Practice 1.11.19.

Wilkins, David P. (1996) Natural tendencies of semantic change and the search for cognates. In: Durie, Mark (ed.) The comparative method reviewed. Regularity and irregularity in language change. New York:Oxford University Press, 264-304.


  1. Preliminary note: most of these would turn into several very different problems depending on what is given and what is not. E.g. for the four automatic anything problems: unannotated corpora, bare wordlists, monolingual dictionaries with or without a semantic ontology, precompiled etymological dataset…?

    1. I have a very clear imagination as to the input formats here, and usually, the required formats I request will be semi-annotated wordlists, ideally in phonetic transcriptions. Otherwise, it won't be fun, but also probably not working, as I know from experience with other tasks that automatic methods handle well by now (e.g., cognate detection).

  2. as to 3, for Dutch we recently developed a loanword-o-meter, not yet on the web (will soon be), but I can send you a PowerPoint. Interested? mail me: (Nicoline van der Sijs)

    1. Hi, thanks for offering. I'll rather wait for the web-app to appear (and a paper describing your approach). However, the challenge, as you'll see, once I manage to write about it, won't be a single language with its loan words (that is definitely doable), but a method that can be applied to virtually any language, retrieving the loan words it shares with other languages. Probably impossible to do consistently, but we'll see. I'm definitely curious to see how your approach works.