Monday, December 16, 2019

Open problems in computational diversity linguistics: Conclusion and Outlook

One year has now passed since I discussed the idea with David to devote a whole year of 12 blopgosts to the topic of "Open problems in computational diversity linguistics". It is time to look back at this year, and the topics that have been discussed.

Quantitative view

The following table lists the pageviews (or clicks) for each blogpost (with all caveats as to what this actually entails), from January to November.

Problem Month Title Clicks Comments
0 January Introduction 535 4
1 February Automatic morpheme detection 718 0
2 March Automatic borrowing detection 422 1
3 April Automatic sound law induction 522 2
4 May Automatic phonological reconstruction 517 0
5 June Simulation of lexical change 269 0
6 July Simulation of sound change 423 0
7 August Statistical proof of language relatedness  383 1
8 September  Typology of semantic change 372 2
9 October Typology of sound change 250 3
10 November Typology of semantic promiscuity 217 2

The first thing to note is that people might have gotten tired of the problems, since the last two blogs were not very well-received in terms of readers (or not yet, anyway). One should, however, not forget that the number of clicks received by the system are cumulative, so if a blog is older, it may have received more readers just because it has been online for a longer time.

What seems, however, to be interesting is the rather high number of readers for the February post; and it seems that this is related to the topic, rather than the content. Morpheme detection is considered to be a very interesting problem by many practitioners of Natural Language Processing (NLP), and the field of NLP has generally many more followers than the field of historical linguistics.

Reader comments and discussions

For a few of the posts, I received interesting comments, and I  replied to all of them, where I found that a reply was in order. A few of them are worth emphasizing here.

As a first comment in March, Guillaume Jacques replied in form of a blog post of his own, where he proposed a very explicit method for the detection of borrowings, which assumes that data are compared where an ancestral language is a available in written sources (see here for the post). Since it will still take some time to prepare the data in the manner proposed by Guillaume, I have not had time to test this method for myself, but it is a very nice example for a new method for borrowing detection, which addresses one specific data type and has so far not been tested.

Thomas Pellard provided a very useful comment on my April post, emphasizing that automatic reconstruction based on regular expressions (as I had proposed it, more or less, as a riddle that should be solved), requires a "very precise chronology (order) of the sound changes", as well as "a perfect knowledge of all the sound changes having occurred". He concluded that "regular expression-based approach may thus be rather suited for the final stage of a reconstruction rather than for exploratory purposes". What is remarkable about this comment is that it partly contradicts (at least in my opinion) the classical doctrine of historical language comparison, since we often assume that linguists apply their "sound laws" perfectly well, being able to explain the history of a given set of languages in full detail. The sparsity of the available literature, and the problems that even small experiments encounter, shows that the idea of completely regular sound change that can be laid out in form of transducers has always remained an idea, but was never really practiced. It seems that it is time to leave the realm of theory and do more practical research on sound change, as suggested by Thomas.

In response to my post on problem number 7 (August), the proof of language relatedness, Guillaume Jacques wrote that: "although most historical linguists see inflectional morphology as the most convincing evidence for language relatedness, it is very difficult to conceive a statistical test that could be applied to morphological paradigms in any systematic way cross-linguistically". I think he is completely right with this point.

J. Pystynen made a very good point with respect to my post on the typology of semantic change (September), mentioning that semantic change may, similar to sound change, also underlie dynamics resulting from the fact that the lexicon of a given language at a given time is a system whose parts are determined by their relation to each other.

David Marjanović criticized my use (in October) of the Indo-European laryngeals as an example to make clear that the abstractionalist-realist problem in the debate about sound change has an impact on what scholars actively reconstruct, and that they are often content to not further specify concrete sound values as long as they can be sure that there are distinctive values for a given phenomenon. His main point was that — in his opinion — the reconstruction of sound values for the Indo-European laryngeal is much clearer than I presented it in my post. I think that Marjanović was misunderstanding the point I wanted to make; and I also think that he is not right regarding the surety with which we can determine sound values for the laryngeal sounds.

As a last and very long comment from November, Alex(andre) François (I assume that it was him, but he only left his first name) provided excellent feedback on the last problem, which I had labelled the problem of establishing a typology of "semantic promiscuity". Alex argues that I overemphasized the role of semantics in the discussion, and that the phenomenon I described might better be labelled "lexical yield of roots". I think that he's right with this criticism, but I am not sure whether the term "lexical yield" is better than the notion of promiscuity. Given that we are searching for a counterpart of the mostly form-based term "productivity", which furthermore focuses on grammatical affixes, the term "promiscuity" focuses on the success of certain form-concept pairs at being recycled during the process of word formation. Alex is right that we are in fact talking about the root here, as a linguistic concept that is — unfortunately — not very strictly defined in linguistics. For the time being, I would propose either the term "root promiscuity" or "lexical promiscuity", but avoid the term "yield", since it sounds too static to me.

Advances on particular problems

Although the problems that I posted are personal, and I am keen to try tackling them in at least some way in the future, I have not yet managed to advance on any of them in particular.

I have experimented with new approaches to borrowing detection, which are not yet in a state where they could be published, but it helped myself to re-think the whole matter in detail. Parts of my ideas shared in this blog post also appeared, in a deeper discussion, in an article that was published this year (List 2019).

I played with the problem of morpheme detection, but none of the different approaches was really convincing enough so far. However, I am still convinced that we can do better than "meaning-less" NLP approaches (which try to infer morphology from dictionaries alone, ignoring any semantic information).

A peripheral thought on automated phonological reconstruction, focusing on the question of the evaluation of a set of automated reconstructions and a set of human-annotated gold standard data, has now been published (List 2019b) as a comment to a target study by Jäger (2019). While my proposal can solve cases where two reconstruction systems differ only by their segment-wise phonological information, I had to conclude my comment by admitting that there are cases where two sets of words in different languages are equivalent in their structure, but not identical. Formally, that means that structurally identical sets of segmented strings in linguistics can be converted from one set to the other with help of simple replacement rules, while structurally equivalent (I am still unsure, if the two terms are well chosen) sets of segmented strings may require additional context rules.

Although I tried to advance on most of the problems mentioned throughout the year, and I carried out quite a few experiments, most of the things that I tested were not conclusive. Before I discuss them in detail, I should make sure they actually work, or provide a larger study that emphasizes and explains why they do not work. At this stage, however, any sharing of information on the different experiments I ran would be premature, leading to confusion rather than to clarification.

Strategies for problem solving

Those of you who have followed my treatment of all the problems over the year will see that I tend to be very careful in delegating problem solutions to classical machine learning approaches. I do this because I am convinced that most of the problems that I mentioned and discussed can, in fact, be handled in a very concrete manner. When dealing with problems that one thinks can ultimately be solved by an algorithm, one should not start by developing a machine learning algorithm, but rather search for the algorithm that really solves the problem.

Nobody would develop a machine learning approach to replace an abacus, although this may in fact be possible. In the same way, I believe that the practice of historical linguistics has sufficiently shown that most of the problems can be solved with help of concrete methods, with the exception, perhaps, of phylogenetic reconstruction (see, for example, my graph-based solution for the sound correspondence pattern detection problem, presented in List 2019c). For this reason, I prefer to work on concrete solutions, avoiding probabilistic approaches or black-box methods, such as neural networks.

A language problem

Retrospect and outlook

In retrospect, I enjoyed the series a lot. It has the advantage of being easier to plan, as I knew in advance what I had to write about. It was, however, also tedious at times, since I knew I could not just talk about a seemingly simpler topic in my monthly post, but had to develop the problem and share all of my thoughts on it. In some situations, I had the impression that I failed, since I realized that there was not enough time to really think everything through. Here, the comments of colleagues were quite helpful.

Content-wise, the idea of looking at our field through the lens of unsolved problems turned out to be very useful. For quite a few of the problems, I have initial ideas (as I tried to indicate each time); and maybe there will be time in the next years to test them in concrete, and to potentially even cross off the one or the other problem from the big list.

Writing a series instead of a collection of unrelated posts turned out to have definite advantages. With my monthly goal of writing at least one contribution for the Genealogical World of Phylogenetic Networks, I never had the problem of thinking too hard of something that might be interesting for a broader readership. While this happened in the past, blog series have the disadvantage of not allowing for flexibility, when something interesting comes up, especially if one sticks to one post per month and reserves this post for the series.

In the next year, I am still considering to write another series, but maybe this time, I will handle it less strictly, allowing some room for surprise, since this is as well one of the major advantages of writing scientific blogs: one is never really be bound to follow beaten tracks.

But for now, I am happy that the year is over, since 2019 has been very busy for me in terms of work. Since this is the final post for the year, I would like to take the chance to thank all who read the posts, and specifically also all those who commented on them. But my greatest thanks go to David for being there, as always, reading my texts, correcting my errors in writing, and giving active feedback in the form of interesting and inspiring comments.


Jäger, Gerhard (2019) Computational historical linguistics. Theoretical Linguistics 45.3-4: 151-182.

List, Johann-Mattis (2019a) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

List, Johann-Mattis (2019b) Beyond Edit Distances: Comparing linguistic reconstruction systems. Theoretical Linguistics 45.3-4: 1-10.

List, Johann-Mattis (2019c) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

No comments:

Post a Comment