The Genealogical World of Phylogenetic Networks: Language change

Showing posts with label Language change. Show all posts

Monday, November 25, 2019

Typology of semantic promiscuity (Open problems in computational diversity linguistics 10)

The final problem in my list of ten open problems in computational diversity linguistics touches upon a phenomenon that most linguists, let alone ordinary people, might not have even have heard about. As a result, the phenomenon does not have a real name in linguistics, and this makes it even more difficult to talk about it.

Semantic promiscuity, in brief, refers to the empirical observations that: (1) the words in the lexicon of human languages are often built from already existing words or word parts, and that (2) the words that are frequently "recycled", ie. the words that are promiscuous (similar to the sense of promiscuous domains in biology, see Basu et al. 2008) denote very common concepts.

If this turns out to be true, that the meaning of words decides, at least to some degree, their success in giving rise to new words, then it should be possible to derive a typology of promiscuous concepts, or some kind of cross-linguistic ranking of those concepts that turn out to be the most successful on the long run.

Our problem can (at least for the moment, since we still have problems of completely grasping the phenomenon, as can be seen from the next section) thus be stated as follows:

Assuming a certain pre-selection of concepts that we assume are expressed by as many languages as possible, can we find out which of the concepts in the sample give rise to the largest amount of new words?

I am not completely happy with this problem definition, since a concept does not actually give rise to a new word, but instead a concept is expressed by a word that is then used to form a new word; but I have decided to leave the problem in this form for reasons of simplicity.

Background on semantic promiscuity

The basic idea of semantic promiscuity goes back to my time as a PhD student in Düsseldorf. My supervisor then was Hans Geisler, a Romance linguist, with a special interest in sound change and sensory-motor concepts. Sensory-motor concepts are concepts that are thought to be grounded in sensory-motor processes. In concrete, scholars assume that many abstract concepts expressed by many, if not all, languages of the world originate in concepts that denote concrete bodily experience (Ströbel 2016).

Thus, we can "grasp an idea", we can "face consequences", or we can "hold a thought". In such cases we express something that is abstract in nature, but expressed by means of verbs that are originally concrete in their meaning and relate to our bodily experience ("to grasp", "to face", "to hold").

When I later met Hans Geisler in 2016 in Düsseldorf, he presented me with an article that he had recently submitted for an anthology that appeared two years later (Geisler 2018). This article, titled "Sind unsere Wörter von Sinnen?" (approximate translation of this pun would be: "Are our words out of the sense?"), investigates concepts such as "to stand" and "to fall" and their importance for the lexicon of German language. Geisler claims that it is due to the importance of the sensory-motor concepts of "standing" and "falling" that words built from stehen ("to stand") and fallen ("to fall") are among the most productive (or promiscuous) ones in the German lexicon.

Words built from fallen and stehen in German.

I found (and still find) this idea fascinating, since it may explain (if it turns out to hold true for a larger sample of the world's languages) the structure of a language's lexicon as a consequence of universal experiences shared among all humans.

Geisler did not have a term for the phenomenon at hand. However, I was working at the same time in a lab with biologists (led by Eric Bapteste and Philippe Lopez), who introduced me to the idea of domain promiscuity in biology, during a longer discussion about similar processes between linguistics and biology. In our paper reporting our discussion of these similarities, we proposed that the comparison of word formation processes in linguistics and protein assembly processes in biology could provide fruitful analogies for future investigations (List et al. 2016: 8ff). But we did not (yet) use the term promiscuity in the linguistic domain.

Geisler's idea, that the success of words to be used to form other words in the lexicon of a language may depend on the semantics of the original terms, changed my view on the topic completely, and I began to search for a good term to denote the phenomenon. I did not want to use the term "promiscuity", because of its original meaning.

Linguistics has the term "productive", which is used for particular morphemes that can be easily attached to existing words to form new ones (eg. by turning a verb into a noun, or by turning a noun into an adjective, etc.). However, "productivity" starts from the form and ignores the concepts, while concepts play a crucial role for Geisler's phenomenon.

At some point, I gave up and began to use the term "promiscuity" in lack of a better term, first in a blogpost discussing Geisler's paper (List 2018, available here). Later in 2018, Nathanael E. Schweikhard, a doctoral student in our research group, developed the idea further, using the term semantic promiscuity (Schweikhard 2018, available here), which considers my tenth and last open problem in computational diversity linguistics (at least for 2019).

In the discussions with Schweikhard, which were very fruitful, we also learned that the idea of expansion and attraction of concepts comes close to the idea of semantic promiscuity. This references Blank's (2003) idea that some concepts tend to frequently attract new words to express them (think of concepts underlying taboo, for simplicity), while other concepts tend to give rise to many new words ("head" is a good example, if you think of all the meanings it can have in different concepts),. However, since Blank is interested in the form, while we are interested in the concept, I agree with Schweikhard in sticking with "promiscuity" instead of adopting Blank's term.

Why it is hard to establish a typology of semantic promiscuity

Assuming that certain cross-linguistic tendencies can be found that would confirm the hypothesis of semantic promiscuity, why is it hard to do so? I see three major obstacles here: one related to the data, one related to annotation, and one related to the comparison.

The data problem is a problem of sparseness. For most of the languages for which we have lexical data, the available data are so sparse that we often even have problems to find a list of 200 or more words. I know this well, since we were struggling hard in a phylogenetic study of Sino-Tibetan languages, where we ended up discarding many interesting languages because the sources did not provide enough lexical data to fill in our wordlists (Sagart et al. 2019).

In order to investigate semantic promiscuity, we need substantially more data than we need for phylogenetic studies, since we ultimately want to investigate the structure of word families inside a given language and compare these structures cross-linguistically. It is not clear where to start here, although it is clear that we cannot be exhaustive in linguistics, as biologists can be when sequencing a whole gene or genome. I think that one would need, at least, 1,000 words per language in order to be able to start looking into semantic promiscuity.

The second problem points to the annotation and the analysis that would be needed in order to investigate the phenomenon sufficiently. What Hans Geisler used in his study were larger dictionaries of German that are digitally available and readily annotated. However, for a cross-linguistic study of semantic promiscuity, all of the annotation work of word families would still have to be done from scratch.

Unfortunately, we have also seen that the algorithms for automated morpheme detection that have been proposed today usually fail greatly when it comes to detecting morpheme boundaries. In addition, word families often have a complex structure, and parts of the words shared across other words are not necessarily identical, due to numerous processes involved in word formation. So, a simple algorithm that splits the words into potential morphemes would not be enough. Another algorithm that identifies language-internal cognate morphemes would be needed; and here, again, we are still waiting for convincing approaches to be developed by computational linguists.

The third problem is the comparison itself, reflects the problem of comparing word-family data across different languages. Since every language has its own structure of words and a very individual set of word families, it is not trivial to decide how one should compare annotated word-family data across multiple languages. While one could try to compare words with the same meaning in different languages, it is quite possible that one would miss many potentially interesting patterns, especially since we do not yet know how (and if at all) the idea of promiscuity features across languages.

Traditional approaches

Apart from the work by Geisler (2018), mentioned above, we find some interesting studies on word formation and compounding in which scholars have addressed some similar questions. Thus, Steve Pepper has submitted (as far as I know) his PhD thesis on The Typology and Semantics of Binomial Lexemes (Pepper 2019, draft here), where he looks into the structure of words that are frequently constructed from two nominal parts, such as "windmill", "railway", etc. In her masters thesis titled Body Part Metaphors as a Window to Cognition, Annika Tjuka investigates how terms for objects and landscapes are created with help of terms originally denoting body parts (such as the "foot" of the table, etc., see Tjuka 2019).

Both of these studies touch on the idea of semantic promiscuity, since they try to look at the lexicon from a concept-based perspective, as opposed to a pure form-based one, and they also try to look at patterns that might emerge when looking at more than one language alone. However, given their respective focus (Pepper looking at a specific type of compounds, Tjuka looking at body-part metaphors), they do not address the typology of semantic promiscuity in general, although they provide very interesting evidence showing that lexical semantics plays an important role in word formation.

Computational approaches

The only study that I know of that comes close to studying the idea of semantic promiscuity computationally is by Keller and Schulz (2014). In this study, the authors analyze the distribution of morpheme family sizes in English and German across a time span of 200 years. Using Birth-Death-Innovation Models (explained in more detail in the paper), they try to measure the dynamics underlying the process of word formation. Their general finding (at least for the English and German data analyzed) is that new words tend to be built from those word forms that appear less frequently across other words in a given language. If this holds true, it would mean that speakers tend to avoid words that are already too promiscuous as a basis to coin new words for a given language. What the study definitely shows is that any study of semantic promiscuity has to look at competing explanations.

Initial ideas for improvement

If we accept that the corpus perspective cannot help us to dive deep into the semantics, since semantics cannot be automatically inferred from corpora (at least not yet to a degree that would allow us to compare them afterwards across a sufficient sample of languages), then we need to address the question in smaller steps.

For the time being, the idea that a larger amount of the words in the lexicon of human languages are recycled from words that originally express specific meanings remains a hypothesis (whatever those meanings may be, since the idea of sensory motor concepts is just one suggestion for a potential candidate for a semantic field). There are enough alternative explanations that could drive the formation of new words, be it the frequency of recycled morphemes in a lexicon, as proposed by Keller and Schulz, or other factors that we still not know, or that I do not know, because I have not yet read the relevant literature.

As long as the idea remains a hypothesis, we should first try to find ways to test it. A starting point could consist of the collection of larger wordlists for the languages of the world (eg. more than 300 words per language) which are already morphologically segmented. With such a corpus, one could easily create word families, by checking which morphemes are re-used across words. By comparing the concepts that share a given morpheme, one could try and check to which degree, for example, sensory-motor concepts form clusters with other concepts.

All in all, my idea is far from being concrete; but what seems clear is that we will need to work on larger datasets that offer word lists for a sufficiently large sample of languages in morpheme-segmented form.

Outlook

Whenever I try to think about the problem of semantic promiscuity, asking myself whether it is a real phenomenon or just a myth, and whether a typology in the form of a world-wide ranking is possible after all, I feel that my brain is starting to itch. It feels like there is something that I cannot really grasp (yet, hopefully), and something I haven't really understood.

If the readers of this post feel the same way afterwards, then there are two possibilities as to why you might feel as I do: you could suffer from the same problem that I have whenever I try to get my head around semantics, or you could just have fallen victim of a largely incomprehensible blog post. I hope, of course, that none of you will suffer from anything; and I will be glad for any additional ideas that might help us to understand this matter more properly.

References

Basu, Malay Kumar and Carmel, Liran and Rogozin, Igor B. and Koonin, Eugene V. (2008) Evolution of protein domain promiscuity in eukaryotes. Genome Research 18: 449-461.

Blank, Andreas (1997) Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Tübingen:Niemeyer.

Geisler, Hans (2018) Sind unsere Wörter von Sinnen? Überlegungen zu den sensomotorischen Grundlagen der Begriffsbildung. In: Kazzazi, Kerstin and Luttermann, Karin and Wahl, Sabine and Fritz, Thomas A. (eds.) Worte über Wörter: Festschrift zu Ehren von Elke Ronneberger-Sibold. Tübingen:Stauffenburg. 131-142.

Keller, Daniela Barbara and Schultz, Jörg (2014) Word formation is aware of morpheme family size. PLoS ONE 9.4: e93978.

List, Johann-Mattis and Pathmanathan, Jananan Sylvestre and Lopez, Philippe and Bapteste, Eric (2016) Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct 11.39: 1-17.

List, Johann-Mattis (2018) Von Wortfamilien und promiskuitiven Wörtern [Of word families and promiscuous words]. Von Wörtern und Bäumen 2.10. URL: https://wub.hypotheses.org/464.

Pepper, Steve (2019) The Typology and Semantics of Binominal Lexemes: Noun-noun Compounds and their Functional Equivalents. University of Oslo: Oslo.

Sagart, Laurent and Jacques, Guillaume and Lai, Yunfan and Ryder, Robin and Thouzeau, Valentin and Greenhill, Simon J. and List, Johann-Mattis (2019) Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116: 10317-10322. DOI: https://doi.org/10.1073/pnas.1817972116

Schweikhard, Nathanael E. (2018) Semantic promiscuity as a factor of productivity in word formation. Computer-Assisted Language Comparison in Practice 1.11. URL: https://calc.hypotheses.org/1169.

Ströbel, Liane (2016) Introduction: Sensory-motor concepts: at the crossroad between language & cognition. In: Ströbel, Liane (ed.) Sensory-motor Concepts: at the Crossroad Between Language & Cognition. Düsseldorf University Press, pp. 11-16.

Tjuka, Annika (2019) Body Part Metaphors as a Window to Cognition: a Cross-linguistic Study of Object and Landscape Terms. Humboldt Universität zu Berlin: Berlin. DOI: https://doi.org/10.17613/j95n-c998.

Monday, September 30, 2019

Typology of semantic change (Open problems in computational diversity linguistics 8)

With this month's problem we are leaving the realm of modeling, which has been the basic aspect underlying the last three problems, discussed in June, July, and August, and enter the realm of typology, or general linguistics. The last three problems that I will discuss, in this and two follow-up posts, deal with the basic problem of making use or collecting data that allows us to establish typologies, that is, to identify cross-linguistic tendencies for specific phenomena, such as semantic change (this post), sound change (October), or semantic promiscuity (November).

Cross-linguistic tendencies are here understood as tendencies that occur across all languages independently of their specific phylogenetic affiliation, the place where they are spoken, or the time when they are spoken. Obviously, the uniformitarian requirement of independence of place and time is an idealization. As we know well, the capacity for language itself developed, potentially gradually, with the evolution of modern humans, and as a result, it does not make sense to assume that the tendencies of semantic change or sound change were the same through time. This has, in fact, been shown in recent research that illustrated that there may be a certain relationship between our diet and the speech sounds that we speak in our languages (Blasi et al. 2019).

Nevertheless, in the same way in which we simplify models in physics, as long as they yield good approximations of the phenomena we want to study, we can also assume a certain uniformity for language change. To guarantee this, we may have to restrict the time frame of language development that we want to discuss (eg. the last 2,000 years), or the aspects of language we want to investigate (eg. a certain selection of concepts that we know must have been expressed 5,000 years ago).

For the specific case of a semantic change, the problem of establishing a typology of the phenomenon can thus be stated as follows:

Assuming a certain pre-selection of concepts that we assume were readily expressed in a given time frame, establish a general typology that informs about the universal tendencies by which a word expressing one concept changes its meaning, to later express another concept in the same language.

In theory, we can further relax the conditions of universality and add the restrictions on time and place later, after having aggregated the data. Maybe this would even be the best idea for a practical investigation; but given that the time frames in which we have attested data for semantic changes are rather limited, I do not believe that it would make much of a change.

Why it is hard to establish a typology of semantic change

There are three reasons why it is hard to establish a typology of semantic change. First, there is the problem of acquiring the data needed to establish the typology. Second, there is the problem of handling the data efficiently. Third, there is the problem of interpreting the data in order to identify cross-linguistic, universal tendencies.

The problem of data acquisition results from the fact that we lack data on observed processes of semantic change. Since there are only a few languages with a continuous tradition of written records spanning 500 years or more, we will never be able to derive any universal tendencies from those languages alone, even if it may be a good starting point to start from languages like Latin and its Romance descendants, as has been shown by Blank (1997).

Accepting the fact that processes attested only for Romance languages are never enough to fill the huge semantic space covered by the world's languages, the only alternative would be using inferred processes of semantic change — that is, processes that have been reconstructed and proposed in the literature. While it is straightforward to show that the meanings of cognate words in different languages can vary quite drastically, it is much more difficult to infer the direction underlying the change. Handling the direction, however, is important for any typology of semantic change, since the data from observed changes suggests that there are specific directional tendencies. Thus, when confronted with cognates such as selig "holy" in German and silly in English, it is much less obvious whether the change happened from "holy" to "silly" or from "silly" to "holy", or even from an unknown ancient concept to both "holy" and "silly".

As a result, we can conclude that any collection of data on semantic change needs to make crystal-clear upon which types of evidence the inference of semantic change processes is based. Citing only the literature on different language families is definitely not enough. Because of the second problem, this also applies to the handling of data on semantic shifts. Here, we face the general problem of elicitation of meanings. Elicitation refers to the process in fieldwork where scholars use a questionnaire to ask their informants how certain meanings are expressed. The problem here is that linguists have never tried to standardize which meanings they actually elicit. What they use, instead, are elicitation glosses, which they think are common enough to allow linguists to understand to what meaning they refer. As a result, it is extremely difficult to search in field work notes, and even in wordlists or dictionaries, for specific meanings, since every linguist is using their own style, often without further explanations.

Our Concepticon project (List et al. 2019, https://concepticon.clld.org) can be seen as a first attempt to handle elicitation glosses consistently. What we do is to link those elicitation glosses that we find in questionnaires, dictionaries, and fieldwork notes to so-called concept sets, which reflect a given concept that is given a unique identifier and a short definition. It would go too far to dive deeper into the problem of concept handling. Interested readers can have a look at a previous blog post I wrote on the topic (List 2018). In any case, any typology on semantic change will need to find a way to address the problem of handling elicitation glosses in the literature, in the one or the other way.

As a last problem, when having assembled data that show semantic change processes across a sufficiently large sample of languages and concepts, there is the problem of analyzing the data themselves. While it seems obvious to identify cross-linguistic tendencies by looking for examples that occur in different language families and different parts of the world, it is not always easy to distinguish between the four major reasons for similarities among languages, namely: (1) coincidence, (2) universal tendencies, (3) inheritance, and (4) contact (List 2019). The only way to avoid being forced to make use of potentially unreliable statistics, to squeeze out the juice of small datasets, is to work on a sufficiently large coverage of data from as many language families and locations as possible. But given that there are no automated ways to infer directed semantic change processes across linguistic datasets, it is unlikely that a collection of data acquired from the literature alone will reach the critical mass needed for such an endeavor.

Traditional approaches

Apart from the above-mentioned work by Blank (1997), which is, unfortunately, rarely mentioned in the literature (potentially because it is written in German), there is an often-cited paper by Wilkinson (1996), and preliminary work on directionality (Urban 2012). However, the attempt that addresses the problem most closely is the Database of Semantic Shifts (Zalizniak et al. 2012), which has, according to the most recent information on the website, was established in 2002 and has been continuously updated since then.

The basic idea, as far as I understand the principle of the database, is to collect semantic shifts attested in the literature, and to note the type of evidence, as well as the direction, where it is known. The resource is unique, nobody else has tried to establish a collection of semantic shifts attested in the literature, and it is therefore incredibly valuable. It shows, however, also, what problems we face when trying to establish a typology of semantic shifts.

Apart from the typical technical problems found in many projects shared on the web (missing download access to all data underlying the website, missing deposit of versions on public repositories, missing versioning), the greatest problem of the project is that no apparent attempt was undertaken to standardize the elicitation glosses. This became specifically obvious when we tried to link an older version of the database, which is now no longer available, to our Concepticon project. In the end, I selected some 870 concepts from the database, which were supported by more datapoints, but had to ignore more than 1500 remaining elicitation glosses, since it was not possible to infer in reasonable time what the underlying concepts denote, not to speak of obvious cases where the same concept was denoted by slightly different elicitation glosses. As far as I can tell, this has not changed much with the most recent update of the database, which was published some time earlier this year.

Apart from the afore-mentioned problems of missing standardization of elicitation glosses, the database does not seem to annotate which type of evidence has been used to establish a given semantic shift. An even more important problem, which is typical of almost all attempts to establish databases of change in the field of diversity linguistics, is that the database only shows what has changed, while nothing can be found on what has stayed the same. A true typology of change, however, must show what has not changed along with showing what has changed. As a result, any attempt to pick proposed changes from the literature alone will fail to offer a true typology, a collection of universal tendencies

To be fair: the Database of Semantic Shifts is by no means claiming to do this. What it offers is a collection of semantic change phenomena discussed in the linguistic literature. This itself is an extremely valuable, and extremely tedious, enterprise. While I wish that the authors open their data, versionize it, standardize the elicitation glosses, and also host it on stable public archives, to avoid what happened in the past (that people quote versions of the data which no longer exist), and to open the data for quantitative analyses, I deeply appreciate the attempt to approach the problem of semantic change from an empirical, data-driven perspective. To address the problem of establishing a typology of semantic shift, however, I think that we need to start thinking beyond collecting what has been stated in the literature.

Computational approaches

As a first computational approach that comes in some way close to a typology of semantic shifts, there is the Database of Cross-Linguistic Colexifications (List et al. 2018), which was originally launched in 2014, and received a major update in 2018 (see List et al. 2018b for details). This CLICS database, which I have mentioned several times in the past, does not show diachronic data, ie. data on semantic change phenomena, but lists automatically detectable polysemies and homophonies (also called colexifications), instead.

While the approach taken by the Database of Semantic shifts is bottom-up in some sense, as the authors start from the literature and add those concept that are discussed there, CLICS is top-down, as it starts from a list of concepts (reflected as standardized Concepticon concept sets) and then checks which languages express more than one concept by one and the same word form.

The advantages of top-down approaches are: that much more data can be processed, and that one can easily derive a balanced sample in which the same concepts iare compared for as many languages as possible. The disadvantage is that such a database will ignore certain concepts a priori, if they do not occur in the data.

Since CLICS lists synchronic patterns without further interpreting them, the database is potentially interesting for those who want to work on semantic change, but it does not help solve the problem of establishing a typology of semantic change itself. In order to achieve this, one would have to go through all attested polysemies in the database and investigate them, searching for potential hints on directions.

A potential way to infer directions for semantic shifts is presented by Dellert (2016), who applies causal inference techniques on polysemy networks to address this task. The problem, as far as I understand the techniques, is that the currently available polysemy databases barely offer enough information needed for these kinds of analyses. Furthermore, it would also be important to see how well the method actually performs in comparison to what we think we already know about the major patterns of semantic change.

Initial ideas for improvement

There does not seem to be a practical way to address our problem by means of computational solutions alone. What we need, instead, is a computer-assisted strategy that starts from the base of a thorough investigation of the criteria that scholars use to infer directions of semantic change from linguistic data. Once these criteria are settled, more or less, one would need to think of ways to operationalize them, in order to allow scholars to work with concrete etymological data, ideally comprising standardized word-lists for different language families, and to annotate them as closely as possible.

Ideally, scholars would propose larger etymological datasets in which they reconstruct whole language families, proposing semantic reconstructions for proto-forms. These would already contain the proposed directions of semantic change, and they would also automatically show where change does not happen. Since we currently lack automated workflows that fully account for this level of detail, one could start by applying methods for cognate detection across semantic semantic slots (cross-semantic cognate detection), which would yield valuable data on semantic change processes, without providing directions, and then adding the directional information based on the principles that scholars use in their reconstruction methodology.

Outlook

Given the recent advances in detection of sound correspondence patterns, sequence comparison, and etymological annotation in the field of computational historical linguistics, it seems perfectly feasible to work on detailed etymological datasets of the languages of the world, in which all information required to derive a typology of semantic change is transparently available. The problem is, however, that it would still take a lot of time to actually analyze and annotate these data, and to find enough scholars who would agree to carry out linguistic reconstruction in a similar way, using transparent tools rather than convenient shortcuts.

References

Blank, Andreas (1997) Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Tübingen:Niemeyer.

Blasi, Damián E. and Steven Moran and Scott R. Moisik and Paul Widmer and Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

List, Johann-Mattis and Simon Greenhill and Cormac Anderson and Thomas Mayer and Tiago Tresoldi and Robert Forkel (2018: CLICS: Database of Cross-Linguistic Colexifications. Version 2.0. Max Planck Institute for the Science of Human History. Jena: http://clics.clld.org/.

Johann Mattis List and Simon Greenhill and Christoph Rzymski and Nathanael Schweikhard and Robert Forkel (2019) Concepticon. A resource for the linking of concept lists (Version 2.1.0). Max Planck Institute for the Science of Human History. Jena: https://concepticon.clld.org/.

Dellert, Johannes and Buch, Armin (2016) Using computational criteria to extract large Swadesh Lists for lexicostatistics. In: Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics.

List, Johann-Mattis and Greenhill, Simon J. and Anderson, Cormac and Mayer, Thomas and Tresoldi, Tiago and Forkel, Robert (2018) CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2: 277-306.

List, Johann-Mattis (2018) Towards a history of concept list compilation in historical linguistics. History and Philosophy of the Language Sciences 5.10: 1-14.

List, Johann-Mattis (2019) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

Urban, Matthias (2011) Asymmetries in overt marking and directionality in semantic change. Journal of Historical Linguistics 1.1: 3-47.

Wilkins, David P. (1996) Natural tendencies of semantic change and the search for cognates. In: Durie, Mark (ed.) The Comparative Method Reviewed: Regularity and Irregularity in Language Change. New York: Oxford University Press, pp. 264-304.

Zalizniak, Anna A. and Bulakh, M. and Ganenkov, Dimitrij and Gruntov, Ilya and Maisak, Timur and Russo, Maxim (2012) The catalogue of semantic shifts as a database for lexical semantic typology. Linguistics 50.3: 633-669.

Monday, June 24, 2019

Simulation of lexical change (Open problems in computational diversity linguistics 5)

The fifth problem in my list of open problems in computational diversity linguistics is devoted to the problem of simulating lexical change. In a broad sense, lexical change refers to the way in which the lexicon of a human language evolves over time. In a narrower sense, we would reduce it to the major processes that constitute the changes that affect the words of human languages.

Following Gevaudán (2007: 15-17), we can distinguish three different dimensions along which words can change, namely:

the semantic dimension — a given word can change its meaning
the morphological dimension —new words are formed from old words by combining existing words or deriving new words with help of affixes, and
the stratic dimension — languages may acquire words from their neighbors and thus contain strata of contact.

If we take these three dimension as the basis of any linguistically meaningful system that simulates lexical change (and I would strongly argue that we should), the task of simulating lexical change can thus be worded as follows:

Create a model of lexical change that simulates how the lexicon of a given language changes over time. This model may be simplifying, but it should account for change along the major dimensions of lexical change, including morphological change, semantic change, and lexical borrowing.

Note that the focus on three dimensions along which a word can change deliberately excludes sound change (which I will treat as a separate problem in an upcoming blogpost). Excluding sound change is justified by the fact that, in the majority of cases, the process proceeds independently from semantic change, morphological change, and borrowing, while the latter three process often interact.

There are, of course, cases where sound change may trigger the other three processes — for example, in cases where sound change leads to homophonous words in a language that express contrary meanings, which is usually resolved by using another word form for one of the concepts. An example for this process can be found in Chinese, where shǒu (in modern pronunciation) came to mean both "head" and "hand" (spelled as 首 and 手). Nowadays, shǒu remains only in expressions like shǒudū 首都 "capital", while tóu 头 is the regular word for "head".

Since the number of these processes where we have sufficient evidence to infer that sound change triggered other changes is rather small, we will do better to ignore it when trying to design initial models of lexical change. Later models could, of course, combine sound change with lexical change in an overarching framework, but given how the modeling of lexical change is already complex just with the three dimensions alone, it seems useful to put it aside for the moment and treat it as a separate problem.

Why simulating lexical change is hard

For historical linguists, it is obvious why it is hard to simulate lexical change in a computational model. The reason is that all three major processes of lexical change, semantic change, morphological change, and lexical borrowing, are already hard to model and understand themselves.

Morphological change is not only difficult to understand as a process, it is even difficult to infer; and it is for this reason, that we find morphological segmentation as the first example in my list of open problems. The same holds for lexical borrowing, which I discussed as the second example in my list of open problems. The problem of common pathways of semantic change will be discussed in a later post, devoted to the general typology of semantic change processes.

If each of the individual processes that constitute lexical change is itself either hard to model or to infer, it is no wonder that the simulation is also hard.

Traditional insights into the process of lexical change

Important work on lexical change goes back at least to the 1950s, when Morris Swadesh (1909-1967) proposed his theory of lexicostatistics and glottochronology (Swadesh 1952, 1955, Lees 1953). What was important in this context was not the idea that one could compute the divergence time of languages, but the data model which Swadesh introduced. This data model is represented by a word-list in which a particular list of concepts is translated into a particular range of languages. While former work on semantic change had been mostly onomasiological — ie. form-based, taking the word as the basic unit and asking how it would change its meaning over time — the new model used concepts as a comparandum, investigating how word forms replaced each other in expressing specific contexts over time. This onomasiological or concept-based perspective has the great advantage of drastically facilitating the sampling of language data from different languages.

When comparing only specific word forms for cognacy, it is difficult to learn something about the dynamics of lexical change through time, since it is never clear how to sample those words that one wants to investigate more closely in a given study. With Swadesh's data model, the sampling process is reduced to the selection of concepts, regardless of whether one knows how many concepts one can find in a given sample of languages. Swadesh was by no means the first to propose this perspective, but he was the one who promulgated it.

Swadesh's data model does not directly measure lexical change, but instead measures the results of lexical change, given that its results surface in the distribution of cognate sets across lexicostatistical word-lists. While historical linguists mostly focused on sound change processes before, often ignoring morphological and semantic change, the lexicostatistical data model moved semantic change, lexical borrowing, and (to a lesser degree also) morphological change into the spotlight of linguistic endeavors. As an example, consider the following quote from Lees (1953), discussing the investigation of change in vocabulary under the label of morpheme decay:

The reasons for morpheme decay, ie. for change in vocabulary, have been classified by many authors; they include such processes as word tabu, phonemic confusion of etymologically distinct items close in meaning, change in material culture with loss of obsolete terms, rise of witty terms or slang, adoption of prestige forms from a superstratum language, and various gradual semantic shifts such as specialization, generalization, and pejoration. [Lees 1953: 114]

In addition to lexicostatistics and the discussions that arose especially from it (including those that criticized the method harshly), I consider the aforementioned model of three dimensions of lexical change by Gevaudán (2007) to be very useful in this context, since it constitutes one of the few attempts to approach the question of lexical change in a formal (or formalizable) way.

Computational approaches

Among the most frequently used models in the historical linguistics literature are those in which lexical change is modeled as a process of cognate gain and cognate loss. Modeling lexical change as a process of word gain and word loss, or root gain and root loss, is in fact straightforward. We well know that languages may cease to use certain words during their evolution, either because the things the words denote no longer exist (think of the word walkman and then try to project the future of the word ipad), or because a specific word form is no longer being used to denote a concept and therefore drops out of the language at some point (think of thorp which meant something like "village", as a comparison with German Dorf "village" shows, but now exists only as a suffix in place names).

Since the gain-loss (or birth-death) model finds a direct counterpart in evolutionary biology, where genome evolution is often modeled as a process involving gain and loss of gene families (Cohen et al. 2008), it is also very easy to apply it to linguistics. The major work on the stochastic description of different gain-loss models has already been done, and we can find very stable software to helps us employ gain-loss models to reconstruct phylogenetic trees (Ronquist and Huelsenbeck 2003).

It is therefore not surprising that gain-loss models are very popular in computational approaches to historical linguistics. Starting from pioneering work by Gray and Jordan (2000) and Gray and Atkinson (2003), they have now been used on many language families, including Austronesian (Gray et al. 2007), Australian languages (Bowern and Atkinson 2012), and most recently also Sino-Tibetan (Sagart et al. 2019). Although scholars (including myself) have expressed skepticism about their usefulness (List 2016), the gain-loss model can be seen as reflecting the quasi-standard of phylogenetic reconstruction in contemporary quantitative historical linguistics.

Despite their popularity for phylogenetic reconstructions, gain-loss models have been used only sporadically in simulation studies. The only attempts that I know of so far are one study by Greenhill et al. (2009), where the authors used the TraitLab software (Nicholls 2013) to simulate language change along with horizontal transfer events, and a study by Murawaki (2015), in which (if I understand the study correctly) a gain-loss model is used to model language contact.

Another approach is reflected in the more "classical" work on lexicostatistics, where lexical change is modeled as a process of lexical replacement within previously selected concept slots. I will call this model the concept-slot model. In this model (and potential variants of it), a language is not a bag of words whose contents changes over time, but is more like a chest of drawers, in which each drawer represents a specific concept and the content of a drawer represents the words that can be used to express that given concept. In such a model, lexical change proceeds as a replacement process: a word within a given concept drawer is replaced by another word.

This model represents the classical way in which Morris Swadesh used to view the evolution of a given language. It is still present in the work of scholars working in the original framework of lexicostatistics (Starostin 2000), but it is used almost exclusively within distance-based frameworks, since a character-based account of the model would require a potentially large number of character states, which usually exceeds the number of character states allowed in the classical software packages for phylogenetic reconstruction.

Similar to the gain-loss model, there have not been many attempts to test the characteristics of this model in simulation studies. The only one known to me is a posthumously published letter from Sergei Starostin (1953-2005) to Murray Gell-Mann (Starostin 2007), in which he describes an attempt to account for his theory that a word's replacement rage increases with the word's age (Starostin 2000) in a computer simulation.

Problems with current models of lexical change

Neither the gain-loss model nor the concept-slot model seem to be misleading when it comes to describe the process of lexical change. However, they both obviously ignore specific and crucial aspects of lexical change that (according to the task stated above) any ambitious simulation of lexical change should try to account for. The gain-loss model, for example, deliberately ignores semantic change and morphological change. It can account for borrowings, which can be easily included in a simulation by allowing contemporary languages to exchange words with each other, but it cannot tell us (since it ignores the meaning of word forms) how the meaning of words changes over time, or how word forms change their shape due to morphological change.

The concept-slot model can, in theory, account for semantic change, but only as far as the concept-slots allow: the number of concepts in this model is fixed and one usually does not assume that it would change. Furthermore, while borrowing can be included in this model, the model does not handle morphological change processes.

In phylogenetic approaches, both models also have clear disadvantages. The main problem of the gain-loss model is the sampling procedure. Since one cannot sample all words of a language, scholars usually derive the cognate sets they use to reconstruct phylogenies from cognate-coded lexicostatistical word-lists. As I have tried to show earlier, in List (2016), this sampling procedure can lead to problems when homology is defined in a loose way. The problem of the concept-slot model is that it cannot be easily applied in phylogenetic inference based on likelihood models (like Maximum likelihood or Bayesian inference), since the only straightforward way to handle them would be multi-state models, which are generally difficult to handle.

Initial ideas for improvement

For the moment, I have no direct idea of how to model morphological change, and more research will be needed before we will be able to handle this in models of lexical change. The problem of the gain-loss and the concept-slot models to account for semantic change, however, can be overcome by turning to bipartite graph models of lexical change (see Newman 2010: 32f for details on bipartite graphs). In such a model, the lexicon of a human language is represented by a bipartite graph consisting of concepts as one type of node and word forms (or forms) as another type of node. The association strength of a given word node and a given concept node (or its "reference potential", see List 2014: 21f), ie. the likelihood of a word being used by a speaker to denote a given concept, can be modeled with help of weighted edges. This model naturally accounts for synonymy (if a meaning can be expressed by multiple words) and polysemy (if a word can express multiple meanings). Lexical change in such a model would consist of the re-arrangement of the weights in the network. Word loss and word gain would occur if a new word node is introduced into the network or an existing node gets dissociated from all of the concepts.

Sankoff's (1996) bipartite model of the lexicon of human languages

We can find this idea of bipartite modeling of a language's lexicon in the early linguistic work of Sankoff (1969: 28-53), as reflected in the figure above, taken from his dissertation (Figure 5, p. 36). Similarly, Smith (2004) used bipartite form-concept networks (which he describes as a matrix) in order to test the mechanisms by which these vocabularies are transmitted from the perspective of different theories on cultural evolution.

As I have never actively tried to review the large amount of literature devoted to simulation studies in historical linguistics, biology, and cultural evolution, it is quite possible that this blogpost lacks reference to important studies devoted to the problem. Despite this possibility, we can clearly say that we are lacking simulation studies in historical linguistics. I am furthermore convinced that the problem of handling lexical change in simulation studies is a difficult one, and that we may well have to wait to acquire more knowledge of the key processes involving lexical change in order to address it sufficiently in the future.

While I understand the popularity of gain-loss models in recent work on phylogenetic reconstruction in historical linguistics, I hope that it might be possible to develop more realistic models in the future. It is well possible that such studies will confirm the superiority of gain-loss models over alternative approaches. But instead of assuming this in an axiomatic way, as we seem to be doing it for the time being, I would rather see some proof for this in simulation studies, or in studies where the data fed to the gain-loss algorithms is sampled differently.

References

Bowern, Claire and Atkinson, Quentin D. (2012) Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88: 817-845.

Cohen, Ofir and Rubinstein, Nimrod D. and Stern, Adi and Gophna, Uri and Pupko, Tal (2008) A likelihood framework to analyse phyletic patterns. Philosophical Transactions of the Royal Society B 363: 3903-3911.

Gévaudan, Paul (2007) Typologie des lexikalischen Wandels. Bedeutungswandel, Wortbildung und Entlehnung am Beispiel der romanischen Sprachen. Tübingen:Stauffenburg.

Gray, Russell D. and Jordan, Fiona M. (2000) Language trees support the express-train sequences of Austronesian expansion. Nature 405: 1052-1055.

Gray, Russell D. and Atkinson, Quentin D. (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426: 435-439.

Gray, Russell D. and Greenhill, Simon J. and Ross, Malcolm D. (2007) The pleasures and perils of Darwinzing culture (with phylogenies). Biological Theory 2: 360-375.

Greenhill, S. J. and Currie, T. E. and Gray, R. D. (2009) Does horizontal transmission invalidate cultural phylogenies? Proceedings of the Royal Society of London, Series B 276: 2299-2306.

Lees, Robert B. (1953) The basis of glottochronology. Language 29: 113-127.

List, Johann-Mattis (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1: 119-136.

Murawaki, Yugo (2015) Spatial structure of evolutionary models of dialects in Contact. PLoS One 10: e0134335.

Newman, M. E. J. (2010) Networks: An Introduction. Oxford: Oxford University Press.

Nicholls, Geoff K and Ryder, Robin J and Welch, David (2013) TraitLab: A MatLab package for fitting and simulating binary tree-like data.

Ronquist, Frederik and Huelsenbeck, J. P. (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572–1574.

Sagart, Laurent, Jacques, Guillaume, Lai, Yunfan, Ryder, Robin, Thouzeau, Valentin, Greenhill, Simon J., List, Johann-Mattis (2019) Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116: 10317–10322. DOI: 10.1073/pnas.1817972116

Sankoff, David (1969) Historical Linguistics as Stochastic Process. McGill University: Montreal.

Smith, Kenny (2004) The evolution of vocabulary. Journal of Theoretical Biology 228: 127-142.

Starostin, Sergej Anatolévič (2000) Comparative-historical linguistics and lexicostatistics. In: Renfrew, Colin, McMahon, April, Trask, Larry (eds.): Time Depth in Historical Linguistics: 1. Cambridge:McDonald Institute for Archaeological Research, pp. 223-265.

Starostin, Sergej A. (2007) Computer-based simulation of the glottochronological process (Letter to M. Gell-Mann). In: : S. A. Starostin: Trudy po yazykoznaniyu [S. A. Starostin: Works in Linguistics]. LRC Publishing House, pp. 854-861.

Swadesh, Morris (1952) Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proceedings of the American Philosophical Society 96: 452-463.

Swadesh, Morris (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21.2: 121-137.