The Genealogical World of Phylogenetic Networks: Semantic change

Showing posts with label Semantic change. Show all posts

Monday, November 25, 2019

Typology of semantic promiscuity (Open problems in computational diversity linguistics 10)

The final problem in my list of ten open problems in computational diversity linguistics touches upon a phenomenon that most linguists, let alone ordinary people, might not have even have heard about. As a result, the phenomenon does not have a real name in linguistics, and this makes it even more difficult to talk about it.

Semantic promiscuity, in brief, refers to the empirical observations that: (1) the words in the lexicon of human languages are often built from already existing words or word parts, and that (2) the words that are frequently "recycled", ie. the words that are promiscuous (similar to the sense of promiscuous domains in biology, see Basu et al. 2008) denote very common concepts.

If this turns out to be true, that the meaning of words decides, at least to some degree, their success in giving rise to new words, then it should be possible to derive a typology of promiscuous concepts, or some kind of cross-linguistic ranking of those concepts that turn out to be the most successful on the long run.

Our problem can (at least for the moment, since we still have problems of completely grasping the phenomenon, as can be seen from the next section) thus be stated as follows:

Assuming a certain pre-selection of concepts that we assume are expressed by as many languages as possible, can we find out which of the concepts in the sample give rise to the largest amount of new words?

I am not completely happy with this problem definition, since a concept does not actually give rise to a new word, but instead a concept is expressed by a word that is then used to form a new word; but I have decided to leave the problem in this form for reasons of simplicity.

Background on semantic promiscuity

The basic idea of semantic promiscuity goes back to my time as a PhD student in Düsseldorf. My supervisor then was Hans Geisler, a Romance linguist, with a special interest in sound change and sensory-motor concepts. Sensory-motor concepts are concepts that are thought to be grounded in sensory-motor processes. In concrete, scholars assume that many abstract concepts expressed by many, if not all, languages of the world originate in concepts that denote concrete bodily experience (Ströbel 2016).

Thus, we can "grasp an idea", we can "face consequences", or we can "hold a thought". In such cases we express something that is abstract in nature, but expressed by means of verbs that are originally concrete in their meaning and relate to our bodily experience ("to grasp", "to face", "to hold").

When I later met Hans Geisler in 2016 in Düsseldorf, he presented me with an article that he had recently submitted for an anthology that appeared two years later (Geisler 2018). This article, titled "Sind unsere Wörter von Sinnen?" (approximate translation of this pun would be: "Are our words out of the sense?"), investigates concepts such as "to stand" and "to fall" and their importance for the lexicon of German language. Geisler claims that it is due to the importance of the sensory-motor concepts of "standing" and "falling" that words built from stehen ("to stand") and fallen ("to fall") are among the most productive (or promiscuous) ones in the German lexicon.

Words built from fallen and stehen in German.

I found (and still find) this idea fascinating, since it may explain (if it turns out to hold true for a larger sample of the world's languages) the structure of a language's lexicon as a consequence of universal experiences shared among all humans.

Geisler did not have a term for the phenomenon at hand. However, I was working at the same time in a lab with biologists (led by Eric Bapteste and Philippe Lopez), who introduced me to the idea of domain promiscuity in biology, during a longer discussion about similar processes between linguistics and biology. In our paper reporting our discussion of these similarities, we proposed that the comparison of word formation processes in linguistics and protein assembly processes in biology could provide fruitful analogies for future investigations (List et al. 2016: 8ff). But we did not (yet) use the term promiscuity in the linguistic domain.

Geisler's idea, that the success of words to be used to form other words in the lexicon of a language may depend on the semantics of the original terms, changed my view on the topic completely, and I began to search for a good term to denote the phenomenon. I did not want to use the term "promiscuity", because of its original meaning.

Linguistics has the term "productive", which is used for particular morphemes that can be easily attached to existing words to form new ones (eg. by turning a verb into a noun, or by turning a noun into an adjective, etc.). However, "productivity" starts from the form and ignores the concepts, while concepts play a crucial role for Geisler's phenomenon.

At some point, I gave up and began to use the term "promiscuity" in lack of a better term, first in a blogpost discussing Geisler's paper (List 2018, available here). Later in 2018, Nathanael E. Schweikhard, a doctoral student in our research group, developed the idea further, using the term semantic promiscuity (Schweikhard 2018, available here), which considers my tenth and last open problem in computational diversity linguistics (at least for 2019).

In the discussions with Schweikhard, which were very fruitful, we also learned that the idea of expansion and attraction of concepts comes close to the idea of semantic promiscuity. This references Blank's (2003) idea that some concepts tend to frequently attract new words to express them (think of concepts underlying taboo, for simplicity), while other concepts tend to give rise to many new words ("head" is a good example, if you think of all the meanings it can have in different concepts),. However, since Blank is interested in the form, while we are interested in the concept, I agree with Schweikhard in sticking with "promiscuity" instead of adopting Blank's term.

Why it is hard to establish a typology of semantic promiscuity

Assuming that certain cross-linguistic tendencies can be found that would confirm the hypothesis of semantic promiscuity, why is it hard to do so? I see three major obstacles here: one related to the data, one related to annotation, and one related to the comparison.

The data problem is a problem of sparseness. For most of the languages for which we have lexical data, the available data are so sparse that we often even have problems to find a list of 200 or more words. I know this well, since we were struggling hard in a phylogenetic study of Sino-Tibetan languages, where we ended up discarding many interesting languages because the sources did not provide enough lexical data to fill in our wordlists (Sagart et al. 2019).

In order to investigate semantic promiscuity, we need substantially more data than we need for phylogenetic studies, since we ultimately want to investigate the structure of word families inside a given language and compare these structures cross-linguistically. It is not clear where to start here, although it is clear that we cannot be exhaustive in linguistics, as biologists can be when sequencing a whole gene or genome. I think that one would need, at least, 1,000 words per language in order to be able to start looking into semantic promiscuity.

The second problem points to the annotation and the analysis that would be needed in order to investigate the phenomenon sufficiently. What Hans Geisler used in his study were larger dictionaries of German that are digitally available and readily annotated. However, for a cross-linguistic study of semantic promiscuity, all of the annotation work of word families would still have to be done from scratch.

Unfortunately, we have also seen that the algorithms for automated morpheme detection that have been proposed today usually fail greatly when it comes to detecting morpheme boundaries. In addition, word families often have a complex structure, and parts of the words shared across other words are not necessarily identical, due to numerous processes involved in word formation. So, a simple algorithm that splits the words into potential morphemes would not be enough. Another algorithm that identifies language-internal cognate morphemes would be needed; and here, again, we are still waiting for convincing approaches to be developed by computational linguists.

The third problem is the comparison itself, reflects the problem of comparing word-family data across different languages. Since every language has its own structure of words and a very individual set of word families, it is not trivial to decide how one should compare annotated word-family data across multiple languages. While one could try to compare words with the same meaning in different languages, it is quite possible that one would miss many potentially interesting patterns, especially since we do not yet know how (and if at all) the idea of promiscuity features across languages.

Traditional approaches

Apart from the work by Geisler (2018), mentioned above, we find some interesting studies on word formation and compounding in which scholars have addressed some similar questions. Thus, Steve Pepper has submitted (as far as I know) his PhD thesis on The Typology and Semantics of Binomial Lexemes (Pepper 2019, draft here), where he looks into the structure of words that are frequently constructed from two nominal parts, such as "windmill", "railway", etc. In her masters thesis titled Body Part Metaphors as a Window to Cognition, Annika Tjuka investigates how terms for objects and landscapes are created with help of terms originally denoting body parts (such as the "foot" of the table, etc., see Tjuka 2019).

Both of these studies touch on the idea of semantic promiscuity, since they try to look at the lexicon from a concept-based perspective, as opposed to a pure form-based one, and they also try to look at patterns that might emerge when looking at more than one language alone. However, given their respective focus (Pepper looking at a specific type of compounds, Tjuka looking at body-part metaphors), they do not address the typology of semantic promiscuity in general, although they provide very interesting evidence showing that lexical semantics plays an important role in word formation.

Computational approaches

The only study that I know of that comes close to studying the idea of semantic promiscuity computationally is by Keller and Schulz (2014). In this study, the authors analyze the distribution of morpheme family sizes in English and German across a time span of 200 years. Using Birth-Death-Innovation Models (explained in more detail in the paper), they try to measure the dynamics underlying the process of word formation. Their general finding (at least for the English and German data analyzed) is that new words tend to be built from those word forms that appear less frequently across other words in a given language. If this holds true, it would mean that speakers tend to avoid words that are already too promiscuous as a basis to coin new words for a given language. What the study definitely shows is that any study of semantic promiscuity has to look at competing explanations.

Initial ideas for improvement

If we accept that the corpus perspective cannot help us to dive deep into the semantics, since semantics cannot be automatically inferred from corpora (at least not yet to a degree that would allow us to compare them afterwards across a sufficient sample of languages), then we need to address the question in smaller steps.

For the time being, the idea that a larger amount of the words in the lexicon of human languages are recycled from words that originally express specific meanings remains a hypothesis (whatever those meanings may be, since the idea of sensory motor concepts is just one suggestion for a potential candidate for a semantic field). There are enough alternative explanations that could drive the formation of new words, be it the frequency of recycled morphemes in a lexicon, as proposed by Keller and Schulz, or other factors that we still not know, or that I do not know, because I have not yet read the relevant literature.

As long as the idea remains a hypothesis, we should first try to find ways to test it. A starting point could consist of the collection of larger wordlists for the languages of the world (eg. more than 300 words per language) which are already morphologically segmented. With such a corpus, one could easily create word families, by checking which morphemes are re-used across words. By comparing the concepts that share a given morpheme, one could try and check to which degree, for example, sensory-motor concepts form clusters with other concepts.

All in all, my idea is far from being concrete; but what seems clear is that we will need to work on larger datasets that offer word lists for a sufficiently large sample of languages in morpheme-segmented form.

Outlook

Whenever I try to think about the problem of semantic promiscuity, asking myself whether it is a real phenomenon or just a myth, and whether a typology in the form of a world-wide ranking is possible after all, I feel that my brain is starting to itch. It feels like there is something that I cannot really grasp (yet, hopefully), and something I haven't really understood.

If the readers of this post feel the same way afterwards, then there are two possibilities as to why you might feel as I do: you could suffer from the same problem that I have whenever I try to get my head around semantics, or you could just have fallen victim of a largely incomprehensible blog post. I hope, of course, that none of you will suffer from anything; and I will be glad for any additional ideas that might help us to understand this matter more properly.

References

Basu, Malay Kumar and Carmel, Liran and Rogozin, Igor B. and Koonin, Eugene V. (2008) Evolution of protein domain promiscuity in eukaryotes. Genome Research 18: 449-461.

Blank, Andreas (1997) Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Tübingen:Niemeyer.

Geisler, Hans (2018) Sind unsere Wörter von Sinnen? Überlegungen zu den sensomotorischen Grundlagen der Begriffsbildung. In: Kazzazi, Kerstin and Luttermann, Karin and Wahl, Sabine and Fritz, Thomas A. (eds.) Worte über Wörter: Festschrift zu Ehren von Elke Ronneberger-Sibold. Tübingen:Stauffenburg. 131-142.

Keller, Daniela Barbara and Schultz, Jörg (2014) Word formation is aware of morpheme family size. PLoS ONE 9.4: e93978.

List, Johann-Mattis and Pathmanathan, Jananan Sylvestre and Lopez, Philippe and Bapteste, Eric (2016) Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct 11.39: 1-17.

List, Johann-Mattis (2018) Von Wortfamilien und promiskuitiven Wörtern [Of word families and promiscuous words]. Von Wörtern und Bäumen 2.10. URL: https://wub.hypotheses.org/464.

Pepper, Steve (2019) The Typology and Semantics of Binominal Lexemes: Noun-noun Compounds and their Functional Equivalents. University of Oslo: Oslo.

Sagart, Laurent and Jacques, Guillaume and Lai, Yunfan and Ryder, Robin and Thouzeau, Valentin and Greenhill, Simon J. and List, Johann-Mattis (2019) Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116: 10317-10322. DOI: https://doi.org/10.1073/pnas.1817972116

Schweikhard, Nathanael E. (2018) Semantic promiscuity as a factor of productivity in word formation. Computer-Assisted Language Comparison in Practice 1.11. URL: https://calc.hypotheses.org/1169.

Ströbel, Liane (2016) Introduction: Sensory-motor concepts: at the crossroad between language & cognition. In: Ströbel, Liane (ed.) Sensory-motor Concepts: at the Crossroad Between Language & Cognition. Düsseldorf University Press, pp. 11-16.

Tjuka, Annika (2019) Body Part Metaphors as a Window to Cognition: a Cross-linguistic Study of Object and Landscape Terms. Humboldt Universität zu Berlin: Berlin. DOI: https://doi.org/10.17613/j95n-c998.

Monday, September 30, 2019

Typology of semantic change (Open problems in computational diversity linguistics 8)

With this month's problem we are leaving the realm of modeling, which has been the basic aspect underlying the last three problems, discussed in June, July, and August, and enter the realm of typology, or general linguistics. The last three problems that I will discuss, in this and two follow-up posts, deal with the basic problem of making use or collecting data that allows us to establish typologies, that is, to identify cross-linguistic tendencies for specific phenomena, such as semantic change (this post), sound change (October), or semantic promiscuity (November).

Cross-linguistic tendencies are here understood as tendencies that occur across all languages independently of their specific phylogenetic affiliation, the place where they are spoken, or the time when they are spoken. Obviously, the uniformitarian requirement of independence of place and time is an idealization. As we know well, the capacity for language itself developed, potentially gradually, with the evolution of modern humans, and as a result, it does not make sense to assume that the tendencies of semantic change or sound change were the same through time. This has, in fact, been shown in recent research that illustrated that there may be a certain relationship between our diet and the speech sounds that we speak in our languages (Blasi et al. 2019).

Nevertheless, in the same way in which we simplify models in physics, as long as they yield good approximations of the phenomena we want to study, we can also assume a certain uniformity for language change. To guarantee this, we may have to restrict the time frame of language development that we want to discuss (eg. the last 2,000 years), or the aspects of language we want to investigate (eg. a certain selection of concepts that we know must have been expressed 5,000 years ago).

For the specific case of a semantic change, the problem of establishing a typology of the phenomenon can thus be stated as follows:

Assuming a certain pre-selection of concepts that we assume were readily expressed in a given time frame, establish a general typology that informs about the universal tendencies by which a word expressing one concept changes its meaning, to later express another concept in the same language.

In theory, we can further relax the conditions of universality and add the restrictions on time and place later, after having aggregated the data. Maybe this would even be the best idea for a practical investigation; but given that the time frames in which we have attested data for semantic changes are rather limited, I do not believe that it would make much of a change.

Why it is hard to establish a typology of semantic change

There are three reasons why it is hard to establish a typology of semantic change. First, there is the problem of acquiring the data needed to establish the typology. Second, there is the problem of handling the data efficiently. Third, there is the problem of interpreting the data in order to identify cross-linguistic, universal tendencies.

The problem of data acquisition results from the fact that we lack data on observed processes of semantic change. Since there are only a few languages with a continuous tradition of written records spanning 500 years or more, we will never be able to derive any universal tendencies from those languages alone, even if it may be a good starting point to start from languages like Latin and its Romance descendants, as has been shown by Blank (1997).

Accepting the fact that processes attested only for Romance languages are never enough to fill the huge semantic space covered by the world's languages, the only alternative would be using inferred processes of semantic change — that is, processes that have been reconstructed and proposed in the literature. While it is straightforward to show that the meanings of cognate words in different languages can vary quite drastically, it is much more difficult to infer the direction underlying the change. Handling the direction, however, is important for any typology of semantic change, since the data from observed changes suggests that there are specific directional tendencies. Thus, when confronted with cognates such as selig "holy" in German and silly in English, it is much less obvious whether the change happened from "holy" to "silly" or from "silly" to "holy", or even from an unknown ancient concept to both "holy" and "silly".

As a result, we can conclude that any collection of data on semantic change needs to make crystal-clear upon which types of evidence the inference of semantic change processes is based. Citing only the literature on different language families is definitely not enough. Because of the second problem, this also applies to the handling of data on semantic shifts. Here, we face the general problem of elicitation of meanings. Elicitation refers to the process in fieldwork where scholars use a questionnaire to ask their informants how certain meanings are expressed. The problem here is that linguists have never tried to standardize which meanings they actually elicit. What they use, instead, are elicitation glosses, which they think are common enough to allow linguists to understand to what meaning they refer. As a result, it is extremely difficult to search in field work notes, and even in wordlists or dictionaries, for specific meanings, since every linguist is using their own style, often without further explanations.

Our Concepticon project (List et al. 2019, https://concepticon.clld.org) can be seen as a first attempt to handle elicitation glosses consistently. What we do is to link those elicitation glosses that we find in questionnaires, dictionaries, and fieldwork notes to so-called concept sets, which reflect a given concept that is given a unique identifier and a short definition. It would go too far to dive deeper into the problem of concept handling. Interested readers can have a look at a previous blog post I wrote on the topic (List 2018). In any case, any typology on semantic change will need to find a way to address the problem of handling elicitation glosses in the literature, in the one or the other way.

As a last problem, when having assembled data that show semantic change processes across a sufficiently large sample of languages and concepts, there is the problem of analyzing the data themselves. While it seems obvious to identify cross-linguistic tendencies by looking for examples that occur in different language families and different parts of the world, it is not always easy to distinguish between the four major reasons for similarities among languages, namely: (1) coincidence, (2) universal tendencies, (3) inheritance, and (4) contact (List 2019). The only way to avoid being forced to make use of potentially unreliable statistics, to squeeze out the juice of small datasets, is to work on a sufficiently large coverage of data from as many language families and locations as possible. But given that there are no automated ways to infer directed semantic change processes across linguistic datasets, it is unlikely that a collection of data acquired from the literature alone will reach the critical mass needed for such an endeavor.

Traditional approaches

Apart from the above-mentioned work by Blank (1997), which is, unfortunately, rarely mentioned in the literature (potentially because it is written in German), there is an often-cited paper by Wilkinson (1996), and preliminary work on directionality (Urban 2012). However, the attempt that addresses the problem most closely is the Database of Semantic Shifts (Zalizniak et al. 2012), which has, according to the most recent information on the website, was established in 2002 and has been continuously updated since then.

The basic idea, as far as I understand the principle of the database, is to collect semantic shifts attested in the literature, and to note the type of evidence, as well as the direction, where it is known. The resource is unique, nobody else has tried to establish a collection of semantic shifts attested in the literature, and it is therefore incredibly valuable. It shows, however, also, what problems we face when trying to establish a typology of semantic shifts.

Apart from the typical technical problems found in many projects shared on the web (missing download access to all data underlying the website, missing deposit of versions on public repositories, missing versioning), the greatest problem of the project is that no apparent attempt was undertaken to standardize the elicitation glosses. This became specifically obvious when we tried to link an older version of the database, which is now no longer available, to our Concepticon project. In the end, I selected some 870 concepts from the database, which were supported by more datapoints, but had to ignore more than 1500 remaining elicitation glosses, since it was not possible to infer in reasonable time what the underlying concepts denote, not to speak of obvious cases where the same concept was denoted by slightly different elicitation glosses. As far as I can tell, this has not changed much with the most recent update of the database, which was published some time earlier this year.

Apart from the afore-mentioned problems of missing standardization of elicitation glosses, the database does not seem to annotate which type of evidence has been used to establish a given semantic shift. An even more important problem, which is typical of almost all attempts to establish databases of change in the field of diversity linguistics, is that the database only shows what has changed, while nothing can be found on what has stayed the same. A true typology of change, however, must show what has not changed along with showing what has changed. As a result, any attempt to pick proposed changes from the literature alone will fail to offer a true typology, a collection of universal tendencies

To be fair: the Database of Semantic Shifts is by no means claiming to do this. What it offers is a collection of semantic change phenomena discussed in the linguistic literature. This itself is an extremely valuable, and extremely tedious, enterprise. While I wish that the authors open their data, versionize it, standardize the elicitation glosses, and also host it on stable public archives, to avoid what happened in the past (that people quote versions of the data which no longer exist), and to open the data for quantitative analyses, I deeply appreciate the attempt to approach the problem of semantic change from an empirical, data-driven perspective. To address the problem of establishing a typology of semantic shift, however, I think that we need to start thinking beyond collecting what has been stated in the literature.

Computational approaches

As a first computational approach that comes in some way close to a typology of semantic shifts, there is the Database of Cross-Linguistic Colexifications (List et al. 2018), which was originally launched in 2014, and received a major update in 2018 (see List et al. 2018b for details). This CLICS database, which I have mentioned several times in the past, does not show diachronic data, ie. data on semantic change phenomena, but lists automatically detectable polysemies and homophonies (also called colexifications), instead.

While the approach taken by the Database of Semantic shifts is bottom-up in some sense, as the authors start from the literature and add those concept that are discussed there, CLICS is top-down, as it starts from a list of concepts (reflected as standardized Concepticon concept sets) and then checks which languages express more than one concept by one and the same word form.

The advantages of top-down approaches are: that much more data can be processed, and that one can easily derive a balanced sample in which the same concepts iare compared for as many languages as possible. The disadvantage is that such a database will ignore certain concepts a priori, if they do not occur in the data.

Since CLICS lists synchronic patterns without further interpreting them, the database is potentially interesting for those who want to work on semantic change, but it does not help solve the problem of establishing a typology of semantic change itself. In order to achieve this, one would have to go through all attested polysemies in the database and investigate them, searching for potential hints on directions.

A potential way to infer directions for semantic shifts is presented by Dellert (2016), who applies causal inference techniques on polysemy networks to address this task. The problem, as far as I understand the techniques, is that the currently available polysemy databases barely offer enough information needed for these kinds of analyses. Furthermore, it would also be important to see how well the method actually performs in comparison to what we think we already know about the major patterns of semantic change.

Initial ideas for improvement

There does not seem to be a practical way to address our problem by means of computational solutions alone. What we need, instead, is a computer-assisted strategy that starts from the base of a thorough investigation of the criteria that scholars use to infer directions of semantic change from linguistic data. Once these criteria are settled, more or less, one would need to think of ways to operationalize them, in order to allow scholars to work with concrete etymological data, ideally comprising standardized word-lists for different language families, and to annotate them as closely as possible.

Ideally, scholars would propose larger etymological datasets in which they reconstruct whole language families, proposing semantic reconstructions for proto-forms. These would already contain the proposed directions of semantic change, and they would also automatically show where change does not happen. Since we currently lack automated workflows that fully account for this level of detail, one could start by applying methods for cognate detection across semantic semantic slots (cross-semantic cognate detection), which would yield valuable data on semantic change processes, without providing directions, and then adding the directional information based on the principles that scholars use in their reconstruction methodology.

Outlook

Given the recent advances in detection of sound correspondence patterns, sequence comparison, and etymological annotation in the field of computational historical linguistics, it seems perfectly feasible to work on detailed etymological datasets of the languages of the world, in which all information required to derive a typology of semantic change is transparently available. The problem is, however, that it would still take a lot of time to actually analyze and annotate these data, and to find enough scholars who would agree to carry out linguistic reconstruction in a similar way, using transparent tools rather than convenient shortcuts.

References

Blank, Andreas (1997) Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Tübingen:Niemeyer.

Blasi, Damián E. and Steven Moran and Scott R. Moisik and Paul Widmer and Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

List, Johann-Mattis and Simon Greenhill and Cormac Anderson and Thomas Mayer and Tiago Tresoldi and Robert Forkel (2018: CLICS: Database of Cross-Linguistic Colexifications. Version 2.0. Max Planck Institute for the Science of Human History. Jena: http://clics.clld.org/.

Johann Mattis List and Simon Greenhill and Christoph Rzymski and Nathanael Schweikhard and Robert Forkel (2019) Concepticon. A resource for the linking of concept lists (Version 2.1.0). Max Planck Institute for the Science of Human History. Jena: https://concepticon.clld.org/.

Dellert, Johannes and Buch, Armin (2016) Using computational criteria to extract large Swadesh Lists for lexicostatistics. In: Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics.

List, Johann-Mattis and Greenhill, Simon J. and Anderson, Cormac and Mayer, Thomas and Tresoldi, Tiago and Forkel, Robert (2018) CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2: 277-306.

List, Johann-Mattis (2018) Towards a history of concept list compilation in historical linguistics. History and Philosophy of the Language Sciences 5.10: 1-14.

List, Johann-Mattis (2019) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

Urban, Matthias (2011) Asymmetries in overt marking and directionality in semantic change. Journal of Historical Linguistics 1.1: 3-47.

Wilkins, David P. (1996) Natural tendencies of semantic change and the search for cognates. In: Durie, Mark (ed.) The Comparative Method Reviewed: Regularity and Irregularity in Language Change. New York: Oxford University Press, pp. 264-304.

Zalizniak, Anna A. and Bulakh, M. and Ganenkov, Dimitrij and Gruntov, Ilya and Maisak, Timur and Russo, Maxim (2012) The catalogue of semantic shifts as a database for lexical semantic typology. Linguistics 50.3: 633-669.

Monday, July 30, 2018

Networks of polysemous and homophonous words

When I was very young, maybe even before I went to school, we often played a game with my parents and grandparents, during which we had to select two homophonous words (that is, one word form that expresses two rather different meanings), and the other people had to guess which word we had selected. This game is slightly different from its Anglo-Saxon counterpart, the homophone game.

In Germany, this game is called Teekesselchen: "little teapot". Therefore, people now also use the word Teekesselchen to denote cases of homophonoy or very advanced polysemy. In this sense, the word Teekesselchen itself becomes polysemous, since it denotes both a little teacup, and the phenomenon that word forms in a given language may often denote multiple meanings.

Homophony and polysemy

In linguistics, we learn very early that we should rigorously distinguish the phenomenon of homophony from the phenomenon of polysemy. The former refers to originally different word forms that have become similar (and even identical) due to the effects of sound change — compare French paix "peace" and pet "fart", which are now both pronounced as [pɛ]. The latter refers to cases where a word form has accumulated multiple meanings over time, which are shifted from the original meaning — compare head as in head of department vs. head as in headache.

Given the difference of the processes leading to homophony on the one hand and polysemy on the other, it may seem justified to opt for a strict usage of the terms, at least when discussing linguistic problems. However, the distinction between homophony and polysemy is not always that easy to make.

In German, for example, we have the same word Decke for "ceiling" and "blanket" (Geyken 2010). This may seem to reflect a homophony at first sight, given that the meanings are so different, so that it seems simpler to assume a coincidence. However, it is in fact a polysemy (cf. Pfeiffer 1993, s. v. «Decke»). This can be easily seen from the verb (be)decken "to cover", from which Decke was derived. While the ceiling covers the room, the blanket covers the body.

Given that we usually do not know much about the history of the words in our languages, we often have difficulties deciding whether we are dealing with homophonies or with polysemies when encountering ambiguous terms in the languges of the world. The problem of the two terms is that they are not descriptive, but explanative (or ontological): they do not only describe a phenomenon ("one word form is ambiguous, having multiple meanings"), but also the origin of this phenomenon (sound change or semantic change).

In this context, the recently coined term colexification (François 2008) has proven to be very helpful, as it is purely descriptive, referring to those cases where a given language has the same word form to express two or more different meanings. The advantage of descriptive terminology is that it allows us to identify a certain phenomenon but analyze it in a separate step — that is, we can already talk about the phenomenon before we have found out its specific explanation.

A new contribution

Having worked hard during recent years writing computer code for data curation and analysis (cf. List et al 2018a), my colleagues and I have finally managed to present the fascinating phenomena of colexifications (homophonies and polysemies) in the languages of the world in an interactive web application. This shows which colexifications occur frequently in which languages of the world.

In order to display how often the languages in the world express different concepts using the same word, we make use of a network model, in which the concepts (or meanings) are represented by the nodes in the networks, and links between concepts are drawn whenever we find that any of the languages in the sample colexifies the concepts. The following figure illustrates this idea.

Colexification network for concepts centering around "FOOD" and "MEAL".

This database and web application is called CLICS, which stands for the Database of Cross-Linguistic Colexifications (List et al. 2018b), and was published officially during the past week (http://clics.clld.org) — it can now be freely accessed by all who are interested. In addition, we describe the database in some more detail in a forthcoming article (List et al. 2018c), which is already available in form of a draft.

The data give us fascinating insights into the way in which the languages of the world describe the world. At times, it is surprising how similar the languages are, even if they do not share any recent ancestry. My favorite example is the network around the concept FUR, shown below. When inspecting this network, one can find direct links of FUR to HAIR, BODY HAIR, and WOOL on one hand, as well as LEATHER, SKIN, BARK, and PEEL on the other. In some sense, the many different languages of the world, whose data was used in this analysis, reflect a general principle of nature, namely that the bodies of living things are often covered by some protective substance.

Colexification network for concepts centering around "FUR".

Although we have been working with these networks for a long time, we are still far from understanding their true potential. Unfortunately, nobody in our team is a true specialist in complex networks. As a result, our approaches are always limited to what we may have read by chance about all of those fascinating ways in which complex networks can be analyzed.

For the future, we hope to convince more colleagues of the interesting character of the data. At the moment, our networks are simple tools for exploration, and it is hard to extract any evolutionary processes from them. With more refined methods, however, it may even be possible to use them to infer general tendencies of semantic change in language evolution.

References

Geyken A. (ed.) (2010) Digitales Wörterbuch der deutschen Sprache DWDS. Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. Berlin-Brandenburgische Akademie der Wissenschaften: Berlin. http://dwds.de

François A. (2008) Semantic maps and the typology of colexification: intertwining polysemous networks across languages. In: Vanhove, M. (ed.) From Polysemy to Semantic Change, pp 163-215. Benjamins: Amsterdam.

List J.-M., M. Walworth, S. Greenhill, T. Tresoldi, R. Forkel (2018) Sequence comparison in computational historical linguistics. Journal of Language Evolution 3.2. http://dx.doi.org/10.1093/jole/lzy006

List J.-M., S. Greenhill, C. Anderson, T. Mayer, T. Tresoldi, R. Forkel (forthcoming) CLICS². An improved database of cross-linguistic colexifications: Assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2. https://doi.org/10.1515/lingty-2018-0010

List J.-M., S. Greenhill, C. Anderson, T. Mayer, T. Tresoldi, and R. Forkel (eds.) (2018) CLICS: Database of Cross-Linguistic Colexifications. Max Planck Institute for the Science of Human History: Jena. http://clics.clld.org

Pfeifer W. (1993) Etymologisches Wörterbuch des Deutschen. Akademie: Berlin.

Tuesday, February 16, 2016

Through a glass darkly

In an earlier blogpost I mentioned the now largely abandoned discipline of lexicostatistics that was in vogue in the 1950s, originally initiated by Morris Swadesh (1909-1967; Swadesh 1950, 1952, 1955), but abandoned in the 1960s and henceforth often labeled as some kind of a failed theory that was explicitly proven to be wrong.

The crucial idea of Swadesh was to investigate lexical change from the perspective of the meaning of words. This perspective is contrasted with the perspective which takes similar (cognate) word forms in different languages as a starting point and compares to which degree they differ in their meanings. Swadesh's perspective, instead, starts from a set of meanings and investigates by which word forms they are expressed, and is also called an onomasiological perspective (which "names" are assigned to concepts?), while the other perspective is called a semasiological perspective (which "meanings" can words have?).

From a semasiological perspective, we would start from a set of related words and investigate their meanings. In this way, we could compare English head with German Hauptstadt "capital city" or English cup with German Kopf "head". Through such an analysis, we would learn that there was a semantic shift from the German word Haupt, which originally meant "head", to a more abstract meaning that is now probably best translated as "capital" or "main", and only occurs in compounds, such as Hauptstadt "capital city", Hauptursache "main reason", etc.

From an onomasialogical perspective, we would start from a set of meanings and investigate which words are use in order to express them in different languages:

No.	Items	German	English	Dutch	Russian
1	hand	Hand	hand	hand	ruka
2	arm	Arm	arm	arm	ruka
3	mainly	hauptsächlich	mainly	hoofdzakelijk	glavny
4	head	Kopf, (Haupt)	head	hoofd, kop	golova
5	cup	Tasse	cup	kop	stakan
...	...	...	...	...	...

When looking at specific meanings in this way, one can find interesting patterns within one and the same language whenever a language uses the same or similar words to express what are different concepts in other languages. Russian thus uses the same word for "hand" and "arm", Dutch shows the same word for "head" and "cup", and Russian, Dutsch, and German have similar forms for "mainly" and "head". These patterns can be historically interpreted by reconstructing patterns of semantic shift. In the case of English cup, German Kopf, and Dutch kop, for example, the original meaning of the words was "vessel" or "cup". Later on, the word changed its meaning and came to denote "head" in German. The transition is still reflected in Dutch, where the word can denote both meanings.

We can model this situation by assuming that every word in a language has a certain reference potential (Schwarz 1996: 175; Allwood 2003; List 2014: 21f, 36). This means that every word has the potential to denote different things in the world, due to the concept it denotes primarily. In List (2014: 21), I have tried to depict this as follows:

Reference Potential of the Linguistic Sign

In this visualization, a word form refers to a meaning, and the meaning itself has the potential to denote various things in the world, but with different probabilities. A word that primarily means "head", for example, may likewise be used to denote the "first person", as in the "head of a group", and a word that primarily means "melon" may also be used to denote a "head", due to the similarity in form. We can investigate the reference potential of words by simply looking at different translations in dictionaries. As an example (from List 2014: 36), when looking at our three words English cup, Dutch kop, and German Kopf, we find the following rough arrangement with respect to the reference potential of the word (the thickness of the arrows indicating differences in denotation probability):

Reference Potential of Words Across Languages

Why do I mention all of this? First, I wanted to show that lexical change, no matter which perspective we take, is a very complex phenomenon. In a simplifying model, we could think of a lexicon as a bipartite network consisting of nodes that represent word forms in a language and nodes that represent meanings, and weighted links between word forms and meanings denoting the frequency by which a word is used to denote a given meaning. In such a network representation, lexical change could be modelled as the re-arrangement of the edges between word forms and meanings. If a word form looses all its edges, this word is lost from the language, but we could also think of new words entering the language, be it that they are borrowed, or created from the language itself. Such a model would be very simplistic, ignoring aspects like word compounding, by which new words are created from existing ones. But it would be much more realistic than the idea that lexical change is just about the gain and loss of words, as assumed in the quasi-standard model of lexical change in phylogenetic reconstruction.

This brings us to my second point. When Swadesh introduced lexicostatistics, and his very specific onomasiological perspective on lexical change, he established a model of lexical change that would deliberately ignore all interesting processes underlying the phenomenon. Since then, we have been looking through a glass darkly. This is like a crime inspector having no other means but watching potential suspects through the windows of their apartments, noticing changes, like the differently coloured words in state A and state B in the Figure below, but never knowing what was really going on inside those flats (state C).

Trough a Glass Darkly: The lexicostatistic perspective on lexical change (A, B), and what is really going on (C).

Yet, when being honest with oneself, the problem of looking through a glass darkly does not pertain to the lexicostatistic perspective alone, but effectively applies to all of our research on language change. It is just the size and the number of windows that we survey, and the cleanliness of the glasses, that may make a little difference.

References

Allwood, J. (2003) Meaning potentials and context: Some consequences for the analysis of variation in meaning. In: Cuyckens, H., R. Dirven, and J. Taylor (eds.): Cognitive approaches to lexical semantics. Mouton de Gruyter: Berlin and New York. 29-65.
List, J.-M. (2014) Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
Schwarz, M. (1996) Einführung in die kognitive Linguistik. Francke: Basel and Tübingen.
Swadesh, M. (1950) Salish internal relationships. Int. J. Am. Linguist. 16.4. 157-167.
Swadesh, M. (1952) Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proc. Am. Philol. Soc. 96.4. 452-463.
Swadesh, M. (1955) Towards greater accuracy in lexicostatistic dating. Int. J. Am. Linguist. 21.2. 121-137.