Wednesday, December 9, 2015

Lexicostatistics: the predecessor of phylogenetic analyses in historical linguistics

Phylogenetic approaches in historical linguistics are extremely common nowadays. Especially, probabilistic models that model lexical change as a birth-death process of cognate sets evolving along a phylogenetic tree (Pagel 2009) are very popular (Lee and Hasegawa 2011, Kitchen et al. 2009, Bowern and Atkinson 2012), but also splits networks are frequently used (Ben Hamed 2005, Heggarty et al. 2010).

However, the standard procedure to produce a family tree or network with phylogenetic software in linguistics goes back to the method of lexicostatistics, which was developed in the 1950s by Morris Swadesh (1909-1967) in a series of papers (Swadesh 1950, 1952, 1955). Lexicostatistics was discarded by the linguistic community not long after it was proposed (Hoijer 1956, Bergsland and Vogt 1962). Since then, lexicostatistics is considered a methodus non gratus in classical circles of historical linguistics, and using it openly may drastically downgrade one's perceived credibility in certain parts of the community.

To avoid the conflicts, most linguists practicing modern phylogenetic approaches emphasize the fundamental differences between early lexicostatistics and modern phylogenetics. These differences, however, apply only to the way the data is analysed. The basic assumptions underlying the selection and preparation of data have not changed since the 1950s, and it is important to keep this in mind, especially when searching for appropriate phylogenetic models to analyse the data.

The Theory of Basic Vocabulary

Swadesh's basic idea was that in the lexicon of every human language there are words that are culturally neutral and functionally universal; and he used the term "basic vocabulary" to refer to these words. Culturally neutral hereby means that the meanings expressed by the words are independently used across different cultures. Functional universality means that the meanings are expressed by all human languages independent of the time and place where they are spoken. The idea is that these meanings are so important for the functioning of a language as a tool of communication, that every language needs to express them.

Cultural neutrality and functional universality guarantee two important aspects of basic words: their stability and their resistance to borrowing. Stability means that words expressing a basic concept are less likely to change their meaning or to be replaced by another word. An argument for this claim is the functional importance of the words — if the words are important for the functioning of a language, it would not make much sense to change them too quickly. Humans are good at changing the meanings of words, as we can see from daily conversations in the media, where new words tend to pop up seemingly on a daily basis, and old words often drastically change their meanings. But changing words that express basic meanings like "head", "stone", "foot", or "mountain" too often might give rise to confusion in communication. As a result, one can assume that words change at a different pace, depending on the meaning they express, and this is one of the core claims of lexicostatistics.

Resistance to borrowing follows also from stability, since the replacement of words expressing basic meanings may again have an impact on our daily communication, and we may thus assume that speakers avoid borrowing these words too quickly. Cultural neutrality of concepts is another important point to guarantee resistance to borrowing. Words expressing concepts which play an important cultural role may easily be transferred from one language to another along with the culture. Thus, although it seems likely that every language has a word for "god" or "spirit" and the like (so the concept is to a certain degree functionally universal), the lack of cultural independency makes words expressing religious terms very likely candidates for borrowing, and it is probably no coincidence that words expressing religion and belief rank first in the scale of borrowability (Tadmor 2009: 232).

Lexical Replacement, Data Preparation, and Divergence Time Estimation

Swadesh had further ideas regarding the importance of basic vocabulary. He assumed that the process of lexical replacement follows universal rates as far as the basic vocabulary is concerned, and that this would allow us to date the divergence of languages, provided we are able to identify the shared cognates. In lexical replacement, a word w₁ expressing a given meaning x in a language is replaced by a word w₂ which then expresses the meaning x, while w₁ either shifts to express another meaning, or completely disappears from the language. For example, older thou did in English was replaced by the plural form you, which now also expresses the singular. In order to search for cognates and determine the time when two languages diverged, Swadesh proposed a straightforward procedure, consisting of very concrete steps (compare Dyen et al. 1992):
  • Compile a list of basic concepts (concepts that you think are culturally neutral and functionally universal; see here for a comparative collection of different lists that have been proposed and used in the past)
  • translate these concepts into the different languages you want to analyse
  • search for cognates between the languages in each meaning slot; if words in two languages are not cognate for a given meaning, then this points to former processes of lexical replacement in at least one of the languages since their divergence
  • count the number of shared cognates, and use some mathematics to calculate the divergence time (which has been independently calibrated using some test cases of known divergence times).
As an example for such a wordlist with cognate judgments, compare the table in the first figure, where I have entered just a few basic concepts from Swadesh's standard concept list and translated them into four languages. Cognacy is assigned with help of IDs in the column at the right of each language column, but also further highlighted with different colors.

Classical cognate coding in lexicostatistics

Phylogenetic Approaches in Historical Linguistics

Modern phylogenetic approaches in historical linguistics basically follow the same workflow that Swadesh propagated for lexicostatistics, the only difference being the last step of the working procedure. Instead of Swadesh's formula, which compared lexical replacement with radioactive decay and was based on aggregated distances in its core, character-based methods are used to infer phylogenetic trees. Characters are retrieved from the data by extracting each cognate from a lexicostatistical wordlist and annotating the presence or absence of each cognate set in each language.

Thus, while Swadesh's lexicostatistical data model would state that the words for "hand" in German and English were cognate, and also in Italian and French, but not in Germanic and Romance, the binary presence-absence coding states that the cognate set formed by words like English hand and German Hand is not present in Romance languages, and that the cognate set formed by words like Italian mano and French main is absent in Germanic languages. This is illustrated in the table below, where the same IDs and colors are used to mark the cognate sets as in the table shown above.

Presence-absence cognate coding for modern phylogenetic analyses

The new way of cognate coding along with the use of phylogenetic software methods has brought, without doubt, many improvements compared to Swadesh's idea of dating divergence times by counting percentages of shared cognates. A couple of problems, however, remain, and one should not forget them when applying computational methods to originally lexicostatistic datasets.

First, we could ask whether the main assumptions of functional universality and cultural neutrality really hold. It seems to be true that words can be remarkably stable throughout the history of a language family. It is, however, also true that the most stable words are not necessarily the same across all language families. Ever since Swadesh established the idea of basic vocabulary, scholars have tried to improve the list of basic vocabulary items. Swadesh himself started from a list of 215 concepts (Swadesh 1950), which he then reduced to 200 concepts (1952) and then later to 100 concepts (1952). Other scholars went further, like Dolgopolsky (1964 [1986]) who reduced the list to 16 concepts. The Concepticon is a resource that links many of the concept lists that have been proposed in the past. When comparing these lists, which all represent what some scholars would label "basic vocabulary items", it becomes obvious that the number of items that all scholars agree upon sinks drastically, while the number of concepts that have been claimed to be basic increases.

An even greater problem than the question of universality and neutrality of basic vocabulary, however, is the underlying model of cognacy in combination with the proposed process of change. Swadesh's model of cognacy controls for meaning. While this model of cognacy is consistent with Swadesh's idea of lexical replacement as a basic process of lexical change, it is by no means consistent with birth-death models of cognate gain and cognate loss if they are created from lexicostatistical data. In biology, birth-death models are usually used to model the evolution of homologous gene families distributed across whole genomes. If we use the traditional view according to which words can be cognate regardless of meaning, the analogy holds, and birth-death processes seem to be adequate in order to analyze datasets that are based on these root cognates (Starostin 1989) or etymological cognates (Starostin 2013). But if we control for meaning in the cognate judgments, we do not necessarily capture processes of gain and loss in our data. Instead, we capture processes in which links between word forms and concepts are shifted, and we investigate these shifts through the very narrow "windows" of pre-defined slots of basic concepts, as I have tried to depict in the following graphic.

Looking at kexical replacement through the small windows of basic vocabulary


As David has mentioned before: We do not necessarily need realistic models in phylogenetic research to infer meaningful processes. The same can probably be said about the discrepancy between our lexicostatistical datasets (Swadesh's heritage, which we keep using for practical reasons) and the birth-death models we now use to analyse the data. Nevertheless, I cannot avoid an uncomfortable feeling when thinking that an algorithm is modeling gain and loss of characters in a dataset that was not produced for this purpose. In order to model the traditional lexicostatistical data consistently, we would either (i) need explicit multistate-models in which concepts are a character and the forms represent the states (Ringe et al. 2002, Ben Hamed and Wang 2006), or (ii) we should directly turn to "root-cognate" methods. These methods have been discussed for some time now (Starostin 1989, Holm 2000), but there is only one recent approach by Michael et al. (forthcoming) in which this is consistently tested.

  • Bergsland, K. and H. Vogt (1962): On the validity of glottochronology. Curr. Anthropol. 3.2. 115-153.
  • Bowern, C. and Q. Atkinson (2012): Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88. 817-845.
  • Dolgopolsky, A. (1964): Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija [A probabilistic hypothesis concering the oldest relationships among the language families of Northern Eurasia]. Voprosy Jazykoznanija 2. 53-63.
  • Dyen, I., J. Kruskal, and P. Black (1992): An Indoeuropean classification. A lexicostatistical experiment. T. Am. Philos. Soc. 82.5. iii-132.
  • Ben Hamed, M. and F. Wang (2006): Stuck in the forest: Trees, networks and Chinese dialects. Diachronica 23. 29-60.
  • Hoijer, H. (1956): Lexicostatistics. A critique. Language 32.1. 49-60.
  • Holm, H. (2000): Genealogy of the main Indo-European branches applying the separation base method. J. Quant. Linguist. 7.2. 73-95.
  • Kitchen, A., C. Ehret, S. Assefa, and C. Mulligan (2009): Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East. Proc. R. Soc. London, Ser. B 276.1668. 2703-2710.
  • Lee, S. and T. Hasegawa (2011): Bayesian phylogenetic analysis supports an agricultural origin of Japonic languages. Proc. R. Soc. London, Ser. B 278.1725. 3662-3669.
  • Pagel, M. (2009): Human language as a culturally transmitted replicator. Nature Reviews. Genetics 10. 405-415.
  • Ringe, D., T. Warnow, and A. Taylor (2002): Indo-European and computational cladistics. T. Philol. Soc. 100.1. 59-129.
  • Starostin, S. (1989): Sravnitel'no-istoričeskoe jazykoznanie i leksikostatistika [Comparative-historical linguistics and lexicostatistics]. In: Kullanda, S., J. Longinov, A. Militarev, E. Nosenko, and V. Shnirel'man (eds.): Materialy k diskussijam na konferencii[Materials for the discussion on the conference].1. Institut Vostokovedenija: Moscow. 3-39.
  • Starostin, G. (2013): Lexicostatistics as a basis for language classification. In: Fangerau, H., H. Geisler, T. Halling, and W. Martin (eds.): Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization.. Franz Steiner Verlag: Stuttgart. 125-146.
  • Swadesh, M. (1950): Salish internal relationships. Int. J. Am. Linguist. 16.4. 157-167.
  • Swadesh, M. (1952): Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proc. Am. Philol. Soc. 96.4. 452-463.
  • Swadesh, M. (1955): Towards greater accuracy in lexicostatistic dating. Int. J. Am. Linguist. 21.2. 121-137.
  • Tadmor, U. (2009): Loanwords in the world’s languages. Findings and results. In: Haspelmath, M. and U. Tadmor (eds.): Loanwords in the world's languages. de Gruyter: Berlin and New York. 55-75.


  1. Just a minor addition: Swadesh himself, and all the people discussing his ideas in the 1960s never drew any trees! Actually, they didn't know how to do this computationally from the data collected. The first linguists to use some basic hierarchical clustering algorithm were David Sankoff and Annette Dobson

    And yes, that is *the* David Sankoff, who started his academic life as a linguist working on lexicostatistics...

    Dobson, A J. "Lexicostatistical Grouping." Anthropological Linguistics 11, no. 7 (1969): 216-221.
    Sankoff, D. "Historical Linguistics As Stochastic Process." PhD Dissertation. McGill University, 1969.

    1. Yep, this is true as far as I know: Swadesh was always careful with trees, although his theory of divergence dating implied them. But note that already in 1960, around about the same time as Sankoff started, I guess, the famous Isidore Dyen would point in a comment in a paper by Hymes on lexicostatistics to the possibility to create trees from the lexicostatistical data. I didn't thoroughly read try to understand the algorithm Dyen proposes, but it reminds of some simplified version of linkage clustering, thus independently proposing things which Sokal and Michener came up two years earlier.

      Dyen (1960):

      Sokal and Michener (1958):

      So, interestingly, linguistics and biology are again working in similar directions, but first independently.