Tuesday, January 31, 2017

Similarities and language relationship

There is a long-standing debate in linguistics regarding the best proof deep relationships between languages. Scholars often break it down to the question of words vs. rules, or lexicon vs. grammar. However, this is essentially misleading, since it suggests that only one type of evidence could ever be used, whereas most of the time it is the accumulation of multiple pieces of evidence that helps to convince scholars. Even if this debate is misleading, it is interesting, since it reflects a general problem of historical linguistics: the problem of similarities between languages, and how to interpret them.

Unlike (or like?) biology, linguistics has a serious problem with similarities. Languages can be strikingly similar in various ways. They can share similar words, but also similar structures, similar ways of expressing things.

In Chinese, for example, new words can be easily created by compounding existing ones, and the word for 'train' is expressed by combining huǒ 火 'fire' and chē 車 'wagon'. The same can be done in languages like German and English, where the words Feuerwagen and fire wagon will be slightly differently interpreted by the speakers, but the constructions are nevertheless valid candidates for words in both languages. In Russian, on the other hand, it is not possible to just put two nouns together to form a new word, but one needs to say something as огненная машина (ognyonnaya mašína), which literally could be translated as 'firy wagon'.

Neither German nor English are historically closely related to Chinese, but German, English, and Russian go back to the same relatively recent ancestral language. We can see that whether a language allows compounding of two words to form a new one or not, is not really indicative of its history, as is the question of whether a language has an article, or whether it has a case system.

The problem with similarities between languages is that the apparent similarities may have different sources, and not all of them are due to historical development. Similarities can be:
  1. coincidental (simply due to chance),
  2. natural (being grounded in human cognition),
  3. genealogical (due to common inheritance), and
  4. contact-induced (due to lateral transfer).
As an example for the first type of similarity, consider the Modern Greek word θεός [θɛɔs] ‘god’ and the Spanish dios [diɔs] ‘god’. Both words look similar and sound similar, but this is a sheer coincidence. This becomes clear when comparing the oldest ancestor forms of the words that are reflected in written sources, namely Old Latin deivos, and Mycenaean Greek thehós (Meier-Brügger 2002: 57f).

As an example of the second type of similarity, consider the Chinese word māmā 媽媽 'mother' vs. the German Mama 'mother'. Both words are strikingly similar, not because they are related, but because they reflect the process of language acquisition by children, which usually starts with vowels like [a] and the nasal consonant [m] (Jakobson 1960).

An example of genealogical similarity is the German Zahn and the English tooth, both going back to a Proto-Germanic form *tanθ-. Contact-induced similarity (the fourth type) is reflected in the English mountain and the French montagne, since the former was borrowed from the latter.

We can display these similarities in the following decision tree, along with examples from the lexicon of different languages (see List 2014: 56):

Four basic types of similarity in linguistics

In this figure, I have highlighted the last two types of similarity (in a box) in order to indicate that they are historical similarities. They reflect individual language development, and allow us to investigate the evolutionary history of languages. Natural and coincidental similarities, on the other hand, are not indicative of history.

When trying to infer the evolutionary history of languages, it is thus crucial to first rule out the non-historical similarities, and then the contact-induced similarities. The non-historical similarities will only add noise to the historical signal, and the contact-induced similarities need to be separated from the genealogical similarities, in order to find out which languages share a common origin and which languages have merely influenced each other some time during their history.

Unfortunately, it is not trivial to disentangle these similarities. Coincidence, for example, seems to be easy to handle, but it is notoriously difficult to calculate the likelihood of chance similarities. Scholars have tried to model the probability of chance similarities mathematically, but their models are far too simple to provide us with good estimations, as they usually only consider the first consonant of a word in no more than 200 words of each language (Ringe 1992, Baxter and Manaster Ramer 2000, Kessler 2001).

The problem here is that everything that goes beyond word-initial consonants would have to take the probability of word structures into account. However, since languages differ greatly regarding their so-called phonotactic structure (that is, the sound combinations they allow to occur inside a syllable or a word), an account on chance similarities would need to include a probabilistic model of possible and language-specific word structures. So far, I am not aware of anybody who has tried to tackle this problem.

Even more problematic is the second type of similarity. At first sight, it seems that one could capture natural similarities by searching for similarities that recur in very diverse locations of the world. If we compare, for example, which languages have tones, and we find that tones occur almost all over the world, we could argue that the existence of tone languages is not a good indicator of relatedness, since tonal systems can easily develop independently.

The problem with independent development, however, is again tricky, as we need to distinguish different aspects of independence. Independent development could be due to: human cognition (the fact that many languages all over the world denote the bark of a tree with a compound tree-skin is obviously grounded in our perception); or due to language acquisition (like the case of words for 'mother'); but potentially also due to environmental factors, such as the size of the population of speakers (Lupyan et al. 2010), or the location where the languages are spoken (see Everett et al. 2015, but also compare the critical assessment in Hammarström 2016).

Convergence (in linguistics, the term is used to denote similar development due to contact) is a very frequent phenomenon in language evolution, and can happen in all domains of language. Often we simply do not know enough to make a qualified assessment as to whether certain features that are similar among languages are inherited/borrowed or have developed independently.

Interestingly, this was first emphasized by Karl Brugmann (1849-1919), who is often credited as the "father of cladistic thinking" in linguistics. Linguists usually quote his paper from 1884, in order to emphasize the crucial role that Brugmann attributed to shared innovations (synapomorphies in the cladistic terminology) for the purpose of subgrouping. When reading this paper thoroughly, however, it is obvious that Brugmann himself was much less obsessed with the obscure and circular notion of shared innovations (which also holds for cladistics in biology; see De Laet 2005), but with the fact that it is often impossible to actually find them, due to our incapacity to disentangle independent development, inheritance and borrowing.

So far, most linguistic research has concentrated on the problem of distinguishing borrowed from inherited traits, and it is here that the fight over lexicon or grammar as primary evidence for relatedness primarily developed. Since certain aspects of grammar, like case inflection, are rarely transferred from one language to another, while words are easily borrowed, some linguists claim that only grammatical similarities are sufficient evidence of language relationship. This argument is not necessarily productive, since many languages simply lack grammatical structures like inflection, and will therefore not be amenable to any investigation, if we only accept inflectional morphology (grammar) as rigorous proof (for a full discussion, see Dybo and Starostin 2008). Luckily, we do not need to go that far. Aikhenvald (2007: 5) proposes the following borrowability scale:
Aikhenvald's (2007) scale of borrowability

As we can see from this scale, core lexicon (basic vocabulary) ranks second, right behind inflectional morphology. Pragmatically, we can thus say: if we have nothing but the words, it is better to compare words than anything else. Even more important is that, even if we compare what people label "grammar", we compare concrete form-meaning pairs (e.g., concrete plural-endings), and we never compare abstract features (e.g., whether languages have an article). We do so in order to avoid the "homoplasy problem" that causes so many headaches in our research. No biologist would group insects, birds, and bats based on their wings; and no linguist would group Chinese and English due to their lack of complex morphology and their preference for compound words.

Why do I mention all this in this blog post? For three main reasons. First, the problem of similarity is still creating a lot of confusion in the interdisciplinary dialogues involving linguistics and biology. David is right: similarity between linguistic traits is more like similarity in morphological traits in biology (phenotype), but too often, scholars draw the analogy with genes (genotype) (Morrison 2014).

Second, the problem of disentangling different kinds of similarities is not unique to linguistics, but is also present in biology (Gordon and Notar 2015), and comparing the problems that both disciplines face is interesting and may even be inspiring.

Third, the problem of similarities has direct implications for our null hypothesis when considering certain types of data. David asked in a recent blog post: "What is the null hypothesis for a phylogeny?" When dealing with observed similarity patterns across different languages, and recalling that we do not have the luxury to assume monogenesis in language evolution, we might want to know what the null hypothesis for these data should be. I have to admit, however, that I really don't know the answer.

  • Aikhenvald, A. (2007): Grammars in contact. A cross-linguistic perspective. In: Aikhenvald, A. and R. Dixon (eds.): Grammars in Contact. Oxford University Press: Oxford. 1-66.
  • Baxter, W. and A. Manaster Ramer (2000): Beyond lumping and splitting: Probabilistic issues in historical linguistics. In: Renfrew, C., A. McMahon, and L. Trask (eds.): Time depth in historical linguistics. McDonald Institute for Archaeological Research: Cambridge. 167-188.
  • Brugmann, K. (1884): Zur Frage nach den Verwandtschaftsverhältnissen der indogermanischen Sprachen [Questions regarding the closer relationship of the Indo-European languages]. Internationale Zeischrift für allgemeine Sprachewissenschaft 1. 228-256.
  • De Laet, J. (2005): Parsimony and the problem of inapplicables in sequence data. In: Albert, V. (ed.): Parsimony, phylogeny, and genomics. Oxford University Press: Oxford. 81-116.
  • Dybo, A. and G. Starostin (2008): In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.): Aspekty komparativistiki.3. RGGU: Moscow. 119-258.
  • Everett, C., D. Blasi, and S. Roberts (2015): Climate, vocal folds, and tonal languages: Connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5. 1322-1327.
  • Gordon, M. and J. Notar (2015): Can systems biology help to separate evolutionary analogies (convergent homoplasies) from homologies?. Progress in Biophysics and Molecular Biology 117. 19-29.
  • Hammarström, H. (2016): There is no demonstrable effect of desiccation. Journal of Language Evolution 1.1. 65–69.
  • Jakobson, R. (1960): Why ‘Mama’ and ‘Papa’?. In: Perspectives in psychological theory: Essays in honor of Heinz Werner. 124-134.
  • Kessler, B. (2001): The significance of word lists. Statistical tests for investigating historical connections between languages. CSLI Publications: Stanford.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • Lupyan, G. and R. Dale (2010): Language structure is partly determined by social structure. PLoS ONE 5.1. e8559.
  • Meier-Brügger, M. (2002): Indogermanische Sprachwissenschaft. de Gruyter: Berlin and New York.
  • Morrison, D. (2014): Is the Tree of Life the best metaphor, model, or heuristic for phylogenetics?. Systematic Biology 63.4. 628-638.
  • Ringe, D. (1992): On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society 82.1. 1-110.


  1. "In Russian, on the other hand, it is not possible to just put two nouns together to form a new word" - not at all! There is a lot of nouns in Russian composed of two nouns.

  2. Well, can you tell me how you would spontaneously build the example of "fire wagon" in Russian as a noun-noun compound? I doubt that speakers would do that, as Russian prefers adjective-noun phrases here. German has Kindergarten, Russian has детский сад. It's not a matter of impossibility, but a matter of preference and productivity, and linguistically, you can't deny that Russian and German work differently here.

    1. Sorry, this is supposed as answer to Alexander. I had a spelling error in my deleted version posted before.