The Genealogical World of Phylogenetic Networks: Similarity

Showing posts with label Similarity. Show all posts

Tuesday, July 25, 2017

More on similarities in linguistics

In an earlier blogpost I discussed various reasons for similarity of certain traits in languages. I emphasized four major reasons for similarities, for example, in the lexicon of languages: coincidence, natural reasons, inheritance, and contact (see also List 2014: 55f and Aikhenvald 2007: 5). Despite the problems of distinguishing inherited from borrowed traits, which I called historical reasons for similarity, controlling for coincidence and history can often be done in a rather straightforward way. Coincidence can be called by applying a frequency criterion: if certain similarities are extremely spurious, they are usually due to chance. Historical similarities can be detected with the help of classical methods for language comparison. If, using these methods, we know, for example, that two or more languages are genetically related or have been developing in close contact with each other, then we will usually assume that shared traits among them are due to their shared history.

The third group of similarities, on the other hand, which I called natural, is a bit more difficult to interpret, since it is not entirely clear what "natural" means in this context. My earlier example was the word for "mother", which in many languages is expressed as "mama", similar to "father", which is often expressed as "papa", even in languages where we know that they are not related. or only extremely distantly related (if we assume that language was only invented once), and will thus be acquired rather early by children.

In the case of "mama" and "papa", we can blame our articulatory apparatus, which makes sounds like [m], [p], and [a] very easy to pronounce for all humans, no matter where and in which time they are born. Calling this "nature" is probably justified, given that pronouncability is not per se characteristic for language as a general means of complex communication. In sign languages, for example, pronouncability does not play any role, as those languages are never pronounced, but expressed with the help of gestures. But even in sign languages, we also find cross-linguistic similarities, which seem to be independent of coincindence or history: body parts, for example, are often expressed iconically, e.g., by pointing to them (see Woodward 1993 for details).

However, not all of those similarities between languages that are not due to history or coincidence are necessarily due to our articulation apparatus. We can think of many different reasons for cross-linguistic similarities, such as, for example, innate settings of the human brain, or global similarities of the environment in which humans live. In the past, colleagues have occasionally pointed out to me the heterogeneity of this class of "natural" similarities. When trying to further subdivide them, the former could be called "similarities due to cognition", while the latter could be called "similarities due to environment". But neither of these two groups seems to be quite satisfying, as we do not really know the relation between environment and cognition. We may also assume that there is a certain influence between the two, and depending on where we draw the border, we would either subscribe to a predominantly Aristotelian viewpoint, where we assign the predominant role to the environment, or a Platonic viewpoint, where we assign it to the innate "ideas" which are given to us along with our brain.

As an example for the difficulty of distinguishing different sources of "natural" similarity, let us have a look at how languages of the world express a fixed set of concepts. In a very simplistic view, given only two things we want to express, for instance the concept "hand" and the concept "arm", we can ask whether a given language will use the same or different words as a rule. English, for example, uses two different words, namely hand and arm, and so does German (Hand and Arm), while Russian uses only one word, ruka, to refer to both concepts in most situations (in Russian, there is another word kist', which can be used to denote "hand", but it is rarely used). We can say that Russian ruka is polysemous, since the word form has at least two meanings. A better way of expressing this is to say that Russian colexifies "hand" and "arm" (François 2008), since the term polysemy has a specific usage in linguistics, referring to words expressing multiple meanings that should be "conceptually close" or "developed from semantic change", which is an extremely vague definition that further requires us to know the history of a given word form and the development of its meanings.

Cross-linguistically, the colexification of "arm" and "hand", i.e. that many languages tend to use a single word to denote both concepts, occurs extremely often in the languages of the world; so often that we can rule out that the use of one word for two concepts is due to coincidence (compare the colexifications of "arm" in the CLICS database by List et al. 2014 through this link). Given that the colexification recurs also in different language families spoken in different regions of the world, we can further rule out historical reasons. This leaves us with the heterogeneous class of "natural reasons for similarities". But what kind of natural similarities are we dealing with here? Are they cognitive? They surely are in some sense, as we can say that humans have good reasons to consider the hand and the arm as one continuous part of their body.

But this continuity is also given by the structure of our body, which itself is given independently of our perception. One could argue that our perception grounds in our bodily experience, but if we look further into other frequent colexifications, e.g. between "dark" and "black" (this occurs in more than 20 language families, see here), as well as "bright" and "white" (occurs in three language families, see here), our perception is less dependent on our body but more on the environment in which we experience darkness and brightness, since most humans have eyesight and do not live entirely in caves.

It is some kind of the egg-hen problem of who was there first, and the more I think about it, I prefer to avoid giving any clear-cut preference to either the egg nor the hen. We can obviously try to make a more fine-grained distinction between different kinds of non-historical and non-coincidental similarities between languages, but unless psychologists and cognitive scientists solve general problems of perception and environment, it seems that, at least for the moment, "natural similarities" is explicit enough as a term to describe universal patterns in the languages of the world.

References

François, A. (2008) Semantic maps and the typology of colexification: intertwining polysemous networks across languages. In: Vanhove, M. (ed.): From polysemy to semantic change. Benjamins: Amsterdam. 163-215.
List, J.-M., T. Mayer, A. Terhalle, and M. Urban (eds.) (2014) CLICS: Database of Cross-Linguistic Colexifications. Forschungszentrum Deutscher Sprachatlas: Marburg. http://www.webcitation.org/6ccEMrZYM.
List, J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, 2393-2400.
Woodward, J. (1993) Lexical evidence for the existence of South Asian and East Asian sign language families. Journal of Asian Pacific Communication 4.2: 91-107.

Tuesday, January 31, 2017

Similarities and language relationship

There is a long-standing debate in linguistics regarding the best proof deep relationships between languages. Scholars often break it down to the question of words vs. rules, or lexicon vs. grammar. However, this is essentially misleading, since it suggests that only one type of evidence could ever be used, whereas most of the time it is the accumulation of multiple pieces of evidence that helps to convince scholars. Even if this debate is misleading, it is interesting, since it reflects a general problem of historical linguistics: the problem of similarities between languages, and how to interpret them.

Unlike (or like?) biology, linguistics has a serious problem with similarities. Languages can be strikingly similar in various ways. They can share similar words, but also similar structures, similar ways of expressing things.

In Chinese, for example, new words can be easily created by compounding existing ones, and the word for 'train' is expressed by combining huǒ 火 'fire' and chē 車 'wagon'. The same can be done in languages like German and English, where the words Feuerwagen and fire wagon will be slightly differently interpreted by the speakers, but the constructions are nevertheless valid candidates for words in both languages. In Russian, on the other hand, it is not possible to just put two nouns together to form a new word, but one needs to say something as огненная машина (ognyonnaya mašína), which literally could be translated as 'firy wagon'.

Neither German nor English are historically closely related to Chinese, but German, English, and Russian go back to the same relatively recent ancestral language. We can see that whether a language allows compounding of two words to form a new one or not, is not really indicative of its history, as is the question of whether a language has an article, or whether it has a case system.

The problem with similarities between languages is that the apparent similarities may have different sources, and not all of them are due to historical development. Similarities can be:

coincidental (simply due to chance),
natural (being grounded in human cognition),
genealogical (due to common inheritance), and
contact-induced (due to lateral transfer).

As an example for the first type of similarity, consider the Modern Greek word θεός [θɛɔs] ‘god’ and the Spanish dios [diɔs] ‘god’. Both words look similar and sound similar, but this is a sheer coincidence. This becomes clear when comparing the oldest ancestor forms of the words that are reflected in written sources, namely Old Latin deivos, and Mycenaean Greek thehós (Meier-Brügger 2002: 57f).

As an example of the second type of similarity, consider the Chinese word māmā 媽媽 'mother' vs. the German Mama 'mother'. Both words are strikingly similar, not because they are related, but because they reflect the process of language acquisition by children, which usually starts with vowels like [a] and the nasal consonant [m] (Jakobson 1960).

An example of genealogical similarity is the German Zahn and the English tooth, both going back to a Proto-Germanic form *tanθ-. Contact-induced similarity (the fourth type) is reflected in the English mountain and the French montagne, since the former was borrowed from the latter.

We can display these similarities in the following decision tree, along with examples from the lexicon of different languages (see List 2014: 56):

Four basic types of similarity in linguistics

In this figure, I have highlighted the last two types of similarity (in a box) in order to indicate that they are historical similarities. They reflect individual language development, and allow us to investigate the evolutionary history of languages. Natural and coincidental similarities, on the other hand, are not indicative of history.

When trying to infer the evolutionary history of languages, it is thus crucial to first rule out the non-historical similarities, and then the contact-induced similarities. The non-historical similarities will only add noise to the historical signal, and the contact-induced similarities need to be separated from the genealogical similarities, in order to find out which languages share a common origin and which languages have merely influenced each other some time during their history.

Unfortunately, it is not trivial to disentangle these similarities. Coincidence, for example, seems to be easy to handle, but it is notoriously difficult to calculate the likelihood of chance similarities. Scholars have tried to model the probability of chance similarities mathematically, but their models are far too simple to provide us with good estimations, as they usually only consider the first consonant of a word in no more than 200 words of each language (Ringe 1992, Baxter and Manaster Ramer 2000, Kessler 2001).

The problem here is that everything that goes beyond word-initial consonants would have to take the probability of word structures into account. However, since languages differ greatly regarding their so-called phonotactic structure (that is, the sound combinations they allow to occur inside a syllable or a word), an account on chance similarities would need to include a probabilistic model of possible and language-specific word structures. So far, I am not aware of anybody who has tried to tackle this problem.

Even more problematic is the second type of similarity. At first sight, it seems that one could capture natural similarities by searching for similarities that recur in very diverse locations of the world. If we compare, for example, which languages have tones, and we find that tones occur almost all over the world, we could argue that the existence of tone languages is not a good indicator of relatedness, since tonal systems can easily develop independently.

The problem with independent development, however, is again tricky, as we need to distinguish different aspects of independence. Independent development could be due to: human cognition (the fact that many languages all over the world denote the bark of a tree with a compound tree-skin is obviously grounded in our perception); or due to language acquisition (like the case of words for 'mother'); but potentially also due to environmental factors, such as the size of the population of speakers (Lupyan et al. 2010), or the location where the languages are spoken (see Everett et al. 2015, but also compare the critical assessment in Hammarström 2016).

Convergence (in linguistics, the term is used to denote similar development due to contact) is a very frequent phenomenon in language evolution, and can happen in all domains of language. Often we simply do not know enough to make a qualified assessment as to whether certain features that are similar among languages are inherited/borrowed or have developed independently.

Interestingly, this was first emphasized by Karl Brugmann (1849-1919), who is often credited as the "father of cladistic thinking" in linguistics. Linguists usually quote his paper from 1884, in order to emphasize the crucial role that Brugmann attributed to shared innovations (synapomorphies in the cladistic terminology) for the purpose of subgrouping. When reading this paper thoroughly, however, it is obvious that Brugmann himself was much less obsessed with the obscure and circular notion of shared innovations (which also holds for cladistics in biology; see De Laet 2005), but with the fact that it is often impossible to actually find them, due to our incapacity to disentangle independent development, inheritance and borrowing.

So far, most linguistic research has concentrated on the problem of distinguishing borrowed from inherited traits, and it is here that the fight over lexicon or grammar as primary evidence for relatedness primarily developed. Since certain aspects of grammar, like case inflection, are rarely transferred from one language to another, while words are easily borrowed, some linguists claim that only grammatical similarities are sufficient evidence of language relationship. This argument is not necessarily productive, since many languages simply lack grammatical structures like inflection, and will therefore not be amenable to any investigation, if we only accept inflectional morphology (grammar) as rigorous proof (for a full discussion, see Dybo and Starostin 2008). Luckily, we do not need to go that far. Aikhenvald (2007: 5) proposes the following borrowability scale:

Aikhenvald's (2007) scale of borrowability

As we can see from this scale, core lexicon (basic vocabulary) ranks second, right behind inflectional morphology. Pragmatically, we can thus say: if we have nothing but the words, it is better to compare words than anything else. Even more important is that, even if we compare what people label "grammar", we compare concrete form-meaning pairs (e.g., concrete plural-endings), and we never compare abstract features (e.g., whether languages have an article). We do so in order to avoid the "homoplasy problem" that causes so many headaches in our research. No biologist would group insects, birds, and bats based on their wings; and no linguist would group Chinese and English due to their lack of complex morphology and their preference for compound words.

Why do I mention all this in this blog post? For three main reasons. First, the problem of similarity is still creating a lot of confusion in the interdisciplinary dialogues involving linguistics and biology. David is right: similarity between linguistic traits is more like similarity in morphological traits in biology (phenotype), but too often, scholars draw the analogy with genes (genotype) (Morrison 2014).

Second, the problem of disentangling different kinds of similarities is not unique to linguistics, but is also present in biology (Gordon and Notar 2015), and comparing the problems that both disciplines face is interesting and may even be inspiring.

Third, the problem of similarities has direct implications for our null hypothesis when considering certain types of data. David asked in a recent blog post: "What is the null hypothesis for a phylogeny?" When dealing with observed similarity patterns across different languages, and recalling that we do not have the luxury to assume monogenesis in language evolution, we might want to know what the null hypothesis for these data should be. I have to admit, however, that I really don't know the answer.

References

Aikhenvald, A. (2007): Grammars in contact. A cross-linguistic perspective. In: Aikhenvald, A. and R. Dixon (eds.): Grammars in Contact. Oxford University Press: Oxford. 1-66.
Baxter, W. and A. Manaster Ramer (2000): Beyond lumping and splitting: Probabilistic issues in historical linguistics. In: Renfrew, C., A. McMahon, and L. Trask (eds.): Time depth in historical linguistics. McDonald Institute for Archaeological Research: Cambridge. 167-188.
Brugmann, K. (1884): Zur Frage nach den Verwandtschaftsverhältnissen der indogermanischen Sprachen [Questions regarding the closer relationship of the Indo-European languages]. Internationale Zeischrift für allgemeine Sprachewissenschaft 1. 228-256.
De Laet, J. (2005): Parsimony and the problem of inapplicables in sequence data. In: Albert, V. (ed.): Parsimony, phylogeny, and genomics. Oxford University Press: Oxford. 81-116.
Dybo, A. and G. Starostin (2008): In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.): Aspekty komparativistiki.3. RGGU: Moscow. 119-258.
Everett, C., D. Blasi, and S. Roberts (2015): Climate, vocal folds, and tonal languages: Connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5. 1322-1327.
Gordon, M. and J. Notar (2015): Can systems biology help to separate evolutionary analogies (convergent homoplasies) from homologies?. Progress in Biophysics and Molecular Biology 117. 19-29.
Hammarström, H. (2016): There is no demonstrable effect of desiccation. Journal of Language Evolution 1.1. 65–69.
Jakobson, R. (1960): Why ‘Mama’ and ‘Papa’?. In: Perspectives in psychological theory: Essays in honor of Heinz Werner. 124-134.
Kessler, B. (2001): The significance of word lists. Statistical tests for investigating historical connections between languages. CSLI Publications: Stanford.
List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
Lupyan, G. and R. Dale (2010): Language structure is partly determined by social structure. PLoS ONE 5.1. e8559.
Meier-Brügger, M. (2002): Indogermanische Sprachwissenschaft. de Gruyter: Berlin and New York.
Morrison, D. (2014): Is the Tree of Life the best metaphor, model, or heuristic for phylogenetics?. Systematic Biology 63.4. 628-638.
Ringe, D. (1992): On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society 82.1. 1-110.