Showing posts with label language evolution. Show all posts
Showing posts with label language evolution. Show all posts

Tuesday, July 25, 2017

More on similarities in linguistics


In an earlier blogpost I discussed various reasons for similarity of certain traits in languages. I emphasized four major reasons for similarities, for example, in the lexicon of languages: coincidence, natural reasons, inheritance, and contact (see also List 2014: 55f and Aikhenvald 2007: 5). Despite the problems of distinguishing inherited from borrowed traits, which I called historical reasons for similarity, controlling for coincidence and history can often be done in a rather straightforward way. Coincidence can be called by applying a frequency criterion: if certain similarities are extremely spurious, they are usually due to chance. Historical similarities can be detected with the help of classical methods for language comparison. If, using these methods, we know, for example, that two or more languages are genetically related or have been developing in close contact with each other, then we will usually assume that shared traits among them are due to their shared history.

The third group of similarities, on the other hand, which I called natural, is a bit more difficult to interpret, since it is not entirely clear what "natural" means in this context. My earlier example was the word for "mother", which in many languages is expressed as "mama", similar to "father", which is often expressed as "papa", even in languages where we know that they are not related. or only extremely distantly related (if we assume that language was only invented once), and will thus be acquired rather early by children.

In the case of "mama" and "papa", we can blame our articulatory apparatus, which makes sounds like [m], [p], and [a] very easy to pronounce for all humans, no matter where and in which time they are born. Calling this "nature" is probably justified, given that pronouncability is not per se characteristic for language as a general means of complex communication. In sign languages, for example, pronouncability does not play any role, as those languages are never pronounced, but expressed with the help of gestures. But even in sign languages, we also find cross-linguistic similarities, which seem to be independent of coincindence or history: body parts, for example, are often expressed iconically, e.g., by pointing to them (see Woodward 1993 for details).

However, not all of those similarities between languages that are not due to history or coincidence are necessarily due to our articulation apparatus. We can think of many different reasons for cross-linguistic similarities, such as, for example, innate settings of the human brain, or global similarities of the environment in which humans live. In the past, colleagues have occasionally pointed out to me the heterogeneity of this class of "natural" similarities. When trying to further subdivide them, the former could be called "similarities due to cognition", while the latter could be called "similarities due to environment". But neither of these two groups seems to be quite satisfying, as we do not really know the relation between environment and cognition. We may also assume that there is a certain influence between the two, and depending on where we draw the border, we would either subscribe to a predominantly Aristotelian viewpoint, where we assign the predominant role to the environment, or a Platonic viewpoint, where we assign it to the innate "ideas" which are given to us along with our brain.

As an example for the difficulty of distinguishing different sources of "natural" similarity, let us have a look at how languages of the world express a fixed set of concepts. In a very simplistic view, given only two things we want to express, for instance the concept "hand" and the concept "arm", we can ask whether a given language will use the same or different words as a rule. English, for example, uses two different words, namely hand and arm, and so does German (Hand and Arm), while Russian uses only one word, ruka, to refer to both concepts in most situations (in Russian, there is another word kist', which can be used to denote "hand", but it is rarely used). We can say that Russian ruka is polysemous, since the word form has at least two meanings. A better way of expressing this is to say that Russian colexifies "hand" and "arm" (François 2008), since the term polysemy has a specific usage in linguistics, referring to words expressing multiple meanings that should be "conceptually close" or "developed from semantic change", which is an extremely vague definition that further requires us to know the history of a given word form and the development of its meanings.

Cross-linguistically, the colexification of "arm" and "hand", i.e. that many languages tend to use a single word to denote both concepts, occurs extremely often in the languages of the world; so often that we can rule out that the use of one word for two concepts is due to coincidence (compare the colexifications of "arm" in the CLICS database by List et al. 2014 through this link). Given that the colexification recurs also in different language families spoken in different regions of the world, we can further rule out historical reasons. This leaves us with the heterogeneous class of "natural reasons for similarities". But what kind of natural similarities are we dealing with here? Are they cognitive? They surely are in some sense, as we can say that humans have good reasons to consider the hand and the arm as one continuous part of their body.

But this continuity is also given by the structure of our body, which itself is given independently of our perception. One could argue that our perception grounds in our bodily experience, but if we look further into other frequent colexifications, e.g. between "dark" and "black" (this occurs in more than 20 language families, see here), as well as "bright" and "white" (occurs in three language families, see here), our perception is less dependent on our body but more on the environment in which we experience darkness and brightness, since most humans have eyesight and do not live entirely in caves.

It is some kind of the egg-hen problem of who was there first, and the more I think about it, I prefer to avoid giving any clear-cut preference to either the egg nor the hen. We can obviously try to make a more fine-grained distinction between different kinds of non-historical and non-coincidental similarities between languages, but unless psychologists and cognitive scientists solve general problems of perception and environment, it seems that, at least for the moment, "natural similarities" is explicit enough as a term to describe universal patterns in the languages of the world.

References
  • François, A. (2008) Semantic maps and the typology of colexification: intertwining polysemous networks across languages. In: Vanhove, M. (ed.): From polysemy to semantic change. Benjamins: Amsterdam. 163-215.
  • List, J.-M., T. Mayer, A. Terhalle, and M. Urban (eds.) (2014) CLICS: Database of Cross-Linguistic Colexifications. Forschungszentrum Deutscher Sprachatlas: Marburg. http://www.webcitation.org/6ccEMrZYM.
  • List, J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, 2393-2400.
  • Woodward, J. (1993) Lexical evidence for the existence of South Asian and East Asian sign language families. Journal of Asian Pacific Communication 4.2: 91-107.

Tuesday, January 31, 2017

Similarities and language relationship


There is a long-standing debate in linguistics regarding the best proof deep relationships between languages. Scholars often break it down to the question of words vs. rules, or lexicon vs. grammar. However, this is essentially misleading, since it suggests that only one type of evidence could ever be used, whereas most of the time it is the accumulation of multiple pieces of evidence that helps to convince scholars. Even if this debate is misleading, it is interesting, since it reflects a general problem of historical linguistics: the problem of similarities between languages, and how to interpret them.

Unlike (or like?) biology, linguistics has a serious problem with similarities. Languages can be strikingly similar in various ways. They can share similar words, but also similar structures, similar ways of expressing things.

In Chinese, for example, new words can be easily created by compounding existing ones, and the word for 'train' is expressed by combining huǒ 火 'fire' and chē 車 'wagon'. The same can be done in languages like German and English, where the words Feuerwagen and fire wagon will be slightly differently interpreted by the speakers, but the constructions are nevertheless valid candidates for words in both languages. In Russian, on the other hand, it is not possible to just put two nouns together to form a new word, but one needs to say something as огненная машина (ognyonnaya mašína), which literally could be translated as 'firy wagon'.

Neither German nor English are historically closely related to Chinese, but German, English, and Russian go back to the same relatively recent ancestral language. We can see that whether a language allows compounding of two words to form a new one or not, is not really indicative of its history, as is the question of whether a language has an article, or whether it has a case system.

The problem with similarities between languages is that the apparent similarities may have different sources, and not all of them are due to historical development. Similarities can be:
  1. coincidental (simply due to chance),
  2. natural (being grounded in human cognition),
  3. genealogical (due to common inheritance), and
  4. contact-induced (due to lateral transfer).
As an example for the first type of similarity, consider the Modern Greek word θεός [θɛɔs] ‘god’ and the Spanish dios [diɔs] ‘god’. Both words look similar and sound similar, but this is a sheer coincidence. This becomes clear when comparing the oldest ancestor forms of the words that are reflected in written sources, namely Old Latin deivos, and Mycenaean Greek thehós (Meier-Brügger 2002: 57f).

As an example of the second type of similarity, consider the Chinese word māmā 媽媽 'mother' vs. the German Mama 'mother'. Both words are strikingly similar, not because they are related, but because they reflect the process of language acquisition by children, which usually starts with vowels like [a] and the nasal consonant [m] (Jakobson 1960).

An example of genealogical similarity is the German Zahn and the English tooth, both going back to a Proto-Germanic form *tanθ-. Contact-induced similarity (the fourth type) is reflected in the English mountain and the French montagne, since the former was borrowed from the latter.

We can display these similarities in the following decision tree, along with examples from the lexicon of different languages (see List 2014: 56):

Four basic types of similarity in linguistics

In this figure, I have highlighted the last two types of similarity (in a box) in order to indicate that they are historical similarities. They reflect individual language development, and allow us to investigate the evolutionary history of languages. Natural and coincidental similarities, on the other hand, are not indicative of history.

When trying to infer the evolutionary history of languages, it is thus crucial to first rule out the non-historical similarities, and then the contact-induced similarities. The non-historical similarities will only add noise to the historical signal, and the contact-induced similarities need to be separated from the genealogical similarities, in order to find out which languages share a common origin and which languages have merely influenced each other some time during their history.

Unfortunately, it is not trivial to disentangle these similarities. Coincidence, for example, seems to be easy to handle, but it is notoriously difficult to calculate the likelihood of chance similarities. Scholars have tried to model the probability of chance similarities mathematically, but their models are far too simple to provide us with good estimations, as they usually only consider the first consonant of a word in no more than 200 words of each language (Ringe 1992, Baxter and Manaster Ramer 2000, Kessler 2001).

The problem here is that everything that goes beyond word-initial consonants would have to take the probability of word structures into account. However, since languages differ greatly regarding their so-called phonotactic structure (that is, the sound combinations they allow to occur inside a syllable or a word), an account on chance similarities would need to include a probabilistic model of possible and language-specific word structures. So far, I am not aware of anybody who has tried to tackle this problem.

Even more problematic is the second type of similarity. At first sight, it seems that one could capture natural similarities by searching for similarities that recur in very diverse locations of the world. If we compare, for example, which languages have tones, and we find that tones occur almost all over the world, we could argue that the existence of tone languages is not a good indicator of relatedness, since tonal systems can easily develop independently.

The problem with independent development, however, is again tricky, as we need to distinguish different aspects of independence. Independent development could be due to: human cognition (the fact that many languages all over the world denote the bark of a tree with a compound tree-skin is obviously grounded in our perception); or due to language acquisition (like the case of words for 'mother'); but potentially also due to environmental factors, such as the size of the population of speakers (Lupyan et al. 2010), or the location where the languages are spoken (see Everett et al. 2015, but also compare the critical assessment in Hammarström 2016).

Convergence (in linguistics, the term is used to denote similar development due to contact) is a very frequent phenomenon in language evolution, and can happen in all domains of language. Often we simply do not know enough to make a qualified assessment as to whether certain features that are similar among languages are inherited/borrowed or have developed independently.

Interestingly, this was first emphasized by Karl Brugmann (1849-1919), who is often credited as the "father of cladistic thinking" in linguistics. Linguists usually quote his paper from 1884, in order to emphasize the crucial role that Brugmann attributed to shared innovations (synapomorphies in the cladistic terminology) for the purpose of subgrouping. When reading this paper thoroughly, however, it is obvious that Brugmann himself was much less obsessed with the obscure and circular notion of shared innovations (which also holds for cladistics in biology; see De Laet 2005), but with the fact that it is often impossible to actually find them, due to our incapacity to disentangle independent development, inheritance and borrowing.

So far, most linguistic research has concentrated on the problem of distinguishing borrowed from inherited traits, and it is here that the fight over lexicon or grammar as primary evidence for relatedness primarily developed. Since certain aspects of grammar, like case inflection, are rarely transferred from one language to another, while words are easily borrowed, some linguists claim that only grammatical similarities are sufficient evidence of language relationship. This argument is not necessarily productive, since many languages simply lack grammatical structures like inflection, and will therefore not be amenable to any investigation, if we only accept inflectional morphology (grammar) as rigorous proof (for a full discussion, see Dybo and Starostin 2008). Luckily, we do not need to go that far. Aikhenvald (2007: 5) proposes the following borrowability scale:
Aikhenvald's (2007) scale of borrowability

As we can see from this scale, core lexicon (basic vocabulary) ranks second, right behind inflectional morphology. Pragmatically, we can thus say: if we have nothing but the words, it is better to compare words than anything else. Even more important is that, even if we compare what people label "grammar", we compare concrete form-meaning pairs (e.g., concrete plural-endings), and we never compare abstract features (e.g., whether languages have an article). We do so in order to avoid the "homoplasy problem" that causes so many headaches in our research. No biologist would group insects, birds, and bats based on their wings; and no linguist would group Chinese and English due to their lack of complex morphology and their preference for compound words.

Why do I mention all this in this blog post? For three main reasons. First, the problem of similarity is still creating a lot of confusion in the interdisciplinary dialogues involving linguistics and biology. David is right: similarity between linguistic traits is more like similarity in morphological traits in biology (phenotype), but too often, scholars draw the analogy with genes (genotype) (Morrison 2014).

Second, the problem of disentangling different kinds of similarities is not unique to linguistics, but is also present in biology (Gordon and Notar 2015), and comparing the problems that both disciplines face is interesting and may even be inspiring.

Third, the problem of similarities has direct implications for our null hypothesis when considering certain types of data. David asked in a recent blog post: "What is the null hypothesis for a phylogeny?" When dealing with observed similarity patterns across different languages, and recalling that we do not have the luxury to assume monogenesis in language evolution, we might want to know what the null hypothesis for these data should be. I have to admit, however, that I really don't know the answer.

References
  • Aikhenvald, A. (2007): Grammars in contact. A cross-linguistic perspective. In: Aikhenvald, A. and R. Dixon (eds.): Grammars in Contact. Oxford University Press: Oxford. 1-66.
  • Baxter, W. and A. Manaster Ramer (2000): Beyond lumping and splitting: Probabilistic issues in historical linguistics. In: Renfrew, C., A. McMahon, and L. Trask (eds.): Time depth in historical linguistics. McDonald Institute for Archaeological Research: Cambridge. 167-188.
  • Brugmann, K. (1884): Zur Frage nach den Verwandtschaftsverhältnissen der indogermanischen Sprachen [Questions regarding the closer relationship of the Indo-European languages]. Internationale Zeischrift für allgemeine Sprachewissenschaft 1. 228-256.
  • De Laet, J. (2005): Parsimony and the problem of inapplicables in sequence data. In: Albert, V. (ed.): Parsimony, phylogeny, and genomics. Oxford University Press: Oxford. 81-116.
  • Dybo, A. and G. Starostin (2008): In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.): Aspekty komparativistiki.3. RGGU: Moscow. 119-258.
  • Everett, C., D. Blasi, and S. Roberts (2015): Climate, vocal folds, and tonal languages: Connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5. 1322-1327.
  • Gordon, M. and J. Notar (2015): Can systems biology help to separate evolutionary analogies (convergent homoplasies) from homologies?. Progress in Biophysics and Molecular Biology 117. 19-29.
  • Hammarström, H. (2016): There is no demonstrable effect of desiccation. Journal of Language Evolution 1.1. 65–69.
  • Jakobson, R. (1960): Why ‘Mama’ and ‘Papa’?. In: Perspectives in psychological theory: Essays in honor of Heinz Werner. 124-134.
  • Kessler, B. (2001): The significance of word lists. Statistical tests for investigating historical connections between languages. CSLI Publications: Stanford.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • Lupyan, G. and R. Dale (2010): Language structure is partly determined by social structure. PLoS ONE 5.1. e8559.
  • Meier-Brügger, M. (2002): Indogermanische Sprachwissenschaft. de Gruyter: Berlin and New York.
  • Morrison, D. (2014): Is the Tree of Life the best metaphor, model, or heuristic for phylogenetics?. Systematic Biology 63.4. 628-638.
  • Ringe, D. (1992): On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society 82.1. 1-110.

Tuesday, December 20, 2016

Isogloss maps are hypergraphs are bipartite networks


Linguists are a very special people. They are very proud, especially when biologists tell them how to do phylogenetic analyses; but their pride is often also justified, as many phylogenetic concepts were initially or independently developed by linguists, be it the family tree model, proposed years before Darwin's (1859) tree by Ćelakovský (1853), or even the cladistic principle of synapomorphies, which are called "exclusively shared innovations" in linguistics (see Brugmann 1884).

Linguists also invented one interesting kind of data-display which so far has never been used by biologists (at least as far as I know): maps of isogloss boundaries. The term "isogloss" is an unfortunate term, as it has multiple usages in linguistics, and its history seems to go back to a naive borrowing from chemistry (but I have not really followed the literature here). On most occasions, it just means "shared trait". That is, it denotes a features shared between two or more languages; and given that languages may share many different features, isoglosses for a group of related languages may yield a very complex type of data. Isoglosses are somehow related to the wave theory, the arch-enemy of the family tree in linguistics, which I described as a mystical theory some time ago, since it never really made it to a clear-cut model that could be formalized (The Wave Theory: the predecessor of network thinking in historical linguistics ).

Some linguists, nevertheless, insist that the waves that are the core of the wave theory are nothing other than isoglosses. More specifically, the waves represent innovations that contribute to the separation of languages (a change in pronunciation of a word here, a change in grammar there), but which are not transmitted vertically — they spread across the speakers of a language and may even cross linguistic borders. One early visualization of these waves can be found in Bloomfield (1933), as shown here:


What Bloomfield essentially does here is pick certain traits of Indo-European languages, calling them isoglosses, and arrange them on a quasi-geographic map of Indo-European languages in such a way that all languages sharing a trait are inside one of these isogloss boundaries.

Only recently, I realised, what this actually means, when I found the "Bible of Network Theory" by Newman (2010) and started reading at a random page, which — as it turned out — treated hypergraphs. Hypergraphs, as I learned from Newman, are graphs in which one edge can connect to more than one node, and Newman used exactly the same visualization for these hyperedges as Bloomfield had done in 1933, without knowing that it was actually a rather complex network structure he was proposing.

Even more interesting than the complex graph structure is that hypergraphs can be likewise displayed as bipartite networks, in which we distinguish two fundamental kinds of nodes, and in which connections are only allowed between nodes of different kinds, without losing any information. In order to do so, one just converts all hyperedges into a node that connects to all nodes (languages in our case) to which the edges connect in the hypergraph. In the same way that Bloomfield labeled the hyperedges in his legend, we can label the isogloss nodes that connect to the languages. The following image shows the resulting bipartite network for Bloomfield's hypergraph:


If you now ask what this tells us after all, I will disappoint you — so far it does not tell us anything, it is just a display of data in a different fashion. Note, however, that hypergraph visualization is not a trivial problem, and if you have enclaves not sharing a trait, it may even be impossible to visualize hypergraphs in a two-dimensional space by just using one line that connects to all nodes. Bipartite networks are easier to handle in this regard. Even more importantly, however, bipartite graphs are also easy to handle algorithmically, and biologists are currently developing new methods to handle them (Corel et al. 2016).

If we visualize the Bloomfield data in a bipartite network using network visualization software such as Cytoscape, we can conveniently explore the data, and arrange the nodes in order to search for patterns in the isoglosses. The following visualization, for example, shows that Bloomfield chose the data well in order to illustrate the amount of conflicting, apparently non-tree-like, signal in Indo-European languages (remember that linguists tend to dislike trees, but not necessarily in a productive way), as the data describes more of a circular structure than a strict hierarchy.


In order to really interpret this kind of data, however, we should not forget that this is still a data-display network. It is by no means a phylogenetic analysis, as we only show how a certain amount of data selected by a scholar and distributed over the given language groups. A true phylogenetic analysis will need to interpret these data, making bold claims about the history of those shared traits.

The existence of sibilants (s-like sounds, like [s, z, ʃˌ ʒ]) for certain velar sounds (k-like sounds, like [k, g, x]), for example, is a trait shared by Balto-Slavic, Indo-Iranian, Armenian, and Albanian, but this does not mean that they all inherited it from a common ancestor, as the process of palatalization, by which velar sounds turn into affricates and fricatives (compare French cent, which was pronounced kentum in Latin), is very frequent in the languages of the world, and may well reflect independent evolution.

Apart from independent development, which would actually force us to revise our network, deleting the respective edges because they are not homologous in the strict sense means that we may also have to deal with differential loss. This quite likely happened with the shared feature labeled as "past e-" in the network, referring to the past tense in Ancient Greek and Indo-Iranian, which was augmented by the prefix e-.

A further reason for those commonalities labelled as isoglosses by linguists may also be simple lateral transfer due to language contact.

Proponents of the wave theory have taken this kind of data as proof that the family tree model is essentially wrong. While I would agree that the family tree model shows only a certain aspect of language evolution, and may therefore be boring at times (and even wrong, if we do not manage to correctly interpret the nature of shared traits), I have a hard time understanding why linguists still insist that isogloss maps are an alternative model of language evolution. They are surely not, in the same way in which splits graphs are not phylogenetic networks, as David emphasized in a recent blogpost.

Unless we add the missing time dimension and analyse how the shared traits originated, isogloss maps and hypergraphs will remain nothing more than an interesting form of data visualization. Given the recent research on bipartite networks, however, we may have some hope that the mysterious waves in historical linguistics may not only find a formal model of representation, but even bring us to the point where we gain new insights into the history of our languages.

References
  • Bloomfield, L. (1973) Language. Allen & Unwin: London.
  • Brugmann, K. (1884) Zur Frage nach den Verwandtschaftsverhältnissen der indogermanischen Sprachen [Questions regarding the closer relationship of the Indo-European languages]. Internationale Zeischrift für allgemeine Sprachewissenschaft 1. 228-256.
  • Čelakovský, F. (1853) Čtení o srovnavací mluvnici slovanské [Lectures on comparative grammar of Slavic]. V komisí u F. Řivnáče: Prague.
  • Corel, E., P. Lopez, R. Méheust, and E. Bapteste (2016) Network-thinking: graphs to analyze microbial complexity and evolution. Trends Microbiol. 24.3: 224-237.
  • Darwin, C. (1859) On the origin of species by means of natural selection, or, the preservation of favoured races in the struggle for life. John Murray: London.
  • Newman, M. (2010) Networks. An Introduction. Oxford University Press: Oxford.

Tuesday, July 26, 2016

Can biologists learn from linguists?


Of course they can. Biologists who know nothing about linguistics can learn a lot about linguistics from linguists, including the most nerdy, the most boring, and the most interesting things.

However, it is obvious that the question in the title of this post implies a different object of learning, and a more precise title would have been "Can biologists learn about evolution from linguists?" As a linguist, I would of course also provide an affirmative answer, but I doubt that most biologists would agree. At the moment, we have a situation in which the majority of interdisciplinary papers state that linguists can learn from biologists. The opposite, that biologists can learn from linguistics, can rarely be found.

Biology to linguistics

An abundance of analogies between biology and linguistics has been noticed so far, and new analogies are regularly being proposed. When looking at the analogies that have been made so far, we find that most of them have never been really followed up. Languages, for example, have been compared with organisms (Schleicher 1848: 16f), species (Pagel 2009), microbes (Nelson-Sathi et al. 2011, List et al. 2014), mutualist symbionts (van Driem 2004), and populations (Mufwene 2001). Words have been compared with cells (Schleicher 1863: 23f), amino-acids (Zwick 1978), codons (Enguix et al. 2012, Jakobson 1973) and genes (Pagel 2009. Sounds (phonemes) have been compared with nucleic bases (Hruschka et al. 2015, Enguix et al. 2012) and atoms (Zwick 1978). Only a small number of these analogies have received broader attention, many have been rejected quickly after they were first proposed, and only recently has an explicit transfer of methods and models been initiated (Atkinson and Gray 2005).

The tenor of most recent studies, especially in the literature published during the past one to two decades, is often that we finally realize that language evolution is largely the same as biological evolution,  surprisingly (for a recent account in this direction, see Pagel 2016). As a result, it is claimed that we can easily use biological methods to study language evolution. We need to use them, since linguistics is in a poor state with no methods of its own, and linguists have never quantified what they know about the history of their languages. Then, finally, with these new methods developed in biology, we see light at the end of the tunnel, and we can draw nice trees of our languages and see how they evolved into their current shape.

I am in complete favour of increasing the objectivity in historical linguistics, making it a more data-driven and a more transparent discipline. I also advocate interdisciplinary transfer of methods and models, and there are quite a few things we can actually learn from biologists in linguistics. What I do not like is this tone, which suggests that biology is the discipline that saved linguistics, waking it up from its 200-year-long sleep in the ivory tower. At the same time, I also do not like the horror-scenarios in traditional linguistics, which state that quantitative approaches would deprive our discipline of all its wit (see the figure below as a not too serious attempt to visualize these two perspectives). In this context, it is quite interesting to look back in history and to recapitulate what actually happened.

The biological storm of bits and bytes: Will it destroy the ivory tower of historical linguistics
or ultimately help it to shine with a new gloss?

The discipline of historical linguistics is about 200 years old, starting with the legendary scholarly work of poeple like Rasmus Rask (Rask 1818), Jakob Grimm (Grimm 1822), and Franz Bopp (Bopp 1816). Using family trees to model language history goes back to the 17th century, pre-dating the first networks in biology by one century (see David's overview in Morrison 2016). The first explicit alignments showing homologous sounds across words occur at least as early as the beginning of the 20th century (Dixon and Kroeber 1919), cladistic frameworks date back to the second half of the 19th century (Brugmann 1886), and even algorithms for tree reconstruction based on distance data occur back in the 1960s (Dyen's comment in Hymes 1960).

The discipline of historical linguistics can look back on a remarkable history of excellent scholarship. Thanks to this scholarship, we have gained invaluable insights, not only into the history of the world's languages, but also into the mechanisms that trigger linguistic diversity. It is undeniable that methods from evolutionary biology have given us some fresh insights during the past 20 years, but their actual influence is often exaggerated. On the one hand, our experience (since the quantitative turn in historical linguistics) shows that in most cases we cannot use biological methods to analyze our data directly. Instead, we need to carefully adapt them to our needs in order to get the best out of them (as I have tried to show in more detail in List 2014).

On the other hand, there is no example during the past 20 years, that I would know of, where the modern biological methods have really revolutionized our insights into language history. They have undeniably shifted our attention towards data and quantification. They have exposed weak spots, in our argumentation, and they have forced us to restate questions that we had forgotten to ask. But no new language family has been detected, no deeper genealogies between existing languages have been proposed, and no deeper insights into human prehistory have been achieved by the use of biological methods alone. Historical linguistics has profited from evolutionary biology, but not as a small oasis in the desert that was given water and seeds by the lords of bits and bytes, but as a discipline in which scholars learned to make active and critical use of interdisciplinary approaches.

Linguistics to biology

This brings us back to the question of the title. Can biology learn from linguistics? It has done so undoubtedly in the past. Tree-drawing in biology, for example, was popularized by Ernst Haeckel who himself became influenced by the linguist August Schleicher (Sutrop 2012: 300). In the early days of genetics, a multitude of metaphors were borrowed from linguistics to describe biological phenomena with words like "alphabet", "word" (Gamov 1954), or "translation" (Crick 1959).

While not all biologists have been in favor of this tendency (see, for example, Shanon 1978), and the borrowing of terms does not necessarily imply methodological transfer, we also find examples for the explicit transfer of methods and theories from the linguistic to the biological domain. As an example, consider the theory of formal grammar (Chomsky 1959) which still plays a very important role in addressing certain problems in bioinformatics (Searls 1997), like RNA folding and protein structure analysis. Biological textbooks on sequence comparison still tend to include a chapter on formal grammars and their application in biology (Durbin et al. 1998).

Biology could also profit from linguistic insights in the future, and this becomes a bit clearer when we recall, what Schleicher mentioned 150 years a go (and what has been obviously forgotten since then):
Observing how new forms descend from old ones can be done more straightforwardly and in a larger scale in linguistics than in biology. For once, the linguists have an advantage over the natural scientists. (Schleicher 1863: 18, my translation)
The advantage of linguistics, which Schleicher points out, is the availability of very concrete, very detailed, very valuable data in linguistics. This data allows us to see evolutionary forces in a detailed way of which biologists can only dream. Written sources allow us to trace the history of whole language families like Romance (and to some extent also Chinese dialects) from their ancestral speech varieties down to today. Language change is fast enough to allow us to investigate it in action. Recent topics in biology, like the importance of invoking a system perspective in evolution, have been long since debated and discussed in linguistics (Tynjanow and Jakobson 1928, since they are so much easier to detect.

In the past, when I worked intensively on the implementation of the Minimal Lateral Network method (Dagan and Martin 2007, Dagan et al. 2008) on linguistic data (List et al. 2014, List 2015), I stumbled upon numerous examples showing the limits of tree topology as a predictor for lateral transfer events. Given that the same necessarily also holds for lateral gene transfer, I was asking myself whether these false positives and the false negatives in the analyses would simply not matter due to the large amount of data in biology, or whether it was ignored due to the lack of good data for algorithmic evaluation. Later, when I read David's post on Tardigrades and phylogenetic networks, where he pointed to two analyses on the same data that explained them once with lateral gene transfer (Boothby et al. 2015) and once with errors in the data (Koutsovoulos 2015), I became aware of the strong advantage of my linguistic data, since I could test it against written records, tracing the history of words through centuries, thus being able to spot errors immediately when looking up a data point.

The detail of our data in linguistics is both a blessing and a curse. It enables us to write detailed word histories without ever having heard of tree reconciliation methods. On the other hand, it seduces us to get lost in details, forgetting about the bigger picture, and the bigger questions that we could ask, if this data was properly digitized and formalized. In this regard, historical linguistics still needs to learn from biology, as we have failed to turn historical linguistics into a modern, data-driven discipline. With more and more detailed data becoming available, however, the day will come when Schleicher is proven right, and when biologists can learn from linguists about evolution.

References
  • Atkinson, Q. and R. Gray (2005): Curious Parallels and Curious Connections: Phylogenetic Thinking in Biology and Historical Linguistics. Syst. Biol. 54.4. 513-526.
  • Boothby, T., J. Tenlen, F. Smith, J. Wang, K. Patanella, E. Osborne Nishimura, S. Tintori, Q. Li, C. Jones, M. Yandell, D. Messina, J. Glasscock, and B. Goldstein (2015): Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proceedings of the National Academy of Sciences 112.52. 15976-15981.
  • Bopp, F. (1816): Über das Conjugationssystem der Sanskritsprache in Vergleichung mit jenem der griechischen, lateinischen, persischen und germanischen Sprache. Nebst Episoden des Ramajan und Mahabharas in genauen metrischen Uebersetzungen aus dem Originaltexte und einigen Aabschnitten aus den Veda’s. Andreäische Buchhandlung: Frankfurt am Main.
  • Brugmann, K. (1886): Einleitung und Lautlehre: Vergleichende Laut-, Stammbildungs- und Flexionslehre der Indogermanischen Sprachen [Introduction and Phonetics. Comparative Studies of Sound Systems, Stem Formations, and Inflexion Systems of Indo-European Languages]. Grundriß der vergleichenden Grammatik der indogermanischen Sprachen [Foundations of the comparative grammar of the Indo-European languages], vol. 1. Walter de Gruyter, Berlin, Leipzig.
  • Chomsky, N. (1959): On certain formal properties of grammars. Information and Control 2. 137-167.
  • Crick, F. (1959): The present position of the coding problem. The Brookhaven Symposia in Biology 12. 35-39.
  • Dagan, T. and W. Martin (2007): Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proceedings of the National Academy of Sciences 104.3. 870-875
  • Dagan, T., Y. Artzy-Randrup, and W. Martin (2008): Modular networks and cumulative impact of lateral transfer in prokaryote genome evolution. Proceedings of the National Academy of Sciences 105.29. 10039-10044.
  • Dixon, R. and A. Kroeber (1919): Linguistic families of California. University of California Press: Berkeley.
  • van Driem, G. (2004): Language as organism: A brief introduction to the Leiden theory of language evolution. In: Lin, Y.-c., F.-m. Hsu, C.-c. Lee, J.-S. Sun, H.-f. Yang, and D.-a. Ho (eds.): Studies on Sino-Tibetan Languages. Academia Sinica: Taipei. 1-9.
  • Durbin, R., S. Eddy, A. Krogh, and G. Mitchinson (2002): Biological sequence analysis. Probabilistic models of proteins and nucleic acids. Cambridge University Press: Cambridge.
  • Enguix, G. and M. Jimenez-Lopez (2012): Natural language and the genetic code: From the semiotic analogy to biolinguistics. In: Proceedings of the 10th World Congress of the International Association for Semiotic Studies (IASS/AIS). 771-780.
  • Gamov, G. (1954): Possible relation between deoxyribonucleic acid and protein structures. Nature 173. 318.
  • Grimm, J. (1822): Deutsche Grammatik. Dieterichsche Buchhandlung: Göttingen.
  • Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015): Detecting regular sound changes in linguistics as events of concerted evolution. Curr. Biol. 25.1. 1-9.
  • Hymes, D. (1960): Lexicostatistics so far. Curr. Anthropol. 1.1. 3-44.
  • Jakobson (1973): Six lectures on sound and meaning. Cambridge and London: MIT Press
  • Koutsovoulos, G., S. Kumar, D. Laetsch, L. Stevens, J. Daub, C. Conlon, H. Maroon, F. Thomas, A. Aboobaker, and M. Blaxter (2015): The genome of the tardigrade Hypsibius dujardini. bioRxiv.
  • List, J.-M., S. Nelson-Sathi, H. Geisler, and W. Martin (2014): Networks of lexical borrowing and lateral gene transfer in language and genome evolution. Bioessays 36.2. 141-150.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • List, J.-M. (2015): Network perspectives on Chinese dialect history. Bull. Chin. Linguist. 8. 42-67.
  • Morrison, D.A. (2016): Genealogies: Pedigrees and phylogenies are reticulating networks not just divergent trees. Evol. Biol. in press.
  • Mufwene, S. (2001): The ecology of language evolution. Cambridge University Press: Cambridge.
  • Nelson-Sathi, S., J.-M. List, H. Geisler, H. Fangerau, R. Gray, W. Martin, and T. Dagan (2011): Networks uncover hidden lexical borrowing in Indo-European language evolution. Proc. R. Soc. London, Ser. B 278.1713. 1794-1803.
  • Pagel, M. (2009): Human language as a culturally transmitted replicator. Nature Reviews. Genetics 10. 405-415.
  • Pagel, M. (2016): Darwinian perspectives on the evolution of human languages. Psychonomic Bulletin & Review . 1-7.
  • Rask, R. (1818): Undersögelse om det gamle Nordiske eller Islandske sprogs oprindelse [Investigation of the origin of the Old Norse or Icelandic language]. Gyldendalske Boghandlings Forlag: Copenhagen.
  • Schleicher, A. (1848): Zur vergleichenden Sprachengeschichte. König: Bonn.
  • Schleicher, A. (1863): Die Darwinsche Theorie und die Sprachwissenschaft. Offenes Sendschreiben an Herrn Dr. Ernst Haeckel. Hermann Böhlau: Weimar.
  • Searls, D. (1997): Linguistic approaches to biological sequences. Comput. Appl. Biosci. 13.4. 333-344.
  • Shanon, B. (1978): The genetic code and human language. Synthese 39.3. 401-415.
  • Sutrop, U. (2012): Estonian traces in the Tree of Life concept and in the language family tree theory. Journal of Estonian and Finno-Ugric Lingusitics 3. 297-326.
  • Tynjanow, J. and R. Jakobson (1991): Probleme der Literatur- und Sprachforschung. In: Viehoff, R. (ed.): Alternative Traditionen.10. Vieweg: Braunschweig. 67-69.
  • Zwick, M. (1978): Some analogies of hierarchical order in biology and linguistics. In: Klir, G. (ed.): Applied General Systems Research: Recent Developments & Trends. Plenum Press: New York. 521-529.

Wednesday, April 13, 2016

Monogenesis, polygenesis, and militant agnosticism


When playing the cognate hunting game or the etymology identification game in historical linguistics, there are many different rules that one needs to keep in mind. Words that look similar are not necessarily related — they could be simple look-alikes (Trask 2000:202). If words are too similar, they could be borrowings. If we quote colleague X from the camp of linguists believing in theory t₁ we should make sure that we also quote colleague Y from the camp of linguists believing in the theory t₂, especially if we do not know the peer reviewers, etc.

A particularly important rule that is often surprising for biologists is the rule that says we can only compare languages that we know are related. We could, of course, compare all languages in the world (and people do compare all languages in the world), but the point is that we are not allowed to compare languages historically unless we know whether they share a common origin. This rule is reflected in a long-standing debate regarding the question of how we can prove that two languages are related. Here, we have basically two opposing camps, one claiming that only grammar can prove language relationship, and one claiming that only the lexicon is suitable for that task (Dybo and Starostin 2008, Campbell and Poser 2008).

That we have to prove that two or more languages are related before we can start to compare them is in strong contrast to biology. The idea of multiple origins as an alternative to a single origin itself has also been discussed in evolutionary biology (David has shown this in an earlier blogpost dealing with networks with multiple roots). In linguistics, however, we are largely agnostic regarding the common origin of all languages, and the degree of agnosticism may go even so far that it acquires a missionary zeal. Attempts to explain how language evolved, that is, how language originated as a means for communication, always run the danger of being ridiculed by the linguistic community. Under very bad circumstances, they can even cast a very dark shadow on the linguistic reputation of those who proposed them.

Affirming our disinterest in the origin of language has a long tradition. In its Statuts from 1866 (published in 1871), the Société de Linguistique de Paris declared that it would not support any research on the origin of language. Even August Schleicher, the father of the language tree, affirmed this attitude in a letter to Ernst Haeckel (Schleicher 1863: 22), where he wrote:
It is impossible to presuppose a material descent of all languages from a single proto-language. (My translation, original text: "Eine so zu sagen materielle Abstammung aller Sprachen von einer einzigen Ursprache können wir also unmöglich voraussetzen.")
Although it is not explicitly spelled out nowadays, these statutes are still active in most linguistic institutes.
 
Being agnostic about the origin of language means that we cannot exclude the possibility that two languages, like, say, Chinese and English, are ultimately not related at all. And if they are ultimately not related, it would be futile to compare them with the hope to find linguistic material that goes back to their common ancestor. Biologists, who usually take the Tree of Life for granted (albeit a bush in the end), might ask themselves for the reasoning behind this agnosticism in linguistics. The reasons are rather simple to state: If we make the very conservative assumption, based on archeological records, that human language originated about 100,000 years ago (Dediu and Levinson 2013), and contrast it with the first written records of languages (about 5,000 years ago), and the presumed time depths of our current comparative method (Meillet 1925, Weiss 2014), which optimistically allows us to reach out 10,000 years back in time, we simply do not have the means to make any qualified linguistic hypothesis regarding the origin of all those 7,000 and more languages spoken today (count based on Hammarström et al. 2015).

The reasons why linguists prefer to maintain an agnostic attitude are completely comprehensible for me. Whether it is good to be agnostic, is another question. And whether it is good to be as militant as are some linguists regarding the question of language origin is yet another one. For the context of evolutionary biology, for example, a little bit of agnosticism regarding the Tree of Life might bring up interesting dynamics. The same could be said about a little bit of "faith" in linguistics, be it that one believes that language originated independently in multiple places at the same or different times, or be it that one supports a monophyletic origin of a "Language of Eden". Neither of the theories has immediate impact on the way we pursue our historical comparison of languages. Even under a monogenesis assumption we would still need to prove a close affinity between languages before we could start comparing them with our traditional methods.

In the long run, however, it might help us to get some of the tension out of our long-standing debates. If we took monogenesis for granted, for example, people would be less afraid of comparing random pairs of languages, and in the long run we could gain new insights into distant relationships. If we rejected monogenesis, on the other hand, we could try to identify how many times language originated independently.

It is (and here you see my own agnostic attitude) not really important whether we stick to monogenesis or polygenesis in the end. What is important is that we are clear about the consequences that either of these two theories might have on our research in the future. Agnosticism is a useful attitude as long as it does not prevent us from asking questions. Following up on David's earlier blogpost, it seems clear to me that  especially linguists might profit a lot from rooted network approaches that allow for multiple roots, since it would allow us to keep our agnosticism without suppressing our curiosity.

References
  • Campbell, L. and W. Poser (2008): Language Classification: History and Method. Cambridge University Press: Cambridge.
  • Dediu, D. and S. Levinson (2013): On the antiquity of language: the reinterpretation of Neandertal linguistic capacities and its consequences. Frontiers in Psychology 4.397. 1-17.
  • Dybo, A. and G. Starostin (2008): In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.): Aspekty komparativistiki.3. RGGU: Moscow. 119-258."
  • Hammarström, H., R. Forkel, M. Haspelmath, and S. Bank (2015): Glottolog. Max Planck Institute for Evolutionary Anthropology: Leipzig. http://glottolog.org.
  • Meillet, A. (1954): La méthode comparative en linguistique historique [The comparative method in historical linguistics]. Honoré Champion: Paris.
  • Schleicher, A. (1863): Die Darwinsche Theorie und die Sprachwissenschaft. Offenes Sendschreiben an Herrn Dr. Ernst Haeckel. Hermann Böhlau: Weimar.
  • Société Linguistique de Paris (1871): Statuts. Approuvés par décision ministérielle du 8 Mars 1866. Bulletin de la Société de Linguistique de Paris 1. III-IV.
  • Trask, R. (2000): The Dictionary of >Historical and Comparative Linguistics. Edinburgh University Press: Edinburgh.
  • Weiss, M. (2014): The comparative method. In: Bowern, C. and N. Evans (eds.): The Routledge Handbook of Historical Linguistics. Routledge: New York. 127-145.

Wednesday, December 9, 2015

Lexicostatistics: the predecessor of phylogenetic analyses in historical linguistics


Phylogenetic approaches in historical linguistics are extremely common nowadays. Especially, probabilistic models that model lexical change as a birth-death process of cognate sets evolving along a phylogenetic tree (Pagel 2009) are very popular (Lee and Hasegawa 2011, Kitchen et al. 2009, Bowern and Atkinson 2012), but also splits networks are frequently used (Ben Hamed 2005, Heggarty et al. 2010).

However, the standard procedure to produce a family tree or network with phylogenetic software in linguistics goes back to the method of lexicostatistics, which was developed in the 1950s by Morris Swadesh (1909-1967) in a series of papers (Swadesh 1950, 1952, 1955). Lexicostatistics was discarded by the linguistic community not long after it was proposed (Hoijer 1956, Bergsland and Vogt 1962). Since then, lexicostatistics is considered a methodus non gratus in classical circles of historical linguistics, and using it openly may drastically downgrade one's perceived credibility in certain parts of the community.

To avoid the conflicts, most linguists practicing modern phylogenetic approaches emphasize the fundamental differences between early lexicostatistics and modern phylogenetics. These differences, however, apply only to the way the data is analysed. The basic assumptions underlying the selection and preparation of data have not changed since the 1950s, and it is important to keep this in mind, especially when searching for appropriate phylogenetic models to analyse the data.

The Theory of Basic Vocabulary

Swadesh's basic idea was that in the lexicon of every human language there are words that are culturally neutral and functionally universal; and he used the term "basic vocabulary" to refer to these words. Culturally neutral hereby means that the meanings expressed by the words are independently used across different cultures. Functional universality means that the meanings are expressed by all human languages independent of the time and place where they are spoken. The idea is that these meanings are so important for the functioning of a language as a tool of communication, that every language needs to express them.

Cultural neutrality and functional universality guarantee two important aspects of basic words: their stability and their resistance to borrowing. Stability means that words expressing a basic concept are less likely to change their meaning or to be replaced by another word. An argument for this claim is the functional importance of the words — if the words are important for the functioning of a language, it would not make much sense to change them too quickly. Humans are good at changing the meanings of words, as we can see from daily conversations in the media, where new words tend to pop up seemingly on a daily basis, and old words often drastically change their meanings. But changing words that express basic meanings like "head", "stone", "foot", or "mountain" too often might give rise to confusion in communication. As a result, one can assume that words change at a different pace, depending on the meaning they express, and this is one of the core claims of lexicostatistics.

Resistance to borrowing follows also from stability, since the replacement of words expressing basic meanings may again have an impact on our daily communication, and we may thus assume that speakers avoid borrowing these words too quickly. Cultural neutrality of concepts is another important point to guarantee resistance to borrowing. Words expressing concepts which play an important cultural role may easily be transferred from one language to another along with the culture. Thus, although it seems likely that every language has a word for "god" or "spirit" and the like (so the concept is to a certain degree functionally universal), the lack of cultural independency makes words expressing religious terms very likely candidates for borrowing, and it is probably no coincidence that words expressing religion and belief rank first in the scale of borrowability (Tadmor 2009: 232).

Lexical Replacement, Data Preparation, and Divergence Time Estimation

Swadesh had further ideas regarding the importance of basic vocabulary. He assumed that the process of lexical replacement follows universal rates as far as the basic vocabulary is concerned, and that this would allow us to date the divergence of languages, provided we are able to identify the shared cognates. In lexical replacement, a word w₁ expressing a given meaning x in a language is replaced by a word w₂ which then expresses the meaning x, while w₁ either shifts to express another meaning, or completely disappears from the language. For example, older thou did in English was replaced by the plural form you, which now also expresses the singular. In order to search for cognates and determine the time when two languages diverged, Swadesh proposed a straightforward procedure, consisting of very concrete steps (compare Dyen et al. 1992):
  • Compile a list of basic concepts (concepts that you think are culturally neutral and functionally universal; see here for a comparative collection of different lists that have been proposed and used in the past)
  • translate these concepts into the different languages you want to analyse
  • search for cognates between the languages in each meaning slot; if words in two languages are not cognate for a given meaning, then this points to former processes of lexical replacement in at least one of the languages since their divergence
  • count the number of shared cognates, and use some mathematics to calculate the divergence time (which has been independently calibrated using some test cases of known divergence times).
As an example for such a wordlist with cognate judgments, compare the table in the first figure, where I have entered just a few basic concepts from Swadesh's standard concept list and translated them into four languages. Cognacy is assigned with help of IDs in the column at the right of each language column, but also further highlighted with different colors.

Classical cognate coding in lexicostatistics

Phylogenetic Approaches in Historical Linguistics

Modern phylogenetic approaches in historical linguistics basically follow the same workflow that Swadesh propagated for lexicostatistics, the only difference being the last step of the working procedure. Instead of Swadesh's formula, which compared lexical replacement with radioactive decay and was based on aggregated distances in its core, character-based methods are used to infer phylogenetic trees. Characters are retrieved from the data by extracting each cognate from a lexicostatistical wordlist and annotating the presence or absence of each cognate set in each language.

Thus, while Swadesh's lexicostatistical data model would state that the words for "hand" in German and English were cognate, and also in Italian and French, but not in Germanic and Romance, the binary presence-absence coding states that the cognate set formed by words like English hand and German Hand is not present in Romance languages, and that the cognate set formed by words like Italian mano and French main is absent in Germanic languages. This is illustrated in the table below, where the same IDs and colors are used to mark the cognate sets as in the table shown above.

Presence-absence cognate coding for modern phylogenetic analyses

The new way of cognate coding along with the use of phylogenetic software methods has brought, without doubt, many improvements compared to Swadesh's idea of dating divergence times by counting percentages of shared cognates. A couple of problems, however, remain, and one should not forget them when applying computational methods to originally lexicostatistic datasets.

First, we could ask whether the main assumptions of functional universality and cultural neutrality really hold. It seems to be true that words can be remarkably stable throughout the history of a language family. It is, however, also true that the most stable words are not necessarily the same across all language families. Ever since Swadesh established the idea of basic vocabulary, scholars have tried to improve the list of basic vocabulary items. Swadesh himself started from a list of 215 concepts (Swadesh 1950), which he then reduced to 200 concepts (1952) and then later to 100 concepts (1952). Other scholars went further, like Dolgopolsky (1964 [1986]) who reduced the list to 16 concepts. The Concepticon is a resource that links many of the concept lists that have been proposed in the past. When comparing these lists, which all represent what some scholars would label "basic vocabulary items", it becomes obvious that the number of items that all scholars agree upon sinks drastically, while the number of concepts that have been claimed to be basic increases.

An even greater problem than the question of universality and neutrality of basic vocabulary, however, is the underlying model of cognacy in combination with the proposed process of change. Swadesh's model of cognacy controls for meaning. While this model of cognacy is consistent with Swadesh's idea of lexical replacement as a basic process of lexical change, it is by no means consistent with birth-death models of cognate gain and cognate loss if they are created from lexicostatistical data. In biology, birth-death models are usually used to model the evolution of homologous gene families distributed across whole genomes. If we use the traditional view according to which words can be cognate regardless of meaning, the analogy holds, and birth-death processes seem to be adequate in order to analyze datasets that are based on these root cognates (Starostin 1989) or etymological cognates (Starostin 2013). But if we control for meaning in the cognate judgments, we do not necessarily capture processes of gain and loss in our data. Instead, we capture processes in which links between word forms and concepts are shifted, and we investigate these shifts through the very narrow "windows" of pre-defined slots of basic concepts, as I have tried to depict in the following graphic.

Looking at kexical replacement through the small windows of basic vocabulary

Conclusion

As David has mentioned before: We do not necessarily need realistic models in phylogenetic research to infer meaningful processes. The same can probably be said about the discrepancy between our lexicostatistical datasets (Swadesh's heritage, which we keep using for practical reasons) and the birth-death models we now use to analyse the data. Nevertheless, I cannot avoid an uncomfortable feeling when thinking that an algorithm is modeling gain and loss of characters in a dataset that was not produced for this purpose. In order to model the traditional lexicostatistical data consistently, we would either (i) need explicit multistate-models in which concepts are a character and the forms represent the states (Ringe et al. 2002, Ben Hamed and Wang 2006), or (ii) we should directly turn to "root-cognate" methods. These methods have been discussed for some time now (Starostin 1989, Holm 2000), but there is only one recent approach by Michael et al. (forthcoming) in which this is consistently tested.

References
  • Bergsland, K. and H. Vogt (1962): On the validity of glottochronology. Curr. Anthropol. 3.2. 115-153.
  • Bowern, C. and Q. Atkinson (2012): Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88. 817-845.
  • Dolgopolsky, A. (1964): Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija [A probabilistic hypothesis concering the oldest relationships among the language families of Northern Eurasia]. Voprosy Jazykoznanija 2. 53-63.
  • Dyen, I., J. Kruskal, and P. Black (1992): An Indoeuropean classification. A lexicostatistical experiment. T. Am. Philos. Soc. 82.5. iii-132.
  • Ben Hamed, M. and F. Wang (2006): Stuck in the forest: Trees, networks and Chinese dialects. Diachronica 23. 29-60.
  • Hoijer, H. (1956): Lexicostatistics. A critique. Language 32.1. 49-60.
  • Holm, H. (2000): Genealogy of the main Indo-European branches applying the separation base method. J. Quant. Linguist. 7.2. 73-95.
  • Kitchen, A., C. Ehret, S. Assefa, and C. Mulligan (2009): Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East. Proc. R. Soc. London, Ser. B 276.1668. 2703-2710.
  • Lee, S. and T. Hasegawa (2011): Bayesian phylogenetic analysis supports an agricultural origin of Japonic languages. Proc. R. Soc. London, Ser. B 278.1725. 3662-3669.
  • Pagel, M. (2009): Human language as a culturally transmitted replicator. Nature Reviews. Genetics 10. 405-415.
  • Ringe, D., T. Warnow, and A. Taylor (2002): Indo-European and computational cladistics. T. Philol. Soc. 100.1. 59-129.
  • Starostin, S. (1989): Sravnitel'no-istoričeskoe jazykoznanie i leksikostatistika [Comparative-historical linguistics and lexicostatistics]. In: Kullanda, S., J. Longinov, A. Militarev, E. Nosenko, and V. Shnirel'man (eds.): Materialy k diskussijam na konferencii[Materials for the discussion on the conference].1. Institut Vostokovedenija: Moscow. 3-39.
  • Starostin, G. (2013): Lexicostatistics as a basis for language classification. In: Fangerau, H., H. Geisler, T. Halling, and W. Martin (eds.): Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization.. Franz Steiner Verlag: Stuttgart. 125-146.
  • Swadesh, M. (1950): Salish internal relationships. Int. J. Am. Linguist. 16.4. 157-167.
  • Swadesh, M. (1952): Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proc. Am. Philol. Soc. 96.4. 452-463.
  • Swadesh, M. (1955): Towards greater accuracy in lexicostatistic dating. Int. J. Am. Linguist. 21.2. 121-137.
  • Tadmor, U. (2009): Loanwords in the world’s languages. Findings and results. In: Haspelmath, M. and U. Tadmor (eds.): Loanwords in the world's languages. de Gruyter: Berlin and New York. 55-75.