The Genealogical World of Phylogenetic Networks: The complexity of lexical change

Most computational approaches to historical linguistics, be it those producing networks or those producing trees, make use of lexical data. There are several reasons for this preference. Lexical data is much easier to handle than abstract grammatical data. Many linguists also think that lexical data is more representative of language evolution in general, and thus offers a much better starting point for inferences. Whether one likes the preference for lexical data or not, it seems to be worthwhile in this context to reflect a bit more about the nature of lexical data and the complexities of lexical change. This may help to get a clearer picture of the differences between language history and biological evolution.

What Makes a Word?

In a very simple language model, the lexicon of a language can be seen as a bag of words. A word, furthermore, is traditionally defined by two aspects: its form and its meaning. Thus, the French word arbre can be defined by its written form arbre or its phonetic form [ɑʁbʁə], and its meaning "tree". This is reflected in the famous sign model of Ferdinand de Saussure (Saussure 1916), which I have reproduced in [A] in the graphic below. In order to emphasize the importance of the two aspects, linguists often say that form and meaning of a word are like two sides of the same coin (see [B] in the graphic below). But we should not forget that a word is only a word if it belongs to a certain language! From the perspective of the German or the English language, for example, the sound chain [ɑʁbʁə] is just meaningless. So, instead of two major aspects of a word, we may better talk of three major aspects: form, meaning, and language. As a result, our bilateral sign model becomes a trilateral one, as I have tried to illustrate in [C] in the graphic below.

What is Lexical Change?

If there was no lexical change, the lexicon of languages would remain stable during all times. Words might change their forms by means of regular sound change, but there would always be an unbroken tradition of identical patterns of denotation. Since this is not the case, the lexicon of all languages is constantly changing. Words are lost, when the speakers cease to use them, or new words enter the lexicon when new concepts arise, be it that they are borrowed from other languages, or created from native material via different morphological processes. Such processes of word loss and word gain are quite frequent and can sometimes even be observed directly by the speakers of a language when they compare their own speech with the speech of an elder or a younger generation.

An even more important process of lexical change, especially in quantitative historical linguistics, is the process of lexical replacement. Lexical replacement refers to the process by which a given word A which is commonly used to express a certain meaning x ceases to express this meaning, while at the same time another word B, which was formerly used to express a meaning y, is now used to express the meaning x. The notion of lexical replacement is thus nothing else than a shift in the perspective on semantic change (as one major dimension of lexical change, see below). While semantic change is usually described from a semasiological perspective, i.e. from the perspective of the form, lexical replacement describes semantic change from an onomasiological perspective, i.e. from the perspective of the meaning.

Three Dimensions of Lexical Change

Gévaudan (2007) distinguishes three dimensions of lexical change: the morphological dimension, the semantic dimension, and the stratic dimension. The morphological dimension points to changes in the outer form of the words which are not due to regular sound change. As an example of this type of change, consider English birth and its ancestral form Proto-Germanic *ga-burdi "birth" — while the meaning of the word did not change (or at least only slightly), the English word apparently lost the prefix ga-. This prefix is still present in the German Geburt "birth", but it was lost without leaving a trace in English.

The loss of prefixes is not the only way in which words can change during language evolution. We also find that prefixes or suffixes are added, as, for example, in French soleil "sun", which goes back to Latin soliculus "small sun, sunny" which is itself a derivation of Latin sol "sun". The semantic dimension is illustrated by changes like the one from Proto-Germanic *sælig "happy" to English silly.

The stratic dimension refers to changes involving the exchange of words between languages, that is, processes of borrowing, in which a word is transferred from one stratum of a language to another. An example for this type of change is English mountain which was borrowed from Old French montaigne "mountain".

Note that these three dimensions of lexical change correspond directly to the three major aspects constituting a linguistic sign (or a word) that I mentioned above: The morphological dimension changes the form of a word, the semantic dimension changes its meaning, and the stratic dimension its language. Thus, the three dimensions of lexical change, as proposed by Gévaudan (2007), find their direct reflection in the major dimensions according to which words can vary.

During language evolution, lexical change processes interact in all three dimensions, and yield complex patterns which may be very hard to uncover for historical linguists. As an example of this complexity, consider the development of Proto-Indo-European *bʰreu̯Hg̑-* "to use", as depicted in the graphic below, which was originally designed by Hans Geisler (Heinrich-Heine University, Düsseldorf), who kindly allowed me to reproduce it here. In the graphic, changes in the stratic dimension are illustrated with the help of dotted arcs (the legend labels this as "borrowed from"), and changes in the morphological dimension are indicated by double arcs (labelled as "derived from"). The semantic dimension is not specifically labelled as such, but one can easily detect it by comparing the meanings of the words.

Modeling Lexical Change

If we look at different historical relations from the perspective of the three dimensions of lexical change, it becomes obvious that the terminology we use in linguistics is rather fuzzy. I mentioned this in an earlier post, where I pointed to the different shades of cognacy, which were never really settled in a satisfying way in historical linguistics. If we look at this again from the perspective of the three dimensions, it is much easier to become clear about the origin of these different historical relations between words.

If we investigate the different uses of the term "cognacy", for example, it becomes obvious that the differences result from controling for one or more of the three dimensions of lexical change. The traditional Indo-Europeanist notion of cognacy, for example, controls the stratic dimension by requiring stratic continuity (no borrowing), but at the same time it is indifferent regarding the other two dimensions. Cognacy à la Swadesh (especially Swadesh 1955), as we know it from the popular computational approaches which model lexical change as a process of cognate loss and gain, is indifferent regarding morphological continuity, but controls the semantic and the stratic dimensions by only considering words that have the same meaning and have not been borrowed (at least in theory).

In the table below, I have attempted to illustrate in which way the different terms, including the biological terms of homology, orthology, paralogy, and xenology, cover processes by controling each for one or more of the three dimensions of lexical change (with "+" indicating that continuity is required, "-" indicating that change is required, and "+/-" indicating indifference.) Contrasting the different dimensions of lexical change with the terminology used to refer to different relations between words shows not only the arbitrariness of the traditional linguistic terminology (why do we only cover two out of 3 * 3 * 3 = 27 different possible types? why do we only control by requiring continuity, not change? etc.), but also the fundamental difference between biological and linguistic terminology.

Concluding Remarks

So far, all computational methods that have been proposed for historical linguistics are based on the strict Swadesh type of wordlist encoding, which in the end controls for the semantic and stratic dimensions of lexical change and is indifferent regarding morphology. Such an encoding is per se inconsistent, since there is no reason to assume that morphological change would be less frequent or less indicative of language history than any of the other types.

The reason why linguists tend to control for meaning when creating their datasets is mostly due to problems of sampling: it is much easier to draw a set of words from a couple of languages by starting from a given set of meanings. However, it may be useful to relax this criterion, since the restricted sets of only about 200 meanings on average necessarily hide vivid and interesting processes of lexical change.

The reasons why linguists control for borrowing are only historical, and in many cases also not feasible, since our evidence for borrowing may be limited, especially in cases where the majority of speakers is bilingual (which is more often the rule than the exception in the languages of the world). It seems much more fruitful to revive our network thinking in linguistics and to invest into the development of high quality datasets with a less arbitrary exclusion of certain dimensions of lexical change, and transparent computational methods which do not exclusively stick to the tree model.

References

Gévaudan, P. (2007) Typologie des lexikalischen Wandels [Typology of lexical change]. Tübingen: Stauffenburg.
Swadesh, M. (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics. Vol. 21(2), pp. 121- 137.
Saussure, F. de (1916) Cours de linguistique générale [Course on general linguistics]. Lausanne: Payot.