Monday, August 27, 2018

Regular cognates: A new term for homology relations in linguistics

The identification of homologous words between genealogically related languages is one of the crucial tasks in historical linguistics. In contrast to biology where, especially at the level of genetic sequences, we find a rather rich terminology contrasting different types of homology among genes and gene sequences, linguistic terminology is still not very precise. Most scholars seem to be content if they can claim that they have identified words that are cognate, which means that they are homologous but have not been borrowed throughout their history.

On various occasions in the past, I have tried to work on a more precise terminology for linguistic frameworks (see for example List 2014 and List 2016, or this earlier blogpost on homology in linguistics). In this context, I have often tried to emphasize that we need to be specifically more careful with the problem of partial cognacy in linguistics, since many words across related languages are not fully homologous, but show homology only in specific parts (List et al. 2016).

Thanks to an increase in accurately annotated linguistic data, resulting specifically from my very productive collaboration with Nathan W. Hill (SOAS, London) on the Burmish languages (see Hill and List 2017), my view has now again changed a bit, and I thought it would be useful to share it here.

Cognacy and homology

The starting point for my earlier proposals to refine the notion of cognacy in linguistics was the rather refined distinction between orthologs, paralogs, and xenologs in molecular biology (Fitch 2000). To account for the distinction between directly inherited (orthologs), duplicated (paralogs), and laterally transferred genes (xenologs), I proposed the terms direct cognates, indirect cognates (inspired by the term oblique cognates by Trask 2000), and indirectly etymologically related words or morphemes (word parts).

While the first and last term are more or less straightforward with respect to linguistic processes, the notion of indirect cognates, however, turned out to be insufficient, given that it is not clear which processes lead to indirect cognacy. Originally, I thought of morphological processes, that is, processes of word formation, by which a word is slightly modified to account for a slightly derived meaning (usually involving processes like suffixation or compounding). My idea was that words that have "experienced" these processes would behave similarly to genes that have been duplicated in biological evolution, and that it would be sufficient to just assign them to a common sub-class of cognates.

However, the research with Nathan W. Hill recently revealed that these terms are insufficient to capture the processes underlying lexical change in historical linguistics.

In order to understand this idea, it is useful to get back to the biological terms and have a closer look at how they distinguish the underlying processes. As far as I understand it, a directaly inherited gene sequence may differ from its ancestral sequence due to processes of random mutation, by which the original gene sequence becomes modified throughout its history. In cases of paralogy, the original gene sequence is duplicated and both copies are subsequently inherited. The copies may, during this process, become more different from each other than would be expected when assuming direct inheritance and random mutation. Similarly, in cases of lateral transfer of genetic material, the changes may again be different from the ones introduced by "normal" random mutation.

If we adopt the view of "normal change", as it is employed in the biological processes, we find a counterpart in the process of sound change in linguistics. As I have mentioned earlier, sound change is a systemic process by which certain sounds in certain environments change regularly across all words in the lexicon of a given language. This process is definitely not comparable with random mutation in sequence evolution, since the process involves a class of "letters" in the sound system of a language that are systematically turned into another sound. However, regarding the crucial role that sound change plays in language evolution, it seems that it is in some sense comparable with random mutation resulting in orthologous genes. Sound change is somewhat the baseline of what happens if languages change, and we have the means to identify its traces by searching for regular sound correspondence patterns across related languages (see my earlier blogpost on this matter).

That sound change is the default which can be handled with some confidence, while other processes, like word formation, semantic change, or the notorious process of analogical leveling, by which not only complex paradigms are transformed to reduce complexity, but other complexities can emerge (compare the German irregular plural of Morgen-de "mornings", which is built on the template of "evenings" Abend-e), is also the reason why Gévaudan (2007) does not include it into the major processes of lexical change. If we take sound change as the default process of language change and as our key evidence for homologous word relations, however, this means that we can no longer make the distinction between direct and indirect cognates following my earlier proposal, since indirect cognates do not necessarily reflect instances of irregular sound change.

This is in fact easy to illustrate. If we follow the former definition of indirect cognacy, the comparison of German Handschuh "glove" (lit. hand-shoe) with English hand would reflect indirect cognacy, since the German word is a compound of Hand "hand" and Schuh "shoe", and thus a derived word form. The morpheme Hand in this example, however, is phonetically identical with German Hand, and the sound correspondences between the English word and the first element of the German compound are still regular by all means. In fact, only a small amount of word formation processes in language evolution also impact on the pronunciation of the base forms.

This means, in turn, that any distinction of cognate word forms (and word parts, i.e., morphemes) into direct and indirect ones that is based on the absence or presence of morphological (= word formation) processes, does not tell us much about the degree to which the sound change affecting these word forms was regular. We could state that direct cognates should always reflect regular sound change, since any irregularity would have to be accounted for by alternative explanations (eg. shortening of a given word due to frequent use, assimilation of sounds serving the ease of pronunciation, etc.).

I wonder whether this would be useful for the initial idea behind the concept of direct cognacy. If we find direct cognates, that is, words that we assume were used by a couple of languages without further modification, apart from regular sound change and potentially sporadic sound changes, it seems still useful to assume that these reflect vertical language history better than cognate sets with residues that were exposed to various morphological processes. Thus, when coding direct cognacy in linguistic datasets, sporadic sound change (if it can be illustrated properly) should not serve as an argument against direct cognacy.

The only way around this problem seems to be to establish a further shade of cognacy, which describes the relations among words and morphemes that have been only affected by sound change, in contrast to words whose history reflects various morphological derivations that impact directly on pronunciation, or processes of irregular sound change due to analogical leveling or assimilation. While I first thought that the biological term ortholog would be useful to describe these specific word relations in linguistics, I realized later that, judging from the Ancient Greek meaning of ortholog (ortho "straight, direct" + logos "relation"), the fact that differences are due to regular sound change is not that neatly reflected.

For now, I think that it should be sufficient to use the term regular cognates for those words or word parts for which we can demonstrate that their change was following the regular "laws" of sound change. Regular cognates are thus defined as words or word parts that have been affected only by sound change during their history. This notion deliberately excludes differences in meaning, frequency of use, or whether the word forms are only reflected in compounds or derived word forms. In fact, for some cases, we could even propose that only parts of a word form that no longer bear any meaning of their own (eg. the first two sounds of a word form) are regular cognates, as long as we can propose good arguments for the regularity of the correspondences.

Note that our tools for alignment analyses in historical linguistics already account for this property. The EDICTOR (, List 2017), a web-based tool for editing, analyzing, and publishing etymological dictionaries, allows users to exclude those parts from an alignment that are assumed to be irregular, as can be seen in the following illustrative alignment of Proto-Germanic *bakanan "to bake". Scholars who want to be explicit about what parts of an alignment they consider to be regular can use this annotation framework to provide more refined analyses.

EDICTOR alignment of regular cognates for Proto-Germanic *bakanan "to bake"

A crucial consequence of using only regularity in the sound correspondences as the criterion to distinguish regular from irregular cognates is that regular cognacy may also be found to hold for borrowings, since borrowings can, as well, be shown to be regular, especially when the language contact between languages was intensive. Identifying regular cognates is furthermore the first and most important step of the classical comparative method (Weiss 2015) for historical language comparison, since (unless we have written evidence for the true relations between languages) regular cognates (as proven by readily aligned cognate sets) are the fundament upon which we build all our hypotheses regarding the external history of languages.

Fitch, W. (2000) Homology: s personal view on some of the problems. Trends in Genetics 16.5: 227-231.
Hill, N. and J.-M. List (2017) Challenges of annotation and analysis in computer-assisted language comparison: a case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1: 47–76.
List, J.-M. (2014) Sequence Comparison in Historical Linguistics. Düsseldorf University Press: Düsseldorf.
List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1.2: 119-136.
List, J.-M., P. Lopez, and E. Bapteste (2016) Using sequence similarity networks to identify partial cognates in multilingual wordlists. In: Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers). Association of Computational Linguistics, pp. 599-605.

List, J.-M. (2017) A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations, pp. 9-12.
Trask, R. (2000) The Dictionary of Historical and Comparative Linguistics. Edinburgh University Press: Edinburgh.
Weiss, M. (2015) The comparative method. In: Bowern, C. and N. Evans (eds.) The Routledge Handbook of Historical Linguistics. Routledge: New York, pp. 127-145.


  1. The Indo-Europeanist distinction between a root etymology and a word etymology would probably be relevant here as well. (The intermediate concept of stem etymology might be too idiosyncratic to be generalizable, however.)

    1. Yes, this is definitely one of the things we need to discuss much more properly if we want to advance our current approaches to phylogenetic reconstruction (but also for the sake of being more explicit regarding our reconstructions).