Showing posts with label Historical linguistics. Show all posts
Showing posts with label Historical linguistics. Show all posts
Monday, October 22, 2018
Controversies about structural data in historical linguistics
In the past, there have been many controversies about structural data, — that is, the kind of that data I introduce in the post written last month. Given the misinterpretation of structural data as being "grammatical", along with the unproven and misleading claim by Nichols (2003) that certain grammatical features are more stable than lexical ones, one can often read about a controversy in linguistics: which aspects are more stable, and therefore more useful to study deep linguistic relationships, the lexicon or the grammar?
In this context, it is often ignored that we are not talking chiefly about the grammar when applying phylogenetic studies to structural datasets. It is also ignored that the original idea of the importance of "grammar" was pointing to homologies in complex and concrete morphological paradigms, as has been most prominently discussed by Meillet (1925), later popularized by Nichols (1996) (i.e., individual word forms, that is: predominantly lexical traits). "Grammar" never pointed to abstract similarities as they are captured in most structural datasets (see the excellent discussion by Dybo and Starostin).
"Grammar" as evidence for deep language relations
Leading scholars in historical linguistics have provided convincing arguments that genetic relationships among languages can only be demonstrated by illustrating regular sound correspondences in concrete form-meaning pairs across the languages under investigation (see especially the very good analysis by Campbell and Poser 2008). In spite of this, the rumor that "grammar" (i.e., structural datasets) might provide a shortcut to detect deep, so far unnoticed, relationships among the languages of the world is very persistent, as reflected in many different studies.
Among the examples, Dunn et al. (2008) claimed that language relationships for Papuan languages of Island Melanesia could be uncovered by means of phonological and grammatical (abstract) structural features; and Longobardi et al. (2015) used syntactic features to compare the development of European languages with the development of European populations. Zhang et al. (2018) used phonological inventories of more than 100 different Chinese dialects, coding the data for simple presence and absence of each of the more than 200 different sounds in the database, and analyzing the data with the STRUCTURE software (Pritchard et al. 2000), whose results tend to be notoriously misinterpreted.
What is important about these studies is that none of them (maybe with exception of the study by Dunn et al. 2008, but I am in no position to actually judge the findings) could make a convincing claim why the structural datasets would provide evidence of deeper relationships than could the lexicon. Even the study by Dunn et al., which tests the suitability of their small questionnaire of only 115 structural traits on Oceanic languages, has since then not led to any new insights into so far undetected language relationships, contrary to the hope expressed by the authors, "that structural phylogeny is an important new tool for exploring historical relationships between languages" (ibid. 734).
Structural data as a shortcut?
Some scholars who work on structural datasets may find my claims harsh and unjustified. In fact, there are studies that seem to provide evidence that structural datasets perform similarly or equally well compared to phylogenetic methods based on lexical data.
For example, Longobardi et al.(2016) carry out experiments on structural data of phoneme inventories, syntactic features, and "traditional" cognate sets for very small Indo-European datasets, concluding that all of the datasets yield similar results, and that syntactic or phonological features in structural datasets could be used instead of lexical phylogenies.
Contrary to this, Grennhill et al. (2017) also experiment on lexical datasets in comparison with structural data for 81 Austronesian languages, but they find that, in general, lexical data is much more stable than structural data, although some structural features seem to be similar to lexical items regarding their stability.
A wish list for future tests
I see two major problems in the debate about the usefulness of structural data in historical linguistics.
First, the studies that confirm that structure might work equally well compared with lexical data, are all based on small samples of one specific language family that was analyzed based on very diverse features that were specifically designed to study the languages under question. For me, a true test that some features carry deep historical signal would need to be illustrated for a large set of related and unrelated languages, not only just for selected datasets.
Furthermore, to allow for an honest comparison with the lexicon, the selection of features should not contain any lexical characters or characters that could only be extracted with the help of lexical characters. Thus, asking whether the words for "fish", "I", and "five" are pronounced similarly in a language would not be allowed in such a feature collection, because this would follow lexical criteria, and we know very well that this property is a very good proxy for identifying Sino-Tibetan languages (Handel 2008).
Second, and more problematic, is the fact that structural datasets do not provide information on the relatedness of the traits under comparison. While this is no problem for typologists who study shared structural features out of interest in universal tendencies in the languages of the world, it is a problem for the application of phylogenetic software, since the typical approaches in biology treat homoplasy as an exception, while it may often be rather the norm than an exception in structural datasets.
Conclusion
In order to make structural data suitable for historical analyses, much more research needs to be carried out, including specifically a much thorougher study of parallel evolution and geographic convergence (due to language contact) in different language families of the world — a nice illustration for the Indo-European languages is provided by Cathcard et al. (2018).
I would be happy for our field if such research could reveal markers of deep genetic ancestry in the languages of the world, and help us to push the boundaries of linguistic reconstruction. For the time being, however, I remain highly skeptical, especially when scholars try to demonstrate the suitability of "grammatical" comparison with small datasets and idiosyncratically selected feature sets that are not comparable across datasets.
References
Campbell, L. and W. Poser (2008) Language Classification: History and Method. Cambridge University Press: Cambridge.
Cathcard, C., G. Carling, F. Larson, R. Johansson, and E. Round (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.
Dunn, M., S. Levinson, E. Lindstroem, G. Reesink, and A. Terrill (2008) Structural phylogeny in historical linguistics: methodological explorations applied in island melanesia. Language 84.4. 710-759.
Dybo, A. and G. Starostin (2008) In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.) Aspekty komparativistikiAspekty komparativistiki.3. RGGU: Moscow, pp 119-258.
Greenhill, S., C. Wu, X. Hua, M. Dunn, S. Levinson, and R. Gray (2017) Evolutionary dynamics of language systems. Proceedings of the National Academy of Sciences 114.42: E8822-E8829.
Handel, Z. (2008) What is Sino-Tibetan? Snapshot of a field and a language family in flux. Language and Linguistics Compass 2.3: 422-441.
Longobardi, G., S. Ghirotto, C. Guardiano, F. Tassi, A. Benazzo, A. Ceolin, and G. Barbujan (2015) Across language families: Genome diversity mirrors linguistic variation within Europe. American Journal of Physical Anthropology 157.4.: 630-640.
Longobardi, G., A. Buch, A. Ceolin, A. Ecay, C. Guardiano, M. Irimia, D. Michelioudakis, N. Radkevich, and G. Jaeger (2016) Correlated Evolution Or Not? Phylogenetic Linguistics With Syntactic, Cognacy, And Phonetic Data. In: The Evolution of Language: Proceedings of the 11th International Conference (EVOLANGX11).
Meillet, A. (1954) La méthode comparative en linguistique historique [The comparative method in historical linguistics]. Honoré Champion: Paris.
Nichols, J. (1996) The comparative method as heuristic. In: Durie, M. (ed.) The Comparative Method Reviewed. Oxford University Press: New York, pp 39-71.
Nichols, J. (2003) Diversity and stability in language. In: Joseph, B. and R. Janda (eds.) The Handbook of Historical Linguistics. Blackwell: Malden, Mass, pp 283-310.
Pritchard, J., M. Stephens, and P. Donnelly (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.
Monday, September 24, 2018
Structural data in historical linguistics
The majority of historical linguists compare words to reconstruct the history of different languages. However, in phylogenetic studies focusing on cognate sets reflecting shared homologs across the languages under investigation, there exists another data type that people have been trying to explore in the past. The nature of this data type is difficult to understand for non-linguists, given that it has a very abstract nature. In the past, it has led to a considerable amount of confusion both among linguists and among non-linguists who tried to use this data for quick (and often also dirty) phylogenetic approaches. For this reason, I figured it would be useful to introduce this type of data in more detail.
This data type can be called "structural". To enable interested readers to experiment with the data themselves, this blogpost comes along with two example datasets that we converted into a computer-readable format (with much help from David), since the original papers only offered the data as PDF files. In future blogposts, we will try to illustrate how the data can, and should, be explored with network methods. In this first blogpost, I will try to explain the basic structure of the data.
Structural data in historical linguistics and language typology
In order to illustrate the type of data we are dealing with here, let's have a look at a typical dataset, compiled by the famous linguist Jerry Norman to illustrate differences between Chinese dialects (Norman 2003). The table below shows a part of the data provided by Norman.
| No. | Feature | Beijing | Suzhou | Meixian | Guangzhou |
|---|---|---|---|---|---|
| 1 | The third person pronoun is tā, or cognate to it | + | - | - | - |
| 4 | Velars palatalize before high-front vowels | + | + | - | - |
| 7 | The qu-tone lacks a register distinction | + | - | + | - |
| 12 | The word for "stand" is zhàn or cognate to it | + | - | - | - |
In this example, the data is based on a questionnaire that provides specific questions; and for each of the languages in the sample, the dataset answers the question with either + or -. Many of these datasets are binary in their nature, but this is not a necessary condition, and questionnaires can also query categorical variables, such as, for example, the major type of word order might have three categories (subject-object-verb, subject-verb-object or other).
We can also see is that the questions can be very diverse. While we often use more or less standardized concept lists for lexical research (such as fixed lists of basic concepts, List et al. 2016), this kind of dataset is much less standardized, due to the nature of the questionnaire: asking for the translation of a concept is more or less straightforward, and the number of possible concepts that are useful for historical research is quite constrained. Asking a question about the structure of a language, however, be it phonological, lexical, based on attested sound changes, or on syntax, provides an incredible number of different possibilities. As a result, it seems that it is close to impossible to standardize these questions across different datasets.
Although scholars often call the data based on these questionnaires "grammatical" (since many questions are directed towards grammatical features, such as word order, presence or absence of articles, etc.), most datasets show a structure in which questions of phonology, lexicon, and grammar are mixed. For this reason, it is misleading to talk of "grammatical datasets", but instead the term "structural data" seems more adequate, since this is what the datasets were originally designed for: to investigate differences in the structure of different languages, as reflected in the most famous World Atlas of Language Structures (Dryer and Haspelmath 2013, https://wals.info).
Too much freedom is a restriction
In addition to mixed features that can be observed without knowing the history of the languages under investigation, many datasets (including the one by Norman we saw above) also use explicit "historical" (diachronic in linguistic terminology) questions in their questionnaires. In his paper describing the dataset, Norman defends this practice, as he argues that the goal of his study is to establish an historical classification of the Chinese dialects. With this goal in mind, it seems defensible to make use of historical knowledge and to include observed phenomena of language change in general, and sound change in specific, when compiling a structural dataset for group of related language varieties.
The problem of the extremely diverse nature of questionnaire items in structural datasets, however, makes their interpretation extremely difficult. This becomes especially evident when using the data in combination with computational methods for phylogenetic reconstruction. This is problematic for two major reasons.
- Since questions are by nature less restricted regarding their content, scholars can easily pick and choose the features in such a way that they confirm the theory they want them to confirm rather than testing it objectively. Since scholars can select suitable features from a virtually unlimited array of possibilities, it is extremely difficult to guarantee the objectivity of a given feature collection.
- If features are mixed, phylogenetic methods that work on explicit
statistical models (like gain and loss of character states, etc.) may
often be inadequate to model the evolution of the characters, especially
if the characters are historical. While a feature like "the language
has an article" may be interpreted as a gain-loss process (at some
point, the language has no article, then it gains the article, then it
looses it, etc.), features showing the results of processes, like "the
words that originally started in
[k]followed by a front vowel are now pronounced as[tɕ]", cannot be interpreted as a process, since the feature itself describes a process.
Two structural datasets for Chinese dialects
Before I start to bore the already small circle of readers interested in these topics, it seems better to stop discussing the usefulness of structural data at this point, and to introduce the two datasets that were promised at the beginning of the post.
Both datasets target Chinese dialect classification, the former being proposed by Norman (2003), and the latter reflecting a new data collection that was recently used by Szeto et al. (2018) to propose a North-South-split of dialects of Mandarin Chinese with help of a Neighbor-Net analysis (Bryant and Moulton 2004). Both datasets have been uploaded to Zenodo, and can be found in the newly established community collection cldf-datasets. The main idea of this collection is to collect various structural datasets that have been published in the literature in the past, and allow those people interested in the data, be it for replication studies or to thest alternative approaches, easy access to the data in various formats.
The basic format is based on the format specifications laid out by the CLDF initiative (Forkel et al. 2018), which provides a software API, format specifications, and examples for best practice for both structural and lexical datasets in historical linguistics and language typology. The collection is curated on GitHub (cldf-datasets), and datasets are converted to CLDF (with all languages being linked to the Glottolog database, glottolog.org, Hammarström et al. 2018) and also to Nexus format. The dataset is versionized, it may be updated in the future, and interested readers can study the code used to generate the specific data format from the raw files, as well as the Nexus files, to learn how to submit their own datasets to our initiative.
Final remarks on publishing structural datasets online
By providing only two initial datasets for an enterprise whose general usefulness is highly questionable, readers might ask themselves why we are going through the pain of making data created by other people accessible through the web.
The truth is that the situation in historical linguistics and language typology has for a very long time been very unsatisfactory. Most of the research based on data did not supply the data with the paper, and often authors directly refuse to share the data when asked after publication (see also the post on Sharing supplementary data). In other cases, access to the data is exacerbated by providing data only in PDF format in tables inside the paper (or even worse: long tables in the supplement of a paper), which force scholars wishing to check a given analysis themselves to reverse-engineer the data from the PDF. That data is provided in a form difficult to access is not even necessarily the fault of the authors, since some journals even restrict the form of supplementary data to PDF only, giving authors wishing to share their data in an appropriate form a difficult time.
Many colleagues think that it is time to change this, and we can only change it by offering standard ways to share our data. The CLDF along with the Nexus file, in which the two Chinese datasets are now published in this open repository collection, may hopefully serve as a starting point for larger collaboration among typologists and historical linguistics. Ideally, all people who publish papers that make use of structural datasets, would — similar to the practice in biology where scholars submit data to GenBank (Benson et al. 2013) — submit their data in CLDF format and Nexus, so that their colleagues can easily build on their results, and test them for potential errors.
References
Benson D., M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and E. Sayers (2013) GenBank. Nucleic Acids Res. 41.Database issue: 36-42.
Bryant D. and V. Moulton (2004) Neighbor-Net. An agglomerative method for the construction of phylogenetic networks. Molecular Biology and Evolution 21.2: 255-265.
Campbell, L. and W. Poser (2008): Language classification: History and method. Cambridge University Press: Cambridge.
Cathcard C., G. Carling, F. Larson, R. Johansson, and E. Round (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.
Dryer M. and Haspelmath, M. (2013) WALS Online. Max Planck Institute for Evolutionary Anthropology: Leipzig.
Forkel R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (forthcoming) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data.
Hammarström H., R. Forkel, and M. Haspelmath (2018) Glottolog. Version 3.3. Max Planck Institute for Evolutionary Anthropology: Leipzig. http://glottolog.org.
List J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp 2393-2400.
Norman J. (2003) The Chinese dialects. Phonology. In: Thurgood, G. and R. LaPolla (eds.) The Sino-Tibetan languages. Routledge: London and New York, pp 72-83.
Pritchard J., M. Stephens, and P. Donnelly (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.
Szeto P., U. Ansaldo, and S. Matthews (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.
Zhang M., W. Pan, S. Yan, and L. Jin (2018) Phonemic evidence reveals interwoven evolution of Chinese dialects. bioarxiv.
Monday, July 30, 2018
Networks of polysemous and homophonous words
When I was very young, maybe even before I went to school, we often played a game with my parents and grandparents, during which we had to select two homophonous words (that is, one word form that expresses two rather different meanings), and the other people had to guess which word we had selected. This game is slightly different from its Anglo-Saxon counterpart, the homophone game.
In Germany, this game is called Teekesselchen: "little teapot". Therefore, people now also use the word Teekesselchen to denote cases of homophonoy or very advanced polysemy. In this sense, the word Teekesselchen itself becomes polysemous, since it denotes both a little teacup, and the phenomenon that word forms in a given language may often denote multiple meanings.
Homophony and polysemy
In linguistics, we learn very early that we should rigorously distinguish the phenomenon of homophony from the phenomenon of polysemy. The former refers to originally different word forms that have become similar (and even identical) due to the effects of sound change — compare French paix "peace" and pet "fart", which are now both pronounced as
[pɛ]. The latter refers
to cases where a word form has accumulated multiple meanings over time, which
are shifted from the original meaning — compare head as in head of
department vs. head as in headache.Given the difference of the processes leading to homophony on the one hand and polysemy on the other, it may seem justified to opt for a strict usage of the terms, at least when discussing linguistic problems. However, the distinction between homophony and polysemy is not always that easy to make.
In German, for example, we have the same word Decke for "ceiling" and "blanket" (Geyken 2010). This may seem to reflect a homophony at first sight, given that the meanings are so different, so that it seems simpler to assume a coincidence. However, it is in fact a polysemy (cf. Pfeiffer 1993, s. v. «Decke»). This can be easily seen from the verb (be)decken "to cover", from which Decke was derived. While the ceiling covers the room, the blanket covers the body.
Given that we usually do not know much about the history of the words in our languages, we often have difficulties deciding whether we are dealing with homophonies or with polysemies when encountering ambiguous terms in the languges of the world. The problem of the two terms is that they are not descriptive, but explanative (or ontological): they do not only describe a phenomenon ("one word form is ambiguous, having multiple meanings"), but also the origin of this phenomenon (sound change or semantic change).
In this context, the recently coined term colexification (François 2008) has proven to be very helpful, as it is purely descriptive, referring to those cases where a given language has the same word form to express two or more different meanings. The advantage of descriptive terminology is that it allows us to identify a certain phenomenon but analyze it in a separate step — that is, we can already talk about the phenomenon before we have found out its specific explanation.
A new contribution
Having worked hard during recent years writing computer code for data curation and analysis (cf. List et al 2018a), my colleagues and I have finally managed to present the fascinating phenomena of colexifications (homophonies and polysemies) in the languages of the world in an interactive web application. This shows which colexifications occur frequently in which languages of the world.
In order to display how often the languages in the world express different concepts using the same word, we make use of a network model, in which the concepts (or meanings) are represented by the nodes in the networks, and links between concepts are drawn whenever we find that any of the languages in the sample colexifies the concepts. The following figure illustrates this idea.
![]() |
| Colexification network for concepts centering around "FOOD" and "MEAL". |
This database and web application is called CLICS, which stands for the Database of Cross-Linguistic Colexifications (List et al. 2018b), and was published officially during the past week (http://clics.clld.org) — it can now be freely accessed by all who are interested. In addition, we describe the database in some more detail in a forthcoming article (List et al. 2018c), which is already available in form of a draft.
The data give us fascinating insights into the way in which the languages of the world describe the world. At times, it is surprising how similar the languages are, even if they do not share any recent ancestry. My favorite example is the network around the concept FUR, shown below. When inspecting this network, one can find direct links of FUR to HAIR, BODY HAIR, and WOOL on one hand, as well as LEATHER, SKIN, BARK, and PEEL on the other. In some sense, the many different languages of the world, whose data was used in this analysis, reflect a general principle of nature, namely that the bodies of living things are often covered by some protective substance.
![]() |
| Colexification network for concepts centering around "FUR". |
Although we have been working with these networks for a long time, we are still far from understanding their true potential. Unfortunately, nobody in our team is a true specialist in complex networks. As a result, our approaches are always limited to what we may have read by chance about all of those fascinating ways in which complex networks can be analyzed.
For the future, we hope to convince more colleagues of the interesting character of the data. At the moment, our networks are simple tools for exploration, and it is hard to extract any evolutionary processes from them. With more refined methods, however, it may even be possible to use them to infer general tendencies of semantic change in language evolution.
References
Geyken A. (ed.) (2010) Digitales Wörterbuch der deutschen Sprache DWDS. Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. Berlin-Brandenburgische Akademie der Wissenschaften: Berlin. http://dwds.de
François A. (2008) Semantic maps and the typology of colexification: intertwining polysemous networks across languages. In: Vanhove, M. (ed.) From Polysemy to Semantic Change, pp 163-215. Benjamins: Amsterdam.
List J.-M., M. Walworth, S. Greenhill, T. Tresoldi, R. Forkel (2018) Sequence comparison in computational historical linguistics. Journal of Language Evolution 3.2. http://dx.doi.org/10.1093/jole/lzy006
List J.-M., S. Greenhill, C. Anderson, T. Mayer, T. Tresoldi, R. Forkel (forthcoming) CLICS². An improved database of cross-linguistic colexifications: Assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2. https://doi.org/10.1515/lingty-2018-0010
List J.-M., S. Greenhill, C. Anderson, T. Mayer, T. Tresoldi, and R. Forkel (eds.) (2018) CLICS: Database of Cross-Linguistic Colexifications. Max Planck Institute for the Science of Human History: Jena. http://clics.clld.org
Pfeifer W. (1993) Etymologisches Wörterbuch des Deutschen. Akademie: Berlin.
Monday, June 25, 2018
Horizontal and vertical language comparison
In the traditional handbooks on historical language comparison, one can often find the claim that there are two fundamentally different, but equally important, means of linguistic reconstruction. One is usually called "external reconstruction" (or alternatively the "comparative method"), and one is called "internal reconstruction". If we think of sequence comparison in historical linguistics in the form of a table, in which concepts are arranged on the vertical axis, and different languages on the horizontal axis, we can look at the two different modes of language comparison (external vs. internal) as the horizontal and the vertical axes of the table. Horizontal language comparison refers to external reconstruction — scholars compare forms (not necessarily of the same meaning) across the horizontal axis, that is, across different languages. Internal language comparison is vertical — scholars search inside one and the same language for structures that allow to infer its older stages.
In past blog posts I have been talking a lot about horizontal / external language comparison, for which especially the notion of sound correspondences is crucial. But in the same way in which we use the evidence across languages to infer the past states of a given language family, we can make use of language-internal evidence to learn more about the history — not only of a given language,- but also of a group of languages.
Vertical Language Comparison
A classical example of vertical or internal language comparison is the investigation of paradigms, that is, the inflection systems of the verbs or nouns in a given language. This, of course, makes sense only if the respective languages have verbal or nominal morphology, ie. if we find differences in the verb forms for the first, second, or third person singular or plural, or for the case system. The principle would not work in Chinese, although we have different means to compare languages without inflection vertically, as I'll illustrate below.
As a simplified example of internal reconstruction, consider the verbal paradigm of the verb esse "to be" in Latin:
| Person | Singular | Plural |
|---|---|---|
| first | sum | sumus |
| second | es | estis |
| third | est | sunt |
If you try to memorize this pattern, you will quickly realize that it is not regular, and you will have difficulties to identify patterns that assist in memorizing the forms. A much more regular pattern would be the following:
| Person | Singular | Plural |
|---|---|---|
| first | es-um | es-umus |
| second | es-Ø | es-tis |
| third | es-t | es-unt |
This pattern would still require us to memorize six different endings, but we could safely remember that the beginning of all forms is the same, and that there are six different endings, accounting for person and number at the same time (which is anyway typical for inflecting languages).
An alternative pattern that would be easier to remember is the following one:
| Person | Singular | Plural |
|---|---|---|
| first | es-um | s-umus |
| second | es-ø | s-tis |
| third | es-t | s-unt |
While it may seem that this pattern is slightly more complicated at first glance, it would still be more regular than the pattern we actually observe, and we would now have two different aspects expressing the meaning of the different forms: the alternation of the root es- vs. s- accounts for the singular-plural distinction, while the endings express again both number and person.
If we look at older stages of Latin, we can, indeed, find evidence for the first person singular, which was written esom in ancient documents (see Meier-Brügger 2002 for details on the reconstruction of this paradigm in Indo-European). If we look at other languages, like Sanskrit and Ancient Greek, we can further see that our alternation between es- and s- in the root (thus our last example) comes also much closer to the supposed ancient state, even if we don't find complete evidence for this in Latin alone.
What we can see, however, is that the inspection of alternating forms of the same root can reveal ancient states of a language. The key assumption is that observed irregularities usually go back to formerly regular patterns.
Horizontal language comparison
The classical example for horizontal or external language comparison is the typical wordlists in which words with similar meanings across different languages are arranged in tabular form. I have mentioned before that it was in great part Morris Swadesh (1909-1967) who popularized the simple tabular perspective that puts a concept and its various translations in the center of historical language comparison. Before the development of this concept-based approach to historical linguistics, scholars would pick examples based on their similarity in form, allowing for great differences in the semantics of the words being assigned to the same slot of cognate words; and this exclusively form-based approach to external language comparison is still the prevalent one in most branches of historical linguistics.
No matter what approach we employ in this context — be it the concept- or the form-based — as long as we compare forms across different languages, we carry out external language comparison, and our main concern is then the identification of regular sound correspondences across the languages in our sample, which enable us to propose ancestral sounds for the ancestral language.
Problems of vertical language comparison
As can be seen from my above example of the inflection of esse in Latin, it is not obvious how the task of internal language comparison could be formalized and automated. There are two main reasons for this. First, inflection paradigms vary greatly among the languages of the world, which makes it difficult to come up with a common way to investigate them.
Second, since we are usually looking for irregular cases that we try to explain as having evolved from former regularities, it is clear that our data will be extremely sparse. Often, it is only the paradigm of one word that we seek to explain, as we have seen for Latin esse, and patterns of irregularities across many verbs are rather rare (although we can also find examples for this). As a result, internal reconstruction is dealing with even fewer data than external reconstruction, where data are also not necessarily big.
Formalizing the language-internal analysis of word families
Despite the obvious problems of exploiting the language-internal perspective in historical language comparison, there are certain types of linguistic analysis that are amenable to a more formal treatment in this area. One example that we are currently testing is the inference and annotation of word families within a given language. It is well known that large number of words in human languages are not unrelated atomic units, but have themselves been created from smaller parts. Linguists distinguish derivation and compounding as the major techniques here, by which new words are created from existing ones.
Derivation refers to those cases where a word is being modified by a form unit that could not form a word of its own, usually a suffix or a prefix. As an example, consider the suffix -er in English which can be attached to verbs in order to form a noun that usually describes the person that regularly carries out the action denoted by the original verb (eg. examine → examiner, teach → teacher, etc.). While the original verb form exists without the suffix in the English language, the form -er only occurs as part of verbs. In contrast to derivation, compounding refers to the process by which two word forms that can be used in isolation are merged to form a new expression (compare foot and ball with football).
Searching for suffixes and compounds in unannotated language data is a very difficult task. Although scholars have been working on automatic methods that split a given monolingual dictionary into its smallest meaning-bearing form units (morphemes), these methods usually only work on very large datasets (Creutz and Laugs 2005). Trained linguists, on the other hand, can easily detect patterns, even when working on smaller datasets of a few hundred words.
The reason why linguists are successful in analysing the morphology of languages, in contrast to machine-learning approaches, is that they make active use of their external knowledge about the potential semantics underlying the patterns, while current methods for automatic morpheme detection usually only consider the forms, and disregard the semantics. Semantics, however, are important to distinguish words that form a true family (in that they share cognate material) from words that are similar only due to chance.
It is clear that languages may have words that sound alike but convey different meanings. As an extreme example, consider French paix
[pɛ] "peace" vs.
pet [pɛ] "fart".Although both words are pronounced the same, we know
that they are not cognate, going back to different ancestral forms, as is also
reflected in the French writing system. But even if we lacked the evidence of
the French orthography, we could easily justify that the words do not form a
family, since (a) their meaning is quite different, and (b) their genus is
different as well (la paix vs. le pet). An automatic method that
disregards semantics and external evidence (like the orthography or the gender
of nouns in our case) cannot distinguish words that are similar due to chance
from words that are similar due to their history.As a further example illustrating the importance of semantics, consider the data for Achang, a Burmish language, spoken in Myanmar (data from Huáng 1992), which is shown in the following graphic (derived from the EDICTOR tool and analyzed by Nathan W. Hill).
![]() |
| Word families in Achang, a Burmish language. |
In this figure, we can see six words which all share tɕʰi⁵⁵ (high numbers represent tones) as their first part. As we can see from the detailed analysis of these compounds in Achang, which is given in the column
"MORPHEMES" in the
figure, our analysis claims that the form tɕʰi⁵⁵, which expresses the
concepts "foot" or "leg" in isolation, recurs in the words for "hoof", "claw", "knee", and "thigh", but not in the word for ""ant". While the semantic
commonalities among the former are plausible, as they all denote body parts
which are closely related to "feet" or "legs", we do not find any transparent
motivation for why the speakers should have used a compound containing the word for "foot" to denote an ant. Although we cannot demonstrate this at this point, we are
hesitant to add the Achang word for "ant" to the word family based on compounds
containing the word for "foot".Bipartite networks of word families
For the time being, we cannot automate this analysis, since we lack data for the testing and training of potential algorithms. We can, however, formalize it in a very straightforward way: with help of a bipartite network (see Hill and List 2017). Bipartite networks are networks with two kinds of nodes, which are usually thought of as representing different types. While we can easily assign different types to all nodes in any network we are dealing with, bipartite networks only allow us to link nodes of different types. In our bipartite network of word families, the first type of nodes represent the forms of the words, while the second type represent the meanings attributed to the sub-parts of the words. In the figure above, the former can be found in the column
"tokens", where the symbol "+" marks the boundaries, and the latter can be found in the column "MORPHEMES".The following figure shows the bipartite network underlying the word family relations following from our analysis of words built with the morpheme "foot" in Achang.
| Bipartite network of word families: nodes in red text represent the (reconstructed) meaning of the morphemes, and blue nodes the words in which those occur as parts. |
Conclusion
The bipartite network above shows only a small part of the word family structure of one language, and the analysis and formalization of word families with help of bipartite networks thus remains exemplary and anecdotal. I hope, however, that the example illustrates how important it is to keep in mind that language change is not only about sound shifts that can be analyzed with help of language-external, horizontal comparison. Investigating the vertical (the language-internal) perspective of language evolution is not only fascinating, offering many so far unresolved methodological problems, it is at least as important as the horizontal perspective for a proper understanding of the dynamics underlying language change.
References
Creutz M. and Lagus K. (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology, 2005, 81.
Hill N. and List J.-M. (2017) Challenges of annotation and analysis in computer-assisted language comparison: A case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1. 47–76.
Meier-Brügger M. (2002) Indogermanische Sprachwissenschaft. de Gruyter: Berlin.
Huáng Bùfán 黃布凡 (1992) Zàngmiǎn yǔzú yǔyán cíhuì [A Tibeto-Burman lexicon]. Zhōngyāng Mínzú Dàxué 中央民族大学 [Central Institute of Minorities]: Běijīng 北京.
Monday, March 26, 2018
It's the system, stupid! More thoughts on sound change in language history
In various blog posts in the past I have tried to emphasize that sound change in linguistics is fundamentally different from the kind of change in phenotype / genotype that we encounter in biology. The most crucial difference is that sound sequences, i.e., our words or parts of the words we use when communicating, do not manifest as a physical substance but — as linguists say — "ephemerically", i.e. by the air flow that comes out of the mouth of a speaker and is perceived as an acoustic signal by the listener. This is in strong contrast to DNA sequences, for example, which are undeniably somewhere "out there". They can be sliced, investigated, and they preserve information for centuries if not millenia, as the recent boom in archaeogenetics illustrates.
Here, I explore the consequences of this difference in a bit more detail.
Language as an activity
Language, as Wilhelm von Humboldt (1767-1835) — the boring linguist who investigated languages from his armchair while his brother Alexander was traveling the world — put it, is an activity (energeia). If we utter sentences, we pursue this activity and produce sample output of the system hidden in our heads. Since the sound signal is only determined by the capacity of our mouth to produce certain sounds, and the capacity of our brain to parse the signals we hear, we find a much stronger variation in the different sounds available in the languages of the world than we find when comparing the alphabets underlying DNA or protein sequences.
Despite the large variation in the sound systems of the world's languages, it is clear that there are striking common tendencies. A language without vowels does not make much sense, as we would have problems pronouncing the words or perceiving them at longer distances. A language without consonants would also be problematic; and even artificial communication systems developed for long-distance communication, like the different kinds of yodeling practiced in different parts of the world, make use of consonants to allow for a clearer distinction between vowels (see the page about Yodeling on Wikipedia). But, between both extremes we find great variation in the languages of the world, and this does not seem to follow any specific pattern that could point to any kind of selective pressure, although scholars have repeatedly tried to demonstrate it (see Everett et al. 2015 and the follow-up by Roberts 2018).
What is also important here is that, not only is the number of the sounds we find in the sound system of a given language highly variable, but there is also variation in the rules by which sounds can be concatenated to form words (called the phonotactics of a language), along with the frequency of the sounds in the words of different languages. Some languages tolerate clusters of multiple consonants (compare Russian vzroslye or German Herbst), others refuse them (compare the Chinese name for Frankfurt: fǎlánkèfú), yet others allow words to end in voiced stops (compare English job in standard pronunciation), and some turn voiced stops into voiceless ones (compare the standard pronunciation of Job in German as jop).
Language as a system
Language is a system which essentially concatenates a fixed number of sounds to sequences, being only restricted by the encoding and decoding capacities of its users. This is the core reason why sound change is so different from change in biological characters. If we say that German d goes back to Proto-Germanic
*θ (pronounced as th in path), this does not
mean that there were a couple of mutations in a couple of words of the German
language. Instead it means that the system which produced the words for
Proto-Germanic changed the way in which the sound *θ
was produced in the original system.In some sense, we can think metaphorically of a typewriter, in which we replace a letter by another one. As a result, whenever we want to type a given word in the way we know it, we will type it with the new letter instead. But this analogy would be to restricted, as we can also add new letters to the typewriter, or remove existing ones. We can also split one letter key into two, as happens in the case of palatalization, which is a very common type of sound change during which sounds like
[k] or [g] turn into sounds like
[tʃ] and [dʒ] when being followed by front vowels (compare Italian
cento "hundred", which was pronounced [kɛntum] in Latin and is now
pronounced as [tʃɛnto]).Sound change is not the same as mutation in biology
Since it is the sound system that changes during the process we call sound change, and not the words (which are just a reflection of the output of the system), we cannot equate sound change with mutations in biological sequences, since mutations do not recur across all sequences in a genome, replacing one DNA segment by another one, which may not even have existed before. The change in the system, as opposed to the sequences that the system produces, is the reason for the apparent regularity of sound change.
This culminates in Leonard Bloomfield's (1887-1949) famous (at least among old-school linguists) expression that 'phonemes [i. e., the minimal distinctive units of language] change' (Bloomfield 1933: 351). From the perspective of formal approaches to sequence comparison, we could restate this as: 'alphabets change'. Hruschka et al. (2015) have compared sound change with concerted evolution in biology. We can state the analogy in simpler terms: sound change reflects systemics in language history, and concerted evolution results from systemic changes in biological evolution. It's the system, stupid!
Given that sound systems change in language history, this means that the problem of character alignments (i.e. determining homology/cognacy) in linguistics cannot be directly solved with the same techniques that are used in biology, where the alphabets are assumed to be constant, and alignments are supposed to identify mutations alone. If we want to compare sequences in linguistics, where we have to compare sequences that were basically drawn from different alphabets, this means that we need to find out which sounds correspond to which sounds across different languages while at the same time trying to align them.
An artificial example for the systemic grounding of sound change
Let me provide a concrete artificial example, to illustrate the peculiarities of sound change. Imagine two people who originally spoke the same language, but then suffered from diseases or accidents that inhibited them from producing their speech in the way they did before. Let the first person suffer from a cold, which blocks the nose, and therefore turns all nasal sounds into their corresponding voiced stops, i.e., n becomes a d, ng becomes a g, and m becomes a b. Let the other person suffer from the loss of the front teeth, which makes it difficult to pronounce the sounds s and z correctly, so that they sound like a th (in its voiced and voiceless form, like in thing vs. that).
![]() |
| Artificial sound change resulting from a cold or the loss of the front teeth. |
If we now let both persons pronounce the same words in their original language, they won't sound very similar anymore, as I have tried to depict in the following table (dh points to the th in words like father, as opposed to the voiceless th in words like thatch).
| No. | Speaker Cold | Speaker Tooth | ||
|---|---|---|---|---|
| 1 | bass | math | ||
| 2 | buzic | mudhic | ||
| 3 | dose | nothe | ||
| 4 | boizy | moidhy | ||
| 5 | sig | thing | ||
| 6 | rizig | ridhing |
By comparing the words systematically, however, bearing in mind that we need to find the best alignment and the mapping between the alphabets, we can retrieve a set of what linguists call sound correspondences. We can see that the s of speaker Cold corresponds to the th of speaker Tooth, z corresponds to dh, b to m, d to n, and g to ng. Having probably figured out by now that my words were taken from the English language (spelling voiced s consequently as z), it is easy even to come up with a reconstruction of the original words (mass, music[=muzik], nose, noisy=[noizy], etc.).
![]() |
| Reconstructing ancestral sounds in our artificial example with help of regular sound correspondences. |
Summary
Systemic changes are difficult to handle in phylogenetic analyses. They leave specific traces in the evolving objects we investigate that are often difficult to interpret. While it has been long since known to linguists that sound change is an inherently systemic phenomenon, it is still very difficult to communicate to non-linguistics what this means, and why it is so difficult for us to compare languages by comparing their words. Although it may seem tempting to compare languages with simple sequence-alignment algorithms with differences in biological sequences resulting from mutations (see for example Wheeler and Whiteley 2015), it is basically an oversimplifying approach.
Simple models undeniably have their merits, especially when dealing with big datasets that are difficult to inspect manually — there is nothing to say against their use. But we should always keep in mind that we can, and should, do much better than this. Handling systemic changes remains a major challenge for phylogenetic approaches, no matter whether they use trees, networks, bushes, or forests.
Given the peculiarity of sound change in linguistic evolution, and how well the phenomena are understood in our discipline, it seems worthwhile to invest time in exploring ways to formalize and model the process. During the past two decades, linguists have taken a lot of inspiration from biology. The time will come when we need to pay something back. Providing models and analyses to deal with systemic processes like sound change might be a good start.
References
Bloomfield, L. (1973) Language. Allen & Unwin: London.
Everett, C., D. Blasi, and S. Roberts (2015) Climate, vocal folds, and tonal languages: connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5: 1322-1327.
Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015) Detecting regular sound changes in linguistics as events of concerted evolution. Curr. Biol. 25.1: 1-9.
Roberts, S. (2018) Robust, causal, and incremental approaches to investigating linguistic adaptation. Frontiers in Psychology 9: 166.
Wheeler, W. and P. Whiteley (2015) Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages. Cladistics 31.2: 113-125.
Monday, February 26, 2018
Tossing coins: linguistic phylogenies and extensive synonymy
The procedures by which linguists sample data when carrying out phylogenetic analyses of languages are sometimes fundamentally different from the methods applied in biology. This is particularly obvious in the matter of the sampling of data for analysis, which I will discuss in this post.
Sampling data in historical linguistics
The reason for the difference is straightforward: while biologists can now sample whole genomes and search across those genomes for shared word families, linguists cannot sample the whole lexicon of several languages. The problem is not that we could not apply cognate detection methods to whole dictionaries. In fact there are recent attempts that try to do exactly this (Arnaud et al. 2017). The problem is that we simply do not know exactly how many words we can find in any given language.
For example, the Duden, a large lexicon of the German language, for example, recently added 5000 more words, mostly due to recent technological innovations, which then lead to new words which we frequently use in German, such as twittern "to tweet", Tablet "tablet computer", or Drohnenangriff "drone attack". In total, it now lists 145,000 words, and the majority of these words has been coined in complex processes involving language-internal derivation of new word forms, but also by a large amount of borrowing, as one can see from the three examples.
One could argue that we should only sample those words which most of the speakers in a given language know, but even there we are far from being able to provide reliable statistics, not to speak of the fact that it is also possible that these numbers vary greatly across different language families and cultural and sociolinguistic backgrounds. Brysbaert et al. (2016), for example, estimate that
an average 20-year-old native speaker of American English knows 42,000 lemmas and 4,200 non-transparent multiword expressions, derived from 11,100 word families.But in order to count as "near-native" in a certain language, including the ability to pursue studies at a university, the Common European Framework of Reference for Languages, requires only between 4000 and 5000 words (Milton 2010, see also List et al. 2016). How many word families this includes is not clear, and may, again, depend directly on the target language.
Lexicostatistics
When Morris Swadesh (1909-1967) established the discipline of lexicostatistics, which represented the first attempt to approach the problems we face in historical linguistics with the help of quantitative methods. He started from a sample of 215 concepts (Swadesh 1950), which he later reduced to only 100 (Swadesh 1955), because he was afraid that some concepts would often be denoted by words that are borrowed, or that would simply not be expressed by single words in certain language families. Since then, linguists have been trying to refine this list further, either by modifying it (Starostin 1991 added 10 more concepts to Swadesh's list of 100 concepts), or by reducing it even further (Holman et al. 2008 reduced the list to 40 concepts).
While it is not essential how many concepts we use in the end, it is important to understand that we do not start simply by comparing words in our current phylogenetic approaches, but instead we sample parts of the lexicon of our languages with the help of a list of comparative concepts (Haspelmath 2010), which we then consecutively translate into the target languages. This sampling procedure was not necessarily invented by Morris Swadesh, but he was first to establish its broader use, and we have directly inherited this procedure of sampling when applying our phylogenetic methods (see this earlier post for details on lexicostatistics).
Synonymy in linguistic datasets
Having inherited the procedure, we have also inherited its problems, and, unfortunately, there are many problems involved with this sampling procedure. Not only do we have difficulties determining a universal diagnostic test list that could be applied to all languages, we also have considerable problems in standardizing the procedure of translating a comparative concept into the target languages, especially when the concepts are only loosely defined. The concept "to kill", for example, seems to be a rather straightforward example at first sight. In German, however, we have two words that could express this meaning equally well: töten (cognate with English dead) and umbringen (partially cognate with English to bring). In fact, as with all languages in the world, there are many more words for "to kill" in German, but these can easily be filtered out, as they usually are euphemisms, such as eliminieren "to eliminate", or neutralisieren "to neutralize". The words töten and umbringen, however, are extremely difficult to distinguish with respect to their meaning, and speakers often use them interchangeably, depending, perhaps, on register (töten being more formal). But even for me as a native speaker of German, it is incredibly difficult to tell when I use which word.
One solution to making a decision as to which of the words is more basic could be corpus studies. By counting how often and in which situations one term is used in a large corpus of German speech, we might be able to determine which of the two words comes closer to the concept "to kill" (see Starostin 2013 for a very elegant example for the problem of words for "dog" in Chinese). But in most cases where we compile lists of languages, we do not have the necessary corpora.
Furthermore, since corpus studies on competing forms for a given concept are extremely rare in linguistics, we cannot exclude the possibility that the frequency of two words expressing the same concept is in the end the same, and the words just represent a state of equilibrium in which speakers use them interchangeably. Whether we like it or not, we have to accept that there is no general principle to avoid these cases of synymony when compiling our datasets for phylogenetic analyses.
Tossing coins
What should linguists do in such a situation, when they are about to compile the dataset that they want to analyze with the modern phylogenetic methods, in order to reconstruct some eye-catching phylogenetic trees? In the early days of lexicostatistics, scholars recommended being very strict, demanding that only one word in a given language should represent one comparative concept. In cases like German töten and umbringen, they recommended to toss a coin (Gudschinsky 1956), in order to guarantee that the procedure was as objective as possible.
Later on, scholars relaxed the criteria, and just accepted that in a few — hopefully very few — cases there would be more than one word representing a comparative concept in a given language. This principle has not changed with the quantitative turn in historical linguistics. In fact, thanks to the procedure by which cognate sets across concept slots are dichotomized in a second step, scholars who only care for the phylogenetic analyses and not for the real data may easily overlook that the Nexus file from which they try to infer the ancestry of a given language family may list a large amount of synonyms, where the classical scholars simply did not know how to translate one of their diagnostic concepts into the target languages.
Testing the impact of synonymy on phylogenetic reconstruction
The obvious question to ask at this stage is: does this actually matter? Can't we just ignore it and trust that our phylogenetic approaches are sophisticated enough to find the major signals in the data, so that we can just ignore the problem of synonymy in linguistic datasets? In an early study, almost 10 years ago, when I was still a greenhorn in computing, I made an initial study of the problem of extensive synonymy, but it never made it into a publication, since we had to shorten our more general study, of which the synonymy test was only a small part. This study has been online since 2010 (Geisler and List 2010), but is still awaiting publication; and instead of including my quantitative test on the impact of extensive synonymy on phylogenetic reconstruction, we just mentioned the problem briefly.
Given that the problem of extensive synonymy turned up frequently in recent discussions with colleagues working on phylogenetic reconstruction in linguistics, I decided that I should finally close this chapter of my life, and resume the analyses that had been sleeping in my computer for the last 10 years.
The approach is very straightforward. If we want to test whether the choice of translations leaves traces in phylogenetic analyses, we can just take the pioneers of lexicostatistics literally, and conduct a series of coin-tossing experiments. We start from a "normal" dataset that people use in phylogenetic studies. These datasets usually contain a certain amount of synonymy (not extremely many, but it is not surprising to find two, three, or even four translations in the datasets that have been analysed in the recent years). If we now have the computer toss a coin in each situation where only one word should be chosen, we can easily create a large sample of datasets each of which is synonym free. Analysing these datasets and comparing the resulting trees is again straightforward.
I wrote some Python code, based on our LingPy library for computational tasks in historical linguistics (List et al. 2017), and selected four datasets, which are publicly available, for my studies, namely: one Indo-European dataset (Dunn 2012), one Pama-Nyungan dataset (Australian languages, Bowern and Atkinson 2012), one Austronesian dataset (Greenhill et al. 2008), and one Austro-Asiatic dataset (Sidwell 2015). The following table lists some basic information about the number of concepts, languages, and the average synonymy, i.e., the average number of words that a concept expresses in the data.
| Dataset | Concepts | Languages | Synonymy | |
|---|---|---|---|---|
| Austro-Asiatic | 200 | 58 | 1.08 | |
| Austronesian | 210 | 45 | 1.12 | |
| Indo-European | 208 | 58 | 1.16 | |
| Pama-Nyungan | 183 | 67 | 1.1 |
For each dataset, I made 1000 coin-tossing trials, in which I randomly picked only one word where more than one word would have been given as the translation of a given concept in a given language. I then computed a phylogeny of each newly created dataset with the help of the Neighbor-joining algorithm on the distance matrix of shared cognates (Saitou and Nei 1987). In order to compare the trees, I employed the general Robinson-Foulds distance, as implemented in LingPy by Taraka Rama. Since I did not have time to wait to compare all 1000 trees against each other (as this takes a long time when computing the analyses for four datasets), I randomly sampled 1000 tree pairs. It is, however, easy to repeat the results and compute the distances for all tree pairs exhaustively. The code and the data that I used can be found online at GitHub (github.com/lingpy/toss-a-coin).
Some results
As shown in the following table, where I added the averaged generalized Robinson-Foulds distances for the pairwise tree comparisons, it becomes obvious that — at least for distance-based phylogenetic calculations — the problem of extensive synonymy and choice of translational equivalents has an immediate impact on phylogenetic reconstruction. In fact, the average differences reported here are higher than the ones we find when comparing phylogenetic reconstruction based on automatic pipelines with phylogenetic reconstruction based on manual annotation (Jäger 2013).
| Dataset | Concepts | Languages | Synonymy | Average GRF |
|---|---|---|---|---|
| Austro-Asiatic | 200 | 58 | 1.08 | 0.20 |
| Austronesian | 210 | 45 | 1.12 | 0.19 |
| Indo-European | 208 | 58 | 1.16 | 0.59 |
| Pama-Nyungan | 183 | 67 | 1.1 | 0.22 |
The most impressive example is for the Indo-European dataset, where we have an incredible average distance of 0.59. This result almost seems surreal, and at first I thought that it was my lazy sampling procedure that introduced the bias. But a second trial confirmed the distance (0.62), and when comparing each of the 1000 trial trees with the tree we receive when not excluding the synonyms, the distance
is even slightly higher (0.64).
When looking at the consensus network of the 1000 trees (created with SplitsTree4, Huson et al. 2006), using no threshold (to make sure that the full variation could be traced), and the mean for the calculation of the branch lengths, which is shown below, we can see that the variation introduced by the synonyms is indeed real.
![]() |
| The consensus network of the 1000 tree sample for the Indo-European language sample |
Notably, the Germanic languages are highly incompatible, followed by Slavic and Romance. In addition, we find quite a lot of variation in the root. Furthermore, when looking the at the table below, which shows the ten languages that have the largest number of synonyms in the Indo-European data, we can see that most of them belong to the highly incompatible Germanic branch.
| Language | Subgroup | Synonymous Concepts |
|---|---|---|
| OLD_NORSE | Germanic | 83 |
| FAROESE | Germanic | 77 |
| SWEDISH | Germanic | 68 |
| OLD_SWEDISH | Germanic | 65 |
| ICELANDIC | Germanic | 64 |
| OLD_IRISH | Celtic | 61 |
| NORWEGIAN_RIKSMAL | Germanic | 54 |
| GUTNISH_LAU | Germanic | 50 |
| ORIYA | Indo-Aryan | 50 |
| ANCIENT_GREEK | Greek | 46 |
Conclusion
This study should be taken with some due care, as it is a preliminary experiment, and I have only tested it on four datasets, using a rather rough procedure of sampling the distances. It is perfectly possible that Bayesian methods (as they are "traditionally" used for phylogenetic analyses in historical linguistics now) can deal with this problem much better than distance-based approaches. It is also clear that by sampling the trees in a more rigorous manner (eg. by setting a threshold to include only those splits which occur frequently enough), the network will look much more tree like.
However, even if it turns out that the results are exaggerating the situation due to some theoretical or practical errors in my experiment, I think that we can no longer ignore the impact that our data decisions have on the phylogenies we produce. I hope that this preliminary study can eventually lead to some fruitful discussions in our field that may help us to improve our standards of data annotation.
I should also make it clear that this is in part already happening. Our colleagues from Moscow State University (lead by George Starostin in the form of the Global Lexicostatistical Database project) try very hard to improve the procedure by which translational equivalents are selected for the languages they investigate. The same applies to colleagues from our department in Jena who are working on an ambitious database for the Indo-European languages.
In addition to linguists trying to improve the way they sample their data, however, I hope that our computational experts could also begin to take the problem of data sampling in historical linguistics more seriously. A phylogenetic analysis does not start with a Nexus file. Especially in historical linguistics, where we often have very detailed accounts of individual word histories (derived from our qualitative methods), we need to work harder to integrate software solutions and qualitative studies.
References
Arnaud, A., D. Beck, and G. Kondrak (2017) Identifying cognate sets across dictionaries of related languages. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2509-2518.
Bowern, C. and Q. Atkinson (2012) Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88. 817-845.
Brysbaert, M., M. Stevens, P. Mandera, and E. Keuleers (2016) How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Frontiers in Psychology 7. 1116.
Dunn, M. (ed.) (2012) Indo-European Lexical Cognacy Database (IELex). http://ielex.mpi.nl/.
Geisler, H. and J.-M. List (2010) Beautiful trees on unstable ground: notes on the data problem in lexicostatistics. In: Hettrich, H. (ed.) Die Ausbreitung des Indogermanischen. Thesen aus Sprachwissenschaft, Archäologie und Genetik. Reichert: Wiesbaden.
Greenhill, S., R. Blust, and R. Gray (2008) The Austronesian Basic Vocabulary Database: From bioinformatics to lexomics. Evolutionary Bioinformatics 4. 271-283.
Gudschinsky, S. (1956) The ABC’s of lexicostatistics (glottochronology). Word 12.2. 175-210.
Haspelmath, M. (2010) Comparative concepts and descriptive categories. Language 86.3. 663-687.
Holman, E., S. Wichmann, C. Brown, V. Velupillai, A. Müller, and D. Bakker (2008) Explorations in automated lexicostatistics. Folia Linguistica 20.3. 116-121.
Huson, D. and D. Bryant (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23.2. 254-267.
Jäger, G. (2013) Phylogenetic inference from word lists using weighted alignment with empirical determined weights. Language Dynamics and Change 3.2. 245-291.
List, J.-M., J. Pathmanathan, P. Lopez, and E. Bapteste (2016) Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct 11.39. 1-17.
List, J.-M., S. Greenhill, and R. Forkel (2017) LingPy. A Python Library For Quantitative Tasks in Historical Linguistics. Software Package. Version 2.6. Max Planck Institute for the Science of Human History: Jena.
Milton, J. (2010) The development of vocabulary breadth across the CEFR levels: a common basis for the elaboration of language syllabuses, curriculum guidelines, examinations, and textbooks across Europe. In: Bartning, I., M. Martin, and I. Vedder (eds.) Communicative Proficiency and Linguistic Development: Intersections Between SLA and Language Testing Research. Eurosla: York. 211-232.
Saitou, N. and M. Nei (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4.4. 406-425.
Sidwell, P. (2015) Austroasiatic Dataset for Phylogenetic Analysis: 2015 version. Mon-Khmer Studies (Notes, Reviews, Data-Papers) 44. lxviii-ccclvii.
Starostin, S. (1991) Altajskaja problema i proischo\vzdenije japonskogo jazyka [The Altaic problem and the origin of the Japanese language]. Nauka: Moscow.
Starostin, G. (2013) K probleme dvuch sobak v klassi\cceskom kitajskom jazyke: canis comestibilis vs. canis venaticus? [On the problem of two words for dog in Classical Chinese: edible vs. hunting dog?]. In: Grincer, N., M. Rusanov, L. Kogan, G. Starostin, and N. \cCalisova (eds.) Institutionis conditori: Ilje Sergejevi\ccu Smirnovy.[In honor of Ilja Sergejevi\cc Smirnov].L. RGGU: Moscow. 269-283.
Swadesh, M. (1950) Salish internal relationships. International Journal of American Linguistics 16.4. 157-167.
Swadesh, M. (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21.2. 121-137.
Tuesday, December 19, 2017
The art of doing science: alignments in historical linguistics
In the past two years, during which I have been writing for this blog, I have often tried to emphasize the importance of alignments in historical linguistics — alignment involves explicit decisions about which characters / states are cognate (and can thus be aligned in a data table). I have also often mentioned that explicit alignments are still rarely used in the field.
To some degree, this situation is frustrating, since it seems so obvious that scholars align data in their head, for example, whenever they write etymological dictionaries and label parts of a word as irregular, not fulfilling their expectations when assuming regular sound change (in the sense in which I have described it before). It is also obvious that linguists have been trying to use alignments before (even before biologists, as I tried to show in this earlier post), but for some reason, they never became systematized.
As an example for the complexity of alignment analyses in historical linguistics, consider the following figure, which depicts both an early version of an alignment (following Dixon and Kroeber 1919), and a "modern" version of the dame data. For the latter, I used the EDICTOR (http://edictor.digling.org), a software tool that I have been developing during recent years, and which helps linguists to edit alignments in a consistent way (List 2017). The old version on the left has been modified in such a way that it becomes clearer what kind of information the authors tried to convey (for the original, see my older post), while the EDICTOR version contains some markup that is important for linguistics, which I will discuss in more detail below.
![]() |
| Figure 1: Alignments from Dixon and Kroeber (1919) in two flavors |
If we carefully inspect the first alignment, it becomes evident that the scholars did not align the data sound by sound, but rather morpheme by morpheme. Morphemes are those parts in words that are supposed to bear a clear-cut meaning, even when taken in isolation, or when abstracting from multiple words. The plural-ending -s in English, for example, is a morpheme that has the function to indicate the plural (compare horse vs. horses, etc.). In order to save space, the authors used abbreviations for the language group names and the names for the languages themselves.
The authors have further tried to save space by listing identical words only once, but putting two entries, separated by a comma, in the column that I have labelled "varieties". If you further compare the entries for NW (=North-Western Maidu) and NE/S (=North-Eastern Maidu and Southern Maidu), you can see that the first entry has been swapped: the tsi’ in tsi’-bi in NW is obviously better compared with the tsi in NE/S bi-tsi rather than comparing bi in NE with tsi in NE/S. This could be a typographical error, of course, but I think it is more likely that the authors did not quite know how to handle swapped instances in their alignment.
In the EDICTOR representation of the alignment, I have tried to align the sounds in addition to aligning the morphemes. My approach here is rather crude. In order to show which sounds most likely share a common origin, I extracted all homologous morphemes, aligned them in such a way that they occur in the same column, and then stripped off the remaining sounds by putting a check-mark in the IGNORE column on the bottom of the EDICTOR representation. When further analyzing these sound correspondences with some software, like the LingPy library (List et al. 2017), all sounds that occur in the IGNORE column will be ignored. Correspondences will then only be calculated for the core part of this alignment, namely the two columns that are left over, in the center of the alignment.
In many cases, this treatment of sound correspondences and homologous words in alignments is sufficient, and also justified. If we want to compare the homologous (cognate) parts across words in different languages, we can't align the words entirely. Consider, for example, the German verb gehen
[geːən] and its
English counterpart go [gɔu]. German regularly adds the infinitive ending
-en to each verb, but English has long ago dropped all endings on verbs apart
from the -s in the third person singular (compare go vs. goes). Comparing
the whole of the verbs would force us to insert gaps for the verb ending
in German, which would be linguistically not meaningful, as those have not been
"gapped" in English, but lost in a morphological process by which all endings
of English verbs were lost.There are, however, also cases that are more complicated to model, especially when dealing with instances of partial cognacy (or partial homology). Compare, for example, the following alignment for words for bark (of a tree) in several dialects of the Bai language, a Sino-Tibetan language spoken in China, whose affiliation with other Sino-Tibetan languages is still unclear (data taken from Wang 2006).
![]() |
| Figure 2: Alignment for words for "bark" in Bai dialects |
In this example, the superscript numbers represent tones, and they are placed at the end of each syllable. Each syllable in these languages usually also represents a morpheme in the sense mentioned above. That means, that each of the words is a compound of two original meanings. Comparison with other words in the languages reveals that most dialects, apart from Mazhelong, express bark as tree-skin, which is a very well-known expression that we can find in many languages of the world. If we want to analyze those words in alignments, we could follow the same strategy as shown above, and just decide for one core part of the words (probably the skin part) and ignore the rest. However, for our calculations of sound correspondences, we would loose important information, as the tree part is also cognate in most instances and therefore rather interesting. But ignoring only the unalignable part of the first syllable in Mazhelong would also not be satisfying, since we would again have gaps for this word in the tree part in Mazhelong which do not result from sound change.
The only consistent solution to handle these cases is to split the words into their morphemes, and then to align all sets of homologous morphemes separately. This can also be done in the EDICTOR tool (but it requires more effort from the scholar and the algorithms). An example is shown above, where you can see how the tool breaks the linear order in the representation of the words as we find them in the languages, in order to cluster them into sets of homologous "word-parts".
![]() |
| Figure 3: Alignments of partial cognates in the Bai dialects |
But if we only look at the tree part of those alignments, namely the third cognate set from the left, with the ID 8, we can see a further complication, as the gaps introduced in some of the words look a little bit unsatisfying. The reason is that the j in Enqi and Tuolo may just as well be treated as a part of the initial of the syllable, and we could re-write it as dj in one segment instead of using two. In this way, we might capture the correspondence much more properly, as it is well known that those affricate initials in the other dialects (
[ts, tʂ, dʐ, dʑ]) often correspond to [dj].
We could thus rewrite the alignment as shown in the next figure, and simply decide that in this situation (and similar ones in our data), we
treat the d and the j as just one main sound (namely the initial of the syllables).![]() |
| Figure 4: Revised alignment of "tree" in the sample |
Summary and conclusions
Before I start boring those of the readers of this blog who are not linguists, and not particularly interested in details of sound change or language change, let me just quickly summarize what I wanted to illustrate with these examples. I think that the reason why linguists never really formalized alignments as a tool of analysis is that there are so many ways to come up with possible alignments of words, which may all be reasonable for any given analysis. In light of this multitude of possibilities for analysis, not to speak of historical linguistics as a discipline that often prides itself by being based on hard manual labor that would be impossible to achieve by machines, I can in part understand why linguists were reluctant to use alignments more often in their research.
Judging from my discussions with colleagues, there are still many misunderstandings regarding the purpose and the power of alignment analyses in historical linguistics. Scholars often think that alignments directly reflect sound change. But how could they, given that we do not have any ancestral words in our sample? Alignments are a tool for analysis, and they can help to identify sound change processes or to reconstruct proto-forms in unattested ancestral languages; but they are by no means the true reflection of what happened and how things changed. The are the starting point, not the end point of the analysis. Furthermore, given that there are many different ways in which we can analyze how languages changed over time, there are also many different ways in which we can analyze language data with the help of alignments. Often, when comparing different alignment analyses for the same languages, there is no simple right and wrong, just a different emphasis on the initial analysis and its purpose.
As David wrote in an email to me:
"An alignment represents the historical events that have occurred. The alignment is thus a static representation of a dynamic set of processes. This is ultimately what causes all of the representational problems, because there is no necessary and sufficient way to achieve this."
This also nicely explains why alignments in biology as well, with respect to the goal of representing homology, "may be more art than science" (Morrison 2015), and I admit that I find it a bit comforting that biology has similar problems, when it comes to the question of how to interpret an alignment analysis. However, in contrast to linguists, who have never really given alignments a chance, biologists not only use alignments frequently, but also try to improve them.
If I am allowed to have an early New Year wish for the upcoming year, I hope that along with the tools that facilitate the labor of creating alignments for language data, we will also have a more vivid discussion about alignments, their shortcomings, and potential improvements in our field.
References
- Dixon, R. and A. Kroeber (1919) Linguistic families of California. University of California Press: Berkeley.
- List, J.-M. (2017) A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations, pp. 9-12.
- LingPy: A Python library for historical linguistics
- Morrison, D. (2015) Molecular homology and multiple-sequence alignment: an analysis of concepts and practice. Australian Systematic Botany 28: 46-62.
- Wang, W.-Y. (2006) Yǔyán, yǔyīn yǔ jìshù \hana 語言,語音與技術 [Language, phonology and technology]. Xiānggǎng Chéngshì Dàxué: Shànghǎi 上海.
Subscribe to:
Posts (Atom)










