The Genealogical World of Phylogenetic Networks: Horizontal and vertical language comparison

In the traditional handbooks on historical language comparison, one can often find the claim that there are two fundamentally different, but equally important, means of linguistic reconstruction. One is usually called "external reconstruction" (or alternatively the "comparative method"), and one is called "internal reconstruction". If we think of sequence comparison in historical linguistics in the form of a table, in which concepts are arranged on the vertical axis, and different languages on the horizontal axis, we can look at the two different modes of language comparison (external vs. internal) as the horizontal and the vertical axes of the table. Horizontal language comparison refers to external reconstruction — scholars compare forms (not necessarily of the same meaning) across the horizontal axis, that is, across different languages. Internal language comparison is vertical — scholars search inside one and the same language for structures that allow to infer its older stages.

In past blog posts I have been talking a lot about horizontal / external language comparison, for which especially the notion of sound correspondences is crucial. But in the same way in which we use the evidence across languages to infer the past states of a given language family, we can make use of language-internal evidence to learn more about the history — not only of a given language,- but also of a group of languages.

Vertical Language Comparison

A classical example of vertical or internal language comparison is the investigation of paradigms, that is, the inflection systems of the verbs or nouns in a given language. This, of course, makes sense only if the respective languages have verbal or nominal morphology, ie. if we find differences in the verb forms for the first, second, or third person singular or plural, or for the case system. The principle would not work in Chinese, although we have different means to compare languages without inflection vertically, as I'll illustrate below.

As a simplified example of internal reconstruction, consider the verbal paradigm of the verb esse "to be" in Latin:

Person	Singular	Plural
first	sum	sumus
second	es	estis
third	est	sunt

If you try to memorize this pattern, you will quickly realize that it is not regular, and you will have difficulties to identify patterns that assist in memorizing the forms. A much more regular pattern would be the following:

Person	Singular	Plural
first	es-um	es-umus
second	es-Ø	es-tis
third	es-t	es-unt

This pattern would still require us to memorize six different endings, but we could safely remember that the beginning of all forms is the same, and that there are six different endings, accounting for person and number at the same time (which is anyway typical for inflecting languages).

An alternative pattern that would be easier to remember is the following one:

Person	Singular	Plural
first	es-um	s-umus
second	es-ø	s-tis
third	es-t	s-unt

While it may seem that this pattern is slightly more complicated at first glance, it would still be more regular than the pattern we actually observe, and we would now have two different aspects expressing the meaning of the different forms: the alternation of the root es- vs. s- accounts for the singular-plural distinction, while the endings express again both number and person.

If we look at older stages of Latin, we can, indeed, find evidence for the first person singular, which was written esom in ancient documents (see Meier-Brügger 2002 for details on the reconstruction of this paradigm in Indo-European). If we look at other languages, like Sanskrit and Ancient Greek, we can further see that our alternation between es- and s- in the root (thus our last example) comes also much closer to the supposed ancient state, even if we don't find complete evidence for this in Latin alone.

What we can see, however, is that the inspection of alternating forms of the same root can reveal ancient states of a language. The key assumption is that observed irregularities usually go back to formerly regular patterns.

Horizontal language comparison

The classical example for horizontal or external language comparison is the typical wordlists in which words with similar meanings across different languages are arranged in tabular form. I have mentioned before that it was in great part Morris Swadesh (1909-1967) who popularized the simple tabular perspective that puts a concept and its various translations in the center of historical language comparison. Before the development of this concept-based approach to historical linguistics, scholars would pick examples based on their similarity in form, allowing for great differences in the semantics of the words being assigned to the same slot of cognate words; and this exclusively form-based approach to external language comparison is still the prevalent one in most branches of historical linguistics.

No matter what approach we employ in this context — be it the concept- or the form-based — as long as we compare forms across different languages, we carry out external language comparison, and our main concern is then the identification of regular sound correspondences across the languages in our sample, which enable us to propose ancestral sounds for the ancestral language.

Problems of vertical language comparison

As can be seen from my above example of the inflection of esse in Latin, it is not obvious how the task of internal language comparison could be formalized and automated. There are two main reasons for this. First, inflection paradigms vary greatly among the languages of the world, which makes it difficult to come up with a common way to investigate them.

Second, since we are usually looking for irregular cases that we try to explain as having evolved from former regularities, it is clear that our data will be extremely sparse. Often, it is only the paradigm of one word that we seek to explain, as we have seen for Latin esse, and patterns of irregularities across many verbs are rather rare (although we can also find examples for this). As a result, internal reconstruction is dealing with even fewer data than external reconstruction, where data are also not necessarily big.

Formalizing the language-internal analysis of word families

Despite the obvious problems of exploiting the language-internal perspective in historical language comparison, there are certain types of linguistic analysis that are amenable to a more formal treatment in this area. One example that we are currently testing is the inference and annotation of word families within a given language. It is well known that large number of words in human languages are not unrelated atomic units, but have themselves been created from smaller parts. Linguists distinguish derivation and compounding as the major techniques here, by which new words are created from existing ones.

Derivation refers to those cases where a word is being modified by a form unit that could not form a word of its own, usually a suffix or a prefix. As an example, consider the suffix -er in English which can be attached to verbs in order to form a noun that usually describes the person that regularly carries out the action denoted by the original verb (eg. examine → examiner, teach → teacher, etc.). While the original verb form exists without the suffix in the English language, the form -er only occurs as part of verbs. In contrast to derivation, compounding refers to the process by which two word forms that can be used in isolation are merged to form a new expression (compare foot and ball with football).

Searching for suffixes and compounds in unannotated language data is a very difficult task. Although scholars have been working on automatic methods that split a given monolingual dictionary into its smallest meaning-bearing form units (morphemes), these methods usually only work on very large datasets (Creutz and Laugs 2005). Trained linguists, on the other hand, can easily detect patterns, even when working on smaller datasets of a few hundred words.

The reason why linguists are successful in analysing the morphology of languages, in contrast to machine-learning approaches, is that they make active use of their external knowledge about the potential semantics underlying the patterns, while current methods for automatic morpheme detection usually only consider the forms, and disregard the semantics. Semantics, however, are important to distinguish words that form a true family (in that they share cognate material) from words that are similar only due to chance.

It is clear that languages may have words that sound alike but convey different meanings. As an extreme example, consider French paix [pɛ] "peace" vs. pet [pɛ] "fart".Although both words are pronounced the same, we know that they are not cognate, going back to different ancestral forms, as is also reflected in the French writing system. But even if we lacked the evidence of the French orthography, we could easily justify that the words do not form a family, since (a) their meaning is quite different, and (b) their genus is different as well (la paix vs. le pet). An automatic method that disregards semantics and external evidence (like the orthography or the gender of nouns in our case) cannot distinguish words that are similar due to chance from words that are similar due to their history.

As a further example illustrating the importance of semantics, consider the data for Achang, a Burmish language, spoken in Myanmar (data from Huáng 1992), which is shown in the following graphic (derived from the EDICTOR tool and analyzed by Nathan W. Hill).

Word families in Achang, a Burmish language.

In this figure, we can see six words which all share tɕʰi⁵⁵ (high numbers represent tones) as their first part. As we can see from the detailed analysis of these compounds in Achang, which is given in the column "MORPHEMES" in the figure, our analysis claims that the form tɕʰi⁵⁵, which expresses the concepts "foot" or "leg" in isolation, recurs in the words for "hoof", "claw", "knee", and "thigh", but not in the word for ""ant". While the semantic commonalities among the former are plausible, as they all denote body parts which are closely related to "feet" or "legs", we do not find any transparent motivation for why the speakers should have used a compound containing the word for "foot" to denote an ant. Although we cannot demonstrate this at this point, we are hesitant to add the Achang word for "ant" to the word family based on compounds containing the word for "foot".

Bipartite networks of word families

For the time being, we cannot automate this analysis, since we lack data for the testing and training of potential algorithms. We can, however, formalize it in a very straightforward way: with help of a bipartite network (see Hill and List 2017). Bipartite networks are networks with two kinds of nodes, which are usually thought of as representing different types. While we can easily assign different types to all nodes in any network we are dealing with, bipartite networks only allow us to link nodes of different types. In our bipartite network of word families, the first type of nodes represent the forms of the words, while the second type represent the meanings attributed to the sub-parts of the words. In the figure above, the former can be found in the column "tokens", where the symbol "+" marks the boundaries, and the latter can be found in the column "MORPHEMES".

The following figure shows the bipartite network underlying the word family relations following from our analysis of words built with the morpheme "foot" in Achang.

Bipartite network of word families: nodes in red text represent the (reconstructed) meaning of the morphemes, and blue nodes the words in which those occur as parts.

Conclusion

The bipartite network above shows only a small part of the word family structure of one language, and the analysis and formalization of word families with help of bipartite networks thus remains exemplary and anecdotal. I hope, however, that the example illustrates how important it is to keep in mind that language change is not only about sound shifts that can be analyzed with help of language-external, horizontal comparison. Investigating the vertical (the language-internal) perspective of language evolution is not only fascinating, offering many so far unresolved methodological problems, it is at least as important as the horizontal perspective for a proper understanding of the dynamics underlying language change.

References

Creutz M. and Lagus K. (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology, 2005, 81.

Hill N. and List J.-M. (2017) Challenges of annotation and analysis in computer-assisted language comparison: A case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1. 47–76.

Meier-Brügger M. (2002) Indogermanische Sprachwissenschaft. de Gruyter: Berlin.

Huáng Bùfán 黃布凡 (1992) Zàngmiǎn yǔzú yǔyán cíhuì [A Tibeto-Burman lexicon]. Zhōngyāng Mínzú Dàxué 中央民族大学 [Central Institute of Minorities]: Běijīng 北京.

Monday, June 25, 2018

Horizontal and vertical language comparison

No comments:

Post a Comment