Monday, May 27, 2019

Automatic phonological reconstruction (Open problems in computational diversity linguistics 4)

The fourth problem in my list of open problems in computational diversity linguistics is devoted to the problem of linguistic reconstruction, or, more specifically, to the problem of phonological reconstruction, which can be characterized as follows:
Given a set of cognate morphemes across a set of related languages, try to infer the hypothetical pronunciation of each morpheme in the proto-language.
This task needs to be distinguished from the broader task of linguistic reconstruction, which would usually include also the reconstruction of full lexemes, i.e. lexical reconstruction — as opposed to single morphemes or "roots" in an unknown ancestral language. In some cases, linguistic reconstruction is even used as a cover term for all reconstruction methods in historical linguistics, including such diverse approaches as phylogenetic reconstruction (finding the phylogeny of a language family), semantic reconstruction (finding the meaning of a reconstructed morpheme or root), or the task of demonstrating that languages are genetically related (see, e.g., the chapters in Fox 1995)

Phonological and lexical reconstruction

In order to understand the specific difference between phonological and lexical reconstruction, and why making this distinction is so important, consider the list of words meaning "yesterday" in five Burmish languages (taken from Hill and List 2017: 51).

Figure 1: Cognate words in Burmish languages (taken from Hill and List 2017)

Four of these languages express the word "yesterday" with the help of more than one morpheme, indicated by using different colors in the table's phonetic transcriptions, which at the same time ­ also indicate which words we consider to be homologous in this sample. Four of the languages have one morpheme which (as we confirmed from the detailed language data) means "day" independently. This morpheme is given the label 2 in the last column of the table. From this, we can see that the motivation by which the word for "yesterday" is composed in these languages is similar to the one we observe in English, where we also find the word day being a part of the word yester-day.

If we want to know how the word "yesterday" was expressed in the ancestor of the Burmish languages, we could make an abstract estimation based on the lexical material we have at hand. We might assume that it was also a compound word, given the importance of compounding in all living Burmish languages. We could further hypothesize that one part of the ancient compound would have been the original word for "day". We could even make a guess and say the word was in structure similar to Bola and Lashi (although it is difficult to find a justification for doing this). In all cases, we would propose a lexical reconstruction for the word for "yesterday" in Proto-Burmish. We would make an assumption with respect to what one could call the denotation structure or the motivation structure, as we called it in Hill and List (2017: 67). This assumption would not need to provide an actual pronunciation of the word, it could be proposed entirely independently.

If we want to reconstruct the pronunciation of the ancient word for "yesterday" as well, we have to compare the corresponding sounds, and build a phonological reconstruction for each of the morphemes separately. As a matter of fact, scholars working on South-East Asian languages rarely propose a full lexical reconstruction as part of their reconstruction systems (for a rare exception, see Mann 1998). Instead, they pick the homologous morphemes from their word comparisons, assign some rough meaning to them (this step would be called semantic reconstruction), and then propose an ancient pronunciation based on the correspondence patterns they observe.

When listing phonological reconstruction as one of my ten problems, I am deliberately distinguishing this task from the tasks of lexical reconstruction or semantic reconstruction, since they can (and probably should) be carried out independently. Furthermore, by describing pronunciation of the morphemes as "hypothetical pronunciations" in the ancestral language, I want not only to emphasize that all reconstruction is hypothetical, but also to point to the fact that it is very possible that some of the morphemes for which one proposes a proto-form may not even have existed in the proto-language. They could have evolved only later as innovations on certain branches in the history of the languages. For the task of phonological reconstruction, however, this would not matter, since the question of whether a morpheme existed in the most recent common ancestor becomes relevant only if one tries to reconstruct the lexicon of a given proto-language. But phonological reconstruction seeks to reconstruct its phonology, i.e. the sound inventory of the proto-language, and the rules by which these sounds could be combined to form morphemes (phonotactics).

Why phonological reconstruction is hard

That phonological reconstruction is hard should not be surprising. What the task entails is to find the most probable pronunciation for a bunch of morphemes in a language for which no written records exist. Imagine you want to find the DNA of LUCA as a biologist, not even in its folded form, with all of the pieces in place, but just a couple of chunks, in order to get a better picture of how this LUCA might have looked like. But while we can employ some weak version of uniformitarianism when trying to reconstruct at least some genes of our LUCA (we would still assume that it was using some kind of DNA, drawn from the typical alphabet of DNA letters), we face the specific problem in linguistics that we cannot even be sure about the letters.

Only recently, Blasi et al. (2019) argued that sounds like f and v may have evolved later than the other sounds we can find in the languages of the world, driven by post-Neolithic changes in the bite configuration, which seem to depend on what we eat. As a rule, and independent of these findings, linguists do not tend to reconstruct an f for the proto-language in those cases where they find it corresponding to a p, since we know that in almost all known cases a p can evolve into an f, but an f almost never becomes a p again. This can lead to the strange situation where some linguists reconstruct a p for a given proto-language even though all descendants show an f, which is, of course, an exaggeration of the principle (see Guillaume Jacques' discussion on this problem).

But the very idea, that we may have good reasons to reconstruct something in our ancestral language that has been lost in all descendant languages, is something completely normal for linguists. In 1879, for example Ferdinand de Saussure (Saussure 1879) used internal and comparative evidence to propose the existence of what he called coefficients sonantiques in Proto-Indo-European. His proposal included the prediction that — if ever a languages was found that retained these elements — these new sounds would surface as segmental elements, as distinctive sounds, in certain cognate sets, where all known Indo-European languages had already lost the contrast.

These sounds are nowadays known as laryngeals (*h1, *h2, *h3, see Meier-Brügger 2002), and when Hittite was identified as an Indo-European language (Hrozný 1915), one of the two sounds predicted by Saussure could indeed be identified. I have discussed before on this blog the problem of unattested character states in historical linguistics, so there is no need to go into further detail. What I want to emphasize is that this aspect of linguistic reconstruction in general, and phonological reconstruction specifically, is one of the many points that makes the task really hard, since any algorithm to reconstruct the phonological system of some proto-language would have to find a way to formalize the complicated arguments by which linguists infer that there are traces of something that is no longer there.

There are many more things that I could mention, if I wanted to identify the difficulty of phonological reconstruction in its entirety. What I find most difficult to deal with is that the methodology is insufficiently formalized. Linguists have their success stories, which helped them to predict certain aspects of a given proto-language that could later be confirmed, and it is due to these success stories that we are confident that it can, in principle, be done. But the methodological literature is sparse, and the rare cases where scholars have tried to formalize it are rarely discussed when it comes to evaluating concrete proposals (as an example for an attempt of formalizing, see Hoenigswald 1960). Before this post becomes too long, I will therefore conclude bu noting that scholars usually have a pretty good idea of how they should perform their phonological reconstructions, but that this knowledge of how one should reconstruct a proto-language is usually not seen as something that could be formalized completely.

Traditional strategies for phonological reconstruction

Given the lack of methodological literature on phonological reconstruction, it is not easy to describe how it should be done in an ideal scenario. What seems to me to be the most promising approach is to start from correspondence patterns. A correspondence pattern is an abstraction from individual alignment sites distributed over cognate sets drawn from related languages. As I have tried to show in a paper published earlier this year (List 2019), a correspondence pattern summarizes individual alignment sites in an abstract form, where missing data are imputed. I will avoid going into the details here but, as a shortcut, we can say that each correspondence pattern should, in theory, only correspond to one proto-sound in the language, although the same proto-sound may correspond to more than one correspondence pattern. As an example, consider the following table, showing three (fictive) patterns that would all be reconstructed by a *p.

 Proto-Form  L₁  L₂  L₃
 *p  p  p  f
 *p  p  p  p
 *p  b  p  p

To justify that the same proto-sound is reconstructed by a *p in all three patterns, linguists invoke the rule of context, by looking at the real words from which the pattern was derived. An example is shown in the next table.

L₁ L₂ L₃
*p i a ŋ  p i a ŋ  p i u ŋ  f a n
*p a t  p a t  p a t  p a t
*a p a ŋ  a b a ŋ  a p a ŋ  a p a n

What you should be able to see from the table is that we can find in all three patterns a conditioning factor that allows us to assume that the deviation from the original *p is secondary. In language L₃, the factor can be found in the palatal environment (followed by the front vowel *i) that we find in the ancestral language. We would assume that this environment triggered the change from *p to f in this language. In the case of the change from *p to b in L₁, the triggering environment is that the p occurs inter-vocalically.

To summarize: what linguists usually do in order to reconstruct proto-forms for ancestral languages that are not attested in written sources, is to investigate the correspondence patterns, and to try to find some neat explanation of how they could have evolved, given a set of proto-forms along with triggering contexts that explain individual changes in individual descendant languages.

Computational strategies for phonological reconstruction

Not many attempts have been made so far to automate the task of reconstruction. The most prominent proposal in this direction has been made by Bouchard-Côté et al. (2013). Their strategy radically differs from the strategy outlined above, since they do not make use of correspondence patterns, but instead use a stochastic transducer and known cognate words in the descendant languages, along with a known phylogenetic tree that they traverse, inferring the most likely changes that could explain the observed distribution of cognate sets.

So far, this method has been tested only on Austronesian languages and their subgroups, where it performed particularly well (with error rates between 0.25 and 0.12, using edit distance as the evaluation measure). Since it is not available as a software package that can be conveniently used and tested on other language families, it is difficult to tell how well it would perform when being presented with more challenging test cases.

In a forthcoming paper, Gerhard Jäger illustrates how classical methods for ancestral state reconstruction applied to aligned cognate sets could be used for the same task (Jäger forthcoming). While Jäger's method is more in line with "linguistic thinking", in so far as he uses alignments, and applies ancestral state reconstructions to each column of the alignments, it does not make use of correspondence patterns, which would be the general way by which linguists would proceed. This may also explain the performance, which shows an error rate of 0.48 (also using edit distance for evaluation) — although this is also due to the fact that the method was tested on Romance languages and compared with Latin, which is believed to be older than the ancestor of all Romance languages.

Problems with computational strategies for phonological reconstruction

Both the method of Bouchard-Côté et al. and the approach of Jäger suffer from the problem of not being able to detect unobserved sounds in the data. Jäger side-steps this problem in theory, by using a shortened alphabet of only 40 characters, proposed by the ASJP project, which encoded more than half of the world's languages in this form. Bouchard-Côté's test data, Proto-Austronesian (and its subgroups), are fairly simple in this regard. It would therefore be interesting to see what would happen if the methods are tested with full phonetic (or phonological) representations of more challenging language families (for example, the Chinese dialects). While Jäger's approach assumes the independence of all alignment sites, Bouchard-Côté's stochastic transducers handle context on the level of bigrams (if I read their description properly). However, while bigrams can be seen as an improvement over ignoring conditioning context, they are not the way in which context is typically handled by linguists. As I tried to explain briefly in last month's post, context in historical linguistics calls for a handling of abstract contexts, for example, by treating sequences as layered entities, similar to music scores.

Apart from the handling of context and unobserved characters, the evaluation measure used in both approaches seems also problematic. Both approaches used the edit distance (Levenshtein 1965), which is equivalent to the Hamming distance (Hamming 1950) applied to aligned sequences. Given the problem of unobserved characters and the abstract nature of linguistic reconstruction systems, however, any measure that evaluates the surface similarity of sequences is essentially wrong.

To illustrate this point, consider the reconstruction of the Indo-European word for sheep by Kortlandt (2007), who gives *ʕʷ e u i s, as compared to Lühr (2008), who gives *h₂ ó w i s. The normalized edit distance between both systems is the Hamming distance of their (trivial) alignment: in three of five cases they differ, which makes up to an unnormalized edit distance of three, and a normalized edit distance of 0.6. While this is pretty high, their systems are mostly compatible, since Korthland reconstructs *ʕʷ in most cases where Lühr writes *h₂. Therefore, the distance should be much lower; in fact, it should be zero, since both authors agree on the structure of the form they reconstruct in comparison with the structure of other words they reconstruct for Proto-Indo-European.

Since scholars do not necessarily select phonetic values in their reconstructions that derive directly from the descendant languages, and moreover they may differ often regarding the details of the phonetic values they propose, a valid evaluation of different reconstruction systems (including automatically derived ones) needs to compare the structure of the systems, not their substance (see List 2014: 48-50 for a discussion of structural and substantial differences between sequences).

Currently, there is (to my knowledge) no accepted solution for the comparison of structural differences among aligned sequences. Finding an adequate evaluation measure to compare reconstruction systems can therefore be seen as a sub-problem of the bigger problem of phonological reconstruction. To illustrate why it is so important to compare the structural information and not the pure substance, consider the three cases in which Jäger's reconstruction gives a v as opposed to a w in Latin (data here): while evaluating by the edit distance yields a score of 0.48, this score will drop to 0.47 when replacing the v instances with a w. Jäger's system is doing something right, but the edit distance cannot capture the fact that the system is deviating systematically from Latin, not randomly.

Initial ideas for improvement

There are many things that we can easily improve when working on automatic methods for phonological reconstruction.

As a first point, we should work on enhanced measures of evaluation, going beyond the edit distance as our main evaluation measure. In fact, this can be easily done. With B-Cubed scores (Amigó et al. 2009), we already have a straightforward measure to compare whether two reconstruction systems are structurally identical or similar. In order to apply these scores, the automatic reconstructions have to be aligned with the gold standard. If they are identical, although the symbols may differ, then the scores will indicate this. The problem of comparing reconstruction systems is, of course, more difficult, as we can face cases where systems are not structurally identical (i.e. you can directly replace any symbol a in system A by any symbol a' in system B to produce B from A and vice versa), but they would be a start.

Furthermore, given that we lack test cases, we might want to work on semi-automatic instead of fully automatic methods, in the meantime. Given that we have a first method to infer sound correspondence patterns from aligned data (List 2019), we can infer all patterns and have linguists annotate each pattern by providing the proto-sound they think would fit best — we are testing this at the moment. Having created enough datasets in this form, we could then think of discussing concrete algorithms that would derive proto-forms from correspondence patterns, and use the semi-automatically created and manually corrected data as gold standard.

Last but not least, one straightforward way by which it is possible to formally create unknown sounds from known data, is to represent sound as vectors of phonological features instead of bare symbols (e.g. representing p as voiceless bilabial plosive and b as voiced labial plosive). If we then compare alignment sites or correspondence patterns for the feature vectors, we could check to what degree standard algorithms for ancestral state reconstructions propose unattested sounds similar to the ones proposed by experts. In order to do this, we would need to encode our data in transparent transcription systems. This is not the case for most current datasets, but with the Cross-Linguistic Transcription Systems initiative we already have a first attempt to provide features for the majority of sounds that we find in the languages of the world (Anderson et al. forthcoming).


It is difficult to tell how hard the problem of phonological reconstruction is in the end. Semi-automatic solutions are already feasible now, and we are currently testing them on different (smaller) groups of phylogenetically related languages. One crucial step in the future is to code up enough data to allow for a rigorous testing of the few automatic solutions that have been proposed so far. We are working on that as well. But how to propose an evaluation system that rigorously tests not only to what degree a given reconstruction is identical with a given gold standard, but also structurally equivalent, remains one of the crucial open problems in this regard.

Amigó, Enrique and Gonzalo, Julio and Artiles, Javier and Verdejo, Felisa (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12.4: 461-486.

Anderson, Cormac, Tresoldi, Tiago, Chacon, Thiago Costa, Fehn, Anne-Maria, Walworth, Mary, Forkel, Robert and List, Johann-Mattis (forthcoming) A cross-linguistic Database of Phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting, pp. 1-27.

Blasi, Damián E. , Steven Moran, Scott R. Moisik, Paul Widmer, Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

Bouchard-Côté, Alexandre and Hall, David and Griffiths, Thomas L. and Klein, Dan (2013) Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110.11: 4224–4229.

Fox, Anthony (1995) Linguistic Reconstruction: An Introduction to Theory and Method. Oxford: Oxford University Press.

Hamming, Richard W. (1950) Error detection and error detection codes. Bell System Technical Journal 29.2: 147–160.

Hill, Nathan W. and List, Johann-Mattis (2017) Challenges of annotation and analysis in computer-assisted language comparison: a case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1: 47–76.

Hoenigswald, Henry M. (1960) Phonetic similarity in internal reconstruction. Language 36.2: 191-192.

Hrozný, Bedřich (1915) Die Lösung des hethitischen Problems [The solution of the Hittite problem]. Mitteilungen der Deutschen Orient-Gesellschaft 56: 17–50.

Jäger, Gerhard (forthcoming) Computational historical linguistics. Theoretical Linguistics.

Kortlandt, Frederik (2007) For Bernard Comrie.

Levenshtein, V. I. (1965) Dvoičnye kody s ispravleniem vypadenij, vstavok i zameščenij simvolov [Binary codes with correction of deletions, insertions and replacements]. Doklady Akademij Nauk SSSR 163.4: 845-848.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Lühr, Rosemarie (2008) Von Berthold Delbrück bis Ferdinand Sommer: Die Herausbildung der Indogermanistik in Jena. Vortrag im Rahmen einer Ringvorlesung zur Geschichte der Altertumswissenschaften (09.01.2008, FSU-Jena).

Mann, Noel Walter (1998) A Phonological Reconstruction of Proto Northern Burmic. The University of Texas: Arlington.

Meier-Brügger, Michael (2002) Indogermanische Sprachwissenschaft. Berlin and New York: de Gruyter.

Saussure, Ferdinand de (1879) Mémoire sur le Système Primitif des Voyelles dans les Langues Indo- Européennes. Leipzig: Teubner.

No comments:

Post a Comment