Showing posts with label Sound change. Show all posts
Showing posts with label Sound change. Show all posts

Monday, October 28, 2019

Typology of sound change (Open problems in computational diversity linguistics 9)


We are getting closer to the end of my list of open problems in computational diversity linguistics. After this post, there is only one left, for November, followed by an outlook and a wrap-up in December.

In last month's post, devoted to the Typology of semantic change, I discussed the general aspects of a typology in linguistics, or — to be more precise — how I think that linguists use the term. One of the necessary conditions for a typology to be meaningful is that the phenomenon under questions shows enough similarities across the languages of the world, so that patterns or tendencies can be identified regardless of the historical relations between human languages.

Sound change in this context refers to a very peculiar phenomenon observed in the change of spoken languages, by which certain sounds in the inventory of a given language change their pronunciation over time. This often occurs across all of the words in which these sounds recur, or across only those sounds which appear to occur in specific phonetic contexts.

As I have discussed this phenomenon in quite a few past blog posts, I will not discuss it any more here, but I will rather simply refer to the specific task, that this problem entails:
Assuming (if needed) a given time frame, in which the change occurs, establish a general typology that informs about the universal tendencies by which sounds occurring in specific phonetic environments are subject to change.
Note that my view of "phonetic environment" in this context includes an environment that would capture all possible contexts. When confronted with a sound change that seems to affect a sound in all phonetic contexts, in which the sound occurs in the same way, linguists often speak of "unconditioned sound change", as they do not find any apparent condition for this change to happen. For a formal treatment, however, this is unsatisfying, since the lack of a phonetic environment is also a specific condition of sound change.

Why it is hard to establish a typology of sound change

As is also true for semantic change, discussed as Problem 8 last month, there are three major reasons why it is hard to establish a typology of sound change. As a first problem, we find, again, the issue of acquiring the data needed to establish the typology. As a second problem, it is also not clear how to handle the data appropriately in order to allow us to study sound change across different language families and different times. As a third problem, it is also very difficult to interpret sound change data when trying to identify cross-linguistic tendencies.

Problem 1

The problem of acquiring data about sound change processes in sufficient size is very similar to the problem of semantic change: most of what we know about sound change has been inferred by comparing languages, and we do not know how confident we can be with respect to those inferences. While semantic change is considered to be notoriously difficult to handle (Fox 1995: 111), scholars generally have more confidence in sound change and the power of linguistic reconstruction. The question remains, however, as to how confident we can really be, which divides the field into the so-called "realists" and the so-called "abstractionalists" (see Lass 2017 for a recent discussion of the debate).

As a typical representative of abstractionalism in linguistic reconstruction, consider the famous linguist Ferdinand de Saussure, who emphasized that the real sound values which scholars reconstructed for proposed ancient words in unattested languages like, for example, Indo-European, could as well be simply replaced by numbers or other characters, serving as identifiers (Saussure 1916: 303). The fundamental idea here, when reconstructing a word for a given proto-language, is that a reconstruction does not need to inform us about the likely pronunciation of a word, but rather about the structure of the word in contrast to other words.

This aspect of historical linguistics is often difficult to discuss with colleagues from other disciplines, since it seems to be very peculiar, but it is very important in order to understand the basic methodology. The general idea of structure versus substance is that, once we accept that the words in a languages are built by drawing letters from an alphabet, the letters themselves do not have a substantial value, but have only a value in contrast to other letters. This means that a sequence, such as "ABBA" can be seen as being structurally identical with "CDDC", or "OTTO". The similarity should be obvious: we have the same letter in the beginning and the end of each word, and the same letter being repeated in the middle of each word (see List 2014: 58f for a closer discussion of this type of similarity).

Since sequence similarity is usually not discussed in pure structural terms, the abstract view of correspondences, as it is maintained by many historical linguists, is often difficult to discuss across disciplines. The reason why linguists tend to maintain it is that languages tend to change not only their words by mutating individual sounds, but that whole sound systems change, and new sounds can be gained during language evolution, or lost (see my blogpost from March 2018 for a closer elaboration of the problem of sound change).

It is important to emphasize, however, that despite prominent abstractionalists such as Ferdinand de Saussure (1857-1913), and in part also Antoine Meillet (1866-1936), the majority of linguists think more realistically about their reconstructions. The reason is that the composition of words based on sounds in the spoken languages of the world usually follows specific rules, so-called phonotactic rules. These may vary to quite some degree among languages, but are also restricted by some natural laws of pronunciability. Thus, although languages may show impressively long chains of one consonant following another, there is a certain limit to the number of consonants that can follow each other without a vowel. Sound change is thus believed to originate roughly in either production (speakers want to pronounce things in a simpler, more convenient way) or perception (listeners misunderstand words and store erroneous variants, see Ohala 1989 for details). Therefore, a reconstruction of a given sound system based on the comparison of multiple languages gains power from a realistic interpretation of sound values.

The problem with the abstractionalist-realist debate, however, is that linguists usually conduct some kind of a mixture between the two extremes. That means that they may reconstruct very concrete sound values for certain words, where they have very good evidence, but at the same time, they may come up with abstract values that serve as place holders in lack of better evidence. The most famous example are the Indo-European "laryngeals", whose existence is beyond doubt for most historical linguistics, but whose sound values cannot be reconstructed with high reliability. As a result, linguists tend to spell them with subscript numbers as *h₁, *h₂, and *h₃. Any attempt to assemble data about sound change processes in the languages of the world needs to find a way to cope with the different degrees of evidence we find in linguistic analyses.

Problem 2

This leads us directly to our second problem in handling sound change data appropriately in order to study sound change processes. Given that many linguists propose changes in the typical A > B / C (A becomes B in context C) notation, a possible way of thinking about establishing a first database of sound changes would consist of typing these changes from the literature and making a catalog out of it. Apart from the interpretation of the data in abstractionalist-realist terms, however, such a way of collecting the data would have a couple of serious shortcomings.

First, it would mean that the analysis of the linguist who proposed the sound change is taken as final, although we often find many debates about the specific triggers of sound change, and it is not clear whether there would be alternative sound change rules that could apply just as well (see Problem 3 on the task of automatic sound law induction for details). Second, as linguists tend to report only what changes, while disregarding what does not change, we would face the same problem as in the traditional study of semantic change: the database would suffer from a sampling bias, as we could not learn anything about the stability of sounds. Third, since sound change depends not only on production and perception, but also on the system of the language in which sounds are produced, listing sounds deprived of examples in real words would most likely make it impossible to take these systemic aspects of sound change into account.

Problem 3

This last point now leads us to the third general difficulty, the question of how to interpret sound change data, assuming that one has had the chance to acquire enough of it from a reasonably large sample of spoken languages. If we look at the general patterns of sound change observed for the languages of the world, we can distinguish two basic conditions of sound change, phonetic conditions and systemic conditions. Phonetic conditions can be further subdivided into articulatory (= production) and acoustic (= perception) conditions. When trying to explain why certain sound changes can be observed more frequently across different languages of the world, many linguists tend to invoke phonetic factors. If the sound p, for example, turns into an f, this is not necessarily surprising given the strong similarity of the sounds.

But similarity can be measured in two ways: one can compare the similarity with respect to the production of a sound by a speaker, and with respect to the perception of the sound by a listener. While production of sounds is traditionally seen as the more important factor contributing to sound change (Hock 1991: 11), there are clear examples for sound change due to misperception and re-interpretation by the listeners (Ohala 1989: 182). Some authors go as far as to claim that production-driven changes reflect regular internal language change (which happens gradually during acquisition, or (depending on the theory) also in later stages (Bybee 2002), while perception-based changes rather reflect change happening in second language acquisition and language contact (Mowrey and Pagliuca 1995: 48).

While the interaction of production and perception has been discussed in some detail in the linguistic literature, the influence of systemic factors has so far been only rarely regarded. What I mean by this factor is the idea that certain changes in language evolution may be explained exclusively as resulting from systemic constellations. As a straightforward example, consider the difference in design space for the production of consonants, vowels, and tones. In order to maintain pronunciability and comprehensiblity, it is useful for the sound system of a given language to fill in those spots in the design space that are maximally different from each other. The larger the design space and the smaller the inventory, the easier it is to guarantee its functionality. Since design spaces for vowels and tones are much smaller than for consonants, however, these sub-systems are more easily disturbed, which could be used to explain the presence of chain shifts of vowels, or flip- flop in tone systems (Wang 1967: 102). Systemic considerations play an increasingly important role in evolutionary theory, and, as shown in List et al. (2016), also be used as explanations for phenomena as strange as the phenomenon of Sapir's drift (Sapir 1921).

However, the crucial question, when trying to establish a typology of sound change, is how these different effects could be measured. I think it is obvious that collections of individual sound changes proposed in the literature are not enough. But what data would be sufficient or needed to address the problem is not entirely clear to me either.

Traditional approaches

As the first traditional approach to the typology of sound change, one should mention the intuition inside the heads of the numerous historical linguists who study particular language families. Scholars trained in historical linguistics usually start to develop some kind of intuition about likely and unlikely tendencies in sound change, and in most parts they also agree on this. The problem with this intuition, however, is that it is not explicit, and it seems even that it was never the intention of the majority of historical linguists to make their knowledge explicit. The reasons for this reluctance with respect to formalization and transparency are two-fold. First, given that every individual has invested quite some time in order to grow their intuition, it is possible that the idea of having a resource that distributes this intuition in a rigorously data-driven and explicit manner yields the typical feeling of envy in quite a few people who may then think: «I had to invest so much time in order to learn all this by heart. Why should young scholars now get all this knowledge for free?» Second, given the problems outlined in the previous section, many scholars also strongly believe that it is impossible to formalize the problem of sound change tendencies.

The by far largest traditional study of the typology of sound change is Kümmel's (2008) book Konsonantenwandel (Consonant Change), in which the author surveys sound change processes discussed in the literature on Indo-European and Semitic languages. As the title of the book suggests, it concentrates on the change of consonants, which are (probably due to the larger design space) also the class of sounds that shows stronger cross-linguistic tendencies. The book is based on a thorough inspection of the literature on consonant change in Indo-European and Semitic linguistics. The procedure by which this collection was carried out can be seen as the gold standard, which any future attempt of enlarging the given collection should be carried out.

What is specifically important, and also very difficult to achieve, is the harmonization of the evidence, which is nicely reflected in Kümmel's introduction, where he mentions that one of the main problems was to determine what the scholars actually meant with respect to phonetics and phonology, when describing certain sound changes (Kümmel 2008: 35). The major drawback of the collection is that it is not (yet) available in digital form. Given the systematicity with which the data was collected, it should be generally possible to turn the collection into a database; and it is beyond doubt that this collection could offer interesting insights into certain tendencies of sound change.

Another collection of sound changes collected from the literature is the mysterious Index Diachronica, a collection of sound changes collected from various language families by a person who wishes to remain anonymous. Up to now, this collection even has a Searchable Index that allows scholars to click on a given sound and to see in which languages this sound is involved in some kind of sound change. What is a pity about the resource is that it is difficult to use, given that one does not really know where it actually comes from, and how the information was extracted from the sources. If the anonymous author would only decide to put it (albeit anonymously, or under a pseudonym) on a public preprint server, such as, for example, Humanities Commons, this would be excellent, as it would allow those who are interested in pursuing the idea of collecting sound changes from the literature an excellent starting point to check the sources, and to further digitize the resource.

Right now, this resource seems to be mostly used by conlangers, ie., people who create artificial languages as a hobby (or profession). Conlangers are often refreshingly pragmatic, and may come up with very interesting and creative ideas about how to address certain data problems in linguistics, which "normal" linguists would refuse to do. There is a certain tendency in our field to ignore certain questions, either because scholars think it would be too tedious to collect the data to address that problem, or they consider it impossible to be done "correctly" from the start.

As a last and fascinating example, I have to mention the study by Yang and Xu (2019), in which the authors review studies of concrete examples of tone change in South-East Asian languages, trying to identify cross-linguistic tendencies. Before I read this study, I was not aware that tone change had at all been studied concretely, since most linguists consider the evidence for any kind of tendency far too shaky, and reconstruct tone exclusively as an abstract entity. The survey by Yang and Xu, however, shows clearly that there seem to be at least some tendencies, and that they can be identified by invoking a careful degree of abstraction when comparing tone change across different languages.

For the detailed reasons outlined in the previous paragraph, I do not think that a collection of sound change examples from the literature addresses the problem of establishing a typology of sound change. Specifically, the fact that sound change collections usually do not provide any tangible examples or frequencies of a given sound change within the language where it occurred, but also the fact that they do not offer any tendencies of sounds to resist change, is a major drawback, and a major loss of evidence during data collection. However, I consider these efforts as valuable and important contributions to our field. Given that they allow us to learn a lot about some very general and well-confirmed tendencies of sound change, they are also an invaluable source of inspiration when it comes to working on alternative approaches.

Computational approaches

To my knowledge, there are no real computational approaches to the study of sound change so far. What one should mention, however, are initial attempts to measure certain aspects of sound change automatically. Thus, Brown et al. (2013) measure sound correspondences across the world's languages, based on a collection of 40-item wordlists for a very large sample of languages. The limitations of this study can be found in the restricted alphabet being used (all languages are represented by a reduced transcription system of some 40 letters, called the ASJP code. While the code originally allowed representing more that just 40 sounds, since the graphemes can be combined, the collection was carried out inconsistently for different languages, which has now led to the situation that the majority of computational approaches treat each letter as a single sound, or consider only the first element of complex grapheme combinations.

While sound change is a directional process, sound correspondences reflect the correspondence of sounds in different languages as a result of sound change, and it is not trivial to extract directional information from sound correspondence data alone. Thus, while the study of Brown et al. is a very interesting contribution, also providing a very straightforward methodology, it does not address the actual problem of sound change.

The study also has other limitations. First, the approach only measures those cases where sounds differ in two languages, and thus we have the same problem that we cannot tell how likely it is that two identical sounds correspond. Second, the study ignores phonetic environment (or context), which is an important factor in sound change tendencies (some sound changes, for example, tend to occur only in word endings, etc.). Third, the study considers only sound correspondences across language pairs, while it is clear that one can often find stronger evidence for sound correspondences when looking at multiple languages (List 2019).

Initial ideas for improvement

What we need in order to address the problem of establishing a true typology of sound change processes, are, in my opinion:
  1. a standardized transcription system for the representation of sounds across linguistic resources,
  2. increased amounts of readily coded data that adhere to the standard transcription system and list cognate sets of ancestral and descendant languages,
  3. good, dated phylogenies that allow to measure how often sound changes appear in a certain time frame,
  4. methods to infer the sound change rules (Problem 3), and
  5. improved methods for ancestral state reconstruction that would allow us to identify sound change processes not only for the root and the descendant nodes, but also for intermediate stages.
It is possible that even these five points are not enough yet, as I am still trying to think about how one should best address the problem. But what I can say for sure is that one needs to address the problem step by step, starting with the issue of standardization — and that the only way to account for the problems mentioned above is to collect the pure empirical evidence on sound change, not the summarized results discussed in the literature. Thus, instead of saying that some source quotes that in German, the t became a ts at some point, I want to see a dataset that provides this in the form of concrete examples that are large enough to show the regularity of the findings and ideally also list the exceptions.

The advantage of this procedure is that the collection is independent of the typical errors that usually occur when data are collected from the literature (usually also by employing armies of students who do the "dirty" work for the scientists). It would also be independent of individual scholars' interpretations. Furthermore, it would be exhaustive — that is, one could measure not only the frequency of a given change, but also the regularity, the conditioning context, or the systemic properties

The disadvantage is, of course, the need to acquire standardized data in a large-enough size for a critical number of languages and language families. But, then again, if there were no challenges involved in this endeavor, I would not present it as an open problem of computational diversity linguistics.

Outlook

With the newly published database of Cross-Linguist Transcription Systems (CLTS, Anderson et al. 2018), the first step towards a rigorous standardization of transcription systems has already been made. With our efforts towards a standardization of wordlists that can also be applied in the form of a retro-standardization to existing data (Forkel et al. 2018), we have proposed a further step of how lexical data can be collected efficiently for a large sample of the worlds' spoken languages (see also List et al. 2018). Work on automated cognate detection and workflows for computer-assisted language comparison has also drastically increased the efficiency of historical language comparison.

So, we are advancing towards a larger collection of high-quality and historically compared datasets; and it is quite possible that we will, in a couple of years from now, arrive at a point where the typology of sound change is no longer a dream by me and many colleagues, but something that may actually be feasible to extract from cross-linguistic data that has been historically annotated. But until then, many issues still remain unsolved; and in order to address these, it would be useful to work towards pilot studies, in order to see how well the ideas for improvement, outlined above, can actually be implemented.

References

Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (2018) A Cross-Linguistic Database of Phonetic Transcription Systems. Yearbook of the Poznań Linguistic Meeting 4.1: 21-53.

Brown, Cecil H. and Holman, Eric W. and Wichmann, Søren (2013) Sound correspondences in the worldś languages. Language 89.1: 4-29.

Bybee, Joan L. (2002) Word frequency and context of use in the lexical diffusion of phonetically conditioned sound change. Language Variation and Change 14: 261-290.

Forkel, Robert and List, Johann-Mattis and Greenhill, Simon J. and Rzymski, Christoph and Bank, Sebastian and Cysouw, Michael and Hammarström, Harald and Haspelmath, Martin and Kaiping, Gereon A. and Gray, Russell D. (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5.180205: 1-10.

Fox, Anthony (1995) Linguistic Reconstruction. An Introduction to Theory and Method. Oxford: Oxford University Press.

Hock, Hans Henrich (1991) Principles of Historical Linguistics. Berlin: Mouton de Gruyter.

Kümmel, Martin Joachim (2008): Konsonantenwandel [Consonant change]. Wiesbaden:Reichert.
Lass, Roger (2017): Reality in a soft science: the metaphonology of historical reconstruction. Papers in Historical Phonology 2.1: 152-163.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Pathmanathan, Jananan Sylvestre and Lopez, Philippe and Bapteste, Eric (2016) Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct 11.39: 1-17.

List, Johann-Mattis and Greenhill, Simon J. and Anderson, Cormac and Mayer, Thomas and Tresoldi, Tiago and Forkel, Robert (2018) CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2: 277-306.

List, Johann-Mattis (2019): Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Mowrey, Richard and Pagliuca, William (1995) The reductive character of articulatory evolution. Rivista di Linguistica 7: 37–124.

Ohala, J. J. (1989) Sound change is drawn from a pool of synchronic variation. In: Breivik, L. E. and Jahr, E. H. (eds.) Language Change: Contributions to the Study of its Causes. Berlin: Mouton de Gruyter., pp.173-198.

Sapir, Edward (1921[1953]) Language. An Introduction to the Study of Speech.

de Saussure, Ferdinand (1916) Cours de linguistique générale. Lausanne: Payot.

William S-Y. Wang (1967) Phonological features of tone. International Journal of American Linguistics 33.2: 93-105.

Yang, Cathryn and Xu, Yi (2019) A review of tone change studies in East and Southeast Asia. Diachronica 36.3: 417-459.

Monday, July 29, 2019

Simulation of sound change (Open problems in computational diversity linguistics 6)


The sixth problem in my list of open problems in computational diversity linguistics is devoted to the problem of simulating sound change. When formulating the problem, it is difficult to see what is actually meant, as there are two possibilities for a concrete simulation: (i) one could think of a sound system of a given language and then model how, through time, the sounds change into other sounds; or (ii) one could think of a bunch of words in the lexicon of a given language, and then simulate how these words are changed through time, based on different kinds of sound change rules. I have in mind the latter scenario.

Why simulating sound change is hard

The problem of simulating sound change is hard for four reasons. First of all, the problem is similar to the problem of sound law induction, since we have to find a simple and straightforward way to handle phonetic context (remember that sound change may often only apply to sounds that occur in a certain environment of other sounds). This is already difficult enough, but it could be handled with help of what I called multi-tiered sequence representations (List and Chacon 2015). However, there are four further problems that one would need to overcome (or at least be aware of) when trying to successfully simulate sound change.

The first of these extra problems is that of morphological change and analogy, which usually goes along with "normal" sound change, following what Anttila (1976) calls Sturtevant's paradox — namely, that regular sound change produces irregularity in language systems, while irregular analogy produces regularity in language systems. In historical linguistics, analogy serves as a cover-term for various processes in which words or word parts are rendered more similar to other words than they had been before. Classical examples are children's "regular" plurals of nouns like mouse (eg. mouses instead of mice) or "regular" past tense forms of verbs like catch (e.g., catched instead of caught). In all these cases, perceived irregularities in the grammatical system, which often go back to ancient sound change processes, are regularized on an ad-hoc basis.

One could (maybe one should), of course, start with a model that deliberately ignores processes of morphological change and analogical leveling, when drafting a first system for sound change simulation. However, one needs to be aware that it is difficult to separate morphological change from sound change, as our methods for inference require that we identify both of them properly.

The second extra problem is the question of the mechanism of sound change, where competing theories exist. Some scholars emphasize that sound change is entirely regular, spreading over the whole lexicon (or changing one key in the typewriter), while others claim that sound change may slowly spread from word to word and at times not reach all words in a given lexicon. If one wants to profit from simulation studies, one would ideally allow for a testing of both systems; but it seems difficult to model the idea of lexical diffusion (Wang 1969), given that it should depend on external parameters, like frequency of word use, which are also not very well understood.

The last problem is that of the actual tendencies of sound change, which are also by no means well understood by linguists. Initial work on sound change has been carried out (Kümmel 2008). However, the major work of finding a way to compare the major tendencies of sound change processes across a large sample of the world's languages (ie. the typology of sound change, which I plan to discuss separately in a later post), has not been carried out so far. The reason why we are missing this typology is that we lack clear-cut machine-readable accounts of annotated, aligned data. Here, scholars would provide their proto-forms for the reconstructed languages along with their proposed sound laws in a system that can in fact be tested and run (to allow to estimate also the exceptions or where those systems fail).

But having an account of the tendencies of sound change opens a fourth important problem apart from the lack of data that we could use to draw a first typology of sound change processes: since sound change tendencies are not only initiated by the general properties of speech sounds, but also by the linguistic systems in which these speech sounds are employed. While scholars occasionally mention this, there have been no real attempts to separate the two aspects in a concrete reconstruction of a particular language. The typology of sound change tendencies could thus not simply stop at listing tendencies resulting from the properties of speech sounds, but would also have to find a way to model diverging tendencies because of systemic constraints.

Traditional insights into the process of sound change

When discussing sound change, we need to distinguish mechanisms, types, and patterns. Mechanisms refer to how the process "proceeds", the types refer to the concrete manifestations of the process (like a certain, concrete change), and patterns reflect the systematic perspective of changes (i.e. their impact on the sound system of a given language, see List 2014).

Figure 1: Lexical diffusion

The question regarding the mechanism is important, since it refers to the dispute over whether sound change is happening simultaneously for the whole lexicon of a given language — that is, whether it reflects a change in the inventory of sounds, or whether it jumps from word to word, as the defenders of lexical diffusion propose, whom I mentioned above (see also Chen 1972). While nobody would probably nowadays deny that sound change can proceed as a regular process (Labov 1981), it is less clear as to which degree the idea of lexical diffusion can be confirmed. Technically, the theory is dangerous, since it allows a high degree of freedom in the analysis, which can have a deleterious impact on the inference of cognates (Hill 2016). But this does not mean, of course, that the process itself does not exist. In these two figures, I have tried to contrast the different perspectives on the phenomena.

Figure 2: Regular sound change

To gain a deeper understanding of the mechanisms of sound change, it seems indispensable to work more on models trying to explain how it is actuated after all. While most linguists agree that synchronic variation in our daily speech is what enables sound change in the first place, it is not entirely clear how certain new variants are fixed in a society. Interesting theories in this context have been proposed by Ohala (1989) who proposes distinct scenarios in which sound change can be initiated both by the speaker or the listener, which would in theory also yield predictable tendencies with respect to the typology of sound change.

The insights into the types and patterns of sound change are, as mentioned above, much more rudimentary, although one can say that most historical linguists have a rather good intuition with respect to what is possible and what is less likely to happen.

Computational approaches

We can find quite a few published papers devoted to the simulation of certain aspects of sound change, but so far, we do not (at least to my current knowledge) find any comprehensive account that would try to feed some 1,000 words to a computer and see how this "language'' develops — which sound laws can be observed to occur, and how they change the shape of the given language. What we find, instead, are a couple of very interesting accounts that try to deal with certain aspects of sound change.

Winter and Wedel for example test agent-based exemplar models, in order to see how systems maintain contrast despite variation in the realization (Hamann 2014: 259f gives a short overview of other recent articles). Au (2008) presents simulation studies that aim to test to which degree lexical diffusion and "regular" sound change interact in language evolution. Dediu and Moisik (2019) investigate, with the help of different models, to which degree vocal tract anatomy of speakers may have an impact on the actuation of sound change. Stevens et al. (2019) present an agent-based simulation to investigate the change of /s/ to /ʃ/ in.

This summary of literature is very eclectic, especially because I have only just started to read more about the different proposals out there. What is important for the problem of sound change simulation is that, to my knowledge, there is no approach yet ready to run the full simulation of a given lexicon for a given language, as stated above. Instead, the studies reported so far have a much more fine-grained focus, specifically concentrating on the dynamics of speaker interaction.

Initial ideas for improvement

I do not have concrete ideas for improvement, since the problem's solution depends on quite a few other problems that would need to be solved first. But to address the idea of simulating sound change, albeit only in a very simplifying account, I think it will be important to work harder on our inferences, by making transparent what so far is only implicitly stored in the heads of the many historical linguists in form of what they call their intuition.

During the past 200 years, after linguists started to apply the mysterious comparative method that they had used successfully to reconstruct Indo-European on other language families, the amount of data and number of reconstructions for the world's languages has been drastically increasing. Many different language families have now been intensively studied, and the results have been presented in etymological dictionaries, numerous books and articles on particular questions, and at times even in databases.

Unfortunately, however, we rarely find attempts of scholars to actually provide their findings in a form that would allow to check the correctness of their predictions automatically. I am thinking in very simple terms here — a scholar who proposes a reconstruction for a given language family should deliver not only the proto-forms with the reflexes in the daughter languages, but also a detailed test of how the proposed sound law by which the proto-forms change into the daughter languages produce the reflexes.

While it is clear that this could not be easily implemented in the past, it is in fact possible now, as we can see from a couple of studies where scholars have tried to compute sound change (Hartmann 2003, Pyysalo 2017, see also Sims-Williams 2018 for an overview on more literature). Although these attempts are unsatisfying, given that they do not account for cross-linguistic comparability of data (eg. they use orthographies rather than unified transcriptions, as proposed by Anderson et al. 2018), they illustrate that it should in principle be possible to use transducers and similar technologies to formally check how well the data can be explained under a certain set of assumptions.

Without cross-linguistic accounts of the diversity of sound change processes (ie. a first solution to the problem of establishing a first typology of sound change), attempts to simulate sound change will remain difficult. The only way to address this problem is to require a more rigorous coding of data (both human- and machine-readable), and an increased openness of scholars who work on the reconstruction of interesting language families, to help make their data cross-linguistically comparable.

Sign languages

When drafting this post, I promised to Guido and Justin to grasp the opportunity when talking about sound change to say a few words about the peculiarities of sound change in contrast to other types of language change. The idea was, that this would help us to somehow contribute to the mini-series on sign languages, which Guido and Justin have been initiated this month (see post number one, two, and three).

I do not think that I have completely succeeded in doing so, as what I have discussed today with respect to sound change does not really point out what makes it peculiar (if it is). But to provide a brief attempt, before I finish this post, I think that it is important to emphasize that the whole debate about regularity of sound change is, in fact, not necessarily about regularity per se, but rather about the question of where the change occurs. As the words in spoken languages are composed of a fixed number of sounds, any change to this system will have an impact on the language as a whole. Synchronic variation of the pronunciation of these sounds offers the possibility of change (for example during language acquisition); and once the pronunciation shifts in this way, all words that are affected will shift along, similar to a typewriter in which you change a key.

As far as I understand, for the time being it is not clear whether a counterpart of this process exists in sign language evolution, but if one wanted to search for such a process, one should, in my opinion, do so by investigating to what degree the signs can be considered as being composed of something similar to phonemes in historical linguistics. In my opinion, the existence of phonemes as minimal meaning-discriminating units in all human languages, including spoken and signed ones, is far from being proven. But if it should turn out that signed languages also recruit meaning-discriminating units from a limited pool of possibilities, there might be the chance of uncovering phenomena similar to regular sound change.

References
Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (2018) A cross-linguistic database of phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting 4.1: 21-53.

Anttila, Raimo (1976) The acceptance of sound change by linguistic structure. In: Fisiak, Jacek (ed.) Recent Developments in Historical Phonology. The Hague, Paris, New York: de Gruyter, pp. 43-56.

Au, Ching-Pong (2008) Acquisition and Evolution of Phonological Systems. Academia Sinica: Taipei.

Chen, Matthew (1972) The time dimension. Contribution toward a theory of sound change. Foundations of Language 8.4. 457-498.

Dan Dediu and Scott Moisik (2019) Pushes and pulls from below: Anatomical variation, articulation and sound change. Glossa 4.1: 1-33.

Hamann, Silke (2014) Phonological changes. In: Bowern, Claire (ed.) Routledge Handbook of Historical Linguistics. Routledge, pp. 249-263.

Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp. 606-609.

Hill, Nathan (2016): A refutation of Song’s (2014) explanation of the ‘stop coda problem’ in Old Chinese. International Journal of Chinese Linguistic 2.2. 270-281.

Kümmel, Martin Joachim (2008) Konsonantenwandel [Consonant change]. Wiesbaden: Reichert.

Labov, William (1981) Resolving the Neogrammarian Controversy. Language 57.2: 267-308.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper presented at the workshop "Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE]" (2015/09/04, Leiden, Societas Linguistica Europaea).

Ohala, J. J. (1989) Sound change is drawn from a pool of synchronic variation. In: Breivik, L. E. and Jahr, E. H. (eds.) Language Change: Contributions to the Study of its Causes. Berlin: Mouton de Gruyter, pp. 173-198.

Pyysalo, Jouna (2017) Proto-Indo-European Lexicon: The generative etymological dictionary of Indo-European languages. In: Proceedings of the 21st Nordic Conference of Computational Linguistics, pp. 259-262.

Sims-Williams, Patrick (2018) Mechanising historical phonology. Transactions of the Philological Society 116.3: 555-573.

Stevens, Mary and Harrington, Jonathan and Schiel, Florian (2019) Associating the origin and spread of sound change using agent-based modelling applied to /s/- retraction in English. Glossa 4.1: 1-30.

Wang, William Shi-Yuan (1969) Competing changes as a cause of residue. Language 45.1: 9-25.

Winter, Bodo and Wedel, Andrew (2016) The co-evolution of speech and the lexicon: Interaction of functional pressures, redundancy, and category variation. Topics in Cognitive Science 8:  503-513.

Monday, April 29, 2019

Automatic sound law induction (Open problems in computational diversity linguistics 3)


The third problem in my list of ten open problems in computational diversity linguistics is a problem that has (to my knowledge) not even been considered as a true problem in computational historical linguistics, so far. Until now, it has been discussed by colleagues only indirectly. This problem, which I call the automatic induction of sound laws, can be described as follows:
Starting from a list of words in a proto-language and their reflexes in a descendant language, try to find the rules by which the ancestral language is converted into the descendant language.
Note that by "rules", in this context, I mean the classical notation that phonologists and historical linguists use in order to convert a source sound in a target sound in a specific environment (see Hall 2000: 73-75). If we consider the following ancestral and descendant words from a fictive language, we can easily find the laws by which the input should be converted into an output — namely, an a should be changed to an e, an e should be changed to an i, and a k changes to s if followed by an i but not if followed by an a.

Input Output
papa pepe
mama meme
kaka keke
keke sisi

Short excursus on linguistic notation of sound laws

Based on the general idea of sound change (or sound laws in classical historical linguistics) as some kind of a function by which a source sound is taken as input and turned into a target sound as output, linguists use a specific notation system for sound laws. In the simplest form of the classical sound law notation, this process is described in the form s > t, where s is the source sound and t is the target sound. Since sound change often relies the on specific conditions of the surrounding context — i.e. it makes a difference if some sound occurs in the beginning or the end of a word — context is added as a condition separated by a /, with an underscore _ referring to the sound in its original phonetic environment. Thus, the phenomenon of voiced stops becoming unvoiced at the end of words in German (e.g. d becoming t), can be written as d > t / _$, where $ denotes the end of a word.

One can see how close this notation comes to regular expressions and according to many scholars, the rules by which languages change with respect to their sound systems do not exceed the complexity of regular grammars. Nevertheless, sound change notation does differ in the scope and the rules for annotation. One notable difference is the possibility to explain how full classes of sounds change in a specific environment. The German rule of devoicing, for example, generally affects all voiced stops in the end of a word. As a result, one could also annotat it as G > K / _$, where G would denote the sounds [b, d, g] and K their counterparts [p, t, k]. Although we could easily write a single rule for each of the three phenomena here, the rule by which the sounds are grouped into two classes of voiced sounds and their unvoiced counterparts is linguistically more interesting, since it reminds us that the change by which word-final consonants loose the feature of voice is a systemic change, and not a phenomenon applying to some random selection of sounds in a given language.

The problem of this systemic annotation, however, is that the grouping of sounds into classes that change in a similar form is often language-specific. As a result, scholars have to propose new groupings whenever they deal with another language. Since neither the notation of sound values nor the symbols used to group sounds into classes are standardized, it is extremely difficult to compare different proposals made in the literature. As a result, any attempt to solve the problem of automatic sound law induction in historical linguistics would at the same time have to make strict proposals for a standardization of sound law notations used in our field. Standardization can thus be seen as one of the first major obstacles of solving this problem, with the problem of accounting for systemic aspects of sound change as the second one.

Beyond regular expressions

Even if we put the problem of inconsistent annotation and systemic changes to one side, the analogy with regular expressions cannot properly handle all aspects of sound change. When looking at the change from Middle Chinese to Mandarin Chinese, for example, we find a complex pattern, by which originally voiced sounds, like [b, d, g, dz] (among others), were either devoiced, becoming [p, t, k, ts], or devoiced and aspirated, becoming [pʰ, tʰ, kʰ, tsʰ]. While it is not uncommon that one sound can change into two variants, depending on the context in which it occurs, the Mandarin sound change in this case is interesting because the context is not a neighboring sound, but is instead the Middle Chinese tone for the syllable in question — syllables with a flat tone (called píng tone in classical terminology) are nowadays voiceless and aspirated, and syllables with one of the three remaining Middle Chinese tones (called shǎng, , and ) are nowadays plain voiceless (see List 2019: 157 for examples).

Since tone is a feature that applies to whole syllables, and not to single sound segments, we are dealing with so-called supra-segmental features here. As the meaning of the term supra-segmental indicates, the features in question cannot be represented as a sequence of sound, but need to be thought of as an additional layer, similar to other supra-segmental features in language, including stress, or juncture (indicating word or morpheme boundaries).

In contrast to sequences as we meet them in mathematics and informatics, linguistic sound sequences do not consist solely of letters drawn from an alphabet that is lined up in some unique order. They are instead often composed of multiple layers, which are in part hierarchically ordered. Words, morphemes, and phrases in linguistics are thus multi-layered constructs, which cannot be represented by one sequence alone, but could be more fruitfully thought of as the same as a partitura in music — the score of a piece of orchestra music, in which every voice of the orchestra is given its own sequence of sounds, and all different sequences are aligned with each other to form a whole.

img
The multi-layered character of sound sequences can be seen as similar to a partitura in musical notation.

This multi-layered character of sound sequences in spoken languages comprises a third complication for the task of automatic sound law induction. Finding the individual laws that trigger the change of one stage of a language to a later stage, cannot (always) be trivially reduced to the task of finding the finite state transducer that translates a set of input strings to a corresponding set of output strings. Since our input word forms in the proto-language are not simple strings, but rather an alignment of the different layers of a word form, a method to induce sound laws needs to be able to handle the multi-layered character of linguistic sequences.

Background for computational approaches to sound law induction

To my knowledge, the question of how to induce sound laws from data on proto- and descendant languages has barely been addressed. What comes closest to the problem are attempts to model sound change from known ancestral languages, such as Latin, to daughter languages, such as Spanish. This is reflected, for example, in the PHONO program (Hartmann 2003), where one can insert data for a proto-language along with a set of sound change rules (provided in a similar form to that mentioned above), which need to be given in a specific order, and are then checked to see whether they correctly predict the descendant forms.

For teaching purposes, I adapted a JavaScript version of a similar system, called the Sound Change Applier² (http://www.zompist.com/sca2.html) by Mark Rosenfelder from 2012, in which students could try to turn Old High German into modern German, by assigning simple rules as they are traditionally used to describe sound change processes in the linguistic literature. This adaptation (which can be found at http://dighl.github.io/sound_change/SoundChanger.html) compares the attested output with the output generated by a given set of rules, and provides some assessment of the general accuracy of the proposed set of rules. For example, when feeding the system the simple rule an > en /_#, which turns all final instances of -an into -en, 54 out of 517 Old High German words will yield the expected output in modern Standard German.

The problem with these endeavors is, of course, the handling of exceptions, along with the comparison of different proposals. Since we can think of an infinite number of rules by which we could successfully turn a certain amount of Old High German strings into Standard German strings, we would need to ask ourselves how we could evaluate different proposals. That some kind of parsimony should play a role here is obvious. However, it is by no means clear (at least to me) how to evaluate the complexity of two systems, since the complexity would not only be reflected in the number of rules, but also in the initial grouping of sounds to classes, which is commonly used to account for systemic aspects of sound change. A system accounting for the problem of sound law induction would try to automate the task of finding the set of rules. The fact that it is difficult even to compare two or more proposals based on human assessment further illustrates why I think that the problem is not trivial.

Another class of approaches is that of word prediction experiments, such as the one by Ciobanu and Dinu (2018) (but see also Bodt and List 2019), in which training data consisting of the source and the target language are used to create a model, which is then successively applied to new data, in order to test how well this model predicts target words from the source words. Since the model itself is not reported in these experiments, but only used in the form of a black box to predict new words, the task cannot be considered to be the same as the task for sound law induction — which I propose as one of my ten challenges for computational historical linguistics — given that we are interested in a method that explicitly returns the model, in order to allow linguists to inspect it.

Problems with the current solutions to sound law induction

Given that no real solutions exist to the problem up to now, it seems somewhat useless to point to the problems of current solutions. What I want to mention in this context, however, are the problems of the solutions presented for word prediction experiments, be they fed by manual data on sound changes (Hartmann 2003), or based on inference procedures (Ciobanu and Dinu 2018, Dekker 2018). Manual solutions like PHONO suffer from the fact that they are tedious to apply, given that linguists have to present all sound changes in their data in an ordered fashion, with the program converting them step by step, always turning the whole input sequence into an intermediate output sequence — the word prediction approaches thus suffer from limitations in feature design.

The method by Ciobanu and Dinu (2018), for example, is based on orthographic data alone, using the Needleman-Wunsch algorithm for sequence alignment (Needleman and Wunsch 1970); and the approach by Dekker (2018) only allows for the use for the limited alphabet of 40 symbols proposed by the ASJP project (Holman et al. 2008). In addition to the limited representation of linguistic sound sequences, be it by resorting to abstract orthography or to abstract reduced phonetic alphabets, none of the methods can handle those kinds of contexts which result from the multi-layered character of speech. Since we know well that these aspects are vital for certain phenomena of sound change, the methods exclude from the beginning an aspect that traditional historical linguists, who might be interested in an automatic solution to the sound law induction problem, would put at the top of their wish-list of what the algorithm should be able to handle.

Why is automatic sound law induction difficult?

The handling of supra-segmental contexts, mentioned above, is in my opinion also the reason why sound law induction is so difficult, not only for machines, but also for humans. I have so far mentioned three major problems as to why I think sound law induction is difficult. First, we face problems in defining the task properly in historical linguistics, due to a significant lack in standardization. This makes it difficult to decide on the exact output of a method for sound law induction. Second, we have problems in handling the systemic aspect of sound change properly. This does not apply only to automatic approaches, but also to the evaluation of different proposals for the same data proposed by humans. Third, the multi-layered character of speech requires an enhanced modeling of linguistic sequences, which cannot be modeled as mono-dimensional strings alone, but should rather be seen as alignments of different strings representing different layers (tonal layer, stress layer, sound layer, etc.).

How humans detect sound laws

There are only a few examples in the literature where scholars have tried to provide detailed lists of sound changes from proto- to descendant language (Baxter 1992, Newman 1999). Most examples of individual sound laws proposed in the literature are rarely even tested exhaustively on the data. As a result, it is difficult to assess what humans usually do in order to detect sound laws. What is clear is that historical linguists who have been working a lot on linguistic reconstruction tend to acquire a very good intuition that helps them to quickly check sound laws applied to word forms in their head, and to convert the output forms. This ability is developed in a learning-by-doing fashion, with no specific techniques ever being discussed in the classroom, which reflects the general tendency in historical linguistics to trust that students will learn how to become a good linguist from examples, sooner or later (Schwink 1994: 29). For this reason, it is difficult to take inspiration from current practice in historical linguistics, in order to develop computer-assisted approaches to solve this task.

Potential solutions to the problem

What can we do in order to address the problem of sound law induction in automatic frameworks in the future?

As a first step, we would have to standardize the notation system that we use to represent sound changes. This would need to come along with a standardized phonetic transcription system. Scholars often think that phonetic transcription is standardized in linguistics, specifically due to the use of the International Phonetic Alphabet. As our investigations into the actual application of the IPA have shown, however, the IPA cannot be seen as a standard, but rather as a set of recommendations that are often only loosely followed by linguists. First attempts to standardize phonetic transcription systems for the purpose of cross-linguistic applications have, however, been made, and will hopefully gain more acceptance in the future (Anderson et al. forthcoming, https://clts.clld.org).

As a second step, we should invest more time in investigating the systemic aspects of language change cross-linguistically. What I consider important in this context is the notion of distinctive features by which linguists try to group sounds into classes. Since feature systems proposed by linguists differ greatly, with some debate as to whether features are innate and the same for all languages, or instead language-specific (see Mielke 2008 for an overview on the problem), a first step would again consist of making the data comparable, rather than trying to decide in favour of one of the numerous proposals in the literature.

As a third step, we need to work on ways to account for the multi-layered aspect of sound sequences. Here, a first proposal, labelled "multi-tiered sequence representation", has already been made by myself (List and Chacon 2015), based on an idea that I had already used for the phonetic alignment algorithm proposed in my dissertation (List 2014), which itself goes back to the handling of hydrophilic sequences in ClustalW (Thompson et al. 1994). The idea is to define a sound sequence as a sequence of vectors, with each vector (called tier) representing one distinct aspect of the original word. As this representation allows for an extremely flexible modeling of context — which would just consist of an arbitrary number of vector dimensions that could account for aspects such as tone, stress, preceding or following sounds — this representation would allow us to treat words as sequences of sounds while at the same time accounting for their multi-layered structure. Although there remain many unsolved aspects on how to exploit this specific model for phonetic sequences to induce sound laws from ancestor-descendant data, I consider this to be a first step in the direction of a solution to the problem.

Multi-tiered sequence representation for a fictive word in Middle Chinese.

Outlook

Although it is not necessarily recognized by the field as a real problem of historical linguistics, I consider the problem of automatic sound law induction as a very important problem for our field. If we could infer sound laws from a set of proposed proto-forms and a set of descendant forms, then we could use them to test the quality of the proto-forms themselves, by inspecting the sound laws proposed by a given system. We could also compare sound laws across different language families to see whether we find cross-linguistic tendencies.

Having inferred enough cross-linguistic data on sound laws represented in unified models for sound law notation, we could also use the rules to search for cognate words that have so far been ignored. There is a lot to do, however, until we reach this point. Starting to think about automatic, and also manual, induction of sound laws as a specific task in computational historical linguistics can be seen as a first step in this direction.

References
Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (forthcoming) A Cross-Linguistic Database of Phonetic Transcription Systems. Yearbook of the Poznań Linguistic Meeting, pp 1-27.

Baxter, William H. (1992) A handbook of Old Chinese Phonology. Berlin: de Gruyter.

Bodt, Timotheus A. and List, Johann-Mattis (2019) Testing the predictive strength of the comparative method: An ongoing experiment on unattested words in Western Kho-Bwa langauges. 1-22. [Preprint, under review, not peer-reviewed]

Ciobanu, Alina Maria and Dinu, Liviu P. (2018) Simulating language evolution: A tool for historical linguistics. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp 68-72.

Dekker, Peter (2018) Reconstructing Language Ancestry by Performing Word Prediction with Neural Networks. University of Amsterdam: Amsterdam.

Hall, T. Alan (2000) Phonologie: Eine Einführung. Berlin and New York: de Gruyter.

Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp 606-609.

Holman, Eric W. and Wichmann, Søren and Brown, Cecil H. and Velupillai, Viveka and Müller, André and Bakker, Dik (2008) Explorations in automated lexicostatistics. Folia Linguistica 20.3: 116-121.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper, presented at the workshop Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE] (2015/09/04, Leiden, Societas Linguistica Europaea).

List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Mielke, Jeff (2008) The Emergence of Distinctive Features. Oxford: Oxford University Press.

Needleman, Saul B. and Wunsch, Christan D. (1970) A gene method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

Newman, John and Raman, Anand V. (1999) Chinese Historical Phonology: Compendium of Beijing and Cantonese Pronunciations of Characters and their Derivations from Middle Chinese. München: LINCOM Europa.

Schwink, Frederick (1994) Linguistic Typology, Universality and the Realism of Reconstruction. Washington: Institute for the Study of Man.

Thompson, J. D. and Higgins, D. G. and Gibson, T. J. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673–4680.

Monday, March 26, 2018

It's the system, stupid! More thoughts on sound change in language history


In various blog posts in the past I have tried to emphasize that sound change in linguistics is fundamentally different from the kind of change in phenotype / genotype that we encounter in biology. The most crucial difference is that sound sequences, i.e., our words or parts of the words we use when communicating, do not manifest as a physical substance but — as linguists say — "ephemerically", i.e. by the air flow that comes out of the mouth of a speaker and is perceived as an acoustic signal by the listener. This is in strong contrast to DNA sequences, for example, which are undeniably somewhere "out there". They can be sliced, investigated, and they preserve information for centuries if not millenia, as the recent boom in archaeogenetics illustrates.

Here, I explore the consequences of this difference in a bit more detail.

Language as an activity

Language, as Wilhelm von Humboldt (1767-1835) — the boring linguist who investigated languages from his armchair while his brother Alexander was traveling the world — put it, is an activity (energeia). If we utter sentences, we pursue this activity and produce sample output of the system hidden in our heads. Since the sound signal is only determined by the capacity of our mouth to produce certain sounds, and the capacity of our brain to parse the signals we hear, we find a much stronger variation in the different sounds available in the languages of the world than we find when comparing the alphabets underlying DNA or protein sequences.

Despite the large variation in the sound systems of the world's languages, it is clear that there are striking common tendencies. A language without vowels does not make much sense, as we would have problems pronouncing the words or perceiving them at longer distances. A language without consonants would also be problematic; and even artificial communication systems developed for long-distance communication, like the different kinds of yodeling practiced in different parts of the world, make use of consonants to allow for a clearer distinction between vowels (see the page about Yodeling on Wikipedia). But, between both extremes we find great variation in the languages of the world, and this does not seem to follow any specific pattern that could point to any kind of selective pressure, although scholars have repeatedly tried to demonstrate it (see Everett et al. 2015 and the follow-up by Roberts 2018).

What is also important here is that, not only is the number of the sounds we find in the sound system of a given language highly variable, but there is also variation in the rules by which sounds can be concatenated to form words (called the phonotactics of a language), along with the frequency of the sounds in the words of different languages. Some languages tolerate clusters of multiple consonants (compare Russian vzroslye or German Herbst), others refuse them (compare the Chinese name for Frankfurt: fǎlánkèfú), yet others allow words to end in voiced stops (compare English job in standard pronunciation), and some turn voiced stops into voiceless ones (compare the standard pronunciation of Job in German as jop).

Language as a system

Language is a system which essentially concatenates a fixed number of sounds to sequences, being only restricted by the encoding and decoding capacities of its users. This is the core reason why sound change is so different from change in biological characters. If we say that German d goes back to Proto-Germanic *θ (pronounced as th in path), this does not mean that there were a couple of mutations in a couple of words of the German language. Instead it means that the system which produced the words for Proto-Germanic changed the way in which the sound *θ was produced in the original system.

In some sense, we can think metaphorically of a typewriter, in which we replace a letter by another one. As a result, whenever we want to type a given word in the way we know it, we will type it with the new letter instead. But this analogy would be to restricted, as we can also add new letters to the typewriter, or remove existing ones. We can also split one letter key into two, as happens in the case of palatalization, which is a very common type of sound change during which sounds like [k] or [g] turn into sounds like [] and [] when being followed by front vowels (compare Italian cento "hundred", which was pronounced [kɛntum] in Latin and is now pronounced as [tʃɛnto]).

Sound change is not the same as mutation in biology

Since it is the sound system that changes during the process we call sound change, and not the words (which are just a reflection of the output of the system), we cannot equate sound change with mutations in biological sequences, since mutations do not recur across all sequences in a genome, replacing one DNA segment by another one, which may not even have existed before. The change in the system, as opposed to the sequences that the system produces, is the reason for the apparent regularity of sound change.

This culminates in Leonard Bloomfield's (1887-1949) famous (at least among old-school linguists) expression that 'phonemes [i. e., the minimal distinctive units of language] change' (Bloomfield 1933: 351). From the perspective of formal approaches to sequence comparison, we could restate this as: 'alphabets change'. Hruschka et al. (2015) have compared sound change with concerted evolution in biology. We can state the analogy in simpler terms: sound change reflects systemics in language history, and concerted evolution results from systemic changes in biological evolution. It's the system, stupid!

Given that sound systems change in language history, this means that the problem of character alignments (i.e. determining homology/cognacy) in linguistics cannot be directly solved with the same techniques that are used in biology, where the alphabets are assumed to be constant, and alignments are supposed to identify mutations alone. If we want to compare sequences in linguistics, where we have to compare sequences that were basically drawn from different alphabets, this means that we need to find out which sounds correspond to which sounds across different languages while at the same time trying to align them.

An artificial example for the systemic grounding of sound change

Let me provide a concrete artificial example, to illustrate the peculiarities of sound change. Imagine two people who originally spoke the same language, but then suffered from diseases or accidents that inhibited them from producing their speech in the way they did before. Let the first person suffer from a cold, which blocks the nose, and therefore turns all nasal sounds into their corresponding voiced stops, i.e., n becomes a d, ng becomes a g, and m becomes a b. Let the other person suffer from the loss of the front teeth, which makes it difficult to pronounce the sounds s and z correctly, so that they sound like a th (in its voiced and voiceless form, like in thing vs. that).


Artificial sound change resulting from a cold or the loss of the front teeth.

If we now let both persons pronounce the same words in their original language, they won't sound very similar anymore, as I have tried to depict in the following table (dh points to the th in words like father, as opposed to the voiceless th in words like thatch).

No.   Speaker Cold   Speaker Tooth 
1 bass math
2 buzic mudhic
3 dose nothe
4 boizy moidhy
5 sig thing
6 rizig ridhing

By comparing the words systematically, however, bearing in mind that we need to find the best alignment and the mapping between the alphabets, we can retrieve a set of what linguists call sound correspondences. We can see that the s of speaker Cold corresponds to the th of speaker Tooth, z corresponds to dh, b to m, d to n, and g to ng. Having probably figured out by now that my words were taken from the English language (spelling voiced s consequently as z), it is easy even to come up with a reconstruction of the original words (mass, music[=muzik], nose, noisy=[noizy], etc.).

Reconstructing ancestral sounds in our artificial example with help of regular sound correspondences.

Summary

Systemic changes are difficult to handle in phylogenetic analyses. They leave specific traces in the evolving objects we investigate that are often difficult to interpret. While it has been long since known to linguists that sound change is an inherently systemic phenomenon, it is still very difficult to communicate to non-linguistics what this means, and why it is so difficult for us to compare languages by comparing their words. Although it may seem tempting to compare languages with simple sequence-alignment algorithms with differences in biological sequences resulting from mutations (see for example Wheeler and Whiteley 2015), it is basically an oversimplifying approach.

Simple models undeniably have their merits, especially when dealing with big datasets that are difficult to inspect manually — there is nothing to say against their use. But we should always keep in mind that we can, and should, do much better than this. Handling systemic changes remains a major challenge for phylogenetic approaches, no matter whether they use trees, networks, bushes, or forests.

Given the peculiarity of sound change in linguistic evolution, and how well the phenomena are understood in our discipline, it seems worthwhile to invest time in exploring ways to formalize and model the process. During the past two decades, linguists have taken a lot of inspiration from biology. The time will come when we need to pay something back. Providing models and analyses to deal with systemic processes like sound change might be a good start.

References

Bloomfield, L. (1973) Language. Allen & Unwin: London.

Everett, C., D. Blasi, and S. Roberts (2015) Climate, vocal folds, and tonal languages: connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5: 1322-1327.

Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015) Detecting regular sound changes in linguistics as events of concerted evolution. Curr. Biol. 25.1: 1-9.

Roberts, S. (2018) Robust, causal, and incremental approaches to investigating linguistic adaptation. Frontiers in Psychology 9: 166.

Wheeler, W. and P. Whiteley (2015) Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages. Cladistics 31.2: 113-125.

Tuesday, August 22, 2017

Unattested character states


In an earlier post from January 2016, I argued that it is important to account for directional processes when modeling language history through character-state evolution. In previous papers (List 2016; Chacon and List 2015), I  tried to show that this can be easily done with asymmetric step matrices in a parsimony framework. Only later did I realize that this is nothing new for biologists who work on morphological characters, thus supporting David's claim that we should not compare linguistic characters with the genotype, but with the phenotype (Morrison 2014). Early this year, a colleague introduced me to Mk-models in phylogenetics, which were first introduced by Lewis (2001)) and allow analysis of multi-state characters in a likelihood framework.

What was surprising for me is that it seems that Mk-models seem to outperform parsimony frameworks, although being much simpler than elaborate step-matrices defined for morphological characters (Wright and Hillis 2014). Today, I read that a recent paper by Wright et al. (2016) even shows how asymmetric transition rates can be handled in likelihood frameworks.

Being by no means an expert in phylogenetic analyses, especially not in likelihood frameworks, I tend to have a hard time understanding what is actually being modeled. However, if I correctly understand the gist of the Wright et al. paper, it seems that we are slowly approaching a situation in which more complex scenarios of lexical character evolution in linguistics no longer need to rely on parsimony frameworks.

But, unfortunately, we are not there yet; and it is even questionable whether we will ever be. The reason is that all multi-state models that have been proposed so far only handle transitions between attested characters: unattested characters can neither be included in the analyses nor can they be inferred.

I have pointed to this problem in some previous blogposts, the last one published in June, where I mentioned Ferdinand de Saussure, (1857-1913), who postulated two unattested consonantal sounds for Indo-European (Saussure 1879), of which one was later found to have still survived in Hittite, a language that was deciphered and shown to be Indo-European only about 30 years later (Lehmann 1992: 33).

The fact that it is possible to use our traditional methods to infer unattested sounds from circumstantial evidence, but not to include our knowledge about them into phylogenetic analyses, is a huge drawback. Potentially even greater are the situations where even our traditional methods do not allow us to infer unattested data. Think, for example, of a word that was once present in some language but was later completely lost. Given the ephemeral nature of human language, we have no way to know this, but we know very well that it easily happens when just thinking of some terms used for old technology, like walkman or soon even iPod, which the younger generations have never heard about.

Colleagues with whom I have discuss my concerns in this regard are often more optimistic than I am, saying that even if the methods cannot handle unattested characters they could still find the major signal, and thus tell us at least the general tendency as to how a language family evolved. However, for classical linguists, who can infer quite a lot using the laborious methods that still need to be applied manually, it leaves a sour taste, if they are told that the analysis deliberately ignored crucial aspects of the processes and phenomena they understand very well. For example, if we detect that some intelligence test is right in about 80% of all cases, we would also abstain from using it to judge who we allow to take up their studies at university.

I also think that it is not a satisfying solution for the analysis of morphological data in biology. It is probably quite likely that some ancient species had certain traits which later evolved into the traits we observe which are simply no longer attested anywhere, either in fossils or in the genes. I also wonder how well phylogenetic frameworks generally account for the fact that what the evidence we are left with may reflect much less of what was once there.

In Chacon and List (2015), we circumvent the problem by adding ancestral but unattested sounds to the step matrices in our parsimony analysis. This is of course not entirely satisfactory, as it adds a heavy bias to the analysis of sound change, which no longer tests for all possible solutions but only for the ones we fed into the algorithm. For sound change, it may be possible to substantially expand the character space by adding sounds attested across the world's languages, and then having the algorithms select the most probable transitions. But given that we still barely know anything about general transition probabilities of sound change, and that databases like Phoible (Moran 2015)  list more than 2,000 different sounds for a bit more than 2,000 languages, it seems like a Sisyphean challenge to tackle this problem consistently.

What can we do in the meantime? Not very much, it seems. But we can still try to improve our methods in baby steps, trying to get a better understanding of the major and minor processes in linguistic and biological evolution; and not forgetting that, although I was only talking about phylogenetic tree reconstruction, in the end we also want to have all of this done in network approaches.

References
  • Chacon, T. and J.-M. List (2015) Improved computational models of sound change shed light on the history of the Tukanoan languages. Journal of Language Relationship 13: 177-204.
  • Lehmann, W. (1992) Historical linguistics. An Introduction. Routledge: London.
  • Lewis, P. (2001) A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology 50: 913-925.
  • List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1: 119-136.
  • Moran, S., D. McCloy, and R. Wright (eds) (2014) PHOIBLE Online. Max Planck Institute for Evolutionary Anthropology: Leipzig.
  • Morrison, D.A. (2014) Are phylogenetic patterns the same in anthropology and biology? bioRxiv.
  • Saussure, F. (1879) Mémoire sur le système primitif des voyelles dans les langues indo-européennes. Teubner: Leipzig.
  • Wright, A. and D. Hillis (2014) Bayesian analysis using a simple likelihood model outperforms parsimony for estimation of phylogeny from discrete morphological data. PLoS ONE 9.10. e109210.
  • Wright, A., G. Lloyd, and D. Hillis (2016) Modeling character change heterogeneity in phylogenetic analyses of morphology through the use of priors. Systematic Biology 65: 602-611.

Tuesday, June 27, 2017

Trees do not necessarily help in linguistic reconstruction


In historical linguistics, "linguistic reconstruction" is a rather important task. It can be divided into several subtasks, like "lexical reconstruction", "phonological reconstruction", and "syntactic reconstruction" — it comes conceptually close to what biologists would call "ancestral state reconstruction".

In phonological reconstruction, linguists seek to reconstruct the sound system of the ancestral language or proto-language, the Ursprache that is no longer attested in written sources. The term lexical reconstruction is less frequently used, but it obviously points to the reconstruction of whole lexemes in the proto-language, and requires sub-tasks, like semantic reconstruction where one seeks to identify the original meaning of the ancestral word form from which a given set of cognate words in the descendant languages developed, or morphological reconstruction, where one tries to reconstruct the morphology, such as case systems, or frequently recurring suffixes.

In a narrow sense, linguistic reconstruction only points to phonological reconstruction, which is something like the holy grail of computational approaches, since, so far, no method has been proposed that would convincingly show that one can do without expert insights. Bouchard-Côté et al. (2013) use language phylogenies to climb a language tree from the leaves to the root, using sophisticated machine-learning techniques to infer the ancestral states of words in Oceanic languages. Hruschka et al. (2015) start from sites in multiple alignments of cognate sets of Turkish languages to infer both a language tree, as well as the ancestral states along with the sound changes that regularly occurred at the internal nodes of the tree. Both approaches show that phylogenetic methods could, in principle, be used to automatically infer which sounds were used in the proto-language; and both approaches report rather promising results.

None of the approaches, however, is finally convincing, both for practical and methodological reasons. First, they are applied to language families that are considered to be rather "easy" to reconstruct. The tough cases are larger language families with more complex phonology, like Sino-Tibetan or any of its subbranches, including even shallow families like Sinitic (Chinese), or Indo-European, where the greatest achievements of the classical methods for language comparison have been made.

Second, they rely on a wrong assumption, that the sounds used in a set of attested languages are necessarily the pool of sounds that would also be the best candidates for the Ursprache. For example, Saussure (1879) proposed that Proto-Indo-European had at least two sounds that did not survive in any of the descendant languages, the so-called laryngeals, which are nowadays commonly represented as h₁, h₂, and h₃, and which leave complex traits in the vocalism and the consonant systems of some Indo-European languages. Ever since then, it has been a standard assumption that it is always possible that none of the ancestral sounds in a given proto-language is still attested in any its descendants.

A third interesting point, which I consider a methodological problem of the methods, is that both of them are based on language trees, which are either given to the algorithm or inferred during the process. Given that most if not all approaches to ancestral state reconstruction in biology are based on some kind of phylogeny, even if it is a rooted evolutionary network, it may sound strange that I criticize this point. But in fact, when linguists use the classical methods to infer ancestral sounds and ancestral sound systems, phylogenies do not necessarily play an important role.

The reason for this lies in the highly directional nature of sound change, especially in the consonant systems of languages, which often makes it extremely easy to predict the ancestral sound without invoking any phylogeny more complex than a star tree. That is, in linguistics we often have a good idea about directed character-state changes. For example, if a linguist observers a [k] in one set of languages and a [ts] in another languages in the same alignment site of multiple cognate sets, then they will immediately reconstruct a *k for the proto-language, since they know that [k] can easily become [ts] but not vice versa. The same holds for many sound correspondence patterns that can be frequently observed among all languages of the world, including cases like [p] and [f], [k] and [x], and many more. Why should we bother about any phylogeny in the background, if we already know that it is much more likely that these changes occurred independently? Directed character-state assessments make a phylogeny unnecessary.

Sound change in this sense is simply not well treated in any paradigm that assumes some kind of parsimony, as it simply occurs too often independently. The question is less acute with vowels, where scholars have observed cycles of change in ancient languages that are attested in written sources. Even more problematic is the change of tones, where scholars have even less intuition regarding preference directions or preference transitions; and also because ancient data does not describe the tones in the phonetic detail we would need in order to compare it with modern data. In contrast to consonant reconstruction, where we can do almost exclusively without phylogenies, phylogenies may indeed provide some help to shed light on open questions in vowel and tone change.

But one should not underestimate this task, given the systemic pressure that may crucially impact on vowel and tone systems. Since there are considerably fewer empty spots in the vowel and tone space of human languages, it can easily happen that the most natural paths of vowel or tone development (if they exist in the end) are counteracted by systemic pressures. Vowels can be more easily confused in communication, and this holds even more for tones. Even if changes are "natural", they could create conflict in communication, if they produce very similar vowels or tones that are hard to distinguish by the speakers. As a result, these changes could provoke mergers in sounds, with speakers no longer distinguishing them at all; or alternatively, changes that are less "natural" (physiologically or acoustically) could be preferred by a speech society in order to maintain the effectiveness of the linguistic system.

In principle, these phenomena are well-known to trained linguists, although it is hard to find any explicit statements in the literature. Surprisingly, linguistic reconstruction (in the sense of phonological reconstruction) is hard for machines, since it is easy for trained linguists. Every historical linguist has a catalogue of existing sounds in their head as well as a network of preference transitions, but we lack a machine-readable version of those catalogues. This is mainly because transcriptions systems widely differ across subfields and families, and since no efforts to standardize these transcriptions have been successful so far.

Without such catalogues, however, any efforts to apply vanilla-style methods for ancestral state reconstruction from biology to linguistic reconstruction in historical linguistics, will be futile. We do not need the trees for linguistic reconstruction, but the network of potential pathways of sound change.

References
  • Bouchard-Côté, A., D. Hall, T. Griffiths, and D. Klein (2013): Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110.11. 4224–4229.
  • Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015): Detecting regular sound changes in linguistics as events of concerted evolution. Current Biology 25.1: 1-9.
  • Saussure, F. (1879): Mémoire sur le système primitif des voyelles dans les langues indo- européennes. Teubner: Leipzig.