Showing posts with label Historical linguistics. Show all posts
Showing posts with label Historical linguistics. Show all posts

Monday, July 1, 2019

Stacking networks based on sign language manual alphabets


This post is the first of a mini-series on sign language manual alphabets. While the evolution of spoken languages has been studied intensively using phylogenetic methods, sign languages have not, as yet.

In this post we will first introduce our readers to a set of stacked networks, and how it assists in establishing ancestor-descendant relationships in a pretty straightforward (but not trivial) case: the evolution of manual alphabets in sign languages. In the next post, I will demonstrate the use of networks for character mapping and putting forward hypothesis about ancestor-descendant relationships.

In 2004, Spencer et al. (Two papers you may want to read...) showed that Neighbor-nets outperform tree inferences when it comes to explicit ancestor-descendant relationships. The data set they used was quite particular: copies of written text. Here, scribes copy a text, and then other scribes, some of them ignorant of the language of the text they are copying, copy the copies. In the paper, the sequence of copies was recorded (the 'true tree'), and then the various texts were transferred into phylogenetic matrices, in order to infer trees and networks, and then this result was compared to the 'true tree'. The best fit of the data to the truth was the Neighbor-net.

This is a compelling conclusion, because, as a planar network and in contrast to median networks, Neighbor-nets don't explicitly place taxa in ancestor-descendant relationships. However, we have shown for many cases here at the Genealogical World of Phylogenetic Networks how ancestors are often placed with respect to their descendants: they are often closer to the center of the graph, or the root when known, and thus they bridge the center or sister lineages and their descendants. We can thus see why Neighbor-nets might be useful in practice.

In this context, the evolution of sign language manual alphabets, ie. the hand-shapes used to represent letters of a written alphabet, should be relatively easy to reconstruct. Once an alphabet is established in a sign language school / community, the ancestor, it will be passed on to other "generations" within the community and other schools / communities, the descendants. However, this is not necessarily a dichotomous process, as depicted in the first figure.

A scheme depicting how manual alphabets may evolve and disperse.

There are a few complications here: for example, hand-shapes may change in course of being used (the hand-shape evolves); contact may lead to exchange or appropriation of hand-shapes (called "borrowing" in linguistics); and, in some cases, entire alphabets will need to be adapted to a particular use. The latter case occurs when changing from one script (Latin, say) to another (Cyrillic or Arabic) — the first formal school for the deaf was established in Paris, for example. As a teacher, I need to decide: Do I take a hand-shape from the morphologically similar letter, or the phonetically similar one? As a scientist, I need to assess the homologies among such hand-shapes without inflicting systematic bias.

Standardization will wipe out local customs and replace them with a multinational standard. For instance, Country 2 in the scheme above, drops its original B-type manual alphabet (red) for an A-type (blue); and in Country 7 both traditions are fused. Over time, originally distinct sign languages may converge due to geographic proximity, or even just feasibility.

The evolution of spoken languages has been studied intensively using phylogenetic methods, and in particular networks are much more commonly found in the linguistic literature than in the biological one. For sign languages we have made a first step in a recently published pre-print:
Justin M. Power, Guido W. Grimm, and Johann-Mattis List (2019) Evolutionary dynamics in the dispersal of sign languages. Humanities Commons. http://dx.doi.org/10.17613/0smt-j414
What excites me about our study is that it combines historical manual alphabets (going back to 1593), which are potential ancestors, with a set of modern-day alphabets, which are their likely descendants. The data set is thus an evolutionary paleontologist's dream (and, possibly, a cladist's nightmare, if we expect a simple tree-like set of relationships rather than a network). As a scientist, I simple love to boldly go where no-one has gone before.

The next figure shows the all-inclusive network from our paper, but focusing on the age of the manual alphabets.

For more linguistic details see the pre-print.
* Historical version(s) of these lineages are not included in our data set

Obviously, there has been quite a lot of evolutionary changes, as well as standardization, going on, although some parts, like the Swedish SL (sign language), have stuck to its unique original. Historical and contemporary Spanish / Catalan are still most similar to the oldest manual alphabets that Justin dug out for our study. On the other hand, the contemporary Norwegian SL is placed far apart from his historical counterparts, and lacks any obvious affinity. Austrian, Danish, and German look back on a long and diverse history, the green "Austrian-origin Group", but the contemporaries have been homogenized by standardization (note the closeness to the International Sign manual alphabet). If we use an analogy with common biological and biogeographical processes (such as range expansion, competition, extinction, etc), then the Austrian-origin Group only survived in a remote island population, where we still find a sort of living fossil, the Icelandic SL.

In contrast to biological data, the old, putatively ancestral, manual alphabets are not closer to the graph's center, or the oldest manual alphabets in our data set. The reason for this seems to lie in the data itself and how manual alphabets evolve, and this will be the topic of the next post(s).

Still, we can isolate some evolutionary pathways, especially when we make time-wise taxon-filtered networks and stack them (see this introduction to stacking and this application using Osmundaceae, a data set including an even larger ratio of fossil taxa to modern taxa).

Fig. 4 from Power et al. Coloring same as above: pink – Spanish; turquoise – French-origin; green – Austrian-origin; orange – Polish; red – Russian; light blue – Swedish Group. The English-origin and Afghan-Jordanian groups are not included, since not represented by historical manual alphabets in our data set

Each of the three networks includes manual alphabets from a certain time period, starting with pre-1840 at the bottom, historical 19th-/20th-century manual alphabets in the middle, and post-1950 manual alphabets in the top network. The dotted links between the networks connect manual alphabets that are included in two of the networks.

Even from these graphs alone, we can say a lot about how ancestors (original manual alphabets in a country) relate to descendants (later and contemporary manual alphabets) and their evolutionary pathways. Here are some examples.

Shortly after the time when the first schools for the deaf were established in continental Europe (late 18th, early 19th centuries), manual alphabets showed quite a diversity, and were very different from their potential Spanish sources, such as Yebra 1593 and Bonet 1620, with the French and Austrian teachers and communities going different ways. The oldest Cyrillic alphabet, Russian 1835, is more closely related to (ancient) Austrian than it is to (ancient) French.

The Swedish manual alphabet of 1866 is a fresh invention. Some hand-shapes may have been borrowed from one or another alphabet in use on the continent, but, as we will see in the next post of the series, includes genuinely new forms.

The French tradition was dispersed into the new World (American SL appears to be a direct derivation from the French, while the Brazilian SL is an adaptation) but remained a relatively homogeneous group. On the other hand, the Austrian-origin languages diversified, in particular within the Danish influence zone. Politically, the Danish king ceded Norway to Sweden in the Treaty of Kiel 1814 (note the distance between Norwegian and Danish languages in the late 19th century), while Iceland was a Danish dependency until 1918, when the Danish-Icelandic Act of Union was signed. Furthermore, the German manual alphabets subsequently diverged from the Austrian source.

The Polish manual alphabet, originally an adaptation of the Austrian-Danish manual alphabets (see the graph in the middle), became closer to the Russian group, with the Latvian sign language taking up an intermediate position. The Cyrillic alphabets evolved further away, too (top graph).

In the following post(s) of this miniseries, we will explain what we learned from simple character mapping on the time-taxon-filtered networks, and how to score manual alphabets in the first place.


Follow-up posts in this miniseries

Monday, May 27, 2019

Automatic phonological reconstruction (Open problems in computational diversity linguistics 4)


The fourth problem in my list of open problems in computational diversity linguistics is devoted to the problem of linguistic reconstruction, or, more specifically, to the problem of phonological reconstruction, which can be characterized as follows:
Given a set of cognate morphemes across a set of related languages, try to infer the hypothetical pronunciation of each morpheme in the proto-language.
This task needs to be distinguished from the broader task of linguistic reconstruction, which would usually include also the reconstruction of full lexemes, i.e. lexical reconstruction — as opposed to single morphemes or "roots" in an unknown ancestral language. In some cases, linguistic reconstruction is even used as a cover term for all reconstruction methods in historical linguistics, including such diverse approaches as phylogenetic reconstruction (finding the phylogeny of a language family), semantic reconstruction (finding the meaning of a reconstructed morpheme or root), or the task of demonstrating that languages are genetically related (see, e.g., the chapters in Fox 1995)

Phonological and lexical reconstruction

In order to understand the specific difference between phonological and lexical reconstruction, and why making this distinction is so important, consider the list of words meaning "yesterday" in five Burmish languages (taken from Hill and List 2017: 51).

Figure 1: Cognate words in Burmish languages (taken from Hill and List 2017)

Four of these languages express the word "yesterday" with the help of more than one morpheme, indicated by using different colors in the table's phonetic transcriptions, which at the same time ­ also indicate which words we consider to be homologous in this sample. Four of the languages have one morpheme which (as we confirmed from the detailed language data) means "day" independently. This morpheme is given the label 2 in the last column of the table. From this, we can see that the motivation by which the word for "yesterday" is composed in these languages is similar to the one we observe in English, where we also find the word day being a part of the word yester-day.

If we want to know how the word "yesterday" was expressed in the ancestor of the Burmish languages, we could make an abstract estimation based on the lexical material we have at hand. We might assume that it was also a compound word, given the importance of compounding in all living Burmish languages. We could further hypothesize that one part of the ancient compound would have been the original word for "day". We could even make a guess and say the word was in structure similar to Bola and Lashi (although it is difficult to find a justification for doing this). In all cases, we would propose a lexical reconstruction for the word for "yesterday" in Proto-Burmish. We would make an assumption with respect to what one could call the denotation structure or the motivation structure, as we called it in Hill and List (2017: 67). This assumption would not need to provide an actual pronunciation of the word, it could be proposed entirely independently.

If we want to reconstruct the pronunciation of the ancient word for "yesterday" as well, we have to compare the corresponding sounds, and build a phonological reconstruction for each of the morphemes separately. As a matter of fact, scholars working on South-East Asian languages rarely propose a full lexical reconstruction as part of their reconstruction systems (for a rare exception, see Mann 1998). Instead, they pick the homologous morphemes from their word comparisons, assign some rough meaning to them (this step would be called semantic reconstruction), and then propose an ancient pronunciation based on the correspondence patterns they observe.

When listing phonological reconstruction as one of my ten problems, I am deliberately distinguishing this task from the tasks of lexical reconstruction or semantic reconstruction, since they can (and probably should) be carried out independently. Furthermore, by describing pronunciation of the morphemes as "hypothetical pronunciations" in the ancestral language, I want not only to emphasize that all reconstruction is hypothetical, but also to point to the fact that it is very possible that some of the morphemes for which one proposes a proto-form may not even have existed in the proto-language. They could have evolved only later as innovations on certain branches in the history of the languages. For the task of phonological reconstruction, however, this would not matter, since the question of whether a morpheme existed in the most recent common ancestor becomes relevant only if one tries to reconstruct the lexicon of a given proto-language. But phonological reconstruction seeks to reconstruct its phonology, i.e. the sound inventory of the proto-language, and the rules by which these sounds could be combined to form morphemes (phonotactics).

Why phonological reconstruction is hard

That phonological reconstruction is hard should not be surprising. What the task entails is to find the most probable pronunciation for a bunch of morphemes in a language for which no written records exist. Imagine you want to find the DNA of LUCA as a biologist, not even in its folded form, with all of the pieces in place, but just a couple of chunks, in order to get a better picture of how this LUCA might have looked like. But while we can employ some weak version of uniformitarianism when trying to reconstruct at least some genes of our LUCA (we would still assume that it was using some kind of DNA, drawn from the typical alphabet of DNA letters), we face the specific problem in linguistics that we cannot even be sure about the letters.

Only recently, Blasi et al. (2019) argued that sounds like f and v may have evolved later than the other sounds we can find in the languages of the world, driven by post-Neolithic changes in the bite configuration, which seem to depend on what we eat. As a rule, and independent of these findings, linguists do not tend to reconstruct an f for the proto-language in those cases where they find it corresponding to a p, since we know that in almost all known cases a p can evolve into an f, but an f almost never becomes a p again. This can lead to the strange situation where some linguists reconstruct a p for a given proto-language even though all descendants show an f, which is, of course, an exaggeration of the principle (see Guillaume Jacques' discussion on this problem).

But the very idea, that we may have good reasons to reconstruct something in our ancestral language that has been lost in all descendant languages, is something completely normal for linguists. In 1879, for example Ferdinand de Saussure (Saussure 1879) used internal and comparative evidence to propose the existence of what he called coefficients sonantiques in Proto-Indo-European. His proposal included the prediction that — if ever a languages was found that retained these elements — these new sounds would surface as segmental elements, as distinctive sounds, in certain cognate sets, where all known Indo-European languages had already lost the contrast.

These sounds are nowadays known as laryngeals (*h1, *h2, *h3, see Meier-Brügger 2002), and when Hittite was identified as an Indo-European language (Hrozný 1915), one of the two sounds predicted by Saussure could indeed be identified. I have discussed before on this blog the problem of unattested character states in historical linguistics, so there is no need to go into further detail. What I want to emphasize is that this aspect of linguistic reconstruction in general, and phonological reconstruction specifically, is one of the many points that makes the task really hard, since any algorithm to reconstruct the phonological system of some proto-language would have to find a way to formalize the complicated arguments by which linguists infer that there are traces of something that is no longer there.

There are many more things that I could mention, if I wanted to identify the difficulty of phonological reconstruction in its entirety. What I find most difficult to deal with is that the methodology is insufficiently formalized. Linguists have their success stories, which helped them to predict certain aspects of a given proto-language that could later be confirmed, and it is due to these success stories that we are confident that it can, in principle, be done. But the methodological literature is sparse, and the rare cases where scholars have tried to formalize it are rarely discussed when it comes to evaluating concrete proposals (as an example for an attempt of formalizing, see Hoenigswald 1960). Before this post becomes too long, I will therefore conclude bu noting that scholars usually have a pretty good idea of how they should perform their phonological reconstructions, but that this knowledge of how one should reconstruct a proto-language is usually not seen as something that could be formalized completely.

Traditional strategies for phonological reconstruction

Given the lack of methodological literature on phonological reconstruction, it is not easy to describe how it should be done in an ideal scenario. What seems to me to be the most promising approach is to start from correspondence patterns. A correspondence pattern is an abstraction from individual alignment sites distributed over cognate sets drawn from related languages. As I have tried to show in a paper published earlier this year (List 2019), a correspondence pattern summarizes individual alignment sites in an abstract form, where missing data are imputed. I will avoid going into the details here but, as a shortcut, we can say that each correspondence pattern should, in theory, only correspond to one proto-sound in the language, although the same proto-sound may correspond to more than one correspondence pattern. As an example, consider the following table, showing three (fictive) patterns that would all be reconstructed by a *p.

 Proto-Form  L₁  L₂  L₃
 *p  p  p  f
 *p  p  p  p
 *p  b  p  p

To justify that the same proto-sound is reconstructed by a *p in all three patterns, linguists invoke the rule of context, by looking at the real words from which the pattern was derived. An example is shown in the next table.


 Proto-Form
L₁ L₂ L₃
*p i a ŋ  p i a ŋ  p i u ŋ  f a n
*p a t  p a t  p a t  p a t
*a p a ŋ  a b a ŋ  a p a ŋ  a p a n

What you should be able to see from the table is that we can find in all three patterns a conditioning factor that allows us to assume that the deviation from the original *p is secondary. In language L₃, the factor can be found in the palatal environment (followed by the front vowel *i) that we find in the ancestral language. We would assume that this environment triggered the change from *p to f in this language. In the case of the change from *p to b in L₁, the triggering environment is that the p occurs inter-vocalically.

To summarize: what linguists usually do in order to reconstruct proto-forms for ancestral languages that are not attested in written sources, is to investigate the correspondence patterns, and to try to find some neat explanation of how they could have evolved, given a set of proto-forms along with triggering contexts that explain individual changes in individual descendant languages.

Computational strategies for phonological reconstruction

Not many attempts have been made so far to automate the task of reconstruction. The most prominent proposal in this direction has been made by Bouchard-Côté et al. (2013). Their strategy radically differs from the strategy outlined above, since they do not make use of correspondence patterns, but instead use a stochastic transducer and known cognate words in the descendant languages, along with a known phylogenetic tree that they traverse, inferring the most likely changes that could explain the observed distribution of cognate sets.

So far, this method has been tested only on Austronesian languages and their subgroups, where it performed particularly well (with error rates between 0.25 and 0.12, using edit distance as the evaluation measure). Since it is not available as a software package that can be conveniently used and tested on other language families, it is difficult to tell how well it would perform when being presented with more challenging test cases.

In a forthcoming paper, Gerhard Jäger illustrates how classical methods for ancestral state reconstruction applied to aligned cognate sets could be used for the same task (Jäger forthcoming). While Jäger's method is more in line with "linguistic thinking", in so far as he uses alignments, and applies ancestral state reconstructions to each column of the alignments, it does not make use of correspondence patterns, which would be the general way by which linguists would proceed. This may also explain the performance, which shows an error rate of 0.48 (also using edit distance for evaluation) — although this is also due to the fact that the method was tested on Romance languages and compared with Latin, which is believed to be older than the ancestor of all Romance languages.

Problems with computational strategies for phonological reconstruction

Both the method of Bouchard-Côté et al. and the approach of Jäger suffer from the problem of not being able to detect unobserved sounds in the data. Jäger side-steps this problem in theory, by using a shortened alphabet of only 40 characters, proposed by the ASJP project, which encoded more than half of the world's languages in this form. Bouchard-Côté's test data, Proto-Austronesian (and its subgroups), are fairly simple in this regard. It would therefore be interesting to see what would happen if the methods are tested with full phonetic (or phonological) representations of more challenging language families (for example, the Chinese dialects). While Jäger's approach assumes the independence of all alignment sites, Bouchard-Côté's stochastic transducers handle context on the level of bigrams (if I read their description properly). However, while bigrams can be seen as an improvement over ignoring conditioning context, they are not the way in which context is typically handled by linguists. As I tried to explain briefly in last month's post, context in historical linguistics calls for a handling of abstract contexts, for example, by treating sequences as layered entities, similar to music scores.

Apart from the handling of context and unobserved characters, the evaluation measure used in both approaches seems also problematic. Both approaches used the edit distance (Levenshtein 1965), which is equivalent to the Hamming distance (Hamming 1950) applied to aligned sequences. Given the problem of unobserved characters and the abstract nature of linguistic reconstruction systems, however, any measure that evaluates the surface similarity of sequences is essentially wrong.

To illustrate this point, consider the reconstruction of the Indo-European word for sheep by Kortlandt (2007), who gives *ʕʷ e u i s, as compared to Lühr (2008), who gives *h₂ ó w i s. The normalized edit distance between both systems is the Hamming distance of their (trivial) alignment: in three of five cases they differ, which makes up to an unnormalized edit distance of three, and a normalized edit distance of 0.6. While this is pretty high, their systems are mostly compatible, since Korthland reconstructs *ʕʷ in most cases where Lühr writes *h₂. Therefore, the distance should be much lower; in fact, it should be zero, since both authors agree on the structure of the form they reconstruct in comparison with the structure of other words they reconstruct for Proto-Indo-European.

Since scholars do not necessarily select phonetic values in their reconstructions that derive directly from the descendant languages, and moreover they may differ often regarding the details of the phonetic values they propose, a valid evaluation of different reconstruction systems (including automatically derived ones) needs to compare the structure of the systems, not their substance (see List 2014: 48-50 for a discussion of structural and substantial differences between sequences).

Currently, there is (to my knowledge) no accepted solution for the comparison of structural differences among aligned sequences. Finding an adequate evaluation measure to compare reconstruction systems can therefore be seen as a sub-problem of the bigger problem of phonological reconstruction. To illustrate why it is so important to compare the structural information and not the pure substance, consider the three cases in which Jäger's reconstruction gives a v as opposed to a w in Latin (data here): while evaluating by the edit distance yields a score of 0.48, this score will drop to 0.47 when replacing the v instances with a w. Jäger's system is doing something right, but the edit distance cannot capture the fact that the system is deviating systematically from Latin, not randomly.

Initial ideas for improvement

There are many things that we can easily improve when working on automatic methods for phonological reconstruction.

As a first point, we should work on enhanced measures of evaluation, going beyond the edit distance as our main evaluation measure. In fact, this can be easily done. With B-Cubed scores (Amigó et al. 2009), we already have a straightforward measure to compare whether two reconstruction systems are structurally identical or similar. In order to apply these scores, the automatic reconstructions have to be aligned with the gold standard. If they are identical, although the symbols may differ, then the scores will indicate this. The problem of comparing reconstruction systems is, of course, more difficult, as we can face cases where systems are not structurally identical (i.e. you can directly replace any symbol a in system A by any symbol a' in system B to produce B from A and vice versa), but they would be a start.

Furthermore, given that we lack test cases, we might want to work on semi-automatic instead of fully automatic methods, in the meantime. Given that we have a first method to infer sound correspondence patterns from aligned data (List 2019), we can infer all patterns and have linguists annotate each pattern by providing the proto-sound they think would fit best — we are testing this at the moment. Having created enough datasets in this form, we could then think of discussing concrete algorithms that would derive proto-forms from correspondence patterns, and use the semi-automatically created and manually corrected data as gold standard.

Last but not least, one straightforward way by which it is possible to formally create unknown sounds from known data, is to represent sound as vectors of phonological features instead of bare symbols (e.g. representing p as voiceless bilabial plosive and b as voiced labial plosive). If we then compare alignment sites or correspondence patterns for the feature vectors, we could check to what degree standard algorithms for ancestral state reconstructions propose unattested sounds similar to the ones proposed by experts. In order to do this, we would need to encode our data in transparent transcription systems. This is not the case for most current datasets, but with the Cross-Linguistic Transcription Systems initiative we already have a first attempt to provide features for the majority of sounds that we find in the languages of the world (Anderson et al. forthcoming).

Outlook

It is difficult to tell how hard the problem of phonological reconstruction is in the end. Semi-automatic solutions are already feasible now, and we are currently testing them on different (smaller) groups of phylogenetically related languages. One crucial step in the future is to code up enough data to allow for a rigorous testing of the few automatic solutions that have been proposed so far. We are working on that as well. But how to propose an evaluation system that rigorously tests not only to what degree a given reconstruction is identical with a given gold standard, but also structurally equivalent, remains one of the crucial open problems in this regard.

References
Amigó, Enrique and Gonzalo, Julio and Artiles, Javier and Verdejo, Felisa (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12.4: 461-486.

Anderson, Cormac, Tresoldi, Tiago, Chacon, Thiago Costa, Fehn, Anne-Maria, Walworth, Mary, Forkel, Robert and List, Johann-Mattis (forthcoming) A cross-linguistic Database of Phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting, pp. 1-27.

Blasi, Damián E. , Steven Moran, Scott R. Moisik, Paul Widmer, Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

Bouchard-Côté, Alexandre and Hall, David and Griffiths, Thomas L. and Klein, Dan (2013) Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110.11: 4224–4229.

Fox, Anthony (1995) Linguistic Reconstruction: An Introduction to Theory and Method. Oxford: Oxford University Press.

Hamming, Richard W. (1950) Error detection and error detection codes. Bell System Technical Journal 29.2: 147–160.

Hill, Nathan W. and List, Johann-Mattis (2017) Challenges of annotation and analysis in computer-assisted language comparison: a case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1: 47–76.

Hoenigswald, Henry M. (1960) Phonetic similarity in internal reconstruction. Language 36.2: 191-192.

Hrozný, Bedřich (1915) Die Lösung des hethitischen Problems [The solution of the Hittite problem]. Mitteilungen der Deutschen Orient-Gesellschaft 56: 17–50.

Jäger, Gerhard (forthcoming) Computational historical linguistics. Theoretical Linguistics.

Kortlandt, Frederik (2007) For Bernard Comrie.

Levenshtein, V. I. (1965) Dvoičnye kody s ispravleniem vypadenij, vstavok i zameščenij simvolov [Binary codes with correction of deletions, insertions and replacements]. Doklady Akademij Nauk SSSR 163.4: 845-848.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Lühr, Rosemarie (2008) Von Berthold Delbrück bis Ferdinand Sommer: Die Herausbildung der Indogermanistik in Jena. Vortrag im Rahmen einer Ringvorlesung zur Geschichte der Altertumswissenschaften (09.01.2008, FSU-Jena).

Mann, Noel Walter (1998) A Phonological Reconstruction of Proto Northern Burmic. The University of Texas: Arlington.

Meier-Brügger, Michael (2002) Indogermanische Sprachwissenschaft. Berlin and New York: de Gruyter.

Saussure, Ferdinand de (1879) Mémoire sur le Système Primitif des Voyelles dans les Langues Indo- Européennes. Leipzig: Teubner.

Monday, April 29, 2019

Automatic sound law induction (Open problems in computational diversity linguistics 3)


The third problem in my list of ten open problems in computational diversity linguistics is a problem that has (to my knowledge) not even been considered as a true problem in computational historical linguistics, so far. Until now, it has been discussed by colleagues only indirectly. This problem, which I call the automatic induction of sound laws, can be described as follows:
Starting from a list of words in a proto-language and their reflexes in a descendant language, try to find the rules by which the ancestral language is converted into the descendant language.
Note that by "rules", in this context, I mean the classical notation that phonologists and historical linguists use in order to convert a source sound in a target sound in a specific environment (see Hall 2000: 73-75). If we consider the following ancestral and descendant words from a fictive language, we can easily find the laws by which the input should be converted into an output — namely, an a should be changed to an e, an e should be changed to an i, and a k changes to s if followed by an i but not if followed by an a.

Input Output
papa pepe
mama meme
kaka keke
keke sisi

Short excursus on linguistic notation of sound laws

Based on the general idea of sound change (or sound laws in classical historical linguistics) as some kind of a function by which a source sound is taken as input and turned into a target sound as output, linguists use a specific notation system for sound laws. In the simplest form of the classical sound law notation, this process is described in the form s > t, where s is the source sound and t is the target sound. Since sound change often relies the on specific conditions of the surrounding context — i.e. it makes a difference if some sound occurs in the beginning or the end of a word — context is added as a condition separated by a /, with an underscore _ referring to the sound in its original phonetic environment. Thus, the phenomenon of voiced stops becoming unvoiced at the end of words in German (e.g. d becoming t), can be written as d > t / _$, where $ denotes the end of a word.

One can see how close this notation comes to regular expressions and according to many scholars, the rules by which languages change with respect to their sound systems do not exceed the complexity of regular grammars. Nevertheless, sound change notation does differ in the scope and the rules for annotation. One notable difference is the possibility to explain how full classes of sounds change in a specific environment. The German rule of devoicing, for example, generally affects all voiced stops in the end of a word. As a result, one could also annotat it as G > K / _$, where G would denote the sounds [b, d, g] and K their counterparts [p, t, k]. Although we could easily write a single rule for each of the three phenomena here, the rule by which the sounds are grouped into two classes of voiced sounds and their unvoiced counterparts is linguistically more interesting, since it reminds us that the change by which word-final consonants loose the feature of voice is a systemic change, and not a phenomenon applying to some random selection of sounds in a given language.

The problem of this systemic annotation, however, is that the grouping of sounds into classes that change in a similar form is often language-specific. As a result, scholars have to propose new groupings whenever they deal with another language. Since neither the notation of sound values nor the symbols used to group sounds into classes are standardized, it is extremely difficult to compare different proposals made in the literature. As a result, any attempt to solve the problem of automatic sound law induction in historical linguistics would at the same time have to make strict proposals for a standardization of sound law notations used in our field. Standardization can thus be seen as one of the first major obstacles of solving this problem, with the problem of accounting for systemic aspects of sound change as the second one.

Beyond regular expressions

Even if we put the problem of inconsistent annotation and systemic changes to one side, the analogy with regular expressions cannot properly handle all aspects of sound change. When looking at the change from Middle Chinese to Mandarin Chinese, for example, we find a complex pattern, by which originally voiced sounds, like [b, d, g, dz] (among others), were either devoiced, becoming [p, t, k, ts], or devoiced and aspirated, becoming [pʰ, tʰ, kʰ, tsʰ]. While it is not uncommon that one sound can change into two variants, depending on the context in which it occurs, the Mandarin sound change in this case is interesting because the context is not a neighboring sound, but is instead the Middle Chinese tone for the syllable in question — syllables with a flat tone (called píng tone in classical terminology) are nowadays voiceless and aspirated, and syllables with one of the three remaining Middle Chinese tones (called shǎng, , and ) are nowadays plain voiceless (see List 2019: 157 for examples).

Since tone is a feature that applies to whole syllables, and not to single sound segments, we are dealing with so-called supra-segmental features here. As the meaning of the term supra-segmental indicates, the features in question cannot be represented as a sequence of sound, but need to be thought of as an additional layer, similar to other supra-segmental features in language, including stress, or juncture (indicating word or morpheme boundaries).

In contrast to sequences as we meet them in mathematics and informatics, linguistic sound sequences do not consist solely of letters drawn from an alphabet that is lined up in some unique order. They are instead often composed of multiple layers, which are in part hierarchically ordered. Words, morphemes, and phrases in linguistics are thus multi-layered constructs, which cannot be represented by one sequence alone, but could be more fruitfully thought of as the same as a partitura in music — the score of a piece of orchestra music, in which every voice of the orchestra is given its own sequence of sounds, and all different sequences are aligned with each other to form a whole.

img
The multi-layered character of sound sequences can be seen as similar to a partitura in musical notation.

This multi-layered character of sound sequences in spoken languages comprises a third complication for the task of automatic sound law induction. Finding the individual laws that trigger the change of one stage of a language to a later stage, cannot (always) be trivially reduced to the task of finding the finite state transducer that translates a set of input strings to a corresponding set of output strings. Since our input word forms in the proto-language are not simple strings, but rather an alignment of the different layers of a word form, a method to induce sound laws needs to be able to handle the multi-layered character of linguistic sequences.

Background for computational approaches to sound law induction

To my knowledge, the question of how to induce sound laws from data on proto- and descendant languages has barely been addressed. What comes closest to the problem are attempts to model sound change from known ancestral languages, such as Latin, to daughter languages, such as Spanish. This is reflected, for example, in the PHONO program (Hartmann 2003), where one can insert data for a proto-language along with a set of sound change rules (provided in a similar form to that mentioned above), which need to be given in a specific order, and are then checked to see whether they correctly predict the descendant forms.

For teaching purposes, I adapted a JavaScript version of a similar system, called the Sound Change Applier² (http://www.zompist.com/sca2.html) by Mark Rosenfelder from 2012, in which students could try to turn Old High German into modern German, by assigning simple rules as they are traditionally used to describe sound change processes in the linguistic literature. This adaptation (which can be found at http://dighl.github.io/sound_change/SoundChanger.html) compares the attested output with the output generated by a given set of rules, and provides some assessment of the general accuracy of the proposed set of rules. For example, when feeding the system the simple rule an > en /_#, which turns all final instances of -an into -en, 54 out of 517 Old High German words will yield the expected output in modern Standard German.

The problem with these endeavors is, of course, the handling of exceptions, along with the comparison of different proposals. Since we can think of an infinite number of rules by which we could successfully turn a certain amount of Old High German strings into Standard German strings, we would need to ask ourselves how we could evaluate different proposals. That some kind of parsimony should play a role here is obvious. However, it is by no means clear (at least to me) how to evaluate the complexity of two systems, since the complexity would not only be reflected in the number of rules, but also in the initial grouping of sounds to classes, which is commonly used to account for systemic aspects of sound change. A system accounting for the problem of sound law induction would try to automate the task of finding the set of rules. The fact that it is difficult even to compare two or more proposals based on human assessment further illustrates why I think that the problem is not trivial.

Another class of approaches is that of word prediction experiments, such as the one by Ciobanu and Dinu (2018) (but see also Bodt and List 2019), in which training data consisting of the source and the target language are used to create a model, which is then successively applied to new data, in order to test how well this model predicts target words from the source words. Since the model itself is not reported in these experiments, but only used in the form of a black box to predict new words, the task cannot be considered to be the same as the task for sound law induction — which I propose as one of my ten challenges for computational historical linguistics — given that we are interested in a method that explicitly returns the model, in order to allow linguists to inspect it.

Problems with the current solutions to sound law induction

Given that no real solutions exist to the problem up to now, it seems somewhat useless to point to the problems of current solutions. What I want to mention in this context, however, are the problems of the solutions presented for word prediction experiments, be they fed by manual data on sound changes (Hartmann 2003), or based on inference procedures (Ciobanu and Dinu 2018, Dekker 2018). Manual solutions like PHONO suffer from the fact that they are tedious to apply, given that linguists have to present all sound changes in their data in an ordered fashion, with the program converting them step by step, always turning the whole input sequence into an intermediate output sequence — the word prediction approaches thus suffer from limitations in feature design.

The method by Ciobanu and Dinu (2018), for example, is based on orthographic data alone, using the Needleman-Wunsch algorithm for sequence alignment (Needleman and Wunsch 1970); and the approach by Dekker (2018) only allows for the use for the limited alphabet of 40 symbols proposed by the ASJP project (Holman et al. 2008). In addition to the limited representation of linguistic sound sequences, be it by resorting to abstract orthography or to abstract reduced phonetic alphabets, none of the methods can handle those kinds of contexts which result from the multi-layered character of speech. Since we know well that these aspects are vital for certain phenomena of sound change, the methods exclude from the beginning an aspect that traditional historical linguists, who might be interested in an automatic solution to the sound law induction problem, would put at the top of their wish-list of what the algorithm should be able to handle.

Why is automatic sound law induction difficult?

The handling of supra-segmental contexts, mentioned above, is in my opinion also the reason why sound law induction is so difficult, not only for machines, but also for humans. I have so far mentioned three major problems as to why I think sound law induction is difficult. First, we face problems in defining the task properly in historical linguistics, due to a significant lack in standardization. This makes it difficult to decide on the exact output of a method for sound law induction. Second, we have problems in handling the systemic aspect of sound change properly. This does not apply only to automatic approaches, but also to the evaluation of different proposals for the same data proposed by humans. Third, the multi-layered character of speech requires an enhanced modeling of linguistic sequences, which cannot be modeled as mono-dimensional strings alone, but should rather be seen as alignments of different strings representing different layers (tonal layer, stress layer, sound layer, etc.).

How humans detect sound laws

There are only a few examples in the literature where scholars have tried to provide detailed lists of sound changes from proto- to descendant language (Baxter 1992, Newman 1999). Most examples of individual sound laws proposed in the literature are rarely even tested exhaustively on the data. As a result, it is difficult to assess what humans usually do in order to detect sound laws. What is clear is that historical linguists who have been working a lot on linguistic reconstruction tend to acquire a very good intuition that helps them to quickly check sound laws applied to word forms in their head, and to convert the output forms. This ability is developed in a learning-by-doing fashion, with no specific techniques ever being discussed in the classroom, which reflects the general tendency in historical linguistics to trust that students will learn how to become a good linguist from examples, sooner or later (Schwink 1994: 29). For this reason, it is difficult to take inspiration from current practice in historical linguistics, in order to develop computer-assisted approaches to solve this task.

Potential solutions to the problem

What can we do in order to address the problem of sound law induction in automatic frameworks in the future?

As a first step, we would have to standardize the notation system that we use to represent sound changes. This would need to come along with a standardized phonetic transcription system. Scholars often think that phonetic transcription is standardized in linguistics, specifically due to the use of the International Phonetic Alphabet. As our investigations into the actual application of the IPA have shown, however, the IPA cannot be seen as a standard, but rather as a set of recommendations that are often only loosely followed by linguists. First attempts to standardize phonetic transcription systems for the purpose of cross-linguistic applications have, however, been made, and will hopefully gain more acceptance in the future (Anderson et al. forthcoming, https://clts.clld.org).

As a second step, we should invest more time in investigating the systemic aspects of language change cross-linguistically. What I consider important in this context is the notion of distinctive features by which linguists try to group sounds into classes. Since feature systems proposed by linguists differ greatly, with some debate as to whether features are innate and the same for all languages, or instead language-specific (see Mielke 2008 for an overview on the problem), a first step would again consist of making the data comparable, rather than trying to decide in favour of one of the numerous proposals in the literature.

As a third step, we need to work on ways to account for the multi-layered aspect of sound sequences. Here, a first proposal, labelled "multi-tiered sequence representation", has already been made by myself (List and Chacon 2015), based on an idea that I had already used for the phonetic alignment algorithm proposed in my dissertation (List 2014), which itself goes back to the handling of hydrophilic sequences in ClustalW (Thompson et al. 1994). The idea is to define a sound sequence as a sequence of vectors, with each vector (called tier) representing one distinct aspect of the original word. As this representation allows for an extremely flexible modeling of context — which would just consist of an arbitrary number of vector dimensions that could account for aspects such as tone, stress, preceding or following sounds — this representation would allow us to treat words as sequences of sounds while at the same time accounting for their multi-layered structure. Although there remain many unsolved aspects on how to exploit this specific model for phonetic sequences to induce sound laws from ancestor-descendant data, I consider this to be a first step in the direction of a solution to the problem.

Multi-tiered sequence representation for a fictive word in Middle Chinese.

Outlook

Although it is not necessarily recognized by the field as a real problem of historical linguistics, I consider the problem of automatic sound law induction as a very important problem for our field. If we could infer sound laws from a set of proposed proto-forms and a set of descendant forms, then we could use them to test the quality of the proto-forms themselves, by inspecting the sound laws proposed by a given system. We could also compare sound laws across different language families to see whether we find cross-linguistic tendencies.

Having inferred enough cross-linguistic data on sound laws represented in unified models for sound law notation, we could also use the rules to search for cognate words that have so far been ignored. There is a lot to do, however, until we reach this point. Starting to think about automatic, and also manual, induction of sound laws as a specific task in computational historical linguistics can be seen as a first step in this direction.

References
Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (forthcoming) A Cross-Linguistic Database of Phonetic Transcription Systems. Yearbook of the Poznań Linguistic Meeting, pp 1-27.

Baxter, William H. (1992) A handbook of Old Chinese Phonology. Berlin: de Gruyter.

Bodt, Timotheus A. and List, Johann-Mattis (2019) Testing the predictive strength of the comparative method: An ongoing experiment on unattested words in Western Kho-Bwa langauges. 1-22. [Preprint, under review, not peer-reviewed]

Ciobanu, Alina Maria and Dinu, Liviu P. (2018) Simulating language evolution: A tool for historical linguistics. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp 68-72.

Dekker, Peter (2018) Reconstructing Language Ancestry by Performing Word Prediction with Neural Networks. University of Amsterdam: Amsterdam.

Hall, T. Alan (2000) Phonologie: Eine Einführung. Berlin and New York: de Gruyter.

Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp 606-609.

Holman, Eric W. and Wichmann, Søren and Brown, Cecil H. and Velupillai, Viveka and Müller, André and Bakker, Dik (2008) Explorations in automated lexicostatistics. Folia Linguistica 20.3: 116-121.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper, presented at the workshop Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE] (2015/09/04, Leiden, Societas Linguistica Europaea).

List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Mielke, Jeff (2008) The Emergence of Distinctive Features. Oxford: Oxford University Press.

Needleman, Saul B. and Wunsch, Christan D. (1970) A gene method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

Newman, John and Raman, Anand V. (1999) Chinese Historical Phonology: Compendium of Beijing and Cantonese Pronunciations of Characters and their Derivations from Middle Chinese. München: LINCOM Europa.

Schwink, Frederick (1994) Linguistic Typology, Universality and the Realism of Reconstruction. Washington: Institute for the Study of Man.

Thompson, J. D. and Higgins, D. G. and Gibson, T. J. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673–4680.

Monday, February 25, 2019

Automatic morpheme segmentation (Open problems in computational diversity linguistics 1)


The first task on my list of 10 open problems in computational diversity linguistics deals with morphemes, that is, the minimal meaning-bearing parts in a language. A morpheme can be a word, but it does not have to be a word, since words may consist of more than one morpheme, and ­— depending on the language in question — may do so almost by default.

Examples of morphemes in English include clear-cut cases of compounding, where two words are joined to form a new word. Often, this is not even readily reflected in spelling, and, as a result, speakers may at times think that a word like "primary school" is not a single word, although it is easy to determine from its semantics that the word is indeed pointing to one uniform concept. Other examples include grammatical markers, such as the ending -s for most English plurals, or to mark the third person singular of verbs. When confronted with a word form like walks, linguists will analyze this word as consisting of two morphemes, illustrating it by adding a dash as a boundary marker: walk-s.

The problem

The task of automatic morpheme segmentation is thus a pretty straightforward one: given a list of words, potentially along with additional information, such as their meaning, or their frequency in the given language, try to identify all morpheme boundaries, and mark this by adding dash symbols where a boundary has been identified.

One may ask why automatic identification of morphemes should be a problem —  and some people commenting on my presentation of the 10 open problems last month did ask this. The problem is not unrecognized in the field of Natural Language Processing, and solutions have been discussed from the 1950s onwards (Harris 1955, Benden 2005, Bordag 2008, Hammarström 2006, see also the overview by Goldsmith 2017).

Roughly speaking, all approaches build on statistics about n-grams, i.e., recurring symbol sequences of arbitrary length. Assuming that n-grams representing meaning-building units should be distributed more frequently across the lexicon of a language, they assemble these statistics from the data, trying to infer the ones which "matter". With Morfessor (Creutz and Lagus 2005, there is also a popular family of algorithms available in form of a very stable and easy-to-use Python library (Virpioja et al. 2013). Applying and testing methods for automatic morpheme segmentation is thus very straightforward nowadays.

The issue with all of these approaches and ideas is that they require a very large amount of data for training, while our actual datasets are small and sparse, by nature. As a result, all currently available algorithms fail graciously when it comes to determining the morphemes in datasets of less of 1,000 words.

Interestingly, even when having been trained on large datasets, the algorithms still commit surprising errors, as can be easily seen when testing the online demo of the Morfessor software for German (https://asr.aalto.fi/morfessordemo/). When testing words like auftürmen "pile up", for example, the algorithm yields the segmentation auf-türme-n, which is probably understandable from the fact that the word Türme "towers" is quite frequent in the German lexicon, thus confusing the algorithm; but for a German speaker, who knows that verbs end in -en in their infinitive, it is clear that the auftürmen can only be segmented as auf-türm-en.

If I understand the information on the website correctly, the Morfessor algorithm offered online was trained with more than 1 million different word forms in German. Given that in our linguistic approaches we can usually dispose of 1,000 words, if not less, per language, it is clear that the algorithms won't provide help in finding the morphemes in our data.

To illustrate this, I ran a small test on the Morfessor software, using two datasets for training, one big dataset with about 50000 words from Baayen et al. (1995), and one smaller dataset of about 600 words which I used as a cognate detection benchmark when writing my dissertation (List 2014). I then used these two datasets to train the Morfessor software and then applied the trained models to segment a list of 10 German words (see the GitHub.Gist here.

The results for the two models (small data and big data) as well as the segmentations proposed by the online application (online) are given in the table below (with my own judgments on morphemes given in the column word).

Number Word Small data Big data Online
1 hand hand hand hand
2 hand-schuh hand-sch-uh hand-schuh hand-schuh
3 hantel h-a-n-t-el hant-el han-tel
4 hunger h-u-n-g-er hunger hunger
5 lauf-en l-a-u-f-en laufen lauf-en
6 geh-en gehen gehen gehen
7 lieg-en l-i-e-g-en liegen liegen
8 schlaf-en sch-lafen schlafen schlaf-en
9 kind-er-arzt kind-er-a-r-z-t kind-er-arzt kinder-arzt
10 grund-schule g-rund-sch-u-l-e grund-schule grundschule

What can be seen clearly from the table, where all forms deviating from my analysis are marked in red font, is that none of the models makes a convincing job in segmenting my ten test words.  More importantly, however, we can clearly see that the algorithm's problems increase drastically when dealing with small training data. Since the segmentations proposed in the Small data column are clearly the worst, splitting words in a seemingly random fashion into letters.

What is interesting in this context is that trained linguists would rarely fail at this task, even when all they were given is the small data list for training. That they do not fail is shown by the numerous studies where linguistic fieldworkers have investigated so far under-investigated languages, and quickly figured out how the morphology works.

Why is it so difficult to find morpheme boundaries?

What makes the detection of morpheme boundaries so difficult, also for humans, is that they are inherently ambiguous. A final -s can mark the plural in German, especially on borrowings, as in Job-s, but it can likewise mark a short variant of es "it", where the vowel is deleted, as in ist's "it's", and in many other cases, it can just mark nothing, but instead be part of a larger morpheme, like Haus "house". Whether or not a certain substring of sounds in a language can function as a morpheme depends on the meaning of the word, not on the substring itself. We can — once more — see one of the great differences between sequences in biology and sequences in linguistics here: linguistic sequences derive their "function" (ie. their meaning) from the context in which they are used, not from their structure alone. 

If speakers are no longer able to clearly understand the morphological structure of a given word, they may even start to change it, in order to make it more "transparent" in its denotation. Examples for this are the numerous cases of folk etymology, where speakers re-interpret the morphemes in a word, with English ham-burger as a prominent example, since the word originally seems to derive from the city Hamburg, which has nothing to do with ham. 

How do humans find morphemes?
 
The reasons why human linguists can relatively easy find morphemes in sparse data, while machines cannot, is still not entirely clear to me (ie. humans are good at pattern recognition and machines are not). However, I do have some basic ideas about why humans largely outperform machines when it comes to morpheme segmentation; and I think that future approaches that try to take these ideas into account might drastically improve the performance of automatic morpheme segmentation methods.

As a first point, given the importance of meaning in order to determine morphemic structure, it seems almost absurd to me to try to identify morphemes in a given language corpus based on a pure analysis of the sequences, without taking their meaning into account.  If we are confronted with two words like Spanish hermano "brother" and hermana "sister", it is clear — if we know what they mean — that the -o vs. -a most likely denotes a distinction of gender. While the machines compare potential similarities inside the words independent of semantics, humans will always start from those pairs where they think that they could expect to find interesting alternations. As long as the meanings are supplied, a human linguist — even when not familiar with a given language — can easily propose a more or less convincing segmentation of a list of only 500 words.

A second point that is disregarded in current automatic approaches is the fact that morphological structures vary drastically among languages. In Chinese and many South-East Asian languages, for example, it is almost a rule that every syllable represents one morpheme (with minimal exceptions being attested and discussed in the literature). Since syllables are again easy to find in these languages, since words can often only end in a specific number of sounds, an algorithm to detect words in those languages would not need any n-gram statistics, but just a theory on syllable structures. Instead of global strategies, we may rather have to use for local strategies of morpheme segmentation, in which we identify different types of languages for which a given algorithm seems suitable.

This brings us to a third point. A peculiarity of linguistic sequences in spoken languages is that they are built by specific phonotactic rules that govern their overall structure. Whether or not a language tolerates more than three consonants in the beginning of a word depends on its phonotactics, its set of rules by which the inventory of sounds is combined to form morphemes and words. Phonotactics itself can also give hints on morpheme boundaries, since they may prohibit combinations of sounds within morphemes which can occur when morphemes are joined to form words. German Ur-instinkt "basic instinct", for example, is pronounced with a glottal stop after the Ur-, which can only occur in the beginning of German words and morphemes, thus marking the word clearly as a compound (otherwise the word could be parsed as Urin-stinkt "urine smells".

A fourth point that is also generally disregarded in current approaches to automatic morpheme segmentation is that of cross-linguistic evidence. In many cases, the speakers of a given language may themselves no longer be aware of the original morphological segmentation of some of their words, while the comparison with closely related languages can still reveal it. If we have a potentially multi-morphemic word in one language, for example, and only one of the two potential morphemes reflected as a normal word in the other language, this is clear evidence that the potentially multi-morphemic word does, indeed, consist of multiple morphemes.

Suggestions

Linguists regularly use multiple types of evidence when trying to understand the morphological composition of the words in a given language. If we want to advance the field of automatic morpheme segmentation, it seems to me indispensable that we give up the idea of detecting the morphology of a language just by looking at the distribution of letters across word forms. Instead, we should make use of semantic, phonotactic, and comparative information. We should further give up the idea of designing universal morpheme segmentation algorithms, but rather study which approach works best on which morphological type. How these aspects can be combined in a unified framework, however, is still not entirely clear to me; and this is also the reason why I list automatic morpheme segmentation as the first of my ten open problems in computational diversity linguistics.

Even more important than the strategies for the solutions of the problem, however, is that we start to work on extensive datasets for testing and training of new algorithms that seek to identify morpheme boundaries on sparse data. As of now, no such datasets exist. Approaches like Morfessor were designed to identify morpheme boundaries in written languages, they barely work with phonetic transcriptions.  But if we had the datasets for testing and training available, be it only some 20 or 40 languages from different language families, manually annotated by experts, segmented both with respect to the phonetics and to the morphemes, this would allow us to investigate both existing and new approaches much more profoundly, and I expect it could give a real boost to our discipline and greatly help us to develop advanced solutions for the problem.

References

Baayen, R. H. and Piepenbrock, R. and Gulikers, L. (eds.) (1995) The CELEX Lexical Database. Version 2. Philadelphia.

Benden, Christoph (2005) Automated detection of morphemes using distributional measurements. In: Claus Weihs and Wolfgang Gaul (eds.): Classification -- the Ubiquitous Challenge. Berlin and Heidelberg:Springer. pp 490-497.

Bordag, Stefan (2008) Unsupervised and knowledge-free morpheme segmentation and analysis. In: Carol Peters, Valentin Jijkoun, Thomas Mandl, Henning Müller, Douglas W. Oard, Anselmo Peñas, Vivien Petras and Diana Santos (eds.): Advances in Multilingual and Multimodal Information Retrieval. Berlin and Heidelberg:Springer, pp 881-891.

Creutz, M. and Lagus, K. (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Technical Report. Helsinki University of Technology.

Goldsmith, John A. and Lee, Jackson L. and Xanthos, Aris (2017) Computational learning of morphology. Annual Review of Linguistics 3.1: 85-106.

Hammarström, Harald (2006) A Naive Theory of Affixation and an Algorithm for Extraction. In: Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology at HLT-NAACL 2006 pp. 79-88.

Harris, Zellig S. (1955) From phoneme to morpheme. Language 31.2: 190-222.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf:Düsseldorf University Press.

Virpioja, Sami, Smit, Peter, Grönroos, Stig-Arne and Kurimo, Mikko (2013) Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Helsinki:Aalto University.

Monday, January 28, 2019

Future challenges for computational diversity linguistics


At the end of each year, many people start to think of the things they want to do during the next year. While not being very extreme in this perspective, I tend to do the same thing at times; and last year, it happened that I started — inspired by a discussion with students I had in Buenos Aires — thinking of the biggest challenges that I see for the field of computational diversity linguistics (i.e. historical and typological language comparison carried out in a formal or quantitative way). I thus sat down before my holidays started, and made a short list of tasks that are challenging, but of which I think can still be tackled in the nearer or further future.


The idea to make such a list of questions is not new to mathematicians, who have their well-known Hilbert Problems, proposed by David Hilbert in 1900. In linguistics, I first heard about them from Russell Gray, who himself was introduced to this by a talk of the linguist Martin Hilpert, who gave a talk on challenging questions for linguistics in 2014 (online available here), called "Challenges for 21st century linguistics". Russell Gray since then has emphasized the importance to propose "Hilbert" questions for the fields of linguistic and cultural evolution, and has also presented his own big challenges in the past.

As somebody who considers himself to be a methodologist, I'm not going to frame questions as "big" or challenging as Russell Gray or Martin Hilpert did. Instead, the problems I would like to see tackled are pure computational challenges, that I think can be solved by algorithms or workflows. This does not mean, of course, that these problems are not challenging in the big sense, and it also does not automatically mean that they can be solved in the near future. But given that my own work, and that of colleagues in the field of computational and computer-assisted language comparison, progresses steadily, at times even at an impressive paste, I have some trust that these problems will indeed be solvable within the next 5-10 years.

The problems I came up with are listed below:
  1. automatic morpheme segmentation
  2. automatic sound law induction
  3. automatic borrowing detection
  4. automatic phonological reconstruction
  5. simulating lexical change
  6. simulating sound change
  7. statistical proof of language relatedness
  8. typology of semantic change
  9. typology of semantic promiscuity
  10. typology of sound change.
You can see that the way I worded the problems divides them into four major categories. The first four problems point to questions of inference, such as the inference of morpheme boundaries in a mono-lingual wordlist (# 1), the inference of laws by which sounds are changed from a parent to a daughter language (# 2), the inference of borrowings in multilingual datasets (# 3), and the inference of so far unattested proto-forms (# 4). The fifth and the sixth problems deal with simulation, and I distinguish the simulation of lexical change (# 5) and the simulation of sound change (# 6) as two separate tasks, although they could of course be combined later. The seventh problem is a bit different from the others, as it deals with the question of genealogical relationship among languages, and how we can test  it statistically (see Baxter and Manaster Ramer 2000 for an overview).

The last three problems deal with general patterns that can, or could be, observed for change in semantics and phonology. Semantic change (# 8) shows highly interesting cross-linguistic tendencies that are not yet fully understood (see Wilkins 1996 for an early discussion). Furthermore (# 9), words are often re-used across the lexicon of a given language, and it is an open question whether striking preferences for building many new words from just a few basic words denoting "promiscuitive" concepts (like "fall", "stand", see Geisler forthcoming and a recent blogpost by Schweikhard 2018 for an overview). Sound change (# 10) also follows cross-linguistic regularities, but the nature of these similarities are still not very well understood (see Kümmel 2008 for a pilot study on the topic).

Discussing each task would be way too long for a single post, given that I have reflected about these problems a lot during the last years, and may at times even have some ideas on how the problems could be tackled in concrete.

So, based on my idea of making plans for 2019, I decided that I would try to discuss each of these ten problems in greater detail in separate blog posts throughout 2019. This post thus serves merely to introduce the problems. Over the next ten months, I will try to devote some time to discuss each problem in a blog post devoted to each of the topics; and then I will discuss all of problems again at the end of this year.

I do not yet know how far this will go, and whether I will have the discipline to write up a post on each topic within the coming months, especially since it may also be possible that I end up discarding problems from my list. However, I feel that this could turn into a nice road map for my research in 2019. If I have to devote at least half a day each month over the next year to think about problems in computational historical and typological language comparison, it might not only help myself but also some colleagues to come up with a solution to some of the problems.

References

Baxter, William H. and Manaster Ramer, Alexis (2000) Beyond lumping and splitting: probabilistic issues in historical linguistics. In: Renfrew, Colin and McMahon, April and Trask, Larry (eds.) Time Depth in Historical Linguistics. Cambridge: McDonald Institute for Archaeological Research, 167-188.

Geisler, Hans (2018): Sind unsere Wörter von Sinnen? Überlegungen zu den sensomotorischen Grundlagen der Begriffsbildung. In: Kazzazi, Kerstin and Luttermann, Karin and Wahl, Sabine and Fritz, Thomas A. (eds.) Worte über Wörter. Festschrift zu Ehren von Elke Ronneberger-Sibold. Tübingen:Stauffenburg, 131-142.

Nathanael E. Schweikhard (2018) Semantic promiscuity as a factor of productivity in word formation. Computer-Assisted Language Comparison in Practice 1.11.19.

Wilkins, David P. (1996) Natural tendencies of semantic change and the search for cognates. In: Durie, Mark (ed.) The comparative method reviewed. Regularity and irregularity in language change. New York:Oxford University Press, 264-304.

Monday, November 26, 2018

How languages lose body parts: once more about structural data in historical linguistics


This is a joint post by Guido Grimm and Johann-Mattis List.

Mattis’ last two blog posts dealt with problems of what linguists call "structural data". Here we discuss what this means for the inference of relationships between languages.

A closer look at structural data: the questionnaire issue

As pointed out before, what is called structural data in comparative linguistics is a very diverse mix of data solely unified by the idea of having some kind of questionnaire that a linguist may use when going into the field and trying to describe a certain language. These questionnaires are a bit different from the traditional concept lists usually used for the purpose of historical language comparison (see the collection of different lists in the Concepticon project by List et al. 2016). The main difference is that they are based on an imaginative question that a field worker asks an informant (which could as well be a written grammar of the language under question). Since questions can be asked in many different ways, while concepts in historical language comparison are usually restricted to the so-called "basic vocabulary", the diversity of structural datasets is much greater than the diversity we encounter when comparing questionnaires based on concept lists.

When analyzing these data, we deal with characters of very different nature, and likely different evolutionary pathways or histories. A biological analogy would probably be (true) total evidence data sets that combine genetic data from: genes/genomes with different inheritance pathways (paternally, maternally, biparentally; basic information level), morphological-anatomical data (visible form, phenotypic), palaeontological data (historical evidence), ontogenetic (life-history stages, developmental features), and biochemical data (expression level). The only difference is probably that the linguistic characters’ histories may be more complex. [Side-remark: ‘total evidence’ datasets found in the biological literature are typically just combination of genetic and morphological data, allowing for the inclusion of extinct/fossil taxa.]

To give a specific example, let's have a look at a the Chinese dataset by Szeto et al. (2018), mentioned in Mattis' blogpost from September. This dataset is now accessible as a GitHub repository (https://github.com/cldf-datasets/szetosinitic). Mattis added some information regarding the different features of the questionnaire. We list these features in slightly abbreviated form in the table below, adding rough categorizations by Mattis in the Comment column.

ID
Description
Comment
p-1
5 or more tone categories
phonological / diachronic
p-2
Retroflex fricative initials
phonological / diachronic
p-3
Bilabial nasal coda
phonological / diachronic
p-4
Stop codas
phonological / diachronic
p-5
Monosyllabic word for 'snake'
lexical
p-6
Differentiation between 'hand' and 'arm'
lexical / semantic
p-7
Differentiation between 'defecate' and 'urinate'
lexical / semantic
p-8
Differentiation between 'eat' and 'drink'
lexical / semantic
p-9
Semantically void suffix in 'table'
lexical
p-10
Different classifiers for humans and pigs
lexical / semantic
p-11
[CLF N] constructions in subject position with definite reference
syntactic
p-12
Reduplicated monosyllabic nouns
morphological
p-13
Post-verbal modal auxiliary developed from 'ge/acquire'
syntactic / diachronic
p-14
Modified-modifier order in animal gender marking
morphological / syntactic
p-15
Post-verbal adverb meaning 'first'
lexical / syntactic
p-16
[V DO IO] order in double object dative constructions
syntactic
p-17
'Give' as a disposal marker
syntactic / diachronic
p-18
'Give' as a passive marker
syntactic / diachronic
p-19
'Go' as a post-VP associated motion marker
syntactic / diachronic
p-20
Marker-Standard-Adjective order in comparatives
syntactic
p-21
case system
morphological / syntactic

Mattis has tried to characterize the features, i.e. matrix’ characters, by generalizing linguistic categories: "phonological", pointing roughly to questions about pronunciation (the biological equivalent would be phenotypic traits in morphology or anatomy); "lexical", pointing to the words in the lexicon (this would be the DNA of a language); "morphological", pointing to the ways in which words are constructed; and "syntactic", pointing to the ways in which words are combined to form sentences. In combination, “morphological” and “syntactic” are equal to ‘meta-level’ biological traits, such as development-related features, ontogenetic evidence, and biochemical composition — the ways in which the genetic code is expressed or used in a living organism in adaption to the environment.

Mattis also flagged some characters as "diachronic", to mark whether the respective feature was selected by the authors due to their independent knowledge about the history of the Chinese dialects. This is something rarely possible in biology, but imagine that we could go back in time to literally observe the evolution of a lineage over a given time-period, and code this observed evolution as traits. Note that this is not entirely science-fiction — there are two examples where we can observe directly pathways of biological evolution: mutation patterns in viruses, and horizontal modification of marine morphs in high-resolution sediment cores.

While one can discuss to what degree a certain feature should belong to this category, it is rather obvious that all phonological features are diachronic, because they name distinctions that reflect well-known processes of sound change, which happened in a couple of Chinese dialects and have been proposed in the past by dialectologists in order to classify the Chinese dialects historically.

For example, consider feature p-3 of the questionnaire: Does a given dialect have a syllable that ends in [-m]? From the history of the Chinese dialects we know that the [-m] was present in Middle Chinese, but later merged with [-n] and [] in many varieties. Given that we know that this happened, and that we know that people have used this to mark a split, especially between the "innovative" dialects in the North and the South, it is clear that this feature bears explicit historical information. The same holds for all phonological features that we find in the data: p-1, the number of different tones in the dialects is again roughly reflecting the differences between languages in the North and in the South (the North having lost many tones); p-2 reflects the retention or specific development of retroflex sounds (similar to sh in English as opposed to s) mostly in the North; and p-4 reflects if a variety has syllables that can end in [-p, -t, -k], again a feature characteristic for the more "conservative" varieties in the South of China.

Figure 1: Overlap of features in Szeto et al.'s (2018) structural feature collection of Chinese dialects

Four lexical features have further been flagged as "semantic"; we query here existing or missing distinctions of concepts. People who learned, for example, Russian or certain German dialects know that it is rather common to have a single word for what other languages call "arm" and "hand" (see the respective entry in the CLICS database) or "foot" and "leg".

This diverse feature collection is coded as binary characters, reflected by presence/absence, or a yes/no answer to the question in the questionnaire. The choice of features is very selective. A biological analogy would be a matrix collecting incompatible splits of paternal (molecular) genealogies, along with a few prominent phenotypical traits (reflecting major evolutionary steps), and some traits that we expect to be primarily triggered not by genetics (inheritance) but by expression or adaptation to the environment. Biologists would not phylogenetically analyze such diverse and complex, potentially selection-biased data (although it could be very interesting), but linguists do.

In this context, it is remarkable, but also typical for these kind of data, that the 21-character feature collection by Szeto et al. (2018) has no feature in common with the collection by Norman (2003), a 15-character-matrix, which we also converted to our Cross-Linguistic Data Formats (see Forkel et al. 2018) in order to increase the data comparability.


Figure 2: A Neighbor-net splits graph of the structural data by Szeto et al. (2018).
The typification, coded as binary matrix to infer the Neighbor-net splits graph in Figure 2, demonstrates some basic characteristics of such 2-dimensional graphs. Note four of the 'characters' (typification categories) correlate with an edge(-bundle) in the network, separating the 'taxa' (the queried features). All "semantic" taxa are also "lexical", but "lexical" is more comprehensive, hence, "semantic" is placed as 'descendant' of "lexical" (Neighbor-nets can visualize ancestor-descendant relationships to some degree). "Morphological" taxa are either just "morphological" or also "syntactic", hence the pronounced box.

For "diachronic" and "syntactic", we have no corresponding edge(-bundle), because one taxon is also "lexical", but the others are "diachronic" and "syntactic" — this is a conflict that cannot be resolved with two dimensions. To visualize all the resultant 'taxon' splits, called also taxon bipartitions, we would need a third dimension. Lacking a third dimension, the Neighbor-net prioritizes keeping most "syntactic" together, because the "diachronic-syntactic" are closer to "syntactic" (max. 1 'character' difference) than to "diachronic-phonological" (2 character difference). The "syntactic-lexical" has to be placed apart because it is equally close to "lexical" and "syntactic" 'taxa', but differs much from "morphological-syntactic" or "diachronic-syntactic", the closest two relatives of "syntactic"-only 'taxa'. It is resolved closer to the centre of the graph, because it is more closely related to the other "syntactic" taxa than to the rest of the "lexical" taxa. This is also the reason why the "syntactic"-only taxa have to be placed farther out: "Diachronic-phonological" and "syntactic-lexical" are closer to the other endpoints, and the distance of "syntactic"-only to "diachronic-phonological", "lexical" and "morphological" should be as large as possible.

Losing body parts: How data coding masks underlying processes

Most typologists collecting structural data are not per se interested in phylogenies. Yet, given that scholars deliberately collect historical (diachronic) features, this shows that even if they would not necessarily admit it, they have a genuine interest in uncovering the history of the languages under question; or at least, how closely related languages (or here: dialects) are. But this requires understanding the characters we analyze, the collected "structural data".

In evolutionary biology, the key question people (should) ask when trying to select characters is how their change can be modeled on a tree or a network. What processes could be expected that shaped the data? What is behind the diversity? Is similarity or dissimilarity instigated by:
  • [A] inheritance, i.e. passed from an ancestor to all / some of its descendants,
  • [B] random mutation and/or sorting, i.e. the product of a stochastic, evolutionary neutral process,
  • [C] non-random mutation, i.e. processes that recur frequently, may be beneficial and positively (gain, or negatively: loss) selected for, or
  • [D] secondary contact, mixing of lineages by hybridization (symmetric mixing) and introgression (asymmetric mixing)?
[A]–[C] are vertical processes following a tree, even if the tree does not necessarily need to be the same; [D] is (mostly) horizontal and can only be modeled using a network. For each of the above, we can find an analogy in the evolution of languages.

In addition, process [3], and to a lesser extent [4], can lead to what biologists call 'homoplasy', meaning that the same feature is observed in two unrelated or distantly related taxa. In the context of phylogenetic inferences, homoplasies inflict tree-incompatible signals, seemingly reticulate patterns originating from a tree-like evolution. Structural (or other) linguistic data and phenotypical biological data have a lot in common — complex processes are boiled down to mere absence or presence of features (or traits, as they are called in biology).

Figure 3: Basic evolutionary processes, we need to consider when looking at linguistic data. Or biological traits, when we replace simplification by adaptive evolution, positively selected traits.

If we check the features in our table above, and ask: to which degree can they be used to model these processes (see also David's last post on illogic in phylogenetics), e.g. simply distinguish between similarity by chance, relatedness, or secondary contact (mixing), we can easily see that they are by no means optimal for evolutionary investigations. This is not necessarily because of the processes they involve, but more importantly because of the data sampling, which makes modeling almost impossible, with each character needing its own model.

As an example, take the feature p-6 in our table. Whether or not a language makes a distinction between "arm" and "hand" does not seem to follow specific geographic or genealogical patterns. The following figure shows a plot from the CLICS database (List et al. 2018), visualizing the most frequently recurring polysemies (or colexifications) centering around the concept "arm". The full visualization in CLICS can be found here, and when hovering with the mouse over the link between "arm" and "hand" (marked in green below).

Figure 4: Colexification network in the CLICS database.

From eye-balling the data, it is hard to find a consistent geographic / language-family pattern, which suggests that the feature p-6 is likely to show a high degree of homoplasy in the languages of the world. Obviously, different people decided not to distinguish between "hand" or "arm". But, the example of the Sami languages in northern Scandinavia also demonstrate that some people using related, long-isolated languages, consistently don't make the distinction. Here, the homoplasy is inherited (lineage-conserved). A biological analogy would be the rarely applied difference between a 'convergence' (a trait is independently evolved in different lineages) and a 'parallelism' (a trait is expressed by different but not all members of the same lineage).

Figure 5: Geographic distribution of arm/hand colexifications in the CLICS database.

A specific analogy to the "hand-arm" colexification / differentiation pattern is leaf shedding in oaks and their relatives (Fagaceae, the beech family). Some oak lineages (section Cerris of oaks, beech trees, chestnuts) are essentially or strictly deciduous, others (sections Cylcobalanopsis, Ilex, the sister sections of Cerris; Castanopsis, the sister genus of chestnuts) are always evergreen, and the biggest group (number of species) of all Fagaceae, subgenus Quercus includes evergreen (1 section), mixed (the two by far largest sections), and deciduous (1 nearly extinct section) sublineages. To some extent this is linked to the climate in which the species thrive (high latitudes and/or per-humid = deciduous, low latitude and/or seasonally dry = evergreen), but consistently evergreen and deciduous lineages do co-exist.

Looking at the Chinese dialects, we see that p-6 represents a trivial split in the network.

Figure 6: A Neighbor-net inferred from the Szeto et al. matrix. Dialects that distinguish "arm" and "hand" with filled dots ('1' for character 6 in the matrix), those that don't ('0') with empty dots. We can put a single line separating all don't- from do-taxa (dialects), i.e. a bipartition of the taxon set fitting the character partition seen in (p-)6.

But, given the general patterning of the feature on a global scale, does this really mean that it is inherited — that is, a good feature to reflect relatedness?

Whether a feature is likely to be homoplastic is just one part of the story. Linguists typically have more information about how things change than do biologists, putting a double-edged sword in their hands (that they hardly ever use). Asking whether "hand" and "arm" are expressed by distinctive concepts does not consider the underlying processes. Here, we can assume at least three different character states, namely:
  1. "arm" and "hand" are expressed by the same word, which is the original word for "arm",
  2. "arm" and "hand" are expressed by the same word, which is the original word for "hand", and
  3. "arm" and "hand" are expressed by different word.
We could even have a forth state, in which "arm" and "hand", in the whole long history of the ancestral languages, was always used to express "arm or hand" (i.e., both body parts). No differentiation and no later generalization from either arm nor hand took place.

Figure 7: Left, current scoring; right, scoring taking into account the actual mutation process.

From Ancient Chinese, we know that "1" (Yes, I do differ between "arm" and "hand") was most likely the original state. We can further assume that once the distinction is dropped, it is less likely to come back again (although this can, of course, also happen). That is, our model involves two possible mutations (vertical process): we lose the word for "arm" due to its replacement by "hand", or we lose the word for "hand" due to its replacement by "arm", each with its own probability.

Figure 8: Probability distribution for transitions involving "hand" and "arm".

The probability, mutation or not, and which mutation, relates to four principal driving factors:
  1. probability of random loss (mutation)
  2. probability of random gain (mutation)
  3. global linguistic tendencies
  4. regional socially-enforced preference
Establishing p-arm (loss "arm") and p-hand (loss "hand") is not trivial, because they may be affected by what is the word for "arm" and "hand" (for simplicity we will assume that p+arm and p+hand are close to 0). We could expect a higher tendency to keep the word that is easier to pronounce or less easy to confuse with other words and, hence, is easier to understand. If two dialects with different states come into contact, this may also influence the decision to take over a state or not. In everyday language, a distinction between "arm" and "and" may be useless because of the clear context in which both words are used, so p1-word > p2-words. However, closeness to administration centers or areas with a higher percentage of educated people could decrease p1-word, because it may be considered a sign of poor social standard to not make the difference between "arm" and "hand".

Figure 9: Vertical and horizontal processes involving transitions of "hand" and "arm".

Estimating p can only be left to phylogenetic algorithms (unless more detailed information is available). But we can (and should) design the questionnaire to capture as many of the processes as possible. In this case, to not only ask whether there is a distinction between "arm" and "hand", but also to find out whether the word "arm" or "hand" is used, e.g. by using two questions/binary characters:
  • Do we use "hand"?
  • Do we use "arm"?
Note that this question requires quite a deal of knowledge about the languages under investigation, since it may not be trivial to find out what was the "original" word for "arm" or "hand".

Therefore, a further step would be to replace the binary characters by a value measuring the similarity between the words used for "hand" and those used for "arm". One could again argue that adding this information would add historical information to the feature, but it is clear that the abstract nature of the question is hiding important phylogenetic (and also typological) information from us.

It seems therefore, that, instead of asking whether or not there is a distinction between "arm" and "hand", it would make much more sense to trace the cognacy (or homology) of the expressions for "arm" and "hand" across all taxa (languages, dialects), and think of ways how this could be scored and modeled by phylogenetic analyses. The structural data framework with its features based on simple yes-no questions therefore inevitably leads to a misinterpetation of processes when analyzing the data with phylogenetic software.

The need for exploratory data analysis

In reality, structural (or other) data sets in linguistics face problems similar to the ones palaeontologists face when trying to establish phylogenetic relationships between fossils (extinct organisms) — the probability for a mutation (visible change) is largely unknown, and differs not only from character to character but also within the same characters. A state 0, 1, 2 etc. may have a higher probability to manifest (or get lost) in one lineage than in another.

In addition, the linguistic problems recur in a similar way to that of biologists working close to and below the species level (see also Guido's post on population dynamics and individual-based fossil phylogenies) — reticulation is rather the rule than the exception, as similarity is triggered by contact,  so that horizontal processes, not inheritance, may dominate evolutionary dynamics. Thus, the diversity pattern cannot be modeled by a tree alone. Establishing explicit probabilistic frameworks to deal with this may not only be difficult but even impossible (given the available data). Meanwhile, however, one can embrace exploratory data analysis as a heuristic tool.

So, let's look at the example. As in the original paper, we used the binary matrix of the 21 characters to infer a planar, 2-dimensional (meta-)phylogenetic network, a Neighbor-net splits graph. The resulting graph is a longitudinally inflated spider-web, with its endpoints defined by the southern Chinese dialects (e.g. Guangzhou, Nanning, Taishan) and the north-central (eg. Linxia and Xining) dialects. The latter are significantly closer (geographically and data-wise) to the Bejing version of Chinese.

Figure 10: The Neighbor-net based on simple mean (Hamming) pairwise binary character distances

The first thing to note is that the matrix includes dialects that are indistinct (green stars) for all 21 characters, and some that are geographically and data-wise very similar to each other, while being distinct from all others (green ovals). In biology, we call this (taxic, lineage-)coherence. In addition to Linxia and Xining, we have Nanchang and Lichuan characterized by elongated ('tree-like') terminal edge-bundles. These obviously represent closely related dialects sharing a long(er) common history.

Others have more than one possible closest relative. For instance, Liuzhou may share quite a few features with Guangzhou, but it is equally close to the Nanchang-Lichuan pair (yellow fields). Dongtai (orange star) is unique, but its 'neighborhood' (orange-ish brackets) as defined by shared edge-bundles that include Changsha (which again is most related to Jiujang) and Taiyuan plus Baotou, the latter two substantially closer to the Bejing (red star) group.

Similar to Dongtai, and also connected to the central part of the graph, are dialects with long-terminal branches (edges). Hefeng (blue star) is substantially different from Dongtai, and only has one further dialect in its neighborhood (blue bracket), Wangrong, a close relative of the Bejing group. The Wuhan, Chengdu, and Guiyang (gray field) dialects appear, on the other hand, to be completely isolated.

As explained above, there are different processes, vertical and horizontal ones, that may trigger similarity, and we want to get an idea as to which character may be influenced by which process. From the graph, several aspects are obvious:
  • geographic closeness plays a major role,
  • the signal provided by the data is not tree-like,
  • the data is highly homoplastic, and includes internal conflict.
Not so obvious is whether this situation is due to random or evolutionary directed similarity, or reticulation. Since the graph is planar, and puts the Chinese dialects in a circular order, we can order the character matrix accordingly to see how the traits form groups (which could be called cliques in this context). In the next step, we can then map each character onto this network, to see how well they fit with the overall similarity pattern. We showed this above for p-6 (hand-arm-distinction, one split), and here we add a character with quite a poor fit, p-17 (syntactic-diachronic), "give" as a disposal marker.

Figure 11: Character mapping for p-17 (filled dots, "give" used as disposal marker; empty, not used), with the p-6 split indicated as well. Red, splits (taxon bipartitions defined by character cliques) that have no corresponding edge-bundle (neighborhood); blue, splits with neighborhood; green, unique, isolated change (deviation from the rule) within the neighborhood.

The number of inferred mutations in the map uses Ockham’s Razor, upon which parsimony (tree and network) inference relies as well. Using such a map, we can even provide an estimate for how likely (qualitatively spoken) a change is under the assumption that neighborhoods in the graph represent either exchange (homogenization) between closely related dialects or are inherited, reflecting both horizontal and vertical relatedness. Mapping characters on a 2-dimensional network allows finding a scenario beyond a single tree hypothesis.

For p-6, we need just one change (i.e. loss in all more south-bound dialects), but we don't find an edge bundle corresponding to this unique change. Given what we discussed above about p-6, we have more independent losses than the simple reconstructed one. Social preference or general contact for retaining the primitive state of having two words could explain why dialects closer to the Beijing dialect area have a "0", although not all are closely related in general.

For p-17, we need at least four (independent) changes from "0" → "1", two of which have a corresponding edge bundle (blue, Nanchang plus Lichuan, Changsha plus Dongtai), one isolated (green, Luoyang), and one without a corresponding edge bundle (Wuhan and Hefeng dialects). The (equally parsimonious) alternative for p-17 would be a series of gains and losses, with the same number of steps:

Figure 12: Alternative scenario for p-17.

This is where one needs to consider additional knowledge about the probability of getting or retaining a certain feature. The state shared by most dialects across the entire net is “0”, irrespective of overall similarity, which would make it a natural pick for the primitive state. Thus, assuming four (or more) changes from 0 → 1 (acquisition of the queried feature), rather than two independent acquisitions (starting with the Beijing group; note, the position of the root will not change the number of needed changes), then a loss (1 → 0) in many southbound dialects and a re-gain (0 → 1) in the Nanchang + Lichuan dialects.

The same assessment can be made for all of the characters, and we end up with something like this:

Figure 13: Fully annotated split network of the data. Changes relating to edge-bundles accordingly colored, arc indicate changes without a corresponding edge-bundle. Note, the prominent yellow split that defines a neighborhood of dialects most similar to the Beijing dialect, albeit there is no character supporting this edge. The rather poor fit of many character splits (cliques) with edge-bundles relate to the fact that we visualize a highly complex diversification (multi-dimensional processes) using a planar, 2-dimensional graph.

While this figure may be confusing at first sight, it comprehensively shows what the characters contribute to the overall graph. We can discriminate more-likely from less-likely mutations (how many changes are needed at least), but also the character assemblies shared by putatively closely related dialects.
  • p-3 and p-11 are a typical feature of Guangzhou and allied dialects within the southern Chinese complex. p-3 is also present in Lichuan, and p-11 in Jixi (thus in not so distant dialects).
  • Features p-6 to p-9, p-16, and p-19 form a diagnostic suite for the Guangzhou dialects and other dialects related to them in the one or other fashion and distinguish them from, e.g., the Beijing group
  • The latter, the Beijing group, has fewer diagnostic character assemblies. One characteristic sequence could be p-1, p-2, p-12, p-14, but this includes three features with a minimum of 3+ changes. Similarity here is mostly the result of a lack of (potentially) derived features (hence, the character-unsupported yellow edge-bundle defining a Beijng-including neighborhood)

Outlook and summary

In this re-investigation, we have, once more, commented on the problems we see with the use of structural features for the purpose of historical language comparison and phylogonetic reconstruction. We see the major problems in the (often) unfortunate choice of question, resulting in elicitations of features that cannot be easily modeled with current software for phylogenetic analyses. It is important to keep in mind, in linguistics and phylogenetics, that we can infer trees or networks based on data of no matter what quality and information content. But before we present the result, we should have taken a look at the primary data.
  • Does it fit with the resulting graph, or not?
  • Where does it fit, and where not?
In the context of our critique of linguistic questionnaires, the mapping strategy discussed above opens a potential avenue to identify:
  • stable / unstable features (geographically or evolution-wise) and
  • coherent / incoherent features.
Based on this, we can then inquire as to which degree language (or dialect) groups influenced, stabilized or modified each other by geographic proximity.

Inference-wise, the natural next step would be to use the information about the minimum number of necessary changes to counter-weight characters. This would eventually allow to use median networks (and related) approaches on the data, which is currently the only way to explicitly identify ancestors using phylogenetic reconstructions. With the current matrices, the extreme homoplasy makes an unweighted application of median networks and related methods impossible.

References

Forkel, R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5.180205: 1-10.

List, J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 2393-2400.

List, J.-M., M. Walworth, S. Greenhill, T. Tresoldi, and R. Forkel (2018) Sequence comparison in computational historical linguistics. Journal of Language Evolution 3.2: 130–144.

Norman, J. (2003) The Chinese dialects. Phonology. In: Thurgood, G. and R. LaPolla (eds.): The Sino-Tibetan languages. Routledge: London and New York, pp. 72-83.

Szeto, P., U. Ansaldo, and S. Matthews (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Supplementary data

The data we used to create the analyses and figures provided in this post are available at https://github.com/cldf-datasets/szetosinitic/tree/master/examples