Monday, June 17, 2019

Ockham's Razor applied, but not used: can we do DNA-scaffolding with seven characters?

One of the most interesting research areas in organismal science is the cross-road between palaeontology and neontology, which puts together a picture marrying the fossil record with molecular-based phylogenies. Unfortunately, when it comes to plant (palaeo-)phylogenetics, some people adhere to outdated analysis frameworks (sometimes with little data).

How to place a fossil?

The fossil record is crucial for neontology as it can provide age constraints (minimum ages when doing node dating) and inform us about the past distribution of a lineage. This, especially in the case of plants that can't run away from unfortunate habitat changes, can be much different than today.

The main question in this context is whether a fossil represents the stem, ie. a precursor or extinct ancient sister lineage, or the crown group, ie. a modern-day taxon (primarily modern-day genus). For instance, the oldest crown fossil gives the best-possible minimum age for the stem (root) age of a modern lineage, whereas a stem fossil can give (at best) only a rough estimate for the stem age of the next-larger taxon when doing the common node dating of molecular trees (note that fossilized birth-death dating can make use of both).

There are two commonly accepted criteria to identify a crown-group fossil:
  1. Apomorphy-based argues that if a fossil shows a uniquely derived character (ie. a aut- or synapomorphy sensu Hennig) or character suite diagnostic for a modern-day genus, it represents a crown-group fossil.
  2. Phylogeny-based aims to place the fossil in a phylogenetic framework, the position of the fossil in the genus- or species-level tree (most commonly done) or network (rarely done but producing much less biased or flawed results) then informs what it is.
(We will focus on members of modern-day genera, since it becomes more trickier for higher-level taxa, see eg. my posts thinking about What is an angiosperm? [part1][part2][why I pondered about it].)

There a three basic options to place a fossil using a phylogenetic tree.
  1. Putting up a morphological matrix, then inferring the tree. A classic but due to the nature of most morphological data sets leading to a partly wrong tree as we demonstrated in some posts here on the Genealogical World of Phylogenetic Networks (hence, such analysis should always be done in a network-based exploratory data analysis framework).
  2. Putting up a mixed molecular-morphological matrix, then inferring a "total evidence" tree. This includes sophisticated approaches that use the molecular data to implement weights on the morphological traits and/or consider the age of the fossils (so-called total evidence dating approaches). Works not that bad with animal-data, provided the matrix includes a lot of morphological traits reflecting aspects of the (molecular-based) phylogeny. Doesn't work too well for plants because we usually have much fewer scorable traits, most of which are evolved convergently or in parallel. Non-trivial plant fossils love to act as rogues during phylogenetic inference.
  3. Optimise the position of a fossil in a molecular-based tree, eg. using so-called "DNA scaffold approach" (usually using parsimony as optimality criterion) or the evolutionary placement algorithm implemented in RAxML (using maximum likelihood). A special form of this approach is to first map the traits on a (dated) molecular tree, and then find the position where a fossil would fit best.

Why (standard) phylogenetic tree-based approaches are tricky

Below a simple example, including three fossils of different age (and often, place) with different character suites.

Even though none of the derived traits (blue and red "1") is a synapomorphy (fide Hennig), we can assign the youngest fossil X to the lineage of genus 1A just based just based on its unique derived ('apomorphic') character suite. Its likely a crown-group fossil of clade 1, and may inform a minimum age for the most-recent common ancestor (MRCA) of the two modern-day genera of Clade 1.
Apomorphy-wise, fossils Y and Z cannot be unambiguously placed. The red trait appears to be independently obtained in both clades, and the blue trait may have been
To discern between the options, we'd be well-advised to do character mapping in a probabilistic framework which require a tree with independently defined branch-lengths.

Just by using parsimony-based DNA-scaffolding, fossil X would be confirmed as crown-group fossil and member of genus 1A (being identical and different from all others) and fossil Z would end up as a stem-group fossil. Fossil Y, however, would be placed as sister to genus 2C (again, identical to each other and different from all others). Using Y in node dating, would then lead to a much too old divergence age for the crown-group age of Clade 2. In reality, what researchers do with such a seemingly too old fossil is not to use it by the book, as MRCA of Genus 2B and 2C, but to inform the MRCA of eg. genera 2A, 2B, and 2C assuming that the fossil's age and trait set indicate the 2C morphology is primitive within the clade or Y is an extinct sister lineage and the shared derived trait a convergence (parallelism).

Four characters, three homoplastic and one invariant, are surely not enough for DNA-scaffolding, but adding more and more characters has a catch. Easy to do for the modern-day taxa, for which we also have molecular data, the preservation of fossils limits adding many more traits; any trait not preserved in the fossil is effectively useless when placing it (including not-preserved traits in total evidence approach may, nonetheless, help the analysis). Which brings us to the real-world example just published in Science:

Wilf P, Nixon KC, Gandolfo MA, Cúneo RA (2019). Eocene Fagaceae from Patagonia and Gondwanan legacy in Asian rainforests. Science 364, 972. Full-text article at Science website.

Why one should not place a fossil using DNA-scaffolding with seven characters

Wilf et al. show (another) spectacularly preserved fossil from the Eocene of Patagonia. Personally, I think that just publishing and shortly describing such a beautiful fossil should be enough to get into the leading biological journals.

But Wilf et al. wanted (needed?) more and came up with the following "phylogenetic analysis" to argue that their fossil is a crown-group Castanoideae, a representative of the modern-day firmly Southeast Asian tropical-subtropical genus Castanopsis, and evidence for a "southern route to Asia hypothesis" (via Antarctica and Australia, both well-studied but devoid so far of any Fagaceae presence; despite the fact that the modern-day climate allows cultivating them as eg. source for commercially used wood).

Wilf et al's Fig. 3 and Table 1 suggest to me that the paper was not critically reviewed by anyone familiar with the molecular genetics of Fagaceae or phylogenetic methods in general — perhaps this is not needed, since the first author is well-merited and the second author a world-leading expert of botanical palaeo-cladistics. However, parsimony-based DNA-scaffolding can be tricky, even with a larger set of characters (see eg. the post on Juglandaceae using a well-done matrix), and using seven is therefore quite bold. Notably, of the seven characters, one is parsimony-uninformative and four are variable within at least one of the included OTUs.

Side note: The tree used as a backbone is outdated and not comprehensive. Plastid and nuclear-molecular data indicate that the castanoids Lithocarpus (mostly tropical SE Asia) and Chrysolepis (temperate N. America) may be sisters. However, the morphologically quite similar Notholithocarpus is not related to either of these, but is instead a close relative of the ubiquitous oaks, genus Quercus (not included in Wilf et al.'s backbone tree), especially subgenus Quercus. Furthermore, the (today Eurasian) castanoid sisterpair Castanea (temperate)-Castanopsis (tropical-subtropical) have stronger affinities to the (today and in the past) Eurasian oaks of subgenus Cerris. The Fagaceae also include three distinct monotypic relict genera, the "trigonobalanoids" Formanodendron and Trigonobalanus, SE Asia, and Colombobalanus from Columbia, South America. Using a more up-to-date instead of a 2-decade-old molecular hypothesis would have been a fair request during review, as would compiling a new molecular matrix to infer a tree used as backbone (currently gene banks include > 238,000 nucleotide DNA accessions including complete plastomes). This would have also enabled the authors to map their traits using a probabilistic framework, which can protect to some degree against homoplastic bias but requires a backbone tree with defined branch-lengths.

There are many more problems with the paper and its conclusions, but this critique would be content- not network-related. Let's just look at the data and see why Wilf et al. would have better off not showing any phylogenetic analysis at all (and the impact-driven editors and positive-meaning reviewers should have advised against it). Or a network.

Clades with little character support

The scaffolding placed the Eocene fossil in a clade with both representatives of Castanopsis, from which it differs by 0–2 and 1–4 traits, respectively. Phylogeny-based, the fossil is a stem- or crown-Castanopsis.

However, the fossil has a character suite that differs in just a single trait (#6: valve deshiscence) from the (genetically very distant) sister taxon of all other Fagaceae, Fagus (the beech), used here as the outgroup to root the Castanoideae subtree. As far as apomorphies are concerned, the data are inconclusive as to whether the fossil represents a stem-Castanoideae (or extinct Fagaceae lineage) or a Castanopsis — this critical, potentially diagnostic derived trait, partial valve dehiscence, is only shared by the fossil and some but not all modern-day Castanopsis. This particular trait is not mentioned elsewhere in the text, although it is the reason the fossil is placed next to Castanopsis and not the outgroup Fagus in the "phylogenetic analysis".

In the following figure, I have mapped (with parsimony) the putative character mutations on the tree used by Wilf et al.

Black font: shared by Fagus (outgroup) and "Castanoideae". Green font: potential uniquely derived traits. Blue font: traits reconstructed as having evolved in parallel/convergently. Red branches, clades in the used backbone tree that are at odds with currently available molecular data (the N. American relict Notholithocarpus should be sister to the Eurasian Castanea-Castanopsis).

This hardly presents a strong case of crown-group assignation. Except for partial dehiscence, even the modern-day Castanopsis have little discriminating derived traits — they are living fossils with a primitive ('plesiomorphic') character suite. Intriguingly, they are also genetically less derived than other Castanoideae and the oaks (see eg. the ITS tree in Denk & Grimm 2010).

The actual differentiation pattern

The best way to depict what the character set provides as information for placing the fossil is, of course, the Neighbor-net, as shown next.

Neighbor-net based on Wilf et al.'s seven scored morphological traits used to place the fossil. Green: the current molecular-based phylogenetic synopsis — based mostly on Oh & Manos 2008; Manos et al. 2008; Denk & Grimm 2010. I had the opportunity to get familiar with all of the then-available genetic data when harvesting all Fagaceae data from gene banks in 2012 for a talk in Bordeaux. One complication in getting an all-Fagaceae-tree is that plastids, geographically constrained, and nuclear regions tell partly different stories.

Castanopsis, including the fossil, is morphologically a paraphyletic (see also our other posts dealing with paraphyla represented as clades in trees). Note also the long edge-bundle separating the temperate Chrysolepis and chestnuts (Castanea), from their respective cold-intolerant sister genera (Lithocarpus viz Castanopsis) — derived traits have been accumulated in parallel within the "Castanoideae". The scored aspects of Fagaceae morphology are very flexible and ~50 million years is a long time, possibly leading to partial valve indehiscence (or losing it) without being part of the same generic lineage. The puzzling differentiation, and the profoundly primitive appearance of the fossil (shared with modern-day Castanopsis), may in fact be the reason the authors didn't: (i) optimize / discuss very similar, co-eval fossils from the Northern Hemisphere interpreted (and cited) as extinct genera (eg. Crepet & Nixon 1989), (ii) left out the two Fagaceae genera today occurring in South America, (iii) opted for classic parsimony and a partly outdated molecular hypothesis, and (iv) just showed a naked cladogram without branch support values as the result of their "phylogenetic analysis" (Please stop using cladograms!)

Based on the scored characters, the position of the fossil in the graph, and on the background of a more up-to-date molecular-based phylogenetic synopsis (the green tree in the figure above), the most parsimonious interpretation (and probably, the most likely) is that the fossil may indeed be a stem-Castanoideae, a representative of the lineage from which the Laurasian oaks evolved at least 55 million yrs ago (oldest Quercus fossil was found in SE Asia), or even represent a morphologically primitive, extinct (South) American lineage of the Fagaceae. Regarding the "southern route", Ockham's Razor would favor that they are just a South American extension of the widespread Eocene Laurasian Fagaceae / Castanoideae, since very similar fossils and castaneoid pollen is found in equally old and older sites in North America, Greenland (papers cited by Wilf et al.) and Eurasia but not Australia, New Zealand or Antarctica.

A final note: when you have so few characters to compare, you should use OTUs that are not completely ambiguous in every potentially discriminating character, as scored for the "C. fissa group" — the "Castanopsis group" has a single unambiguously defined, potentially derived trait. Using artificial bulk taxa is generally a bad idea when mapping trait evolution onto a molecular backbone tree. Instead, you should compile a representative placeholder taxa set, with as many taxa as you need (or are feasible) to represent all character combinations seen in the modern species/genera.

Other cited references, with comments
Crepet WL, Nixon KC (1989) Earliest megafossil evidence of Fagaceae: phylogenetic and biogeographic implications. American Journal of Botany 76: 842–855. – introducing a Castanopsis-like infructescence interpreted to represent an extinct genus but very similar to the new Patagonian fossil in its preserved features; and co-occuring with castaneoid pollen (not reported so far for Patagonia) and foliage.
Denk T, Grimm GW (2010) The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59: 351–366. — includes an all-"Quercaceae" ITS-tree (fig. 3) and -network (fig. 4) using data of ~ 1000 ITS accessions; the nuclear-encoded ITS is so far the only comprehensively sampled gene region that gets the genera and main intra-generic lineages apart (recently confirmed and refined by NGS phylogenomic data), something wide-sampled plastid barcodes struggle with. Analysed with up-to-date methods and avoiding long-branch interference by excluding the only partially alignable Fagus, Castanopsis dissolves into a grade in the all-accessions tree and Quercus is deeply nested within the Castanoideae (as already seen in the 2001 tree used by Wilf et al. as backbone). The species-level PBC neighbor-net prefers a ciruclar arrangement in which Notholithocarpus remains a putative sister of substantially divergent and diversified Quercus, followed by Castanea-Castanopsis, and Lithocarpus, while Chrysolepis is recognized as unique.

Oh S-H, Manos PS (2008) Molecular phylogenetics and cupule evolution in Fagaceae as inferred from nuclear CRABS CLAW sequences. Taxon 57: 434–451. – Probably still the best Fagaceae tree, and surely not a bad basis for probabilistic mapping of morphological traits in the family.

Manos PS, Cannon CH, Oh S-H (2008) Phylogenetic relationships and taxonomic status of the paleoendemic Fagaceae of Western North America: recognition of a new genus, Notholithocarpus. Madroño 55: 181–190. – the tree failed to resolve the monophyly of the largest genus, the oaks, but depicts well the data reality when combining ITS with plastid data and, hence, provides a good trade-off guide tree.

Monday, June 10, 2019

Why don't people draw evolutionary networks sensibly?

In phylogenetics there are two types of network:
  • those where the network edges have a time direction, whether explicit or implied; and
  • those where the edges are undirected.
The latter networks are among the most valuable tools ever devised for the exploration of multivariate data patterns; and this blog is replete with examples drawn from all fields that produce quantitative data (see the Analyses blog page). The first type of network, however, is the only one that can display hypothesized evolutionary histories — that is, they can truly be called evolutionary networks.

Evolutionary networks have a set of characteristics that are essential in order to successfully display biological histories, such as:
  • no directed cycles, because otherwise one of the descendants would be its own ancestor;
  • time consistency, meaning that reticulations in the network only occur between contemporaries.
The latter requirement is not needed for the history of human artifacts, because the ideas on which those artifacts are based can be recorded, and then not used until much later — ideas can "leap forward" in time. There are a number of examples of this in this blog, as discussed in last week's post (A phylogenetic network outside science).

However, time consistency is pretty much universal in biology (see the post on Time inconsistency in evolutionary networks). Natural hybridization and introgression require two living organisms in order to occur, as does horizontal gene transfer. This is basic biology, at least outside the laboratory.

So, the question posed in this post's title refers to the fact that so many people draw their evolutionary networks in a manner that appears to violate time consistency.

Consider this example (from: Interspecies hybrids play a vital role in evolution. Quanta Magazine):

Note that the reticulation edges (the dashed lines) represent gene transfers by introgression or hybidization, and yet none of them are drawn vertically, as they would need to be in order to be time consistent (since time travels from left to right).

It might be argued that most of these are not all that important in practice, but the one to the left quite definitely matters very much. It shows gene transfer between: (i) an organism that speciated 3.65 million years ago and (ii) an organism that is the descendant of one that speciated 3.47 million years ago. The 180,000 years between those two events are not irrelevant; and they make the claimed gene transfer impossible.

One might think that this is simply the general media misunderstanding the network requirements, but this is not so. The diagram is actually a quite accurate representation of the one from the original scientific publication (from: Genome-wide signatures of complex introgression and adaptive evolution in the big cats. Science Advances 3: e1700299; 2017.):

The network shows the same series of hybridizations / introgressions. However, this time three sets of gene transfers are shown to be time consistent, represented by the horizontal arrows (since time flows from top to bottom). Two of the three diagonal arrows (light blue and orange) could be made time consistent (ie. drawn horizontally), although the authors have chosen not to do so, apparently for artistic reasons. However, the first reticulation cannot be made time consistent, for the reason outlined above.

So, people, please think about what you are drawing, and don't show things that are biologically impossible,

Monday, June 3, 2019

A phylogenetic network outside science

I have written before about the presentation of historical information using the pictorial representation of a phylogeny (eg. Phylogenetic networks outside science; Another phylogenetic network outside science). These diagrams are often representations of the evolutionary history of human artifacts, and so a phylogeny is quite appropriate. They are of interest because:
  • they are usually hybridization networks, rather than divergent trees, because the artifact ideas involve horizontal transfer (ideas added) and recombination (ideas replaced);
  • they are often not time consistent, because ideas can leap forward in time, so that the reticulations do not connect contemporary artifacts (see Time inconsistency in evolutionary networks); and
  • they are sometimes drawn badly, in the sense that the diagram does not reflect the history in a consistent way.
The latter point often involves poor indication of the time direction (see Direction is important when showing history), or involves subdividing the network into a set of linearized trees.

One particularly noteworthy example that I have previously discussed is of the GNU/Linux Distribution Timeline, which illustrates the complex history of the computer operating system. The problems with this diagram as a phylogeny are discussed in the blog post section History of Linux distributions.

In this new post I will simply point out that there is a more acceptable diagram, showing the key Unix and Unix-like operating systems. I have reproduced a copy of it below.

Click to enlarge.

This version of the information correctly shows the history as a network, not a series of linearized trees (each with a central axis). It also draws the reticulations in an informative manner, rather than having them be merely artistic fancies.

It is good to know that phylogenetic diagrams can be drawn well, even outside biology and linguistics.

Monday, May 27, 2019

Automatic phonological reconstruction (Open problems in computational diversity linguistics 4)

The fourth problem in my list of open problems in computational diversity linguistics is devoted to the problem of linguistic reconstruction, or, more specifically, to the problem of phonological reconstruction, which can be characterized as follows:
Given a set of cognate morphemes across a set of related languages, try to infer the hypothetical pronunciation of each morpheme in the proto-language.
This task needs to be distinguished from the broader task of linguistic reconstruction, which would usually include also the reconstruction of full lexemes, i.e. lexical reconstruction — as opposed to single morphemes or "roots" in an unknown ancestral language. In some cases, linguistic reconstruction is even used as a cover term for all reconstruction methods in historical linguistics, including such diverse approaches as phylogenetic reconstruction (finding the phylogeny of a language family), semantic reconstruction (finding the meaning of a reconstructed morpheme or root), or the task of demonstrating that languages are genetically related (see, e.g., the chapters in Fox 1995)

Phonological and lexical reconstruction

In order to understand the specific difference between phonological and lexical reconstruction, and why making this distinction is so important, consider the list of words meaning "yesterday" in five Burmish languages (taken from Hill and List 2017: 51).

Figure 1: Cognate words in Burmish languages (taken from Hill and List 2017)

Four of these languages express the word "yesterday" with the help of more than one morpheme, indicated by using different colors in the table's phonetic transcriptions, which at the same time ­ also indicate which words we consider to be homologous in this sample. Four of the languages have one morpheme which (as we confirmed from the detailed language data) means "day" independently. This morpheme is given the label 2 in the last column of the table. From this, we can see that the motivation by which the word for "yesterday" is composed in these languages is similar to the one we observe in English, where we also find the word day being a part of the word yester-day.

If we want to know how the word "yesterday" was expressed in the ancestor of the Burmish languages, we could make an abstract estimation based on the lexical material we have at hand. We might assume that it was also a compound word, given the importance of compounding in all living Burmish languages. We could further hypothesize that one part of the ancient compound would have been the original word for "day". We could even make a guess and say the word was in structure similar to Bola and Lashi (although it is difficult to find a justification for doing this). In all cases, we would propose a lexical reconstruction for the word for "yesterday" in Proto-Burmish. We would make an assumption with respect to what one could call the denotation structure or the motivation structure, as we called it in Hill and List (2017: 67). This assumption would not need to provide an actual pronunciation of the word, it could be proposed entirely independently.

If we want to reconstruct the pronunciation of the ancient word for "yesterday" as well, we have to compare the corresponding sounds, and build a phonological reconstruction for each of the morphemes separately. As a matter of fact, scholars working on South-East Asian languages rarely propose a full lexical reconstruction as part of their reconstruction systems (for a rare exception, see Mann 1998). Instead, they pick the homologous morphemes from their word comparisons, assign some rough meaning to them (this step would be called semantic reconstruction), and then propose an ancient pronunciation based on the correspondence patterns they observe.

When listing phonological reconstruction as one of my ten problems, I am deliberately distinguishing this task from the tasks of lexical reconstruction or semantic reconstruction, since they can (and probably should) be carried out independently. Furthermore, by describing pronunciation of the morphemes as "hypothetical pronunciations" in the ancestral language, I want not only to emphasize that all reconstruction is hypothetical, but also to point to the fact that it is very possible that some of the morphemes for which one proposes a proto-form may not even have existed in the proto-language. They could have evolved only later as innovations on certain branches in the history of the languages. For the task of phonological reconstruction, however, this would not matter, since the question of whether a morpheme existed in the most recent common ancestor becomes relevant only if one tries to reconstruct the lexicon of a given proto-language. But phonological reconstruction seeks to reconstruct its phonology, i.e. the sound inventory of the proto-language, and the rules by which these sounds could be combined to form morphemes (phonotactics).

Why phonological reconstruction is hard

That phonological reconstruction is hard should not be surprising. What the task entails is to find the most probable pronunciation for a bunch of morphemes in a language for which no written records exist. Imagine you want to find the DNA of LUCA as a biologist, not even in its folded form, with all of the pieces in place, but just a couple of chunks, in order to get a better picture of how this LUCA might have looked like. But while we can employ some weak version of uniformitarianism when trying to reconstruct at least some genes of our LUCA (we would still assume that it was using some kind of DNA, drawn from the typical alphabet of DNA letters), we face the specific problem in linguistics that we cannot even be sure about the letters.

Only recently, Blasi et al. (2019) argued that sounds like f and v may have evolved later than the other sounds we can find in the languages of the world, driven by post-Neolithic changes in the bite configuration, which seem to depend on what we eat. As a rule, and independent of these findings, linguists do not tend to reconstruct an f for the proto-language in those cases where they find it corresponding to a p, since we know that in almost all known cases a p can evolve into an f, but an f almost never becomes a p again. This can lead to the strange situation where some linguists reconstruct a p for a given proto-language even though all descendants show an f, which is, of course, an exaggeration of the principle (see Guillaume Jacques' discussion on this problem).

But the very idea, that we may have good reasons to reconstruct something in our ancestral language that has been lost in all descendant languages, is something completely normal for linguists. In 1879, for example Ferdinand de Saussure (Saussure 1879) used internal and comparative evidence to propose the existence of what he called coefficients sonantiques in Proto-Indo-European. His proposal included the prediction that — if ever a languages was found that retained these elements — these new sounds would surface as segmental elements, as distinctive sounds, in certain cognate sets, where all known Indo-European languages had already lost the contrast.

These sounds are nowadays known as laryngeals (*h1, *h2, *h3, see Meier-Brügger 2002), and when Hittite was identified as an Indo-European language (Hrozný 1915), one of the two sounds predicted by Saussure could indeed be identified. I have discussed before on this blog the problem of unattested character states in historical linguistics, so there is no need to go into further detail. What I want to emphasize is that this aspect of linguistic reconstruction in general, and phonological reconstruction specifically, is one of the many points that makes the task really hard, since any algorithm to reconstruct the phonological system of some proto-language would have to find a way to formalize the complicated arguments by which linguists infer that there are traces of something that is no longer there.

There are many more things that I could mention, if I wanted to identify the difficulty of phonological reconstruction in its entirety. What I find most difficult to deal with is that the methodology is insufficiently formalized. Linguists have their success stories, which helped them to predict certain aspects of a given proto-language that could later be confirmed, and it is due to these success stories that we are confident that it can, in principle, be done. But the methodological literature is sparse, and the rare cases where scholars have tried to formalize it are rarely discussed when it comes to evaluating concrete proposals (as an example for an attempt of formalizing, see Hoenigswald 1960). Before this post becomes too long, I will therefore conclude bu noting that scholars usually have a pretty good idea of how they should perform their phonological reconstructions, but that this knowledge of how one should reconstruct a proto-language is usually not seen as something that could be formalized completely.

Traditional strategies for phonological reconstruction

Given the lack of methodological literature on phonological reconstruction, it is not easy to describe how it should be done in an ideal scenario. What seems to me to be the most promising approach is to start from correspondence patterns. A correspondence pattern is an abstraction from individual alignment sites distributed over cognate sets drawn from related languages. As I have tried to show in a paper published earlier this year (List 2019), a correspondence pattern summarizes individual alignment sites in an abstract form, where missing data are imputed. I will avoid going into the details here but, as a shortcut, we can say that each correspondence pattern should, in theory, only correspond to one proto-sound in the language, although the same proto-sound may correspond to more than one correspondence pattern. As an example, consider the following table, showing three (fictive) patterns that would all be reconstructed by a *p.

 Proto-Form  L₁  L₂  L₃
 *p  p  p  f
 *p  p  p  p
 *p  b  p  p

To justify that the same proto-sound is reconstructed by a *p in all three patterns, linguists invoke the rule of context, by looking at the real words from which the pattern was derived. An example is shown in the next table.

L₁ L₂ L₃
*p i a ŋ  p i a ŋ  p i u ŋ  f a n
*p a t  p a t  p a t  p a t
*a p a ŋ  a b a ŋ  a p a ŋ  a p a n

What you should be able to see from the table is that we can find in all three patterns a conditioning factor that allows us to assume that the deviation from the original *p is secondary. In language L₃, the factor can be found in the palatal environment (followed by the front vowel *i) that we find in the ancestral language. We would assume that this environment triggered the change from *p to f in this language. In the case of the change from *p to b in L₁, the triggering environment is that the p occurs inter-vocalically.

To summarize: what linguists usually do in order to reconstruct proto-forms for ancestral languages that are not attested in written sources, is to investigate the correspondence patterns, and to try to find some neat explanation of how they could have evolved, given a set of proto-forms along with triggering contexts that explain individual changes in individual descendant languages.

Computational strategies for phonological reconstruction

Not many attempts have been made so far to automate the task of reconstruction. The most prominent proposal in this direction has been made by Bouchard-Côté et al. (2013). Their strategy radically differs from the strategy outlined above, since they do not make use of correspondence patterns, but instead use a stochastic transducer and known cognate words in the descendant languages, along with a known phylogenetic tree that they traverse, inferring the most likely changes that could explain the observed distribution of cognate sets.

So far, this method has been tested only on Austronesian languages and their subgroups, where it performed particularly well (with error rates between 0.25 and 0.12, using edit distance as the evaluation measure). Since it is not available as a software package that can be conveniently used and tested on other language families, it is difficult to tell how well it would perform when being presented with more challenging test cases.

In a forthcoming paper, Gerhard Jäger illustrates how classical methods for ancestral state reconstruction applied to aligned cognate sets could be used for the same task (Jäger forthcoming). While Jäger's method is more in line with "linguistic thinking", in so far as he uses alignments, and applies ancestral state reconstructions to each column of the alignments, it does not make use of correspondence patterns, which would be the general way by which linguists would proceed. This may also explain the performance, which shows an error rate of 0.48 (also using edit distance for evaluation) — although this is also due to the fact that the method was tested on Romance languages and compared with Latin, which is believed to be older than the ancestor of all Romance languages.

Problems with computational strategies for phonological reconstruction

Both the method of Bouchard-Côté et al. and the approach of Jäger suffer from the problem of not being able to detect unobserved sounds in the data. Jäger side-steps this problem in theory, by using a shortened alphabet of only 40 characters, proposed by the ASJP project, which encoded more than half of the world's languages in this form. Bouchard-Côté's test data, Proto-Austronesian (and its subgroups), are fairly simple in this regard. It would therefore be interesting to see what would happen if the methods are tested with full phonetic (or phonological) representations of more challenging language families (for example, the Chinese dialects). While Jäger's approach assumes the independence of all alignment sites, Bouchard-Côté's stochastic transducers handle context on the level of bigrams (if I read their description properly). However, while bigrams can be seen as an improvement over ignoring conditioning context, they are not the way in which context is typically handled by linguists. As I tried to explain briefly in last month's post, context in historical linguistics calls for a handling of abstract contexts, for example, by treating sequences as layered entities, similar to music scores.

Apart from the handling of context and unobserved characters, the evaluation measure used in both approaches seems also problematic. Both approaches used the edit distance (Levenshtein 1965), which is equivalent to the Hamming distance (Hamming 1950) applied to aligned sequences. Given the problem of unobserved characters and the abstract nature of linguistic reconstruction systems, however, any measure that evaluates the surface similarity of sequences is essentially wrong.

To illustrate this point, consider the reconstruction of the Indo-European word for sheep by Kortlandt (2007), who gives *ʕʷ e u i s, as compared to Lühr (2008), who gives *h₂ ó w i s. The normalized edit distance between both systems is the Hamming distance of their (trivial) alignment: in three of five cases they differ, which makes up to an unnormalized edit distance of three, and a normalized edit distance of 0.6. While this is pretty high, their systems are mostly compatible, since Korthland reconstructs *ʕʷ in most cases where Lühr writes *h₂. Therefore, the distance should be much lower; in fact, it should be zero, since both authors agree on the structure of the form they reconstruct in comparison with the structure of other words they reconstruct for Proto-Indo-European.

Since scholars do not necessarily select phonetic values in their reconstructions that derive directly from the descendant languages, and moreover they may differ often regarding the details of the phonetic values they propose, a valid evaluation of different reconstruction systems (including automatically derived ones) needs to compare the structure of the systems, not their substance (see List 2014: 48-50 for a discussion of structural and substantial differences between sequences).

Currently, there is (to my knowledge) no accepted solution for the comparison of structural differences among aligned sequences. Finding an adequate evaluation measure to compare reconstruction systems can therefore be seen as a sub-problem of the bigger problem of phonological reconstruction. To illustrate why it is so important to compare the structural information and not the pure substance, consider the three cases in which Jäger's reconstruction gives a v as opposed to a w in Latin (data here): while evaluating by the edit distance yields a score of 0.48, this score will drop to 0.47 when replacing the v instances with a w. Jäger's system is doing something right, but the edit distance cannot capture the fact that the system is deviating systematically from Latin, not randomly.

Initial ideas for improvement

There are many things that we can easily improve when working on automatic methods for phonological reconstruction.

As a first point, we should work on enhanced measures of evaluation, going beyond the edit distance as our main evaluation measure. In fact, this can be easily done. With B-Cubed scores (Amigó et al. 2009), we already have a straightforward measure to compare whether two reconstruction systems are structurally identical or similar. In order to apply these scores, the automatic reconstructions have to be aligned with the gold standard. If they are identical, although the symbols may differ, then the scores will indicate this. The problem of comparing reconstruction systems is, of course, more difficult, as we can face cases where systems are not structurally identical (i.e. you can directly replace any symbol a in system A by any symbol a' in system B to produce B from A and vice versa), but they would be a start.

Furthermore, given that we lack test cases, we might want to work on semi-automatic instead of fully automatic methods, in the meantime. Given that we have a first method to infer sound correspondence patterns from aligned data (List 2019), we can infer all patterns and have linguists annotate each pattern by providing the proto-sound they think would fit best — we are testing this at the moment. Having created enough datasets in this form, we could then think of discussing concrete algorithms that would derive proto-forms from correspondence patterns, and use the semi-automatically created and manually corrected data as gold standard.

Last but not least, one straightforward way by which it is possible to formally create unknown sounds from known data, is to represent sound as vectors of phonological features instead of bare symbols (e.g. representing p as voiceless bilabial plosive and b as voiced labial plosive). If we then compare alignment sites or correspondence patterns for the feature vectors, we could check to what degree standard algorithms for ancestral state reconstructions propose unattested sounds similar to the ones proposed by experts. In order to do this, we would need to encode our data in transparent transcription systems. This is not the case for most current datasets, but with the Cross-Linguistic Transcription Systems initiative we already have a first attempt to provide features for the majority of sounds that we find in the languages of the world (Anderson et al. forthcoming).


It is difficult to tell how hard the problem of phonological reconstruction is in the end. Semi-automatic solutions are already feasible now, and we are currently testing them on different (smaller) groups of phylogenetically related languages. One crucial step in the future is to code up enough data to allow for a rigorous testing of the few automatic solutions that have been proposed so far. We are working on that as well. But how to propose an evaluation system that rigorously tests not only to what degree a given reconstruction is identical with a given gold standard, but also structurally equivalent, remains one of the crucial open problems in this regard.

Amigó, Enrique and Gonzalo, Julio and Artiles, Javier and Verdejo, Felisa (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12.4: 461-486.

Anderson, Cormac, Tresoldi, Tiago, Chacon, Thiago Costa, Fehn, Anne-Maria, Walworth, Mary, Forkel, Robert and List, Johann-Mattis (forthcoming) A cross-linguistic Database of Phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting, pp. 1-27.

Blasi, Damián E. , Steven Moran, Scott R. Moisik, Paul Widmer, Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

Bouchard-Côté, Alexandre and Hall, David and Griffiths, Thomas L. and Klein, Dan (2013) Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110.11: 4224–4229.

Fox, Anthony (1995) Linguistic Reconstruction: An Introduction to Theory and Method. Oxford: Oxford University Press.

Hamming, Richard W. (1950) Error detection and error detection codes. Bell System Technical Journal 29.2: 147–160.

Hill, Nathan W. and List, Johann-Mattis (2017) Challenges of annotation and analysis in computer-assisted language comparison: a case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1: 47–76.

Hoenigswald, Henry M. (1960) Phonetic similarity in internal reconstruction. Language 36.2: 191-192.

Hrozný, Bedřich (1915) Die Lösung des hethitischen Problems [The solution of the Hittite problem]. Mitteilungen der Deutschen Orient-Gesellschaft 56: 17–50.

Jäger, Gerhard (forthcoming) Computational historical linguistics. Theoretical Linguistics.

Kortlandt, Frederik (2007) For Bernard Comrie.

Levenshtein, V. I. (1965) Dvoičnye kody s ispravleniem vypadenij, vstavok i zameščenij simvolov [Binary codes with correction of deletions, insertions and replacements]. Doklady Akademij Nauk SSSR 163.4: 845-848.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Lühr, Rosemarie (2008) Von Berthold Delbrück bis Ferdinand Sommer: Die Herausbildung der Indogermanistik in Jena. Vortrag im Rahmen einer Ringvorlesung zur Geschichte der Altertumswissenschaften (09.01.2008, FSU-Jena).

Mann, Noel Walter (1998) A Phonological Reconstruction of Proto Northern Burmic. The University of Texas: Arlington.

Meier-Brügger, Michael (2002) Indogermanische Sprachwissenschaft. Berlin and New York: de Gruyter.

Saussure, Ferdinand de (1879) Mémoire sur le Système Primitif des Voyelles dans les Langues Indo- Européennes. Leipzig: Teubner.

Monday, May 20, 2019

Tattoo Monday XVIII

We haven't had any Charles Darwin tree tattoos on this blog for quite a while, so here is a new collection of Darwin's best-known sketch from his Notebooks (the "I think" tree) — for other examples, see Tattoo Monday III, Tattoo Monday V, Tattoo Monday VI, Tattoo Monday IX, and Tattoo Monday XII.

Monday, May 13, 2019

Which airlines serve the best wine?

I have only flown Business Class once, when I got upgraded on a flight from Sydney to Auckland; and I have never flown First Class. So, I don't really care about the so-called Cellars in the Sky, because I get only the vin ordinaire in Economy Class.

However, some people do care about the quality of the beer, wine and spirits served to the high flyers. These include the people at Business Traveller magazine / web site. For more than 30 years, they have handed out annual Cellars in the Sky awards, after evaluating the quality of the wine served to business class and first class passengers on the world's airlines.

Airlines can choose to enter the Awards process provided that they serve wine in business or first class on mid- or long-haul routes. The airlines submit up to two red wines, two white wines, a sparkling wine, and a fortified or dessert wine, from both their business and first class cellars. These wines are assessed and scored (blind) by a panel of independent judges. The awards are based on the average marks for the wines concerned, with separate awards for First Class and Business Class, plus an Overall Award for consistency across both classes.

I have analyzed the data for the Best Overall Cellar for the years 2006 to 2018, inclusive. The number of airlines commended each year varied from 3 to 5 (average 4.0). I simply gave each airline a score scaled from 0–1 depending on its ranking in the awards list. There were 16 airlines mentioned over the 13 years, but I have included only those 10 that appeared in more than one year.

Since these are multivariate data, one of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network, as a tool for exploratory data analysis. For this network analysis, I calculated the similarity of the airlines, based on the awards they received, using the manhattan distance, and a Neighbor-net analysis was then used to display the between-airline similarities.

The resulting network is shown in the graph. Airlines that are closely connected in the network are similar to each other based on when they won their awards, and those airlines that are further apart are progressively more different from each other.

Only one airline received an award in every year: QANTAS, followed by Qatar Airlines with 9 out of 13 years. These two airlines are grouped together at the top of the figure. The other airlines are arranged based on which years they won awards. For example, Cathay Pacific won 7 awards, and both Singapore Airlines and British Airways won 5, but they were mostly not in the same years. American Airlines, Air France, Korean Air and Lufthansa each won only 2 awards.

So, if you want to get your money's worth out of your business-class ticket, then it would be a good idea to try QANTAS or Qatar Airlines — the hours will pass more quickly with a glass of good wine in your hand.

Monday, May 6, 2019

Corals — a new metaphor for phylogenetic diagrams

A year ago I mentioned a published discussion of the different branching diagrams that have been used for phylogenetic relationships (Tree metaphors and mathematical trees). If we consider the form of the relationship and whether time is involved, we get the following four possible diagram types:

Most current phylogenetic diagrams claim to show sister-group relationships (which means that ancestors are inferred only), with a time-order to the branching sequence. There is a broad range of diagram types in use, both mathematical and metaphorical. For example, the top four in this next diagram are mathematical and the bottom four are metaphorical variants of the above 2x2 table:

The connection between these different diagrams has both conceptual and practical problems, although these seem to be overlooked by most practitioners. This issue as been addressed by János Podani in a paper that is now online:
The Coral of Life. Evolutionary Biology (2019).
To quote from the Abstract:
The Tree of Life (ToL) has been of central importance in the biological sciences, usually understood as a model or a metaphor, and portrayed in various graphical forms to summarize the history of life as a single diagram. If it is seen as a mathematical construct — a rooted graph theoretical tree or, as more recently viewed, a directed network, the Network of Life (NoL) — then its proper visualization is not feasible, for both epistemological and technical reasons. As an overview included in this study demonstrates, published ToLs and NoLs are extremely diverse in appearance and content ... Metaphorical trees are even less useful for the purpose, because ramification is the only property of botanical trees that may be interpreted in an evolutionary or phylogenetic context. This paper argues that corals, as suggested by Darwin in his early notebooks, are superior to trees as metaphors, and may also be used as mathematical models. A coral diagram is useful for portraying past and present life because it is suitable: (1) to illustrate bifurcations and anastomoses, (2) to depict species richness of taxa proportionately, (3) to show chronology, extinct taxa and major evolutionary innovations, (4) to express taxonomic continuity, (5) to expand particulars due to its self-similarity, and (6) to accommodate a genealogy-based, rank-free classification.
It is worth checking out this paper, even if only for the new Coral of Life diagram that is presented in its Figure 3, which synthesizes much of our current knowledge.

Monday, April 29, 2019

Automatic sound law induction (Open problems in computational diversity linguistics 3)

The third problem in my list of ten open problems in computational diversity linguistics is a problem that has (to my knowledge) not even been considered as a true problem in computational historical linguistics, so far. Until now, it has been discussed by colleagues only indirectly. This problem, which I call the automatic induction of sound laws, can be described as follows:
Starting from a list of words in a proto-language and their reflexes in a descendant language, try to find the rules by which the ancestral language is converted into the descendant language.
Note that by "rules", in this context, I mean the classical notation that phonologists and historical linguists use in order to convert a source sound in a target sound in a specific environment (see Hall 2000: 73-75). If we consider the following ancestral and descendant words from a fictive language, we can easily find the laws by which the input should be converted into an output — namely, an a should be changed to an e, an e should be changed to an i, and a k changes to s if followed by an i but not if followed by an a.

Input Output
papa pepe
mama meme
kaka keke
keke sisi

Short excursus on linguistic notation of sound laws

Based on the general idea of sound change (or sound laws in classical historical linguistics) as some kind of a function by which a source sound is taken as input and turned into a target sound as output, linguists use a specific notation system for sound laws. In the simplest form of the classical sound law notation, this process is described in the form s > t, where s is the source sound and t is the target sound. Since sound change often relies the on specific conditions of the surrounding context — i.e. it makes a difference if some sound occurs in the beginning or the end of a word — context is added as a condition separated by a /, with an underscore _ referring to the sound in its original phonetic environment. Thus, the phenomenon of voiced stops becoming unvoiced at the end of words in German (e.g. d becoming t), can be written as d > t / _$, where $ denotes the end of a word.

One can see how close this notation comes to regular expressions and according to many scholars, the rules by which languages change with respect to their sound systems do not exceed the complexity of regular grammars. Nevertheless, sound change notation does differ in the scope and the rules for annotation. One notable difference is the possibility to explain how full classes of sounds change in a specific environment. The German rule of devoicing, for example, generally affects all voiced stops in the end of a word. As a result, one could also annotat it as G > K / _$, where G would denote the sounds [b, d, g] and K their counterparts [p, t, k]. Although we could easily write a single rule for each of the three phenomena here, the rule by which the sounds are grouped into two classes of voiced sounds and their unvoiced counterparts is linguistically more interesting, since it reminds us that the change by which word-final consonants loose the feature of voice is a systemic change, and not a phenomenon applying to some random selection of sounds in a given language.

The problem of this systemic annotation, however, is that the grouping of sounds into classes that change in a similar form is often language-specific. As a result, scholars have to propose new groupings whenever they deal with another language. Since neither the notation of sound values nor the symbols used to group sounds into classes are standardized, it is extremely difficult to compare different proposals made in the literature. As a result, any attempt to solve the problem of automatic sound law induction in historical linguistics would at the same time have to make strict proposals for a standardization of sound law notations used in our field. Standardization can thus be seen as one of the first major obstacles of solving this problem, with the problem of accounting for systemic aspects of sound change as the second one.

Beyond regular expressions

Even if we put the problem of inconsistent annotation and systemic changes to one side, the analogy with regular expressions cannot properly handle all aspects of sound change. When looking at the change from Middle Chinese to Mandarin Chinese, for example, we find a complex pattern, by which originally voiced sounds, like [b, d, g, dz] (among others), were either devoiced, becoming [p, t, k, ts], or devoiced and aspirated, becoming [pʰ, tʰ, kʰ, tsʰ]. While it is not uncommon that one sound can change into two variants, depending on the context in which it occurs, the Mandarin sound change in this case is interesting because the context is not a neighboring sound, but is instead the Middle Chinese tone for the syllable in question — syllables with a flat tone (called píng tone in classical terminology) are nowadays voiceless and aspirated, and syllables with one of the three remaining Middle Chinese tones (called shǎng, , and ) are nowadays plain voiceless (see List 2019: 157 for examples).

Since tone is a feature that applies to whole syllables, and not to single sound segments, we are dealing with so-called supra-segmental features here. As the meaning of the term supra-segmental indicates, the features in question cannot be represented as a sequence of sound, but need to be thought of as an additional layer, similar to other supra-segmental features in language, including stress, or juncture (indicating word or morpheme boundaries).

In contrast to sequences as we meet them in mathematics and informatics, linguistic sound sequences do not consist solely of letters drawn from an alphabet that is lined up in some unique order. They are instead often composed of multiple layers, which are in part hierarchically ordered. Words, morphemes, and phrases in linguistics are thus multi-layered constructs, which cannot be represented by one sequence alone, but could be more fruitfully thought of as the same as a partitura in music — the score of a piece of orchestra music, in which every voice of the orchestra is given its own sequence of sounds, and all different sequences are aligned with each other to form a whole.

The multi-layered character of sound sequences can be seen as similar to a partitura in musical notation.

This multi-layered character of sound sequences in spoken languages comprises a third complication for the task of automatic sound law induction. Finding the individual laws that trigger the change of one stage of a language to a later stage, cannot (always) be trivially reduced to the task of finding the finite state transducer that translates a set of input strings to a corresponding set of output strings. Since our input word forms in the proto-language are not simple strings, but rather an alignment of the different layers of a word form, a method to induce sound laws needs to be able to handle the multi-layered character of linguistic sequences.

Background for computational approaches to sound law induction

To my knowledge, the question of how to induce sound laws from data on proto- and descendant languages has barely been addressed. What comes closest to the problem are attempts to model sound change from known ancestral languages, such as Latin, to daughter languages, such as Spanish. This is reflected, for example, in the PHONO program (Hartmann 2003), where one can insert data for a proto-language along with a set of sound change rules (provided in a similar form to that mentioned above), which need to be given in a specific order, and are then checked to see whether they correctly predict the descendant forms.

For teaching purposes, I adapted a JavaScript version of a similar system, called the Sound Change Applier² ( by Mark Rosenfelder from 2012, in which students could try to turn Old High German into modern German, by assigning simple rules as they are traditionally used to describe sound change processes in the linguistic literature. This adaptation (which can be found at compares the attested output with the output generated by a given set of rules, and provides some assessment of the general accuracy of the proposed set of rules. For example, when feeding the system the simple rule an > en /_#, which turns all final instances of -an into -en, 54 out of 517 Old High German words will yield the expected output in modern Standard German.

The problem with these endeavors is, of course, the handling of exceptions, along with the comparison of different proposals. Since we can think of an infinite number of rules by which we could successfully turn a certain amount of Old High German strings into Standard German strings, we would need to ask ourselves how we could evaluate different proposals. That some kind of parsimony should play a role here is obvious. However, it is by no means clear (at least to me) how to evaluate the complexity of two systems, since the complexity would not only be reflected in the number of rules, but also in the initial grouping of sounds to classes, which is commonly used to account for systemic aspects of sound change. A system accounting for the problem of sound law induction would try to automate the task of finding the set of rules. The fact that it is difficult even to compare two or more proposals based on human assessment further illustrates why I think that the problem is not trivial.

Another class of approaches is that of word prediction experiments, such as the one by Ciobanu and Dinu (2018) (but see also Bodt and List 2019), in which training data consisting of the source and the target language are used to create a model, which is then successively applied to new data, in order to test how well this model predicts target words from the source words. Since the model itself is not reported in these experiments, but only used in the form of a black box to predict new words, the task cannot be considered to be the same as the task for sound law induction — which I propose as one of my ten challenges for computational historical linguistics — given that we are interested in a method that explicitly returns the model, in order to allow linguists to inspect it.

Problems with the current solutions to sound law induction

Given that no real solutions exist to the problem up to now, it seems somewhat useless to point to the problems of current solutions. What I want to mention in this context, however, are the problems of the solutions presented for word prediction experiments, be they fed by manual data on sound changes (Hartmann 2003), or based on inference procedures (Ciobanu and Dinu 2018, Dekker 2018). Manual solutions like PHONO suffer from the fact that they are tedious to apply, given that linguists have to present all sound changes in their data in an ordered fashion, with the program converting them step by step, always turning the whole input sequence into an intermediate output sequence — the word prediction approaches thus suffer from limitations in feature design.

The method by Ciobanu and Dinu (2018), for example, is based on orthographic data alone, using the Needleman-Wunsch algorithm for sequence alignment (Needleman and Wunsch 1970); and the approach by Dekker (2018) only allows for the use for the limited alphabet of 40 symbols proposed by the ASJP project (Holman et al. 2008). In addition to the limited representation of linguistic sound sequences, be it by resorting to abstract orthography or to abstract reduced phonetic alphabets, none of the methods can handle those kinds of contexts which result from the multi-layered character of speech. Since we know well that these aspects are vital for certain phenomena of sound change, the methods exclude from the beginning an aspect that traditional historical linguists, who might be interested in an automatic solution to the sound law induction problem, would put at the top of their wish-list of what the algorithm should be able to handle.

Why is automatic sound law induction difficult?

The handling of supra-segmental contexts, mentioned above, is in my opinion also the reason why sound law induction is so difficult, not only for machines, but also for humans. I have so far mentioned three major problems as to why I think sound law induction is difficult. First, we face problems in defining the task properly in historical linguistics, due to a significant lack in standardization. This makes it difficult to decide on the exact output of a method for sound law induction. Second, we have problems in handling the systemic aspect of sound change properly. This does not apply only to automatic approaches, but also to the evaluation of different proposals for the same data proposed by humans. Third, the multi-layered character of speech requires an enhanced modeling of linguistic sequences, which cannot be modeled as mono-dimensional strings alone, but should rather be seen as alignments of different strings representing different layers (tonal layer, stress layer, sound layer, etc.).

How humans detect sound laws

There are only a few examples in the literature where scholars have tried to provide detailed lists of sound changes from proto- to descendant language (Baxter 1992, Newman 1999). Most examples of individual sound laws proposed in the literature are rarely even tested exhaustively on the data. As a result, it is difficult to assess what humans usually do in order to detect sound laws. What is clear is that historical linguists who have been working a lot on linguistic reconstruction tend to acquire a very good intuition that helps them to quickly check sound laws applied to word forms in their head, and to convert the output forms. This ability is developed in a learning-by-doing fashion, with no specific techniques ever being discussed in the classroom, which reflects the general tendency in historical linguistics to trust that students will learn how to become a good linguist from examples, sooner or later (Schwink 1994: 29). For this reason, it is difficult to take inspiration from current practice in historical linguistics, in order to develop computer-assisted approaches to solve this task.

Potential solutions to the problem

What can we do in order to address the problem of sound law induction in automatic frameworks in the future?

As a first step, we would have to standardize the notation system that we use to represent sound changes. This would need to come along with a standardized phonetic transcription system. Scholars often think that phonetic transcription is standardized in linguistics, specifically due to the use of the International Phonetic Alphabet. As our investigations into the actual application of the IPA have shown, however, the IPA cannot be seen as a standard, but rather as a set of recommendations that are often only loosely followed by linguists. First attempts to standardize phonetic transcription systems for the purpose of cross-linguistic applications have, however, been made, and will hopefully gain more acceptance in the future (Anderson et al. forthcoming,

As a second step, we should invest more time in investigating the systemic aspects of language change cross-linguistically. What I consider important in this context is the notion of distinctive features by which linguists try to group sounds into classes. Since feature systems proposed by linguists differ greatly, with some debate as to whether features are innate and the same for all languages, or instead language-specific (see Mielke 2008 for an overview on the problem), a first step would again consist of making the data comparable, rather than trying to decide in favour of one of the numerous proposals in the literature.

As a third step, we need to work on ways to account for the multi-layered aspect of sound sequences. Here, a first proposal, labelled "multi-tiered sequence representation", has already been made by myself (List and Chacon 2015), based on an idea that I had already used for the phonetic alignment algorithm proposed in my dissertation (List 2014), which itself goes back to the handling of hydrophilic sequences in ClustalW (Thompson et al. 1994). The idea is to define a sound sequence as a sequence of vectors, with each vector (called tier) representing one distinct aspect of the original word. As this representation allows for an extremely flexible modeling of context — which would just consist of an arbitrary number of vector dimensions that could account for aspects such as tone, stress, preceding or following sounds — this representation would allow us to treat words as sequences of sounds while at the same time accounting for their multi-layered structure. Although there remain many unsolved aspects on how to exploit this specific model for phonetic sequences to induce sound laws from ancestor-descendant data, I consider this to be a first step in the direction of a solution to the problem.

Multi-tiered sequence representation for a fictive word in Middle Chinese.


Although it is not necessarily recognized by the field as a real problem of historical linguistics, I consider the problem of automatic sound law induction as a very important problem for our field. If we could infer sound laws from a set of proposed proto-forms and a set of descendant forms, then we could use them to test the quality of the proto-forms themselves, by inspecting the sound laws proposed by a given system. We could also compare sound laws across different language families to see whether we find cross-linguistic tendencies.

Having inferred enough cross-linguistic data on sound laws represented in unified models for sound law notation, we could also use the rules to search for cognate words that have so far been ignored. There is a lot to do, however, until we reach this point. Starting to think about automatic, and also manual, induction of sound laws as a specific task in computational historical linguistics can be seen as a first step in this direction.

Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (forthcoming) A Cross-Linguistic Database of Phonetic Transcription Systems. Yearbook of the Poznań Linguistic Meeting, pp 1-27.

Baxter, William H. (1992) A handbook of Old Chinese Phonology. Berlin: de Gruyter.

Bodt, Timotheus A. and List, Johann-Mattis (2019) Testing the predictive strength of the comparative method: An ongoing experiment on unattested words in Western Kho-Bwa langauges. 1-22. [Preprint, under review, not peer-reviewed]

Ciobanu, Alina Maria and Dinu, Liviu P. (2018) Simulating language evolution: A tool for historical linguistics. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp 68-72.

Dekker, Peter (2018) Reconstructing Language Ancestry by Performing Word Prediction with Neural Networks. University of Amsterdam: Amsterdam.

Hall, T. Alan (2000) Phonologie: Eine Einführung. Berlin and New York: de Gruyter.

Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp 606-609.

Holman, Eric W. and Wichmann, Søren and Brown, Cecil H. and Velupillai, Viveka and Müller, André and Bakker, Dik (2008) Explorations in automated lexicostatistics. Folia Linguistica 20.3: 116-121.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper, presented at the workshop Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE] (2015/09/04, Leiden, Societas Linguistica Europaea).

List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Mielke, Jeff (2008) The Emergence of Distinctive Features. Oxford: Oxford University Press.

Needleman, Saul B. and Wunsch, Christan D. (1970) A gene method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

Newman, John and Raman, Anand V. (1999) Chinese Historical Phonology: Compendium of Beijing and Cantonese Pronunciations of Characters and their Derivations from Middle Chinese. München: LINCOM Europa.

Schwink, Frederick (1994) Linguistic Typology, Universality and the Realism of Reconstruction. Washington: Institute for the Study of Man.

Thompson, J. D. and Higgins, D. G. and Gibson, T. J. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673–4680.

Monday, April 22, 2019

The 2nd Amendment does more than keep King George away

A year ago, in the aftermath of the Florida shooting, I used a neighbor-net as a way to visualize U.S. gun legislation (see the first graph here). In this post, we will use this network to explore some other aspects of American society.

A network illustrating the diversity in U.S. gun legislation. Blue stars – states with a gun registry.

The network picture emphasizes those states where guns are regulated to some extent (in green), but this means that the states at the bottom-left have little or no regulation of gun ownership. Note, first, that the U.S. gun lobby argues that the absence of any gun control is covered by the 2nd Amendment to the U.S. Constitution,which covers the right of citizens to form a "well regulated militia", an amendment installed to protect the freedom of the new republic from the former British sovereign (ie. to "keep King George away").

This claim ignores the fact that "well regulated" implies regulation of some sort, while the network emphasizes its absence in many cases. Besides, the risk of being re-conquered by Her Majesty's Royal Army is quite low these days, with or without Brexit. More to the point, the world itself has changed quite a bit since the 1700s, while the Constitution has had only a few Amendments added and subtracted.

If we start our use of the neighbor-net to look at the data, then we can see that there is at least one obvious consequence of unregulated gun ownership. For example, the next plot shows the number of gun-related deaths (in 2016) super-imposed on the gun-regulation network.

The total number of firearm-related deaths in 2016 (includes accidents and suicides.
Data from; this and more plots can be found here:
Visualising U.S. gun legislation, and mapping politics, economics, and population)

There seems to be a good correlation between unregulated gun ownership and the probability of getting shot or shooting yourself — the number of shootings is greatest in the lower-left of the network, where gun ownership is essentially unregulated (see the Gun Violence Archive for current numbers).

Arming every citizen may have helped to fend off King George's Redcoats, but in the long run, a substantial amount of Americans (c. 275,000 per year; when compared with Canada's rate) would still be alive if the Colonies would have become HRM's dominion like Australia or Canada; both Canadians and Australians own a lot of firearms per capita (see the Small Arms Survey for up-to-date estimates), but while Canada long had Europe-style legislation (and low casualty frequencies); Australians implemented them more recently leading to a massive drop in firearm-related deaths (see above).

As a side note, arming every male citizen to secure freedom from a feudal lord was probably a Swiss invention (see the Swiss Federal Charter of 1291, the Bundesbrief). Switzerland has a compulsory general draft of young males; and after this service they take their Sturmgewehr back home for the yearly training exercise, and to be prepared to fend off invaders (until 2007, including the ammunition). They have ~4-times lower rate of firearm-related deaths (2.8 in 2015 according to; nearly all of them males) — the only EU country approaching lowest U.S. values is Finland, and it's near exclusively accidents and suicides.

Other factors

It is important to keep in mind that the United States is a true federation of states, with each state having a substantial amount of autonomy, which is not found in any other country with a federal organization. Hence, many other aspects differ between states, not just the substantial differences in gun legislation.

For example, economics differ greatly between the states, and this also shows a reasonable correlation with gun regulation, as seen in this next version of the network. Note that Gross Domestic Product (GDP) is a monetary measure of the market value of all the goods and services produced annually — rich places have high GDP and poor places have lower GDP.

Real gross domestic product per capita mapped on the gun-legislation-based network.
Red, below global U.S. value; green above global U.S. value.
Data source: U.S. Bureau of Economic Analysis.

So, the economically poorer the state, the less likely there is to be gun regulation.

Modern developments include allowing women into the armed forces, and granting them the right to vote. For example, the 19th Amendment to the US Constitution granted women the right to vote, which was passed by Congress June 4, 1919, and ratified on August 18, 1920. This first map shows the situation for the European Union, some parts of which lagged behind the U.S.

Implementation of general right to vote within the countries of the EU (source: Süddeutsche Zeitung).
In the case of Germany and France, the reason was a lost war leading to the (re)establishment of new republics.

Women make about 50% of the populace and (usually) more than 50% of the electorate (having a generally higher life expectancy), but they are still typically under-represented in parliaments (here are a few examples). The United States is, sadly, a good example of this imbalance. This next map shows that the women in 13 states currently have no same-sex representation in the U.S. Congress.

Female representation in the current U.S. Congress.
The green part of each pie chart indicates the proportion of women representatives.

This leads to the obvious question for this blog post: how does the absence of female representatives (and senators) relate to the absence of gun regulation? So, let's map the above collection of pie charts onto the gun legislation network.

Female representation in the U.S. Congress after 2018 mid-term elections
(includes Senate and House of Representatives).
The c. 700,000 inhabitants of DC, District of Columbia, have no representation in
Congress at all, but send a non-voting delegate to the House.

There is a general trend — those states with little or no gun regulation (bottom left) have less female representation than those with (some) gun regulation. Perhaps someone took the 2nd Amendment a bit too literally (the right that every man to carry a gun), and this keeps not only King George away, from the country but also women away from Congress?

Exceptions from the generalization (starting with 75% going down to 33%) are sparsely populated states with only a few members of Congress: New Hampshire (NH, 75%; 2 representatives in addition to the two U.S. senators representing each state), Maine (ME, 2 reps.), West Virginia (WV; 3 reps), Alaska (AK; 1 rep.), New Mexico (NM; 3 reps), and Nevada (NV; 4 reps). All of these states have one thing in common: a substantial proportion of the state is wilderness.

At the other end, some states with relative high levels of gun regulation, like Maryland (MD; 8 reps), Rhode Island (RI; 2 reps), New Jersey (NJ; 12 reps) and Colorado (CO; 7 reps), lack women in Congress (0–15%, ie. one representative or none). This may relate to these state being very densely populated (MD, RI, NJ), and, irrespective of outside threats, no-one wants their close neighbors running around with guns. Colorado is particular in this sense, because with Denver it includes a major population center (the nucleus of the emerging Front Range megaregion), and it enforced much stricter gun regulation than found elsewhere in the state.

A map showing Colorado's congressional districts, for the 113th Congress.
Data from the defunct digital version of the U.S. National Atlas.

Do more women in parliament save American lives?

According to a recent Gallup poll, Americans have the highest regard for nurses, a profession mostly occupied by women and lowest regard for Members of Congress, a profession mostly occupied by men. Hence, it would make sense to explore the data the other way around. We will explore this in a later post.