Monday, August 19, 2019

Phylogenetics of chain letters?

The general public and the general media often have no idea what biologists mean by the work "evolution". The word has two possible meanings, and they usually pick the wrong one. Niles Eldredge tried to clarify the situation by referring to them:
  • transformational evolution — the change in a group of objects resulting from a change in each object (often attributed to Lamarck)
  • variational evolution - the change in a group of objects resulting from a change in the proportion of different types of objects (usually attributed to Darwin).
Charles Darwin changed biology by pointing out that changes in species occur via the latter mechanism, not the former, which had been the predominant previous idea. Sadly, 160 years later, the idea of transformational evolution still seems to prevail in the minds of the general public and the people writing for them.

So, it was with some trepidation that I looked at an article in Scientific American called Chain letters and evolutionary histories (by Charles H. Bennett, Ming Li and Bin Ma. June 2003, pp. 76-81). It was subtitled: "A study of chain letters shows how to infer the family tree of anything that evolves over time, from biological genomes to languages to plagiarized schoolwork."

The "taxa" in their study consist of 33 different chain letters, collected during the period 1980–1995 (8 other letters were excluded), covering the diversity of chain letters as they existed before internet spam became widespread. These letters can be viewed on the Chain Letters Home Page.

The main issue with this study is that there are no clearly defined characters, from which the phylogeny could be constructed. The authors therefore resort to creating a pairwise distance matrix, among the taxa, in a manner (compression) that I have criticized before (Non-model distances in phylogenetics). I have also discussed previous examples where this approach has been used, notably: Phylogenetics of computer viruses? Multimedia phylogeny?

The essential problem, as I see it, is that without a model of character change there is no reliable way to separate phylogenetic information from any other type of information. That is, phylogenetic similarity is a special type of similarity. It is based on the idea of shared derived character states, as these are the only things that are informative about a phylogeny.

Compression, on the other hand, is a general sort of similarity, based on the idea of information complexity. This presumably will contain some useful phylogenetic information, but it will also contain a lot of irrelevance — for example, shared ancestral character states, which are uninformative at best and positively misleading at worst.

So, the authors can easily produce an unrooted tree from their similarity matrix, which they then proceed to root at one of the letters that they collected early on in their study. This tree is shown here.

However, whether this diagram represents a phylogeny is unknown.

Nevertheless, that does not stop us using an unrooted phylogenetic network as a form of exploratory data analysis, as we have done so often in this blog. This is not intended to produce a rooted evolutionary history, but instead merely to summarize the multivariate information in a comprehensible (and informative) manner. This might indicate whether we are likely to be able to reconstruct the phylogeny In this case, I have used a NeighborNet to display the similarity matrix, as shown next.

Phylogenetic network of cahin letters

It is easy to see that the relationships among the letters are not particularly tree-like. Moreover, the long terminal edges emphasize that much of the complexity information is not shared among the letters, while the shard information is distinctly net-like. So, a simple "phylogenetic tree" (as shown above) is not likely to be representative of the actual evolutionary history.

However, there are actually a few reasonably well-defined groups among the taxa — one at the top. one at the right, and several at the bottom of the network. There are also letters of uncertain affinity, such as L2, L23, L13 and L31. These may reflect phylogenetic history, even though that history is hard to untangle.

Finally, it is worth noting that the history of chain letters, dating back to the 1800s, is discussed in detail by Daniel W. VanArsdale at his Chain Letter Evolution web pages.

Monday, August 12, 2019

Public transit trips in the USA

Public transport, or mass transit, has long been a politically charged issue, throughout the world. However, the modern world now recognizes that it is an effective way to deal with mass movements of people in a manner that respects the use of non-renewable resources.

After all, the only way to continue with autonomous transportation is to get rid of fossil fuels. However. electric cars will not be of much use until we work out where we are going to get all of the needed extra electricity, in a manner that is environmentally friendly. There is not much point in simply moving the burning of fossil fuels from the vehicle (ie. gasoline) to a power station that also burns fossil fuels (eg. coal). There is also a limit to how many rivers there are left to dam for hydroelectric power; and nuclear reactors have gone out of fashion (fortunately). There is also, of course, the matter of how we are going to recycle the used (lithium-ion) batteries from the cars, which is apparently a tougher proposition than recycling the electric motors themselves.

So, until we sort this out, mass transit is a viable option for most conurbations. In this context, a conurbation (or a metropolitan area) is a contiguous area within which large numbers of people move regularly, especially traveling to and from their workplace each weekday. A conurbation often involves multiple cities and towns, as defined by political administrations or contiguous urban development — many people live in one urban area but work in another.

So, naturally, governments collect data on these matters. One such data collection is the U.S. Department of Transportation's National Transit Database. The data consist of "sums of annual ridership (in terms of unlinked passenger trips), as reported by transit agencies to the Federal Transit Administration." Data for three separate modes of transit are included: bus, rail, and paratransit. The data currently cover the years 2002–2018, inclusive.

To look at the data for the 42 U.S. conurbations included, for the year 2018, I have performed this blog's usual exploratory data analysis. I first calculated the transit rate per person, by dividing the annual number of trips for each of the three modes by the conurbation population size. Since these are multivariate data, one of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network. For this network analysis, I calculated the similarity of the conurbations using the manhattan distance. A Neighbor-net analysis was then used to display the between-area similarities.

The resulting network is shown in the graph. Conurbations that are closely connected in the network are similar to each other based on the trip rates, and those areas that are further apart are progressively more different from each other. In this case, there is a simple gradient from the busiest mass transit systems at the top of the network to the least busy at the bottom.

The network shows us that the New York – Newark transit-commuting area (which covers part of three states) is far and away the busiest in the USA. The subway system dominates this mass transit, of course, as it is justifiably world famous, although not always for the best of reasons as far as commuters are concerned

The San Francisco – Oakland area is in clear second place. Here, bus transit slightly exceeds rail transit. Then follows Washington DC and Boston, both of which also cover parts of three states. In Boston trains out-do buses 2:1, while in Washington it is closer to 1.5:1.

Nest comes a group of four conurbations: Chicago, Philadelphia, Portland and Seattle. Two of these cover part of Washington, but in quite different ways — in Seattle the buses dominate the system 5:1 but in Portland it is only 1.5:1. Chicago and Philadelphia share buses and trains pretty equally.

At the bottom of the network there are two large groups of conurbations, one of which does slightly better than the other at mass transit use. The least-used system is that of San Juan, in Puerto Rico, perhaps not unexpectedly. Of the contiguous U.S. states, Indianapolis (IN) has the least used system, followed by Memphis (TN–MS–AR).

Moving on, we could also look at changes in the total number of transit trips (irrespective of mode) during the period for which data are available: 2002–2018. A network is of little help here. So, it so simplest just to plot the data, as shown in the next graph.

For most of the metropolitan areas there is little in the way of consistent change through time. However, there are some areas that show high correlations between the number of trips and time. These are the areas that have shown the most consistent increase in the number of transit trips from 2002–2018:
  • Chicago (IL–IN)
  • Tampa – St Petersburg (FL)
  • Baltimore (MD)
  • Denver – Aurora (CO)
  • San Francisco – Oakland (CA)
  • Memphis (TN–MS–AR)
  • San Diego (CA)
  • Cleveland (OH)
  • Providence (RI–MA)
  • Orlando (FL)
  • Indianapolis (IN)
  • New York – Newark (NY–NJ–CT)
  • Portland (OR–WA)
  • Minneapolis – St Paul (MN–WI)
Sadly, there are also areas that have shown a consistent decrease in the number of transit trips through time (2002–2018):
  • Kansas City (MO–KS)
  • Columbus (OH)
  • Riverside – San Bernardino (CA)
Presumably these are the areas where the local politicians should be looking into how to address this long-term issue.

Declining transit numbers is a topic discussed around the web; for example: Transit ridership down in most American cities. This article has a graph neatly showing the change in transit numbers from 2017 to 2018. It shows marked decreases, particularly for bus trips, while the few increases almost all involved rail travel. Is this a short-term effect, or the start of a general long-term decline?

Monday, August 5, 2019

Tattoo Monday XIX

Here are two more (large) Charles Darwin tree tattoos, based on his best-known sketch from his Notebooks (the "I think" tree). For other examples, see Tattoo Monday III, Tattoo Monday V, Tattoo Monday VI, Tattoo Monday IX, Tattoo Monday XII, and Tattoo Monday XVIII.

Monday, July 29, 2019

Simulation of sound change (Open problems in computational diversity linguistics 6)

The sixth problem in my list of open problems in computational diversity linguistics is devoted to the problem of simulating sound change. When formulating the problem, it is difficult to see what is actually meant, as there are two possibilities for a concrete simulation: (i) one could think of a sound system of a given language and then model how, through time, the sounds change into other sounds; or (ii) one could think of a bunch of words in the lexicon of a given language, and then simulate how these words are changed through time, based on different kinds of sound change rules. I have in mind the latter scenario.

Why simulating sound change is hard

The problem of simulating sound change is hard for four reasons. First of all, the problem is similar to the problem of sound law induction, since we have to find a simple and straightforward way to handle phonetic context (remember that sound change may often only apply to sounds that occur in a certain environment of other sounds). This is already difficult enough, but it could be handled with help of what I called multi-tiered sequence representations (List and Chacon 2015). However, there are four further problems that one would need to overcome (or at least be aware of) when trying to successfully simulate sound change.

The first of these extra problems is that of morphological change and analogy, which usually goes along with "normal" sound change, following what Anttila (1976) calls Sturtevant's paradox — namely, that regular sound change produces irregularity in language systems, while irregular analogy produces regularity in language systems. In historical linguistics, analogy serves as a cover-term for various processes in which words or word parts are rendered more similar to other words than they had been before. Classical examples are children's "regular" plurals of nouns like mouse (eg. mouses instead of mice) or "regular" past tense forms of verbs like catch (e.g., catched instead of caught). In all these cases, perceived irregularities in the grammatical system, which often go back to ancient sound change processes, are regularized on an ad-hoc basis.

One could (maybe one should), of course, start with a model that deliberately ignores processes of morphological change and analogical leveling, when drafting a first system for sound change simulation. However, one needs to be aware that it is difficult to separate morphological change from sound change, as our methods for inference require that we identify both of them properly.

The second extra problem is the question of the mechanism of sound change, where competing theories exist. Some scholars emphasize that sound change is entirely regular, spreading over the whole lexicon (or changing one key in the typewriter), while others claim that sound change may slowly spread from word to word and at times not reach all words in a given lexicon. If one wants to profit from simulation studies, one would ideally allow for a testing of both systems; but it seems difficult to model the idea of lexical diffusion (Wang 1969), given that it should depend on external parameters, like frequency of word use, which are also not very well understood.

The last problem is that of the actual tendencies of sound change, which are also by no means well understood by linguists. Initial work on sound change has been carried out (Kümmel 2008). However, the major work of finding a way to compare the major tendencies of sound change processes across a large sample of the world's languages (ie. the typology of sound change, which I plan to discuss separately in a later post), has not been carried out so far. The reason why we are missing this typology is that we lack clear-cut machine-readable accounts of annotated, aligned data. Here, scholars would provide their proto-forms for the reconstructed languages along with their proposed sound laws in a system that can in fact be tested and run (to allow to estimate also the exceptions or where those systems fail).

But having an account of the tendencies of sound change opens a fourth important problem apart from the lack of data that we could use to draw a first typology of sound change processes: since sound change tendencies are not only initiated by the general properties of speech sounds, but also by the linguistic systems in which these speech sounds are employed. While scholars occasionally mention this, there have been no real attempts to separate the two aspects in a concrete reconstruction of a particular language. The typology of sound change tendencies could thus not simply stop at listing tendencies resulting from the properties of speech sounds, but would also have to find a way to model diverging tendencies because of systemic constraints.

Traditional insights into the process of sound change

When discussing sound change, we need to distinguish mechanisms, types, and patterns. Mechanisms refer to how the process "proceeds", the types refer to the concrete manifestations of the process (like a certain, concrete change), and patterns reflect the systematic perspective of changes (i.e. their impact on the sound system of a given language, see List 2014).

Figure 1: Lexical diffusion

The question regarding the mechanism is important, since it refers to the dispute over whether sound change is happening simultaneously for the whole lexicon of a given language — that is, whether it reflects a change in the inventory of sounds, or whether it jumps from word to word, as the defenders of lexical diffusion propose, whom I mentioned above (see also Chen 1972). While nobody would probably nowadays deny that sound change can proceed as a regular process (Labov 1981), it is less clear as to which degree the idea of lexical diffusion can be confirmed. Technically, the theory is dangerous, since it allows a high degree of freedom in the analysis, which can have a deleterious impact on the inference of cognates (Hill 2016). But this does not mean, of course, that the process itself does not exist. In these two figures, I have tried to contrast the different perspectives on the phenomena.

Figure 2: Regular sound change

To gain a deeper understanding of the mechanisms of sound change, it seems indispensable to work more on models trying to explain how it is actuated after all. While most linguists agree that synchronic variation in our daily speech is what enables sound change in the first place, it is not entirely clear how certain new variants are fixed in a society. Interesting theories in this context have been proposed by Ohala (1989) who proposes distinct scenarios in which sound change can be initiated both by the speaker or the listener, which would in theory also yield predictable tendencies with respect to the typology of sound change.

The insights into the types and patterns of sound change are, as mentioned above, much more rudimentary, although one can say that most historical linguists have a rather good intuition with respect to what is possible and what is less likely to happen.

Computational approaches

We can find quite a few published papers devoted to the simulation of certain aspects of sound change, but so far, we do not (at least to my current knowledge) find any comprehensive account that would try to feed some 1,000 words to a computer and see how this "language'' develops — which sound laws can be observed to occur, and how they change the shape of the given language. What we find, instead, are a couple of very interesting accounts that try to deal with certain aspects of sound change.

Winter and Wedel for example test agent-based exemplar models, in order to see how systems maintain contrast despite variation in the realization (Hamann 2014: 259f gives a short overview of other recent articles). Au (2008) presents simulation studies that aim to test to which degree lexical diffusion and "regular" sound change interact in language evolution. Dediu and Moisik (2019) investigate, with the help of different models, to which degree vocal tract anatomy of speakers may have an impact on the actuation of sound change. Stevens et al. (2019) present an agent-based simulation to investigate the change of /s/ to /ʃ/ in.

This summary of literature is very eclectic, especially because I have only just started to read more about the different proposals out there. What is important for the problem of sound change simulation is that, to my knowledge, there is no approach yet ready to run the full simulation of a given lexicon for a given language, as stated above. Instead, the studies reported so far have a much more fine-grained focus, specifically concentrating on the dynamics of speaker interaction.

Initial ideas for improvement

I do not have concrete ideas for improvement, since the problem's solution depends on quite a few other problems that would need to be solved first. But to address the idea of simulating sound change, albeit only in a very simplifying account, I think it will be important to work harder on our inferences, by making transparent what so far is only implicitly stored in the heads of the many historical linguists in form of what they call their intuition.

During the past 200 years, after linguists started to apply the mysterious comparative method that they had used successfully to reconstruct Indo-European on other language families, the amount of data and number of reconstructions for the world's languages has been drastically increasing. Many different language families have now been intensively studied, and the results have been presented in etymological dictionaries, numerous books and articles on particular questions, and at times even in databases.

Unfortunately, however, we rarely find attempts of scholars to actually provide their findings in a form that would allow to check the correctness of their predictions automatically. I am thinking in very simple terms here — a scholar who proposes a reconstruction for a given language family should deliver not only the proto-forms with the reflexes in the daughter languages, but also a detailed test of how the proposed sound law by which the proto-forms change into the daughter languages produce the reflexes.

While it is clear that this could not be easily implemented in the past, it is in fact possible now, as we can see from a couple of studies where scholars have tried to compute sound change (Hartmann 2003, Pyysalo 2017, see also Sims-Williams 2018 for an overview on more literature). Although these attempts are unsatisfying, given that they do not account for cross-linguistic comparability of data (eg. they use orthographies rather than unified transcriptions, as proposed by Anderson et al. 2018), they illustrate that it should in principle be possible to use transducers and similar technologies to formally check how well the data can be explained under a certain set of assumptions.

Without cross-linguistic accounts of the diversity of sound change processes (ie. a first solution to the problem of establishing a first typology of sound change), attempts to simulate sound change will remain difficult. The only way to address this problem is to require a more rigorous coding of data (both human- and machine-readable), and an increased openness of scholars who work on the reconstruction of interesting language families, to help make their data cross-linguistically comparable.

Sign languages

When drafting this post, I promised to Guido and Justin to grasp the opportunity when talking about sound change to say a few words about the peculiarities of sound change in contrast to other types of language change. The idea was, that this would help us to somehow contribute to the mini-series on sign languages, which Guido and Justin have been initiated this month (see post number one, two, and three).

I do not think that I have completely succeeded in doing so, as what I have discussed today with respect to sound change does not really point out what makes it peculiar (if it is). But to provide a brief attempt, before I finish this post, I think that it is important to emphasize that the whole debate about regularity of sound change is, in fact, not necessarily about regularity per se, but rather about the question of where the change occurs. As the words in spoken languages are composed of a fixed number of sounds, any change to this system will have an impact on the language as a whole. Synchronic variation of the pronunciation of these sounds offers the possibility of change (for example during language acquisition); and once the pronunciation shifts in this way, all words that are affected will shift along, similar to a typewriter in which you change a key.

As far as I understand, for the time being it is not clear whether a counterpart of this process exists in sign language evolution, but if one wanted to search for such a process, one should, in my opinion, do so by investigating to what degree the signs can be considered as being composed of something similar to phonemes in historical linguistics. In my opinion, the existence of phonemes as minimal meaning-discriminating units in all human languages, including spoken and signed ones, is far from being proven. But if it should turn out that signed languages also recruit meaning-discriminating units from a limited pool of possibilities, there might be the chance of uncovering phenomena similar to regular sound change.

Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (2018) A cross-linguistic database of phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting 4.1: 21-53.

Anttila, Raimo (1976) The acceptance of sound change by linguistic structure. In: Fisiak, Jacek (ed.) Recent Developments in Historical Phonology. The Hague, Paris, New York: de Gruyter, pp. 43-56.

Au, Ching-Pong (2008) Acquisition and Evolution of Phonological Systems. Academia Sinica: Taipei.

Chen, Matthew (1972) The time dimension. Contribution toward a theory of sound change. Foundations of Language 8.4. 457-498.

Dan Dediu and Scott Moisik (2019) Pushes and pulls from below: Anatomical variation, articulation and sound change. Glossa 4.1: 1-33.

Hamann, Silke (2014) Phonological changes. In: Bowern, Claire (ed.) Routledge Handbook of Historical Linguistics. Routledge, pp. 249-263.

Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp. 606-609.

Hill, Nathan (2016): A refutation of Song’s (2014) explanation of the ‘stop coda problem’ in Old Chinese. International Journal of Chinese Linguistic 2.2. 270-281.

Kümmel, Martin Joachim (2008) Konsonantenwandel [Consonant change]. Wiesbaden: Reichert.

Labov, William (1981) Resolving the Neogrammarian Controversy. Language 57.2: 267-308.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper presented at the workshop "Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE]" (2015/09/04, Leiden, Societas Linguistica Europaea).

Ohala, J. J. (1989) Sound change is drawn from a pool of synchronic variation. In: Breivik, L. E. and Jahr, E. H. (eds.) Language Change: Contributions to the Study of its Causes. Berlin: Mouton de Gruyter, pp. 173-198.

Pyysalo, Jouna (2017) Proto-Indo-European Lexicon: The generative etymological dictionary of Indo-European languages. In: Proceedings of the 21st Nordic Conference of Computational Linguistics, pp. 259-262.

Sims-Williams, Patrick (2018) Mechanising historical phonology. Transactions of the Philological Society 116.3: 555-573.

Stevens, Mary and Harrington, Jonathan and Schiel, Florian (2019) Associating the origin and spread of sound change using agent-based modelling applied to /s/- retraction in English. Glossa 4.1: 1-30.

Wang, William Shi-Yuan (1969) Competing changes as a cause of residue. Language 45.1: 9-25.

Winter, Bodo and Wedel, Andrew (2016) The co-evolution of speech and the lexicon: Interaction of functional pressures, redundancy, and category variation. Topics in Cognitive Science 8:  503-513.

Monday, July 22, 2019

Two problems concerning the use of Ancient DNA

Last week I wrote a piece for The Wine Gourd blog, called The role of Wine Influencers — more of the same. I discussed the modern concern in the wine industry with social media Influencers, who use Facebook, Instagram, Twitter, Youtube, etc to promote wine — when LeBron James drinks a wine it will sell a whole lot better (presumably on the principle that “You may not be able to play like LeBron, but you can drink like him”).

My conclusion was that the wine industry has always had what are now called Micro- or Nano-Influencers, involving endorsements from people and organizations who possess an expert level of knowledge as well as social influence. For example, professional wine critics have always fitted this bill, notably Robert M. Parker Jr.

So, the existence of social media Wine Influencers is nothing new — it is simply the modern equivalent of something old.

Well, my blog post here is about the same idea in Ancient-DNA phylogenetics — the idea that, in spite of the claim that modern techniques provide new advantages, we may in fact simply be repeating ourselves. Modern issues are simply modern versions of the same old issues.

First problem

The first issue that I would like to raise is that of molecular data. This is seen as the crucial element of modern studies of ancient remains. Even the recent re-creation of the vineyard of Leonardo da Vinci (La Vigna di Leonardo, in Milan) involved the finding of sufficient DNA from the vineyard land, which was bombed during World War II, to identify the grape cultivar that was grown by da Vinci (Inside Leonardo da Vinci's vineyard).

The issue is that DNA studies, based on direct studies of genotype, are subject to all of the same data-analysis issues as are studies of phenotype (such as morphology, anatomy and ultrastructure).

One classic example is the supposed discovery in the 1980s of the phenomenon of Long-branch Attraction (LBA) in molecular studies. Here, if many shared nucleotide changes occur on distantly related branches of a phylogenetic tree, these branches may actually be reconstructed as sister lineages during the phylogenetic analysis. However, this is simply an example of parallelism, a phenomenon that had previously been known for decades in phylogenetic analyses of phenotype.

Many currently recognized practical problems in genotype studies, such as LBA and compositional biases, are merely specific examples of how analogy appears in molecular biology. Analogy will create convergences and parallelisms, and these will confound the attempt to detect homology.

So, reconstructing evolutionary history using molecular biology is a priori neither better nor worse than using any other source of data, because the same limitations apply. It is simply another type of data.

Second problem

The second issue that I would like to raise is that genome data are a type of Big Data, and the idea that Big Data will apparently solve all ills with data analyses. The idea seems to be that, if you can collect enough data, then you must be lead to "the truth".

This is nonsense — data are just numbers, and numbers can mislead, no matter how many there are. Data need to be interpreted by a human mind, if they are to tell that mind anything useful. The only thing that changes with the use of Big Data is the order in which the steps of the data analysis and interperation occur.

In the Old Days (ie. when I was a student), what we did was:
  1. develop an experimental question
  2. think about potential problems
  3. collect targeted data
  4. analyze the data
  5. interpret the data, to answer the question.
These days, with Big Data, what people do is:
  1. collect a very large amount of data
  2. analyze the data, and try to interpret it
  3. think of a question that the data might answer
  4. discover the potential problems later.
All that is really different is the order, along with which steps are confounded with which other steps.

I don't see that this is necessarily any better; it is just different. So, don't pin your hopes on Ancient DNA genome-scale data to solve problems with your work.

Other issues

Anyone working with Ancient DNA knows that there are oddles of other problems. Some of them are discussed for the general public by Gideon Lewis-Kraus, writing on January 17 2019 for The New York Times Magazine: Is Ancient DNA Research revealing new truths — or falling into old traps? The answer is, of course, "both".

Monday, July 15, 2019

Untangling vertical and horizontal processes in the evolution of handshapes

Justin Power

[This is a guest-post by Justin Power, and the 3rd part of our miniseries on sign language manual alphabets]

In Guido’s most recent post in this miniseries on manual alphabet evolution in sign languages, he discussed the role of character mapping on networks in phylogenetic inference. He pointed out how we used this approach to infer evolutionary pathways of languages, and why this step in exploratory data analysis is important, given the complexity of the underlying signal in this data set.

In this new post, I take up the topic of hand-shape evolution in more detail, explaining some of the complexities involved in studying sign language evolution. I will specifically look at how we can identify both vertical and horizontal processes in the evolution of hand-shapes.


We know very little about how signs and hand-shapes actually evolve. There have been a few studies — most of them from decades ago — comparing American Sign Language in videos and dictionaries from the early 20th century with then contemporary forms (Frishberg 1975; Battison et al. 1975). One study in particular argued that, as a sign language emerges in a community of signers, crystallizing into a stable linguistic system, the signs evolve in a quasi-teleological way from earlier, more gesture- or pantomime-like forms to more language-like forms, cutting similar evolutionary pathways leading to more constraints on articulation and to general systematization.

But what happens (in this story) once sign languages become linguistic systems? Do they continue evolving, as happens in spoken languages? If yes, how? Investigating these kinds of questions was one of my motivations for tracking down historical examples of manual alphabets for over a dozen sign languages. The pay-off (besides the thrill of the treasure hunt) is that, by tracing hand-shapes through historical examples and comparing them with contemporary sign languages, we can infer the vertical and horizontal evolutionary processes affecting sign languages and hand-shape forms.

Vertical and horizontal aspects of hand-shape evolution

Consider part of the Neighbor-net from our paper (see Part 1) including the Austrian-origin and Russian groups in the figure below. Russian 1835 is the earliest manual alphabet in our sample published in Russia (St. Petersburg); and Danish 1808, in the Danish subgroup, was published in Copenhagen.

While the two manual alphabets are found in different neighborhoods in the graph, they share a number of hand-shapes, some of which were (and still are) shared widely throughout Europe, for reasons that we discuss in the main paper.

One such hand-shape represents the Latin / Cyrillic letter "A" in both Danish 1808 and Russian 1835, as illustrated in the timeline here.

Note the position of the thumbs at the bottom of the figure: in both early examples, the thumb is adjacent to the bent index finger. In an example from Danish SL in 1907 (and subsequently in 1926 and 1967), the position of the thumb has shifted across the index finger. For Russian SL, too, the position of the thumb in the contemporary hand-shape representing the Cyrillic letter A has crept across the index finger to the front of the fist (the hand-shape in the figure is my attempt to reproduce the source; see here for the real thing).

There are two points to note here in connection with evolutionary processes. First, these changes in thumb position appear to have a vertical aspect: as signers in a community used these hand-shapes and transmitted them to later generations, they also modified the forms in subtle ways, perhaps unconsciously in a process with analogies to sound change in spoken language.

Second, the changes also include a horizontal aspect: the forms evolved in similar ways, as the two signing communities converged on the same shape (apparently) independently, possibly due to similar articulatory or perceptual pressures. The horizontal aspect of this process contributes to signal incompatibility in the dataset underlying the network — the more convergence there is, then the less tree-like will be the Neighbor-net (in this case, the more spiderweb-like).


In addition to the preceding example, a typical case of convergence can be seen in the independent creation of similar hand-shapes to represent the Greek and Cyrillic letter "Г".

Beginning again with the main Neighbor-net in the figure immediately above, we see that Russian 1835 and contemporary Greek SL are found in different neighborhoods, with Greek in the French-origin group. The two languages, however, share the Г-representing hand-shape (the Russian form is from Fleri 1835, while the Greek form is, again, my own hand; see here for the real one). Because Greek SL is the only language in the French-origin group to share this hand-shape with the Russian group, there is a clear suggestion of a horizontal process that resulted in similar hand-shapes across unrelated languages. The most likely processes here are convergence due to the independent creation of iconic representations of the written letter; or lateral transfer — called borrowing in linguistics — via some historical instance of contact between signers of the two languages. [My intuition is for the former explanation.]


The final example deals with a clear case of borrowing. The figure below shows the time- / taxon-filtered Neighbor-net, including historical manual alphabets up to about 1840 (see Part 2), but only annotated with the relevant languages.

The two earliest manual alphabets in our dataset were published in Madrid in 1593 (de Yebra) and 1620 (Bonet). In neither case do we see any trace of a hand-shape representing the letter "W", which was not needed to represent these Latin alphabets. Later, too, manual alphabets published in Spain in 1815, 1845, and 1859 still did not include the letter "W". In contrast, in Austrian 1786 and French 1800 (as well as other languages), hand-shape forms representing the letter W are found in the earliest examples we have for those languages. Some 160–230 years later, however, we find similar forms for "W" in contemporary Austrian, French and Spanish SLs. We deduce that contemporary Spanish SL did not inherit the "W" hand-shape from the 19th century Spanish manual alphabets. Instead, the hand-shape may have been borrowed from some other language, possibly French SL given its influence on deaf education in Europe, or possibly later from the International Sign manual alphabet (also part of the French-origin Group).


As these examples show, there are different types of horizontal processes contributing to conflicting signal in the data set. Using the splits network graphs together with historical examples of manual alphabets, we can untangle the horizontal signal in many cases. The approach has also given us some insight into the evolutionary processes contributing to the diversity of contemporary sign languages, a topic that we plan to investigate more fully.

Cited literature, further reading and data
  • Battison, Robin, Harry Markowicz, & James Woodward (1975) A good rule of thumb: Variable phonology in American Sign Language. In Ralph W. Fasold & Roger W. Shuy (eds.), Analyzing Variation in Language: Papers from the Second Colloquium on New Ways of Analyzing Variation, Part 3, pp. 291–302. Washington D.C.: Georgetown University Press.
  • Bonet, Juan Pablo (1620). Reduction de las letras y arte para enseñar a ablar los mudos. Madrid: Francisco Abarca de Angulo.
  • Fleri, Viktor I. (1835) Глухонемые, рассматриваемые в отношении к их состоянию и к способам образования, самым свойственнымих при. St. Petersburg:Типография А. Плюшара.
  • Frishberg, Nancy 1975 Arbitrariness and iconicity: Historical change in American Sign Language. Language 51(3): 696–719.
  • Yebra, Melchor de (1593) Libro llamado Refugium Infirmorum: Muy util y prouechoso para todo genero de gente : En el qual se contienen muchosauisos espirituales para socorro de los afligidos enfermos, y para ayudar à bien morir a los que estan en lo ultimo de su vida ; con un Alfabeto de S. Buenauentura para hablar por la mano. Madrid: Luys Sa[n]chez
A comprehensive reference list can be found in our pre-print at Humanties Commons. The raw data and analysis files are available via GitHub.

Other posts in this miniseries

Monday, July 8, 2019

Character cliques and networks – mapping haplotypes of manual alphabets

[This post is the second part of our miniseries on the origin and evolution of sign language manual alphabets]

One aspect of exploratory data analysis (EDA) is for us to try to understand how our data relate to our inference(s). This is especially important when the signal from our data is increasingly complex. Sign language manual alphabets are such a case.

In our first post about sign language manual alphabets, I introduced the principal networks that we used to classify sign languages. Here, I'll describe our character mapping procedure and why we did it as part of our EDA framework, in order to establish scenarios for the origin and evolution of sign languages.

Characters and mapping

We encoded each hand-shape used to signify a certain concept, such as the letters included in the standard Latin alphabet "a", "b", "c", .... "x", "y", "z", as a binary sequence – the presence or absence of a certain COGID (we will explain and discuss this in a later post). These binary sequences can be seen as an analogy of the genetic code, as a sort of 'linguistic haplotype', and their evolution can be mapped onto a network based on the entire dataset.

For instance, our matrix has three binaries (haplotypes) for the concept [g] in the oldest set of sign languages (pre-1840), two of which can be found in the earliest alphabets in our dataset: those of Yebra 1953 and Bonet 1620. Russian 1835, the oldest Cyrillic alphabet, uses a somewhat different hand-shape for its counterpart of the Latin "g", the Cyrillic "г".

For the concept [g], we thus have three taxon cliques, each defined by a distinct binary/haplotype: the 'Yebra haplotype', the 'Bonet haplotype', and the 'Cyrillic haplotype'.

By mapping these haplotypes on the network, as shown in the next figure, we can see that there is a small edge bundle reflecting the basic split between the Yebra and Bonet haplotypes.

Hand-shape drawings are taken from the original manuscripts.

We can also see that the Russian haplotype either evolved from the Yebra haplotype kept in the older Austrian-origin Group, ie. is an adaptation of the Yebra haplotype, or that it is a genuinely new invention — note the similarity of the Russian hanshape with the letter г.

We repeated this procedure for all 26 concepts of the standard Latin alphabet, to get an idea of how often the encoded linguistic haplotypes fit with the overall pattern visualized in the inferred Neighbor-nets (ie. the neighborhoods as defined by edge bundles). This is shown in the next figure.

The arrows indicate inferred evolutionary processes (replacement or invention).

Using this network mapping(which, in principle, uses the logic of parsimony/median networks), we can make direct inferences about the general mode of evolution.

For instance, even though Russian 1835 uses a different set of hand-shapes (ie. is defined by partly unique haplotypes), the hand-shapes for the concepts [p] and [z] are exclusively shared with the Austrian-origin Group. The biological equivalent would be: the 'Austrian haplotypes' are a uniquely shared derived feature reflecting a putative common origin of the Austrian and Russian lineages — ie a potential linguistic synapomorphy. We also can see that all haplotypes shared by Russian and all ([a][c][f][r][u][y]) or part ([b][e][i][k][n][o][x]) of the French-origin Group, an alternative source that may have inspired this early Cyrillic alphabet, lack this quality.

We can also make inferences about:
  1. which hand-shape is the original one (O);
  2. lineage-specific / diagnostic hand-shapes, eg. At. = Austrian, Da. = Danish (using two letter abbreviations);
  3. which hand-shapes are shared but apparently derived, eg. At.-Fr. are hand-shapes / haplotypes shared by members of the Austrian- and French-origin groups not found in the Yebra or Bonet alphabets — C stands for cosmopolitan, non-original handshapes common in various lineages, including British-origin Group, and D represents derived but rare hand-shapes without any clear lineage-affiliation; and
  4. alphabet-unique (ie. represent a linguistic autapomorphy.
In addition, we can explore certain details, including patterns (character-based taxon cliques) that are at odds with the overall reconstruction. The latter are to be expected, because the graph is planar (2-dimensional) but the processes that shaped sign alphabets are likely to be multi-dimensional. For instance, our networks failed to resolve the affinity of the contemporary Norwegian Sign Language, the reason for which can be seen in the following character map.

Note the position of Norwegian 1955, which is still part of the Austrian-origin Group (like older manual alphabets used in the late 19th century in Norway). However, it is already influenced by international standardization — eg. concepts [k], [p], and [z] use(d) French hand-shapes. Hence, Norwegian 1955 shares quite a high number of lineage-diagnostic hand-shapes with Danish 1967 and the Icelandic Sign Language. These, and others, were further replaced in its contemporary counterpart (Norwegian SL) by hand-shapes borrowed from various lineages — eg. [c],[f] from the nearly extinct Austrian-origin Group, [p] from the Russian Group, [k] same as in the Spanish Group) — as well as unique hand-shapes, including hand-shapes evolved from earlier forms or those that have been genuinely invented.

Why we map character evolution along networks

In many cases, we only have one set of data, in order to draw our conclusions based on the graph(s) we infer. We cannot test to which degree our data (the way we scored the differentiation patterns) and inferences are systematically biased. Thus, we want to explore which aspects of our inference are supported by character splits, and establish taxon cliques and evolutionary pathways for the characters (scored traits). Lacking an independent source of data, the latter would involve circular reasoning — ie. mapping the traits along a tree derived from those same traits.

By inferring a tree, we crystallize one pattern dimension out of the data, although more often than not this will be a comprise from multidimensional signals. A network, such as a Neighbor-net, has two dimensions, and hence our mapping can consider two alternatives at the same time — this enables us to make a choice, if we have to. Another practical advantage of a Neighbor-net is that it is quick to infer, so that we can easily reduce the data set and use a more focused graph for the map.

In cases where 2-dimensional graphs don't suffice, there are still Consensus networks, which would allow mapping character evolution based on a sample of many alternative trees.

We could even eliminate the circular reasoning while maintaining a relatively stable inference framework. Deleting a character or several characters (or recoding them: see eg. Should we try to infer trees on tree-unlikely matrices?) can easily lead to a new tree topology, although it has less effect on the structure of a Neighbor-net. When we would need to worry about circular reasoning for mapping a certain concept, or two concepts that may have interacted, we just base our Neighbour-net on a distance matrix calculated from a reduced character matrix, and then map only those concepts not considered for the inference.

Other posts in this miniseries

Monday, July 1, 2019

Stacking networks based on sign language manual alphabets

This post is the first of a mini-series on sign language manual alphabets. While the evolution of spoken languages has been studied intensively using phylogenetic methods, sign languages have not, as yet.

In this post we will first introduce our readers to a set of stacked networks, and how it assists in establishing ancestor-descendant relationships in a pretty straightforward (but not trivial) case: the evolution of manual alphabets in sign languages. In the next post, I will demonstrate the use of networks for character mapping and putting forward hypothesis about ancestor-descendant relationships.

In 2004, Spencer et al. (Two papers you may want to read...) showed that Neighbor-nets outperform tree inferences when it comes to explicit ancestor-descendant relationships. The data set they used was quite particular: copies of written text. Here, scribes copy a text, and then other scribes, some of them ignorant of the language of the text they are copying, copy the copies. In the paper, the sequence of copies was recorded (the 'true tree'), and then the various texts were transferred into phylogenetic matrices, in order to infer trees and networks, and then this result was compared to the 'true tree'. The best fit of the data to the truth was the Neighbor-net.

This is a compelling conclusion, because, as a planar network and in contrast to median networks, Neighbor-nets don't explicitly place taxa in ancestor-descendant relationships. However, we have shown for many cases here at the Genealogical World of Phylogenetic Networks how ancestors are often placed with respect to their descendants: they are often closer to the center of the graph, or the root when known, and thus they bridge the center or sister lineages and their descendants. We can thus see why Neighbor-nets might be useful in practice.

In this context, the evolution of sign language manual alphabets, ie. the hand-shapes used to represent letters of a written alphabet, should be relatively easy to reconstruct. Once an alphabet is established in a sign language school / community, the ancestor, it will be passed on to other "generations" within the community and other schools / communities, the descendants. However, this is not necessarily a dichotomous process, as depicted in the first figure.

A scheme depicting how manual alphabets may evolve and disperse.

There are a few complications here: for example, hand-shapes may change in course of being used (the hand-shape evolves); contact may lead to exchange or appropriation of hand-shapes (called "borrowing" in linguistics); and, in some cases, entire alphabets will need to be adapted to a particular use. The latter case occurs when changing from one script (Latin, say) to another (Cyrillic or Arabic) — the first formal school for the deaf was established in Paris, for example. As a teacher, I need to decide: Do I take a hand-shape from the morphologically similar letter, or the phonetically similar one? As a scientist, I need to assess the homologies among such hand-shapes without inflicting systematic bias.

Standardization will wipe out local customs and replace them with a multinational standard. For instance, Country 2 in the scheme above, drops its original B-type manual alphabet (red) for an A-type (blue); and in Country 7 both traditions are fused. Over time, originally distinct sign languages may converge due to geographic proximity, or even just feasibility.

The evolution of spoken languages has been studied intensively using phylogenetic methods, and in particular networks are much more commonly found in the linguistic literature than in the biological one. For sign languages we have made a first step in a recently published pre-print:
Justin M. Power, Guido W. Grimm, and Johann-Mattis List (2019) Evolutionary dynamics in the dispersal of sign languages. Humanities Commons.
What excites me about our study is that it combines historical manual alphabets (going back to 1593), which are potential ancestors, with a set of modern-day alphabets, which are their likely descendants. The data set is thus an evolutionary paleontologist's dream (and, possibly, a cladist's nightmare, if we expect a simple tree-like set of relationships rather than a network). As a scientist, I simple love to boldly go where no-one has gone before.

The next figure shows the all-inclusive network from our paper, but focusing on the age of the manual alphabets.

For more linguistic details see the pre-print.
* Historical version(s) of these lineages are not included in our data set

Obviously, there has been quite a lot of evolutionary changes, as well as standardization, going on, although some parts, like the Swedish SL (sign language), have stuck to its unique original. Historical and contemporary Spanish / Catalan are still most similar to the oldest manual alphabets that Justin dug out for our study. On the other hand, the contemporary Norwegian SL is placed far apart from his historical counterparts, and lacks any obvious affinity. Austrian, Danish, and German look back on a long and diverse history, the green "Austrian-origin Group", but the contemporaries have been homogenized by standardization (note the closeness to the International Sign manual alphabet). If we use an analogy with common biological and biogeographical processes (such as range expansion, competition, extinction, etc), then the Austrian-origin Group only survived in a remote island population, where we still find a sort of living fossil, the Icelandic SL.

In contrast to biological data, the old, putatively ancestral, manual alphabets are not closer to the graph's center, or the oldest manual alphabets in our data set. The reason for this seems to lie in the data itself and how manual alphabets evolve, and this will be the topic of the next post(s).

Still, we can isolate some evolutionary pathways, especially when we make time-wise taxon-filtered networks and stack them (see this introduction to stacking and this application using Osmundaceae, a data set including an even larger ratio of fossil taxa to modern taxa).

Fig. 4 from Power et al. Coloring same as above: pink – Spanish; turquoise – French-origin; green – Austrian-origin; orange – Polish; red – Russian; light blue – Swedish Group. The English-origin and Afghan-Jordanian groups are not included, since not represented by historical manual alphabets in our data set

Each of the three networks includes manual alphabets from a certain time period, starting with pre-1840 at the bottom, historical 19th-/20th-century manual alphabets in the middle, and post-1950 manual alphabets in the top network. The dotted links between the networks connect manual alphabets that are included in two of the networks.

Even from these graphs alone, we can say a lot about how ancestors (original manual alphabets in a country) relate to descendants (later and contemporary manual alphabets) and their evolutionary pathways. Here are some examples.

Shortly after the time when the first schools for the deaf were established in continental Europe (late 18th, early 19th centuries), manual alphabets showed quite a diversity, and were very different from their potential Spanish sources, such as Yebra 1593 and Bonet 1620, with the French and Austrian teachers and communities going different ways. The oldest Cyrillic alphabet, Russian 1835, is more closely related to (ancient) Austrian than it is to (ancient) French.

The Swedish manual alphabet of 1866 is a fresh invention. Some hand-shapes may have been borrowed from one or another alphabet in use on the continent, but, as we will see in the next post of the series, includes genuinely new forms.

The French tradition was dispersed into the new World (American SL appears to be a direct derivation from the French, while the Brazilian SL is an adaptation) but remained a relatively homogeneous group. On the other hand, the Austrian-origin languages diversified, in particular within the Danish influence zone. Politically, the Danish king ceded Norway to Sweden in the Treaty of Kiel 1814 (note the distance between Norwegian and Danish languages in the late 19th century), while Iceland was a Danish dependency until 1918, when the Danish-Icelandic Act of Union was signed. Furthermore, the German manual alphabets subsequently diverged from the Austrian source.

The Polish manual alphabet, originally an adaptation of the Austrian-Danish manual alphabets (see the graph in the middle), became closer to the Russian group, with the Latvian sign language taking up an intermediate position. The Cyrillic alphabets evolved further away, too (top graph).

In the following post(s) of this miniseries, we will explain what we learned from simple character mapping on the time-taxon-filtered networks, and how to score manual alphabets in the first place.

Follow-up posts in this miniseries

Monday, June 24, 2019

Simulation of lexical change (Open problems in computational diversity linguistics 5)

The fifth problem in my list of open problems in computational diversity linguistics is devoted to the problem of simulating lexical change. In a broad sense, lexical change refers to the way in which the lexicon of a human language evolves over time. In a narrower sense, we would reduce it to the major processes that constitute the changes that affect the words of human languages.

Following Gevaudán (2007: 15-17), we can distinguish three different dimensions along which words can change, namely:
  • the semantic dimension — a given word can change its meaning
  • the morphological dimension —new words are formed from old words by combining existing words or deriving new words with help of affixes, and
  • the stratic dimension — languages may acquire words from their neighbors and thus contain strata of contact.
If we take these three dimension as the basis of any linguistically meaningful system that simulates lexical change (and I would strongly argue that we should), the task of simulating lexical change can thus be worded as follows:
Create a model of lexical change that simulates how the lexicon of a given language changes over time. This model may be simplifying, but it should account for change along the major dimensions of lexical change, including morphological change, semantic change, and lexical borrowing.
Note that the focus on three dimensions along which a word can change deliberately excludes sound change (which I will treat as a separate problem in an upcoming blogpost). Excluding sound change is justified by the fact that, in the majority of cases, the process proceeds independently from semantic change, morphological change, and borrowing, while the latter three process often interact.

There are, of course, cases where sound change may trigger the other three processes — for example, in cases where sound change leads to homophonous words in a language that express contrary meanings, which is usually resolved by using another word form for one of the concepts. An example for this process can be found in Chinese, where shǒu (in modern pronunciation) came to mean both "head" and "hand" (spelled as 首 and 手). Nowadays, shǒu remains only in expressions like shǒudū 首都 "capital", while tóu 头 is the regular word for "head".

Since the number of these processes where we have sufficient evidence to infer that sound change triggered other changes is rather small, we will do better to ignore it when trying to design initial models of lexical change. Later models could, of course, combine sound change with lexical change in an overarching framework, but given how the modeling of lexical change is already complex just with the three dimensions alone, it seems useful to put it aside for the moment and treat it as a separate problem.

Why simulating lexical change is hard

For historical linguists, it is obvious why it is hard to simulate lexical change in a computational model. The reason is that all three major processes of lexical change, semantic change, morphological change, and lexical borrowing, are already hard to model and understand themselves.

Morphological change is not only difficult to understand as a process, it is even difficult to infer; and it is for this reason, that we find morphological segmentation as the first example in my list of open problems. The same holds for lexical borrowing, which I discussed as the second example in my list of open problems. The problem of common pathways of semantic change will be discussed in a later post, devoted to the general typology of semantic change processes.

If each of the individual processes that constitute lexical change is itself either hard to model or to infer, it is no wonder that the simulation is also hard.

Traditional insights into the process of lexical change

Important work on lexical change goes back at least to the 1950s, when Morris Swadesh (1909-1967) proposed his theory of lexicostatistics and glottochronology (Swadesh 1952, 1955, Lees 1953). What was important in this context was not the idea that one could compute the divergence time of languages, but the data model which Swadesh introduced. This data model is represented by a word-list in which a particular list of concepts is translated into a particular range of languages. While former work on semantic change had been mostly onomasiological — ie. form-based, taking the word as the basic unit and asking how it would change its meaning over time — the new model used concepts as a comparandum, investigating how word forms replaced each other in expressing specific contexts over time. This onomasiological or concept-based perspective has the great advantage of drastically facilitating the sampling of language data from different languages.

When comparing only specific word forms for cognacy, it is difficult to learn something about the dynamics of lexical change through time, since it is never clear how to sample those words that one wants to investigate more closely in a given study. With Swadesh's data model, the sampling process is reduced to the selection of concepts, regardless of whether one knows how many concepts one can find in a given sample of languages. Swadesh was by no means the first to propose this perspective, but he was the one who promulgated it.

Swadesh's data model does not directly measure lexical change, but instead measures the results of lexical change, given that its results surface in the distribution of cognate sets across lexicostatistical word-lists. While historical linguists mostly focused on sound change processes before, often ignoring morphological and semantic change, the lexicostatistical data model moved semantic change, lexical borrowing, and (to a lesser degree also) morphological change into the spotlight of linguistic endeavors. As an example, consider the following quote from Lees (1953), discussing the investigation of change in vocabulary under the label of morpheme decay:
The reasons for morpheme decay, ie. for change in vocabulary, have been classified by many authors; they include such processes as word tabu, phonemic confusion of etymologically distinct items close in meaning, change in material culture with loss of obsolete terms, rise of witty terms or slang, adoption of prestige forms from a superstratum language, and various gradual semantic shifts such as specialization, generalization, and pejoration. [Lees 1953: 114]
In addition to lexicostatistics and the discussions that arose especially from it (including those that criticized the method harshly), I consider the aforementioned model of three dimensions of lexical change by Gevaudán (2007) to be very useful in this context, since it constitutes one of the few attempts to approach the question of lexical change in a formal (or formalizable) way.

Computational approaches

Among the most frequently used models in the historical linguistics literature are those in which lexical change is modeled as a process of cognate gain and cognate loss. Modeling lexical change as a process of word gain and word loss, or root gain and root loss, is in fact straightforward. We well know that languages may cease to use certain words during their evolution, either because the things the words denote no longer exist (think of the word walkman and then try to project the future of the word ipad), or because a specific word form is no longer being used to denote a concept and therefore drops out of the language at some point (think of thorp which meant something like "village", as a comparison with German Dorf "village" shows, but now exists only as a suffix in place names).

Since the gain-loss (or birth-death) model finds a direct counterpart in evolutionary biology, where genome evolution is often modeled as a process involving gain and loss of gene families (Cohen et al. 2008), it is also very easy to apply it to linguistics. The major work on the stochastic description of different gain-loss models has already been done, and we can find very stable software to helps us employ gain-loss models to reconstruct phylogenetic trees (Ronquist and Huelsenbeck 2003).

It is therefore not surprising that gain-loss models are very popular in computational approaches to historical linguistics. Starting from pioneering work by Gray and Jordan (2000) and Gray and Atkinson (2003), they have now been used on many language families, including Austronesian (Gray et al. 2007), Australian languages (Bowern and Atkinson 2012), and most recently also Sino-Tibetan (Sagart et al. 2019). Although scholars (including myself) have expressed skepticism about their usefulness (List 2016), the gain-loss model can be seen as reflecting the quasi-standard of phylogenetic reconstruction in contemporary quantitative historical linguistics.

Despite their popularity for phylogenetic reconstructions, gain-loss models have been used only sporadically in simulation studies. The only attempts that I know of so far are one study by Greenhill et al. (2009), where the authors used the TraitLab software (Nicholls 2013) to simulate language change along with horizontal transfer events, and a study by Murawaki (2015), in which (if I understand the study correctly) a gain-loss model is used to model language contact.

Another approach is reflected in the more "classical" work on lexicostatistics, where lexical change is modeled as a process of lexical replacement within previously selected concept slots. I will call this model the concept-slot model. In this model (and potential variants of it), a language is not a bag of words whose contents changes over time, but is more like a chest of drawers, in which each drawer represents a specific concept and the content of a drawer represents the words that can be used to express that given concept. In such a model, lexical change proceeds as a replacement process: a word within a given concept drawer is replaced by another word.

This model represents the classical way in which Morris Swadesh used to view the evolution of a given language. It is still present in the work of scholars working in the original framework of lexicostatistics (Starostin 2000), but it is used almost exclusively within distance-based frameworks, since a character-based account of the model would require a potentially large number of character states, which usually exceeds the number of character states allowed in the classical software packages for phylogenetic reconstruction.

Similar to the gain-loss model, there have not been many attempts to test the characteristics of this model in simulation studies. The only one known to me is a posthumously published letter from Sergei Starostin (1953-2005) to Murray Gell-Mann (Starostin 2007), in which he describes an attempt to account for his theory that a word's replacement rage increases with the word's age (Starostin 2000) in a computer simulation.

Problems with current models of lexical change

Neither the gain-loss model nor the concept-slot model seem to be misleading when it comes to describe the process of lexical change. However, they both obviously ignore specific and crucial aspects of lexical change that (according to the task stated above) any ambitious simulation of lexical change should try to account for. The gain-loss model, for example, deliberately ignores semantic change and morphological change. It can account for borrowings, which can be easily included in a simulation by allowing contemporary languages to exchange words with each other, but it cannot tell us (since it ignores the meaning of word forms) how the meaning of words changes over time, or how word forms change their shape due to morphological change.

The concept-slot model can, in theory, account for semantic change, but only as far as the concept-slots allow: the number of concepts in this model is fixed and one usually does not assume that it would change. Furthermore, while borrowing can be included in this model, the model does not handle morphological change processes.

In phylogenetic approaches, both models also have clear disadvantages. The main problem of the gain-loss model is the sampling procedure. Since one cannot sample all words of a language, scholars usually derive the cognate sets they use to reconstruct phylogenies from cognate-coded lexicostatistical word-lists. As I have tried to show earlier, in List (2016), this sampling procedure can lead to problems when homology is defined in a loose way. The problem of the concept-slot model is that it cannot be easily applied in phylogenetic inference based on likelihood models (like Maximum likelihood or Bayesian inference), since the only straightforward way to handle them would be multi-state models, which are generally difficult to handle.

Initial ideas for improvement

For the moment, I have no direct idea of how to model morphological change, and more research will be needed before we will be able to handle this in models of lexical change. The problem of the gain-loss and the concept-slot models to account for semantic change, however, can be overcome by turning to bipartite graph models of lexical change (see Newman 2010: 32f for details on bipartite graphs). In such a model, the lexicon of a human language is represented by a bipartite graph consisting of concepts as one type of node and word forms (or forms) as another type of node. The association strength of a given word node and a given concept node (or its "reference potential", see List 2014: 21f), ie. the likelihood of a word being used by a speaker to denote a given concept, can be modeled with help of weighted edges. This model naturally accounts for synonymy (if a meaning can be expressed by multiple words) and polysemy (if a word can express multiple meanings). Lexical change in such a model would consist of the re-arrangement of the weights in the network. Word loss and word gain would occur if a new word node is introduced into the network or an existing node gets dissociated from all of the concepts.

Sankoff's (1996) bipartite model of the lexicon of human languages

We can find this idea of bipartite modeling of a language's lexicon in the early linguistic work of Sankoff (1969: 28-53), as reflected in the figure above, taken from his dissertation (Figure 5, p. 36). Similarly, Smith (2004) used bipartite form-concept networks (which he describes as a matrix) in order to test the mechanisms by which these vocabularies are transmitted from the perspective of different theories on cultural evolution.

As I have never actively tried to review the large amount of literature devoted to simulation studies in historical linguistics, biology, and cultural evolution, it is quite possible that this blogpost lacks reference to important studies devoted to the problem. Despite this possibility, we can clearly say that we are lacking simulation studies in historical linguistics. I am furthermore convinced that the problem of handling lexical change in simulation studies is a difficult one, and that we may well have to wait to acquire more knowledge of the key processes involving lexical change in order to address it sufficiently in the future.

While I understand the popularity of gain-loss models in recent work on phylogenetic reconstruction in historical linguistics, I hope that it might be possible to develop more realistic models in the future. It is well possible that such studies will confirm the superiority of gain-loss models over alternative approaches. But instead of assuming this in an axiomatic way, as we seem to be doing it for the time being, I would rather see some proof for this in simulation studies, or in studies where the data fed to the gain-loss algorithms is sampled differently.


Bowern, Claire and Atkinson, Quentin D. (2012) Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88: 817-845.

Cohen, Ofir and Rubinstein, Nimrod D. and Stern, Adi and Gophna, Uri and Pupko, Tal (2008) A likelihood framework to analyse phyletic patterns. Philosophical Transactions of the Royal Society B 363: 3903-3911.

Gévaudan, Paul (2007) Typologie des lexikalischen Wandels. Bedeutungswandel, Wortbildung und Entlehnung am Beispiel der romanischen Sprachen. Tübingen:Stauffenburg.

Gray, Russell D. and Jordan, Fiona M. (2000) Language trees support the express-train sequences of Austronesian expansion. Nature 405: 1052-1055.

Gray, Russell D. and Atkinson, Quentin D. (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426: 435-439.

Gray, Russell D. and Greenhill, Simon J. and Ross, Malcolm D. (2007) The pleasures and perils of Darwinzing culture (with phylogenies). Biological Theory 2: 360-375.

Greenhill, S. J. and Currie, T. E. and Gray, R. D. (2009) Does horizontal transmission invalidate cultural phylogenies? Proceedings of the Royal Society of London, Series B 276: 2299-2306.

Lees, Robert B. (1953) The basis of glottochronology. Language 29: 113-127.

List, Johann-Mattis (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1: 119-136.

Murawaki, Yugo (2015) Spatial structure of evolutionary models of dialects in Contact. PLoS One 10: e0134335.

Newman, M. E. J. (2010) Networks: An Introduction. Oxford: Oxford University Press.

Nicholls, Geoff K and Ryder, Robin J and Welch, David (2013) TraitLab: A MatLab package for fitting and simulating binary tree-like data.

Ronquist, Frederik and Huelsenbeck, J. P. (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572–1574.

Sagart, Laurent, Jacques, Guillaume, Lai, Yunfan, Ryder, Robin, Thouzeau, Valentin, Greenhill, Simon J., List, Johann-Mattis (2019) Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116: 10317–10322. DOI: 10.1073/pnas.1817972116

Sankoff, David (1969) Historical Linguistics as Stochastic Process. McGill University: Montreal.

Smith, Kenny (2004) The evolution of vocabulary. Journal of Theoretical Biology 228: 127-142.

Starostin, Sergej Anatolévič (2000) Comparative-historical linguistics and lexicostatistics. In: Renfrew, Colin, McMahon, April, Trask, Larry (eds.): Time Depth in Historical Linguistics: 1. Cambridge:McDonald Institute for Archaeological Research, pp. 223-265.

Starostin, Sergej A. (2007) Computer-based simulation of the glottochronological process (Letter to M. Gell-Mann). In: : S. A. Starostin: Trudy po yazykoznaniyu [S. A. Starostin: Works in Linguistics]. LRC Publishing House, pp. 854-861.

Swadesh, Morris (1952) Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proceedings of the American Philosophical Society 96: 452-463.

Swadesh, Morris (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21.2: 121-137.

Monday, June 17, 2019

Ockham's Razor applied, but not used: can we do DNA-scaffolding with seven characters?

One of the most interesting research areas in organismal science is the cross-road between palaeontology and neontology, which puts together a picture marrying the fossil record with molecular-based phylogenies. Unfortunately, when it comes to plant (palaeo-)phylogenetics, some people adhere to outdated analysis frameworks (sometimes with little data).

How to place a fossil?

The fossil record is crucial for neontology as it can provide age constraints (minimum ages when doing node dating) and inform us about the past distribution of a lineage. This, especially in the case of plants that can't run away from unfortunate habitat changes, can be much different than today.

The main question in this context is whether a fossil represents the stem, ie. a precursor or extinct ancient sister lineage, or the crown group, ie. a modern-day taxon (primarily modern-day genus). For instance, the oldest crown fossil gives the best-possible minimum age for the stem (root) age of a modern lineage, whereas a stem fossil can give (at best) only a rough estimate for the crown age of the next-larger taxon/clade when doing the common node dating of molecular trees (note that fossilized birth-death dating can make use of both).

There are two commonly accepted criteria to identify a crown-group fossil:
  1. Apomorphy-based argues that if a fossil shows a uniquely derived character (ie. a aut- or synapomorphy sensu Hennig) or character suite diagnostic for a modern-day genus, it represents a crown-group fossil.
  2. Phylogeny-based aims to place the fossil in a phylogenetic framework, the position of the fossil in the genus- or species-level tree (most commonly done) or network (rarely done but producing much less biased or flawed results) then informs what it is.
(We will focus on members of modern-day genera, since it becomes more trickier for higher-level taxa, see eg. my posts thinking about What is an angiosperm? [part1][part2][why I pondered about it].)

There a three basic options to place a fossil using a phylogenetic tree.
  1. Putting up a morphological matrix, then inferring the tree. A classic but due to the nature of most morphological data sets leading to a partly wrong tree as we demonstrated in some posts here on the Genealogical World of Phylogenetic Networks (hence, such analysis should always be done in a network-based exploratory data analysis framework).
  2. Putting up a mixed molecular-morphological matrix, then inferring a "total evidence" tree. This includes sophisticated approaches that use the molecular data to implement weights on the morphological traits and/or consider the age of the fossils (so-called total evidence dating approaches). Works not that bad with animal-data, provided the matrix includes a lot of morphological traits reflecting aspects of the (molecular-based) phylogeny. Doesn't work too well for plants because we usually have much fewer scorable traits, most of which are evolved convergently or in parallel. Non-trivial plant fossils love to act as rogues during phylogenetic inference.
  3. Optimise the position of a fossil in a molecular-based tree, eg. using so-called "DNA scaffold approach" (usually using parsimony as optimality criterion) or the evolutionary placement algorithm implemented in RAxML (using maximum likelihood). A special form of this approach is to first map the traits on a (dated) molecular tree, and then find the position where a fossil would fit best.

Why (standard) phylogenetic tree-based approaches are tricky

Below a simple example, including three fossils of different age (and often, place) with different character suites.

Even though none of the derived traits (blue and red "1") is a synapomorphy (fide Hennig), we can assign the youngest fossil X to the lineage of genus 1A just based just based on its unique derived ('apomorphic') character suite. Its likely a crown-group fossil of clade 1, and may inform a minimum age for the most-recent common ancestor (MRCA) of the two modern-day genera of Clade 1.
Apomorphy-wise, fossils Y and Z cannot be unambiguously placed. The red trait appears to be independently obtained in both clades, and the blue trait may have been
To discern between the options, we'd be well-advised to do character mapping in a probabilistic framework which require a tree with independently defined branch-lengths.

Just by using parsimony-based DNA-scaffolding, fossil X would be confirmed as crown-group fossil and member of genus 1A (being identical and different from all others) and fossil Z would end up as a stem-group fossil. Fossil Y, however, would be placed as sister to genus 2C (again, identical to each other and different from all others). Using Y in node dating, would then lead to a much too old divergence age for the crown-group age of Clade 2. In reality, what researchers do with such a seemingly too old fossil is not to use it by the book, as MRCA of Genus 2B and 2C, but to inform the MRCA of eg. genera 2A, 2B, and 2C assuming that the fossil's age and trait set indicate the 2C morphology is primitive within the clade or Y is an extinct sister lineage and the shared derived trait a convergence (parallelism).

Four characters, three homoplastic and one invariant, are surely not enough for DNA-scaffolding, but adding more and more characters has a catch. Easy to do for the modern-day taxa, for which we also have molecular data, the preservation of fossils limits adding many more traits; any trait not preserved in the fossil is effectively useless when placing it (including not-preserved traits in total evidence approach may, nonetheless, help the analysis). Which brings us to the real-world example just published in Science:

Wilf P, Nixon KC, Gandolfo MA, Cúneo RA (2019). Eocene Fagaceae from Patagonia and Gondwanan legacy in Asian rainforests. Science 364, 972. Full-text article at Science website.

Why one should not place a fossil using DNA-scaffolding with seven characters

Wilf et al. show (another) spectacularly preserved fossil from the Eocene of Patagonia. Personally, I think that just publishing and shortly describing such a beautiful fossil should be enough to get into the leading biological journals.

But Wilf et al. wanted (needed?) more and came up with the following "phylogenetic analysis" to argue that their fossil is a crown-group Castanoideae, a representative of the modern-day firmly Southeast Asian tropical-subtropical genus Castanopsis, and evidence for a "southern route to Asia hypothesis" (via Antarctica and Australia, both well-studied but devoid so far of any Fagaceae presence; despite the fact that the modern-day climate allows cultivating them as eg. source for commercially used wood).

Wilf et al's Fig. 3 and Table 1 suggest to me that the paper was not critically reviewed by anyone familiar with the molecular genetics of Fagaceae or phylogenetic methods in general — perhaps this is not needed, since the first author is well-merited and the second author a world-leading expert of botanical palaeo-cladistics. However, parsimony-based DNA-scaffolding can be tricky, even with a larger set of characters (see eg. the post on Juglandaceae using a well-done matrix), and using seven is therefore quite bold. Notably, of the seven characters, one is parsimony-uninformative and four are variable within at least one of the included OTUs.

Side note: The tree used as a backbone is outdated and not comprehensive. Plastid and nuclear-molecular data indicate that the castanoids Lithocarpus (mostly tropical SE Asia) and Chrysolepis (temperate N. America) may be sisters. However, the morphologically quite similar Notholithocarpus is not related to either of these, but is instead a close relative of the ubiquitous oaks, genus Quercus (not included in Wilf et al.'s backbone tree), especially subgenus Quercus. Furthermore, the (today Eurasian) castanoid sisterpair Castanea (temperate)-Castanopsis (tropical-subtropical) have stronger affinities to the (today and in the past) Eurasian oaks of subgenus Cerris. The Fagaceae also include three distinct monotypic relict genera, the "trigonobalanoids" Formanodendron and Trigonobalanus, SE Asia, and Colombobalanus from Columbia, South America. Using a more up-to-date instead of a 2-decade-old molecular hypothesis would have been a fair request during review, as would compiling a new molecular matrix to infer a tree used as backbone (currently gene banks include > 238,000 nucleotide DNA accessions including complete plastomes). This would have also enabled the authors to map their traits using a probabilistic framework, which can protect to some degree against homoplastic bias but requires a backbone tree with defined branch-lengths.

There are many more problems with the paper and its conclusions, but this critique would be content- not network-related. Let's just look at the data and see why Wilf et al. would have better off not showing any phylogenetic analysis at all (and the impact-driven editors and positive-meaning reviewers should have advised against it). Or a network.

Clades with little character support

The scaffolding placed the Eocene fossil in a clade with both representatives of Castanopsis, from which it differs by 0–2 and 1–4 traits, respectively. Phylogeny-based, the fossil is a stem- or crown-Castanopsis.

However, the fossil has a character suite that differs in just a single trait (#6: valve deshiscence) from the (genetically very distant) sister taxon of all other Fagaceae, Fagus (the beech), used here as the outgroup to root the Castanoideae subtree. As far as apomorphies are concerned, the data are inconclusive as to whether the fossil represents a stem-Castanoideae (or extinct Fagaceae lineage) or a Castanopsis — this critical, potentially diagnostic derived trait, partial valve dehiscence, is only shared by the fossil and some but not all modern-day Castanopsis. This particular trait is not mentioned elsewhere in the text, although it is the reason the fossil is placed next to Castanopsis and not the outgroup Fagus in the "phylogenetic analysis".

In the following figure, I have mapped (with parsimony) the putative character mutations on the tree used by Wilf et al.

Black font: shared by Fagus (outgroup) and "Castanoideae". Green font: potential uniquely derived traits. Blue font: traits reconstructed as having evolved in parallel/convergently. Red branches, clades in the used backbone tree that are at odds with currently available molecular data (the N. American relict Notholithocarpus should be sister to the Eurasian Castanea-Castanopsis).

This hardly presents a strong case of crown-group assignation. Except for partial dehiscence, even the modern-day Castanopsis have little discriminating derived traits — they are living fossils with a primitive ('plesiomorphic') character suite. Intriguingly, they are also genetically less derived than other Castanoideae and the oaks (see eg. the ITS tree in Denk & Grimm 2010).

The actual differentiation pattern

The best way to depict what the character set provides as information for placing the fossil is, of course, the Neighbor-net, as shown next.

Neighbor-net based on Wilf et al.'s seven scored morphological traits used to place the fossil. Green: the current molecular-based phylogenetic synopsis — based mostly on Oh & Manos 2008; Manos et al. 2008; Denk & Grimm 2010. I had the opportunity to get familiar with all of the then-available genetic data when harvesting all Fagaceae data from gene banks in 2012 for a talk in Bordeaux. One complication in getting an all-Fagaceae-tree is that plastids, geographically constrained, and nuclear regions tell partly different stories.

Castanopsis, including the fossil, is morphologically a paraphyletic (see also our other posts dealing with paraphyla represented as clades in trees). Note also the long edge-bundle separating the temperate Chrysolepis and chestnuts (Castanea), from their respective cold-intolerant sister genera (Lithocarpus viz Castanopsis) — derived traits have been accumulated in parallel within the "Castanoideae". The scored aspects of Fagaceae morphology are very flexible and ~50 million years is a long time, possibly leading to partial valve indehiscence (or losing it) without being part of the same generic lineage. The puzzling differentiation, and the profoundly primitive appearance of the fossil (shared with modern-day Castanopsis), may in fact be the reason the authors didn't: (i) optimize / discuss very similar, co-eval fossils from the Northern Hemisphere interpreted (and cited) as extinct genera (eg. Crepet & Nixon 1989), (ii) left out the two Fagaceae genera today occurring in South America, (iii) opted for classic parsimony and a partly outdated molecular hypothesis, and (iv) just showed a naked cladogram without branch support values as the result of their "phylogenetic analysis" (Please stop using cladograms!)

Based on the scored characters, the position of the fossil in the graph, and on the background of a more up-to-date molecular-based phylogenetic synopsis (the green tree in the figure above), the most parsimonious interpretation (and probably, the most likely) is that the fossil may indeed be a stem-Castanoideae, a representative of the lineage from which the Laurasian oaks evolved at least 55 million yrs ago (oldest Quercus fossil was found in SE Asia), or even represent a morphologically primitive, extinct (South) American lineage of the Fagaceae. Regarding the "southern route", Ockham's Razor would favor that they are just a South American extension of the widespread Eocene Laurasian Fagaceae / Castanoideae, since very similar fossils and castaneoid pollen is found in equally old and older sites in North America, Greenland (papers cited by Wilf et al.) and Eurasia but not Australia, New Zealand or Antarctica.

A final note: when you have so few characters to compare, you should use OTUs that are not completely ambiguous in every potentially discriminating character, as scored for the "C. fissa group" — the "Castanopsis group" has a single unambiguously defined, potentially derived trait. Using artificial bulk taxa is generally a bad idea when mapping trait evolution onto a molecular backbone tree. Instead, you should compile a representative placeholder taxa set, with as many taxa as you need (or are feasible) to represent all character combinations seen in the modern species/genera.

Other cited references, with comments
Crepet WL, Nixon KC (1989) Earliest megafossil evidence of Fagaceae: phylogenetic and biogeographic implications. American Journal of Botany 76: 842–855. – introducing a Castanopsis-like infructescence interpreted to represent an extinct genus but very similar to the new Patagonian fossil in its preserved features; and co-occuring with castaneoid pollen (not reported so far for Patagonia) and foliage.
Denk T, Grimm GW (2010) The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59: 351–366. — includes an all-"Quercaceae" ITS-tree (fig. 3) and -network (fig. 4) using data of ~ 1000 ITS accessions; the nuclear-encoded ITS is so far the only comprehensively sampled gene region that gets the genera and main intra-generic lineages apart (recently confirmed and refined by NGS phylogenomic data), something wide-sampled plastid barcodes struggle with. Analysed with up-to-date methods and avoiding long-branch interference by excluding the only partially alignable Fagus, Castanopsis dissolves into a grade in the all-accessions tree and Quercus is deeply nested within the Castanoideae (as already seen in the 2001 tree used by Wilf et al. as backbone). The species-level PBC neighbor-net prefers a ciruclar arrangement in which Notholithocarpus remains a putative sister of substantially divergent and diversified Quercus, followed by Castanea-Castanopsis, and Lithocarpus, while Chrysolepis is recognized as unique.

Oh S-H, Manos PS (2008) Molecular phylogenetics and cupule evolution in Fagaceae as inferred from nuclear CRABS CLAW sequences. Taxon 57: 434–451. – Probably still the best Fagaceae tree, and surely not a bad basis for probabilistic mapping of morphological traits in the family.

Manos PS, Cannon CH, Oh S-H (2008) Phylogenetic relationships and taxonomic status of the paleoendemic Fagaceae of Western North America: recognition of a new genus, Notholithocarpus. Madroño 55: 181–190. – the tree failed to resolve the monophyly of the largest genus, the oaks, but depicts well the data reality when combining ITS with plastid data and, hence, provides a good trade-off guide tree.