Monday, April 29, 2019

Automatic sound law induction (Open problems in computational diversity linguistics 3)

The third problem in my list of ten open problems in computational diversity linguistics is a problem that has (to my knowledge) not even been considered as a true problem in computational historical linguistics, so far. Until now, it has been discussed by colleagues only indirectly. This problem, which I call the automatic induction of sound laws, can be described as follows:
Starting from a list of words in a proto-language and their reflexes in a descendant language, try to find the rules by which the ancestral language is converted into the descendant language.
Note that by "rules", in this context, I mean the classical notation that phonologists and historical linguists use in order to convert a source sound in a target sound in a specific environment (see Hall 2000: 73-75). If we consider the following ancestral and descendant words from a fictive language, we can easily find the laws by which the input should be converted into an output — namely, an a should be changed to an e, an e should be changed to an i, and a k changes to s if followed by an i but not if followed by an a.

Input Output
papa pepe
mama meme
kaka keke
keke sisi

Short excursus on linguistic notation of sound laws

Based on the general idea of sound change (or sound laws in classical historical linguistics) as some kind of a function by which a source sound is taken as input and turned into a target sound as output, linguists use a specific notation system for sound laws. In the simplest form of the classical sound law notation, this process is described in the form s > t, where s is the source sound and t is the target sound. Since sound change often relies the on specific conditions of the surrounding context — i.e. it makes a difference if some sound occurs in the beginning or the end of a word — context is added as a condition separated by a /, with an underscore _ referring to the sound in its original phonetic environment. Thus, the phenomenon of voiced stops becoming unvoiced at the end of words in German (e.g. d becoming t), can be written as d > t / _$, where $ denotes the end of a word.

One can see how close this notation comes to regular expressions and according to many scholars, the rules by which languages change with respect to their sound systems do not exceed the complexity of regular grammars. Nevertheless, sound change notation does differ in the scope and the rules for annotation. One notable difference is the possibility to explain how full classes of sounds change in a specific environment. The German rule of devoicing, for example, generally affects all voiced stops in the end of a word. As a result, one could also annotat it as G > K / _$, where G would denote the sounds [b, d, g] and K their counterparts [p, t, k]. Although we could easily write a single rule for each of the three phenomena here, the rule by which the sounds are grouped into two classes of voiced sounds and their unvoiced counterparts is linguistically more interesting, since it reminds us that the change by which word-final consonants loose the feature of voice is a systemic change, and not a phenomenon applying to some random selection of sounds in a given language.

The problem of this systemic annotation, however, is that the grouping of sounds into classes that change in a similar form is often language-specific. As a result, scholars have to propose new groupings whenever they deal with another language. Since neither the notation of sound values nor the symbols used to group sounds into classes are standardized, it is extremely difficult to compare different proposals made in the literature. As a result, any attempt to solve the problem of automatic sound law induction in historical linguistics would at the same time have to make strict proposals for a standardization of sound law notations used in our field. Standardization can thus be seen as one of the first major obstacles of solving this problem, with the problem of accounting for systemic aspects of sound change as the second one.

Beyond regular expressions

Even if we put the problem of inconsistent annotation and systemic changes to one side, the analogy with regular expressions cannot properly handle all aspects of sound change. When looking at the change from Middle Chinese to Mandarin Chinese, for example, we find a complex pattern, by which originally voiced sounds, like [b, d, g, dz] (among others), were either devoiced, becoming [p, t, k, ts], or devoiced and aspirated, becoming [pʰ, tʰ, kʰ, tsʰ]. While it is not uncommon that one sound can change into two variants, depending on the context in which it occurs, the Mandarin sound change in this case is interesting because the context is not a neighboring sound, but is instead the Middle Chinese tone for the syllable in question — syllables with a flat tone (called píng tone in classical terminology) are nowadays voiceless and aspirated, and syllables with one of the three remaining Middle Chinese tones (called shǎng, , and ) are nowadays plain voiceless (see List 2019: 157 for examples).

Since tone is a feature that applies to whole syllables, and not to single sound segments, we are dealing with so-called supra-segmental features here. As the meaning of the term supra-segmental indicates, the features in question cannot be represented as a sequence of sound, but need to be thought of as an additional layer, similar to other supra-segmental features in language, including stress, or juncture (indicating word or morpheme boundaries).

In contrast to sequences as we meet them in mathematics and informatics, linguistic sound sequences do not consist solely of letters drawn from an alphabet that is lined up in some unique order. They are instead often composed of multiple layers, which are in part hierarchically ordered. Words, morphemes, and phrases in linguistics are thus multi-layered constructs, which cannot be represented by one sequence alone, but could be more fruitfully thought of as the same as a partitura in music — the score of a piece of orchestra music, in which every voice of the orchestra is given its own sequence of sounds, and all different sequences are aligned with each other to form a whole.

The multi-layered character of sound sequences can be seen as similar to a partitura in musical notation.

This multi-layered character of sound sequences in spoken languages comprises a third complication for the task of automatic sound law induction. Finding the individual laws that trigger the change of one stage of a language to a later stage, cannot (always) be trivially reduced to the task of finding the finite state transducer that translates a set of input strings to a corresponding set of output strings. Since our input word forms in the proto-language are not simple strings, but rather an alignment of the different layers of a word form, a method to induce sound laws needs to be able to handle the multi-layered character of linguistic sequences.

Background for computational approaches to sound law induction

To my knowledge, the question of how to induce sound laws from data on proto- and descendant languages has barely been addressed. What comes closest to the problem are attempts to model sound change from known ancestral languages, such as Latin, to daughter languages, such as Spanish. This is reflected, for example, in the PHONO program (Hartmann 2003), where one can insert data for a proto-language along with a set of sound change rules (provided in a similar form to that mentioned above), which need to be given in a specific order, and are then checked to see whether they correctly predict the descendant forms.

For teaching purposes, I adapted a JavaScript version of a similar system, called the Sound Change Applier² ( by Mark Rosenfelder from 2012, in which students could try to turn Old High German into modern German, by assigning simple rules as they are traditionally used to describe sound change processes in the linguistic literature. This adaptation (which can be found at compares the attested output with the output generated by a given set of rules, and provides some assessment of the general accuracy of the proposed set of rules. For example, when feeding the system the simple rule an > en /_#, which turns all final instances of -an into -en, 54 out of 517 Old High German words will yield the expected output in modern Standard German.

The problem with these endeavors is, of course, the handling of exceptions, along with the comparison of different proposals. Since we can think of an infinite number of rules by which we could successfully turn a certain amount of Old High German strings into Standard German strings, we would need to ask ourselves how we could evaluate different proposals. That some kind of parsimony should play a role here is obvious. However, it is by no means clear (at least to me) how to evaluate the complexity of two systems, since the complexity would not only be reflected in the number of rules, but also in the initial grouping of sounds to classes, which is commonly used to account for systemic aspects of sound change. A system accounting for the problem of sound law induction would try to automate the task of finding the set of rules. The fact that it is difficult even to compare two or more proposals based on human assessment further illustrates why I think that the problem is not trivial.

Another class of approaches is that of word prediction experiments, such as the one by Ciobanu and Dinu (2018) (but see also Bodt and List 2019), in which training data consisting of the source and the target language are used to create a model, which is then successively applied to new data, in order to test how well this model predicts target words from the source words. Since the model itself is not reported in these experiments, but only used in the form of a black box to predict new words, the task cannot be considered to be the same as the task for sound law induction — which I propose as one of my ten challenges for computational historical linguistics — given that we are interested in a method that explicitly returns the model, in order to allow linguists to inspect it.

Problems with the current solutions to sound law induction

Given that no real solutions exist to the problem up to now, it seems somewhat useless to point to the problems of current solutions. What I want to mention in this context, however, are the problems of the solutions presented for word prediction experiments, be they fed by manual data on sound changes (Hartmann 2003), or based on inference procedures (Ciobanu and Dinu 2018, Dekker 2018). Manual solutions like PHONO suffer from the fact that they are tedious to apply, given that linguists have to present all sound changes in their data in an ordered fashion, with the program converting them step by step, always turning the whole input sequence into an intermediate output sequence — the word prediction approaches thus suffer from limitations in feature design.

The method by Ciobanu and Dinu (2018), for example, is based on orthographic data alone, using the Needleman-Wunsch algorithm for sequence alignment (Needleman and Wunsch 1970); and the approach by Dekker (2018) only allows for the use for the limited alphabet of 40 symbols proposed by the ASJP project (Holman et al. 2008). In addition to the limited representation of linguistic sound sequences, be it by resorting to abstract orthography or to abstract reduced phonetic alphabets, none of the methods can handle those kinds of contexts which result from the multi-layered character of speech. Since we know well that these aspects are vital for certain phenomena of sound change, the methods exclude from the beginning an aspect that traditional historical linguists, who might be interested in an automatic solution to the sound law induction problem, would put at the top of their wish-list of what the algorithm should be able to handle.

Why is automatic sound law induction difficult?

The handling of supra-segmental contexts, mentioned above, is in my opinion also the reason why sound law induction is so difficult, not only for machines, but also for humans. I have so far mentioned three major problems as to why I think sound law induction is difficult. First, we face problems in defining the task properly in historical linguistics, due to a significant lack in standardization. This makes it difficult to decide on the exact output of a method for sound law induction. Second, we have problems in handling the systemic aspect of sound change properly. This does not apply only to automatic approaches, but also to the evaluation of different proposals for the same data proposed by humans. Third, the multi-layered character of speech requires an enhanced modeling of linguistic sequences, which cannot be modeled as mono-dimensional strings alone, but should rather be seen as alignments of different strings representing different layers (tonal layer, stress layer, sound layer, etc.).

How humans detect sound laws

There are only a few examples in the literature where scholars have tried to provide detailed lists of sound changes from proto- to descendant language (Baxter 1992, Newman 1999). Most examples of individual sound laws proposed in the literature are rarely even tested exhaustively on the data. As a result, it is difficult to assess what humans usually do in order to detect sound laws. What is clear is that historical linguists who have been working a lot on linguistic reconstruction tend to acquire a very good intuition that helps them to quickly check sound laws applied to word forms in their head, and to convert the output forms. This ability is developed in a learning-by-doing fashion, with no specific techniques ever being discussed in the classroom, which reflects the general tendency in historical linguistics to trust that students will learn how to become a good linguist from examples, sooner or later (Schwink 1994: 29). For this reason, it is difficult to take inspiration from current practice in historical linguistics, in order to develop computer-assisted approaches to solve this task.

Potential solutions to the problem

What can we do in order to address the problem of sound law induction in automatic frameworks in the future?

As a first step, we would have to standardize the notation system that we use to represent sound changes. This would need to come along with a standardized phonetic transcription system. Scholars often think that phonetic transcription is standardized in linguistics, specifically due to the use of the International Phonetic Alphabet. As our investigations into the actual application of the IPA have shown, however, the IPA cannot be seen as a standard, but rather as a set of recommendations that are often only loosely followed by linguists. First attempts to standardize phonetic transcription systems for the purpose of cross-linguistic applications have, however, been made, and will hopefully gain more acceptance in the future (Anderson et al. forthcoming,

As a second step, we should invest more time in investigating the systemic aspects of language change cross-linguistically. What I consider important in this context is the notion of distinctive features by which linguists try to group sounds into classes. Since feature systems proposed by linguists differ greatly, with some debate as to whether features are innate and the same for all languages, or instead language-specific (see Mielke 2008 for an overview on the problem), a first step would again consist of making the data comparable, rather than trying to decide in favour of one of the numerous proposals in the literature.

As a third step, we need to work on ways to account for the multi-layered aspect of sound sequences. Here, a first proposal, labelled "multi-tiered sequence representation", has already been made by myself (List and Chacon 2015), based on an idea that I had already used for the phonetic alignment algorithm proposed in my dissertation (List 2014), which itself goes back to the handling of hydrophilic sequences in ClustalW (Thompson et al. 1994). The idea is to define a sound sequence as a sequence of vectors, with each vector (called tier) representing one distinct aspect of the original word. As this representation allows for an extremely flexible modeling of context — which would just consist of an arbitrary number of vector dimensions that could account for aspects such as tone, stress, preceding or following sounds — this representation would allow us to treat words as sequences of sounds while at the same time accounting for their multi-layered structure. Although there remain many unsolved aspects on how to exploit this specific model for phonetic sequences to induce sound laws from ancestor-descendant data, I consider this to be a first step in the direction of a solution to the problem.

Multi-tiered sequence representation for a fictive word in Middle Chinese.


Although it is not necessarily recognized by the field as a real problem of historical linguistics, I consider the problem of automatic sound law induction as a very important problem for our field. If we could infer sound laws from a set of proposed proto-forms and a set of descendant forms, then we could use them to test the quality of the proto-forms themselves, by inspecting the sound laws proposed by a given system. We could also compare sound laws across different language families to see whether we find cross-linguistic tendencies.

Having inferred enough cross-linguistic data on sound laws represented in unified models for sound law notation, we could also use the rules to search for cognate words that have so far been ignored. There is a lot to do, however, until we reach this point. Starting to think about automatic, and also manual, induction of sound laws as a specific task in computational historical linguistics can be seen as a first step in this direction.

Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (forthcoming) A Cross-Linguistic Database of Phonetic Transcription Systems. Yearbook of the Poznań Linguistic Meeting, pp 1-27.

Baxter, William H. (1992) A handbook of Old Chinese Phonology. Berlin: de Gruyter.

Bodt, Timotheus A. and List, Johann-Mattis (2019) Testing the predictive strength of the comparative method: An ongoing experiment on unattested words in Western Kho-Bwa langauges. 1-22. [Preprint, under review, not peer-reviewed]

Ciobanu, Alina Maria and Dinu, Liviu P. (2018) Simulating language evolution: A tool for historical linguistics. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp 68-72.

Dekker, Peter (2018) Reconstructing Language Ancestry by Performing Word Prediction with Neural Networks. University of Amsterdam: Amsterdam.

Hall, T. Alan (2000) Phonologie: Eine Einführung. Berlin and New York: de Gruyter.

Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp 606-609.

Holman, Eric W. and Wichmann, Søren and Brown, Cecil H. and Velupillai, Viveka and Müller, André and Bakker, Dik (2008) Explorations in automated lexicostatistics. Folia Linguistica 20.3: 116-121.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper, presented at the workshop Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE] (2015/09/04, Leiden, Societas Linguistica Europaea).

List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Mielke, Jeff (2008) The Emergence of Distinctive Features. Oxford: Oxford University Press.

Needleman, Saul B. and Wunsch, Christan D. (1970) A gene method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

Newman, John and Raman, Anand V. (1999) Chinese Historical Phonology: Compendium of Beijing and Cantonese Pronunciations of Characters and their Derivations from Middle Chinese. München: LINCOM Europa.

Schwink, Frederick (1994) Linguistic Typology, Universality and the Realism of Reconstruction. Washington: Institute for the Study of Man.

Thompson, J. D. and Higgins, D. G. and Gibson, T. J. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673–4680.

Monday, April 22, 2019

The 2nd Amendment does more than keep King George away

A year ago, in the aftermath of the Florida shooting, I used a neighbor-net as a way to visualize U.S. gun legislation (see the first graph here). In this post, we will use this network to explore some other aspects of American society.

A network illustrating the diversity in U.S. gun legislation. Blue stars – states with a gun registry.

The network picture emphasizes those states where guns are regulated to some extent (in green), but this means that the states at the bottom-left have little or no regulation of gun ownership. Note, first, that the U.S. gun lobby argues that the absence of any gun control is covered by the 2nd Amendment to the U.S. Constitution,which covers the right of citizens to form a "well regulated militia", an amendment installed to protect the freedom of the new republic from the former British sovereign (ie. to "keep King George away").

This claim ignores the fact that "well regulated" implies regulation of some sort, while the network emphasizes its absence in many cases. Besides, the risk of being re-conquered by Her Majesty's Royal Army is quite low these days, with or without Brexit. More to the point, the world itself has changed quite a bit since the 1700s, while the Constitution has had only a few Amendments added and subtracted.

If we start our use of the neighbor-net to look at the data, then we can see that there is at least one obvious consequence of unregulated gun ownership. For example, the next plot shows the number of gun-related deaths (in 2016) super-imposed on the gun-regulation network.

The total number of firearm-related deaths in 2016 (includes accidents and suicides.
Data from; this and more plots can be found here:
Visualising U.S. gun legislation, and mapping politics, economics, and population)

There seems to be a good correlation between unregulated gun ownership and the probability of getting shot or shooting yourself — the number of shootings is greatest in the lower-left of the network, where gun ownership is essentially unregulated (see the Gun Violence Archive for current numbers).

Arming every citizen may have helped to fend off King George's Redcoats, but in the long run, a substantial amount of Americans (c. 275,000 per year; when compared with Canada's rate) would still be alive if the Colonies would have become HRM's dominion like Australia or Canada; both Canadians and Australians own a lot of firearms per capita (see the Small Arms Survey for up-to-date estimates), but while Canada long had Europe-style legislation (and low casualty frequencies); Australians implemented them more recently leading to a massive drop in firearm-related deaths (see above).

As a side note, arming every male citizen to secure freedom from a feudal lord was probably a Swiss invention (see the Swiss Federal Charter of 1291, the Bundesbrief). Switzerland has a compulsory general draft of young males; and after this service they take their Sturmgewehr back home for the yearly training exercise, and to be prepared to fend off invaders (until 2007, including the ammunition). They have ~4-times lower rate of firearm-related deaths (2.8 in 2015 according to; nearly all of them males) — the only EU country approaching lowest U.S. values is Finland, and it's near exclusively accidents and suicides.

Other factors

It is important to keep in mind that the United States is a true federation of states, with each state having a substantial amount of autonomy, which is not found in any other country with a federal organization. Hence, many other aspects differ between states, not just the substantial differences in gun legislation.

For example, economics differ greatly between the states, and this also shows a reasonable correlation with gun regulation, as seen in this next version of the network. Note that Gross Domestic Product (GDP) is a monetary measure of the market value of all the goods and services produced annually — rich places have high GDP and poor places have lower GDP.

Real gross domestic product per capita mapped on the gun-legislation-based network.
Red, below global U.S. value; green above global U.S. value.
Data source: U.S. Bureau of Economic Analysis.

So, the economically poorer the state, the less likely there is to be gun regulation.

Modern developments include allowing women into the armed forces, and granting them the right to vote. For example, the 19th Amendment to the US Constitution granted women the right to vote, which was passed by Congress June 4, 1919, and ratified on August 18, 1920. This first map shows the situation for the European Union, some parts of which lagged behind the U.S.

Implementation of general right to vote within the countries of the EU (source: Süddeutsche Zeitung).
In the case of Germany and France, the reason was a lost war leading to the (re)establishment of new republics.

Women make about 50% of the populace and (usually) more than 50% of the electorate (having a generally higher life expectancy), but they are still typically under-represented in parliaments (here are a few examples). The United States is, sadly, a good example of this imbalance. This next map shows that the women in 13 states currently have no same-sex representation in the U.S. Congress.

Female representation in the current U.S. Congress.
The green part of each pie chart indicates the proportion of women representatives.

This leads to the obvious question for this blog post: how does the absence of female representatives (and senators) relate to the absence of gun regulation? So, let's map the above collection of pie charts onto the gun legislation network.

Female representation in the U.S. Congress after 2018 mid-term elections
(includes Senate and House of Representatives).
The c. 700,000 inhabitants of DC, District of Columbia, have no representation in
Congress at all, but send a non-voting delegate to the House.

There is a general trend — those states with little or no gun regulation (bottom left) have less female representation than those with (some) gun regulation. Perhaps someone took the 2nd Amendment a bit too literally (the right that every man to carry a gun), and this keeps not only King George away, from the country but also women away from Congress?

Exceptions from the generalization (starting with 75% going down to 33%) are sparsely populated states with only a few members of Congress: New Hampshire (NH, 75%; 2 representatives in addition to the two U.S. senators representing each state), Maine (ME, 2 reps.), West Virginia (WV; 3 reps), Alaska (AK; 1 rep.), New Mexico (NM; 3 reps), and Nevada (NV; 4 reps). All of these states have one thing in common: a substantial proportion of the state is wilderness.

At the other end, some states with relative high levels of gun regulation, like Maryland (MD; 8 reps), Rhode Island (RI; 2 reps), New Jersey (NJ; 12 reps) and Colorado (CO; 7 reps), lack women in Congress (0–15%, ie. one representative or none). This may relate to these state being very densely populated (MD, RI, NJ), and, irrespective of outside threats, no-one wants their close neighbors running around with guns. Colorado is particular in this sense, because with Denver it includes a major population center (the nucleus of the emerging Front Range megaregion), and it enforced much stricter gun regulation than found elsewhere in the state.

A map showing Colorado's congressional districts, for the 113th Congress.
Data from the defunct digital version of the U.S. National Atlas.

Do more women in parliament save American lives?

According to a recent Gallup poll, Americans have the highest regard for nurses, a profession mostly occupied by women and lowest regard for Members of Congress, a profession mostly occupied by men. Hence, it would make sense to explore the data the other way around. We will explore this in a later post.

Monday, April 15, 2019

Tournament success is not poker success

Let us suppose for a moment that we wish to list the world's best professional poker players. This might be of some interest, because poker is partly a game of luck (the cards are dealt at random) and partly a game of skill (players choose how to play their cards). Indeed, put simply, the idea is to convince your opponents that you have a weak hand when they have a strong one (so that they will bet against you) and a strong hand when they have a weak one (so that they will fold).

One well-known way to assess poker success is to look at tournament winnings. Indeed, Nathan Williams recently did this for The Top 50 Best Poker Players of All Time by simply listing the 50 greatest money earners from The Hendon Mob database. This database accumulates data on the lifetime money winnings for all of those participants who have ever cashed in a live poker tournament.

However, this approach does not work. In fact, there are at least five reasons why this is not appropriate:
  1. Inflation continues unabated. After all, $1 now is not worth as much as $1 was 30 years ago. In fact, something that cost $1 in 1990 would cost a bit more than $2 now (ie. the money has been devalued to 50%). So, the value of current winnings cannot be compared to those of the past.
  2. There are more tournaments now than there have ever been. So, there are more opportunities to play them now, and to thereby potentially accumulate more money for the same tournament success rate.
  3. The tournament fields are now generally bigger. This means that the average prize money for each tournament is now much greater than before (since the money is provided by the participants themselves). In particular, the top prizes now provide more money than whole tournaments did 20 years ago.
  4. Some of the best players play online rather than live. Obviously, this is a bit more difficult these days, due to the banning of online poker in the USA, but it is still a significant source of poker income for many people.
  5. Some of the best players do not play many tournaments —instead, they play cash games. Indeed, if you want to make a living playing poker, you may be better off playing for cash rather than for prize money, as tournament success is much more of a lottery.
The first three reasons all mean that we would have to adjust the tournament winnings, if we wish to have a meaningful assessment of lifetime earnings. As one example of the need to do this, we can look at point no. 3 in a simple way. The first graph shows the current top-100 money earners from The Hendon Mob. For each player, it shows how much of their total earnings came from their biggest single tournament cash.

Note that for the majority of players, a large part of their lifetime winnings came from a single tournament — the median percentage is 18.4% (range 3.8–97.7%). Indeed, for some of the players it is >50%, and for a few it is almost all of their money. Bigger fields mean more money per tournament, and thus bigger cashes when you do well. Note, incidentally, that this graph does contain the top 17 biggest cashes in history (to date).

An alternative approach

So, in order to evaluate players, we actually need a list of criteria that is independent of money won. That is, we need a list of the poker skills of each player. There are several different skills involved in playing poker, and presumably some people are good at some of them, and other people are good at some of the others. A comparison of relative skills is what we need.

This approach was actually tried by Barry Greenstein back in c. 2005. What he did was try to rate a group of 33 of the poker players that he had played against in cash games. He rated these players by style of play, based on ten playing criteria (each scored on a 1–10 scale):
  • Aggressiveness
  • Looseness
  • Short-handed play
  • Limit poker
  • No-limit poker
  • Tournaments
  • Side games
  • Steam control
  • Against weak players
  • Against strong players
Given the time at which this analysis was done (2005), the modern crop of young players are obviously not included, and a few of those people included are no longer playing. However, it is worthwhile looking at the data to see just what can be done with this approach.

Greenstein himself notes: "I don’t think you can add up the ratings in the skill categories to get an accurate comparison of players." He is right; but first let's do it anyway. So, the next graph shows the total score (out of 100) for each player. (Click on the figure to see it at full size.)

This problem here is that we are comparing apples with oranges. That is, the rank ordering of the sum does not make much sense, because it does not group players with similar playing strengths. The rank order would make sense when comparing each feature one at a time, but not for the total. For example, ranking by total winnings does make sense, because we have only one criterion: money (although it is not a useful criterion). This is the basic weakness of having a single rank order.

As one example of how the "overall score" misses important points, note that Eric Seidel and John Juanda have the same total. However, Seidel exceeds Juanda on Stem control, while Juanda exceeds Seidel on Looseness — these are actually two rather different players.

A better way to look at the data is to use a network, as we often do in this blog. The final graph is a NeighborNet (based on the manhattan distance) of Greenstein's data. Each point represents one of the 33 people. Those people that are near each other in the network have a similar set of scores, while people further apart are progressively more different from each other as poker players.

As you can see, there is no simple trend from "best" to "worst", but instead a complex set of relationships, just as we would expect. However, the network does show an overall trend of decreasing total score from top to bottom (compare this to the previous graph).

Note, first, that Eric Seidel and John Juanda are on opposite sides of the network (Juanda left, Seidel right). This illustrates how much better the network is as a display of the data, compared to simply summing the scores (as in the previous graph). The network accurately shows the differences in the relative playing styles.

There are some players who are actually gathered together in the network, indicating that they have similar scores across all 10 criteria. For example, Barry Greenstein , Eric Seidel and Howard Lederer rarely differ by more than 1 point on any of the criteria — according to Greenstein, these people have very similar playing styles.

Alternatively, Pil Helmuth and T.J. Cloutier have scores that differ from the other players — both have low scores on Side games and Steam control. Gus Hansen is near these two because all three have high scores for Against weak players. Similarly, the legendary Stu Ungar and Patrik Antonius both have high Aggressiveness and Looseness.

There is one a final point worth mentioning. As Michel Bettane once said (The absurdity and flattery of scores):
It doesn't take a genius to appreciate the absurdity of giving a number score to a work of art or, worse still, an artist. Salvador Dalí had huge fun scoring great artists (including himself) on the basis of design, color, and composition — but that says far more for his sense of provocation and irony than it does for the principle itself.
Is poker an art, a science or a sport? If it is either of the first two, then scoring players may actually be a Bad Idea.

Monday, April 8, 2019

Next-generation neighbor-nets

Neighbor-nets are a most versatile tool for exploratory data analysis (EDA). Next-generation sequencing (NGS) allows us to tap into an unprecedented wealth of information that can be used for phylogenetics. Hence, it is natural step to combine the two.

I have been waiting for it (actively-passively) and the time has now come. Getting NGS data has become cheaper and easier, but one still needs considerable resources and fresh material. Hence, NGS papers usually not only use a lot of data, but also are many-authored. You can now find neighbor-nets based on phylogenomic pairwise distances computed from NGS data — for example, in these two recently published open access pre-prints:
  • Pérez Escobar​ OA, Bogarín D, Schley R, Bateman R, Gerlach G, Harpke D, Brassac J, Fernández-Mazuecos M, Dodsworth S, Hagsater E, Gottschling M, Blattner F. 2018. Resolving relationships in an exceedingly young orchid lineage using Genotyping-by-sequencing data. PeerJ Preprint 6:e27296v1
  • Hipp AL, Manos PS, Hahn M, Avishai M, Bodénès C, Cavender-Bares J, Crowl A, Deng M, Denk T, Fitz-Gibbon S, Gailing O, González Elizondo MS, González Rodríguez A, Grimm GW, Jiang X-L, Kremer A, Lesur I, McVay JD, Plomion C, Rodríguez-Correa H, Schulze E-D, Simeone MC, Sork VL, Valencia Avalos S. 2019. Genomic landscape of the global oak phylogeny. bioRxiv DOI:10.1101/587253.

Example 1: A young species aggregate of orchids

Pérez Escobar et al.'s neighbor-nets are based on uncorrected p-distances inferred from a matrix including 13,000 GBS ("genotyping-by-sequencing") loci (see the short introduction for the method on Wikipedia, or the comprehensive PDF from a talk at/by researchers of Cornell) covering 29 accessions of six orchid species and subspecies.

They also inferred maximum likelihood trees, and did a coalescent analysis to consider eventual tree-incompatible signal, gene-tree incongruence due to potential reticulation and incomplete lineage sorting. They applied the neighbor-net to their data because "split graphs are considered more suitable than phylograms or ultrametric trees to represent evolutionary histories that are still subject to reticulation (Rutherford et al., 2018)" – which is true, although neighbor-nets do not explicitly show a reticulate history.

Here's a fused image of the ML trees (their fig. 1) and the corresponding neighbor-nets (their fig. 2):

Not so "phenetic": NGS data neighbor-nets (NNet) show essentially the same than ML trees — the distance matrices reflect putative common origin(s) as much as the ML phylograms. The numbers at branches and edges show bootstrap support under ML and the NNet optimization.

Groups resolved as clades, Group I and III, or grades or clades, Group II (compare A vs. B and C), in the ML trees form simple (relating to one edge-bundle) or more complex (defined by two partly compatible edge-bundles, Group I in A) neighborhoods in the neighbor-net splits graphs. The evolutionary unfolding, we are looking at closely related biological units, was likely not following a simple dichotomizing tree, hence, the ambiguous branch-support (left) and competing edge-support (right) for some of the groups. Furthermore, each part of a genome will be more descriminative for some aspect of the coalescent and less for another, another source of topological ambiguity (ambiguous BS support) and incompatible signal (as seen in and handled by the neighbor-nets). The reconstructions under A, B and C differ in the breadth and gappyness of the included data (all NGS analyses involve data filtering steps): A includes only loci covered for all taxa, B includes all with less than 50% missing data, and C all loci with at least 15% coverage.

PS I contacted the first author, the paper is still under review (four peers), a revision is (about to be) submitted, and, with a bit of luck, we'll see it in print soon.

Example 2: The oaks of the world

The Hipp et al. (note that I am an author) neighbor-net is based on model-based distances. The reason I opted (here) for model-based distance instead of uncorrected p-distances is the depth of our phylogeny: our data cover splits that go back till the Eocene, but many of the species found today are relatively young. The dated tree analyses show substantial shifts in diversification rates. In the diverse lineages today and possibly in the past (see the lines in the following graph), in those with few species (*,#) we may be looking at the left-overs of ancient radiations.

A lineage(s)-through-time plot for the oaks (Hipp et al. 2019, fig. 2). Generic diversification probably started in the Eocene around 50 Ma, and between 10–5 Ma parts (usually a single sublineage) of these long-isolated intrageneric lineages (sections) underwent increased speciation.

The data basis is otherwise similar, SNPs (single-nucleotide polymorphisms) generated using a different NGS method, in our case RAD-tagging (RAD-seq) of c. 450 oak individuals covering the entire range of this common tree genus — the most diverse extra-tropical genus of the Northern Hemisphere. There are differences between GBS and RAD-seq SNP data sets — a rule of thumb is that the latter can provide more signal and SNPs, but the single-loci trees are usually less decisive, which can be a problem for coalescent methods and tests for reticulation and incomplete lineage sorting that require a lot of single-loci (or single-gene) trees (see the paper for a short introduction and discussion, and further references).

We also inferred a ML tree, and my leading co-authors did the other necessary and fancy analyses. Here, I will focus on the essential information needed to interpret the neighbor-net that we show (and why we included it at all).

Our fig. 6. Coloring of main lineages (oak sections) same as in the LTT plot. Bluish, the three sections traditionally included in the white oaks (s.l.); red, red oaks; purple, the golden-cup or 'intermediate' (between white and red) oaks — these three groups (five sections) form subgenus Quercus, which except for the "Roburoids" and one species of sect. Ponticae is restricted to the Americas. Yellow to green, the sections and main clades (in our and earlier ML trees) of the exclusively Eurasian subgenus Cerris.

Like Pérez Escobar et al., we noted a very good fit between the distance-matrix based neighbor-net and the optimised ML tree. Clades with high branch support and intra-clade coherence form distinct clusters, here distinct neighborhoods associated with certain edge bundles (thick colored lines). This tells us that the distance-matrix is representative, it captures the prime-phylogenetic signal that also informs the tree.

The first thing that we can infer from the network is that we have little missing data issues in our data. Distance-based methods are prone to missing data artifacts and RAD-seq data are (inevitably) rather gappy. It is important to keep in mind that neighbor-nets cannot replace tree analysis in the case of NGS data, they are "just" a tool to explore the overall signal in the matrix. If the network has neighborhoods contrasting what can be seen in the tree, this can be an indication that one's data is not sufficiently tree-like at all. But it also can just mean that the data is not sufficient to get a representative distance matrix.

Did you notice the little isolated blue dot (Q. lobata)? This is such a case — it has nothing to do with reticulation between the blue and the yellow edges, it's just that the available data don't produce an equally discriminative distance pattern: according to its pairwise distances, this sample is generally much closer to all other oak individuals included in the matrix in contrast to the other members of its Dumosae clade, which are generally more similar to each other, and to the remainder of the white oaks (s.str., dark blue, and s.l., all bluish).

Close-up on the white oak s.str. neighbor-hood (sect. Quercus) and plot of the preferred dated tree.

In the tree it is hence placed as sister to all other members, and, being closer to the all-ancestor, it triggers a deep Dumusae crown age, c. 10 myr older than the subsequent radiation(s) and as old as the divergence of the rest of the white oaks s.str.

The second observation, which can assist in the interpretation of the ML tree (especially the dated one), is the principal structure (ordering) within each subgenus and section. The neighbor-net is a planar (i.e. 2-dimensional graph), so the taxa will be put in a circular order. The algorithm essentially identifies the closest relative (which is a candidate for a direct sister, like a tree does) and the second-closest relative. Towards the leaves of the Tree of Life, this is usually a cousin, or, in the case of reticulation, the intermixing lineage. Towards the roots, it can reflect the general level of derivation, the distance the (hypothetical all-)ancestor.

Knowing the primary split (between the two subgenera), we can interprete the graph towards the general level of (phylogenetic) derivedness.

The overall least derived groups are placed to the left in each subgenus, and the most derived to the right. The reason is long-branch attraction (LBA) stepping in: the red and green group are the most isolated/unique within their subgenera, and hence they attract each other. This is important to keep in mind when looking at the tree and judge whether (local) LBA may be an issue (parsimony and distance-methods will always get the wrong tree in the Felsenstein Zone, but probabilistics have a 50% chance to escape). In our oak data, we are on the safe side. The red group (sect. Lobatae, the red oaks) are indeed resolved as the first-branching lineage within subgenus Quercus, but within subgenus Cerris it is the yellow group, sect. Cyclobalanopsis. If this would be LBA, Cyclobalanopsis would need to be on the right side, next to the red oaks.

The third obvious pattern is the distinct form of each subgraph: we have neighborhoods with long, slim root trunks and others that look like broad fans.

Long-narrow trunks, i.e. distances show high intra-group coherence and high inter-group distinctness can be expected for long isolated lineages with small (founder) population sizes, eg. lineages that underwent in the past severe or repeated bottleneck situations. Unique genetic signatures will be quickly accumulated (increasing the overall distance to sister lineages), and the extinction ensures only one (or very similar) signature survives (low intragroup diversity until the final radiation).

Fans represent gradual, undisturbed accumulation of diversity over a long period of time, eg. frequent radiation and formation of new species during range and niche expansion – in the absence of stable barriers we get a very broad, rather unstructured fan like the one of the white oaks (s.str.; blue); along a relative narrow (today and likely in the past) geographic east-west corridor (here: the  'Himalayan corridor') a more structured, elongated one as in the case of section Ilex (olive).

Close-up on the sect. Ilex neighborhood, again with the tree plotted. In the tree, we see just sister clades, in the network we see the strong correlation between geography and genetic diversity patterns, indicating a gradual expansion of the lineage towards the west till finally reaching the Mediterranean. Only sophisticated explicit ancestral area analysis can possibly come to a similar result (often without certainty) which is obvious from comparing the tree with the network.

This can go along with higher population sizes and/or more permeable species barriers, both of which will lead to lower intragroup diversity and less tree-compatible signals. Knowing that both section Quercus (white oaks s.str., blue) and Ilex (olive) evolved and started to radiate about the same time, it's obvious from the structure of both fans that the (mostly and originally temperate) white oaks produced always more, but likely less stable species than the mid-latitude (subtropical to temperate) Ilex oaks today spanning an arc from the Mediterranean via the southern flanks of the Himalayas into the mountains of China and the subtropics of Japan.

Networks can be used to understand, interpret and confirm aspects of the (dated) NGS tree.

The much older stem and young crown ages seen in dated trees may be indicative for bottlenecks, too. But since we typically use relaxed clock models, which allow for rate changes and rely on very few fix points (eg. fossil age constraints), we may get (too?) old stem and (much too) young crown ages, especially for poorly sampled groups or unrepresentative data. By looking at the neighbor-net, we can directly see that the relative old crown ages for the lineages with (today) few species fit with their within-lineage and general distinctness.

The deepest splits: the tree mapped on the neighbor-net.

By mapping the tree onto the network, and thus directly comparing the tree to the network, we can see that different evolutionary processes may be considered to explain what we see in the data. It also shows us how much of our tree is (data-wise) trivial and where it could be worth to take a deeper look, eg. apply coalescent networks, generate more data, or recruit additional data. Last, but not least, it's quick to infer and makes pretty figures.

So, try it out with your NGS data, too.

PS. Model-based distances can be inferred with the same program many of us use to infer the ML tree: RAxML. We can hence use the same model assumptions for the neighbor-net that we optimized for the inferring tree and establishing branch support.

Monday, April 1, 2019

The Tree of Life (April 1)

The so-called Tree of Life is actually an anastomosing plexus rather than a divaricating tree, due to extensive interconnections between the cell and genome lineages during early single-cell evolution. These connections may have been caused by the process known as horizontal gene transfer.

Furthermore, the alleged Last Universal Common Ancestor may not have been a single coherent group, but may have been a mixture of quite different genotypes. After all, this supposed ancestor does not represent the origin of life, but was itself the end-product of an extensive prior evolutionary history.

These two basic points are illustrated in the following figure.

Happy April 1. For previous posts, see: