Monday, October 14, 2019

Some hitherto unkown genealogical trees of music

In last week's post, I discussed Petter Hellström's recent doctoral thesis: Trees of Knowledge: Science and the Shape of Genealogy. In this thesis he discusses three "genealogical tees" in detail. Augustin Augier’s tree of plant families and Félix Gallet’s family tree of languages have already been covered in this blog (you can look them up using the Search box, to the right), but Henri Montan Berton’s family tree of chords has not.

Indeed, the historical literature at large has pretty much ignored the idea of a genealogical tree being associated with music. Nevertheless, the tree itself is explicitly labeled a Genealogical Tree of Chords. This tree, and its predecessor by François Guillaume Vial, thus deserve examination.

Henri Montan Berton (1767–1844) is well known within the history of music; and his tree was published as an independent broadsheet as two (almost identical) editions in c. 1807 and 1815. It seems to have been produced as a teaching tool, as indeed were also the trees of Augier and Gallet. As Petter Hellström notes, for these authors "genealogy did not necessarily involve chronology or change ... the introduction of family trees into secular knowledge production had more to do with the needs of information management, visualisation and communication".

Berton himself states (translated from the French):
In composing the Genealogical Tree, one has has had the intention to present to the eye, at a single glance, the reunion of the great family of Chords, and to demonstrate to the eye that there is only one Primordial [Chord], and that it is the source of all Harmonies.
At the base of the tree is a fundamental bass note along with its 12th and 17th major — this was the harmonic series in 18th century music theory. From here the tree produces 8 branches above, each labeled (at the bottom) with a musical chord, and with another 20 chords labeled further up the branches (all highlighted by arrows at the left). The main trunk (denoted A) is labeled Perfect or Constant Chord. The eight branches are intended to show the relationships between "8 fundamental chords [bottom arrow] and 20 inverted chords [the upper arrows]".

The tree thus displays the harmonic relationships among the chords, rather than any sort of chronological development. It was devised as an aid to learning the fundamentals of music composition.

Berton was not the first to use this idea within music theory. Four decades earlier, in 1766, François Guillaume Vial (1725–?) had produced another broadsheet, this time labeled Genealogical Tree of Harmony.

Like Berton's tree, this is not about chronology, but is about "family relationships" in a different sense. Moreover, in this instance the branching aspect of the tree is abandoned, and the tree foliage is simply festooned with medallions, labeled with chords — it is the different sections of the tree's crown that show relationships, not different branches.

The objective here was to illustrate "the most natural order of harmonic modulation", once again devised as a teaching tool. The two compass roses at the bottom left and right show the circle of fifths (left), guiding horizontal modulation among the chords, and the circle of thirds (right), guiding vertical modulation among the chords.

Vial himself states (translated from the French):
This Genealogical Tree simplifies and allows those who are capable of intonation [to practice] the art of preluding not only on a leading note, but even to change between the most desired modulations of any instrument.
Hellström traces these uses of the "family tree" metaphor in music back to Jean-Philippe Rameau (1683–1764), an influential music theorist. Thus, he concludes that we should:
read the trees of Vial and Berton as graphical codifications of an already established metaphor and manner of thinking about harmony, especially as both authors were informed by Rameau in their understanding of harmony in the first place.
In constructing their respective tree diagrams, Berton and Vial both seized upon an already existing metaphor and made it visible on paper. Their trees are not 'genealogical' in the sense that they charted family history or cross-generational relationships, they are 'genealogical' in the sense that they depict presumably natural, organic relationships, in which every part has its place in the whole, and where every part can be referred back to a common source or root.
These trees do not, therefore, fit into the usual history of genealogical trees, as this blog recognizes them, denoting a chronological history. They, would, however, fir neatly into the post on Relationship trees drawn like real trees.

Monday, October 7, 2019

A recent thesis about Trees of Knowledge

Recently, Petter Hellström successfully defended his doctoral thesis:
Trees of Knowledge: Science and the Shape of Genealogy
Department of the History of Science and Ideas
Uppsala University, Sweden
The thesis itself is obviously of great interest to readers of this blog. It is not currently online, but you can obtain a printed or electronic copy by contacting:

Here is the abstract:
This study investigates early employments of family trees in the modern sciences, in order to historicise their iconic status and now established uses, notably in evolutionary biology and linguistics. Moving beyond disciplinary accounts to consider the wider cultural background, it examines how early uses within the sciences transformed family trees as a format of visual representation, as well as the meanings invested in them.
Historical writing about trees in the modern sciences is heavily tilted towards evolutionary biology, especially the iconic diagrams associated with Darwinism. Trees of Knowledge shifts the focus to France in the wake of the Revolution, when family trees were first put to use in a number of disparate academic fields. Through three case studies drawn from across the disciplines, it investigates the simultaneous appearance of trees in natural history, language studies, and music theory. Augustin Augier’s tree of plant families, Félix Gallet’s family tree of dead and living languages, and Henri Montan Berton’s family tree of chords served diverse ends, yet all exploited the familiar shape of genealogy.
While outlining how genealogical trees once constituted a more general resource in scholarly knowledge production — employed primarily as pedagogical tools — this study argues that family trees entered the modern sciences independently of the evolutionary theories they were later made to illustrate. The trees from post-revolutionary France occasionally charted development over time, yet more often they served to visualise organic hierarchy and perfect order. In bringing this neglected history to light, Trees of Knowledge provides not only a rich account of the rise of tree thinking in the modern sciences, but also a pragmatic methodology for approaching the dynamic interplay of metaphor, visual representation, and knowledge production in the history of science.
The trees of Augier and Gallet have been covered in this blog, but that of Berton has not. I will discuss it in the next post.

Monday, September 30, 2019

Typology of semantic change (Open problems in computational diversity linguistics 8)

With this month's problem we are leaving the realm of modeling, which has been the basic aspect underlying the last three problems, discussed in June, July, and August, and enter the realm of typology, or general linguistics. The last three problems that I will discuss, in this and two follow-up posts, deal with the basic problem of making use or collecting data that allows us to establish typologies, that is, to identify cross-linguistic tendencies for specific phenomena, such as semantic change (this post), sound change (October), or semantic promiscuity (November).

Cross-linguistic tendencies are here understood as tendencies that occur across all languages independently of their specific phylogenetic affiliation, the place where they are spoken, or the time when they are spoken. Obviously, the uniformitarian requirement of independence of place and time is an idealization. As we know well, the capacity for language itself developed, potentially gradually, with the evolution of modern humans, and as a result, it does not make sense to assume that the tendencies of semantic change or sound change were the same through time. This has, in fact, been shown in recent research that illustrated that there may be a certain relationship between our diet and the speech sounds that we speak in our languages (Blasi et al. 2019).

Nevertheless, in the same way in which we simplify models in physics, as long as they yield good approximations of the phenomena we want to study, we can also assume a certain uniformity for language change. To guarantee this, we may have to restrict the time frame of language development that we want to discuss (eg. the last 2,000 years), or the aspects of language we want to investigate (eg. a certain selection of concepts that we know must have been expressed 5,000 years ago).

For the specific case of a semantic change, the problem of establishing a typology of the phenomenon can thus be stated as follows:
Assuming a certain pre-selection of concepts that we assume were readily expressed in a given time frame, establish a general typology that informs about the universal tendencies by which a word expressing one concept changes its meaning, to later express another concept in the same language.
In theory, we can further relax the conditions of universality and add the restrictions on time and place later, after having aggregated the data. Maybe this would even be the best idea for a practical investigation; but given that the time frames in which we have attested data for semantic changes are rather limited, I do not believe that it would make much of a change.

Why it is hard to establish a typology of semantic change

There are three reasons why it is hard to establish a typology of semantic change. First, there is the problem of acquiring the data needed to establish the typology. Second, there is the problem of handling the data efficiently. Third, there is the problem of interpreting the data in order to identify cross-linguistic, universal tendencies.

The problem of data acquisition results from the fact that we lack data on observed processes of semantic change. Since there are only a few languages with a continuous tradition of written records spanning 500 years or more, we will never be able to derive any universal tendencies from those languages alone, even if it may be a good starting point to start from languages like Latin and its Romance descendants, as has been shown by Blank (1997).

Accepting the fact that processes attested only for Romance languages are never enough to fill the huge semantic space covered by the world's languages, the only alternative would be using inferred processes of semantic change — that is, processes that have been reconstructed and proposed in the literature. While it is straightforward to show that the meanings of cognate words in different languages can vary quite drastically, it is much more difficult to infer the direction underlying the change. Handling the direction, however, is important for any typology of semantic change, since the data from observed changes suggests that there are specific directional tendencies. Thus, when confronted with cognates such as selig "holy" in German and silly in English, it is much less obvious whether the change happened from "holy" to "silly" or from "silly" to "holy", or even from an unknown ancient concept to both "holy" and "silly".

As a result, we can conclude that any collection of data on semantic change needs to make crystal-clear upon which types of evidence the inference of semantic change processes is based. Citing only the literature on different language families is definitely not enough. Because of the second problem, this also applies to the handling of data on semantic shifts. Here, we face the general problem of elicitation of meanings. Elicitation refers to the process in fieldwork where scholars use a questionnaire to ask their informants how certain meanings are expressed. The problem here is that linguists have never tried to standardize which meanings they actually elicit. What they use, instead, are elicitation glosses, which they think are common enough to allow linguists to understand to what meaning they refer. As a result, it is extremely difficult to search in field work notes, and even in wordlists or dictionaries, for specific meanings, since every linguist is using their own style, often without further explanations.

Our Concepticon project (List et al. 2019, can be seen as a first attempt to handle elicitation glosses consistently. What we do is to link those elicitation glosses that we find in questionnaires, dictionaries, and fieldwork notes to so-called concept sets, which reflect a given concept that is given a unique identifier and a short definition. It would go too far to dive deeper into the problem of concept handling. Interested readers can have a look at a previous blog post I wrote on the topic (List 2018). In any case, any typology on semantic change will need to find a way to address the problem of handling elicitation glosses in the literature, in the one or the other way.

As a last problem, when having assembled data that show semantic change processes across a sufficiently large sample of languages and concepts, there is the problem of analyzing the data themselves. While it seems obvious to identify cross-linguistic tendencies by looking for examples that occur in different language families and different parts of the world, it is not always easy to distinguish between the four major reasons for similarities among languages, namely: (1) coincidence, (2) universal tendencies, (3) inheritance, and (4) contact (List 2019). The only way to avoid being forced to make use of potentially unreliable statistics, to squeeze out the juice of small datasets, is to work on a sufficiently large coverage of data from as many language families and locations as possible. But given that there are no automated ways to infer directed semantic change processes across linguistic datasets, it is unlikely that a collection of data acquired from the literature alone will reach the critical mass needed for such an endeavor.

Traditional approaches

Apart from the above-mentioned work by Blank (1997), which is, unfortunately, rarely mentioned in the literature (potentially because it is written in German), there is an often-cited paper by Wilkinson (1996), and preliminary work on directionality (Urban 2012). However, the attempt that addresses the problem most closely is the Database of Semantic Shifts (Zalizniak et al. 2012), which has, according to the most recent information on the website, was established in 2002 and has been  continuously updated since then.

The basic idea, as far as I understand the principle of the database, is to collect semantic shifts attested in the literature, and to note the type of evidence, as well as the direction, where it is known. The resource is unique, nobody else has tried to establish a collection of semantic shifts attested in the literature, and it is therefore incredibly valuable. It shows, however, also, what problems we face when trying to establish a typology of semantic shifts.

Apart from the typical technical problems found in many projects shared on the web (missing download access to all data underlying the website, missing deposit of versions on public repositories, missing versioning), the greatest problem of the project is that no apparent attempt was undertaken to standardize the elicitation glosses. This became specifically obvious when we tried to link an older version of the database, which is now no longer available, to our Concepticon project. In the end, I selected some 870 concepts from the database, which were supported by more datapoints, but had to ignore more than 1500 remaining elicitation glosses, since it was not possible to infer in reasonable time what the underlying concepts denote, not to speak of obvious cases where the same concept was denoted by slightly different elicitation glosses. As far as I can tell, this has not changed much with the most recent update of the database, which was published some time earlier this year.

Apart from the afore-mentioned problems of missing standardization of elicitation glosses, the database does not seem to annotate which type of evidence has been used to establish a given semantic shift. An even more important problem, which is typical of almost all attempts to establish databases of change in the field of diversity linguistics, is that the database only shows what has changed, while nothing can be found on what has stayed the same. A true typology of change, however, must show what has not changed along with showing what has changed. As a result, any attempt to pick proposed changes from the literature alone will fail to offer a true typology, a collection of universal tendencies

To be fair: the Database of Semantic Shifts is by no means claiming to do this. What it offers is a collection of semantic change phenomena discussed in the linguistic literature. This itself is an extremely valuable, and extremely tedious, enterprise. While I wish that the authors open their data, versionize it, standardize the elicitation glosses, and also host it on stable public archives, to avoid what happened in the past (that people quote versions of the data which no longer exist), and to open the data for quantitative analyses, I deeply appreciate the attempt to approach the problem of semantic change from an empirical, data-driven perspective. To address the problem of establishing a typology of semantic shift, however, I think that we need to start thinking beyond collecting what has been stated in the literature.

Computational approaches

As a first computational approach that comes in some way close to a typology of semantic shifts, there is the Database of Cross-Linguistic Colexifications (List et al. 2018), which was originally launched in 2014, and received a major update in 2018 (see List et al. 2018b for details). This CLICS database, which I have mentioned several times in the past, does not show diachronic data, ie. data on semantic change phenomena, but lists automatically detectable polysemies and homophonies (also called colexifications), instead.

While the approach taken by the Database of Semantic shifts is bottom-up in some sense, as the authors start from the literature and add those concept that are discussed there, CLICS is top-down, as it starts from a list of concepts (reflected as standardized Concepticon concept sets) and then checks which languages express more than one concept by one and the same word form.

The advantages of top-down approaches are: that much more data can be processed, and that one can easily derive a balanced sample in which the same concepts iare compared for as many languages as possible. The disadvantage is that such a database will ignore certain concepts a priori, if they do not occur in the data.

Since CLICS lists synchronic patterns without further interpreting them, the database is potentially interesting for those who want to work on semantic change, but it does not help solve the problem of establishing a typology of semantic change itself. In order to achieve this, one would have to go through all attested polysemies in the database and investigate them, searching for potential hints on directions.

A potential way to infer directions for semantic shifts is presented by Dellert (2016), who applies causal inference techniques on polysemy networks to address this task. The problem, as far as I understand the techniques, is that the currently available polysemy databases barely offer enough information needed for these kinds of analyses. Furthermore, it would also be important to see how well the method actually performs in comparison to what we think we already know about the major patterns of semantic change.

Initial ideas for improvement

There does not seem to be a practical way to address our problem by means of computational solutions alone. What we need, instead, is a computer-assisted strategy that starts from the base of  a thorough investigation of the criteria that scholars use to infer directions of semantic change from linguistic data. Once these criteria are settled, more or less, one would need to think of ways to operationalize them, in order to allow scholars to work with concrete etymological data, ideally comprising standardized word-lists for different language families, and to annotate them as closely as possible.

Ideally, scholars would propose larger etymological datasets in which they reconstruct whole language families, proposing semantic reconstructions for proto-forms. These would already contain the proposed directions of semantic change, and they would also automatically show where change does not happen. Since we currently lack automated workflows that fully account for this level of detail, one could start by applying methods for cognate detection across semantic semantic slots (cross-semantic cognate detection), which would yield valuable data on semantic change processes, without providing directions, and then adding the directional information based on the principles that scholars use in their reconstruction methodology.


Given the recent advances in detection of sound correspondence patterns, sequence comparison, and etymological annotation in the field of computational historical linguistics, it seems perfectly feasible to work on detailed etymological datasets of the languages of the world, in which all information required to derive a typology of semantic change is transparently available. The problem is, however, that it would still take a lot of time to actually analyze and annotate these data, and to find enough scholars who would agree to carry out linguistic reconstruction in a similar way, using transparent tools rather than convenient shortcuts.


Blank, Andreas (1997) Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Tübingen:Niemeyer.

Blasi, Damián E. and Steven Moran and Scott R. Moisik and Paul Widmer and Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

List, Johann-Mattis and Simon Greenhill and Cormac Anderson and Thomas Mayer and Tiago Tresoldi and Robert Forkel (2018: CLICS: Database of Cross-Linguistic Colexifications. Version 2.0. Max Planck Institute for the Science of Human History. Jena:

Johann Mattis List and Simon Greenhill and Christoph Rzymski and Nathanael Schweikhard and Robert Forkel (2019) Concepticon. A resource for the linking of concept lists (Version 2.1.0). Max Planck Institute for the Science of Human History. Jena:

Dellert, Johannes and Buch, Armin (2016) Using computational criteria to extract large Swadesh Lists for lexicostatistics. In: Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics.

List, Johann-Mattis and Greenhill, Simon J. and Anderson, Cormac and Mayer, Thomas and Tresoldi, Tiago and Forkel, Robert (2018) CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2: 277-306.

List, Johann-Mattis (2018) Towards a history of concept list compilation in historical linguistics. History and Philosophy of the Language Sciences 5.10: 1-14.

List, Johann-Mattis (2019) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

Urban, Matthias (2011) Asymmetries in overt marking and directionality in semantic change. Journal of Historical Linguistics 1.1: 3-47.

Wilkins, David P. (1996) Natural tendencies of semantic change and the search for cognates. In: Durie, Mark (ed.) The Comparative Method Reviewed: Regularity and Irregularity in Language Change. New York: Oxford University Press, pp. 264-304.

Zalizniak, Anna A. and Bulakh, M. and Ganenkov, Dimitrij and Gruntov, Ilya and Maisak, Timur and Russo, Maxim (2012) The catalogue of semantic shifts as a database for lexical semantic typology. Linguistics 50.3: 633-669.

Monday, September 23, 2019

Where are we, 60 years after Hennig?

Phylogenetic analysis is common in the modern study of evolutionary biology, and yet it often seems to be a poorly understood tool. Indeed, it seems to often be seen as nothing more than a tool, and one for which one does not need much expertise.

For example, we do not need to spend much time on Twitter to realize that many evolutionary biologists do not understand even the most basic things about the difference between taxa and characters. Taxa are often referred to as "primitive", particularly by people studying the so-called Origin of Life. However, taxa themselves cannot be either primitive or derived; instead, they are composed of mixtures of primitive and derived characters — they have derived characters relative to their ancestors and primitive ones compared to their descendants.

The logical relationship between common ancestors and monophyletic / paraphyletic groups is also apparently unknown to many evolutionary biologists. There is endless debate about whether the Last Universal Common Ancestor was a Bacterium or an Archaean when, of course, it cannot be either. That is, we sample contemporary organisms for analysis, which come from particular taxonomic groupings, and from these data we infer hypothetical ancestors. However, those ancestors cannot be part of the same taxonomic group as their descendants unless that taxonomic group is monophyletic.

This is all basic stuff, first expounded in the 1950s by Willi Hennig. So, why do so many people apparently still not know any of this 60 years later? I suspect that somewhere along the line the molecular geneticists got the idea that Hennig was part of Parsimony Analysis, and since they adopted Likelihood Analysis, instead, he is thus irrelevant.

However, Hennigian Logic underlies all phylogenetic analyses, of whatever mathematical ilk. All such analyses are based on the search for unique shared derived characters, which is the only basis on which we can objectively produce a rooted phylogenetic tree or network.

In the molecular world, many analysis techniques are based on analyzing the similarity of the taxa. However, similarity is only relevant if it is based on shared derived characters — if it is based on shared primitive characters then it cannot reliably detect phylogenetic history. This was Hennig's basic insight, and it is as true today as it was 60 years ago.

The confusing thing here is that most similarity among taxa will be based on both primitive and derived characters. This means that some of the analysis output reflects phylogenetic history and some does not. The further we go back in evolutionary time, the more likely it is that similarity reflects shared primitive characters rather than shared derived characters. This simple limitation seems to be poorly understood by evolutionary biologists.

Perhaps it would be a good idea if university courses in molecular evolutionary biology actually taught phylogenetics as a topic of its own, rather than as an incidental tool for studying evolution. After all, there is more to getting a scientific answer than feeding data into a computer program.

Obviously, I may be wrong in painting my picture with such a broad brush. If so, then it must be that the people I have described seem to have gathered on Twitter, like birds of a feather.

And yet, I see the same thing in the literature, as well. Consider this recent paper:
A polyploid admixed origin of beer yeasts derived from European and Asian wine populations. Justin C. Fay, Ping Liu, Giang T. Ong, Maitreya J. Dunham, Gareth A. Cromie, Eric W. Jeffery, Catherine L. Ludlow, Aimée M. Dudley. 2019. PLoS Biology 17(3): e3000147.
This seems to be quite an interesting study of a reticulate evolutionary history involving budding yeasts, from which the authors conclude that:
The four beer populations are most closely related to the Europe/wine population. However, the admixture graph also showed strong support for two episodes of gene flow into the beer lineages resulting in 40% to 42% admixture with the Asia/sake population.

However, they then undo all of their good work with this sentence:
The inferred admixture graph grouped the four beer populations together, with the lager and two ale populations being derived from the lineage leading to the Beer/baking population.
Nonsense! Neither lineage derives from the other, but instead they both derive from a common ancestor. This is like saying that I derive from the lineage leading to my younger brother, when in fact we both derive from the same parents. I doubt that the authors believe the latter idea, so why do they apparently believe the former?

That is a little test that you can all use when writing about phylogenetics. If your words don't make sense for a family history, then they don't make sense for phylogenetics either.

Monday, September 16, 2019

A network of happiness, by ranks

This is a joint post by David Morrison and Guido Grimm

Over a year ago, we showed a network relating to the World Happiness Report 2018 based on the variables used for explaining why people in some countries report themselves to be happier than in other countries. A new WHR report is out for 2019, warranting a new network.

The 2019 Report describes itself as:
a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. This year’s World Happiness Report focuses on happiness and the community: how happiness has evolved over the past dozen years, with a focus on the technologies, social norms, conflicts and government policies that have driven those changes.
For our purposes, we will simply focus on the happiness scores themselves. So, this time we will base our analysis on the country rankings for the four measures of subjective well-being:
  • Cantril Ladder life-evaluation question in the Gallup World Poll — asks the survey respondents to place the status of their lives on a “ladder” scale ranging from 0 to 10, where 0 means the worst possible life and 10 the best possible life
  • Ladder standard deviation — provides a measure of happiness inequality across the country
  • Positive affect — comprises the average frequency of happiness, laughter and enjoyment on the previous day to the survey (scaled from 0 to 1)
  • Negative affect — comprises the average frequency of worry, sadness and anger on the previous day to the survey (scaled from 0 to 1)
As expected, not a lot has changed between 2018 and 2019. The first graph shows the comparison of the Cantril Ladder scores (the principal happiness measure) for those 153 countries that appear in both reports. Each point represents one country, with the color coding indicating the geographical area (as listed in the network below).

Only three countries (as labeled) show large differences, with Malaysia becoming less happy, and two small African countries improving. As also expected, the European countries (green) tending to be at the top, and the African countries (grey) dominating the bottom scores.

Finland is still ranked #1, with even happier people than in 2018's report. New in the top-10 of the happiest countries is Austria (last years #12), which took the place of Australia (now #11). At the other end, South Sudan went down from 3.3 to 2.9 — this is not really a good start for the youngest state in the world. New to the lowest-ranking ten are Botswana (−0.1, down two places) and Afghanistan (−0.4, down 9).

A network analysis

The four measures of subjective well-being do not necessarily agree with each other, since they measure different things. To get an over view of all four happiness variables simultaneously, we can use a phylogenetic network as a form of exploratory data analysis. [Technical details of our analysis: Qatar was deleted because it has too many missing values. The data used were the simple rankings of the counties for each of the four variables. The Manhattan distance was then calculated; the distances have been displayed as a neighbor-net splits graph.]

In the network (shown below), the spatial relationship of the points contains the summary information — points near each other in the network are similar to each other based on the data variables, and the further apart they are then the less similar they are. The points are color-coded based on major geographic regions; and the size of the points represents the Cantril Ladder score. We have added some annotations for the major network groups, indicating which geographical regions are included — these groups are the major happiness groupings.

The rank-based network 2019 looks quite different to the one based on the explaining parameters 2018. Let us have a short look at the clusters, as annotated in the graph.

Cluster 1: The happiest this includes the welfare states of north-western and central Europe (score > 6.7), as well as Australia, Canada and New Zealand (~7.3), Taiwan (the 25th happiest country in the world, 6.4) and Singapore (#34 with 6.3). For both the positive and negative measures of happiness, the countries rank typically in the top 50, with Czechia ranking lowest regarding positive affects (#74), while the people in Singapore (#1) and Taiwan apparently suffer the fewest negative affects (#2).

Cluster 2: Quite happy includes countries like France, with 6.6 making it the happiest one of the group, plus countries along the southern shore of the Baltic Sea, as well as Japan, Hong Kong, but also also quite different countries from western Asia such as Kyrgyzstan and Turkmenistan, and Vietnam, the least happy (5.1) of the group. Common to all of them is that they rank in the top third of the standard derivation of the Cantril ladder scores, i.e. their people are equally happy across each country. Towards the right of the group, bridging to Cluster 3, we have countries that rank in the bottom third of positive affects. Potential causes are the high levels of perceived corruption, or the lack of social support and generosity, as in the case of Turkmenistan (#147 in social support, #153 in generosity).

Cluster 3: Not so happy — an Old World group of the lower half (Cantril scores between 5.2, Algeria, and 3.4, Rwanda) that are either doing a bit better than other, equally (un)happy countries regarding positive affects (Myanmar, Madagascar, Rwanda) or negative affects (e.g. Georgia, Ukraine), and are in the top-half when it comes to the SD.

Cluster 4: Generally unhappy — this collects most of the countries of the Sub-saharan cluster 2018 with Cantril scores ≤ 5, including three of the (still) unhappiest countries in the world: war-ridden Syria, the Central African Republic, and South Sudan, which rank in the bottom-half of all happiness rankings. When is comes to explanations, the ranking table is of little use: Chad, for example, ranks 2nd regarding perceived corruption, and the Central African Republic, generally regarded a as a failed state, ranks 16th, and 14th regarding freedom — ie. it seems to have similar values here like the happiest bunch (Cluster 1).

Cluster 5: Pretty unhappy — this includes Asian and African countries that are not much happier than those of Cluster 4 but which rank high when only looking at positive affects. The reasons may include low levels of perceived corruption but also generosity, at least in the case of Bhutan (#25, #13) and South Africa (#24/#1), the latter being the most generous country in the world (something Guido agrees with based on personal experience).

Cluster 6: Partially unhappy — is a very heterogeneous cluster, when we look at the Cantril scores ranging from 7.2 for Costa Rica (#12), a score close to the Top-10 of Cluster 1, to 4.7 for Somalia (#112). Effectively, it collects all states that don't fit ranking-pattern-wise in any of the other clusters. For example, the U.S. (6.9, #19) and U.A.E. (6.8, #21) plot close to each other in the network because both rank between 35 and 70 on the other three variables, ie. lower than the countries of Cluster 1 with not much higher Cantril scores. Mexico, by the way (6.6, #23), performs similarly to the U.S. but ranks much higher regarding positive affects. The latter seems to be a general trend within the other states of the New World in this cluster.

Cluster 7: Really not happy — also covers a wide range, from a Cantril score of 6.0 (Kuwait, #51 in the world) to 3.2 (Afghanistan, #154). It includes the remainder of the Sub-saharan countries, most of the countries in the Arab world, and the unhappy countries within and outside the EU (Portugal, Greece, Serbia, Bosnia & Herzegovina). These are countries that usually rank in the lower half or bottom third regarding all four included variables.

Cluster 8: Increasingly unhappy — these countries bridge between Clusters 1 and 7, starting (upper left in the graph) with Russia (#68, top 10 regarding negative affects) and ending with Democratic Republic of Congo (#127, Congo Kinshasa in WHR dataset, ranking like a Cluster 7 country). In between are pretty happy countries such as Israel (#13) and unhappy EU members (Bulgaria, #97). The reason Israel is not in Cluster 1 is its very low ranking regarding both positive affects (#104) and not too high placement when it comes to negative affects (#69), but in contrast to the U.S. it ranks high when it comes to the SD of the Cantril scores — that is, the USA has a great diversity regarding happiness, from billionaires to the very poor, whereas the peoples of most countries are more equally happy. Other very-high ranking countries regarding the latter are Bulgaria, the least-happy country of the EU, and Mongolia.

Monday, September 9, 2019

Lifestyle habits in the states of the USA

People throughout the western world are constantly being reminded that modern lifestyles have many unhealthy aspects. This is particularly true of the United Stats of America, where obesity (degree of over-weight) is now officially considered to be a medical epidemic. That is, it is a disease, but it is not caused by some organism, such as a bacterium or virus, but is instead a lifestyle disease — it can be cured and prevented only by changing the person's lifestyle.

The Centers for Disease Control and Prevention (CDC), in the USA, publish a range of data collected in their surveys — Nutrition, Physical Activity, and Obesity: Data, Trends and Maps. Their current data include information up to 2017.

These data are presented separately for each state. The data collection includes:
  • Obesity — % of adults who are obese, as defined by the Body Mass Index (>30 is obese)
  • Lack of exercise — % of adults reporting no physical leisure activity; % of adolescents watching 3 or more hours of television each school day
  • Unhealthy eating — % of adults eating less than one fruit per day; % of adolescents drinking soda / pop at least once per day.
The CDC show maps and graphs for these data variables separately, but there is no overall picture of the data collection as a whole. This would be interesting, because it would show us which states have the biggest general problem, in the sense that they fare badly on all or most of the lifestyle measurements. So, let's use a network to produce such a picture.

For our purposes here, I have looked at the three sets of data for adults only. The network will thus show states that have lots of obese adults who get little exercise and do not eat many fruits and vegetables.

As usual for this blog, the network analysis is a form of exploratory data analysis. The data are the percentages of people in each state that fit into the three lifestyle characteristics defined above (obese, no exercise, unhealthy eating). For the network analysis, I calculated the similarity of the states using the manhattan distance; and a Neighbor-net analysis was then used to display the between-state similarities.

Network of the lifestyle habits i the various US states

The resulting network is shown in the graph. States that are closely connected in the network are similar to each other based on their adult lifestyles, and those states that are further apart are progressively more different from each other. In this case, the main pattern is a gradient from the healthiest states at the top of the network to the most unhealthy at the bottom.

Note that there are seven states separated from the rest at the bottom of the network. These states have far more people with unhealthy lifestyles than do the other US states. In other words, the lifestyle epidemic is at its worst here.

In the top-middle of the network there is a partial separation of states at the left from those at the right (there is no such separation elsewhere in the network). The states at the left are those that have relatively low obesity levels but still fare worse on the other two criteria (exercise and eating). For example, New York and New Jersey have the same sorts of eating and exercise habits as Pennsylvania and Maryland but their obesity levels are lower.

It is clear that the network relates closely to the standard five geographical regions of the USA, as shown by the network colors. The healthiest states are mostly from the Northeast (red), except for Delaware, while the unhealthiest states are from the Southeast (orange), with Florida, Virginia and North Carolina doing much better than the others. The Midwest states are scattered along the middle-right of the network, indicating a middling status. The Southwest states are mostly at the middle-left of the network.

The biggest exception to these regional clusterings is the state of Oklahoma. This is in the bottom (unhealthiest) network group, far from the other Southwest states. This pattern occurs across all three characteristics; for example, Oklahoma has the second-lowest intake of fruit (nearly half the adults don't eat fruit), second only to Mississippi.

These data have also been analyzed by Consumer Protect, who offer some further commentary.


This analysis highlights those seven US states that have quantitatively the worst lifestyles in the country, and where the lifestyle obesity epidemic is thus at its worst.

These poor lifestyles have a dramatic impact on longevity — people cannot expect to live very long if they live an unhealthy lifestyle. The key concept here is the difference between life expectancy (how long people live, on average) and healthy life expectancy (how long people people remain actively healthy, on average). This topic is discussed by the The US Burden of Disease Collaborators (2018. The state of US health, 1990-2016. Journal of the American Medical Association 319: 1444-1472).

In that paper, the data for the USA show that, for most states, healthy life expectancy is c. 11 years less than the total life expectancy, on average. This big difference is due to unhealthy lifestyles, which eventually catch up with you. As a simple example, the seven states at the bottom of the network are ranked 44-51 in terms of healthy longevity, at least 2.5 years shorter than the national average. (Note: Tennessee is ranked 45th.)

You can see why the CDC is concerned, and why there is considered to be an epidemic.


Some of the seven states highlighted here have other lifestyle problems, as well. For example, if you consult Places in America with the highest STD rates, you will find that they are listed as five of the top ten: 2: Mississippi, 3: Louisiana, 6: Alabama, 9: Arkansas, 10: Oklahoma, 31: Kentucky, and 50: West Virginia.

Monday, September 2, 2019

Losing information in phylogenetic consensus

Any summary loses information, by definition. That is, a summary is used to extract the "main" information from a larger set of information. Exactly how "main" is defined and detected varies from case to case, and some summary methods work better for certain purposes than for others.

A thought experiment that I used to play with my experimental-design students was to imagine that they were all given the same scientific publication, and were asked to provide an abstract of it. Our obvious expectation is that there would be a lot of similarity among those abstracts, which would represent the "important points" from the original — that is, those points of most interest to the majority of the students. However, there would also be differences among the abstracts, as each student would find different points that they think should also be included in the summary. In one sense, the worst abstract would be the one that has the least in common with the other abstracts, since it would be summarizing things that are of less general interest.

The same concept applies to mathematical summaries (aka "averages"), such as the mean, median and mode, which reduce the central location of a dataset to a single number. It also applies to summaries of the variation in a dataset, such as the variance and inter-quartile range. (Note that a confidence interval or standard error is an indication of the precision of the estimate of the central location, not a summary of the dataset variation — this is a point that seems to confuse many people.)

So, it is easy to summarize data and thereby lose important information. For example, if my dataset has two exactly opposing time patterns, then the data average will appear to remain constant through time. I might thus conclude from the average that "nothing is happening" through time when, in fact, two things are happening. I will never find out about my mistake by simply looking at the data summary — I also need to look at the original data patterns.

So, what has this got to do with phylogenetics? Well, a phylogenetic tree is a summary of a dataset, and that summary is, by definition, missing some of the patterns in the data. These patterns might be of interest to me, if I knew about them.

Even worse, phylogenetic data analyses often produce multiple phylogenetic trees, all of which are mathematically equal as summaries of the data. What are we then to do?

One thing that people often do is to compute a Consensus Tree (eg. the majority consensus), which is a summary of the summaries — that is, it is a tree that summarizes the other trees. It would hardly be surprising if that consensus tree is an inadequate summary of the original data. In spite of this, how often do you see published papers that contain any evaluation of their consensus tree as a summary of the original data?

This issue has recently been addressed in a paper uploaded to the BioRxiv:
Anti-consensus: detecting trees that have an evolutionary signal that is lost in consensus
Daniel H. Huson, Benjamin Albrecht, Sascha Patz, Mike Steel
Not unexpectedly, given the background of the authors, they explore this issue in the context of phylogenetic networks. As they note:
A consensus tree, such as the majority consensus, is based on the set of all splits that are present in more than 50% of the input trees. A consensus network is obtained by lowering the threshold and considering all splits that are contained in 10% of the trees, say, and then computing the corresponding splits network. By construction and in practice, a consensus network usually shows the majority tree, extended by a number of rectangles that represent local rearrangements around internal nodes of the consensus tree. This may lead to the false conclusion that the input trees do not differ in a significant way because "even a phylogenetic network" does not display any large discrepancies.
That is, sometimes authors do attempt to evaluate their consensus tree, by looking at a network. However, even the network may turn out to be inadequate, because a phylogenetic tree is a much more complex summary than is a simple mathematical average. This is sad, of course.

So, the new suggestion by the authors is:
To harness the full potential of a phylogenetic network, we introduce the new concept of an anti-consensus network that aims at representing the largest interesting discrepancies found in a set of trees.
This should reveal multiple large patterns, if they exist in the original dataset. Phylogenetic analyses keep moving forward, fortunately.

Monday, August 26, 2019

Statistical proof of language relatedness (Open problems in computational diversity linguistics 7)

The more I advance with the problems I want to present during this year, the more I have to admit to myself, sometimes, that the problem I planned to present is so difficult that I find it even hard to simply present the state-of-the-art. The problem of this month, problem number 7 in my list, is such an example — proving that two or more languages are "genetically related", as historical linguists (incorrectly) tend to say, is not only hard, it is also extremely difficult even to summarize the topic properly.

Typically, colleagues start with the famous but also not very helpful quote of Sir William Jones, who delivered a report to the British Indian Company, thereby mentioning that there might be a deeper relationship between Sanskrit and some European languages (like Greek and Latin). The article, titled The third anniversary discourse, delivered 2 February, 1786, by the president (published in 1798) has by now been quoted so many times that it is better to avoid quoting it another time (but you will find the full quote with references in my reference library.

In contrast to later scholars like Jacob Grimm and Rasmus Rask, however, Jones does not prove anything, he just states an opinion. The reason why scholars like to quote him, is that he seems to talk about probability, since he mentions the impossibility that the resemblances between the languages he observed could have arisen by chance. Since a great deal of the discussion about language relationship centers around the question how chance could be controlled for, it is a welcome quote from the olden times to be used when writing a paper on statistics or quantitative methods. But this does not necessarily mean that Jones really knew what he was writing about, as one can read in detail in the very interesting book by Campbell and Poser (2008), which deals at length with the supposedly overrated role that William Jones played in the early history of historical linguistics.

Macro Families

Returning to the topic at hand. The regularity of sound change and the possibility to prove language relationship in some cases was an unexpected detection of some linguists during the early 19th century, but what many linguists have been dreaming about since is to expand their methods to such a degree that even deeper relationships could be proven. While the evidence for the relationship of the core Indo-European languages was more or less convincing by itself (as rightfully pointed out by Nichols 1996), scholars have proposed many suggestions of relationship, many of which are no longer followed by the communis opinio. Among these long-range proposals for deep phylogenetic relations are theories that further unite fully established language families, proposing large macro-families — such as Nostratic (uniting Semitic, Indo-European, and many more, depending on the respective version), Altaic (uniting Turkic, Mongolic, Tungusic, Japanese, and Korean, etc.), or Dene-Caucasian (uniting Sino-Tibetan, North Caucasian, and Na-Dene), which span incredibly large areas on earth.

Given that it the majority of scholars mistrust these new and risky proposals, and that even scholars who work in the field of long-range comparison often disagree with each other, it is not surprising that at least some linguists became interested in the question of how long-range relationship could be proven in the end. One of the first attempts in this regard was presented by Aharon Dolgopolsky, a convinced Nostratic linguist, who presented a first, very interesting, heuristic procedure to determine deep cognates and deep language relationships, by breaking sounds down to more abstract classes, in order to address the problem that words often do no longer look similar due to sound change (Dolgopolsky 1964).

Why it is hard to prove language relationship

Dolgopolsky did not use any statistics to prove his approach, but he emphasized the probabilistic aspect of his endeavor, and derived his "consonant classes" or "sound classes" as well as his very short list of stable concepts from the empirical investigation of a large corpus. The core of his approach, to fix a list of semantic items, presumably "stable" (i.e. slowly changing with respect to semantic shift), and to reduce the complexity of phonetic transcriptions to a core meta-alphabet, has been the basis of many follow-up studies that follow an explicitly quantitative (or statistic) approach.

As of now, most scholars, be they classical or computational, agree that the first stage of historical language comparison consists of the proof that the languages one wants to investigate are, indeed, historically related to each other (for the underlying workflow of historical language comparison, see Ross and Durie). In a blogpost published much earlier (Monogenesis, polygenesis, and militant agnosticism I have already pointed to this problem, as it is quite different from biology, where independent evolution of life is usually not assumed by scholars, while linguistic research can never really exclude it.

While proving language relationship of closely related languages is often a complete no-brainer, it becomes especially then hard, when exceeding some critical time depth. Where this time depth lies is not clear by now, but based on our observations regarding the paste in which languages replace existing words with new ones, borrow words, or loose and build grammatical structures, it is clear that it is theoretically possible that a language group could have lost all hints on its ancestry after 5,000 to 10,000 years. Luckily, what is theoretically possible for one language, does not necessarily happen with all languages in a given sample, and as a result, we find still enough signal for ancestral languages in quite a few language families of the world, that allows us to draw conclusions that go back about 10,000 years in the most cases, if not even deeper in some cases.

Traditional insights into the proof of language relationships

The difficulty of the task is probably obvious without further explanation — the more material a language acquires from its neighbors, and the more it loses or modifies the material it inherited from its ancestors, the more difficult it is for the experts to find the evidence that convinces their colleagues about the phylogenetic affiliation of such a language. While regular sound changes can easily convince people of phylogenetic relationship, the evidence that scholars propose for deeper linguistic groupings is rarely large enough to establish correspondences.

As a result, scholars often resort to other types of evidence, such as certain grammatical peculiarities, certain similarities in the pronunciation of certain words, or external findings (e.g.,from archaeology). As Handel (2008) points out, for example, a good indicator of a Sino-Tibetan language is that its words for five, I, and fish start with similar initial sounds and contain a similar vowel (compare Chinese , , and , going back to MC readings ŋjuX. ŋaX, and ŋjo). While these arguments are often intuitively very convincing (and may also be statistically convincing, as Nichols 1996 argues), this kind of evidence, as mentioned by Handel, is extremely difficult to detect, since the commonalities can be found in so many different regions of a human language system.

While linguists also use sound correspondences to prove and establish relationship, there are no convincing cases known to me in which sound correspondences were employed to prove relationships beyond a certain time depth. One can compare this endeavor to some degree with the work of police commissars who have to find a murderer, and can do so easily if the person responsible left DNA at the spot, while they have to spend many nights in pubs, drinking cheap beer and smoking bad cigarettes, in order to wait for the spark of inspiration that delivers the ultimate proof not based on DNA.

Computational and statistical approaches

Up to now, no computational methods are available to find signals of the kind presented by Handel for Sino-Tibetan, i.e, a general-purpose heuristic to search for what Nichols (1996) calls individual-identifying evidence. So,computational and statistical methods have so far been based on very schematic approaches, which are almost exclusively based on wordlists. A wordlist can hereby be thought of as a simple table with a certain number of concepts (arm, hand, stone, cinema) in the first column, and translation equivalents for these concepts being listed for several different languages in the following columns (see List 2014: 22-24). This format can of course be enhanced (Forkel et al. 2018), but it represents the standard way in which many historical linguists still prepare and curate their data.

What scholars now try to do is to see if they can find some kind of signal in the data that they think would be unlikely to be detected by chance. In general, there are two ways that scholars have explored so far. In the approach proposed by Ringe (1992), the signalsthat are tested for in the wordlists are sound correspondences, and we can therefore call theses approaches correspondence-based approaches to prove language relationship. In the approach of Baxter and Manaster Ramer (2000), which follows the original idea of Dolgopolsky, the data are converted to sound classes first, and cognacy is assumed for words with identical sound classes. Sound-class-based approaches again try to illustrate that the matches that can be identified are unlikely to be due to chance.

Both approaches have been discussed in quite a range of different papers, and scholars have also tried to propose improvements to the methods. Ringe's correspondence-based approach showed that it can become difficult to prove the relationship of languages formally, although we have very good reasons to assume it based on our standard methods. Baxter and Manaster Ramer (2000) presented a more optimistic case study, in which they argue that their sound-class-based approach would allow them to argue in favor of the relationship of Hindi and English, even if the two languages are separated by at least 10,000 or even more years.

A general problem of Ringe's approach was that he tried to use combinatorics to arrive at his statistical evaluation. This is similar to the way in which Henikoff and Henikoff (1992) developed their BLOSUM matrices for biology, by assuming that the only factor that handles the combination of amino acids in biological sequences is their frequency. Ringe tried to estimate the likelihood of finding matches of word-initial consonants in his data by using a combinatorial approach based on the assumption of simple sound frequencies in the word lists he investigated. The general problem with linguistic sequences, however, is that they are not randomly arranged. Instead, every language has its own system of phonotactic rules, a rather simple grammar that restricts certain letter combinations and favors others. All spoken languages have these systems, and some vary greatly with respect to their phonotactics. As a result, due to the inherent structure of sequences, a bag of symbols approach, as used by Ringe, can have unwanted side effects and invoke misleading estimates regarding the probability of certain matches.

To avoid this problem, Kessler (2001) proposed the use of permutation tests, by which the random distribution, against which the attested distribution is compared, is generated via the shuffling of the lists. Instead of comparing translations for "apple" in one language with translations for "apple" in another language, one compares now translations for pear with translations for "apple", hoping that this — if done often enough — better approximates the random distribution (i.e. the situation in which one compares several known unrelated languages with similar phoneme inventories).

Permutation is also the standard in all sound-correspondence-based approaches. In a recent paper, Kassian et al. (2015) used these approaches (first proposed by Turchin et al. 2010) to argue for the relationship of Indo-European and Uralic languages by comparing reconstructed word lists for Proto-Indo-European and Proto-Uralic. As can be seen from the discussion of these findings involving multiple authors, people are still not automatically convinced by a significance test, and scholars have criticized: their choice of test concepts (they used the classical 110-item list by Yakhontov and Starostin), their choice of reconstruction system (they did not use the mysterious laryngeals in their comparison), and the possibility that the findings were due to other factors (early borrowing).

While there have been some more attempts to improve the correspondence-based and the sound-class-based approaches (e.g., Kessler 2007, Kilani 2015, Mortarino 2009), it is unlikely that they will lead to the consolidation of contested proposals on macro families any time soon. Apart from the general problems of many of the current tests, there seem to be too many unknowns that prevent the community to accept findings, no matter "how" significant they appear. As can be nicely seen from the reaction to the paper by Kassian et al. 2015, a significant test will first raise the typical questions regarding the quality of the data and the initial judgments (which may also at times be biased). Even if all scholars would agree in this case, however, i.e. if one could not criticize anything in the initial test setting, there would still be the possibility to say that the findings reflect early language contact instead of phylogenetic relatedness.

Initial ideas for improvement

What I find unsatisfying about most existing tests is that they do not make exhaustive use of alignment methods. The sound-class-based approach is a shortcut for alignments, but it reduces words to two consonant classes only, and requires an extensive analysis of the words to compare only the root morpheme. It therefore also opens the possibility to bias the results (even if scholars may not intend that directly). While correspondence-based tests are much more elegant in general, they avoid alignments completely, and just pick the first letter in every word. The problem seems to be that — even when using permutations to generate the random distribution — nobody really knows how one should score the significance of sound correspondences in aligned words. I have to admit that I do not know it either. Although the tools for automated sequence comparison that my colleagues and I have been developing in the past (List 2014, List et al. 2018) seem like the best starting point to improve the correspondence-based approach, it is not clear how the test should be performed in the end.

Additionally, I assume also that expanded, fully fledged, tests will ultimately show what I reported back in my dissertation — if we work on limited wordlists, with only 200 items per language, the test will drastically lose its power when certain time depths have been reached. While we can easily prove the relationship of English and German, even with only 100 words, we have a hard time doing the same thing for English and Albanian (see List 2014: 200-203). But expanding the wordlists bears another risk for comparison (as pointed out to me by George Starostin): the more words we add, the more likely it is that they have been borrowed. Thus, we face a general dilemma in historical linguistics: that we are forced to deal with sparse data, since languages tend to lose their historical signal rather quickly.


While there is no doubt that it would be attractive to have a test that would immediately tell one whether languages are related or not, I am becoming more and more skeptical about whether this test would actually help us, specifically when concentrating on pairwise tests alone. The challenge of this problem is not just to design a test that makes sense and does not overly simplify. The challenge is to propagate the test in such a way that it convinces our colleagues that it really works. This, however, is a challenge that is greater than any of the other open problems I have discussed so far in this year.


Baxter, William H. and Manaster Ramer, Alexis (2000) Beyond lumping and splitting: Probabilistic issues in historical linguistics. In: Renfrew, Colin and McMahon, April and Trask, Larry (eds.) Time Depth in Historical Linguistics. Cambridge:McDonald Institute for Archaeological Research, pp. 167-188.

Campbell, Lyle and Poser, William John (2008) Language Classification: History and Method. Cambridge:Cambridge University Press.

Dolgopolsky, Aron B. (1964) Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija [A probabilistic hypothesis concering the oldest relationships among the language families of Northern Eurasia]. Voprosy Jazykoznanija 2: 53-63.

Forkel, Robert and List, Johann-Mattis and Greenhill, Simon J. and Rzymski, Christoph and Bank, Sebastian and Cysouw, Michael and Hammarström, Harald and Haspelmath, Martin and Kaiping, Gereon A. and Gray, Russell D. (2018) Cross-linguistic data formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5: 1-10.

Handel, Zev (2008) What is Sino-Tibetan? Snapshot of a field and a language family in flux. Language and Linguistics Compass 2: 422-441.

Henikoff, Steven and Henikoff, Jorja G. (1992) Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences 89: 10915-10919.

Jones, William (1798) The third anniversary discourse, delivered 2 February, 1786, by the president. On the Hindus. Asiatick Researches 1: 415-43.

Kassian, Alexei and Zhivlov, Mikhail and Starostin, George S. (2015) Proto-Indo-European-Uralic comparison from the probabilistic point of view. The Journal of Indo-European Studies 43: 301-347.

Kessler, Brett (2001) The Significance of Word Lists. Statistical Tests for Investigating Historical Connections Between Languages. Stanford: CSLI Publications.

Kessler, Brett (2007) Word similarity metrics and multilateral comparison. In: Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology, pp. 6-14.

Kilani, Marwan (2015): Calculating false cognates: An extension of the Baxter & Manaster-Ramer solution and its application to the case of Pre-Greek. Diachronica 32: 331-364.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Walworth, Mary and Greenhill, Simon J. and Tresoldi, Tiago and Forkel, Robert (2018) Sequence comparison in computational historical linguistics. Journal of Language Evolution 3: 130–144.

Mortarino, Cinzia (2009) An improved statistical test for historical linguistics. Statistical Methods and Applications 18: 193-204.

Nichols, Johanna (1996) The comparative method as heuristic. In: Durie, Mark (ed.) The Comparative Method Reviewed. New York:Oxford University Press, pp. 39-71.

Ringe, Donald A. (1992) On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society 82: 1-110.

Ross, Malcolm D. (1996) Contact-induced change and the comparative method. Cases from Papua New Guinea. In: Durie, Mark (ed.) The Comparative Method Reviewed. New York: Oxford University Press, pp. 180-217.

Turchin, Peter and Peiros, Ilja and Gell-Mann, Murray (2010) Analyzing genetic connections between languages by matching consonant classes. Journal of Language Relationship 3: 117-126.

Monday, August 19, 2019

Phylogenetics of chain letters?

The general public and the general media often have no idea what biologists mean by the work "evolution". The word has two possible meanings, and they usually pick the wrong one. Niles Eldredge tried to clarify the situation by referring to them:
  • transformational evolution — the change in a group of objects resulting from a change in each object (often attributed to Lamarck)
  • variational evolution - the change in a group of objects resulting from a change in the proportion of different types of objects (usually attributed to Darwin).
Charles Darwin changed biology by pointing out that changes in species occur via the latter mechanism, not the former, which had been the predominant previous idea. Sadly, 160 years later, the idea of transformational evolution still seems to prevail in the minds of the general public and the people writing for them.

So, it was with some trepidation that I looked at an article in Scientific American called Chain letters and evolutionary histories (by Charles H. Bennett, Ming Li and Bin Ma. June 2003, pp. 76-81). It was subtitled: "A study of chain letters shows how to infer the family tree of anything that evolves over time, from biological genomes to languages to plagiarized schoolwork."

The "taxa" in their study consist of 33 different chain letters, collected during the period 1980–1995 (8 other letters were excluded), covering the diversity of chain letters as they existed before internet spam became widespread. These letters can be viewed on the Chain Letters Home Page.

The main issue with this study is that there are no clearly defined characters, from which the phylogeny could be constructed. The authors therefore resort to creating a pairwise distance matrix, among the taxa, in a manner (compression) that I have criticized before (Non-model distances in phylogenetics). I have also discussed previous examples where this approach has been used, notably: Phylogenetics of computer viruses? Multimedia phylogeny?

The essential problem, as I see it, is that without a model of character change there is no reliable way to separate phylogenetic information from any other type of information. That is, phylogenetic similarity is a special type of similarity. It is based on the idea of shared derived character states, as these are the only things that are informative about a phylogeny.

Compression, on the other hand, is a general sort of similarity, based on the idea of information complexity. This presumably will contain some useful phylogenetic information, but it will also contain a lot of irrelevance — for example, shared ancestral character states, which are uninformative at best and positively misleading at worst.

So, the authors can easily produce an unrooted tree from their similarity matrix, which they then proceed to root at one of the letters that they collected early on in their study. This tree is shown here.

However, whether this diagram represents a phylogeny is unknown.

Nevertheless, that does not stop us using an unrooted phylogenetic network as a form of exploratory data analysis, as we have done so often in this blog. This is not intended to produce a rooted evolutionary history, but instead merely to summarize the multivariate information in a comprehensible (and informative) manner. This might indicate whether we are likely to be able to reconstruct the phylogeny In this case, I have used a NeighborNet to display the similarity matrix, as shown next.

Phylogenetic network of cahin letters

It is easy to see that the relationships among the letters are not particularly tree-like. Moreover, the long terminal edges emphasize that much of the complexity information is not shared among the letters, while the shard information is distinctly net-like. So, a simple "phylogenetic tree" (as shown above) is not likely to be representative of the actual evolutionary history.

However, there are actually a few reasonably well-defined groups among the taxa — one at the top. one at the right, and several at the bottom of the network. There are also letters of uncertain affinity, such as L2, L23, L13 and L31. These may reflect phylogenetic history, even though that history is hard to untangle.

Finally, it is worth noting that the history of chain letters, dating back to the 1800s, is discussed in detail by Daniel W. VanArsdale at his Chain Letter Evolution web pages.

Monday, August 12, 2019

Public transit trips in the USA

Public transport, or mass transit, has long been a politically charged issue, throughout the world. However, the modern world now recognizes that it is an effective way to deal with mass movements of people in a manner that respects the use of non-renewable resources.

After all, the only way to continue with autonomous transportation is to get rid of fossil fuels. However. electric cars will not be of much use until we work out where we are going to get all of the needed extra electricity, in a manner that is environmentally friendly. There is not much point in simply moving the burning of fossil fuels from the vehicle (ie. gasoline) to a power station that also burns fossil fuels (eg. coal). There is also a limit to how many rivers there are left to dam for hydroelectric power; and nuclear reactors have gone out of fashion (fortunately). There is also, of course, the matter of how we are going to recycle the used (lithium-ion) batteries from the cars, which is apparently a tougher proposition than recycling the electric motors themselves.

So, until we sort this out, mass transit is a viable option for most conurbations. In this context, a conurbation (or a metropolitan area) is a contiguous area within which large numbers of people move regularly, especially traveling to and from their workplace each weekday. A conurbation often involves multiple cities and towns, as defined by political administrations or contiguous urban development — many people live in one urban area but work in another.

So, naturally, governments collect data on these matters. One such data collection is the U.S. Department of Transportation's National Transit Database. The data consist of "sums of annual ridership (in terms of unlinked passenger trips), as reported by transit agencies to the Federal Transit Administration." Data for three separate modes of transit are included: bus, rail, and paratransit. The data currently cover the years 2002–2018, inclusive.

To look at the data for the 42 U.S. conurbations included, for the year 2018, I have performed this blog's usual exploratory data analysis. I first calculated the transit rate per person, by dividing the annual number of trips for each of the three modes by the conurbation population size. Since these are multivariate data, one of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network. For this network analysis, I calculated the similarity of the conurbations using the manhattan distance. A Neighbor-net analysis was then used to display the between-area similarities.

The resulting network is shown in the graph. Conurbations that are closely connected in the network are similar to each other based on the trip rates, and those areas that are further apart are progressively more different from each other. In this case, there is a simple gradient from the busiest mass transit systems at the top of the network to the least busy at the bottom.

The network shows us that the New York – Newark transit-commuting area (which covers part of three states) is far and away the busiest in the USA. The subway system dominates this mass transit, of course, as it is justifiably world famous, although not always for the best of reasons as far as commuters are concerned

The San Francisco – Oakland area is in clear second place. Here, bus transit slightly exceeds rail transit. Then follows Washington DC and Boston, both of which also cover parts of three states. In Boston trains out-do buses 2:1, while in Washington it is closer to 1.5:1.

Nest comes a group of four conurbations: Chicago, Philadelphia, Portland and Seattle. Two of these cover part of Washington, but in quite different ways — in Seattle the buses dominate the system 5:1 but in Portland it is only 1.5:1. Chicago and Philadelphia share buses and trains pretty equally.

At the bottom of the network there are two large groups of conurbations, one of which does slightly better than the other at mass transit use. The least-used system is that of San Juan, in Puerto Rico, perhaps not unexpectedly. Of the contiguous U.S. states, Indianapolis (IN) has the least used system, followed by Memphis (TN–MS–AR).

Moving on, we could also look at changes in the total number of transit trips (irrespective of mode) during the period for which data are available: 2002–2018. A network is of little help here. So, it so simplest just to plot the data, as shown in the next graph.

For most of the metropolitan areas there is little in the way of consistent change through time. However, there are some areas that show high correlations between the number of trips and time. These are the areas that have shown the most consistent increase in the number of transit trips from 2002–2018:
  • Chicago (IL–IN)
  • Tampa – St Petersburg (FL)
  • Baltimore (MD)
  • Denver – Aurora (CO)
  • San Francisco – Oakland (CA)
  • Memphis (TN–MS–AR)
  • San Diego (CA)
  • Cleveland (OH)
  • Providence (RI–MA)
  • Orlando (FL)
  • Indianapolis (IN)
  • New York – Newark (NY–NJ–CT)
  • Portland (OR–WA)
  • Minneapolis – St Paul (MN–WI)
Sadly, there are also areas that have shown a consistent decrease in the number of transit trips through time (2002–2018):
  • Kansas City (MO–KS)
  • Columbus (OH)
  • Riverside – San Bernardino (CA)
Presumably these are the areas where the local politicians should be looking into how to address this long-term issue.

Declining transit numbers is a topic discussed around the web; for example: Transit ridership down in most American cities. This article has a graph neatly showing the change in transit numbers from 2017 to 2018. It shows marked decreases, particularly for bus trips, while the few increases almost all involved rail travel. Is this a short-term effect, or the start of a general long-term decline?

Monday, August 5, 2019

Tattoo Monday XIX

Here are two more (large) Charles Darwin tree tattoos, based on his best-known sketch from his Notebooks (the "I think" tree). For other examples, see Tattoo Monday III, Tattoo Monday V, Tattoo Monday VI, Tattoo Monday IX, Tattoo Monday XII, and Tattoo Monday XVIII.

Monday, July 29, 2019

Simulation of sound change (Open problems in computational diversity linguistics 6)

The sixth problem in my list of open problems in computational diversity linguistics is devoted to the problem of simulating sound change. When formulating the problem, it is difficult to see what is actually meant, as there are two possibilities for a concrete simulation: (i) one could think of a sound system of a given language and then model how, through time, the sounds change into other sounds; or (ii) one could think of a bunch of words in the lexicon of a given language, and then simulate how these words are changed through time, based on different kinds of sound change rules. I have in mind the latter scenario.

Why simulating sound change is hard

The problem of simulating sound change is hard for four reasons. First of all, the problem is similar to the problem of sound law induction, since we have to find a simple and straightforward way to handle phonetic context (remember that sound change may often only apply to sounds that occur in a certain environment of other sounds). This is already difficult enough, but it could be handled with help of what I called multi-tiered sequence representations (List and Chacon 2015). However, there are four further problems that one would need to overcome (or at least be aware of) when trying to successfully simulate sound change.

The first of these extra problems is that of morphological change and analogy, which usually goes along with "normal" sound change, following what Anttila (1976) calls Sturtevant's paradox — namely, that regular sound change produces irregularity in language systems, while irregular analogy produces regularity in language systems. In historical linguistics, analogy serves as a cover-term for various processes in which words or word parts are rendered more similar to other words than they had been before. Classical examples are children's "regular" plurals of nouns like mouse (eg. mouses instead of mice) or "regular" past tense forms of verbs like catch (e.g., catched instead of caught). In all these cases, perceived irregularities in the grammatical system, which often go back to ancient sound change processes, are regularized on an ad-hoc basis.

One could (maybe one should), of course, start with a model that deliberately ignores processes of morphological change and analogical leveling, when drafting a first system for sound change simulation. However, one needs to be aware that it is difficult to separate morphological change from sound change, as our methods for inference require that we identify both of them properly.

The second extra problem is the question of the mechanism of sound change, where competing theories exist. Some scholars emphasize that sound change is entirely regular, spreading over the whole lexicon (or changing one key in the typewriter), while others claim that sound change may slowly spread from word to word and at times not reach all words in a given lexicon. If one wants to profit from simulation studies, one would ideally allow for a testing of both systems; but it seems difficult to model the idea of lexical diffusion (Wang 1969), given that it should depend on external parameters, like frequency of word use, which are also not very well understood.

The last problem is that of the actual tendencies of sound change, which are also by no means well understood by linguists. Initial work on sound change has been carried out (Kümmel 2008). However, the major work of finding a way to compare the major tendencies of sound change processes across a large sample of the world's languages (ie. the typology of sound change, which I plan to discuss separately in a later post), has not been carried out so far. The reason why we are missing this typology is that we lack clear-cut machine-readable accounts of annotated, aligned data. Here, scholars would provide their proto-forms for the reconstructed languages along with their proposed sound laws in a system that can in fact be tested and run (to allow to estimate also the exceptions or where those systems fail).

But having an account of the tendencies of sound change opens a fourth important problem apart from the lack of data that we could use to draw a first typology of sound change processes: since sound change tendencies are not only initiated by the general properties of speech sounds, but also by the linguistic systems in which these speech sounds are employed. While scholars occasionally mention this, there have been no real attempts to separate the two aspects in a concrete reconstruction of a particular language. The typology of sound change tendencies could thus not simply stop at listing tendencies resulting from the properties of speech sounds, but would also have to find a way to model diverging tendencies because of systemic constraints.

Traditional insights into the process of sound change

When discussing sound change, we need to distinguish mechanisms, types, and patterns. Mechanisms refer to how the process "proceeds", the types refer to the concrete manifestations of the process (like a certain, concrete change), and patterns reflect the systematic perspective of changes (i.e. their impact on the sound system of a given language, see List 2014).

Figure 1: Lexical diffusion

The question regarding the mechanism is important, since it refers to the dispute over whether sound change is happening simultaneously for the whole lexicon of a given language — that is, whether it reflects a change in the inventory of sounds, or whether it jumps from word to word, as the defenders of lexical diffusion propose, whom I mentioned above (see also Chen 1972). While nobody would probably nowadays deny that sound change can proceed as a regular process (Labov 1981), it is less clear as to which degree the idea of lexical diffusion can be confirmed. Technically, the theory is dangerous, since it allows a high degree of freedom in the analysis, which can have a deleterious impact on the inference of cognates (Hill 2016). But this does not mean, of course, that the process itself does not exist. In these two figures, I have tried to contrast the different perspectives on the phenomena.

Figure 2: Regular sound change

To gain a deeper understanding of the mechanisms of sound change, it seems indispensable to work more on models trying to explain how it is actuated after all. While most linguists agree that synchronic variation in our daily speech is what enables sound change in the first place, it is not entirely clear how certain new variants are fixed in a society. Interesting theories in this context have been proposed by Ohala (1989) who proposes distinct scenarios in which sound change can be initiated both by the speaker or the listener, which would in theory also yield predictable tendencies with respect to the typology of sound change.

The insights into the types and patterns of sound change are, as mentioned above, much more rudimentary, although one can say that most historical linguists have a rather good intuition with respect to what is possible and what is less likely to happen.

Computational approaches

We can find quite a few published papers devoted to the simulation of certain aspects of sound change, but so far, we do not (at least to my current knowledge) find any comprehensive account that would try to feed some 1,000 words to a computer and see how this "language'' develops — which sound laws can be observed to occur, and how they change the shape of the given language. What we find, instead, are a couple of very interesting accounts that try to deal with certain aspects of sound change.

Winter and Wedel for example test agent-based exemplar models, in order to see how systems maintain contrast despite variation in the realization (Hamann 2014: 259f gives a short overview of other recent articles). Au (2008) presents simulation studies that aim to test to which degree lexical diffusion and "regular" sound change interact in language evolution. Dediu and Moisik (2019) investigate, with the help of different models, to which degree vocal tract anatomy of speakers may have an impact on the actuation of sound change. Stevens et al. (2019) present an agent-based simulation to investigate the change of /s/ to /ʃ/ in.

This summary of literature is very eclectic, especially because I have only just started to read more about the different proposals out there. What is important for the problem of sound change simulation is that, to my knowledge, there is no approach yet ready to run the full simulation of a given lexicon for a given language, as stated above. Instead, the studies reported so far have a much more fine-grained focus, specifically concentrating on the dynamics of speaker interaction.

Initial ideas for improvement

I do not have concrete ideas for improvement, since the problem's solution depends on quite a few other problems that would need to be solved first. But to address the idea of simulating sound change, albeit only in a very simplifying account, I think it will be important to work harder on our inferences, by making transparent what so far is only implicitly stored in the heads of the many historical linguists in form of what they call their intuition.

During the past 200 years, after linguists started to apply the mysterious comparative method that they had used successfully to reconstruct Indo-European on other language families, the amount of data and number of reconstructions for the world's languages has been drastically increasing. Many different language families have now been intensively studied, and the results have been presented in etymological dictionaries, numerous books and articles on particular questions, and at times even in databases.

Unfortunately, however, we rarely find attempts of scholars to actually provide their findings in a form that would allow to check the correctness of their predictions automatically. I am thinking in very simple terms here — a scholar who proposes a reconstruction for a given language family should deliver not only the proto-forms with the reflexes in the daughter languages, but also a detailed test of how the proposed sound law by which the proto-forms change into the daughter languages produce the reflexes.

While it is clear that this could not be easily implemented in the past, it is in fact possible now, as we can see from a couple of studies where scholars have tried to compute sound change (Hartmann 2003, Pyysalo 2017, see also Sims-Williams 2018 for an overview on more literature). Although these attempts are unsatisfying, given that they do not account for cross-linguistic comparability of data (eg. they use orthographies rather than unified transcriptions, as proposed by Anderson et al. 2018), they illustrate that it should in principle be possible to use transducers and similar technologies to formally check how well the data can be explained under a certain set of assumptions.

Without cross-linguistic accounts of the diversity of sound change processes (ie. a first solution to the problem of establishing a first typology of sound change), attempts to simulate sound change will remain difficult. The only way to address this problem is to require a more rigorous coding of data (both human- and machine-readable), and an increased openness of scholars who work on the reconstruction of interesting language families, to help make their data cross-linguistically comparable.

Sign languages

When drafting this post, I promised to Guido and Justin to grasp the opportunity when talking about sound change to say a few words about the peculiarities of sound change in contrast to other types of language change. The idea was, that this would help us to somehow contribute to the mini-series on sign languages, which Guido and Justin have been initiated this month (see post number one, two, and three).

I do not think that I have completely succeeded in doing so, as what I have discussed today with respect to sound change does not really point out what makes it peculiar (if it is). But to provide a brief attempt, before I finish this post, I think that it is important to emphasize that the whole debate about regularity of sound change is, in fact, not necessarily about regularity per se, but rather about the question of where the change occurs. As the words in spoken languages are composed of a fixed number of sounds, any change to this system will have an impact on the language as a whole. Synchronic variation of the pronunciation of these sounds offers the possibility of change (for example during language acquisition); and once the pronunciation shifts in this way, all words that are affected will shift along, similar to a typewriter in which you change a key.

As far as I understand, for the time being it is not clear whether a counterpart of this process exists in sign language evolution, but if one wanted to search for such a process, one should, in my opinion, do so by investigating to what degree the signs can be considered as being composed of something similar to phonemes in historical linguistics. In my opinion, the existence of phonemes as minimal meaning-discriminating units in all human languages, including spoken and signed ones, is far from being proven. But if it should turn out that signed languages also recruit meaning-discriminating units from a limited pool of possibilities, there might be the chance of uncovering phenomena similar to regular sound change.

Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (2018) A cross-linguistic database of phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting 4.1: 21-53.

Anttila, Raimo (1976) The acceptance of sound change by linguistic structure. In: Fisiak, Jacek (ed.) Recent Developments in Historical Phonology. The Hague, Paris, New York: de Gruyter, pp. 43-56.

Au, Ching-Pong (2008) Acquisition and Evolution of Phonological Systems. Academia Sinica: Taipei.

Chen, Matthew (1972) The time dimension. Contribution toward a theory of sound change. Foundations of Language 8.4. 457-498.

Dan Dediu and Scott Moisik (2019) Pushes and pulls from below: Anatomical variation, articulation and sound change. Glossa 4.1: 1-33.

Hamann, Silke (2014) Phonological changes. In: Bowern, Claire (ed.) Routledge Handbook of Historical Linguistics. Routledge, pp. 249-263.

Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp. 606-609.

Hill, Nathan (2016): A refutation of Song’s (2014) explanation of the ‘stop coda problem’ in Old Chinese. International Journal of Chinese Linguistic 2.2. 270-281.

Kümmel, Martin Joachim (2008) Konsonantenwandel [Consonant change]. Wiesbaden: Reichert.

Labov, William (1981) Resolving the Neogrammarian Controversy. Language 57.2: 267-308.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper presented at the workshop "Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE]" (2015/09/04, Leiden, Societas Linguistica Europaea).

Ohala, J. J. (1989) Sound change is drawn from a pool of synchronic variation. In: Breivik, L. E. and Jahr, E. H. (eds.) Language Change: Contributions to the Study of its Causes. Berlin: Mouton de Gruyter, pp. 173-198.

Pyysalo, Jouna (2017) Proto-Indo-European Lexicon: The generative etymological dictionary of Indo-European languages. In: Proceedings of the 21st Nordic Conference of Computational Linguistics, pp. 259-262.

Sims-Williams, Patrick (2018) Mechanising historical phonology. Transactions of the Philological Society 116.3: 555-573.

Stevens, Mary and Harrington, Jonathan and Schiel, Florian (2019) Associating the origin and spread of sound change using agent-based modelling applied to /s/- retraction in English. Glossa 4.1: 1-30.

Wang, William Shi-Yuan (1969) Competing changes as a cause of residue. Language 45.1: 9-25.

Winter, Bodo and Wedel, Andrew (2016) The co-evolution of speech and the lexicon: Interaction of functional pressures, redundancy, and category variation. Topics in Cognitive Science 8:  503-513.