Monday, October 22, 2018

Controversies about structural data in historical linguistics

In the past, there have been many controversies about structural data, — that is, the kind of that data I introduce in the post written last month. Given the misinterpretation of structural data as being "grammatical", along with the unproven and misleading claim by Nichols (2003) that certain grammatical features are more stable than lexical ones, one can often read about a controversy in linguistics: which aspects are more stable, and therefore more useful to study deep linguistic relationships, the lexicon or the grammar?

In this context, it is often ignored that we are not talking chiefly about the grammar when applying phylogenetic studies to structural datasets. It is also ignored that the original idea of the importance of "grammar" was pointing to homologies in complex and concrete morphological paradigms, as has been most prominently discussed by Meillet (1925), later popularized by Nichols (1996) (i.e., individual word forms, that is: predominantly lexical traits). "Grammar" never pointed to abstract similarities as they are captured in most structural datasets (see the excellent discussion by Dybo and Starostin).

"Grammar" as evidence for deep language relations

Leading scholars in historical linguistics have provided convincing arguments that genetic relationships among languages can only be demonstrated by illustrating regular sound correspondences in concrete form-meaning pairs across the languages under investigation (see especially the very good analysis by Campbell and Poser 2008). In spite of this, the rumor that "grammar" (i.e., structural datasets) might provide a shortcut to detect deep, so far unnoticed, relationships among the languages of the world is very persistent, as reflected in many different studies.

Among the examples, Dunn et al. (2008) claimed that language relationships for Papuan languages of Island Melanesia could be uncovered by means of phonological and grammatical (abstract) structural features; and Longobardi et al. (2015) used syntactic features to compare the development of European languages with the development of European populations. Zhang et al. (2018) used phonological inventories of more than 100 different Chinese dialects, coding the data for simple presence and absence of each of the more than 200 different sounds in the database, and analyzing the data with the STRUCTURE software (Pritchard et al. 2000), whose results tend to be notoriously misinterpreted.

What is important about these studies is that none of them (maybe with exception of the study by Dunn et al. 2008, but I am in no position to actually judge the findings) could make a convincing claim why the structural datasets would provide evidence of deeper relationships than could the lexicon. Even the study by Dunn et al., which tests the suitability of their small questionnaire of only 115 structural traits on Oceanic languages, has since then not led to any new insights into so far undetected language relationships, contrary to the hope expressed by the authors, "that structural phylogeny is an important new tool for exploring historical relationships between languages" (ibid. 734).

Structural data as a shortcut?

Some scholars who work on structural datasets may find my claims harsh and unjustified. In fact, there are studies that seem to provide evidence that structural datasets perform similarly or equally well compared to phylogenetic methods based on lexical data.

For example, Longobardi et al.(2016) carry out experiments on structural data of phoneme inventories, syntactic features, and "traditional" cognate sets for very small Indo-European datasets, concluding that all of the datasets yield similar results, and that syntactic or phonological features in structural datasets could be used instead of lexical phylogenies.

Contrary to this, Grennhill et al. (2017) also experiment on lexical datasets in comparison with structural data for 81 Austronesian languages, but they find that, in general, lexical data is much more stable than structural data, although some structural features seem to be similar to lexical items regarding their stability.

A wish list for future tests

I see two major problems in the debate about the usefulness of structural data in historical linguistics.

First, the studies that confirm that structure might work equally well compared with lexical data, are all based on small samples of one specific language family that was analyzed based on very diverse features that were specifically designed to study the languages under question. For me, a true test that some features carry deep historical signal would need to be illustrated for a large set of related and unrelated languages, not only just for selected datasets.

Furthermore, to allow for an honest comparison with the lexicon, the selection of features should not contain any lexical characters or characters that could only be extracted with the help of lexical characters. Thus, asking whether the words for "fish", "I", and "five" are pronounced similarly in a language would not be allowed in such a feature collection, because this would follow lexical criteria, and we know very well that this property is a very good proxy for identifying Sino-Tibetan languages (Handel 2008).

Second, and more problematic, is the fact that structural datasets do not provide information on the relatedness of the traits under comparison. While this is no problem for typologists who study shared structural features out of interest in universal tendencies in the languages of the world, it is a problem for the application of phylogenetic software, since the typical approaches in biology treat homoplasy as an exception, while it may often be rather the norm than an exception in structural datasets.


In order to make structural data suitable for historical analyses, much more research needs to be carried out, including specifically a much thorougher study of parallel evolution and geographic convergence (due to language contact) in different language families of the world — a nice illustration for the Indo-European languages is provided by Cathcard et al. (2018).

I would be happy for our field if such research could reveal markers of deep genetic ancestry in the languages of the world, and help us to push the boundaries of linguistic reconstruction. For the time being, however, I remain highly skeptical, especially when scholars try to demonstrate the suitability of "grammatical" comparison with small datasets and idiosyncratically selected feature sets that are not comparable across datasets.


Campbell, L. and W. Poser (2008) Language Classification: History and Method. Cambridge University Press: Cambridge.

Cathcard, C., G. Carling, F. Larson, R. Johansson, and E. Round (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.

Dunn, M., S. Levinson, E. Lindstroem, G. Reesink, and A. Terrill (2008) Structural phylogeny in historical linguistics: methodological explorations applied in island melanesia. Language 84.4. 710-759.

Dybo, A. and G. Starostin (2008) In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.) Aspekty komparativistikiAspekty komparativistiki.3. RGGU: Moscow, pp 119-258.

Greenhill, S., C. Wu, X. Hua, M. Dunn, S. Levinson, and R. Gray (2017) Evolutionary dynamics of language systems. Proceedings of the National Academy of Sciences 114.42: E8822-E8829.

Handel, Z. (2008) What is Sino-Tibetan? Snapshot of a field and a language family in flux. Language and Linguistics Compass 2.3: 422-441.

Longobardi, G., S. Ghirotto, C. Guardiano, F. Tassi, A. Benazzo, A. Ceolin, and G. Barbujan (2015) Across language families: Genome diversity mirrors linguistic variation within Europe. American Journal of Physical Anthropology 157.4.: 630-640.

Longobardi, G., A. Buch, A. Ceolin, A. Ecay, C. Guardiano, M. Irimia, D. Michelioudakis, N. Radkevich, and G. Jaeger (2016) Correlated Evolution Or Not? Phylogenetic Linguistics With Syntactic, Cognacy, And Phonetic Data. In: The Evolution of Language: Proceedings of the 11th International Conference (EVOLANGX11).

Meillet, A. (1954) La méthode comparative en linguistique historique [The comparative method in historical linguistics]. Honoré Champion: Paris.

Nichols, J. (1996) The comparative method as heuristic. In: Durie, M. (ed.) The Comparative Method Reviewed. Oxford University Press: New York, pp 39-71.

Nichols, J. (2003) Diversity and stability in language. In: Joseph, B. and R. Janda (eds.) The Handbook of Historical Linguistics. Blackwell: Malden, Mass, pp 283-310.

Pritchard, J., M. Stephens, and P. Donnelly (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.

Monday, October 15, 2018

Jumping political parties in Germany's state elections

In one of last year's post, I showed a neighbour-net for the parties competing in the national election based on political distances inferred from the Wahl-O-Mat questionnaire (A network of political parties competing for the 2017 Bundestag). But Germany is a federal state, and since then, there has been a state election in Lower Saxony, and soon there will be two in Bavaria and Hesse. This is a good opportunity to make some network-based comparisons.

It is important to note that there are many political parties in Germany, not just two or three major parties, as in most English-speaking countries. State parliaments can therefore be composed of quite different mixtures of these groups.

The questionnaire

The Wahl-O-Mat is a political information service provided by the BPB, the "Bundeszentrale für politische Bildung". A group of youngsters assisted by scientists puts together a questionnaire of political theses (bullet points), which is sent out to the political parties competing in an election. When participating, as most parties do, they can either choose "agree", "disagree" or "neutral" to each statement.

As a voter, you can fill in the same questionnaire, mark some of the questions as "high importance" (which will be weighted stronger), and then choose (up to) eight parties for your personal comparison. The result will be a bar chart, showing you the percentage of your personal overlap with each of the parties. The BPB usually provide this service for all federal and state elections.

The problem I always have with this approach is that you don't get any graphical summary information about how the parties agree or disagree with each other to start from. In the worst case scenario, you could have 75% overlap with each of two parties who disagree with each other for 50% of the bullet points!

A straightforward solution to this shortcoming is to: code the questionnaire as a ternary matrix (0 = "disagree", 1 = "neutral", 2 = "agree"), treat them as ordered characters and determine the mean pairwise (Hamming) distances, and then infer a Neighbor-net based on the resulting distance matrix.

This is shown in the first figure, where each labeled point is one of the political parties. The two political extremes are also labeled.

The neighbour-net for the 2017 federal election Wahl-O-Mat questionaire (original GWoN post from last year, for those interested in further comments, extrapolations, and infographics, see related posts on my Res.I.P blog). The red split denotes the outgoing and new coalition parties (Merkel's "centre-right" CDU/CSU + "centre-left" SPD, the social-democrats), the blue split the most natural minor coalition partner for the CDU/CSU since the Kohl era, the "centrist" liberals (FDP). For the yellow split, see here (in German, but there is a Google translate button).

Political Compasses (for orientation)

Another graphical approach is to use a "political compass", instead. The original can be found at The Political Compass. Parties or persons are placed along two absolute (in the case of the original) axes: an economic left-right x-axis and a social authoritarian-libertarian (in the classic, not US sense, i.e. socially liberal) y-axis. (I encourage everyone to do the test for themselves. I was not surprised to see where I stand in the compass, but others have been. But first do the test, before browsing The Political Compass' highly interesting pages.)

Here's how this looks like for the main German parties (currently six) that also got seat in the newly formed Bundestag, with some orientation points: (in)famous historical figures and the presidential run-offs in the U.S. (most of this blog's readers sit in the U.S.) and France (because I live there, but can't vote).

Overlay of several Political Compass assessments regarding the last major elections in the Germany, France and the U.S. Grey dots, (in)famous figures that shaped modern world; the main German parties are in full colours (all on the economic right, except for the Left Party, Die Linke, which is where social-democrats where in the 70s, when the European model of welfare states was fully implemented). The position of U.S. (both right-authoritharian) and French (relaxed choice between Hitler, fascism, and Friedman, neo-liberalism) presidential run-offs is provided for comparison.

In Lower Saxony the "Niedersächsiche Landeszentrale für politische Bildung", the state's analog of the BPB, hired a Dutch company to provide a compass ("Wahlkompass") linked to the Wahl-O-Mat questionnaire for the 2017 state parliament election.

After filling in the questionnaire, you would be placed in the relative, compass, too. Note that (possibly to avoid giving due credit to The Political Compass) the y-axis has been flipped and modified to "progressiv" (progressive) and "konservativ" (conservative). Another reason may be that classifying parties as authoritarian is a bit tricky for a state-funded German institution for historical reasons.

The red marker indicates an all-neutral voter. The placement is a relative one, hence no grid.

The relative positions of the liberals (FDP), the right-wing populists of the "Alternative for Germany" AfD (blue symbol at the bottom), the CDU, SPD and Left Party (Linke) all agree with The Political Compass' assessment of their federal-level counterparts. However, the Green Party is placed much closer to the Left Party on the social y-axis. This has two possible reasons:
  1. The Political Compass bases its assessment on party programs and actual government politics, and the Greens are part of quite a few state governments, and are the major ruling party in Baden-Würrtemberg, Germany's economically strongest state.
  2. There can be a difference between progressive and libertarian. The Greens are progressive by supporting e.g. equal rights for women or LGBT and other aspects of modern society, but aim to achieve these goals by imposing legislation, which is authoritarian. On the other hand, conservatism – keeping the status-quo – is mutually linked to authoritarian politics. Any social movement will change society, or challenge the status-quo, and hence needs to be constrained or suppressed.
Another difference to the Wahl-O-Mat is that – similar to the questionnaire of The Political Compass – the Lower Saxony Wahlkompass allows six possible answers to each bullet point: "totally disagree" (which I scored as "0"), "disagree" (1), "neutral" (2), "agree" (3), "totally agree" (4), and "No opinion" (?). The latter is a quite useful, and would be useful add-on also to the Wahl-O-Mat, because there is a difference whether one is neutral on a matter (could live with it) or has no opinion on it (don't bother). The more refined scale also allows us to treat the answers as ordered multistate characters when inferring the distance matrix, resulting in a more resolved network

This is shown in the next figure.

Neighbour-net based on the Niedersachsen Wahlkompass questionaire (full post, in German).

As you can see, the political-distance-based Neighbor-net splits graph captures the similarity of the political parties to each other quite well. Now the only thing left to do is to add yourself (as a voter or interested third party) to the matrix and then re-infer the Neighbor-net. The basic files to do so (NEXUS-formatted matrices) for this, upcoming (Bavaria, Hesse), and future elections can be found on figshare.

Comparing different elections

As a federal state, Germany has a long tradition of within-party diversity. Most commonly known is that the "Schwesterparteien" (sister parties) CSU and CDU disagree in not a few points. The CSU is a Bavarian endemit, whil ethe CDU covers rest of Germany, including the former East Germany — see also my post [in English] on German and French party genealogies after World War II). Hence, they are treated separately by The Political Compass for the 2017 election. The CSU is in general (much) less neo-liberal than the CDU (placed left of it), but (often) more authoritarian, cultivating conservative views. But neither is the CDU a homogeneous formation when compared from state to state, nor are any of the other parties. The following splits graphs, based on the various Wahl-O-Mat questionnaires, illustrate this quite well.

Let's start with the upcoming state elections in Bavaria and Hesse. Here are the two Neighbor-nets.

Reduced Neighbour-nets for Bavaria and Hesse. Parties competing only in one of the states not included.

We note that some parties keep their position relative to each other. For example, the most severe political antagonists in both states are the Left P. (left-libertarian) and the LKR (distinctly right-authoritarian; political distance PD > 1.5).

The latter is a small party collecting the original founder(s) of the AfD. The AfD is usually described as a (far-)right populist party, but started as a Euro-sceptic conservative and distinctly neo-liberal party. This is well captured in the splits graphs, with the LKR placed either as sister to the Bavarian (less neo-liberal) CSU or at a box connecting the (less authoritarian) CDU with the (more left) AfD. Other small parties (Humanist Party, the animal-rights party P!MUT, and the ÖDP, a conservative-green party) are equally stable.

The "right" is more tree-like in Bavaria than in Hesse because the so far all-ruling CSU tries (tried) to follow an old maxim of Franz-Josef Strauß, who said that there should never be a political party right (i.e. more conservative and nationalist) of the CSU in the Bavarian parliament — hence, it is much more similar to the right-wing populist AfD than the Hesse CDU.

In Hesse, the CDU ruled the state for the last four years with the Greens, which explains the position of the Green Party in both graphs. Being the opposition, and strongly opposing CSU policies (both economically and socially), they are much closer to the Left Party in Bavaria, while occupying a position between their coalition partner CDU and the "left" parties (Left P., SPD) in Hesse.

In Hesse, the Green Party takes effectively the position that in Bavaria is filled by the Pirate Party — the latter had a surge couple of years ago entering several state parliaments but now is back to 2% or less. With the Greens moving right, the Pirate Party Hesse remains more similar to the classical "left" of the political spectrum.

Another jumper is "Die PARTEI". This is hardly surprising, because they answer some questions in the Wahl-O-Mat by flipping a coin, or select the one allowing them to come up with most satiric arguments for their choice (sometimes not so different from those of certain party policies!).

Compared to the last federal election, the federal-state discrepancy in official party policies is striking, and this is well represented in their answers to the Wahl-O-Mat questionaires.

Same-scaled, taxon-pruned Neighbour-nets. The "Big-6" (7 in Bavaria), parties either already sitting in the parliaments or with chance to crack the 5%-hurdle in upcoming elections, in bold. Arrows indicate current ruling coalitions/government parties.

Being a frequent junior partner of the CDU/CSU, but the opposition in Bavaria (for decades) and Hesse (once the dominant party), the federal SPD is drawn much more to the "right" than its state counterparts. But this holds also for the federal CDU in the opposite way, and hence the FDP becomes the closest (still distant) "relative" of the AfD, which campaigned 2017 with a more neo-liberal program than it does now in Bavaria and Hesse (a necessity for populistic parties, as anyone likes free stuff).

The "blue-green" ÖDP comes closer to the Greens, because ecology-related bullet points took a more prominent place in the federal election Wahl-O-Mat. The "net-gap" in between them, and the edges shared by the ÖDP with the AfD or other parties of the "right" (FW, CDU/CSU, FDP), highlight their differences in social policies.

In Lower Saxony fewer parties competed, so let's prune the taxon set further. The Lower Saxony Neighbor-net has a different scale, because a more differentiated answer was possible. Usually, two parties oppose each other on all points, the maximum theoretically possible distance between two parties in the Lower Saxony matrix would be 4, i.e. they would strongly disagree on all bullet points that have no missing data for either one.

Again, parties in (last year's elections) or with chances to enter parliament (upcoming) in bold, and arrows indicating current or leaving government parties/coalitions.

Note how the Green Party and the SPD are placed with respect to the third main party from the traditional "left", the Left Party, and the FDP in comparison to CDU/CSU and AfD, forming the parliamentary "right". In Lower Saxony, the largest (SPD) and second-largeste (CDU) party followed the example of the Bund. The outgoing SPD-Greens coalition lost its tight majority; and although a CDU-FDP-AfD coalition would have had a majority and quite an overlap, involving the AfD in governments has been considered a no-go in Germany to this point (for all involved parties for different reasons).

Also in the Bundestag, the "right" would have a majority, but the SPD is close enough, and obviously Merkel's preferred partner. The polls for the Sunday elections (yesterday, when you read this) predict the CSU will lose its absolute majority. Also, here the natural partner (AfD) will be a no-go, so Bavaria will head towards interesting coalition talks with the Greens, being second in the polls. This would be the first time since 1958. The black-green Hesse government is also likely to lose its majority. However, adding the FDP (called "Jamaica coalition", because of the traditional colors of the three parties) should be no great deal, given its position between the current coalition partners.


The post introducing Neighbor-nets to explore Wahl-O-Mat questionnaires can be found here.

More infographics (including plots of each bullet point on the splits graphs) revolving around political distances expressed in election questionnaires, or politics in general, can be found in my Res.I.P. posts — flagged as "Bundestagswahl" (federal elections, in German or English), "Landtagswahlen" (usually in German), "phylo-networks" (usually in English) and "politics" (again mixed).

Related data are included in a figshare fileset (open data; CC-BY licence), which may get updated when another election happens.

Monday, October 8, 2018

A proper network of Europeans

Back in May this year, Iosif Lazaridis submitted a paper to the arXiv, called: "The evolutionary history of human populations in Europe". It is now online as part of the December 2018 issue of Current Opinion in Genetics & Development (53: 21-27).

Its interest for readers of this blog is the one and only figure that the paper contains. It is a genealogical network, showing the obvious — that the human "family tree" has quite a few reticulations, mostly due to introgression (or admixture, as human geneticists like to call it). Here is the figure, along with the legend. Note that not all of the edges in the network have a direction, so that it is not really a directed acyclic graph (see also First-degree relationships and partly directed networks).

A sketch of European evolutionary history based on ancient DNA
Bronze Age Europeans (~4.5-3kya) were a mixture of mainly two proximate sources of ancestry: (i) the Neolithic farmers of ~8-5kya who were themselves variable mixtures of farmers from Anatolia and hunter-gatherers of mainland Europe (WHG), and (ii) Bronze Age steppe migrants of ~5kya who were themselves a mixture of hunter-gatherers of eastern Europe (EHG) and southern populations from the Near East. Thus, we only have to go ~8 thousand years backwards in time to find at least four sources of ancestry for Europeans. But, each of these sources was also admixed: European hunter-gatherers received genetic input from Siberia and ultimately also from archaic Eurasians, and Near Eastern populations interacted in unknown ways with Europe and Siberia and also had ancestry from ‘Basal Eurasians’, a sister group of the main lineage of all other non-African populations. Dates correspond to sampled populations; in the case of a cluster of populations (such as the WHG), they correspond to the earliest attestation of the group.

Monday, October 1, 2018

Which airlines are the best?

Scientists are known to get about a bit. They attend conferences and give workshops, they go on sabbatical, and sometimes they even have holidays. Many of these activities require them to be in other places than their home city; and to get there they often resort to air travel. This makes it of interest to them to know which airlines are considered to be "good". Scientists may not have much choice about which airlines they can choose to fly, depending on where they live, but they can at least try to fly on one of the good ones.

They are not alone in this desire, and so inevitably there are web sites that provide the necessary information. These include AirHelp Airline Worldwide Rankings; but the best-known listing is the annual one from Skytrax, a UK-based consumer aviation agency.

Each year, Skytrax conducts a survey in which "airline customers around the world" vote for the best airline. The survey results are released at the beginning of each year, and they thus refer to the previous year's survey. Skytrax note that "over 275 airlines were featured in the [current] customer survey but we only feature the top 100 listing."

The Skytrax top-100 data currently exist online for the years 2012-2018 inclusive, which cover the years 2011-2017. It can be useful to consider data for multiple years, because some airlines have greatly improved their ranking through time, while others have slipped back. There are 80 airlines with top-100 data for each of the years 2011-2017, and another 45 airlines that have appeared in the top 100 at least once. A few airlines have also merged during these years.

We can explore the multi-year data for the 80 airlines using a network analysis, to visualize the overall pattern. I first calculated the Manhattan distances pairwise between the airlines, and then plotted these using a NeighborNet graph, as shown in the figure below. Airlines that have similar rankings across the years are near each other in the network; and the further apart they are in the network then the more different are their overall rankings.

As you can see, this is pretty much a linear network, with the best-ranked airlines at the top-right, and then continuing down to the bottom-left. A simple list of the average rankings across the years would be almost as informative. In particular, the top-ranked airlines have remained at the top across the years; and it is only in the middle and especially at the bottom that there has been movement among the rankings (that is, the network broadens out at one spot in the middle and then again at the end).

Note that the top end of the list consists mainly of airlines from the Middle East and Asia. Australia has only two airlines, both of which do well in the network, along with the only one from New Zealand. The presence near the top of both Turkish Airlines and Garuda Indonesia may surprise some people.

You will also note that the US airlines are generally closer to the bottom of the network — they are marked in red in the network. The airlines from China are mostly there, also (except Hainan Airlines). It is not a coincidence that neither of the world's two biggest economies runs a high-quality airline. It seems that the only way to do this is actually to rely on government subsidies, which is how most of the top-ranked airlines are doing it.

Finally, there are few discount airlines that make it into the top 50. Put simply, the economics of running an all-economy-class plane do not allow much in the way of customer service (see How Budget Airlines Work). It is actually the first-class and business-class passengers on any given plane that allow it to take off at all, in terms of making money for the airline — a classic example of the 80/20 rule: 80% of the money comes from 20% of the passengers (see The Economics of Airline Class).

Finally, in a similar vein, you could also contemplate the sites pertaining to airport quality (eg. AirHelp Airport Worldwide Rankings, World Airport Awards), as well as the Guide to Sleeping in Airports. There are also sites that tell you which seats to choose in any given plane (eg. SeatGuru).