Monday, October 22, 2018

Controversies about structural data in historical linguistics

In the past, there have been many controversies about structural data, — that is, the kind of that data I introduce in the post written last month. Given the misinterpretation of structural data as being "grammatical", along with the unproven and misleading claim by Nichols (2003) that certain grammatical features are more stable than lexical ones, one can often read about a controversy in linguistics: which aspects are more stable, and therefore more useful to study deep linguistic relationships, the lexicon or the grammar?

In this context, it is often ignored that we are not talking chiefly about the grammar when applying phylogenetic studies to structural datasets. It is also ignored that the original idea of the importance of "grammar" was pointing to homologies in complex and concrete morphological paradigms, as has been most prominently discussed by Meillet (1925), later popularized by Nichols (1996) (i.e., individual word forms, that is: predominantly lexical traits). "Grammar" never pointed to abstract similarities as they are captured in most structural datasets (see the excellent discussion by Dybo and Starostin).

"Grammar" as evidence for deep language relations

Leading scholars in historical linguistics have provided convincing arguments that genetic relationships among languages can only be demonstrated by illustrating regular sound correspondences in concrete form-meaning pairs across the languages under investigation (see especially the very good analysis by Campbell and Poser 2008). In spite of this, the rumor that "grammar" (i.e., structural datasets) might provide a shortcut to detect deep, so far unnoticed, relationships among the languages of the world is very persistent, as reflected in many different studies.

Among the examples, Dunn et al. (2008) claimed that language relationships for Papuan languages of Island Melanesia could be uncovered by means of phonological and grammatical (abstract) structural features; and Longobardi et al. (2015) used syntactic features to compare the development of European languages with the development of European populations. Zhang et al. (2018) used phonological inventories of more than 100 different Chinese dialects, coding the data for simple presence and absence of each of the more than 200 different sounds in the database, and analyzing the data with the STRUCTURE software (Pritchard et al. 2000), whose results tend to be notoriously misinterpreted.

What is important about these studies is that none of them (maybe with exception of the study by Dunn et al. 2008, but I am in no position to actually judge the findings) could make a convincing claim why the structural datasets would provide evidence of deeper relationships than could the lexicon. Even the study by Dunn et al., which tests the suitability of their small questionnaire of only 115 structural traits on Oceanic languages, has since then not led to any new insights into so far undetected language relationships, contrary to the hope expressed by the authors, "that structural phylogeny is an important new tool for exploring historical relationships between languages" (ibid. 734).

Structural data as a shortcut?

Some scholars who work on structural datasets may find my claims harsh and unjustified. In fact, there are studies that seem to provide evidence that structural datasets perform similarly or equally well compared to phylogenetic methods based on lexical data.

For example, Longobardi et al.(2016) carry out experiments on structural data of phoneme inventories, syntactic features, and "traditional" cognate sets for very small Indo-European datasets, concluding that all of the datasets yield similar results, and that syntactic or phonological features in structural datasets could be used instead of lexical phylogenies.

Contrary to this, Grennhill et al. (2017) also experiment on lexical datasets in comparison with structural data for 81 Austronesian languages, but they find that, in general, lexical data is much more stable than structural data, although some structural features seem to be similar to lexical items regarding their stability.

A wish list for future tests

I see two major problems in the debate about the usefulness of structural data in historical linguistics.

First, the studies that confirm that structure might work equally well compared with lexical data, are all based on small samples of one specific language family that was analyzed based on very diverse features that were specifically designed to study the languages under question. For me, a true test that some features carry deep historical signal would need to be illustrated for a large set of related and unrelated languages, not only just for selected datasets.

Furthermore, to allow for an honest comparison with the lexicon, the selection of features should not contain any lexical characters or characters that could only be extracted with the help of lexical characters. Thus, asking whether the words for "fish", "I", and "five" are pronounced similarly in a language would not be allowed in such a feature collection, because this would follow lexical criteria, and we know very well that this property is a very good proxy for identifying Sino-Tibetan languages (Handel 2008).

Second, and more problematic, is the fact that structural datasets do not provide information on the relatedness of the traits under comparison. While this is no problem for typologists who study shared structural features out of interest in universal tendencies in the languages of the world, it is a problem for the application of phylogenetic software, since the typical approaches in biology treat homoplasy as an exception, while it may often be rather the norm than an exception in structural datasets.


In order to make structural data suitable for historical analyses, much more research needs to be carried out, including specifically a much thorougher study of parallel evolution and geographic convergence (due to language contact) in different language families of the world — a nice illustration for the Indo-European languages is provided by Cathcard et al. (2018).

I would be happy for our field if such research could reveal markers of deep genetic ancestry in the languages of the world, and help us to push the boundaries of linguistic reconstruction. For the time being, however, I remain highly skeptical, especially when scholars try to demonstrate the suitability of "grammatical" comparison with small datasets and idiosyncratically selected feature sets that are not comparable across datasets.


Campbell, L. and W. Poser (2008) Language Classification: History and Method. Cambridge University Press: Cambridge.

Cathcard, C., G. Carling, F. Larson, R. Johansson, and E. Round (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.

Dunn, M., S. Levinson, E. Lindstroem, G. Reesink, and A. Terrill (2008) Structural phylogeny in historical linguistics: methodological explorations applied in island melanesia. Language 84.4. 710-759.

Dybo, A. and G. Starostin (2008) In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.) Aspekty komparativistikiAspekty komparativistiki.3. RGGU: Moscow, pp 119-258.

Greenhill, S., C. Wu, X. Hua, M. Dunn, S. Levinson, and R. Gray (2017) Evolutionary dynamics of language systems. Proceedings of the National Academy of Sciences 114.42: E8822-E8829.

Handel, Z. (2008) What is Sino-Tibetan? Snapshot of a field and a language family in flux. Language and Linguistics Compass 2.3: 422-441.

Longobardi, G., S. Ghirotto, C. Guardiano, F. Tassi, A. Benazzo, A. Ceolin, and G. Barbujan (2015) Across language families: Genome diversity mirrors linguistic variation within Europe. American Journal of Physical Anthropology 157.4.: 630-640.

Longobardi, G., A. Buch, A. Ceolin, A. Ecay, C. Guardiano, M. Irimia, D. Michelioudakis, N. Radkevich, and G. Jaeger (2016) Correlated Evolution Or Not? Phylogenetic Linguistics With Syntactic, Cognacy, And Phonetic Data. In: The Evolution of Language: Proceedings of the 11th International Conference (EVOLANGX11).

Meillet, A. (1954) La méthode comparative en linguistique historique [The comparative method in historical linguistics]. Honoré Champion: Paris.

Nichols, J. (1996) The comparative method as heuristic. In: Durie, M. (ed.) The Comparative Method Reviewed. Oxford University Press: New York, pp 39-71.

Nichols, J. (2003) Diversity and stability in language. In: Joseph, B. and R. Janda (eds.) The Handbook of Historical Linguistics. Blackwell: Malden, Mass, pp 283-310.

Pritchard, J., M. Stephens, and P. Donnelly (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.

Monday, October 15, 2018

Jumping political parties in Germany's state elections

In one of last year's post, I showed a neighbour-net for the parties competing in the national election based on political distances inferred from the Wahl-O-Mat questionnaire (A network of political parties competing for the 2017 Bundestag). But Germany is a federal state, and since then, there has been a state election in Lower Saxony, and soon there will be two in Bavaria and Hesse. This is a good opportunity to make some network-based comparisons.

It is important to note that there are many political parties in Germany, not just two or three major parties, as in most English-speaking countries. State parliaments can therefore be composed of quite different mixtures of these groups.

The questionnaire

The Wahl-O-Mat is a political information service provided by the BPB, the "Bundeszentrale für politische Bildung". A group of youngsters assisted by scientists puts together a questionnaire of political theses (bullet points), which is sent out to the political parties competing in an election. When participating, as most parties do, they can either choose "agree", "disagree" or "neutral" to each statement.

As a voter, you can fill in the same questionnaire, mark some of the questions as "high importance" (which will be weighted stronger), and then choose (up to) eight parties for your personal comparison. The result will be a bar chart, showing you the percentage of your personal overlap with each of the parties. The BPB usually provide this service for all federal and state elections.

The problem I always have with this approach is that you don't get any graphical summary information about how the parties agree or disagree with each other to start from. In the worst case scenario, you could have 75% overlap with each of two parties who disagree with each other for 50% of the bullet points!

A straightforward solution to this shortcoming is to: code the questionnaire as a ternary matrix (0 = "disagree", 1 = "neutral", 2 = "agree"), treat them as ordered characters and determine the mean pairwise (Hamming) distances, and then infer a Neighbor-net based on the resulting distance matrix.

This is shown in the first figure, where each labeled point is one of the political parties. The two political extremes are also labeled.

The neighbour-net for the 2017 federal election Wahl-O-Mat questionaire (original GWoN post from last year, for those interested in further comments, extrapolations, and infographics, see related posts on my Res.I.P blog). The red split denotes the outgoing and new coalition parties (Merkel's "centre-right" CDU/CSU + "centre-left" SPD, the social-democrats), the blue split the most natural minor coalition partner for the CDU/CSU since the Kohl era, the "centrist" liberals (FDP). For the yellow split, see here (in German, but there is a Google translate button).

Political Compasses (for orientation)

Another graphical approach is to use a "political compass", instead. The original can be found at The Political Compass. Parties or persons are placed along two absolute (in the case of the original) axes: an economic left-right x-axis and a social authoritarian-libertarian (in the classic, not US sense, i.e. socially liberal) y-axis. (I encourage everyone to do the test for themselves. I was not surprised to see where I stand in the compass, but others have been. But first do the test, before browsing The Political Compass' highly interesting pages.)

Here's how this looks like for the main German parties (currently six) that also got seat in the newly formed Bundestag, with some orientation points: (in)famous historical figures and the presidential run-offs in the U.S. (most of this blog's readers sit in the U.S.) and France (because I live there, but can't vote).

Overlay of several Political Compass assessments regarding the last major elections in the Germany, France and the U.S. Grey dots, (in)famous figures that shaped modern world; the main German parties are in full colours (all on the economic right, except for the Left Party, Die Linke, which is where social-democrats where in the 70s, when the European model of welfare states was fully implemented). The position of U.S. (both right-authoritharian) and French (relaxed choice between Hitler, fascism, and Friedman, neo-liberalism) presidential run-offs is provided for comparison.

In Lower Saxony the "Niedersächsiche Landeszentrale für politische Bildung", the state's analog of the BPB, hired a Dutch company to provide a compass ("Wahlkompass") linked to the Wahl-O-Mat questionnaire for the 2017 state parliament election.

After filling in the questionnaire, you would be placed in the relative, compass, too. Note that (possibly to avoid giving due credit to The Political Compass) the y-axis has been flipped and modified to "progressiv" (progressive) and "konservativ" (conservative). Another reason may be that classifying parties as authoritarian is a bit tricky for a state-funded German institution for historical reasons.

The red marker indicates an all-neutral voter. The placement is a relative one, hence no grid.

The relative positions of the liberals (FDP), the right-wing populists of the "Alternative for Germany" AfD (blue symbol at the bottom), the CDU, SPD and Left Party (Linke) all agree with The Political Compass' assessment of their federal-level counterparts. However, the Green Party is placed much closer to the Left Party on the social y-axis. This has two possible reasons:
  1. The Political Compass bases its assessment on party programs and actual government politics, and the Greens are part of quite a few state governments, and are the major ruling party in Baden-Würrtemberg, Germany's economically strongest state.
  2. There can be a difference between progressive and libertarian. The Greens are progressive by supporting e.g. equal rights for women or LGBT and other aspects of modern society, but aim to achieve these goals by imposing legislation, which is authoritarian. On the other hand, conservatism – keeping the status-quo – is mutually linked to authoritarian politics. Any social movement will change society, or challenge the status-quo, and hence needs to be constrained or suppressed.
Another difference to the Wahl-O-Mat is that – similar to the questionnaire of The Political Compass – the Lower Saxony Wahlkompass allows six possible answers to each bullet point: "totally disagree" (which I scored as "0"), "disagree" (1), "neutral" (2), "agree" (3), "totally agree" (4), and "No opinion" (?). The latter is a quite useful, and would be useful add-on also to the Wahl-O-Mat, because there is a difference whether one is neutral on a matter (could live with it) or has no opinion on it (don't bother). The more refined scale also allows us to treat the answers as ordered multistate characters when inferring the distance matrix, resulting in a more resolved network

This is shown in the next figure.

Neighbour-net based on the Niedersachsen Wahlkompass questionaire (full post, in German).

As you can see, the political-distance-based Neighbor-net splits graph captures the similarity of the political parties to each other quite well. Now the only thing left to do is to add yourself (as a voter or interested third party) to the matrix and then re-infer the Neighbor-net. The basic files to do so (NEXUS-formatted matrices) for this, upcoming (Bavaria, Hesse), and future elections can be found on figshare.

Comparing different elections

As a federal state, Germany has a long tradition of within-party diversity. Most commonly known is that the "Schwesterparteien" (sister parties) CSU and CDU disagree in not a few points. The CSU is a Bavarian endemit, whil ethe CDU covers rest of Germany, including the former East Germany — see also my post [in English] on German and French party genealogies after World War II). Hence, they are treated separately by The Political Compass for the 2017 election. The CSU is in general (much) less neo-liberal than the CDU (placed left of it), but (often) more authoritarian, cultivating conservative views. But neither is the CDU a homogeneous formation when compared from state to state, nor are any of the other parties. The following splits graphs, based on the various Wahl-O-Mat questionnaires, illustrate this quite well.

Let's start with the upcoming state elections in Bavaria and Hesse. Here are the two Neighbor-nets.

Reduced Neighbour-nets for Bavaria and Hesse. Parties competing only in one of the states not included.

We note that some parties keep their position relative to each other. For example, the most severe political antagonists in both states are the Left P. (left-libertarian) and the LKR (distinctly right-authoritarian; political distance PD > 1.5).

The latter is a small party collecting the original founder(s) of the AfD. The AfD is usually described as a (far-)right populist party, but started as a Euro-sceptic conservative and distinctly neo-liberal party. This is well captured in the splits graphs, with the LKR placed either as sister to the Bavarian (less neo-liberal) CSU or at a box connecting the (less authoritarian) CDU with the (more left) AfD. Other small parties (Humanist Party, the animal-rights party P!MUT, and the ÖDP, a conservative-green party) are equally stable.

The "right" is more tree-like in Bavaria than in Hesse because the so far all-ruling CSU tries (tried) to follow an old maxim of Franz-Josef Strauß, who said that there should never be a political party right (i.e. more conservative and nationalist) of the CSU in the Bavarian parliament — hence, it is much more similar to the right-wing populist AfD than the Hesse CDU.

In Hesse, the CDU ruled the state for the last four years with the Greens, which explains the position of the Green Party in both graphs. Being the opposition, and strongly opposing CSU policies (both economically and socially), they are much closer to the Left Party in Bavaria, while occupying a position between their coalition partner CDU and the "left" parties (Left P., SPD) in Hesse.

In Hesse, the Green Party takes effectively the position that in Bavaria is filled by the Pirate Party — the latter had a surge couple of years ago entering several state parliaments but now is back to 2% or less. With the Greens moving right, the Pirate Party Hesse remains more similar to the classical "left" of the political spectrum.

Another jumper is "Die PARTEI". This is hardly surprising, because they answer some questions in the Wahl-O-Mat by flipping a coin, or select the one allowing them to come up with most satiric arguments for their choice (sometimes not so different from those of certain party policies!).

Compared to the last federal election, the federal-state discrepancy in official party policies is striking, and this is well represented in their answers to the Wahl-O-Mat questionaires.

Same-scaled, taxon-pruned Neighbour-nets. The "Big-6" (7 in Bavaria), parties either already sitting in the parliaments or with chance to crack the 5%-hurdle in upcoming elections, in bold. Arrows indicate current ruling coalitions/government parties.

Being a frequent junior partner of the CDU/CSU, but the opposition in Bavaria (for decades) and Hesse (once the dominant party), the federal SPD is drawn much more to the "right" than its state counterparts. But this holds also for the federal CDU in the opposite way, and hence the FDP becomes the closest (still distant) "relative" of the AfD, which campaigned 2017 with a more neo-liberal program than it does now in Bavaria and Hesse (a necessity for populistic parties, as anyone likes free stuff).

The "blue-green" ÖDP comes closer to the Greens, because ecology-related bullet points took a more prominent place in the federal election Wahl-O-Mat. The "net-gap" in between them, and the edges shared by the ÖDP with the AfD or other parties of the "right" (FW, CDU/CSU, FDP), highlight their differences in social policies.

In Lower Saxony fewer parties competed, so let's prune the taxon set further. The Lower Saxony Neighbor-net has a different scale, because a more differentiated answer was possible. Usually, two parties oppose each other on all points, the maximum theoretically possible distance between two parties in the Lower Saxony matrix would be 4, i.e. they would strongly disagree on all bullet points that have no missing data for either one.

Again, parties in (last year's elections) or with chances to enter parliament (upcoming) in bold, and arrows indicating current or leaving government parties/coalitions.

Note how the Green Party and the SPD are placed with respect to the third main party from the traditional "left", the Left Party, and the FDP in comparison to CDU/CSU and AfD, forming the parliamentary "right". In Lower Saxony, the largest (SPD) and second-largeste (CDU) party followed the example of the Bund. The outgoing SPD-Greens coalition lost its tight majority; and although a CDU-FDP-AfD coalition would have had a majority and quite an overlap, involving the AfD in governments has been considered a no-go in Germany to this point (for all involved parties for different reasons).

Also in the Bundestag, the "right" would have a majority, but the SPD is close enough, and obviously Merkel's preferred partner. The polls for the Sunday elections (yesterday, when you read this) predict the CSU will lose its absolute majority. Also, here the natural partner (AfD) will be a no-go, so Bavaria will head towards interesting coalition talks with the Greens, being second in the polls. This would be the first time since 1958. The black-green Hesse government is also likely to lose its majority. However, adding the FDP (called "Jamaica coalition", because of the traditional colors of the three parties) should be no great deal, given its position between the current coalition partners.


The post introducing Neighbor-nets to explore Wahl-O-Mat questionnaires can be found here.

More infographics (including plots of each bullet point on the splits graphs) revolving around political distances expressed in election questionnaires, or politics in general, can be found in my Res.I.P. posts — flagged as "Bundestagswahl" (federal elections, in German or English), "Landtagswahlen" (usually in German), "phylo-networks" (usually in English) and "politics" (again mixed).

Related data are included in a figshare fileset (open data; CC-BY licence), which may get updated when another election happens.

Monday, October 8, 2018

A proper network of Europeans

Back in May this year, Iosif Lazaridis submitted a paper to the arXiv, called: "The evolutionary history of human populations in Europe". It is now online as part of the December 2018 issue of Current Opinion in Genetics & Development (53: 21-27).

Its interest for readers of this blog is the one and only figure that the paper contains. It is a genealogical network, showing the obvious — that the human "family tree" has quite a few reticulations, mostly due to introgression (or admixture, as human geneticists like to call it). Here is the figure, along with the legend. Note that not all of the edges in the network have a direction, so that it is not really a directed acyclic graph (see also First-degree relationships and partly directed networks).

A sketch of European evolutionary history based on ancient DNA
Bronze Age Europeans (~4.5-3kya) were a mixture of mainly two proximate sources of ancestry: (i) the Neolithic farmers of ~8-5kya who were themselves variable mixtures of farmers from Anatolia and hunter-gatherers of mainland Europe (WHG), and (ii) Bronze Age steppe migrants of ~5kya who were themselves a mixture of hunter-gatherers of eastern Europe (EHG) and southern populations from the Near East. Thus, we only have to go ~8 thousand years backwards in time to find at least four sources of ancestry for Europeans. But, each of these sources was also admixed: European hunter-gatherers received genetic input from Siberia and ultimately also from archaic Eurasians, and Near Eastern populations interacted in unknown ways with Europe and Siberia and also had ancestry from ‘Basal Eurasians’, a sister group of the main lineage of all other non-African populations. Dates correspond to sampled populations; in the case of a cluster of populations (such as the WHG), they correspond to the earliest attestation of the group.

Monday, October 1, 2018

Which airlines are the best?

Scientists are known to get about a bit. They attend conferences and give workshops, they go on sabbatical, and sometimes they even have holidays. Many of these activities require them to be in other places than their home city; and to get there they often resort to air travel. This makes it of interest to them to know which airlines are considered to be "good". Scientists may not have much choice about which airlines they can choose to fly, depending on where they live, but they can at least try to fly on one of the good ones.

They are not alone in this desire, and so inevitably there are web sites that provide the necessary information. These include AirHelp Airline Worldwide Rankings; but the best-known listing is the annual one from Skytrax, a UK-based consumer aviation agency.

Each year, Skytrax conducts a survey in which "airline customers around the world" vote for the best airline. The survey results are released at the beginning of each year, and they thus refer to the previous year's survey. Skytrax note that "over 275 airlines were featured in the [current] customer survey but we only feature the top 100 listing."

The Skytrax top-100 data currently exist online for the years 2012-2018 inclusive, which cover the years 2011-2017. It can be useful to consider data for multiple years, because some airlines have greatly improved their ranking through time, while others have slipped back. There are 80 airlines with top-100 data for each of the years 2011-2017, and another 45 airlines that have appeared in the top 100 at least once. A few airlines have also merged during these years.

We can explore the multi-year data for the 80 airlines using a network analysis, to visualize the overall pattern. I first calculated the Manhattan distances pairwise between the airlines, and then plotted these using a NeighborNet graph, as shown in the figure below. Airlines that have similar rankings across the years are near each other in the network; and the further apart they are in the network then the more different are their overall rankings.

As you can see, this is pretty much a linear network, with the best-ranked airlines at the top-right, and then continuing down to the bottom-left. A simple list of the average rankings across the years would be almost as informative. In particular, the top-ranked airlines have remained at the top across the years; and it is only in the middle and especially at the bottom that there has been movement among the rankings (that is, the network broadens out at one spot in the middle and then again at the end).

Note that the top end of the list consists mainly of airlines from the Middle East and Asia. Australia has only two airlines, both of which do well in the network, along with the only one from New Zealand. The presence near the top of both Turkish Airlines and Garuda Indonesia may surprise some people.

You will also note that the US airlines are generally closer to the bottom of the network — they are marked in red in the network. The airlines from China are mostly there, also (except Hainan Airlines). It is not a coincidence that neither of the world's two biggest economies runs a high-quality airline. It seems that the only way to do this is actually to rely on government subsidies, which is how most of the top-ranked airlines are doing it.

Finally, there are few discount airlines that make it into the top 50. Put simply, the economics of running an all-economy-class plane do not allow much in the way of customer service (see How Budget Airlines Work). It is actually the first-class and business-class passengers on any given plane that allow it to take off at all, in terms of making money for the airline — a classic example of the 80/20 rule: 80% of the money comes from 20% of the passengers (see The Economics of Airline Class).

Finally, in a similar vein, you could also contemplate the sites pertaining to airport quality (eg. AirHelp Airport Worldwide Rankings, World Airport Awards), as well as the Guide to Sleeping in Airports. There are also sites that tell you which seats to choose in any given plane (eg. SeatGuru).

Monday, September 24, 2018

Structural data in historical linguistics

The majority of historical linguists compare words to reconstruct the history of different languages. However, in phylogenetic studies focusing on cognate sets reflecting shared homologs across the languages under investigation, there exists another data type that people have been trying to explore in the past. The nature of this data type is difficult to understand for non-linguists, given that it has a very abstract nature. In the past, it has led to a considerable amount of confusion both among linguists and among non-linguists who tried to use this data for quick (and often also dirty) phylogenetic approaches. For this reason, I figured it would be useful to introduce this type of data in more detail.

This data type can be called "structural". To enable interested readers to experiment with the data themselves, this blogpost comes along with two example datasets that we converted into a computer-readable format (with much help from David), since the original papers only offered the data as PDF files. In future blogposts, we will try to illustrate how the data can, and should, be explored with network methods. In this first blogpost, I will try to explain the basic structure of the data.

Structural data in historical linguistics and language typology

In order to illustrate the type of data we are dealing with here, let's have a look at a typical dataset, compiled by the famous linguist Jerry Norman to illustrate differences between Chinese dialects (Norman 2003). The table below shows a part of the data provided by Norman.

No. Feature Beijing Suzhou Meixian Guangzhou
1 The third person pronoun is tā, or cognate to it + - - -
4 Velars palatalize before high-front vowels + + - -
7 The qu-tone lacks a register distinction + - + -
12 The word for "stand" is zhàn or cognate to it + - - -

In this example, the data is based on a questionnaire that provides specific questions; and for each of the languages in the sample, the dataset answers the question with either + or -. Many of these datasets are binary in their nature, but this is not a necessary condition, and questionnaires can also query categorical variables, such as, for example, the major type of word order might have three categories (subject-object-verb, subject-verb-object or other).

We can also see is that the questions can be very diverse. While we often use more or less standardized concept lists for lexical research (such as fixed lists of basic concepts, List et al. 2016), this kind of dataset is much less standardized, due to the nature of the questionnaire: asking for the translation of a concept is more or less straightforward, and the number of possible concepts that are useful for historical research is quite constrained. Asking a question about the structure of a language, however, be it phonological, lexical, based on attested sound changes, or on syntax, provides an incredible number of different possibilities. As a result, it seems that it is close to impossible to standardize these questions across different datasets.

Although scholars often call the data based on these questionnaires "grammatical" (since many questions are directed towards grammatical features, such as word order, presence or absence of articles, etc.), most datasets show a structure in which questions of phonology, lexicon, and grammar are mixed. For this reason, it is misleading to talk of "grammatical datasets", but instead the term "structural data" seems more adequate, since this is what the datasets were originally designed for: to investigate differences in the structure of different languages, as reflected in the most famous World Atlas of Language Structures (Dryer and Haspelmath 2013,

Too much freedom is a restriction

In addition to mixed features that can be observed without knowing the history of the languages under investigation, many datasets (including the one by Norman we saw above) also use explicit "historical" (diachronic in linguistic terminology) questions in their questionnaires. In his paper describing the dataset, Norman defends this practice, as he argues that the goal of his study is to establish an historical classification of the Chinese dialects. With this goal in mind, it seems defensible to make use of historical knowledge and to include observed phenomena of language change in general, and sound change in specific, when compiling a structural dataset for group of related language varieties.

The problem of the extremely diverse nature of questionnaire items in structural datasets, however, makes their interpretation extremely difficult. This becomes especially evident when using the data in combination with computational methods for phylogenetic reconstruction. This is problematic for two major reasons.
  1. Since questions are by nature less restricted regarding their content, scholars can easily pick and choose the features in such a way that they confirm the theory they want them to confirm rather than testing it objectively. Since scholars can select suitable features from a virtually unlimited array of possibilities, it is extremely difficult to guarantee the objectivity of a given feature collection. 
  2. If features are mixed, phylogenetic methods that work on explicit statistical models (like gain and loss of character states, etc.) may often be inadequate to model the evolution of the characters, especially if the characters are historical. While a feature like "the language has an article" may be interpreted as a gain-loss process (at some point, the language has no article, then it gains the article, then it looses it, etc.), features showing the results of processes, like "the words that originally started in [k] followed by a front vowel are now pronounced as []", cannot be interpreted as a process, since the feature itself describes a process.
For these reasons, all phylogenetic studies that make use of structural data, in contrast to purely lexical datastes, should be taken with great care, not only because they tend to yield unreliable results, but more importantly because they are extremely difficult to compare across different language families, given that they have way too much freedom when compiling them. Feature collections provided in structural datasets are an interesting resource for diversity linguistics, but they should not be used to make primary claims about external language history or subgrouping.

Two structural datasets for Chinese dialects

Before I start to bore the already small circle of readers interested in these topics, it seems better to stop discussing the usefulness of structural data at this point, and to introduce the two datasets that were promised at the beginning of the post.

Both datasets target Chinese dialect classification, the former being proposed by Norman (2003), and the latter reflecting a new data collection that was recently used by Szeto et al. (2018) to propose a North-South-split of dialects of Mandarin Chinese with help of a Neighbor-Net analysis (Bryant and Moulton 2004). Both datasets have been uploaded to Zenodo, and can be found in the newly established community collection cldf-datasets. The main idea of this collection is to collect various structural datasets that have been published in the literature in the past, and allow those people interested in the data, be it for replication studies or to thest alternative approaches, easy access to the data in various formats.

The basic format is based on the format specifications laid out by the CLDF initiative (Forkel et al. 2018), which provides a software API, format specifications, and examples for best practice for both structural and lexical datasets in historical linguistics and language typology. The collection is curated on GitHub (cldf-datasets), and datasets are converted to CLDF (with all languages being linked to the Glottolog database,, Hammarström et al. 2018) and also to Nexus format. The dataset is versionized, it may be updated in the future, and interested readers can study the code used to generate the specific data format from the raw files, as well as the Nexus files, to learn how to submit their own datasets to our initiative.

Final remarks on publishing structural datasets online

By providing only two initial datasets for an enterprise whose general usefulness is highly questionable, readers might ask themselves why we are going through the pain of making data created by other people accessible through the web.

The truth is that the situation in historical linguistics and language typology has for a very long time been very unsatisfactory. Most of the research based on data did not supply the data with the paper, and often authors directly refuse to share the data when asked after publication (see also the post on Sharing supplementary data). In other cases, access to the data is exacerbated by providing data only in PDF format in tables inside the paper (or even worse: long tables in the supplement of a paper), which force scholars wishing to check a given analysis themselves to reverse-engineer the data from the PDF. That data is provided in a form difficult to access is not even necessarily the fault of the authors, since some journals even restrict the form of supplementary data to PDF only, giving authors wishing to share their data in an appropriate form a difficult time.

Many colleagues think that it is time to change this, and we can only change it by offering standard ways to share our data. The CLDF along with the Nexus file, in which the two Chinese datasets are now published in this open repository collection, may hopefully serve as a starting point for larger collaboration among typologists and historical linguistics. Ideally, all people who publish papers that make use of structural datasets, would — similar to the practice in biology where scholars submit data to GenBank (Benson et al. 2013) — submit their data in CLDF format and Nexus, so that their colleagues can easily build on their results, and test them for potential errors.


Benson D., M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and E. Sayers (2013) GenBank. Nucleic Acids Res. 41.Database issue: 36-42.

Bryant D. and V. Moulton (2004) Neighbor-Net. An agglomerative method for the construction of phylogenetic networks. Molecular Biology and Evolution 21.2: 255-265.
Campbell, L. and W. Poser (2008): Language classification: History and method. Cambridge University Press: Cambridge.

Cathcard C., G. Carling, F. Larson, R. Johansson, and E. Round (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.

Dryer M. and Haspelmath, M. (2013) WALS Online. Max Planck Institute for Evolutionary Anthropology: Leipzig.

Forkel R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (forthcoming) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data.

Hammarström H., R. Forkel, and M. Haspelmath (2018) Glottolog. Version 3.3. Max Planck Institute for Evolutionary Anthropology: Leipzig.

List J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp 2393-2400.

Norman J. (2003) The Chinese dialects. Phonology. In: Thurgood, G. and R. LaPolla (eds.) The Sino-Tibetan languages. Routledge: London and New York, pp 72-83.

Pritchard J., M. Stephens, and P. Donnelly (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.

Szeto P., U. Ansaldo, and S. Matthews (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Zhang M., W. Pan, S. Yan, and L. Jin (2018) Phonemic evidence reveals interwoven evolution of Chinese dialects. bioarxiv.

Monday, September 17, 2018

Getting the wrong tree when reticulations are ignored

One issue that has long intrigued me is what happens when someone constructs a phylogenetic tree under circumstances where there are reticulate evolutionary events in the actual (ie. true) phylogeny itself. That is, a network is required to accurately represent the phylogeny, but a tree is used as the model, instead. How accurate is the tree?

By this, I mean that, if the phylogeny can be thought of as a "tree with reticulations", do we simply get that tree but miss the reticulations, or do we get a different (ie. wrong) tree?

Sometimes, people refer to this situation as having a "backbone tree" — the phylogeny is basically tree-like, but there are a few extra branches, perhaps representing occasional hybridizations or horizontal gene transfers. The phylogenetic tree can then be treated as a close approximation to the true phylogeny, representing the diversification events but not the (rarer) reticulation events.

I have argued against this approach (2014. Systematic Biology 63: 628-638.). Instead of seeing a network as a generalization of a tree, we should see a tree as a simplification of a network. If we do this, then we would construct a network every time; and sometimes that network would be a tree, because there are no reticulation events in the phylogeny. It cannot work the other way around, because we can never get a network if all we ask for is a tree!

Presumably, if there are no reticulations then we should get the same answer (phylogenetic tree) irrespective of whether we simply construct a tree or instead construct a network that turns out to be a tree. But what about the "backbone tree" situation? Here, it has always seemed to me to be possible that we do not get the same tree. If this is so, then constructing a tree and then adding a few reticulations to it (as is often done in the literature) would not work — we would be adding reticulations to the wrong backbone tree.

There are two possible ways in which we can get the wrong backbone tree: the topology might be incorrect, or the branch-lengths might be incorrect (or both). For example, if there are true reticulations and yet we do not include them in our model, I have argued that the branches will be too short (2014. Systematic Biology 63: 847-849.) — two taxa will be genetically similar because of the reticulation events, but the tree-building algorithm can only make them similar on the tree by shortening the branches (not by adding a reticulation).

Fortunately, for at least one tree-building model Luay Nakhleh and his group have now done some simulations to answer my questions. You may not yet have noticed their results, because they are not necessarily in the most obvious place; so I will highlight them here. The analyses involve the Multispecies Coalescent (MSC) model, which accounts for incomplete lineage sorting during the tree-like part of evolution, as compared to the Multispecies Network Coalescent (MSNC) which adds reticulations (eg hybridization) to the model.

Dingqiao Wen, Yun Yu, Matthew W. Hahn, Luay Nakhleh (2016) Reticulate evolutionary history and extensive introgression in mosquito species revealed by phylogenetic network analysis. Molecular Ecology 25: 2361-2372.

This paper compares a tree-based analysis (construct a tree first then add reticulations) with a network-based analysis (construct a network) for an empirical genomic dataset. The two results differ.

Dingqiao Wen, Luay Nakhleh (2018) Coestimating reticulate phylogenies and gene trees from multilocus sequence data. Systematic Biology 67: 439-457.

Tucked away in the Supplementary Information are the results of a set of simulations comparing the MSC (using *Beast) and the MSNC (using PhyloNet), with (section 3) and without (section 2) reticulations. The basic conclusion is that, in the presence of reticulation, tree-based methods either get the tree completely wrong, or they get the tree topology right but the branch lengths are "forced" to be very short. A summary of the latter result is shown in the figure above. In the absence of reticulation, both methods produce the same tree.

R.A. Leo Elworth, Huw A. Ogilvie, Jiafan Zhu, and Luay Nakhleh (ms.) Advances in computational methods for phylogenetic networks in the presence of hybridization. (chapter for a forthcoming book]

A summary of the group's work to date. Section 6.3 summarizes the results from the paper 2.

Monday, September 10, 2018

Limitations of the new book about HGT networks

This is a joint post by David Morrison and Ajith Harish.

There has been a flurry of reviewing activity recently about the new book:

The Tangled Tree: a Radical New History of Life
David Quammen. 2018. Simon & Schuster.

This book has received glowing reviews, including:

The book is intended for the general public, rather than for specialists, explaining the "new view" of evolutionary history that includes extensive horizontal gene transfer (HGT), especially in the microbial world. Quammen describes himself as a science, nature and travel writer, so his book is more than just a record of science, and is as much about the people involved as about the scientific theory. In particular, it contains a biography of Carl Woese.

Quammen’s recent New York Times feature article The scientist who scrambled Darwin’s Tree of Life is a very good primer to his book. For us, it indicates that the book has many overlaps with Jan Sapp's earlier book The New Foundations of Evolution: on the Tree of Life (2009. Oxford University Press). The publisher’s advertised selling point of that book is: "This is the first book on (and first history of) microbial evolutionary biology, and that it puts forth a new theory of evolution", with HGT being the new theory. In this sense, the "radical new view" is simply that genetic material can be transferred without sexual reproduction, an idea that goes back rather a long way in history (see The history of HGT), and which is often seen as anti-Darwinian.

Bill Hanage in his review of Sapp’s book (2010. The trouble with trees. Science 327: 645-646) argues that the book neither puts forward a new theory nor is the debate actually about horizontal gene transfer, and the Tree of Life is thus far from settled. There are many other interesting points discussed in that review. Furthermore, even after almost 10 years, Hanage’s review of Sapp’s 2009 book can be substituted verbatim as a review of Quammen’s 2018 book! This PDF shows how the book review would read if the author and book names in Hanage’s review were to be substituted [reproduced with the permission of the original author].

The debate allegedly involving HGT is, at heart, about explaining the pattern of extensively mixed genetic material found in the akaryotes. However, simply looking at a pattern does not tell you about the process that created the pattern. In order to study processes, we need a model, in this case a model about how evolution occurs. The "HGT model" is that the Last Universal Common Ancestor (LUCA) of life was a relatively simple organism genetically, and that subsequent evolutionary history has involved complexification of that ancestor, both by diversification and by HGT.

What the two books do not explore is the other major model for the current distribution of genetic material among akaryotes. This alternative scenario is that the LUCA was genetically complex, and that the subsequent evolutionary history involved independent losses of parts of the genetic material — the sporadically shared material is basically coincidental. All that this model requires is that there be evolutionary history prior to the LUCA, during which it became a complex organism from its simple beginnings — the LUCA is merely as far back as we can see into the past, with the prior history being unrecoverable by us (ie. we cannot see past the LUCA bottleneck).

Over the past couple of decades, a number of papers have explored the evidence for the latter idea, from both the RNA and protein perspectives, including:
  • Anthony Poole, Daniel Jeffares, David Penny (1999) Early evolution: prokaryotes, the new kids on the block. BioEssays 21: 880-889.
  • Christos A. Ouzounis, Victor Kunin, Nikos Darzentas, Leon Goldovsky (2006) A minimal estimate for the gene content of the last universal common ancestor — exobiology from a terrestrial perspective. Research in Microbiology 157: 57-68.
  • Miklós Csűrös István Miklós (2009) Streamlining and large ancestral genomes in Archaea inferred with a phylogenetic birth-and-death model. Molecular Biology and Evolution 26: 2087-2095.
  • Kyung Mo Kim, Gustavo Caetano-Anollés (2011) The proteomic complexity and rise of the primordial ancestor of diversified life. BMC Evolutionary Biology 11: 140.
  • Ajith Harish, Charles G. Kurland (2017) Akaryotes and Eukaryotes are independent descendants of a universal common ancestor. Biochimie 138: 168-183.
Finally, even from the perspective of phylogenetic networks, Quammen's book is very one-sided. In particular, the other processes that lead to reticulate evolution (eg. introgression and hybridization) are pretty much ignored. That is, the focus is on akaryotes not eukaryotes. The latter are also of phylogenetic interest.

Monday, September 3, 2018

More on networks for placing fossils, such as Eocene lantern fruits

A colleague pointed me to a paper published last year in Science about a spectacular fossil find: an Eocene Physalis-fruit with a preserved lampion. In an recent post, I advocated Neighbor-nets as nice and quick tools to place fossils phylogenetically. In this post, I'll will exemplify this once more, and argue why this would have been even more informative than what the authors showed as graphs.

The study and the data

In their 2017 paper, Wilf et al. (Science 355: 71–75) describe a new fossil find, which, by itself, rejects the often-too-young molecular dating estimates for Solanceae, the potato-tomato family, the "Nightshades". The Nightshades include many well-known plants, in addition to potato/tomato (the latter is phylogenetically a subclade of the potatoes) — we have e.g. the tobacco genus (Nicotiana), and also the genus Physalis, which includes several species commercialized as fruits (e.g. P. peruviana, also known as Cape gooseberry or goldenberry) and ornamental plants (e.g. P. alkekengi, the Chinese Lantern).

Just by looking at the pictures showing the fossil (Wilf et al.'s text-Fig. 1), anyone who ever ate a physalis, would agree that it was produced by a member of the genus. However, science is not usually about common sense, but about formal reconstructions. Thus, the authors placed their fossil using a total evidence tree approach: they scored 13 morphological traits as binary or ternary characters, concatenated these data with a molecular data set and inferred trees under maximum parsimony (their text-Fig. 2, below) and maximum likelihood (the tree can be found in the supporting information).

Wilf et al.'s total evidence tree showing the (quoted from the legend)
"Phylogenetic relationships of Physalis infinemundi sp. nov. and selected Solanaceae species" (their Fig. 2). Strict consensus of 2835 most parsimonious trees of 3510 steps (CI = 0.438, RI = 0.726)."

Based on the graph, one can confirm that the fossil (arrow; pictured, too) is part of the core Physalis, but its position within this core clade is unresolved. The Decay index shown indicates that moving the entire branch would require just one step more. Not overly re-assuring regarding the total length of the tree (3510 steps) and underlying data (the used matrix has 7070 characters!)

The molecular data were selected from an earlier study (Särkinen et al., BMC Evol. Biol., 2013), but the total evidence matrix is not provided (see this post on why we want to publish our phylogenetic data). But at least the "...morphological matrix developed in this paper is tabulated in the supplementary materials."

This file includes two sheets: the first shows the "raw scores", including four continuous characters, and the second shows the "character scoring" used for the analysis, where the continuous characters were scored (binned) as ternary and binary characters. The iinformation provided is partly wrong, likely to be the result of copy & paste errors (this is another reason why it should be obligatory for phylogenetic studies to provide the data as aligned-FASTA or NEXUS file). A corrected version of the "character scores" sheet based on the "raw scores" sheet is included in the figshare submission for this post.

By just filtering this matrix for same-as-in-the-fossil characters, we can identify two extant species that are identical to the fossil in all scored characters: Physalis acutifolia and P. lanceolata. Both are part of the Physalis core clade in Wilf et al.'s total evidence tree, but their position is as unresolved as that of the fossil.

Enlarged part of the above figure, showing the absolute character difference (0 to 5 out of 13 covered characters) between the fossil and other members of the Physalis core clade.

The reason for this becomes clear in the total-evidence maximum-likelihood tree. Here, the fossil is resolved as the sister of P. lanceolata (maximum likelihood bootstrap support: ML-BS < 70, the actual value would have been nice), to which it is identical, both being deeply nested in the Physalis core clade. However, the other identical species (morphologically), P. acutifolia, is placed in the first diverging subclade of the core clade (ML-BS < 70, along with most of the backbone of this clade). The "low" support may have two possible reasons:
  • the fossil, with 99.8% missing data, acts as a 'rogue' taxon; or
  • the genetic data provides little discriminating or ambiguous signals.
Solanaceae genera can be tricky, and the gene sample lacks high-divergent sequence regions. Since the molecular data are not documented, I can't assess how significant this separation is, but it appears to be supported by at least some mutations: the tree-wise distance is about 0.04 expected substitutions; and the two morphologically indistinct (regarding the scored characters) species are genetically distinct (to some degree).

Extract from Wilf et al.'s Fig. S1, showing the Physalinae subtree with the core Physalis clade and the deeply nested fossil P. infinemundi (in bold font). Support is only shown for branches with a ML-BS support ≥70.

Trees may fail to show the obvious, but networks won't

Just by using the Neighbour-net to visualize the signal in the morphological partition, we can directly argue that the fossil is likely to be part of the core Physalis. Thus, being Eocene of age, rejects the much-too-young age estimates in e.g. the dated tree by Särkinen et al. (the reference for the molecular data used by Wilf et al.)

Neighbour-net splits graph based on the morphological data partition included in Wilf et al.'s "supermatrix".

In contrast to the little information that comes along with the tree shown above (soft-ish polytomy, weak Decay index, potentially decreased ML-BS support), the splits graph highlights the ambiguity (incompatibility) of the morphological signal. The graph shows little tree-likeness, and members of the same (sub)tribe show little coherence (C = Capsiceae, H = Hyoscyameae, J = Juanulloeae, S = Solaneae; W = Withaninae; all represented by de-facto molecular clades with ML-BS ≥ 77 in Wilf et al.'s supplement Fig. S1). There is one notable exception: members of the core Physalis (red dots) are sufficiently distinct from anything else, forming a highly supported clade (ML-BS = 98 in Wilf et al.'s fig. S1),.

The network also shows that the fossil is identical to both P. acutifolia and P. lanceolata.

Neighbour-net after reducing the taxon set to the phylogenetic neighbourhood of the fossil specimen. Filled fields indicate sister/sibling species supported by a ML-BS >= 80 in Wilf et al.'s "total evidence" ML tree.

By focusing on the phylogenetic neighborhood of the fossil, we end up with a spider-web-like graph. Which means that the morphological partition has little consistent signal for recognizing potential relatives: the same features are likely to have evolved in parallel (all members of this neighborhood a likely to share a common origin) — 50 million years (and more) is a long time for a lineage to end up with a similar fruit (see also the maximum-parsimony character reconstructions on the parsimony strict-consensus tree provided in the supplement to Wilf et al.'s study).

Data and graphs

The Splits-NEXUS files for the Neighbor-nets and NEXUS-versions of Wilf et al.'s Data S1, as well as additional graphics (network with labeled bubbles) can be found on figshare.

Monday, August 27, 2018

Regular cognates: A new term for homology relations in linguistics

The identification of homologous words between genealogically related languages is one of the crucial tasks in historical linguistics. In contrast to biology where, especially at the level of genetic sequences, we find a rather rich terminology contrasting different types of homology among genes and gene sequences, linguistic terminology is still not very precise. Most scholars seem to be content if they can claim that they have identified words that are cognate, which means that they are homologous but have not been borrowed throughout their history.

On various occasions in the past, I have tried to work on a more precise terminology for linguistic frameworks (see for example List 2014 and List 2016, or this earlier blogpost on homology in linguistics). In this context, I have often tried to emphasize that we need to be specifically more careful with the problem of partial cognacy in linguistics, since many words across related languages are not fully homologous, but show homology only in specific parts (List et al. 2016).

Thanks to an increase in accurately annotated linguistic data, resulting specifically from my very productive collaboration with Nathan W. Hill (SOAS, London) on the Burmish languages (see Hill and List 2017), my view has now again changed a bit, and I thought it would be useful to share it here.

Cognacy and homology

The starting point for my earlier proposals to refine the notion of cognacy in linguistics was the rather refined distinction between orthologs, paralogs, and xenologs in molecular biology (Fitch 2000). To account for the distinction between directly inherited (orthologs), duplicated (paralogs), and laterally transferred genes (xenologs), I proposed the terms direct cognates, indirect cognates (inspired by the term oblique cognates by Trask 2000), and indirectly etymologically related words or morphemes (word parts).

While the first and last term are more or less straightforward with respect to linguistic processes, the notion of indirect cognates, however, turned out to be insufficient, given that it is not clear which processes lead to indirect cognacy. Originally, I thought of morphological processes, that is, processes of word formation, by which a word is slightly modified to account for a slightly derived meaning (usually involving processes like suffixation or compounding). My idea was that words that have "experienced" these processes would behave similarly to genes that have been duplicated in biological evolution, and that it would be sufficient to just assign them to a common sub-class of cognates.

However, the research with Nathan W. Hill recently revealed that these terms are insufficient to capture the processes underlying lexical change in historical linguistics.

In order to understand this idea, it is useful to get back to the biological terms and have a closer look at how they distinguish the underlying processes. As far as I understand it, a directaly inherited gene sequence may differ from its ancestral sequence due to processes of random mutation, by which the original gene sequence becomes modified throughout its history. In cases of paralogy, the original gene sequence is duplicated and both copies are subsequently inherited. The copies may, during this process, become more different from each other than would be expected when assuming direct inheritance and random mutation. Similarly, in cases of lateral transfer of genetic material, the changes may again be different from the ones introduced by "normal" random mutation.

If we adopt the view of "normal change", as it is employed in the biological processes, we find a counterpart in the process of sound change in linguistics. As I have mentioned earlier, sound change is a systemic process by which certain sounds in certain environments change regularly across all words in the lexicon of a given language. This process is definitely not comparable with random mutation in sequence evolution, since the process involves a class of "letters" in the sound system of a language that are systematically turned into another sound. However, regarding the crucial role that sound change plays in language evolution, it seems that it is in some sense comparable with random mutation resulting in orthologous genes. Sound change is somewhat the baseline of what happens if languages change, and we have the means to identify its traces by searching for regular sound correspondence patterns across related languages (see my earlier blogpost on this matter).

That sound change is the default which can be handled with some confidence, while other processes, like word formation, semantic change, or the notorious process of analogical leveling, by which not only complex paradigms are transformed to reduce complexity, but other complexities can emerge (compare the German irregular plural of Morgen-de "mornings", which is built on the template of "evenings" Abend-e), is also the reason why Gévaudan (2007) does not include it into the major processes of lexical change. If we take sound change as the default process of language change and as our key evidence for homologous word relations, however, this means that we can no longer make the distinction between direct and indirect cognates following my earlier proposal, since indirect cognates do not necessarily reflect instances of irregular sound change.

This is in fact easy to illustrate. If we follow the former definition of indirect cognacy, the comparison of German Handschuh "glove" (lit. hand-shoe) with English hand would reflect indirect cognacy, since the German word is a compound of Hand "hand" and Schuh "shoe", and thus a derived word form. The morpheme Hand in this example, however, is phonetically identical with German Hand, and the sound correspondences between the English word and the first element of the German compound are still regular by all means. In fact, only a small amount of word formation processes in language evolution also impact on the pronunciation of the base forms.

This means, in turn, that any distinction of cognate word forms (and word parts, i.e., morphemes) into direct and indirect ones that is based on the absence or presence of morphological (= word formation) processes, does not tell us much about the degree to which the sound change affecting these word forms was regular. We could state that direct cognates should always reflect regular sound change, since any irregularity would have to be accounted for by alternative explanations (eg. shortening of a given word due to frequent use, assimilation of sounds serving the ease of pronunciation, etc.).

I wonder whether this would be useful for the initial idea behind the concept of direct cognacy. If we find direct cognates, that is, words that we assume were used by a couple of languages without further modification, apart from regular sound change and potentially sporadic sound changes, it seems still useful to assume that these reflect vertical language history better than cognate sets with residues that were exposed to various morphological processes. Thus, when coding direct cognacy in linguistic datasets, sporadic sound change (if it can be illustrated properly) should not serve as an argument against direct cognacy.

The only way around this problem seems to be to establish a further shade of cognacy, which describes the relations among words and morphemes that have been only affected by sound change, in contrast to words whose history reflects various morphological derivations that impact directly on pronunciation, or processes of irregular sound change due to analogical leveling or assimilation. While I first thought that the biological term ortholog would be useful to describe these specific word relations in linguistics, I realized later that, judging from the Ancient Greek meaning of ortholog (ortho "straight, direct" + logos "relation"), the fact that differences are due to regular sound change is not that neatly reflected.

For now, I think that it should be sufficient to use the term regular cognates for those words or word parts for which we can demonstrate that their change was following the regular "laws" of sound change. Regular cognates are thus defined as words or word parts that have been affected only by sound change during their history. This notion deliberately excludes differences in meaning, frequency of use, or whether the word forms are only reflected in compounds or derived word forms. In fact, for some cases, we could even propose that only parts of a word form that no longer bear any meaning of their own (eg. the first two sounds of a word form) are regular cognates, as long as we can propose good arguments for the regularity of the correspondences.

Note that our tools for alignment analyses in historical linguistics already account for this property. The EDICTOR (, List 2017), a web-based tool for editing, analyzing, and publishing etymological dictionaries, allows users to exclude those parts from an alignment that are assumed to be irregular, as can be seen in the following illustrative alignment of Proto-Germanic *bakanan "to bake". Scholars who want to be explicit about what parts of an alignment they consider to be regular can use this annotation framework to provide more refined analyses.

EDICTOR alignment of regular cognates for Proto-Germanic *bakanan "to bake"

A crucial consequence of using only regularity in the sound correspondences as the criterion to distinguish regular from irregular cognates is that regular cognacy may also be found to hold for borrowings, since borrowings can, as well, be shown to be regular, especially when the language contact between languages was intensive. Identifying regular cognates is furthermore the first and most important step of the classical comparative method (Weiss 2015) for historical language comparison, since (unless we have written evidence for the true relations between languages) regular cognates (as proven by readily aligned cognate sets) are the fundament upon which we build all our hypotheses regarding the external history of languages.

Fitch, W. (2000) Homology: s personal view on some of the problems. Trends in Genetics 16.5: 227-231.
Hill, N. and J.-M. List (2017) Challenges of annotation and analysis in computer-assisted language comparison: a case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1: 47–76.
List, J.-M. (2014) Sequence Comparison in Historical Linguistics. Düsseldorf University Press: Düsseldorf.
List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1.2: 119-136.
List, J.-M., P. Lopez, and E. Bapteste (2016) Using sequence similarity networks to identify partial cognates in multilingual wordlists. In: Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers). Association of Computational Linguistics, pp. 599-605.

List, J.-M. (2017) A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations, pp. 9-12.
Trask, R. (2000) The Dictionary of Historical and Comparative Linguistics. Edinburgh University Press: Edinburgh.
Weiss, M. (2015) The comparative method. In: Bowern, C. and N. Evans (eds.) The Routledge Handbook of Historical Linguistics. Routledge: New York, pp. 127-145.

Wednesday, August 22, 2018

Distinguishability in Phylogenetic Networks, report

We have now completed the workshop, as you can tell from the previous post with some photos. Here is a brief report on what seem to me to be some of the more useful points covered.

We had 10 formal presentations, but we also focused on group discussions for several hours each day. It may be the latter that were the most productive. However, I will briefly summarize the talks first.

I spent my time time in the opening talk emphasizing the different viewpoints of network computations, which focus on the patterns that can be detected in the data, and the network users, who are generally more interested in the processes that create those patterns (or are, indeed, absence from the patterns but present in the phylogenetic history, anyway). This highlights the two essential point of the workshop title, that both the patterns and the processes are much harder to untangle for networks than for trees.

Céline Scornavacca then bravely tried to tackle the combined problem, anyway, by trying to produce networks from analyzing the patterns in terms of their processes. The issues immediately become obvious, but she seems to be determined to proceed, regardless. Later in the week, Luay Nakhleh reduced the issue simply to vertical processes (including incomplete lineage sorting but not gene duplication-loss) versus horizontal processes. This creates a tractable problem for parsimony and likelihood, but the current challenge remains the limited number of taxa.

Vincent Moulton, Cécile Ané and Charles Semple dodged the issue by focusing on computations. Charles took on the challenge of trying to create a network version of Neighbor-Joining, which would address the issues of computational speed and taxon sampling, and Vince tackled super-networks, and the conditions required for building networks from a collection of smaller (ie. incomplete) trees. Both topics remain open questions. Cécile, on the other hand, discussed network models for trait evolution, which is important for the use of phylogenetic comparative methods when using networks.

On the user side, the presentations focused on examples, and the issues encountered when dealing with them. James Whitfield and Axel Janke talking about biology (mostly phylogenomics), while Johann-Mattis List talked about linguistics, and Tiago Tresoldi talked about stemmatology. In some ways, historical linguistics seems to be the odd one out, since many of the issues dealt with are somewhat removed from those in the other fields. However, in biology there are actually two options for producing networks — directly from the data or via "gene trees" (trees derived from non-recombining blocks of sequences). For the humanities, much of the current discussion is about the nature of the data, and how to code it for quantitative analysis.

This brings us to the discussions. While some time was spent on trying to establish whether biologists think that there is a difference between lateral gene transfer and horizontal gene transfer, or between incomplete lineage sorting, ancestral polymorphism and deep coalescence, some productive interchanges also occurred. Here is a coverage of four of the most important ones.

There was general agreement that there are several barriers to widespread adoption of network analyses in phylogenetics. This includes the development of suitable methods (in the face on indistinguishability), but also includes an understanding of what methods are currently available, what data are required to apply those methods, what taxon sampling is required to benefit from the methods, and how to use the programs that implement those methods.

One popular suggestion was therefore to produce some sort of "cookbook", to address the complexity of producing networks, given that there are many methods and programs. From the users' point of view this would illustrate what network analyses can do, in terms of finding reticulation patterns in the data; and from the computational point of view it would outline what needs to be done to get the programs to work. The consensus idea was to choose two suitable datasets (yet to be determined), and then have each program author provide analyses of them (including any scripts that are needed).

Following on from this latter point, it was agreed that the programs need easy user interfaces, if they are to become more widely used. Here, the word "widely" includes casual users from outside of phylogenetics, who use phylogenies as only one of many tools in their work. So, users will include those who need nothing more than a "point and click" control panel (which may be >90% of potential users) to those who would benefit from scripting control of the analyses. The interface needs both a front end, to specify the particular analysis, and a back end, to allow exploration of the output.

Another long-discussed issue was how to popularize networks, which is clearly a major topic. A phylogenetic tree is nothing more than one of the possible networks for any given dataset, and yet the focus is often on trees rather than networks.

To this end, it was noted that the current Wikipedia entry is inadequate, especially compared to the corresponding entry for phylogenetic trees. Not only is this entry out of date, it is in a number of ways misleading. In particular, there needs to be a discussion of the fact that, if a network is a "tree with reticulations", then ignoring the reticulations can result in the wrong tree, and the branch lengths may be severely under-estimated. There are challenges to getting Wikipedia entries changed, especially the wholesale re-writing of an entry, but this will be necessary.

Finally, it was noted that Philippe Gambette's Who is Who in Phylogenetic Networks website is extremely useful but is still poorly known, even within the phylogenetic networks community. We had a long discussion about how to enhance this site, to make it a more general-purpose repository of information about phylogenetic networks. This included a more inclusive database, more comprehensive tagging of keywords, enhanced descriptions of those keywords, and ways to keep the database up to date.

Steven Kelk has the notes from the final session, which was a review of what we achieved during the workshop, and which contains the To Do list. Both he and Philippe have the notes about modifications for the Who is Who in Phylogenetic Networks website, which is likely to be the first outcome-project tackled.

Thankyou to everybody who participated in the workshop. It seemed to be very productive, with a number of concrete outcomes that will be interesting to review at the next workshop.