Showing posts with label EDA. Show all posts
Showing posts with label EDA. Show all posts

Monday, October 15, 2018

Jumping political parties in Germany's state elections


In one of last year's post, I showed a neighbour-net for the parties competing in the national election based on political distances inferred from the Wahl-O-Mat questionnaire (A network of political parties competing for the 2017 Bundestag). But Germany is a federal state, and since then, there has been a state election in Lower Saxony, and soon there will be two in Bavaria and Hesse. This is a good opportunity to make some network-based comparisons.

It is important to note that there are many political parties in Germany, not just two or three major parties, as in most English-speaking countries. State parliaments can therefore be composed of quite different mixtures of these groups.

The questionnaire

The Wahl-O-Mat is a political information service provided by the BPB, the "Bundeszentrale für politische Bildung". A group of youngsters assisted by scientists puts together a questionnaire of political theses (bullet points), which is sent out to the political parties competing in an election. When participating, as most parties do, they can either choose "agree", "disagree" or "neutral" to each statement.

As a voter, you can fill in the same questionnaire, mark some of the questions as "high importance" (which will be weighted stronger), and then choose (up to) eight parties for your personal comparison. The result will be a bar chart, showing you the percentage of your personal overlap with each of the parties. The BPB usually provide this service for all federal and state elections.

The problem I always have with this approach is that you don't get any graphical summary information about how the parties agree or disagree with each other to start from. In the worst case scenario, you could have 75% overlap with each of two parties who disagree with each other for 50% of the bullet points!

A straightforward solution to this shortcoming is to: code the questionnaire as a ternary matrix (0 = "disagree", 1 = "neutral", 2 = "agree"), treat them as ordered characters and determine the mean pairwise (Hamming) distances, and then infer a Neighbor-net based on the resulting distance matrix.

This is shown in the first figure, where each labeled point is one of the political parties. The two political extremes are also labeled.

The neighbour-net for the 2017 federal election Wahl-O-Mat questionaire (original GWoN post from last year, for those interested in further comments, extrapolations, and infographics, see related posts on my Res.I.P blog). The red split denotes the outgoing and new coalition parties (Merkel's "centre-right" CDU/CSU + "centre-left" SPD, the social-democrats), the blue split the most natural minor coalition partner for the CDU/CSU since the Kohl era, the "centrist" liberals (FDP). For the yellow split, see here (in German, but there is a Google translate button).

Political Compasses (for orientation)

Another graphical approach is to use a "political compass", instead. The original can be found at The Political Compass. Parties or persons are placed along two absolute (in the case of the original) axes: an economic left-right x-axis and a social authoritarian-libertarian (in the classic, not US sense, i.e. socially liberal) y-axis. (I encourage everyone to do the test for themselves. I was not surprised to see where I stand in the compass, but others have been. But first do the test, before browsing The Political Compass' highly interesting pages.)

Here's how this looks like for the main German parties (currently six) that also got seat in the newly formed Bundestag, with some orientation points: (in)famous historical figures and the presidential run-offs in the U.S. (most of this blog's readers sit in the U.S.) and France (because I live there, but can't vote).


Overlay of several Political Compass assessments regarding the last major elections in the Germany, France and the U.S. Grey dots, (in)famous figures that shaped modern world; the main German parties are in full colours (all on the economic right, except for the Left Party, Die Linke, which is where social-democrats where in the 70s, when the European model of welfare states was fully implemented). The position of U.S. (both right-authoritharian) and French (relaxed choice between Hitler, fascism, and Friedman, neo-liberalism) presidential run-offs is provided for comparison.


In Lower Saxony the "Niedersächsiche Landeszentrale für politische Bildung", the state's analog of the BPB, hired a Dutch company to provide a compass ("Wahlkompass") linked to the Wahl-O-Mat questionnaire for the 2017 state parliament election.

After filling in the questionnaire, you would be placed in the relative, compass, too. Note that (possibly to avoid giving due credit to The Political Compass) the y-axis has been flipped and modified to "progressiv" (progressive) and "konservativ" (conservative). Another reason may be that classifying parties as authoritarian is a bit tricky for a state-funded German institution for historical reasons.

The red marker indicates an all-neutral voter. The placement is a relative one, hence no grid.

The relative positions of the liberals (FDP), the right-wing populists of the "Alternative for Germany" AfD (blue symbol at the bottom), the CDU, SPD and Left Party (Linke) all agree with The Political Compass' assessment of their federal-level counterparts. However, the Green Party is placed much closer to the Left Party on the social y-axis. This has two possible reasons:
  1. The Political Compass bases its assessment on party programs and actual government politics, and the Greens are part of quite a few state governments, and are the major ruling party in Baden-Würrtemberg, Germany's economically strongest state.
  2. There can be a difference between progressive and libertarian. The Greens are progressive by supporting e.g. equal rights for women or LGBT and other aspects of modern society, but aim to achieve these goals by imposing legislation, which is authoritarian. On the other hand, conservatism – keeping the status-quo – is mutually linked to authoritarian politics. Any social movement will change society, or challenge the status-quo, and hence needs to be constrained or suppressed.
Another difference to the Wahl-O-Mat is that – similar to the questionnaire of The Political Compass – the Lower Saxony Wahlkompass allows six possible answers to each bullet point: "totally disagree" (which I scored as "0"), "disagree" (1), "neutral" (2), "agree" (3), "totally agree" (4), and "No opinion" (?). The latter is a quite useful, and would be an useful add-on also to the Wahl-O-Mat, because there is a difference whether one is neutral on a matter (could live with it) or has no opinion on it (don't bother). The more refined scale also allows us to treat the answers as ordered multistate characters when inferring the distance matrix, resulting in a more resolved network.

This is shown in the next figure.

Neighbour-net based on the Niedersachsen Wahlkompass questionaire (full post, in German).

As you can see, the political-distance-based Neighbor-net splits graph captures the similarity of the political parties to each other quite well. Now the only thing left to do is to add yourself (as a voter or interested third party) to the matrix and then re-infer the Neighbor-net. The basic files to do so (NEXUS-formatted matrices) for this, upcoming (Bavaria, Hesse), and future elections can be found on figshare

Comparing different elections

As a federal state, Germany has a long tradition of within-party diversity. Most commonly known is that the "Schwesterparteien" (sister parties) CSU and CDU disagree in not a few points. The CSU is a Bavarian endemit, while the CDU covers rest of Germany, including the former East Germany — see also my post [in English] on German and French party genealogies after World War II). Hence, they are treated separately by The Political Compass for the 2017 election. The CSU is in general (much) less neo-liberal than the CDU (placed left of it), but (often) more authoritarian, cultivating conservative views. But neither is the CDU a homogeneous formation when compared from state to state, nor are any of the other parties. The following splits graphs, based on the various Wahl-O-Mat questionnaires, illustrate this quite well.

Let's start with the upcoming state elections in Bavaria and Hesse. Here are the two Neighbor-nets.

Reduced Neighbour-nets for Bavaria and Hesse. Parties competing only in one of the states not included.

We note that some parties keep their position relative to each other. For example, the most severe political antagonists in both states are the Left P. (left-libertarian) and the LKR (distinctly right-authoritarian; political distance PD > 1.5).

The latter is a small party collecting the original founder(s) of the AfD. The AfD is usually described as a (far-)right populist party, but started as a Euro-sceptic conservative and distinctly neo-liberal party. This is well captured in the splits graphs, with the LKR placed either as sister to the Bavarian (less neo-liberal) CSU or at a box connecting the (less authoritarian) CDU with the (more left) AfD. Other small parties (Humanist Party, the animal-rights party P!MUT, and the ÖDP, a conservative-green party) are equally stable.

The "right" is more tree-like in Bavaria than in Hesse because the so far all-ruling CSU tries (tried) to follow an old maxim of Franz-Josef Strauß, who said that there should never be a political party right (i.e. more conservative and nationalist) of the CSU in the Bavarian parliament — hence, it is much more similar to the right-wing populist AfD than the Hesse CDU.

In Hesse, the CDU ruled the state for the last four years with the Greens, which explains the position of the Green Party in both graphs. Being the opposition, and strongly opposing CSU policies (both economically and socially), they are much closer to the Left Party in Bavaria, while occupying a position between their coalition partner CDU and the "left" parties (Left P., SPD) in Hesse.

In Hesse, the Green Party takes effectively the position that in Bavaria is filled by the Pirate Party — the latter had a surge couple of years ago entering several state parliaments but now is back to 2% or less. With the Greens moving right, the Pirate Party Hesse remains more similar to the classical "left" of the political spectrum.

Another jumper is "Die PARTEI". This is hardly surprising, because they answer some questions in the Wahl-O-Mat by flipping a coin, or select the one allowing them to come up with most satiric arguments for their choice (sometimes not so different from those of certain party policies!).

Compared to the last federal election, the federal-state discrepancy in official party policies is striking, and this is well represented in their answers to the Wahl-O-Mat questionaires.

Same-scaled, taxon-pruned Neighbour-nets. The "Big-6" (7 in Bavaria), parties either already sitting in the parliaments or with chance to crack the 5%-hurdle in upcoming elections, in bold. Arrows indicate current ruling coalitions/government parties.

Being a frequent junior partner of the CDU/CSU, but the opposition in Bavaria (for decades) and Hesse (once the dominant party), the federal SPD is drawn much more to the "right" than its state counterparts. But this holds also for the federal CDU in the opposite way, and hence the FDP becomes the closest (still distant) "relative" of the AfD, which campaigned 2017 with a more neo-liberal program than it does now in Bavaria and Hesse (a necessity for populistic parties, as anyone likes free stuff).

The "blue-green" ÖDP comes closer to the Greens, because ecology-related bullet points took a more prominent place in the federal election Wahl-O-Mat. The "net-gap" in between them, and the edges shared by the ÖDP with the AfD or other parties of the "right" (FW, CDU/CSU, FDP), highlight their differences in social policies.

In Lower Saxony fewer parties competed, so let's prune the taxon set further. The Lower Saxony Neighbor-net has a different scale, because a more differentiated answer was possible. Usually, two parties oppose each other on all points, the maximum theoretically possible distance between two parties in the Lower Saxony matrix would be 4, i.e. they would strongly disagree on all bullet points that have no missing data for either one.

Again, parties in (last year's elections) or with chances to enter parliament (upcoming) in bold, and arrows indicating current or leaving government parties/coalitions.

Note how the Green Party and the SPD are placed with respect to the third main party from the traditional "left", the Left Party, and the FDP in comparison to CDU/CSU and AfD, forming the parliamentary "right". In Lower Saxony, the largest (SPD) and second-largeste (CDU) party followed the example of the Bund. The outgoing SPD-Greens coalition lost its tight majority; and although a CDU-FDP-AfD coalition would have had a majority and quite an overlap, involving the AfD in governments has been considered a no-go in Germany to this point (for all involved parties for different reasons).

Also in the Bundestag, the "right" would have a majority, but the SPD is close enough, and obviously Merkel's preferred partner. The polls for the Sunday elections (yesterday, when you read this) predict the CSU will lose its absolute majority. Also, here the natural partner (AfD) will be a no-go, so Bavaria will head towards interesting coalition talks with the Greens, being second in the polls. This would be the first time since 1958. The black-green Hesse government is also likely to lose its majority. However, adding the FDP (called "Jamaica coalition", because of the traditional colors of the three parties) should be no great deal, given its position between the current coalition partners.

Links

The post introducing Neighbor-nets to explore Wahl-O-Mat questionnaires can be found here.

More infographics (including plots of each bullet point on the splits graphs) revolving around political distances expressed in election questionnaires, or politics in general, can be found in my Res.I.P. posts — flagged as "Bundestagswahl" (federal elections, in German or English), "Landtagswahlen" (usually in German), "phylo-networks" (usually in English) and "politics" (again mixed).


Related data are included in a figshare fileset (open data; CC-BY licence), which may get updated when another election happens.

Monday, October 1, 2018

Which airlines are the best?


Scientists are known to get about a bit. They attend conferences and give workshops, they go on sabbatical, and sometimes they even have holidays. Many of these activities require them to be in other places than their home city; and to get there they often resort to air travel. This makes it of interest to them to know which airlines are considered to be "good". Scientists may not have much choice about which airlines they can choose to fly, depending on where they live, but they can at least try to fly on one of the good ones.


They are not alone in this desire, and so inevitably there are web sites that provide the necessary information. These include AirHelp Airline Worldwide Rankings; but the best-known listing is the annual one from Skytrax, a UK-based consumer aviation agency.

Each year, Skytrax conducts a survey in which "airline customers around the world" vote for the best airline. The survey results are released at the beginning of each year, and they thus refer to the previous year's survey. Skytrax note that "over 275 airlines were featured in the [current] customer survey but we only feature the top 100 listing."

The Skytrax top-100 data currently exist online for the years 2012-2018 inclusive, which cover the years 2011-2017. It can be useful to consider data for multiple years, because some airlines have greatly improved their ranking through time, while others have slipped back. There are 80 airlines with top-100 data for each of the years 2011-2017, and another 45 airlines that have appeared in the top 100 at least once. A few airlines have also merged during these years.

We can explore the multi-year data for the 80 airlines using a network analysis, to visualize the overall pattern. I first calculated the Manhattan distances pairwise between the airlines, and then plotted these using a NeighborNet graph, as shown in the figure below. Airlines that have similar rankings across the years are near each other in the network; and the further apart they are in the network then the more different are their overall rankings.


As you can see, this is pretty much a linear network, with the best-ranked airlines at the top-right, and then continuing down to the bottom-left. A simple list of the average rankings across the years would be almost as informative. In particular, the top-ranked airlines have remained at the top across the years; and it is only in the middle and especially at the bottom that there has been movement among the rankings (that is, the network broadens out at one spot in the middle and then again at the end).

Note that the top end of the list consists mainly of airlines from the Middle East and Asia. Australia has only two airlines, both of which do well in the network, along with the only one from New Zealand. The presence near the top of both Turkish Airlines and Garuda Indonesia may surprise some people.

You will also note that the US airlines are generally closer to the bottom of the network — they are marked in red in the network. The airlines from China are mostly there, also (except Hainan Airlines). It is not a coincidence that neither of the world's two biggest economies runs a high-quality airline. It seems that the only way to do this is actually to rely on government subsidies, which is how most of the top-ranked airlines are doing it.

Finally, there are few discount airlines that make it into the top 50. Put simply, the economics of running an all-economy-class plane do not allow much in the way of customer service (see How Budget Airlines Work). It is actually the first-class and business-class passengers on any given plane that allow it to take off at all, in terms of making money for the airline — a classic example of the 80/20 rule: 80% of the money comes from 20% of the passengers (see The Economics of Airline Class).

Finally, in a similar vein, you could also contemplate the sites pertaining to airport quality (eg. AirHelp Airport Worldwide Rankings, World Airport Awards), as well as the Guide to Sleeping in Airports. There are also sites that tell you which seats to choose in any given plane (eg. SeatGuru).

Monday, September 3, 2018

More on networks for placing fossils, such as Eocene lantern fruits


A colleague pointed me to a paper published last year in Science about a spectacular fossil find: an Eocene Physalis-fruit with a preserved lampion. In an recent post, I advocated Neighbor-nets as nice and quick tools to place fossils phylogenetically. In this post, I'll will exemplify this once more, and argue why this would have been even more informative than what the authors showed as graphs.

The study and the data

In their 2017 paper, Wilf et al. (Science 355: 71–75) describe a new fossil find, which, by itself, rejects the often-too-young molecular dating estimates for Solanceae, the potato-tomato family, the "Nightshades". The Nightshades include many well-known plants, in addition to potato/tomato (the latter is phylogenetically a subclade of the potatoes) — we have e.g. the tobacco genus (Nicotiana), and also the genus Physalis, which includes several species commercialized as fruits (e.g. P. peruviana, also known as Cape gooseberry or goldenberry) and ornamental plants (e.g. P. alkekengi, the Chinese Lantern).

Just by looking at the pictures showing the fossil (Wilf et al.'s text-Fig. 1), anyone who ever ate a physalis, would agree that it was produced by a member of the genus. However, science is not usually about common sense, but about formal reconstructions. Thus, the authors placed their fossil using a total evidence tree approach: they scored 13 morphological traits as binary or ternary characters, concatenated these data with a molecular data set and inferred trees under maximum parsimony (their text-Fig. 2, below) and maximum likelihood (the tree can be found in the supporting information).


Wilf et al.'s total evidence tree showing the (quoted from the legend)
"Phylogenetic relationships of Physalis infinemundi sp. nov. and selected Solanaceae species" (their Fig. 2). Strict consensus of 2835 most parsimonious trees of 3510 steps (CI = 0.438, RI = 0.726)."

Based on the graph, one can confirm that the fossil (arrow; pictured, too) is part of the core Physalis, but its position within this core clade is unresolved. The Decay index shown indicates that moving the entire branch would require just one step more. Not overly re-assuring regarding the total length of the tree (3510 steps) and underlying data (the used matrix has 7070 characters!)

The molecular data were selected from an earlier study (Särkinen et al., BMC Evol. Biol., 2013), but the total evidence matrix is not provided (see this post on why we want to publish our phylogenetic data). But at least the "...morphological matrix developed in this paper is tabulated in the supplementary materials."

This file includes two sheets: the first shows the "raw scores", including four continuous characters, and the second shows the "character scoring" used for the analysis, where the continuous characters were scored (binned) as ternary and binary characters. The iinformation provided is partly wrong, likely to be the result of copy & paste errors (this is another reason why it should be obligatory for phylogenetic studies to provide the data as aligned-FASTA or NEXUS file). A corrected version of the "character scores" sheet based on the "raw scores" sheet is included in the figshare submission for this post.

By just filtering this matrix for same-as-in-the-fossil characters, we can identify two extant species that are identical to the fossil in all scored characters: Physalis acutifolia and P. lanceolata. Both are part of the Physalis core clade in Wilf et al.'s total evidence tree, but their position is as unresolved as that of the fossil.

Enlarged part of the above figure, showing the absolute character difference (0 to 5 out of 13 covered characters) between the fossil and other members of the Physalis core clade.

The reason for this becomes clear in the total-evidence maximum-likelihood tree. Here, the fossil is resolved as the sister of P. lanceolata (maximum likelihood bootstrap support: ML-BS < 70, the actual value would have been nice), to which it is identical, both being deeply nested in the Physalis core clade. However, the other identical species (morphologically), P. acutifolia, is placed in the first diverging subclade of the core clade (ML-BS < 70, along with most of the backbone of this clade). The "low" support may have two possible reasons:
  • the fossil, with 99.8% missing data, acts as a 'rogue' taxon; or
  • the genetic data provides little discriminating or ambiguous signals.
Solanaceae genera can be tricky, and the gene sample lacks high-divergent sequence regions. Since the molecular data are not documented, I can't assess how significant this separation is, but it appears to be supported by at least some mutations: the tree-wise distance is about 0.04 expected substitutions; and the two morphologically indistinct (regarding the scored characters) species are genetically distinct (to some degree).

Extract from Wilf et al.'s Fig. S1, showing the Physalinae subtree with the core Physalis clade and the deeply nested fossil P. infinemundi (in bold font). Support is only shown for branches with a ML-BS support ≥70.

Trees may fail to show the obvious, but networks won't

Just by using the Neighbour-net to visualize the signal in the morphological partition, we can directly argue that the fossil is likely to be part of the core Physalis. Thus, being Eocene of age, rejects the much-too-young age estimates in e.g. the dated tree by Särkinen et al. (the reference for the molecular data used by Wilf et al.)

Neighbour-net splits graph based on the morphological data partition included in Wilf et al.'s "supermatrix".

In contrast to the little information that comes along with the tree shown above (soft-ish polytomy, weak Decay index, potentially decreased ML-BS support), the splits graph highlights the ambiguity (incompatibility) of the morphological signal. The graph shows little tree-likeness, and members of the same (sub)tribe show little coherence (C = Capsiceae, H = Hyoscyameae, J = Juanulloeae, S = Solaneae; W = Withaninae; all represented by de-facto molecular clades with ML-BS ≥ 77 in Wilf et al.'s supplement Fig. S1). There is one notable exception: members of the core Physalis (red dots) are sufficiently distinct from anything else, forming a highly supported clade (ML-BS = 98 in Wilf et al.'s fig. S1),.

The network also shows that the fossil is identical to both P. acutifolia and P. lanceolata.

Neighbour-net after reducing the taxon set to the phylogenetic neighbourhood of the fossil specimen. Filled fields indicate sister/sibling species supported by a ML-BS >= 80 in Wilf et al.'s "total evidence" ML tree.

By focusing on the phylogenetic neighborhood of the fossil, we end up with a spider-web-like graph. Which means that the morphological partition has little consistent signal for recognizing potential relatives: the same features are likely to have evolved in parallel (all members of this neighborhood a likely to share a common origin) — 50 million years (and more) is a long time for a lineage to end up with a similar fruit (see also the maximum-parsimony character reconstructions on the parsimony strict-consensus tree provided in the supplement to Wilf et al.'s study).

Data and graphs

The Splits-NEXUS files for the Neighbor-nets and NEXUS-versions of Wilf et al.'s Data S1, as well as additional graphics (network with labeled bubbles) can be found on figshare.

Monday, July 2, 2018

Reticulation at its best — an example from the oaks


One particular case where networks turn out to be a versatile tool is the study of low-level evolutionary patterns. This is especially so when we leave the comfort zone of well-sorted molecular markers, and use more than a single individual per species. Our recently published data set on (mostly Mediterranean) oaks, provides a nice example of this.

Why so few people study oaks at the intra-generic level

Oaks are notoriously difficult to study because they don't bother too much about species boundaries (which can be more or less obvious) and – at one point – decided to not sort their plastids at all (and full plastomes, as I once saw for myself first-hand, won't help). Hence, all reasonable phylogenetic reconstructions of oak evolution have been based on genetic data from the nucleome. However, this imposes a new problem — the sequenced nuclear gene regions allow the recognition of the major lineages (which recently have been formalized), but the closer one comes to the species level the more difficult it is to resolve anything at all.

Even the famous ITS region, which includes the weakly constrained internal transcribed spacer ITS1, and the structurally quite constrained ITS2, and have been frequently advocated as plant barcodes, turns out to be a two-edged sword. Relationships between the major intra-generic lineages is relatively clear, the ITS is pretty divergent down to the species level, but at the individual level, one faces a intra-genomic divergence that often outmatches inter-species differentiation.

In some groups, like the most speciose and most widespread white oaks (sect. Quercus), identical ITS variants exist from individuals / species separated today by thousands of kilometers of ocean or icy wasteland. One possible explanation is that oaks have very large population sizes, and they are wind-pollinated, so that they have a high capacity to permanently homogenize their genepools. Plastids, on the other hand, are only transmitted via the large fruit, the acorns, and the main animal vector for distributing acorns, the jaybirds, are sedentary birds. Their backup-vector, the squirrels are known to hoard a lot of acorns in a single place, but not for migrating globally (unless we assist them).

Nonetheless, we readily notice that the intra-individual differentiation patterns appear not to be entirely random, and so in our study we moved to another nuclear multi-copy spacer known to be more variable than the ITS1 and ITS2 (hence, largely ignored by molecular phylogeneticists) — the 5S intergenic spacer (5S IGS). It didn't help too much for solving the white oak puzzle (in western Eurasia), but did give us new insights into the two other western Eurasian sections: Ilex and Cerris.

The 'host-associate' framework

A cloned 5S-IGS (or ITS) sequence is not a good OTU, because we are usually not interested in a clone phylogeny (a mere sequence genealogy), but in the phylogenetic relationships between the individuals or species carrying the cloned sequence variants: the nuclear spacer population. Even networks struggle with such data, and my colleague Markus Göker came up with the idea to treat this in the form of hosts, the individuals, and associates, the cloned sequences found in the individual (Göker & Grimm, BMC Evol. Biol. 2008 — open access). There are several options to transfer the primary clone (associate) data into individual (host) data.

Options that we tested for transferring associate data into host data.
CM = character matrix, DM = distance matrix. CMhosts, independent used were morphological matrices. ENT — entropy, FRQ — frequence, CON — strict consensus, MOD — modal consensus, and SIZ — sample size, are character transformations implemented in Markus' g2cef, PBC and MIN are distance transformations implemented in pbc (these and other little helper programmes can be found here).

Using three cloned (ITS) datasets, we found that for these data the "Phylogenetic Bray-Curtis" (PBC — see the next figure) distance transformation outperforms the other tested options.

Computation of the "Phylogenetic Bray-Curtis" distance. It's a modification of the Bray-Curtis dissimilarity using the minimum distance for each covered row/column instead absence/presence. H1/H2 = hosts with different sets of associates (A1–A6)

Incidental but interesting insights

Whenever I come into contact with such data I advise the use of the PBC distance transformation as the basis for the main individual-level network, but also to run the MIN distance transformation: MIN will just calculate the minimum inter-clone distance between the clone samples of two individuals, and use this as the inter-individual distance.

Neighbour-net using the MIN transformation

The MIN network (above) is quite bushy for these data, because we naturally have many shared 5S-IGS variants among individuals of the same species, but occasionally also shared by individuals of different species. Nonetheless, it visualizes some basic differentiation patterns in the clone sample: compare e.g. the coherent cluster 3, the crenata-suber lineage (the 'Cork Oaks') — all individuals share a pair of very similar to identical 5S-IGS clones; and the divergent cluster 4, the 'Vallonea' oaks — all individuals have different sets of clones, but uniuqe 5S-IGS variants separating them from all other Cerris oaks (long proximal edge bundle).

Furthermore, we have potential F1 hybrids (morphologically intermediate) in our sample, and such hybrids, e.g. tj08, should have very low (to zero) MIN distances with members of their parental lineages.

However, the PBC network (below) is as beautiful as it gets — I really love this transformation, as it always comes up with something usable and interpretable.

Neighbor-net based on PBC-transformed inter-individual distances. See Simeone et al. (PeerJ PrePrints 2018 — open access pre-print) for a discussion.

However, this network was a last minute addition, because a happy little "accident" happened along the way, and the networks we were working with and looking at while drafting the paper where not PBC networks, as I thought.

It happened this way. Also implemented in Markus' little helper program are AVG, the average inter-clone distance, and MAX, the maximum inter-clone distance. AVG and MAX don't result in a proper distance matrix, because the diagonal will be the average or maximum distance between the clones of a single individual, and not all-zero as it should be (for MIN it's always zero). [We discussed a few options to modify AVG and MAX to ensure a zero diagonal, but couldn't devise something that makes sense.]

However, the SplitsTree program didn't bother about an all-zero diagonal, so the AVG and MAX transformed distance matrices will produce a Neighbor-net. So, what I assumed were PBC networks were in fact AVG networks.


Neighbor-net based on AVG-transformed inter-individual distances.

It took me quite long to recognize this "error" because, in contrast to the AVG (and MAX) networks I looked at when we did the 2008 paper, the one for the oaks made a lot of sense. Notably, the suspected F1 hybrids were perfectly resolved spanning up according boxes, and the species aggregates (clusters) did make sense regarding the general geographic setting, the history of the region under study, and their morphology.

Same graph as above, highlighting known or potential F1-hybrids spanning up according boxes.

For these data (with a minimum of four clones available per individual, individuals covering all species, and including the entire range of the section in western Eurasia), the AVG network better shows the potential F1 hybrids (or introgrades) than the (more methodologically sophisticated) PBC network. However, the latter makes more sense regarding speciation processes and the history of the group (because, the distance is a "phylogenetic" version of the well-known Bray-Curtis distances).


A "cactus-oak" fusion graph depicting nuclear and plastid differentiation (and evolution) in Quercus Group Cerris.

Take-home message

First, it's always good to delegate work you can do by heart to somebody new to it! This forces its propagation, which is important. More importantly, though, one has ones preferences and established analysis pipelines, and they may have become restricted in scope. I mainly used the -a (AVG), -i (MIN) and -x (MAX transformation) options in the little helper program to quickly summarize some of the primary differentiation data — for example, individuals have identical clones (MIN = 0), intra-individual divergence may be higher or not than inter-individual (MAX intra-individual > MIN inter-individual), and individuals may have strongly divergent clones (high MAX). AVG was computed and tabulated but never cherished by me. I always looked at the MIN transformed networks, since this provides a valid distance matrix, but then ignored them. But I never again tried to infer a Neighbor-net based on AVG or MAX transformations after our 2008 paper.

Second, Neighbor-nets are so quick to infer that there is no resource- or logic-related reason to not just run whatever distance one has on hand or can easily establish. Maybe even the biologically less-sound will reveal some interesting aspect (there are a lot of biological arguments that can be put forward for dismissing AVG distances in favour of PBC distances)

Paper (pre-print) and open data
Simeone MC, Cardoni S, Piredda R, Imperatori F, Avishai M, Grimm GW, Denk T. 2018. Comparative systematics and phylogeography of Quercus Section Cerris in western Eurasia: inferences from plastid and nuclear DNA variation. PeerJ Preprints 6: e26995v1.
Primary data and analysis files are included in the Online Supplemantary Archive: Simeone et al., PeerJ Preprints, doi: 10.7287/peerj.preprints.26995v1/supp-4. (See Readme.txt included in the topfolder of the archive.) 

Monday, June 18, 2018

To boldy go where no one has gone before – networks of moons


This is a joint post by Timothy Holt and Guido Grimm

One ‘a-phylogenetic’ application of phylogenetic methods is the classification of stellar (in the widest sense) objects, so-called "astrocladistics" (see Didier Fraix-Burnet’s dedicated blog: astrocladistics.org). Traditionally, the objects would be characterized and their (dis)similarity translated into a plot (eg. using PCoA) or a tree (eg. a UPGMA tree). Such cluster analysis frameworks would then be the basis for the classification of the objects.

In ‘astrocladistics’, phylogenetic trees that fulfill the maximum parsimony or minimum evolution criteria, are used instead. But why should we stop with trees (see the prior blog post Astrocladistics: a network analysis)? For this post, we have used the matrices of a recent astrocladistic paper by Holt et al. (2018) to highlight an as yet under-explored application of phylogenetic methods in classification: exploratory data analysis (EDA).

Why exploratory data analysis

As noted in the earlier post on astrocladistics, one problem is that one infers phylogenetic trees based on a data sets that are not the product of an evolutionary process. Some objects may evolve from others (eg. a satellite may evolve from planetary ring matter), but this is not a dichotomous splitting process through time. And any non-dichotomous process can lead to tree-incompatible signals, which will then hamper tree inference in a biological context. Any tree using astral objects (galaxies, stars, planets, moons) as OTUs is per se a faux phylogeny (some examples for faux phylogenies are collected here and here).

Another problem is a data-inherent bias. The matrices are coded in a fashion that reflects an a priori hypothesis of derivation. For instance, by inferring that objects farther away are older and closer ones are younger, we can make hypotheses about maturation of galaxies, and hard-code this hypothesis into the data matrix. This will infer a tree that was coded into the matrix.

Guido’s starting argument is that when our main goal is classification and not inferring evolutionary relationships, the topology of the tree (or alternative trees) is the least of our concerns. What we want to know is to what degree our data converges to the same groupings, supports coherent classes. This is exploratory data analysis, and Neighbor-nets are then a powerful tool to visualize any differentiation pattern (see some recent a-biological examples: U.S. gun legislation, cryptocurrencies, where to retire Worldwide and within the USA)

Instead of inferring trees, as in the original paper about two satellite systems (Jupiter and Saturn), here we use the matrices to infer Neighbor-nets, map character support (non-parametric bootstrap support) on the resultant networks, and discuss the prospects and perils of ‘astro-Neighbor-nets’ when it comes to classification of astronomical bodies.

Data properties and analysis set-up

In order to construct the matrices, three different types of characteristics were used: dynamical, physical and compositional. Dynamical characteristics are the positions of the various satellites, how far they are away from the planet (semi-major axis), their inclination to the plane and eccentricity of their orbit. Several of the satellites also orbit opposite to the planet's rotation (they are on a retrograde orbit), which is also code. Physical characteristics are two properties of the satellites: their albedo, or how reflective they are, and their density. Any characteristics related to mass and size are specifically avoided, as this would hide any parent/daughter relationships resulting from breakups. The compositional characteristics are the most numerous ones in the analysis. These are binary characteristics indicating the presence/absence of chemical species, eg. water, iron, methane, etc.

Five of the characteristics, semi-major axis, inclination, eccentricity, albedo and density, are ordered and continuous. These prose a problem for standard cladistic analysis using parsimony, which needs discrete character states. Hence, these characteristics are binned using a python program. Each character-set is binned independently, and for each of the Jovian and Saturnian systems. The aforementioned python program iterates the number of bins until a linear regression model between binned and unbinned sets achieves a coefficient of determination (r2) score of > 0.99. All characteristics are binned in a linear fashion, with the majority increasing in progression. The exception to the linear increase is the density character set, with a reversed profile. All of the continuous, binned characteristic sets are (by definition) ordered characters.

Thus, the matrices comprised two sorts of characters with strongly different properties, when it comes to explicit inferences: binary characters, and highly ordered characters (the binned ones) with up to 11 states. For the graphs used here, we didn’t apply any weighting, which means that in the most extreme case complete difference in a binned character counts 11-times more than a difference in any of the binary characters. This bias is compensated to some degree by the number of binary characters (33, with 31 variable) vs. binned characters (5), when restricting the analysis to well-known planetary objects.

The matrices are comprehensive, and include little-known objects with a lot of missing data (>80% of the characters cannot be scored), which should be included. A matrix-based classification makes most sense, when one uses character sets that are defined for most or all of the objects. Thus, to see how the little-known objects relate to the well-known, we eliminated all poorly covered characters, leaving us with two binary and five binned ones. To not lose the information from the binary characters when calculating the inter-object distances, we gave them a weight of 7–8. This ensures that a 0↔1 difference in a binary character more or less equals the maximum possible difference in a binned character (on average 8 bins for the Jupiter dataset, and7 for the Saturn data set).

Fig. 1 The orbits of the satellites in the Jovian satellite system. Colours represent traditionally recognized groups.

The Jovian moons (and ring)

The Jovian system (Fig. 1) is dominated by the famous Galilean satellites (moons): Io, Europa, Ganymede, and Callisto. In between these moons and Jupiter there is a faint ring system, and four small satellites, the Amalthea family. Outside the Galilean system, there is a system of 67 small irregular satellites that have much wilder orbits, some going in the opposite direction to the other satellites. These are thought to be captured asteroids.

As seen in Fig. 2, the data analysis supports a somewhat tree-like network. The Galilean Group (the large moons), the association of Amalthea with Jupiter’s Main Ring, and the Himalia Family, but it rejects the traditional division of the remaining well-known moons (captured asteroids) into three families: two of the three Pasiphae Family satellites are very similar to Carme. Although Ananke is somewhat different, it is substantially more similar to Carme and the Pasiphae Family satellites than the inter-group differentiation found elsewhere. One commonly shared idea about classification is that one should erect classes that have similar quality and are defined by high intra-class coherence and inter-class differentiation.

Fig. 2 Neighbor-net of Jovian moons, well-covered by data (<1% or no missing data). Colouration addresses traditionally recognised groups, same colours used as in Fig. 1.

By reducing the character set to those characters that are defined for most or all moons, we naturally take away some of the potential differentiation. Nonetheless, the resulting graph (Fig. 3) provides a structure that may well be used to place less-known objects, identify their closest best-known counterpart(s), erect a classification, and discuss current classification schemes.

Fig. 3 Neighbor-net of all Jovian moons based on a distance-matrix reflecting the known (scored) similarity and differences. Colouration as above.

We can see that we lose differentiation within the well-known (and well-supported; Fig. 2) groups, especially regarding the distinctness between members of the Galilean Group and to the Amaltheae Family. However, the basic structure of the graph remains the same. Based on the scored data, the Ananke, Pasiphae and Carmes Families are not supported. A sub-division may be possible, but would require some re-shuffling of the moons. For instance, a group including Ananke-like satellites would not include Euporie, Iocaste and Hermippe, but may include Callirhoe. A Pasiphae s.str. group would make sense when excluding Aoede, Helike, Sinope, Autonoe, and Eurydome, with the latter three being (nearly) identical to Carmes or members of the Carmes Family.

Fig. 4 The orbits of the satellites in the Saturnian satellite system. Colours represent traditionally recognised groups.

The Saturnian moons and rings

The Saturn system (Fig. 4) is similar to of the Jupiter one.

The ring structure is one of Saturn’s most distinctive features, with structures seen even with a modest telescope. Imbedded in the rings are small moon-lets. The co-orbitals Janus and Epimetheus, just outside the main rings, swap orbits every four years. There are eight mid-sized satellites, including Titan, a small world in itself with a methane-based weather system. Of the other icy satellites, Mimas, Enceladus, Tethys, Dione and Rhea, are embedded in the diffuse E-ring. The source of this ring is cryovolcanic plumes on Enceladus, a possible location for life beyond Earth.

Unique to the Saturn system are Trojan satellites, in the same orbit as their parent satellites, one 60o ahead, and 60o behind. Tethys has Telesto and Calypso as Trojan satellites, while Helene and Polydeuces are Trojan satellites of Dione. Between the orbits of Mimas and Enceladus, there are the Alkyonides (Methone, Anthe and Pallene) recently discovered by the Cassini spacecraft. Each of the Alkyonides have their own faint ring arcs comprised of similar material to the satellite.

As with Jupiter, there is also a system of 38 small irregular satellites outside the inner system. This system is dominated by Phoebe — at ~240 km across it is six times the size of the next largest irregular. It is also the only irregular satellite to have its photo taken, with the Cassini spacecraft flying within 2,000 km of the surface, taking high-res images as it went. Using these new data, a picture is emerging of Phoebe as a captured outer solar system object.

For Jupiter’s smaller sibling (Saturn), a less tree-like network is inferred (Fig. 5). Since it is not the product of an evolutionary process in a biological sense (ie. a phylogeny), but instead including patterns related to parent/descendant relationships (rings-moons, breakups), we should not necessarily expect tree-like graphs.

Fig. 5 Neighbor-net of well-known Saturnian planetary objects (<1% or no missing data). Colouration addresses traditionally recognised groups
using the same colours than in Fig. 4.

Nonetheless, the graph could be the basis of an objective classification. The elements of the ‘Main Ring Group’ have high intra-group coherence, but also include Calypso and Telesto of the ‘E Ring Group’. On the other hand, the ‘Outer Satellite Group’ is very heterogeneous. One straightforward option would be to fuse this group with the E Ring Group; another is to exchange Enceladus for Hyperion. The Norse Group’s (= Phoebe Family) representative Phoebe is clearly distinct from any other object and would need a class of its own.

As in the case of Jupiter ,we can add and try to classify the remaining little-known objects (Fig. 6), to some degree.

Fig. 6 Neighbor-net of all Saturnian moons and rings based on a distance-matrix reflecting the known (scored) similarity and differences. Colouration as above.

In contrast to Jupiter, the reduced character set (just four characters, one binary, three binned-ordered) loses the differentiation between objects of the Main Ring, E Ring and Outer Satellite groups included in Fig. 5. They are virtually identically for these characters. The two groups not covered in Fig. 5, the Siarnaq (named after Inuit gods) and Albiorix Family (Gallic gods), are close to each other. The Albiorix Family forms a distinct subset of the Siarnaq Family. The moons of the coherent Phoebe Family (named after Norse mythological figures) are all close to each other, and this group includes various newly discovered satellites. Interesting is also the position of the Phoebe Ring compared to its name-giving moon and the remainder of the Phoebe Family.

Comparison with the tree-based analyses

In comparison with the astrocladistical work of Holt et al. (2018), the network-based analysis captures most of the meta-structure of the satellite classifications.

Compared with the Jovian trees, the network-based analysis shows the distinction between the inner Galilean group and the outer ‘irregular’ satellites and separates the Himalia family. The differences are in how each of the analyses handles the retrograde irregulars, the Pasiphae, Carme and Ananke families. It should be noted that these bodies are woefully under-studied, and have very little information available, making any inferences difficult.

In Holt et al.’s trees, the Ananke, Carme and Sinope subfamiles are unresolved, but are supported using Multivariate Hierarchical Cluster Analysis (example provided in Fig. 7). This method uses clustering in parameter space to justify collisional families. Though the particular members are different, the network-based analysis still identifies clusters around the largest irregular satellites, Anake, Sinope, Carme and Pasiphae. This further supports further the theory that these families are remnants of collision breakups. As usual with science, there is far more work to be done here.

Fig 7: Clustering of several Jovian Irregular satellites in three dimensional parameter space using Semi-major axis (a), eccentricity (e) and inclination (i).


In the Saturnian system, the outer satellites also prove to be problematic. Holt et at. split the Aegir and Ymir subfamilies from within the Phoebe family. These subfamilies are distinct from Phoebe and its ring, following a narrative of a different origin for Phoebe and the rest of the irregular satellites. The capture of Phoebe would have major disruptive effects on the satellites. As the dynamical characteristics play such a large role, they are the only information available for some of the satellites, so that little sub-structuring can be seen. As with the Jovian irregular satellites, more information is needed.

The inner system of Saturn also warrants mention, particularly the case of Telesto and Calypso, the Trojans of Tethys. In the network analysis, they are associated with main ring objects, rather than with Tethys itself. There is a possibility that these two Trojans are captured main-ring objects, and this would support that hypothesis. Dione, and its Trojan Helene, are both closely associated with one another in both analyses, indicating a parent/daughter relationship (keep in mind that phylogenetic trees cannot discern between parent/daughter and sister relationships).

Phoebe as seen by the Cassini spacecraft, NASA/JPL/Space Science Institute, PIA06064 (the NASA provides more than 1000 media files covering the Cassini-Huygens mission)


Boldly gone – networks as tools in classification
 
The idea discussed here appears worth exploring — using distance-based or other (meta-)phylogenetic networks for the classification of objects not necessarily following any phylogeny. It has some obvious advantages over astrocladistics (especially when using maximum parsimony as the tree-optimizing criterion) or traditional classification methods (PCoA, simple clustering approaches):
  • Distance-matrices are easy and quick to generate based on any data; and they can also be used for more traditional classification means such as PCoA.
  • Neighbor-nets are very quick to calculate, and can capture more aspects of the actual differentiation than can cluster analysis (e.g. UPGMA trees, PCoA) or astrocladistic methods; in some sense they represent a fusion of the best aspects of both approaches.
  • In contrast to a tree, where tree-incompatible signals can massively distort branch-length patterns, or rogue objects interfere with establishing a finely resolved topology, a Neighbor-net can be straightforwardly interpreted regarding group coherence.
Perhaps, the main disadvantage is of this approach is the need for a distance matrix with meaningful pairwise distances. If missing data distort the general (dis)similarity patterns, then Neighbor-net may have branching (edging) artifacts.

However, using the Neighbor-net as a basis for classification, groups also allows us to quickly test for character sampling bias, eg. by re-calculating the distance input matrix using weighting schemes, or different distance calculations (eg. instead of binning the continuous characters, they could be used as-is), or reduced character or taxon sets. Also, when it comes to classifying non-living objects, it’s always good to keep it as simple as possible, while being able to explore the signal in the data matrix.

More results, the data matrices used, and the template analysis files can be found on figshare. The archive includes also the (simple) NEXUS-formatted files with the PAUP* command blocks we used for the analyses. The one for Jupiter is fully annotated with comments on the code lines for PAUP* to assist inexperienced users and to facilitate export (and subsequent) import into SplitsTree.

The archive includes also the code for and results of a full bootstrap-support analysis (currently two optimality criteria: Least-squares and Maximum parsimony, Maximum likelihood to be added) — even when preferring the astrocladistic approach, networks are handy to summarize the bootstrap pseudoreplicate sample.

Reference

Holt TR, Brown AJ, Nesvorný D, Horner J, Carter B (2018) Cladistical analysis of the Jovian and Saturnian satellite systems. Astrophysical Journal 859(2): 97, 20 pp; arXiv: 1706.0142

Monday, May 7, 2018

Keeping it simple in phylogenetics


This is a post by Guido, with a bit of help from David.

There's an old saying in physics, to the effect that: "If you think you need a more complex model, then you actually need better data." This is often considered to be nonsense in the biological sciences and the humanities, because   the data produced by biodiversity is orders of magnitude more complex than anything known to physicists:
The success of physics has been obtained by applying extremely complicated methods to extremely simple systems ... The electrons in copper may describe complicated trajectories but this complexity pales in comparison with that of an earthworm. (Craig Bohren)
Or, more succinctly:
If it isn’t simple, it isn’t physics. (Polykarp Kusch)
So, in both biology and the humanities there has been a long-standing trend towards developing and using more and more complex models for data analysis. Sometimes, it seems like every little nuance in the data is important, and needs to be modeled.


However, even at the grossest level, complexity can be important. For example, in evolutionary studies, a tree-based model is often adequate for analyzing the origin and development of biodiversity, but it is inadequate for studying many reticulation processes, such as hybridization and transfer (either in biology or linguistics, for example). In the latter case, a network-based model is more appropriate.

Nevertheless, the physicists do have a point. After all, it is a long-standing truism in science that we should keep things simple:
We may assume the superiority, all things being equal, of the demonstration that derives from fewer postulates or hypotheses. (Aristoteles) 
It is futile to do with more things that which can be done with fewer. (William of Ockham) 
Plurality must never be posited without necessity. (William of Ockham) 
Everything should be as simple as it can be, but not simpler. (Albert Einstein)
To this end, it is often instructive to investigate your data with a simple model, before proceeding to a more complex analysis.

Simplicity in phylogenetics

In the case of phylogenetics, there are two parts to a model: (i) the biodiversity model (eg. chain, tree, network), and (ii) the character-evolution model. A simple analysis might drop the latter, for example, and simply display the data unadorned by any considerations of how characters might evolve, or what processes might lead to changes in biodiversity.

This way, we can see what patterns are supported by our actual data, rather than by the data processed through some pre-conceived model of change. If we were physicists, then we might find the outcome to be a more reliable representation of the real world. Furthermore, if the complex model and the simple model produce roughly the same answer, then we may not need "better data".


Modern-day geographic distribution of Dravidian languages (Fig. 1 of Kolipakam, Jordan, et al., 2018)

Historical linguistics of Dravidian languages

Vishnupriya Kolipakam, Fiona M. Jordan, Michael Dunn, Simon J. Greenhill, Remco Bouckaert, Russell D. Gray, Annemarie Verkerk (2018. A Bayesian phylogenetic study of the Dravidian language family. Royal Society Open Science) dated the splits within the Dravidian language family in a Bayesian framework. Aware of uncertainty regarding the phylogeny of this language family, they constrained and dated several topological alternatives. Furthermore, they checked how stable the age estimates are when using different, increasingly elaborate linguistic substitution models implemented in the software (BEAST2).

The preferred and unconstrained result of the Bayesian optimization is shown in their Figs 3 and 4 (their Fig. 2 shows the neighbour-net).

Fig. 3 of Kolipakam et al. (2018), constraining the North (purple), South I (red) and South II (yellow) groups as clades (PP := 1)
Fig. 4 of Kolipakam et al. (2018), result of the Bayesian dating using the same model but not constraints. The Central and South II group is mixed up.

As you can see, many branches have rather low PP support, which is a common (and inevitable) phenomenon when analyzing non-molecular data matrices providing non-trivial signals. This is a situation where support consensus networks may come in handy, which Guido pointed out in his (as yet unpublished) comment to the paper (find it here).

On Twitter, Simon Greenhill (one of the authors) posted a Bayesian PP support network as a reply.

A PP consensus network of the Bayesian tree sample, probably the one used for Fig. 3 of Kolipakam et al. 2018, constraining the North, South I, and South II groups as clades (S. Greenhill, 23/3/2018, on Twitter).

Greenhill, himself, didn't find it too revealing, but for fans of exploratory data analysis it shows, for example, that the low support for Tulu as sister to the remainder of the South I clade (PP = 0.25) is due to lack of decisive signal. In case of the low support (PP = 0.37) for the North-Central clade, one faces two alternatives: it's equally likely that the Central Parji and Olawi Godha are related to the South II group which forms a highly supported clade (PP = 0.95), including the third language of the Central group (one of the topological alternatives tested by the authors).

A question that pops up is: when we want to explore the signal in this matrix, do we need to consider complex models?

Using the simplest-possible model

The maximum-likelihood inference used here is naive in the sense that each binary character in the matrix is treated as an independent character. The matrix, however, represents a binary sequence of concepts in the lexica of the Dravidian languages (see the original paper for details).

For instance, the first, invariant, character encodes for "I" (same for all languages and coded as "1"), characters 2–16 encode for "all", and so on. Whereas "I" (character 1) may be independent from "all" (characters 2–16), the binary encodings for "all" are inter-dependent, and effectively encode a micro-phylogeny for the concept "all": characters 2–4 are parsimony-informative (ie. split the taxon set into two subsets, and compatible); the remainder are parsimony-uninformative (ie. unique to a single taxon).


The binary sequence for "All" defines three non-trivial splits, visualized as branches, which are partly compatible with the Bayesian tree; eg. Kolami groups with members of South I, and within South II we have two groups matching the subclades in the Bayesian tree.

Two analyses were run by the original authors, one using the standard binary model, Lewis’ Mk (1-paramter) model, and allowing for site-specific rate variation modelled using a Gamma-distribution (option -m BINGAMMA). As in the case of morphological data matrices (or certain SNP data sets), and in contrast to molecular data matrices, most of the characters are variable (not constant) in linguistic matrices. The lack of such invariant sites may lead to so-called “ascertainment bias” when optimizing the substitution model and calculating the likelihood.

Hence, RAxML includes an option to correct for this bias for morphological or other binary or multi-state matrices. In the case of the Dravidian language matrix, four out of the over 700 characters (sites) are invariant and were removed prior to rerun the analysis applying the correction (option -m ASC_BINGAMMA). The results of both runs show a high correlation— the Pearson correlation co-efficient of the bipartition frequencies (bootstrap support, BS) is 0.964. Nonetheless, BS support for individual branches can differ by up to 20 (which may be a genuine or random result, we don't know yet). The following figures show the bootstrap consensus network of the standard analysis and for the analysis correcting for the ascertainment bias.

Maximum likelihood (ML) bootstrap (BS) consensus network for the standard analysis. Green edges correspond to branches seen in the unconstrained Bayesian tree in Kolipakam et al. (2018, fig. 4), the olive edges to alternatives in the PP support network by S. Greenhill. Edge values show ML-BS support, and PP for comparison.

ML-BS consensus network for the analysis correcting for the ascertainment bias. BSasc annotated at edges in bold font, with BSunc and PP (graph before) provided for comparison. Note the higher tree-likeness of the graph.

Both graphs show that this characters’ naïve approach is relatively decisive, even more so when we correct against the ascertainment bias. The graphs show relatively few boxes, referring to competing, tree-incompatible signals in the underlying matrix.

Differences involve Kannada, a language that is resolved as equally related to Malayam-Tamil and Kodava-Yeruva — BSasc = 39/35, when correcting for ascertainment bias; but BSunc < 20/40, using the standard analysis); and Kolami is supported as sister to Koya-Telugu (BSasc = 69 vs. BSunc. = 49) rather than Gondi (BSasc < 20, BSunc = 21).

They also show that from a tree-inference point of view, we don't need highly sophisticated models. All branches with high (or unambiguous) PP in the original analysis are also inferred, and can be supported using maximum likelihood with the simple 1-parameter Mk model. This also means that if the scoring were to include certain biases, the models may not correct against this. At best, they help to increase the support and minimize the alternatives, although the opposite can also be true.

For relationships within the Central-South II clade (unconstrained and constrained analyses), the PP were low. The character-naïve Maximum likelihood analysis reflects some signal ambiguity, too, and can occasionally be higher than the PP. BS > PP values are directly indicative of issues with the phylogenetic signal (eg. lack of discriminative signal, topological ambiguity), because in general PP tend to overestimate and BS underestimate. The only obvious difference is that Maximum likelihood failed to provide support for the putative sister relationship between Ollari Gadba and Parji of the Central group.

The crux with using trees

When inferring a tree as the basis of our hypothesis testing, we do this under the assumption that a series of dichotomies can model the diversification process. Languages are particularly difficult in this respect, because even when we clean the data of borrowings, we cannot be sure that the formation of languages represents a simple split of one unit into two units. Support consensus networks based on the Bayesian or bootstrap tree samples can open a new viewpoint by visualizing internal conflict.

This tree-model conflict may be genuine. For example, when languages evolve and establish they may be closer or farther from their respective sibling languages and may have undergone some non-dichotomous sorting process. Alternatively, the conflict may be due to character scoring, the way one transforms a lexicon into a sequence of (here) binary characters. The support networks allow exploring these phenomena beyond the model question. Ideally, a BS of 40 vs. 30 means that 40% of the binary characters support the one alternative and 30% support the competing one.

In this respect, historical-linguistic and morphological-biology matrices have a lot in common. Languages and morphologies can provide tree-incompatible signals, or contain signals that infer different topologies. By mapping the characters on the alternatives, we can investigate whether this is a genuine signal or one related to our character coding.

Mapping the binary sequences for the concept "all" (example used above to illustrate the matrix basic properties; equalling 15 binary characters) on the ML-BS consensus network. We can see that its evolution is in pretty good agreement with the overall reconstruction. Two binaries support the sister relationship of the South II languages Koya and Telugo, and a third collects most members of the South I group. All other binaries are specific to one language, hence, do not produce a conflict with the edges in the network.