Monday, April 22, 2019

The 2nd Amendment does more than keep King George away


A year ago, in the aftermath of the Florida shooting, I used a neighbor-net as a way to visualize U.S. gun legislation (see the first graph here). In this post, we will use this network to explore some other aspects of American society.

A network illustrating the diversity in U.S. gun legislation. Blue stars – states with a gun registry.

The network picture emphasizes those states where guns are regulated to some extent (in green), but this means that the states at the bottom-left have little or no regulation of gun ownership. Note, first, that the U.S. gun lobby argues that the absence of any gun control is covered by the 2nd Amendment to the U.S. Constitution,which covers the right of citizens to form a "well regulated militia", an amendment installed to protect the freedom of the new republic from the former British sovereign (ie. to "keep King George away").

This claim ignores the fact that "well regulated" implies regulation of some sort, while the network emphasizes its absence in many cases. Besides, the risk of being re-conquered by Her Majesty's Royal Army is quite low these days, with or without Brexit. More to the point, the world itself has changed quite a bit since the 1700s, while the Constitution has had only a few Amendments added and subtracted.

If we start our use of the neighbor-net to look at the data, then we can see that there is at least one obvious consequence of unregulated gun ownership. For example, the next plot shows the number of gun-related deaths (in 2016) super-imposed on the gun-regulation network.

The total number of firearm-related deaths in 2016 (includes accidents and suicides.
Data from worldlifeexpectancy.com; this and more plots can be found here:
Visualising U.S. gun legislation, and mapping politics, economics, and population)

There seems to be a good correlation between unregulated gun ownership and the probability of getting shot or shooting yourself — the number of shootings is greatest in the lower-left of the network, where gun ownership is essentially unregulated (see the Gun Violence Archive for current numbers).

Arming every citizen may have helped to fend off King George's Redcoats, but in the long run, a substantial amount of Americans (c. 275,000 per year; when compared with Canada's rate) would still be alive if the Colonies would have become HRM's dominion like Australia or Canada; both Canadians and Australians own a lot of firearms per capita (see the Small Arms Survey for up-to-date estimates), but while Canada long had Europe-style legislation (and low casualty frequencies); Australians implemented them more recently leading to a massive drop in firearm-related deaths (see above).

As a side note, arming every male citizen to secure freedom from a feudal lord was probably a Swiss invention (see the Swiss Federal Charter of 1291, the Bundesbrief). Switzerland has a compulsory general draft of young males; and after this service they take their Sturmgewehr back home for the yearly training exercise, and to be prepared to fend off invaders (until 2007, including the ammunition). They have ~4-times lower rate of firearm-related deaths (2.8 in 2015 according to GunPolicy.org; nearly all of them males) — the only EU country approaching lowest U.S. values is Finland, and it's near exclusively accidents and suicides.

Other factors

It is important to keep in mind that the United States is a true federation of states, with each state having a substantial amount of autonomy, which is not found in any other country with a federal organization. Hence, many other aspects differ between states, not just the substantial differences in gun legislation.

For example, economics differ greatly between the states, and this also shows a reasonable correlation with gun regulation, as seen in this next version of the network. Note that Gross Domestic Product (GDP) is a monetary measure of the market value of all the goods and services produced annually — rich places have high GDP and poor places have lower GDP.

Real gross domestic product per capita mapped on the gun-legislation-based network.
Red, below global U.S. value; green above global U.S. value.
Data source: U.S. Bureau of Economic Analysis.

So, the economically poorer the state, the less likely there is to be gun regulation.

Modern developments include allowing women into the armed forces, and granting them the right to vote. For example, the 19th Amendment to the US Constitution granted women the right to vote, which was passed by Congress June 4, 1919, and ratified on August 18, 1920. This first map shows the situation for the European Union, some parts of which lagged behind the U.S.

Implementation of general right to vote within the countries of the EU (source: Süddeutsche Zeitung).
In the case of Germany and France, the reason was a lost war leading to the (re)establishment of new republics.

Women make about 50% of the populace and (usually) more than 50% of the electorate (having a generally higher life expectancy), but they are still typically under-represented in parliaments (here are a few examples). The United States is, sadly, a good example of this imbalance. This next map shows that the women in 13 states currently have no same-sex representation in the U.S. Congress.

Female representation in the current U.S. Congress.
The green part of each pie chart indicates the proportion of women representatives.

This leads to the obvious question for this blog post: how does the absence of female representatives (and senators) relate to the absence of gun regulation? So, let's map the above collection of pie charts onto the gun legislation network.

Female representation in the U.S. Congress after 2018 mid-term elections
(includes Senate and House of Representatives).
The c. 700,000 inhabitants of DC, District of Columbia, have no representation in
Congress at all, but send a non-voting delegate to the House.

There is a general trend — those states with little or no gun regulation (bottom left) have less female representation than those with (some) gun regulation. Perhaps someone took the 2nd Amendment a bit too literally (the right that every man to carry a gun), and this keeps not only King George away, from the country but also women away from Congress?

Exceptions from the generalization (starting with 75% going down to 33%) are sparsely populated states with only a few members of Congress: New Hampshire (NH, 75%; 2 representatives in addition to the two U.S. senators representing each state), Maine (ME, 2 reps.), West Virginia (WV; 3 reps), Alaska (AK; 1 rep.), New Mexico (NM; 3 reps), and Nevada (NV; 4 reps). All of these states have one thing in common: a substantial proportion of the state is wilderness.

At the other end, some states with relative high levels of gun regulation, like Maryland (MD; 8 reps), Rhode Island (RI; 2 reps), New Jersey (NJ; 12 reps) and Colorado (CO; 7 reps), lack women in Congress (0–15%, ie. one representative or none). This may relate to these state being very densely populated (MD, RI, NJ), and, irrespective of outside threats, no-one wants their close neighbors running around with guns. Colorado is particular in this sense, because with Denver it includes a major population center (the nucleus of the emerging Front Range megaregion), and it enforced much stricter gun regulation than found elsewhere in the state.

A map showing Colorado's congressional districts, for the 113th Congress.
Data from the defunct digital version of the U.S. National Atlas.

Do more women in parliament save American lives?

According to a recent Gallup poll, Americans have the highest regard for nurses, a profession mostly occupied by women and lowest regard for Members of Congress, a profession mostly occupied by men. Hence, it would make sense to explore the data the other way around. We will explore this in a later post.

Monday, April 15, 2019

Tournament success is not poker success


Let us suppose for a moment that we wish to list the world's best professional poker players. This might be of some interest, because poker is partly a game of luck (the cards are dealt at random) and partly a game of skill (players choose how to play their cards). Indeed, put simply, the idea is to convince your opponents that you have a weak hand when they have a strong one (so that they will bet against you) and a strong hand when they have a weak one (so that they will fold).


One well-known way to assess poker success is to look at tournament winnings. Indeed, Nathan Williams recently did this for The Top 50 Best Poker Players of All Time by simply listing the 50 greatest money earners from The Hendon Mob database. This database accumulates data on the lifetime money winnings for all of those participants who have ever cashed in a live poker tournament.

However, this approach does not work. In fact, there are at least five reasons why this is not appropriate:
  1. Inflation continues unabated. After all, $1 now is not worth as much as $1 was 30 years ago. In fact, something that cost $1 in 1990 would cost a bit more than $2 now (ie. the money has been devalued to 50%). So, the value of current winnings cannot be compared to those of the past.
  2. There are more tournaments now than there have ever been. So, there are more opportunities to play them now, and to thereby potentially accumulate more money for the same tournament success rate.
  3. The tournament fields are now generally bigger. This means that the average prize money for each tournament is now much greater than before (since the money is provided by the participants themselves). In particular, the top prizes now provide more money than whole tournaments did 20 years ago.
  4. Some of the best players play online rather than live. Obviously, this is a bit more difficult these days, due to the banning of online poker in the USA, but it is still a significant source of poker income for many people.
  5. Some of the best players do not play many tournaments —instead, they play cash games. Indeed, if you want to make a living playing poker, you may be better off playing for cash rather than for prize money, as tournament success is much more of a lottery.
The first three reasons all mean that we would have to adjust the tournament winnings, if we wish to have a meaningful assessment of lifetime earnings. As one example of the need to do this, we can look at point no. 3 in a simple way. The first graph shows the current top-100 money earners from The Hendon Mob. For each player, it shows how much of their total earnings came from their biggest single tournament cash.

Note that for the majority of players, a large part of their lifetime winnings came from a single tournament — the median percentage is 18.4% (range 3.8–97.7%). Indeed, for some of the players it is >50%, and for a few it is almost all of their money. Bigger fields mean more money per tournament, and thus bigger cashes when you do well. Note, incidentally, that this graph does contain the top 17 biggest cashes in history (to date).

An alternative approach

So, in order to evaluate players, we actually need a list of criteria that is independent of money won. That is, we need a list of the poker skills of each player. There are several different skills involved in playing poker, and presumably some people are good at some of them, and other people are good at some of the others. A comparison of relative skills is what we need.

This approach was actually tried by Barry Greenstein back in c. 2005. What he did was try to rate a group of 33 of the poker players that he had played against in cash games. He rated these players by style of play, based on ten playing criteria (each scored on a 1–10 scale):
  • Aggressiveness
  • Looseness
  • Short-handed play
  • Limit poker
  • No-limit poker
  • Tournaments
  • Side games
  • Steam control
  • Against weak players
  • Against strong players
Given the time at which this analysis was done (2005), the modern crop of young players are obviously not included, and a few of those people included are no longer playing. However, it is worthwhile looking at the data to see just what can be done with this approach.

Greenstein himself notes: "I don’t think you can add up the ratings in the skill categories to get an accurate comparison of players." He is right; but first let's do it anyway. So, the next graph shows the total score (out of 100) for each player. (Click on the figure to see it at full size.)


This problem here is that we are comparing apples with oranges. That is, the rank ordering of the sum does not make much sense, because it does not group players with similar playing strengths. The rank order would make sense when comparing each feature one at a time, but not for the total. For example, ranking by total winnings does make sense, because we have only one criterion: money (although it is not a useful criterion). This is the basic weakness of having a single rank order.

As one example of how the "overall score" misses important points, note that Eric Seidel and John Juanda have the same total. However, Seidel exceeds Juanda on Stem control, while Juanda exceeds Seidel on Looseness — these are actually two rather different players.

A better way to look at the data is to use a network, as we often do in this blog. The final graph is a NeighborNet (based on the manhattan distance) of Greenstein's data. Each point represents one of the 33 people. Those people that are near each other in the network have a similar set of scores, while people further apart are progressively more different from each other as poker players.


As you can see, there is no simple trend from "best" to "worst", but instead a complex set of relationships, just as we would expect. However, the network does show an overall trend of decreasing total score from top to bottom (compare this to the previous graph).

Note, first, that Eric Seidel and John Juanda are on opposite sides of the network (Juanda left, Seidel right). This illustrates how much better the network is as a display of the data, compared to simply summing the scores (as in the previous graph). The network accurately shows the differences in the relative playing styles.

There are some players who are actually gathered together in the network, indicating that they have similar scores across all 10 criteria. For example, Barry Greenstein , Eric Seidel and Howard Lederer rarely differ by more than 1 point on any of the criteria — according to Greenstein, these people have very similar playing styles.

Alternatively, Pil Helmuth and T.J. Cloutier have scores that differ from the other players — both have low scores on Side games and Steam control. Gus Hansen is near these two because all three have high scores for Against weak players. Similarly, the legendary Stu Ungar and Patrik Antonius both have high Aggressiveness and Looseness.

There is one a final point worth mentioning. As Michel Bettane once said (The absurdity and flattery of scores):
It doesn't take a genius to appreciate the absurdity of giving a number score to a work of art or, worse still, an artist. Salvador Dalí had huge fun scoring great artists (including himself) on the basis of design, color, and composition — but that says far more for his sense of provocation and irony than it does for the principle itself.
Is poker an art, a science or a sport? If it is either of the first two, then scoring players may actually be a Bad Idea.

Monday, April 8, 2019

Next-generation neighbor-nets


Neighbor-nets are a most versatile tool for exploratory data analysis (EDA). Next-generation sequencing (NGS) allows us to tap into an unprecedented wealth of information that can be used for phylogenetics. Hence, it is natural step to combine the two.

I have been waiting for it (actively-passively) and the time has now come. Getting NGS data has become cheaper and easier, but one still needs considerable resources and fresh material. Hence, NGS papers usually not only use a lot of data, but also are many-authored. You can now find neighbor-nets based on phylogenomic pairwise distances computed from NGS data — for example, in these two recently published open access pre-prints:
  • Pérez Escobar​ OA, Bogarín D, Schley R, Bateman R, Gerlach G, Harpke D, Brassac J, Fernández-Mazuecos M, Dodsworth S, Hagsater E, Gottschling M, Blattner F. 2018. Resolving relationships in an exceedingly young orchid lineage using Genotyping-by-sequencing data. PeerJ Preprint 6:e27296v1
  • Hipp AL, Manos PS, Hahn M, Avishai M, Bodénès C, Cavender-Bares J, Crowl A, Deng M, Denk T, Fitz-Gibbon S, Gailing O, González Elizondo MS, González Rodríguez A, Grimm GW, Jiang X-L, Kremer A, Lesur I, McVay JD, Plomion C, Rodríguez-Correa H, Schulze E-D, Simeone MC, Sork VL, Valencia Avalos S. 2019. Genomic landscape of the global oak phylogeny. bioRxiv DOI:10.1101/587253.

Example 1: A young species aggregate of orchids

Pérez Escobar et al.'s neighbor-nets are based on uncorrected p-distances inferred from a matrix including 13,000 GBS ("genotyping-by-sequencing") loci (see the short introduction for the method on Wikipedia, or the comprehensive PDF from a talk at/by researchers of Cornell) covering 29 accessions of six orchid species and subspecies.

They also inferred maximum likelihood trees, and did a coalescent analysis to consider eventual tree-incompatible signal, gene-tree incongruence due to potential reticulation and incomplete lineage sorting. They applied the neighbor-net to their data because "split graphs are considered more suitable than phylograms or ultrametric trees to represent evolutionary histories that are still subject to reticulation (Rutherford et al., 2018)" – which is true, although neighbor-nets do not explicitly show a reticulate history.

Here's a fused image of the ML trees (their fig. 1) and the corresponding neighbor-nets (their fig. 2):

Not so "phenetic": NGS data neighbor-nets (NNet) show essentially the same than ML trees — the distance matrices reflect putative common origin(s) as much as the ML phylograms. The numbers at branches and edges show bootstrap support under ML and the NNet optimization.

Groups resolved as clades, Group I and III, or grades or clades, Group II (compare A vs. B and C), in the ML trees form simple (relating to one edge-bundle) or more complex (defined by two partly compatible edge-bundles, Group I in A) neighborhoods in the neighbor-net splits graphs. The evolutionary unfolding, we are looking at closely related biological units, was likely not following a simple dichotomizing tree, hence, the ambiguous branch-support (left) and competing edge-support (right) for some of the groups. Furthermore, each part of a genome will be more descriminative for some aspect of the coalescent and less for another, another source of topological ambiguity (ambiguous BS support) and incompatible signal (as seen in and handled by the neighbor-nets). The reconstructions under A, B and C differ in the breadth and gappyness of the included data (all NGS analyses involve data filtering steps): A includes only loci covered for all taxa, B includes all with less than 50% missing data, and C all loci with at least 15% coverage.

PS I contacted the first author, the paper is still under review (four peers), a revision is (about to be) submitted, and, with a bit of luck, we'll see it in print soon.


Example 2: The oaks of the world

The Hipp et al. (note that I am an author) neighbor-net is based on model-based distances. The reason I opted (here) for model-based distance instead of uncorrected p-distances is the depth of our phylogeny: our data cover splits that go back till the Eocene, but many of the species found today are relatively young. The dated tree analyses show substantial shifts in diversification rates. In the diverse lineages today and possibly in the past (see the lines in the following graph), in those with few species (*,#) we may be looking at the left-overs of ancient radiations.

A lineage(s)-through-time plot for the oaks (Hipp et al. 2019, fig. 2). Generic diversification probably started in the Eocene around 50 Ma, and between 10–5 Ma parts (usually a single sublineage) of these long-isolated intrageneric lineages (sections) underwent increased speciation.

The data basis is otherwise similar, SNPs (single-nucleotide polymorphisms) generated using a different NGS method, in our case RAD-tagging (RAD-seq) of c. 450 oak individuals covering the entire range of this common tree genus — the most diverse extra-tropical genus of the Northern Hemisphere. There are differences between GBS and RAD-seq SNP data sets — a rule of thumb is that the latter can provide more signal and SNPs, but the single-loci trees are usually less decisive, which can be a problem for coalescent methods and tests for reticulation and incomplete lineage sorting that require a lot of single-loci (or single-gene) trees (see the paper for a short introduction and discussion, and further references).

We also inferred a ML tree, and my leading co-authors did the other necessary and fancy analyses. Here, I will focus on the essential information needed to interpret the neighbor-net that we show (and why we included it at all).

Our fig. 6. Coloring of main lineages (oak sections) same as in the LTT plot. Bluish, the three sections traditionally included in the white oaks (s.l.); red, red oaks; purple, the golden-cup or 'intermediate' (between white and red) oaks — these three groups (five sections) form subgenus Quercus, which except for the "Roburoids" and one species of sect. Ponticae is restricted to the Americas. Yellow to green, the sections and main clades (in our and earlier ML trees) of the exclusively Eurasian subgenus Cerris.

Like Pérez Escobar et al., we noted a very good fit between the distance-matrix based neighbor-net and the optimised ML tree. Clades with high branch support and intra-clade coherence form distinct clusters, here distinct neighborhoods associated with certain edge bundles (thick colored lines). This tells us that the distance-matrix is representative, it captures the prime-phylogenetic signal that also informs the tree.

The first thing that we can infer from the network is that we have little missing data issues in our data. Distance-based methods are prone to missing data artifacts and RAD-seq data are (inevitably) rather gappy. It is important to keep in mind that neighbor-nets cannot replace tree analysis in the case of NGS data, they are "just" a tool to explore the overall signal in the matrix. If the network has neighborhoods contrasting what can be seen in the tree, this can be an indication that one's data is not sufficiently tree-like at all. But it also can just mean that the data is not sufficient to get a representative distance matrix.

Did you notice the little isolated blue dot (Q. lobata)? This is such a case — it has nothing to do with reticulation between the blue and the yellow edges, it's just that the available data don't produce an equally discriminative distance pattern: according to its pairwise distances, this sample is generally much closer to all other oak individuals included in the matrix in contrast to the other members of its Dumosae clade, which are generally more similar to each other, and to the remainder of the white oaks (s.str., dark blue, and s.l., all bluish).

Close-up on the white oak s.str. neighbor-hood (sect. Quercus) and plot of the preferred dated tree.

In the tree it is hence placed as sister to all other members, and, being closer to the all-ancestor, it triggers a deep Dumusae crown age, c. 10 myr older than the subsequent radiation(s) and as old as the divergence of the rest of the white oaks s.str.

The second observation, which can assist in the interpretation of the ML tree (especially the dated one), is the principal structure (ordering) within each subgenus and section. The neighbor-net is a planar (i.e. 2-dimensional graph), so the taxa will be put in a circular order. The algorithm essentially identifies the closest relative (which is a candidate for a direct sister, like a tree does) and the second-closest relative. Towards the leaves of the Tree of Life, this is usually a cousin, or, in the case of reticulation, the intermixing lineage. Towards the roots, it can reflect the general level of derivation, the distance the (hypothetical all-)ancestor.

Knowing the primary split (between the two subgenera), we can interprete the graph towards the general level of (phylogenetic) derivedness.

The overall least derived groups are placed to the left in each subgenus, and the most derived to the right. The reason is long-branch attraction (LBA) stepping in: the red and green group are the most isolated/unique within their subgenera, and hence they attract each other. This is important to keep in mind when looking at the tree and judge whether (local) LBA may be an issue (parsimony and distance-methods will always get the wrong tree in the Felsenstein Zone, but probabilistics have a 50% chance to escape). In our oak data, we are on the safe side. The red group (sect. Lobatae, the red oaks) are indeed resolved as the first-branching lineage within subgenus Quercus, but within subgenus Cerris it is the yellow group, sect. Cyclobalanopsis. If this would be LBA, Cyclobalanopsis would need to be on the right side, next to the red oaks.

The third obvious pattern is the distinct form of each subgraph: we have neighborhoods with long, slim root trunks and others that look like broad fans.

Long-narrow trunks, i.e. distances show high intra-group coherence and high inter-group distinctness can be expected for long isolated lineages with small (founder) population sizes, eg. lineages that underwent in the past severe or repeated bottleneck situations. Unique genetic signatures will be quickly accumulated (increasing the overall distance to sister lineages), and the extinction ensures only one (or very similar) signature survives (low intragroup diversity until the final radiation).

Fans represent gradual, undisturbed accumulation of diversity over a long period of time, eg. frequent radiation and formation of new species during range and niche expansion – in the absence of stable barriers we get a very broad, rather unstructured fan like the one of the white oaks (s.str.; blue); along a relative narrow (today and likely in the past) geographic east-west corridor (here: the  'Himalayan corridor') a more structured, elongated one as in the case of section Ilex (olive).

Close-up on the sect. Ilex neighborhood, again with the tree plotted. In the tree, we see just sister clades, in the network we see the strong correlation between geography and genetic diversity patterns, indicating a gradual expansion of the lineage towards the west till finally reaching the Mediterranean. Only sophisticated explicit ancestral area analysis can possibly come to a similar result (often without certainty) which is obvious from comparing the tree with the network.

This can go along with higher population sizes and/or more permeable species barriers, both of which will lead to lower intragroup diversity and less tree-compatible signals. Knowing that both section Quercus (white oaks s.str., blue) and Ilex (olive) evolved and started to radiate about the same time, it's obvious from the structure of both fans that the (mostly and originally temperate) white oaks produced always more, but likely less stable species than the mid-latitude (subtropical to temperate) Ilex oaks today spanning an arc from the Mediterranean via the southern flanks of the Himalayas into the mountains of China and the subtropics of Japan.

Networks can be used to understand, interpret and confirm aspects of the (dated) NGS tree.

The much older stem and young crown ages seen in dated trees may be indicative for bottlenecks, too. But since we typically use relaxed clock models, which allow for rate changes and rely on very few fix points (eg. fossil age constraints), we may get (too?) old stem and (much too) young crown ages, especially for poorly sampled groups or unrepresentative data. By looking at the neighbor-net, we can directly see that the relative old crown ages for the lineages with (today) few species fit with their within-lineage and general distinctness.

The deepest splits: the tree mapped on the neighbor-net.

By mapping the tree onto the network, and thus directly comparing the tree to the network, we can see that different evolutionary processes may be considered to explain what we see in the data. It also shows us how much of our tree is (data-wise) trivial and where it could be worth to take a deeper look, eg. apply coalescent networks, generate more data, or recruit additional data. Last, but not least, it's quick to infer and makes pretty figures.

So, try it out with your NGS data, too.

PS. Model-based distances can be inferred with the same program many of us use to infer the ML tree: RAxML. We can hence use the same model assumptions for the neighbor-net that we optimized for the inferring tree and establishing branch support.

Monday, April 1, 2019

The Tree of Life (April 1)


The so-called Tree of Life is actually an anastomosing plexus rather than a divaricating tree, due to extensive interconnections between the cell and genome lineages during early single-cell evolution. These connections may have been caused by the process known as horizontal gene transfer.

Furthermore, the alleged Last Universal Common Ancestor may not have been a single coherent group, but may have been a mixture of quite different genotypes. After all, this supposed ancestor does not represent the origin of life, but was itself the end-product of an extensive prior evolutionary history.

These two basic points are illustrated in the following figure.


Happy April 1. For previous posts, see:

Monday, March 25, 2019

Automatic detection of borrowing (Open problems in computational diversity linguistics 2)


The second task on my list of 10 open problems in computational diversity linguistics deals with detecting borrowings or language contact. The prototypical case of language contact would be lexical borrowing, where a word is borrowed from one language into another, such as English job, which was adopted by Germans in the rather specific meaning of temporary occupation. More complex cases involve semantic borrowing, where a way of denoting something is borrowed, not the form itself, such as, for example, the use of the word for mouse to denote a computer mouse in many languages of the world.

Even less well understood are cases where specific aspects of grammar have been transferred. German has, for example, a certain number of neuter nouns, all borrowed from Ancient Greek or Latin, in which the plural is built according to (or inspired by) the Greek model: Lexikon has Lexika as plural, Komma has Kommata as plural, and Kompositum has Komposita as plural. While these cases are spurious in German and thus rather harmless (as are the similar examples in English), there are other cases of language contact where scholars not only suspect that plural forms have been borrowed along with the words (as in German), but that entire paradigms and strategies of grammatical marking have been adopted by one language from a neighboring variety as a result of close language contact.


Why borrowing is hard to detect

Unless we witness them happening directly, most cases of borrowing are difficult to demonstrate consistently. By comparison with lexical borrowing, however, the borrowing of grammar is probably the hardest to show, especially when dealing with abstract categories that could have actually emerged independently. The reason why borrowing is generally hard to deal with, not only in computational approaches, is that detecting borrowing and demonstrating language contact presupposes that alternative explanations are all excluded, such as universal tendencies of language change (i.e., "convergent evolution" in the biological sense), common inheritance, or simple chance.

While we need to exclude alternative possibilities to prove any of the four major types of similarities (coincidental, natural, genealogical, or contact-induced, see List 2014: 55-57), we have a much harder time in doing so when dealing with borrowings, because linguistics does not know even one procedure for the identification of borrowings. Instead, we resort to a mix of different types of evidence, which are qualitatively weighted and discussed by the experts. While historical linguistics has developed sophisticated techniques to show that language similarities are genealogical, it has not succeeded to reach the same level of sophistication for the identification of borrowings.

In this regard, techniques for contact detection are not much different from other, more specific, types of linguistic reconstruction, such as the "philological reconstruction" of ancient pronunciations (Jarceva 1990, Sturtevant 1920), the reconstruction of detailed etymologies (Malkiel 1954), or the reconstruction of syntax (Willis 2011).

Traditional strategies for detecting borrowing

It is not easy to give an exhaustive and clear-cut overview of all of the qualitative methods that scholars make use of in order to detect borrowings among languages. This is at least partially due to the nature of "cumulative-evidence arguments" (Berg 1998) — or arguments based on consilience (Whewell 1840, Wilson 1998) — which are always more difficult to formalize than clear-cut procedures that yield simple, binary results. Despite the difficulty in determining exact workflows, we can identify a couple of proxies that scholars use to assess whether a given trait has been borrowed or not.

One important class of hints are conflicts with possible genealogical explanations. A first type of conflict is represented by similarities shared among unrelated or distantly related languages. Since English mountain is reflected only in English, with similar words only in Romance, we could take this as evidence that the English word was borrowed. Since these conflicts arise from the supposed phylogeny of the languages under consideration, we can speak of phylogeny-related arguments for interference.

A second conflict involves the traits themselves, most prominently observed in the case of irregular sound correspondence patterns. German Damm, for example, is related to English dam, but since the expected correspondence for cognates between English and German would yield a German reflex Tamm (as it is still reflected in Old High German, see Kluge 2002), we can take this as evidence that the modern German term was borrowed (Pfeifer 1993). We can call these cases trait-related arguments for contact.

In addition to observations of conflicts, two further types of evidence are of great importance for inferring contact. The first one is areal proximity, and the second one is the assumed borrowability of traits. Given that language contact requires the direct contact of speakers of different languages, it is self-evident that geographical proximity, including proximity by means of travel routes, is a necessary argument when proposing contact relations between different varieties.

Furthermore, since direct evidence confirms that linguistic interference does not act to the same degree on all levels of linguistic organisation, the notion of borrowability also plays an important role. Although scholars tend to have different opinions about the concept, most would probably agree with the borrowability scale proposed by Aikhenvald (2007, p. 5), which ranges from "inflectional morphology" and "core vocabulary", representing aspects resistant to borrowing, up to "discourse structure" and the "structure of idioms", representing aspects that are easy to borrow. How core vocabulary can be defined, and how the borrowability of individual concepts can be determined and ranked, however, has been subject to controversial discussions (Lee and Sagart 2008, Starostin 1995, Tadmor 2009, Zenner et al. 2014).

Computational strategies for contact inference

Despite the large number of quantitative applications proposed during the past two decades, computational approaches for the inference of contact situations are still in their infancy. As of now, none of the few approaches proposed in the past can compete with the classical methods. The reasons for this are twofold. First, given the multiple types of evidence employed by the classical approaches, the formalization of the problem of borrowing detection is difficult. Second, given the limited number and suitability of datasets annotated for different types of linguistic interference, scholars have a hard time in developing algorithms, since they lack data for testing and training.

In principle, all algorithms for contact inference proposed so far make use of the strategies used in the classical approaches. Thus, they infer or determine shared traits among two or more languages, and then determine conflicts in these traits, taking geographical closeness and borrowability into account. In contrast to classical approaches, which combine different types of evidence, computational approaches are usually restricted to one type.

The automatic methods proposed so far can be divided into three classes. The first class employs phylogeny-related conflicts to identify those traits whose evolution cannot be explained with a given phylogenetic tree, explaining the conflicts as resulting from contact. Examples include work where I was involved myself (Nelson-Sathi et al. 2011, List et al. 2014), some early and interesting approaches which did not receive too much attention (Minett and Wang 2003), or have been mostly forgotten by now (Nakhleh et al. 2005), along with a recent study on grammatical features (Cathcart et al. 2018).

The second class uses techniques for automatic sequence comparison to search for similar words, but not cognate words, across different languages. Here, the most prominent examples include the work by Ark et al. (2007), and later Mennecier et al. (2016), who searched for similar words among languages known to be not related. Further examples include the work by Boc et al. (2010) and Willems et al. (2016), who experimented with tree reconciliation approaches, based on word trees derived from sequence-alignment techniques. There is also an experimental study where I was again involved myself (Hantgan and List forthcoming), in which we tried to identify borrowings by comparing two automatically inferred similarities among words from related and unrelated languages: surface similarities, as reflected by naive alignment algorithms, and deep similarities, reflected by advanced methods that take sound correspondences into account (List 2014).

The third class searches for distribution-related conflicts by comparing the amount of shared words within sublists of differing degrees of borrowability. This class is best represented by Sergey Yakhontov's (1926-2018) work on stable and unstable concept lists (Starostin 1991), which assumed that deep historical relations should surface in those parts of the lexicon that are stable and resistant to borrowing, while recent contact-induced relations would surface rather in those parts of the lexicon that are more prone to borrowing. Yakhontov's work was independently re-invented by Chén (1996), and McMahon et al. (2005); but given how difficult it turned out to distinguish concepts prone to borrowing from those resistant to borrowing, it has been largely disregarded for some time now.

Problems with computational strategies for contact inference

All three classes of approaches discussed so far have certain shortcomings. Phylogeny-based inference of borrowing, for example, tends to drastically overestimate the number of borrowed traits, simply because conflicts in a phylogeny can result from undetected borrowings in the data but they never need to (see Appendix 1 of Morrison 2011 on causes of reticulation in biology, which has many parallels to linguistics). Saying that all instances in which a dataset conflicts with a given phylogeny are borrowings is therefore generally a bad idea. It can be used as a very rough heuristics to come up with potentially wrongly annotated homologies in a dataset, which could then be checked again by experts, but deriving stronger claims from it seems problematic.

While sequence comparison techniques applied to unrelated languages are basically safe in my opinion, and the results are very reliable, unless one compares words that occur in all languages, such as "mama" and "papa" (Jakobson 1960, see also "Mama and papa" on Wikipedia).

Using methods for tree reconciliation on individual word trees, calculated from word distances based on phonetic alignment techniques or similar, yields the same problems of over-counting conflicts as we get for phylogeny-based approaches to borrowing. The problem here is a general misunderstanding of the concept differences between gene trees in biology, where surface similarity of gene sequences is thought to reflect evolutionary history, and word trees in linguistics. While we can use qualitative methods to draw a word tree for a given set of homologous words, the surface similarity among the words says little, if anything, about their evolutionary history.

Attempts to distinguish borrowed from inherited traits with sublists have lost their popularity in most recent studies. When properly applied, they might, indeed, provide some evidence in the search for borrowings or deep homologies. So far, however, all stability rankings of concepts that have been proposed have been based on too small an amount of either concepts (we would need rankings for some 1,000 concepts at least), or languages from which the information was derived. If we could manage to get reliable counts on some 1,000 concepts for a larger sample of the world's languages, this might greatly help our field, as it would provide us with a starting point from which people could search (even qualitatively) for borrowings in their data.

Outlook

Assuming that currently we have no realistic way to operationalize arguments based on consilience, there is no direct hope to have a fully automatic method for detecting borrowings any time soon. By developing promising existing methods further, however, there is a hope that we can learn a lot more about borrowing processes in the world's languages. What is needed here are, of course, the data that we need in order to apply the methods.

In addition to the above-mentioned automatic approaches for borrowing detection, so far, nobody has tried to use trait-related conflicts to infer borrowings. Since these are usually considered to be quite reliable by experts in historical linguistics, it seems inevitable to work in this direction as well, if we want to tackle the problem of consistent automatic detection of borrowing. Here, my recently proposed framework for a consistent handling and identification of patterns of sound correspondences across multiple languages (List 2019), could definitely be useful, although it will again be challenging to find the right balance of parameters and interpretation, since not all conflicts in sound correspondences necessarily result from borrowings.

Whether it will be possible to identify even the direction of borrowings, when developing these methods further, is an open question. Borrowability accounts might help here, but again, since no clear-cut strategies are being used by scholars, it is difficult to formalize any of the existing qualitative approaches. The greatest challenge will perhaps consist in the creation of a database of known borrowings that could assist digital linguists in testing and training new approaches.

References
Aikhenvald, Alexandra Y. (2007) Grammars in contact. A cross-linguistic perspective. In: Aikhenvald, Alexandra Y. and Dixon, Robert M. W. (eds.) Grammars in Contact. Oxford:Oxford University Press. 1-66.

van der Ark, René and Mennecier, Philippe and Nerbonne, John and Manni, Franz (2007) Preliminary identification of language groups and loan words in Central Asia. In: Proceedings of the RANLP Workshop on Acquisition and Management of Multilingual Lexicons, pp. 13-20.

Berg, Thomas (1998) Linguistic Structure and Change: an Explanation from Language Processing. Gloucestershire:Clarendon Press.

Boc, Alix and Di Sciullo, Anna Maria and Makarenkov, Vladimir (2010) Classification of the Indo-European languages using a phylogenetic network approach. In: Locarek-Junge, H. and Weihs, C. (eds.) Classification as a Tool for Research. Berlin and Heidelberg:Springer. 647-655.

Cathcart, Chundra and Carling, Gerd and Larson, Filip and Johansson, Richard and Round, Erich (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.

Chén Bǎoyà 陈保亚 (1996) Lùn yǔyán jiēchù yǔ yǔyán liánméng 论语言接触与语言联盟 [Language Contact and Language Unions]. Běijīng 北京:Yǔwén 语文.

Hantgan, Abbie and List, Johann-Mattis (forthcoming) Bangime: Secret language, language isolate, or language island? Journal of Language Contact.

Jakobson, Roman (1960): Why 'Mama' and ‘Papa'?. In: Perspectives in Psychological Theory: Essays in Honor of Heinz Werner, pp. 124-134.

Jarceva, V. N. (1990) Lingvistil'eskij enciklopedil'eskij slovar'. Moscow: Sovetskaja Enciklopedija.

Kluge, Friedrich (2002) Etymologisches Wörterbuch der deutschen Sprache. Berlin:de Gruyter.

Lee, Yeon-Ju and Sagart, Laurent (2008) No limits to borrowing: The case of Bai and Chinese. Diachronica 25.3: 357-385.

List, Johann-Mattis and Nelson-Sathi, Shijulal and Geisler, Hans and Martin, William (2014) Networks of lexical borrowing and lateral gene transfer in language and genome evolution. Bioessays 36.2: 141-150.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Malkiel, Yakov (1954): Etymology and the structure of word families. Word 10.2-3: 265-274.

McMahon, April and Heggarty, Paul and McMahon, Robert and Slaska, Natalia (2005) Swadesh sublists and the benefits of borrowing: an Andean case study. Transactions of the Philological Society 103: 147-170.

Phillipe Mennecier and John Nerbonne and Evelyne Heyer and Franz Manni (2016) A Central Asian language survey. Language Dynamics and Change 6.1: 57–98.

Minett, James W. and Wang, William S.-Y. (2003) On detecting borrowing. Diachronica 20.2: 289–330.

Morrison, D. A. (2011) An Introduction to Phylogenetic Networks. Uppsala: RJR Productions.

Nakhleh, Luay and Ringe, Don and Warnow, Tandy (2005) Perfect Phylogenetic Networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81.2: 382-420.

Nelson-Sathi, Shijulal and List, Johann-Mattis and Geisler, Hans and Fangerau, Heiner and Gray, Russell D. and Martin, William and Dagan, Tal (2011) Networks uncover hidden lexical borrowing in Indo-European language evolution. Proceedings of the Royal Society of London B: Biological Sciences 278.1713: 1794-1803.

Pfeifer, Wolfgang (1993) Etymologisches Wörterbuch des Deutschen. Berlin: Akademie.

Starostin, Sergej Anatolévic (1991) Altajskaja problema i proischoždenije japonskogo jazyka [The Altaic Problem and the Origin of the Japanese Language]. Moscow: Nauka.

Starostin, Sergej Anatolévic (1995) Old Chinese vocabulary: A historical perspective. In: Wang, William S.-Y. (ed.) The Ancestry of the Chinese Language. Berkeley: University of California Press, pp. 225-251.

Sturtevant, Edgar H. (1920) The Pronunciation of Greek and Latin. Chicago: University of Chicago Press.

Tadmor, Uri (2009): Loanwords in the world’s languages. Findings and results. In: Haspelmath, Martin and Tadmor, Uri (eds.) Loanwords in the World's Languages. Berlin and New York: de Gruyter, pp. 55-75.

Whewell, William D. D. (1847) The Philosophy of the Inductive Sciences, Founded Upon Their History. London: John W. Parker.

Willems, Matthieu and Lord, Etienne and Laforest, Louise and Labelle, Gilbert and Lapointe, François-Joseph and Di Sciullo, Anna Maria and Makarenkov, Vladimir (2016) Using hybridization networks to retrace the evolution of Indo-European languages. BMC Evolutionary Biology 16.1: 1-18.

David Willis (2011) Reconstructing last week's weather: Syntactic reconstruction and Brythonic free relatives. Journal of Linguistics 47.2: 407-446.

Wilson, Edward O. (1998) Consilience: the Unity of Knowledge. New York: Vintage Books.

Zenner, Eline and Dirk Speelman and Dirk Geeraerts (2014) Core vocabulary, borrowability and entrenchment. Diachronica 31.1: 74–105.

Monday, March 18, 2019

Which US cities are best for walking, biking and public transport?


In the modern world, there is a lot of discussion about the environmental damage caused by cars and trucks, not least due to their involvement in global climate change. The pro-active parts of this discussion revolve around banning cars, so that parts of cities and towns can return to pedestrian areas (eg. Life in the Spanish city that banned cars; The automotive liberation of Paris), and encouraging alternative modes of transport, particularly bicycles (eg. Copenhagenize your city: the case for urban cycling; Britain wants cycle-friendly cities).

In particular, some cities throughout the world are taking active steps to improve the "walkability" of their centers, including Addis Ababa, Auckland, Denver, Hanoi, London, Manchester and San Francisco (What would a truly walkable city look like?), and the "cyclability" of their inner suburbs, including Calgary, Copenhagen, Eindhoven, Lidzbark, Purmerend, San Sebastian, Utrecht and Vancouver (Top 10 pieces of cycling infrastructure: which country does it right?). On the other hand, there are some cities who have not yet tried to do much about cycling, including Beijing, Cairo, Delhi, Hong Kong, Moscow, Mumbai, Nairobi, Orlando, São Paulo and Sydney (Top 10 worst cities for cycling ).


The USA is not usually considered to be at the forefront of this movement, having long ago wedded itself to the cult of the private motor car. However, this does not mean that US cities are all the same in terms of non-car transportation. For example, the Walk Score site, which is part of the Redfin real estate organization, provides a ranking of all US cities and neighborhoods with a population of 200,000 or more, in terms of how friendly they are for: walking, biking and transit.

The ranks are based on a score out of 100 for each location, using various methodologies:
— Walk Score analyzes hundreds of walking routes to nearby amenities; points are awarded based on the distance to amenities in each category.
— Bike Score is calculated by measuring bike infrastructure (lanes, trails, etc), hills, destinations and road connectivity, and the number of bike commuters.
— Transit Score assign a "usefulness" value to nearby transit routes based on their frequency, type of route (rail, bus, etc), and distance to the nearest stop on the route.
Our interest here is in combining these three pieces of information into a single picture, showing which cities are generally good, at the moment.

Not unexpectedly, the Walk Score and Transit Score are highly correlated (86% shared rankings), while the Bike Score is not as highly correlated with either of these (49% and 42%, respectively). This means that the same cities tend to be good for the first two criteria. The three best cities for the Walk Score are New York, Jersey City and San Francisco, while the top two for the Transit Score are New York and San Francisco. On the other hand, for the Bike Score the top two are Minneapolis and Portland — it would be difficult to imagine either New York or San Francisco as being good for biking!

If we define a "good" score as being >70, then only San Francisco has a score for all three criteria >70, although Boston comes close. On the other hand, Pittsburgh and Washington D.C. have the most consistent scores across the board, because they have uniformly middle-rank scores.

Since these are multivariate data, one of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network, as a tool for exploratory data analysis. For this network analysis, we calculated the similarity of the cities using the Manhattan distance, and a Neighbor-net analysis was then used to display the between-city similarities.

The resulting network of the 98 cities with complete data is shown in the figure. Cities that are closely connected in the network are similar to each other based on how good they are for walking, biking and transit, and those cities that are further apart are progressively more different from each other. The color-coding for the cities is from Megaregions of the United States.


The network generally shows decreasing walking / transit scores from top to bottom, and decreasing biking scores from right to left. We have labeled only the top group of 29 cities, which are distinctly "better" than the remaining 69, plus four unusual cities (at the middle-left).

Note that, as expected, New York, San Francisco and Boston stand out at the top of the network. Note, also, that Minneapolis and Portland are separated in the network from the other cities, because of their high Bike Scores — all of the other cities in the top group have much lower biking scores. Newark, in particular, has a low biking score. New Orleans is at the bottom-left of this group because it has a low Transit Score but not Walk Score.

For the four unusual cities, separated at the left of the bottom group: Dallas has a low Transit Score, and Atlanta, Cincinnati and San Diego all have a low Bike Score.

The city at the very bottom-left of the network, which has the lowest score on all three criteria, is Arlington TX. Along the same lines, there is an online graph of The 10 most dangerous states for cyclists, showing Florida way out in front.

Finally, you should be warned about potential problems with rankings like these, based on only a few selected criteria. For example, the real estate site StreetEasy recently tried to compile a list of the 10 Healthiest Neighborhoods in New York city, and ended up listing the Brooklyn industrial area of Red Hook as number 1, which engendered a couple of negative comments, such as:
I guess the fact that the majority of Red Hook’s parkland has been closed for many years due to lead contamination, or the fact that we have one of the highest asthma rates in the city, was overlooked for this study.
Caveat emptor!

Monday, March 11, 2019

Tattoo Monday XVII


Here are seven more tattoos in our compilation of evolutionary tree tattoos from around the internet. For more examples of the circular design for a phylogenetic tree, in a variety of body locations, see Tattoo Monday V, Tattoo Monday VII, Tattoo Monday X and Tattoo Monday XI.

At the bottom of this post is an unusual linearized version of this same type of tree.



Monday, March 4, 2019

Has homoiology been neglected in phylogenetics?


In a recently published pre-print on PaleorXiv, Roland Sookias makes a point for distinguishing between parallelism, ie. shared inherited traits that can be found in some but not all of the offspring of a common ancestor, and convergences in a strict sense, involving similar traits that are not homologous. The former is also known as homoiology, a term Sookias attributes to Ludwig Plate.

As a geneticist working mostly at the tips of the Tree of Plant Life, I'm quite familiar with the (pre-Hennigian) concept: we much more often than not lack Hennig's 'synapomorphies', ie. shared, derived traits exclusive to an evolutionary lineage. But we have many highly diagnostic characters suites including 'shared apomorphies' (I think that the angiosperm phylogeneticist Jim Doyle coined the term) that collect the same species or higher taxa, eg. groups of taxa that also form highly supported clades in molecular trees, but are not exclusive. In every plant group you can additionally observe that certain traits are exclusive to some members of one lineage, because the lineage has the genetic-physiological prerequisites to express these traits, while their sister lineages or distant relatives lack this potential. Epigenetics deals with tendencies to express a trait in response to the environment without even changing the genetic code.

If you look close enough, you can find such patterns even at the molecular level.

Molecular evolution of the 5' half of the ITS1 in beeches. Each sequence motif is assigned a state (Ax, Bx etc; x = 0 represents the ancestral state, x > 0 are derived states) and evolution involves usually the gain ("+") or loss ("-") of sequence motifs including some potential genetic homoiologies (see here for context and references).

However, it has apparently been ignored by my fellow paleontologists: Sookias' wants to discuss "the neglected concept of homoiology ... in the context of palaeontological phylogenetic methods". Paleontological phylogenetic methods are, of course, tree inferences, and the idea is that recognition of homoiologies can be a means of establishing node support or to "help to choose between equally parsimonious or likely trees". He provides an R function "to calculate two measures for a given tree and matrix: (a) the potential support for clades based on potential homoiologies; and (b) the fit of the tree to all states given the concept of homoiology".

Sookias provides a nice and conscise introduction to the problem with some examples, and makes the connection to linguistics (see also Mattis' and my post on the Chinese dialects continuum: How languages lose body parts); so, give the short paper a read. Like all paleontological literature it is strongly influenced by cladistic views, such as that life is monophyletic, and it revolves around the central theme how to get better supported trees.

My inner geneticist has a principal problem with such a goal, because there has (to my knowledge) not been a single morphology-based tree that was fully congruent to a molecular tree with sufficient taxon and gene sampling, which applies also to the real-world data example that Sookias chose (as we will see).

My inner paleontologists also knows that there are highly diagnostic morphs in the fossil record, but diagnostic character suites and morphs reflect as many paraphyla as monophyla. He also knows that the fossil record, provided you find the right fossil from the right time, may alter your perspective on ancestral and derived character states.

An inferred tree (see this post). Given the inferred tree (quasi-dated tree), we would assume that star shapes are primitive (a symplesiomorphy) within the Pointish lineage, and possibly 10-tipped stars; and conclude that the Tenstars are paraphyletic. Greenish is clearly ancestral (a Pointish symplesiomorphy), and bluish derived (a Polygonia synapomorphy).
If we have the full picture, we can confirm star shapes are symplesiomorphic within the Pointish (the first common ancestor being a five-pointed colorless star). However, all greenish stars form a monophylum not a paraphylum.
Having ten tips is a synapomorphy of the monophyletic Tenstars.

So, why should we aim to get more resolved, better supported, morphology-based trees? Any such tree will inevitably include wrong branches!

I argue that, instead, we should just explore the signal in our data matrices using networks. Any potential tree is included in a network. But networks are more comprehensive because they provide not only a single tree but alternative, competing trees. By visualizing the alternatives, we can discern between mere convergence (random similarity), homoiology (parallelism, convergence related to descent), symplesiomorphy (shared, lineage-consistent primitive traits) and synapomorphy (lineage-unique and consistent shared derived traits), which can be very tricky with just a tree. Thus, we can try to evaluate which evolutionary scenario best explains all our data.

Compatibility

The basic problem when using morphological and such-like data sets to infer phylogenies is that most of the scored characters are, to some degree, incompatible with the true tree, ie. the actual evolutionary pathways.

Let's take a hypothetical evolution (no reticulations), in which the x-axis represents the morphological diversification and the y-axis time.


As in real-world data, sister taxa (eg. Species A and B) may have different levels of morphological derivation compared to their common ancestor(s). This leads us to this unrooted true tree in which the branch lengths are proportional to the real (above) amount of change.

Unrooted representation of the above evolution.
All commonly used tree inferences infer unrooted trees.

The only characters providing a taxon bipartition that is fully compatible with the true tree are Hennig's 'synapomorphies':

Clade A–D shares a unique, derived trait.
The character split is fully compatible with the true tree.

Next come Hennig's 'symplesiomorphies' (Sookias' R-script discards them):

Blue is the ancestral state within the ingroup, lost/modified in Species A.
The character split is compatible with the true tree except for A.
In phylogenetic inference, symplesiomorphies will usually stabilize the topology
as there will be enough other characters supporting A as sister of B and Clade A–D(–F).

Homoiologies / parallelisms can be partly compatible:

Blue is a homoiology found in 50% of the species composing Clade A–F.
The character split supports the sister relationship of A and B (compatible aspect)
but joins them with F (incompatible aspect).
A, B and F belong to the same monophylum/clade (semi-compatible aspect).
As long as homoiologies are confined to otherwise
coherent (or flat) subtrees, they will contribute to the overall decision capacity of the data.

Note that without a molecular backbone tree, it may be impossible to distinguish homoiologies from symplesiomorphies – whether a trait will be resolved as either the one or the other in a tree depends solely on its frequency and distribution across the subtree, and the situation in outgroups.

Purple is the plesiomorphy of the ingroup, blue the homoiology
found in members of Clade A–F, evolved twice
Considering the phylogenetic root-tip distances in the true tree, it makes sense that blue is the plesiomorphy of the ingroup retained in the shorter branching members, and purple a homoiology found in the most derived sublineages (again, evolved twice).
Both scenarios require three steps, but probabilistic character mapping methods would prefer the second scenario as they assume the longer the internal branches, the higher the likelihood for a change. To dismiss symplesiomorphies, Sookias' script infers the ancestral state of the MRCA of a clade and only considers states as homoiologies that differ from the inferred ancestral state (the cut-off value can be modified to "less stringently exclude potential symplesiomorphies as homoiologies").
 
Doyle's 'shared apomorphies' are locally compatible:

Blue is a shared apomorphy of the GH lineage, convergently evolved in the
outgroup (see original tree above: the GH lineage is a strongly derived
ingroup lineage evolving into the direction of the outgroup
in contrast to the remainder of the ingroup).
The example above also illustrates how shared apomorphies may trigger branching artifacts such as ingroup-outgroup long-branch attraction. Imagine that GH is not the first diverging branch of the ingroup but instead a strongly derived sublineage nested within Clade A–F, and that we lack the short-branching sister-group but have a large outgroup sample. Any ingroup-outgroup shared apomorphies will then draw GH towards the outgroup-defined ingroup's root and detrimental for inferring the true tree.

Convergence in a strict sense, ie. superficial or random similarity, is incompatible with the true tree:

Blue is a randomly distributed derived state found in all longer-branched taxa.

A tree-incompatible signal is, naturally, best handled using a network and not by forcing it into a single tree. Unless, of course, we have a sensible molecular tree and can go for total evidence approaches assuming the molecular tree reflects the true tree.

PS: Also, in molecular data the true tree incompatible characters may outnumber the compatible ones, but there we have many more characters and (usually but not always) a lot that are not filtered by negative or positive selection. Our stochastic molecular models are for sure never accurate enough to model molecular evolution for our sequences, but apparently precise enough for most applications. Even before next generation sequencing and big data, molecular phylogenies outshined morphological phylogenies, something that paleontologists cannot afford to ignore any more — not because the data are much better (to infer evolution) but because the patterns and processes are much less complex.

Sookias' data example, crocodiles and relatives

The supplement of Sookias' paper includes a morphological character matrix for crocodilians and the resulting molecular tree for the group. Here's Sookias' fig. 3 ,using these data to make his point for how to select the better-fitting tree using homoiology recognition:


Now, the unsolved problem is: if we don't have a molecular tree, how can we possibly know 0 is a homoiology and not a symplesiomorphy, 1 not a reversal (scenario B) or likely convergence (scenario C), hence, B should be preferred over C (the legend has a little typo, cf. Sookias 2019, p. 3, l. 34)?

The matrix provided as the example is not the best one to make this point. Sookias' script, when stringently eliminating potential symplesiomorphies, identifies, using the molecular tree as input, one potential homoiology for the Crocodylinae, five for their larger clade (including Gavialis and Tomistoma), and one for the alligators' larger clade in a matrix with 117 characters. Less than 10% can hardly be a game-changer.

What the morpho-data shows

Furthermore, the morphological matrix will give us a single most-parsimonious tree (MPT, using PAUP*'s Branch-and-Bound algorithm), not two or more equally parsimonious alternatives that we need to weigh against each other.

The single most-parsimonious tree that can be inferred from the morpho-matrix (236 steps, CI = 0.64, RI = 0.84). Red branches are conflicting with the topology of the molecular (truer?) tree (green brackets).

Some of the red branches are supported by pseudo-synapmorphies, which, on the background of the molecular tree, are potential homoiologies for the comprising clade, however, interpreted as symplesiomorphies by Sookias' script (provided the molecular branch-lengths are sufficient, they might be recognized when using a probabilistic framework to infer the ancestral states).

Not a good example for Sookias goal, but the matrix shows the limitations of trees when it comes to morphological differentiation. Here's the distance-based, 2-dimensional network for the morphological data:

A Neigbor-net based on Sookias' morphological matrix.
The arrow indicates the position of the assumed root.

The signal from the morphological matrix is quite tree-like, and the structure of the left part of the network is synonymous to that of the single MPT (and the molecular tree). On the right-hand side, we find more complexity than we would expect from the single MPT. The data signal is not trivial regarding the position of the root as inferred by Bernissartia; and nor is the placement of Gavialis and Tomistoma (pink edge bundles), two genera producing a very prominent box-like structure. Called by cladists a "phenetic" approach, the distance-based network is nonetheless straightforward regarding the identification of monophyletic groups (green) and potential monophyletic groups (yellow) (the latter always include the particular alternative seen in the single MPT as well, in case of the pink box, also the molecular alternative). The light green monophylum is a necessary consequence of the prior knowledge about the position of the root, and the likely monophyly of Alligator and its relatives (the tree-like subgraph with long internal branches = lots of uniquely shared traits, including potential synapmorphies).

Potential synapomorphies that can be inferred from the morpho-matrix alone by mapping the states onto the network. Red, homoiologies reconstructed as synapomorphies ('pseudo-synapomorphies') and (except for one) excluded as potential symplesiomorphies by Sookias' test run of his script (strict and relaxed cut-off).

The network provides more information than can be extracted from the MPT: one Crocodylus is significantly closer to the Osteolaemus (the neighborhood defined by the light blue edge bundle, see Sookias' fig. 3A). Crocodylus, however, is likely monophyletic, being generally very similar; and the third genus, Mecitops, is closely linked to (all of) them (neighbourhood defined by the dark blue edge). An inclusive common origin (including the third genus, Mecistops) is – just based on morphology and without using a "phylogenetic" tree inference – beyond question, even though we lack syn- or shared apomorphies (short corresponding edge bundle): Mecistops is obviously closely related to Crocodylus, and Osteolaemus is related to part of the latter, so it's not a bad hypothesis that all three are descendants of the same common ancestor, and that Tomistoma (and Gavialis) branched off the lineage before the Crocodylinae radiated. The only alternative explanation would be that the Crocodylinae show the primitive morphs of the entire lineage, and that the position of Tomistoma and Gavialis is affected by long-branch (-edge) attraction (however, if that is the case then we should have found a Tomistoma-Gavialis clade in the MPT — parsimony will always get it wrong in the Felsenstein zone)

The main flaw

But, any morphology-based alternative using this data matrix is not fully compatible with the molecular tree, which places Mecitops and Osteolaemus as sister to Crocodylus. Here's the consensus network based on 10,000 boostrap pseudoreplicate BioNJ trees inferred from the morpho-matrix, highlighting the support for splits compatible with the molecular tree (green) and their competing, partly incongruent (red edge bundles) alternatives (I do the information transfer manually, but those with R-scripting skills can use the functions in the phangorn library; Schliep et al., MEE, 2017; see also David's post):

NJ-Bootstrap (BS) consensus network based on 10,000 pseudoreplicates.
Edges/splits corresponding to clades in the molecular tree
(see Sookias' fig. 3 above) in green, those conflicting the molecular tree in red.
Edge values show BS support (edge-lengths are proportional to NJ-BS support),
while asterisks indicate the branches seen in the MPT.
Obviously, there is some signal in the morpho-matrix compatible with the molecular clades (this can be synaporphies, symplesiomorphies, homoiologies or shared apomorphies) clashing with the signal of pseudo-synapomorphies etc. supporting the topological alternatives seen in the morpho-based MPT.

Assuming the molecular tree is correct, the above reconstruction means that Osteolaemus is morphologically more derived, and hence placed as sister, while Mecitops and Crocodylus retain more primitive character states, and hence lacks discriminatiory derived traits — a sort of local ingroup-outgroup long-branch attraction (or 'short-branch culling').

What differentiates the Crocodylinae? Black, aut- or synapomorphies; blue, potential homoiologies (or symplesiomorphies); red, shared apomorphies (convergence). The Mecitops-Crocodylus pseudo-monophylum is mostly supported by traits shared between Osteolaemus and distant siblings (taxa of the larger alligator clade) and/or the outgroup.

We can also hypothesize that the initial radiation was fast, because the Mecitops-Osteolaemus ancestor did not accumulate a single, unique, discriminating character trait.

Excess of shared derived, pseudo-synapomorphic traits is the reason Tomistoma is not resolved as sister of Gavialis in the MPT — the molecular Gavialis-Tomistoma clade is represented by a morphological grade.

A 'splits rose' showing the basic splits. Black, aut- or synapomorphies; blue, potential homoiologies (or symplesiomorphies of the larger clade including Crocodylinae); pink, pseudo-synapomorphies (deep homoiologies or symplesiomorphies of the larger Crocodylinae clade); orange, shared ancestral (plesiomorph) or derived traits (convergent). 

And the homoiologies identified using the molecular tree as input cannot put things right. They are just partly compatible with unproblematic splits, ie. the larger clade including Alligator (character #7), the larger clade including Crocodylinae (#1, #18, #73, #74, #117) or exclusive to the Crocodylinae (#66)

Character mapping of the molecular-inferred homoiologies. The lush green splits represent the molecular splits.

However, if we are ignorant of the molecular tree, we would have to assume that Mecitops is the sister to Crocodylus, and that some of their shared traits not found in Osteolaemus are shared apomorphies (if occurring outside the clade and in the sister clade) or even synapomorphies (if exclusive for Mecitops + Crocodylus), while only those shared by Osteolaemus and C. porosus (#66) can be homoiologies. We also would have no reason to challenge the Gavialis-Tomistoma grade, until we infer networks.

Map of all potential synapomorphies (bold), symplesiomorphies (italics) and homoiologies (plain font) using the morphology-based Neighbor-net as basis. Red, pseudo-synapomorphies: split seen in the MPT and (with or without alternative in the Neighbor-net) but rejected by the molecular tree.

This is the main flaw of Sookias' idea. To identify homoiologies, we need the same prerequisite as for any of Hennig's concepts: we need to know the true tree. If we use the inferred tree based on the same data that we want to weight (here: use homoiologies for decision making or means of node support), then we propagate first-level errors; apply circular reasoning. Such as the red-marked pseudo-synapomorphies in the network above; vice versa, all actual (molecular-wise) synapomorphies supporting the molecular Gavialis-Tomistoma clade (dark purple split) would be reconstructed as homoiologies or symplesiomorphies based on the morpho-based single MPT (or morpho-based NJ tree, or probabilistic tree).

And if we have an independent molecular tree, it will make the decision on the fly: putative synapormorphies are the traits that are fully compatible, symplesiomorphies, homoiologies and shared apomorphies are decreasingly compatible, and random convergences are incompatible with the molecular tree.

It is not homoiology but tree-incompatible signal that is neglected in phylogenetics

Sookias points out: "In inference of phylogeny by parsimony, an occurrence of a character state in a part of a tree separated from it by another state is considered simply a homoplasy, and a tree where the occurrences are nearer or further from one another is not more or less parsimonious ... a tree where the 15 occurrences are nearer or further from one another is not more or less parsimonious". In principle, this is true, but has little consequence in application.

We, usually without realizing it, make frequent use of the discriminating power of potential homoiologies. See the example above, but also when, eg., placing fossils in a molecular framework or do post-inference character weighting. In these cases, homoiologies (and symplesiomorphies) will stabilize the inference and increase support. For better and worse:
  • Better, because homoiologies will ensure that the fossil is placed in the right molecular-based subtree, and can compensate for the lack of synapomorphies. Imagine an extinct fossil sibling lineage showing only homoiologies shared by Osteolaemus and C. porosus. Using tree-based optimization (eg. RAxML's 'evolutionary placement algorithm'), it would be placed close to the Crocodylinae ancestor, likely next to Osteolaemus. Using a Neighbor-net, it would be placed between Osteolaemus and C. porosus. Either way, the homoiologies would ensure it is nested within the Crocodylinae.
  • Post-inference character weighting, as implemented in eg. TNT, will downweight inferred convergences (ie. higher homoplasy, more stochastically distributed across the tree) more than putative homoiologies (ie. less homoplastic since confined to a single subtree). This can be better or worse. How do we avoid what happened for the crocodiles that homoiologies are not recognized as such but support (somewhat) misleading clades (act as synapomorphies)? Clades are commonly interpreted as a sufficient criterion to determine monophyly; however, they are not even a necessary one: taxa can be part of a monophyletic group despite not forming an inclusive subtree (ie. clade in a rooted tree) such as the genus Caiman or Gavialia-Tomistoma.
Hence, we should disencourage any form of data-self-dependent or post-analysis weighting and instead just explore the signal in our data — using networks.

One thing is also obvious from the crocodile example: if we have enough signal in the morphological data, then we may get one or another thing wrong and, in some cases, may not be able to decide between one or another alternative. However, overall, the morphological differentiation pretty well captures what the genes provide us as the best approximation of the true tree. Even when the matrix includes very few potential synapomorphies and clear homoiologies but a lot of shared apomorphies, most of which were convergently evolved in parts of both major clades.

At least, this will be so when we analyze the data using networks and not just trees (compare the single MPT to the networks).

Using the alternative evolutionary scenarios provided by the networks, we can then look back into our data (see the maps above), to see what may be a homoiology, a symplesiomorphy (very useful for deciding between evolutionary scenarios, as well) or a synapomorphy. The phangorn library (used for Sookias' script) has now network functionality and allows transferring information between trees and networks. An R-affine person may be able to extract lists of potential (partly competing) synapomorphies, symplesiomorphies, and homoiologies directly from the network showing all possible or the most likely trees.

And then use this information to eg. place fossils in a phylogenetic context, or reconstruct evolutionary trends in extinct groups of organisms — reconstruction of evolutionary trends in extant organisms should always rely on morphological data analyzed in a molecular-phylogenetic framework.

Data

A NEXUS-version of Sookias' test matrix (slightly annotated for Mesquite, simple version for PAUP*), tree- and distance matrix files have been added to my figshare collection of morphological matrices.