Monday, March 25, 2019

Automatic detection of borrowing (Open problems in computational diversity linguistics 2)

The second task on my list of 10 open problems in computational diversity linguistics deals with detecting borrowings or language contact. The prototypical case of language contact would be lexical borrowing, where a word is borrowed from one language into another, such as English job, which was adopted by Germans in the rather specific meaning of temporary occupation. More complex cases involve semantic borrowing, where a way of denoting something is borrowed, not the form itself, such as, for example, the use of the word for mouse to denote a computer mouse in many languages of the world.

Even less well understood are cases where specific aspects of grammar have been transferred. German has, for example, a certain number of neuter nouns, all borrowed from Ancient Greek or Latin, in which the plural is built according to (or inspired by) the Greek model: Lexikon has Lexika as plural, Komma has Kommata as plural, and Kompositum has Komposita as plural. While these cases are spurious in German and thus rather harmless (as are the similar examples in English), there are other cases of language contact where scholars not only suspect that plural forms have been borrowed along with the words (as in German), but that entire paradigms and strategies of grammatical marking have been adopted by one language from a neighboring variety as a result of close language contact.

Why borrowing is hard to detect

Unless we witness them happening directly, most cases of borrowing are difficult to demonstrate consistently. By comparison with lexical borrowing, however, the borrowing of grammar is probably the hardest to show, especially when dealing with abstract categories that could have actually emerged independently. The reason why borrowing is generally hard to deal with, not only in computational approaches, is that detecting borrowing and demonstrating language contact presupposes that alternative explanations are all excluded, such as universal tendencies of language change (i.e., "convergent evolution" in the biological sense), common inheritance, or simple chance.

While we need to exclude alternative possibilities to prove any of the four major types of similarities (coincidental, natural, genealogical, or contact-induced, see List 2014: 55-57), we have a much harder time in doing so when dealing with borrowings, because linguistics does not know even one procedure for the identification of borrowings. Instead, we resort to a mix of different types of evidence, which are qualitatively weighted and discussed by the experts. While historical linguistics has developed sophisticated techniques to show that language similarities are genealogical, it has not succeeded to reach the same level of sophistication for the identification of borrowings.

In this regard, techniques for contact detection are not much different from other, more specific, types of linguistic reconstruction, such as the "philological reconstruction" of ancient pronunciations (Jarceva 1990, Sturtevant 1920), the reconstruction of detailed etymologies (Malkiel 1954), or the reconstruction of syntax (Willis 2011).

Traditional strategies for detecting borrowing

It is not easy to give an exhaustive and clear-cut overview of all of the qualitative methods that scholars make use of in order to detect borrowings among languages. This is at least partially due to the nature of "cumulative-evidence arguments" (Berg 1998) — or arguments based on consilience (Whewell 1840, Wilson 1998) — which are always more difficult to formalize than clear-cut procedures that yield simple, binary results. Despite the difficulty in determining exact workflows, we can identify a couple of proxies that scholars use to assess whether a given trait has been borrowed or not.

One important class of hints are conflicts with possible genealogical explanations. A first type of conflict is represented by similarities shared among unrelated or distantly related languages. Since English mountain is reflected only in English, with similar words only in Romance, we could take this as evidence that the English word was borrowed. Since these conflicts arise from the supposed phylogeny of the languages under consideration, we can speak of phylogeny-related arguments for interference.

A second conflict involves the traits themselves, most prominently observed in the case of irregular sound correspondence patterns. German Damm, for example, is related to English dam, but since the expected correspondence for cognates between English and German would yield a German reflex Tamm (as it is still reflected in Old High German, see Kluge 2002), we can take this as evidence that the modern German term was borrowed (Pfeifer 1993). We can call these cases trait-related arguments for contact.

In addition to observations of conflicts, two further types of evidence are of great importance for inferring contact. The first one is areal proximity, and the second one is the assumed borrowability of traits. Given that language contact requires the direct contact of speakers of different languages, it is self-evident that geographical proximity, including proximity by means of travel routes, is a necessary argument when proposing contact relations between different varieties.

Furthermore, since direct evidence confirms that linguistic interference does not act to the same degree on all levels of linguistic organisation, the notion of borrowability also plays an important role. Although scholars tend to have different opinions about the concept, most would probably agree with the borrowability scale proposed by Aikhenvald (2007, p. 5), which ranges from "inflectional morphology" and "core vocabulary", representing aspects resistant to borrowing, up to "discourse structure" and the "structure of idioms", representing aspects that are easy to borrow. How core vocabulary can be defined, and how the borrowability of individual concepts can be determined and ranked, however, has been subject to controversial discussions (Lee and Sagart 2008, Starostin 1995, Tadmor 2009, Zenner et al. 2014).

Computational strategies for contact inference

Despite the large number of quantitative applications proposed during the past two decades, computational approaches for the inference of contact situations are still in their infancy. As of now, none of the few approaches proposed in the past can compete with the classical methods. The reasons for this are twofold. First, given the multiple types of evidence employed by the classical approaches, the formalization of the problem of borrowing detection is difficult. Second, given the limited number and suitability of datasets annotated for different types of linguistic interference, scholars have a hard time in developing algorithms, since they lack data for testing and training.

In principle, all algorithms for contact inference proposed so far make use of the strategies used in the classical approaches. Thus, they infer or determine shared traits among two or more languages, and then determine conflicts in these traits, taking geographical closeness and borrowability into account. In contrast to classical approaches, which combine different types of evidence, computational approaches are usually restricted to one type.

The automatic methods proposed so far can be divided into three classes. The first class employs phylogeny-related conflicts to identify those traits whose evolution cannot be explained with a given phylogenetic tree, explaining the conflicts as resulting from contact. Examples include work where I was involved myself (Nelson-Sathi et al. 2011, List et al. 2014), some early and interesting approaches which did not receive too much attention (Minett and Wang 2003), or have been mostly forgotten by now (Nakhleh et al. 2005), along with a recent study on grammatical features (Cathcart et al. 2018).

The second class uses techniques for automatic sequence comparison to search for similar words, but not cognate words, across different languages. Here, the most prominent examples include the work by Ark et al. (2007), and later Mennecier et al. (2016), who searched for similar words among languages known to be not related. Further examples include the work by Boc et al. (2010) and Willems et al. (2016), who experimented with tree reconciliation approaches, based on word trees derived from sequence-alignment techniques. There is also an experimental study where I was again involved myself (Hantgan and List forthcoming), in which we tried to identify borrowings by comparing two automatically inferred similarities among words from related and unrelated languages: surface similarities, as reflected by naive alignment algorithms, and deep similarities, reflected by advanced methods that take sound correspondences into account (List 2014).

The third class searches for distribution-related conflicts by comparing the amount of shared words within sublists of differing degrees of borrowability. This class is best represented by Sergey Yakhontov's (1926-2018) work on stable and unstable concept lists (Starostin 1991), which assumed that deep historical relations should surface in those parts of the lexicon that are stable and resistant to borrowing, while recent contact-induced relations would surface rather in those parts of the lexicon that are more prone to borrowing. Yakhontov's work was independently re-invented by Chén (1996), and McMahon et al. (2005); but given how difficult it turned out to distinguish concepts prone to borrowing from those resistant to borrowing, it has been largely disregarded for some time now.

Problems with computational strategies for contact inference

All three classes of approaches discussed so far have certain shortcomings. Phylogeny-based inference of borrowing, for example, tends to drastically overestimate the number of borrowed traits, simply because conflicts in a phylogeny can result from undetected borrowings in the data but they never need to (see Appendix 1 of Morrison 2011 on causes of reticulation in biology, which has many parallels to linguistics). Saying that all instances in which a dataset conflicts with a given phylogeny are borrowings is therefore generally a bad idea. It can be used as a very rough heuristics to come up with potentially wrongly annotated homologies in a dataset, which could then be checked again by experts, but deriving stronger claims from it seems problematic.

While sequence comparison techniques applied to unrelated languages are basically safe in my opinion, and the results are very reliable, unless one compares words that occur in all languages, such as "mama" and "papa" (Jakobson 1960, see also "Mama and papa" on Wikipedia).

Using methods for tree reconciliation on individual word trees, calculated from word distances based on phonetic alignment techniques or similar, yields the same problems of over-counting conflicts as we get for phylogeny-based approaches to borrowing. The problem here is a general misunderstanding of the concept differences between gene trees in biology, where surface similarity of gene sequences is thought to reflect evolutionary history, and word trees in linguistics. While we can use qualitative methods to draw a word tree for a given set of homologous words, the surface similarity among the words says little, if anything, about their evolutionary history.

Attempts to distinguish borrowed from inherited traits with sublists have lost their popularity in most recent studies. When properly applied, they might, indeed, provide some evidence in the search for borrowings or deep homologies. So far, however, all stability rankings of concepts that have been proposed have been based on too small an amount of either concepts (we would need rankings for some 1,000 concepts at least), or languages from which the information was derived. If we could manage to get reliable counts on some 1,000 concepts for a larger sample of the world's languages, this might greatly help our field, as it would provide us with a starting point from which people could search (even qualitatively) for borrowings in their data.


Assuming that currently we have no realistic way to operationalize arguments based on consilience, there is no direct hope to have a fully automatic method for detecting borrowings any time soon. By developing promising existing methods further, however, there is a hope that we can learn a lot more about borrowing processes in the world's languages. What is needed here are, of course, the data that we need in order to apply the methods.

In addition to the above-mentioned automatic approaches for borrowing detection, so far, nobody has tried to use trait-related conflicts to infer borrowings. Since these are usually considered to be quite reliable by experts in historical linguistics, it seems inevitable to work in this direction as well, if we want to tackle the problem of consistent automatic detection of borrowing. Here, my recently proposed framework for a consistent handling and identification of patterns of sound correspondences across multiple languages (List 2019), could definitely be useful, although it will again be challenging to find the right balance of parameters and interpretation, since not all conflicts in sound correspondences necessarily result from borrowings.

Whether it will be possible to identify even the direction of borrowings, when developing these methods further, is an open question. Borrowability accounts might help here, but again, since no clear-cut strategies are being used by scholars, it is difficult to formalize any of the existing qualitative approaches. The greatest challenge will perhaps consist in the creation of a database of known borrowings that could assist digital linguists in testing and training new approaches.

Aikhenvald, Alexandra Y. (2007) Grammars in contact. A cross-linguistic perspective. In: Aikhenvald, Alexandra Y. and Dixon, Robert M. W. (eds.) Grammars in Contact. Oxford:Oxford University Press. 1-66.

van der Ark, René and Mennecier, Philippe and Nerbonne, John and Manni, Franz (2007) Preliminary identification of language groups and loan words in Central Asia. In: Proceedings of the RANLP Workshop on Acquisition and Management of Multilingual Lexicons, pp. 13-20.

Berg, Thomas (1998) Linguistic Structure and Change: an Explanation from Language Processing. Gloucestershire:Clarendon Press.

Boc, Alix and Di Sciullo, Anna Maria and Makarenkov, Vladimir (2010) Classification of the Indo-European languages using a phylogenetic network approach. In: Locarek-Junge, H. and Weihs, C. (eds.) Classification as a Tool for Research. Berlin and Heidelberg:Springer. 647-655.

Cathcart, Chundra and Carling, Gerd and Larson, Filip and Johansson, Richard and Round, Erich (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.

Chén Bǎoyà 陈保亚 (1996) Lùn yǔyán jiēchù yǔ yǔyán liánméng 论语言接触与语言联盟 [Language Contact and Language Unions]. Běijīng 北京:Yǔwén 语文.

Hantgan, Abbie and List, Johann-Mattis (forthcoming) Bangime: Secret language, language isolate, or language island? Journal of Language Contact.

Jakobson, Roman (1960): Why 'Mama' and ‘Papa'?. In: Perspectives in Psychological Theory: Essays in Honor of Heinz Werner, pp. 124-134.

Jarceva, V. N. (1990) Lingvistil'eskij enciklopedil'eskij slovar'. Moscow: Sovetskaja Enciklopedija.

Kluge, Friedrich (2002) Etymologisches Wörterbuch der deutschen Sprache. Berlin:de Gruyter.

Lee, Yeon-Ju and Sagart, Laurent (2008) No limits to borrowing: The case of Bai and Chinese. Diachronica 25.3: 357-385.

List, Johann-Mattis and Nelson-Sathi, Shijulal and Geisler, Hans and Martin, William (2014) Networks of lexical borrowing and lateral gene transfer in language and genome evolution. Bioessays 36.2: 141-150.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Malkiel, Yakov (1954): Etymology and the structure of word families. Word 10.2-3: 265-274.

McMahon, April and Heggarty, Paul and McMahon, Robert and Slaska, Natalia (2005) Swadesh sublists and the benefits of borrowing: an Andean case study. Transactions of the Philological Society 103: 147-170.

Phillipe Mennecier and John Nerbonne and Evelyne Heyer and Franz Manni (2016) A Central Asian language survey. Language Dynamics and Change 6.1: 57–98.

Minett, James W. and Wang, William S.-Y. (2003) On detecting borrowing. Diachronica 20.2: 289–330.

Morrison, D. A. (2011) An Introduction to Phylogenetic Networks. Uppsala: RJR Productions.

Nakhleh, Luay and Ringe, Don and Warnow, Tandy (2005) Perfect Phylogenetic Networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81.2: 382-420.

Nelson-Sathi, Shijulal and List, Johann-Mattis and Geisler, Hans and Fangerau, Heiner and Gray, Russell D. and Martin, William and Dagan, Tal (2011) Networks uncover hidden lexical borrowing in Indo-European language evolution. Proceedings of the Royal Society of London B: Biological Sciences 278.1713: 1794-1803.

Pfeifer, Wolfgang (1993) Etymologisches Wörterbuch des Deutschen. Berlin: Akademie.

Starostin, Sergej Anatolévic (1991) Altajskaja problema i proischoždenije japonskogo jazyka [The Altaic Problem and the Origin of the Japanese Language]. Moscow: Nauka.

Starostin, Sergej Anatolévic (1995) Old Chinese vocabulary: A historical perspective. In: Wang, William S.-Y. (ed.) The Ancestry of the Chinese Language. Berkeley: University of California Press, pp. 225-251.

Sturtevant, Edgar H. (1920) The Pronunciation of Greek and Latin. Chicago: University of Chicago Press.

Tadmor, Uri (2009): Loanwords in the world’s languages. Findings and results. In: Haspelmath, Martin and Tadmor, Uri (eds.) Loanwords in the World's Languages. Berlin and New York: de Gruyter, pp. 55-75.

Whewell, William D. D. (1847) The Philosophy of the Inductive Sciences, Founded Upon Their History. London: John W. Parker.

Willems, Matthieu and Lord, Etienne and Laforest, Louise and Labelle, Gilbert and Lapointe, François-Joseph and Di Sciullo, Anna Maria and Makarenkov, Vladimir (2016) Using hybridization networks to retrace the evolution of Indo-European languages. BMC Evolutionary Biology 16.1: 1-18.

David Willis (2011) Reconstructing last week's weather: Syntactic reconstruction and Brythonic free relatives. Journal of Linguistics 47.2: 407-446.

Wilson, Edward O. (1998) Consilience: the Unity of Knowledge. New York: Vintage Books.

Zenner, Eline and Dirk Speelman and Dirk Geeraerts (2014) Core vocabulary, borrowability and entrenchment. Diachronica 31.1: 74–105.

Monday, March 18, 2019

Which US cities are best for walking, biking and public transport?

In the modern world, there is a lot of discussion about the environmental damage caused by cars and trucks, not least due to their involvement in global climate change. The pro-active parts of this discussion revolve around banning cars, so that parts of cities and towns can return to pedestrian areas (eg. Life in the Spanish city that banned cars; The automotive liberation of Paris), and encouraging alternative modes of transport, particularly bicycles (eg. Copenhagenize your city: the case for urban cycling; Britain wants cycle-friendly cities).

In particular, some cities throughout the world are taking active steps to improve the "walkability" of their centers, including Addis Ababa, Auckland, Denver, Hanoi, London, Manchester and San Francisco (What would a truly walkable city look like?), and the "cyclability" of their inner suburbs, including Calgary, Copenhagen, Eindhoven, Lidzbark, Purmerend, San Sebastian, Utrecht and Vancouver (Top 10 pieces of cycling infrastructure: which country does it right?). On the other hand, there are some cities who have not yet tried to do much about cycling, including Beijing, Cairo, Delhi, Hong Kong, Moscow, Mumbai, Nairobi, Orlando, São Paulo and Sydney (Top 10 worst cities for cycling ).

The USA is not usually considered to be at the forefront of this movement, having long ago wedded itself to the cult of the private motor car. However, this does not mean that US cities are all the same in terms of non-car transportation. For example, the Walk Score site, which is part of the Redfin real estate organization, provides a ranking of all US cities and neighborhoods with a population of 200,000 or more, in terms of how friendly they are for: walking, biking and transit.

The ranks are based on a score out of 100 for each location, using various methodologies:
— Walk Score analyzes hundreds of walking routes to nearby amenities; points are awarded based on the distance to amenities in each category.
— Bike Score is calculated by measuring bike infrastructure (lanes, trails, etc), hills, destinations and road connectivity, and the number of bike commuters.
— Transit Score assign a "usefulness" value to nearby transit routes based on their frequency, type of route (rail, bus, etc), and distance to the nearest stop on the route.
Our interest here is in combining these three pieces of information into a single picture, showing which cities are generally good, at the moment.

Not unexpectedly, the Walk Score and Transit Score are highly correlated (86% shared rankings), while the Bike Score is not as highly correlated with either of these (49% and 42%, respectively). This means that the same cities tend to be good for the first two criteria. The three best cities for the Walk Score are New York, Jersey City and San Francisco, while the top two for the Transit Score are New York and San Francisco. On the other hand, for the Bike Score the top two are Minneapolis and Portland — it would be difficult to imagine either New York or San Francisco as being good for biking!

If we define a "good" score as being >70, then only San Francisco has a score for all three criteria >70, although Boston comes close. On the other hand, Pittsburgh and Washington D.C. have the most consistent scores across the board, because they have uniformly middle-rank scores.

Since these are multivariate data, one of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network, as a tool for exploratory data analysis. For this network analysis, we calculated the similarity of the cities using the Manhattan distance, and a Neighbor-net analysis was then used to display the between-city similarities.

The resulting network of the 98 cities with complete data is shown in the figure. Cities that are closely connected in the network are similar to each other based on how good they are for walking, biking and transit, and those cities that are further apart are progressively more different from each other. The color-coding for the cities is from Megaregions of the United States.

The network generally shows decreasing walking / transit scores from top to bottom, and decreasing biking scores from right to left. We have labeled only the top group of 29 cities, which are distinctly "better" than the remaining 69, plus four unusual cities (at the middle-left).

Note that, as expected, New York, San Francisco and Boston stand out at the top of the network. Note, also, that Minneapolis and Portland are separated in the network from the other cities, because of their high Bike Scores — all of the other cities in the top group have much lower biking scores. Newark, in particular, has a low biking score. New Orleans is at the bottom-left of this group because it has a low Transit Score but not Walk Score.

For the four unusual cities, separated at the left of the bottom group: Dallas has a low Transit Score, and Atlanta, Cincinnati and San Diego all have a low Bike Score.

The city at the very bottom-left of the network, which has the lowest score on all three criteria, is Arlington TX. Along the same lines, there is an online graph of The 10 most dangerous states for cyclists, showing Florida way out in front.

Finally, you should be warned about potential problems with rankings like these, based on only a few selected criteria. For example, the real estate site StreetEasy recently tried to compile a list of the 10 Healthiest Neighborhoods in New York city, and ended up listing the Brooklyn industrial area of Red Hook as number 1, which engendered a couple of negative comments, such as:
I guess the fact that the majority of Red Hook’s parkland has been closed for many years due to lead contamination, or the fact that we have one of the highest asthma rates in the city, was overlooked for this study.
Caveat emptor!

Monday, March 11, 2019

Tattoo Monday XVII

Here are seven more tattoos in our compilation of evolutionary tree tattoos from around the internet. For more examples of the circular design for a phylogenetic tree, in a variety of body locations, see Tattoo Monday V, Tattoo Monday VII, Tattoo Monday X and Tattoo Monday XI.

At the bottom of this post is an unusual linearized version of this same type of tree.

Monday, March 4, 2019

Has homoiology been neglected in phylogenetics?

In a recently published pre-print on PaleorXiv, Roland Sookias makes a point for distinguishing between parallelism, ie. shared inherited traits that can be found in some but not all of the offspring of a common ancestor, and convergences in a strict sense, involving similar traits that are not homologous. The former is also known as homoiology, a term Sookias attributes to Ludwig Plate.

As a geneticist working mostly at the tips of the Tree of Plant Life, I'm quite familiar with the (pre-Hennigian) concept: we much more often than not lack Hennig's 'synapomorphies', ie. shared, derived traits exclusive to an evolutionary lineage. But we have many highly diagnostic characters suites including 'shared apomorphies' (I think that the angiosperm phylogeneticist Jim Doyle coined the term) that collect the same species or higher taxa, eg. groups of taxa that also form highly supported clades in molecular trees, but are not exclusive. In every plant group you can additionally observe that certain traits are exclusive to some members of one lineage, because the lineage has the genetic-physiological prerequisites to express these traits, while their sister lineages or distant relatives lack this potential. Epigenetics deals with tendencies to express a trait in response to the environment without even changing the genetic code.

If you look close enough, you can find such patterns even at the molecular level.

Molecular evolution of the 5' half of the ITS1 in beeches. Each sequence motif is assigned a state (Ax, Bx etc; x = 0 represents the ancestral state, x > 0 are derived states) and evolution involves usually the gain ("+") or loss ("-") of sequence motifs including some potential genetic homoiologies (see here for context and references).

However, it has apparently been ignored by my fellow paleontologists: Sookias' wants to discuss "the neglected concept of homoiology ... in the context of palaeontological phylogenetic methods". Paleontological phylogenetic methods are, of course, tree inferences, and the idea is that recognition of homoiologies can be a means of establishing node support or to "help to choose between equally parsimonious or likely trees". He provides an R function "to calculate two measures for a given tree and matrix: (a) the potential support for clades based on potential homoiologies; and (b) the fit of the tree to all states given the concept of homoiology".

Sookias provides a nice and conscise introduction to the problem with some examples, and makes the connection to linguistics (see also Mattis' and my post on the Chinese dialects continuum: How languages lose body parts); so, give the short paper a read. Like all paleontological literature it is strongly influenced by cladistic views, such as that life is monophyletic, and it revolves around the central theme how to get better supported trees.

My inner geneticist has a principal problem with such a goal, because there has (to my knowledge) not been a single morphology-based tree that was fully congruent to a molecular tree with sufficient taxon and gene sampling, which applies also to the real-world data example that Sookias chose (as we will see).

My inner paleontologists also knows that there are highly diagnostic morphs in the fossil record, but diagnostic character suites and morphs reflect as many paraphyla as monophyla. He also knows that the fossil record, provided you find the right fossil from the right time, may alter your perspective on ancestral and derived character states.

An inferred tree (see this post). Given the inferred tree (quasi-dated tree), we would assume that star shapes are primitive (a symplesiomorphy) within the Pointish lineage, and possibly 10-tipped stars; and conclude that the Tenstars are paraphyletic. Greenish is clearly ancestral (a Pointish symplesiomorphy), and bluish derived (a Polygonia synapomorphy).
If we have the full picture, we can confirm star shapes are symplesiomorphic within the Pointish (the first common ancestor being a five-pointed colorless star). However, all greenish stars form a monophylum not a paraphylum.
Having ten tips is a synapomorphy of the monophyletic Tenstars.

So, why should we aim to get more resolved, better supported, morphology-based trees? Any such tree will inevitably include wrong branches!

I argue that, instead, we should just explore the signal in our data matrices using networks. Any potential tree is included in a network. But networks are more comprehensive because they provide not only a single tree but alternative, competing trees. By visualizing the alternatives, we can discern between mere convergence (random similarity), homoiology (parallelism, convergence related to descent), symplesiomorphy (shared, lineage-consistent primitive traits) and synapomorphy (lineage-unique and consistent shared derived traits), which can be very tricky with just a tree. Thus, we can try to evaluate which evolutionary scenario best explains all our data.


The basic problem when using morphological and such-like data sets to infer phylogenies is that most of the scored characters are, to some degree, incompatible with the true tree, ie. the actual evolutionary pathways.

Let's take a hypothetical evolution (no reticulations), in which the x-axis represents the morphological diversification and the y-axis time.

As in real-world data, sister taxa (eg. Species A and B) may have different levels of morphological derivation compared to their common ancestor(s). This leads us to this unrooted true tree in which the branch lengths are proportional to the real (above) amount of change.

Unrooted representation of the above evolution.
All commonly used tree inferences infer unrooted trees.

The only characters providing a taxon bipartition that is fully compatible with the true tree are Hennig's 'synapomorphies':

Clade A–D shares a unique, derived trait.
The character split is fully compatible with the true tree.

Next come Hennig's 'symplesiomorphies' (Sookias' R-script discards them):

Blue is the ancestral state within the ingroup, lost/modified in Species A.
The character split is compatible with the true tree except for A.
In phylogenetic inference, symplesiomorphies will usually stabilize the topology
as there will be enough other characters supporting A as sister of B and Clade A–D(–F).

Homoiologies / parallelisms can be partly compatible:

Blue is a homoiology found in 50% of the species composing Clade A–F.
The character split supports the sister relationship of A and B (compatible aspect)
but joins them with F (incompatible aspect).
A, B and F belong to the same monophylum/clade (semi-compatible aspect).
As long as homoiologies are confined to otherwise
coherent (or flat) subtrees, they will contribute to the overall decision capacity of the data.

Note that without a molecular backbone tree, it may be impossible to distinguish homoiologies from symplesiomorphies – whether a trait will be resolved as either the one or the other in a tree depends solely on its frequency and distribution across the subtree, and the situation in outgroups.

Purple is the plesiomorphy of the ingroup, blue the homoiology
found in members of Clade A–F, evolved twice
Considering the phylogenetic root-tip distances in the true tree, it makes sense that blue is the plesiomorphy of the ingroup retained in the shorter branching members, and purple a homoiology found in the most derived sublineages (again, evolved twice).
Both scenarios require three steps, but probabilistic character mapping methods would prefer the second scenario as they assume the longer the internal branches, the higher the likelihood for a change. To dismiss symplesiomorphies, Sookias' script infers the ancestral state of the MRCA of a clade and only considers states as homoiologies that differ from the inferred ancestral state (the cut-off value can be modified to "less stringently exclude potential symplesiomorphies as homoiologies").
Doyle's 'shared apomorphies' are locally compatible:

Blue is a shared apomorphy of the GH lineage, convergently evolved in the
outgroup (see original tree above: the GH lineage is a strongly derived
ingroup lineage evolving into the direction of the outgroup
in contrast to the remainder of the ingroup).
The example above also illustrates how shared apomorphies may trigger branching artifacts such as ingroup-outgroup long-branch attraction. Imagine that GH is not the first diverging branch of the ingroup but instead a strongly derived sublineage nested within Clade A–F, and that we lack the short-branching sister-group but have a large outgroup sample. Any ingroup-outgroup shared apomorphies will then draw GH towards the outgroup-defined ingroup's root and detrimental for inferring the true tree.

Convergence in a strict sense, ie. superficial or random similarity, is incompatible with the true tree:

Blue is a randomly distributed derived state found in all longer-branched taxa.

A tree-incompatible signal is, naturally, best handled using a network and not by forcing it into a single tree. Unless, of course, we have a sensible molecular tree and can go for total evidence approaches assuming the molecular tree reflects the true tree.

PS: Also, in molecular data the true tree incompatible characters may outnumber the compatible ones, but there we have many more characters and (usually but not always) a lot that are not filtered by negative or positive selection. Our stochastic molecular models are for sure never accurate enough to model molecular evolution for our sequences, but apparently precise enough for most applications. Even before next generation sequencing and big data, molecular phylogenies outshined morphological phylogenies, something that paleontologists cannot afford to ignore any more — not because the data are much better (to infer evolution) but because the patterns and processes are much less complex.

Sookias' data example, crocodiles and relatives

The supplement of Sookias' paper includes a morphological character matrix for crocodilians and the resulting molecular tree for the group. Here's Sookias' fig. 3 ,using these data to make his point for how to select the better-fitting tree using homoiology recognition:

Now, the unsolved problem is: if we don't have a molecular tree, how can we possibly know 0 is a homoiology and not a symplesiomorphy, 1 not a reversal (scenario B) or likely convergence (scenario C), hence, B should be preferred over C (the legend has a little typo, cf. Sookias 2019, p. 3, l. 34)?

The matrix provided as the example is not the best one to make this point. Sookias' script, when stringently eliminating potential symplesiomorphies, identifies, using the molecular tree as input, one potential homoiology for the Crocodylinae, five for their larger clade (including Gavialis and Tomistoma), and one for the alligators' larger clade in a matrix with 117 characters. Less than 10% can hardly be a game-changer.

What the morpho-data shows

Furthermore, the morphological matrix will give us a single most-parsimonious tree (MPT, using PAUP*'s Branch-and-Bound algorithm), not two or more equally parsimonious alternatives that we need to weigh against each other.

The single most-parsimonious tree that can be inferred from the morpho-matrix (236 steps, CI = 0.64, RI = 0.84). Red branches are conflicting with the topology of the molecular (truer?) tree (green brackets).

Some of the red branches are supported by pseudo-synapmorphies, which, on the background of the molecular tree, are potential homoiologies for the comprising clade, however, interpreted as symplesiomorphies by Sookias' script (provided the molecular branch-lengths are sufficient, they might be recognized when using a probabilistic framework to infer the ancestral states).

Not a good example for Sookias goal, but the matrix shows the limitations of trees when it comes to morphological differentiation. Here's the distance-based, 2-dimensional network for the morphological data:

A Neigbor-net based on Sookias' morphological matrix.
The arrow indicates the position of the assumed root.

The signal from the morphological matrix is quite tree-like, and the structure of the left part of the network is synonymous to that of the single MPT (and the molecular tree). On the right-hand side, we find more complexity than we would expect from the single MPT. The data signal is not trivial regarding the position of the root as inferred by Bernissartia; and nor is the placement of Gavialis and Tomistoma (pink edge bundles), two genera producing a very prominent box-like structure. Called by cladists a "phenetic" approach, the distance-based network is nonetheless straightforward regarding the identification of monophyletic groups (green) and potential monophyletic groups (yellow) (the latter always include the particular alternative seen in the single MPT as well, in case of the pink box, also the molecular alternative). The light green monophylum is a necessary consequence of the prior knowledge about the position of the root, and the likely monophyly of Alligator and its relatives (the tree-like subgraph with long internal branches = lots of uniquely shared traits, including potential synapmorphies).

Potential synapomorphies that can be inferred from the morpho-matrix alone by mapping the states onto the network. Red, homoiologies reconstructed as synapomorphies ('pseudo-synapomorphies') and (except for one) excluded as potential symplesiomorphies by Sookias' test run of his script (strict and relaxed cut-off).

The network provides more information than can be extracted from the MPT: one Crocodylus is significantly closer to the Osteolaemus (the neighborhood defined by the light blue edge bundle, see Sookias' fig. 3A). Crocodylus, however, is likely monophyletic, being generally very similar; and the third genus, Mecitops, is closely linked to (all of) them (neighbourhood defined by the dark blue edge). An inclusive common origin (including the third genus, Mecistops) is – just based on morphology and without using a "phylogenetic" tree inference – beyond question, even though we lack syn- or shared apomorphies (short corresponding edge bundle): Mecistops is obviously closely related to Crocodylus, and Osteolaemus is related to part of the latter, so it's not a bad hypothesis that all three are descendants of the same common ancestor, and that Tomistoma (and Gavialis) branched off the lineage before the Crocodylinae radiated. The only alternative explanation would be that the Crocodylinae show the primitive morphs of the entire lineage, and that the position of Tomistoma and Gavialis is affected by long-branch (-edge) attraction (however, if that is the case then we should have found a Tomistoma-Gavialis clade in the MPT — parsimony will always get it wrong in the Felsenstein zone)

The main flaw

But, any morphology-based alternative using this data matrix is not fully compatible with the molecular tree, which places Mecitops and Osteolaemus as sister to Crocodylus. Here's the consensus network based on 10,000 boostrap pseudoreplicate BioNJ trees inferred from the morpho-matrix, highlighting the support for splits compatible with the molecular tree (green) and their competing, partly incongruent (red edge bundles) alternatives (I do the information transfer manually, but those with R-scripting skills can use the functions in the phangorn library; Schliep et al., MEE, 2017; see also David's post):

NJ-Bootstrap (BS) consensus network based on 10,000 pseudoreplicates.
Edges/splits corresponding to clades in the molecular tree
(see Sookias' fig. 3 above) in green, those conflicting the molecular tree in red.
Edge values show BS support (edge-lengths are proportional to NJ-BS support),
while asterisks indicate the branches seen in the MPT.
Obviously, there is some signal in the morpho-matrix compatible with the molecular clades (this can be synaporphies, symplesiomorphies, homoiologies or shared apomorphies) clashing with the signal of pseudo-synapomorphies etc. supporting the topological alternatives seen in the morpho-based MPT.

Assuming the molecular tree is correct, the above reconstruction means that Osteolaemus is morphologically more derived, and hence placed as sister, while Mecitops and Crocodylus retain more primitive character states, and hence lacks discriminatiory derived traits — a sort of local ingroup-outgroup long-branch attraction (or 'short-branch culling').

What differentiates the Crocodylinae? Black, aut- or synapomorphies; blue, potential homoiologies (or symplesiomorphies); red, shared apomorphies (convergence). The Mecitops-Crocodylus pseudo-monophylum is mostly supported by traits shared between Osteolaemus and distant siblings (taxa of the larger alligator clade) and/or the outgroup.

We can also hypothesize that the initial radiation was fast, because the Mecitops-Osteolaemus ancestor did not accumulate a single, unique, discriminating character trait.

Excess of shared derived, pseudo-synapomorphic traits is the reason Tomistoma is not resolved as sister of Gavialis in the MPT — the molecular Gavialis-Tomistoma clade is represented by a morphological grade.

A 'splits rose' showing the basic splits. Black, aut- or synapomorphies; blue, potential homoiologies (or symplesiomorphies of the larger clade including Crocodylinae); pink, pseudo-synapomorphies (deep homoiologies or symplesiomorphies of the larger Crocodylinae clade); orange, shared ancestral (plesiomorph) or derived traits (convergent). 

And the homoiologies identified using the molecular tree as input cannot put things right. They are just partly compatible with unproblematic splits, ie. the larger clade including Alligator (character #7), the larger clade including Crocodylinae (#1, #18, #73, #74, #117) or exclusive to the Crocodylinae (#66)

Character mapping of the molecular-inferred homoiologies. The lush green splits represent the molecular splits.

However, if we are ignorant of the molecular tree, we would have to assume that Mecitops is the sister to Crocodylus, and that some of their shared traits not found in Osteolaemus are shared apomorphies (if occurring outside the clade and in the sister clade) or even synapomorphies (if exclusive for Mecitops + Crocodylus), while only those shared by Osteolaemus and C. porosus (#66) can be homoiologies. We also would have no reason to challenge the Gavialis-Tomistoma grade, until we infer networks.

Map of all potential synapomorphies (bold), symplesiomorphies (italics) and homoiologies (plain font) using the morphology-based Neighbor-net as basis. Red, pseudo-synapomorphies: split seen in the MPT and (with or without alternative in the Neighbor-net) but rejected by the molecular tree.

This is the main flaw of Sookias' idea. To identify homoiologies, we need the same prerequisite as for any of Hennig's concepts: we need to know the true tree. If we use the inferred tree based on the same data that we want to weight (here: use homoiologies for decision making or means of node support), then we propagate first-level errors; apply circular reasoning. Such as the red-marked pseudo-synapomorphies in the network above; vice versa, all actual (molecular-wise) synapomorphies supporting the molecular Gavialis-Tomistoma clade (dark purple split) would be reconstructed as homoiologies or symplesiomorphies based on the morpho-based single MPT (or morpho-based NJ tree, or probabilistic tree).

And if we have an independent molecular tree, it will make the decision on the fly: putative synapormorphies are the traits that are fully compatible, symplesiomorphies, homoiologies and shared apomorphies are decreasingly compatible, and random convergences are incompatible with the molecular tree.

It is not homoiology but tree-incompatible signal that is neglected in phylogenetics

Sookias points out: "In inference of phylogeny by parsimony, an occurrence of a character state in a part of a tree separated from it by another state is considered simply a homoplasy, and a tree where the occurrences are nearer or further from one another is not more or less parsimonious ... a tree where the 15 occurrences are nearer or further from one another is not more or less parsimonious". In principle, this is true, but has little consequence in application.

We, usually without realizing it, make frequent use of the discriminating power of potential homoiologies. See the example above, but also when, eg., placing fossils in a molecular framework or do post-inference character weighting. In these cases, homoiologies (and symplesiomorphies) will stabilize the inference and increase support. For better and worse:
  • Better, because homoiologies will ensure that the fossil is placed in the right molecular-based subtree, and can compensate for the lack of synapomorphies. Imagine an extinct fossil sibling lineage showing only homoiologies shared by Osteolaemus and C. porosus. Using tree-based optimization (eg. RAxML's 'evolutionary placement algorithm'), it would be placed close to the Crocodylinae ancestor, likely next to Osteolaemus. Using a Neighbor-net, it would be placed between Osteolaemus and C. porosus. Either way, the homoiologies would ensure it is nested within the Crocodylinae.
  • Post-inference character weighting, as implemented in eg. TNT, will downweight inferred convergences (ie. higher homoplasy, more stochastically distributed across the tree) more than putative homoiologies (ie. less homoplastic since confined to a single subtree). This can be better or worse. How do we avoid what happened for the crocodiles that homoiologies are not recognized as such but support (somewhat) misleading clades (act as synapomorphies)? Clades are commonly interpreted as a sufficient criterion to determine monophyly; however, they are not even a necessary one: taxa can be part of a monophyletic group despite not forming an inclusive subtree (ie. clade in a rooted tree) such as the genus Caiman or Gavialia-Tomistoma.
Hence, we should disencourage any form of data-self-dependent or post-analysis weighting and instead just explore the signal in our data — using networks.

One thing is also obvious from the crocodile example: if we have enough signal in the morphological data, then we may get one or another thing wrong and, in some cases, may not be able to decide between one or another alternative. However, overall, the morphological differentiation pretty well captures what the genes provide us as the best approximation of the true tree. Even when the matrix includes very few potential synapomorphies and clear homoiologies but a lot of shared apomorphies, most of which were convergently evolved in parts of both major clades.

At least, this will be so when we analyze the data using networks and not just trees (compare the single MPT to the networks).

Using the alternative evolutionary scenarios provided by the networks, we can then look back into our data (see the maps above), to see what may be a homoiology, a symplesiomorphy (very useful for deciding between evolutionary scenarios, as well) or a synapomorphy. The phangorn library (used for Sookias' script) has now network functionality and allows transferring information between trees and networks. An R-affine person may be able to extract lists of potential (partly competing) synapomorphies, symplesiomorphies, and homoiologies directly from the network showing all possible or the most likely trees.

And then use this information to eg. place fossils in a phylogenetic context, or reconstruct evolutionary trends in extinct groups of organisms — reconstruction of evolutionary trends in extant organisms should always rely on morphological data analyzed in a molecular-phylogenetic framework.


A NEXUS-version of Sookias' test matrix (slightly annotated for Mesquite, simple version for PAUP*), tree- and distance matrix files have been added to my figshare collection of morphological matrices.