Monday, June 25, 2018

Horizontal and vertical language comparison

In the traditional handbooks on historical language comparison, one can often find the claim that there are two fundamentally different, but equally important, means of linguistic reconstruction. One is usually called "external reconstruction" (or alternatively the "comparative method"), and one is called "internal reconstruction". If we think of sequence comparison in historical linguistics in the form of a table, in which concepts are arranged on the vertical axis, and different languages on the horizontal axis, we can look at the two different modes of language comparison (external vs. internal) as the horizontal and the vertical axes of the table. Horizontal language comparison refers to external reconstruction — scholars compare forms (not necessarily of the same meaning) across the horizontal axis, that is, across different languages. Internal language comparison is vertical — scholars search inside one and the same language for structures that allow to infer its older stages.

In past blog posts I have been talking a lot about horizontal / external language comparison, for which especially the notion of sound correspondences is crucial. But in the same way in which we use the evidence across languages to infer the past states of a given language family, we can make use of language-internal evidence to learn more about the history — not only of a given language,- but also of a group of languages.

Vertical Language Comparison

A classical example of vertical or internal language comparison is the investigation of paradigms, that is, the inflection systems of the verbs or nouns in a given language. This, of course, makes sense only if the respective languages have verbal or nominal morphology, ie. if we find differences in the verb forms for the first, second, or third person singular or plural, or for the case system. The principle would not work in Chinese, although we have different means to compare languages without inflection vertically, as I'll illustrate below.

As a simplified example of internal reconstruction, consider the verbal paradigm of the verb esse "to be" in Latin:
Person Singular Plural
first sum sumus
second es estis
third est sunt

If you try to memorize this pattern, you will quickly realize that it is not regular, and you will have difficulties to identify patterns that assist in memorizing the forms. A much more regular pattern would be the following:
Person Singular Plural
first es-um es-umus
second es-Ø es-tis
third es-t es-unt

This pattern would still require us to memorize six different endings, but we could safely remember that the beginning of all forms is the same, and that there are six different endings, accounting for person and number at the same time (which is anyway typical for inflecting languages).

An alternative pattern that would be easier to remember is the following one:
Person Singular Plural
first es-um s-umus
second es-ø s-tis
third es-t s-unt

While it may seem that this pattern is slightly more complicated at first glance, it would still be more regular than the pattern we actually observe, and we would now have two different aspects expressing the meaning of the different forms: the alternation of the root es- vs. s- accounts for the singular-plural distinction, while the endings express again both number and person.

If we look at older stages of Latin, we can, indeed, find evidence for the first person singular, which was written esom in ancient documents (see Meier-Brügger 2002 for details on the reconstruction of this paradigm in Indo-European). If we look at other languages, like Sanskrit and Ancient Greek, we can further see that our alternation between es- and s- in the root (thus our last example) comes also much closer to the supposed ancient state, even if we don't find complete evidence for this in Latin alone.

What we can see, however, is that the inspection of alternating forms of the same root can reveal ancient states of a language. The key assumption is that observed irregularities usually go back to formerly regular patterns.

Horizontal language comparison

The classical example for horizontal or external language comparison is the typical wordlists in which words with similar meanings across different languages are arranged in tabular form. I have mentioned before that it was in great part Morris Swadesh (1909-1967) who popularized the simple tabular perspective that puts a concept and its various translations in the center of historical language comparison. Before the development of this concept-based approach to historical linguistics, scholars would pick examples based on their similarity in form, allowing for great differences in the semantics of the words being assigned to the same slot of cognate words; and this exclusively form-based approach to external language comparison is still the prevalent one in most branches of historical linguistics.

No matter what approach we employ in this context — be it the concept- or the form-based — as long as we compare forms across different languages, we carry out external language comparison, and our main concern is then the identification of regular sound correspondences across the languages in our sample, which enable us to propose ancestral sounds for the ancestral language.

Problems of vertical language comparison

As can be seen from my above example of the inflection of esse in Latin, it is not obvious how the task of internal language comparison could be formalized and automated. There are two main reasons for this. First, inflection paradigms vary greatly among the languages of the world, which makes it difficult to come up with a common way to investigate them.

Second, since we are usually looking for irregular cases that we try to explain as having evolved from former regularities, it is clear that our data will be extremely sparse. Often, it is only the paradigm of one word that we seek to explain, as we have seen for Latin esse, and patterns of irregularities across many verbs are rather rare (although we can also find examples for this). As a result, internal reconstruction is dealing with even fewer data than external reconstruction, where data are also not necessarily big.

Formalizing the language-internal analysis of word families

Despite the obvious problems of exploiting the language-internal perspective in historical language comparison, there are certain types of linguistic analysis that are amenable to a more formal treatment in this area. One example that we are currently testing is the inference and annotation of word families within a given language. It is well known that large number of words in human languages are not unrelated atomic units, but have themselves been created from smaller parts. Linguists distinguish derivation and compounding as the major techniques here, by which new words are created from existing ones.

Derivation refers to those cases where a word is being modified by a form unit that could not form a word of its own, usually a suffix or a prefix. As an example, consider the suffix -er in English which can be attached to verbs in order to form a noun that usually describes the person that regularly carries out the action denoted by the original verb (eg. examineexaminer, teachteacher, etc.). While the original verb form exists without the suffix in the English language, the form -er only occurs as part of verbs. In contrast to derivation, compounding refers to the process by which two word forms that can be used in isolation are merged to form a new expression (compare foot and ball with football).

Searching for suffixes and compounds in unannotated language data is a very difficult task. Although scholars have been working on automatic methods that split a given monolingual dictionary into its smallest meaning-bearing form units (morphemes), these methods usually only work on very large datasets (Creutz and Laugs 2005). Trained linguists, on the other hand, can easily detect patterns, even when working on smaller datasets of a few hundred words.

The reason why linguists are successful in analysing the morphology of languages, in contrast to machine-learning approaches, is that they make active use of their external knowledge about the potential semantics underlying the patterns, while current methods for automatic morpheme detection usually only consider the forms, and disregard the semantics. Semantics, however, are important to distinguish words that form a true family (in that they share cognate material) from words that are similar only due to chance.

It is clear that languages may have words that sound alike but convey different meanings. As an extreme example, consider French paix [] "peace" vs. pet [] "fart".Although both words are pronounced the same, we know that they are not cognate, going back to different ancestral forms, as is also reflected in the French writing system. But even if we lacked the evidence of the French orthography, we could easily justify that the words do not form a family, since (a) their meaning is quite different, and (b) their genus is different as well (la paix vs. le pet). An automatic method that disregards semantics and external evidence (like the orthography or the gender of nouns in our case) cannot distinguish words that are similar due to chance from words that are similar due to their history.

As a further example illustrating the importance of semantics, consider the data for Achang, a Burmish language, spoken in Myanmar (data from Huáng 1992), which is shown in the following graphic (derived from the EDICTOR tool and analyzed by Nathan W. Hill).

Word families in Achang, a Burmish language.

In this figure, we can see six words which all share tɕʰi⁵⁵ (high numbers represent tones) as their first part. As we can see from the detailed analysis of these compounds in Achang, which is given in the column "MORPHEMES" in the figure, our analysis claims that the form tɕʰi⁵⁵, which expresses the concepts "foot" or "leg" in isolation, recurs in the words for "hoof", "claw", "knee", and "thigh", but not in the word for ""ant". While the semantic commonalities among the former are plausible, as they all denote body parts which are closely related to "feet" or "legs", we do not find any transparent motivation for why the speakers should have used a compound containing the word for "foot" to denote an ant. Although we cannot demonstrate this at this point, we are hesitant to add the Achang word for "ant" to the word family based on compounds containing the word for "foot".

Bipartite networks of word families

For the time being, we cannot automate this analysis, since we lack data for the testing and training of potential algorithms. We can, however, formalize it in a very straightforward way: with help of a bipartite network (see Hill and List 2017). Bipartite networks are networks with two kinds of nodes, which are usually thought of as representing different types. While we can easily assign different types to all nodes in any network we are dealing with, bipartite networks only allow us to link nodes of different types. In our bipartite network of word families, the first type of nodes represent the forms of the words, while the second type represent the meanings attributed to the sub-parts of the words. In the figure above, the former can be found in the column "tokens", where the symbol "+" marks the boundaries, and the latter can be found in the column "MORPHEMES".

The following figure shows the bipartite network underlying the word family relations following from our analysis of words built with the morpheme "foot" in Achang.

Bipartite network of word families: nodes in red text represent the (reconstructed) meaning of the morphemes, and blue nodes the words in which those occur as parts.


The bipartite network above shows only a small part of the word family structure of one language, and the analysis and formalization of word families with help of bipartite networks thus remains exemplary and anecdotal. I hope, however, that the example illustrates how important it is to keep in mind that language change is not only about sound shifts that can be analyzed with help of language-external, horizontal comparison. Investigating the vertical (the language-internal) perspective of language evolution is not only fascinating, offering many so far unresolved methodological problems, it is at least as important as the horizontal perspective for a proper understanding of the dynamics underlying language change.


Creutz M. and Lagus K. (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology, 2005, 81.

Hill N. and List J.-M. (2017) Challenges of annotation and analysis in computer-assisted language comparison: A case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1. 47–76.

Meier-Brügger M. (2002) Indogermanische Sprachwissenschaft. de Gruyter: Berlin.

Huáng Bùfán 黃布凡 (1992) Zàngmiǎn yǔzú yǔyán cíhuì [A Tibeto-Burman lexicon]. Zhōngyāng Mínzú Dàxué 中央民族大学 [Central Institute of Minorities]: Běijīng 北京.

Monday, June 18, 2018

To boldy go where no one has gone before – networks of moons

This is a joint post by Timothy Holt and Guido Grimm

One ‘a-phylogenetic’ application of phylogenetic methods is the classification of stellar (in the widest sense) objects, so-called "astrocladistics" (see Didier Fraix-Burnet’s dedicated blog: Traditionally, the objects would be characterized and their (dis)similarity translated into a plot (eg. using PCoA) or a tree (eg. a UPGMA tree). Such cluster analysis frameworks would then be the basis for the classification of the objects.

In ‘astrocladistics’, phylogenetic trees that fulfill the maximum parsimony or minimum evolution criteria, are used instead. But why should we stop with trees (see the prior blog post Astrocladistics: a network analysis)? For this post, we have used the matrices of a recent astrocladistic paper by Holt et al. (2018) to highlight an as yet under-explored application of phylogenetic methods in classification: exploratory data analysis (EDA).

Why exploratory data analysis

As noted in the earlier post on astrocladistics, one problem is that one infers phylogenetic trees based on a data sets that are not the product of an evolutionary process. Some objects may evolve from others (eg. a satellite may evolve from planetary ring matter), but this is not a dichotomous splitting process through time. And any non-dichotomous process can lead to tree-incompatible signals, which will then hamper tree inference in a biological context. Any tree using astral objects (galaxies, stars, planets, moons) as OTUs is per se a faux phylogeny (some examples for faux phylogenies are collected here and here).

Another problem is a data-inherent bias. The matrices are coded in a fashion that reflects an a priori hypothesis of derivation. For instance, by inferring that objects farther away are older and closer ones are younger, we can make hypotheses about maturation of galaxies, and hard-code this hypothesis into the data matrix. This will infer a tree that was coded into the matrix.

Guido’s starting argument is that when our main goal is classification and not inferring evolutionary relationships, the topology of the tree (or alternative trees) is the least of our concerns. What we want to know is to what degree our data converges to the same groupings, supports coherent classes. This is exploratory data analysis, and Neighbor-nets are then a powerful tool to visualize any differentiation pattern (see some recent a-biological examples: U.S. gun legislation, cryptocurrencies, where to retire Worldwide and within the USA)

Instead of inferring trees, as in the original paper about two satellite systems (Jupiter and Saturn), here we use the matrices to infer Neighbor-nets, map character support (non-parametric bootstrap support) on the resultant networks, and discuss the prospects and perils of ‘astro-Neighbor-nets’ when it comes to classification of astronomical bodies.

Data properties and analysis set-up

In order to construct the matrices, three different types of characteristics were used: dynamical, physical and compositional. Dynamical characteristics are the positions of the various satellites, how far they are away from the planet (semi-major axis), their inclination to the plane and eccentricity of their orbit. Several of the satellites also orbit opposite to the planet's rotation (they are on a retrograde orbit), which is also code. Physical characteristics are two properties of the satellites: their albedo, or how reflective they are, and their density. Any characteristics related to mass and size are specifically avoided, as this would hide any parent/daughter relationships resulting from breakups. The compositional characteristics are the most numerous ones in the analysis. These are binary characteristics indicating the presence/absence of chemical species, eg. water, iron, methane, etc.

Five of the characteristics, semi-major axis, inclination, eccentricity, albedo and density, are ordered and continuous. These prose a problem for standard cladistic analysis using parsimony, which needs discrete character states. Hence, these characteristics are binned using a python program. Each character-set is binned independently, and for each of the Jovian and Saturnian systems. The aforementioned python program iterates the number of bins until a linear regression model between binned and unbinned sets achieves a coefficient of determination (r2) score of > 0.99. All characteristics are binned in a linear fashion, with the majority increasing in progression. The exception to the linear increase is the density character set, with a reversed profile. All of the continuous, binned characteristic sets are (by definition) ordered characters.

Thus, the matrices comprised two sorts of characters with strongly different properties, when it comes to explicit inferences: binary characters, and highly ordered characters (the binned ones) with up to 11 states. For the graphs used here, we didn’t apply any weighting, which means that in the most extreme case complete difference in a binned character counts 11-times more than a difference in any of the binary characters. This bias is compensated to some degree by the number of binary characters (33, with 31 variable) vs. binned characters (5), when restricting the analysis to well-known planetary objects.

The matrices are comprehensive, and include little-known objects with a lot of missing data (>80% of the characters cannot be scored), which should be included. A matrix-based classification makes most sense, when one uses character sets that are defined for most or all of the objects. Thus, to see how the little-known objects relate to the well-known, we eliminated all poorly covered characters, leaving us with two binary and five binned ones. To not lose the information from the binary characters when calculating the inter-object distances, we gave them a weight of 7–8. This ensures that a 0↔1 difference in a binary character more or less equals the maximum possible difference in a binned character (on average 8 bins for the Jupiter dataset, and7 for the Saturn data set).

Fig. 1 The orbits of the satellites in the Jovian satellite system. Colours represent traditionally recognized groups.

The Jovian moons (and ring)

The Jovian system (Fig. 1) is dominated by the famous Galilean satellites (moons): Io, Europa, Ganymede, and Callisto. In between these moons and Jupiter there is a faint ring system, and four small satellites, the Amalthea family. Outside the Galilean system, there is a system of 67 small irregular satellites that have much wilder orbits, some going in the opposite direction to the other satellites. These are thought to be captured asteroids.

As seen in Fig. 2, the data analysis supports a somewhat tree-like network. The Galilean Group (the large moons), the association of Amalthea with Jupiter’s Main Ring, and the Himalia Family, but it rejects the traditional division of the remaining well-known moons (captured asteroids) into three families: two of the three Pasiphae Family satellites are very similar to Carme. Although Ananke is somewhat different, it is substantially more similar to Carme and the Pasiphae Family satellites than the inter-group differentiation found elsewhere. One commonly shared idea about classification is that one should erect classes that have similar quality and are defined by high intra-class coherence and inter-class differentiation.

Fig. 2 Neighbor-net of Jovian moons, well-covered by data (<1% or no missing data). Colouration addresses traditionally recognised groups, same colours used as in Fig. 1.

By reducing the character set to those characters that are defined for most or all moons, we naturally take away some of the potential differentiation. Nonetheless, the resulting graph (Fig. 3) provides a structure that may well be used to place less-known objects, identify their closest best-known counterpart(s), erect a classification, and discuss current classification schemes.

Fig. 3 Neighbor-net of all Jovian moons based on a distance-matrix reflecting the known (scored) similarity and differences. Colouration as above.

We can see that we lose differentiation within the well-known (and well-supported; Fig. 2) groups, especially regarding the distinctness between members of the Galilean Group and to the Amaltheae Family. However, the basic structure of the graph remains the same. Based on the scored data, the Ananke, Pasiphae and Carmes Families are not supported. A sub-division may be possible, but would require some re-shuffling of the moons. For instance, a group including Ananke-like satellites would not include Euporie, Iocaste and Hermippe, but may include Callirhoe. A Pasiphae s.str. group would make sense when excluding Aoede, Helike, Sinope, Autonoe, and Eurydome, with the latter three being (nearly) identical to Carmes or members of the Carmes Family.

Fig. 4 The orbits of the satellites in the Saturnian satellite system. Colours represent traditionally recognised groups.

The Saturnian moons and rings

The Saturn system (Fig. 4) is similar to of the Jupiter one.

The ring structure is one of Saturn’s most distinctive features, with structures seen even with a modest telescope. Imbedded in the rings are small moon-lets. The co-orbitals Janus and Epimetheus, just outside the main rings, swap orbits every four years. There are eight mid-sized satellites, including Titan, a small world in itself with a methane-based weather system. Of the other icy satellites, Mimas, Enceladus, Tethys, Dione and Rhea, are embedded in the diffuse E-ring. The source of this ring is cryovolcanic plumes on Enceladus, a possible location for life beyond Earth.

Unique to the Saturn system are Trojan satellites, in the same orbit as their parent satellites, one 60o ahead, and 60o behind. Tethys has Telesto and Calypso as Trojan satellites, while Helene and Polydeuces are Trojan satellites of Dione. Between the orbits of Mimas and Enceladus, there are the Alkyonides (Methone, Anthe and Pallene) recently discovered by the Cassini spacecraft. Each of the Alkyonides have their own faint ring arcs comprised of similar material to the satellite.

As with Jupiter, there is also a system of 38 small irregular satellites outside the inner system. This system is dominated by Phoebe — at ~240 km across it is six times the size of the next largest irregular. It is also the only irregular satellite to have its photo taken, with the Cassini spacecraft flying within 2,000 km of the surface, taking high-res images as it went. Using these new data, a picture is emerging of Phoebe as a captured outer solar system object.

For Jupiter’s smaller sibling (Saturn), a less tree-like network is inferred (Fig. 5). Since it is not the product of an evolutionary process in a biological sense (ie. a phylogeny), but instead including patterns related to parent/descendant relationships (rings-moons, breakups), we should not necessarily expect tree-like graphs.

Fig. 5 Neighbor-net of well-known Saturnian planetary objects (<1% or no missing data). Colouration addresses traditionally recognised groups
using the same colours than in Fig. 4.

Nonetheless, the graph could be the basis of an objective classification. The elements of the ‘Main Ring Group’ have high intra-group coherence, but also include Calypso and Telesto of the ‘E Ring Group’. On the other hand, the ‘Outer Satellite Group’ is very heterogeneous. One straightforward option would be to fuse this group with the E Ring Group; another is to exchange Enceladus for Hyperion. The Norse Group’s (= Phoebe Family) representative Phoebe is clearly distinct from any other object and would need a class of its own.

As in the case of Jupiter ,we can add and try to classify the remaining little-known objects (Fig. 6), to some degree.

Fig. 6 Neighbor-net of all Saturnian moons and rings based on a distance-matrix reflecting the known (scored) similarity and differences. Colouration as above.

In contrast to Jupiter, the reduced character set (just four characters, one binary, three binned-ordered) loses the differentiation between objects of the Main Ring, E Ring and Outer Satellite groups included in Fig. 5. They are virtually identically for these characters. The two groups not covered in Fig. 5, the Siarnaq (named after Inuit gods) and Albiorix Family (Gallic gods), are close to each other. The Albiorix Family forms a distinct subset of the Siarnaq Family. The moons of the coherent Phoebe Family (named after Norse mythological figures) are all close to each other, and this group includes various newly discovered satellites. Interesting is also the position of the Phoebe Ring compared to its name-giving moon and the remainder of the Phoebe Family.

Comparison with the tree-based analyses

In comparison with the astrocladistical work of Holt et al. (2018), the network-based analysis captures most of the meta-structure of the satellite classifications.

Compared with the Jovian trees, the network-based analysis shows the distinction between the inner Galilean group and the outer ‘irregular’ satellites and separates the Himalia family. The differences are in how each of the analyses handles the retrograde irregulars, the Pasiphae, Carme and Ananke families. It should be noted that these bodies are woefully under-studied, and have very little information available, making any inferences difficult.

In Holt et al.’s trees, the Ananke, Carme and Sinope subfamiles are unresolved, but are supported using Multivariate Hierarchical Cluster Analysis (example provided in Fig. 7). This method uses clustering in parameter space to justify collisional families. Though the particular members are different, the network-based analysis still identifies clusters around the largest irregular satellites, Anake, Sinope, Carme and Pasiphae. This further supports further the theory that these families are remnants of collision breakups. As usual with science, there is far more work to be done here.

Fig 7: Clustering of several Jovian Irregular satellites in three dimensional parameter space using Semi-major axis (a), eccentricity (e) and inclination (i).

In the Saturnian system, the outer satellites also prove to be problematic. Holt et at. split the Aegir and Ymir subfamilies from within the Phoebe family. These subfamilies are distinct from Phoebe and its ring, following a narrative of a different origin for Phoebe and the rest of the irregular satellites. The capture of Phoebe would have major disruptive effects on the satellites. As the dynamical characteristics play such a large role, they are the only information available for some of the satellites, so that little sub-structuring can be seen. As with the Jovian irregular satellites, more information is needed.

The inner system of Saturn also warrants mention, particularly the case of Telesto and Calypso, the Trojans of Tethys. In the network analysis, they are associated with main ring objects, rather than with Tethys itself. There is a possibility that these two Trojans are captured main-ring objects, and this would support that hypothesis. Dione, and its Trojan Helene, are both closely associated with one another in both analyses, indicating a parent/daughter relationship (keep in mind that phylogenetic trees cannot discern between parent/daughter and sister relationships).

Phoebe as seen by the Cassini spacecraft, NASA/JPL/Space Science Institute, PIA06064 (the NASA provides more than 1000 media files covering the Cassini-Huygens mission)

Boldly gone – networks as tools in classification
The idea discussed here appears worth exploring — using distance-based or other (meta-)phylogenetic networks for the classification of objects not necessarily following any phylogeny. It has some obvious advantages over astrocladistics (especially when using maximum parsimony as the tree-optimizing criterion) or traditional classification methods (PCoA, simple clustering approaches):
  • Distance-matrices are easy and quick to generate based on any data; and they can also be used for more traditional classification means such as PCoA.
  • Neighbor-nets are very quick to calculate, and can capture more aspects of the actual differentiation than can cluster analysis (e.g. UPGMA trees, PCoA) or astrocladistic methods; in some sense they represent a fusion of the best aspects of both approaches.
  • In contrast to a tree, where tree-incompatible signals can massively distort branch-length patterns, or rogue objects interfere with establishing a finely resolved topology, a Neighbor-net can be straightforwardly interpreted regarding group coherence.
Perhaps, the main disadvantage is of this approach is the need for a distance matrix with meaningful pairwise distances. If missing data distort the general (dis)similarity patterns, then Neighbor-net may have branching (edging) artifacts.

However, using the Neighbor-net as a basis for classification, groups also allows us to quickly test for character sampling bias, eg. by re-calculating the distance input matrix using weighting schemes, or different distance calculations (eg. instead of binning the continuous characters, they could be used as-is), or reduced character or taxon sets. Also, when it comes to classifying non-living objects, it’s always good to keep it as simple as possible, while being able to explore the signal in the data matrix.

More results, the data matrices used, and the template analysis files can be found on figshare. The archive includes also the (simple) NEXUS-formatted files with the PAUP* command blocks we used for the analyses. The one for Jupiter is fully annotated with comments on the code lines for PAUP* to assist inexperienced users and to facilitate export (and subsequent) import into SplitsTree.

The archive includes also the code for and results of a full bootstrap-support analysis (currently two optimality criteria: Least-squares and Maximum parsimony, Maximum likelihood to be added) — even when preferring the astrocladistic approach, networks are handy to summarize the bootstrap pseudoreplicate sample.


Holt TR, Brown AJ, Nesvorný D, Horner J, Carter B (2018) Cladistical analysis of the Jovian and Saturnian satellite systems. Astrophysical Journal 859(2): 97, 20 pp; arXiv: 1706.0142

Monday, June 11, 2018

Want to place a fossil in a minute? Just use Neighbour-nets

Palaeontological phylogenetic researchers typically put a lot of effort into inferring trees. It has been argued (and occasionally pointed out during manuscript reviews) that only by placing a fossil in an explicitly phylogenetic framework can we assess what it represents. I sympathize with this notion, but in most cases we don't need any elaborate analysis to do it — a quick network-based analysis will do the trick.

In this post, I'll demonstrate my point using the most recent matrix presented by two eminent plant morphology veterans. In a fresh-off-the-press paper, James Doyle & Peter Endress provide a "Phylogenetic analysis of Cretaceous fossils related to Chloranthaceae and their evolutionary implications" (Botanical Review) using their morphology matrix focussing on early diverging angiosperm lineages, which was originally used for a paper by Sareela et al. (Nature 446: 312–315, 2009) and has been continually updated.

Like all morphological matrices that aim to cover as much as possible, Doyle & Endress' matrix does not provide any strong tree-like signal, and hence it has little use for inferring phylogenetic trees. Doyle & Endress deal with this issue by using a (more or less molucular-based) backbone tree enforcing several clades for the modern taxa, and then trying to find the most parsimonious placement of the fossil(s). This approach works to some degree but has two problems: one theoretical and one practical.

First, the backbone tree, or any molecular-informed topology, is usually some steps longer than the most-parsimonious trees that could be inferred on their matrix. In other words, morphological evolution in plants doesn't fully fulfill Ockham's Razor. Why should this also be the case for the fossils?

Second, moving a fossil through the branches to find the best-placement takes some time, and will lead to many equally parsimonious solutions. Not rarely, the fossil can be placed on quite distant branches, producing trees that are only a few steps longer.

A graph I made depicting the 'parsimoniousness' of placing a fossil, Monetianthus (a Cretaceous water lily), within a given topology using an earlier version of Doyle & Endress' matrix (fig. 7 in Friis et al., Int. J. Plant Sci., 2009). The number of additional steps was estimated by moving the target taxon, the fossil, to the accordingly coloured branch of the tree. (PS To show that the fossil is a water lily, a Nympheaceae, we used a Neighbour-net)

For Doyle & Endress' papers this is no big problem, because they just show the best placements as well as those a few steps longer. For example:
Placing the Chloranthistemon species on the stem lineage of Sarcandra and Chloranthus is four steps less parsimonious than placing them on the stem of Chloranthus. For perspective, only two steps are added if the Asteropollis plant is moved to the stem of the whole family. If a four-step parsimony debt is accepted in moving Chloranthistemon to a morphologically less favored position, one may ask why the Asteropollis plant is considered a reliable minimum age constraint for the family.
But with respect to the fact that morphological evolution is not necessarily parsimonious, and that even the modern taxa can show variable root to tip pathlength distances, I always remain skeptical of this approach.

A Bayesian-inferred angiosperm tree based on a total-evidence matrix, built from a curated version of Soltis et al.'s 2011 matrix and including the 2010-version of Doyle & Endress' matrix as morphological partition (provided as open data @ figshare). Note that many fossils (Cretaceous, ~100 Ma) have longer terminal branches than their surviving relatives (hence, made Bayesian total evidence dating impossible).

Aside from this, the matrix signal is pretty straightforward when it comes to decide on the potential position of the fossil in the angiosperm part of the Tree of Life. And the analysis takes (literally) moments.

You just take the matrix, calculate mean pairwise morphological distances (done in a blink), export the distance matrix as NEXUS-formatted file and input this to SplitsTree, which will give you a Neighbour-net (in another blink).

A Neighbour-net based on Doyle & Endress' 2018 matrix including only the modern-day taxa.

Most members of the well-established clades, main angiosperm lineages, cluster in the Neighbour-net (bracketed names point to somewhat scattered clades). In case the signal from a fossil is trivial, it will be nested within the respective cluster. Trivial signals are when a fossil has a character suite that indicates it is much more similar to one of the clusters than to any other, which usually means that it is part of the same evolutionary lineage. Convergences may be common, and characters homoplasious, but evolving the exact same suite of characters while not sharing common ancestry is quite unlikely.

The Neighbour-net including the matrix' fossils. Note that the relative position of the Eudicots has changed and Circaeaster and Euptelea are placed closer to the other Ranunculales, although the pairwise distances between all modern taxa have not changed. The re-arrangement is solely a fossil-inclusion effect. By adding fossils attracted to Ceratophyllum, this enigmatic and isolated genus is drawn away from the most basal eudicots.

Two of the fossils apprear to have unique character suites, placing them intermediate between two phylogenetically isolated plants, the unique Amborella (still considered the earliest branching modern-day angiosperm; one species on New Caledonia) and Ceratophyllum, an equally enigmatic water plant. The remainder are clearly members of the Chloranthales.

Having identified the phylogenetic Neighbourhood of the fossils, we can then focus on this neighbourhood (in SplitsTree: Select the comprising OTUs; then go to menu Data > Keep only selected taxa).

The Neighbour-net after all non-neighbourhood OTUs have been removed.

From this graph we can directly conclude:
  • The Asteropollis plant is a close relative, likely a sister or early representative, of Hedyosmum.
  • Couperites represents an early and substantially diverged lineage within the Chloranthales — its closest living relative appears to be Ascarina, next to Hedyosmum the most derived living Chlorantales (here: one would need to see if there is any shared character suite).
  • Zlatkocarpus is an ancestral form of the Chloranthales core clade, comprising Ascarina and the sister taxa Chloranthus and Sacandra (one would need to check the possibility that this could be a missing data artefact).
  • Canrightia and Canrightiopsis are sister lineages or precursors of Chloranthus and Sacandra (see the open access tree-and-network-based paper by Friis et al. 2015, Grana 54: 184–212)
  • The Pennipollis plant is an ancient isolated Chloranthales lineage, with no living relative.
  • Appomattoxia and Pseudoasterophyllites may be (very) distant relatives of Ceratophyllum, the latter an isolated genus that has long been and still is a problem for molecular phylogenies. Alternatively, they may represent early and extinct angiosperm lineages (or the same lineage) with no modern counterparts.
To say anything beyond this based on the current data set quickly leaves the grounds of objectivity, and requires a priori assumptions about the importance of certain morphological traits being shared or not (i.e. not expressed, not just missing due to poor preservation).

[For those interested in a formal discussion of these results, see Doyle & Endress, 2018, pp. 7–25.]

To refine the analysis, we can just reduce the character set to the characters scored for the fossils.

A Neighbour-net based on distances computed using a character-subset generated by excluding all invariable characters (in PAUP*: Exclude constant) for the taxa included in the network (66 characters, including eight not defined for any fossil taxon). Grey depicts a (molecular-data backed) tree hypothesis that could explain the seen differentiation pattern

To take the next step, morphological matrices alone will have no practical use, because they will not allow us to identify fast vs. slow evolving traits and lineages. A lineage that goes through bottlenecks or colonizes new niches will be genetically, and morphologically, more distinct than one that remained in calm waters. One could map the preserved traits found in each fossil onto a molecular tree, and include the information about actual branch lengths in that tree to put forwards hypotheses about the ancestral state (e.g. Mendes et al., Grana 53: 283–301 [open access] for Lardizabalaceae), and possibly even time the divergences. This would enable one to compare the situation in the fossils with the top-down hypotheses about morphological evolution for the very same time period.

But that would be mostly tree-based analyses, and thus nothing for a Genealogical World of Phylogenetic Networks post.

Data — In case you are interested in the primary data matrix, a ready-to-use NEXUS version and the raw Split-NEXUS files have been uploaded to figshare.

Monday, June 4, 2018

Tattoo Monday XV

In the previous post of this series I considered tattoos among modern women. To balance this, here are some circular phylogenetic trees of various sizes on the torsos of men.