The Genealogical World of Phylogenetic Networks: November 2018

Monday, November 26, 2018

How languages lose body parts: once more about structural data in historical linguistics

This is a joint post by Guido Grimm and Johann-Mattis List.

Mattis’ last two blog posts dealt with problems of what linguists call "structural data". Here we discuss what this means for the inference of relationships between languages.

A closer look at structural data: the questionnaire issue

As pointed out before, what is called structural data in comparative linguistics is a very diverse mix of data solely unified by the idea of having some kind of questionnaire that a linguist may use when going into the field and trying to describe a certain language. These questionnaires are a bit different from the traditional concept lists usually used for the purpose of historical language comparison (see the collection of different lists in the Concepticon project by List et al. 2016). The main difference is that they are based on an imaginative question that a field worker asks an informant (which could as well be a written grammar of the language under question). Since questions can be asked in many different ways, while concepts in historical language comparison are usually restricted to the so-called "basic vocabulary", the diversity of structural datasets is much greater than the diversity we encounter when comparing questionnaires based on concept lists.

When analyzing these data, we deal with characters of very different nature, and likely different evolutionary pathways or histories. A biological analogy would probably be (true) total evidence data sets that combine genetic data from: genes/genomes with different inheritance pathways (paternally, maternally, biparentally; basic information level), morphological-anatomical data (visible form, phenotypic), palaeontological data (historical evidence), ontogenetic (life-history stages, developmental features), and biochemical data (expression level). The only difference is probably that the linguistic characters’ histories may be more complex. [Side-remark: ‘total evidence’ datasets found in the biological literature are typically just combination of genetic and morphological data, allowing for the inclusion of extinct/fossil taxa.]

To give a specific example, let's have a look at a the Chinese dataset by Szeto et al. (2018), mentioned in Mattis' blogpost from September. This dataset is now accessible as a GitHub repository (https://github.com/cldf-datasets/szetosinitic). Mattis added some information regarding the different features of the questionnaire. We list these features in slightly abbreviated form in the table below, adding rough categorizations by Mattis in the Comment column.

ID	Description	Comment
p-1	5 or more tone categories	phonological / diachronic
p-2	Retroflex fricative initials	phonological / diachronic
p-3	Bilabial nasal coda	phonological / diachronic
p-4	Stop codas	phonological / diachronic
p-5	Monosyllabic word for 'snake'	lexical
p-6	Differentiation between 'hand' and 'arm'	lexical / semantic
p-7	Differentiation between 'defecate' and 'urinate'	lexical / semantic
p-8	Differentiation between 'eat' and 'drink'	lexical / semantic
p-9	Semantically void suffix in 'table'	lexical
p-10	Different classifiers for humans and pigs	lexical / semantic
p-11	[CLF N] constructions in subject position with definite reference	syntactic
p-12	Reduplicated monosyllabic nouns	morphological
p-13	Post-verbal modal auxiliary developed from 'ge/acquire'	syntactic / diachronic
p-14	Modified-modifier order in animal gender marking	morphological / syntactic
p-15	Post-verbal adverb meaning 'first'	lexical / syntactic
p-16	[V DO IO] order in double object dative constructions	syntactic
p-17	'Give' as a disposal marker	syntactic / diachronic
p-18	'Give' as a passive marker	syntactic / diachronic
p-19	'Go' as a post-VP associated motion marker	syntactic / diachronic
p-20	Marker-Standard-Adjective order in comparatives	syntactic
p-21	case system	morphological / syntactic

Mattis has tried to characterize the features, i.e. matrix’ characters, by generalizing linguistic categories: "phonological", pointing roughly to questions about pronunciation (the biological equivalent would be phenotypic traits in morphology or anatomy); "lexical", pointing to the words in the lexicon (this would be the DNA of a language); "morphological", pointing to the ways in which words are constructed; and "syntactic", pointing to the ways in which words are combined to form sentences. In combination, “morphological” and “syntactic” are equal to ‘meta-level’ biological traits, such as development-related features, ontogenetic evidence, and biochemical composition — the ways in which the genetic code is expressed or used in a living organism in adaption to the environment.

Mattis also flagged some characters as "diachronic", to mark whether the respective feature was selected by the authors due to their independent knowledge about the history of the Chinese dialects. This is something rarely possible in biology, but imagine that we could go back in time to literally observe the evolution of a lineage over a given time-period, and code this observed evolution as traits. Note that this is not entirely science-fiction — there are two examples where we can observe directly pathways of biological evolution: mutation patterns in viruses, and horizontal modification of marine morphs in high-resolution sediment cores.

While one can discuss to what degree a certain feature should belong to this category, it is rather obvious that all phonological features are diachronic, because they name distinctions that reflect well-known processes of sound change, which happened in a couple of Chinese dialects and have been proposed in the past by dialectologists in order to classify the Chinese dialects historically.

For example, consider feature p-3 of the questionnaire: Does a given dialect have a syllable that ends in [-m]? From the history of the Chinese dialects we know that the [-m] was present in Middle Chinese, but later merged with [-n] and [-ŋ] in many varieties. Given that we know that this happened, and that we know that people have used this to mark a split, especially between the "innovative" dialects in the North and the South, it is clear that this feature bears explicit historical information. The same holds for all phonological features that we find in the data: p-1, the number of different tones in the dialects is again roughly reflecting the differences between languages in the North and in the South (the North having lost many tones); p-2 reflects the retention or specific development of retroflex sounds (similar to sh in English as opposed to s) mostly in the North; and p-4 reflects if a variety has syllables that can end in [-p, -t, -k], again a feature characteristic for the more "conservative" varieties in the South of China.

Figure 1: Overlap of features in Szeto et al.'s (2018) structural feature collection of Chinese dialects

Four lexical features have further been flagged as "semantic"; we query here existing or missing distinctions of concepts. People who learned, for example, Russian or certain German dialects know that it is rather common to have a single word for what other languages call "arm" and "hand" (see the respective entry in the CLICS database) or "foot" and "leg".

This diverse feature collection is coded as binary characters, reflected by presence/absence, or a yes/no answer to the question in the questionnaire. The choice of features is very selective. A biological analogy would be a matrix collecting incompatible splits of paternal (molecular) genealogies, along with a few prominent phenotypical traits (reflecting major evolutionary steps), and some traits that we expect to be primarily triggered not by genetics (inheritance) but by expression or adaptation to the environment. Biologists would not phylogenetically analyze such diverse and complex, potentially selection-biased data (although it could be very interesting), but linguists do.

In this context, it is remarkable, but also typical for these kind of data, that the 21-character feature collection by Szeto et al. (2018) has no feature in common with the collection by Norman (2003), a 15-character-matrix, which we also converted to our Cross-Linguistic Data Formats (see Forkel et al. 2018) in order to increase the data comparability.

Figure 2: A Neighbor-net splits graph of the structural data by Szeto et al. (2018).

The typification, coded as binary matrix to infer the Neighbor-net splits graph in Figure 2, demonstrates some basic characteristics of such 2-dimensional graphs. Note four of the 'characters' (typification categories) correlate with an edge(-bundle) in the network, separating the 'taxa' (the queried features). All "semantic" taxa are also "lexical", but "lexical" is more comprehensive, hence, "semantic" is placed as 'descendant' of "lexical" (Neighbor-nets can visualize ancestor-descendant relationships to some degree). "Morphological" taxa are either just "morphological" or also "syntactic", hence the pronounced box.

For "diachronic" and "syntactic", we have no corresponding edge(-bundle), because one taxon is also "lexical", but the others are "diachronic" and "syntactic" — this is a conflict that cannot be resolved with two dimensions. To visualize all the resultant 'taxon' splits, called also taxon bipartitions, we would need a third dimension. Lacking a third dimension, the Neighbor-net prioritizes keeping most "syntactic" together, because the "diachronic-syntactic" are closer to "syntactic" (max. 1 'character' difference) than to "diachronic-phonological" (2 character difference). The "syntactic-lexical" has to be placed apart because it is equally close to "lexical" and "syntactic" 'taxa', but differs much from "morphological-syntactic" or "diachronic-syntactic", the closest two relatives of "syntactic"-only 'taxa'. It is resolved closer to the centre of the graph, because it is more closely related to the other "syntactic" taxa than to the rest of the "lexical" taxa. This is also the reason why the "syntactic"-only taxa have to be placed farther out: "Diachronic-phonological" and "syntactic-lexical" are closer to the other endpoints, and the distance of "syntactic"-only to "diachronic-phonological", "lexical" and "morphological" should be as large as possible.

Losing body parts: How data coding masks underlying processes

Most typologists collecting structural data are not per se interested in phylogenies. Yet, given that scholars deliberately collect historical (diachronic) features, this shows that even if they would not necessarily admit it, they have a genuine interest in uncovering the history of the languages under question; or at least, how closely related languages (or here: dialects) are. But this requires understanding the characters we analyze, the collected "structural data".

In evolutionary biology, the key question people (should) ask when trying to select characters is how their change can be modeled on a tree or a network. What processes could be expected that shaped the data? What is behind the diversity? Is similarity or dissimilarity instigated by:

[A] inheritance, i.e. passed from an ancestor to all / some of its descendants,
[B] random mutation and/or sorting, i.e. the product of a stochastic, evolutionary neutral process,
[C] non-random mutation, i.e. processes that recur frequently, may be beneficial and positively (gain, or negatively: loss) selected for, or
[D] secondary contact, mixing of lineages by hybridization (symmetric mixing) and introgression (asymmetric mixing)?

[A]–[C] are vertical processes following a tree, even if the tree does not necessarily need to be the same; [D] is (mostly) horizontal and can only be modeled using a network. For each of the above, we can find an analogy in the evolution of languages.

In addition, process [3], and to a lesser extent [4], can lead to what biologists call 'homoplasy', meaning that the same feature is observed in two unrelated or distantly related taxa. In the context of phylogenetic inferences, homoplasies inflict tree-incompatible signals, seemingly reticulate patterns originating from a tree-like evolution. Structural (or other) linguistic data and phenotypical biological data have a lot in common — complex processes are boiled down to mere absence or presence of features (or traits, as they are called in biology).

Figure 3: Basic evolutionary processes, we need to consider when looking at linguistic data. Or biological traits, when we replace simplification by adaptive evolution, positively selected traits.

If we check the features in our table above, and ask: to which degree can they be used to model these processes (see also David's last post on illogic in phylogenetics), e.g. simply distinguish between similarity by chance, relatedness, or secondary contact (mixing), we can easily see that they are by no means optimal for evolutionary investigations. This is not necessarily because of the processes they involve, but more importantly because of the data sampling, which makes modeling almost impossible, with each character needing its own model.

As an example, take the feature p-6 in our table. Whether or not a language makes a distinction between "arm" and "hand" does not seem to follow specific geographic or genealogical patterns. The following figure shows a plot from the CLICS database (List et al. 2018), visualizing the most frequently recurring polysemies (or colexifications) centering around the concept "arm". The full visualization in CLICS can be found here, and when hovering with the mouse over the link between "arm" and "hand" (marked in green below).

Figure 4: Colexification network in the CLICS database.

From eye-balling the data, it is hard to find a consistent geographic / language-family pattern, which suggests that the feature p-6 is likely to show a high degree of homoplasy in the languages of the world. Obviously, different people decided not to distinguish between "hand" or "arm". But, the example of the Sami languages in northern Scandinavia also demonstrate that some people using related, long-isolated languages, consistently don't make the distinction. Here, the homoplasy is inherited (lineage-conserved). A biological analogy would be the rarely applied difference between a 'convergence' (a trait is independently evolved in different lineages) and a 'parallelism' (a trait is expressed by different but not all members of the same lineage).

Figure 5: Geographic distribution of arm/hand colexifications in the CLICS database.

A specific analogy to the "hand-arm" colexification / differentiation pattern is leaf shedding in oaks and their relatives (Fagaceae, the beech family). Some oak lineages (section Cerris of oaks, beech trees, chestnuts) are essentially or strictly deciduous, others (sections Cylcobalanopsis, Ilex, the sister sections of Cerris; Castanopsis, the sister genus of chestnuts) are always evergreen, and the biggest group (number of species) of all Fagaceae, subgenus Quercus includes evergreen (1 section), mixed (the two by far largest sections), and deciduous (1 nearly extinct section) sublineages. To some extent this is linked to the climate in which the species thrive (high latitudes and/or per-humid = deciduous, low latitude and/or seasonally dry = evergreen), but consistently evergreen and deciduous lineages do co-exist.

Looking at the Chinese dialects, we see that p-6 represents a trivial split in the network.

Figure 6: A Neighbor-net inferred from the Szeto et al. matrix. Dialects that distinguish "arm" and "hand" with filled dots ('1' for character 6 in the matrix), those that don't ('0') with empty dots. We can put a single line separating all don't- from do-taxa (dialects), i.e. a bipartition of the taxon set fitting the character partition seen in (p-)6.

But, given the general patterning of the feature on a global scale, does this really mean that it is inherited — that is, a good feature to reflect relatedness?

Whether a feature is likely to be homoplastic is just one part of the story. Linguists typically have more information about how things change than do biologists, putting a double-edged sword in their hands (that they hardly ever use). Asking whether "hand" and "arm" are expressed by distinctive concepts does not consider the underlying processes. Here, we can assume at least three different character states, namely:

"arm" and "hand" are expressed by the same word, which is the original word for "arm",
"arm" and "hand" are expressed by the same word, which is the original word for "hand", and
"arm" and "hand" are expressed by different word.

We could even have a forth state, in which "arm" and "hand", in the whole long history of the ancestral languages, was always used to express "arm or hand" (i.e., both body parts). No differentiation and no later generalization from either arm nor hand took place.

Figure 7: Left, current scoring; right, scoring taking into account the actual mutation process.

From Ancient Chinese, we know that "1" (Yes, I do differ between "arm" and "hand") was most likely the original state. We can further assume that once the distinction is dropped, it is less likely to come back again (although this can, of course, also happen). That is, our model involves two possible mutations (vertical process): we lose the word for "arm" due to its replacement by "hand", or we lose the word for "hand" due to its replacement by "arm", each with its own probability.

Figure 8: Probability distribution for transitions involving "hand" and "arm".

The probability, mutation or not, and which mutation, relates to four principal driving factors:

probability of random loss (mutation)
probability of random gain (mutation)
global linguistic tendencies
regional socially-enforced preference

Establishing p_-_arm (loss "arm") and p_-hand (loss "hand") is not trivial, because they may be affected by what is the word for "arm" and "hand" (for simplicity we will assume that p_+arm and p_+hand are close to 0). We could expect a higher tendency to keep the word that is easier to pronounce or less easy to confuse with other words and, hence, is easier to understand. If two dialects with different states come into contact, this may also influence the decision to take over a state or not. In everyday language, a distinction between "arm" and "and" may be useless because of the clear context in which both words are used, so p_1-word > p_2-words. However, closeness to administration centers or areas with a higher percentage of educated people could decrease p_1-word, because it may be considered a sign of poor social standard to not make the difference between "arm" and "hand".

Figure 9: Vertical and horizontal processes involving transitions of "hand" and "arm".

Estimating p can only be left to phylogenetic algorithms (unless more detailed information is available). But we can (and should) design the questionnaire to capture as many of the processes as possible. In this case, to not only ask whether there is a distinction between "arm" and "hand", but also to find out whether the word "arm" or "hand" is used, e.g. by using two questions/binary characters:

Do we use "hand"?
Do we use "arm"?

Note that this question requires quite a deal of knowledge about the languages under investigation, since it may not be trivial to find out what was the "original" word for "arm" or "hand".

Therefore, a further step would be to replace the binary characters by a value measuring the similarity between the words used for "hand" and those used for "arm". One could again argue that adding this information would add historical information to the feature, but it is clear that the abstract nature of the question is hiding important phylogenetic (and also typological) information from us.

It seems therefore, that, instead of asking whether or not there is a distinction between "arm" and "hand", it would make much more sense to trace the cognacy (or homology) of the expressions for "arm" and "hand" across all taxa (languages, dialects), and think of ways how this could be scored and modeled by phylogenetic analyses. The structural data framework with its features based on simple yes-no questions therefore inevitably leads to a misinterpetation of processes when analyzing the data with phylogenetic software.

The need for exploratory data analysis

In reality, structural (or other) data sets in linguistics face problems similar to the ones palaeontologists face when trying to establish phylogenetic relationships between fossils (extinct organisms) — the probability for a mutation (visible change) is largely unknown, and differs not only from character to character but also within the same characters. A state 0, 1, 2 etc. may have a higher probability to manifest (or get lost) in one lineage than in another.

In addition, the linguistic problems recur in a similar way to that of biologists working close to and below the species level (see also Guido's post on population dynamics and individual-based fossil phylogenies) — reticulation is rather the rule than the exception, as similarity is triggered by contact, so that horizontal processes, not inheritance, may dominate evolutionary dynamics. Thus, the diversity pattern cannot be modeled by a tree alone. Establishing explicit probabilistic frameworks to deal with this may not only be difficult but even impossible (given the available data). Meanwhile, however, one can embrace exploratory data analysis as a heuristic tool.

So, let's look at the example. As in the original paper, we used the binary matrix of the 21 characters to infer a planar, 2-dimensional (meta-)phylogenetic network, a Neighbor-net splits graph. The resulting graph is a longitudinally inflated spider-web, with its endpoints defined by the southern Chinese dialects (e.g. Guangzhou, Nanning, Taishan) and the north-central (eg. Linxia and Xining) dialects. The latter are significantly closer (geographically and data-wise) to the Bejing version of Chinese.

Figure 10: The Neighbor-net based on simple mean (Hamming) pairwise binary character distances

The first thing to note is that the matrix includes dialects that are indistinct (green stars) for all 21 characters, and some that are geographically and data-wise very similar to each other, while being distinct from all others (green ovals). In biology, we call this (taxic, lineage-)coherence. In addition to Linxia and Xining, we have Nanchang and Lichuan characterized by elongated ('tree-like') terminal edge-bundles. These obviously represent closely related dialects sharing a long(er) common history.

Others have more than one possible closest relative. For instance, Liuzhou may share quite a few features with Guangzhou, but it is equally close to the Nanchang-Lichuan pair (yellow fields). Dongtai (orange star) is unique, but its 'neighborhood' (orange-ish brackets) as defined by shared edge-bundles that include Changsha (which again is most related to Jiujang) and Taiyuan plus Baotou, the latter two substantially closer to the Bejing (red star) group.

Similar to Dongtai, and also connected to the central part of the graph, are dialects with long-terminal branches (edges). Hefeng (blue star) is substantially different from Dongtai, and only has one further dialect in its neighborhood (blue bracket), Wangrong, a close relative of the Bejing group. The Wuhan, Chengdu, and Guiyang (gray field) dialects appear, on the other hand, to be completely isolated.

As explained above, there are different processes, vertical and horizontal ones, that may trigger similarity, and we want to get an idea as to which character may be influenced by which process. From the graph, several aspects are obvious:

geographic closeness plays a major role,
the signal provided by the data is not tree-like,
the data is highly homoplastic, and includes internal conflict.

Not so obvious is whether this situation is due to random or evolutionary directed similarity, or reticulation. Since the graph is planar, and puts the Chinese dialects in a circular order, we can order the character matrix accordingly to see how the traits form groups (which could be called cliques in this context). In the next step, we can then map each character onto this network, to see how well they fit with the overall similarity pattern. We showed this above for p-6 (hand-arm-distinction, one split), and here we add a character with quite a poor fit, p-17 (syntactic-diachronic), "give" as a disposal marker.

Figure 11: Character mapping for p-17 (filled dots, "give" used as disposal marker; empty, not used), with the p-6 split indicated as well. Red, splits (taxon bipartitions defined by character cliques) that have no corresponding edge-bundle (neighborhood); blue, splits with neighborhood; green, unique, isolated change (deviation from the rule) within the neighborhood.

The number of inferred mutations in the map uses Ockham’s Razor, upon which parsimony (tree and network) inference relies as well. Using such a map, we can even provide an estimate for how likely (qualitatively spoken) a change is under the assumption that neighborhoods in the graph represent either exchange (homogenization) between closely related dialects or are inherited, reflecting both horizontal and vertical relatedness. Mapping characters on a 2-dimensional network allows finding a scenario beyond a single tree hypothesis.

For p-6, we need just one change (i.e. loss in all more south-bound dialects), but we don't find an edge bundle corresponding to this unique change. Given what we discussed above about p-6, we have more independent losses than the simple reconstructed one. Social preference or general contact for retaining the primitive state of having two words could explain why dialects closer to the Beijing dialect area have a "0", although not all are closely related in general.

For p-17, we need at least four (independent) changes from "0" → "1", two of which have a corresponding edge bundle (blue, Nanchang plus Lichuan, Changsha plus Dongtai), one isolated (green, Luoyang), and one without a corresponding edge bundle (Wuhan and Hefeng dialects). The (equally parsimonious) alternative for p-17 would be a series of gains and losses, with the same number of steps:

Figure 12: Alternative scenario for p-17.

This is where one needs to consider additional knowledge about the probability of getting or retaining a certain feature. The state shared by most dialects across the entire net is “0”, irrespective of overall similarity, which would make it a natural pick for the primitive state. Thus, assuming four (or more) changes from 0 → 1 (acquisition of the queried feature), rather than two independent acquisitions (starting with the Beijing group; note, the position of the root will not change the number of needed changes), then a loss (1 → 0) in many southbound dialects and a re-gain (0 → 1) in the Nanchang + Lichuan dialects.

The same assessment can be made for all of the characters, and we end up with something like this:

Figure 13: Fully annotated split network of the data. Changes relating to edge-bundles accordingly colored, arc indicate changes without a corresponding edge-bundle. Note, the prominent yellow split that defines a neighborhood of dialects most similar to the Beijing dialect, albeit there is no character supporting this edge. The rather poor fit of many character splits (cliques) with edge-bundles relate to the fact that we visualize a highly complex diversification (multi-dimensional processes) using a planar, 2-dimensional graph.

While this figure may be confusing at first sight, it comprehensively shows what the characters contribute to the overall graph. We can discriminate more-likely from less-likely mutations (how many changes are needed at least), but also the character assemblies shared by putatively closely related dialects.

p-3 and p-11 are a typical feature of Guangzhou and allied dialects within the southern Chinese complex. p-3 is also present in Lichuan, and p-11 in Jixi (thus in not so distant dialects).
Features p-6 to p-9, p-16, and p-19 form a diagnostic suite for the Guangzhou dialects and other dialects related to them in the one or other fashion and distinguish them from, e.g., the Beijing group
The latter, the Beijing group, has fewer diagnostic character assemblies. One characteristic sequence could be p-1, p-2, p-12, p-14, but this includes three features with a minimum of 3+ changes. Similarity here is mostly the result of a lack of (potentially) derived features (hence, the character-unsupported yellow edge-bundle defining a Beijng-including neighborhood)

Outlook and summary

In this re-investigation, we have, once more, commented on the problems we see with the use of structural features for the purpose of historical language comparison and phylogonetic reconstruction. We see the major problems in the (often) unfortunate choice of question, resulting in elicitations of features that cannot be easily modeled with current software for phylogenetic analyses. It is important to keep in mind, in linguistics and phylogenetics, that we can infer trees or networks based on data of no matter what quality and information content. But before we present the result, we should have taken a look at the primary data.

Does it fit with the resulting graph, or not?
Where does it fit, and where not?

In the context of our critique of linguistic questionnaires, the mapping strategy discussed above opens a potential avenue to identify:

stable / unstable features (geographically or evolution-wise) and
coherent / incoherent features.

Based on this, we can then inquire as to which degree language (or dialect) groups influenced, stabilized or modified each other by geographic proximity.

Inference-wise, the natural next step would be to use the information about the minimum number of necessary changes to counter-weight characters. This would eventually allow to use median networks (and related) approaches on the data, which is currently the only way to explicitly identify ancestors using phylogenetic reconstructions. With the current matrices, the extreme homoplasy makes an unweighted application of median networks and related methods impossible.

References

Forkel, R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5.180205: 1-10.

List, J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 2393-2400.

List, J.-M., M. Walworth, S. Greenhill, T. Tresoldi, and R. Forkel (2018) Sequence comparison in computational historical linguistics. Journal of Language Evolution 3.2: 130–144.

Norman, J. (2003) The Chinese dialects. Phonology. In: Thurgood, G. and R. LaPolla (eds.): The Sino-Tibetan languages. Routledge: London and New York, pp. 72-83.

Szeto, P., U. Ansaldo, and S. Matthews (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Supplementary data

The data we used to create the analyses and figures provided in this post are available at https://github.com/cldf-datasets/szetosinitic/tree/master/examples

Monday, November 19, 2018

The curiously converted logic of phylogenetics

Phylogenetic analysis involves describing patterns, not studying processes. That is, we cannot conduct a manipulative experiment to study evolutionary history. All we can do is collect naturally occurring data, and then try to detect relevant patterns in it. Thus, in a descriptive study we investigate processes by examining the patterns they produce, not by manipulating the processes themselves, which is what we would do in an experimental study.

Obviously, one of the limitations of this procedure is that the patterns we need may not be in the data we have at hand. It is this limitation that leads some scientists to claim that descriptive studies are not part of science. However, this is not the majority view. [See Mattis' later post, on Patterns, processes, abduction, and consilience]

Equally importantly, there is a logical limitation to descriptive studies, as well, which I have rarely seen mentioned. In the world of logic, propositions cannot be converted; and yet converting propositions is exactly what is done by all descriptive analyses. [The four terms used in logic are defined at the bottom of this post.]

Our initial logic works from process to pattern (if p, then q), but we interpret it the other way around, that a specified pattern must be created by a particular process (if q, then p). Thus:

we expect this specific process to produce that particular pattern
therefore, when we see that particular pattern we can infer this specific process.

The problem here is the second statement, which is the logical converse of the first statement (the proposition). The inference is illogical, because other processes might also create the same pattern, in which case our inference can be wrong.

The Monty Python comedy team had a go at this in their Logician skit on "The Holy Grail" album (but not in the movie of the same name). Their example concerned a 1950s-60s singer called Alma Cogan, who died in 1966. Their inference was:

all of Alma Cogan is dead
therefore, all dead people are Alma Cogan.

This is illogical, because there is more to being dead than simply being Alma Cogan — logical propositions can be only partially converted.

The same logical fallacy has also been pointed out in the application of statistics to ecology. Stuart Hurlbert (1990. Spatial distribution of the Montane Unicorn. Oikos 58: 257-271) assessed the use of the poisson probability distribution as evidence for random spatial distributions of organisms. The inference is:

for a poisson distribution, the variance equals the mean
therefore, if the variance equals the mean we can infer a poisson distribution.

His paper points out many real datasets where the variance equals the mean but the data do not fit a poisson distribution. He concluded: "Each population showed a different pattern of aggregation and none corresponded to a Poisson distribution. The variance:mean ratio is useless as a measure of departure from randomness, though it is widely recommended as such."

These are simply examples of a general problem: we cannot convert a proposition and expect to be right all of the time, or even most of the time. The issue applies to all phylogenetic analyses, whether they involve the assessment of homology, or the construction of trees and networks — we are inferring particular evolutionary processes form the observation of particular patterns in our data. For example, our model of the process of speciation implies a tree model of evolution, and therefore every time we get a "well-supported tree" we treat it as the true phylogeny. This will not work if other processes are occurring, such as hybridization.

I will finish with one specific example from network analysis. The D-statistic is used in the so-called ABBA-BABA test for detecting introgression among taxa (see Networks of admixture or introgression). The logic works from process to pattern (introgression would create a particular gene-tree pattern), but we interpret it the other way around — we see the specified gene pattern and we thereby infer the presence of introgression.

This issue of illogic is definitely a limitation of phylogenetic analysis.

The terms of logical analysis:

Proposition
Inverse
Converse
Contrapositive

if p, then q
if not p, then not q
if q, then p
if not q, then not p

Monday, November 12, 2018

More heretic bits: networks for (more) recent matrices published in Cladistics

This is Part 2 of a 2-part blog series. Part 1 covered some history, while this post has three (more) recently published matrices, and the take-home message.

Jumping forward in time, welcome to the 21st century

In Part 1, I showed several networks generated based on some early phylogenetic matrices published in the first volumes of the journal Cladistics. In this post, we will look at the most recent data matrices and trees uploaded to TreeBASE, covering the past seven years.

Nearly a generation later, and facing the "molecular revolution", some researchers (fortunately) still compile morphological matrices. This is an often overlooked but important work: genes and genomes can be sequenced by machines, and the only thing we need to do is to feed these machine-generated data into other powerful machines (and programs) to get a phylogenetic tree, or network. But no software and computer cluster can (so far) study anatomy, and generate a morphological matrix. The latter is paramount when we want to put fossils, usually devoid of DNA, in a (molecular) phylogenetic context. We need to do this when we aim to reconstruct histories in space and time.

Nevertheless, we can't ignore the fact that these important data are (still) far from tree-like. What holds for the matrices of the 80's (see the end of Part 1), still applies now.

So, let's have a look at the three most recent data sets (one morphological, two molecular) published in Cladistics that have their data matrix in TreeBASE.

The morphological dataset

Beutel et al. (2011; submission S11976) provided a "robust phylogeny of ... Holometabola", and note in their abstract: "Our results show little congruence with studies based on rRNA, but confirm most clades retrieved in a recent study based on nuclear genes."

Without having read the study, I can guess which clades (likely used here as a synonym for monophyletic group; but see David's post on Hennig and Cladistics) were confirmed. The data matrix contains: 356 multistate, with up to six states, characters scored and annotated for 34 taxa, including polymorphisms and some gaps ("–") viz missing data ("?"). Just by looking at the Neighbor-net inferred from this matrix. (Standard tree- or network-inference doesn't differ between gaps and missing data, but some people find it important to distinguish between "not applicable" and "not known" in a matrix.)

Neighbor-net inferred from simple pairwise distances computed based on Beutel et al.'s matrix. Brackets show my ad hoc assessment of candidates for monophyla (here: likely represented by clades in no matter how optimized trees).

How did I postulate the monophyla? By deduction: if two or more OTUs are much more similar to each other than to anything else in the matrix, they likely are part of the same evolutionary lineage, ie. have a common origin (= monophyletic in a pre-Hennigian sense). This, when the matrix well covers the group and morphospace, has a good chance to be inclusive (= monophyletic fide Hennig; for the covered OTUs). This is especially so when there is a good deal of homoplasy — the provided tree has a CI of 0.44 and RC of 0.33: convergences should be more randomly distributed than lineage-specific/-conserved traits. The latter don't need to be (or were, at some point in time) synapomorphies, shared derived unique traits, but could be diagnostic suites of characters that evolved in parallel within a lineage and passed on to all (or most) of the descendants.

The first molecular dataset

Let's look at the signal in the two molecular matrices.

In 2016, Gaspar and Almeida (submission S19167) tested generic circumscriptions in a group of ferns by "assembl[ing] the broadest dataset thus far, from three plastid regions (rbcL, rps4-trnS, trnL-trnF) ... includ[ing] 158 taxa and 178 newly generated sequences". They found: "three subfamilies each corresponding to a highly supported clade across all analyses (maximum parsimony, Bayesian inference, and maximum likelihood)."

The total matrix has 3250 characters, of which 1641 are constant and 1189 are parsimony-informative. This is a quite a lot for such a matrix, and, by itself, rules out parsimony for tree-inference. If half of the nucleotide sites are variable, then the rate of character change was high, and parsimony is statistically only robust, when the rate of change was low. High mutation rates or high level of divergence may also pose problems for distance methods and other optimality criteria, all closely related to parsimony.

The file includes three trees, labelled "vero" (which, in Italian, means "true"), "Fig._1" and "MPT". "Vero" and "Fig._1" come with branch lengths; judging from the values (<< 1), they are probabilistic trees (of some sort); the "MPT" is (as usual) provided as a cladogram without branch-lengths. It may be that the authors had to add the parsimony tree just to fulfill editorial policies, while being convinced "vero" is the much better tree. "Vero" is a fully resolved tree (the ML tree?), while "Fig._1" (Bayesian?) and "MPT" include polytomies.

Using PAUP*'s "describe" function, we learn that the "MPT" is 5101 steps long and has a CI of 0.41 and RC of 0.33. Nucleotide sequence data can be notoriously homoplasious, as we repeat the same four states into infinity and have to deal with an unknown but usually significant amount of back mutations. This adds to the other problems for parsimony:

transitions are more likely to happen than transversions; and
in coding gene regions, such as the rbcL, some sites (3^rd codon positions) mutate much faster than others.

Still, parsimony trees are not necessarily wrong. Neither are NJ trees; and there are also datasets where probabilistic methods struggle, eg. when the likelihood surface of the treespace is flat.

So, the first question is: how different are the three trees provided? Rather than having to show three graphs, we can show the (strict) Consensus network of those trees.

A strict consensus network summarizing the topologies of the three trees provided in the TreeBASE submission of

The main difference is between "vero" and the other two — "Fig. 1" and the "MPT" are very similar (and both include polytomies). There are three main scenarios for a Consensus network like this with respect to the high portion of variable sites:

"Fig. 1" is a Jukes-Cantor model-based tree,
"Fig. 1" is an uncorrected p-distance based tree, or
most of the variation is between ingroup (the subtree including all Blechnum) and outgroup (the other subtree).

"Vero" is still quite congruent, so the model used here can't be too much different, either.

What should ring one's alarm bells are, however, the many grade-like / staircase subtrees, which are unusual for a molecular data set. Staircases imply that each subsequent dichotomous speciation event resulted in a single species and a further diversifying lineage: multiple, consistently occurring budding events.

The same graph, with arrows showing grade evolution. Often found in morpho-data-based trees with ancestral, more ancient, and derived (from them), modern forms, but should ring an alarm bell when common in a molecular tree. Major clades (found in all three trees) are labelled for comparison with the next graph.

Let's compare this to the Neighbor-net (usually, I would use model-based distances in such a case, but here we can do with uncorrected p-distances).

A Neighbor-net inferred from uncorrected p-distances based on Gaspar & Almeida's matrix; the major clades are labelled as in the preceding graph. Note the isolated, long-branch blue dots with asterisks, indicating the position of the first diverged species in the large clades G and I. Genuine signal or missing data artefact?

The Neighbor-net shows only a limited number of tree-like portions, but does correspond with the main clades above. Only A and B are dissolved, which are the two first diverging clades in the original trees (preceding graph). Some OTUs are placed close to the centre of the graph, or even along a tree-like portion (purple dots), a behaviour known from actual ancestors: some OTUs apparently have sequences that may be literally ancestral to others. This explains the grade structure seen in the original trees. Others (violet dots) create boxes, which may reflect a genuine ambiguous signal, or just be missing data leading to ambiguous pairwise distances. The latter (missing data artefact) is behind the misplacement of the four OTUs (red dots): missing data can inflate pairwise distances severely. And, like parsimony, distance-based methods are more vulnerable to long-branch(edge)-attraction than probabilistic methods.

Model-based distances may help clean up this a bit, but the networks needed for these kind of data are Support consensus networks (see e.g. Schliep et al., MEE, 2017). The split appearance of the Neighbor-net hints at internal signal conflict and, with respect to the high number of variable sites (note the sometimes extremely long terminal edges), saturation issues. Two major questions would be:

How do the different markers (coding gene vs. inter-genic spacers with different levels of diversity; rps4-trnS is typically more divergent than the trnL-trnF spacer) resolve relationships, which clades / topological alternatives receive unanimous support?
Does it make a difference to run a fully partitioned (ML) analysis vs. an unpartitioned one vs. one excluding the 3^rd codon position in the gene?

For intra-clade evolutionary pathways, it would be worthwhile to give median networks and suchlike a try, as parsimony methods that can discern ancestor-descendant relationships.

The second molecular dataset

The most recent data are from Kuo et al. (2017; submission S20277), who inferred a "robust ... phylogeny" (see Part 1, Jamieson et al. 1987, and Beutel et al., above) for a group of ferns, focusing on the taxonomy of a single genus, Deparia, that now includes five traditionally recognized genera. In the abstract it says: "... seven major clades were identified, and most of them were characterized by inferring synapomorphies using 14 morphological characters".

The matrix includes the molecular characters used to infer the major clades plus two trees, labelled "bestREP1" and "rep9BEST", both with branch lengths. Branch length values indicate that "bestREP1" could be parsimony-optimized (with averaged or weighted branch lengths), while "rep9BEST" is either a ML or Bayesian tree (technically, it could be a distance-based tree, too, but I don't think such "phenetics" are condoned by Cladistics).

Re-calculated, the first tree ("bestREP1") is shorter (3024 steps) than the one of Gaspar & Almeida, reflecting the much lower number of parsimony-informative sites (979). Many of the sites differ only between the focal genus and the outgroups, which is well visible in the Neighbor-net. [For those of you unfamiliar with Neighbor-nets, a parsimony analysis of these data takes hours, or days depending on the software and computer, while the distance matrix and the resultant Neighbor-net is inferred in a blink.]

The Neighbor-net based on Kuo et al.'s data. Why do we need to include long-branching, distant outgroups when we just want to bring order in a genus? Because to test monophyly, we need a rooted tree (ambiguous or not, or even biased by branching artefacts).

Let's remove the distant, long-branching outgroups, which (as we can see in the Neighbor-net) at best provide ambiguous signal for rooting the ingroup — at worst, they trigger ingroup-outgroup branching artefacts. What could a Neighbour-net have contributed regarding taxonomy and the seven major monophyletic intrageneric groups ("clades")? Pretty much everything needed for the paper, I guess (judging from the abstract).

Same data as above, but outgroups removed. The structure of this Neighbour-net allows to identify seven likely candidates for monophyla ("1"–"7"), with "1" and "2" being obvious sister lineages. Colours refer to the clusters ("A"–"E") annotated above.

On a side note: by removing the long-branching, distant outgroups, taxon "T" is resolved as a probable member of the putative monophyletic group "5" (= "E" in the full graph with outgroups, and surely a high-supported subtree in any ingroup-only reconstruction, method-independent). Placing the root between "T" and the rest of the genus implies that "5" is a paraphyletic group comprising species that haven't evolved and diversified at all (ie. are genetically primitive), in stark contrast to the other main intra-generic lineages. This is not impossible, but quite unlikely. More likely is the second scenario (primary split between "1"–"3" and "4"–"7"). Having "4" as sister to the rest could be an alternative, too.

This is where Hennig's logic could be of help: find and tabulate putative synapomorphies to argue for a set and root that makes the most sense regarding morphological evolution and molecular differentiation.

The take-home message(s)

We have argued before that it is in the ultimate interest of science and scientists to give access to phylogenetic data. No matter where one stands regarding phylogenetic philosophy, we should publish our data, so that people can do analyses of their own. Discussion should be based on results, not philosophies.

When you deal with morphological data, you should never be content with inferring a single tree (parsimony or other). You have to use networks.

The Neighbor-net was born as late as 2002 (Bryant & Moulton, 2002, in: Guigó R, and Gusfield D, eds, Algorithms in Bioinformatics, Second International Workshop, WABI, p. 375–391; paywalled) and made known to biologists in 2004 (same authors, same title, in Mol. Biol. Evol. 21:255–265), so that authors before this time did not have access to its benefits. Similarly, Consensus networks arrived around about the same time (Holland & Moulton 2003, in: Benson G, and Page R, eds, Algorithms in Bioinformatics: Third International Workshop, WABI, p. 165–176). However, the Genealogical World of Phylogenetic Networks has been here for six years now (first post February 2012). So there is now no excuse for publishing a cladogram without having explored the tree-likeness of your matrix' signal!

Neighbor-nets like the ones I showed in this 2-piece post (or can be found in many of our other posts) are a quick and essential tool to explore the basic signal in your matrix:

How tree-like is it?
Where are the potential conflicts, obscurities?
What are the principal evolutionary alternatives (competing topologies)?
What is well supported (especially regarding taxonomy and the question of monophyly)?

Even if you don't use it in your paper, the network will tell you what you are dealing with when you start inferring trees.

The second essential tool is the much under-used Support consensus network, not shown in this post but in plenty of our other posts (and many papers I co-authored; for a comprehensive collection of network-related literature see Who's who in phylogenetic networks by Philippe Gambette). Support consensus networks estimate and visualize the robustness of the signal for competing topological (tree) alternatives.

Consensus networks should also be obligatory for those molecular data,where even probabilistic methods fail to find a single fully resolved, highly supported tree.

If the editors of Cladistics are really dedicated to parsimony, they should not still insist only on a parsimony tree (often provided as cladogram), but also parsimony-based networks as well:

strict Consensus networks to summarize the MPT samples instead of the standard strict Consensus cladograms;
bootstrap Support consensus networks showing the signal strength and support for alternative trees/competing clades (TNT has many bootstrapping options to play around with); and
Median networks and such-like for datasets with few mutations, and low levels of expected homoplasy.

This is what the 2016 #parsimonygate uproar (see Part 1) should have been about (12 years after Neighbor-nets, and 11 years after Consensus networks). Not the prioritizing of parsimony, but the naivety or ignorance towards pitfalls of (parsimony or other) trees inferred from data not providing tree-like signal or riddled by internal conflict.
This is a problem not limited to Cladistics, but found, to my modest experience in professional science (c. 20 years), in many other journals as well (e.g. Bot. J. Linn. Soc., Taxon, Mol. Phyl. Evol., J. Biogeogr., Syst. Biol., Nature, Science).

Hence, here are my suggestions for future conference buttons, instead of those shown in Part 1.


No Cladograms!	Use Neighbour-nets!	Support Consensus Networks as obligatory!

Further reading for those who mistrust trees or become network-curious in general

In this blog, under the label "EDA" you will find all sorts of data-display / data-explaining networks, biological and non-biological ones; and the labels "neighbor-net" and "consensus networks" will point you to posts using these networks.
For problem trees – ancestor-descendant relationships, see this recent post and the posts linked there. In this context, don't miss our posts on median networks.
The label "treelikeness" brings you to posts questioning trees inferred from non-treelike data.
The labels "cladistics" and "philosophy" include also more conceptual posts in our strife for less tree-thinking and more network-thinking.
The labels "phylo-networks" and "branch support" collect similar posts on my science-and-other-stuff blog Res.I.P.

Monday, November 5, 2018

A bit of heresy: networks for matrices used in Cladistics studies

[This is Part 1 of a two-part topic – this one is Historical matrices from the 1980s]

When I first came into contact with phylogenetics (usually based on morphological data sets, back then) and after reading Hennig's book (the original German version, published in 1950), I dreamed about publishing in Cladistics, the journal of the Willi Hennig Society (WHS). I never did. In this post, I show why.

Later on, in 2016, Cladistics achieved renewed fame due to an editorial that triggered a twitter uproar under the hashtag #parsimonygate. A lot of people were shocked to read in the editorial that the journal (still) prefers and requires parsimony-based inferences (in fact, parsimony-based trees). Some people, like Joe Felsenstein, were not at all surprised. I wasn't either, because Cladistics is the journal of the Willi Hennig Society (WHS), which has always been dedicated to parsimony: "Ockham told Popper told Hennig to use parsimony" (see the historical summary by Felsenstein in Systematic Biology, 2001; free access).

Historical buttons that you (allegedly) could get at meetings of the WHS. Left: Joe Felsenstein; right: L for Likelihood. Just a gag, of course! Nothing serious behind it.

In the good old days, when the "Phylogenetic Wars" were still on (in the 1980s, petering out in the 90s), they would invite a probability-ist to their conference to tear him down. My first phylogenetic paper (2002) got a negative review (ie. rejection, invitation to resubmit) by a WHS member solely because it did not include a parsimony tree, which he described as "standard these days". More recently, they ensured free access to TNT, the current main software for doing parsimony analysis and an essential tool for many palaeontologists.

I stopped using parsimony trees very early in my career, but I'm still a great fan of the family of methods based on median networks, which operate under the same parsimony criterion (Clades, Cladograms, ...; Using Median networks ...). Fate exposed me early to the Neighbor-nets, which can be used as a quick check of how tree-like the signal is in data matrices, to start with.

The thing that bugged me most concerning many journals, including Cladistics, is not a focus on parsimony, but the lack of data documentation and easy data access. To me, it seems natural to use a service like TreeBASE, when my main dedication is to tree-inference. TreeBASE allows you to provide your data and inferred trees to the general public in the common NEXUS format, so that other people can make use of it.

Luckily, some authors of Cladistics upload their data (about one study per 1–3 years). So, here are some data-display networks showing the strengths and weaknesses of the parsimony trees in the original publications, which have been randomly selected from among the oldest ones and the newest ones (I found) in TreeBASE. I won't discuss the actual results, as Cladistics is pay-walled, so just enjoy the graphs.

The oldest one (in my list), Dahlgren & Bremer 1985, TreeBASE submission number S231

The submission (a binary matrix, including some missing data; published in the first volume of Cladistics) comes with three angiosperm trees: one composite order-level tree, plus two empirical trees labelled as "Fig. 2" and "Fig. 3" using the family-level OTUs in the matrix. The latter two look like this:

Connected cladograms of "Fig. 2" and "Fig. 3", the result of two parsimony analyses. Jumping taxa/clades highlighted with colours.

That the matrix is not only highly homoplasious (CI = 0.28) but has a severe signal problem, becomes obvious when inferring a NJ tree, providing a third topology.

A NJ tree (fulfilling least-squares optimality criterion for phylogenetic trees) from the same matrix: blue, branches incongruent among the original trees and the NJ tree. Color coding: light blue, branch congruent to "Fig. 2" tree (different in "Fig. 3" tree); green, branch found in all three trees; red, branch incongruent to consistent placement in both original trees.

Not surprisingly, the Neighbor-net inferred from simple (mean) Hamming distances is a spider-web, as the matrix' signal is not tree-like at all — all non-green branches above, or their conflicting alternatives, receive low to very low bootstrap support, independent of the optimality criterion used.

The Neighbor-net inferred from Dahlgren & Bremer's matrix.

Despite its spider-web structure, we do learn quite a lot from the Neighbor-net regarding what is behind the clades in the original trees. For example, we can overlay a Dahlgrenogram representing the top-most subtree of the "Fig. 2" tree.

Blue, red and yellow fields denote (sub)clades in Dahlgren & Bremer's "Fig. 2" tree that compose the top clade (grey).

The same could be done for all the other clades.

TreeBASE submission S329, worms (Oligochaeta) by Jamieson et al. (1987)

The more perfect is a character matrix regarding tree-inference (ie. with tree-compatible characters), the more similar the NJ and the parsimony-tree will be (or any other tree, under any other optimality criterion), as we can see in this second example published in the third volume of Cladistics.

The tree (the abstract notes a single most-parsimonious tree) was inferred from a multistate matrix with up to seven states, possibly including some characters that should be treated as ordered, but such specifics are not included in the original NEXUS file, so we will treat them as unordered.

Aside from grades becoming clades (and vice versa), the published tree (unordered: 102 steps, high CI = 0.81, RC = 0.53) and the NJ tree are quite similar, even regarding their relative branch-lengths.

Two phylograms: left, the original MPT, right, a NJ tree, shared branches in green, (partly) conflicting ones in orange. Cladists address the left tree as "phylogenetic", the right one as "phenetic", but both are equally valid solutions using different optimality criteria.

Moreover, the Neighbor-net is much less complex than in the previous examples, with individual edges corresponding to branches in both trees — Neighbor-nets are truly meta-phylogenetic graphs.

Splits found in the original MPT in green, when corresponding with edges in the Neighbour-net, and orange, when there is no corresponding edge (according to the abstract, the authors discuss alternatives to certain branches in their tree). Edges found in the NJ tree (providing an alternative topology/phylogenetic hypothesis) in blue.

Submission S349, an amniote phylogeny by Gaulthier et al. (1988)

This is a matrix much to my liking, as it includes extinct taxa, with quite impressive dimensions (computers back in 1988 were awfully slow): 316 characters with up to four states for 31 taxa. Naturally, it includes a lot of missing data, as do all fossil-including matrices.

Missing data is potentially a bigger problem for distance-based approaches than for character-based ones like parsimony, maximum likelihood or Bayesian inference — when there is little character overlap between the fossil taxa, their pairwise distances will be distorted. Missing data can be an equal problem for tree-inference — depending which characters are missing, many different topologies are equally optimal, or nearly so. In Gaulthier et al.'s matrix 10% of the characters are parsimony-uninformative.

Similar to the angiosperm matrix, Gaulthier et al.'s tree has a relatively low CI (0.45) and RC (0.33), i.e. there is homoplasy adding to the missing data as a source of incompatible, tree-unlike signals.

Just by comparing the NJ tree to the parsimony tree, we can see that distance distortion because of missing data is no big deal for this matrix.

The trees are largely congruent, with three striking exceptions: the birds (Aves), the crocodiles (Crocodylia) and turtles (Testudines) are not placed as sisters to the lineage leading to modern-day mammals (tree provided by Gaulthier et al.), but fall in the "dinosaur"-only clade in the NJ tree (compare with the current Tree of Life: Archosauria). This makes sense (data-wise), because in Gaulthier's matrix the taxon pairs Aves + Ornithosuchia and Crocodylia + Pseudosuchia are identical in their shared defined characters (ie. zero-distance pairs). Obviously, the parsimony tree comes with some implicit assumptions: the unweighted/unordered single most-parsimonious tree PAUP* infers for the matrix using the branch-and-bound algorithm has only 510 steps, a higher CI (0.66) and RC (0.59), and is largely congruent with the NJ tree; except that Captorhinidae and Testudines are sisters and Casea, Ophiacodon and Edaphosaurus form a grade not a clade.

As in the other cases so far, the Neighbor-net well captures the actual data situation.

Blue edge bundles refer to splits shared with both the NJ tree and the (inferred, not provided) MPT. Note that some splits in the NJ tree and or the MPT have no counterpart in the Neighbour-net. One split found in the MPT but not in the NJ tree has a corresponding edge in the Neighbour-net (light blue).

The thin "upper trunk" in the Neighbor-net further shows that the matrix provides a strong signal for an increase of shared derived ('mammalian') and decrease of shared ancestral ('reptilian') traits, which is a bias. Although the MPT and NJ tree agree well, the matrix provides clear tree-like signal only for terminal relationships in the other main, inferred clade. The thinning trunk may also indicate a taxon sampling issue. Well-sampled phylogenetic data sets usually result in more star-like networks (see eg. graphs in this post on fossil and extant walnuts, dinosaurs, spermatophytes, or the above ones and the next one) in contrast to non-phylogenetic data sets (see eg. the posts on breast sizes, airlines, or moons)

Take-home message in the middle of the film

Even though they are arbitrary choices, the three matrices above show what phylogeneticists had to work with in the 1980s morphological datasets:

... trapped in homoplasy (Dahlgren & Bremer, 1985) — datasets in which phylogenetic relationships were obscured behind highly ambiguous, non-treelike signal;
... asking for a model (Jamieson et al., 1987) — datasets with partly consistent signal, but not consistent enough to result in the same tree independent of the optimality criterion;
... encoding a tree (Gaulthier et al., 1988) — datasets tweaked to promote a certain evolutionary hypothesis, including (superficially) simple series of gradual evolution and ancestor-descendant pairs (see Trivial data, not so trivial graphs). Such data will result in a single optimal tree (method independent!) dominanted by staircase-like subtrees. This may be fine for a cladist, but nothing a phylogeneticist / evolutionary biologist could really be content with (not in the 1980s, or before 1950).

Top, two phylogenetic tress sketched by Darwin; bottom, Hilgendorf's (1866) phylogenetic tree. There are quite a few before 1950 (eg. Pojárkova, 1933, Acta Institute of Botany, Academy of Sciences of the USSR, ser. 1, 1: 225–374; unfortunately have no copy/scan)