Tuesday, August 29, 2017

More non-treelike data forced into trees: a glimpse into the dinosaurs

Plant morphological data sets including fossil taxa can be riddled with incompatible data patterns (e.g. see my first post), and this can be a bit mind-blowing when it comes to tracing evolution over time. So, let’s move on to something potentially more simple: extinct groups of animals.

Until a time-machine is invented, phylogenetic hypotheses for groups such as the many extinct lineages of dinosaurs will have to be based on morphological data sets. Dinosaur fossils are nowhere near as frequent as as plant fossils (often isolated organ); but when a complete or partial skeleton is found, this specimen allows scoring more characters than is possible for even a higher-level composite plant taxon. For instance, the largest (character-wise) plant data matrices, using composite taxa and operating at the level of genera and above, including fossils, have a little over 100 characters, whereas dinosaur matrices like the one used by Tschopp, Mateus & Benson (2015) can have several hundreds of characters.

Classification of dinosaurs tries to apply the principles of ‘cladistics’ (see also http://tolweb.org/Dinosauria), a classification system established by Hennig (1950). Cladistic classification – Hennig did not propose any inference framework – aims to identify exclusively shared derived traits (synapomorphies), and consequently groups of taxa (originally species) that share an inclusive common origin, Hennig's “monophyla”. [In contrast to Haeckel’s (1866) concept of monophyletic groups, which just assumed a common origin, but did not require inclusiveness.] For some reason, which seem to have no scientific basis, but can be understood in a historical context (Felsenstein 2001, 2004: chapter 10), cladistics has been synonymised with parsimony analysis, one of the optimality criteria to infer one-dimensional graphs reflecting a series of dichotomous splits (phylogenetic trees). A basic assumption of cladistic studies is that a clade in a parsimony-inferred tree equals a monophylum (which is not necessarily the case, see e.g. Scotland & Steel 2015 for binary data).

In palaeontology (and systematic biology to some degree) it is common not to show a phylogram, a phylogenetic tree with branch-lengths, but a cladogram. These cladograms rarely depict the optimised (or one of the equally optimal) tree(s), but instead show the strict consensus tree of the found equally parsimonious trees (or potentially most-parsimonious trees) (MPTs). This is also the case for the study by Tschopp et al., used here as an example of the generally non-treelike data used in studies dealing with extinct groups of animals.

David provided a list of questions for exploratory data analysis (EDA), which can (and should) be asked when trying to infer phylogenies based on morphological data. I will look at some of them here.

First question: Are the data tree-like?

The data matrix of Tschopp et al. is impressive (much like the paper itself, with its 298 pages). The authors scored 477 characters (243 new) for (a final set of) 81 “operational taxonomic units” (OTUs). The OTUs are typically specimens in the case of the ingroup, and include several outgroup species for rooting the phylogenetic tree. There are lots of gaps in of the matrix (65% missing data), which relates to the inclusion of poorly known fossil specimens, which the authors tried to classify using parsimony inference and pairwise distances. The authors note (p. 163): “Given the low consistency index (CI) and thus high number of homoplasies in the dataset, an additional analysis with the same settings was conducted using implied weighting (iw).” In addition to signal ambiguity related to general homoplasy and ontogeny, the authors note character overlap effects and deformation (pp. 166ff). So, there are quite a few different sources of incompatible, non-treelike signal.

With equal weighting and including all 81 OTUs, the authors ended up with 60,000 equally parsimonious trees (possibly more — this was the maximum number limited by computational constraints). This produced a strict consensus (SC) tree with just 12 nodes, in which “all ingroup specimens formed one large polytomy”. The ‘implied weighting’ lead to a slightly more resolved SC tree. ‘Implied weighting’ is a posterior means to downweigh characters conflicting with the inferred tree. The authors further identified some (4, 8, or 15) OTUs accounting for most of the “instability”. A posteriori filtering of these putative rogue taxa led to SC trees that were much better resolved (Fig. 1).

Fig. 1 The six strict consensus trees shown by Tschopp et al. The red crosses indicate the OTUs that were pruned from the MPT tree sample to increase the resolution of the SC tree. For the first tree, I added the information on the fraction of missing data (blue dots).

Both tree-like and non-treelike data can collapse strict consensus trees, but the large number of MPTs can be a first indication that the data are not tree-like. The MPT samples inferred by Tschopp et al. are not included in the documentation (following the current standard; see also data uploaded to TreeBase). Using the quick-analysis option in PAUP* (random heuristic search, 100 replicates, CHUCK-options set), I found 3,000 equally parsimonious trees, which are only slightly worse (1983 steps) than the 60,000 MPTs (1979 steps reported) combined in Tschopp et al.’s unweighted cladogram.

Using the consensus network approach (Holland & Moulton 2003) for summarising the parsimony-tree sample (no cut-off value), we can get a first impression of the signal in the matrix (Fig. 2). The data allow for a great number of topological alternatives — they are generally not tree-like. Only a few relationships are unambiguous in this collection. The fan-like topological features (composed typically of low-dimensional boxes) relate to: (a) jumping OTUs (rogue taxa), (b) uncertainty regarding relationships between related OTUs consistently found in the same subtree, and (c) the exact composition of the subtrees. In contrast to the strict consensus tree, the network visualises the tree-unlikeliness of the data expressed in the MPT collection, revealing extremely ‘rogue’-ish OTUs (e.g. Diplodocus_YPM_1922) and OTUs with indiscriminate signal (e.g. FMNH_P25112), and also allows us to qualify the ‘rogueness’ of all other OTUs.

Fig. 2 Strict consensus network (all edge-lengths set to 1) of 3000 equally parsimonious trees, inferred from Tschopp et al.'s matrix. This graph is the network equivalent of the commonly seen strict consensus cladograms (Fig. 1). Note that the tree sample is slightly suboptimal and likely incomprehensive.

One pre-inference measure for tree-likeness is the Delta Value (DV) introduced by Holland et al. (2002); see e.g. Auch et al. (2006) and Göker & Grimm (2008) for applications. The matrix DV is 0.47, which is very high, even for a morphological matrix. The individual DVs (iDV) range between 0.417 and 0.577, which means that no set of OTU provides a tree-like signal. The complete data are not tree-like, and hence the failure to find unambiguous relationships, even when a comprehensive tree search and ‘implicit weighting’ are used (see Tschopp et al. 2015). Extreme iDV (> 0.55) correlate with (relatively) high proportions of missing data (75–98%, i.e. 10–119 defined characters; Fig. 3), indicating that missing data are a problem for inferences and the calculation of the pairwise distance matrix.

Fig. 3 XY-plot showing the individual Delta Values (a measure for treelike signal) in relation to the proportion of missing data. The green "comfort zone" indicates iDVs favorable for tree-inference (based on personal experience).

Subsequent question: Why are the data not tree-like?

In his post, David listed four possible reasons for non-tree-like data:
  (a) uninformative data: a “bush”,
  (b) weakly tree-like data: a “tree obscured by vines”,
  (c) data containing several strongly incompatible relationships: a “structured network”,
  (d) confusing or random data: a “spider-web”.
Lacking branch-lengths, the MPT consensus network above provides no information regarding (a), and limited information regarding (b) and (c). Only (d) can be excluded as a main source of non-tree-like signal for the dinosaur data: higher-than-3-dimensional boxes are rare.

Fig. 4 Boostrap (BS) consensus network based on 10,000 BS (pseudo)replicates. Trivial splits in grey, splits without strong alternatives in blue, conflicting splits (always two alternatives) in red. All splits found in less than 20% of the BS replicates not shown, and edge length are proportional to the split frequencies.

Figure 4 shows the bootstrap support network based on 10,000 parsimony bootstrap pseudoreplicates (generated following Müller 2005). Some terminal sister relationships seen in the original, taxon-reduced, unweighted or weighted SC trees rely on quite robust, unconflicted signal, a few others are only supported by a small fraction of the characters, but all competing alternatives even less (blue edges in the graph). Thus, it is a “Maybe” for (a) (see also Fig. 5), and a “Yes” for (b) (compare Figs 2 and 4). The character suites of many OTUs provide no robust signal to place them; their position in the set of trees is based on the signal of relatively (large matrix!) few characters, or the result of branching artefacts as we force non-treelike data into a tree. The robust signal for some terminal clades may be obscured by ambiguous signal of potential additional members of the clade, or OTUs similar to only part of a clade (the “vines”).

We can also observe some pronounced 2-dimensional boxes: here the signal from the data matrix has no preference for a single alternative, but indicates two competing alternatives (red edges in the graph), i.e. also a possible “Yes” for (c). In the case of morphological data, reticulate signals do not necessarily indicate reticulation in an evolutionary sense. They can be triggered by two (more or less related) lineages evolving into the same morphospace, or the co-existence of ancestral and derived forms (see also this post). No spider-web-like portions (high-dimensional boxes) are seen (and are also largely missing from the MPT consensus network in Fig. 2), so we can exclude chaotic signal as reason (d) for the tree-unlikeliness of the data.

Fig. 5 Neighbour-net splits graph based on pairwise (Hamming) distances computed with PAUP* using the Tschopp et al. matrix.

Figure 5 shows the unfiltered, simple (Hamming) distance-based neighbour-net (NNet) for the same matrix. Mirroring the high matrix DV and iDVs, the NNet has only a few tree-like portions, but nevertheless reflects a high diversity — long terminal edges; pairwise distances range between 0 (no difference in data-covered characters) and 1 (all characters are different). Some OTUs are placed closed to or in the boxy centre of the graph or the root trunks of terminal groups. Such a placement is either indicative of ancestry (see my earlier post), which is a special case of reason (c), or a lack of discriminative signal, i.e. reason (a) for non-treelike data. Here, it appears to be mostly the latter: the iDV are high, and the highest iDV relate to high proportions of missing data (more than 75%).

High proportions of missing data do not necessarily result in high DV (here 75% missing data equals c. 150 defined characters, which could be more than enough to place a taxon). But not a few OTUs have zero pairwise-distances to a set of diverse OTUs that are not closely related. In total, 74 of the 81 OTUs show a zero-distance to at least one other OTU; with Diplodocus YPM 1922 (98% missing data) being the most-extremely non-distinct OTU: it has a zero-distance to 66 OTUs, including one outgroup taxon. Such a pattern is impossible from an evolutionary point of view (even an ancestor cannot be identical to all of its off-spring when they diversified). and is a missing data artefact. The NNet resolves this data insufficiency by placing the highly ambiguous OTUs in the centre of the graph, whereas parsimony (or other tree inference) deals with this effectively unsolvable problem by providing some, many, or all theoretically possible placements of the problematic OTU (the OTU turns ‘rogue’) as equally optimal (large fans in Fig. 2) but without support (Fig. 4).

There are two options to infer phylogenetic trees, or to test alternative evolutionary hypotheses using Tschopp et al.’s matrix with its tree-unlike data.
  1. One is to reduce the taxon set to those OTUs with less than 50% of missing data, to produce a backbone tree or network (matrix DV = 0.28; iDV range between 0.219–0.352; Fig. 6), Then  to evaluate the position (or possible positions) of each other OTU within this backbone (using ‘+1 OTU’ neighbour-nets, parsimony-optimisation or algorithms such as the evolutionary placement algorithm implemented in RAxML; Berger & Stamatakis 2010; Berger, Krompass & Stamatakis 2011). Then finalise with group-restricted taxon and character subsets to study within-group relationships.
  2. The other is to cut the matrix into pieces and taxon sets with good data overlap. Then assess the correlation between these submatrices (e.g. using Pearson’s correlation coefficient) and their tree-likeness (using Delta Values). Then use consensus networks and/or supernetworks to investigate potential incongruences, and to summarise topological alternatives.

Fig.6 Neighbour-net (NNet) for a taxon-reduced set, only including OTUs with more than 50% of defined characters. These data result in a single most-parsimonious tree, which is largely congruent to the main splits in the NNet (blue), except for a three poorly supported branches (red). Numbers indicate neighbour-joining and parsimony bootstrap support for branches in the MPT and corresponding edges in the NNet and their alternatives.

Palaeontologists: Please stop using strict consensus trees, and start with EDA

To fill the deeper parts of the Tree of Life with life, we cannot get around morphological data and phylogenetic inferences based on these data. Most of Earth’s diversity is extinct, so their molecular data are (largely) lost to science. But no matter whether we work with extinct plants or animals, or with matrices containing many or few morphological characters, we should keep a close eye on the primary signals in those matrices. Are the data tree-like? Are there rogue taxa, and how/why do they affect the inferences? How discriminatory are the data regarding competing alternative hypotheses? Does taxon and character sampling matter? Networks (planar or n-dimensional) can help to: (1) assess the potential of the data for tree inference, and (2) discuss the putative monophyly of groups and their alternatives.

The signal from morphological data matrices is complex, and the data are rarely tree-like. Irrespective of whether one wants to stick with parsimony or not, tree-based and support consensus networks should by now have long replaced the strict (or majority-rule) consensus trees in “cladistic” or general-phylogenetic studies dealing with extinct groups of organisms.

Posteriori methods to filter or down-weight characters not fitting the inferred tree(s) ignore the fact that morphological differentiation typically cannot be explained by a single tree (leaving aside, that total evidence and DNA-constrained analysis demonstrate that morphological evolution is not parsimonious at all). There are too many sources of signal incompatible with the true tree.

In the light of ambiguous and potentially biased signals (outlined and discussed by Tschopp et al. 2015 for their data), the focus of cladistic or other phylogenetic studies that aim to fill the Tree of Life with extinct branches cannot be to infer a clean(ed) tree. Instead, the focus should be on exploring the signals in the data and assessing their capacity to exclude or support evolutionary scenarios. A well understood topological uncertainty is always better than a poorly supported clade.

Regarding the Tree of Life, we should start representing uncertainty as-is (i.e. showing the currently competing alternatives), and reserve polytomies for cases where we really have no idea at all. Also, we should place potential ancestors (ancestral forms) where they belong: at the root nodes of their descendant lineages (the forms derived from them).


Auch AF, Henz SR, Holland BR, Göker M. (2006) Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences. BMC Bioinformatics 7:350.

Berger SA, Krompass D, Stamatakis A. (2011) Performance, accuracy, and web server for evolutionary placement of short sequence reads under Maximum Likelihood. Systematic Biology 60:291–302.

Berger SA, Stamatakis A. (2010) Accuracy of morphology-based phylogenetic fossil placement under Maximum Likelihood. IEEE/ACS International Conference on Computer Systems and Applications (AICCSA). Hammamet: IEEE. p. 1-9.

Felsenstein J. (2001) The troubled growth of statistical phylogenetics. Systematic Biology 50:465–467.

Felsenstein J. (2004) Inferring phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Göker M, Grimm GW. (2008)General functions to transform associate data to host data, and their use in phylogenetic inference from sequences with intra-individual variability. BMC Evolutionary Biology 8:86.

Haeckel E. (1866) Generelle Morphologie der Organismen. Berlin: Georg Reiner.

Hennig W. (1950) Grundzüge einer Theorie der phylogenetischen Systematik. Berlin: Dt. Zentralverlag.

Holland B, Moulton V. (2003) Consensus networks: A method for visualising incompatibilities in collections of trees. In: Benson G, and Page R, eds. Algorithms in Bioinformatics: Third International Workshop, WABI, Budapest, Hungary Proceedings. Berlin, Heidelberg, Stuttgart: Springer Verlag, p. 165–176.

Holland BR, Huber KT, Dress A, Moulton V. (2002) Delta Plots: A tool for analyzing phylogenetic distance data. Molecular Biology and Evolution 19:2051-2059.

Müller KF. (2005) The efficiency of different search strategies for estimating parsimony, jackknife, bootstrap, and Bremer support. BMC Evolutionary Biology 5:58.

Scotland RW, Steel M. (2015) Circumstances in which parsimony but not compatibility will be provably misleading. Systematic Biology 64:492–504. [preprint]

Tschopp E, Mateus O, Benson RBJ. (2015) A specimen-level phylogenetic analysis and taxonomic revision of Diplodocidae (Dinosauria, Sauropoda). PeerJ 3:e857.

Post-script: Why distance-based approaches?

Distance-based approaches may be still refuted by hard-core cladists as “unphylogenetic” or “phenetic” (again, see Felsenstein 2004 for the historical reasons, and why this is wrong), particularly when acting as anonymous reviewers of palaeontological papers. But the simple fact is: a character matrix not allowing inference of a pairwise distance matrix with at least some tree-like signal, should not be used to infer phylogenetic trees (no matter which optimality criterion is used).

A perfect character matrix, i.e. a matrix in which each dichotomy is subsequently followed by one or several strictly synapomorphic changes will, of course, result in a single MPT. But it will also provide a simple (Hamming) mean distance matrix allowing us to infer a neighbour-joining tree fulfilling the least-squares or minimum evolution optimality criteria, and this will be identical to the MPT and a corresponding NNet without any box-like portions. It will also be the most probable topology that can be inferred using maximum likelihood or Bayesian inference.

When different tree inference methods come to substantially different results for morphological matrices, the signal from the primary matrix is likely not to be tree-like, and internal conflict then needs to be explored. The more tree-like is the matrix, then the less it will be affected by methodological differences (e.g. Fig. 6; the only branches of the MPT not fitting the preferred splits in the NNet have low support, and compete with equally low supported splits seen in the NNet that receive high support from NJ-bootstrapping).

Distance-based analyses are much faster than parsimony, maximum likelihood, and Bayesian inferences; and they are not restricted to inferring phylogenetic trees. Within the same time that I need to perform a comprehensive tree and branch support analysis, I can generate hundreds of NNets using different taxon and character subsets of my matrix, and thus explore its many signals. One can employ different distance measures to deal with continuous or ordered categorical data, and then directly see the effect on the reconstruction. Eventually, one may find a subset that provides the most tree-like signal, which will be the best possible basis for the final tree-inference (in case an evolutionaru tree is what is wanted) and branch support analysis.

No comments:

Post a Comment