A general assumption in phylogenetics is: the more the better. The more data my matrix includes, the better will be my tree. The more taxa I include, the better will be my phylogenetic analysis. But is this true when we include (or rely on) fossils? After all, there is an old saying: less is more; and in this post I will show you that it is often true here, too.
Perfect data – how to recognize unproblematic topologies
In the first post of this series (Farris and Felsenstein), I introduced two matrices, a Farris Zone matrix and a Felsenstein Zone matrix, with the same set of tip taxa: three extant genera and three early fossils, one for each generic lineage.
The Farris Zone matrix provides a perfect signal. No matter which inference criterion one uses, one always gets the true tree. In such a case, the taxon sampling should be irrelevant; and it is. Any 5-taxon sub-tree correctly shows only splits found in the 6-taxon true tree — shown below are the actual most parsimonious trees (MPT) of each inference using the branch-and-bound algorithm.
|Six most-parsimonious trees showing the topology of the true tree; trees are midpoint-rooted and have the same scale.|
Note: NJ/LS and ML would give the same result for this experiment.
Consequently, for the perfect case, the SuperNetwork of the six 5-taxon trees is the 6-taxon true tree.
|Z-closure SuperNetwork (Huson et al. 2004) of the 5-taxon MPTs generated with SplitsTree (walkthrough at the end of the post) depicting the true tree.|
Therefore, the simplest test to check for potential topological issues in any set of data is to sub-sample the taxa by sequentially pruning a single taxon, infer the resulting group of trees (which I will call minus-one trees), and then summarize this tree sample in the form of a SuperNetwork. If the data have no signal issues – and the inferred all-inclusive tree is unbiased – all minus-one trees will be congruent with the all-inclusive inferred tree. The resulting SuperNetwork will then be a tree matching the inferred all-inclusive tree.
On the other hand, if removing a single taxon has a significant effect on the inferred tree, then this either means you need this taxon to get the right tree or that this taxon is causing bias. We cannot assume that trees with many taxa are better than trees with fewer taxa. Only if a topology is independent of taxon sampling can we be sure that we are looking at a true tree (or one inevitable with the data at hand).
Taxon-sampling matters? Then the all-inclusive tree may be biased
Real data matrices are far from perfect. Paleophylogenetic matrices, for instance, not only include a lot of missing data limiting the decision capacity of any phylogenetic inference, but, being restricted to morphological traits, usually high levels of homoplasy — that is, similarity in conflict or only partial agreement with the phylogeny (here are some related posts: Has homoiology been neglected in phylogeny? Should we bother about character dependency? Please stop using cladograms! The curious case[s] of tree-like matrices with no synapomorphies and More non-treelike data forced into trees: a glimpse into the dinosaurs). While some OTUs are primitive in their character suites, others are highly derived. We often, without realizing it, are infering within or close to the Felsenstein Zone.
If we repeat the same minus-one experiment, but now use the Felsenstein Zone matrix, instead, we end up with something quite different. We get three most-parsimonious tree (MPT) solutions when eliminating the outgroup genus O or its fossil Z; and eliminating the genera A and B and their fossils C and D, respectively, each leads to a single MPT. This yields a total of 10 MPTs.
|First row rooted with Z, all other trees mid-point rooted. All trees have the same scale.|
By pruning the long-branching genera A or B, even parsimony analysis gets the correct tree because we have eliminated the source of the long-branch attraction. Adding fossils to break down long branches can be effective (classic paper: Wiens 2005), but dropping long-branching tip taxa works just as well. Changing between a close outgroup (fossil Z) and a distant outgroup (fossil O) has little benefit here.
In this case, the resulting SuperNetwork of our 10 MPTs is not a tree but a network including alternative clades, wrong ones (orange), ie. not monophyletic, and correct ones (green) — ie. branches (internodes, bipartitions) reflecting the monophyletic lineages of the true tree.
|Comprehensive Z-closure SuperNetwork of the 10 minus-one MPT inferred based on the Felsenstein Zone matrix. The network includes all split patterns found in the MPT sample.|
A real world example
To give an example of how sequentially dropping one taxon works with real-world data, we'll use the exhaustive 700 character matrix for bird-related dinosaurs provided by Hartman et al. (2019).
With its total of 501 taxa (OTUs), the apparent rationale behind the matrix is that, by including as many taxa as possible, one gets the best-possible (parsimony) trees, irrespective of the signal quality provided by individual OTUs. However, the full matrix cannot be forced into a single-optimal parsimony tree, due to missing data (72% of the matrix' cells are undefined or ambiguous, ie. 255969 cells) and a scarcity of synapomorphies (in a Hennigian sense) — this is discussed in Hartman et al.; see also the related Q&A.
Here, in light of the computational effort and to avoid heuristics when searching the MPTs, we'll use a pruned sub-matrix. For our first experiment, we take 15 out of the 19 best-covered OTUs. Thus, OTU pairs / triplets that are much more similar to each other than to any other OTU, are reduced to the best-covered representative.
The 19-taxon matrix that I used in a previous post (Large morphomatrices – trivial signal) had only one most-parsimonious tree solution, showing only clades in agreement with current opinion, which assumes a largely staircase-like evolution from dinosaurs to modern birds (Tree of Life). In contrast to the full matrix, the 19-taxon matrix provided high support for most clades (method-independent), reflecting the number of scored traits. The extant taxa, representatives of modern birds (duck, turkey and ostrich, all edible), have many derived cgaracters, with the extinct bird genus Lithornis being placed in-between ostrich and duck + turkey.
|The optimal topologies for the 19 best-covered taxon matrix. Green, the single most-parsimonious tree. Clade names copied from Wikipedia/Tree of Life.|
The ML and NJ/LS (except for one branch) trees were topologically identical; each branch is supported by about 100 inferred changes. The signal from the matrix should be straightforward.
The tree-size weighted mean (default in SplitsTree) SuperNetwork, summarizing the result of an exhaustive branch-and-bound search using the 15-dropped-1-taxon matrices (each one resulting in a single optimal MPT) has a tree-like structure.
Conflicting clades are found in only two of the 15 inferred MPTs, being represented by short branches (their length in the other 14 trees is counted as zero).
Nonetheless, these conflicts received considerable character support. The frequency of a split in the minus-1 tree sample is irrelevant (see the A-B LBA problem discussed above — any tree including A and B showed the wrong clade). When summarizing our tree sample (especially when using MPTs), we should hence opt for a SuperNetwork, in which the edge lengths give the minimum branch lengths found in the MPT collection, ie. the edge length reflects the minimum length of the branch in all trees showing that branch.
|Same SuperNetwork as above, but using the "Min" option instead of the default setting for computing edge lengths.|
Without Dromiceiomimus – representing an earlier diverged lineage and step in bird evolution – the Dromaeosauridae clade, which is probably monophyletic (Wikipedia), flips and dissolves into a grade. By removing the intermediate step, we seem to create some ingroup-outgroup (long-branch) attraction.
Anas, the duck, forms the morphological link to Lithornis – with a mean morphological pairwise Hamming distance (MD) of 0.23, Anas is the most-similar OTU; and, hence, the MPT places Lithornis as sister to Anas + Meleagris (turkey; MD = 0.17). By eliminating Anas, the remaining contemporary birds form a clade — the modern birds (Neornithes) are assumed to be monophyletic but do not form a clade in the all-inclusive MPT (Struthio, the ostrich, is morphologically more distant from duck, turkey and Lithornis).
Even the most comprehensive, least gappy of paleophylogenetic matrices have substantial signal issues. If a tree inference is dependent on which OTUs are sampled, we cannot assume that we will automatically get better trees simply by including everything we have. Some OTUs (in our experiment: Dromiceiomimus) will stabilize correct aspects of a tree, while others will manifest bias or error (here: Anas). It's unlikely that a wrong, ie. not monophyletic, clade created by the attraction of two well-sampled taxa can be broken down by adding numerous taxa showing only a fraction of defined characters. SuperNetworks of minus-one trees can point you to the critical OTUs and unstable branching patterns of your (backbone) phylogeny.
PS. Personally, I would analyze a matrix with these properties, and a taxon sample spanning more than 150 myrs of evolution (from Allosaurus to modern birds), using ML not MP. I used MP in this post only because paleontologists are still very fond of it (not a few still discard anything else as unfit for their data). ML is less prone to long-branch attraction, results in a single tree (easier to compare when using larger taxon samples), and is speedy these days, allowing for more in-depth experiments towards the end of the exploratory data analysis. Both IQ-Tree (homepage; includes links to online servers) and RAxML-NG (open access paper providing essential links / github; implemented on various online servers) can quickly infer ML trees and establish branch support (including but not restricted to nonparametric bootstrapping) using binary and multistate data.
Walk-through for computing Z-closure SuperNetworks (Huson et al. 2004) in SplitsTree (v. 4, since v. 5 is still not fully functional):
- Make sure the tree sample for reading is in Newick format, including branch-length information. The trees can be in a single file or multiple files.
- Start SplitsTree.
- To read in the tree sample:
- File > Open, if your trees are in one file;
- File > Tools > Load multiple trees, if your files (eg. minus-1 MPTs) are in different files.
- Go to Networks > SuperNetwork. Choose "Min" for "Edge Weight" in the pop-up analysis window for the first graph. You can also try out "Mean"/"Sum" (short, rare alternatives will be less prominent), "AverageRelative" (trade-off) or "None" (branch-lengths in the minus-one tree sample are ignored). When using simple tree samples (little topological variation, matrix with fairly stringent signals), a single run (default) suffices. Increasing the number (eg. to 100) ensures no branching pattern in the minus-one tree sample gets lost. For instance, for the Felsenstein Zone matrix, a single run will give you a SuperNetwork capturing the major conflicting aspects, while 100 runs will lead to a higher dimensional graph that includes the correct BD and AC clades as alternatives. If you like to view the overall best-fitting tree instead of a network, tick "SuperTree".
Hartman S, Mortimer M, Wahl WR, Lomax DR, Lippincott J, Lovelace DM (2019) A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight. PeerJ 7: e7247.
Huson DH, Dezulian T, Kloepper T, Steel MA (2004) Phylogenetic super-networks from partial trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1: 151–158.
Wiens JJ (2005) Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Systematic Biology 54: 731–742.