Monday, February 17, 2020

Large morphomatrices – trivial signal

In my last post about fossils, Farris and Felsenstein Zones, I gave an example of a trivial (signal-wise perfect) binary phylogenetic matrix, which will give us the true tree no matter which optimality criterion we use. In this post, we will look at a real world example, a huge bird therapods matrix.
S. Hartman, M. Mortimer, W. R. Wahl, D. R. Lomax, J. Lippincott, D. M. Lovelace
A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight. PeerJ 7: e7247.
What intrigued me about this particular paper (I have no idea about dinosaurs, but the documentation, pictures and data, and presentation seems impeccable) was the following sentence:
The analysis resulted in >99999 most parsimonious trees with a length of 12,123 steps. The recovered trees had a consistency index of 0.073, and a retention index of 0.589.
What can you possibly do with strict consensus trees (Losing information in phylogenetic consensus) based on an unknown number of MPTs that have a CI converging to 0 (but and RI of 0.6; The curious case[s] of tree-like matrices with no synapomorphies)? And isn't this a case for some networks-based exploratory data analysis?

The complete matrix has 501 taxa and 700 characters (the largest plant morphological matrices have hardly more than 100 characters) but also a gappyness of 72%. In this case, 255,969 of the 353,500 cells in the matrix are ambiguous or undefined (missing). The matrix is a (rich) Swiss cheese with very big holes. The high number of MPTs is hence not surprising, and neither is the low CI.

Why run elaborate tree-inferences on such a swiss cheese matrix? One answer is that (some) vertebrate palaeophylogeneticists are convinced that few taxa – many character matrices can lead to wrong clades (clades that are not monophyletic); and each added taxon, no matter how many characters can be scored, will lead to a better tree, by eliminating (parsimony) branching artifacts (see Q&A to the paper). At least 56 of the 501 taxa have 5% or fewer defined characters; still, with 700 characters, 5% equals up to 35 defined traits, which is more than we can recruit for most plant fossils. The median missing data proportion is 74% — more than half of the taxa are scored for less than 26% (< 182 out of 700) of the characters. Can such taxa really save the all-inclusive tree from branching artefacts, or is the high number of MPTs an indication for signal conflicts and data gaps issues?

For this post, we will just look at the tip of the iceberg. What is the signal from the 700 characters to start with?

The basic signal

Here's the heat map for the 19 taxa that have a gappyness of less than 15% (ie. at least 595 of 700 possible characters are defined). The taxon order is mostly the one from the original matrix, sorted by phylogenetic groups — for more orientation, I added next-inclusive superclass "Clades" from Wikipedia (so apologize any errors).

In my last post, I showed that evolutionary lineages (and monophyly) can be directly deduced from such a heat map following the simple logic: two taxa sharing a (direct) common origin are usually more similar to each other than to a third, fourth etc. taxon not part of the same lineage. Exceptions include fossils close to the last common ancestors lacking advanced traits.

The outgroup as used (in this taxon sample: Allosaurus to Tyrannosaurus) is most similar to each other but not monophyletic. One (Allosaurus) respresents the sister lineage of, the other an early split within the lineage that lead to the birds (Coelurosauria:Tyrannoraptora). The extinct (monophyletic) families (Tyrannosauridae, Ornithomimidae, Dromaesauridae) are, however, well visible, being defined by low intra-family and higher inter-family pairwise distances. The same is true for the direct relatives (Clade Ornithurae) of modern birds (class Aves).

Very typical for such datasets is the increasing distance between the (primitive?) outgroups and the most derived, modern-day taxa (living birds: Struthio – ostrich, Anas – duck, Meleagris – turkey). Closest relatives in the taxon set, phylogenetically and time-wise, are (much) more similar than distant ones. Allosaurus may be most similar to the tyrannosaurs, not because of common ancestry but because both are scored as being primitive with respect to the group of interest.

The only tree

This situation becomes very obvious from the only possible (single-optimal) tree that can be inferred from this matrix, when visualized as a phylogram (Stop using cladograms!)

The ML, MP and LS/NJ tree overlapped and scaled to equal root (first split within Tyrannoraptor) to tip (split between Anas and Meleagris) distance (phylogenetic distance, via the tree). Pink, the LS clade conflicting with ML and MP trees, and Wikipedia's tree(s).

No matter which optimisation criterion is used (here Least-Squares via Neighbor-joining, Maximum Parsimony, Maximum Likelihood), the result is the same. The only exception is that the NJ/LS tree places Archaeopteryx as sister to Dromaeosauridae; and the relative branch lengths of roots vs. tips also differ.

Because our matrix has favorable properties (few taxa, many defined characters), it's straightforward to establish branch support. This is a bit frowned upon in palaeontological circles, but having dealt with morphological evolution in cases where we have molecular data, I want to know how robust my clades are, and what may be the alternatives, before I conclude that they reflect monophyly. Bootstrapping coupled with consensus networks is a quick and simple way to test robustness and investigate ambiguous support (Connecting tree and network edges) .

The BS support consensus networks for NJ/LS and ML have only a single reticulation each.

Rooted support consensus networks based on the NJ/LS (10,000 pseudoreplicates, PAUP*) and ML bootstrap (100, number of necessary replicates determined by bootstop criterion implemented in RAxML) samples. Only splits are shown that ocurred in at least 15% of the BS pseudoreplicates.

The MP BS support consensus network is, however, has many more reticulations.

Rooted MP-BS support consensus network (10,000 BS pseudoreplicates, PAUP*). Green — edge bundles corresponding to clades in the all-optimal tree(s); orange — less supported conflicting alternatives; red – higher supported conflicting alternatives; pink – wrong clade in NJ/LS tree.

We can make two generally relevant observations here:
  1. The wrong Archaeopterix-Dromaeosauridae clade (pink edge/branch) masks a split BSNJ support: 68 for the wrong clade, 31 for the right one. While resampling under ML appears to be inert to this conflict, MP is not.
  2. While the NJ- and ML support networks are very tree-like, all clades in the inferred tree have high to unambiguous support, and are near-congruent, the MP network is much more boxy. In some cases the split in agreement with the all-optimal tree has a lower BS support than an alternative (here usually in conflict with the gold tree).
Similar observations can be made with other data sets: although NJ/LS and ML optimisation are fundamentally different (distance- vs. character-based, equal change vs. varying probability of change), they show more agreement with each other when it comes to supporting a topology (or topological alternatives) than MP (character-based like ML, but all changes are treated as equal like NJ/LS). MP is a very conservative approach, highly dependent on possibly a few discerning characters. If they are missing from the BS pseudoreplicate, the backbone tree collapses or changes, and BS values may decrease rapidly. This is so even for a very data-dense matrix like the one used here (few taxa, many characters, low gappyness).

On the positive side, we can expect that MP will produce fewer false positives. On the negative side, it is also more dependent on character coverage, and will produce much more false negatives. Any fossil lacking the crucial characters (or showing too few of them) may be still resolved (placed and supported) under NJ/LS and ML but not using MP. When inferring trees, these fossils will quickly increase the number of MPTs and decrease branch support for the part of the tree they interact with. Personally, given how hard it can be to place a fossil per se with the data at hand, I always preferred a method that can give some result, and point towards possible alternatives (even risking including erroneous), rather than no result at all.

The simplest of networks

Naturally, we can use the distance matrix directly to infer a Neighbor-net, and explore the basic differentiation signal beyond trees but also with regard to the all-optimal tree.

Neighbor-net based on the pairwise distance matrix. Coloration highlights edges found (or not) in the optimised trees.

The Neighbor-net recovers the clades from the all-optimal tree (green, purple the NJ/LS-unique branch), but shows additional edges (orange). The principal signal in the data has, for instance, problems with placing Archaeopteryx, because it is (signal-wise) intermediate between the Avebrevicaudata, the lineage including modern birds, and the Dromaeosauridae, their sister lineage (note that the vertebrate fossil record is considered to be free of ancestors and precursors; all fossils represent extinct sister lineages – evolutionary dead-ends). Skeleton IGM 100042 (an Oviraptoridae), placed as sister to both in the all-optimal tree, also lacks obvious affinities: this is a taxon where the tree inference makes a decision that is not based on a trivial signal encoded in the matrix.

The central boxy part of the Neighbor-net correlates with the 2/3-dimensional part of the parsimony BS consensus network: to resolve these relationships, we need a large set of characters (under MP). On the other hand, recognizing the Ornithurae, members of an extinct family, or a relative of IGM 100042, should be straightforward even with a limited amount of defined characters. Based on the Neighbor-net, which is inferred in a blink no matter how large the matrix, we can also make a decision, as to which taxa interfere and which ones facilitate tree-inferences. The more tree-like the Neighbor-net graph becomes, the easier it is for a tree inference to be made.

Placing fossils, quickly and easily

Using this backbone graph, it is easy to assess in which phylogenetic neighborhood a newly coded fossil falls, eg. the fossil newly described in Hartman et al. and scored for 267 unambiguously defined traits, Hesperornithoides.

Neighbor-net including Hesperornithoides.

Hesperornithoides is obviously a member of the Eumaniraptora (= Paraves), morphologically somewhat intermediate between the Avialae, the "flying dinosaurs", and Dromaeosauridae, but doesn't seem to be part of either of these sister lineages. The graph lacks a prominent neighborhood, the Archaeopteryx-Bambiraptor neighborhood may reflect local long-edge attraction (note the long terminal edges) or convergent evolution in both taxa and, possibly, also the Hesperornithoides lineage. Just based on this simple and quick-to-infer network, Hartman et al.'s title "A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight" appears to be correct (in future posts, we may come back to this morphological supermatrix to see what else networks could have quickly shown).

One should be willing to leave the phylogenetic beaten track – ie. relying on strict consensus parsimony trees as the sole basis for phylogenetic hypothesis. The Neighbor-net is a valuable tool for quick pre- and post-analysis because it can:
  • visualize how coherent the clades in our trees are, 
  • how easy it will be for the tree inference (especially MP) to find and support clades, 
  • help to differentiate ambiguous from important taxa, and finally, 
  • assess whether a new fossil really requires an in-depth re-analysis of the full matrix (and dealing with >99,999 MPTs) instead of using a more focussed taxon (and character) set.

No comments:

Post a Comment