Monday, September 7, 2020

Fossils and Networks 3 – (deleting and) adding one tip

In the last Fossils and Networks post, we explored the use of SuperNetworks to identify both safe and problematic branching patterns by removing one OTU and re-evaluating the analysis. Here, we'll take the opposite approach, and see what we can learn from adding one OTU to our analysis.

Breaking and supporting wrong branches

We start again with the artificial Felsenstein Zone matrix that results in a wrong AB clade. Here's the original true tree used to generate the matrix.

Because of convergent/parallel evolution in the modern taxa (genera O, A and B) and primitive characters of their fossil sisters, any phylogenetic inference method will find the wrong, tree with a A + B | rest split.

In the Felsenstein Zone, parsimony will always get the wrong tree due to long-branch attraction (LBA), while Maximum likelihood has a 50:50 chance to escape LBA. To break down the LBA between A and B, we need a fossil that is, from an evolutionary point of view, intermediate between D and B.

If we add a fossil E that features 1 out of 3 derived traits found in the BD lineage (including the only synapomorphy of BD), we end up with two alternative parsimony trees: one with a wrong topology and the other the correct topology, as shown here.

By adding a fossil F featuring 2 out of 3 derived traits, we increase the number of most-parsimonious trees (MPTs) to three alternatives, all of which fall prey to A-B+F LBA, as shown next.

Convergent evolution is a problem for tree inference but selection bias and homoiologies are worse, involving accumulation of the same advanced trait within some but not all members of a lineage (Has homoiology been neglected in phylogenetics?). This is worse because the characters will enforce attraction between long-branching, highly evolved (more modern) taxa. A and B are siblings, but by enforcing an ABF clade, we will inevitably misinterpret the most primitive members of the ingroup, C and D. Hence, we may draw wrong conclusions about evolution in the A–F lineage.

Because E is virtually half-way evolved between D and F, and F is the next step towards B, the all-inclusive tree gets it right. We infer a single optimal tree, shown here.

PS: Also, in this case we could use any other optimality criterion (Maximum Likelihood, Least-squares, Minimum Evolution) and we would end up with the same tree.

Missing the important bits

That last observation is encouraging: the more fossils we include in our matrix and the better they reflect the evolutionary trends within a group (here from a D-like ancestor via E to F and B), the greater the chance of ending up with the true tree. There's only one drawback: in real-world data sets, we may miss exactly those traits in the fossil sample that are needed in order to infer (or stabilize) the true tree.

(Paleo-)Parsimonists have frequently argued that missing data are unproblematic, which is true in one sense, as shown in the above example. The commonly used strict consensus tree has no wrong branches, because it only has one, which is the trivial ingroup-outgroup split. The much less commonly used Adams consensus tree has one more branch, which is wrong: the ABF clade.

As always in such cases, the strict Consensus network visualizes the MPT sample best (again exemplifying why we should stop using cladograms).

The price for not having false positives is that we cannot infer a most-parsimonious tree or a few alternative trees any more, but could easily end up with scores of them. Here, we have 41 MPTs for a 8-taxon dataset that include fairly wrong trees*, although some of them are closer to the true tree (green and olive edges in the strict Consensus network above). For large matrices, or matrices lacking tree-like signals, the number of MPTs can easily reach tens or hundreds of thousands. Lacking critical traits in E (14 out of 46 characters missing) and F (7 missing), we may escape LBA at the cost of decisiveness. If we do have those traits only in F but not E, we will enforce LBA between A and B.

Plus-1-trees (and SuperNetworks)

Before adding a taxon as an additional leaf to our tree, we may be interested in what that taxon does to our tree: can it trigger a topological change or does it fall in line? We will again take the dinosaur-to-bird-matrix of Hartman et al. (2019, PeerJ 7: e7247) as a real-world example. This includes everything from well-covered highly derived and most primitive taxa, to those that lack discriminatory signal in general (ie. are unresolved), plus the one or two rogue taxa, with ambiguous phylogenetic affinities creating topological conflict. (Note: the commonly reported strict consensus trees cannot distinguish between those two alternatives.)

The best-covered 15 taxa provide us with a single optimal tree that is in agreement with current opinion (shown below). However, this struggles to resolve the clade of modern birds because the extinct Lithornis is being attracted by Anas, the duck. When we remove Dromiceiomimus (as shown in Fossil and Networks 2), we end up with a putatively wrong Dromaeosauridae grade, because of LBA between the most distinct Dromaesauridae, Velociraptor and Bambiraptor, and the distantly related (to flying dinosaurs) Allosaurus, Tyrannosaurus and the IGM 10042 skeleton.

Two of the Minus-1 trees generated for the last post of this series.

For our experiment, we will take this (partly) wrong tree, and add every other taxon included in the Hartman et al. (2019) matrix as 15th tip. We can then perform a branch-and-bound search to infer these 14-Plus-1 tree(s). When we browse through the inferred MPTs, we can see that many taxa fall in line with the wrong topology, including a few that, in addition, increase uncertainty for branches correctly resolved in the minus-Dromiceiomimus tree.

Out of the 485 candidate trees, only 10** have a set of characters that can compensate for the missing Dromiceiomimus, leading to Plus-1 trees that show a Dromaesauridae clade, as shown here.

Two of the ten Plus-1 trees, where the added tip saves the inference from LBA. Numbers give the amount of defined characters (scored traits). Both Halszkaraptor and Zhenyuanlong are classified as Dromaeosauridae, however only the better covered taxon is placed as sister to the Dromaeosauridae included in the original 14-taxon tree.

The presence of the deep-branching Compsognathus (Tyrannoraptora: ... :Neocoelurosauria: †Compsognathidae) triggers an Archaeopteryx-Dromaesauridae clade.

In the case of relative deep-branching Garudimus (... :Neocoelurosauria: Maniraptoriformes: †Ornithomimosauria: †Deinocheiridae) and Epidexipteryx (... : Maniraptoriformes: ... : : ... : Paraves: †Scansoriopterygidae) one or two of the two or three MPTs show the wrong grade except the last the clade.

Note: the relative low number of scored traits for Epidexipteryx can avoid LBA leading to a Dromaeosauridae grade but misplace the taxon within the Plus-1 MPTs: its family, the Scansoriopterygidae, are considered to represent the sister lineage (Wikipedia, referring to Godefroit et al. 2013 Nature 498: 359–362) of the Eumaniraptora which include the Dromaeosauridae as first-diverging branch.

We can also summarize the outcome, a collection of 640 Plus-1 MPTs, in form of a z-closure SuperNetwork, as we did for the Minus-1 trees in the previous Fossils and Networks post (shown next).

This SuperNetwork is quite boxy, and may be only semi-comprehensive (I used only 20 runs, which took half a day). Matching 485 tips into a 14-taxon backbone tree is not the kind of tree sample that the SuperNetwork has originally been designed for!

Only four edges, fat and blue, are without alternatives. In all other cases, the added tip triggered the creation of several alternatives: the highest dimension for the boxes is five, but most have four or less dimensions. Regarding our problem of saving the Dromaeosauridae clade, we can see that the topological change depends on very few characters, with Microraptor being very close to the divergence but a bit more bird-like (in a very broad sense), while the other two are much more derived.

Close-up on the Dromaeosauridae part of the network, with all tips labeled. Pie charts give the percentage of scored traits/missing data. * – Tips that saved the inference from LBA (see above).

Note the length of some of the colored edges, especially the light green which represent edges reflecting a Dromaeosauridae clade. Other Dromaeosauridae taxa increase not only the diversity but also may create substantial topological ambiguity (bluish and greenish edge bundles; same color = same split) and branching bias.

Take-home message

Creating morphological supermatrixes makes a lot of sense, because it ensures normalization and facilitates universal comparability, which is crucial also for paleobiology. However, even more than molecular phylogenies, paleophylogenies are affected by character and taxon sampling. This is nothing new, and much debate has dealt with which parsimony strict consensus cladogram is the better one.

I suggest taking a new route. Instead of using morphological supermatrixes to infer trees – for this matrix, Hartman et al. found millions of equally optimal parsimony trees further filtered by post-analysis, initial tree topology informed character weighting (as implemented in TNT) – we should use it to generate subsets and engage in exploratory data analysis. This will pinpoint strengths and weaknesses of the data and its individual taxa. Rather than producing evolutionary meaningless soft polytomies, one should study the reasons for any topological ambiguity. After all, one simple reason for unstable branching patterns may be that all so-far inferred trees are biased, only differently.

The SuperNetwork can assist us in putting together taxon sets that could allow not only a simple tree inference but also topology testing.
  • If we want to test the stability of, e.g., the Dromaeosauridae clade against taxon sampling, it will be of little use to include the most primitive (anything outside Maniraptora) and much more advanced taxa (Avialae including modern birds) of the 501-taxon matrix. On one had, the most primitive taxa will only increase the computational load, because our inferred tree not only optimizes branches we are interested in, but also irrelevant ones, using taxa that largely lack discriminative signal for the branches of interest or at all. On the other hand, the most derived taxa may bias the tree inference by providing strong terminal signals outcompeting potentially conflicting weak basal signals.
  • If we want to test the stability of the backbone phylogeny against adding taxa and entire lineages, we may prefer short-branched over long-branched taxa, in order to avoid (local) LBA (especially when we want to stick to parsimony). The terminal edges in the SuperNetwork indicate the minimum number of unique changes for each tip added to the 14-taxon tree. As seen also in our hypothetical example: E and F only break down the wrong AB clade because both are either identical (or very similar) to the last common ancestor of E+F+B and F+B, respectively.
In a future post, I'll come back to the issue of identifying taxa that are game changers, using a simple and quick tree-based approach: the so-called "evolutionary placement algorithm", first implemented in RAxML.

For any of you who really don't like networks, but still find no comfort in comb-like strict consensus cladograms either: just tick the SuperTree option when inferring the SuperNetwork. But only if your input trees converge to a shared topology. Otherwise the result may look like this:

A SuperTree based on the 640 Plus-1 MPTs.

* Somebody familiar with Consensus networks and morphological data partitions providing complex signal, can extract a phylogenetic hypothesis from this boxy network for the included taxa. In general, the distance along the network edges represents a phylogenetic distance, and thus gives a direct measure of how derived a taxon is.

For example, C, D are closer to the ougroup and placed close to the centre of the graph, which is exactly where a primitive ingroup taxon, with an ancestral morphology, would be placed. F is most likely a sister of B. The olive EF | rest split supports a potential common origin of E, F, and B (long green edge bundle). Hence, A can only represent a distant, strongly evolved sister lineage (both the alternative AB and ABF clade have less character support). Also, since the graph depicts E as least derived of the four (irrespective of the topological alternatives), its affinity to F and B has more value than the affinity between A and B, both being long-branched, and hence susceptible to LBA. D fits into the picture, the olive DE edge either: (1) represents a common origin, which would make D an early member of the red lineage; or (2) has similarity due to shared primitive traits within the ingroup, which would make D an early member of an ABEF lineage. C, in contrast to D, has no clear affinities with any other ingroup member, and so can only be interpreted as an early, very primitive form with uncertain phylogenetic relationships. The (true tree) mutual monophyly of the red and blue ingroup lineages has very little character support in the matrix, and hence cannot possibly be resolved.

** Systematically they cover a range of maniraptoran ('hand hunters') families 'below' the Avialae ('flying' dinosaurs) including, in addition to two Dromaeosauridae (Halszkaraptor, Zhenyuanlong, trees shown above), members of †Alvarezsauroidea (Haplocheirus), †Caudipteridae (Caudipteryx), †Sinovenatorinae (Sinovenator), †Therizinosauroidea or related (Beipiaosaurus, Jianchangosaurus) and †Troodontidae (Gobivenator, Sinornithoides). Caihong is a member of the †Anchiornithidae, which Wikipedia flags as "Avialae ?". These OTUs show data coverage far above the median (74% missing), with 278 (Caihong) to 558 (Caudipteryx) defined characters (out of a total of 700).

No comments:

Post a Comment