Monday, November 5, 2018

A bit of heresy: networks for matrices used in Cladistics studies

[This is Part 1 of a two-part topic – this one is Historical matrices from the 1980s]

When I first came into contact with phylogenetics (usually based on morphological data sets, back then) and after reading Hennig's book (the original German version, published in 1950), I dreamed about publishing in Cladistics, the journal of the Willi Hennig Society (WHS). I never did. In this post, I show why.

Later on, in 2016, Cladistics achieved renewed fame due to an editorial that triggered a twitter uproar under the hashtag #parsimonygate. A lot of people were shocked to read in the editorial that the journal (still) prefers and requires parsimony-based inferences (in fact, parsimony-based trees). Some people, like Joe Felsenstein, were not at all surprised. I wasn't either, because Cladistics is the journal of the Willi Hennig Society (WHS), which has always been dedicated to parsimony: "Ockham told Popper told Hennig to use parsimony" (see the historical summary by Felsenstein in Systematic Biology, 2001; free access).

Historical buttons that you (allegedly) could get at meetings of the WHS. Left: Joe Felsenstein; right: L for Likelihood. Just a gag, of course! Nothing serious behind it.

In the good old days, when the "Phylogenetic Wars" were still on (in the 1980s, petering out in the 90s), they would invite a probability-ist to their conference to tear him down. My first phylogenetic paper (2002) got a negative review (ie. rejection, invitation to resubmit) by a WHS member solely because it did not include a parsimony tree, which he described as "standard these days". More recently, they ensured free access to TNT, the current main software for doing parsimony analysis and an essential tool for many palaeontologists.

I stopped using parsimony trees very early in my career, but I'm still a great fan of the family of methods based on median networks, which operate under the same parsimony criterion (Clades, Cladograms, ...; Using Median networks ...). Fate exposed me early to the Neighbor-nets, which can be used as a quick check of how tree-like the signal is in data matrices, to start with.

The thing that bugged me most concerning many journals, including Cladistics, is not a focus on parsimony, but the lack of data documentation and easy data access. To me, it seems natural to use a service like TreeBASE, when my main dedication is to tree-inference. TreeBASE allows you to provide your data and inferred trees to the general public in the common NEXUS format, so that other people can make use of it.

Luckily, some authors of Cladistics upload their data (about one study per 1–3 years). So, here are some data-display networks showing the strengths and weaknesses of the parsimony trees in the original publications, which have been randomly selected from among the oldest ones and the newest ones (I found) in TreeBASE. I won't discuss the actual results, as Cladistics is pay-walled, so just enjoy the graphs.

The oldest one (in my list), Dahlgren & Bremer 1985, TreeBASE submission number S231

The submission (a binary matrix, including some missing data; published in the first volume of Cladistics) comes with three angiosperm trees: one composite order-level tree, plus two empirical trees labelled as "Fig. 2" and "Fig. 3" using the family-level OTUs in the matrix. The latter two look like this:

Connected cladograms of "Fig. 2" and "Fig. 3", the result of two parsimony analyses. Jumping taxa/clades highlighted with colours.
That the matrix is not only highly homoplasious (CI = 0.28) but has a severe signal problem, becomes obvious when inferring a NJ tree, providing a third topology.

A NJ tree (fulfilling least-squares optimality criterion for phylogenetic trees) from the same matrix: blue, branches incongruent among the original trees and the NJ tree. Color coding: light blue, branch congruent to "Fig. 2" tree (different in "Fig. 3" tree); green, branch found in all three trees; red, branch incongruent to consistent placement in both original trees.

Not surprisingly, the Neighbor-net inferred from simple (mean) Hamming distances is a spider-web, as the matrix' signal is not tree-like at all — all non-green branches above, or their conflicting alternatives, receive low to very low bootstrap support, independent of the optimality criterion used.

The Neighbor-net inferred from Dahlgren & Bremer's matrix.

Despite its spider-web structure, we do learn quite a lot from the Neighbor-net regarding what is behind the clades in the original trees. For example, we can overlay a Dahlgrenogram representing the top-most subtree of the "Fig. 2" tree.

Blue, red and yellow fields denote (sub)clades in Dahlgren & Bremer's "Fig. 2" tree that compose the top clade (grey).

The same could be done for all the other clades.

TreeBASE submission S329, worms (Oligochaeta) by Jamieson et al. (1987)

The more perfect is a character matrix regarding tree-inference (ie. with tree-compatible characters), the more similar the NJ and the parsimony-tree will be (or any other tree, under any other optimality criterion), as we can see in this second example published in the third volume of Cladistics.

The tree (the abstract notes a single most-parsimonious tree) was inferred from a multistate matrix with up to seven states, possibly including some characters that should be treated as ordered, but such specifics are not included in the original NEXUS file, so we will treat them as unordered.

Aside from grades becoming clades (and vice versa), the published tree (unordered: 102 steps, high CI = 0.81, RC = 0.53) and the NJ tree are quite similar, even regarding their relative branch-lengths.

Two phylograms: left, the original MPT, right, a NJ tree, shared branches in green, (partly) conflicting ones in orange. Cladists address the left tree as "phylogenetic", the right one as "phenetic", but both are equally valid solutions using different optimality criteria.

Moreover, the Neighbor-net is much less complex than in the previous examples, with individual edges corresponding to branches in both trees — Neighbor-nets are truly meta-phylogenetic graphs.

Splits found in the original MPT in green, when corresponding with edges in the Neighbour-net, and orange, when there is no corresponding edge (according to the abstract, the authors discuss alternatives to certain branches in their tree). Edges found in the NJ tree (providing an alternative topology/phylogenetic hypothesis) in blue.

Submission S349, an amniote phylogeny by Gaulthier et al. (1988)

This is a matrix much to my liking, as it includes extinct taxa, with quite impressive dimensions (computers back in 1988 were awfully slow): 316 characters with up to four states for 31 taxa. Naturally, it includes a lot of missing data, as do all fossil-including matrices.

Missing data is potentially a bigger problem for distance-based approaches than for character-based ones like parsimony, maximum likelihood or Bayesian inference — when there is little character overlap between the fossil taxa, their pairwise distances will be distorted. Missing data can be an equal problem for tree-inference — depending which characters are missing, many different topologies are equally optimal, or nearly so. In Gaulthier et al.'s matrix 10% of the characters are parsimony-uninformative.

Similar to the angiosperm matrix, Gaulthier et al.'s tree has a relatively low CI (0.45) and RC (0.33), i.e. there is homoplasy adding to the missing data as a source of incompatible, tree-unlike signals.

Just by comparing the NJ tree to the parsimony tree, we can see that distance distortion because of missing data is no big deal for this matrix.

The trees are largely congruent, with three striking exceptions: the birds (Aves), the crocodiles (Crocodylia) and turtles (Testudines) are not placed as sisters to the lineage leading to modern-day mammals (tree provided by Gaulthier et al.), but fall in the "dinosaur"-only clade in the NJ tree (compare with the current Tree of Life: Archosauria). This makes sense (data-wise), because in Gaulthier's matrix the taxon pairs Aves + Ornithosuchia and Crocodylia + Pseudosuchia are identical in their shared defined characters (ie. zero-distance pairs). Obviously, the parsimony tree comes with some implicit assumptions: the unweighted/unordered single most-parsimonious tree PAUP* infers for the matrix using the branch-and-bound algorithm has only 510 steps, a higher CI (0.66) and RC (0.59), and is largely congruent with the NJ tree; except that Captorhinidae and Testudines are sisters and Casea, Ophiacodon and Edaphosaurus form a grade not a clade.

As in the other cases so far, the Neighbor-net well captures the actual data situation.

Blue edge bundles refer to splits shared with both the NJ tree and the (inferred, not provided) MPT. Note that some splits in the NJ tree and or the MPT have no counterpart in the Neighbour-net. One split found in the MPT but not in the NJ tree has a corresponding edge in the Neighbour-net (light blue).
The thin "upper trunk" in the Neighbor-net further shows that the matrix provides a strong signal for an increase of shared derived ('mammalian') and decrease of shared ancestral ('reptilian') traits, which is a bias. Although the MPT and NJ tree agree well, the matrix provides clear tree-like signal only for terminal relationships in the other main, inferred clade. The thinning trunk may also indicate a taxon sampling issue. Well-sampled phylogenetic data sets usually result in more star-like networks (see eg. graphs in this post on fossil and extant walnuts, dinosaurs, spermatophytes, or the above ones and the next one) in contrast to non-phylogenetic data sets (see eg. the posts on breast sizes, airlines, or moons)

Take-home message in the middle of the film

Even though they are arbitrary choices, the three matrices above show what phylogeneticists had to work with in the 1980s morphological datasets:
  • ... trapped in homoplasy (Dahlgren & Bremer, 1985) — datasets in which phylogenetic relationships were obscured behind highly ambiguous, non-treelike signal;
  • ... asking for a model (Jamieson et al., 1987) — datasets with partly consistent signal, but not consistent enough to result in the same tree independent of the optimality criterion;
  • ... encoding a tree (Gaulthier et al., 1988) — datasets tweaked to promote a certain evolutionary hypothesis, including (superficially) simple series of gradual evolution and ancestor-descendant pairs (see Trivial data, not so trivial graphs). Such data will result in a single optimal tree (method independent!) dominanted by staircase-like subtrees. This may be fine for a cladist, but nothing a phylogeneticist / evolutionary biologist could really be content with (not in the 1980s, or before 1950).

Top, two phylogenetic tress sketched by Darwin; bottom, Hilgendorf's (1866) phylogenetic tree. There are quite a few before 1950 (eg. Pojárkova, 1933, Acta Institute of Botany, Academy of Sciences of the USSR, ser. 1, 1: 225–374; unfortunately have no copy/scan)

No comments:

Post a Comment