Showing posts with label Bootstrap. Show all posts
Showing posts with label Bootstrap. Show all posts

Monday, August 6, 2018

Trivial data, but not so trivial graphs


One may expect that perfectly compatible, trivial data will lead to perfect trees that are trivial to interpret. And this may really be the case when phylogenetics is restricted to contemporary taxa and molecular data. Adding to various earlier posts that deal with data patterns and their representation in inference graphs (e.g. Networks can outperform PCA..., Stacking neighbour-nets..., Clades, cladograms, cladistics ... and networks ...), I will show in this post what we get when we deal with very trivial, straightforward to interpret, data.

Two trivial scenarios: a linear and a dichotomous evolutionary sequence

The virtual data matrix for our experiment comprises seven taxa (OTUs) from different time scales and six binary (Dollo) characters. There are two historical scenarios that are supported by patterns in the data (see the first figure).

The linear scenario has a mother taxon that evolves by acquiring a unique, persistent trait, and is replaced by its daughter taxon through time. In contrast, the dichotomous scenario has two subsequent events of cladogenesis: the all-ancestor A splits into two taxa (B, E), each defined by a unique change in a binary character passed on to their descendants. B and E then underwent a second cladogenetic event, giving rise to C+D and F+G.


The resultant data matrices have different properties. In the case of the linear evolution, all changes lead to synapomorphies sensu Hennig (characters #1–#5) along with one terminal autapomorphy of the latest member of the lineage, G (character #6).

In the case of the dichotomous evolution, we have two synapomorphies supporting the BCD and EFG clades (characters #1, #4), respectively, and four autapomorphies (each one for C, D, F and G, the youngest set of taxa).

The following figure shows the character-based splits (taxon bipartitions) for the linear evolution scenario:
(Trivial splits, one taxon separated from all others, in blue)

Reconstructing the (true) evolutionary pathway is trivial based on this perfect split pattern, especially if we know that A is the oldest taxon and G the youngest.


It's equally straighforward for our second scenario, with perfectly dichotomous evolution:


Character 1 and character 4 define taxon cliques comprising B,C,D and E,F,G. The remaining characters indicate that C,D and F,G derive from B and E, respectively.


Explicit inferences

As stated above, the data properties for both scenarios are different. The matrices have a different number of parsimony-informative characters (4 for linear, 2 for dichotomous). Accordingly, the reconstructed optimal trees (here using the maximum parsimony, least-squares, and maximum likelihood criteria), are better resolved / more correct for the linear than for the dichotomous evolution.

MPT = most-parsimonious tree; ML = maximum likelihood. *Corrected for ascertainment bias.

Using all of the variable characters, NJ and ML are generally more decisive and produce higher support for the right branches. But for the dichotomous evolution scenario, they also show ghost-clades ("para-clades" as they include close relatives sharing a recent common origin, but do not represent monophyletic groups sensu Hennig) with low support. The corresponding MPT has no ghost-clades, but it also provides no clues to how B,C,D and E,F,G are related to each other.

Beyond this, and as can be seen in many real-world examples, there is no fundamental difference between character-based inferences such as maximum parsimony (MP) or maximum likelihood (ML) and distance-based inferences (NJ) fulfilling (here) the least-squares criterion (sometimes still called "phenetic" inferences in contrast to the "phylogenetic" parsimony, Bayesian inference and maximum likelihood).

The differences diminish further when we look at the phylograms instead of the cladograms, as shown next.


Another observation we can make is that for the linear-evolution scenario (four synapormophies), the ascertainment bias correction under ML has little effect, but it is crucial for the dichotomous evolution (two synapomorphies) to get sensible branch lengths.

Parsimony provides the most conservative (and least decisive) results for the dichotomous-evolution scenario, also because of the way I applied it: PAUP* allows optimizing trees with hard polytomies when using the default branch-and-bound search (for tree inference as well as bootstrapping), whereas the NJ / BioNJ algorithm and the ML implementation in RAxML will always produce fully dichotomized trees, including zero-length or near-zero-length branches. This explains the difference in the support values of preferred and alternative splits.


(Non-filtered) Bootstrap support consensus networks for the linear evolution scenario. Same scale for all graphs, trivial splits (dashed lines) collapsed.
(Non-filtered) Bootstrap support consensus networks for the dichotomous evolution scenario.


Trees are not wrong, but they miss the point

None of the graphs above show anything strongly erroneous, but they also don't fully capture the evolutionary pathways — that is, the actual ancestor-descendant relationships. This is because our taxon set includes ancestral forms, which, in traditional trees, have to be placed as sisters to part or all of their descendants. Networks provide a quick solution to this limitation.

Median-joining networks inferred with NETWORK 5.0.0.3 for both scenarios, with the inferred (and real) character changes annotated along edges.

Neighbour-nets inferred with SplitsTree 4.13.1 for both scenarios, based on the mean (Hamming) pairwise distances.

The two (perfectly tree-like) graphs, one parsimony-based, the other distance-based, look identical, and place all of the taxa exactly where they should be: the ancestors on the nodes ("medians"), and their (latest) descendants at the tips. But note that in the case of the Neighbour-net this is a visual illusion / approximation: in fact, the ancestors are actually connected by zero-length edges to the node they appear to be sitting on.

Given that both scenarios used here produce trivial, straightforward to interpret, data patterns (see the first figures), the failure of the traditional tree inferences to get it completely right can be a bit unsettling. Trees including primitive-old and derived-new forms are common in the (palaeontological) literature, and typically show many branches lacking high support (note that only ML produced a bootstrap support >90 for a true-tree branch, and only for the linear evolution scenario). To address evolution over time, networks should hence be standard applications, rather than the exception. Cladograms should be long gone, as they show very little beyond the most trivial.

If we want trees (and many of us want trees!), we need tree inferences that can optimize an older taxon on an internal branch or node, to accommodate potentially ancestral forms.

Related blog posts

In Clades, cladograms, cladistics, and why networks are inevitable, I argue that we cannot get around networks when we aim to study taxa from different time scales using their morphologies.

Digging deeper: Population dynamics and individual-based fossil phylogenies raises the question of what we deal with when we use individual fossils (i.e. long-dead individuals) as OTUs in our phylogenetic inferences.
 
Monophyletic groups in networks by David gives an introduction into (fringe) terminology. What to do when dealing with more than a single most-recent common ancestor and past reticulation?

Networks and most recent common ancestors by David discusses the concepts of conservative MRCAs (most recent common ancestors), fuzzy MRCAs and (alternative) LCA — lowest (last) common ancestors in the face of reticulation.

In Stacking neighbour-nets: ancestors and descendants, I outline how one may (and why one should) stack Neighbour-nets to analyse the evolutionary history of a group including (mostly) fossil representatives.

The first Darwinian evolutionary tree[s] show features one rarely finds in a modern-day phylogenetic tree: ancestral and descendant forms, ancestral taxa addressed as species and not higher taxa, and gradual transition between forms (post by David).

Tree metaphors and mathematical trees by David, which introduces János Podani's concept about "branching silhouettes" and how to depict an actual evolutionary tree.

Where have all the ancestors gone? discusses the common notion that we don't have to deal with ancestor-descendant problems in phylogenetics at all, because the scarcity of the (terrestrial) fossil records ensures to only find extinct side (sister) lineages. 
 

Monday, April 9, 2018

The curious case(s) of tree-like matrices with no synapomorphies


(This is a joint post by Guido Grimm and David Morrison)

Phylogenetic data matrices can have odd patterns in them, which presumably represent phylogenetic signals of some sort. This seems to apply particularly to morphological matrices. In this post, we will show examples of matrices that are packed with homoplasious characters, and thus lead to trees with a low Consistency Index (CI), but which nevertheless have high tree-likeness, as measured by a high Retention Index (RI) and a low matrix Delta Value (mDV). We will also try to explore the reasons for this apparently contradictory situation.

Background

A colleague of ours was recently asked, when trying to publish a paper, to explain why there were low CI but high RI values in his study. This reminded Guido of a set of analyses he started about a decade ago, using an arbitrary selection of plant morphological matrices he had access to.

The idea of that study was to advocate the use of networks for phylogenetic studies using morphological matrices, based on the two dozen data sets that he had at hand. The datasets were each used to infer trees and quantify branch support, under three different optimality criteria: least-squares (via neighbour-joining, NJ), maximum likelihood, and maximum parsimony. This study was was never wrapped up for a formal paper, for several reasons (one being that 10 years ago Guido had absolutely no idea which journal could possibly consider to publish such a paper, another that he struggled to find many suitable published matrices).

The signals detected in the collected matrices were quite different from each other. The set included matrices with very high matrix Delta Values (mDV), nontree-like signals, and astonishingly low mDVs, for a morphological matrix. Equally divergent were the CI and RI of the inferred equally most-parsimonious trees (MPT) and the NJ tree. The data for the MPTs and the primary matrices are shown in the first graph, as a series of scatterplots, where each axis covers the values 0-1. (Note: in most cases the NJ topologies are as optimal as the MPTs, and have similar CI and RI values.)


As you can see, the CI values (parsimony-uninformative characters not considered) are not correlated with either the RI or mDV values, whereas the latter two are highly correlated, with one exception.

The most tree-like matrix (mDV = 0.184, which is a value typically found for molecular matrices allowing for inference of unambiguous trees) was the one of Hufford & McMahon (2004) on Besseya and Synthyris. The number of MPTs was undetermined —using a ChuckScore of 39 steps (the best value found in test runs), PAUP* found more than 80,000 MPTs with a CI of 0.39 (third-lowest of all of the datasets), but an RI of 0.9 (highest value found).

A strict consensus network of the 80,003 equally parsimonious solutions, the network equivalent to the commonly seen strict consensus tree cladograms. Trivial splits are collapsed. Colours solely added for orientation (see next graph).

Oddly, the NJ tree had the same number of steps (under parsimony), but a much higher CI (0.69). The proportion of branches with a boostrap support of > 50% was twice as large in a distance-based framework than using parsimony.

Bootstrap consensus networks based on 10,000 pseudoreplicates each. Left, distance-based and inferred using the Neighbour-Joining algorithm; right, using a branch-and-bound search under parsimony as optimality criterion (one tree saved per replicate). Edge-lengths reflect branch support of sole or competing alternatives; alternatives found in less than 20% of the replicates not shown; trivial splits are collapsed. Same colour scheme than above for orientation.

The Neighbour-net based on this matrix has quite an interesting structure. Tree-like portions are clearly visible (hence, the low mDV) but the branches are not twigs but well developed trunks. The large number of MPTs is mainly due to the relative indistinctness of many OTUs from each other.


Neighbour-net based on simple mean (Hamming) morphological distances. Same colour scheme as above.
This distance-based 2-dimensional graph captures all main aspects of the tree inferences and bootstrap analyses, with one notable exception: B. alpina which is clearly part of the red clade in the tree-based analyses. We can see that the orange group, B. wyomingensis and close relatives, is (morphology-wise) less derived than the red species group. Although B. alpina is usually placed in a red clade, it would represent a morphotype much more similar to the orange cluster as it lacks most of the derived character suite that defines the rest of the red clade. In trees, B. alpina is accordingly connected to the short red root branch as first diverging "sister" with a very short to zero-long terminal branch, but in the network it is placed intermediate between the poorly differentiated but morphologically inhomogenous oranges and the strongly derived reds — being a slightly reddish orange. This reddishness may reflect a shared common origin of B. alpina and the other reds, in which case the tree-based inferences show us the true tree. Or just a parallel derivation in a member of the B. wyoming species aggregate, in which case the unambiguous clade would be a pseudo-monophylum (see also our recent posts on Clades, cladistics, and why networks are inevitable and Let's distinguish between Hennig and cladistics).

Interpretation, what does low CI but high RI stand for?

The distinction between the Consistency Index and the Retention index has been of long-standing practical importance in phylogenetics. For a detailed discussion, you can consult the paper by Gavin Naylor and Fred Kraus (The Relationship between s and m and the Retention Index. Systematic Biology 44: 559-562. 1995).

For each character, the consistency index is the fraction of changes in a character that are implied to be unique on any given tree (ie. one change for each character state): m / s, where m = the minimum possible number if character-state changes on the tree, and s = the observed number if character-state changes on the tree. The sum of these values across all characters is the ensemble consistency index for the dataset (CI).

The retention index (also called the homoplasy excess ratio) for each character quantifies the apparent synapomorphy in the character that is retained as synapomorphy on the tree: (g - s) / (g - m), where g = the greatest amount of change that the character may require on the tree. Once again, the sum of these values across all characters is the ensemble retention index for the dataset (RI).

Both CI and RI are comparative measures of homoplasy — that is, the degree to which the data fit the given tree. However, CI is negatively correlated with both the number of taxa and the number of characters, and it is inflated by the inclusion of parsimony-uninformative characters. RI is less sensitive to these characteristics. However, RI is inflated by the presence of unique states in multi-state characters that have some other states shared among taxa and, therefore, are potentially synapomorphic.

It is these different responses to character-state distributions (among the taxa) that apparently create the situation noted above for morphological data. Neither CI nor RI directly measures tree-likeness, but instead they are related to homoplasy. So, it is the relative character-state distributions among the taxa that matter in determining their values, not just the tree itself.

For example, increasing the number of states per character will, in general, increase CI faster than RI. Increasing the number of states that per character that occur in only one taxon will, in general, increase RI faster than CI.

Take-home message

This is just another example demonstrating that morphological data sets should not be used to infer (parsimony) trees alone, but analysed using a combination of Neighbour-nets and support Consensus Networks. No matter which optimality criterion is preferred by the researcher, the signal in such matrices is typically not trivial. It calls for exploratory data analysis, and inference methods that are able to capture more than a trivial sequence of dichotomies.

[Update 10/9/2018] Related data files can now be found in my Collection of morphological matrices (some including extinct taxa) and related phylogenetic inferences (Version 2) on figshare

Tuesday, October 17, 2017

Networks, not trees, identify "weak spots" in phylogenetic trees


A major application of networks in exploratory data analysis is to identify signal oddities and visualise ambiguity. Thus, they would be the natural choice when it comes to pinpointing weaknesses in phylogenetic trees. This is particularly so when the aim is to propose a relatively stable (and intuitive) ‘phylogenetic’ (identifying likely monophyla sensu Hennig) or ‘cladistic’ (clade-based) systematic framework for a group of organsims. In other words, whenever we try to translate branching patterns into monophyletic groups.

‘Weak spots’ in phylogenetic trees are relationships with either little or ambiguous support, or branching patterns strongly affected by sampling (taxa and characters). These are topological phenomena that are rather the rule than the exception when studying extinct groups of organisms (e.g. spermatophytes or ‘long-necks’).

One example appears to be probably one of the fiercest group of marine predators: the mosasaurs (mosasauroid squamates; Madzia & Cau 2017). I will discuss this example in this post.

Fig. 1. The tree-based systematic groups of mosasaurs (Mosasauroidae plus ancient relatives) when applying Madzia & Cau's nomenclature to their Bayesian-inferred majority-rule consensus tree. Most higher taxa (above genus) are "branch-based", except for the "node-based" Mosasauridae, Russellosaurina (wrong suffix, kept as rank-less taxon by the authors), Tethysaurinae, and Yaguarasaurinae. Genera represented by a single OTU in blue, 'non-monophyletic' genera in red. Thick branches received near unambiguous support (PP ≥ 0.95)

Madzia & Cau “re-examined a data set that results from modifications assembled in the course of the last 20 years and performed multiple parsimony analyses and Bayesian tip-dating analysis” in order to identify the ‘weak spots’ and take them into account when providing a revised cladistic nomenclature of “the ‘traditionally’ recognized mosasauroid clades” (Fig. 1). They define possibly monophyletic groups via recurring branching patterns in their various trees, along with the position of key taxa in those trees (see their chapter Phylogenetic [in fact: cladistic] nomenclature). This allows the groups to “self-destruct” when not forming a clade, and to be replaced.

Although the combination of unweighted and differentially weighted parsimony and Bayesian tip-dating analyses could be methodologically interesting (when examined in detail), it is hardly necessary in order to identify weaknesses and strengths of the data matrix used – going back to Bell 1997, and being emended since (see Introduction of Madzia & Cau) – to define possible monophyletic (or other) groups. A quick and simple neighbour-net splits graph would have done the trick, too.

The situation regarding tree inference, e.g. parsimony

The mosasaurid data matrix suffers from the typical problems: ambiguous, highly homoplasious signals, paired with a few missing data issues (typically lack of data overlap). Adding to this is the miscellaneous signal from taxa regarded as outgroups (here: ancient potential members of the mosasaurs): Adriosaurus suessi (which the authors used to root their trees), Dolichosaurus longicollis, and Ponto-saurus kornhuberi. Accordingly, standard parsimony analysis fails to provide a useful result for about half of the taxa, when documented in the traditional fashion (see my last post) — a strict consensus cladogram of all most parsimonious trees (MPTs) is shown in Fig. 2A.

Fig. 2 Strict consensus graphs based on 152 equally (most) parsimonious trees inferred from the matrix (all characters treated as unweighted and unordered) using PAUP*. Green, unambiguous placement/grouping; turquois, weakly 'rogue-ish', red, rogue taxa

But even the Adams consensus tree (Fig. 2B) is more informative, and the (near) strict consensus network (only showing splits that occur in more than a single MPT) highlights where the equally parsimonious solutions agree and disagree, and which taxa act more ‘rogueish’ than others (Fig. 2C). Weighting and Bayesian inference naturally produce more resolved trees; but the question remains whether the overall higher to unambiguous branch support sufficiently reflects the signal in the character matrix.

Data sets of extinct organisms need neighbour-nets, to start with

The consensus network of the most (equally) parsimonious trees (MPT; Fig. 2C) informs us about equally valid topological alternatives and ‘rogueness’. Using the branch-length averaging option, we can visualize character support to some degree for the alternatives. But there is a quicker and more comprehensive alternative, when it comes to (tree-)incompatible signal.

The neighbour-net (Fig. 3) directly identifies potentially strong signals and ‘weak spots’. First, we can see that the outgroup taxa are not clustered, which is never good. Obviously, they are not too useful to infer an ingroup root (Madzia & Cau discuss the outgroup sampling bias). Only one of the outgrops, Pontosaurus, is placed closed to the Aigialosauridae, which collects the earliest diverging Mosasauroideae lineage (see Fig. 1). Their signals are likely to mess-up any tree inference (Fig. 2).

Fig. 3 The neighbour-net based on simple (Hamming) mean distances inferred from Madzia & Cau's matrix. Colouring as in Fig. 1

Trivial (data-wise) lineages are e.g. the Tylosaurinae, supported by a very long narrow branch— this lineage is characterised by high group coherence and distinctness to any other taxon/taxon group and will inevitably have high support and placed close (phylogenetically and absolute) to the Plioplate-carpinae (Figs 2, 3). The Mosasaurinae are equally well circumscribed, with only one putative member, Dallasaurus, being substantially apart from the rest, and bridging Mosasaurinae and Halisaurinae, their putative sisters. Hence, trees will favour splits rejecting the "Natantia" group unless Dallasaurus is excluded from the inference.

Species of the same genera are conspicuously grouped; this differs from Madzia & Cau’s trees, where Mosasaurus or Prognathodon species are collected in the same subtrees, but are “non-monophyletic”, i.e. do not form an exclusive clade. Based on the neighbour-net, the main reason may be terminal noise and resulting flat likelihood surfaces (hence, low posterior probabilities). The placement of the older members of the mosasaurs (classified as Tethysaurinae and Yaguarasaurinae) to each other, and the slightly older outgroup taxa, is clearly difficult with this matrix, even though there is no ambiguity, e.g. in the MPT sample (Fig. 2). Hence, the branch-lengths do not reflect synapomorphies or rarely shared apomorphies in this subtree, but instead shared convergences — a perfect phylogeny always generates a perfectly tree-like distance matrix.

Oddly placed taxa in the neighbour-net? Probably unrepresentative distances; and the quick fix

In contrast to trees, the network in Fig. 3 fails to resolve a likely position for one Prognathodon species: P. currii, and the large associated box indicates a data issue. The pairwise distances of the oddly placed P. currii and the probably misplaced Dolichosaurus, are poorly defined: both have zero-distances to non-similar taxa, but also to each other. But whereas Dolichosaurus differs from other members of Prognathodon by mean morphological distances (MD) of 0.5–1.0 (1.0 means it differs in all defined characters!), P. currii is much more similar to its congeners (MD = 0.17–0.27 and 0.46). Their other affinities also lie with strongly different taxon sets.

Their position in the neighbour-net is the result of a missing data artefact. Being just a 2-dimensional graph, such severe signal ambiguity cannot be resolved. Unrepresentative distances are the major (only) obstacle for neighbour-nets in the context of extinct groups. Trees are more decisive in such cases, when the few covered characters fit well the preferred tree's topology. By removing the outgroup taxa and P. currii, we can generate a neighbour-net (Fig. 4) in-line, and going beyond the Bayesian-tree-based groups suggested by Madzia & Cau (Fig. 1).

Fig. 4 Same data and method as shown in Fig. 3; four OTUs were excluded, the non-Mosasauroidea (outgroup) and the misplaced Prognathodon currii

Using networks to define taxonomic groups

Just based on the neighbour-nets (Figs 3, 4), circumscription of genera and higher taxa can be discussed (assuming that morphology mirrors phylogeny). For instance, Mosasaurus can be kept as-is or can include Plotosaurus; whereas the Clidastes form a clearly distinct taxon (whether paraphyletic/ monophyletic or clade/grade may be impossible to decide, see Fig. 1). Including (all) Prognathodon in the Globidensini remains an option; Eremiasaurus may be included, too, or included in the likely sister clade, the Mosasaurini.  

Dallasaurus is not only the oldest possible but clearly the most unique (primitive?) member of the Mosasaurinae, and the Halisaurinae likely represent their early diverged sister lineage. Treating Tylosaurinae and Plioplatecarpinae as reciprocally monophyletic sister lineages makes sense with respect to the older taxa and the co-eval Mosasaurinae-Halisaurinae lineage. The ancient forms are generally more similar to Plioplatecarpinae (+ Tylosaurinae) than to the Mosasaurinae and Halisaurinae lineages; but whether they should be included in the same systematic group ("Russellosaurina") cannot be judged based on the data matrix or the inferred trees (see also Figs 1, 2). Their topological attraction may be due to more shared primitive features (Hennig's ‘symplesiomorphies’), and the "Russellosaurina" could be a paraphyletic clade.

An interesting pronounced central edge bundle in the network in Fig. 4, which agrees well with Madzia & Cau's Bayesian consensus tree (Fig. 1), is the one separating all oldest, potentially more primitive taxa/lineages (> 90 Ma) from the later more diversified lineages (Mosasaurinae, Halisaurinae, Plioplatecarpinae, and Tylosaurinae). Regarding primitiveness vs. derivedness, an option to map characters on networks and extract alternative trees directly from the network would be handy (see also David’s 500th post).


Fig. 5 Bootstrap (BS) support network based on 10,000 BS (pseudo)replicates optimised under parsimony. Splits are shown that occurred only in at least 20% of the BS replicates; trivial splits are collapsed. Some taxa have low, but unchallenged support, in other cases no preference at all is found (e.g. for the highest level bracketing taxa) or two alternatives compete with each other.

Also in the case of the mosasaurs: when we want to use phylogenetic trees as the sole (or main) basis for classification, rather than neighbour-nets (see my last post) and common sense backed up by EDA (e.g. Fig. 4; Bomfleur et al. 2017), the method of choice would be the support consensus networks based on parsimony (example provided in Fig. 5), least-squares, and/or likelihood bootstrapping pseudoreplicate samples. in addition to or instead of the Bayesian-inferred topologies sample. The posterior probabilities in Madzia & Cau’s tip-dated tree and Bayesian majority-rule consensus tree include values << 1.0, which already can be an indication of very strong signal conflict or just lack of discriminating signal (flat likelihood surfaces).

We should not be over-confident in PP, when the underlying data are not tree-like at all, as they too easily tilt towards one alternative (see also Zander 2004). The same holds for post-analysis character weighting, designed to eliminate (down-weigh) conflicting signals. While parsimony and distance methods are more easily affected by branching artefacts, probabilistic methods may struggle with flat likelihood surfaces. Thus, bootstrap support networks should be the first choice for ‘phylogenetic’ (by identifying Hennigian monophyla) or ‘cladistic’ (clade-based) classification as they show the robustness of the signal for the preferred and other topological alternatives, and can be generated under different optimality criteria. Having a certain support for a clade is nice, but one should always consider the support for alternatives, and consider how many characters support or oppose an alternative.

Morphological matrices need to be analysed using network approaches

Madzia & Cau’s study is methodologically interesting by providing a tip-dated Bayesian tree for an extinct group of organisms. A one-to-one comparison of their parsimony-BS support using different character and weighting schemes vs. Bayesian PP may be interesting, too — note the difference between the tip-dated tree and the majority rule consensus trees for several critical branches. However, following the current standard practice, no BS pseudoreplicate and Bayesian saved topologies samples were provided. Regarding the main objective, the identification of ‘weak spots’ to propose enhanced systematic groups, networks (Figs 2–5) would have been more informative and straightforward.

No matter what classification philosophy is applied, when we deal with morphological matrices of extinct groups of organisms, the first step should always be to explore the primary signal in the data before we infer trees using (highly) sophisticated methods, and interpret them — the latter may actually obscure ‘weak spots’ rather than identifying them. The quickest analyses are neighbour-nets, but watch out for odd pairwise distance patterns (easily visualised using heat maps)!

The second step is producing support consensus networks, for the fine-tuning and to decide on the most probable trees to explain the data. Regarding classification, we should ask ourselves whether we really want inevitably unstable clade-based classification systems (when dealing with extinct organisms), or robust ones that reflect the general data situation and include potentially or likely paraphyletic taxa (see e.g. Clidastes in Figs 2–5 and Madzia & Cau's trees, and their elaborate discussion of higher level taxa, which – to a good degree – could become superfluous when allowing paraphyletic taxa).

Links

All graphics, and some primary data files, are publicly available from figshare. An archive including all re-analysis files can be downloaded at www.palaeogrimm.org.

References

Bell GL (1997) A phylogenetic revision of North American and Adriatic Mosasauroidea. In: Callaway JM, and Nicholls EL, eds. Ancient Marine Reptiles. San Diego: Academic Press, pp. 293–332 [cited from Madzia & Cau 2017]

Bomfleur B, Grimm GW, McLoughlin S. 2017. The fossil Osmundales (Royal Ferns)—a phylogenetic network analysis, revised taxonomy, and evolutionary classification of anatomically preserved trunks and rhizomes. PeerJ 5:e3433. https://peerj.com/articles/3433/.

Madzia D, Cau A (2017) Inferring 'weak spots' in phylogenetic trees: application to mosasauroid nomenclature. PeerJ 5: e3782. https://peerj.com/articles/3782/.

Zander RH (2004) Minimal values of reliability of Bootstrap and Jackknife proportions, Decay index, and Bayesian posterior probability. PhyloInformatics 2: 1–13.

Tuesday, July 4, 2017

Should we try to infer trees on tree-unlikely matrices?


Spermatophyte morphological matrices that combine extinct and extant taxa notoriously have low branch support, as traditionally established using non-parametric bootstrapping under parsimony as optimality criterion. Coiro, Chomicki & Doyle (2017) recently published a pre-print to show that this can be overcome to some degree by changing to Bayesian-inferred posterior probabilities. They also highlight the use of support consensus networks for investigating potential conflict in the data. This is a good start for a scientific community that so far has put more of their trust in either (i) direct visual comparison of fossils with extant taxa or (ii) collections of most parsimonious trees inferred based on matrices with high level of probably homoplasious characters and low compatibility. But do those matrices really require or support a tree? Here, I try to answer this question.

Background

Coiro et al. mainly rely on a recent matrix by Rothwell & Stockey (2016), which marks the current endpoint of a long history of putting up and re-scoring morphology-based matrices (Coiro et al.’s fig. 1b). All of these matrices provide, to various degrees, ambiguous signal. This is not overly surprising, as these matrices include a relatively high number of fossil taxa with many data gaps (due to preservation and scoring problems), and combine taxa that perished a hundred or more millions years ago with highly derived, possibly distant-related modern counterparts.

Rothwell & Stockey state (p. 929) "As is characteristic for the results from the analysis of matrices with low character state/taxon ratios, results of the bootstrap analysis (1000 replicates) yielded a much less fully resolved tree (not figured)." Coiro et al.’s consensus trees and network based on 10,000 parsimony bootstrap replicates nicely depicts this issue, and may explain why Rothwell & Stockey decided against showing those results. When studying an earlier version of their matrix (Rothwell, Crepet & Stockey 2009), they did not provide any support values, citing a paper published in 2006, where the authors state (Rothwell & Nixon 2006, p. 739): “… support values, whether low or high for particular groups, would only mislead the reader into believing we are presenting a proposed phylogeny for the groups in question. Differences among most-parsimonious trees are sufficient to illuminate the points we wish to make here, and support values only provide what we consider to be a false sense of accuracy in these assessments”.

Do the data support a tree?

The problem is not just low support. In fact, the tree showed by Rothwell & Stockey with its “pectinate arrangement” conflicts in parts with the best-supported topology, a problem that also applied to its 2009 predecessor. This general “pectinate” arrangement of a large, low or unsupported grade is not uncommon for strict consensus trees based on morphological matrices that include fossils and extant taxa (see e.g. the more proximal parts of the Tree of Life, e.g. birds and their dinosaur ancestors).

The support patterns indicate that some of the characters are compatible with the tree, but many others are not. Of the 34 internodes (branches) in the shown tree (their fig. 28 shows a strict consensus tree based on a collection of equally parsimonious trees), 12 have lower bootstrap support under parsimony than their competing alternatives (Fig. 1). Support may be generally low for any alternative, but the ones in the tree can be among the worst.

The main problem is that the matrix simply does not provide enough tree-like signal to infer a tree. Delta Values (Holland et al. 2002) can be used as a quick estimate for the treelikeliness of signal in a matrix. In the case of large all-spermatophyte matrices (Hilton & Bateman 2006; Friis et al. 2007; Rothwell, Crepet & Stockey 2009; Crepet & Stevenson 2010), the matrix Delta Values (mDV) are ≥ 0.3. For comparison, molecular matrices resulting in more or less resolved trees have mDV of ≤ 0.15. The individual Delta Values (iDV), which can be an indicator of how well a taxon behaves during tree inference, go down to 0.25 for extant angiosperms – very distinct from all other taxa in the all-spermatophyte matrices with low proportions of missing data/gaps – and reach values of 0.35 for fossil taxa with long-debated affinities.

The newest 2016 matrix is no exception with a mDV of 0.322 (the highest of all mentioned matrices), and iDVs range between 0.26 (monocots and other extant angiosperms) and 0.39 for Doylea mongolica (a fossil with very few scored characters). In the original tree, Doylea (represented by two taxa) is part of the large grade and indicated as the sister to Gnetidae (or Gnetales) + angiosperms (molecular trees associate the Gnetidae with conifers and Ginkgo). According to the bootstrap analysis, Doylea is closest to the extant Pinales, the modern conifers. Coiro et al. found the same using Bayesian inference. Their posterior probability (PP) of a Doylea-Podocarpus-Pinus clade is 0.54, and Rothwell & Stockey’s Doylea-Ginkgo-angiosperm clade conflicts with a series of splits with PPs up to 0.95.

Figure 1. Parsimony bootstrap network based on 10,000 pseudoreplicate trees
inferred from the matrix of Rothwell & Stockey.
Edges not found in the authors’ tree in red, edges also found in the tree in green.
Extant taxa in blue bold font. The edge length is proportional to the frequency of the
according split (taxon bipartition, branch in a possible tree) in the pseudoreplicate
tree sample. The network includes all edges of the authors’ tree except for
Doylea + Gnetidae + Petriellales + angiosperms vs. all other gymnosperms and
extinct seed plant groups. Such a split has also no bootstrap support (BS < 10)
using least-square and maximum likelihood optimum criteria.

Do the data require a tree?

As David made a point in an earlier post, neighbour-nets are not really “phylogenetic networks” in the evolutionary sense. Being unrooted and 2-dimensional, they don’t depict a phylogeny, which has to be a sort of (rooted) tree, a one-dimensional graph with time as the only axis (this includes reticulation networks where nodes can be the crossing point of two internodes rather than their divergence point). The neighbour-net algorithm is an extension into two dimensions of the neighbour-joining algorithm, the latter infers a phylogenetic tree serving a distance criterion such as minimum evolution or least-squares (Felsenstein 2004). Essentially, the neighbour-net is a ‘meta-phylogenetic’ graph inferring and depicting the best and second-best alternative for each relationship. Thus, neighbour-nets can help to establish whether the signal from a matrix, treelike or not as it is the cases here, supports potential and phylogenetic relationships, and explore the alternatives much more comprehensively than would be possible with a strict-consensus or other tree (Fig. 2).

Figure 2. Neighbour-net based on a mean distance matrix inferred
from the matrix of Rothwell & Stockey.
The distance to the "progymnosperms", a potential ancestral group of the
seed plants, can be taken as a measurement for the derivedness of each
major group. The primitive seed ferns are placed between progymnosperms
 and the gymnosperms connected by partly compatible edge bundles; the
putatively derived "higher seed ferns" isolated between the progymnosperms
and the long-edged angiosperms. Shared edge-bundles and 'neighbourness'
reflect quite well potential phylogenetic relationships and eventual ambiguities,
as in the case of Gnetidae. Colouring as in Figure 1; some taxon names
are abbreviated.

In addition, neighbour-nets usually are better backgrounds to map patterns of conflicting or partly conflicting support seen in a bootstrap, jackknife or Bayesian-inferred tree sample. In Fig. 3, I have mapped the bootstrap support for alternative taxon bipartitions (branches in a tree) on the background of the neighbour-net in Fig. 2.

Obvious and less-obvious relationships are simultaneously revealed, and their competing support patterns depicted. Based on the graph, we can see (edge lengths of the neighbour-net) that there is a relatively weak primary but substantial bootstrap support for the Petriellales (a recently described taxon new to the matrix) as sister to the angiosperms. Several taxa, or groups of closely related taxa, are characterised by long terminal edges/edge bundles, rooting in the boxy central part of the graph. Any alternative relationship of these taxa/taxon groups receives equally low support, but there are notable differences in the actual values.

There is little signal to place most of the fossil “seed ferns” (extinct seed plants) in relation to the modern groups, and a very ambiguous signal regarding the relationship of the Gnetidae (or Gnetales) with the two main groups of extant seed plants, the conifers (Pinidae; see C. Earle’s gymnosperm database) and angiosperms (for a list and trees, see P. Stevens’ Angiosperm Phylogeny Website).

The Gnetidae is a strongly distinct (also genetically) group of three surviving genera, being a persistent source of headaches for plant phylogeneticists. Placed as sister to the Pinaceae (‘Gnepine’ hypothesis) in early molecular trees (long-branch attraction artefact), the currently favoured hypothesis (‘Gnetifer’) places the Gnetidae as sister to all conifers (Pinatidae) in an all-gymnosperm clade (including Gingko and possibly the cycads).

As favoured by the branch support analyses, and contrasting with the preferred 2016 tree, the two Doyleas are placed closest to the conifers, nested within a commonly found group including the modern and ancient conifers and their long-extinct relatives (Cordaitales), and possibly Ginkgo (Ginkgoidae). In the original parsimony strict consensus tree, they are placed in the distal part as sister to a Gnetidae and Petriellales + angiosperms (possibly long-branch attraction). The grade including the ‘primitive seed ferns’ (Elkinsia through Callistophyton), seen also in Rothwell and Stockey’s 2016 tree, may be poorly supported under maximum parsimony (the criterion used to generate the tree), but receives quite high support when using a probabilistic approach such as maximum likelihood bootstrapping or Bayesian inference to some degree (Fig. 3; Coiro, Chomicki & Doyle 2017).

Figure 3. Neighbour-net from above used to map alternative support patterns.
Numbers refer to non-parametric bootstrap (BS) support for alternative phylogenetic
splits under three optimality criteria: maximum likelihood (ML) as implemented in
RAxML (using MK+G model), maximum parsimony (MP), and least-squares
(via neighbour-joining, NJ; using PAUP*); and Bayesian posterior probabilties
(using MrBayes 3.2; see Denk & Grimm 2009, for analysis set-up). The circular
arrangement of the taxa allows tracking most edges in the authors’ tree and their,
sometimes better supported, alternatives. The edge lengths provide direct
information about the distinctness of the included taxa to each other; the structure
of the graph informs about the how tree-like the signal is regarding possible
phylogenetic relationships or their alternatives. Colouring as in Figure 1;
some taxon names are abbreviated.

Numerous morphological matrices provide non-treelike signals. A tree can be inferred, but its topology may be only one of many possible trees. In the framework of total evidence, this may be not such a big problem, because the molecular partitions will predefine a tree, and fossils will simply be placed in that tree based on their character suites. Without such data, any tree may be biased and a poor reflection of the differentiation patterns.

By not forcing the data in a series of dichotomies, neighbour-nets provide a quick, simple alternative. Unambiguous, well-supported branches in a tree will usually result in tree-like portions of the neighbour net. Boxy portions in the neighbour-net pinpoint the ambiguous or even problematic signals from the matrix. Based on the graph, one can extract the alternatives worth testing or exploring. Support for the alternatives can be established using traditional branch support measures. Since any morphological matrix will combine those characters that are in line with the phylogeny as well as those that are at odds with it (convergences, character misinterpretations), the focus cannot be to infer a tree, but to establish the alternative scenarios and the support for them in the data matrix.

References

Coiro M, Chomicki G, Doyle JA. 2017. Experimental signal dissection and method sensitivity analyses reaffirm the potential of fossils and morphology in the resolution of seed plant phylogeny. bioRxiv DOI:10.1101/134262

Crepet WL, Stevenson DM. 2010. The Bennettitales (Cycadeoidales): a preliminary perspective of this arguably enigmatic group. In: Gee CT, ed. Plants in Mesozoic Time: Morphological Innovations, Phylogeny, Ecosystems. Bloomington: Indiana University Press, pp. 215-244.

Denk T, Grimm GW. 2009. The biogeographic history of beech trees. Review of Palaeobotany and Palynology 158: 83-100.

Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Friis EM, Crane PR, Pedersen KR, Bengtson S, Donoghue PCJ, Grimm GW, Stampanoni M. 2007. Phase-contrast X-ray microtomography links Cretaceous seeds with Gnetales and Bennettitales. Nature 450: 549-552 [all important information needed for this post is in the supplement to the paper; a figure showing the actual full analysis results can be found at figshare]

Hilton J, Bateman RM. 2006. Pteridosperms are the backbone of seed-plant phylogeny. Journal of the Torrey Botanical Society 133: 119-168.

Holland BR, Huber KT, Dress A, Moulton V. 2002. Delta Plots: A tool for analyzing phylogenetic distance data. Molecular Biology and Evolution 19: 2051-2059.

Rothwell GW, Crepet WL, Stockey RA. 2009. Is the anthophyte hypothesis alive and well? New evidence from the reproductive structures of Bennettitales. American Journal of Botany 96: 296–322.

Rothwell GW, Nixon K. 2006. How does the inclusion of fossil data change our conclusions about the phylogenetic history of the euphyllophytes? International Journal of Plant Sciences 167: 737–749.

Rothwell GW, Stockey RA. 2016. Phylogenetic diversification of Early Cretaceous seed plants: The compound seed cone of Doylea tetrahedrasperma. American Journal of Botany 103: 923–937.

Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution DOI:10.1111/2041-210X.12760.

Wednesday, April 16, 2014

Some things you probably don't know about the bootstrap


The following text was written a few years ago, but much of it never got published. So, I thought that this might be a good opportunity to make it available, since what it says is still true today.

Since a phylogenetic tree is interpreted in terms of the monophyletic groups that it hypothesizes, it is important to quantitatively assess the robustness of all of these groups (i.e. the degree of support for each branch in the tree) — is the support for a particular group any better than would be expected from a random data set? This issue of clade robustness is the same as assessing branch support on the tree, since each branch represents a clade. Many different techniques have been developed, including:
  1. analytical procedures, such as interior-branch tests (Nei et al. 1985; Sneath 1986), likelihood-ratio tests (Felsenstein 1988; Huelsenbeck et al. 1996b), and clade significance (Lee 2000);
  2. resampling procedures, such as the bootstrap (Felsenstein 1985), the jackknife (Lanyon 1985), topology-dependent permutation (Faith 1991), and clade credibility or posterior probability (Larget and Simon 1999); and
  3. non-statistical procedures, such as the decay index (Bremer 1988), clade stability (Davis 1993), and spectral signals (Hendy and Penny 1993).
Of these, far and away the most popular and widely used method has been the bootstrap technique (Holmes 2003; Soltis and Soltis 2003).


The bootstrap

This method was first introduced by Efron (1979) as an alternative method to jackknifing for producing standard errors on estimates of central location other than the mean (e.g. the median), but it has since been expanded to cover probabilistic confidence intervals as well (Efron and Tibshirani 1993; Davison and Hinkley 1997). It was introduced into phylogenetic studies by Penny et al. (1982) and then formalized by Felsenstein (1985), who suggested that it could be implemented by holding the taxa constant and resampling the characters randomly with replacement, the tree-building analysis then being applied to each of the bootstrap resamples.

Bootstrapping is a monte carlo procedure that generates "pseudo" data sets from the original data, and then uses these new data sets for its inferences. That is, it tries to derive the population inferences (i.e the "true" answer) from repeated generation of new samples, each sample being constrained by the characteristics of the original data sample. It thus relies on an explicit analogy between the sample and the appropriate population: that sampling from the sample is the same as sampling from the population. Clearly, the strongest requirement for bootstrapping to work is that the sample be a reasonable representation of the population.

Bootstrap confidence intervals are only ever approximate, especially for complex data structures, as they are a fundamentally more ambitious measure of accuracy than is a simple standard error (SE). For example, the usual formula for calculating a confidence interval (CI) when the population frequency distribution is assumed to be normal is: CI = t * SE, where t is the Student t-value associated with the particular sample size and confidence percentage required. However, the main use of bootstrapping is in situations where the population frequency distribution is either indeterminate or is difficult to obtain empirically, and so this simple formula cannot be applied. Getting from the standard error to a confidence interval is then not straightforward. As a result, there are actually several quite distinct procedures for performing bootstrapping (Carpenter and Bithell 2000), with varying degrees of expected success.

Types of bootstrap

The original technique is called the percentile bootstrap. It is based on the principle of using the minimum number of ad hoc assumptions, and so it merely counts the percentage of bootstrap resamples that meet the specified criteria. F§or example, to estimate the standard error of a median, the median can be calculated for each bootstrap resample and then the standard deviation of the resulting frequency distribution will be the estimated standard error of the original median. The method is thus rather simplistic, and is often referred to as the naïve bootstrap, because it assumes no knowledge of how to calculate population estimates. It is a widespread method, as it can be applied even when the other methods cannot. However, it is known to have certain problems associated with the estimates produced, particularly for confidence intervals, such as bias and skewness (especially when the parent frequency distribution is not symmetrical). These were pointed out right from the start (Efron 1979), and efforts have subsequently been made to deal with them. Nevertheless, this is the form of bootstrap introduced by Felsenstein (1985), and it is the one used by most phylogeny computer programs. It is therefore the one that will be discussed in more detail below.

These known problems with the naïve bootstrap can be overcome by using bias-corrected (BC) bootstrap estimates — that is, the bias is estimated and removed from the calculation of the confidence interval. Possible dependence of the standard error on the parameter being estimated, which creates skewness, can be dealt with by using bias-corrected and accelerated (BCa) bootstrap estimates, so that the bias and skewness are both estimated and removed from the calculation of the confidence interval. The BCa method is the one usually recommended for use (Carpenter and Bithell 2000), because it corrects for both bias and skewness. This method is much slower to calculate than the simple percentile bootstrap, because it requires an extra parameter to be estimated for each of the bias and skewness corrections, and the latter correction is actually estimated by performing a separate jackknife analysis on each bootstrap resample (which means that the analysis can take 100 times as long as a naïve analysis). There have been several attempts to apply this form of correction methodology to bootstrapping in a phylogenetic context (Rodrigo 1993; Zharkikh and Li 1995; Efron et al. 1996; Shimodaira 2002), but while these can be successful at correcting bias and skewness (Sanderson and Wojciechowski 2000) these have not caught on, possibly because of the time factor involved.

Alternatively, we can decide not to be naïve when calculating confidence intervals, and to therefore calculate them in the traditional manner, using the standard error and the t-distribution. However, we then need to overcome any non-normal distribution problems of these two estimates by estimating both of them using bootstrapping. That is, bootstrapped-t confidence intervals are derived by calculating both the standard error and the t-value using bootstrapping, and then calculating the confidence interval as ±t * SE. To many people, this is the most natural way to calculate confidence intervals, since it matches the usual parametric procedure, and thus it is frequently recommended (Carpenter and Bithell 2000). Once again, this method is much slower to calculate than the percentile bootstrap, because the t-value is actually estimated by performing a separate bootstrap analysis on each bootstrap resample (which means that the analysis can take 100 times as long as a naïve analysis). This methodology seems not to have yet been suggested in a phylogenetic context, and in any case the time factor may be restrictive.

It is also possible to calculate test-inversion confidence intervals. This idea is based on the reciprocal relationship of statistical tests and confidence intervals, where (for example) non-overlapping 95% confidence intervals indicate statistically significant patterns at p<0.05 and vice versa. Thus, if we work out the situation where the pattern has a probability of p=0.05 of occurring by chance, then this defines the 95% confidence limit of the pattern. Clearly, this can be a complex process, especially for two-sided tests (which double the required number of calculations), as it can only be done by iteratively modifying the pattern and re-calculating the probability until the solution is reached. Once again, no-one yet seems to have suggested this methodology in a phylogenetic context, which is not unexpected given the general problems in deciding how to test branches statistically.

The above methods all count as non-parametric bootstrap methods. More recently, parametric bootstrapping methods have also been developed, which make the more restrictive assumption that a parametric model can be applied to the data (e.g. that the standard deviation of the parameter can be reliably estimated). In parametric bootstrapping we generate simulated datasets based on the assumed frequency distribution of the data, rather than by resampling from the data set itself. That is, instead of sampling from the sample, we sample from the assumed theoretical distribution to generate the set of bootstrap samples. We can then apply the percentile, BCa or bootstrap-t methods, described above, in the usual way. Clearly this method assumes that we know the appropriate frequency distribution; and the method will only be appropriate if this assumption is true, but not otherwise. However, if the assumption is correct, then this can be the most powerful method (Huelsenbeck et al. 1996a; Newton 1996) because it is not dependent on the representativeness of the data sample. The method has been introduced into phylogenetics in several contexts (Goldman 1993; Adell and Dopazo 1994; Huelsenbeck et al. 1996a), but the appropriate frequency distribution for branch support is not obvious (i.e. a phylogeny is a complex structure and cannot be represented by a single number but rather requires a model of sequence evolution and a model tree) and so it is not used for this purpose.

Issues with the bootstrap

Thus, for several reasons, all of the best bootstrapping methods are not likely to be available when assessing the robustness of a phylogenetic tree, and we are left with the naïve percentile bootstrap, which can be expected a priori to provide biased and skewed estimates of confidence intervals (because the frequency distribution associated with tree branches will not be symmetrical). Sadly, these problems have been repeatedly confirmed for the assessment of branch support in phylogenetic tree-building, both theoretically (Zharkikh and Li 1992a, 1992b; Felsenstein and Kishino 1993; Li and Zharkikh 1994; Sitnikova et al. 1995; Berry and Gascuel 1996; Efron et al. 1996; Huelsenbeck et al. 1996a; Newton 1996; Sanderson and Wojciechowski 2000; Suzuki et al. 2002; Alfaro et al. 2003; Erixon et al. 2003; Galtier 2004; Huelsenbeck and Rannala 2004) and empirically (Sanderson 1989; Hillis and Bull 1993; Buckley et al. 2001; Buckley and Cunningham 2002; Wilcox et al. 2002; Taylor and Piel 2004).

An example of the relationship between naïve bootstrap probabilities and the true probability of a false positive result, showing that percentile bootstrap indices >75% tend to be underestimates of the amount of support while they are overestimates below this level. The graph is based upon 1000 bootstrap resamples of 100 simulated characters for a clade of three taxa plus outgroup (based on data presented by Zharkikh and Li 1992a). The true probability represents the amount of character support for the clade in the simulated data, while the bootstrap probability is the proportion of resamples that included the clade.

These studies have demonstrated that the probability of bootstrap resampling supporting the true tree may be either under- or overestimated, depending on the particular situation. For example, bootstrap values >75% tend to be underestimates of the amount of support, while they may be overestimates below this level, as shown in the first graph (above). That is, when the branch support is strong (i.e. the clade is part of the true tree) there will be an underestimation and when the support is weak (i.e. the clade is not part of the true tree) there will be an overestimation. This situation has been reported time and time again, with various theoretical explanations (e.g. Felsenstein and Kishino 1993; Efron et al. 1996; Newton 1996), although there are dissenting voices (e.g. Taylor and Piel 2004) as would be expected for a complex situation. Unfortunately, practitioners seem to ignore this fact, and to assume incorrectly that bootstrap values are always underestimates.

Just as importantly, the theoretical studies show that the pattern of over- and underestimation depends on (i) the shape of the tree and the branch lengths, (ii) the number of taxa, (iii) the number of characters, (iv) the evolutionary model used, and (v) the number of bootstrap resamples. This was first reported by Zharkikh and Li (1992a), and has been reconfirmed since then. For example, with few characters the bootstrap index tends to overestimate the support for a clade and to underestimate it for more characters. This is particularly true if the number of phylogenetically informative characters is increased or the number of non-independent characters is increased; and the index becomes progressively more conservative (i.e. lower values) as the number of taxa is increased.

Moreover, these patterns of under- and overestimation are increased with an increasing number of bootstrap replications, as shown in the next graph — this called "being wrong, with confidence".

An example of the relationship between the true clade probability and the observed non-parametric bootstrap proportion for two simulated data sets with different numbers of characters (as shown). The lines are based on data presented by Zharkikh & Li, (1995) for 1000 bootstrap resamples of a clade of three taxa plus outgroup.

The following graph pair of graphs show the effect of varying the evolutionary model used to generate the data, where under-specification of the analysis model leads to a general over-estimate of the true probability (cross-over at p=0.8, as shown in the first graph of the pair), while matching the generating and analysis models leads to a general under-estimation (cross-over at p=0.3, as shown in the second graph of the pair).

An example of the relationship between the true tree probability and the difference between the observed percentile bootstrap proportion and the true probability for two simulated data sets. The label in the bottom corner shows the substitution model used to simulate the data, then the model assumed in the bootstrap analysis (the sequence length is 100 nucleotides); JC69 = Jukes-Cantor, GTRG = general time- reversible + gamma-distributed among-site rate variation. The points are based on data presented by Huelsenbeck & Rannala (2004).

These are serious issues, which seem to be often ignored by practitioners. We can't just assume that the "true" support value is larger than our observed bootstrap value. In particular, this means that bootstrap values are not directly comparable between trees, even for the same taxa, and thus there can be no "agreed" level of bootstrap support that can be considered to be "statistically significant". A bootstrap value of 90% on a branch on one tree may actually represent less support than a bootstrap value of 85% on another tree, depending on the characteristics of the dataset concerned and the bootstrapping procedure used (although within a single tree the values should be comparable).

This complex situation means that we have to consider carefully how best to interpret bootstrap values in a phylogenetic context (Sanderson 1995). The bootstrap proportion (i.e. the proportion of resampled trees containing the branch/clade of interest) has variously been interpreted as (Berry and Gascuel 1996):
  1. a measure of reliability, telling us what would be expected to happen if we repeated our experiment;
  2. a measure of accuracy, telling us about the probability of our experimental result being true; and
  3. a measure of confidence, interpreted as a conditional probability similar to those in standard statistical hypothesis tests (i.e. measuring Type I errors or false positives).
The bootstrap was originally designed for purpose (1), and all of the problems identified above relate to trying to use it for purposes (2) and (3). The values derived from the naïve bootstrap need correcting for purposes (2) and (3), and the degree of correction depends on the particular data set being examined (Efron et al. 1996; Goloboff et al. 2003).

The issue of support values depending on the number of bootstrap replicates is also of interest. It is usually recommended that at least 1,000–2,000 bootstrap resamples are taken for estimating confidence intervals, and this generality has been applied to phylogenetic trees (Hedges 1992). However, it is important to recognize that these suggestions relate to the precision of the confidence estimates not to their accuracy. Accuracy refers to how close the estimates are to the true value (i.e. correctness) while precision refers to how variable are the estimates (i.e. repeatability). Accuracy depends on a complex set of characteristics many of which have nothing to do with bootstrap replication. Precision, on the other hand, is entirely to do with the number of bootstrap replicates and the expected accuracy of the estimates. As shown in the next graph, 100 replicates at a conventional level of accuracy produces estimates that are expected to be within ±4% of the "true" values while 2,000 replicates produces estimates ±1%. This needs to be borne in mind when deciding whether to call a particular value "significant support" or not.

The number of bootstrap replicates needed to achieve a specified amount of precision, given statistical testing at two different levels of probability. For example (as shown by the dotted line), 100 bootstrap replicates means that, if the bootstrap value is accurate at the 95% confidence level, then the estimated bootstrap percentage will be precise to ±4.3%. In order to get ±1% precision then nearly 2,000 bootstrap replicates are needed.

There have also been attempts to overcome some of the practical limitations of bootstrapping for large data sets by adopting heuristic procedures, including resampling estimated likelihoods for maximum-likelihood analyses (Waddell et al. 2002) and reduced tree-search effort for the bootstrap replicates. However, approaches using reduced tree-search effort produce even more conservative estimates of branch support, and the magnitude of the effect increases with decreasing bootstrap values (DeBry and Olmstead 2000; Mort et al. 2000; Sanderson and Wojciechowski 2000).

References

Adell J.C., Dopazo J. 1994. Monte Carlo simulation in phylogenies: an application to test the constancy of evolutionary rates. J. Mol. Evol. 38, 305-309.

Alfaro M.E., Zoller S., Lutzoni F. 2003. Bayes or bootstrap? A simulation study comparing the performance of bayesian markov chain monte carlo sampling and bootstrapping in assessing phylogenetic confidence. Mol. Biol. Evol. 20, 255-266.

Berry V., Gascuel O. 1996. On the interpretation of bootstrap trees: appropriate threshold of clade selection and induced gain. Mol. Biol. Evol. 13, 999-1011.

Bremer K. 1988. The limits of amino acid sequence data in angiosperm phylogenetic reconstruction. Evolution 42, 795-803.

Buckley T.R., Cunningham C.W. 2002. The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support. Mol. Biol. Evol. 19, 394-405.

Buckley T.R., Simon C., Chambers G.K. 2001. Exploring among-site rate variation models in a maximum likelihood framework using empirical data: effects of model assumptions on estimates of topology, branch lengths and bootstrap support. Syst. Biol. 50, 67-86.

Carpenter J., Bithell J. 2000. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Stat. Med. 19, 1141-1164.

Davis J.I. 1993. Character removal as a means for assessing the stability of clades. Cladistics 9, 201-210.

Davison A.C., Hinkley D.V. 1997. Bootstrap Methods and Their Applications. Cambridge Uni. Press, Cambridge.

DeBry R.W., Olmstead R.G. 2000. A simulation study of reduced tree-search effort in bootstrap resampling analysis. Syst. Biol. 49, 171-179.

Efron B. 1979. Bootstrapping methods: another look at the jackknife. Ann. Stat. 7, 1-26.

Efron B., Halloran E., Holmes S. 1996. Bootstrap confidence levels for phylogenetic trees. Proc. Nat. Acad. Sci. U.S.A. 93, 7085-7090.

Efron B., Tibshirani R.J. 1993. An Introduction to the Bootstrap. Chapman & Hall, London.

Erixon P., Svennblad B., Britton T., Oxelman B. 2003. Reliability of bayesian probabilities and bootstrap frequencies in phylogenetics. Syst. Biol. 52, 665-673.

Faith D.P. 1991. Cladistic permutation tests for monophyly and nonmonophyly. Syst. Zool. 40, 366-375.

Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783-791.

Felsenstein J. 1988. Phylogenies from molecular sequences: inference and reliability. Annu. Rev. Genet. 22, 521-565.

Felsenstein J., Kishino H. 1993. Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull. Syst. Biol. 42, 193-200.

Galtier N. 2004. Sampling properties of the bootstrap support in molecular phylogeny: influence of nonindependence among sites. Syst. Biol. 53, 38-46.

Goldman N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36, 182-198.

Goloboff P.A., Farris J.S., Källersjö M., Oxelman B., Ramırez M.J., Szumik C.A. 2003. Improvements to resampling measures of group support. Cladistics 19, 324-332.

Hedges S.B. 1992. The number of replications needed for accurate estimation of the bootstrap P value in phylogenetic studies. Mol. Biol. Evol. 9, 366-369.

Hendy M.D., Penny D. 1993. Spectral analysis of phylogenetic data. J. Classific. 10, 5-24.

Hillis D.M., Bull J.J. 1993. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst. Biol. 42, 182-192.

Holmes S. 2003. Bootstrapping phylogenetic trees: theory and methods. Statist. Sci. 18, 241-255.

Huelsenbeck J.P., Hillis D.M., Jones R. 1996a. Parametric bootstrapping in molecular phylogenetics: applications and performance. In: Ferraris, J.D., Palumbi, S.R. (Eds), Molecular

Huelsenbeck J.P., Hillis D.M., Nielsen R. 1996b. A likelihood ratio test of monophyly. Syst. Biol. 45, 546-558.

Huelsenbeck J.P., Rannala B. 2004. Frequentist properties of bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst. Biol. 53, 904-913.

Lanyon S.M. 1985. Detecting internal inconsistencies in distance data. Syst. Zool. 34, 397-403.

Larget B., Simon D.L. 1999. Markov chain monte carlo algorithms for the bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16, 750-759.

Lee M.S.Y. 2000. Tree robustness and clade significance. Syst. Biol. 49, 829-836.

Li W.-H., Zharkikh A. 1994. What is the bootstrap technique? Syst. Biol. 43, 424-430.

Mort M.E., Soltis P.S., Soltis D.E., Mabry M.L. 2000. Comparison of three methods for estimating internal support on phylogenetic trees. Syst. Biol. 49, 160-171.

Nei M., Stevens J.C., Saitou M. 1985. Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes. Mol. Biol. Evol. 2, 66-85.

Newton M.A. 1996. Bootstrapping phylogenies: large deviations and dispersion effects. Biometrika 83, 315-328.

Penny D., Foulds L.R., Hendy M.D. 1982. Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences. Nature 297, 197-200.

Rodrigo A.G. 1993. Calibrating the bootstrap test of monophyly. Int. J. Parasitol. 23, 507-514.

Sanderson M.J. 1989. Confidence limits on phylogenies: the bootstrap revisited. Cladistics 5, 113-129.

Sanderson M.J. 1995. Objections to bootstrapping phylogenies: a critique. Syst. Biol. 44, 299-320.

Sanderson M.J., Wojciechowski M.F. 2000. Improved bootstrap confidence limits in large-scale phylogenies, with an example from Neo-Astragalus (Leguminosae). Syst. Biol. 49, 671-685.

Shimodaira H. 2002. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51, 492-508.

Sitnikova T., Rzhetsky A., Nei M. 1995. Interior-branch and bootstrap tests of phylogenetic trees. Mol. Biol. Evol. 12, 319-333.

Sneath P.H.A. 1986. Estimating uncertainty in evolutionary trees from Manhattan-distance triads. Syst. Zool. 35, 470–488.

Soltis P.S., Soltis D.E. 2003. Applying the bootstrap in phylogeny reconstruction. Statist. Sci. 18, 256-267.

Suzuki Y., Glazko G.V., Nei M. 2002. Overcredibility of molecular phylogenies obtained by bayesian phylogenetics. Proc. Nat. Acad. Sci. U.S.A. 99, 16138-16143.

Taylor D.J., Piel W.H. 2004. An assessment of accuracy, error, and conflict with support values from genome-scale phylogenetic data. Mol. Biol. Evol. 21, 1534-1537.

Waddell P.J., Kishino H. and Ota, R. 2002). Very fast algorithms for evaluating the stability of ML and Bayesian phylogenetic trees from se- quence data. Genome Informatics 13, 82-92.

Wilcox T.P., Zwickl D., Heath T.A., Hillis D.M. 2002. Phylogenetic relationships of the dwarf boas and a comparison of bayesian and bootstrap measures of phylogenetic support. Mol. Phylogenet. Evol. 25, 361-371.

Zharkikh A., Li W.-H. 1992a. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock. Mol. Biol. Evol. 9, 1119-1147.

Zharkikh A., Li W.-H. 1992b. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. II. Four taxa without a molecular clock. J. Mol. Evol. 35, 356-366.

Zharkikh A., Li W.-H. 1995. Estimation of confidence in phylogeny: the complete-and-partial bootstrap technique. Mol. Phylogenet. Evol. 4, 44-63.