Showing posts with label Bootstrap. Show all posts
Showing posts with label Bootstrap. Show all posts

Wednesday, August 7, 2013

Network of apple cultivars


As is always emphasized in this blog, it is best to explore the nature of any phylogenetic dataset, before proceeding to a formal data analysis. Usually, I discuss examples where important insights are revealed by using a phylogenetic network as a form of Exploratory Data Analysis. Here, instead, I note an example where there are few noteworthy features, in addition to those emphasized by the phylogenetic tree — some datasets really are tree-like.


The paper under discussion is:
Nikiforova S.V., Cavalieri D., Velasco R., Goremykin V. (2013) Phylogenetic analysis of 47 chloroplast genomes clarifies the contribution of wild species to the domesticated apple maternal line. Molecular Biology & Evolution 30: 1751-1760.

The data involve 47 chloroplast genomes from cultivated apple varieties and wild apple species (genus Malus). The nucleotide alignment is 134,553 bp; and the dataset is available in the Dryad database.

The authors did check some of the basic assumptions of their proposed phylogenetic analysis, such as whether the nucleotide substitutions are saturated and whether the nucleotide composition is homogeneous. The authors conclude that the data are very well-behaved: the alignment is unproblematic, so there is no ambiguity about homology; the P-distances = the corrected distances, so that it is unimportant which nucleotide substitution model is chosen; the nucleotide composition is homogeneous; and most of the site variation is binary. The authors conclude that: "phylogenetic signal is well preserved in the data and is not distorted by multiple substitutions and strong compositional bias."

This does not, however, examine whether the phylogenetic signal is tree-like or not. This is best done with a phylogenetic network. So, I have used a NeighborNet network based on the P-distances, as shown below.

NeighborNet network,
with some of the labels (names and bootstrap values) reproduced from the original tree.

In their tree-based analysis (a bootstrapped maximum-likelihood tree) the authors recognize five monophyletic groups (labeled A to E) plus the outgroup Pyrus. The network reveals that the major groups (A–E) are tree-like except for three things:
  1. the A + B grouping has 87% bootstrap support in the tree-based analysis but is not supported by the network analysis;
  2. the grouping of M. zhaojiaoensis with group C has 90% bootstrap support in the tree-based analysis but is not supported by the network analysis;
  3. the relationship of M. fusca and M. micromalus to group A is not clear in the network.
Points (1) and (2) indicate that only branches with 100% bootstrap values (nothing less) are well-supported by the data. Indeed, the branches with 90% and 87% support are very short branches, so there is no significant character data support.

For point (3), the tree-building analysis makes a somewhat arbitrary decision to resolve the conflicting relationships — it shows M. fusca as the sister to group A, but it includes M. micromalus within the group.

Otherwise, the authors' confidence in their tree-based results seems to be well justified.

Monday, January 21, 2013

EDA or post-optimality analysis of phylogenetic data?


These days, phylogeneticists usually build trees to express the evolutionary history of their samples. As part of this procedure, they also show an interest in the "quality" of their trees. This is a very vaguely defined concept, probably because it has something to do with accuracy, or correctness, which is something we can know almost nothing about. So, instead, we resort to a whole swag of other concepts, such as resolution, robustness, sensitivity and stability, which are related to precision rather than to accuracy.

We implement these "precision" ideas in many ways, including: (i) analytical procedures, such as interior-branch tests, likelihood-ratio tests, clade significance, and the incongruence length difference test; (ii) statistical procedures, such as the ubiquitous non-parametric bootstrap and posterior probabilities, the jackknife, topology-dependent permutation, and clade credibility; and (iii) non-statistical procedures, such as the decay index, clade stability, data decisiveness, and spectral signals.

Most of these are forms of what is called post-optimality analysis — the tree is first calculated and then we evaluate it. In the current issue of Bioinformatics there is a paper by Saad Sheikh, Tamer Kahveci, Sanjay Ranka and Gordon Burleigh (Stability analysis of phylogenetic trees. Bioinformatics 2013 29(2): 166-174) that provides yet another take on the same theme:
We define measures that assess the stability of trees, subtrees and individual taxa with respect to changes in the input sequences. Our measures consider changes at the finest granularity in the input data (i.e. individual nucleotides).
Basically, the idea is to see how much the input would need to be changed in order to cause a change in the tree topology. For example, the authors quantify the minimum edit distance required to create a specified Robinson-Foulds tree distance from the optimal tree, although any similar distances could be used instead. Their basic purpose was to develop a method that could be effective for very large datasets, which most of the alternatives cannot.

What this approach begs is the question as to whether a post-optimality analysis is the best approach in the first place. This type of approach assumes a tree as the basic structure, and fails to consider alternative structures that might be more appropriate for the data.

Exploratory data analysis (EDA), if performed effectively, can achieve the same result (an assessment of "stability") while at the same time revealing whether a tree is actually the best structure. It does this by evaluating the dataset directly, a priori, rather than evaluating the data relative to a tree, a posteriori. Evaluating the tree in terms of the data is not the same thing as evaluating the data independently of any tree.

Of the methods listed above, the only one that evaluates the data in a tree-independent manner is the use of spectral signals. Another approach is, of course, to use a data-display network, which provides a very convenient picture of the data, and will thus reveal whether a tree-building analysis is a good idea or not.

An example

To explore the essential difference between EDA and post-optimality analysis, we can look at one of the example datasets used by Sheikh et al. to illustrate their method.

This dataset involves sequences of 169 species of mammals, published by Meredith et al. (Impacts of the Cretaceous terrestrial revolution and KPg extinction on mammal diversification. Science 2011 334(6055): 521-524). There are 35,603 aligned nucleotides, concatenated from 26 gene fragments. The sampling includes nearly all mammalian families, plus five vertebrate outgroups.

Meredith et al. note about their data:
Phylogenetic relations from maximum likelihood (ML) and Bayesian methods are well resolved across the mammalian tree. More than 90% of the nodes have bootstrap support of ≥ 90% and Bayesian posterior probabilities of ≥ 0.95. Amino acid and DNA ML trees are in agreement for 163 out of 168 internal nodes.
Not surprisingly, Sheikh et al. reach a similar enthusiastic conclusion:
The Mammals dataset is highly stable. There is not a single move (R = 1) possible for an edit distance of up to 530 nucleotides. Even if we place an extremely high limit of E = 1000, the biggest move possible is RF = 5. Thus, the stability measures provide an explicit guarantee that there is no move possible for E = 500 and any values of R within 1 SPR distance. This also demonstrates the power of building phylogenies from large densely sampled datasets.
However, this enthusiasm contradicts some well-known previous results. For example, Meredith et al. also note:
Several nodes that remain difficult to resolve (e.g., placental root) have variable support between studies of rare genomic changes, as well as genome-scale data sets, which suggest that diversification was not fully bifurcating or occurred in such rapid succession that phylogenetic signal tracking true species relations may not be recoverable with current methods.
A simple EDA analysis makes the situation clear, which is not done by either the bootstrap / posterior-probability approach of Meredith et al. or the edit-distance / tree-distance approach of Sheikh et al. If we stick to the simple parsimony approach of the latter (rather than the model-based approach of the former), then we can analyze the dataset with hamming distances and a NeighborNet graph.


First, the root of the mammals is not clear. The published tree places the root on the branch leading to monotremes, but in the network the outgroup involves a reticulation. This is caused by an ambiguous relationship between Echinops telfairi (Tenrecidae) and (i) the {outgroup + monotremes} and (ii) the Afrotheria. In the tree the Afrotheria is united by a "strongly supported node".

Second, the root of the placentals is very unclear. Most of the major groups of placentals form clusters, but the relationships among these clusters are very obscure. The data are bush-like within the placentals, rather than tree-like, both at the level of the four major groups (named in the graph) and within each of those groups. In the published tree, some of these subgroups are well-supported, but others involve disagreement between the DNA and amino-acid trees, while others have < 90% bootstrap support.

It is not immediately obvious that a tree-building analysis is going to be of much use for this dataset. There is certainly some "power of building phylogenies from large densely sampled datasets", but this does not automatically mean that those phylogenies will be tree-like. Evolution involves a more diverse process than that, and post-optimality analyses based on a single model may be very misleading about that diversity.

Wednesday, December 5, 2012

How networks differ from bootstrapped trees

I have noted before (Networks and bootstraps as tree-support criteria) that data-display networks can produce quite different results from bootstrap values on phylogenetic trees. For example, a splits-graph assesses character support for alternative bipartitions of the dataset, whereas a bootstrapped tree assesses  support for those branches that appear only in the tree. These two data evaluations will often be congruent, but they can also differ notably. Here, I use a published dataset to illustrate the two ways in which they can differ.

The dataset is from: Wang N., Braun E.L., Kimball R.T. (2012) Testing hypotheses about the sister group of the Passeriformes using an independent 30-locus data set. Molecular Biology and Evolution 29: 737–750. There are 28 taxa and 25,700 aligned nucleotides.

I used the SplitsTree program to calculate (i) a Neighbor-Joining tree with 1,000 bootstrap pseudoreplicates, and (ii) a NeighborNet graph. In both cases the simple p-distance was used.

Assessment

The graph below shows the split weights (or edge lengths) for the 58 splits that were included in the NeighborNet graph. These form a collection of what are called circular splits, and it is important to note that this collection does not include all of the splits supported by the data. Those splits not in the NeighborNet graph are shown in green with a split weight of 0.00001 (rather than zero), to accommodate the log scale.

The graph also shows the bootstrap percentages for all of the splits in the NeighborNet graph plus all of those branches with a bootstrap frequency greater than 1/100.  Splits that did not appear in any of the bootstrap pseudoreplicates are shown in pink.


Those splits / branches where there is an approximate agreement between the tree and the network are shown in blue. There is a roughly s-shaped relationship between the split weights and the bootstrap percentages, so that an increase in one is associated with an increase in the other.

However, in the range of the graph where there is 100% bootstrap support there are 8 splits (in pink) with a large split weight but 0% bootstrap support. These are splits that contradict at least one better-supported split. The better-supported splits appear as branches in the NeighborJoining tree at the expense of these 8 splits. This is the limitation of a tree representation, as it cannot accommodate alternative patterns, no matter how well-supported they are by the character data.

The important thing to realize is that these splits cannot appear in any bootstrap pseudoreplicate, because they are out-weighed in the character resampling performed by the bootstrap procedure. For each bootstrap pseudoreplicate the resampled data are forced into a tree, and thus all contradictory splits are ignored each time, no matter how well-supported they are. These splits therefore get 0% bootstrap support even though there is considerable character support for them.

Equally importantly, there is one edge that appears in the bootstrap assessment with high support (87%) but which does not appear in the NeighborNet graph at all, shown in green in the graph. The first step of the NeighborNet algorithm decides on a set of circular splits, and only these splits will appear in the splits graph, no matter how well-supported other splits might be.

In this example, there is a nested set of taxa that appears in the tree ((13,14)((15,16)17)), but the NeighborNet finds greater support for several contradictory partitions such as {16,17,27,28} and {15,16,17,27}, and thus cannot display the partition {15,16,17}, although it can accommodate the other three partitions: {13,14}, {15,16} and {13,14,15,16,17}.

So, the nested set gets high bootstrap support simply because it fits onto a tree. That is, given the partitions {13,14}, {15,16}, and {13,14,15,16,17}, all of which are well supported by the character data, then {15,16,17} will be supported as well because it fits neatly onto the tree, irrespective of the strength of its character support (the location of 17 is not well supported no matter where it is on the tree).

Conclusion

Trees and networks will often be in agreement, especially if the data are very tree-like. However, they can differ in two ways: (i) the network may show well-supported character patterns that are not included in the tree, and (ii) the tree may show well-supported branches that are not accommodated by the network-building algorithm. Well-supported branches on a tree are not necessarily well-supported by the character data, and absence of a branch from a tree does not necessarily mean that it has little character support.

Monday, October 8, 2012

Open questions about evolutionary networks, part 3


There are a number of issues that have been of interest to the phylogenetics community with regard to the construction of evolutionary trees that have not yet been addressed for evolutionary networks. These can be considered to be "open questions" — ones that need widespread discussion at some stage, either by biologists or by computational scientists (or both). This blog post finishes my list of some of these topics (see Part 1 and Part 2).

Robustness of branch/reticulation estimates

It is de rigueur in the world of phylogenetic tree building to pepper the tree branches with bootstrap values or posterior probabilities, or frequently both, especially if these estimates are >50%. On the other hand, these values are almost never seen in the world of phylogenetic networks.

If there is a direct link between the network and some character-state data, then bootstrap values can be calculated for a network in the same manner as for a tree — one simply builds many networks from the re-sampled character data. However, this procedure may not be quite as computationally feasible, if the network method does not have a practical computational running time.

Moreover, this procedure is not necessarily straightforward for other types of data from which we might build a network. For example, if we are building a network by minimizing the number of reticulations needed to reconcile a set of conflicting trees, the application of the bootstrap has not yet been evaluated. The computational focus to date has been on the optimization problem, not on the re-sampling problem. And, of course, in the absence of a likelihood model for reticulation events, posterior probabilities cannot be calculated at all.

So, this is another area where the lack of methods commonly associated with tree building seems to be a handicap for the widespread acceptance of network-based methodology.

Can biologists correctly interpret networks?

I have used this quote in an earlier blog post, but it is relevant again here. Baum and Smith (2012, Tree Thinking: An Introduction to Phylogenetic Biology) have noted the following:

"We do not know why it should be so, but we have learned from working with thousands of students that, without contrary training, people tend to have a one-dimensional and progressive view of evolution. We tend to tell evolution as a story with a beginning, a middle, and an end. Against that backdrop, phylogenetic trees are challenging; they are not linear but branching and fractal, with one beginning and many equally valid ends. Tree thinking is, in short, counterintuitive."

This is a well-studied problem. For example, there have been a number of studies of students taking introductory biology courses at tertiary institutions (mostly in the U.S.A.), aimed at identifying the "major misconceptions" entertained by these students. Certain basic problems are discussed by almost all of the authors concerned (both inside and outside the USA). I have written more extensively on this topic in a post at the Scientopia blog (Ambiguity in phylogenies), which you can read if you are unfamiliar with the current state of affairs. That blog post lists most of the important issues as well as the available literature.

That evolution professionals often suffer the same sort of problem is also well known. I have written more extensively on this topic in a previous post at this blog (Evolutionary trees: old wine in new bottles?). This blog post also lists the relevant literature.

What is worse, some professional organizations apparently know no better. For example, the Federation of American Societies for Experimental Biology (FASEB), which describes itself as "the policy voice of biological and biomedical researchers" in the U.S.A., has this Advocacy Card on their web site:


FASEB was also giving away similar bumper stickers at the recent 20th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB — July 2012, in Long Beach, CA), as discussed at the Byte Size Biology blog. Clearly, this image confounds linear evolution with tree-based evolution — this distinction is crucial to phylogenetic analysis, and yet confusion about these two things is rampant.

This leads me to an obvious question: if people have so much trouble going from a linear view of evolution to a tree-based view, are they going to have even more trouble going to a network-based view?

I cannot answer this question (yet). At one extreme, maybe the big conceptual leap is going from a chain to a tree, and a network is just a complicated tree, so that the conceptual leap is not great. Alternatively, maybe a tree is difficult because it is a set of linked and overlapping chains, and therefore a network is very difficult because it is a set of linked and overlapping trees. Maybe reality will turn out to be somewhere in between these two extremes.

There are at least two issues that are likely to be of importance here, in addition to those concerned with trees:

  1. it is difficult to recognize monophyletic groups (clades) in a network, because the ancestry of any one taxon may be complicated (eg. what is a Most Recent Common Ancestor in a reticulated network? — see this blog post);
  2. it is difficult to distinguish the different possible causes of reticulations (recombination, hybridization, HGT).

We will presumably find out how difficult things are after we have developed a set of widely used methods for constructing evolutionary networks.

Wednesday, April 25, 2012

Networks and bootstraps as tree-support criteria


It has been pointed out several times in the literature (eg. Wägele & Mayer 2007; Wägele et al. 2009; Morrison 2010) that network analyses and, for example, bootstrap analyses of trees do not necessarily show the same amount of "support" for a tree. This occurs because branch support values can be independent of character support.

Consequently, many apparently "well-supported" trees published in the literature are often not well-supported by the original data at all. That is, incongruences in the data are ignored by all tree-building algorithms, by definition. Indeed, this problem may be almost universal in the literature, because very few papers provide any evidence that the tree-likeness of the data has been evaluated by the authors.

Since this point seems to poorly understood by most workers, it is worth re-iterating here with an example. The three references cited above provide other examples where bootstrap analyses and network analyses yield very different conclusions about the support for phylogenetic trees.

The basic distinction between networks and bootstrapped trees is this: use of a data-display network, such as a splits graph, evaluates the character (or distance) data independently of any tree, whereas a bootstrap analysis evaluates the data solely in terms of a tree. For example, a bootstrap analysis records the trees at each iteration (or replicate) rather than recording the bootstrapped character set itself, and many different character sets can produce the same tree. Therefore, a bootstrap analysis does not directly assess the character support for a tree. Neither does a posterior probability from a bayesian analysis.

The importance of this distinction for phylogenetics is that a tree analysis forces the data into a tree irrespective of how well the data fit that tree. All that is required is that the tree be the optimal one based on a particular criterion (parsimony, likelihood, etc), while the degree of fit of the data and tree is effectively treated as immaterial to the analysis. This is true at each bootstrap iteration, as well, so that all we learn from a bootstrap analysis is which tree branches are the best supported — we do not learn anything directly about the support of the data for a tree in the first place.

Literally, bootstrap values represent "branch support" rather than "tree support"; and a similar thing can be said for bayesian posterior probabilities. [This issue is discussed further in this later blog post: How networks differ from bootstrapped trees.]

This can be illustrated with a simple empirical example. The data are taken from my Primer of Phylogenetic Networks. The original data are 1,687 aligned nucleotide positions of two genes from five species of the plant genus Viburnum. However, only 43 of the characters vary among these five species. It is expected a priori that V. prunifolium is a hybrid between V. rufidulum and V. lentago, so that a single well-supported tree is not necessarily likely.

Median network. Click to enlarge.

The Median network for the data is shown in the first figure, with the branches labelled by the characters that "support" them. Other types of splits graphs have the same topology as this one (eg. NeighborNet based on uncorrected distances), since the characters are all binary and are never more than pairwise incompatible. This means that all of the character data are displayed in the graph. The netted region in the graph is created by four characters (3, 32, 41, 42) that are incompatible with nine others. Thus, there is no unambiguously supported branch (other than the terminal ones), let alone support for a single tree.

Neighbor-Joining tree, with NJ (above) and Parsimony (below) bootstrap values. Click to enlarge.

Nevertheless, both Neighbor-Joining (based on uncorrected distances) and Parsimony analyses of the data produce a tree that is well-supported by bootstrap analyses, as shown in the second figure. In particular, note that there is strong support in both analyses (based on 100,000 bootstrap replicates) for the branch uniting V. prunifolium and V. rufidulum, even though the data indicate that this arrangement is supported by 3 characters and contradicted by 2 other characters.

Bayesian tree, with posterior probabilities (above) and Maximum-likelihood bootstrap values (below).
Click to enlarge.

Both the Maximum-Likelihood and the Bayesian analyses deal with the situation in a somewhat different manner, as shown in the third figure. Based on a GTR+G+I model (and 5,000 sampled or re-sampled trees), they correctly recognize the relative lack of data support for uniting V. prunifolium and V. rufidulum (the character support is 3/5=60%). However, they both greatly over-estimate the character support for the branch involving V. lantanoides and V. nudum, which is supported by 5 characters and contradicted by 3 other characters (5/8=60% support). The extra number of characters (8 versus 5) apparently makes a big difference to the evaluation of branch support.

Thus, there is no reason to expect branch support values of any ilk to represent character support for that branch; and there is no simple relationship between the two things. The mere fact that character data can repeatedly be shoe-horned into the same tree does not mean that the data offer much support for that tree!

If you want an evaluation of the tree-likeness of the original data, you need to use either a data-display network or some other non-tree evaluation method. Only then can we directly assess the tree support.

References

Morrison D.A. (2010) Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology & Evolution 27: 1044-1057.

Wägele J.W., Letsch H., Klussmann-Kolb A., Mayer C., Misof B., Wägele H. (2009) Phylogenetic support values are not necessarily informative: the case of the Serialia hypothesis (a mollusk phylogeny). Frontiers in Zoology 6: 12.

Wägele J.W., Mayer C. (2007) Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects. BMC Evolutionary Biology 7: 147.