Wednesday, April 25, 2012

Networks and bootstraps as tree-support criteria


It has been pointed out several times in the literature (eg. Wägele & Mayer 2007; Wägele et al. 2009; Morrison 2010) that network analyses and, for example, bootstrap analyses of trees do not necessarily show the same amount of "support" for a tree. This occurs because branch support values can be independent of character support.

Consequently, many apparently "well-supported" trees published in the literature are often not well-supported by the original data at all. That is, incongruences in the data are ignored by all tree-building algorithms, by definition. Indeed, this problem may be almost universal in the literature, because very few papers provide any evidence that the tree-likeness of the data has been evaluated by the authors.

Since this point seems to poorly understood by most workers, it is worth re-iterating here with an example. The three references cited above provide other examples where bootstrap analyses and network analyses yield very different conclusions about the support for phylogenetic trees.

The basic distinction between networks and bootstrapped trees is this: use of a data-display network, such as a splits graph, evaluates the character (or distance) data independently of any tree, whereas a bootstrap analysis evaluates the data solely in terms of a tree. For example, a bootstrap analysis records the trees at each iteration (or replicate) rather than recording the bootstrapped character set itself, and many different character sets can produce the same tree. Therefore, a bootstrap analysis does not directly assess the character support for a tree. Neither does a posterior probability from a bayesian analysis.

The importance of this distinction for phylogenetics is that a tree analysis forces the data into a tree irrespective of how well the data fit that tree. All that is required is that the tree be the optimal one based on a particular criterion (parsimony, likelihood, etc), while the degree of fit of the data and tree is effectively treated as immaterial to the analysis. This is true at each bootstrap iteration, as well, so that all we learn from a bootstrap analysis is which tree branches are the best supported — we do not learn anything directly about the support of the data for a tree in the first place.

Literally, bootstrap values represent "branch support" rather than "tree support"; and a similar thing can be said for bayesian posterior probabilities. [This issue is discussed further in this later blog post: How networks differ from bootstrapped trees.]

This can be illustrated with a simple empirical example. The data are taken from my Primer of Phylogenetic Networks. The original data are 1,687 aligned nucleotide positions of two genes from five species of the plant genus Viburnum. However, only 43 of the characters vary among these five species. It is expected a priori that V. prunifolium is a hybrid between V. rufidulum and V. lentago, so that a single well-supported tree is not necessarily likely.

Median network. Click to enlarge.

The Median network for the data is shown in the first figure, with the branches labelled by the characters that "support" them. Other types of splits graphs have the same topology as this one (eg. NeighborNet based on uncorrected distances), since the characters are all binary and are never more than pairwise incompatible. This means that all of the character data are displayed in the graph. The netted region in the graph is created by four characters (3, 32, 41, 42) that are incompatible with nine others. Thus, there is no unambiguously supported branch (other than the terminal ones), let alone support for a single tree.

Neighbor-Joining tree, with NJ (above) and Parsimony (below) bootstrap values. Click to enlarge.

Nevertheless, both Neighbor-Joining (based on uncorrected distances) and Parsimony analyses of the data produce a tree that is well-supported by bootstrap analyses, as shown in the second figure. In particular, note that there is strong support in both analyses (based on 100,000 bootstrap replicates) for the branch uniting V. prunifolium and V. rufidulum, even though the data indicate that this arrangement is supported by 3 characters and contradicted by 2 other characters.

Bayesian tree, with posterior probabilities (above) and Maximum-likelihood bootstrap values (below).
Click to enlarge.

Both the Maximum-Likelihood and the Bayesian analyses deal with the situation in a somewhat different manner, as shown in the third figure. Based on a GTR+G+I model (and 5,000 sampled or re-sampled trees), they correctly recognize the relative lack of data support for uniting V. prunifolium and V. rufidulum (the character support is 3/5=60%). However, they both greatly over-estimate the character support for the branch involving V. lantanoides and V. nudum, which is supported by 5 characters and contradicted by 3 other characters (5/8=60% support). The extra number of characters (8 versus 5) apparently makes a big difference to the evaluation of branch support.

Thus, there is no reason to expect branch support values of any ilk to represent character support for that branch; and there is no simple relationship between the two things. The mere fact that character data can repeatedly be shoe-horned into the same tree does not mean that the data offer much support for that tree!

If you want an evaluation of the tree-likeness of the original data, you need to use either a data-display network or some other non-tree evaluation method. Only then can we directly assess the tree support.

References

Morrison D.A. (2010) Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology & Evolution 27: 1044-1057.

Wägele J.W., Letsch H., Klussmann-Kolb A., Mayer C., Misof B., Wägele H. (2009) Phylogenetic support values are not necessarily informative: the case of the Serialia hypothesis (a mollusk phylogeny). Frontiers in Zoology 6: 12.

Wägele J.W., Mayer C. (2007) Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects. BMC Evolutionary Biology 7: 147.

2 comments:

  1. One issue with "phylogenetic networks" is that they too are subject to artifacts. Consider a data set generated by ordinary stochastic processes (say GTR evolution with equal rates at all sites). This will generate some sites that have patterns of bases that conflict with the true tree. We can solve that by inferring a network. It may look something like the network you show above, Even in cases where the data comes from a simulation in which we know that the truth is a tree.

    What we need is a statistical inference machinery for testing networks against trees and seeing if the evidence in favor of the network is strong enough to support it. Among the problems in developing this is that there is no single statistical model to be used in a network. Some networks are from hybridization, some from horizontal gene transfer, some even from recombination among nearly-asexual lineages within a species. Some are even from coalescent phenomena within a species tree. We need to have some understanding of which of these to assume as the alternative to a tree, and then use the appropriate model for that.

    Just saying "see, it turns out to be a network" is not enough.

    ReplyDelete
  2. Joe, you have touched on a very important point. If one treats a data-display network (eg. a splits graph) as an evolutionary diagram then many of the reticulations will be "false positives" — they will represent stochastic variability rather than hybridization, HGT etc. This has been shown in the literature a number of times using simulated data. Indeed, in the example above that is what several of the reticulations undoubtedly are (in my view).

    I have always argued that such diagrams are best used as exploratory data analysis (EDA) *before* deciding whether a tree model will be adequate. If it is not, then some more complex model will be needed. My main point in this post is that first building a tree and then bootstrapping it (for example) does not provide the same EDA information. Nevertheless, a data-display network cannot tell you whether your study organisms have a tree-like evolutionary history or not — they evaluate the support of the data for any underlying tree, not the existence of such a tree.

    This leads to the need for some objective (optimization?) procedure for deciding just how complex a model will be required for the analysis of each and every dataset. This is the other point that you rightly make, as such models could involve any one or more of several reticulation processes.

    At the moment, the most common null model emerging in the literature seems to be a tree that takes into account deep coalescence; and only when this model is "rejected" in some way is reticulation invoked. This model is derived from the coalescent, of course, but it also makes conceptual sense in terms of evolutionary mechanisms that can create genealogical complexity without reticulations that result from horizontal evolution. One potential problem with this model is that a duplication-loss history (hidden paralogy) has the same effect — complexity without horizontal evolution.

    Another problem is that it is not immediately obvious, at least to me, what the criteria should be for rejecting this null model. As you note, explicit alternative models are needed, which is something that biologists need to address. Moreover, the potential complexity of a model that invokes all known causes of non-tree evolutionary histories gives one pause for some serious thought. A tree is *definitely* the simplest model!

    ReplyDelete