## Wednesday, January 9, 2013

### We should present bayesian phylogenetic analyses using networks

Bayesian methods differ from other forms of probabilistic analysis in that they are concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. That is, Bayesian analysis is not about identifying the most likely outcome, it is about quantifying the relative likelihood of all possible outcomes. In this sense, it is quite distinct from other probabilistic methods, such as those based on estimating the optimal outcome under criteria such as maximum likelihood, maximum parsimony or minimum evolution. As far as likelihood is concerned, Bayesian analysis deals with (maximum) integrated likelihood rather than (maximum relative) likelihood (Steel & Penny 2000).

In phylogenetic analysis this creates a potentially confusing situation, as the result of most Bayesian analyses is presented as a single tree, rather than showing the probability distribution of all trees. Certainly, some of the information from the probability distribution is used in the tree – usually the posterior probabilities that are attached to each of the tree branches – but this is a poor visual summary of the available information. A better approach would be to use a network to display the actual probability distribution.

Theory

Bayesian phylogeny programs such as MrBayes produce a file (or files) with a copy of every tree that was sampled by the MCMC run (or runs). These files can then be manipulated (eg. using the "sumt" command in MrBayes) to exclude the first trees as part of the burn-in; and a smaller file is then produced with a copy of the unique trees found, along with their frequency of occurrence in the sample (eg. in the "trprobs" file from MrBayes). It is this sample of trees that is summarized to produce the so-called MAP tree (maximum a posteriori probability tree) and its associated branch posterior probabilities.

Ideally, we would take this file and produce a consensus network rather than a consensus tree. The tree produced by MrBayes is built from the best-supported branches of the set of trees sampled, but only a set of compatible branches can be included in the consensus tree (see the original work of Margush and McMorris 1981). Any well-supported but incompatible branches will not be shown, and it is the absence of these branches that causes the phylogenetic tree to deviate from the standard Bayesian philosophy of presenting a probability distribution. A consensus network solves this problem because it is specifically designed to present a specified percentage of the incompatible branches as well (Holland et al. 2004).

The idea of using a consensus network rather than a consensus tree was first suggested by Holland et al. (2005), although it has rarely been used in practice (eg. Gibb & Penny 2010). Indeed, consensus networks could also be used to present the results of bootstrap analyses, or a set of equally parsimonious trees (Holland et al. 2006).

It is important to note that a consensus network is unrooted, and therefore it is solely a mathematical summary of the data, and cannot be treated as an evolutionary diagram. The MAP tree is, strictly speaking a form of consensus tree, and so it is technically also solely a mathematical summary of the data. It is, however, conventional to treat it as an evolutionary diagram.

An example

The example data are from Schnittger et al. (2012), being sequence data for the beta-tubulin gene of 17 samples of the genus Babesia. Details of the Bayesian analysis using MrBayes are provided here, which yielded 100,000 trees in the final MCMC sample. A nexus file with the sequence data, the 95% credible set of trees, and the associated bipartitions can be found here. The rooted MAP tree produced by MrBayes is shown in the first diagram, with the posterior probabilities (PP) labelled on each branch.

 Bayesian MAP tree.

There are several ways that a consensus network could be presented. The standard way would be to choose some percentage of the MCMC trees and to show the network of those trees. For example, the MAP tree is simply a consensus that includes all of the bipartitions that occur in at least 50% of the trees, plus all of the other bipartitions that are compatible with those bipartitions.

As an alternative, here is the unrooted consensus network with all of the bipartitions that occur in at least 5% of the trees. It is important to note that the branch lengths in the following consensus networks represent the branch support not the estimated number of substitutions (as in the MAP tree). For example, the relationships among the Babesia microti sequences have very short branches in the MAP tree (ie. few substitutions) but long branches in the consensus networks (ie. good support).

 Consensus network set at 5%.

This network makes visually clear where the major incompatibilities are among the MCMC trees, which the MAP consensus tree does not (unless one checks the PP values). The major set of boxes in the network involve the branches with PP values of c. 0.4. Note also that this network also approximates the 95% credible set of trees, as it excludes those bipartitions that occur in <5% of the trees.

Unfortunately, this network is visually rather cluttered, with a set of multi-dimensional boxes, and so maybe a simpler network would suffice. Instead, here is the consensus network with all of the bipartitions that occur in at least 20% of the trees. This network still emphasizes the main area of incompatibility among the trees, while losing the less-important incompatibilities on the other branches.

 Consensus network set at 20%.

Another approach to simplifying the consensus network is to present a set of what are called weakly compatible bipartitions, rather than choosing some percentage of the bipartitions. This is shown in the next network. Note that visually it is a compromise between the above two networks, and may thus be preferable, as it makes clear the range of branches involved in the incompatibilities without needing to use multi-dimensional boxes to do so (ie. the graph is planar rather than 3-dimensional).

 Consensus network of weakly compatible bipartitions.

Finally, we can emphasize the relationship between the network and bipartition compatibility by plotting the consensus network consisting of a set of compatible bipartitons. This forms a tree that has the same topology as the MAP tree (since it is constructed in exactly the same manner), but here the branch lengths represent the branch support rather than the estimated number of substitutions.

 Consensus network of compatible bipartitions.

Current practical problems

The simplest way to implement the use of consensus networks to display the set of Bayesian trees would be to use SplitsTree to produce the consensus network directly from the (MrBayes) nexus-format MCMC treefile. However, this treefile will usually contain tens of thousands of trees, which is pushing the limit of SplitsTree. Alternatively, we could read in the smaller nexus-format "trprobs" file, which contains only the unique trees along with a weight indicating their relative frequency. Unfortunately, SplitsTree does not currently read tree weights in nexus treefiles. So, Holland et al. (2005, 2006) produced some Python scripts to create the required nexus files, which can then be input to SplitsTree (or SpectroNet).

Alternatively, MrBayes also produces a partition table, showing the relative frequency of the bipartitions found in the sample of trees. The consensus network is actually produced from these bipartitions, rather than the trees, and so this can also be used instead of the treefile. There are two practical problems with this approach. First, MrBayes does not produce a nexus-format file with the bipartition information, and so the available information must be manually converted to a Splits block and put into a nexus file. Second, SplitsTree currently does not construct networks with different percentages of splits when data are input via a Splits block, only when the data are input via a Trees block. So, a series of Splits blocks needs to be constructed, each with the appropriate number of bipartitions.

I used the latter approach to analyze the example data.

Thanks to Bernie Cohen for prompting me to think about networks and Bayesian analysis, and to Barbara Holland for her help with getting the calculations done.

References

Gibb GC, Penny D (2010) Two aspects along the continuum of pigeon evolution: a South-Pacific radiation and the relationship of pigeons within Neoaves. Molecular Phylogenetics and Evolution 56: 698-706.

Holland BR, Delsuc F, Moulton V (2005) Visualizing conflicting evolutionary hypotheses in large collections of trees: using consensus networks to study the origins of placentals and hexapods. Systematic Biology 54:66-76.

Holland BR, Huber KT, Moulton V, Lockhart PJ (2004) Using consensus networks to visualize contradictory evidence for species phylogeny. Molecular Biology and Evolution 21: 1459-1461.

Holland BR, Jermiin LS, Moulton V (2006) Improved consensus network techniques for genome-scale phylogeny. Molecular Biology and Evolution 23: 848-855.

Margush T, McMorris FR (1981) Consensus n-trees. Bulletin of Mathematical Biology 43: 239-244.

Schnittger L, Rodriguez AE, Florin-Christensen M, Morrison DA (2012) Babesia: a world emerging. Infection, Genetics and Evolution 12: 1788-1809.

Steel M, Penny D (2000) Parsimony, likelihood, and the role of models in molecular phylogenetics. Molecular Biology and Evolution 17: 839-850.

#### 1 comment:

1. Thanks for being stimulated -- to very good effect.

However, miserably incapable users like me actually need to have software readily available and easy to operate. Given that, I would certainly look at my next Bayesian analyses in network form; it might shed useful light. BLC