The Genealogical World of Phylogenetic Networks: Posterior probability

Showing posts with label Posterior probability. Show all posts

Tuesday, July 4, 2017

Should we try to infer trees on tree-unlikely matrices?

Spermatophyte morphological matrices that combine extinct and extant taxa notoriously have low branch support, as traditionally established using non-parametric bootstrapping under parsimony as optimality criterion. Coiro, Chomicki & Doyle (2017) recently published a pre-print to show that this can be overcome to some degree by changing to Bayesian-inferred posterior probabilities. They also highlight the use of support consensus networks for investigating potential conflict in the data. This is a good start for a scientific community that so far has put more of their trust in either (i) direct visual comparison of fossils with extant taxa or (ii) collections of most parsimonious trees inferred based on matrices with high level of probably homoplasious characters and low compatibility. But do those matrices really require or support a tree? Here, I try to answer this question.

Background

Coiro et al. mainly rely on a recent matrix by Rothwell & Stockey (2016), which marks the current endpoint of a long history of putting up and re-scoring morphology-based matrices (Coiro et al.’s fig. 1b). All of these matrices provide, to various degrees, ambiguous signal. This is not overly surprising, as these matrices include a relatively high number of fossil taxa with many data gaps (due to preservation and scoring problems), and combine taxa that perished a hundred or more millions years ago with highly derived, possibly distant-related modern counterparts.

Rothwell & Stockey state (p. 929) "As is characteristic for the results from the analysis of matrices with low character state/taxon ratios, results of the bootstrap analysis (1000 replicates) yielded a much less fully resolved tree (not figured)." Coiro et al.’s consensus trees and network based on 10,000 parsimony bootstrap replicates nicely depicts this issue, and may explain why Rothwell & Stockey decided against showing those results. When studying an earlier version of their matrix (Rothwell, Crepet & Stockey 2009), they did not provide any support values, citing a paper published in 2006, where the authors state (Rothwell & Nixon 2006, p. 739): “… support values, whether low or high for particular groups, would only mislead the reader into believing we are presenting a proposed phylogeny for the groups in question. Differences among most-parsimonious trees are sufficient to illuminate the points we wish to make here, and support values only provide what we consider to be a false sense of accuracy in these assessments”.

Do the data support a tree?

The problem is not just low support. In fact, the tree showed by Rothwell & Stockey with its “pectinate arrangement” conflicts in parts with the best-supported topology, a problem that also applied to its 2009 predecessor. This general “pectinate” arrangement of a large, low or unsupported grade is not uncommon for strict consensus trees based on morphological matrices that include fossils and extant taxa (see e.g. the more proximal parts of the Tree of Life, e.g. birds and their dinosaur ancestors).

The support patterns indicate that some of the characters are compatible with the tree, but many others are not. Of the 34 internodes (branches) in the shown tree (their fig. 28 shows a strict consensus tree based on a collection of equally parsimonious trees), 12 have lower bootstrap support under parsimony than their competing alternatives (Fig. 1). Support may be generally low for any alternative, but the ones in the tree can be among the worst.

The main problem is that the matrix simply does not provide enough tree-like signal to infer a tree. Delta Values (Holland et al. 2002) can be used as a quick estimate for the treelikeliness of signal in a matrix. In the case of large all-spermatophyte matrices (Hilton & Bateman 2006; Friis et al. 2007; Rothwell, Crepet & Stockey 2009; Crepet & Stevenson 2010), the matrix Delta Values (mDV) are ≥ 0.3. For comparison, molecular matrices resulting in more or less resolved trees have mDV of ≤ 0.15. The individual Delta Values (iDV), which can be an indicator of how well a taxon behaves during tree inference, go down to 0.25 for extant angiosperms – very distinct from all other taxa in the all-spermatophyte matrices with low proportions of missing data/gaps – and reach values of 0.35 for fossil taxa with long-debated affinities.

The newest 2016 matrix is no exception with a mDV of 0.322 (the highest of all mentioned matrices), and iDVs range between 0.26 (monocots and other extant angiosperms) and 0.39 for Doylea mongolica (a fossil with very few scored characters). In the original tree, Doylea (represented by two taxa) is part of the large grade and indicated as the sister to Gnetidae (or Gnetales) + angiosperms (molecular trees associate the Gnetidae with conifers and Ginkgo). According to the bootstrap analysis, Doylea is closest to the extant Pinales, the modern conifers. Coiro et al. found the same using Bayesian inference. Their posterior probability (PP) of a Doylea-Podocarpus-Pinus clade is 0.54, and Rothwell & Stockey’s Doylea-Ginkgo-angiosperm clade conflicts with a series of splits with PPs up to 0.95.

Figure 1. Parsimony bootstrap network based on 10,000 pseudoreplicate trees
inferred from the matrix of Rothwell & Stockey.
Edges not found in the authors’ tree in red, edges also found in the tree in green.
Extant taxa in blue bold font. The edge length is proportional to the frequency of the
according split (taxon bipartition, branch in a possible tree) in the pseudoreplicate
tree sample. The network includes all edges of the authors’ tree except for
Doylea + Gnetidae + Petriellales + angiosperms vs. all other gymnosperms and
extinct seed plant groups. Such a split has also no bootstrap support (BS < 10)
using least-square and maximum likelihood optimum criteria.

Do the data require a tree?

As David made a point in an earlier post, neighbour-nets are not really “phylogenetic networks” in the evolutionary sense. Being unrooted and 2-dimensional, they don’t depict a phylogeny, which has to be a sort of (rooted) tree, a one-dimensional graph with time as the only axis (this includes reticulation networks where nodes can be the crossing point of two internodes rather than their divergence point). The neighbour-net algorithm is an extension into two dimensions of the neighbour-joining algorithm, the latter infers a phylogenetic tree serving a distance criterion such as minimum evolution or least-squares (Felsenstein 2004). Essentially, the neighbour-net is a ‘meta-phylogenetic’ graph inferring and depicting the best and second-best alternative for each relationship. Thus, neighbour-nets can help to establish whether the signal from a matrix, treelike or not as it is the cases here, supports potential and phylogenetic relationships, and explore the alternatives much more comprehensively than would be possible with a strict-consensus or other tree (Fig. 2).

Figure 2. Neighbour-net based on a mean distance matrix inferred
from the matrix of Rothwell & Stockey.
The distance to the "progymnosperms", a potential ancestral group of the
seed plants, can be taken as a measurement for the derivedness of each
major group. The primitive seed ferns are placed between progymnosperms
and the gymnosperms connected by partly compatible edge bundles; the
putatively derived "higher seed ferns" isolated between the progymnosperms
and the long-edged angiosperms. Shared edge-bundles and 'neighbourness'
reflect quite well potential phylogenetic relationships and eventual ambiguities,
as in the case of Gnetidae. Colouring as in Figure 1; some taxon names
are abbreviated.

In addition, neighbour-nets usually are better backgrounds to map patterns of conflicting or partly conflicting support seen in a bootstrap, jackknife or Bayesian-inferred tree sample. In Fig. 3, I have mapped the bootstrap support for alternative taxon bipartitions (branches in a tree) on the background of the neighbour-net in Fig. 2.

Obvious and less-obvious relationships are simultaneously revealed, and their competing support patterns depicted. Based on the graph, we can see (edge lengths of the neighbour-net) that there is a relatively weak primary but substantial bootstrap support for the Petriellales (a recently described taxon new to the matrix) as sister to the angiosperms. Several taxa, or groups of closely related taxa, are characterised by long terminal edges/edge bundles, rooting in the boxy central part of the graph. Any alternative relationship of these taxa/taxon groups receives equally low support, but there are notable differences in the actual values.

There is little signal to place most of the fossil “seed ferns” (extinct seed plants) in relation to the modern groups, and a very ambiguous signal regarding the relationship of the Gnetidae (or Gnetales) with the two main groups of extant seed plants, the conifers (Pinidae; see C. Earle’s gymnosperm database) and angiosperms (for a list and trees, see P. Stevens’ Angiosperm Phylogeny Website).

The Gnetidae is a strongly distinct (also genetically) group of three surviving genera, being a persistent source of headaches for plant phylogeneticists. Placed as sister to the Pinaceae (‘Gnepine’ hypothesis) in early molecular trees (long-branch attraction artefact), the currently favoured hypothesis (‘Gnetifer’) places the Gnetidae as sister to all conifers (Pinatidae) in an all-gymnosperm clade (including Gingko and possibly the cycads).

As favoured by the branch support analyses, and contrasting with the preferred 2016 tree, the two Doyleas are placed closest to the conifers, nested within a commonly found group including the modern and ancient conifers and their long-extinct relatives (Cordaitales), and possibly Ginkgo (Ginkgoidae). In the original parsimony strict consensus tree, they are placed in the distal part as sister to a Gnetidae and Petriellales + angiosperms (possibly long-branch attraction). The grade including the ‘primitive seed ferns’ (Elkinsia through Callistophyton), seen also in Rothwell and Stockey’s 2016 tree, may be poorly supported under maximum parsimony (the criterion used to generate the tree), but receives quite high support when using a probabilistic approach such as maximum likelihood bootstrapping or Bayesian inference to some degree (Fig. 3; Coiro, Chomicki & Doyle 2017).

Figure 3. Neighbour-net from above used to map alternative support patterns.
Numbers refer to non-parametric bootstrap (BS) support for alternative phylogenetic
splits under three optimality criteria: maximum likelihood (ML) as implemented in
RAxML (using MK+G model), maximum parsimony (MP), and least-squares
(via neighbour-joining, NJ; using PAUP*); and Bayesian posterior probabilties
(using MrBayes 3.2; see Denk & Grimm 2009, for analysis set-up). The circular
arrangement of the taxa allows tracking most edges in the authors’ tree and their,
sometimes better supported, alternatives. The edge lengths provide direct
information about the distinctness of the included taxa to each other; the structure
of the graph informs about the how tree-like the signal is regarding possible
phylogenetic relationships or their alternatives. Colouring as in Figure 1;
some taxon names are abbreviated.

Numerous morphological matrices provide non-treelike signals. A tree can be inferred, but its topology may be only one of many possible trees. In the framework of total evidence, this may be not such a big problem, because the molecular partitions will predefine a tree, and fossils will simply be placed in that tree based on their character suites. Without such data, any tree may be biased and a poor reflection of the differentiation patterns.

By not forcing the data in a series of dichotomies, neighbour-nets provide a quick, simple alternative. Unambiguous, well-supported branches in a tree will usually result in tree-like portions of the neighbour net. Boxy portions in the neighbour-net pinpoint the ambiguous or even problematic signals from the matrix. Based on the graph, one can extract the alternatives worth testing or exploring. Support for the alternatives can be established using traditional branch support measures. Since any morphological matrix will combine those characters that are in line with the phylogeny as well as those that are at odds with it (convergences, character misinterpretations), the focus cannot be to infer a tree, but to establish the alternative scenarios and the support for them in the data matrix.

References

Coiro M, Chomicki G, Doyle JA. 2017. Experimental signal dissection and method sensitivity analyses reaffirm the potential of fossils and morphology in the resolution of seed plant phylogeny. bioRxiv DOI:10.1101/134262

Crepet WL, Stevenson DM. 2010. The Bennettitales (Cycadeoidales): a preliminary perspective of this arguably enigmatic group. In: Gee CT, ed. Plants in Mesozoic Time: Morphological Innovations, Phylogeny, Ecosystems. Bloomington: Indiana University Press, pp. 215-244.

Denk T, Grimm GW. 2009. The biogeographic history of beech trees. Review of Palaeobotany and Palynology 158: 83-100.

Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Friis EM, Crane PR, Pedersen KR, Bengtson S, Donoghue PCJ, Grimm GW, Stampanoni M. 2007. Phase-contrast X-ray microtomography links Cretaceous seeds with Gnetales and Bennettitales. Nature 450: 549-552 [all important information needed for this post is in the supplement to the paper; a figure showing the actual full analysis results can be found at figshare]

Hilton J, Bateman RM. 2006. Pteridosperms are the backbone of seed-plant phylogeny. Journal of the Torrey Botanical Society 133: 119-168.

Holland BR, Huber KT, Dress A, Moulton V. 2002. Delta Plots: A tool for analyzing phylogenetic distance data. Molecular Biology and Evolution 19: 2051-2059.

Rothwell GW, Crepet WL, Stockey RA. 2009. Is the anthophyte hypothesis alive and well? New evidence from the reproductive structures of Bennettitales. American Journal of Botany 96: 296–322.

Rothwell GW, Nixon K. 2006. How does the inclusion of fossil data change our conclusions about the phylogenetic history of the euphyllophytes? International Journal of Plant Sciences 167: 737–749.

Rothwell GW, Stockey RA. 2016. Phylogenetic diversification of Early Cretaceous seed plants: The compound seed cone of Doylea tetrahedrasperma. American Journal of Botany 103: 923–937.

Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution DOI:10.1111/2041-210X.12760.

Wednesday, August 14, 2013

How to construct a consensus network from the output of a bayesian tree analysis

In an earlier blog post I argued that We should present bayesian phylogenetic analyses using networks. The rationale for this is that a bayesian analysis is concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. In phylogenetics based on Markov Chain Monte Carlo (MCMC) methods, which produce a set of trees sampled in proportion to their posterior probabilites, the tree topologies can thus be summarized using a consensus network. This should be more in keeping with the bayesian philosophy than is producing a single tree, the so-called MAP tree. The MAP tree is based on combining those taxon partitions with the greatest frequency in the MCMC sample, so that the probability distribution is reduced down to a single tree with posterior probability values on the branches. On the other hand, a network produced from all of the partitions that appear in the MCMC sample, weighted according to their frequency, would be much closer to the bayesian aim. [Note: For a clarification of this point, see Leonardo de Oliveira Martins' comment at the end of this post.]

The practical issue with trying to do this is that at the moment it is not straightforward to get the consensus network from the output of any of the bayesian computer programs. These programs usually produce a file containing all of the sampled trees (from which the burn-in trees can be deleted). The simplest way to get the consensus network would be to use the SplitsTree or Spectronet programs to produce the network directly from this treefile. This can be done for files with a small number of trees; but no-one recommends doing bayesian analysis with a small number of MCMC-sampled trees. When the treefile contains tens of thousands of trees this is pushing the limit of SplitsTree and Spectronet, and they crash.

An alternative is to use the smaller "trprobs" file that is provided by, for example MrBayes, which contains only the unique trees along with a weight indicating their relative frequency. Unfortunately, SplitsTree and SectroNet do not currently read tree weights in treefiles. So, Holland et al. (2005, 2006) produced a Python script to create the required input files, which can then be input to SplitsTree or SpectroNet. A copy of this script is provided here.

However, this approach is still limited, and so it is not the approach that I used in the example analysis provided in my previous blog post. The MrBayes program, for example, also produces a partition table, showing the relative frequency of the bipartitions found in the sample of trees. The consensus network is actually produced from these bipartitions, rather than from the trees, and so this information can also be used instead of the treefile. This can be provided to SplitsTree in a nexus-formatted Splits block (derived from the bipartitions) rather than in a Trees block (derived from the treefile).

There are two practical problems with this approach. First, SplitsTree currently does not construct networks with different percentages of splits when data are input via a Splits block, only when the data are input via a Trees block. So, a series of Splits blocks needs to be constructed, each with the appropriate number of bipartitions, in order to decide how many of the bipartitions should be included in the network. This makes the process tedious. Second, MrBayes does not produce a nexus-format file with the bipartition information, and so the available information must be manually converted to a Splits block and put into a nexus file. I will try to explain this process here.

Manual procedure

The nexus-format file used for my previous analysis is here. It contains the original sequence data (the Data block), the instructions for MrBayes (the MrBayes block), the treefile produced by MrBayes (the Trees block), and the bipartitions information (the Splits block). This should reproduce the first consensus network shown in the previous blog post, and thus it shows you what a nexus-formatted file looks like.

The Splits block looks like this. The first column of the Matrix is simply an index to label the splits; the second column is the bipartition weight; and the third column lists the taxa in one of the two parts of each bipartition (you can choose either partition, but clearly it is quicker to list the taxa in the smaller partition).

BEGIN Splits; [Bipartitions occurring in >5% of the trees]
DIMENSIONS ntax=17 nsplits=46;
FORMAT labels=no weights=yes confidences=no intervals=no;
MATRIX
[1] 1.000000 1,
[2] 1.000000 2,
[3] 1.000000 3,
[4] 1.000000 4,
[5] 1.000000 5,
[6] 1.000000 6,
[7] 1.000000 7,
[8] 1.000000 8,
[9] 1.000000 9,
[10] 1.000000 10,
[11] 1.000000 11,
[12] 1.000000 12,
[13] 1.000000 13,
[14] 1.000000 14,
[15] 1.000000 15,
[16] 1.000000 16,
[17] 1.000000 17,
[18] 1.000000 12 13 14 15 16 17,
[19] 1.000000 16 17,
[20] 1.000000 12 13 14 15,
[21] 0.990441 9 10,
[22] 0.986401 12 13,
[23] 0.981162 3 4 7,
[24] 0.950994 2 5 6 8 9 10 11 12 13 14 15 16 17,
[25] 0.940455 3 4,
[26] 0.884189 12 13 15,
[27] 0.858771 5 6 8 9 10 11 12 13 14 15 16 17,
[28] 0.467503 5 6 8 11 12 13 14 15 16 17,
[29] 0.359641 6 8,
[30] 0.327114 6 11,
[31] 0.299736 5 11,
[32] 0.298256 6 8 11 12 13 14 15 16 17,
[33] 0.264989 11 12 13 14 15 16 17,
[34] 0.264549 6 8 11,
[35] 0.229872 6 11 12 13 14 15 16 17,
[36] 0.218812 5 6 8 11,
[37] 0.165447 5 11 12 13 14 15 16 17,
[38] 0.146118 9 1012 13 14 15 16 17,
[39] 0.144918 6 8 12 13 14 15 16 17,
[40] 0.135209 5 68 9 10 11,
[41] 0.130750 6 12 13 14 15 16 17,
[42] 0.114961 5 12 13 14 15 16 17,
[43] 0.109871 5 9 10,
[44] 0.105942 14 15,
[45] 0.084963 2 5 6 8 9 10 11,
[46] 0.070264 5 9 10 11 12 13 14 15 16 17,
;
END; [Splits]

The information about the taxa occurring in each bipartiton is taken from the following table, which appears in the MrBayes output. The ID is used as the first column of the Splits block; and the asterisks have simply been converted to the relevant taxon number (the Partition columns represent taxa 1–17, in order, so that an asterisk indicates that the particular taxon is included in that partition).

ID -- Partition
-----------------------
1 -- .****************
2 -- .*...............
3 -- ..*..............
4 -- ...*.............
5 -- ....*............
6 -- .....*...........
7 -- ......*..........
8 -- .......*.........
9 -- ........*........
10 -- .........*.......
11 -- ..........*......
12 -- ...........*.....
13 -- ............*....
14 -- .............*...
15 -- ..............*..
16 -- ...............*.
17 -- ................*
18 -- ...........******
19 -- ...............**
20 -- ...........****..
21 -- ........**.......
22 -- ...........**....
23 -- ..**..*..........
24 -- .*..**.**********
25 -- ..**.............
26 -- ...........**.*..
27 -- ....**.**********
28 -- ....**.*..*******
29 -- .....*.*.........
30 -- .....*....*......
31 -- ....*.....*......
32 -- .....*.*..*******
33 -- ..........*******
34 -- .....*.*..*......
35 -- .....*....*******
36 -- ....**.*..*......
37 -- ....*.....*******
38 -- ........**.******
39 -- .....*.*...******
40 -- ....**.****......
41 -- .....*.....******
42 -- ....*......******
43 -- ....*...**.......
44 -- .............**..
45 -- .*..**.****......
46 -- ....*...*********
-----------------------

The bipartition weights come from this table, which also appears in the MrBayes output. The relevant information is in column three. The IDs are the same as in the previous table. (Note that IDs 1–17 have a "Probab." of 1.000000 by definition.)

Summary statistics for informative taxon bipartitions
ID #obs Probab. Sd(s)+ Min(s) Max(s) Nruns
------------------------------------------------------------------
18 100008 1.000000 0.000000 1.000000 1.000000 8
19 100008 1.000000 0.000000 1.000000 1.000000 8
20 100008 1.000000 0.000000 1.000000 1.000000 8
21 99052 0.990441 0.001920 0.987521 0.993121 8
22 98648 0.986401 0.002285 0.984481 0.991041 8
23 98124 0.981162 0.004318 0.974082 0.986721 8
24 95107 0.950994 0.010545 0.939765 0.964723 8
25 94053 0.940455 0.004211 0.934885 0.946564 8
26 88426 0.884189 0.019000 0.851532 0.919526 8
27 85884 0.858771 0.018347 0.839453 0.885449 8
28 46754 0.467503 0.033135 0.432525 0.528278 8
29 35967 0.359641 0.072244 0.286057 0.512679 8
30 32714 0.327114 0.048225 0.243341 0.390609 8
31 29976 0.299736 0.048974 0.233661 0.382049 8
32 29828 0.298256 0.050175 0.201584 0.358851 8
33 26501 0.264989 0.049249 0.185345 0.345732 8
34 26457 0.264549 0.039611 0.204304 0.316375 8
35 22989 0.229872 0.042602 0.141909 0.275578 8
36 21883 0.218812 0.031232 0.160867 0.246460 8
37 16546 0.165447 0.051763 0.116151 0.251500 8
38 14613 0.146118 0.023326 0.094152 0.172466 8
39 14493 0.144918 0.018488 0.108631 0.164947 8
40 13522 0.135209 0.020371 0.106072 0.156707 8
41 13076 0.130750 0.017912 0.108391 0.161507 8
42 11497 0.114961 0.017107 0.096072 0.143429 8
43 10988 0.109871 0.016253 0.075194 0.123830 8
44 10595 0.105942 0.018071 0.072074 0.136069 8
45 8497 0.084963 0.016141 0.065035 0.102872 8
46 7027 0.070264 0.021858 0.051276 0.108951 8
------------------------------------------------------------------

So, some of the information in these two output tables is used to manually produce the Splits block, in the appropriate format, as shown. It would also be possible to write a script to automate this process (e.g. in Perl or Python).

Producing the separate files with Splits blocks containing different percentages of bipartitions is straightforward. As shown above, for the example data there are 46 bipartitions needed for the weights to sum to >0.95 (which is the MrBayes default number). If we choose 0.90 as the sum instead, then only the first 44 bipartitions are needed for the example data, while 0.85 requires only the first 37, and so on:
0.95 46
0.90 44
0.85 37
0.80 36
0.75 34
0.67 29
0.50 27
All that is needed is to delete the unwanted bipartitions from the Splits block, and then save the result as a separate nexus file.

Thanks to Mark for asking me to provide this blog post.

Wednesday, April 24, 2013

Cloudograms and data-display networks

I have previously noted that splits graphs are a logical way to present the results of Bayesian analyses (We should present bayesian phylogenetic analyses using networks). Bayesian analyses are concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. Thus, the result of a Bayesian phylogenetic analysis should not be as a single tree (the so-called MAP tree or maximum a posteriori probability tree), but should instead show the probability distribution of all of the sampled trees. This can easily be done with a consensus network, as illustrated by example in the previous blog post.

An interesting alternative way of visualizing the probability distribution of trees is what has been called a Cloudogram, an idea introduced by Remco R. Bouckaert (2010, DensiTree: making sense of sets of phylogenetic trees. Bioinformatics 26: 1372-1373). This diagram superimposes the set of all trees arising from an analysis. Dark areas in such a diagram will be those parts where many of the trees agree on the topology, while lighter areas will indicate disagreement. This idea can be best illustrated by a few published examples.

The first cloudogram is from Figure 4 of Chaves JA, Smith TB (2011) Evolutionary patterns of diversification in the Andean hummingbird genus Adelomyia. Molecular Phylogenetics and Evolution 60: 207-218.

In this case the MAP tree has been superimposed on the cloudogram.

Species-tree with the highest posterior probability (PP > 80) superimposed upon
a cloudogram of the entire posterior distribution of species-trees recovered in BEAST.
Areas where the majority of trees agree in topology and branch length are shown as
darker areas (well-supported clades), while areas with little agreement as webs.

The next one is from Figure 2 of Pabijan M, Crottini A, Reckwell D, Irisarri I, Hauswaldt JS, Vences M (2012) A multigene species tree for Western Mediterranean painted frogs (Discoglossus). Molecular Phylogenetics and Evolution 64: 690-696.

Posterior density of 2700 species trees (‘‘cloudogram’’) representing the entire posterior distribution
of species trees (270,000 trees post-burnin) from the BEAST analysis based on seven nuclear loci and
4 mitochondrial gene fragments. The species tree with the highest posterior probability is nested within
the set; values indicate posterior probabilities associated with this consensus tree. Areas where many
species trees agree on topology and/or branch lengths are densely colored.

The next one is from Figure 1 of Lerner HR, Meyer M, James HF, Hofreiter M, Fleischer RC (2011) Multilocus resolution of phylogeny and timescale in the extant adaptive radiation of Hawaiian honeycreepers. Current Biology 21: 1838-1844.

In this case the data are more tree-like than the previous two examples.

Cloudogram showing all trees resulting from a Bayesian analysis of whole
mitogenomes (19,601 trees; 14,449 bps). Variation in timing of divergences is
shown as variation (i.e., fuzziness) along the x axis. Darker branches represent a
greater proportion of corresponding trees. All nodes have support values >0.99.

The final one is from Figure 2 of McCormack JE, Faircloth BC, Crawford NG, Gowaty PA, Brumfield RT, Glenn TC (2012) Ultraconserved elements are novel phylogenomic markers that resolve placental mammal phylogeny when combined with species-tree analysis. Genome Research 22: 746-754.

This analysis involves bootstraps rather than Bayesian samples, showing that the same principle applies.

Evolutionary history of placental mammals resolved from conflicting
gene histories. Widespread consensus among 1000 species-tree bootstrap
replicates of the same 183-locus data set. STEAC trees are depicted because
the branch lengths allow for better visualization of branching patterns, but
STAR results supported the same topology. Cones emanating from terminal
tips of species trees (red arrows) indicate disagreement among bootstrap
replicates.

It would be nice to illustrate this further by direct comparison with a splits graph of the same dataset that I used in the previous blog post. Unfortunately, the computer program available (DensiTree) has the same practical limitation as the SplitsTree program (as mentioned in the previous post) — it does not read the MrBayes ".trprobs" file because it ignores the tree weights. This means that one has to enter the entire treefile (with thousands of trees), and I have not yet done that. Moreover, the program relies very much on having branch lengths for each tree — the output is really quite odd without them, with the taxa appearing in a series of steps rather than connected by straight branches. My previous analysis did not use branch lengths, as they are not needed for the consensus network, in which edge lengths represent support rather than character evolution.

Wednesday, January 9, 2013

We should present bayesian phylogenetic analyses using networks

Bayesian methods differ from other forms of probabilistic analysis in that they are concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. That is, Bayesian analysis is not about identifying the most likely outcome, it is about quantifying the relative likelihood of all possible outcomes. In this sense, it is quite distinct from other probabilistic methods, such as those based on estimating the optimal outcome under criteria such as maximum likelihood, maximum parsimony or minimum evolution. As far as likelihood is concerned, Bayesian analysis deals with (maximum) integrated likelihood rather than (maximum relative) likelihood (Steel & Penny 2000).

In phylogenetic analysis this creates a potentially confusing situation, as the result of most Bayesian analyses is presented as a single tree, rather than showing the probability distribution of all trees. Certainly, some of the information from the probability distribution is used in the tree – usually the posterior probabilities that are attached to each of the tree branches – but this is a poor visual summary of the available information. A better approach would be to use a network to display the actual probability distribution.

Theory

Bayesian phylogeny programs such as MrBayes produce a file (or files) with a copy of every tree that was sampled by the MCMC run (or runs). These files can then be manipulated (eg. using the "sumt" command in MrBayes) to exclude the first trees as part of the burn-in; and a smaller file is then produced with a copy of the unique trees found, along with their frequency of occurrence in the sample (eg. in the "trprobs" file from MrBayes). It is this sample of trees that is summarized to produce the so-called MAP tree (maximum a posteriori probability tree) and its associated branch posterior probabilities.

Ideally, we would take this file and produce a consensus network rather than a consensus tree. The tree produced by MrBayes is built from the best-supported branches of the set of trees sampled, but only a set of compatible branches can be included in the consensus tree (see the original work of Margush and McMorris 1981). Any well-supported but incompatible branches will not be shown, and it is the absence of these branches that causes the phylogenetic tree to deviate from the standard Bayesian philosophy of presenting a probability distribution. A consensus network solves this problem because it is specifically designed to present a specified percentage of the incompatible branches as well (Holland et al. 2004).

The idea of using a consensus network rather than a consensus tree was first suggested by Holland et al. (2005), although it has rarely been used in practice (eg. Gibb & Penny 2010). Indeed, consensus networks could also be used to present the results of bootstrap analyses, or a set of equally parsimonious trees (Holland et al. 2006).

It is important to note that a consensus network is unrooted, and therefore it is solely a mathematical summary of the data, and cannot be treated as an evolutionary diagram. The MAP tree is, strictly speaking a form of consensus tree, and so it is technically also solely a mathematical summary of the data. It is, however, conventional to treat it as an evolutionary diagram.

An example

The example data are from Schnittger et al. (2012), being sequence data for the beta-tubulin gene of 17 samples of the genus Babesia. Details of the Bayesian analysis using MrBayes are provided here, which yielded 100,000 trees in the final MCMC sample. A nexus file with the sequence data, the 95% credible set of trees, and the associated bipartitions can be found here. The rooted MAP tree produced by MrBayes is shown in the first diagram, with the posterior probabilities (PP) labelled on each branch.

Bayesian MAP tree.

There are several ways that a consensus network could be presented. The standard way would be to choose some percentage of the MCMC trees and to show the network of those trees. For example, the MAP tree is simply a consensus that includes all of the bipartitions that occur in at least 50% of the trees, plus all of the other bipartitions that are compatible with those bipartitions.

As an alternative, here is the unrooted consensus network with all of the bipartitions that occur in at least 5% of the trees. It is important to note that the branch lengths in the following consensus networks represent the branch support not the estimated number of substitutions (as in the MAP tree). For example, the relationships among the Babesia microti sequences have very short branches in the MAP tree (ie. few substitutions) but long branches in the consensus networks (ie. good support).

Consensus network set at 5%.

This network makes visually clear where the major incompatibilities are among the MCMC trees, which the MAP consensus tree does not (unless one checks the PP values). The major set of boxes in the network involve the branches with PP values of c. 0.4. Note also that this network also approximates the 95% credible set of trees, as it excludes those bipartitions that occur in <5% of the trees.

Unfortunately, this network is visually rather cluttered, with a set of multi-dimensional boxes, and so maybe a simpler network would suffice. Instead, here is the consensus network with all of the bipartitions that occur in at least 20% of the trees. This network still emphasizes the main area of incompatibility among the trees, while losing the less-important incompatibilities on the other branches.

Consensus network set at 20%.

Another approach to simplifying the consensus network is to present a set of what are called weakly compatible bipartitions, rather than choosing some percentage of the bipartitions. This is shown in the next network. Note that visually it is a compromise between the above two networks, and may thus be preferable, as it makes clear the range of branches involved in the incompatibilities without needing to use multi-dimensional boxes to do so (ie. the graph is planar rather than 3-dimensional).

Consensus network of weakly compatible bipartitions.

Finally, we can emphasize the relationship between the network and bipartition compatibility by plotting the consensus network consisting of a set of compatible bipartitons. This forms a tree that has the same topology as the MAP tree (since it is constructed in exactly the same manner), but here the branch lengths represent the branch support rather than the estimated number of substitutions.

Consensus network of compatible bipartitions.

Current practical problems

The simplest way to implement the use of consensus networks to display the set of Bayesian trees would be to use SplitsTree to produce the consensus network directly from the (MrBayes) nexus-format MCMC treefile. However, this treefile will usually contain tens of thousands of trees, which is pushing the limit of SplitsTree. Alternatively, we could read in the smaller nexus-format "trprobs" file, which contains only the unique trees along with a weight indicating their relative frequency. Unfortunately, SplitsTree does not currently read tree weights in nexus treefiles. So, Holland et al. (2005, 2006) produced some Python scripts to create the required nexus files, which can then be input to SplitsTree (or SpectroNet).

Alternatively, MrBayes also produces a partition table, showing the relative frequency of the bipartitions found in the sample of trees. The consensus network is actually produced from these bipartitions, rather than the trees, and so this can also be used instead of the treefile. There are two practical problems with this approach. First, MrBayes does not produce a nexus-format file with the bipartition information, and so the available information must be manually converted to a Splits block and put into a nexus file. Second, SplitsTree currently does not construct networks with different percentages of splits when data are input via a Splits block, only when the data are input via a Trees block. So, a series of Splits blocks needs to be constructed, each with the appropriate number of bipartitions.

I used the latter approach to analyze the example data. [This is explained in more detail in a later post: How to construct a consensus network from the output of a bayesian tree analysis.]

Thanks to Bernie Cohen for prompting me to think about networks and Bayesian analysis, and to Barbara Holland for her help with getting the calculations done.

References

Gibb GC, Penny D (2010) Two aspects along the continuum of pigeon evolution: a South-Pacific radiation and the relationship of pigeons within Neoaves. Molecular Phylogenetics and Evolution 56: 698-706.

Holland BR, Delsuc F, Moulton V (2005) Visualizing conflicting evolutionary hypotheses in large collections of trees: using consensus networks to study the origins of placentals and hexapods. Systematic Biology 54:66-76.

Holland BR, Huber KT, Moulton V, Lockhart PJ (2004) Using consensus networks to visualize contradictory evidence for species phylogeny. Molecular Biology and Evolution 21: 1459-1461.

Holland BR, Jermiin LS, Moulton V (2006) Improved consensus network techniques for genome-scale phylogeny. Molecular Biology and Evolution 23: 848-855.

Margush T, McMorris FR (1981) Consensus n-trees. Bulletin of Mathematical Biology 43: 239-244.

Schnittger L, Rodriguez AE, Florin-Christensen M, Morrison DA (2012) Babesia: a world emerging. Infection, Genetics and Evolution 12: 1788-1809.

Steel M, Penny D (2000) Parsimony, likelihood, and the role of models in molecular phylogenetics. Molecular Biology and Evolution 17: 839-850.

Monday, October 8, 2012

Open questions about evolutionary networks, part 3

There are a number of issues that have been of interest to the phylogenetics community with regard to the construction of evolutionary trees that have not yet been addressed for evolutionary networks. These can be considered to be "open questions" — ones that need widespread discussion at some stage, either by biologists or by computational scientists (or both). This blog post finishes my list of some of these topics (see Part 1 and Part 2).

Robustness of branch/reticulation estimates

It is de rigueur in the world of phylogenetic tree building to pepper the tree branches with bootstrap values or posterior probabilities, or frequently both, especially if these estimates are >50%. On the other hand, these values are almost never seen in the world of phylogenetic networks.

If there is a direct link between the network and some character-state data, then bootstrap values can be calculated for a network in the same manner as for a tree — one simply builds many networks from the re-sampled character data. However, this procedure may not be quite as computationally feasible, if the network method does not have a practical computational running time.

Moreover, this procedure is not necessarily straightforward for other types of data from which we might build a network. For example, if we are building a network by minimizing the number of reticulations needed to reconcile a set of conflicting trees, the application of the bootstrap has not yet been evaluated. The computational focus to date has been on the optimization problem, not on the re-sampling problem. And, of course, in the absence of a likelihood model for reticulation events, posterior probabilities cannot be calculated at all.

So, this is another area where the lack of methods commonly associated with tree building seems to be a handicap for the widespread acceptance of network-based methodology.

Can biologists correctly interpret networks?

I have used this quote in an earlier blog post, but it is relevant again here. Baum and Smith (2012, Tree Thinking: An Introduction to Phylogenetic Biology) have noted the following:

"We do not know why it should be so, but we have learned from working with thousands of students that, without contrary training, people tend to have a one-dimensional and progressive view of evolution. We tend to tell evolution as a story with a beginning, a middle, and an end. Against that backdrop, phylogenetic trees are challenging; they are not linear but branching and fractal, with one beginning and many equally valid ends. Tree thinking is, in short, counterintuitive."

This is a well-studied problem. For example, there have been a number of studies of students taking introductory biology courses at tertiary institutions (mostly in the U.S.A.), aimed at identifying the "major misconceptions" entertained by these students. Certain basic problems are discussed by almost all of the authors concerned (both inside and outside the USA). I have written more extensively on this topic in a post at the Scientopia blog (Ambiguity in phylogenies), which you can read if you are unfamiliar with the current state of affairs. That blog post lists most of the important issues as well as the available literature.

That evolution professionals often suffer the same sort of problem is also well known. I have written more extensively on this topic in a previous post at this blog (Evolutionary trees: old wine in new bottles?). This blog post also lists the relevant literature.

What is worse, some professional organizations apparently know no better. For example, the Federation of American Societies for Experimental Biology (FASEB), which describes itself as "the policy voice of biological and biomedical researchers" in the U.S.A., has this Advocacy Card on their web site:

FASEB was also giving away similar bumper stickers at the recent 20th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB — July 2012, in Long Beach, CA), as discussed at the Byte Size Biology blog. Clearly, this image confounds linear evolution with tree-based evolution — this distinction is crucial to phylogenetic analysis, and yet confusion about these two things is rampant.

This leads me to an obvious question: if people have so much trouble going from a linear view of evolution to a tree-based view, are they going to have even more trouble going to a network-based view?

I cannot answer this question (yet). At one extreme, maybe the big conceptual leap is going from a chain to a tree, and a network is just a complicated tree, so that the conceptual leap is not great. Alternatively, maybe a tree is difficult because it is a set of linked and overlapping chains, and therefore a network is very difficult because it is a set of linked and overlapping trees. Maybe reality will turn out to be somewhere in between these two extremes.

There are at least two issues that are likely to be of importance here, in addition to those concerned with trees:

it is difficult to recognize monophyletic groups (clades) in a network, because the ancestry of any one taxon may be complicated (eg. what is a Most Recent Common Ancestor in a reticulated network? — see this blog post);
it is difficult to distinguish the different possible causes of reticulations (recombination, hybridization, HGT).

We will presumably find out how difficult things are after we have developed a set of widely used methods for constructing evolutionary networks.

Wednesday, April 25, 2012

Networks and bootstraps as tree-support criteria

It has been pointed out several times in the literature (eg. Wägele & Mayer 2007; Wägele et al. 2009; Morrison 2010) that network analyses and, for example, bootstrap analyses of trees do not necessarily show the same amount of "support" for a tree. This occurs because branch support values can be independent of character support.

Consequently, many apparently "well-supported" trees published in the literature are often not well-supported by the original data at all. That is, incongruences in the data are ignored by all tree-building algorithms, by definition. Indeed, this problem may be almost universal in the literature, because very few papers provide any evidence that the tree-likeness of the data has been evaluated by the authors.

Since this point seems to poorly understood by most workers, it is worth re-iterating here with an example. The three references cited above provide other examples where bootstrap analyses and network analyses yield very different conclusions about the support for phylogenetic trees.

The basic distinction between networks and bootstrapped trees is this: use of a data-display network, such as a splits graph, evaluates the character (or distance) data independently of any tree, whereas a bootstrap analysis evaluates the data solely in terms of a tree. For example, a bootstrap analysis records the trees at each iteration (or replicate) rather than recording the bootstrapped character set itself, and many different character sets can produce the same tree. Therefore, a bootstrap analysis does not directly assess the character support for a tree. Neither does a posterior probability from a bayesian analysis.

The importance of this distinction for phylogenetics is that a tree analysis forces the data into a tree irrespective of how well the data fit that tree. All that is required is that the tree be the optimal one based on a particular criterion (parsimony, likelihood, etc), while the degree of fit of the data and tree is effectively treated as immaterial to the analysis. This is true at each bootstrap iteration, as well, so that all we learn from a bootstrap analysis is which tree branches are the best supported — we do not learn anything directly about the support of the data for a tree in the first place.

Literally, bootstrap values represent "branch support" rather than "tree support"; and a similar thing can be said for bayesian posterior probabilities. [This issue is discussed further in this later blog post: How networks differ from bootstrapped trees.]

This can be illustrated with a simple empirical example. The data are taken from my Primer of Phylogenetic Networks. The original data are 1,687 aligned nucleotide positions of two genes from five species of the plant genus Viburnum. However, only 43 of the characters vary among these five species. It is expected a priori that V. prunifolium is a hybrid between V. rufidulum and V. lentago, so that a single well-supported tree is not necessarily likely.

Median network. Click to enlarge.

The Median network for the data is shown in the first figure, with the branches labelled by the characters that "support" them. Other types of splits graphs have the same topology as this one (eg. NeighborNet based on uncorrected distances), since the characters are all binary and are never more than pairwise incompatible. This means that all of the character data are displayed in the graph. The netted region in the graph is created by four characters (3, 32, 41, 42) that are incompatible with nine others. Thus, there is no unambiguously supported branch (other than the terminal ones), let alone support for a single tree.

Neighbor-Joining tree, with NJ (above) and Parsimony (below) bootstrap values. Click to enlarge.

Nevertheless, both Neighbor-Joining (based on uncorrected distances) and Parsimony analyses of the data produce a tree that is well-supported by bootstrap analyses, as shown in the second figure. In particular, note that there is strong support in both analyses (based on 100,000 bootstrap replicates) for the branch uniting V. prunifolium and V. rufidulum, even though the data indicate that this arrangement is supported by 3 characters and contradicted by 2 other characters.

Bayesian tree, with posterior probabilities (above) and Maximum-likelihood bootstrap values (below).
Click to enlarge.

Both the Maximum-Likelihood and the Bayesian analyses deal with the situation in a somewhat different manner, as shown in the third figure. Based on a GTR+G+I model (and 5,000 sampled or re-sampled trees), they correctly recognize the relative lack of data support for uniting V. prunifolium and V. rufidulum (the character support is 3/5=60%). However, they both greatly over-estimate the character support for the branch involving V. lantanoides and V. nudum, which is supported by 5 characters and contradicted by 3 other characters (5/8=60% support). The extra number of characters (8 versus 5) apparently makes a big difference to the evaluation of branch support.

Thus, there is no reason to expect branch support values of any ilk to represent character support for that branch; and there is no simple relationship between the two things. The mere fact that character data can repeatedly be shoe-horned into the same tree does not mean that the data offer much support for that tree!

If you want an evaluation of the tree-likeness of the original data, you need to use either a data-display network or some other non-tree evaluation method. Only then can we directly assess the tree support.

References

Morrison D.A. (2010) Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology & Evolution 27: 1044-1057.

Wägele J.W., Letsch H., Klussmann-Kolb A., Mayer C., Misof B., Wägele H. (2009) Phylogenetic support values are not necessarily informative: the case of the Serialia hypothesis (a mollusk phylogeny). Frontiers in Zoology 6: 12.

Wägele J.W., Mayer C. (2007) Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects. BMC Evolutionary Biology 7: 147.