Wednesday, December 5, 2012

How networks differ from bootstrapped trees

I have noted before (Networks and bootstraps as tree-support criteria) that data-display networks can produce quite different results from bootstrap values on phylogenetic trees. For example, a splits-graph assesses character support for alternative bipartitions of the dataset, whereas a bootstrapped tree assesses  support for those branches that appear only in the tree. These two data evaluations will often be congruent, but they can also differ notably. Here, I use a published dataset to illustrate the two ways in which they can differ.

The dataset is from: Wang N., Braun E.L., Kimball R.T. (2012) Testing hypotheses about the sister group of the Passeriformes using an independent 30-locus data set. Molecular Biology and Evolution 29: 737–750. There are 28 taxa and 25,700 aligned nucleotides.

I used the SplitsTree program to calculate (i) a Neighbor-Joining tree with 1,000 bootstrap pseudoreplicates, and (ii) a NeighborNet graph. In both cases the simple p-distance was used.

Assessment

The graph below shows the split weights (or edge lengths) for the 58 splits that were included in the NeighborNet graph. These form a collection of what are called circular splits, and it is important to note that this collection does not include all of the splits supported by the data. Those splits not in the NeighborNet graph are shown in green with a split weight of 0.00001 (rather than zero), to accommodate the log scale.

The graph also shows the bootstrap percentages for all of the splits in the NeighborNet graph plus all of those branches with a bootstrap frequency greater than 1/100.  Splits that did not appear in any of the bootstrap pseudoreplicates are shown in pink.


Those splits / branches where there is an approximate agreement between the tree and the network are shown in blue. There is a roughly s-shaped relationship between the split weights and the bootstrap percentages, so that an increase in one is associated with an increase in the other.

However, in the range of the graph where there is 100% bootstrap support there are 8 splits (in pink) with a large split weight but 0% bootstrap support. These are splits that contradict at least one better-supported split. The better-supported splits appear as branches in the NeighborJoining tree at the expense of these 8 splits. This is the limitation of a tree representation, as it cannot accommodate alternative patterns, no matter how well-supported they are by the character data.

The important thing to realize is that these splits cannot appear in any bootstrap pseudoreplicate, because they are out-weighed in the character resampling performed by the bootstrap procedure. For each bootstrap pseudoreplicate the resampled data are forced into a tree, and thus all contradictory splits are ignored each time, no matter how well-supported they are. These splits therefore get 0% bootstrap support even though there is considerable character support for them.

Equally importantly, there is one edge that appears in the bootstrap assessment with high support (87%) but which does not appear in the NeighborNet graph at all, shown in green in the graph. The first step of the NeighborNet algorithm decides on a set of circular splits, and only these splits will appear in the splits graph, no matter how well-supported other splits might be.

In this example, there is a nested set of taxa that appears in the tree ((13,14)((15,16)17)), but the NeighborNet finds greater support for several contradictory partitions such as {16,17,27,28} and {15,16,17,27}, and thus cannot display the partition {15,16,17}, although it can accommodate the other three partitions: {13,14}, {15,16} and {13,14,15,16,17}.

So, the nested set gets high bootstrap support simply because it fits onto a tree. That is, given the partitions {13,14}, {15,16}, and {13,14,15,16,17}, all of which are well supported by the character data, then {15,16,17} will be supported as well because it fits neatly onto the tree, irrespective of the strength of its character support (the location of 17 is not well supported no matter where it is on the tree).

Conclusion

Trees and networks will often be in agreement, especially if the data are very tree-like. However, they can differ in two ways: (i) the network may show well-supported character patterns that are not included in the tree, and (ii) the tree may show well-supported branches that are not accommodated by the network-building algorithm. Well-supported branches on a tree are not necessarily well-supported by the character data, and absence of a branch from a tree does not necessarily mean that it has little character support.

No comments:

Post a Comment