Monday, January 21, 2013

EDA or post-optimality analysis of phylogenetic data?

These days, phylogeneticists usually build trees to express the evolutionary history of their samples. As part of this procedure, they also show an interest in the "quality" of their trees. This is a very vaguely defined concept, probably because it has something to do with accuracy, or correctness, which is something we can know almost nothing about. So, instead, we resort to a whole swag of other concepts, such as resolution, robustness, sensitivity and stability, which are related to precision rather than to accuracy.

We implement these "precision" ideas in many ways, including: (i) analytical procedures, such as interior-branch tests, likelihood-ratio tests, clade significance, and the incongruence length difference test; (ii) statistical procedures, such as the ubiquitous non-parametric bootstrap and posterior probabilities, the jackknife, topology-dependent permutation, and clade credibility; and (iii) non-statistical procedures, such as the decay index, clade stability, data decisiveness, and spectral signals.

Most of these are forms of what is called post-optimality analysis — the tree is first calculated and then we evaluate it. In the current issue of Bioinformatics there is a paper by Saad Sheikh, Tamer Kahveci, Sanjay Ranka and Gordon Burleigh (Stability analysis of phylogenetic trees. Bioinformatics 2013 29(2): 166-174) that provides yet another take on the same theme:
We define measures that assess the stability of trees, subtrees and individual taxa with respect to changes in the input sequences. Our measures consider changes at the finest granularity in the input data (i.e. individual nucleotides).
Basically, the idea is to see how much the input would need to be changed in order to cause a change in the tree topology. For example, the authors quantify the minimum edit distance required to create a specified Robinson-Foulds tree distance from the optimal tree, although any similar distances could be used instead. Their basic purpose was to develop a method that could be effective for very large datasets, which most of the alternatives cannot.

What this approach begs is the question as to whether a post-optimality analysis is the best approach in the first place. This type of approach assumes a tree as the basic structure, and fails to consider alternative structures that might be more appropriate for the data.

Exploratory data analysis (EDA), if performed effectively, can achieve the same result (an assessment of "stability") while at the same time revealing whether a tree is actually the best structure. It does this by evaluating the dataset directly, a priori, rather than evaluating the data relative to a tree, a posteriori. Evaluating the tree in terms of the data is not the same thing as evaluating the data independently of any tree.

Of the methods listed above, the only one that evaluates the data in a tree-independent manner is the use of spectral signals. Another approach is, of course, to use a data-display network, which provides a very convenient picture of the data, and will thus reveal whether a tree-building analysis is a good idea or not.

An example

To explore the essential difference between EDA and post-optimality analysis, we can look at one of the example datasets used by Sheikh et al. to illustrate their method.

This dataset involves sequences of 169 species of mammals, published by Meredith et al. (Impacts of the Cretaceous terrestrial revolution and KPg extinction on mammal diversification. Science 2011 334(6055): 521-524). There are 35,603 aligned nucleotides, concatenated from 26 gene fragments. The sampling includes nearly all mammalian families, plus five vertebrate outgroups.

Meredith et al. note about their data:
Phylogenetic relations from maximum likelihood (ML) and Bayesian methods are well resolved across the mammalian tree. More than 90% of the nodes have bootstrap support of ≥ 90% and Bayesian posterior probabilities of ≥ 0.95. Amino acid and DNA ML trees are in agreement for 163 out of 168 internal nodes.
Not surprisingly, Sheikh et al. reach a similar enthusiastic conclusion:
The Mammals dataset is highly stable. There is not a single move (R = 1) possible for an edit distance of up to 530 nucleotides. Even if we place an extremely high limit of E = 1000, the biggest move possible is RF = 5. Thus, the stability measures provide an explicit guarantee that there is no move possible for E = 500 and any values of R within 1 SPR distance. This also demonstrates the power of building phylogenies from large densely sampled datasets.
However, this enthusiasm contradicts some well-known previous results. For example, Meredith et al. also note:
Several nodes that remain difficult to resolve (e.g., placental root) have variable support between studies of rare genomic changes, as well as genome-scale data sets, which suggest that diversification was not fully bifurcating or occurred in such rapid succession that phylogenetic signal tracking true species relations may not be recoverable with current methods.
A simple EDA analysis makes the situation clear, which is not done by either the bootstrap / posterior-probability approach of Meredith et al. or the edit-distance / tree-distance approach of Sheikh et al. If we stick to the simple parsimony approach of the latter (rather than the model-based approach of the former), then we can analyze the dataset with hamming distances and a NeighborNet graph.

First, the root of the mammals is not clear. The published tree places the root on the branch leading to monotremes, but in the network the outgroup involves a reticulation. This is caused by an ambiguous relationship between Echinops telfairi (Tenrecidae) and (i) the {outgroup + monotremes} and (ii) the Afrotheria. In the tree the Afrotheria is united by a "strongly supported node".

Second, the root of the placentals is very unclear. Most of the major groups of placentals form clusters, but the relationships among these clusters are very obscure. The data are bush-like within the placentals, rather than tree-like, both at the level of the four major groups (named in the graph) and within each of those groups. In the published tree, some of these subgroups are well-supported, but others involve disagreement between the DNA and amino-acid trees, while others have < 90% bootstrap support.

It is not immediately obvious that a tree-building analysis is going to be of much use for this dataset. There is certainly some "power of building phylogenies from large densely sampled datasets", but this does not automatically mean that those phylogenies will be tree-like. Evolution involves a more diverse process than that, and post-optimality analyses based on a single model may be very misleading about that diversity.

No comments:

Post a Comment