There are a number of issues that have been of interest to the phylogenetics community with regard to the construction of evolutionary trees that have not yet been addressed for evolutionary networks. These can be considered to be "open questions" — ones that need widespread discussion at some stage, either by biologists or by computational scientists (or both). This blog post continues my list of some of these topics (see Part 1 and Part 3).
Separating randomness and rooting from reticulation
There are at least three quite distinct causes of incompatible patterns in a phylogenetic dataset:
- Randomness, which is expected to create stochastic variation (such as homoplasy), but which may also be due to bias (eg. selection);
- Rooting, with different "gene trees" being rooted in different places; and
- Reticulation, which can have any one of several causes (eg. hybridization, HGT, recombination).
I have previously discussed published examples in which several trees have been presented, from different gene segments, that differ from each other in the location of their outgroup root — eg Figures 4.7 and 4.27 of my book (Introduction to Phylogenetic Networks), and also in the Grass Phylogeny Working Group dataset (see this blog post). In at least one of these cases, there are no reticulate evolutionary events at all, merely an uncertain root. That is, a network was constructed showing putative hybridizations and yet the only evolutionary pattern in the data was that the single unrooted species tree had different roots in the different gene trees.
In all of these cases, it is difficult to present an evolutionary network, because many of the resulting reticulations reflect the differences in the outgroup roots rather than true evolutionary reticulation events. Clearly, we cannot accept a situation where incompatibility among the trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes. This is further discussed in the next section.
Randomness refers to uncertainty in any of the relationships depicted by the tree. Stochastic variation has long been recognized in phylogenetics, and it is the principal issue that most tree-building methods try to address in their algorithms. Biologically, stochastic variation usually arises from short evolutionary intervals (represented as short branch lengths in the tree), but may also arise from inadequate tree-building models, etc. It is the problem that branch-support estimates are designed to quantify, such as bootstrap values or posterior probabilities.
In the "normal" statistical world, random data variation is assumed to be associated with estimation errors. For phylogenetic data, these might include incorrect data (eg. contamination), inappropriate sampling, and model mis-specification. Alternatively, these errors might lead to bias rather than random variation. If so, then the sources of bias should be dealt with via exploratory data analysis, and the offending information can then be corrected or deleted.
However, when we are specifically trying to study reticulate evolution, there will also be many possible biological causes of data conflicts, which are not the result of either reticulation or estimation errors, such as homoplasy (parallelism, convergence, reversal), duplication/loss, and various complex molecular activities (such as sequence inversion, duplication, and transposition). All of these issues need to be dealt with under the concept of "non-reticulation variation".
Separating reticulation-caused data conflict from non-reticulation data conflict requires a null model for reticulation. This is discussed below.
Standardizing the root
Rooting is, to my mind, a problem that has not yet been dealt with properly in phylogenetics. I find that the differences among a set of gene trees are often little more that the relative location of the root (the common ancestor). That is, the unrooted gene trees are (almost) identical, but they have been rooted in somewhat different places (often not too far from each other). In one sense, this is simply Randomness occurring with respect to the root. However, its effect can be great, because it can potentially affect all of the rest of the network, whereas Randomness in most other locations will have only local effects on the topology.
Situations where incompatibility among trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes, can be dealt with by pre-processing of the data (prior to network analysis). Here, I will make a few suggestions, just to get the ball rolling.
If we have a set of "gene trees", then problems with incompatible rooting might be dealt with using polychotomies. That is, we could try to create a set of rooted gene trees with the "same" root by deleting conflicting basal edges from the tree. For example, an algorithm might look like this:
- unroot all of the gene trees
- find the most-common root — the root location in an unrooted tree defines a split, so the most-common root will be the root-split that occurs in the largest number of trees (unless there are multiple outgroup taxa that make the ingroup non-monophyletic)
- any rooted gene tree consistent with that root (displays that split) can be used unmodified
- any gene tree with a nearby root could then be modified so that some of its edges are contracted into a ploychotomy until the unrooted tree is consistent with the common root, and the resulting less-refined tree would then be used as the rooted tree — obviously, it would be necessary to explicitly define "nearby"
- the remaining gene trees would then be set aside and not used in the network analysis.
If the ingroup is not monophyletic, then the biologist should fix this before proceeding with the network analysis. This is a "biological problem" of sampling, not a mathematical one — perhaps the problem arises from deep coalescence, for example. If there is no clear "most-common root" among the trees, then perhaps we could define an "average" or centroid root of some sort. We would then proceed with the rest of the method.
An alternative to this "polychotomy method" might be to use the coalescent to construct an "approximate" species tree from the multiple gene trees (there are now several methods to do this), and then in the network analysis we could allow the gene trees to differ from the species tree only with respect to the poorly supported branches in the species tree. That is, we would use the well-supported parts of the coalescent tree as a backbone common to all of the gene trees, and for the uncertain parts we would use each of the gene trees. However, I am not certain of the applicability of the coalescent to higher taxa (as opposed to closely related species).
I have often thought that duplication-loss is another potential cause of problems with the root. It is not immediately obvious how to approach this, but some suggestions have been made by Burleigh et al. (2011).
A different strategy would be to try all possible roots and see which one(s) minimize the network complexity. This might be computationally intensive, depending on the size of the dataset and the network method used. It might be necessary to restrict the roots tested to those observed among the input trees.
Null models for reticulation
Once we have the root standardized for the dataset, we are then set the task of separating reticulation-caused data conflict from non-reticulation data conflict. This requires a null model for data conflict — any data conflict that cannot be accommodated by the null model is a candidate for explanation as the result of a reticulation event.
Looking at the literature, it seems to me that the most commonly accepted null model is deep coalescence (incomplete lineage sorting) (Meng and Kubatko 2009; Kubatko and Meng 2010). For example, a maximum-likelihood method has been developed that models hybridization in the presence of deep coalescence (Kubatko 2009). One can also use the coalescent as an optimality criterion to choose among alternative networks, with lineage sorting under the coalescent as the null hypothesis (Huson et al. 2005; Buckley et al. 2006; Than et al. 2007; Lyngsø et al. 2008; Joly et al. 2009).
However, the sole use of deep coalescence effectively ignores the other non-reticulation causes of data conflict, as listed above (under Separating randomness and rooting from reticulation). Now, I suppose that it is possible that this approach will work in practice, but it seems unlikely to me that this will be so. Effectively, this approach assumes that the gene trees correspond to the true underlying coalescent trees. This is unlikely because the gene trees are inferred and therefore can be incorrect, due to the other (listed) non-reticulation causes of data conflict. Moreover, if there are multiple types of reticulation event occurring then the approach might fail. For example, if one wishes to study hybridization, then the coalescence methods assume that recombination occurs only between and not within the regions used to infer the gene trees, which is also unlikely.
So, a more comprehensive null model seems to be needed, one that includes more than simply traditional statistical randomness plus deep coalescence. The default expectation at the moment seems to be that deep coalescence occurs above the species level, so that all data sets should be tree-like, whereas the objective here is to detect the non-tree-like parts of evolutionary history.
Dealing with stochastic error and bias
In addition to null models, we may also need pre-processing to deal with stochastic error and bias. There is a limit to what can be done with a single null model, and phylogenetic data are rarely simple. Here, I make a few suggestions, once again to start some discussion.
If we have a set of "gene trees", then perhaps the most obvious approach is to delete uncertain edges. That is, they would appear as polychotomies in the gene trees. This allows refined versions of these trees to be represented in the network, rather than requiring extra edges in a network to accommodate all of them. An alternative is to weight all of the edges with respect to their data "support", with the expectation that poorly supported edges would only appear in the network if they are consistently supported across a number of the gene trees.
I think that there are two types of support that could be relevant to uncertainty: (1) classic branch support, such as bootstrap values; and (2) the set of multiple equally optimal or nearly optimal trees. These two types coincide in bayesian analysis, as it is currently implemented in phylogenetics, because in bayesian analysis the branch support is derived from the set of nearly optimal trees. I suspect that (2) may be a better idea than (1), because it expresses something about the tree itself rather than each edge alone; and it is used in the SpNet method (Nakhleh et al. 2005), for example, where each gene tree is a consensus tree of several nearly-optimal trees. The appeal of using polychotomies is that it is simple. The main arguments against it may be the work required for the calculations in methods such as maximum likelihood (both parsimony and bayesian analyses do the necessary calculations anyway), and the fact that it may create non-dense sets of triplets (Jansson and Sung 2006), for example.
Another idea might be to delete taxa that have no consistent position among the input trees. The idea here is that biologically we are looking for things like hybridization and HGT, and we are not expecting this to involve any one taxon in combination with many other taxa. Therefore, extremely uncertain positions are unlikely to reflect Reticulation but rather Randomness (or lack of information). Creating polychotomies would lose a lot of information in this situation, and so it would be better to flag these taxa as problematic, and then leave them out of the network analysis. This is basically the concept used for largest common pruned trees (or agreement subtrees), except that here we don't prune the data all the way down to a tree (see Abby et al. 2010). This also seems to be the idea behind the Dendroscope program's option to deal only with clusters that appear in a certain percentage of the trees. The problem with the Dendroscope approach, however, is that a cluster generated by HGT (say) that appears in only one tree will be ignored. It would thus be better to use the variation in position of individual taxa, rather than presence/absence of clusters.
References
Abby S.S., Tannier E., Gouy M., Daubin V. (2010) Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics 11: 324.
Buckley T., Cordeiro M., Marshall D., Simon C. (2006) Differentiating between hypotheses of lineage sorting and introgression in New Zealand Alpine cicadas (Maoricicada dugdale). Systematic Biology 55: 411-425.
Burleigh J.G., Bansal M.S., Eulenstein O., Hartmann S., Wehe A., Vision T.J. (2011) Genome-scale phylogenetics: inferring the plant tree of life from 18,896 gene trees. Systematic Biology 60: 117-125.
Huson D.H., Klöpper T., Lockhart P.J., Steel M.A. (2005) Reconstruction of reticulate networks from gene trees. Lecture Notes in Bioinformatics 3500: 233-249.
Jansson J., Sung W.-K. (2006) Inferring a level-1 phylogenetic network from a dense set of rooted triplets. Theoretical Computational Science 363: 60-68.
Joly S., McLenachan P.A., Lockhart P.J. (2009) A statistical approach for distinguishing hybridization and incomplete lineage sorting. American Naturalist 174: E54-E70.
Kubatko L.S. (2009) Identifying hybridization events in the presence of coalescence via model selection. Systematic Biology 58: 478-488.
Kubatko L.S., Meng C. (2010) Accommodating hybridization in a multilocus phylogenetic network. In: Knowles L.L., Kubatko L.S. (eds) Estimating Species Trees: Practical and Theoretical Aspects, pp. 99-113. Wiley-Blackwell, Hoboken NJ.
Lyngsø R.B., Song Y.S., Hein J. (2008) Accurate computation of likelihoods in the coalescent with recombination via parsimony. Lecture Notes in Computer Science 4955: 463-477.
Meng C., Kubatko L.S. (2009) Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theoretical Population Biology 75: 35-45.
Nakhleh L., Warnow T., Linder C.R., St John K. (2005) Reconstructing reticulate evolution in species — theory and practice. Journal of Computational Biology 12: 796-811.
Than C., Ruths D., Innan H., Nakhleh L. (2007) Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 14: 517-535.
Very interesting article, thanks! I agree that rooting has not been done properly in phylogenetics and have also been thinking about this problem, although in the context of building supertrees as a start. Considering each rooting to minimize supertree complexity is very computationally expensive and, interestingly enough, seems to provide slightly worse results in my tests than trying to root the gene trees according to the partially built supertree. As you might imagine, this strategy heavily biases the search (and would be even more difficult to do in network construction) so I have also been thinking of methods to find "most common rootings" among the trees. Thanks for the idea.
ReplyDelete