Thursday, October 4, 2012

Open questions about evolutionary networks, part 1


There are a number of issues that have been of interest to the phylogenetics community with regard to the construction of evolutionary trees that have not yet been addressed for evolutionary networks. These can be considered to be "open questions" — ones that need widespread discussion at some stage, either by biologists or by computational scientists (or both). In this blog post I list some of these, and provide a brief introduction to them. I will continue the list in future blog posts (Part 2 and Part 3).

Optimization criteria

Phylogenetic analysis has been treated mainly as a mathematical optimization problem. There seem to be two types of data that can be optimized:

  • (i) character-state changes (eg. nucleotide substitution, nucleotide insertion / deletion, or their amino acid equivalents), and
  • (ii) character-block events (eg. inversion, duplication / loss, transposition, recombination, hybridization, horizontal gene transfer).

To date, phylogenetic tree-building has concentrated on (i), and methods have been developed using optimization criteria such as minimum distance, maximum parsimony, maximum likelihood, and bayesian analysis (which, strictly speaking, does not involve an optimization criterion).

Most of the data-display network methods have also been based on optimizing data-type (i), notably the splits-graph methods (see this primer), which conceptually can be seen as based on either maximum parsimony or minimum distance. Moreover, it is possible to optimize the character data directly onto a network by maximizing either the parsimony scores (eg. Hein 1990, 1993; Dickerman 1998; Nakhleh et al. 2005; Jin et al. 2006a, 2007a, 2007b) or the likelihood scores (eg. von Haeseler and Churchill 1993; Strimmer and Moulton 2000; Strimmer et al. 2001; Jin et al. 2006b; Snir and Tuller 2009). The likelihood scores can also be evaluated in a bayesian context (Radice 2011).

However, evolutionary networks can differ from evolutionary trees by explicitly taking into account data-type (ii), either instead of or in addition to (i). So far, maximum parsimony has been the criterion of choice for doing this, in the sense that the available methods minimize the count of the number of events. For example, a large amount of work has been done to minimize the number of reticulation nodes when reconciling a set of incompatible phylogenetic trees, or alternatively minimizing the level (see this blog post).

However, this means that there are currently few available likelihood-based methods that will allow us to build networks directly from quantitative evolutionary models of how non-tree events occur. The most obvious exception here is the recent development of Admixture graphs (see this blog post), some at least of which are based on an approximate maximum-likelihood model (Pickrell and Pritchard 2012).

This seems to be a serious omission, given that model-based methods are among the most widely used of those available for phylogenetic trees, at least among those users who want a robust analysis (Kelchner and Thomas 2006). Likelihood has effectively replaced maximum parsimony as an optimization criterion for tree building. (The quick-and-dirty distance-based methods will probably always out-rank the other methods, because they can be useful as a "first approximation".)

It may not be easy to create likelihood models for non-tree events, perhaps even more so given the number of different types of events that need to be modelled. Nevertheless, the lack of such models seems to be a handicap for the widespread acceptance of network-based methods.

Partitioned models for likelihood analyses 

This topic is a direct extension of the previous one. Current likelihood models for tree-building analyses can be applied independently to different partitions of the type-(i) character data, and this partitioning is considered to be a valuable part of any likelihood analysis (eg. Blair and Murphy 2011). Indeed, it is the desirability of model partitioning that seems to be a major component of the increasing move from maximum likelihood to bayesian analysis, as well as the ease of implementing models that deal with heterogeneity among and within lineages (especially relaxed molecular clocks).

Partitioned models allow us to add complexity that can deal with heterogeneity within a dataset (Endicott et al. 2009), by a priori or a posteriori choice of partitions with greater inter- than intra-partition variability in substitution rates. For example, there is substitution-rate heterogeneity within genes (eg. different codon positions in protein-coding genes, paired versus unpaired positions in RNA-coding genes), as well as between genes (e.g. house-keeping genes versus rRNA-coding genes), between coding and non-coding regions (e.g. introns versus exons, as well as transcribed spacers and the mitochondrial control region), and between genomes (e.g. nuclear versus mitochondrial). Failure to correctly account for this heterogeneity can seriously mislead phylogenetic analyses; and automated procedures for devising partition schemes have now been developed (eg. Lanfear et al. 2012).

Partitioning is not a panacea for heterogeneity, of course, and there are potential problems that need to be addressed concerning partition choice and its consequences (see Brown et al. 2010; Marshall 2010; Fan et al. 2011). None of these issues have yet been addressed in the context of evolutionary networks, although there seems to be no barrier to the use of partitioning for network likelihood models. On the other hand, dealing with evolutionary heterogeneity among and within lineages may actually be a bigger problem, given the increased complexity of the lineages in a network.

Mixture models for likelihood analyses

An alternative approach to dealing with heterogeneity is through the use of mixture models. Here, the likelihood of each character is calculated under more than one model, and these likelihoods are then combined. For example, the parameters of several substitution models, as well as the probability with which each model applies to each alignment position, can be determined directly from the data. Such models have been developed for nucleotide (Pagel and Meade 2004) and amino-acid (Le et al. 2008) sequences, but this is otherwise a very under-explored part of phylogenetic analysis. Nevertheless, computer programs are becoming more readily available (eg. Stamatakis 2006).

It would presumably be possible to combine data types (i) and (ii) using this approach. Indeed, this has obvious theoretical advantages for networks, although the resulting models may be overly complex. It seems likely that the ability to model, say, hybridization versus recombination, as alternative causes of reticulations in a phylogeny will be a part of any successful attempt to produce a widely used method of phylogenetic analysis.

References

Blair C., Murphy R.W. (2011) Recent trends in molecular phylogenetic analysis: where to next? Journal of Heredity 102: 130-138.

Brown J.M., Hedtke S.M., Lemmon A.R., Moriarty Lemmon E. (2010) When trees grow too long: investigating the causes of highly inaccurate bayesian branch-length estimates. Systematic Biology 59: 145-161.

Dickerman A.W. (1998) Generalizing phylogenetic parsimony from the tree to the forest. Systematic Biology 47: 414-426.

Endicott P., Ho S.Y.W., Metspalu M., Stringer C. (2009) Evaluating the mitochondrial timescale of human evolution. Trends in Ecology and Evolution 24: 515-521.

Fan Y., Wu R., Chen M.-H., Kuo L., Lewis P.O. (2011) Choosing among partition models in bayesian phylogenetics. Molecular Biology and Evolution 28: 523-532.

Hein J. (1990) Reconstructing evolution of sequences subject to recombination using parsimony. Mathematical Biosciences 98: 185-200.

Hein J. (1993) A heuristic method to reconstruct the history of sequences subject to recombination. Journal of Molecular Evolution 36: 396-405.

Jin G., Nakhleh L., Snir S., Tuller T. (2006a) Efficient parsimony-based methods for phylogenetic network reconstruction. Bioinformatics 23: e123-e128.

Jin G., Nakhleh L., Snir S., Tuller T. (2006b) Maximum likelihood of phylogenetic networks. Bioinformatics 22: 2604-2611.

Jin G., Nakhleh L., Snir S., Tuller T. (2007a) Inferring phylogenetic networks by the maximum parsimony criterion: a case study. Molecular Biology and Evolution 24: 324-337.

Jin G., Nakhleh L., Snir S., Tuller T. (2007b) A new linear-time heuristic algorithm for computing the parsimony score of phylogenetic networks: theoretical bounds and empirical performance. Lecture Notes in Bioinformatics 4463: 61-72.

Kelchner S.A., Thomas M.A. (2006) Model use in phylogenetics: nine key questions. Trends in Ecology and Evolution 22: 87-94.

Lanfear R., Calcott B., Ho S.Y.W., Guindon S. (2012) PartitionFinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution 29: 1695-1701.

Le S.Q., Lartillot N., Gascuel O. (2008) Phylogenetic mixture models for proteins. Philosophical Transactions of the Royal Society of London, B: Biological Sciences 363: 3965-3976.

Marshall D.C. (2010) Cryptic failure of partitioned bayesian phylogenetic analyses: lost in the land of long trees. Systematic Biology 59: 108-117.

Nakhleh L., Jin G., Zhao F., Mellor-Crummey J. (2005) Reconstructing phylogenetic networks using maximum parsimony. In: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference, pp. 93-102. IEEE Computer Society, Washington DC.

Pagel M., Meade A. (2004) A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Systematic Biology 53: 571-581.

Pickrell J.K., Pritchard J.K. (2012) Inference of population splits and mixtures from genome-wide allele frequency data. Unpublished ms (http://arxiv.org/abs/1206.2332).

Radice R. (2011) A Bayesian Approach to Phylogenetic Networks. PhD thesis, University of Bath, UK.

Snir S., Tuller T. (2009) The NET-HMM approach: phylogenetic network inference by combining maximum likelihood and hidden markov models. Journal of Bioinformatics and Computational Biology 7: 625-644.

Stamatakis A. (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688-2690.

Strimmer K., Moulton V. (2000) Likelihood analysis of phylogenetic networks using directed graphical methods. Molecular Biology and Evolution 17: 875-881.

Strimmer K., Wiuf C., Moulton V. (2001) Recombination analysis using directed graphical models. Molecular Biology and Evolution 18: 97-99.

von Haeseler A., Churchill G.A. (1993) Network models for sequence evolution. Journal of Molecular Evolution 37: 77-85.

1 comment: