The Genealogical World of Phylogenetic Networks: Is rate variation among lineages actually due to reticulation?

Non-congruence among characters has traditionally been attributed solely to so-called vertical evolutionary processes (parent to offspring), which can be represented in a phylogenetic tree. For example, phenotypic incongruence was originally attributed solely to homoplasy (convergence, parallelism, reversal). For molecular data this could be modeled with DNA substitutions and indels, along with allowance for variable rates in different genic regions (e.g. invariant sites, or the well-known gamma model of rate variation).

This approach was not all that successful, and so the substitution models were made more complex, by allowing different evolutionary rates in different branches of the tree (e.g. substitutions are more or less common in some parts of the tree compared to others). For many researchers this is still as sophisticated as their phylogenetic models get (Schwartz & Mueller 2010), allowing for a relaxed molecular clock in their model rather than imposing a strict clock.

There is, however, a fundamental limitation to trying to make any one model more sophisticated: the more complex model will probably fit the data better but it might be fitting details rather than the main picture. Consider the illustration below. There is a lot of variation among these six animals and yet they are all basically the same. If I wish to devise a model to describe them, do I need a sophisticated model that describes all the nuances of their shape variation, or do I need a simple model that recognizes that they are all five-pointed stars? The answer depends on my purpose — if I wish to identify them to class then it is the latter, if I wish to identify them to species then it might be the former.

Vertical process models

This is relevant to phylogenetics. For example, if I wish to estimate a species tree from a set of gene trees, do I need a complex model that deals with all of the evolutionary nuances of the individual gene trees, or a simpler model that ignores the details and instead estimates what the trees have in common? It has been argued that the latter will be more useful under these circumstances. On the other hand, if I am studying gene evolution itself, I may be better off with the former.

So, adding things like rate variation among lineages (and also rate variation along genes) will usually produce "better fitting" models. However, this is fit to the data, and the fit between data and model is not the important issue, because this increases precision but does not necessarily increase accuracy.

Therefore, modern interest is in changing the fundamentals of the model, rather than changing its details. There are many possible causes of gene-tree incongruence, and maybe these should be in the model in order to increase accuracy.

For example, there has been interest in adding other vertical processes to the tree-building model, most notably incomplete lineage sorting (ILS) and gene duplication-loss (DL). ILS means that gene trees are not expected to exactly match the species tree, but will vary stochastically around that tree, with probabilities that can be calculated using the coalescent. DL means that gene copies appear and disappear during evolution, so that gene sequence variation is due to hidden paralogy as well as to orthology.

ILS has been modeled by being integrated into a more sophisticated DNA substitution model (see the papers in Knowles & Kubatko 2010). Originally, DL was dealt with at the whole-gene level (Slowinski and Page 1999; Ma et al. 2000), but there have been recent attempts to integrate this into the DNA substitution models, as well (Åkerborg et al. 2009; Rasmussen & Kellis 2012). These models are not yet widely used, and so most published empirical species trees still rely on modeling incongruence using rate variation among branches.

Horizontal process models

However, this whole approach restricts the phylogenetic model to vertical processes alone. It is entirely possible that the sequence variation that is being attributed to rate variation among branches is actually being caused by horizontal evolutionary processes, such as recombination, hybridization, introgression or horizontal gene transfer (HGT). For example, an influx of genetic material from outside a lineage could be mis-interpreted as an increase in the rate of substitutions and indels within that lineage. That is, long branches might represent introgression (or HGT) rather than in situ rate variation. If this is true then we would be modeling the wrong thing.

There has been little explicit discussion of this point in the literature. Syvanen (1987) seems to have been among the first. However, his premise was that the molecular clock is ultimately correct (and that "the basic observation has been that different macromolecules yield roughly the same phylogenetic picture"), and he was arguing that HGT does not necessarily violate the clock. Our modern perspective is, of course, that a strict clock is unlikely unless it has been demonstrated, and that genes are incongruent as often as they are congruent.

Recent models for ILS and DL have started to broach this issue, by adding reticulation to their underlying models. Rather oddly, this has usually been described as:

ILS + hybridization (Meng & Kubatko 2009; Kubatko 2009; Joly et al. 2009; Bloomquist & Suchard 2010; Yu et al. 2011; Marcussen et al. 2012; Jones et al. 2013; Yu et al. 2013); and
DL + HGT (Mirkin et al. 2003; Górecki 2004; Hallett et al. 2004; Csürös & Miklós 2006; Doyon et al. 2010; Tofigh et al. 2011; Bansal et al. 2012; Sjöstrand et al. 2012).

This pairwise association seems to reflect historical accident, rather than any actual mathematical difference in procedure — the gene-tree incongruence patterns are essentially the same for hybridization, introgression and HGT, as well as recombination. In the mathematical models, all we can really talk about is "reticulation" — it is up to the biologist to determine the nature of the horizontal process in each case.

Conclusion

The point here is essentially the same one that I made in a previous post (Resistance to network thinking). Currently, phylogenetics is approached in a very conservative manner. The "old way" is the best way, and things change very slowly. The currently popular phylogenetic models are simply variants of the same models that have been used for 30 years. Temporal rate variation (among lineages) and spatial rate variation (along genes) have been added to the original model from the 1970s, but not yet more complex vertical processes (ILS or DL), and not yet horizontal processes. For these, specialist programs need to be used.

Essentially, all variation in branch length is still attributed to homoplasy and rate variation, rather than considering the myriad of other biological processes that will produce the same apparent phenomen. With this attitude we might be getting more precise models but not necessarily more accurate one.

References

Åkerborg Ö, Sennblad B, Arvestad L, Lagergren J (2009) Simultaneous bayesian gene tree reconstruction and reconciliation analysis. Proceedings of the National Academy of Sciences of the USA 106: 5714-5719.

Bansal MS, Alm EJ, Kellis M (2012) Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics 28: i283-i291.

Bloomquist EW, Suchard MA (2012) Unifying vertical and nonvertical evolution: a stochastic ARG-based framework. Systematic Biology 59: 27-41.

Csürös M, Miklós I (2006) A probabilistic model for gene content evolution with duplication, loss, and horizontal transfer. Lecture Notes in Computer Science 3909: 206-220.

Doyon J-P, Scornavacca C, Gorbunov KY, Szöllösi GJ, Ranwez V, Berry V (2019) An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. Lecture Notes in Computer Science 6398: 93-108.

Górecki P (2004) Reconciliation problems for duplication, loss and horizontal gene transfer. In: Bourne PE, Gusfield D (editors). Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology, pp. 316-325. ACM Press, New York.

Hallett M, Lagergren J, Tofigh A (2004) Simultaneous identification of duplications and lateral transfers. In: Bourne PE, Gusfield D (editors). Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology, pp. 347-356. ACM Press, New York.

Joly S, McLenachan PA, Lockhart PJ (2009) A statistical approach for distinguishing hybridization and incomplete lineage sorting. American Naturalist 174: E54-E70.

Jones G, Sagitov S, Oxelman B (2013) Statistical inference of allopolyploid species networks in the presence of incomplete lineage sorting. Systematic Biology 62: 467-478.

Knowles LL, Kubatko LS (editors) (2010) Estimating Species Trees: Practical and Theoretical Aspects. Wiley-Blackwell, Hoboken NJ.

Kubatko L (2009) Identifying hybridization events in the presence of coalescence via model selection. Systematic Biology 58: 478-488.

Ma B, Li M, Zhang L (2000) From gene trees to species trees. SIAM Journal on Computing 30:
729-752.

Marcussen T, Jakobsen KS, Danihelka J, Ballard HE, Blaxland K, Brysting AK, Oxelman B (2012) Inferring species networks from gene trees in high-polyploid North American and Hawaiian violets (Viola, Violaceae). Systematic Biology 61: 107-126.

Meng C, Kubatko LS (2009) Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theoretical Population Biology 75: 35-45.

Mirkin BG, Fenner TI, Galperin MY, Koonin EV (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evolutionary Biology 3: 2.

Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Research 22: 755-765.

Schwartz RS, Mueller RL (2010) Variation in DNA substitution rates among lineages erroneously inferred from simulated clock-like data. PLoS One 5: e9649.

Sjöstrand J, Sennblad B, Arvestad L, Lagergren J (2012) DLRS: gene tree evolution in light of a species tree. Bioinformatics 28: 2994-2995.

Slowinski J, Page RDM (1999) How should species phylogenies be inferred from sequence
data? Systematic Biology 48: 814-825.

Syvanen M (1987) Molecular clocks and evolutionary relationships: possible distortions due to horizontal gene flow. Journal of Molecular Evolution 26: 16-23.

Tofigh A, Hallett M, Lagergren J (2011) Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8: 517-535.

Yu Y, Barnett RM, Nakhleh L (2013) Parsimonious inference of hybridization in the presence of incomplete lineage sorting. Systematic Biology 62: 738-751.

Yu Y, Than C, Degnan JH, Nakhleh L (2011) Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology 60: 138-149.

Thursday, December 19, 2013

Is rate variation among lineages actually due to reticulation?

No comments:

Post a Comment