Wednesday, September 23, 2015

Uses of MUL-trees for evolutionary networks


Creating evolutionary phylogenetic networks is currently a somewhat ad hoc procedure, with a number of competing strategies based on various models of how gene flow occurs.

One possibility is to use multi-labeled trees. Here, multiple gene trees can be represented by a single multi-labeled tree (a MUL-tree), which in turn can also be represented as a reticulating network. A MUL-tree has leaves that are not uniquely labeled by a set of species (ie. each species can appear more than once). This means that multiple gene trees can be represented by a single MUL-tree, with different combinations of the leaf labels representing different gene trees.

The most obvious uses of a MUL-tree are where there are multiple copies of genes within an organism, as each gene copy can be represented independently in the MUL-tree. This will apply when there has been gene duplication, for example, or when there has been polyploidy (ie. multiple copies of the entire genome). Computer programs such as PADRE or MulRF can then be used to derive an optimal single-labeled species network from the MUL-tree.

However, this same strategy can also be used whenever there is conflict among gene trees. In this scenario, the conflicting genes are treated as different leaves in the MUL-tree. One labeled leaf would have the data for the first gene, with the second gene entered as missing data, and the second leaf would then have the inverse situation (the data for gene one are missing and those for gene two are present).

This can be illustrated by a recent example of the Erica (heather plants) genus, from Mugrabi de Kuppler et al. (2015). The authors were interested in whether the observed gene tree conflict in Erica lusitanica could be the result of hybridisation between morphologically dissimilar species, as this has previously been suggested.

They collected sequence data for a number of plastid regions as well as the nuclear ribosomal ITS region. The observed conflict was between the plastid (chloroplast) and nuclear sequences. They note:
A targeted supermatrix strategy was employed, whereby more variable ITS and trnL-trnF spacer sequences were obtained for most samples, and the other, mostly less variable chloroplast markers were added for selected taxa in order to improve resolution of deeper nodes in the chloroplast tree. 
Where gene tree conflict was identified, the taxa with conflicting phylogenetic signals were duplicated in a combined matrix following the approach of Pirie et al. (2008, 2009) in order to infer a single multi-labelled "taxon duplication" tree. [This occurred for only one species. Thus, one leaf label for E. lusitanica has the data only for the chloroplast sequences, and the other leaf has the data only for the nuclear sequence.]


The figure shows the result of the coalescent BEAST analysis of the multi-labeled data, with E. lusitanica appearing twice in the MUL-tree. Inset is the resulting single-labeled network, with E. lusitanica appearing once, as a reticulation.

This is an interesting application of MUL-trees. However, there are two issues that I wish to highlight about the procedure.

First, the reticulation as shown in the example is not actually time-consistent, given that the horizontal axis of the MUL-tree is scaled to time. This could, for example, be resolved by having "E. lusitanica CP" attached to a ghost lineage.

Second, the data matrix from which the MUL-tree is created will have a non-random distribution of missing data, by definition. This non-randomness is known to have a bad effect on likelihood analyses (Simmons 2012). In the example, the non-randomness is exacerbated by further non-randomness in the acquisition of the plastid sequences. So, if this form of MUL-tree analysis is to be pursued then maybe this potential limitation should be investigated.

References

Mugrabi de Kuppler AL, Fagúndez J, Bellstedt DU, Oliver EGH, Léon J, Pirie MD (2015) Testing reticulate versus coalescent origins of Erica lusitanica using a species phylogeny of the northern heathers (Ericeae, Ericaceae). Molecular Phylogenetics and Evolution 88: 121-131.

Pirie MD, Humphreys AM, Galley C, Barker NP, Verboom GA, Orlovich D, Draffin SJ, Lloyd K, Baeza CM, Negritto M, Ruiz E, Cota Sanchez JH, Reimer E, Linder HP (2008) A novel supermatrix approach improves resolution of phylogenetic relationships in a comprehensive sample of danthonioid grasses. Molecular Phylogenetic and Evolution 48: 1106-1119.

Pirie MD, Humphreys AM, Barker NP, Linder HP (2009) Reticulation, data combination, and inferring evolutionary history: an example from Danthonioideae (Poaceae). Systematic Biology 58: 612-628.

Simmons MP (2012) Radical instability and spurious branch support by likelihood when applied to matrices with non-random distributions of missing data. Molecular Phylogenetics and Evolution 62: 472-484.

3 comments:

  1. Belatedly discovered this post via Altmetric – it’s great that you’ve engaged with our research.

    On missing data: we can refer to the body of work put out there particularly by John Wiens & co. demonstrating that missing data is in principle unproblematic for phylogenetic inference, including molecular dating (Wiens and Morrill, 2011; Zheng and Wiens, 2015). Nevertheless, the potentially large proportion of non-randomly distributed missing data introduced when using taxon duplication is indeed something to be careful about. As an extreme example, we applied the approach to a virus dataset (Visser et al., 2012) including multiple recombinant strains. This introduced a very high proportion of missing data, and critically we didn’t have enough non-recombinant strains in the dataset to be able to infer with confidence the relationships of closely related duplicated taxa (that shared no common positions in the alignment). Our solution was to use the pattern of unique shared breakpoints as (genomic scale) evidence for constraining monophyly in duplicated clades, and in this way we were able to infer the timing of recombination events that led to pathogenic virus strains (despite the Swiss cheese-like nature of the underlying matrix).

    On time-consistency: I’d argue that in the Erica lusitanica example the mul-tree is in fact consistent, but in other examples, particularly where hybrids have subsequently radiated into multiple species (or strains, such as in our virus example), this indeed might not be the case. The two or more internal branches that represent ancestors that hybridised clearly have to overlap in time or it makes no sense. I’d actually interpret this as an additional source of information with potential for improving our molecular dating estimates. In a correctly inferred chronogram, those branches will overlap. If they don’t, that would reveal a problem with the analyses (not so unusual in the dicey world of molecular dating), and that problem might even be addressed by ensuring that they do.

    Visser, J.C., Bellstedt, D.U., & Pirie, M.D. 2012. The recent recombinant evolution of a major crop pathogen, Potato Virus Y. PLoS ONE 7:e50631. http://dx.doi.org/10.1371/journal.pone.0050631

    Wiens, J.J., & Morrill, M.C. 2011. Missing Data in Phylogenetic Analysis: Reconciling Results from Simulations and Empirical Data. Sys. Biol. 60:719-731. http://dx.doi.org/10.1093/sysbio/syr025

    Zheng, Y., & Wiens, J.J. 2015. Do missing data influence the accuracy of divergence-time estimation with BEAST? Mol Phylogenet Evol 85C:41-49. http://dx.doi.org/10.1016/j.ympev.2015.02.002

    ReplyDelete
    Replies
    1. Thanks for your comment. Here are some thoughts in reply.

      Missing data: all of the work that you quote refers to constructing trees. There is, as far as I know, no published evaluation of the effect of missing data on combining trees into a network. This is unlikely to be a trivial problem, because networks are exponentially more complex than trees.

      Time consistency: the two ancestral taxa that are involved in the identified reticulation exist at non-overlapping times in the time-scaled phylogeny. Therefore, they are not time consistent, by definition. As I noted, one solution is to have an unsampled continuation of one of the two taxa.

      Delete
  2. Thank you for the response.

    On missing data: I can certainly imagine how missing data could impact networks that are inferred directly from sequence data. It's less obvious to me how this applies to networks summarised from trees (assuming of course that those trees were correctly inferred).

    On time consistency: on reflection, I do see what you mean. It strikes me as an issue with the network rather than as one with the tree/chronogram itself. When I add a ghost lineage to the MUL-tree (sister to E. lusitanica CP) this moves the reticulation onto the (more recent) stem lineage of this newly defined clade, which is indeed a better representation.

    ReplyDelete