The Genealogical World of Phylogenetic Networks: Why the emporer has no clothes on

In the final part of this series dissecting angiosperm gene trees (see: Why the emporer has no clothes on — part 1 and part 2), we will enter muddy ground. Using our example data set, we will try to make a call on whether or not there has been any (detectable) major reticulation in the deep branches of the angiosperm tree.

What triggers conflicting gene histories

Before we look at the data, it may be a good idea to set the scene using simple theoretical examples of what we may look at.

Our two genes, represented by circle and pentagon (could be multigene regions or entire genomes), both follow the same evolutionary history (the gray background tree). In the left lineage, we have a bit of incomplete lineage sorting, because the ancestor was polymorphic for the circles. In the right lineage, we have different fixation rates: the circles evolve faster than the pentagons. With molecular data we usually don't have the ancestors, making any inference straightforward; we only have the tips.

Because of incomplete lineage sorting and different fixation rates in the left and right lineages, the circle gene tree gets the phylogeny pretty wrong. The pentagon gene tree comes closer to the reality – we only infer two sister clades where there is a grade. (With real-world data, the branch support values could give one a clue that three of the inferred blue clades have a higher quality than the fourth supporting a pseudo-monophylum.) The circle and pentagon trees are largely incongruent despite sharing the same history; and we may infer a pseudo-hybrid (the first diverging lineage within the right clade).

Combining these data may allow us to infer a tree that fits the real tree much better. In the left clade the trivial pentagon signal can out-compete the misleading circle signal, and avoid the misplacement of the first diverging lineage of the right clade. In the right clade, the circle signal can help to correct for the pseudo-clade.

Now we can add a late reticulation, and re-infer the gene trees.

Because of the reticulation (the circles are biparentally inherited, the pentagons maternally), the gene trees are more congruent then in the example above (circle and pentagon get it a bit wrong in the left clade), except for the hybrid and its pseudo-hybrid parent. The gene conflict in placing the lineage cross (part of the left clade in the circle-based tree, part of the right clade in the pentagon tree) well reflects its hybrid origin.

Different histories of nuclear genes vs. plastid / mitochondrial genes?

The easiest way to catch reticulation is to compare trees based on plastid / mitochondrial data (maternally inherited) vs. nuclear data (biparentally inherited). If reticulation happened in the past, we can expect that the maternal and biparental genealogy diverge from each other (see part 2).

Strict Consensus network of the plastid (data from 3 protein-coding genes +1 partly coding gene region), mitochondrial (3 protein-coding genes) and nuclear trees (2 nrDNAs). The bold lines represent generally accepted phylogenetic splits (APG IV tree, see also Steven's comprehensive Angiosperm Phylogeny Website).

This network is much more box-like compared to what one would have expected based on the combined tree that can be inferred from the data (Part 1). But are we looking on largely decoupled histories?

This mess is hardly surprising. The combined tree is constrained by the plastid tree, specifically by the signal from the matK gene (Part 1), while the remaining plastid genes (from a different part of the plastome) fall into line. The mitochondrial tree combines genes that on their own inform poorly resolved trees riddled with branching artifacts (Part 2). The nuclear tree, on the other hand, combines the most and least divergent nuclear genes widely known. Because of this, they show topological conflict between each other.

18S-25S rDNA tanglegram. The branch numbers show each gene's bootstrap support (BS) deviating from the combined BS support for the respective branch (indicated by line thickness): green, increased BS support when combining both genes, red, decreased BS support.

However, they are part of the same multi-copy coding unit (the 35S nuclear rDNA) that has very particular evolutionary constraints, such as structural constraints, affected by completeness of concerted evolution and intra-genomic recombination. Polyploid grasses, for example, can have up to three different collections of 35S rDNA, reflecting four different evolutionary origins, being part of the A, B, C or D genomes. You end up with what is called a multi-labelled tree: the A, B, C and D-genome variants of the same taxon pop up (consistently) in different parts of the tree, and you can have recombinants. If we look into the 18S vs. 25S data, however, we find no consistent sequence patterns supporting the topological conflicts between the two trees, or examples for recombination.

As in our theoretical example, each of the trees has certain strengths, and its own set of weaknesses, some of which can be overcome when combining the data (eg. branches with increased combined support in the 18S-25S tanglegram)

Bootstrap (BS) Consensus networks for the combined cp (upper left), mt (upper right), nc (lower left) and full data (lower right). Branches without numbers: BS = 100. Splits conflicting with those present based on the full data highlighted by red font (all with BS < 100).

In contrast to the boxy network appearance and the substantial conflict between the single gene trees (Part 2), most of the relationships (eg. the major clade roots but also many intra-clade relationships) receive high or unambiguous support in all three trees*. Aside from the disparate signals, the data seem to converge on a coalescent. If the genomes had different histories, they wouldn't converge so easily. Also, we would expect to see more consistent conflict between the "genome" trees than between the single-gene trees of the same genome, since the nuclear rDNA is biparentally inherited while the plastid and mitochondrial DNAs are passed on via the mothers only. Many of the angiosperms in our data reproduce sexually.

So far, no conclusive evidence for reticulation

Mere gene-tree incongruence is a poor basis to conclude about decoupled gene histories. We need to dig for sequence-based evidence for reticulation and recombination. For instance, we might find a clearly derived sequence pattern exclusive to the right clade in a member of the left clade.

The importance of rare genomic changes when interpreting conflicting gene trees. The left and right clades obtained a unique and conserved gene or sequence feature before they diversified. The hybrid is the only taxon showing both.

This is where the Walker et al. (2019) and Sullivan et al. (2017) studies seem to fall short — they don't give any example, gene, gene region, or recognizable lineage-diagnostic sequence pattern that could be used as direct evidence for decoupled gene histories and/or reticulation.

For my data set, I cannot pinpoint such evidence either. All high(er)-supported conflict seems to be related to lineage sorting and data/signal issues, the inability of certain gene regions to resolve relationships in parts of the angiosperm tree, or falling prey to (more local than global) long-branch attraction. When looking at the sequences, there's no reason to question, for example, the assumed monophyly of the main lineages and orders, in spite of the topological conflict we face when analyzing these data. If there was reticulation between the ancestors of angiosperm lineages, or later on between the already formed lineages, it left no obvious imprint in the data.

Thus, after having investigated aspects of the seeming conflict by going back to the data (checking highly divergent and conserved sequence patterns, tabulating the partly competing BS support of the single genes, and minus-one gene analyses), I did not hesitate to combine these data and use a Bayesian total-evidence dating procedure. (We never published the results because mid-Cretaceaous angiosperm fossils have much too derived morphologies for total evidence dating; when left unconstrained, MrBayes optimized towards an angiosperm root age of 4.5 Ba, which was the in-built maximum).

A total-evidence Bayes tree based on the full data set. Stars indicate the position of fossil taxa (mid-Cretaceaous). Note their relative long terminal branches, a situation total-evidence dating cannot handle. The matrix can be found at figshare: A basic total evidence matrix for basal angiosperms — combining Soltis et al (2011) with Doyle & Endress (2010).

An example for actual reticulation resulting in gene tree conflict

Working at the coal-face of evolution, I have encountered examples of apparently real reticulation (when analysing biparentally inherited nuclear data). The most compelling was probably the ancient relictual genotypes and pseudogenes that point towards ancient reticulation in the widely known plane trees, Platanus. Platanus subgenus Platanus (which includes all but one species, P. kerrii, a relict of a distant lineage growing in tropical-hot subtropical lowland forests of North Vietnam) falls into two main lineages characterized by unique sets of genotypes, the ANA clade (Atlantic-facing North and Mesoamerica) and the PNA-E clade (NW. Mexico, California and Mediterranean).

Haplo/-genotypic composition of Platanus (Grimm & Denk, Taxon, 2010, ES2 [PDF]). Platanus kerrii represent the sole surviving relative within the Platanaceae (genetically very distinct), an old lineage of angiosperm trees (going back deep into the Cretaceous). Their next kin today are, according to angiosperm molecular trees, the enigmatic Proteaceae, a Gondwanan relict (represented in our angiosperm data by Petrophile). For an even more comprehensive genotypic study that also covers plastid markers check out De Castro et al., Ann. Bot., 2013 [open access])

Individuals in the contact zone between species of the two main lineages (including hybrids) can be heterozygotic / polymorphic for at least one of the sequenced nuclear regions, so that identification of recent hybrids is straightforward. Beyond this, genetically inconspicious members of the ANA clade may show ITS pseudogenes from the PNA-E clade (stippled line in the figures above and below). Furthermore, two of the ANA clade species show (predominately), a PNA-E LEAFY genotype — P. palmeri (pa) and P. rzedowskii (rz), which grow closest to the populations of the PNA-E clade. However, this is not the genotype found in the close-by American PNA-E species (ra, ge), which is one that's sequence is phylogenetically closer to the Mediterranean species, P. orientalis (or), on the other side of the globe.

Overlay of the LEAFY, 5S-IGS and ITS histories in Platanus. This doodle is based on tree- and network-inferences coupled with PCR-RFLP-based genotyping and in-depth analysis of mutation patterns in length-polymorphic sequence regions (Grimm & Denk 2010, ES1). P. x hispanica is the well-known ornamental alley/park tree, the 'London plane'. A cultivated historical hybrid (mid 18th century) of the most hardy North American plane, P. occidentalis, and the frost-vulnerable Mediterranean plane, P. orientalis. In the Mediterranean, due to frequent backcrossing, one can find morphologically mixed individuals showing only the P. orientalis genotypes or homogenous (American or European) type individuals showing occidenatlis and orientalis genotypes (see eg. Pilotti et al., Euphytica, 2009)

Further reading

An animal example, of seemingly incongruent single-gene trees that may well be the product of a largely shared evolutionary history, is the autosomal intron data compiled for bears by Kutschera et al. (2014. Bears in a forest of gene trees: Phylogenetic inference is complicated by incomplete lineage sorting and gene flow. Mol. Biol. Evol. 31:2004–2017). Rather than a "forest of trees", each gene tree is poorly resolved but, when combined, allows inferring a phylogeny that matches quite well the parental genealogy based on Y-chromosome data, both in strong conflict with the maternal genealogy inferred from mitochondriomes (see Part 2).

In Supplement File S6 [PDF] of Grímsson et al. (2018, Grana 57:16–116), I outline how ambiguous signal from combined gene regions relate to the poor support of critical branches in the Loranthaceae tree; see also the related posts: Using consensus networks to understand poor roots and Trivial but illogical – reconstructing the biogeographic history of the Loranthaceae (again). Some gene-tree conflicts are possibly linked to different histories (nuclear vs. chloroplast data), while others are a mix of insufficient signal and missing data (between chloroplast genes).

In a previous post (All solved a decade ago: the asterisk branch in the Fagales phylogeny), I give another example using an old Fagales matrix, which resulted in a tree that, even today, is the gold standard of Fagales phylogeny. The matrix combines a highly conserved nuclear gene (18S) conflicting with the plastid genes and complemented by an entirely uninformative mitochondrial gene (matR) to provide a "tree based on all three genomes". Also in this case the three-genome tree is essentially the matK tree.

* That doesn't mean that all highly supported, unconflicted relationships must be true. Note that just by combining a few genes, we obtain a near-unambiguous support for the split between Mesangiosperms and the ANA-grade + gymnosperms, one of the splits defining the root and "basal" part of the angiosperm tree. The outgroup-inferred root is well fixed. Even when using nuclear data, despite the fact that the 18S signal (the one showing the least ingroup-outgroup genetic distance) doesn't support such a root but the 25S does (see part 2), being more divergent and prone to ingroup-outgroup long branch attraction (LBA). That we have LBA issues with the data is obvious from a tiny detail: Ginkgo is supported with BS > 70 as sister of Podocarpus, which is wrong, based on all we know about gymnosperms,(see also Earle's gymnosperm database and literature cited therein). The likely correct split, Ginkgo as sister to Cycas, is present in the nc tree, but represents a much less supported alternative (BS <= 25). It is also obvious when one looks at the alignment(s): Cycas and Ginkgo share some potential genetic 'synapomorphies' in the low-divergent, generally conserved regions (eg. 18S, stem-regions of 25S), but there are essentially none for Gingko + Podocarpus.

Monday, November 18, 2019

Why the emporer has no clothes on – conflict or not?

No comments:

Post a Comment