Monday, October 21, 2019

Why the emperor has no clothes on – the mighty matK

In a recent paper published in PeerJ, Walker et al. (2019) take a close look at the complete plastome data of angiosperms. Although they don't find anything fundamentally new — well, at least not for those of us who have looked at the oligogene datasets we worked with — it's nice to see that somebody has been willing to do it in a very comprehensive way, and thereby published what some of us have long known:
  • A combined tree is not the sum of the genes that have been combined;
  • Single-gene trees can tell you very different stories.
Even if the overall branch support is pretty high, we always should be aware of internal data conflict.

When looked at closely, the emperor, in this case the Angiosperm Phylogeny Group (APG) complete plastome tree, maybe not be entirely naked, but is clothed in very few of the many garments at his disposal. Effectively the branches in the plastome reference tree draw their support from very few of the 79 genes/gene regions in the plastome.As Walker et al. note:
"Of the most commonly used markers, matK, greatly outperforms rbcL; however, the rarely used gene rpoC2 is the top-performing gene in every analysis. We find that rpoC2 reconstructs angiosperm phylogeny as well as the entire concatenated set of protein-coding chloroplast genes."

Fig. 1 from Walker et al. showing the (lack of) individual gene support for the angiosperm reference phylogeny.

However, there is one aspect of the paper that calls for a network-based blog post:
"Following the typical assumptions of chloroplast inheritance [i.e. that the entire plastome shares a common history being passed on solely by the mother in angiosperms], we would expect all genes in the plastomes to share the same evolutionary history. We would also expect all plastid genes to show similar patterns of conflict when compared to non-plastid inferred phylogenies ... Our results, however, discussed below, frequently conflict with these common assumptions about chloroplast inheritance and evolutionary history."
Getting incongruent branches in the single-gene trees, including a few highly supported ones, is taken as evidence for different histories potentially mixed within the plastome. Walker et al. give references for (potential) recombination and reticulation in plastomes.

I asked a question about whether this logic isn't a bit naive about tree inference. In their response, they pointed to the paper by Sullivan et al. (Mol. Biol. Evol. 2017) — these authors made test for recombination in Picea (spruce) plastomes, then split the complete plastomes into three structural units, and found two embedded conflicting phylogenies, as shown in the next figure.

Fig. 4 from Sullivan et al. (2017). F1 and F2 are structural regions comprising most of the large single-copy unit, the F3 the two (duplicate) inverted-repeat regions and the small single-copy unit of the Picea plastomes.

This seems to be a compelling case (but note the BS < 100 for conflicting critical branches). It is also quite possible, since gymnosperm plastomes, in contrast to angiosperms, may be paternally or bipartentally inherited. But, is it a valid assumption that each single-gene tree (or, in Sullivan et al.'s case, trees based on multigene regions) reflects the true tree of that gene or gene complex? That is, even if I assume that all of the genes in my matrix share the same history, must they support the same inferred tree?

Since I have worked a lot at low taxonomic levels, and often with other people's (plastid) data (during my entire career, I remained faithful to the nuclear-encoded ribosomal DNA spacers), my spontaneous answer would be: Absolutely not! Topological conflict may hint towards decoupled gene histories — it is a neccessary criterion but not a sufficient criterion.

There are quite a lot evolutionary scenarios that will lead to data inevitably supporting wrong branches, or false positives (see also Walker et al.'s discussion). Even if evolution is a strictly dichotomous process (which it clearly isn't):
  • low divergence may result in primitive (underived) sequences ('genetic symplesiomorphies') being shared by distant taxa
  • high divergence may result in saturation, which ultimately triggers branching artifacts
  • long isolation coupled with small active population sizes, repeated bottleneck / massive extinction events and/or lack of radiations will lead to sequences that are different from anything else in our data (in angiosperms, this phenomenon has a name: Ceratophyllum).
In fact, the very argument for angiosperm molecular phylogeneticists to move away from using single-gene phylogenies was that these first single-gene trees had branches that made little sense, especially when based on plastid data.

Single-gene trees will get things wrong. The more signal we add, usually by adding additional gene regions, the more we will reduce these errors (this is best-case scenario, but see Delsuc et al., Nature Rev. Genet. 2005). Thus, if some gene-trees conflict more with the combined tree than do others, it can be for two possible reasons:
  1. The conflicting genes had indeed different evolutionary histories. However, this would have to involve intra-plastome recombination and heteroplasmy, which so far have been very rarely documented in angiosperms.
  2. All genes had the same evolutionary history, but some of the data get more aspects of this true tree right than do others (and, of course, some are wrong that others get right).

And the matK said: "I'm your lord, follow my lead"

Walker et al. (all their scripts and results files can be found on github) find that it's only a few of the genes that essentially make up the combined tree. One of them is an old reliable pal of angiosperm phylogeneticists, the chloroplast matK gene. The literature is full of "multigene" trees that are effectively matK gene-trees using enlarged matrices. The matK determines a topology, and by adding genes that cannot compete with it (being too conserved, too variable or just inconsistently different), we re-enforce this topology. Only branches unresolved by matK will be further optimized using the added data.

Let's look at an example.

For the purpose of this post (and the follow-up), I'll use an old angiosperm matrix on stock (I know the quirks of this matrix). For analysis, I eliminated all of the OTUs with missing gene partitions, mainly to make sure that all of the trees and bootstrap (BS) pseudoreplicate trees have the same set of leaves, so I can summarize the tree samples using consensus networks.

Here's the my combined tree, unpartitioned.

Gray – current APG IV classification, "gold tree" (primary relationships within Mesangiospermae still a matter of debate)
And here is the fully partitioned one (over-parametrized; with each gene/codon position treated as data partition).

Essentially the same tree (some branches elongated, others shortened), eudicot clade and the Ceratophyllum-monocot clade swapped positions. Both trees have the same scale.

Even though my matrix includes only relatively few genes (just 21,550 sites), the tree gets the main aspects of the APG IV standard tree. The support for most of the branches is nearly unambiguous (irrespective of data partition), with the exception of some deep-down relationships within the Mesangiospermae (a long-standing issue, called the "dirty dozen"). The fact that the unpartioned and partitioned analysis agree for most part, indicates the signal in my matrix has no model-related issues (at least, none we could fix by using "better" models).

And the matK tree mirrors the fully partitioned tree, as shown here.

A tanglegram of the matK and the combined trees. Shown is the matK BS support for shared and conflicting edges. Orange asterisks, the monocot subtrees have the same structure but when using only matK, the conifer outgroup Podocarpus is nested deep within.

The similarity is indeed striking, in particular since the gene sample in the matrix comprises data from:
  • two of the nuclear-encoded ribosomal RNA genes (18S, 25S; biparentally inherited) that did follow partly different evolutionary trajectories, as e.g. well-studied in the case of Fagales (being a derived eudicot, not included in my matrix)
  • six chloroplast genes/gene regions (maternally inherited including the classics rbcL and matK but also the rpoC2, the most informative gene identified by Walker et al.)
  • three mitochondrial genes (also maternally inherited, but most mutations are, amino-acid-wise, synonymous, being concentrated at the third codon position).
The main things that matK get's wrong* in contrast to the combined tree are deep divergences represented by (very) short branches, in the part of the graph following the (very rapid) split of the mesangiosperm common ancestor (known as "Darwin's abominable mystery").

Also, it nests Podocarpus, the conifer in the outgroup, with unambiguous support in the monocots — which clearly is wrong, a false positive. Looking into the alignment, we can see that the reason for this is a mix of moderate-LBA (long-branch attraction) with missing-data-culling. To minimize LBA artifacts in the matrix originally used, I blanked out parts of the matK in the outgroup (which included a more derived conifer, Pinus, but also the extremely divergent gnetophytes); parts that were not straightforwardly alignable with the angiosperm matK.

The best way to illustrate internal signal conflict is, however, to directly show the BS Consensus network, not mapping support on two alternative topologies as seen in the tanglegram.

BS Consensus network based on 150 matK BS pseudoreplicates (numbers of necessary BS replicates determined by Pattengale et al.'s extended majority rule bootstop criterion implemented in RAxML)

When looking at BS << 100 and the boxes of competing splits in BS-support networks, it is important to keep in mind that low support can have two reasons:
  • Lack of decisive signal, because the BS pseudoreplicates will have (semi-)random or biased branching patterns; in the tree this surfaces usually as low (when random) to moderately high (when biased) support associated with (very) short branches.
  • Conflicting signals, ie. signals incompatible with a single tree; depending which site is eliminated or duplicated during resampling, the BS pseudoreplicate will show one or another topology; strong, deep conflict can surface in a tree by low support associated with (normally) long internal branches but also relatively high support for one alternative topology, the other only manifesting in very long terminal branches.
Regarding Walker et al.'s results, we now need to ask:
  1. Are the non-conflicted branches in the combined tree (major clades equal to the gold tree) the result of shared history of all of the included genes, or just that of the matK?
  2. Is the conflict with the combined tree and locally ambiguous signals due to a different history of the matK, located in the large single-copy unit, and the other genes, or just matK's inability to get certain things right?
In this case, all relatively high-supported conflicting matK splits are associated either with: (i) very short internal branches in the tree, the non-discriminative product of a fast ancient radiation, or (ii) are the result of an obvious data/branching artifact, ie. the misplaced Podocarpus.

So far, nothing challenges the assumption that the combined genes didn't follow the same history. Whether the other genes reveal something else, we'll see in my next post.

* or right: APG IV treats Ceratophyllum as the "probable sister of the eudicots" (see also Stevens' Angiosperm Phylogeny Website).

No comments:

Post a Comment