Monday, December 2, 2019

Trees informing networks explaining trees

Working at the coalface of evolution, one phenomenon always intrigued me: How does the signal in the data build up a tree? Especially since we have to assume some sort of reticulation happened at some point — evolution is rarely a strictly dichotomous process, which we would model by a tree. In earlier posts, we have covered the difference between clades and grades in a tree, and Hennig's concepts of monophyly and paraphyly. Clearly, in the light of actual evolutionary processes, the cladistic approach synonymizing clades with monophyly is a simplification at best, and naive at worst.

In this post, I will discuss a real-world example using molecular data put together for a (probably) recently evolved plant genus, Drosanthemum, as discussed in this paper:

Liede-Schumann S, Grimm GW, Nürk NM, Potts AJ, Meve U, Hartmann HEK. 2019. Phylogenetic relationships in the southern African genus Drosanthemum (Ruschioideae, Aizoaceae). bioRxiv preprint.
These days, next-generation sequencing (NGS) and phylogenomic data may provide what you need to resolve everything from the beginning of life to the very tips of the Tree of Life (ToL; which, at the root and tips is probably more of a Network of Life). However, this has two shortcomings: You need a lot of money, and a lot of DNA. Given the number of modern-day plant species, including not a few that are in flux, it's pretty safe to assume that I won't live long enough to have all of the ToL leaves resolved by NGS data.

On the hand, there are a countless numbers of scientists with taxonomic expertise struggling for funding; and classic Sanger sequencing has become very cheap. Thus, Oligogene (fossil) data sets will remain in use for quite some time. We do, however, have to deal with their shortcomings, such as not giving us a fully resolved phylogenetic tree, but instead providing partially diffuse signals. Nevertheless, we can get a lot of insights by combining traditional tree and network inferences.

The tree

To test systematic concepts and construct a species phylogeny for Drosanthemum, we tapped into four non-coding plastid gene regions, which, following earlier research, were the most divergent within the larger group:
  • the close intergenic plastid spacers trnK-rps16 and rps16-trnQ,
  • the trnS-trnG intergenic spacer, and
  • the rpl16 intron.
Following popular demand, we also sequenced the nuclear-encoded ITS region used in earlier phylogenetic studies (despite being quite useless for tree inference in the larger lineage, being much too conserved).

We did a full analysis, a single-gene tree inference, and bootstrapping vs. combined analysis, with or without data partitioning. We concluded that the combined plastid (not including ITS data) tree does provide a good phylogenetic-systematic framework for the genus.

Our Drosanthemum tree, rooted using the most probable rooting scenario (following an outgroup-EPA analysis; see Liede-Schumann et al., fig. 4 and Supplement file S4). Major clades and subclades are annotated, on the left the morphological subgenera associated with each major clade.

Why consider it to be good? Well, it fits amazingly well with the morphology-based systematics. Evolution doesn't follow a straight path: (i) reticulation will happen at least during the formation of species; (ii) there will always be some incomplete lineage sorting of geno- and phenotypes; and (iii) morphologies have substantially different evolutionary constraints from noncoding plastid gene regions. If it ends up in a good match, it won't be by coincidence but more likely because our inferred phylogeny captures well the true tree (coalescent).

Regarding monophyly, the tree is hence well suited to construct a framework: most of our major clades are linked to a specific, clade-unique morphology. (Don't hope to find too many autapomorphies, in plants common origin manifests typically in diagnostic character suites rather than individual aut-/synapomorphies.) The exception is subgenus Drosanthemum, which is apparently diphyletic — this term is meant literally, not just because its members form two molecular clades. Furthermore, although not visible on the tree, Clade IV / Vespertina may well be evolved from Clade III / Drosanthemum. The III-IV grade represents a monophyletic group, and Clade III may be paraphyletic.

At this point, you may be thinking: Guido has lost it; but bear with me.

Ancestral and derived haplotypes

The point is, I know our data. When we look at the sequence patterns in the gene regions, we can readily see that Clade III (subgenus Drosanthemum) and IV (subgenus Speciosa) likely had a (genetic) common ancestor different of that from the more evolved clades I (subgenus Drosanthemum) and II, and hence the high support and increased branch length of the corresponding branches — I + II and III + IV could well be reciprocally monophyletic. Realizing that clades III and IV are part of the same evolutionary lineage, we can take a closer look at them using, for example, Median networks. Parsimony is generally inferior to probabilistic methods when dealing with (mostly) neutral but stochastic mutation patterns. However, since we are very close to the coalface of evolution, we are dealing with rather minute changes — too minute for ML to make a well-informed call (also, there is no ML counterpart to haplotype networks).

To not miss something in our data (or overweight indels and linked mutation patterns), we do not just use the nucleotide sequences but we tabulate and code each mutational pattern — simple ones like single-nucleotide polymorphisms (SNPs) and duplications, but also complex ones, sequence patterns in length-polymorphic, sequentially diverse parts. The next figure shows an example:

"Export" refers to the unaltered export of (parsimony-informative) variable sites of the aligned nucleotide matrix; "recoded" the correction for excess mutations (when treating gaps as 5th base) in order to ensure the coding matches the number of steps in the theory of Median networks (see Liede-Schumann et al., supplement file S2).

Now, we are operating above the species level, which is outside the comfort zone of Median networks, which were originally designed for investigating within-species population structure. We are dealing with signals from phylogenetically sorting (eg. evolution of complex sequence patterns, see example above) mixed with (partly) convergent/ homoiologous patterns (eg. duplications, which are very rarely lineage-specific in plant plastomes). The resulting Median networks are quite complex, as shown next.

The output from NETWORK for Clade III + IV and the trnG-trnS intergenic spacer. In total, the matrix codes for 14 mutational patterns (10 SNPs, 3 indels, 1 length-polymorphic region involving a SNP; see sheet Clade3&4.trnGS in f_Haplotyping.xlsx in folder 1_main_data_and_results the online supporting archive @ DataDryad); the red edge numbers indicate which pattern changed, the bubble colors refer to the group: Cyan, Subgrade IIIa; blue, Subclade IIIb; yellow, Clade IV (note: Clade III and IV differ from other major clades by uniquely shared patterns)

One option would be to weight the characters. However, it is pretty difficult to come up with a weighting scheme given that we deal with very different mutation patterns, which include everything from simple transitions to reorganization of length-polymorphic regions. When looking at the SNPs, AG transitions appear to be more probable than AC transversions, but some AG transitions are highly diagnostic for clades, while some AC transversions seem random. Instead of getting lost in weighting (and self-enforcing bias), we compare them across the four gene regions by collapsing haplotype groups and their (diffuse) subnets, as shown next.

'Condensed' Median networks for the Clade III + IV lineage, parts of the graphs collecting sequentially similiar members of one group are replaced by bubbles (cf. Liede-Schumann et al., fig. 6).

Note that, in contrast to traditional haplotype networks, the bubbles in the figure don't represent the number of accessions of the haplotype, but instead are the sequential diversity of the collapsed haplotype group. From the graphs superimposed on the background of the combined tree, it is straightforward to see which haplotype maybe ancestral within a lineage and which haplotype is derived and how they relate to each other.

Paraphyletic "clades"

We now know how the haplotypes of each covered gene region are related to each other, and which species have substantially derived sequences, and which species have putatively ancestral sequences. Using the networks and by comparison with the sequence patterns in the sister group(s), we could even reconstruct an hypothetical haplotype of the common ancestor. But just by comparing the median networks for each gene regions with the corresponding subtrees in the combined tree we can (try to) interpret our clades and grades as monophyletic or paraphyletic.

Fig. 5 from Liede-Schumann et al. (2019) showing the 'condensed' Median networks for Clade I/ Drosanthemum (s.str.)

Members of Subclade Ib, the subtree with the worst support within Clade I / Drosanthemum (s.str.), may represent the survivors of the initial radiation, and hence are a paraphyletic group. They are resolved as a clade in the tree because of the signal from the trnS-trnG region producing a clear split between the three groups. However, this is also the most-conserved gene region, and when compared with the mutational patterns in the other clades (especially the sister clade, Clade II), it would not be far fetched to conclude that the trnS-trnG haplotype B is the original haplotype of the entire lineage.

The distinctive feature of Subclade Ib in the trnS-trnG is a complex duplication pattern not found in the otherwise genetically more coherent subclades Ia and Ib, as shown next.

This looks like a simple evolutionary sequence, with Clade Ia and Ic having retained the original pattern, with the complex pattern being a derived, clade-unique feature of Clade Ib (an autapormorphy for the corresponding monophyletic group).

But when we add the patterns of Clade II, the reciprocally monophyletic sister clade, it's not that simple anymore, as shown next.

Why one should be careful with gap-coding: even complex plastid duplication patterns evolve in parallel (or convergently). No matter whether X-Y or X-Y'-X-Y is the ancestral pattern, we have one/two convergent mutations in parts of Clades I and II; either duplication of X and insertion of Y' or (subsequent) deletion of X-Y'.

Realizing that a few clades in our tree may be paraphyletic gives us a new edge on our data and phylogenetic framework that can be further elaborated. Because they directly point towards a first, quick radiation that predates the formation of the monophyletic molecular clades (this is only a tautonym in cladistics, not in phylogenetics) — the members of paraphyletic molecular clades are genetically distinct (long terminal branches, typically low and/or ambiguous support for the clade root) or little evolved survivors (short root and terminal branches, but relatively high root branch support).

Furthermore, we can now see why some species act a bit roguish, are difficult to resolve, or inflict internal data conflict.

'Condensed' Medium networks for the sister clades V and VI (modified after Liede-Schumann et al., fig. 7)

Drosanthemum gracillimum is the only species our tree that doesn't resolve as a member of one of the two main (definitely monophyletic) subclades within Clade V: Subclade Va / Speciosa and Vb / Ossicula, genetically close but morphologically distinct sisters. We had no material of this species for our analysis, and instead used available GenBank data (out of curiosity). Its trnS-trnG and rps16-trnQ haplotypes are unique but rather ancestral within Clade V, and hence the tree cannot resolve where to place it.

Another example for how ancestry of sequences contribute to topological conflict or ambiguity in intrageneric phylogenies, but also illustrating the limitation of our approach, is one individual of D. striatum. It's the only member of Subclade Vb / Ossicula with a Subclade Va / Speciosa-type rps16-trnQ. With respect to my last blog post, the simplest explanation is that it just retained a less derived rps16-trnQ haplotype. However, this spacer includes a high-divergent, genotaxomomically valuable region that we had to exclude from all analyses (but included in our spreadsheet haplotyping.xlsx). In this, it shows a very unique, complex, apparently derived pattern shared with a few other members of the sister Clade Va. Maybe there was some reticulation and plastome-recombination at work here (contamination can be ruled out, as the material was processed twice).

Just try it with your own data

We cannot all afford perfect, often seemingly trivial, NGS / phylogenomic data. Combined trees can inform us about groups sharing a likely (mostly inclusive) common origin, such as molecular clades with fair support and distinctly long root branches and/or shared unique morphologies (ie. "monophyla" in a strict Hennigian sense). Clade-restricted haplotype networks can help us to understand the molecular evolution in these groups, free from the assumption of dichotomy and time equality.

By definition, all tip sequences represent the same time (today) in a tree, so they can only be sisters not ancestors and descendants. In reality, when we approach the coalface, we have some sequence patterns or actual sequences that are ancestral to others, because the species carrying them didn't evolve as much and as fast as their sibling(s). At some time, different parts evolved at different speeds within one lineage (see the examples above).

The networks hence fill a gap that the tree can't possibly resolve. They allow to understand why the tree may make more sense in certain parts than in others; and where it is probably 100% reliable and where we may want to have a closer look. Furthermore, only the networks can tell us if there is some real conflict in the data: different gene regions reflecting different histories.


As a careful reader ,you may have noticed that we skipped the ITS sequence entirely. The reason is shown in the following two graphs.

The first one shows a statistical parsimony network of all of our ITS data compiled for the species included in the plastid combined tree.

A statistical parsimony network based on the ITS data. Colors give the main cp clades (see Liede-Schumann et al., Supplement file S3)

The network approaches a spider-web, as shown above. The reason for this is that there are only a limited number of ITS positions where Drosanthemum fixes mutations (notably nearly exclusively SNPs, with no length-polymorphism). So, the genus is likely a young one, much like its sister clade the Ruschideae, which also mostly shows randomly distribution ITS mutation patterns.

Inferring an ITS tree is possible but useless, in that the data don't provide a clear signal. Furthermore, when we map the observed mutational patterns onto the plastid tree, we see a lot of messing up towards the leaves; but, in principle, it's all just sorting along the shared coalescent. We can identify those ITS mutation that a (plastid-)clade specific and lineage-diagnostic, including ITS-"synapormophies" for plastid-inferred clades that are likely monophyletic (being correlated to a distinct morphology and supported by derived, uniquely shared sequence patterns).

ITS genotypes mapped on the plastid tree, pointing to a largely congruent history with incomplete (ITS) lineage sorting. CU = clade-unique sequence pattern; Sh = shared, not unique, sequence pattern.
This opens the door to quickly screen for individuals / species that don't fall in line of the coalescent but are the product of (deep) reticulation (either using bulk sequencing and NGS genotyping or traditional cheap methods such as PCR-RFLP).