The Genealogical World of Phylogenetic Networks: Reticulation at its best

One particular case where networks turn out to be a versatile tool is the study of low-level evolutionary patterns. This is especially so when we leave the comfort zone of well-sorted molecular markers, and use more than a single individual per species. Our recently published data set on (mostly Mediterranean) oaks, provides a nice example of this.

Why so few people study oaks at the intra-generic level

Oaks are notoriously difficult to study because they don't bother too much about species boundaries (which can be more or less obvious) and – at one point – decided to not sort their plastids at all (and full plastomes, as I once saw for myself first-hand, won't help). Hence, all reasonable phylogenetic reconstructions of oak evolution have been based on genetic data from the nucleome. However, this imposes a new problem — the sequenced nuclear gene regions allow the recognition of the major lineages (which recently have been formalized), but the closer one comes to the species level the more difficult it is to resolve anything at all.

Even the famous ITS region, which includes the weakly constrained internal transcribed spacer ITS1, and the structurally quite constrained ITS2, and have been frequently advocated as plant barcodes, turns out to be a two-edged sword. Relationships between the major intra-generic lineages is relatively clear, the ITS is pretty divergent down to the species level, but at the individual level, one faces a intra-genomic divergence that often outmatches inter-species differentiation.

In some groups, like the most speciose and most widespread white oaks (sect. Quercus), identical ITS variants exist from individuals / species separated today by thousands of kilometers of ocean or icy wasteland. One possible explanation is that oaks have very large population sizes, and they are wind-pollinated, so that they have a high capacity to permanently homogenize their genepools. Plastids, on the other hand, are only transmitted via the large fruit, the acorns, and the main animal vector for distributing acorns, the jaybirds, are sedentary birds. Their backup-vector, the squirrels are known to hoard a lot of acorns in a single place, but not for migrating globally (unless we assist them).

Nonetheless, we readily notice that the intra-individual differentiation patterns appear not to be entirely random, and so in our study we moved to another nuclear multi-copy spacer known to be more variable than the ITS1 and ITS2 (hence, largely ignored by molecular phylogeneticists) — the 5S intergenic spacer (5S IGS). It didn't help too much for solving the white oak puzzle (in western Eurasia), but did give us new insights into the two other western Eurasian sections: Ilex and Cerris.

The 'host-associate' framework

A cloned 5S-IGS (or ITS) sequence is not a good OTU, because we are usually not interested in a clone phylogeny (a mere sequence genealogy), but in the phylogenetic relationships between the individuals or species carrying the cloned sequence variants: the nuclear spacer population. Even networks struggle with such data, and my colleague Markus Göker came up with the idea to treat this in the form of hosts, the individuals, and associates, the cloned sequences found in the individual (Göker & Grimm, BMC Evol. Biol. 2008 — open access). There are several options to transfer the primary clone (associate) data into individual (host) data.

Options that we tested for transferring associate data into host data.
CM = character matrix, DM = distance matrix. CMhosts, independent used were morphological matrices. ENT — entropy, FRQ — frequence, CON — strict consensus, MOD — modal consensus, and SIZ — sample size, are character transformations implemented in Markus' g2cef, PBC and MIN are distance transformations implemented in pbc (these and other little helper programmes can be found here).

Using three cloned (ITS) datasets, we found that for these data the "Phylogenetic Bray-Curtis" (PBC — see the next figure) distance transformation outperforms the other tested options.

Computation of the "Phylogenetic Bray-Curtis" distance. It's a modification of the Bray-Curtis dissimilarity using the minimum distance for each covered row/column instead absence/presence. H1/H2 = hosts with different sets of associates (A1–A6)

Incidental but interesting insights

Whenever I come into contact with such data I advise the use of the PBC distance transformation as the basis for the main individual-level network, but also to run the MIN distance transformation: MIN will just calculate the minimum inter-clone distance between the clone samples of two individuals, and use this as the inter-individual distance.

Neighbour-net using the MIN transformation

The MIN network (above) is quite bushy for these data, because we naturally have many shared 5S-IGS variants among individuals of the same species, but occasionally also shared by individuals of different species. Nonetheless, it visualizes some basic differentiation patterns in the clone sample: compare e.g. the coherent cluster 3, the crenata-suber lineage (the 'Cork Oaks') — all individuals share a pair of very similar to identical 5S-IGS clones; and the divergent cluster 4, the 'Vallonea' oaks — all individuals have different sets of clones, but uniuqe 5S-IGS variants separating them from all other Cerris oaks (long proximal edge bundle).

Furthermore, we have potential F1 hybrids (morphologically intermediate) in our sample, and such hybrids, e.g. tj08, should have very low (to zero) MIN distances with members of their parental lineages.

However, the PBC network (below) is as beautiful as it gets — I really love this transformation, as it always comes up with something usable and interpretable.

Neighbor-net based on PBC-transformed inter-individual distances. See Simeone et al. (PeerJ PrePrints 2018 — open access pre-print) for a discussion.

However, this network was a last minute addition, because a happy little "accident" happened along the way, and the networks we were working with and looking at while drafting the paper where not PBC networks, as I thought.

It happened this way. Also implemented in Markus' little helper program are AVG, the average inter-clone distance, and MAX, the maximum inter-clone distance. AVG and MAX don't result in a proper distance matrix, because the diagonal will be the average or maximum distance between the clones of a single individual, and not all-zero as it should be (for MIN it's always zero). [We discussed a few options to modify AVG and MAX to ensure a zero diagonal, but couldn't devise something that makes sense.]

However, the SplitsTree program didn't bother about an all-zero diagonal, so the AVG and MAX transformed distance matrices will produce a Neighbor-net. So, what I assumed were PBC networks were in fact AVG networks.

Neighbor-net based on AVG-transformed inter-individual distances.

It took me quite long to recognize this "error" because, in contrast to the AVG (and MAX) networks I looked at when we did the 2008 paper, the one for the oaks made a lot of sense. Notably, the suspected F1 hybrids were perfectly resolved spanning up according boxes, and the species aggregates (clusters) did make sense regarding the general geographic setting, the history of the region under study, and their morphology.

Same graph as above, highlighting known or potential F1-hybrids spanning up according boxes.

For these data (with a minimum of four clones available per individual, individuals covering all species, and including the entire range of the section in western Eurasia), the AVG network better shows the potential F1 hybrids (or introgrades) than the (more methodologically sophisticated) PBC network. However, the latter makes more sense regarding speciation processes and the history of the group (because, the distance is a "phylogenetic" version of the well-known Bray-Curtis distances).

A "cactus-oak" fusion graph depicting nuclear and plastid differentiation (and evolution) in Quercus Group Cerris.

Take-home message

First, it's always good to delegate work you can do by heart to somebody new to it! This forces its propagation, which is important. More importantly, though, one has ones preferences and established analysis pipelines, and they may have become restricted in scope. I mainly used the -a (AVG), -i (MIN) and -x (MAX transformation) options in the little helper program to quickly summarize some of the primary differentiation data — for example, individuals have identical clones (MIN = 0), intra-individual divergence may be higher or not than inter-individual (MAX intra-individual > MIN inter-individual), and individuals may have strongly divergent clones (high MAX). AVG was computed and tabulated but never cherished by me. I always looked at the MIN transformed networks, since this provides a valid distance matrix, but then ignored them. But I never again tried to infer a Neighbor-net based on AVG or MAX transformations after our 2008 paper.

Second, Neighbor-nets are so quick to infer that there is no resource- or logic-related reason to not just run whatever distance one has on hand or can easily establish. Maybe even the biologically less-sound will reveal some interesting aspect (there are a lot of biological arguments that can be put forward for dismissing AVG distances in favour of PBC distances)

Paper (pre-print) and open data
Simeone MC, Cardoni S, Piredda R, Imperatori F, Avishai M, Grimm GW, Denk T. 2018. Comparative systematics and phylogeography of Quercus Section Cerris in western Eurasia: inferences from plastid and nuclear DNA variation. PeerJ Preprints 6: e26995v1.
Primary data and analysis files are included in the Online Supplemantary Archive: Simeone et al., PeerJ Preprints, doi: 10.7287/peerj.preprints.26995v1/supp-4. (See Readme.txt included in the topfolder of the archive.)

Monday, July 2, 2018

Reticulation at its best — an example from the oaks

No comments:

Post a Comment