Monday, April 8, 2019

Next-generation neighbor-nets

Neighbor-nets are a most versatile tool for exploratory data analysis (EDA). Next-generation sequencing (NGS) allows us to tap into an unprecedented wealth of information that can be used for phylogenetics. Hence, it is natural step to combine the two.

I have been waiting for it (actively-passively) and the time has now come. Getting NGS data has become cheaper and easier, but one still needs considerable resources and fresh material. Hence, NGS papers usually not only use a lot of data, but also are many-authored. You can now find neighbor-nets based on phylogenomic pairwise distances computed from NGS data — for example, in these two recently published open access pre-prints:
  • Pérez Escobar​ OA, Bogarín D, Schley R, Bateman R, Gerlach G, Harpke D, Brassac J, Fernández-Mazuecos M, Dodsworth S, Hagsater E, Gottschling M, Blattner F. 2018. Resolving relationships in an exceedingly young orchid lineage using Genotyping-by-sequencing data. PeerJ Preprint 6:e27296v1
  • Hipp AL, Manos PS, Hahn M, Avishai M, Bodénès C, Cavender-Bares J, Crowl A, Deng M, Denk T, Fitz-Gibbon S, Gailing O, González Elizondo MS, González Rodríguez A, Grimm GW, Jiang X-L, Kremer A, Lesur I, McVay JD, Plomion C, Rodríguez-Correa H, Schulze E-D, Simeone MC, Sork VL, Valencia Avalos S. 2019. Genomic landscape of the global oak phylogeny. bioRxiv DOI:10.1101/587253.

Example 1: A young species aggregate of orchids

Pérez Escobar et al.'s neighbor-nets are based on uncorrected p-distances inferred from a matrix including 13,000 GBS ("genotyping-by-sequencing") loci (see the short introduction for the method on Wikipedia, or the comprehensive PDF from a talk at/by researchers of Cornell) covering 29 accessions of six orchid species and subspecies.

They also inferred maximum likelihood trees, and did a coalescent analysis to consider eventual tree-incompatible signal, gene-tree incongruence due to potential reticulation and incomplete lineage sorting. They applied the neighbor-net to their data because "split graphs are considered more suitable than phylograms or ultrametric trees to represent evolutionary histories that are still subject to reticulation (Rutherford et al., 2018)" – which is true, although neighbor-nets do not explicitly show a reticulate history.

Here's a fused image of the ML trees (their fig. 1) and the corresponding neighbor-nets (their fig. 2):

Not so "phenetic": NGS data neighbor-nets (NNet) show essentially the same than ML trees — the distance matrices reflect putative common origin(s) as much as the ML phylograms. The numbers at branches and edges show bootstrap support under ML and the NNet optimization.

Groups resolved as clades, Group I and III, or grades or clades, Group II (compare A vs. B and C), in the ML trees form simple (relating to one edge-bundle) or more complex (defined by two partly compatible edge-bundles, Group I in A) neighborhoods in the neighbor-net splits graphs. The evolutionary unfolding, we are looking at closely related biological units, was likely not following a simple dichotomizing tree, hence, the ambiguous branch-support (left) and competing edge-support (right) for some of the groups. Furthermore, each part of a genome will be more descriminative for some aspect of the coalescent and less for another, another source of topological ambiguity (ambiguous BS support) and incompatible signal (as seen in and handled by the neighbor-nets). The reconstructions under A, B and C differ in the breadth and gappyness of the included data (all NGS analyses involve data filtering steps): A includes only loci covered for all taxa, B includes all with less than 50% missing data, and C all loci with at least 15% coverage.

PS I contacted the first author, the paper is still under review (four peers), a revision is (about to be) submitted, and, with a bit of luck, we'll see it in print soon.

Example 2: The oaks of the world

The Hipp et al. (note that I am an author) neighbor-net is based on model-based distances. The reason I opted (here) for model-based distance instead of uncorrected p-distances is the depth of our phylogeny: our data cover splits that go back till the Eocene, but many of the species found today are relatively young. The dated tree analyses show substantial shifts in diversification rates. In the diverse lineages today and possibly in the past (see the lines in the following graph), in those with few species (*,#) we may be looking at the left-overs of ancient radiations.

A lineage(s)-through-time plot for the oaks (Hipp et al. 2019, fig. 2). Generic diversification probably started in the Eocene around 50 Ma, and between 10–5 Ma parts (usually a single sublineage) of these long-isolated intrageneric lineages (sections) underwent increased speciation.

The data basis is otherwise similar, SNPs (single-nucleotide polymorphisms) generated using a different NGS method, in our case RAD-tagging (RAD-seq) of c. 450 oak individuals covering the entire range of this common tree genus — the most diverse extra-tropical genus of the Northern Hemisphere. There are differences between GBS and RAD-seq SNP data sets — a rule of thumb is that the latter can provide more signal and SNPs, but the single-loci trees are usually less decisive, which can be a problem for coalescent methods and tests for reticulation and incomplete lineage sorting that require a lot of single-loci (or single-gene) trees (see the paper for a short introduction and discussion, and further references).

We also inferred a ML tree, and my leading co-authors did the other necessary and fancy analyses. Here, I will focus on the essential information needed to interpret the neighbor-net that we show (and why we included it at all).

Our fig. 6. Coloring of main lineages (oak sections) same as in the LTT plot. Bluish, the three sections traditionally included in the white oaks (s.l.); red, red oaks; purple, the golden-cup or 'intermediate' (between white and red) oaks — these three groups (five sections) form subgenus Quercus, which except for the "Roburoids" and one species of sect. Ponticae is restricted to the Americas. Yellow to green, the sections and main clades (in our and earlier ML trees) of the exclusively Eurasian subgenus Cerris.

Like Pérez Escobar et al., we noted a very good fit between the distance-matrix based neighbor-net and the optimised ML tree. Clades with high branch support and intra-clade coherence form distinct clusters, here distinct neighborhoods associated with certain edge bundles (thick colored lines). This tells us that the distance-matrix is representative, it captures the prime-phylogenetic signal that also informs the tree.

The first thing that we can infer from the network is that we have little missing data issues in our data. Distance-based methods are prone to missing data artifacts and RAD-seq data are (inevitably) rather gappy. It is important to keep in mind that neighbor-nets cannot replace tree analysis in the case of NGS data, they are "just" a tool to explore the overall signal in the matrix. If the network has neighborhoods contrasting what can be seen in the tree, this can be an indication that one's data is not sufficiently tree-like at all. But it also can just mean that the data is not sufficient to get a representative distance matrix.

Did you notice the little isolated blue dot (Q. lobata)? This is such a case — it has nothing to do with reticulation between the blue and the yellow edges, it's just that the available data don't produce an equally discriminative distance pattern: according to its pairwise distances, this sample is generally much closer to all other oak individuals included in the matrix in contrast to the other members of its Dumosae clade, which are generally more similar to each other, and to the remainder of the white oaks (s.str., dark blue, and s.l., all bluish).

Close-up on the white oak s.str. neighbor-hood (sect. Quercus) and plot of the preferred dated tree.

In the tree it is hence placed as sister to all other members, and, being closer to the all-ancestor, it triggers a deep Dumusae crown age, c. 10 myr older than the subsequent radiation(s) and as old as the divergence of the rest of the white oaks s.str.

The second observation, which can assist in the interpretation of the ML tree (especially the dated one), is the principal structure (ordering) within each subgenus and section. The neighbor-net is a planar (i.e. 2-dimensional graph), so the taxa will be put in a circular order. The algorithm essentially identifies the closest relative (which is a candidate for a direct sister, like a tree does) and the second-closest relative. Towards the leaves of the Tree of Life, this is usually a cousin, or, in the case of reticulation, the intermixing lineage. Towards the roots, it can reflect the general level of derivation, the distance the (hypothetical all-)ancestor.

Knowing the primary split (between the two subgenera), we can interprete the graph towards the general level of (phylogenetic) derivedness.

The overall least derived groups are placed to the left in each subgenus, and the most derived to the right. The reason is long-branch attraction (LBA) stepping in: the red and green group are the most isolated/unique within their subgenera, and hence they attract each other. This is important to keep in mind when looking at the tree and judge whether (local) LBA may be an issue (parsimony and distance-methods will always get the wrong tree in the Felsenstein Zone, but probabilistics have a 50% chance to escape). In our oak data, we are on the safe side. The red group (sect. Lobatae, the red oaks) are indeed resolved as the first-branching lineage within subgenus Quercus, but within subgenus Cerris it is the yellow group, sect. Cyclobalanopsis. If this would be LBA, Cyclobalanopsis would need to be on the right side, next to the red oaks.

The third obvious pattern is the distinct form of each subgraph: we have neighborhoods with long, slim root trunks and others that look like broad fans.

Long-narrow trunks, i.e. distances show high intra-group coherence and high inter-group distinctness can be expected for long isolated lineages with small (founder) population sizes, eg. lineages that underwent in the past severe or repeated bottleneck situations. Unique genetic signatures will be quickly accumulated (increasing the overall distance to sister lineages), and the extinction ensures only one (or very similar) signature survives (low intragroup diversity until the final radiation).

Fans represent gradual, undisturbed accumulation of diversity over a long period of time, eg. frequent radiation and formation of new species during range and niche expansion – in the absence of stable barriers we get a very broad, rather unstructured fan like the one of the white oaks (s.str.; blue); along a relative narrow (today and likely in the past) geographic east-west corridor (here: the  'Himalayan corridor') a more structured, elongated one as in the case of section Ilex (olive).

Close-up on the sect. Ilex neighborhood, again with the tree plotted. In the tree, we see just sister clades, in the network we see the strong correlation between geography and genetic diversity patterns, indicating a gradual expansion of the lineage towards the west till finally reaching the Mediterranean. Only sophisticated explicit ancestral area analysis can possibly come to a similar result (often without certainty) which is obvious from comparing the tree with the network.

This can go along with higher population sizes and/or more permeable species barriers, both of which will lead to lower intragroup diversity and less tree-compatible signals. Knowing that both section Quercus (white oaks s.str., blue) and Ilex (olive) evolved and started to radiate about the same time, it's obvious from the structure of both fans that the (mostly and originally temperate) white oaks produced always more, but likely less stable species than the mid-latitude (subtropical to temperate) Ilex oaks today spanning an arc from the Mediterranean via the southern flanks of the Himalayas into the mountains of China and the subtropics of Japan.

Networks can be used to understand, interpret and confirm aspects of the (dated) NGS tree.

The much older stem and young crown ages seen in dated trees may be indicative for bottlenecks, too. But since we typically use relaxed clock models, which allow for rate changes and rely on very few fix points (eg. fossil age constraints), we may get (too?) old stem and (much too) young crown ages, especially for poorly sampled groups or unrepresentative data. By looking at the neighbor-net, we can directly see that the relative old crown ages for the lineages with (today) few species fit with their within-lineage and general distinctness.

The deepest splits: the tree mapped on the neighbor-net.

By mapping the tree onto the network, and thus directly comparing the tree to the network, we can see that different evolutionary processes may be considered to explain what we see in the data. It also shows us how much of our tree is (data-wise) trivial and where it could be worth to take a deeper look, eg. apply coalescent networks, generate more data, or recruit additional data. Last, but not least, it's quick to infer and makes pretty figures.

So, try it out with your NGS data, too.

PS. Model-based distances can be inferred with the same program many of us use to infer the ML tree: RAxML. We can hence use the same model assumptions for the neighbor-net that we optimized for the inferring tree and establishing branch support.

No comments:

Post a Comment