Distance matrices offer many avenues for exploring data. A common method is Principal Component Analysis (PCA). A much less common method is the use of Neighbour-nets. We have previously compared PCA and Neighbor-nets using theoretical data. In this post, I'll compare a PCA graph and the corresponding Neighbour-net using some empirical data.
Genetic differentiation in Mediterranean oaks
In the paper by Vitelli et al. (2017), we explored the phylogeographic structuring of a group of Mediterranean oak species. The species represented the westernmost populations of one of the main Eurasian oak lineages: the evergreen Quercus section Ilex ("Ilex oaks"; see Denk et al. 2017 for an up-to-date classification of oaks; see also this figshare-spread-sheet). It was a follow-up study to the one by Simeone et al. (2016).
We found that one species, the most widespread (Quercus ilex), carry plastids from quite different origins. The 2016 paper identified three main plastid haplotypes in the Ilex oaks: the unique (within the entire genus) "Euro-Med" haplotype; the "Cerris-Ilex" haplotype shared with western Eurasian members of (essentially deciduous) section Cerris, the sister clade of section Ilex (see Denk & Grimm 2010; confirmed by NGS SNP data, Hipp et al. 2015); and the "WAHEA" haplotype, an east-bound haplotype of section Ilex. Vitelli et al. aimed to characterise the range of these three main haplotypes throughout the four Ilex oak species found in the Mediterranean.
Figure 1 shows the two multivariate data analyses, along with a map of the sample locations.
|Fig. 1 Phylogeographic structure of Quercus section Ilex around the Mediterranean (after Vitelli et al. 2017). a. PCA graph, and b. Neighbour-net based on the same inter-haplotype pairwise distance matrix. c. A map depicting the distribution of main haplotype groups labelled by Roman numerals: I haplotypes of the "WAHEA" lineage, II "Cerris-Ilex"-lineage, III–VI, subtypes of the "Euro-Med" lineage (cf. Simeone et al. 2016, fig. 1)|
Regarding the overall diversification pattern, the PCA graph and the Neighbour-net show similar things. The "Euro-Med" lineage is the most diverse group, with four subgroups — two larger (and widespread) ones (haplotypes IV, V) and two rare ones (III, VI) only found in the Aegean region.
- According to the PCA, haplotype III (colored olive) is intermediate between "Euro-Med" IV (blue) and the haplotype II (yellow), which represents another lineage of oak haplotypes, the Aegean/Northern Turkish "Cerris-Ilex" lineage. The same can be seen in the Neighbour-net.
- The PCA further places haplotype VI (red) as equidistant to all of the other types, with IV and I (green; representing the oriental "WAHEA" lineage) being a bit closer. In the Neighbour-net, we can sum up the length of the connecting edge-bundles to find the same pattern. A difference between the two analyses is that VI is connected only with part of V (purple) by a pronounced edge bundle, but not connected to I (green). This is strikingly different from III, which shares an edge bundle with II and IV+V.
At this point in the analyses, we can use the potential property of the Neighbour-net acting as a distance-based 2-dimensional graph and acting as a meta-phylogenetic network (Fig. 2). Based on the PCA, which also is a 2-dimensional depiction of the differentiation, one may be tempted to interpret VI as a bridge between IV/V and I, not much different from how III bridges between II and IV (Fig. 1). On the other hand, the network (Figs 1, 2) informs us that VI is a likely relative of V, which in turn is a likely relative of IV; and the only connection between I and VI is their increasing distinctness to the other haplotypes of the "Euro-Med" lineage, III/IV/V.
|Fig. 2 The main splits expressed in the neighbour-net. III may either be sister to II, or is part of a clade comprising IV and V.|
Using the main split patterns in the Neighbour-net, we can infer the one phylogenetic hypothesis, a tree, that can accommodate them all (Fig. 3).
|Fig. 3 The tree solution congruent with the major split patterns (Fig. 2).|
I rejected the alternative sister relationship between II and III because this would imply a sister clade that only includes IV, V and VI but not III, which clashes with the affinity of III to IV and V (Fig. 2). Interpreting III as a sister of IV and V, explains both its affinity to II (putative sister lineage to III–VI) and IV and V.
We might accept that all three plastome lineages are reciprocally monophyletic (in a quite broad sense), meaning that each lineage evolved from a pool of closely related mother plants. If so, then the higher similarity between III ("Euro-Med") and II ("Cerris-Ilex") may represent a relative lack of derivation, whereas the dissimilarity between VI ("Euro-Med") and I ("WAHEA") to all other types can be due to a higher level of distinctness. And we can come up with a "cactus"-type metaphorical tree (Fig. 4) explaining the Neighbour-net (and PCA graph).
|Fig. 4 A "cactus"-type tree metaphor for the evolution of oak plastomes (based on the results of Simeone et al. 2016, Vitelli et al. 2017, and – outside the focus group, i.e. Mediterranean oaks of Subgenus Cerris – some partly arcane, not yet published knowledge, I have access to)|
There's no reason to stop with a PCA
One empirical example is far from being conclusive, but it shows what the Neighbour-nets have to offer.
Trees are fine for proposing phylogenetic hypotheses, but we should always be aware of equally valid alternatives to the tree that we have optimized. And with increasing numbers of taxa, inferring optimal trees and assessing their alternatives require increasing effort, and checking. For many questions, PCA has been used as a quick alternative, including in large-sample genetic studies (see Continued misuse of PCA in genomics studies).
Neighbour-nets are just a natural step further towards a phylogeny, which come with very little extra effort and can use the same data basis: a matrix of pairwise distances. In the case of genetic data, which usually reflects at least the main aspects of the actual phylogeny (trivial or complex) behind it, the "true tree", they should be obligatory. They are much more than just a clustering approach (even though their algorithm is based on a cluster algorithm) or a bivariate analysis. Neighbour-nets are meta-phylogenetic networks that have the capacity to contain the one or many topologies explaining the data. They are as straightforward as PCA, when it comes to recognising "natural", coherent and equal, groups (in contrast to phylogenetic trees).
I would have liked to add some more examples with non-genetic data. Data sets where the distances are not the result of an explicit phylogenetic process. But this requires much more effort, since none of the PCA studies I browsed had documented the used distance data/matrix. However, I'm sure that inferring a Neighbour-net based on no-matter-what similarity data used for PCA, can be a fruitful and revealing endeavour (and the reason why you find Neighbour-net based on U.S. gun legislation, breast sizes, languages, cryptocurrencies, etc. on this blog, but few PCAs). So, try it out the next time you make a PCA, and share the results e.g. by using our comment option or even a post as guest-blogger.
Don't miss these earlier posts with similar topic:
- Distortions and artifacts in Principal Components Analysis for analysis of genome data
- Networks can outperform PCA ordinations in phylogenetic analysis
- Network map of the Ukraine
Also, this paper introduces Neighbor-nets to the wider audience of multivariate data analyses:
- Morrison, D.A. (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: 296–312.
Denk T, Grimm GW. 2010. The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59: 351–366.
Denk T, Grimm GW, Manos PS, Deng M, Hipp AL. 2017. An updated infrageneric classification of the oaks: review of previous taxonomic schemes and synthesis of evolutionary patterns. In: Gil-Pelegrín E, Peguero-Pina JJ, and Sancho-Knapik D, eds. Oaks Physiological Ecology. Heidelberg, New York: Springer, p. 13–38. Free Pre-Print at bioRxiv [major change: Ponticae and Virentes accepted as additional sections in final version]
Hipp AL, Manos P, McVay JD, ... , Avishai M, Simeone MC. 2015 [abstract]. A phylogeny of the World's oaks. Botany 2015. Edmonton.
Simeone MC, Grimm GW, Papini A, Vessella F, Cardoni S, Tordoni E, Piredda R, Franc A, Denk T. 2016. Plastome data reveal multiple geographic origins of Quercus Group Ilex. PeerJ 4: e1897 [open access, comments/questions welcomed]
Vitelli M, Vessella F, Cardoni S, Pollegioni P, Denk T, Grimm GW, Simeone MC. 2017. Phylogeographic structuring of plastome diversity in Mediterranean oaks (Quercus Group Ilex, Fagaceae). Tree Genetics and Genomes 13:3.