Showing posts with label neighbour-net. Show all posts
Showing posts with label neighbour-net. Show all posts

Monday, August 6, 2018

Trivial data, but not so trivial graphs


One may expect that perfectly compatible, trivial data will lead to perfect trees that are trivial to interpret. And this may really be the case when phylogenetics is restricted to contemporary taxa and molecular data. Adding to various earlier posts that deal with data patterns and their representation in inference graphs (e.g. Networks can outperform PCA..., Stacking neighbour-nets..., Clades, cladograms, cladistics ... and networks ...), I will show in this post what we get when we deal with very trivial, straightforward to interpret, data.

Two trivial scenarios: a linear and a dichotomous evolutionary sequence

The virtual data matrix for our experiment comprises seven taxa (OTUs) from different time scales and six binary (Dollo) characters. There are two historical scenarios that are supported by patterns in the data (see the first figure).

The linear scenario has a mother taxon that evolves by acquiring a unique, persistent trait, and is replaced by its daughter taxon through time. In contrast, the dichotomous scenario has two subsequent events of cladogenesis: the all-ancestor A splits into two taxa (B, E), each defined by a unique change in a binary character passed on to their descendants. B and E then underwent a second cladogenetic event, giving rise to C+D and F+G.


The resultant data matrices have different properties. In the case of the linear evolution, all changes lead to synapomorphies sensu Hennig (characters #1–#5) along with one terminal autapomorphy of the latest member of the lineage, G (character #6).

In the case of the dichotomous evolution, we have two synapomorphies supporting the BCD and EFG clades (characters #1, #4), respectively, and four autapomorphies (each one for C, D, F and G, the youngest set of taxa).

The following figure shows the character-based splits (taxon bipartitions) for the linear evolution scenario:
(Trivial splits, one taxon separated from all others, in blue)

Reconstructing the (true) evolutionary pathway is trivial based on this perfect split pattern, especially if we know that A is the oldest taxon and G the youngest.


It's equally straighforward for our second scenario, with perfectly dichotomous evolution:


Character 1 and character 4 define taxon cliques comprising B,C,D and E,F,G. The remaining characters indicate that C,D and F,G derive from B and E, respectively.


Explicit inferences

As stated above, the data properties for both scenarios are different. The matrices have a different number of parsimony-informative characters (4 for linear, 2 for dichotomous). Accordingly, the reconstructed optimal trees (here using the maximum parsimony, least-squares, and maximum likelihood criteria), are better resolved / more correct for the linear than for the dichotomous evolution.

MPT = most-parsimonious tree; ML = maximum likelihood. *Corrected for ascertainment bias.

Using all of the variable characters, NJ and ML are generally more decisive and produce higher support for the right branches. But for the dichotomous evolution scenario, they also show ghost-clades ("para-clades" as they include close relatives sharing a recent common origin, but do not represent monophyletic groups sensu Hennig) with low support. The corresponding MPT has no ghost-clades, but it also provides no clues to how B,C,D and E,F,G are related to each other.

Beyond this, and as can be seen in many real-world examples, there is no fundamental difference between character-based inferences such as maximum parsimony (MP) or maximum likelihood (ML) and distance-based inferences (NJ) fulfilling (here) the least-squares criterion (sometimes still called "phenetic" inferences in contrast to the "phylogenetic" parsimony, Bayesian inference and maximum likelihood).

The differences diminish further when we look at the phylograms instead of the cladograms, as shown next.


Another observation we can make is that for the linear-evolution scenario (four synapormophies), the ascertainment bias correction under ML has little effect, but it is crucial for the dichotomous evolution (two synapomorphies) to get sensible branch lengths.

Parsimony provides the most conservative (and least decisive) results for the dichotomous-evolution scenario, also because of the way I applied it: PAUP* allows optimizing trees with hard polytomies when using the default branch-and-bound search (for tree inference as well as bootstrapping), whereas the NJ / BioNJ algorithm and the ML implementation in RAxML will always produce fully dichotomized trees, including zero-length or near-zero-length branches. This explains the difference in the support values of preferred and alternative splits.


(Non-filtered) Bootstrap support consensus networks for the linear evolution scenario. Same scale for all graphs, trivial splits (dashed lines) collapsed.
(Non-filtered) Bootstrap support consensus networks for the dichotomous evolution scenario.


Trees are not wrong, but they miss the point

None of the graphs above show anything strongly erroneous, but they also don't fully capture the evolutionary pathways — that is, the actual ancestor-descendant relationships. This is because our taxon set includes ancestral forms, which, in traditional trees, have to be placed as sisters to part or all of their descendants. Networks provide a quick solution to this limitation.

Median-joining networks inferred with NETWORK 5.0.0.3 for both scenarios, with the inferred (and real) character changes annotated along edges.

Neighbour-nets inferred with SplitsTree 4.13.1 for both scenarios, based on the mean (Hamming) pairwise distances.

The two (perfectly tree-like) graphs, one parsimony-based, the other distance-based, look identical, and place all of the taxa exactly where they should be: the ancestors on the nodes ("medians"), and their (latest) descendants at the tips. But note that in the case of the Neighbour-net this is a visual illusion / approximation: in fact, the ancestors are actually connected by zero-length edges to the node they appear to be sitting on.

Given that both scenarios used here produce trivial, straightforward to interpret, data patterns (see the first figures), the failure of the traditional tree inferences to get it completely right can be a bit unsettling. Trees including primitive-old and derived-new forms are common in the (palaeontological) literature, and typically show many branches lacking high support (note that only ML produced a bootstrap support >90 for a true-tree branch, and only for the linear evolution scenario). To address evolution over time, networks should hence be standard applications, rather than the exception. Cladograms should be long gone, as they show very little beyond the most trivial.

If we want trees (and many of us want trees!), we need tree inferences that can optimize an older taxon on an internal branch or node, to accommodate potentially ancestral forms.

Related blog posts

In Clades, cladograms, cladistics, and why networks are inevitable, I argue that we cannot get around networks when we aim to study taxa from different time scales using their morphologies.

Digging deeper: Population dynamics and individual-based fossil phylogenies raises the question of what we deal with when we use individual fossils (i.e. long-dead individuals) as OTUs in our phylogenetic inferences.
 
Monophyletic groups in networks by David gives an introduction into (fringe) terminology. What to do when dealing with more than a single most-recent common ancestor and past reticulation?

Networks and most recent common ancestors by David discusses the concepts of conservative MRCAs (most recent common ancestors), fuzzy MRCAs and (alternative) LCA — lowest (last) common ancestors in the face of reticulation.

In Stacking neighbour-nets: ancestors and descendants, I outline how one may (and why one should) stack Neighbour-nets to analyse the evolutionary history of a group including (mostly) fossil representatives.

The first Darwinian evolutionary tree[s] show features one rarely finds in a modern-day phylogenetic tree: ancestral and descendant forms, ancestral taxa addressed as species and not higher taxa, and gradual transition between forms (post by David).

Tree metaphors and mathematical trees by David, which introduces János Podani's concept about "branching silhouettes" and how to depict an actual evolutionary tree.

Where have all the ancestors gone? discusses the common notion that we don't have to deal with ancestor-descendant problems in phylogenetics at all, because the scarcity of the (terrestrial) fossil records ensures to only find extinct side (sister) lineages. 
 

Monday, July 2, 2018

Reticulation at its best — an example from the oaks


One particular case where networks turn out to be a versatile tool is the study of low-level evolutionary patterns. This is especially so when we leave the comfort zone of well-sorted molecular markers, and use more than a single individual per species. Our recently published data set on (mostly Mediterranean) oaks, provides a nice example of this.

Why so few people study oaks at the intra-generic level

Oaks are notoriously difficult to study because they don't bother too much about species boundaries (which can be more or less obvious) and – at one point – decided to not sort their plastids at all (and full plastomes, as I once saw for myself first-hand, won't help). Hence, all reasonable phylogenetic reconstructions of oak evolution have been based on genetic data from the nucleome. However, this imposes a new problem — the sequenced nuclear gene regions allow the recognition of the major lineages (which recently have been formalized), but the closer one comes to the species level the more difficult it is to resolve anything at all.

Even the famous ITS region, which includes the weakly constrained internal transcribed spacer ITS1, and the structurally quite constrained ITS2, and have been frequently advocated as plant barcodes, turns out to be a two-edged sword. Relationships between the major intra-generic lineages is relatively clear, the ITS is pretty divergent down to the species level, but at the individual level, one faces a intra-genomic divergence that often outmatches inter-species differentiation.

In some groups, like the most speciose and most widespread white oaks (sect. Quercus), identical ITS variants exist from individuals / species separated today by thousands of kilometers of ocean or icy wasteland. One possible explanation is that oaks have very large population sizes, and they are wind-pollinated, so that they have a high capacity to permanently homogenize their genepools. Plastids, on the other hand, are only transmitted via the large fruit, the acorns, and the main animal vector for distributing acorns, the jaybirds, are sedentary birds. Their backup-vector, the squirrels are known to hoard a lot of acorns in a single place, but not for migrating globally (unless we assist them).

Nonetheless, we readily notice that the intra-individual differentiation patterns appear not to be entirely random, and so in our study we moved to another nuclear multi-copy spacer known to be more variable than the ITS1 and ITS2 (hence, largely ignored by molecular phylogeneticists) — the 5S intergenic spacer (5S IGS). It didn't help too much for solving the white oak puzzle (in western Eurasia), but did give us new insights into the two other western Eurasian sections: Ilex and Cerris.

The 'host-associate' framework

A cloned 5S-IGS (or ITS) sequence is not a good OTU, because we are usually not interested in a clone phylogeny (a mere sequence genealogy), but in the phylogenetic relationships between the individuals or species carrying the cloned sequence variants: the nuclear spacer population. Even networks struggle with such data, and my colleague Markus Göker came up with the idea to treat this in the form of hosts, the individuals, and associates, the cloned sequences found in the individual (Göker & Grimm, BMC Evol. Biol. 2008 — open access). There are several options to transfer the primary clone (associate) data into individual (host) data.

Options that we tested for transferring associate data into host data.
CM = character matrix, DM = distance matrix. CMhosts, independent used were morphological matrices. ENT — entropy, FRQ — frequence, CON — strict consensus, MOD — modal consensus, and SIZ — sample size, are character transformations implemented in Markus' g2cef, PBC and MIN are distance transformations implemented in pbc (these and other little helper programmes can be found here).

Using three cloned (ITS) datasets, we found that for these data the "Phylogenetic Bray-Curtis" (PBC — see the next figure) distance transformation outperforms the other tested options.

Computation of the "Phylogenetic Bray-Curtis" distance. It's a modification of the Bray-Curtis dissimilarity using the minimum distance for each covered row/column instead absence/presence. H1/H2 = hosts with different sets of associates (A1–A6)

Incidental but interesting insights

Whenever I come into contact with such data I advise the use of the PBC distance transformation as the basis for the main individual-level network, but also to run the MIN distance transformation: MIN will just calculate the minimum inter-clone distance between the clone samples of two individuals, and use this as the inter-individual distance.

Neighbour-net using the MIN transformation

The MIN network (above) is quite bushy for these data, because we naturally have many shared 5S-IGS variants among individuals of the same species, but occasionally also shared by individuals of different species. Nonetheless, it visualizes some basic differentiation patterns in the clone sample: compare e.g. the coherent cluster 3, the crenata-suber lineage (the 'Cork Oaks') — all individuals share a pair of very similar to identical 5S-IGS clones; and the divergent cluster 4, the 'Vallonea' oaks — all individuals have different sets of clones, but uniuqe 5S-IGS variants separating them from all other Cerris oaks (long proximal edge bundle).

Furthermore, we have potential F1 hybrids (morphologically intermediate) in our sample, and such hybrids, e.g. tj08, should have very low (to zero) MIN distances with members of their parental lineages.

However, the PBC network (below) is as beautiful as it gets — I really love this transformation, as it always comes up with something usable and interpretable.

Neighbor-net based on PBC-transformed inter-individual distances. See Simeone et al. (PeerJ PrePrints 2018 — open access pre-print) for a discussion.

However, this network was a last minute addition, because a happy little "accident" happened along the way, and the networks we were working with and looking at while drafting the paper where not PBC networks, as I thought.

It happened this way. Also implemented in Markus' little helper program are AVG, the average inter-clone distance, and MAX, the maximum inter-clone distance. AVG and MAX don't result in a proper distance matrix, because the diagonal will be the average or maximum distance between the clones of a single individual, and not all-zero as it should be (for MIN it's always zero). [We discussed a few options to modify AVG and MAX to ensure a zero diagonal, but couldn't devise something that makes sense.]

However, the SplitsTree program didn't bother about an all-zero diagonal, so the AVG and MAX transformed distance matrices will produce a Neighbor-net. So, what I assumed were PBC networks were in fact AVG networks.


Neighbor-net based on AVG-transformed inter-individual distances.

It took me quite long to recognize this "error" because, in contrast to the AVG (and MAX) networks I looked at when we did the 2008 paper, the one for the oaks made a lot of sense. Notably, the suspected F1 hybrids were perfectly resolved spanning up according boxes, and the species aggregates (clusters) did make sense regarding the general geographic setting, the history of the region under study, and their morphology.

Same graph as above, highlighting known or potential F1-hybrids spanning up according boxes.

For these data (with a minimum of four clones available per individual, individuals covering all species, and including the entire range of the section in western Eurasia), the AVG network better shows the potential F1 hybrids (or introgrades) than the (more methodologically sophisticated) PBC network. However, the latter makes more sense regarding speciation processes and the history of the group (because, the distance is a "phylogenetic" version of the well-known Bray-Curtis distances).


A "cactus-oak" fusion graph depicting nuclear and plastid differentiation (and evolution) in Quercus Group Cerris.

Take-home message

First, it's always good to delegate work you can do by heart to somebody new to it! This forces its propagation, which is important. More importantly, though, one has ones preferences and established analysis pipelines, and they may have become restricted in scope. I mainly used the -a (AVG), -i (MIN) and -x (MAX transformation) options in the little helper program to quickly summarize some of the primary differentiation data — for example, individuals have identical clones (MIN = 0), intra-individual divergence may be higher or not than inter-individual (MAX intra-individual > MIN inter-individual), and individuals may have strongly divergent clones (high MAX). AVG was computed and tabulated but never cherished by me. I always looked at the MIN transformed networks, since this provides a valid distance matrix, but then ignored them. But I never again tried to infer a Neighbor-net based on AVG or MAX transformations after our 2008 paper.

Second, Neighbor-nets are so quick to infer that there is no resource- or logic-related reason to not just run whatever distance one has on hand or can easily establish. Maybe even the biologically less-sound will reveal some interesting aspect (there are a lot of biological arguments that can be put forward for dismissing AVG distances in favour of PBC distances)

Paper (pre-print) and open data
Simeone MC, Cardoni S, Piredda R, Imperatori F, Avishai M, Grimm GW, Denk T. 2018. Comparative systematics and phylogeography of Quercus Section Cerris in western Eurasia: inferences from plastid and nuclear DNA variation. PeerJ Preprints 6: e26995v1.
Primary data and analysis files are included in the Online Supplemantary Archive: Simeone et al., PeerJ Preprints, doi: 10.7287/peerj.preprints.26995v1/supp-4. (See Readme.txt included in the topfolder of the archive.) 

Monday, June 11, 2018

Want to place a fossil in a minute? Just use Neighbour-nets


Palaeontological phylogenetic researchers typically put a lot of effort into inferring trees. It has been argued (and occasionally pointed out during manuscript reviews) that only by placing a fossil in an explicitly phylogenetic framework can we assess what it represents. I sympathize with this notion, but in most cases we don't need any elaborate analysis to do it — a quick network-based analysis will do the trick.

In this post, I'll demonstrate my point using the most recent matrix presented by two eminent plant morphology veterans. In a fresh-off-the-press paper, James Doyle & Peter Endress provide a "Phylogenetic analysis of Cretaceous fossils related to Chloranthaceae and their evolutionary implications" (Botanical Review) using their morphology matrix focussing on early diverging angiosperm lineages, which was originally used for a paper by Sareela et al. (Nature 446: 312–315, 2009) and has been continually updated.

Like all morphological matrices that aim to cover as much as possible, Doyle & Endress' matrix does not provide any strong tree-like signal, and hence it has little use for inferring phylogenetic trees. Doyle & Endress deal with this issue by using a (more or less molucular-based) backbone tree enforcing several clades for the modern taxa, and then trying to find the most parsimonious placement of the fossil(s). This approach works to some degree but has two problems: one theoretical and one practical.

First, the backbone tree, or any molecular-informed topology, is usually some steps longer than the most-parsimonious trees that could be inferred on their matrix. In other words, morphological evolution in plants doesn't fully fulfill Ockham's Razor. Why should this also be the case for the fossils?

Second, moving a fossil through the branches to find the best-placement takes some time, and will lead to many equally parsimonious solutions. Not rarely, the fossil can be placed on quite distant branches, producing trees that are only a few steps longer.

A graph I made depicting the 'parsimoniousness' of placing a fossil, Monetianthus (a Cretaceous water lily), within a given topology using an earlier version of Doyle & Endress' matrix (fig. 7 in Friis et al., Int. J. Plant Sci., 2009). The number of additional steps was estimated by moving the target taxon, the fossil, to the accordingly coloured branch of the tree. (PS To show that the fossil is a water lily, a Nympheaceae, we used a Neighbour-net)

For Doyle & Endress' papers this is no big problem, because they just show the best placements as well as those a few steps longer. For example:
Placing the Chloranthistemon species on the stem lineage of Sarcandra and Chloranthus is four steps less parsimonious than placing them on the stem of Chloranthus. For perspective, only two steps are added if the Asteropollis plant is moved to the stem of the whole family. If a four-step parsimony debt is accepted in moving Chloranthistemon to a morphologically less favored position, one may ask why the Asteropollis plant is considered a reliable minimum age constraint for the family.
But with respect to the fact that morphological evolution is not necessarily parsimonious, and that even the modern taxa can show variable root to tip pathlength distances, I always remain skeptical of this approach.


A Bayesian-inferred angiosperm tree based on a total-evidence matrix, built from a curated version of Soltis et al.'s 2011 matrix and including the 2010-version of Doyle & Endress' matrix as morphological partition (provided as open data @ figshare). Note that many fossils (Cretaceous, ~100 Ma) have longer terminal branches than their surviving relatives (hence, made Bayesian total evidence dating impossible).

Aside from this, the matrix signal is pretty straightforward when it comes to decide on the potential position of the fossil in the angiosperm part of the Tree of Life. And the analysis takes (literally) moments.

You just take the matrix, calculate mean pairwise morphological distances (done in a blink), export the distance matrix as NEXUS-formatted file and input this to SplitsTree, which will give you a Neighbour-net (in another blink).

A Neighbour-net based on Doyle & Endress' 2018 matrix including only the modern-day taxa.

Most members of the well-established clades, main angiosperm lineages, cluster in the Neighbour-net (bracketed names point to somewhat scattered clades). In case the signal from a fossil is trivial, it will be nested within the respective cluster. Trivial signals are when a fossil has a character suite that indicates it is much more similar to one of the clusters than to any other, which usually means that it is part of the same evolutionary lineage. Convergences may be common, and characters homoplasious, but evolving the exact same suite of characters while not sharing common ancestry is quite unlikely.

The Neighbour-net including the matrix' fossils. Note that the relative position of the Eudicots has changed and Circaeaster and Euptelea are placed closer to the other Ranunculales, although the pairwise distances between all modern taxa have not changed. The re-arrangement is solely a fossil-inclusion effect. By adding fossils attracted to Ceratophyllum, this enigmatic and isolated genus is drawn away from the most basal eudicots.

Two of the fossils apprear to have unique character suites, placing them intermediate between two phylogenetically isolated plants, the unique Amborella (still considered the earliest branching modern-day angiosperm; one species on New Caledonia) and Ceratophyllum, an equally enigmatic water plant. The remainder are clearly members of the Chloranthales.


Having identified the phylogenetic Neighbourhood of the fossils, we can then focus on this neighbourhood (in SplitsTree: Select the comprising OTUs; then go to menu Data > Keep only selected taxa).

The Neighbour-net after all non-neighbourhood OTUs have been removed.

From this graph we can directly conclude:
  • The Asteropollis plant is a close relative, likely a sister or early representative, of Hedyosmum.
  • Couperites represents an early and substantially diverged lineage within the Chloranthales — its closest living relative appears to be Ascarina, next to Hedyosmum the most derived living Chlorantales (here: one would need to see if there is any shared character suite).
  • Zlatkocarpus is an ancestral form of the Chloranthales core clade, comprising Ascarina and the sister taxa Chloranthus and Sacandra (one would need to check the possibility that this could be a missing data artefact).
  • Canrightia and Canrightiopsis are sister lineages or precursors of Chloranthus and Sacandra (see the open access tree-and-network-based paper by Friis et al. 2015, Grana 54: 184–212)
  • The Pennipollis plant is an ancient isolated Chloranthales lineage, with no living relative.
  • Appomattoxia and Pseudoasterophyllites may be (very) distant relatives of Ceratophyllum, the latter an isolated genus that has long been and still is a problem for molecular phylogenies. Alternatively, they may represent early and extinct angiosperm lineages (or the same lineage) with no modern counterparts.
To say anything beyond this based on the current data set quickly leaves the grounds of objectivity, and requires a priori assumptions about the importance of certain morphological traits being shared or not (i.e. not expressed, not just missing due to poor preservation).

[For those interested in a formal discussion of these results, see Doyle & Endress, 2018, pp. 7–25.]

To refine the analysis, we can just reduce the character set to the characters scored for the fossils.

A Neighbour-net based on distances computed using a character-subset generated by excluding all invariable characters (in PAUP*: Exclude constant) for the taxa included in the network (66 characters, including eight not defined for any fossil taxon). Grey depicts a (molecular-data backed) tree hypothesis that could explain the seen differentiation pattern

To take the next step, morphological matrices alone will have no practical use, because they will not allow us to identify fast vs. slow evolving traits and lineages. A lineage that goes through bottlenecks or colonizes new niches will be genetically, and morphologically, more distinct than one that remained in calm waters. One could map the preserved traits found in each fossil onto a molecular tree, and include the information about actual branch lengths in that tree to put forwards hypotheses about the ancestral state (e.g. Mendes et al., Grana 53: 283–301 [open access] for Lardizabalaceae), and possibly even time the divergences. This would enable one to compare the situation in the fossils with the top-down hypotheses about morphological evolution for the very same time period.

But that would be mostly tree-based analyses, and thus nothing for a Genealogical World of Phylogenetic Networks post.

Data — In case you are interested in the primary data matrix, a ready-to-use NEXUS version and the raw Split-NEXUS files have been uploaded to figshare.

Monday, April 23, 2018

A (wal)nut to crack – what a network tells you that no tree can


In this post, I will show a network that I generated some time ago as illustration of a point: morphological data should not be used to infer trees, but networks, instead — especially when the goal is to place some fossils in a modern-day phylogenetic framework.

In 2007, Manos et al. (Systematic Biology 56:412–430) published an interesting phylogenetic study that provided a phylogenetic framework to place some enigmatic fossils of the Juglandaceae, the walnut family. Following my preferred procedure (presumably without realizing it), they recruited a palaeobotanical expert to erect a morphological partition.

Given the high quality of the matrix, this is an ideal example to demonstrate the utility of networks in (palaeo)phylogenetic research and to discuss the question of potential ancestor-descendant relationships, and their poor representation in trees (especially cladograms). Phylogenetic relationships within modern Juglandaceae are relatively well resolved. Rhoiptelea, a relict genus found in the mountains of northern Vietnam and south-western China, is sister to the remainder of the family — it is now subfamily Rhoipteleoideae, but was traditionally its own family. Rhoiptelea is an living fossil: flowers with fitting in-situ pollen and seeds have been found in the Late Cretaceous (Heřmanová et al. 2011, IJPS 172: 285–293; cryptically named Budvaricarpus serialis, the "Serial Budvarseed", because one is not allowed to use a modern-day genus for naming a 85–90 million year old angiosperm, even when it looks the same). The remainder of the Juglandaceae falls into two main clades, recognized as subfamilies:
  1. the Juglandoideae — the walnuts (Juglans) and their closest relatives: the (eastern) North American-East Asian disjunct genus Carya, the Eurasian relict genus Pterocarya (mainly Transcaucasia, East Asia), and the monotypic genera Cyclocarya and Platycarya.
  2. the Engelhardioideae — a group of tropical-subtropical, mostly relict genera: Alfaroa + Oreomunnea in the equatorial regions of the New World; and South East Asian-Malesian genus Engelhardia and the, probably monotypic, Alfaropsis widespread in China (sometimes still included in Engelhardia; e.g. current Flora of China, despite unambiguous molecular and morphological evidence).
Juglandaceae produce (winged) seeds and pollen that are relatively easy to identify. They are well-known and very common companions of palaeontologists during much of the Cenozoic, especially the (today geographically very restricted) Engelhardioideae. But in addition to the modern genera, the family includes some very interesting, unique fossils — the idea is to place these in a phylogenetic framework.

Results of the study of Manos et al. (2007).
Arrows indicate the position of the fossils. a) A majority rule consensus cladogram using a cut-off of 50 based on the morphological partition; b) the total evidence counterpart.

As can be seen from the above trees (taken from the paper), morphology reflects some of the molecular phylogenetic relationships — the Juglandoideae are supported as a clade, as are most genera (except for Engelhardia and Oreomunnea). Two fossils, Pal(a)eoplatycarya and Platycarya americana were resolved as sister taxa to their modern counterpart, Platycarya strobilacea; and the two enigmatic fossils Polyptera (the "many-winged one") and Cruciptera (the "cross-winged one") could be associated with the Juglandoideae. The total evidence approach indicated that Cruciptera is part of the "crown-group" Juglandoideae, in contrast to Polyptera, that appears at a more "basal" (root-proximal) position in this subclade. A sixth fossil, Pal(a)eooreomunnea could not be resolved with certainty (placed as sister to all Juglandoideae in the total evidence tree). As the name indicates, literally the "Ancient Oreomunnea", we would have expected it to group with the Engelhardioideae, which form a clade in the total evidence tree.

This is okay so far as it goes but, beyond potential sister relationships, these cladograms show very little. When I place a fossil such as Cyclocarya in the phylogeny, I would like to know whether it is more closely related to Juglans, Pterocarya or Cyclocarya. Is it an early sister lineage of all of these, or even a precursor? Cladograms cannot answer such questions.

The persistent issue of pseudo-clades

It has been pointed out in earlier posts that clades/grades are not necessarily synonyms of Hennig's concepts of monophyly and paraphyly, mainly because of convergent evolution creating data splits that are incongruent with the true tree. Parsimony-based analyses are especially vulnerable, because each change represents a step to be optimized.

One alternative method to place fossils in a (molecular-based) phylogenetic framework is the evolutionary placement algorithm (EPA; Berger & Stamatakis 2010, AICCSA conference paper). This changes to a probabilistic framework, and queries each fossil alone using its morphological partition but using the molecular-based tree as framework.

Summarized result of the evolutionary placement algorithm as implemented in RAxML.
The number represents a probability to join the fossil at the according branch using maximum likelihood as optimality criterion.

This gives the above tree as the result for the Walnut data set. Palaeooreomunnea is now unambiguously linked to one of the two included species of Oreomunnea, O. mexicana. Cruciptera is associated (again unambiguously) with Cyclocarya. Furthermore, not only are Palaeoplatycarya and the extinct North American Platycarya relatives of the modern-day Platycarya, but also Polytera. This, according to the original analysis, is the first-branching member of the remainder of the Juglanoideae, ie. all genera except Platycarya.

And the network shows us why

The most important problem with morphological data sets is that their signals are complex, and usually not very tree-like. Hence, whenever we optimize fossils along a tree (either by directly analyzing the morphological data or by some form of total evidence approach), the analysis has to fit in this odd little OTU at all cost, even when it means collapsing an entire clade. Simultaneous optimisation of two or more fossils triggers further branching artifacts, and may decrease branch support, because we have no molecular data compensating for eventual branch attraction conflicting with the actual phylogeny.

Let's take the Polyptera as an example. If we de-root the trees, the original total evidence placement and the ML-EPA are not that different from each other: Polyptera is just moved one node. A easily inferred Neighbour-net, which is not 1-dimensional like a phylogenetic tree, but 2-dimensional, shows the reason why (and only by using the morphological data partition).

The neighbour-net based on the Manos et al.'s morpho-data partition.
Numbers at branches represent nonparametric boostrap support (Least-squares and Maximum parsimony criteria) and Bayesian posterior probabilities.

  • We can see that Polyptera has a unique morphology (it shows the longest terminal edge of all fossils), making it equally similar to Platycarya and the remaining Juglandoideae: Juglans, Pterocarya, Cyclocarya, and Carya (Annamocarya is a not-widely-accepted Chinese genus, genetically indistinct from other East Asian Carya). This explains its instability in tree-based reconstructions. Assuming that Rhoiptelea points to the actual root, one could use the relatively high branch support values as an argument to say that Polyptera evolved after Platycarya split from the remainder of the Juglandoideae. But the network shows that the signal is not that straightforward, and Polyptera may just be a third lineage within the Juglandoideae (note the short orange edge bundle in contrast to the large red and green ones). A crucial question to check, also regarding the ML-EPA result, is whether the orange-edge clade (including Polyptera) is supported by uniquely shared characters and not just a tree-branching artifact because of the distinctness of the Platycarya group. Being substantially distinct (genetically and morphologically) from the remainder of the Juglandoideae, they must be placed as sister taxa. Being a fossil Polyptera is not that distinct, hence, placed in the Juglandoideae core clade. Distance-based and parsimony methods are more vulnerable to long-branch attraction (or short-branch culling) than is ML; and Bayesian analysis optimizes to a tree best comforting all signals in the data (compatible or not).
  • Cruciptera is more similar to Cyclocarya and Pterocarya than to Juglans, and represents a more primitive (ancestral) form. Based on the position of Cyclocarya and Pterocarya, we can directly conclude that they are morphologically less derived than Juglans, their sister taxon. Hence, one should be careful interpreting Cruciptera as a precursor of eg. Pterocarya, but would have to go back into the matrix and assess which characters differentiate within this part of the graph, in order to decide whether the similarity between them is a genuine representation of shared (common) origin, and not just due to symplesiomorphies.
  • The fossil counterparts of modern-day Platycarya span a quite prominent box-like structure in the network, but the blue edge has little support from tree-based analyses. A simple explanation would be that these two more ancient members of the Platycarya lineage, and are less derived than their modern counterpart and the other Juglandoideae.
  • Palaeooreomunnea is placed as one would expect for an ancestral form of the Engelhardoideae. It is clearly closer to the New World pair Alfaroa and Oreomunnea than to the Old World Alfaropsis and Engelhardia.
Data & software for EPA

The data matrix that I used for the ML-EPA, the Neighbour-net and the competing branch support analyses can be found in the supplementary information of the original paper.

EPA is implemented in RAxML since Version 7 and usually used to place environmental short sequence reads (Berger et al. 2011, Syst. Biol. 60:291–302). For a published application of EPA to place fossils, see e.g. Bomfleur et al. 2015, BMC Evol. Biol. 15:126.

Monday, April 9, 2018

The curious case(s) of tree-like matrices with no synapomorphies


(This is a joint post by Guido Grimm and David Morrison)

Phylogenetic data matrices can have odd patterns in them, which presumably represent phylogenetic signals of some sort. This seems to apply particularly to morphological matrices. In this post, we will show examples of matrices that are packed with homoplasious characters, and thus lead to trees with a low Consistency Index (CI), but which nevertheless have high tree-likeness, as measured by a high Retention Index (RI) and a low matrix Delta Value (mDV). We will also try to explore the reasons for this apparently contradictory situation.

Background

A colleague of ours was recently asked, when trying to publish a paper, to explain why there were low CI but high RI values in his study. This reminded Guido of a set of analyses he started about a decade ago, using an arbitrary selection of plant morphological matrices he had access to.

The idea of that study was to advocate the use of networks for phylogenetic studies using morphological matrices, based on the two dozen data sets that he had at hand. The datasets were each used to infer trees and quantify branch support, under three different optimality criteria: least-squares (via neighbour-joining, NJ), maximum likelihood, and maximum parsimony. This study was was never wrapped up for a formal paper, for several reasons (one being that 10 years ago Guido had absolutely no idea which journal could possibly consider to publish such a paper, another that he struggled to find many suitable published matrices).

The signals detected in the collected matrices were quite different from each other. The set included matrices with very high matrix Delta Values (mDV), nontree-like signals, and astonishingly low mDVs, for a morphological matrix. Equally divergent were the CI and RI of the inferred equally most-parsimonious trees (MPT) and the NJ tree. The data for the MPTs and the primary matrices are shown in the first graph, as a series of scatterplots, where each axis covers the values 0-1. (Note: in most cases the NJ topologies are as optimal as the MPTs, and have similar CI and RI values.)


As you can see, the CI values (parsimony-uninformative characters not considered) are not correlated with either the RI or mDV values, whereas the latter two are highly correlated, with one exception.

The most tree-like matrix (mDV = 0.184, which is a value typically found for molecular matrices allowing for inference of unambiguous trees) was the one of Hufford & McMahon (2004) on Besseya and Synthyris. The number of MPTs was undetermined —using a ChuckScore of 39 steps (the best value found in test runs), PAUP* found more than 80,000 MPTs with a CI of 0.39 (third-lowest of all of the datasets), but an RI of 0.9 (highest value found).

A strict consensus network of the 80,003 equally parsimonious solutions, the network equivalent to the commonly seen strict consensus tree cladograms. Trivial splits are collapsed. Colours solely added for orientation (see next graph).

Oddly, the NJ tree had the same number of steps (under parsimony), but a much higher CI (0.69). The proportion of branches with a boostrap support of > 50% was twice as large in a distance-based framework than using parsimony.

Bootstrap consensus networks based on 10,000 pseudoreplicates each. Left, distance-based and inferred using the Neighbour-Joining algorithm; right, using a branch-and-bound search under parsimony as optimality criterion (one tree saved per replicate). Edge-lengths reflect branch support of sole or competing alternatives; alternatives found in less than 20% of the replicates not shown; trivial splits are collapsed. Same colour scheme than above for orientation.

The Neighbour-net based on this matrix has quite an interesting structure. Tree-like portions are clearly visible (hence, the low mDV) but the branches are not twigs but well developed trunks. The large number of MPTs is mainly due to the relative indistinctness of many OTUs from each other.


Neighbour-net based on simple mean (Hamming) morphological distances. Same colour scheme as above.
This distance-based 2-dimensional graph captures all main aspects of the tree inferences and bootstrap analyses, with one notable exception: B. alpina which is clearly part of the red clade in the tree-based analyses. We can see that the orange group, B. wyomingensis and close relatives, is (morphology-wise) less derived than the red species group. Although B. alpina is usually placed in a red clade, it would represent a morphotype much more similar to the orange cluster as it lacks most of the derived character suite that defines the rest of the red clade. In trees, B. alpina is accordingly connected to the short red root branch as first diverging "sister" with a very short to zero-long terminal branch, but in the network it is placed intermediate between the poorly differentiated but morphologically inhomogenous oranges and the strongly derived reds — being a slightly reddish orange. This reddishness may reflect a shared common origin of B. alpina and the other reds, in which case the tree-based inferences show us the true tree. Or just a parallel derivation in a member of the B. wyoming species aggregate, in which case the unambiguous clade would be a pseudo-monophylum (see also our recent posts on Clades, cladistics, and why networks are inevitable and Let's distinguish between Hennig and cladistics).

Interpretation, what does low CI but high RI stand for?

The distinction between the Consistency Index and the Retention index has been of long-standing practical importance in phylogenetics. For a detailed discussion, you can consult the paper by Gavin Naylor and Fred Kraus (The Relationship between s and m and the Retention Index. Systematic Biology 44: 559-562. 1995).

For each character, the consistency index is the fraction of changes in a character that are implied to be unique on any given tree (ie. one change for each character state): m / s, where m = the minimum possible number if character-state changes on the tree, and s = the observed number if character-state changes on the tree. The sum of these values across all characters is the ensemble consistency index for the dataset (CI).

The retention index (also called the homoplasy excess ratio) for each character quantifies the apparent synapomorphy in the character that is retained as synapomorphy on the tree: (g - s) / (g - m), where g = the greatest amount of change that the character may require on the tree. Once again, the sum of these values across all characters is the ensemble retention index for the dataset (RI).

Both CI and RI are comparative measures of homoplasy — that is, the degree to which the data fit the given tree. However, CI is negatively correlated with both the number of taxa and the number of characters, and it is inflated by the inclusion of parsimony-uninformative characters. RI is less sensitive to these characteristics. However, RI is inflated by the presence of unique states in multi-state characters that have some other states shared among taxa and, therefore, are potentially synapomorphic.

It is these different responses to character-state distributions (among the taxa) that apparently create the situation noted above for morphological data. Neither CI nor RI directly measures tree-likeness, but instead they are related to homoplasy. So, it is the relative character-state distributions among the taxa that matter in determining their values, not just the tree itself.

For example, increasing the number of states per character will, in general, increase CI faster than RI. Increasing the number of states that per character that occur in only one taxon will, in general, increase RI faster than CI.

Take-home message

This is just another example demonstrating that morphological data sets should not be used to infer (parsimony) trees alone, but analysed using a combination of Neighbour-nets and support Consensus Networks. No matter which optimality criterion is preferred by the researcher, the signal in such matrices is typically not trivial. It calls for exploratory data analysis, and inference methods that are able to capture more than a trivial sequence of dichotomies.

[Update 10/9/2018] Related data files can now be found in my Collection of morphological matrices (some including extinct taxa) and related phylogenetic inferences (Version 2) on figshare

Monday, April 2, 2018

Things you can learn in a blink about your data


As phylogeneticists, we commonly have to deal with data that we don't initially understand. In this post, I'll use a recently published 8-gene dataset on lizards to show how much can be learned prior to any deeper analysis, just from producing a few Neighbour-nets.

The data

Solovyeva et al. (Cenozoic aridization in Central Eurasia shaped diversification of toad-headed agamas, PeerJ, 2018) sampled species of toad-headed agamas (lizards) across their natural range (north-western China to the western side of the Caspian Sea), to study their genetic differentiation in time and space. To do so they used two datasets. The mitochondrial data covers four gene regions: coxI, cytB, nad2, and nad4, and are complemented by four nuclear gene regions: AKAP9, NKTR, BDNF, RAG1.

This caught my eye, because the authors' preferred trees have a bunch of low branch-support values, so that this would be a good opportunity to advocate some Consensus networks. They also report only values above a certain threshold, as apparently recommended by several reviewers. My reviewers not rarely recommended the same, but I always ignored this — I believe we should give the value, because it makes a difference if its just below the threshold (e.g. bootstrap support, BS, of 49), or non-existent (BS < 5). The authors also note that their mitochondrial and nuclear genealogies are not fully congruent. In short, the signal from their matrix is probably not trivial, but could be interesting.

In contrast to many other journals, PeerJ has a strict open-data policy. Solovyeva et al. provide each gene as FASTA-formatted alignment as Supporting Information. So let's have some quick-and-dirty Neighbour-nets.

Using Neighbour-nets to decide on an analysis strategy

A comprehensive outgroup sampling can avoid outgroup-rooting artefacts, but adding very distant outgroups comes at a price. We need to invest much more computational effort, because the inference programmes not only try to optimize our focus group, but the entire taxon set. Another principal question is: what can an outgroup taxon provide as information for rooting an ingroup, while being completely different? Furthermore, when we do an ML (or Bayesian) analysis, e.g. with RAxML, we leave it to the program to optimize a substitution model (even when we predefine a model, its parameters will usually be optimized by the inference software on the fly). By adding distant outgroups, we optimize a model for them plus our focus group — by not using any outgroup, we optimize a model suiting just the situation in our focus group.

Fig. 1 shows the neighbour-net (uncorrected, codon-naive p-distances) for the first of the mitochondrial genes, coxI (the others are similar), which and tells us a lot about the data to be used for the tree inferences.

Fig. 1 Neighbour-net based on mitochondrial (coxI) uncorrected p-distances. The diffuse, non-treelike signal expressed in the A and B fans will be a hard nut for the tree inference, and will have little influence on questions dealing with the focal genus.
We can see that outgroup diversity is much higher than for the focus group, and that most outgroup taxa are very distinct from the ingroup. Looking at the closest outgroups (Stellagama, Agama, Laudakia, Paralaudakia, Xenagama, Pseudotrapelus), we see that finding an unambiguous sister taxon to the focal genus will be difficult. And we can realize that including more-distant taxa just gives the algorithm much more work (note the A and B bushes), but hardly will have any benefit for rooting the ingroup.

We also can see that the 3rd codon position is probably saturated to some degree, and that we will be dealing with a high level of stochasticity (randomly distributed mutation patterns) here — all terminal edges are long to very long. Since the same thing holds for the other three mitochondrial regions, it would not be a bad idea to do an additional inference including only the 1st and 2nd codon positions, in case all taxa should be included.

Using Neighbour-nets to understand the basic signal properties of your data

Fig. 2 shows the Neighbour-net (again, uncorrected p-distances) for one of the nuclear genes, AKAP9. The outgroup sample is somewhat different, but we can immediately see that this gene has more potency to infer unambiguous phylogenetic relationships among the sampled taxa — the graph has distinctly tree-like portions. We also see that saturation of 3rd codon position is much less of an issue here, compared to the cox1 gene (Fig. 1) — the terminal edges are comparatively short, with respect to the central edge bundles. [Nonetheless, it is never wrong to analyze coding gene data partitioned: 1st and 2nd codon positions vs. 3rd codon position.]


Fig. 2 Neighbour-net based on the nuclear (AKAP9) genetic distances. Note the much more treelike structure of the graph, the generally shorter terminal edges, and last-but-not-least the notable difference between ingroup (focal genus) and outgroup taxa.
For the general differentiation patterns, compare the minute extent of the focal group, green background in Fig. 2 vs. the prominent bush in Fig. 1. It is clear that including distant outgroups will not have any benefit. We may even consider reducing the outgroup sample (if one has to include an outgroup at all) to the two genetically closest genera Stellagama and Paralaudakia.

Similarly structured graphs are found for the other three nuclear genes.


Producing some quick Neighbour-nets doesn't hurt

Sometimes reviewers will pick on them — "distance-based phenetic method" is something I used to get a lot. In this case, you can still produce them just to get some basic impressions on your data set. This will help you to understand the results of your tree inferences, including why some of your branches have ambiguous support.

It comes as little surprise that the taxa one can identify, in these networks, as likely sister genera of the focal genus, come up as sister taxa in the explicit phylogenetic analyses done by Soloveya et al. — e.g., their fig. 2 showing the combined mitochondrial tree, and their fig. 3, showing the combined nuclear tree.

Soloveya et al. (2018) performed some incongruence tests (AU-topology test) using single-gene inferences (going further than many other studies), but did not dig deeper. One of the authors answered my question about potential signal issues that may cause topological incongruence between ML and Bayesian trees, as well as ambiguous support, but he considers this to be a solely a problem with methods — different algorithms prefer different phylogenies. Having looked at the basic differentiation pattern in the gene regions using Neighbour-nets, it may be more than just an issue with methods — ML and Bayesian analysis should always support the same splits when using the same or similar substitution models.

Like many other studies, the authors also use the data for Bayesian dating and dating-dependent biogeographic analysis. Lacking any ingroup fossils, the authors could only constrain nodes within the outgroup subtree, which are nodes far from those that they discuss and estimate. I have my doubts that we can put much faith in the uncorrelated clock process to handle such extreme differences between focus group (ingroup) and (constrained) outgroup-taxon lineages as seen in Fig. 2. Estimates for rate shifts between outgroup and ingroup usually render ingroup age estimates to be too young, compared to age estimates obtained with ingroup fossils. This is something that can be directly deduced from a graph like the one in Fig. 2.

Data and networks can be found at figshare

The original paper provides a comprehensive supplement with a lot of interesting information, but the FASTA-files, each comprising a single gene region and a few editing issues, are not yet ready to use. Hence, I transformed them into NEXUS-files, and generated a combined data matrix. The files and the Neighbour-nets for each gene region (and a full single-gene maximum likelihood analysis) can be found on figshare.

Monday, March 19, 2018

Comparing neighbour-nets and PCA graphs – the example of Mediterranean oaks


Distance matrices offer many avenues for exploring data. A common method is Principal Component Analysis (PCA). A much less common method is the use of Neighbour-nets. We have previously compared PCA and Neighbor-nets using theoretical data. In this post, I'll compare a PCA graph and the corresponding Neighbour-net using some empirical data.

Genetic differentiation in Mediterranean oaks

In the paper by Vitelli et al. (2017), we explored the phylogeographic structuring of a group of Mediterranean oak species. The species represented the westernmost populations of one of the main Eurasian oak lineages: the evergreen Quercus section Ilex ("Ilex oaks"; see Denk et al. 2017 for an up-to-date classification of oaks; see also this figshare-spread-sheet). It was a follow-up study to the one by Simeone et al. (2016).

We found that one species, the most widespread (Quercus ilex), carry plastids from quite different origins. The 2016 paper identified three main plastid haplotypes in the Ilex oaks: the unique (within the entire genus) "Euro-Med" haplotype; the "Cerris-Ilex" haplotype shared with western Eurasian members of (essentially deciduous) section Cerris, the sister clade of section Ilex (see Denk & Grimm 2010; confirmed by NGS SNP data, Hipp et al. 2015); and the "WAHEA" haplotype, an east-bound haplotype of section Ilex. Vitelli et al. aimed to characterise the range of these three main haplotypes throughout the four Ilex oak species found in the Mediterranean.

Figure 1 shows the two multivariate data analyses, along with a map of the sample locations.

Fig. 1 Phylogeographic structure of Quercus section Ilex around the Mediterranean (after Vitelli et al. 2017). a. PCA graph, and b. Neighbour-net based on the same inter-haplotype pairwise distance matrix. c. A map depicting the distribution of main haplotype groups labelled by Roman numerals: I haplotypes of the "WAHEA" lineage, II "Cerris-Ilex"-lineage, III–VI, subtypes of the "Euro-Med" lineage (cf. Simeone et al. 2016, fig. 1)

Regarding the overall diversification pattern, the PCA graph and the Neighbour-net show similar things. The "Euro-Med" lineage is the most diverse group, with four subgroups — two larger (and widespread) ones (haplotypes IV, V) and two rare ones (III, VI) only found in the Aegean region.
  • According to the PCA, haplotype III (colored olive) is intermediate between "Euro-Med" IV (blue) and the haplotype II (yellow), which represents another lineage of oak haplotypes, the Aegean/Northern Turkish "Cerris-Ilex" lineage. The same can be seen in the Neighbour-net.
  • The PCA further places haplotype VI (red) as equidistant to all of the other types, with IV and I (green; representing the oriental "WAHEA" lineage) being a bit closer. In the Neighbour-net, we can sum up the length of the connecting edge-bundles to find the same pattern. A difference between the two analyses is that VI is connected only with part of V (purple) by a pronounced edge bundle, but not connected to I (green). This is strikingly different from III, which shares an edge bundle with II and IV+V.

At this point in the analyses, we can use the potential property of the Neighbour-net acting as a distance-based 2-dimensional graph and acting as a meta-phylogenetic network (Fig. 2). Based on the PCA, which also is a 2-dimensional depiction of the differentiation, one may be tempted to interpret VI as a bridge between IV/V and I, not much different from how III bridges between II and IV (Fig. 1). On the other hand, the network (Figs 1, 2) informs us that VI is a likely relative of V, which in turn is a likely relative of IV; and the only connection between I and VI is their increasing distinctness to the other haplotypes of the "Euro-Med" lineage, III/IV/V.

Fig. 2 The main splits expressed in the neighbour-net. III may either be sister to II, or is part of a clade comprising IV and V.

Using the main split patterns in the Neighbour-net, we can infer the one phylogenetic hypothesis, a tree, that can accommodate them all (Fig. 3).

Fig. 3 The tree solution congruent with the major split patterns (Fig. 2).

I rejected the alternative sister relationship between II and III because this would imply a sister clade that only includes IV, V and VI but not III, which clashes with the affinity of III to IV and V (Fig. 2). Interpreting III as a sister of IV and V, explains both its affinity to II (putative sister lineage to III–VI) and IV and V.

We might accept that all three plastome lineages are reciprocally monophyletic (in a quite broad sense), meaning that each lineage evolved from a pool of closely related mother plants. If so, then the higher similarity between III ("Euro-Med") and II ("Cerris-Ilex") may represent a relative lack of derivation, whereas the dissimilarity between VI ("Euro-Med") and I ("WAHEA") to all other types can be due to a higher level of distinctness. And we can come up with a "cactus"-type metaphorical tree (Fig. 4) explaining the Neighbour-net (and PCA graph).

Fig. 4 A "cactus"-type tree metaphor for the evolution of oak plastomes (based on the results of Simeone et al. 2016, Vitelli et al. 2017, and – outside the focus group, i.e. Mediterranean oaks of Subgenus Cerris – some partly arcane, not yet published knowledge, I have access to)
We thus learn more from the Neighbor-net than from the PCA.

There's no reason to stop with a PCA

One empirical example is far from being conclusive, but it shows what the Neighbour-nets have to offer.

Trees are fine for proposing phylogenetic hypotheses, but we should always be aware of equally valid alternatives to the tree that we have optimized. And with increasing numbers of taxa, inferring optimal trees and assessing their alternatives require increasing effort, and checking. For many questions, PCA has been used as a quick alternative, including in large-sample genetic studies (see Continued misuse of PCA in genomics studies).

Neighbour-nets are just a natural step further towards a phylogeny, which come with very little extra effort and can use the same data basis: a matrix of pairwise distances. In the case of genetic data, which usually reflects at least the main aspects of the actual phylogeny (trivial or complex) behind it, the "true tree", they should be obligatory. They are much more than just a clustering approach (even though their algorithm is based on a cluster algorithm) or a bivariate analysis. Neighbour-nets are meta-phylogenetic networks that have the capacity to contain the one or many topologies explaining the data. They are as straightforward as PCA, when it comes to recognising "natural", coherent and equal, groups (in contrast to phylogenetic trees).

Postscript

I would have liked to add some more examples with non-genetic data. Data sets where the distances are not the result of an explicit phylogenetic process. But this requires much more effort, since none of the PCA studies I browsed had documented the used distance data/matrix. However, I'm sure that inferring a Neighbour-net based on no-matter-what similarity data used for PCA, can be a fruitful and revealing endeavour (and the reason why you find Neighbour-net based on U.S. gun legislation, breast sizes, languages, cryptocurrencies, etc. on this blog, but few PCAs). So, try it out the next time you make a PCA, and share the results e.g. by using our comment option or even a post as guest-blogger.

Don't miss these earlier posts with similar topic:

Also, this paper introduces Neighbor-nets to the wider audience of multivariate data analyses:

References

Denk T, Grimm GW. 2010. The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59: 351–366.

Denk T, Grimm GW, Manos PS, Deng M, Hipp AL. 2017. An updated infrageneric classification of the oaks: review of previous taxonomic schemes and synthesis of evolutionary patterns. In: Gil-Pelegrín E, Peguero-Pina JJ, and Sancho-Knapik D, eds. Oaks Physiological Ecology. Heidelberg, New York: Springer, p. 13–38. Free Pre-Print at bioRxiv [major change: Ponticae and Virentes accepted as additional sections in final version]

Hipp AL, Manos P, McVay JD, ... , Avishai M, Simeone MC. 2015 [abstract]. A phylogeny of the World's oaks. Botany 2015. Edmonton.

Simeone MC, Grimm GW, Papini A, Vessella F, Cardoni S, Tordoni E, Piredda R, Franc A, Denk T. 2016. Plastome data reveal multiple geographic origins of Quercus Group Ilex. PeerJ 4: e1897 [open access, comments/questions welcomed]

Vitelli M, Vessella F, Cardoni S, Pollegioni P, Denk T, Grimm GW, Simeone MC. 2017. Phylogeographic structuring of plastome diversity in Mediterranean oaks (Quercus Group Ilex, Fagaceae). Tree Genetics and Genomes 13:3.