Showing posts with label Median network. Show all posts
Showing posts with label Median network. Show all posts

Monday, June 1, 2020

To what degree are Median-joining networks phylogenetic?


In a comment to the recent paper by Forster et al. (2020), Sánchez-Pacheco et al. (2020) argue that Forster et al.'s analysis is "neither phylogenetic nor evolutionary" because it's based on the use of a Median-joining network. They don't re-analyse the data, but instead mostly refer to a paper they published four years ago in Cladistics (Kong et al. 2016), the journal of the Willi-Hennig Society.

In that earlier paper, Kong et al. conclude:
Other than fast computation and very attractive graphics, MJNs [Median-joining networks] harbour no virtue for phylogenetic inference. MJNs are distance-based, unrooted branching diagrams with cycles that say nothing about the evolutionary history due to the absence of direction. MJ was introduced in 1999 and, in contrast to most scientific ideas, its application has spread rapidly through copying the methods of others, and, unfortunately, with little further scrutiny. We hope that the theoretical arguments presented here can reverse this trend.
It seems unlikely that it will, as I will argue here.

What makes a graph a phylogenetic tree or network – direction

Kong et al. argue that a line graph needs to be directed (ie. the edges indicate a time direction) in order to represent a phylogeny, which is a good point. After all, a phylogenetic tree is a directed (rooted) branching diagram that represents the hypothesized relationships among the organisms under study.

A phylogenetic tree (see also: Fritz Müller and the first phylogenetic tree)

A phylogenetic network is the generalization of a phylogenetic tree, as it combines lineage splits (divergences) with lineage anastomoses.

A phylogenetic network including a reticulation leading to a circle in the graph — B is the product of crossing of lineages that produced its sisters A and C.

Since a MJ network is, per se, an undirected graph, it thus cannot be an explicit phylogenetic network.

However, following this argument, few inferred trees are directly a phylogenetic tree, either — including the Nextstrain-generated tree on the GISAID page that is promoted by Mavian et al. 2020 (which is another comment to Forster et al., focusing on data issues). Irrespective of which criterion we use to optimize the tree, almost all trees we infer (with no matter what tree inference software) are unrooted graphs — in general, we root them only after the analysis, by defining one leaf or a subtree as an outgroup. (Note, this includes those based on parsimony, the method of choice of the Willi-Hennig Society and Cladistics to this day.)

The difference between inference and interpretation: Using the tip sequences, we can infer a single most parsimonious (6-step long; using PAUP*'s branch-and-bound or NETWORK's MJ algoritm), but also most likely and shortest (distance-based), unrooted tree. By defining a root – here: one taxon designated as outgroup and assuming that all single-taxon-unique sequence patterns are autapomorphies – we can interpret the inferred tree as four different phylogenetic trees.

The same can be said of MJ networks — outgroup rooting can be applied (Finding the CoV-2 root).

Difficulty in depicting ancestor-descendant relationships

A phylogenetic relationship focuses on ancestors, which, for the purpose of inferring a phylogenetic tree, are considered to be purely hypothetical, although they are not hypothetical in a MJ network (or related graphs). We can easily create character sets where the inferred tree will not "represent the hypothesized relationship". Most parsimony studies show a strict consensus cladogram of most-parsimonious trees (MPT). This is unproblematic, as long as all leaves have the same age, and all of the cladogenic events resulted in unique, lineage-conserved character patterns. We then:
  1. would only infer but a single MPT;
  2. have no zero-length branches.
So, following Hong et al.'s logic, any dataset that results in more than one MPT and has subtrees including zero-length branches (like our example above) cannot qualify as phylogenetic trees.

Median-joining networks are, like MP trees (both use parsimony as the optimality criterion), vulnerable to homoplasy (Using Median networks to study SARS-CoV-2; see also Mavian et al. 2020), but while a MP tree (or any other tree we infer) cannot resolve ancestor-descendant relationships, MJ networks can (see eg. Why do we still use tree for Neanderthal genealogy).

Median (or MJ) network, left, and MPT, right, inferred from a perfect matrix. "x" = all-ancestor, ie. represents the root. "a" is the ancestor of "B" and "C", "d" of "f" to "H", "f" of "G" and "H". While the median networks depicts all ancestor-descendant relationships, the MPT only depicts them indirectly by trichotomies including the ancestor as zero-length branch.
Imperfect matrices (data including homoplasy) lead to wrong edges and branches. Being able to recognize ancestors, the MJ network comes closer to the phylgoenetic tree (same as above; from Clades, cladograms, cladistics, and why networks are inevitable).

Hence, Bandelt et al.'s (1999) statement, as cited by Kong et al., that “reconstructing phylogenies from intraspecific data ... is often a challenging task because of large sample sizes and small genetic distances between individuals”. Such data results in largely uninformative, comb-like MPT strict consensus trees. This is because identical sequences, equally probable alternative pathways, non-dichotomous differentiation patterns, and ancestral sequence variants present in the data increase the number of MPTs (sometimes to near-infinity). This leads to the collapse of branches in the strict consensus tree used to summarize the MPT sample. Probabilistic methods struggle, too, because the likelihood surface of the tree space is too flat to make a call.

[Kong et al. point to the mathematical definition of 'network', as "nothing more than an unrooted branching diagram with reticulation" but not of 'tree', which they consider is always a directed acyclic graph, ie. synonym to 'phylogenetic tree'. However, it is, inference-wise, clearly nothing more than an unrooted branching diagram without reticulation.]

Confusing heuristics with principle

To discredit the MJ network, Kong et al. then "... focus on its phenetic nature."

There is a tendency among cladists to dismiss a method as "distance-based", as this is treated as synonymous with phenetics. In reply, Joe Felsenstein commented on this alleged fundamental difference between distance-based and parsimony methods of tree inference (Felsenstein 2004, Chapter 10, p. 145f, The irrelevance of classification):
The terminology is also affected by the lingering emphasis on classification. Many systematists believe that it is important to label certain methods (primarily parsimony methods) as "cladistic" and others (distance matrix methods, for example) as "phenetic". These are terms that have rather straightforward meaning when applied to methods of classification. But are they appropriate for methods of inferring phylogenies? I don't think they are. Making this distinction implies that something fundamental is missing from the "phenetic" methods, that they are ignoring information that the "cladistic" methods do not. In fact, both methods can be considered to be statistical methods, making their estimates in slightly different ways.
The following chapter in Felsenstein's book (Chapter 11, pp. 147–175) deals exclusively with the "phenetic" distance matrix methods because they were the first to be used to infer phylogenetic trees (their limitations are outlined on pp. 174f).

Because the inference of MJ network starts from the generation of a Minimum-spanning network, which is generated from a distance matrix, Kong et al. argue the MJ network is merely a distance-based graph, ie. "phenetic", and "not phylogenetic". Any NP hard problem requires heuristics but, just because we use a distance-based graph to start with, doesn't determine whether the end-product is or is not a distance-based graph.

For instance, the Neighbor-joining (NJ) algorithm (Felsenstein 2004, p.166ff) is a cluster algorithm, which finds a phylogenetic tree fulfilling either the Minimum evolution (ME, p. 159f) or Least-squares (LS) criteria (p. 148ff). Thus, the tree inferred is, indeed, based on a distance-matrix via NJ, but it is not a cluster dendrogram — instead, it is a ME or LS optimised phylogenetic tree. Similarly, FastTree, IQTree, and RAxML are extremely fast programmes to infer Maximum likelihood (ML) phylogenetic trees; but, while FastTree and IQTree start with "phenetic" Neighbor-joining trees, RAxML (like GARLI before) infers first a quick-and-dirty parsimony tree. The final product in both cases is a topology optimized under ML, and the results are hence ML trees and not distance-based or MP trees (even though they started that way).

The final MJ network shows the most parsimonious evolutionary pathways that change one sequence type into another. When you infer it with the NETWORK program, all inferred mutations are mapped onto the final graph, and, using the Steiner post-analysis step, you can look through all of the MP trees that have been included in this graph. However, according to Kong et al. these are not MP trees:
[Following Farris (1970] Invoking principles of parsimony does not validate a phenetic technique as being a phylogenetic method. Indeed, the best Steiner trees are not necessarily the most parsimonious trees.
Kong et al. did not provide any real-world data examples; possibly because they would be very difficult to find. Just take my simple example above — clearly the MJ tree is actually a most-parsimonious solution to the data. Alternatively, you could take any data set for which you can infer plausible MPTs with (ie. data where the rate of change is low), eg. using the TNT program, and compare the result – the Consensus network of all MPTs, not the collapsed strict consensus tree (Stop using cladograms!) – with the Steiner trees inferred using NETWORK and the MJ algorithm.

Are medians ancestors? Do cycles represent reticulation?

Kong et al.'s final point is:
BEA99 [ie. Bandelt et al. 1999] stressed that median vectors can be interpreted biologically as existing unsampled or extinct ancestral sequences (i.e. they can represent missing intermediates; Fig. 3). However, a median vector in an MJ analysis is a sequence generated by majority, and is a mathematically drawn point in the final MJN that connects a triplet of sequences. The resulting “evolutionary paths in the form of cycles” (BEA99, p. 37) merely illustrates the failure of the algorithm to choose between alternative, equally optimal connections due to the modification of Kruskal’s algorithm. Consequently, a cycle represents an analytical artefact rather than an evolutionary scenario (Salzburger et al., 2011).
It is obvious to anyone who has ever used MJ networks, and is familiar with their own data, that Medians are likely to be ancestors, and that medians separated by parallel edge bundles are usually alternative ancestors. But, like all inferences, MJ networks may not capture the complex truth.

A phylogeny involving a recombination ...
... and the MJ network that can be inferred on the same data including two wrong edges (red). The West-1/East-ancestor recombinant is resolved as the product of hybridisation of the West- and East-ancestors, while West-2, a descendant of West-1, is resolved as hybrid of West-1 and the recombinant. Any tree included in the network would have 7 steps (ie. is most-parsimonious).
These reconstructed medians thus do bring Kong et al. to their only valid point, which, however, doesn't apply to the method as proposed, but is instead a common misinterpretation of MJ networks — their cycles do not necessarily reflect reticulation.

Bandelt et al. (1999) clearly state that the MJ network is only an approximation, to deal with complex situations. The cycles usually represent equally optimal alternative pathways, and are usually the result of homoplasy but not reticulation. The final goal is hence to get a graph with as few reticulation as possible but as many as are necessary (see NETWORK manual on selection of the epsilon parameter and weighting).

The Sanchéz-Pacheco et al. critique of Forster et al.

As I showed in an earlier post using actual CoV-2 data, only this part of anchéz-Pacheco et al.'s critique of Forster et al.'s paper is valid — we do need to be very careful before we interpret parallel edge bundles in virus-based (or other) MJ networks as being evidence for reticulation. MJ networks can be phylogenetic networks, but they are still consensus networks of competing, equally parsimonious alternatives. If we take a strict position, then most MJ networks are probably not phylogenetic networks; but neither are all trees phylogenetic trees.

Everything else in their comment is simply cladistic lore. Most importantly, their critique ignores the fact that the obvious alternative to MJ networks when analysing low-divergent virus data, which is parsimony-based trees, has exactly the same data-inherent shortcomings — ie. vulnerability to homoplasy, impossibility to detect and reconstruct recombination. They also have an extra one: they treat all samples to be of the same age and generation, and thus have to resolve actual ancestors as being sisters. Which increases the number of possible, equally parsimonious solutions.

The 19, 7-step long MPTs that can be inferred for the recombination example using PAUP*'s branch-and-bound algorithm – rooted with the Source, the common ancestor of all ("AllAnc") – and their strict consensus tree (gray background, 11-steps long). "Best" shows a phylogenetic tree that comes closest to the true tree: ancestors are resolved as zero-length tips in clades including their descendants. "Close" denotes trees that only misplace the recombinant (purple), which – being a recombinant of the East ancestor (red) and West-1 (blue) – should be placed in a tree as sister to either parent.

The consequence of this is that what Sánchez-Pacheco et al. and Kong et al. criticize about the MJ networks applies even more to the predominately used phylogenetic trees. As David pointed out earlier (Problems with the phylogeny of coronaviruses): virus trees may be inferred using phylogenetic methods but they effectively depict only similarity patterns.

References

Bandelt H-J, Forster P, Röhl A. 1999. Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution 16:37-48.

Forster P, Forster L, Renfrew C, Forster M. 2020. Phylogenetic network analysis of SARS-CoV-2 genomes. PNAS 117:9241–9243.

Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Mavian C, Kosakovsky Pond S, Marini S, Rife Magalis B, Vandamme A-M, Dellicour S, Scarpino SV, Houldcroft C, Villabona-Arenas J, Paisie TK, Trovão NS, Boucher C, Zhang Y, Scheuermann RH, Gascuel O, Tsan-Yuk Lam T, Suchard MA, Abecasis A, Wilkinson E, de Oliveira T, Bento AI, Schmidt HA, Martin D, Hadfield J, Faria N, Grubaugh ND, Neher RA, Baele G, Lemey P, Stadler T, Albert J, Crandall KA, Leitner T, Stamatakis A, Prosperi M, Salemi M. 2020. Sampling bias and incorrect rooting make phylogenetic network tracing of SARS-COV-2 infections unreliable. PNAS doi:10.1073/pnas.2007295117

Sánchez-Pacheco S, Kong S, Pulido-Santacruz P, Murphy RW, Kubatko L. 2020. Median-joining network analysis of SARS-CoV-2 genomes is neither phylogenetic nor evolutionary. PNAS, doi:10.1073/pnas.2007062117.

Monday, May 4, 2020

Finding the CoV-2 root


In my last post, I looked at the prospects and pitfalls of using Median networks to trace virus evolution in the case of the SARS-CoV-2 virus. In this post, I will explore how we can try to root the CoV-2 MJ network, and why using an outgroup, as done by Forster et al., PNAS (2020), is not the best choice.

We'll stick to our 88 sequence dataset because I have already investigated its characteristics in my last post (XLSX-file included in the figshare file set). Here's the unweighted MJ network that can be inferred from these data, including all 146 mutation patterns (145 characters because one indel overlaps with a SNP – single-nucleotide polymorphism).

Median-joining network for the 88 samples in our early March harvest, color-coded for provenance and with sample dates. Four mutations (purples) are resolved as homoplasies. Red edges – potential recombination with unsampled types, line thickness gives here the number of deviating SNPs. Forster et al's Types given for orientation.

As in Forster et al.'s graph, we have one box in the central part of the graph, probably between Forster et al.'s type B (the big pie in the center and its satellites) and their type C (here: the long-edge global group including the Australian and European samples).

There's a useful rule-of-thumb in population genetics: a widespread, frequent haplotpype with many satellite types is often the ancestral type of the investigated sample. This, in our case, includes the reference CoV-2 genome ("Wuhan 1"; NC_0455512, sampled 26/12/2019). Having investigated in detail the data behind the graph (see the last post; adding sample date, provenance, graph above), we can put forward hypotheses as to what degree the parallel edge bundles represent alternative evolutionary scenarios, or are alternatively the result of potential recombinants between CoV-2 sub-lineages.

This allows us to depict an evolutionary scenario for our early samples, to picture how (i) the putative original variant (Wuhan 1/Type B) was distributed during the intitial phase (largely unmodified — light gray arrows in the next figure), (ii) where mutations happened to give rise to sequentially new (sub)types, and (iii) where recombination may have happened (crosses in the figure). Some links (the dotted lines) require further data in order to decide whether the shared mutation is lineage-diagnostic (as indicated by the MJ network) or a convergence.


Early evolution of CoV-2 in time (earliest dates) and space (coloring). Different grays distinguish the main two/three lineages: 20% gray, original Wuhan type (Forster et al.'s type B), dispersed unmodified to rest of China (sampled), Nepal (not sampled), the cruiseship (sampled) and North America (not sampled); 40% gray, potential type C differing by one transversion (basic type not sampled); 60% gray, Forster et al.'s type A differing by two transitions, basic variant found in a sample from Taiwan (Jan 31st). The circle sizes give the number of additional mutations within a lineage and geographic cluster; the x indicate potential recombination (within or between main types/lineages).

The early samples demonstrate that the later USA samples were infected by various (sub)types by mid-/end of January (by up to six lineages), while most of the variation arising in locked-down Wuhan did not escape (at this early stage) — the earliest two samples from 23/12 (MT019529) and 26/12 (reference genome) differ by three mutations.

The quarantined cruise-ship in Japan was infected with the unmodified Wuhan 1 type, which then evolved within the vessel's population. So, this quarantine worked, because the vessel's mutated viruses are not found elsewhere. While the 11121-transition has probably been propagated in the vessel's population via recombination, its occurrence outside (in the Jetsetter/USA lineage, type C?, and USA-Type A) could be due to homoplasy: both the Jetsetter/USA and the A-type USA genomes are (strongly) derived. The 24072 and 28892-transitions point to reticulation between (less evolved) American B- and (highly evolved) A-type lineages; the MJ network can't resolve the resulting box because the American A-type showing the 24072 mutation is strongly derived.

Note: It's also interesting to compare our graph with the tree-based virus "phylogeny" on the GISAID page, which doesn't seem to include the cruise-ship samples. Note that most of the deep branches of the GISAID tree are unsupported ("no mutation"), and samples identical to the reference can be found among the early samples of most main "clades" depicted in the GISAID "phylogeny".

Substitution probabilities

It is also straightforward to identify likely (→ U) and less likely substitutions (all others), as shown in the table.


There is a clear substitutional bias, as transitions are more likely than transversions, the approximate substitution model is abaaba for substitutions replacing the reference / CoV-2-consensus nucleotide. But the model is asymmetrical: Us are more likely to replace C than vice versa, while A/G transitions are balanced. Stochastically distributed singleton/rare mutations have a high probability to show a U, in general. So, a shared C is more likely to be a conserved, shared ancestral pattern (what Hennig called a "symplesiomorphy"). A shared U may be a uniquely shared, derived pattern (a "synapomorphorphy"), or a convergently (in parallel) obtained, derived pattern, a homoplasy. Low-frequency Cs, but also A and Gs at predominately U positions, are most probably synapormorphies as well (based on the data situation and observed substitution probabilities).

Currently, there is no maximum likelihood analog to Median networks, but one could weight mutation patterns differently (see, e.g., guidelines provided in NETWORK under the Help > About menu item in the Median Joining analysis window).

With each successive virus generation, the probability for a homoplasious U increases. Thus, when using MJ networks for virus evolution, we should consider analyzing the data at different time-points, rather than including all of the data in one large analysis (see also our posts on stacking Neighbor-nets: introduction, fossil king ferns, and manual alphabets).

Homoplasy + distant outgroups = wrong roots

By relying on a distantly related sister-lineage to infer an outgroup root of the MJ network, Forster et al. likely got the basic relationships wrong.

Central part of the original outgroup-rooted "phylogenetic network". Coloring after Forster et al. (2020).

Their Type A is probably not ancestral to Type B/Wuhan 1, but derived from it or representing an early split.

Same graph, mutation arrows taking into account observed mutation probabilities (our 88 genomes data) and assuming that there was no recombination among earliest types of each lineage.

The 3 Us shared by the bat outgroup and (part of) Type A (8782, 18060, 29095) likely represent homoplasy in distantly related sister lineages (cf. our last SARS virus post). Being homoplasies, they produce a network box reflecting alternative mutational pathways but not recombination. Homoplasious (convergently evolved) mutational patterns accumulate with increasing phylogenetic distance. Neutral mutations have a generally higher chance to replace a C by an U, back-mutations are less likely, and some sites are more likely to be mutated than others. Hence, there is a good chance that the bat sister-CoV-virus shows more shared mutational patterns with a derived CoV-2 lineage (ie. derived Type A variants) than with the ancestral one (Type B). Distant outgroups should not be used to root Median networks (see also: How do we interpret a rooted median network).

The only possibly genuine mutation would be the shared C (Forster et al.'s pos. 28144, pos. 28219 in our alignment) opposing a U in all Type B and Type C, differentiated only by two incompatible mutations, G → U transitions. The U at pos 28144 may have evolved in parallel in the B and C types; and the actual all-ancestor of CoV-2 (as indicated above) is neither included in Forster et al.'s sample, nor in the current GISAID sample (or our harvest).

Outlook

It will be interesting to infer MJ networks on time-stamped and geo-referenced subsamples collected in the GISAID database, once the virus has had half a year (or more) to evolve, to see (i) how common homoplasy is, (ii) which sites are likely to accumulate → U substitutions independently of ancestry and (iii) whether there are further and more obvious examples of recombination. The further that genotypes evolve from the original stock, then the more diagnostic their sequences may become, and the easier it will be to decide whether shared but incompatible sequence features are the result of homoplasy or recombination.

Monday, April 20, 2020

Using Median Networks to study SARS-CoV-2


One software package essential for my research has been the free-/shareware NETWORK by Fluxus Engineering. NETWORK can (now) read in PHYLIP- (and NEXUS-)formatted sequence files to infer Reduced Median (RM) and Median-joining (MJ) networks. The people behind NETWORK have just landed a sort of scientific scoop by publishing a Phylogenetic network analysis of SARS-CoV-2 genomes in PNAS — this is the first such network to be published (appearing the same day as our previous blog post).

Why use Median networks

A full Median Network depicts all possible direct mutational links between the sampled sequences in a data set, hence, is rarely seen in published papers. Here's an example from my own (unpublished) research on oaks.

A full Median network for the 5S nrDNA intergenic spacer (5S-IGS) data of Mediterranean oaks
(Quercus sect. Ilex), The numbers on the edges give mutated alignment positions; the
abbreviations show the the provenance of the sequences (reflecting inter-population
and intra-genomic variation); and the coloration shows the general 5S-IGS variant
(genotype, also called "ribotype" in the literature)

Such graphs can easily get very complex, meaning that the full Median network is often impractical. So, NETWORK gives you two practical options to analyze the data while decreasing the complexity of the resulting graph. One can:
  1. infer the so-called Reduced Median networks (Bandelt et al. 1995; mostly used for binary or RY-transformed data) or
  2. apply the Median-joining (MJ) network algorithm (Bandelt et al. 1999).
[PS: When choosing an inference in NETWORK, you can view a how-to-do step-by-step explanation via Help → About.]
Basically, the MJ network is a summary of the possible parsimony trees for the data, not unlike a strict consensus network of most-parsimonious trees. NETWORK's in-built viewer allows browsing through the parsimony trees that make up the network. The subtle but very important difference is that the sampled sequences are not regarded exclusively as network tips but can be resolved as internal nodes of the graph, the so-called medians. A median represents the "ancestral type" from which the more terminal types were evolved. So, in contrast to a phylogenetic tree (or consensus network), the MJ network can depict ancestor-descendant relationships (see also: Reconstructing ancestors in splits graphs; Clades, cladograms, cladistics, and why networks are inevitable).

This makes Median (in particular MJ) networks more proficient to investigate virus phylogenies than phylogenetic trees. Because we have to expect that our sample includes ancestral and derived variants of the virus' RNA: some of the OTUs are expected to be placed on internal nodes of the phylogenetic tree/network.

So, Forster et al., in their paper, harvested a data repository dedicated to epidemological data (GISAID), and provided the following MJ network based on complete CoV-2 genomes (click to enlarge it).


Forster et al. highlight some (tree-like) features of their MJ network that fit with individual patient travel histories and assumed virus propagation patterns (their data and NETWORK-files can be found here).

The central part of Forster et al.'s MJ network is characterized by several boxes.

Close-up of the central part, the differentiation of the original Type A (as defined by the bat sistergroup) into B and C types. Note that most of the (likely synonymous) mutations during the intitial differentiation phase are transitions from U to C, assuming the sistergroup can inform the ingroup root. The reference sequence (Wuhan 1; NC_045512, sampled Dec 2019) has an ancestral B type, derived from a globablly distributed A-type intermediate between B and the not-sampled last common ancestor ("original genome").

There is a reason why you don't find a MJ network in our last post on coronaovirus genomes (aside from taking non-annotated data from gene banks and hence we lacked quick-to-access background information). This is that inferring a MJ network for the CoV-2-group seems premature at this point. Its interpretation as a phylogenetic network (arrows above) is problematic because we have parallel edges in the graph, and thus do not have unique evolutionary pathways to be inferred.

Let's look at what I mean.

Homoplasy is bad, but recombination is worse

In the "Significance" section of their paper, Forster et al. state
These genomes are closely related and under evolutionary selection in their human hosts, sometimes with parallel evolution events, that is, the same virus mutation emerges in two different human hosts. This makes character-based phylogenetic networks the method of choice for reconstructing their evolutionary paths and their ancestral genome in the human host.
"Parallel evolution events", ie. homoplasy, are the major shortcoming of Median networks, when we interpret them as phylogenetic networks. In a phylogenetic network, a reticulation (forming a "box" in the graph) represents a reticulation event; and the most common in viruses are recombinations.

Let's take the following simple example with four sites (SNPs – single nucleotide polymorphisms) mutated with every generation of the virus, plus one homoplasy (transition from A to G at the forth SNP) and a final recombination event.


Not including the recombinant, the MJ network (below) depicts the true phylogenetic network, which, in the absence of a reticulate event, is a tree. However, one benefit of the MJ network for the use of non-trivial phylogenies, is that the graph is not restricted to dichotomous speciation events: one virus sequence may be source of more than two offspring. The commonly seen phylogenetic trees struggle with such a data situation: they assume that all ancestors are gone (not represented in the data) and have been replaced by exactly two offspring.

Note: The inferred MJ network is an undirected, unrooted graph.
By knowing the source (the all-ancestor), we can interpret it as
a directed phylogenetic network.

When we include the recombinant in this analysis, the MJ network depicts what could be a phylogenetic network. However, it is a wrong one.

The West-1/East-ancestor recombinant is resolved as hybrid/cross of
West- and East-ancestors, and West-2 as cross of West-1 and the
Recombinant. False edges are in red.

It is wrong because Median networks, like parsimony or probabilistic trees, assume that every difference in the sequence is due to a mutation. The East-ancestor mutated only the last of the SNPs in the example. The West-lineage mutated the first SNP, then the third one, and finally (parallel to the East-lineage), the last SNP. Only the last 'West' mutation is found in the recombinant, because it recombined the first half of the West-1 genome with the second half of the East-ancestor.

However, homoplasy on its own can also produce reticulations in the network, as shown next.

The descendant of the East-ancestor shows a West-lineage mutation, leading to a
sequence identical to that of the West-1 x East-ancestor recombinant.

MJ networks can be, but are not always, phylogenetic networks. That is, a box in a MJ network may reflect either of two different things:
  • homoplasy, ie. alternative evolutionary pathways
  • reticulation events.
A Median-Joining network is not enough to study viruses

In their "Significance" section, Forster et al. continue:
The network method has been used in around 10,000 phylogenetic studies of diverse organisms, and is mostly known for reconstructing the prehistoric population movements of humans and for ecological studies, but is less commonly employed in the field of virology.
However, using these networks is tricky, because they (like any parsimony method) struggle with homoplasy, and (like all tree inferences) they cannot handle recombinants. A virus MJ network provides a display of mutation sites in an evolutionary context that, in the presence of ancestor-descendant relationships, does better than a Consensus network of most-parsimonious trees; but it is not a phylogenetic network per se.

Forster et al. provide free access to their data, but only as an RDF file, which is NETWORK's matrix format; and there is no data export option in the freeware version of the program. So, we cannot do any quick downstream investigation of the "published" dataset (and have to rely on our own harvest, as for the previous post, available via figshare).

The reason, we can apply Median networks to complete CoV-2 genomes at all is their low divergence. From our previous post (sampled between December 2019 and March 1st 2020 with a focus on China and the USA), our Group 7 sequences (= SARS-CoV-2) show 146 mutation patterns, 141 site variations and five 3 to 15 nt-long deletions in a stretch covering ~29,700 of the up to 30,000 basepairs of 88 CoV-2-genomes (ends trimmed for missing data). There are also polymorphic base calls in the data, but no prior way to judge whether these represent genuine host polymorphism or simply mediocre sequencing.

Are we detecting homoplasy, or is it recombination?

Since the overall divergence is low, and we have nearly 30,000 basepairs (i.e. 10,000+ for synonymous substitutions underlying &plusm; neutral evolution), we can fairly rule out random homoplasy creating the network patterns. The chance that two independent virus lineages mutate the same position of a total of 30,000 by accident is low. Indeed, most SNPs and three of the deletions occur only in a single sequence, stochastically distributed across the genomes. So, we have:
  • 111 singletons: 94 SNPs, including one set of linked SNPs (6 SNPs, stretching across 50 nt), 13 possible intra-host polymorphisms (PIHP), and 4 deletions.
  • 35 parsimony-informative patterns: 34 SNPs, of which eight involve PIHP, and 1 deletion.
We may still have homoplasy, even in the parsimony-informative sites, because some positions may be more susceptible to mutations than are others, and some mutations may be generally beneficial for the virus' spread. If the sample is large enough, then these should be easy to spot, because they should be frequent, and show character splits incompatible with the rest of the sequences.

In our data, there are two candidates for homoplasy among the parsimony-informative patterns, both of them mutations from G or C in the reference and majority of genomes to U.

Example 1

At alignment position 11121, the majority G is replaced by U in nine genomes, and C in one. If we exclude recombination as a cause, then it represents a safe homoplasy because U-carrying genomes show rare additional mutations deviating from the consensus (which is identical to the reference genome, "Wuhan 1") also seen in G-carrying genomes. Those mutations can be located at the start, center or end of the genomes. In addition, we find one transversion at the G/U site. This could be indicative for the G → U/C site being a site that is subject to increased probability of mutation , and hence homoplasy.

Genomes sharing rare mutations in addition to G/U variation at alignment position 11121. The first occurrence of the U-mutation, not accompanied by any other mutation, was discovered by Japanese researchers on the docked cruise ship. The thickness of the lines shows the number of genomes with identical mutation patterns in the parsimony-informative sites (1 pt = 1 genome), the size of the majority base, always found in the reference genome, its frequency (0.5 pt = 1 genome). The "jet setter" host is a Brazilian coming home from Switzerland via Italy.
However, six of the nine accession are from the "Cruise A" sample, the early quarantined Diamond Princess. Given the setting (a closed, densely populated space) and usually diverse host populations on cruise ships, the otherwise unchanged CoV-2 U-strain (top) and already modified G-strains present in the ship's population may just have recombined: the sequences up- and down-stream of the G/UC-site can be identical in various CoV-2 lineages for hundreds of basepairs.

Example 2

An analogous situation is found for the other candidate position, alignment position 24072 (black arrow), where a C is replaced by U in four genomes. One genome (MN988713; from Illinois, USA, sampled Jan 21st) shows the polymorphism: Y (= C/U). In MN988713, 7 more of the 35 parsimony-informative SNPs are polymorphic: the sequence is a near-perfect (gray arrow) consensus of the original "Wuhan 1" type and a strongly derived type (probably Forster et al.'s A cluster) from a second Illinois host sampled a week later, Jan 28th (MT044257)

Black and gray arrows highligh sites indicative for homoplasy or within-USA recombination. The polymorphic Illinois genome represents a strict consensus of the second Illinois strain (sampled one week later) — directly derived from the California strain, derived within the Type A cluster — and a (not sampled) sequence differing from the Wuhan 1 type (Type B) by one point mutation shared with two North American samples from end of January.

If we assume that the lab didn't just mix up or cross-contaminate the IL1 and IL2 samples, then the MN988713 host was infected twice by the CoV-2 virus: once by the original strain (Forster et al.'s Type B), and a second time by an evolved strain, being the tip of a new CoV-2 lineage that can be traced back (by congruent mutation patterns) to Jan 10th, Shenzhen (Guangdong, China) characterized by two C → U transitions at alignment pos. 8820 and 28182 (Forster et al.'s Type A).

Distinguishing homoplasy and recombination

With a growing set of samples, and given that the virus is free to mutate further in a large amount of hosts, it might become easier and more straightforward to distinguish homoplasy from recombination. It is possible that incongruent character splits have not one but two reasons: they have evolved in parallel but also have been propagated by recombination. The U replacing a G or C (or A) at the same site in one accession reflects a different history from another accession. Homoplasy and recombination result in the same graph inferences.

I agree with Forster et al. that the MJ network is under-used in virology (and other biological disciplines: eg. Why do we still use trees for the Neandertal genealogy; Using median networks to understand the evolution of genera) because it is a perfect tool — especially when used as a data-display network (eg. Networks can outperform PCA ordinations in phylogenetic analysis; Can we depict the evolution of highly conserved gene regions such as the ribosomal RNA genes). It facilitates grouping genotypes, to define ancestors and descendants, and to put them in a preliminary evolutionary framework.

But it cannot replace investigating the sequence mutation patterns, especially when we want to look out for intra-host variation — that is, a patient carrying more than one virus strain (parsimony treats polymorphism as missing data) — and recombinants. Visual inspection and tabulation can do this, although it takes a lot more time (and space).

Inferring a MJ network is Step 1. The obligatory Step 2 is to assess how conserved and/or phylogenetically informative are the reconstructed mutation patterns. This also can help to identify wrong roots inferred via outgroups. Forster et al.'s Type A is likely not the ancestral type, and the shared U-sites with the bat-virus outgroup are due to homoplasy, instead, as I will show in the next post (in two weeks's time).

Data

The complete tabulation of mutation patterns (EXCEL spread sheets) and the CoV-2-only alignment in ready-to-use NEXUS and (extended) PHYLIP format have been added to our figshare coronavirus data and file collection.

Grimm G, Morrison D (2020) Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset. https://doi.org/10.6084/m9.figshare.12046581.v2

References

Bandelt H-J, Forster P, Sykes BC, Richards MB (1995) Mitochondrial portraits of human populations using median networks. Genetics 141: 743–753.

Bandelt H-J, Forster P, Röhl A (1999) Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution 16: 37–48.

Monday, December 2, 2019

Trees informing networks explaining trees


Working at the coalface of evolution, one phenomenon always intrigued me: How does the signal in the data build up a tree? Especially since we have to assume some sort of reticulation happened at some point — evolution is rarely a strictly dichotomous process, which we would model by a tree. In earlier posts, we have covered the difference between clades and grades in a tree, and Hennig's concepts of monophyly and paraphyly. Clearly, in the light of actual evolutionary processes, the cladistic approach synonymizing clades with monophyly is a simplification at best, and naive at worst.

In this post, I will discuss a real-world example using molecular data put together for a (probably) recently evolved plant genus, Drosanthemum, as discussed in this paper:

Liede-Schumann S, Grimm GW, Nürk NM, Potts AJ, Meve U, Hartmann HEK. 2019. Phylogenetic relationships in the southern African genus Drosanthemum (Ruschioideae, Aizoaceae). bioRxiv preprint.
These days, next-generation sequencing (NGS) and phylogenomic data may provide what you need to resolve everything from the beginning of life to the very tips of the Tree of Life (ToL; which, at the root and tips is probably more of a Network of Life). However, this has two shortcomings: You need a lot of money, and a lot of DNA. Given the number of modern-day plant species, including not a few that are in flux, it's pretty safe to assume that I won't live long enough to have all of the ToL leaves resolved by NGS data.

On the hand, there are a countless numbers of scientists with taxonomic expertise struggling for funding; and classic Sanger sequencing has become very cheap. Thus, Oligogene (fossil) data sets will remain in use for quite some time. We do, however, have to deal with their shortcomings, such as not giving us a fully resolved phylogenetic tree, but instead providing partially diffuse signals. Nevertheless, we can get a lot of insights by combining traditional tree and network inferences.

The tree

To test systematic concepts and construct a species phylogeny for Drosanthemum, we tapped into four non-coding plastid gene regions, which, following earlier research, were the most divergent within the larger group:
  • the close intergenic plastid spacers trnK-rps16 and rps16-trnQ,
  • the trnS-trnG intergenic spacer, and
  • the rpl16 intron.
Following popular demand, we also sequenced the nuclear-encoded ITS region used in earlier phylogenetic studies (despite being quite useless for tree inference in the larger lineage, being much too conserved).

We did a full analysis, a single-gene tree inference, and bootstrapping vs. combined analysis, with or without data partitioning. We concluded that the combined plastid (not including ITS data) tree does provide a good phylogenetic-systematic framework for the genus.

Our Drosanthemum tree, rooted using the most probable rooting scenario (following an outgroup-EPA analysis; see Liede-Schumann et al., fig. 4 and Supplement file S4). Major clades and subclades are annotated, on the left the morphological subgenera associated with each major clade.

Why consider it to be good? Well, it fits amazingly well with the morphology-based systematics. Evolution doesn't follow a straight path: (i) reticulation will happen at least during the formation of species; (ii) there will always be some incomplete lineage sorting of geno- and phenotypes; and (iii) morphologies have substantially different evolutionary constraints from noncoding plastid gene regions. If it ends up in a good match, it won't be by coincidence but more likely because our inferred phylogeny captures well the true tree (coalescent).

Regarding monophyly, the tree is hence well suited to construct a framework: most of our major clades are linked to a specific, clade-unique morphology. (Don't hope to find too many autapomorphies, in plants common origin manifests typically in diagnostic character suites rather than individual aut-/synapomorphies.) The exception is subgenus Drosanthemum, which is apparently diphyletic — this term is meant literally, not just because its members form two molecular clades. Furthermore, although not visible on the tree, Clade IV / Vespertina may well be evolved from Clade III / Drosanthemum. The III-IV grade represents a monophyletic group, and Clade III may be paraphyletic.

At this point, you may be thinking: Guido has lost it; but bear with me.

Ancestral and derived haplotypes

The point is, I know our data. When we look at the sequence patterns in the gene regions, we can readily see that Clade III (subgenus Drosanthemum) and IV (subgenus Speciosa) likely had a (genetic) common ancestor different of that from the more evolved clades I (subgenus Drosanthemum) and II, and hence the high support and increased branch length of the corresponding branches — I + II and III + IV could well be reciprocally monophyletic. Realizing that clades III and IV are part of the same evolutionary lineage, we can take a closer look at them using, for example, Median networks. Parsimony is generally inferior to probabilistic methods when dealing with (mostly) neutral but stochastic mutation patterns. However, since we are very close to the coalface of evolution, we are dealing with rather minute changes — too minute for ML to make a well-informed call (also, there is no ML counterpart to haplotype networks).

To not miss something in our data (or overweight indels and linked mutation patterns), we do not just use the nucleotide sequences but we tabulate and code each mutational pattern — simple ones like single-nucleotide polymorphisms (SNPs) and duplications, but also complex ones, sequence patterns in length-polymorphic, sequentially diverse parts. The next figure shows an example:

"Export" refers to the unaltered export of (parsimony-informative) variable sites of the aligned nucleotide matrix; "recoded" the correction for excess mutations (when treating gaps as 5th base) in order to ensure the coding matches the number of steps in the theory of Median networks (see Liede-Schumann et al., supplement file S2).

Now, we are operating above the species level, which is outside the comfort zone of Median networks, which were originally designed for investigating within-species population structure. We are dealing with signals from phylogenetically sorting (eg. evolution of complex sequence patterns, see example above) mixed with (partly) convergent/ homoiologous patterns (eg. duplications, which are very rarely lineage-specific in plant plastomes). The resulting Median networks are quite complex, as shown next.

The output from NETWORK for Clade III + IV and the trnG-trnS intergenic spacer. In total, the matrix codes for 14 mutational patterns (10 SNPs, 3 indels, 1 length-polymorphic region involving a SNP; see sheet Clade3&4.trnGS in f_Haplotyping.xlsx in folder 1_main_data_and_results the online supporting archive @ DataDryad); the red edge numbers indicate which pattern changed, the bubble colors refer to the group: Cyan, Subgrade IIIa; blue, Subclade IIIb; yellow, Clade IV (note: Clade III and IV differ from other major clades by uniquely shared patterns)

One option would be to weight the characters. However, it is pretty difficult to come up with a weighting scheme given that we deal with very different mutation patterns, which include everything from simple transitions to reorganization of length-polymorphic regions. When looking at the SNPs, AG transitions appear to be more probable than AC transversions, but some AG transitions are highly diagnostic for clades, while some AC transversions seem random. Instead of getting lost in weighting (and self-enforcing bias), we compare them across the four gene regions by collapsing haplotype groups and their (diffuse) subnets, as shown next.

'Condensed' Median networks for the Clade III + IV lineage, parts of the graphs collecting sequentially similiar members of one group are replaced by bubbles (cf. Liede-Schumann et al., fig. 6).

Note that, in contrast to traditional haplotype networks, the bubbles in the figure don't represent the number of accessions of the haplotype, but instead are the sequential diversity of the collapsed haplotype group. From the graphs superimposed on the background of the combined tree, it is straightforward to see which haplotype maybe ancestral within a lineage and which haplotype is derived and how they relate to each other.

Paraphyletic "clades"

We now know how the haplotypes of each covered gene region are related to each other, and which species have substantially derived sequences, and which species have putatively ancestral sequences. Using the networks and by comparison with the sequence patterns in the sister group(s), we could even reconstruct an hypothetical haplotype of the common ancestor. But just by comparing the median networks for each gene regions with the corresponding subtrees in the combined tree we can (try to) interpret our clades and grades as monophyletic or paraphyletic.

Fig. 5 from Liede-Schumann et al. (2019) showing the 'condensed' Median networks for Clade I/ Drosanthemum (s.str.)

Members of Subclade Ib, the subtree with the worst support within Clade I / Drosanthemum (s.str.), may represent the survivors of the initial radiation, and hence are a paraphyletic group. They are resolved as a clade in the tree because of the signal from the trnS-trnG region producing a clear split between the three groups. However, this is also the most-conserved gene region, and when compared with the mutational patterns in the other clades (especially the sister clade, Clade II), it would not be far fetched to conclude that the trnS-trnG haplotype B is the original haplotype of the entire lineage.

The distinctive feature of Subclade Ib in the trnS-trnG is a complex duplication pattern not found in the otherwise genetically more coherent subclades Ia and Ib, as shown next.


This looks like a simple evolutionary sequence, with Clade Ia and Ic having retained the original pattern, with the complex pattern being a derived, clade-unique feature of Clade Ib (an autapormorphy for the corresponding monophyletic group).

But when we add the patterns of Clade II, the reciprocally monophyletic sister clade, it's not that simple anymore, as shown next.

Why one should be careful with gap-coding: even complex plastid duplication patterns evolve in parallel (or convergently). No matter whether X-Y or X-Y'-X-Y is the ancestral pattern, we have one/two convergent mutations in parts of Clades I and II; either duplication of X and insertion of Y' or (subsequent) deletion of X-Y'.

Realizing that a few clades in our tree may be paraphyletic gives us a new edge on our data and phylogenetic framework that can be further elaborated. Because they directly point towards a first, quick radiation that predates the formation of the monophyletic molecular clades (this is only a tautonym in cladistics, not in phylogenetics) — the members of paraphyletic molecular clades are genetically distinct (long terminal branches, typically low and/or ambiguous support for the clade root) or little evolved survivors (short root and terminal branches, but relatively high root branch support).

Furthermore, we can now see why some species act a bit roguish, are difficult to resolve, or inflict internal data conflict.

'Condensed' Medium networks for the sister clades V and VI (modified after Liede-Schumann et al., fig. 7)

Drosanthemum gracillimum is the only species our tree that doesn't resolve as a member of one of the two main (definitely monophyletic) subclades within Clade V: Subclade Va / Speciosa and Vb / Ossicula, genetically close but morphologically distinct sisters. We had no material of this species for our analysis, and instead used available GenBank data (out of curiosity). Its trnS-trnG and rps16-trnQ haplotypes are unique but rather ancestral within Clade V, and hence the tree cannot resolve where to place it.

Another example for how ancestry of sequences contribute to topological conflict or ambiguity in intrageneric phylogenies, but also illustrating the limitation of our approach, is one individual of D. striatum. It's the only member of Subclade Vb / Ossicula with a Subclade Va / Speciosa-type rps16-trnQ. With respect to my last blog post, the simplest explanation is that it just retained a less derived rps16-trnQ haplotype. However, this spacer includes a high-divergent, genotaxomomically valuable region that we had to exclude from all analyses (but included in our spreadsheet haplotyping.xlsx). In this, it shows a very unique, complex, apparently derived pattern shared with a few other members of the sister Clade Va. Maybe there was some reticulation and plastome-recombination at work here (contamination can be ruled out, as the material was processed twice).

Just try it with your own data

We cannot all afford perfect, often seemingly trivial, NGS / phylogenomic data. Combined trees can inform us about groups sharing a likely (mostly inclusive) common origin, such as molecular clades with fair support and distinctly long root branches and/or shared unique morphologies (ie. "monophyla" in a strict Hennigian sense). Clade-restricted haplotype networks can help us to understand the molecular evolution in these groups, free from the assumption of dichotomy and time equality.

By definition, all tip sequences represent the same time (today) in a tree, so they can only be sisters not ancestors and descendants. In reality, when we approach the coalface, we have some sequence patterns or actual sequences that are ancestral to others, because the species carrying them didn't evolve as much and as fast as their sibling(s). At some time, different parts evolved at different speeds within one lineage (see the examples above).

The networks hence fill a gap that the tree can't possibly resolve. They allow to understand why the tree may make more sense in certain parts than in others; and where it is probably 100% reliable and where we may want to have a closer look. Furthermore, only the networks can tell us if there is some real conflict in the data: different gene regions reflecting different histories.



Epilogue

As a careful reader ,you may have noticed that we skipped the ITS sequence entirely. The reason is shown in the following two graphs.

The first one shows a statistical parsimony network of all of our ITS data compiled for the species included in the plastid combined tree.

A statistical parsimony network based on the ITS data. Colors give the main cp clades (see Liede-Schumann et al., Supplement file S3)

The network approaches a spider-web, as shown above. The reason for this is that there are only a limited number of ITS positions where Drosanthemum fixes mutations (notably nearly exclusively SNPs, with no length-polymorphism). So, the genus is likely a young one, much like its sister clade the Ruschideae, which also mostly shows randomly distribution ITS mutation patterns.

Inferring an ITS tree is possible but useless, in that the data don't provide a clear signal. Furthermore, when we map the observed mutational patterns onto the plastid tree, we see a lot of messing up towards the leaves; but, in principle, it's all just sorting along the shared coalescent. We can identify those ITS mutation that a (plastid-)clade specific and lineage-diagnostic, including ITS-"synapormophies" for plastid-inferred clades that are likely monophyletic (being correlated to a distinct morphology and supported by derived, uniquely shared sequence patterns).

ITS genotypes mapped on the plastid tree, pointing to a largely congruent history with incomplete (ITS) lineage sorting. CU = clade-unique sequence pattern; Sh = shared, not unique, sequence pattern.
This opens the door to quickly screen for individuals / species that don't fall in line of the coalescent but are the product of (deep) reticulation (either using bulk sequencing and NGS genotyping or traditional cheap methods such as PCR-RFLP).

Monday, February 18, 2019

Can we depict the evolution of highly conserved genes, such as the ribosomal RNA genes?


Median networks have been designed to put within-species haplotypes into an explicit evolutionary framework. They are exclusively parsimony-based, but differ from traditional trees by treating operational taxonomic units (OTUs) as both potential tips and ancestors. Ancestors are placed at internal nodes ('medians'). The latter makes them interesting for hypotheses about sequence evolution; but, like all parsimony-based methods, they suffer from high levels of homoplasy, which is a common feature of genetic data sets.

Can we use median networks to better understand evolution far above the species level?

In order to test this, I generated a median network using data on the nuclear-encoded 5.8S rDNA of Fagales. This is a flowering plant (angiosperm) order, which includes well-known trees such as oaks, beeches, chestnuts, walnuts, alder, birch and hazel, but also the enigmatic 'false beech' (Nothofagus s.l., the traditional four subgenera have been elevated to genera by Heenan & Smissen 2013), a Gondwanan element that (for some time) has intrigued biogeographers.

Why I have always loved nrDNA

A a young (phylo-)geneticist, my boss, a geneticist who sequenced genes such as the rRNA genes before PCR made it easy, pointed me to the works of Mark Hershkovitz, Louise Lewis, and Edith Zimmer about evolution of the nuclear-encoded ribosomal RNA genes (nrDNA) in angiosperms. Long pre-dating the era of big data and self-evident, trivial phylogenies (ie. data sets allowing for the inference of a fully resolved, unambiguously supported tree), Hershkovitz and co-workers sought to extract as much information as possible from the best-known gene region available back then (mid-late 90s): the internal transcribed spacers (ITS1, ITS2) of the 35S rDNA, the cistron encoding the genes for the 18S, 5.8S and 25S (or 28S, but not "26S") nuclear ribosomal RNA.
  • Hershkovitz MA, Lewis LA. 1996. Deep-level diagnostic value of the rDNA-ITS region. Molecular Biology and Evolution 13:1276–1295.
  • Hershkovitz MA, Zimmer EA. 1996. Conservation patterns in angiosperm rDNA ITS2 sequences. Nucleic Acids Research 24:2857–2867.
  • Hershkovitz MA, Zimmer EA, Hahn WJ. 1999. Ribosomal DNA sequences and angiosperm systematics. In: Hollingsworth PM, Bateman RM, and Gornall RJ, eds. Molecular Systematics and Plant Evolution. London: Taylor & Francis, pp. 268–326.
The ITS1 and ITS2 are highly divergent, non-coding but transcribed intergenic spacers within the structurally and sequentially much more conserved nrDNA, which distinguishes them from nearly all other non-coding regions. More often than not, their sequences are impossible to align across high-ranking taxa such as families or orders. The brilliance of Hershkovitz et al.'s work was to just go a level-up by identifying shared general sequence patterns, and to put them in an evolutionary context.

Birds-eye view of the ITS region (consensed for sequence groups) in Fagales including sequences of the two outgroups used in Li et al. 2004 (zoom-in and try to figure out where they are). The position of the ITS(1) cleavage site is indicated, a highly conserved, AT-dominated sequence motif within the ITS1. The "Nothofagus deletion" (Manos 1997), gray area seen in some of the topmost variants in the 5.8S rDNA, is a sequencing/ editing artifact (newer sequences all have a complete 5.8S rDNA). Most of these data are more than 15-years old (see references provided at the end of the post) and may include more data artifacts, especially in the length-polymorphic portions. Nonetheless, part of the data were included in the dating studies of Sauquet et al. (2012) and Xing et al. (2014) to compensate for the lack of resolution of the also included plastid regions towards the tips of the Fagales tree (intrafamily and -generic relationships).

Accordingly, in my (open access) Ph.D. thesis you'll find not a few figures depicting the potential evolution of sequence patterns in the ITS1 and ITS2 of maples and the beech trees.

I could probably write a book taking up where Hershkovitz et al. stopped, but this would be: a) very subjective, and b) too complex and marginal for the 21st century. Very few people would read it. We have grown accustomed to simple graphs as metaphors of evolution and, thanks to big data, we have become reluctant to discuss the results ex machina. Also, I would have needed a score of students to pursue all the avenues that I glimpsed into; e.g. the following pic:

Evolution of the 5'-end of the ITS1 in basal eudicots (looking at divergences that happened, at least, 100 myrs ago).

The other way around

If the more conserved sequence patterns within the ITS1 and ITS2 can be informative about evolution at a much higher level (which they are), the next question is: what can we learn from the sequence patterns in the highly-conserved portions of the rDNA linked with the ITS1 and ITS2? Historic-genetically, the ITS1 is fundamentally different from the ITS2. The former, ITS1, is an intergenic spacer, which has no secondary structure (although you can find reconstructions in literature) as it is split into two parts right after translation (the ITS1 cleavage site is quite conserved, and a main topic in the papers by Hershkovitz and Zimmer). The latter, ITS2, has been evolutionarily derived from the first variable portion of the large ribosomal subunit (LSU), the 25S (28S) rDNA. In primitive organisms, there is hence no 5.8S rDNA and ITS2.

This geno-evolutionary history is also the reason for the structural linkage between the 5.8S rRNA and the 5' end of the 25S (28S) rRNA. Here's a zoom-in on the part that we are interested in.


For better orientation, I have named some of the extremely conserved secondary structure elements of the (mature) 5.8S rRNA. Note that the "Gingerbread Man" structure is very conserved in angiosperm sequences although it only contains three very short stems. The "Pimple" and the "Needle" are so-called hairpins — a strictly complementary stem part is capped by a short, non-complementary tip ('semi-loop'): a 3- and 4-nt long motif, respectively, in Arabidopsis and all Fagales (in some species of Lithocarpus, the tropical 'stone nut' and relative of oaks, the "Needle" has two extra nucleotides).

5.8S rDNA in Fagales

I chose the Fagales because I have worked on them a lot, they are a pretty small group, and except for one "asterisk branch" their inter-family relationships are solved.

Basic signal in Li et al. (2004)'s matrix. Inter-family relationships are, data-wise, fairly trivial, hence, the tree-like Neighbor-net. Only the placement of the Myricaceae with respect to Juglandaceae (now incl. Rhoipteleaceae) and Betulaceae + allies is not unambiguously resolved (see this post)

Oaks have received a lot of attention from population geneticists, like other widespread species or species complexes. Those studies, using Median networks and related methods such as Statistical Parsimony, revealed very complex genetic diversity patterns. On the other hand, the Fagales lineage has been fairly neglected by plant phylogeneticists, although it comprises many of the dominant, ecologically and economically most important trees of the Northern Hemisphere (and the enigmatic Gondwanan Nothofagaceae). The early studies found evidence for deep nuclear-plastid incongruences, but only in recent years has the first (non-comprehensive) complete plastome phylogenies and dated all-Fagales trees surfaced (which do contain one or other common error and misinterpretation of results).

For one family, the southern hemispheric, tropical-subtropical Casuarinaceae, we have no (reliable) ITS data at all; also missing is one of the genera of the Juglandaceae: Engelhardia (s.str.; most data in gene banks labelled as Engelhardia is from Alfaropsis; cf. Manchester 1987 and Manos et al. 2007, but see Zhang et al. 2013).

In total, we find 17 variable sites at and above the genus level in the 5.8S rDNA of Fagales. There are three in the core parts, structurally linked to the 5' 25S rRNA, two in the 'Gingerbread Man', three in the 5' and 3' trails, and the rest are in the 'Needle'.

Unique mutations and mutational trends (arrows) in the 5.8S rDNA in Fagales. Circles highlight the basepairs differing from the reference (Arabidopsis 5.8S rRNA). Blue, mutations found within more than one major lineage, pink, lineage-conserved (diagnostic) mutations; red, mutations restricted to a single genus; green, genetic (syn)apomorphies of the 5.8S rDNA of Fagales. Be = Betulaceae; Ju = Juglandaceae; My = Myricaceae; No = Nothofagaceae; Fagaceae include Fagus (Fa, the beech) and the remainder ("Quercaceae": Qu), which are genetically substantially distinct from Fagus.

Many mutations are genus-coherent; increased intrageneric variation is found in the 5'-tail and the part encoding the 4(6)-nt long 'semi-loop' sequence of the "Needle" (pos. 120–142 in the rRNA of Arabidopsis thaliana):

A (near-)full Median network for the tip of the 'Needle'. In a few Lithocarpus (a "Quercaceae" genus) the sequence is 6-nt-long, which would result in an elongated hairpin (paired basepairs are underlined). The ATTC is a genetic symplesiomorphy.

Exceptions are Fagus and Quercus, which can show substantial intragenomic ITS divergence, Lithocarpus (the most divergent genus, ITS-wise), and Nothofagus s.l. (between the former subgenera, now genera). In these cases, the intra-(sub)generic variation includes the putatively ancestral nucleotide and/or nucleotide shared with other genera of the family; eg. at pos. 123, all Fagales have a C, Fagus can have either C or T (= Y), and Quercus can show any of the four nucleotides (= N).

A Median-network for the 5.8S rDNA

Ambiguities can be detrimental for resolution in standard parsimony implementations. The NETWORK program, for instance, warns that a code of "N" may render the result less reliable, and this applies also to the other ambiguity codes. If we include the intra-generic polymorphisms as ambiguity codes, NETWORK runs for quite a long time: too many solutions are equally parsimonious (for this experiment I used genus-consensus data, being interested in the deep splits)

But when we resolve the intra-generic polymorphisms prior to analysis by treating them as satellite types, ie. assuming the family-shared nucleotide represents the ancestral state within the according lineage, we quickly get the following result:

Edges colored to trace the same mutational step. Bubbles indicate the position of the (basic) 5.8S rDNA genotypes for the genera in each family-level lineage.

This is still not a too trivial graph, but it:
  • provides a framework on which we can develop our evoluionary scenario;
  • visualizes how mutational patterns may be linked;
  • tells us directly how derived (genetically) and unique (isolated) the genera are.
Since the 5.8S rDNA is part of a multi-copy (potentially multi-loci, Ribeiro et al. 2011) gene region, uniqueness gives us an idea about how reduced a lineage is. Bottlenecks will eliminate intra-lineage diversity and unique mutational patterns are more likely to accumulate in a species-poor lineage with small population sizes.

But since it is a vital gene region underlying strong sequential and structural constraints, evolution is not neutral: the graph has little tree-likeness. However, the graph looks like graphs that one expects for fast ancient radiations.

There are more interesting details. For instance, we have no mutation separating consistently the earliest diverging lineages (given the currently accepted root), the Nothofagaceae and the Fagaceae (s.l.) and the remainder of the order (called "higher hamamelids" in classic systematic literature). We also see that the 5.8S rDNA shows the Fagaceae should be monotypic: Fagus is more different from its siblings, the 'Quercaceae', than it is from the first-diverging Nothofagaceae or the common ancestor of the "higher hamamelids". Fagaceae s.str. and 'Quercaceae' are without a doubt sister lineages but this also applies to Betulaceae and Ticodendraceae (differing only by three point mutations), with the Betulaceae being just one point mutation away from its more distant sibling (phylogenetically speaking), the Juglandaceae. Furthermore, for Ticodendron-Betulaceae we can postulate a sequentially unique common ancestor, but we can't do the same for Fagus-'Quercaceae'.

Either the 5.8S rDNA evolved much faster in Fagus than in most other lineages, or Fagus split away from its sisters prior to the radiation of the "higher hamamelids" and shortly after their respective ancestors isolated. This second scenario coincides nicely to recent fossil findings tracing the Fagus lineage back to the late Cretaceous (at least 80 Ma; Grímsson et al. 2016, supplement includes a digression of all-Fagales dating attempts).

Reconstruction of ancestral genepools

Using the split patterns in the network to extract an evolutionary tree could be hazardous, since we are looking at strongly interconnected mutational patterns filtered by selective pressure (maintaining a functional structure) in a gene region that evolves very slowly: some sites can or did accumulate mutations (the 'Needle' and the trails), others can't and did not (the remainder of the 5.8S rDNA) in the Fagales lineage. At least mutations were not fixed over a long evolutionary time: the data includes at least as many variable sites where within a single genus, species or genome, the shared, family-typical nucleotide (or even shared with Arabidopsis, a quite distant relative of Fagales) is occasionally replaced.

But since we know the phylogeny of the Fagales, we can, based on the Median(-joining) network(s), infer the evolution of the 5.8S rDNA (i.e. the rDNA gene pool) over time:

Results of the Median-joining analysis mapped on the currently accepted Fagales tree. Clade-characteristic mutations are highlighted by according colors; black, homoplastic mutations that occurred independently in two lineages, gray, in more than two.

Regarding the 'asterisk branch', the 5.8S rDNA provides few extra clues, unless we want to re-include a third hypothesis: that the Myricaceae are sister to Juglandaceae + Betulaceae and allies. This would be the most fitting explanation for the 5.8S rDNA diversity. It also would explain why they can be either sister to Betulaceae and allies or Juglandaceae. Ancestors, or slower evolving sisters diverging shortly before a radiation, will do such a thing.

In this context, one should point out that unequivocal fossils representing various modern genera of all families are known from the early Paleogene, many pop up in early Eocene (~ 50 Ma) intramontane basins of northwestern North America. The oldest modern genus and a possible living fossil is the first diverging Juglandaceae: Rhoiptelea. Its pollen can be found from the Maastrichian onwards in North America and elsewhere, and a fossil showing the unique Rhoiptelea-flower and fitting pollen can be found in the late Turonian-Santonian (~90 Ma) of Bohemia (Heřmanová et al. 2011; the authors, however, decided to name it Budvaricarpus and tone down the striking resemblance to modern-day Rhoiptelea).

Of course, since we use network-based approaches, we can conceptualize the 5.8S rDNA sequence patterns and inferred evolution as a subsequent breaking up and sorting of once-shared gene pools:

A 'coral' tree metaphor for the evolution of the 5.8S rDNA in Fagales (using an alternative, one-node-shifted root).

I chose an alternative root because it is the one that makes most sense regarding the fossil-morphological, palaeoclimatological/-vegetation and high-conserved genetic patterns (thinking of the 18S rDNA). The labels are, of course, a gross simplification — it is likely that the all-ancestor was a tropical-subtropical plant as well (the genetically most unique and potentially earliest isolated genera of the 'Quercaceae' are exclusively tropical-subtropical) and Myricaceae, Betulaceae and Juglandoideae can today be found deep into the temperate zone, some even thriving in boreal and polar climates. But posts can afford to trigger discussion.

The vertical axis reflects not only the derivedness of the 5.8S rDNA, but also the potential sequence of divergences back in time. The horizontal axis represents the taxonomic-geographic breadth over time (very roughly, tapering means higher diversity/greater range in the past than today) and towards the tips the genetic within-lineage diversity seen in the ITS1 and ITS2 (in Myricaceae, it would be close to a point, if it would not be for one species: Myrica gale, the bog myrtle or sweetgale, beloved in Scotland and Scandinavia – see this Dane's video for how to use it).

Just a curious experiment?

Now, to most readers this post may just be a strange example with little general relevance for phylogenetics. But consider the following.
  1. When we infer deeper phylogenetic relationships, we usually rely on sequence differentiation in coding-gene regions. Like the rRNA genes, the tRNA genes need to fulfill secondary (and tertiary) structural constraints to maintain their vital functions. All other genes code for proteins, which also need to fulfill structural constraints (secondary, tertiary and quaternary structures). Their essential functions rely on keeping a specific amino-acid sequence, which is translated from DNA sequences.
  2. We do this inference under the assumption that molecular evolution is neutral, which, as can be seen in the case of the 5.8S rDNA, is apparently not the case. Mutations that would negatively affect the function of the DNA-transcripts are strongly selected against.
Many of our trees makes sense nonetheless, but we should keep a wary eye on all of those branches that draw their support from only one or two gene regions (a common issue of oligo-gene trees like the one by Li et al. 2004), or very few mutations. Especially, when we are producing an ultrametric tree. How sensible can a divergence age estimate be when the data behind it are four mutations in the monotypic lineage and zero in its more diverse sister clade?



Cited literature and further reading (with comments).

ITS studies (some mixed with further data and results that were ignored by all-Fagales dating studies that included the data)
  • Acosta MC, Premoli AC. 2010. Evidence of chloroplast capture in South American Nothofagus (subgenus Nothofagus, Nothofagaceae). Molecular Phylogenetics and Evolution 54:235–242. See also Premoli AC, Mathiasen P, Acosta MC, Ramos VA. 2012. Phylogeographically concordant chloroplast DNA divergence in sympatric Nothofagus s.s. How deep can it be? New Phytologist 193:261–275. — Just two brilliant papers that only leave one question open: is this different in the Australasian genera of the Nothofagaceae?
  • Cannon CH, Manos PS. 2003. Phylogeography of the Southeast Asian stone oaks (Lithocarpus). Journal of Biogeography 30:211–226. — A very well-done paper that still doesn't need to fear to comparison with more recent biogeographic papers on Fagales genera with access to more elaborate inference methods, while using much poorer data samples.
  • Denk T, Grimm GW. 2010. The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59:351–366. — Since this is mine, I should not give myself an assessment. Just some info: it was the most sloppy draft, we ever submitted, and passed rather smoothly the review process. But it used 600+ new ITS and 900+ new 5S-IGS sequences, and although it provided a comprehensive ITS tree (new and all data stored in gene banks), the conclusions relied mostly on networks based on inter-clonal and inter-individual distances and ML bootstrap pseudoreplicate samples. I'm pretty sure, it's still hard to find a similar paper.
  • Denk T, Grimm G, Stögerer K, Langer M, Hemleben V. 2002. The evolutionary history of Fagus in western Eurasia: Evidence from genes, morphology and the fossil record. Plant Systematics and Evolution 232:213–236. — My first phylogenetic paper (using only about 100 ITS sequences) and one of my most-cited papers; published only because the editor ignored the opinions of two reviewers.
  • Denk T, Grimm GW, Hemleben V. 2005. Patterns of molecular and morphological differentiation in Fagus: implications for phylogeny. American Journal of Botany 92:1006–1016. — the follow-up paper, including all beech species.
  • Forest F, Bruneau A. 2000. Phylogenetic analysis, organization, and molecular evolution of the non-transcribed spacer of 5S ribosomal RNA genes in Corylus (Betulaceae). International Journal of Plant Sciences 161:793–806. — Likely the reason for the 2005 study by Forest et al., a great paper (especially when compared to other phylogenetic papers published in the same journal back then and much later). The reason why the 5S-IGS has rarely been studied, is because it is difficult to handle (usually one needs to clone because of intraindividual length-polymorphism). But it provides an unsurpassed resolution at the intrageneric level that only finds a match in the last years by the accumulation of NGS SNP data.
  • Forest F, Savolainen V, Chase MW, Lupia R, Bruneau A, Crane PR. 2005. Teasing apart molecular- versus fossil-based error estimates when dating phylogenetic trees: a case study in the birch family (Betulaceae). Systematic Botany 30:118–133. — A pivotal, still valid study using ITS and 5S-IGS data, even though the divergence age estimates are probably much too old (an aspect demonstrating the quality of the study, back then, molecular age estimates were usually much too young). Forest and Bruneau published several other papers of equal quality on other plant groups, and I suspect there is an interesting publication story given the author list and the dissemination platform.
  • Grimm GW, Denk T, Hemleben V. 2007. Coding of intraspecific nucleotide polymorphisms: a tool to resolve reticulate evolutionary relationships in the ITS of beech trees (Fagus L., Fagaceae). Systematics and Biodiversity 5:291–309. — A crazy experiment, but one that, years later, would bring me my first paper in Systematic Biology [PDF] (10-times higher impact factor) because it was the only piece of science providing a way-out for a young researcher in South Africa.
  • Manos PS. 1997. Systematics of Nothofagus (Nothofagaceae) based on rDNA spacer sequences (ITS): taxonomic congruence with morphology and plastid sequences. American Journal of Botany 84:1137–1155. — A typical study for the time, may be not ground-breaking but opening an interesting path and still the basis for molecular systematics of Nothofagaceae (getting such data in the late 90s was not easy). Interestingly, no-one in Australia or New Zealand ever took the thread up (but see Knapp et al. 2005), the only only properly studied genus (then a subgenus) of Nothofagaceae is Nothofagus s.str. (Acosta & Premoli 2010; Premoli et al. 2012).
  • Manos PS, Doyle JJ, Nixon KC. 1999. Phylogeny, biogeography, and processes of molecular differentiation in Quercus subgenus Quercus (Fagaceae). Molecular Phylogenetics and Evolution 12:333–349. [PDF] — The counterpart to the above for oaks, it took nearly two decades to assemble more data on American oaks than used for this study.
  • Manos PS, Stone DE. 2001. Evolution, phylogeny, and systematics of the Juglandaceae. Annals of the Missouri Botanical Garden 88:231–269. — An exemplary paper for two reasons (and despite the fact that it just shows cladograms): 1) it combined morphological and chemotaxonomic data with ITS and plastid data (rbcL-atpB and trnL-trnF intergenic spacer); 2) pretty much got the still accepted tree. Also proof-of-point that, even 20 years ago, studies in low-impact journals were not rarely better than those in high-fly ones. (Note the number of pages; decent research needs space!)
  • Manos PS, Zhou ZK, Cannon CH. 2001. Systematics of Fagaceae: Phylogenetic tests of reproductive trait evolution. International Journal of Plant Sciences 162:1361–1379. — For years to come the basis for Fagaceae systematics.
  • Muir G, Fleming CC, Schlötterer C. 2001. Three divergent rDNA clusters predate the species divergence in Quercus petraea (Matt.) Liebl. and Quercus robur L. Molecular Biology and Evolution 18:112–119. — Only about two species, but setting the scene: ITS evolution in Fagales (and probably any other wind-pollinated tree) can be very complex at the very basic level.
  • Ribeiro T, Loureiro J, Santos C, Morais-Cecílio L. 2011. Evolution of rDNA FISH patterns in the Fagaceae. Tree Genetics and Genomes 7:1113–1122. — A must-read for everyone using ITS data in Fagales.
Phylogenetic studies at and above family level
Betulaceae: see Forest et al. (2005) and Grimm & Renner (2013, following section).
Casuarinaceae: see 'Phylogeny' section on Stevens' Angiosperm Phylogeny Website (never bothered myself with them, since they lack ITS data).
Fagaceae: see Manos et al. (2001), tree in Denk & Grimm (2010)
  • Oh S-H, Manos PS. 2008. Molecular phylogenetics and cupule evolution in Fagaceae as inferred from nuclear CRABS CLAW sequences. Taxon 57:434–451. — The molecular basis for Fagaceae systematics.
  • Manos PS, Cannon CH, Oh S-H. 2008. Phylogenetic relationships and taxonomic status of the paleoendemic Fagaceae of Western North America: recognition of a new genus, Notholithocarpus. Madroño 55:181–190.The only paper providing a tangible plastid-informed phylogeny.
Juglandaceae:
  • Manos PS, Soltis PS, Soltis DE, Manchester SR, Oh S-H, Bell CD, Dilcher DL, Stone DS. 2007. Phylogeny of extant and fossil Juglandaceae inferred from the integration of molecular and morphological data sets. Systematic Biology 56:412–430. — I would have used a different set of analyses but the paper (and used data) provides the basis for Juglandaceae phylogenetics and systematics (see Manos & Stone 2001)
Nothofagaceae: Manos (1997), Knapp et al. (2005, following section).
Fagales dating studies (naturally including phylogenies)
  • Grimm GW, Renner SS. 2013. Harvesting GenBank for a Betulaceae supermatrix, and a new chronogram for the family. Botanical Journal of the Linnéan Society 172:465–477. [PDF] — a little experiment we made and submitted to a respectable but low-impact journal because the results were not really ground-shaking. Exemplifies how I think one should harvest gene banks for dating studies (check out the supplement files), hence, providing a striking contrast to the much more ambitious papers by Xiang et al. (2014) and Xing et al. (2014). In that aspect, possibly a must-read for reviewers and editors of large-scale, harvest papers.
  • Knapp M, Stöckler K, Havell D, Delsuc F, Sebastiani F, Lockhart PJ. 2005. Relaxed molecular clock provides evidence for long-distance dispersal of Nothofagus (Southern Beech). PLoS Biology 3:e14. — A very interesting paper, because it rejects two of the scenarios later tested by Sauquet et al. (2012) and found to produce strange estimates; also, it provides some new sequences of higher quality, none of which was included for the 2012 paper. The author list is quite interesting, too: the last author (GoogleScholar) was the only botanist who challenged tree-thinking from the very start and embraced splits graphs as alternative to trees. The forth author wrote a classic paper everyone should have read working with big data: Delsuc F, Brinkmann H, Philippe H. 2005. Phylogenomics and the reconstruction of the tree of live. Nature Reviews Genetics 6:361–375.
  • Sauquet H, Ho SY, Gandolfo MA, Jordan GJ, Wilf P, Cantrill DJ, Bayly MJ, Bromham L, Brown GK, Carpenter RJ, Lee DM, Murphy DJ, Sniderman JM, Udovicic F. 2012. Testing the impact of calibration on molecular divergence times using a fossil-rich group: the case of Nothofagus (Fagales). Systematic Biology 61:289–313 — in principle, an interesting idea, unfortunately the instability of dating estimates observed may be mostly due to data artifacts. The authors use unrepresentative, old data (which is puzzling, since the understudied Nothofagaceae grow in Australia, New Zealand and the French New Caledonia, and the authors are from France, Australia and New Zealand) including not a few editing/ sequencing artifacts, insufficient sampling and internal signal conflict by combination of low-divergent plastid genes and introns with high-divergent ITS data. The main test compares apples (Nothofagaceae) with pears (the rest of Fagales as sister clade); for details see this draft [PDF], which I put together for applications (the data documentation of Sauquet et al. is examplary, hence, it was very easy to look into the data basis).
  • Xiang X-G, Wang W, Li R-Q, Lin L, Liu Y, Zhou Z-K, Li Z-Y, Chen Z-D. 2014. Large-scale phylogenetic analyses reveal fagalean diversification promoted by the interplay of diaspores and environments in the Paleogene. Perspectives in Plant Ecology, Evolution and Systematics 16:101–110 — an ambitious experiment, with even more data-related problems than the study of Sauquet et al. While Sauquet et al. used placeholder sequences for each included genus (and dropped some because their data inflicted too much topological ambiguity), Xiang et al. blindly harvested all data of commonly sequenced plastid "barcodes" (rbcL, matK, trnL/LF region, rbcL-atpB spacer) to infer a species-level tree. Outdated, invalid taxa were not corrected for; the used gene sample can show little to no variation below the genus level (which makes dating, and barcoding, impossible). Furthermore, plastid diversification is partly or fully decoupled from speciation processes in the four genera that have been studied using more than a single individual per species (Nothofagus s.str., Fagus, Quercus, Ostryopsis).
  • Xing Y, Onstein RE, Carter RJ, Stadler T, Linder HP. 2014. Fossils and large molecular phylogeny show that the evolution of species richness, generic diversity, and turnover rates are disconnected. Evolution 68:2821–2832 — very similar to the Xiang et al. approach but even more flawed (poor control over used data, poor selection of markers, several problems with the dating approach, which is the bases to estimate the crucial turnover rates). Xiang et al. and Xing et al. show what happens when large-scale meta-analyses are conducted by researchers with no idea about the studied organisms.
  • Zhang J-B, Li R-Q, Xiang X-G, Manchester SR, Lin L, Wang W, Wen J, Chen Z-D. 2013. Integrated fossil and molecular data reveal the biogeographic diversification of the eastern Asian-eastern North American disjunct hickory genus (Carya Nutt.). PLoS ONE 8:e70449. — Focuses on one genus but includes data from all Juglandaceae and gives a typical example for plant biogeographic studies using dated trees (the forth author is the expert on the fossil record of Juglandaceae, so there are little data issues). It's open access, quite short, give it a read and then try to figure out what is the point of the paper (I looked at the provided data matrix, too, and found quite interesting genetic patterns that completely escaped the authors; it is never wrong to look over your alignment when this is still possible).
Other cited literature
  • Grímsson F, Grimm GW, Zetter R, Denk T. 2016. Cretaceous and Paleogene Fagaceae from North America and Greenland: evidence for a Late Cretaceous split between Fagus and the remaining Fagaceae. Acta Palaeobotanica 56:247–305.
  • Heenan PB, Smissen RD. 2013. Revised circumscription of Nothofagus and recognition of the segregate genera Fuscospora, Lophozonia, and Trisyngyne (Nothofagaceae). Phytotaxa 146:1–31.
  • Heřmanová Z, Kvaček J, Friis EM. 2011. Budvaricarpus serialis Knobloch & Mai, an unusual new member of the Normapolles complex from the Late Cretaceous of the Czech Republic. International Journal of Plant Sciences 172:285–293.
  • Manchester SR. 1987. The fossil history of the Juglandaceae. St. Louis: Missouri Botanical Garden. [book-like paper]