Showing posts with label Consensus network. Show all posts
Showing posts with label Consensus network. Show all posts

Tuesday, August 29, 2017

More non-treelike data forced into trees: a glimpse into the dinosaurs


Plant morphological data sets including fossil taxa can be riddled with incompatible data patterns (e.g. see my first post), and this can be a bit mind-blowing when it comes to tracing evolution over time. So, let’s move on to something potentially more simple: extinct groups of animals.

Until a time-machine is invented, phylogenetic hypotheses for groups such as the many extinct lineages of dinosaurs will have to be based on morphological data sets. Dinosaur fossils are nowhere near as frequent as as plant fossils (often isolated organ); but when a complete or partial skeleton is found, this specimen allows scoring more characters than is possible for even a higher-level composite plant taxon. For instance, the largest (character-wise) plant data matrices, using composite taxa and operating at the level of genera and above, including fossils, have a little over 100 characters, whereas dinosaur matrices like the one used by Tschopp, Mateus & Benson (2015) can have several hundreds of characters.

Classification of dinosaurs tries to apply the principles of ‘cladistics’ (see also http://tolweb.org/Dinosauria), a classification system established by Hennig (1950). Cladistic classification – Hennig did not propose any inference framework – aims to identify exclusively shared derived traits (synapomorphies), and consequently groups of taxa (originally species) that share an inclusive common origin, Hennig's “monophyla”. [In contrast to Haeckel’s (1866) concept of monophyletic groups, which just assumed a common origin, but did not require inclusiveness.] For some reason, which seem to have no scientific basis, but can be understood in a historical context (Felsenstein 2001, 2004: chapter 10), cladistics has been synonymised with parsimony analysis, one of the optimality criteria to infer one-dimensional graphs reflecting a series of dichotomous splits (phylogenetic trees). A basic assumption of cladistic studies is that a clade in a parsimony-inferred tree equals a monophylum (which is not necessarily the case, see e.g. Scotland & Steel 2015 for binary data).

In palaeontology (and systematic biology to some degree) it is common not to show a phylogram, a phylogenetic tree with branch-lengths, but a cladogram. These cladograms rarely depict the optimised (or one of the equally optimal) tree(s), but instead show the strict consensus tree of the found equally parsimonious trees (or potentially most-parsimonious trees) (MPTs). This is also the case for the study by Tschopp et al., used here as an example of the generally non-treelike data used in studies dealing with extinct groups of animals.

David provided a list of questions for exploratory data analysis (EDA), which can (and should) be asked when trying to infer phylogenies based on morphological data. I will look at some of them here.

First question: Are the data tree-like?

The data matrix of Tschopp et al. is impressive (much like the paper itself, with its 298 pages). The authors scored 477 characters (243 new) for (a final set of) 81 “operational taxonomic units” (OTUs). The OTUs are typically specimens in the case of the ingroup, and include several outgroup species for rooting the phylogenetic tree. There are lots of gaps in of the matrix (65% missing data), which relates to the inclusion of poorly known fossil specimens, which the authors tried to classify using parsimony inference and pairwise distances. The authors note (p. 163): “Given the low consistency index (CI) and thus high number of homoplasies in the dataset, an additional analysis with the same settings was conducted using implied weighting (iw).” In addition to signal ambiguity related to general homoplasy and ontogeny, the authors note character overlap effects and deformation (pp. 166ff). So, there are quite a few different sources of incompatible, non-treelike signal.

With equal weighting and including all 81 OTUs, the authors ended up with 60,000 equally parsimonious trees (possibly more — this was the maximum number limited by computational constraints). This produced a strict consensus (SC) tree with just 12 nodes, in which “all ingroup specimens formed one large polytomy”. The ‘implied weighting’ lead to a slightly more resolved SC tree. ‘Implied weighting’ is a posterior means to downweigh characters conflicting with the inferred tree. The authors further identified some (4, 8, or 15) OTUs accounting for most of the “instability”. A posteriori filtering of these putative rogue taxa led to SC trees that were much better resolved (Fig. 1).

Fig. 1 The six strict consensus trees shown by Tschopp et al. The red crosses indicate the OTUs that were pruned from the MPT tree sample to increase the resolution of the SC tree. For the first tree, I added the information on the fraction of missing data (blue dots).

Both tree-like and non-treelike data can collapse strict consensus trees, but the large number of MPTs can be a first indication that the data are not tree-like. The MPT samples inferred by Tschopp et al. are not included in the documentation (following the current standard; see also data uploaded to TreeBase). Using the quick-analysis option in PAUP* (random heuristic search, 100 replicates, CHUCK-options set), I found 3,000 equally parsimonious trees, which are only slightly worse (1983 steps) than the 60,000 MPTs (1979 steps reported) combined in Tschopp et al.’s unweighted cladogram.

Using the consensus network approach (Holland & Moulton 2003) for summarising the parsimony-tree sample (no cut-off value), we can get a first impression of the signal in the matrix (Fig. 2). The data allow for a great number of topological alternatives — they are generally not tree-like. Only a few relationships are unambiguous in this collection. The fan-like topological features (composed typically of low-dimensional boxes) relate to: (a) jumping OTUs (rogue taxa), (b) uncertainty regarding relationships between related OTUs consistently found in the same subtree, and (c) the exact composition of the subtrees. In contrast to the strict consensus tree, the network visualises the tree-unlikeliness of the data expressed in the MPT collection, revealing extremely ‘rogue’-ish OTUs (e.g. Diplodocus_YPM_1922) and OTUs with indiscriminate signal (e.g. FMNH_P25112), and also allows us to qualify the ‘rogueness’ of all other OTUs.

Fig. 2 Strict consensus network (all edge-lengths set to 1) of 3000 equally parsimonious trees, inferred from Tschopp et al.'s matrix. This graph is the network equivalent of the commonly seen strict consensus cladograms (Fig. 1). Note that the tree sample is slightly suboptimal and likely incomprehensive.

One pre-inference measure for tree-likeness is the Delta Value (DV) introduced by Holland et al. (2002); see e.g. Auch et al. (2006) and Göker & Grimm (2008) for applications. The matrix DV is 0.47, which is very high, even for a morphological matrix. The individual DVs (iDV) range between 0.417 and 0.577, which means that no set of OTU provides a tree-like signal. The complete data are not tree-like, and hence the failure to find unambiguous relationships, even when a comprehensive tree search and ‘implicit weighting’ are used (see Tschopp et al. 2015). Extreme iDV (> 0.55) correlate with (relatively) high proportions of missing data (75–98%, i.e. 10–119 defined characters; Fig. 3), indicating that missing data are a problem for inferences and the calculation of the pairwise distance matrix.

Fig. 3 XY-plot showing the individual Delta Values (a measure for treelike signal) in relation to the proportion of missing data. The green "comfort zone" indicates iDVs favorable for tree-inference (based on personal experience).

Subsequent question: Why are the data not tree-like?

In his post, David listed four possible reasons for non-tree-like data:
  (a) uninformative data: a “bush”,
  (b) weakly tree-like data: a “tree obscured by vines”,
  (c) data containing several strongly incompatible relationships: a “structured network”,
  (d) confusing or random data: a “spider-web”.
Lacking branch-lengths, the MPT consensus network above provides no information regarding (a), and limited information regarding (b) and (c). Only (d) can be excluded as a main source of non-tree-like signal for the dinosaur data: higher-than-3-dimensional boxes are rare.

Fig. 4 Boostrap (BS) consensus network based on 10,000 BS (pseudo)replicates. Trivial splits in grey, splits without strong alternatives in blue, conflicting splits (always two alternatives) in red. All splits found in less than 20% of the BS replicates not shown, and edge length are proportional to the split frequencies.

Figure 4 shows the bootstrap support network based on 10,000 parsimony bootstrap pseudoreplicates (generated following Müller 2005). Some terminal sister relationships seen in the original, taxon-reduced, unweighted or weighted SC trees rely on quite robust, unconflicted signal, a few others are only supported by a small fraction of the characters, but all competing alternatives even less (blue edges in the graph). Thus, it is a “Maybe” for (a) (see also Fig. 5), and a “Yes” for (b) (compare Figs 2 and 4). The character suites of many OTUs provide no robust signal to place them; their position in the set of trees is based on the signal of relatively (large matrix!) few characters, or the result of branching artefacts as we force non-treelike data into a tree. The robust signal for some terminal clades may be obscured by ambiguous signal of potential additional members of the clade, or OTUs similar to only part of a clade (the “vines”).

We can also observe some pronounced 2-dimensional boxes: here the signal from the data matrix has no preference for a single alternative, but indicates two competing alternatives (red edges in the graph), i.e. also a possible “Yes” for (c). In the case of morphological data, reticulate signals do not necessarily indicate reticulation in an evolutionary sense. They can be triggered by two (more or less related) lineages evolving into the same morphospace, or the co-existence of ancestral and derived forms (see also this post). No spider-web-like portions (high-dimensional boxes) are seen (and are also largely missing from the MPT consensus network in Fig. 2), so we can exclude chaotic signal as reason (d) for the tree-unlikeliness of the data.

Fig. 5 Neighbour-net splits graph based on pairwise (Hamming) distances computed with PAUP* using the Tschopp et al. matrix.

Figure 5 shows the unfiltered, simple (Hamming) distance-based neighbour-net (NNet) for the same matrix. Mirroring the high matrix DV and iDVs, the NNet has only a few tree-like portions, but nevertheless reflects a high diversity — long terminal edges; pairwise distances range between 0 (no difference in data-covered characters) and 1 (all characters are different). Some OTUs are placed closed to or in the boxy centre of the graph or the root trunks of terminal groups. Such a placement is either indicative of ancestry (see my earlier post), which is a special case of reason (c), or a lack of discriminative signal, i.e. reason (a) for non-treelike data. Here, it appears to be mostly the latter: the iDV are high, and the highest iDV relate to high proportions of missing data (more than 75%).

High proportions of missing data do not necessarily result in high DV (here 75% missing data equals c. 150 defined characters, which could be more than enough to place a taxon). But not a few OTUs have zero pairwise-distances to a set of diverse OTUs that are not closely related. In total, 74 of the 81 OTUs show a zero-distance to at least one other OTU; with Diplodocus YPM 1922 (98% missing data) being the most-extremely non-distinct OTU: it has a zero-distance to 66 OTUs, including one outgroup taxon. Such a pattern is impossible from an evolutionary point of view (even an ancestor cannot be identical to all of its off-spring when they diversified). and is a missing data artefact. The NNet resolves this data insufficiency by placing the highly ambiguous OTUs in the centre of the graph, whereas parsimony (or other tree inference) deals with this effectively unsolvable problem by providing some, many, or all theoretically possible placements of the problematic OTU (the OTU turns ‘rogue’) as equally optimal (large fans in Fig. 2) but without support (Fig. 4).

There are two options to infer phylogenetic trees, or to test alternative evolutionary hypotheses using Tschopp et al.’s matrix with its tree-unlike data.
  1. One is to reduce the taxon set to those OTUs with less than 50% of missing data, to produce a backbone tree or network (matrix DV = 0.28; iDV range between 0.219–0.352; Fig. 6), Then  to evaluate the position (or possible positions) of each other OTU within this backbone (using ‘+1 OTU’ neighbour-nets, parsimony-optimisation or algorithms such as the evolutionary placement algorithm implemented in RAxML; Berger & Stamatakis 2010; Berger, Krompass & Stamatakis 2011). Then finalise with group-restricted taxon and character subsets to study within-group relationships.
  2. The other is to cut the matrix into pieces and taxon sets with good data overlap. Then assess the correlation between these submatrices (e.g. using Pearson’s correlation coefficient) and their tree-likeness (using Delta Values). Then use consensus networks and/or supernetworks to investigate potential incongruences, and to summarise topological alternatives.

Fig.6 Neighbour-net (NNet) for a taxon-reduced set, only including OTUs with more than 50% of defined characters. These data result in a single most-parsimonious tree, which is largely congruent to the main splits in the NNet (blue), except for a three poorly supported branches (red). Numbers indicate neighbour-joining and parsimony bootstrap support for branches in the MPT and corresponding edges in the NNet and their alternatives.

Palaeontologists: Please stop using strict consensus trees, and start with EDA

To fill the deeper parts of the Tree of Life with life, we cannot get around morphological data and phylogenetic inferences based on these data. Most of Earth’s diversity is extinct, so their molecular data are (largely) lost to science. But no matter whether we work with extinct plants or animals, or with matrices containing many or few morphological characters, we should keep a close eye on the primary signals in those matrices. Are the data tree-like? Are there rogue taxa, and how/why do they affect the inferences? How discriminatory are the data regarding competing alternative hypotheses? Does taxon and character sampling matter? Networks (planar or n-dimensional) can help to: (1) assess the potential of the data for tree inference, and (2) discuss the putative monophyly of groups and their alternatives.

The signal from morphological data matrices is complex, and the data are rarely tree-like. Irrespective of whether one wants to stick with parsimony or not, tree-based and support consensus networks should by now have long replaced the strict (or majority-rule) consensus trees in “cladistic” or general-phylogenetic studies dealing with extinct groups of organisms.

Posteriori methods to filter or down-weight characters not fitting the inferred tree(s) ignore the fact that morphological differentiation typically cannot be explained by a single tree (leaving aside, that total evidence and DNA-constrained analysis demonstrate that morphological evolution is not parsimonious at all). There are too many sources of signal incompatible with the true tree.

In the light of ambiguous and potentially biased signals (outlined and discussed by Tschopp et al. 2015 for their data), the focus of cladistic or other phylogenetic studies that aim to fill the Tree of Life with extinct branches cannot be to infer a clean(ed) tree. Instead, the focus should be on exploring the signals in the data and assessing their capacity to exclude or support evolutionary scenarios. A well understood topological uncertainty is always better than a poorly supported clade.

Regarding the Tree of Life, we should start representing uncertainty as-is (i.e. showing the currently competing alternatives), and reserve polytomies for cases where we really have no idea at all. Also, we should place potential ancestors (ancestral forms) where they belong: at the root nodes of their descendant lineages (the forms derived from them).

References

Auch AF, Henz SR, Holland BR, Göker M. (2006) Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences. BMC Bioinformatics 7:350.

Berger SA, Krompass D, Stamatakis A. (2011) Performance, accuracy, and web server for evolutionary placement of short sequence reads under Maximum Likelihood. Systematic Biology 60:291–302.

Berger SA, Stamatakis A. (2010) Accuracy of morphology-based phylogenetic fossil placement under Maximum Likelihood. IEEE/ACS International Conference on Computer Systems and Applications (AICCSA). Hammamet: IEEE. p. 1-9.

Felsenstein J. (2001) The troubled growth of statistical phylogenetics. Systematic Biology 50:465–467.

Felsenstein J. (2004) Inferring phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Göker M, Grimm GW. (2008)General functions to transform associate data to host data, and their use in phylogenetic inference from sequences with intra-individual variability. BMC Evolutionary Biology 8:86.

Haeckel E. (1866) Generelle Morphologie der Organismen. Berlin: Georg Reiner.

Hennig W. (1950) Grundzüge einer Theorie der phylogenetischen Systematik. Berlin: Dt. Zentralverlag.

Holland B, Moulton V. (2003) Consensus networks: A method for visualising incompatibilities in collections of trees. In: Benson G, and Page R, eds. Algorithms in Bioinformatics: Third International Workshop, WABI, Budapest, Hungary Proceedings. Berlin, Heidelberg, Stuttgart: Springer Verlag, p. 165–176.

Holland BR, Huber KT, Dress A, Moulton V. (2002) Delta Plots: A tool for analyzing phylogenetic distance data. Molecular Biology and Evolution 19:2051-2059.

Müller KF. (2005) The efficiency of different search strategies for estimating parsimony, jackknife, bootstrap, and Bremer support. BMC Evolutionary Biology 5:58.

Scotland RW, Steel M. (2015) Circumstances in which parsimony but not compatibility will be provably misleading. Systematic Biology 64:492–504. [preprint]

Tschopp E, Mateus O, Benson RBJ. (2015) A specimen-level phylogenetic analysis and taxonomic revision of Diplodocidae (Dinosauria, Sauropoda). PeerJ 3:e857.

Post-script: Why distance-based approaches?

Distance-based approaches may be still refuted by hard-core cladists as “unphylogenetic” or “phenetic” (again, see Felsenstein 2004 for the historical reasons, and why this is wrong), particularly when acting as anonymous reviewers of palaeontological papers. But the simple fact is: a character matrix not allowing inference of a pairwise distance matrix with at least some tree-like signal, should not be used to infer phylogenetic trees (no matter which optimality criterion is used).

A perfect character matrix, i.e. a matrix in which each dichotomy is subsequently followed by one or several strictly synapomorphic changes will, of course, result in a single MPT. But it will also provide a simple (Hamming) mean distance matrix allowing us to infer a neighbour-joining tree fulfilling the least-squares or minimum evolution optimality criteria, and this will be identical to the MPT and a corresponding NNet without any box-like portions. It will also be the most probable topology that can be inferred using maximum likelihood or Bayesian inference.

When different tree inference methods come to substantially different results for morphological matrices, the signal from the primary matrix is likely not to be tree-like, and internal conflict then needs to be explored. The more tree-like is the matrix, then the less it will be affected by methodological differences (e.g. Fig. 6; the only branches of the MPT not fitting the preferred splits in the NNet have low support, and compete with equally low supported splits seen in the NNet that receive high support from NJ-bootstrapping).

Distance-based analyses are much faster than parsimony, maximum likelihood, and Bayesian inferences; and they are not restricted to inferring phylogenetic trees. Within the same time that I need to perform a comprehensive tree and branch support analysis, I can generate hundreds of NNets using different taxon and character subsets of my matrix, and thus explore its many signals. One can employ different distance measures to deal with continuous or ordered categorical data, and then directly see the effect on the reconstruction. Eventually, one may find a subset that provides the most tree-like signal, which will be the best possible basis for the final tree-inference (in case an evolutionaru tree is what is wanted) and branch support analysis.

Tuesday, July 4, 2017

Should we try to infer trees on tree-unlikely matrices?


Spermatophyte morphological matrices that combine extinct and extant taxa notoriously have low branch support, as traditionally established using non-parametric bootstrapping under parsimony as optimality criterion. Coiro, Chomicki & Doyle (2017) recently published a pre-print to show that this can be overcome to some degree by changing to Bayesian-inferred posterior probabilities. They also highlight the use of support consensus networks for investigating potential conflict in the data. This is a good start for a scientific community that so far has put more of their trust in either (i) direct visual comparison of fossils with extant taxa or (ii) collections of most parsimonious trees inferred based on matrices with high level of probably homoplasious characters and low compatibility. But do those matrices really require or support a tree? Here, I try to answer this question.

Background

Coiro et al. mainly rely on a recent matrix by Rothwell & Stockey (2016), which marks the current endpoint of a long history of putting up and re-scoring morphology-based matrices (Coiro et al.’s fig. 1b). All of these matrices provide, to various degrees, ambiguous signal. This is not overly surprising, as these matrices include a relatively high number of fossil taxa with many data gaps (due to preservation and scoring problems), and combine taxa that perished a hundred or more millions years ago with highly derived, possibly distant-related modern counterparts.

Rothwell & Stockey state (p. 929) "As is characteristic for the results from the analysis of matrices with low character state/taxon ratios, results of the bootstrap analysis (1000 replicates) yielded a much less fully resolved tree (not figured)." Coiro et al.’s consensus trees and network based on 10,000 parsimony bootstrap replicates nicely depicts this issue, and may explain why Rothwell & Stockey decided against showing those results. When studying an earlier version of their matrix (Rothwell, Crepet & Stockey 2009), they did not provide any support values, citing a paper published in 2006, where the authors state (Rothwell & Nixon 2006, p. 739): “… support values, whether low or high for particular groups, would only mislead the reader into believing we are presenting a proposed phylogeny for the groups in question. Differences among most-parsimonious trees are sufficient to illuminate the points we wish to make here, and support values only provide what we consider to be a false sense of accuracy in these assessments”.

Do the data support a tree?

The problem is not just low support. In fact, the tree showed by Rothwell & Stockey with its “pectinate arrangement” conflicts in parts with the best-supported topology, a problem that also applied to its 2009 predecessor. This general “pectinate” arrangement of a large, low or unsupported grade is not uncommon for strict consensus trees based on morphological matrices that include fossils and extant taxa (see e.g. the more proximal parts of the Tree of Life, e.g. birds and their dinosaur ancestors).

The support patterns indicate that some of the characters are compatible with the tree, but many others are not. Of the 34 internodes (branches) in the shown tree (their fig. 28 shows a strict consensus tree based on a collection of equally parsimonious trees), 12 have lower bootstrap support under parsimony than their competing alternatives (Fig. 1). Support may be generally low for any alternative, but the ones in the tree can be among the worst.

The main problem is that the matrix simply does not provide enough tree-like signal to infer a tree. Delta Values (Holland et al. 2002) can be used as a quick estimate for the treelikeliness of signal in a matrix. In the case of large all-spermatophyte matrices (Hilton & Bateman 2006; Friis et al. 2007; Rothwell, Crepet & Stockey 2009; Crepet & Stevenson 2010), the matrix Delta Values (mDV) are ≥ 0.3. For comparison, molecular matrices resulting in more or less resolved trees have mDV of ≤ 0.15. The individual Delta Values (iDV), which can be an indicator of how well a taxon behaves during tree inference, go down to 0.25 for extant angiosperms – very distinct from all other taxa in the all-spermatophyte matrices with low proportions of missing data/gaps – and reach values of 0.35 for fossil taxa with long-debated affinities.

The newest 2016 matrix is no exception with a mDV of 0.322 (the highest of all mentioned matrices), and iDVs range between 0.26 (monocots and other extant angiosperms) and 0.39 for Doylea mongolica (a fossil with very few scored characters). In the original tree, Doylea (represented by two taxa) is part of the large grade and indicated as the sister to Gnetidae (or Gnetales) + angiosperms (molecular trees associate the Gnetidae with conifers and Ginkgo). According to the bootstrap analysis, Doylea is closest to the extant Pinales, the modern conifers. Coiro et al. found the same using Bayesian inference. Their posterior probability (PP) of a Doylea-Podocarpus-Pinus clade is 0.54, and Rothwell & Stockey’s Doylea-Ginkgo-angiosperm clade conflicts with a series of splits with PPs up to 0.95.

Figure 1. Parsimony bootstrap network based on 10,000 pseudoreplicate trees
inferred from the matrix of Rothwell & Stockey.
Edges not found in the authors’ tree in red, edges also found in the tree in green.
Extant taxa in blue bold font. The edge length is proportional to the frequency of the
according split (taxon bipartition, branch in a possible tree) in the pseudoreplicate
tree sample. The network includes all edges of the authors’ tree except for
Doylea + Gnetidae + Petriellales + angiosperms vs. all other gymnosperms and
extinct seed plant groups. Such a split has also no bootstrap support (BS < 10)
using least-square and maximum likelihood optimum criteria.

Do the data require a tree?

As David made a point in an earlier post, neighbour-nets are not really “phylogenetic networks” in the evolutionary sense. Being unrooted and 2-dimensional, they don’t depict a phylogeny, which has to be a sort of (rooted) tree, a one-dimensional graph with time as the only axis (this includes reticulation networks where nodes can be the crossing point of two internodes rather than their divergence point). The neighbour-net algorithm is an extension into two dimensions of the neighbour-joining algorithm, the latter infers a phylogenetic tree serving a distance criterion such as minimum evolution or least-squares (Felsenstein 2004). Essentially, the neighbour-net is a ‘meta-phylogenetic’ graph inferring and depicting the best and second-best alternative for each relationship. Thus, neighbour-nets can help to establish whether the signal from a matrix, treelike or not as it is the cases here, supports potential and phylogenetic relationships, and explore the alternatives much more comprehensively than would be possible with a strict-consensus or other tree (Fig. 2).

Figure 2. Neighbour-net based on a mean distance matrix inferred
from the matrix of Rothwell & Stockey.
The distance to the "progymnosperms", a potential ancestral group of the
seed plants, can be taken as a measurement for the derivedness of each
major group. The primitive seed ferns are placed between progymnosperms
 and the gymnosperms connected by partly compatible edge bundles; the
putatively derived "higher seed ferns" isolated between the progymnosperms
and the long-edged angiosperms. Shared edge-bundles and 'neighbourness'
reflect quite well potential phylogenetic relationships and eventual ambiguities,
as in the case of Gnetidae. Colouring as in Figure 1; some taxon names
are abbreviated.

In addition, neighbour-nets usually are better backgrounds to map patterns of conflicting or partly conflicting support seen in a bootstrap, jackknife or Bayesian-inferred tree sample. In Fig. 3, I have mapped the bootstrap support for alternative taxon bipartitions (branches in a tree) on the background of the neighbour-net in Fig. 2.

Obvious and less-obvious relationships are simultaneously revealed, and their competing support patterns depicted. Based on the graph, we can see (edge lengths of the neighbour-net) that there is a relatively weak primary but substantial bootstrap support for the Petriellales (a recently described taxon new to the matrix) as sister to the angiosperms. Several taxa, or groups of closely related taxa, are characterised by long terminal edges/edge bundles, rooting in the boxy central part of the graph. Any alternative relationship of these taxa/taxon groups receives equally low support, but there are notable differences in the actual values.

There is little signal to place most of the fossil “seed ferns” (extinct seed plants) in relation to the modern groups, and a very ambiguous signal regarding the relationship of the Gnetidae (or Gnetales) with the two main groups of extant seed plants, the conifers (Pinidae; see C. Earle’s gymnosperm database) and angiosperms (for a list and trees, see P. Stevens’ Angiosperm Phylogeny Website).

The Gnetidae is a strongly distinct (also genetically) group of three surviving genera, being a persistent source of headaches for plant phylogeneticists. Placed as sister to the Pinaceae (‘Gnepine’ hypothesis) in early molecular trees (long-branch attraction artefact), the currently favoured hypothesis (‘Gnetifer’) places the Gnetidae as sister to all conifers (Pinatidae) in an all-gymnosperm clade (including Gingko and possibly the cycads).

As favoured by the branch support analyses, and contrasting with the preferred 2016 tree, the two Doyleas are placed closest to the conifers, nested within a commonly found group including the modern and ancient conifers and their long-extinct relatives (Cordaitales), and possibly Ginkgo (Ginkgoidae). In the original parsimony strict consensus tree, they are placed in the distal part as sister to a Gnetidae and Petriellales + angiosperms (possibly long-branch attraction). The grade including the ‘primitive seed ferns’ (Elkinsia through Callistophyton), seen also in Rothwell and Stockey’s 2016 tree, may be poorly supported under maximum parsimony (the criterion used to generate the tree), but receives quite high support when using a probabilistic approach such as maximum likelihood bootstrapping or Bayesian inference to some degree (Fig. 3; Coiro, Chomicki & Doyle 2017).

Figure 3. Neighbour-net from above used to map alternative support patterns.
Numbers refer to non-parametric bootstrap (BS) support for alternative phylogenetic
splits under three optimality criteria: maximum likelihood (ML) as implemented in
RAxML (using MK+G model), maximum parsimony (MP), and least-squares
(via neighbour-joining, NJ; using PAUP*); and Bayesian posterior probabilties
(using MrBayes 3.2; see Denk & Grimm 2009, for analysis set-up). The circular
arrangement of the taxa allows tracking most edges in the authors’ tree and their,
sometimes better supported, alternatives. The edge lengths provide direct
information about the distinctness of the included taxa to each other; the structure
of the graph informs about the how tree-like the signal is regarding possible
phylogenetic relationships or their alternatives. Colouring as in Figure 1;
some taxon names are abbreviated.

Numerous morphological matrices provide non-treelike signals. A tree can be inferred, but its topology may be only one of many possible trees. In the framework of total evidence, this may be not such a big problem, because the molecular partitions will predefine a tree, and fossils will simply be placed in that tree based on their character suites. Without such data, any tree may be biased and a poor reflection of the differentiation patterns.

By not forcing the data in a series of dichotomies, neighbour-nets provide a quick, simple alternative. Unambiguous, well-supported branches in a tree will usually result in tree-like portions of the neighbour net. Boxy portions in the neighbour-net pinpoint the ambiguous or even problematic signals from the matrix. Based on the graph, one can extract the alternatives worth testing or exploring. Support for the alternatives can be established using traditional branch support measures. Since any morphological matrix will combine those characters that are in line with the phylogeny as well as those that are at odds with it (convergences, character misinterpretations), the focus cannot be to infer a tree, but to establish the alternative scenarios and the support for them in the data matrix.

References

Coiro M, Chomicki G, Doyle JA. 2017. Experimental signal dissection and method sensitivity analyses reaffirm the potential of fossils and morphology in the resolution of seed plant phylogeny. bioRxiv DOI:10.1101/134262

Crepet WL, Stevenson DM. 2010. The Bennettitales (Cycadeoidales): a preliminary perspective of this arguably enigmatic group. In: Gee CT, ed. Plants in Mesozoic Time: Morphological Innovations, Phylogeny, Ecosystems. Bloomington: Indiana University Press, pp. 215-244.

Denk T, Grimm GW. 2009. The biogeographic history of beech trees. Review of Palaeobotany and Palynology 158: 83-100.

Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Friis EM, Crane PR, Pedersen KR, Bengtson S, Donoghue PCJ, Grimm GW, Stampanoni M. 2007. Phase-contrast X-ray microtomography links Cretaceous seeds with Gnetales and Bennettitales. Nature 450: 549-552 [all important information needed for this post is in the supplement to the paper; a figure showing the actual full analysis results can be found at figshare]

Hilton J, Bateman RM. 2006. Pteridosperms are the backbone of seed-plant phylogeny. Journal of the Torrey Botanical Society 133: 119-168.

Holland BR, Huber KT, Dress A, Moulton V. 2002. Delta Plots: A tool for analyzing phylogenetic distance data. Molecular Biology and Evolution 19: 2051-2059.

Rothwell GW, Crepet WL, Stockey RA. 2009. Is the anthophyte hypothesis alive and well? New evidence from the reproductive structures of Bennettitales. American Journal of Botany 96: 296–322.

Rothwell GW, Nixon K. 2006. How does the inclusion of fossil data change our conclusions about the phylogenetic history of the euphyllophytes? International Journal of Plant Sciences 167: 737–749.

Rothwell GW, Stockey RA. 2016. Phylogenetic diversification of Early Cretaceous seed plants: The compound seed cone of Doylea tetrahedrasperma. American Journal of Botany 103: 923–937.

Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution DOI:10.1111/2041-210X.12760.

Wednesday, August 14, 2013

How to construct a consensus network from the output of a bayesian tree analysis


In an earlier blog post I argued that We should present bayesian phylogenetic analyses using networks. The rationale for this is that a bayesian analysis is concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. In phylogenetics based on Markov Chain Monte Carlo (MCMC) methods, which produce a set of trees sampled in proportion to their posterior probabilites, the tree topologies can thus be summarized using a consensus network. This should be more in keeping with the bayesian philosophy than is producing a single tree, the so-called MAP tree. The MAP tree is based on combining those taxon partitions with the greatest frequency in the MCMC sample, so that the probability distribution is reduced down to a single tree with posterior probability values on the branches. On the other hand, a network produced from all of the partitions that appear in the MCMC sample, weighted according to their frequency, would be much closer to the bayesian aim. [Note: For a clarification of this point, see Leonardo de Oliveira Martins' comment at the end of this post.]

The practical issue with trying to do this is that at the moment it is not straightforward to get the consensus network from the output of any of the bayesian computer programs. These programs usually produce a file containing all of the sampled trees (from which the burn-in trees can be deleted). The simplest way to get the consensus network would be to use the SplitsTree or Spectronet programs to produce the network directly from this treefile. This can be done for files with a small number of trees; but no-one recommends doing bayesian analysis with a small number of MCMC-sampled trees. When the treefile contains tens of thousands of trees this is pushing the limit of SplitsTree and Spectronet, and they crash.

An alternative is to use the smaller "trprobs" file that is provided by, for example MrBayes, which contains only the unique trees along with a weight indicating their relative frequency. Unfortunately, SplitsTree and SectroNet do not currently read tree weights in treefiles. So, Holland et al. (2005, 2006) produced a Python script to create the required input files, which can then be input to SplitsTree or SpectroNet. A copy of this script is provided here.

However, this approach is still limited, and so it is not the approach that I used in the example analysis provided in my previous blog post. The MrBayes program, for example, also produces a partition table, showing the relative frequency of the bipartitions found in the sample of trees. The consensus network is actually produced from these bipartitions, rather than from the trees, and so this information can also be used instead of the treefile. This can be provided to SplitsTree in a nexus-formatted Splits block (derived from the bipartitions) rather than in a Trees block (derived from the treefile).

There are two practical problems with this approach. First, SplitsTree currently does not construct networks with different percentages of splits when data are input via a Splits block, only when the data are input via a Trees block. So, a series of Splits blocks needs to be constructed, each with the appropriate number of bipartitions, in order to decide how many of the bipartitions should be included in the network. This makes the process tedious. Second, MrBayes does not produce a nexus-format file with the bipartition information, and so the available information must be manually converted to a Splits block and put into a nexus file. I will try to explain this process here.

Manual procedure

The nexus-format file used for my previous analysis is here. It contains the original sequence data (the Data block), the instructions for MrBayes (the MrBayes block), the treefile produced by MrBayes (the Trees block), and the bipartitions information (the Splits block). This should reproduce the first consensus network shown in the previous blog post, and thus it shows you what a nexus-formatted file looks like.

The Splits block looks like this. The first column of the Matrix is simply an index to label the splits; the second column is the bipartition weight; and the third column lists the taxa in one of the two parts of each bipartition (you can choose either partition, but clearly it is quicker to list the taxa in the smaller partition).

BEGIN Splits; [Bipartitions occurring in >5% of the trees]
  DIMENSIONS ntax=17 nsplits=46;
  FORMAT labels=no weights=yes confidences=no intervals=no;
MATRIX
[1]   1.000000 1,
[2]   1.000000 2,
[3]   1.000000 3,
[4]   1.000000 4,
[5]   1.000000 5,
[6]   1.000000 6,
[7]   1.000000 7,
[8]   1.000000 8,
[9]   1.000000 9,
[10]   1.000000 10,
[11]   1.000000 11,
[12]   1.000000 12,
[13]   1.000000 13,
[14]   1.000000 14,
[15]   1.000000 15,
[16]   1.000000 16,
[17]   1.000000 17,
[18]   1.000000 12 13 14 15 16 17,
[19]   1.000000 16 17,
[20]   1.000000 12 13 14 15,
[21]   0.990441 9 10,
[22]   0.986401 12 13,
[23]   0.981162 3 4 7,
[24]   0.950994 2 5 6 8 9 10 11 12 13 14 15 16 17,
[25]   0.940455 3 4,
[26]   0.884189 12 13 15,
[27]   0.858771 5 6 8 9 10 11 12 13 14 15 16 17,
[28]   0.467503 5 6 8 11 12 13 14 15 16 17,
[29]   0.359641 6 8,
[30]   0.327114 6 11,
[31]   0.299736 5 11,
[32]   0.298256 6 8 11 12 13 14 15 16 17,
[33]   0.264989 11 12 13 14 15 16 17,
[34]   0.264549 6 8 11,
[35]   0.229872 6 11 12 13 14 15 16 17,
[36]   0.218812 5 6 8 11,
[37]   0.165447 5 11 12 13 14 15 16 17,
[38]   0.146118 9 1012 13 14 15 16 17,
[39]   0.144918 6 8 12 13 14 15 16 17,
[40]   0.135209 5 68 9 10 11,
[41]   0.130750 6 12 13 14 15 16 17,
[42]   0.114961 5 12 13 14 15 16 17,
[43]   0.109871 5 9 10,
[44]   0.105942 14 15,
[45]   0.084963 2 5 6 8 9 10 11,
[46]   0.070264 5 9 10 11 12 13 14 15 16 17,
;
END; [Splits]

The information about the taxa occurring in each bipartiton is taken from the following table, which appears in the MrBayes output. The ID is used as the first column of the Splits block; and the asterisks have simply been converted to the relevant taxon number (the Partition columns represent taxa 1–17, in order, so that an asterisk indicates that the particular taxon is included in that partition).

      ID -- Partition
      -----------------------
       1 -- .****************
       2 -- .*...............
       3 -- ..*..............
       4 -- ...*.............
       5 -- ....*............
       6 -- .....*...........
       7 -- ......*..........
       8 -- .......*.........
       9 -- ........*........
      10 -- .........*.......
      11 -- ..........*......
      12 -- ...........*.....
      13 -- ............*....
      14 -- .............*...
      15 -- ..............*..
      16 -- ...............*.
      17 -- ................*
      18 -- ...........******
      19 -- ...............**
      20 -- ...........****..
      21 -- ........**.......
      22 -- ...........**....
      23 -- ..**..*..........
      24 -- .*..**.**********
      25 -- ..**.............
      26 -- ...........**.*..
      27 -- ....**.**********
      28 -- ....**.*..*******
      29 -- .....*.*.........
      30 -- .....*....*......
      31 -- ....*.....*......
      32 -- .....*.*..*******
      33 -- ..........*******
      34 -- .....*.*..*......
      35 -- .....*....*******
      36 -- ....**.*..*......
      37 -- ....*.....*******
      38 -- ........**.******
      39 -- .....*.*...******
      40 -- ....**.****......
      41 -- .....*.....******
      42 -- ....*......******
      43 -- ....*...**.......
      44 -- .............**..
      45 -- .*..**.****......
      46 -- ....*...*********
      -----------------------

The bipartition weights come from this table, which also appears in the MrBayes output. The relevant information is in column three. The IDs are the same as in the previous table. (Note that IDs 1–17 have a "Probab." of 1.000000 by definition.)

      Summary statistics for informative taxon bipartitions
      ID   #obs      Probab.     Sd(s)+      Min(s)      Max(s)   Nruns 
      ------------------------------------------------------------------
      18  100008    1.000000    0.000000    1.000000    1.000000    8
      19  100008    1.000000    0.000000    1.000000    1.000000    8
      20  100008    1.000000    0.000000    1.000000    1.000000    8
      21   99052    0.990441    0.001920    0.987521    0.993121    8
      22   98648    0.986401    0.002285    0.984481    0.991041    8
      23   98124    0.981162    0.004318    0.974082    0.986721    8
      24   95107    0.950994    0.010545    0.939765    0.964723    8
      25   94053    0.940455    0.004211    0.934885    0.946564    8
      26   88426    0.884189    0.019000    0.851532    0.919526    8
      27   85884    0.858771    0.018347    0.839453    0.885449    8
      28   46754    0.467503    0.033135    0.432525    0.528278    8
      29   35967    0.359641    0.072244    0.286057    0.512679    8
      30   32714    0.327114    0.048225    0.243341    0.390609    8
      31   29976    0.299736    0.048974    0.233661    0.382049    8
      32   29828    0.298256    0.050175    0.201584    0.358851    8
      33   26501    0.264989    0.049249    0.185345    0.345732    8
      34   26457    0.264549    0.039611    0.204304    0.316375    8
      35   22989    0.229872    0.042602    0.141909    0.275578    8
      36   21883    0.218812    0.031232    0.160867    0.246460    8
      37   16546    0.165447    0.051763    0.116151    0.251500    8
      38   14613    0.146118    0.023326    0.094152    0.172466    8
      39   14493    0.144918    0.018488    0.108631    0.164947    8
      40   13522    0.135209    0.020371    0.106072    0.156707    8
      41   13076    0.130750    0.017912    0.108391    0.161507    8
      42   11497    0.114961    0.017107    0.096072    0.143429    8
      43   10988    0.109871    0.016253    0.075194    0.123830    8
      44   10595    0.105942    0.018071    0.072074    0.136069    8
      45    8497    0.084963    0.016141    0.065035    0.102872    8
      46    7027    0.070264    0.021858    0.051276    0.108951    8
      ------------------------------------------------------------------

So, some of the information in these two output tables is used to manually produce the Splits block, in the appropriate format, as shown. It would also be possible to write a script to automate this process (e.g. in Perl or Python).

Producing the separate files with Splits blocks containing different percentages of bipartitions is straightforward. As shown above, for the example data there are 46 bipartitions needed for the weights to sum to >0.95 (which is the MrBayes default number). If we choose 0.90 as the sum instead, then only the first 44 bipartitions are needed for the example data, while 0.85 requires only the first 37, and so on:
0.95  46
0.90  44
0.85  37
0.80  36
0.75  34
0.67  29
0.50  27
All that is needed is to delete the unwanted bipartitions from the Splits block, and then save the result as a separate nexus file.

Thanks to Mark for asking me to provide this blog post.

Wednesday, January 9, 2013

We should present bayesian phylogenetic analyses using networks


Bayesian methods differ from other forms of probabilistic analysis in that they are concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. That is, Bayesian analysis is not about identifying the most likely outcome, it is about quantifying the relative likelihood of all possible outcomes. In this sense, it is quite distinct from other probabilistic methods, such as those based on estimating the optimal outcome under criteria such as maximum likelihood, maximum parsimony or minimum evolution. As far as likelihood is concerned, Bayesian analysis deals with (maximum) integrated likelihood rather than (maximum relative) likelihood (Steel & Penny 2000).

In phylogenetic analysis this creates a potentially confusing situation, as the result of most Bayesian analyses is presented as a single tree, rather than showing the probability distribution of all trees. Certainly, some of the information from the probability distribution is used in the tree – usually the posterior probabilities that are attached to each of the tree branches – but this is a poor visual summary of the available information. A better approach would be to use a network to display the actual probability distribution.

Theory

Bayesian phylogeny programs such as MrBayes produce a file (or files) with a copy of every tree that was sampled by the MCMC run (or runs). These files can then be manipulated (eg. using the "sumt" command in MrBayes) to exclude the first trees as part of the burn-in; and a smaller file is then produced with a copy of the unique trees found, along with their frequency of occurrence in the sample (eg. in the "trprobs" file from MrBayes). It is this sample of trees that is summarized to produce the so-called MAP tree (maximum a posteriori probability tree) and its associated branch posterior probabilities.

Ideally, we would take this file and produce a consensus network rather than a consensus tree. The tree produced by MrBayes is built from the best-supported branches of the set of trees sampled, but only a set of compatible branches can be included in the consensus tree (see the original work of Margush and McMorris 1981). Any well-supported but incompatible branches will not be shown, and it is the absence of these branches that causes the phylogenetic tree to deviate from the standard Bayesian philosophy of presenting a probability distribution. A consensus network solves this problem because it is specifically designed to present a specified percentage of the incompatible branches as well (Holland et al. 2004).

The idea of using a consensus network rather than a consensus tree was first suggested by Holland et al. (2005), although it has rarely been used in practice (eg. Gibb & Penny 2010). Indeed, consensus networks could also be used to present the results of bootstrap analyses, or a set of equally parsimonious trees (Holland et al. 2006).

It is important to note that a consensus network is unrooted, and therefore it is solely a mathematical summary of the data, and cannot be treated as an evolutionary diagram. The MAP tree is, strictly speaking a form of consensus tree, and so it is technically also solely a mathematical summary of the data. It is, however, conventional to treat it as an evolutionary diagram.

An example

The example data are from Schnittger et al. (2012), being sequence data for the beta-tubulin gene of 17 samples of the genus Babesia. Details of the Bayesian analysis using MrBayes are provided here, which yielded 100,000 trees in the final MCMC sample. A nexus file with the sequence data, the 95% credible set of trees, and the associated bipartitions can be found here. The rooted MAP tree produced by MrBayes is shown in the first diagram, with the posterior probabilities (PP) labelled on each branch.

Bayesian MAP tree.

There are several ways that a consensus network could be presented. The standard way would be to choose some percentage of the MCMC trees and to show the network of those trees. For example, the MAP tree is simply a consensus that includes all of the bipartitions that occur in at least 50% of the trees, plus all of the other bipartitions that are compatible with those bipartitions.

As an alternative, here is the unrooted consensus network with all of the bipartitions that occur in at least 5% of the trees. It is important to note that the branch lengths in the following consensus networks represent the branch support not the estimated number of substitutions (as in the MAP tree). For example, the relationships among the Babesia microti sequences have very short branches in the MAP tree (ie. few substitutions) but long branches in the consensus networks (ie. good support).

Consensus network set at 5%.

This network makes visually clear where the major incompatibilities are among the MCMC trees, which the MAP consensus tree does not (unless one checks the PP values). The major set of boxes in the network involve the branches with PP values of c. 0.4. Note also that this network also approximates the 95% credible set of trees, as it excludes those bipartitions that occur in <5% of the trees.

Unfortunately, this network is visually rather cluttered, with a set of multi-dimensional boxes, and so maybe a simpler network would suffice. Instead, here is the consensus network with all of the bipartitions that occur in at least 20% of the trees. This network still emphasizes the main area of incompatibility among the trees, while losing the less-important incompatibilities on the other branches.

Consensus network set at 20%.

Another approach to simplifying the consensus network is to present a set of what are called weakly compatible bipartitions, rather than choosing some percentage of the bipartitions. This is shown in the next network. Note that visually it is a compromise between the above two networks, and may thus be preferable, as it makes clear the range of branches involved in the incompatibilities without needing to use multi-dimensional boxes to do so (ie. the graph is planar rather than 3-dimensional).

Consensus network of weakly compatible bipartitions.

Finally, we can emphasize the relationship between the network and bipartition compatibility by plotting the consensus network consisting of a set of compatible bipartitons. This forms a tree that has the same topology as the MAP tree (since it is constructed in exactly the same manner), but here the branch lengths represent the branch support rather than the estimated number of substitutions.

Consensus network of compatible bipartitions.

Current practical problems

The simplest way to implement the use of consensus networks to display the set of Bayesian trees would be to use SplitsTree to produce the consensus network directly from the (MrBayes) nexus-format MCMC treefile. However, this treefile will usually contain tens of thousands of trees, which is pushing the limit of SplitsTree. Alternatively, we could read in the smaller nexus-format "trprobs" file, which contains only the unique trees along with a weight indicating their relative frequency. Unfortunately, SplitsTree does not currently read tree weights in nexus treefiles. So, Holland et al. (2005, 2006) produced some Python scripts to create the required nexus files, which can then be input to SplitsTree (or SpectroNet).

Alternatively, MrBayes also produces a partition table, showing the relative frequency of the bipartitions found in the sample of trees. The consensus network is actually produced from these bipartitions, rather than the trees, and so this can also be used instead of the treefile. There are two practical problems with this approach. First, MrBayes does not produce a nexus-format file with the bipartition information, and so the available information must be manually converted to a Splits block and put into a nexus file. Second, SplitsTree currently does not construct networks with different percentages of splits when data are input via a Splits block, only when the data are input via a Trees block. So, a series of Splits blocks needs to be constructed, each with the appropriate number of bipartitions.

I used the latter approach to analyze the example data. [This is explained in more detail in a later post: How to construct a consensus network from the output of a bayesian tree analysis.]

Thanks to Bernie Cohen for prompting me to think about networks and Bayesian analysis, and to Barbara Holland for her help with getting the calculations done.

References

Gibb GC, Penny D (2010) Two aspects along the continuum of pigeon evolution: a South-Pacific radiation and the relationship of pigeons within Neoaves. Molecular Phylogenetics and Evolution 56: 698-706.

Holland BR, Delsuc F, Moulton V (2005) Visualizing conflicting evolutionary hypotheses in large collections of trees: using consensus networks to study the origins of placentals and hexapods. Systematic Biology 54:66-76.

Holland BR, Huber KT, Moulton V, Lockhart PJ (2004) Using consensus networks to visualize contradictory evidence for species phylogeny. Molecular Biology and Evolution 21: 1459-1461.

Holland BR, Jermiin LS, Moulton V (2006) Improved consensus network techniques for genome-scale phylogeny. Molecular Biology and Evolution 23: 848-855.

Margush T, McMorris FR (1981) Consensus n-trees. Bulletin of Mathematical Biology 43: 239-244.

Schnittger L, Rodriguez AE, Florin-Christensen M, Morrison DA (2012) Babesia: a world emerging. Infection, Genetics and Evolution 12: 1788-1809.

Steel M, Penny D (2000) Parsimony, likelihood, and the role of models in molecular phylogenetics. Molecular Biology and Evolution 17: 839-850.