Wednesday, May 16, 2012
GPWG Poaceae dataset
In a previous post, Steven mentioned that one of the datasets from the Grass Phylogeny Working Group has played an unexpectedly prominent role in evaluation of hybridization network algorithms.
These algorithms work by trying to construct a network from a set of rooted trees with overlapping sets of taxa; and the GPWG dataset provides six such trees, one from each of six different molecular loci. This dataset seems to have been introduced into the network literature by Bordewich et al. (2007), although it had previously been used for evaluations of supertree methods (Salamin et al. 2002; Schmidt 2003).
The data used consist of DNA sequences of three nuclear loci and three chloroplast genes. The original publication also has data provided for morphology and restriction sites, but these have not been used for the network analyses. One reason for interest in this dataset is the possibility of reticulation signals between the nuclear and chloroplast data sources. There are 66 taxa, although nearly half of them are composites formed from data for several different species in the same genus, and only a few of the taxa have data for all six datasets (the number of taxa varies from 19-65 per dataset). The data available are summarized in Table 7.1 from Schmidt (2003).
An important point about these data is that in the original GPWG publication the six gene trees were strict consensus trees from maximum-parsimony analyses, and so they have quite a number of polychotomies. These polychotomies were intended by the authors [personal communication] to express uncertainty about the topologies of the trees.
However, this uncertainty is not shown in the trees that have been used for network evaluation. According to Bordewich et al., the trees that they (and everyone else) used were reconstructed using the fastDNAmL program (ie. maximum-likelihood), and were supplied by Heiko Schmidt (see Schmidt 2003, p.74). As expected, there are no polychotomies in these ML trees and no indication of uncertain topology; and, of course, the tree topologies are somewhat different from the parsimony trees.
An important consequence is that there is more incompatibility among the dichotomous maximum-likelihood trees than there is among the polychromous maximum-parsimony trees. That is, many of the ML incompatibilities are related to uncertainties in the MP trees. Unfortunately, most of the network algorithms that have been evaluated using these data require strictly dichotomous trees.
Also, the root seems to create problems for these data. The GPWG trees are all rooted with this topology:
However, the position of this 7-taxon outgroup relative to the rest of the taxa varies among the gene trees. That is, the connection between the outgroup and the ingroup differs between the gene trees. So, some of the incompatibility among the trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes.
Some of the ML datasets available have trees with the same set of ingroup / outgroup relationships as the GPWG trees, for example those datasets available with the CASS algorithm. However, some of the ML trees presented in the literature seem to be rooted in quite a different place, and this place differs between the gene trees. For example, the data as presented with the HybridInterleave program, which is presented as 15 pairs of subtrees rather than as six complete trees, not only are the the gene trees apparently rooted in different places but the different subsets presented of the same gene tree are also sometimes rooted in different places.
It seems to me that there are two consequences arising from these points: (i) it is unnecessarily hard to construct a network from the ML data (because not all of the data signals relate to reticulation), and (ii) the resulting networks (as published) look rather unrealistic to a biologist (there are far too many reticulation nodes). Perhaps this isn't the most realistic dataset to be using for the evaluation of network algorithms.
Another commonly used dataset is the Ranunculus data from Lockhart et al. (2001). In this dataset much of the incompatibility signal also seems to be associated with an uncertain position for the root (see Morrison 2011, Fig. 4.7). In this case there are two gene trees (one nuclear and one chloroplast) that have similar unrooted topologies but have different outgroup-derived root locations. Dealing with root uncertainty may thus be one of the biggest confounding problems when trying to identify reticulation events.
The original GPWG data are available at:
The nexus data matrix is available at:
[In this dataset, 0=A, 1=C, 2=G, 3=T]
A nexus treefile with the original six GPWG (consensus parsimony) trees is available at:
A dendroscope treefile with the six ML trees is available at:
Bordewich M., Linz S., St. John K., Charles Semple C. (2007) A reduction algorithm for computing the hybridization number of two trees. Evolutionary Bioinformatics 3: 86-98.
Grass Phylogeny Working Group (2001) Phylogeny and subfamilial classification of the grasses (Poaceae). Annals of the Missouri Botanical Garden 88: 373-457.
Lockhart P., McLechnanan P.A., Havell D., Glenny D., Huson D., Jensen U. (2001) Phylogeny, radiation, and transoceanic dispersal of New Zealand alpine buttercups: molecular evidence under split decomposition. Annals of the Missouri Botanical Garden 88: 458-477.
Morrison D.A. (2011) Introduction to Phylogenetic Networks. RJR Productions, Uppsala.
Salamin N., Hodkinson T.R., Savolainen V. (2002) Building supertrees: an empirical assessment using the grass family (Poaceae). Systematic Biology 51: 136-150.
Schmidt H.A. (2003) Phylogenetic Trees From Large Datasets. PhD thesis, Heinrich Heine University, Düsseldorf.
Wu Y. (2010) Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees. Bioinformatics 26: i140-i148.