Exploratory data analysis (EDA) is an important part of biological data analysis. It is a form of a priori analysis, where the data are investigated before any substantial hypothesis-testing analysis occurs.
EDA seeks to investigate the patterns that exist in a dataset without too many constraints on what those patterns might be. It should reveal any patterns and their relative strengths, in a quick and easily employed manner, preferably using easily digested pictures. In phylogenetics, there are now many types of networks available for exploring the extent to which data are tree-like, and the nature and location of any non-tree-like patterns. These are ideal for EDA.
It is important to recognize that EDA is not a phylogenetic analysis, in the sense that it is not intended to reveal evolutionary history. It is an exploration of the data intended to reveal the major patterns and their relationships. This information will help a phylogeneticist make decisions about how best to analyze their data and, indeed, even whether a phylogenetic analysis of the data is worthwhile. In this sense it is unfortunate that the sorts of networks that I discuss here are often called "phylogenetic networks", since they do not perform phylogenetic analysis.
EDA should not proceed in a haphazard manner, as the more one looks at a dataset the more patterns one is likely to see. In particular, there is always the potential danger of detecting spurious patterns, which may seem to be biologically important, especially if they confirm some of the pre-conceived ideas of the phylogeneticist. Therefore, it is important to have a clear goal when employing EDA, which might involve asking a set of explicit questions when viewing the data.
These questions should not constrain the exploration of the data, as there are many possible questions, directed towards may different goals. Instead, they should be intended to discover possible constraints on subsequent analyses. Indeed, they may even reveal testable hypotheses that would otherwise go unrecognized. (You should not, of course, both discover and test hypotheses using the same dataset!)
As an example, I discuss some questions that might be important for EDA in phylogenetics. These are not the only possible questions, of course, but they seem to be among the more commonly seen ones in the literature. That is, they focus on the contemporary approach to phylogenetics as a tree-building exercise. These represent a basic set of questions that should be answered before a tree-building analysis could begin. Similar questions could be asked, for example, about reticulation processes in evolutionary history.
These questions can be asked about the whole data or about subsets of the data (eg. genomes, genes, loci), which will indicate whether the variation is within-gene or between-gene.
(1) Are the data tree-like?
(2) If not, then is this because they are:
(a) uninformative about relationships (a bush)
(b) weakly tree-like (a tree obscured by vines)
(c) contain several strong incompatible relationships (a structured network)
(d) confused about relationships (an unstructured network — a spiderweb)
(3) If the data are tree-like, then are the trees of data subsets:
The important issues that these questions are intended to address include:
(i) are there strong patterns in the data that might answer the experimental question? (finding suitable data for a specific phylogenetic question is not necessarily easy);
(ii) what are the patterns? (different patterns are likely to have different biological causes):
(iii) is any data incompatibility principally between genes or within genes? (these patterns are likely to be caused by different biological processes);
(iv) are the incompatibilities the result of reticulation processes (eg. recombination, hybridization, lateral gene transfer) or not (eg. incomplete lineage sorting, gene duplication-loss)?
These questions can be addressed using programs such as SplitsTree, which implements a range of network analyses based on unrooted splits graphs. Probably the most commonly encountered of these methods is NeighborNet. However, it is not currently possible to directly extract answers to the above questions using this program. For example, we cannot simultaneously display the networks for different subsets of the data, to directly compare them. Nevertheless, I can illustrate the idea with an example.
The data are from O'Donnell et al. (2000). There are data for six partial gene sequences (334-1336 nt per gene) from each of 27 samples of Fusarium graminearum fungi.
We can start the EDA with a NeighborNet analysis of the entire dataset:
This is rather tree-like, with one major reticulation, involving sample NRRL_28721. Note that we have not imposed a root on the data, because it is often the root location that is the most ambiguous part of the data. We can confirm that this sample is the only one causing the non-tree-like behaviour by temporarily deleting it:
Indeed, the data are now very tree-like. Thus, the complete dataset seems to fit category 2c from the list above. Given a root location, this might then be expected to be a good estimate of the phylogeny.
We can further investigate the incompatible data pattern by looking at at each of the six gene subsets individually, in order to locate the possible cause of the reticulation involving NRRL_28721. It turns out that there are several different patterns revealed by the subsets.
The Ligase2 and Ligase3 genes produce poorly resolved bushes (category 2a), with many of the samples being identical:
The Ligase1 and Permease1 genes produce bushes with a netted centre (a combination of categories 2a and 2d):
The Permease2 gene apparently shows a reticulation, but a look at the original sequence alignment shows that this involves an incompatibility between 1 character and 2 other characters; and so it is likely to be of little interest:
In my experience, this is an important consideration when using EDA in phylogenetics. A pattern that looks large in a network may be relatively trivial in the original data. This can happen whenever there is little information in the dataset, so that even the smallest pattern looks large.
The Trichothene gene also produces a net-centered bush, but it is the only gene that shows the reticulation involving NRRL_28721. It is therefore a gene of particular interest for the EDA:
What this EDA shows is that most of the individual genes contain little information on their own, but when combined they show a strong tree-like pattern. That is, the genes complement each other in a phylogenetic analysis. Moreover, the reticulation involving NRRL_28721 appears in only one gene. Indeed, investigation of the alignment suggests a recombination event, with a cross-over in this particular gene. The other genes share one or the other of the two patterns associated with this cross-over.
A final tree-building analysis would therefore be expected to be productive, except that NRRL_28721 will not have a stable position when forced into a tree. On the other hand, a network would be a more appropriate display of the data, as it could show the recombination event associated with NRRL_28721.
Some desiderata for EDA in phylogenetics
Given what I have said above, it would be convenient if there were quantitative measures to identify the different types of network structure, which could then be displayed on each graph as part of the EDA protocol. For example, we need to distinguish these graph forms from each other:
Few edges (ie. many taxa are identical)
Star tree (ie. unresolved relationships)
Resolved tree (ie. clear relationships)
Structured network (ie. several clear conflicting relationships)
Unstructured network (ie. unclear relationships)
Unsurprisingly, there is no single measure that could distinguish these from each other. We will therefore need several measures, to be used together.
At the moment, two measures of the degree of reticulation have been proposed, the δ-score (Holland et al. 2002) and the Q-residual (Gray et al. 2010). The main difference is in the normalization constant. Both of them are currently reported by SplitsTree, as an option. Basically, they produce small values for trees and bushes, larger values for structured networks, and the largest values for spider webs. For example, deleting NRRL_28721 from the above dataset reduces the δ-score from 0.1341 to 0.1083 (the scores can range from 0 to 1).
Thus, these measures can potentially be used to distinguish among the last three graph types in the list, but not among the first three. Wichman et al. (2011) expressed an apparent preference for the δ-score over the Q-residual, based on an evaluation of a linguistic dataset. Unfortunately, these scores have no statistical interpretation. As noted by Gray et al.:
We have implemented and tested a number of schemes for assessing the significance of delta score and Q-residual values, including non-parametric and parametric bootstrapping. Unfortunately, and curiously, none have proven to be sufficiently powerful and robust.It is possible to suggest other measures that might also be useful for quantifying the different types of graph structure. For example, for a NeighborNet graph, which is closely related to an unrooted Neighbor-Joining (NJ) tree, we could consider the following quantitative measures to characterize each type of structure, relative to a fully resolved tree:
how many splits there are relative to the number of samples
how many internal splits there are relative to the minimum number necessary for a fully
how many of the NJ tree splits form parallel sets in the NeighborNet of the same data
(as opposed to single edges)
how many extra splits there are in the NeighborNet relative to the NJ tree
I am sure that others can easily be devised, as well. I encourage people to have a think about this, so that some consensus might be reached about desiderata for EDA in phylogenetics. The computational people cannot develop algorithms until the biologists tell them what is needed.