Wednesday, June 8, 2016

Why do so few biologists look at their phylogenetic data?

Most data analyses involve processing the data using some model. For example, standard parametric statistical tests assume a normal distribution for the "error" term, as well as equal variances and linear relationships between the variables. If these model assumptions ado not hold, then any inferences from the tests may be incorrect.

It is possible to look at any dataset in a model-free manner, although this does not necessarily lead to any strong inferences. Looking at data is usually called exploratory data analysis. This is often done using graphs of various types.

Exactly the same principle applies to phylogenetics. A phylogenetic tree is an inference from the data via a given model. The inference is a reconstructed genealogical history assuming a divergent tree. In this context, different models will often (usually?) give different inferences.

Therefore, most phylogeneticists never actually see their data. What they see, instead, is the data as processed through some model. That is, they see inferences from the model, not the original data. Models are important, but the data should be even more important, for a scientist.

It is thus interesting that so many phylogeneticists skip the step of looking at their data, and proceed immediately to the model-based inference. So many of the disagreements throughout the literature end up being about the models and not the data. There are very strong opinions about which models should be used, with less attention being paid to whether the data contain sufficient information to answer the original scientific question in the first place.

A specific example of this was discussed in some earlier blog posts:
Conflicting placental roots: network or tree?
Why are there conflicting placental roots?
In this example there are three possible genealogical patterns, each of which has been reported to receive strong support from model-based tree inference of nucleotide sequences. However, when looking at the sequence data themselves, in a model-free manner using data-display networks, any one dataset shows all three possible patterns. So, any inference of a single tree is coming from the model not from the data. That is, the data do not distinguish between the three genealogies, but the models do discriminate amongst them.

It is worth mentioning here that a haplotype network is not a genealogy. Instead, it is a summary of a population dataset, which may contain some phylogenetic patterns or it may not. So, a haplotype network is closer to exploratory data analysis than it is to model-based inference. This point is clearly made by Jessica W. Leigh and David Bryant (2015. PopART: full-feature software for haplotype network construction. Methods in Ecology and Evolution 6: 1110-1116):
The haplotype networks do provide, however, a concise and accessible representation of the data themselves, one aspect which is often lost in methods heavily dependent on model-based inference.
Looking at the data before you start processing it can be a very good idea. After all, you may be able to avoid unlikely inferences.


  1. I agree with EDA in phylogenetic analysis, I think is very important. I have one question, first this has been attacked or criticized from a philosophical view, for example (Grant & kluge, 2003,2004; Patterson, 1982 and hempel 1965). Which opinion you have of these criticisms?; Is possible have a philosophical background to justify the use of EDA in phylogeny?


    1. The papers to which you refer all have a specific philosophical point of view, and one that is not necessarily shared by many others. From their point of view phylogenetic data can be analyzed in one way only, and therefore all other possibilities are rejected a priori (including EDA). The usually philosophical rationale for EDA is a statistical one — our data provide statistical estimates of real phenomena, and do not necessarily provide accurate estimates. Therefore, we need to examine the data for biases (leading to inaccuracies) or unexpected patterns (leading to mis-estimates of reality). So, EDA has a sound philosophical basis within mathematics.