Monday, December 10, 2012

Data enrichment in phylogenetics

Since this is post #100 in this blog, I thought that we might celebrate with something humorous. Since evolutionists often have a tough time, this post is about how to get more out of your phylogenetic analyses than you previously thought was possible.

In 1957, Henry R. Lewis published an article about The Data-Enrichment Method (Operations Research 5: 551-554). This method was intended "to improve the quality of inferences drawn from a set of experimentally obtained data ... without recourse to the expense and trouble of increasing the size of the sample data." This distinguishes the method from similarly named techniques, such as the likelihood method of Data Augmentation, which require actual data.

Clearly, such a method is of great interest to all empirical scientists, especially those without much grant money. Indeed, The Data Enrichment Method was immediately expanded by other interested parties (see Operations Research 5: 858-859, and 6: 136), who pointed out that it can be applied iteratively to great effect, and that it can be used to support an hypothesis and also its opposite.

The important requirements for the Data Enrichment Method are: (i) a nested set of data patterns, and (ii) an a priori expectation about what should be the answer to the experimental question. All scientists should have the latter, of course, since they are supposed to be testing the expectation by calling it an "hypothesis".

Most interestingly for us, phylogeneticists will often be able to meet requirement (i), as well, because their data often form a nested set, representing the shared derived character states from which a phylogenetic tree will be derived. I therefore once wrote an article examining the application of The Data Enrichment Method to phylogenetics, where it does indeed work very well. You do need at least some data to start with, and so it does not free you entirely from the inconvenience and embarrassment of uncontrollable empirical results.

This article appeared in 1992 in the Australian Systematic Botany Society Newsletter 71: 2–5. Since this issue of the Newsletter is not online, presumably no-one has read this article since then. However, you should read it, and so I have linked to a PDF copy of the paper:
A new method for increasing the robustness of cladistic analyses

After reading it, you might like to think about how to apply this method to phylogenetic networks. The mixture of horizontal gene flow with vertical descent breaks the simple nested data pattern of a phylogenetic tree, which complicates the application of data enrichment to networks.


  1. "presumably non-one has read this article since then."

    That may be because the methods therein are highly biased, based in circular reasoning. The virtual characters are inferred from the very topology that they support. Or did I miss something?

  2. Yes, you missed the fact that it is intended to be humorous.

  3. So it was just satirizing the arguments back then on the merits of doing parsimony versus likelihood and how to code data for phylogenetics? Then I missed the wink, wink in this post and the paper.

    Could have (and did) fool a few of us!

  4. I have now changed the first sentence of the blog to be a bit more explicit. I don't intend fooling anyone - I would rather they smile.

  5. haha, I appreciate a dry sense of humor even when I can't detect it.