Saturday, April 28, 2012

Validating methods for constructing evolutionary phylogenetic networks

Many researchers working on constructing evolutionary (i.e. explicit, as opposed to implicit/data-display) phylogenetic networks encounter the problem that, at present, there are not many options for validating the biological relevance of their methods. In other words, how does a researcher verify whether the network produced by his/her latest algorithm is a biologically plausible approximation of reality? This is of critical importance because, unlike implicit/data-display networks, evolutionary phylogenetic networks seek to produce an explicit hypothesis of what actually happened.

Ideally there should be a repository of biological datasets where there is some level of consensus amongst biologists as to the character, extent and location of reticulate evolutionary events. This can then be used as a framework for validating the output of algorithms for constructing evolutionary phylogenetic networks. Unfortunately, as far as I am aware there are very few such “reference” datasets in circulation – if any.  There seem to be multiple reasons for this. Within biology reticulate evolution is still a comparatively new topic which actually encompasses an entire range of evolutionary time-scales and phenomena. I can fully appreciate that trying to get a grip on even a tiny part of this world is an immensely complex task for biologists! This is probably why biological validation of algorithmic methods, if it happens at all, still requires collaborating biologists to perform a labour-intensive and highly case-specific analysis. It will be a massive challenge to move beyond such ad-hoc models of validation.

On the algorithmic side there are also plenty of issues. Input-side and output-side limitations to existing software are well-known. Expressed deliberately sharply: it is not often that one encounters a biologist who has two fully-refined, unambiguously rooted gene trees on the same set of taxa who wants to develop a reticulation-minimal solution and who does not mind if ancestors can hybridize with descendants. Faced with such limitations computer scientists inevitably resort to simulations or try and analyse the same dataset that the last group of computer scientists used, which is (sigh…) probably the Grass Phylogeny Working Group's Poaceae dataset. Simulations tend to use a variety of plausible-sounding techniques (e.g. random rSPR moves to simulate HGT, or – at the population-genomic level – techniques for simulating recombination) but in how far do these simulations really approximate reality?

My concern is that, at the moment, biologists and computer-scientists are locked in an unhealthy embrace, both expecting the other group to come up with “real” networks. This could be dangerous. I’ve seen biologists adjust their hypotheses based on the output of evolutionary phylogenetic network software. But those computer programs often lack any form of biological validation: not because algorithm designers are bad people aiming to mislead but because the apparently intractable character of the associated optimization problems forces computer scientists to make all kinds of restrictions and assumptions which are not necessarily compatible with the concerns of biologists. In any case: it’s clearly not helpful if hypotheses derived this way find their way back into the literature with an “approved by biologists” seal of approval.

How, then, to transform this embrace into something more virtuous? One possibility could be a structured collaboration between groups in the phylogenetic network community to produce and disseminate at least a small number of rigorously validated reference datasets which can serve as benchmarks. Is this realistic?

Very curious to hear what you think!

Note: The suggested database now exists: Datasets for validating algorithms for evolutionary networks

1 comment:

  1. I think this is a great idea. As a computer scientist who has mostly worked on the simple two tree problem you (rightfully) criticize, it is very difficult to determine what, exactly, I should be working towards next. For example, I am currently working on dropping the fully-refined assumption. I decided to work on this based on (1) hearing that this assumption was limiting and (2) what seems possible to do efficiently. In other words, what improvement to current methods that I believe will be (1) useful and (2) possible. However, (1) often takes a backseat to (2). It would be very helpful to have some sort of standards even just to see what biologists actually want.

    To be practical, though, such a set of rigorously validated reference datasets would need to cover a variety of types of input and output (e.g. a set of binary rooted trees, multifurcating rooted trees, binary unrooted trees, etc.) to allow both incremental improvement and provide context for an "end goal". Otherwise, someone working on a method that is not covered by the reference set is still forced to use simulations. There will certainly be difficulty in choosing what to incorporate in the reference set, what part of the reference set to work on first, what the "answers" should look like, and so on. As such, I think the largest issues a structured collaboration will face are how to structure the collaboration and what the end result should be.