Many researchers working on constructing evolutionary (i.e. explicit, as opposed to implicit/data-display) phylogenetic networks encounter the problem that, at present, there are not many options for validating the biological relevance of their methods. In other words, how does a researcher verify whether the network produced by his/her latest algorithm is a biologically plausible approximation of reality? This is of critical importance because, unlike implicit/data-display networks, evolutionary phylogenetic networks seek to produce an explicit hypothesis of what actually happened.
Ideally there should be a repository of biological datasets where there is some level of consensus amongst biologists as to the character, extent and location of reticulate evolutionary events. This can then be used as a framework for validating the output of algorithms for constructing evolutionary phylogenetic networks. Unfortunately, as far as I am aware there are very few such “reference” datasets in circulation – if any. There seem to be multiple reasons for this. Within biology reticulate evolution is still a comparatively new topic which actually encompasses an entire range of evolutionary time-scales and phenomena. I can fully appreciate that trying to get a grip on even a tiny part of this world is an immensely complex task for biologists! This is probably why biological validation of algorithmic methods, if it happens at all, still requires collaborating biologists to perform a labour-intensive and highly case-specific analysis. It will be a massive challenge to move beyond such ad-hoc models of validation.
On the algorithmic side there are also plenty of issues. Input-side and output-side limitations to existing software are well-known. Expressed deliberately sharply: it is not often that one encounters a biologist who has two fully-refined, unambiguously rooted gene trees on the same set of taxa who wants to develop a reticulation-minimal solution and who does not mind if ancestors can hybridize with descendants. Faced with such limitations computer scientists inevitably resort to simulations or try and analyse the same dataset that the last group of computer scientists used, which is (sigh…) probably the Grass Phylogeny Working Group's Poaceae dataset. Simulations tend to use a variety of plausible-sounding techniques (e.g. random rSPR moves to simulate HGT, or – at the population-genomic level – techniques for simulating recombination) but in how far do these simulations really approximate reality?
My concern is that, at the moment, biologists and computer-scientists are locked in an unhealthy embrace, both expecting the other group to come up with “real” networks. This could be dangerous. I’ve seen biologists adjust their hypotheses based on the output of evolutionary phylogenetic network software. But those computer programs often lack any form of biological validation: not because algorithm designers are bad people aiming to mislead but because the apparently intractable character of the associated optimization problems forces computer scientists to make all kinds of restrictions and assumptions which are not necessarily compatible with the concerns of biologists. In any case: it’s clearly not helpful if hypotheses derived this way find their way back into the literature with an “approved by biologists” seal of approval.
How, then, to transform this embrace into something more virtuous? One possibility could be a structured collaboration between groups in the phylogenetic network community to produce and disseminate at least a small number of rigorously validated reference datasets which can serve as benchmarks. Is this realistic?
Very curious to hear what you think!
Note: The suggested database now exists: Datasets for validating algorithms for evolutionary networks