Showing posts with label Validation. Show all posts
Showing posts with label Validation. Show all posts

Wednesday, November 12, 2014

Archiving of phylogenetics data


The draft Minimum Information about a Phylogenetic Analysis standard (Leebens-Mack et al. 2006) suggests that all relevant information about each and every published phylogenetics analysis should be archived, so that it can be scrutinized by later researchers, either for validation or for re-use. The issues here are both preservation of the information (data and analysis protocols) and open access to it.

In this blog we have already pointed out that there has been criticism of the bioinformatics part of this archiving, where there have been repeated claims that many computer programs are poorly maintained (Poor bioinformatics?) as well as poorly archived (Archiving of bioinformatics software).

Anyone who has ever tried to get data out of a biologist will know that the data-related part of the standard is no better. My own success rate, at requesting data from all areas of biology not just phylogenetics, is less than 20% over the past 25 years. The responses have been, in order: (i) no response (>50%), (ii) "a student / postdoc / colleague has the data not me", and (iii) "I have moved recently and don't know where the data are". My most recent attempt, to get the data from Collard et al. (2006), was ultimately unsuccessful even after several attempts.


For phylogenetics, this situation has recently been quantified and analyzed by Magee et al. (2014). They tried to collect phylogenetic data (comprising nucleotide sequence alignment and tree files) from 217 published studies. Of these, 54 (25%) had at least some part of the data (alignment or tree) archived in an online repository, and 91 (42%) were obtained by direct solicitation, but in 72 (33%) of cases nothing could be obtained even after three requests. Overall, complete datasets (both tree and alignment) were available for only 40% of the studies.

The authors note that the data were more likely to be deposited in online archives and/ or shared upon request when the publishing journal has a strong data-sharing policy. Furthermore, there has been a positive impact of recent policy initiatives and infrastructural changes involving data repositories. The TreeBASE phylogenetic-data repository has existed for more than 20 years, but its use has been sporadic. However, the recent establishment of the Joint Data Archiving Policy by a consortium of journals, which requires the submission of data to online archives as a condition of publication, and the concomitant establishment of the Dryad repository for evolutionary and ecological data, has seen a surge in the archiving of data.

So, all in all, things have been no better on the bio side than the informatics side of bioinformatics.

Stoltzfus et al. (2012) have identified a number of possible barriers to successful data archiving, including lack of awareness of options and policies, perception that benefits do not justify burden, and an active desire to restrict data access. Importantly, there are also a number of practical issues even for those people who do wish to archive their data:
  • inconvenience of gathering complete data and metadata
  • inconvenience of format conversions needed for archiving
  • frustration when some data don't fit the archive's data model
  • poor and undocumented archive submission interfaces.
For the readers of this blog, issue three is possibly the most important one — all current repositories are based on a tree model for phylogenetics, and therefore network phylogenies are frustrating to deal with.

In order to improve the overall situation, there are explicit suggestions from Cranston et al. (2014) for best practices when archiving. They have ten simple guidelines that, if followed, will result in you providing open access to your data and analyses, even if the publishing journal does not force you to do it.

Footnote: I have been reminded that archiving data in PDF format is inappropriate. Trying to extract text (such as a dataset) from a PDF file can be difficult, because there is no standard format for storing the text. Consequently, different PDF readers will extract the text in different ways, and it is possible that in all cases the output will need extensive manual re-formatting, in order to recover the original text formatting that went into the PDF file. In my experience, Google Chrome may do the least-worst job.

References

Collard M, Shennan SJ, Tehrani JJ (2006) Branching, blending, and the evolution of cultural similarities and differences among human populations. Evolution and Human Behavior 27: 169-184.

Cranston K, Harmon LJ, O'Leary MA, Lisle C (2014) Best practices for data sharing in phylogenetic research. PLoS Currents Jun 19;6.

Leebens-Mack J, Vision T, Brenner E, Bowers JE, Cannon S, Clement MJ, Cunningham CW, dePamphilis C, deSalle R, Doyle JJ, Eisen JA, Gu X, Harshman J, Jansen RK, Kellogg EA, Koonin EV, Mishler BD, Philippe H, Pires JC, Qiu YL, Rhee SY, Sjölander K, Soltis DE, Soltis PS, Stevenson DW, Wall K, Warnow T, Zmasek C (2006) Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA). OMICS 10: 231-237.

Magee AF, May MR, Moore BR (2014) The dawn of open access to phylogenetic data. PLoS One 9: e110268.

Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie EL, Kumar S, Rosauer DF, Vos RA (2012) Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis. BMC Research Notes 5: 574.

Wednesday, January 29, 2014

More datasets for validating network algorithms


Four more datasets have been added to the Datasets blog page. These are:
  • 1 stemmatology study where the manuscript history is known from experimentation
  • 2 stemmatology studies where the manuscript history is known from experimentation, and where there is reticulation caused by contamination
  • 1 plant study where recombination is known.

These are the first three studies to be added from the social sciences, all of them from experimental manipulation of text copying.

Unfortunately, it is unlikely that suitable datasets will be found from other parts of the social sciences, such as linguistics; but please tell us if you know of any relevant studies, where the phylogenetic history is known or inferred independently of the dataset itself.

Wednesday, January 30, 2013

More datasets for validating network algorithms


Ten more datasets have been added to the Datasets blog page. These are:
  • 2 plant studies where hybrids are known from experimentation
  • 3 more plant studies where natural hybrids are known
  • 5 studies (fungi, plants, protozoa, viruses, animals) where recombination is known.

A comment

It is worth noting something that has become obvious to me while compiling these datasets — the mathematical model often applied to hybridization networks cannot easily be applied to many of the datasets collected by biologists. The usual mathematical model involves incompatibility between two or more trees for the same set of taxa, for example from different genes or genomes. The incompatibilities are resolved by postulating one or more reticulations in the network.

However, the data produced by biologists often involve only a single nuclear gene, most frequently the Internal Transcribed Spacer region, so that the biologists do not have multiple trees. Instead, hybrids are detected by additive polymorphisms at alignment positions within the study gene. These polymorphisms arise either from (i) the polyploid nature of the hybrids (there are multiple copies of each chromosome, each of which may have a gene copy from either parental species), or (ii) from multiple paralogous copies of the genes (the rRNA region, which contains the ITS, usually has many tandemly repeated copies of the genes, which are homogenized by concerted evolution, but in a hybrid any of them may have a gene copy from either parental species).

This means that it is difficult to use any current evolutionary network for the phylogenetic analysis of many of the datasets used for detecting hybridization. In turn, this suggests that we may need a different model, one based on additive polymorphisms rather than incongruent trees.

The usual mathematical model for lateral-transfer networks is actually the same as for hybridization networks, since the only real difference between HGT and hybridization is that HGT does not occur via sexual reproduction while hybridization does. (Also, hybridization often involves whole genomes while HGT usually involves partial genomes.) Importantly, the mathematical model does seem to apply to the sort of datasets collected by biologists when they are studying HGT. That is, HGT is detected by incompatibility between two or more trees for the same set of taxa. Indeed, this model is usually the only evidence for HGT, unlike hybridization and recombination where there is often evidence that is independent of the network model.

Wednesday, January 16, 2013

Datasets for validating algorithms for evolutionary networks


Steven Kelk has previously raised the issue about Validating methods for constructing evolutionary phylogenetic networks: there are currently not many options for validating the biological relevance of methods for constructing evolutionary phylogenetic networks. These are phylogenetic networks intended to represent evolutionary history, such as HGT networks. hybridization networks, and recombination networks.

Thus, we need a repository of biological datasets where there is some level of consensus amongst biologists as to the character, extent and location of reticulate evolutionary events. This could then be used as a framework for validating the output of algorithms for constructing evolutionary phylogenetic networks.

This issue was discussed at some length at the Workshop: The Future of Phylogenetic Networks. It was suggested by Leo van Iersel that a practical starting point would be to use this blog as a link to suitable datasets. As people become aware of such datasets, a blog post would be published with the details, and the dataset would be linked from one of the blog Pages.

This page now exists (Datasets), and can be accessed at the top right of each blog page. Everyone is encouraged to contribute to this "database", which you can do by sending details about potential dataset  to me by email.

In another post, What should a database of datasets look like?, I have noted that there have been four suggested approaches to acquiring datasets for evaluating algorithms (in order of increasing reality):
  1. simulate datasets under one or more data-generation models
  2. create mixed datasets from "pure" datasets, or create artificial mosaic taxa from real datasets
  3. use datasets where the postulated reticulation events have been independently confirmed
  4. experimentally create taxa with a known evolutionary history.
It seems unnecessary to store datasets of type (1), since they can be created to order by computer programs. Datasets of type (2) are rare, but would be suitable for the database.

Datasets of type (4) currently exist for tree-like evolutionary histories but not yet, as far as I know, for reticulated histories. I have added the known (and available) ones to the database.

Datasets of type (3) are likely to form the bulk of the database, and I have started this part of the database with some example datasets involving hybridization.

For the latter datasets, it is important to note the potential problem of the degree to which the postulated reticulation events have been independently confirmed. I suspect that only weak evidence has been applied to far too many datasets. This is particularly true for those involving horizontal gene transfer (HGT), where mere incongruence between genes is presented as the sole "evidence". More than this is required (see Than C, Ruths D, Innan H, Nakhleh L. 2007. Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 14: 517-535.).

Saturday, September 8, 2012

Poor bioinformatics?


There have been a number of recent posts in the blogsphere about what is perceived to be the rather poor quality of many computer programs in bioinformatics. Basically, many bioinformaticians aren't taking seriously the need to properly engineer software, with full documentation and standard programming development and versioning.

I thought that I might draw your attention to a few of the posts here, for those of you who write code. Most of the posts have a long series of comments, which are themselves worth reading, along with the original post.

At the Byte Size Biology blog, Iddo Friedberg discusses the nature of disposable programs in research, which are written for one specific purpose and then effectively thrown away:
Can we make accountable research software?
Such programs are "not done with the purpose of being robust, or reusable, or long-lived in development and versioning repositories." I have much sympathy for this point of view, since all of my own programs are of this throw-away sort.

However, Deepak Singh, at the Business|Bytes|Genes|Molecules blog, fails to see much point to this sort of programming:
Research code
He argues that disposable code creates a "technical debt" from which the programmer will not recover.

Titus Brown, at the Living in an Ivory Basement blog, extends the discussion by considering the consequences of publishing (or not publishing) this sort of code:
Anecdotal science
He considers that failure to properly document and release computer code makes the work anecdotal bioinformatics rather than computational science. He laments the pressure to publish code that is not yet ready for prime-time, and the fact that computational work is treated as secondary to the experimental work. Having myself encountered this latter attitude from experimental biologists (the experiment gets two pages of description and the data analysis gets two lines), I entirely agree. Titus concludes with this telling comment: "I would never recommend a bioinformatics analysis position to anyone — it leads to computational science driven by biologists, which is often something we call 'bad science'." Indeed, indeed.

Back at the Business|Bytes|Genes|Molecules blog, Deepak Singh also agrees that "a lot of computational science, at least in the life sciences, is very anecdoctal and suffers from a lack of computational rigor, and there is an opaqueness that makes science difficult to reproduce":
Titus has a point

This leads to Iddo Friedberg's post (the Byte Size Biology blog) that mentions this concept:
The Bioinformatics Testing Consortium
This group intends to act as testers for bioinformatics software, providing a means to validate the quality of the code. This is a good, if somewhat ambitious, idea.


Finally, Dave Lunt, at the EvoPhylo blog, takes this to the next step, by considering the direct effect on the reproducibility of scientific research:
Reproducible Research in Phylogenetics
He notes that bioinformatics workflows are often complex, using pipelines to tie together a wide range of programs. This makes the data analysis difficult to reproduce if it needs to be done manually. Hence, he champions "pipelines, workflows and/or script-based automation", with the code made available as part of the Methods section of publications.

Wednesday, May 16, 2012

GPWG Poaceae dataset


In a previous post, Steven mentioned that one of the datasets from the Grass Phylogeny Working Group has played an unexpectedly prominent role in evaluation of hybridization network algorithms.

These algorithms work by trying to construct a network from a set of rooted trees with overlapping sets of taxa; and the GPWG dataset provides six such trees, one from each of six different molecular loci. This dataset seems to have been introduced into the network literature by Bordewich et al. (2007), although it had previously been used for evaluations of supertree methods (Salamin et al. 2002; Schmidt 2003).

The data used consist of DNA sequences of three nuclear loci and three chloroplast genes. The original publication also has data provided for morphology and restriction sites, but these have not been used for the network analyses. One reason for interest in this dataset is the possibility of reticulation signals between the nuclear and chloroplast data sources. There are 66 taxa, although nearly half of them are composites formed from data for several different species in the same genus, and only a few of the taxa have data for all six datasets (the number of taxa varies from 19-65 per dataset). The data available are summarized in Table 7.1 from Schmidt (2003).

An important point about these data is that in the original GPWG publication the six gene trees were strict consensus trees from maximum-parsimony analyses, and so they have quite a number of polychotomies. These polychotomies were intended by the authors [personal communication] to express uncertainty about the topologies of the trees.

However, this uncertainty is not shown in the trees that have been used for network evaluation. According to Bordewich et al., the trees that they (and everyone else) used were reconstructed using the fastDNAmL program (ie. maximum-likelihood), and were supplied by Heiko Schmidt (see Schmidt 2003, p.74). As expected, there are no polychotomies in these ML trees and no indication of uncertain topology; and, of course, the tree topologies are somewhat different from the parsimony trees.

An important consequence is that there is more incompatibility among the dichotomous maximum-likelihood trees than there is among the polychromous maximum-parsimony trees. That is, many of the ML incompatibilities are related to uncertainties in the MP trees. Unfortunately, most of the network algorithms that have been evaluated using these data require strictly dichotomous trees.

Also, the root seems to create problems for these data. The GPWG trees are all rooted with this topology:
  (Flagellaria,((Elegia,Baloskion),(Joinvillea,((Streptochaeta,Anomochloa),(Pharus,(ingroup))))))
However, the position of this 7-taxon outgroup relative to the rest of the taxa varies among the gene trees. That is, the connection between the outgroup and the ingroup differs between the gene trees. So, some of the incompatibility among the trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes.

Some of the ML datasets available have trees with the same set of ingroup / outgroup relationships as the GPWG trees, for example those datasets available with the CASS algorithm. However, some of the ML trees presented in the literature seem to be rooted in quite a different place, and this place differs between the gene trees. For example, the data as presented with the HybridInterleave program, which is presented as 15 pairs of subtrees rather than as six complete trees, not only are the the gene trees apparently rooted in different places but the different subsets presented of the same gene tree are also sometimes rooted in different places.

It seems to me that there are two consequences arising from these points: (i) it is unnecessarily hard to construct a network from the ML data (because not all of the data signals relate to reticulation), and (ii) the resulting networks (as published) look rather unrealistic to a biologist (there are far too many reticulation nodes). Perhaps this isn't the most realistic dataset to be using for the evaluation of network algorithms.

Another commonly used dataset is the Ranunculus data from Lockhart et al. (2001). In this dataset much of the incompatibility signal also seems to be associated with an uncertain position for the root (see Morrison 2011, Fig. 4.7). In this case there are two gene trees (one nuclear and one chloroplast) that have similar unrooted topologies but have different outgroup-derived root locations. Dealing with root uncertainty may thus be one of the biggest confounding problems when trying to identify reticulation events.

The original GPWG data are available at:
http://www.eeob.iastate.edu/faculty/profiles/ClarkL/GPWG-2001-Appendices.pdf

The nexus data matrix is available at:
http://www.umsl.edu/services/kellogg/gpwg/matrix.html
[In this dataset, 0=A, 1=C, 2=G, 3=T]

A nexus treefile with the original six GPWG (consensus parsimony) trees is available at:
http://acacia.atspace.eu/data/GPWG.tre

A dendroscope treefile with the six ML trees is available at:
http://sites.google.com/site/cassalgorithm/data-sets

References

Bordewich M., Linz S., St. John K., Charles Semple C. (2007) A reduction algorithm for computing the hybridization number of two trees. Evolutionary Bioinformatics 3: 86-98.

Grass Phylogeny Working Group (2001) Phylogeny and subfamilial classification of the grasses (Poaceae). Annals of the Missouri Botanical Garden 88: 373-457.

Lockhart P., McLechnanan P.A., Havell D., Glenny D., Huson D., Jensen U. (2001) Phylogeny, radiation, and transoceanic dispersal of New Zealand alpine buttercups: molecular evidence under split decomposition. Annals of the Missouri Botanical Garden 88: 458-477.

Morrison D.A. (2011) Introduction to Phylogenetic Networks. RJR Productions, Uppsala.

Salamin N., Hodkinson T.R., Savolainen V. (2002) Building supertrees: an empirical assessment using the grass family (Poaceae). Systematic Biology 51: 136-150.

Schmidt H.A. (2003) Phylogenetic Trees From Large Datasets. PhD thesis, Heinrich Heine University, Düsseldorf.

Wu Y. (2010) Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees. Bioinformatics 26: i140-i148.

Wednesday, May 2, 2012

What should a database of datasets look like?


In the previous post, Steven made the very good point that we need a "database" of datasets that can be used to evaluate algorithms for phylogenetic networks. In biological terms, we currently lack a "gold standard" with which to compare the results of our data analyses. This is an important point, to which it is worth adding a few biological notes.

Validation (or evaluation) is a common analytical problem, not just in biology, and it has been addressed in many different circumstances. For example, another area within which I work is multiple DNA sequence alignment. Between 1999 and 2005 several different databases of empirical alignments were developed: BAliBase, OXBench, PREFAB, SABmark, and BRAliBase. These were created independently of each other, and were ostensibly designed for somewhat different purposes. Evaluations of computer algorithms since that time tend to have used several of these databases as their gold standard; and it is quite obvious that success using one of the databases does not imply success with any of the others.

This background has lead me to the conclusion that a database needs a structured set of data, with the structure addressing all of the different biological issues that are likely to be important. For example, phylogenetic networks are used to analyze datasets that contain "reticulation" events such as hybridization & introgression, lateral gene transfer, recombination, and genome fusion, but such events can be confounded by other events such as deep coalescence and hidden paralogy. So, a truly valuable database would have datasets that encompass not only all of these possibilities but their combinations as well. Chris Whidden's comment on Steven's post discusses the same important issue, but from the point of view of the mathematical requirements for finding the optimal network.

Such a database is a very ambitious goal. More to the point, I doubt that such a database could be created without widespread collaboration, as both Steven and Chris have emphasized.

BAliBase (referred to above) attempted to have a structured set of multiple alignments, but it is interesting to note that this structure was mostly ignored by subsequent users of the database. The users simply pooled all the different groups of alignments together and came up with an "average" success for the alignment algorithms, rather than discovering (as they would have done; see Morrison 2006) that different algorithms have different degrees of success depending on the particular characteristics of the dataset being analyzed. We should not make the same mistake when evaluating network algorithms.

I think that there have been four suggested approaches to acquiring datasets for evaluating tree/network algorithms (in order of increasing reality):
  1. simulation under one or more data-generation models
  2. create mixed datasets from "pure" datasets, or create artificial mosaic taxa from real datasets
  3. use datasets where the postulated events have been independently confirmed
  4. experimentally create taxa with a known evolutionary history.
Option (1) has been used by many workers to evaluate tree-building algorithms, and the models have been readily adapted for phylogenetic networks (eg. Bandelt et al. 2000; Morin 2007; Woolley et al. 2008). Indeed, this has been the most common strategy for evaluating network algorithms, although there seems to be little consensus so far on what data-generation model(s) to use. The basic limitation here is that simulations (a) only show the success of the algorithms relative to how well they fit the model used to simulate the data, and (b) the relationship between the simulation model and the "real world" is unknown.

Option (2) has rarely been used for networks (eg. Vriesendorp 2007). The basic idea is to create "known" reticulations by combining parts of pre-existing datasets that lack reticulation signals. One can either combine whole datasets that contain mutually incompatible signals, or one can create individual taxa that have parts of their data taken from different reticulation-free datasets. This is a promising approach to "experimental" phylogenetics, although lack of prior experience means that we do not yet know how to use this strategy most effectively.

Option (3) is an obvious approach to collating data (McDade 1990), and has been used for evaluating tree-building algorithms (eg. McDade 1992; Leitner et al. 1996, Lemey et al. 2005). This has been used, for example, for fast-evolving organisms such as viruses, where the transmission history can sometimes be independently checked. Also, hybrids can often be experimentally verified; and Vriesendorp (2007) lists several such datasets for plants. The problem here is the degree to which the postulated reticulation events have been independently confirmed. A network reticulation may look like good evidence in favor of a "suspected" hybrid, for example, but it is not really independent evidence of anything in particular. I suspect that this weak sort of reasoning has been applied to far too many datasets used for the evaluation of network algorithms, where unsuitable datasets have been employed.

Option (4) has occasionally been used for evaluating tree-building algorithms (eg. Hillis et al. 1992; Cunningham et al. 1998; Sanson et al. 2002) but not, as far as I know, network algorithms. The idea is to experimentally manipulate some biological organisms in the laboratory to create a known evolutionary history, against which subsequent data analyses can be compared. Realistically, this restricts the datasets to viruses and phages, as these can be manipulated within a reasonable timeframe.

We need to think about which of these options we wish to adopt. Perhaps all of them?

Note: The suggested database now exists: Datasets for validating algorithms for evolutionary networks

References

Bandelt H.-J., Macauley V., Richards M. (2000) Median networks: speedy construction and greedy reduction, one simulation, and two case studies from human mtDNA. Molecular Phylogenetics & Evolution 16: 8–28.

Cunningham C.W., Zhu H., Hillis D.M. (1998) Best-fit maximum-likelihood models for phylogenetic inference: empirical tests with known phylogenies. Evolution 52: 978-987.

Hillis D.M., Bull J.J., White M.E., Badgett M.R., Molineux I.J. (1992) Experimental phylogenetics: generation of a known phylogeny. Science 255: 589-592.

Leitner T., Escanilla D., Franzén C., Uhlén M., Albert J. (1996) Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proceedings of the National Academy of Sciences of the USA 93: 10864-10869.

Lemey P., Derdelinckx I., Rambaut A., Van Laethem K., Dumont S., Vermeulen S., Van Wijngaerden E., Vandamme A.-M. (2005) Molecular footprint of drug-selective pressure in a Human Immunodeficiency Virus transmission chain. Journal of Virology 79: 11981–11989.

McDade L.A. (1990) Hybrids and phylogenetic systematics I. Patterns of character expression in hybrids and their implications for cladistic analysis. Evolution 44: 1685–1700.

McDade L.A. (1992) Hybrids and phylogenetic systematics II. The impact of hybrids on cladistic analysis. Evolution 46: 1329–1346.

Morin M.M. (2007) Phylogenetic Networks: Simulation, Characterization, and Reconstruction. PhD Thesis, University of New Mexico.

Morrison D.A. (2006) Multiple sequence alignment for phylogenetic purposes. Australian Systematic Botany 19: 479-539.

Sanson G.F.O., Kawashita S.Y., Brunstein A., Briones M.R.S. (2002) Experimental phylogeny of neutrally evolving DNA sequences generated by a bifurcate series of nested polymerase chain reactions. Molecular Biology & Evolution 19: 170–178.

Vriesendorp B. (2007) Phylogenetworks: Exploring Reticulate Evolution and its Consequences for Phylogenetic Reconstruction. PhD Thesis, Wageningen University.

Woolley S.M., Posada D., Crandall K.A. (2008) A comparison of phylogenetic network methods using computer simulation. PLoS One 3: e1913.

Saturday, April 28, 2012

Validating methods for constructing evolutionary phylogenetic networks

Many researchers working on constructing evolutionary (i.e. explicit, as opposed to implicit/data-display) phylogenetic networks encounter the problem that, at present, there are not many options for validating the biological relevance of their methods. In other words, how does a researcher verify whether the network produced by his/her latest algorithm is a biologically plausible approximation of reality? This is of critical importance because, unlike implicit/data-display networks, evolutionary phylogenetic networks seek to produce an explicit hypothesis of what actually happened.

Ideally there should be a repository of biological datasets where there is some level of consensus amongst biologists as to the character, extent and location of reticulate evolutionary events. This can then be used as a framework for validating the output of algorithms for constructing evolutionary phylogenetic networks. Unfortunately, as far as I am aware there are very few such “reference” datasets in circulation – if any.  There seem to be multiple reasons for this. Within biology reticulate evolution is still a comparatively new topic which actually encompasses an entire range of evolutionary time-scales and phenomena. I can fully appreciate that trying to get a grip on even a tiny part of this world is an immensely complex task for biologists! This is probably why biological validation of algorithmic methods, if it happens at all, still requires collaborating biologists to perform a labour-intensive and highly case-specific analysis. It will be a massive challenge to move beyond such ad-hoc models of validation.

On the algorithmic side there are also plenty of issues. Input-side and output-side limitations to existing software are well-known. Expressed deliberately sharply: it is not often that one encounters a biologist who has two fully-refined, unambiguously rooted gene trees on the same set of taxa who wants to develop a reticulation-minimal solution and who does not mind if ancestors can hybridize with descendants. Faced with such limitations computer scientists inevitably resort to simulations or try and analyse the same dataset that the last group of computer scientists used, which is (sigh…) probably the Grass Phylogeny Working Group's Poaceae dataset. Simulations tend to use a variety of plausible-sounding techniques (e.g. random rSPR moves to simulate HGT, or – at the population-genomic level – techniques for simulating recombination) but in how far do these simulations really approximate reality?

My concern is that, at the moment, biologists and computer-scientists are locked in an unhealthy embrace, both expecting the other group to come up with “real” networks. This could be dangerous. I’ve seen biologists adjust their hypotheses based on the output of evolutionary phylogenetic network software. But those computer programs often lack any form of biological validation: not because algorithm designers are bad people aiming to mislead but because the apparently intractable character of the associated optimization problems forces computer scientists to make all kinds of restrictions and assumptions which are not necessarily compatible with the concerns of biologists. In any case: it’s clearly not helpful if hypotheses derived this way find their way back into the literature with an “approved by biologists” seal of approval.

How, then, to transform this embrace into something more virtuous? One possibility could be a structured collaboration between groups in the phylogenetic network community to produce and disseminate at least a small number of rigorously validated reference datasets which can serve as benchmarks. Is this realistic?

Very curious to hear what you think!

Note: The suggested database now exists: Datasets for validating algorithms for evolutionary networks