Wednesday, January 16, 2013

Datasets for validating algorithms for evolutionary networks


Steven Kelk has previously raised the issue about Validating methods for constructing evolutionary phylogenetic networks: there are currently not many options for validating the biological relevance of methods for constructing evolutionary phylogenetic networks. These are phylogenetic networks intended to represent evolutionary history, such as HGT networks. hybridization networks, and recombination networks.

Thus, we need a repository of biological datasets where there is some level of consensus amongst biologists as to the character, extent and location of reticulate evolutionary events. This could then be used as a framework for validating the output of algorithms for constructing evolutionary phylogenetic networks.

This issue was discussed at some length at the Workshop: The Future of Phylogenetic Networks. It was suggested by Leo van Iersel that a practical starting point would be to use this blog as a link to suitable datasets. As people become aware of such datasets, a blog post would be published with the details, and the dataset would be linked from one of the blog Pages.

This page now exists (Datasets), and can be accessed at the top right of each blog page. Everyone is encouraged to contribute to this "database", which you can do by sending details about potential dataset  to me by email.

In another post, What should a database of datasets look like?, I have noted that there have been four suggested approaches to acquiring datasets for evaluating algorithms (in order of increasing reality):
  1. simulate datasets under one or more data-generation models
  2. create mixed datasets from "pure" datasets, or create artificial mosaic taxa from real datasets
  3. use datasets where the postulated reticulation events have been independently confirmed
  4. experimentally create taxa with a known evolutionary history.
It seems unnecessary to store datasets of type (1), since they can be created to order by computer programs. Datasets of type (2) are rare, but would be suitable for the database.

Datasets of type (4) currently exist for tree-like evolutionary histories but not yet, as far as I know, for reticulated histories. I have added the known (and available) ones to the database.

Datasets of type (3) are likely to form the bulk of the database, and I have started this part of the database with some example datasets involving hybridization.

For the latter datasets, it is important to note the potential problem of the degree to which the postulated reticulation events have been independently confirmed. I suspect that only weak evidence has been applied to far too many datasets. This is particularly true for those involving horizontal gene transfer (HGT), where mere incongruence between genes is presented as the sole "evidence". More than this is required (see Than C, Ruths D, Innan H, Nakhleh L. 2007. Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 14: 517-535.).

No comments:

Post a Comment