Wednesday, January 30, 2013

More datasets for validating network algorithms


Ten more datasets have been added to the Datasets blog page. These are:
  • 2 plant studies where hybrids are known from experimentation
  • 3 more plant studies where natural hybrids are known
  • 5 studies (fungi, plants, protozoa, viruses, animals) where recombination is known.

A comment

It is worth noting something that has become obvious to me while compiling these datasets — the mathematical model often applied to hybridization networks cannot easily be applied to many of the datasets collected by biologists. The usual mathematical model involves incompatibility between two or more trees for the same set of taxa, for example from different genes or genomes. The incompatibilities are resolved by postulating one or more reticulations in the network.

However, the data produced by biologists often involve only a single nuclear gene, most frequently the Internal Transcribed Spacer region, so that the biologists do not have multiple trees. Instead, hybrids are detected by additive polymorphisms at alignment positions within the study gene. These polymorphisms arise either from (i) the polyploid nature of the hybrids (there are multiple copies of each chromosome, each of which may have a gene copy from either parental species), or (ii) from multiple paralogous copies of the genes (the rRNA region, which contains the ITS, usually has many tandemly repeated copies of the genes, which are homogenized by concerted evolution, but in a hybrid any of them may have a gene copy from either parental species).

This means that it is difficult to use any current evolutionary network for the phylogenetic analysis of many of the datasets used for detecting hybridization. In turn, this suggests that we may need a different model, one based on additive polymorphisms rather than incongruent trees.

The usual mathematical model for lateral-transfer networks is actually the same as for hybridization networks, since the only real difference between HGT and hybridization is that HGT does not occur via sexual reproduction while hybridization does. (Also, hybridization often involves whole genomes while HGT usually involves partial genomes.) Importantly, the mathematical model does seem to apply to the sort of datasets collected by biologists when they are studying HGT. That is, HGT is detected by incompatibility between two or more trees for the same set of taxa. Indeed, this model is usually the only evidence for HGT, unlike hybridization and recombination where there is often evidence that is independent of the network model.

No comments:

Post a Comment