Wednesday, September 18, 2013
Checking data errors with phylogenetic networks
Data-display networks can be used for a number of purposes, for example: Exploratory data analysis, Displaying data patterns, Displaying data conflicts, Summarizing analysis results, and Testing phylogenetic hypotheses. One of the more important, but currently under-valued, purposes is detecting data errors.
For instance, networks can help you detect data-sampling errors or outliers (eg. wrong specimen identification, diseased specimens), as well as data-collection errors (eg. extracting the wrong DNA, amplifying the wrong gene, sequencing artifacts) and data-processing errors (eg. data entry mistakes, incorrect alignment). These types of errors will likely show up as reticulations in a network, especially a splits graph.
Perhaps the most powerful use of such networks is in conjunction with a database of gold-standard or benchmark sequences. Comparison of all new sequences with the database would allow for a systematic quality check, because the network structure of the database is already known, and any deviation from this structure highlights potential problems ("identifying idiosyncrasies that cannot be attributed to natural evolutionary processes") or indicates novel sequence variation. Much of this process can be effectively automated by computer scripts.
To date, the champion of this use of networks has been Hans-Jürgen Bandelt, who has presented a number of interesting practical examples over the past dozen years. Below, I have included an annotated list of some of the more interesting publications in this area.
Bandelt H-J, Lahermo P, Richards M, Macaulay V (2001) Detecting errors in mtDNA data by phylogenetic analysis. International Journal of Legal Medicine 115: 64-69. —The first to suggest phylogenetic analysis as a component of data-quality checking, although networks are not explicitly mentioned
Bandelt H-J, Quintana-Murci L, Salas A, Macaulay V (2002) The fingerprint of phantom mutations in mitochondrial DNA data. American Journal of Human Genetics 71: 1150-1160. — The first to explicitly suggest using networks, and then use median and quasi-median networks to detect errors in published human mtDNA control-region datasets
Bandelt HJ, Kivisild T (2006) Quality assessment of DNA sequence data: autopsy of a mis-sequenced mtDNA population sample. Annals of Human Genetics 70: 314- 326. — Use quasi-median networks to detect errors in a published human mtDNA control-region dataset
Bandelt HJ, Dür A (2007) Translating DNA data tables into quasi-median networks for parsimony analysis and error detection. Molecular Phylogenetics and Evolution 42: 256-271. — Discuss the use of quasi-median networks for error detection, and re-visit the analysis of Bandelt and Kivisild (2006)
Parson W, Dür A (2007) EMPOP — A forensic mtDNA database. Forensic Science International: Genetics 1: 88-92. — Use quasi-median networks to detect mtDNA errors in forensic data by comparison with a benchmark database
Kong Q-P, Salas A, Sun C, Fuku N, Tanaka M, Zhong L, Wang C-Y, Yao Y-G, Bandelt H- J (2008) Distilling artificial recombinants from large sets of complete mtDNA genomes. PLOS One 3: e3016. — Use median networks to detect possible artificial recombinant sequences in molecular databases (ie. chimeric sequences resulting from laboratory-induced errors)
Bandelt H-J, Yao Y-G, Bravi CM, Salas A, Kivisild T (2009) Median network analysis of defectively sequenced entire mitochondrial genomes from early and contemporary disease studies. Journal of Human Genetics 54: 174-181. — Use median networks to detect possible errors in human mtDNA genomes intended to find sequence mutations associated with particular diseases