Datasets


This is a compilation of links to empirical datasets that might prove useful for validating mathematical algorithms associated with those phylogenetic networks intended to represent evolutionary history. In each case an aligned datafile is provided, along with annotation notes.

For each dataset, the following information is provided:
  • Name: a unique name for the dataset
  • Source: the publication used as the source for the data
  • Zip file: contains the nexus file, the annotation notes, and a PDF copy of the source paper
  • Nexus file: a text version of the nexus-formatted file for quick viewing
  • Notes: a brief explanation of what the data are about, and what phylogenetic history they represent


Part 1
Datasets where the history is a tree

These serve as negative controls for network algorithms.

Datasets where the history is known from experimentation


(1)
Name: Sanson
Source: Sanson GF, Kawashita SY, Brunstein A, Briones MR (2002) Experimental phylogeny of neutrally evolving DNA sequences generated by a bifurcate series of nested polymerase chain reactions. Molecular Biology and Evolution 19: 170-178.
Zip fileSanson.zip
Nexus fileSansonLeaves.nex
Notes: complete small-subunit rDNA gene sequences from Trypanosoma cruzi; an easy tree — it is recovered by all analyses and all models

(2)
Name: Hillis
Source: Hillis DM, Bull JJ, White ME, Badgett MR, Molineux IJ (1992) Experimental phylogenetics: generation of a known phylogeny. Science 255: 589-592.
Zip fileHillis.zip
Nexus fileHillis.nex
Notes: three blocks with partial gene sequences from bacteriophage T7; no model gets the tree quite right

(3)
Name: Cunningham
Source: Cunningham CW, Zhu H, Hillis DM (1998) Best-fit maximum-likelihood models for phylogenetic inference: empirical tests with known phylogenies. Evolution 52: 978-987.
Zip fileCunningham.zip
Nexus fileCunningham.nex
Notes: three complete gene sequences + 2 partial gene sequences from bacteriophage T7; almost a star tree, and no model gets the tree right

(4)
Name: Cunningham2
Source: Cunningham CW, Jeng K, Husti J, Badgett M, Molineux IJ, Hillis DM, Bull JJ (1997) Parallel molecular evolution of deletions and nonsense mutations in bacteriophage T7. Molecular Biology and Evolution 14: 113-116.
Zip fileCunningham2.zip
Nexus fileCunningham2.nex
Notes: 2 partial gene sequences from bacteriophage T7; almost a star tree

(5)
Name: Sousa
Source: Sousa A, Zé-Zé L, Silva P, Tenreiro R (2008) Exploring tree-building methods and distinct molecular data to recover a known asymmetric phage phylogeny. Molecular Phylogenetics and Evolution 48: 563-573.
Zip fileSousa.zip
Nexus fileSousa.nex
Notes: nine blocks with partial gene sequences from bacteriophage T7

(6)
Name: Parzival
Source: Spencer M, Davidson EA, Barbrook AC, Howe CJ (2004) Phylogenetics of artificial manuscripts. Journal of Theoretical Biology 227: 503-511.
Zip fileParzival.zip
Nexus fileParzival.nex
Notes: one block of text from the medieval German poem "Parzival", manually copied several times


Datasets where the history is known from retrospective observation


(1)
Name: Leitner
Source: Leitner T, Escanilla D, Franzén C, Uhlén M, Albert J (1996) Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proceedings of the National Academy of Sciences of the USA 93: 10864-10869.
Zip fileLeitner.zip
Nexus fileLeitner.nex
Notes: two partial gene sequences from HIV-1 virus; no model gets the tree quite right

(2)
Name:  Lemey
Source: Lemey P, Derdelinckx I, Rambaut A, Van Laethem K, Dumont S, Vermeulen S, Van Wijngaerden E, Vandamme A-M (2005) Molecular footprint of drug-selective pressure in a Human Immunodeficiency Virus transmission chain. Journal of Virology 79: 11981-11989.
Zip fileLemey.zip
Nexus fileLemey.nex
Notes: two partial gene sequences from HIV-1 virus; most models get the tree almost right



Datasets where the history is known from simulation


(1) Name: Camin
Source: Sokal RR (1983) A phylogenetic analysis of the Caminalcules. I. The data base. Systematic Zoology 32: 159-184.
Zip fileCamin.zip
Nexus file: 2 separate files (see the Zip file)
Notes: morphological features of artificial organisms; there are two data files, one containing only the 29 extant organisms (and for which the tree is provided) and one with both the 29 extant organisms and the 48 fossil organisms
Caveat emptor: These are simulated data, and do not therefore necessarily match real data in all ways


Part 2
Datasets where the history is reticulated

Datasets where the evidence of reticulation is independent of the dataset.

Datasets where the history is known from experimentation


(i) Hybridization and Introgression


(1)
Name: Feliner
Source: Fuertes Aguilar J, Rosselló JA, Nieto Feliner G (1999) Nuclear ribosomal DNA (nrDNA) concerted evolution in natural and artificial hybrids of Armeria (Plumbaginaceae). Molecular Ecology 8: 1341-1346.
Zip fileFeliner.zip
Nexus fileFeliner.nex
Notes: one gene sequence from Armeria plants; there are three artificial hybrids, which differ only by having additive polymorphic nucleotides in some of the six positions at which the parents differ

(2)
Name: McDade
Source: McDade LA (1997) Hybrids and phylogenetic systematics. III. Comparison with distance methods. Systematic Botany 22: 669-683.
Zip fileMcDade.zip
Nexus fileMcDade.nex
Notes: morphology from Aphelandra plants; there are 17 artificial hybrids, originally intended to be analyzed with each F1 hybrid added individually to the set of F0 species


(ii) Text Contamination


(1)
Name: Heinrichi
Source: Roos T, Heikkilä T (2009) Evaluating methods for computer-assisted stemmatology using artificial benchmark data sets. Literary and Linguistic Computing 24: 417-433.
See also the Computer-Assisted Stemmatology Challenge web page.
Zip fileHeinrichi.zip
Nexus fileHeinrichi.nex
Notes: one block of text from the late medieval Finnish folktale "Piispa Henrikin Surmavirsi", manually copied several times, with contamination among copies and deliberately deleted text

(2)
Name: Besoin
Source: Baret PV, Macé C,  Robinson P (2006) Testing methods on an artificially created textual tradition. In Macé C, Baret P, Bozzi A, Cignoni L (eds) The Evolution of Texts: Confronting Stemmatological and Genetical Methods, pp 255-281. Istituti Editoriali e Poligrafici Internazionali, Pisa.
See also the Computer-Assisted Stemmatology Challenge web page.
Zip fileBesoin.zip
Nexus fileBesoin.nex
Notes: one block of text from the the modern French "Notre besoin de consolation est impossible à rassasier", manually copied several times, with contamination in one copy and deliberately deleted text




(iii) Pedigree

(1)
Name: Eclipse
Source: Bower MA, Campana MG, Nisbet RER, Weller R, Whitten M, Edwards CJ, Stock F,  Barrett E, O'Connell TC, Hill EW, Wilson AM, Howe CJ, Barker G, Binns M (2012a) Truth in the bones: resolving the identity of the founding elite thoroughbred racehorses. Archaeometry 54: 916-925.
Zip fileEclipse.zip
Nexus fileEclipse.nex
Notes: mitochondrial control region from historical thoroughbred stallions; there are two reticulations from male ancestors



Datasets where the reticulation is inferred


(i) Hybridization and Introgression


(1)
Name: Donoghue
Source: Donoghue MJ, Baldwin BG, Li J, Winkworth RC (2004) Viburnum phylogeny based on chloroplast trnK intron and nuclear ribosomal ITS DNA sequences. Systematic Botany 29: 188-198.
Zip fileDonoghue.zip
Nexus fileDonoghueSubset.nex
Notes: two partial gene sequences from Viburnum plants; Viburnum prunifolium is a hybrid

(2)
Name: Rieseberg
Source: Rieseberg LH (1991) Homoploid reticulate evolution in Helianthus (Asteraceae): evidence from ribosomal genes. American Journal of Botany 78: 1218-1237.
Zip fileRieseberg.zip
Nexus fileRieseberg.nex
Notes: two restriction-site sets from Helianthus plants; Helianthus anomalus, Helianthus deserticola and Helianthus paradoxus are hybrids

(3)
Name: Atchley
Source: Atchley WR, Fitch WM (1991) Gene trees and the origins of inbred strains of mice. Science 254: 554-558.
Zip fileAtchley.zip
Nexus fileAtchley.nex
Notes: percentage allelic differences for 144 gene loci from laboratory mice; SEA, CBA and C3H are hybrids, but only the first one appears to be detectable in the dataset

(4)
Name: Beardsley
Source: Beardsley PM, Schoenig SE, Whittall JB, Olmstead RG (2004) Patterns of evolution in western North American Mimulus (Phrymaceae). American Journal of Botany 91: 474-489.
Zip fileBeardsley.zip
Nexus fileBeardsleyAll.nex
Notes: three partial gene sequences from Mimulus plants; Mimulus evanescens is a hybrid

(5)
Name: Hoggard
Source: Hoggard GD, Kores PJ, Molvray M, Hoggard RK (2004) The phylogeny of Gaura (Onagraceae) based on ITS, ETS, and trnL-F sequence data. American Journal of Botany 91: 139-148.
Zip fileHoggard.zip
Nexus fileHoggard.nex
Notes: three partial gene sequences from Gaura plants; Gaura drummondii is a hybrid

(6)
Name: Alice
Source: Alice LA, Eriksson T, Eriksen B, Campbell CS (2001) Hybridization and gene flow between distantly related species of Rubus (Rosaceae): evidence from nuclear ribosomal DNA internal transcribed spacer region sequences. Systematic Botany 26: 769-778.
Zip fileAlice.zip
Nexus fileAlice.nex
Notes: one partial sequence from Rubus plants; five hybrids, but three are similar to the parents

(7)
Name: Howarth
Source: Howarth DG, Baum DA (2005) Genealogical evidence of homoploid hybrid speciation in an adaptive radiation of Scaevola (Goodeniaceae) in the Hawaiian Islands. Evolution 59: 948-961.
Zip fileHowarth.zip
Nexus fileHowarth.nex
Notes: four partial gene sequences from Scaevola plants; there are three samples of the hybrid Scaevola procera

(8)
Name: Moody
Source: Moody ML, Rieseberg LH (2012) Sorting through the chaff, nDNA gene trees for phylogenetic inference and hybrid identification of annual sunflowers (Helianthus sect. Helianthus). Molecular Phylogenetics and Evolution 64: 145–155.
Zip fileMoody.zip
Nexus files: 11 separate files (see the Zip file)
Notes: eleven partial gene sequences from Helianthus plants, with multiple accessions for many of the species, and multiple alleles for many of the accessions; Helianthus anomalus, Helianthus deserticola and Helianthus paradoxus are hybrids; some recombinants have also been detected
Caveat emptor: There are discrepancies between Table 1 and Figure 1 in the paper, and between both of these and the dataset; these are detailed in the Excel spreadsheet in the Zip file


(ii) Recombination


(1)
Name: ODonnell
Source: O’Donnell K, Kistler HC, Tacke BK, Casper HH (2000) Gene genealogies reveal global phylogeographic structure and reproductive isolation among lineages of Fusarium graminearum, the fungus causing wheat scab. Proceedings of the National Academy of Sciences of the USA 97: 7905-7910.
Zip fileODonnell.zip
Nexus fileODonnellAll.nex
Notes: six partial gene sequences from Fusarium fungi; NRRL_28338 and NRRL_28721 are recombinants

(2)
Name: Bollyky
Source: Bollyky PL, Rambaut A, Harvey PH, Holmes EC (1996) Recombination between sequences of Hepatitis B Virus from different genotypes. Journal of Molecular Evolution 42: 97-102.
Zip fileBollyky.zip
Nexus fileBollyky.nex
Notes: complete genome sequences from Hepatitis B viruses; HBVDNA and HPBADWl are reassortants

(3)
Name: Starr
Source: Starr JR, Gravel G, Bruneau A, Muasya AM (1996) Phylogenetic implications of a unique 5.8s nrDNA insertion in Cyperaceae. Aliso 23: 84-98.
Zip fileStarr.zip
Nexus fileStarr.nex
Notes: one partial gene sequence from sedge and rush plants; Oxychloe andina is a chimeric sequence

(4)
Name: Cooper
Source: Cooper MA, Adam RD, Worobey M, Sterling CR (2007) Population genetics provides evidence for recombination in Giardia. Current Biology 17: 1984-1988.
Zip fileCooper.zip
Nexus fileCooper.nex
Notes: three partial chromosome sequences from Giardia protozoa; Giardia intestinalis isolate 335 is a recombinant

(5)
Name: Aoyama
Source: Aoyama J, Nishida M, Tsukamoto K (2001) Molecular phylogeny and evolution of the freshwater eel, genus Anguilla. Molecular Phylogenetics and Evolution 20: 450-459.
Zip fileAoyama.zip
Nexus fileAoyama.nex
Notes: one partial gene sequence from Anguilla eels; Anguilla bicolor bicolor is a recombinant

(6)
Name: Sessa
Source: Sessa EB, Zimmer EA, Givnish TJ (2012) Unraveling reticulate evolution in North American Dryopteris (Dryopteridaceae). BMC Evolutionary Biology 12: 104.
Zip fileSessa.zip
Nexus fileSessa.nex
Notes: eight partial gene sequences from Dryopteris ferns; Dryopteris celsa EBS27 is a recombinant


(iii) Lateral Gene Transfer


To be added



(iv) Word Borrowing


(1) Name: List
Source: List J-M, Nelson-Sathi S, Geisler H, Martin W (2013) Networks of lexical borrowing and lateral gene transfer in language and genome evolution. Bioessays 36: 141-150.
Zip fileList.zip
Nexus file: 2 separate files (see the Zip file)
Notes: presence/absence of sets of cognate words; there are two data files, one with known borrowings (loan words) included and one without; extensive word borrowings are known in several languages

1 comment: