Wednesday, September 2, 2015

Is this a "gold standard" dataset?


I have just added another dataset to our database. This one is of considerable interest, because it is a complex one. As the authors note, it is likely to contain ancient hybrid speciation, recent introgression and deep coalescence. Thus, identifying recent hybrids will be problematic.
Michael L. Moody and Loren H. Rieseberg (2012) Sorting through the chaff, nDNA gene trees for phylogenetic inference and hybrid identification of annual sunflowers (Helianthus sect. Helianthus). Molecular Phylogenetics and Evolution 64: 145–155.
There are 29 accessions from 13 species, with data for 11 loci in 5 linkage groups (a total of 8,077 aligned nucleotides). The accessions have sequences for either 1 or 2 of the alleles, and sometimes 3 (the latter are likely to be the result of PCR artifacts). The authors have also tried to identify recombinant sequences. Three of the species are previously identified hybrid taxa.

Unfortunately, adding this dataset to the database has also been problematic, because there are internal inconsistencies. For complete consistency, Figure 1 of the paper should agree with its own Table 1, and the GenBank data should agree with both of them. Unfortunately, this three-way consistency exists for only 2 of the 11 loci. For the rest, in 7 instances the dataset is the odd one out, in 4 cases it is the table, and in four instances it is the figure. For the data discrepancies, in 2 cases a sequence is missing, in 1 case there is an extra sequence, and for the remaining 2 pairs it is likely that there is mis-labelling of the sequences.

It is therefore not immediately obvious to what extent this counts as a "gold standard" dataset. I have included it because of its intrinsic interest, but obviously with a caveat emptor warning. Sadly, this sort of situation has been all too common in my search for suitable datasets.

No comments:

Post a Comment