Wednesday, September 16, 2015

Some new additions to the dataset database

Recently, I have added three new datasets to the database of "gold standards" that might be used to evaluate network algorithms. All three are different to what has previously been included, and so I will briefly discuss them here.

Pedigree data

I have included a known pedigree from a small group of thoroughbred stallions (Eclipse dataset) for which there are mitochondrial D-loop (control region) sequences. Pedigrees are networks, not trees, whenever there is inter-breeding among close relatives, and so their inclusion in the database is needed.

There are practical problems with including more pedigrees. Most of the known pedigrees do not have readily available sequence data associated with them, as the collected data have been mainly for features associated with diseases syndromes. Conversely, most of the available sequence data are not associated with known pedigrees, although for humans they are often taken from known social / linguistic / geographical groups (usually based on the place of birth of all four grandparents).

Language data

The database currently contains only a few examples from the social sciences, notably some experimental manipulations from stemmatology. However, there is so far nothing from linguistics, mainly because the phylogenetic history of languages is often poorly known. Nevertheless, languages form networks whenever there is borrowing of words (ie. loan words) between languages (usually as a result of geographical contact), and so their inclusion is desirable.

I have now included one dataset (the List dataset) taken from what appears to be the best-curated source of linguistic data, the Indo-European Lexical Cognacy Database. Known loan words are explicitly tagged in this source; and the phylogenetic relationships of many Indo-European languages are also tolerably well known (eg. see Ethnologue: Languages of the World).

Simulated data

I have not previously included simulated data, for two reasons. First, such data can easily be generated anew each time a set is required; and even if this is impractical then there are readily available datasets online (eg. see the compilation at utcs Phylogenetics). Second, and more importantly, simulations are based on a model (eg. using Brownian motion, Ornstein–Uhlenbeck, or Markov chains), and therefore they model only a subset of reality. Simulations are useful for situations involving a few well-defined variables, but they are much less useful for multivariate data such as occur in phylogenetics.

Nevertheless, I have included one well-known dataset, the Caminalcules (Camin dataset). These data were simulated manually back in the 1960s, and they include morphological features for both extant and fossil organisms. Over the years, the data have been used for many pedagogic purposes in the teaching of systematics, particularly in the U.S.A. (see Pasta have no phylogeny, so don't try to give them one). The data are strictly tree-like, and they do match real datasets in a number of ways (see Sokal 1983). However, there are also known ways in which they differ detectably from real data (see Holman 1986; Wirth 1993).


Holman EW (1986) A taxonomic difference between the Caminalcules and real organisms. Systematic Zoology 35: 259-261.

Sokal RR (1983) A phylogenetic analysis of the Caminalcules. I. The data base. Systematic Zoology 32: 159-184.

Wirth U (1993) Caminalcules and Didaktozoa: imaginary organisms as test-examples for systematics. In: Opitz O, Lausen B, Klar R (eds) Information and Classification: Concepts, Methods and Applications, pp. 421-433. Springer, Berlin.

No comments:

Post a Comment