Tuesday, April 26, 2016

Phylogeny of a dataset


Phylogenetic methods have been applied to all sorts of research fields, including biology, linguistics, stemmatology and archaeology. There are many posts in this blog discussing examples of these applications, both good and bad.

However, some time ago a paper appeared that tried to apply these methods to data, instead:
Andrea K. Thomer, Nicholas M. Weber (2014) The phylogeny of a dataset. In: Andrew Grove (ed.) Proceedings of the 77th ASIS&T Annual Meeting: Connecting Collections, Cultures, and Communities, Volume 51. ASIS&T, Silver Spring, Maryland 20910, USA.
The authors do a creditable job of describing phylogenetics for the uninitiated, but I am not convinced that their empirical application to "digital objects" works particularly well.

They describe their application as follows:
The digital objects under examination are different versions of the International Comprehensive Ocean and Atmosphere dataset (ICOADS).
ICOADS data consist of marine surface measurements and observations (e.g. sea-surface temperature, sea-level pressure, wave swell, wind direction, etc.) that have been digitized from historical ship logs, or taken from floating buoys. As a result of the broad time periods that the dataset covers (approximately 450 years, 1662–2014) the quality and reliability of the data varies considerably.
Much like a piece of software, ICOADS is an evolving dataset with intermittent releases. Version 1.0 – called simply COADS – was publically [sic] released in 1987, and contained almost 100 million historical observations starting in 1854 and continuing to 1979.
Thus, understanding the ways in which ICOADS evolved into new versions, and gave rise to "offspring" datasets over a thirty-year period is the focus of the case study presented below.
The significant properties being used as phylogenetic characters included: Entry Title, Entry ID, Summary, Geographic Coverage, Start Date, End Date, Geographic Resolution, Temporal Resolution, Scientific Keywords (often dataset parameters), Geographic Keywords, Sources (platform of data collection), and Instruments. Once collected, each field was converted into binary codes for "presence" or "absence" of individual keywords.
The problem here is that tere is no implication that any of these characters are phylogenetically informative (ie. inherited), and thus that shared features might represent synapomorphies. In applications to linguistics, stemmatology and archaeology, on the other hand, it is at least likely that shared similarities might represent synapomorphies.

Given these data, the analyses cluster the datasets based on similarity — indeed, the authors explicitly refer to their tree-based analyses as "clustering algorithms". However, this form of analysis does not necessarily reveal history, in the sense that none of the analyses are explicitly historical. Historical patterns will be included in the outcome, but they will not necessarily be separable from patterns resulting from any other source. The resulting groups of datasets may or may not have historical meaning. The authors do, however, have a series of hypotheses (the groups) that can now be subject to scrutiny for possible historical interpretations.

For our purposes it is also worth noting that the authors do recognize one limitation of their analytic approach when applied to datasets:
A purely tree-based phylogenetic approach is also incapable of showing the exchange of traits between different lineages of digital objects, or cases in which several organisms merge into one; thus a reticulating network may be needed in lieu of a bifurcating tree.