Monday, February 1, 2016

Tardigrades and phylogenetic networks

In this blog we have always championed the use of Exploratory Data Analysis prior to phylogenetic analyses. This approach explores the characteristics of the data before making formal inferences about possible evolutionary scenarios. One of the reasons for doing this is the possibility of data errors. That is, we need to distinguish between estimation errors deriving from our experimental procedures and real biological scenarios, because both of these will result in complex patterns in our data.

One possible classification of the potential causes of complex data patterns in phylogenetics is this:

Estimation errors
(i) incorrect data
— inadequate data-collection protocol
— poor laboratory / museum / herbarium technique
— lack of quality control after data collection
— misadventure
(ii) inappropriate sampling
— distant outgroup
— rapid evolutionary rates
— short internal branches
(iii) model mis-specification
— wrong assessment of primary homology
— wrong substitution model
— different optimality criteria

Biological complexity
(iv) analogy
— parallelism
— convergence
— reversal
(v) homology
— deep coalescence
— duplication–loss
— hybridization
— introgression
— recombination
— horizontal gene transfer
— genome fusion

The scientific literature has a number of prime examples where people have asserted a case of biological complexity that has subsequently been questioned, and attributed to estimation errors instead.

For example, many of you will have noted the recent attention given to the release of various genome sequences from the Tardigrades, a group of microscopic animals often alleged to be the world's most resistant to environmental conditions. Two rival papers have appeared:
Thomas C. Boothby et al. (2015) Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proceedings of the National Academy of Sciences of the USA 112: 15976–15981.
Georgios Koutsovoulos et al. (2015) The genome of the tardigrade Hypsibius dujardini. BioRxiv preprint 33464. [Now published as: Georgios Koutsovoulos et al. (2016) No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proceedings of the National Academy of Sciences of the USA]
The former paper attributes their observed phylogenetic complexity to horizontal gene transfer (group v in the list above) while the latter attributes it to sequencing errors (group i). This situation is discussed in more detail elsewhere on the web, for example:
Rival scientists cast doubt upon recent discovery about invincible animals
How did these indestructible pond critters get their genes?
This difference in possible cause (of complexity) matters particularly for the use of phylogenetic networks, because both estimation errors and biological complexity will appear as reticulation patterns in any network. This is particularly important for the assertion of evolutionary scenarios such as horizontal gene transfer, because usually the only evidence for any such gene flow is the complexity of the phylogenetic network — that is, there is no independent experimental evidence, and we are relying entirely on the phylogenetic pattern analysis. Estimation errors must thus be eliminated prior to the phylogenetic analysis, if we are to produce a high quality network.

The current situation potentially has unfortunate consequences. For example, there are continual comments that horizontal gene flow is rare, particularly from zoologists, even though there is a large amount of evidence to the contrary. Situations like the current one can only add fuel to this argument, if strong claims of gene flow turn out to be erroneous. There is no quantitative basis for an assertion that gene flow is rare in zoology — those who have looked for reticulate evolution in animals have found it, and those who haven't haven't.

In the end, data-display networks are useful for displaying incongruent data patterns, but the source of the incongruence needs to be identified before these networks are turned into evolutionary networks (either explicitly drawn or verbally implied).


  1. Disclaimer - I'm one of the authors on GD Koutsovoulos et al. (2015) The genome of the tardigrade Hypsibius dujardini. BioRxiv preprint 33464..

    I agree with you that people who look for reticulate evolution will probably find it. Just to clarify, our lab has published often on HGT (eg Wolbachia fragments in nematodes), we are not HGT deniers ;-)

    Tardigrades do have a small (and unremarkable ~1%) amount of HGT. Boothby et al simply failed to check for extensive contamination which is easy to spot using tools like Blobtools and Anvio. In fact, the creators of Anvi'o have independently identified complete bacterial genomes in the tardigrade data set, strongly supporting the assertion that the 17% HGT in the Boothby et al genome is almost entirely an artifact of bacterial contamination (see Delmont TO, Eren AM. (2016) Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies. PeerJ PrePrints 4:e1695v1).

  2. Thanks for updating your blog post with a link to the PNAS paper!