Monday, February 8, 2016

The network of woodpeckers, etc


The world continues on its merry way, searching for fragments of the Tree of Life. That is, research papers continue to be published that give no credence to the possibility of reticulate evolutionary history, especially in zoology.

A recent case in point is this one:
Matthew J. Dufort (2016) An augmented supermatrix phylogeny of the avian family Picidae reveals uncertainty deep in the family tree. Molecular Phylogenetics and Evolution 94: 313–326.
The author constructed a supermatrix for 26 loci for 78 taxa of the bird family containing woodpeckers, piculets, and wrynecks. The author used an array of phylogenetic techniques, including the construction of maximum-likelihood "gene" trees, several different maximum-likelihood species trees, plus time trees. All of these methods pre-suppose that the evolutionary history of the species was strictly tree-like.

We can use an exploratory data analysis to evaluate how probable is this fundamental assumption. I constructed a SuperNetwork based on the 26 gene trees produced by the author, using the SplitsTree program, as shown here. The network is basically tree-like, with one major exception.


I have labeled only one taxon, which seems to be the culprit for the major non-tree-likeness. This species appears in 10 of the gene trees, being part of the Leiopicus group (as expected) in 7 of the rooted trees. It is unexpectedly related to the Picus group in two of the other rooted trees, and is close to Dryocopus in the remaining tree.

This network EDA does not, of course, imply the existence of reticulate evolution. It does, however, highlight a pattern of incongruence that requires explanation, if the history of these birds is to fully elucidated. Reticulate evolution remains one of the possible explanations, pending further investigation.

Monday, February 1, 2016

Tardigrades and phylogenetic networks


In this blog we have always championed the use of Exploratory Data Analysis prior to phylogenetic analyses. This approach explores the characteristics of the data before making formal inferences about possible evolutionary scenarios. One of the reasons for doing this is the possibility of data errors. That is, we need to distinguish between estimation errors deriving from our experimental procedures and real biological scenarios, because both of these will result in complex patterns in our data.

One possible classification of the potential causes of complex data patterns in phylogenetics is this:

Estimation errors
(i) incorrect data
— inadequate data-collection protocol
— poor laboratory / museum / herbarium technique
— lack of quality control after data collection
— misadventure
(ii) inappropriate sampling
— distant outgroup
— rapid evolutionary rates
— short internal branches
(iii) model mis-specification
— wrong assessment of primary homology
— wrong substitution model
— different optimality criteria

Biological complexity
(iv) analogy
— parallelism
— convergence
— reversal
(v) homology
— deep coalescence
— duplication–loss
— hybridization
— introgression
— recombination
— horizontal gene transfer
— genome fusion

The scientific literature has a number of prime examples where people have asserted a case of biological complexity that has subsequently been questioned, and attributed to estimation errors instead.

For example, many of you will have noted the recent attention given to the release of various genome sequences from the Tardigrades, a group of microscopic animals often alleged to be the world's most resistant to environmental conditions. Two rival papers have appeared:
Thomas C. Boothby et al. (2015) Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proceedings of the National Academy of Sciences of the USA 112: 15976–15981.
Georgios Koutsovoulos et al. (2015) The genome of the tardigrade Hypsibius dujardini. BioRxiv preprint 33464.
The former paper attributes their observed phylogenetic complexity to horizontal gene transfer (group v in the list above) while the latter attributes it to sequencing errors (group i). This situation is discussed in more detail elsewhere on the web, for example:
Rival scientists cast doubt upon recent discovery about invincible animals
How did these indestructible pond critters get their genes?
This difference in possible cause (of complexity) matters particularly for the use of phylogenetic networks, because both estimation errors and biological complexity will appear as reticulation patterns in any network. This is particularly important for the assertion of evolutionary scenarios such as horizontal gene transfer, because usually the only evidence for any such gene flow is the complexity of the phylogenetic network — that is, there is no independent experimental evidence, and we are relying entirely on the phylogenetic pattern analysis. Estimation errors must thus be eliminated prior to the phylogenetic analysis, if we are to produce a high quality network.

The current situation potentially has unfortunate consequences. For example, there are continual comments that horizontal gene flow is rare, particularly from zoologists, even though there is a large amount of evidence to the contrary. Situations like the current one can only add fuel to this argument, if strong claims of gene flow turn out to be erroneous. There is no quantitative basis for an assertion that gene flow is rare in zoology — those who have looked for reticulate evolution in animals have found it, and those who haven't haven't.

In the end, data-display networks are useful for displaying incongruent data patterns, but the source of the incongruence needs to be identified before these networks are turned into evolutionary networks (either explicitly drawn or verbally implied).