Wednesday, October 14, 2015

Problems with manually constructing networks


I wrote recently about whether explicit network methods are currently used in practice to construct evolutionary networks (Are networks actually used to explore reticulate histories?), and noted that they usually are not. Here I explore in a bit more detail another example, and point out a couple of limitations of constructing such networks manually.

Earlier this year a paper was published exploring the Anopheles gambiae species complex, this group of mosquitoes being the principal vector of the malaria parasite:
Fontaine MC, et al. (2015) Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science 347: 1258524.
There are about 450 known species of anopheline mosquitoes, which transmit five species of malaria to humans, and many other malaria species to most other vertebrates. The genomes of the six Anopheles species were included as part of a genome study published simultaneously, which also included other Anopheles species:
Neafsey DE, et al. (2015) Highly evolvable malaria vectors: the genomes of 16 Anopheles mosquitoes. Science 347: 1258522.

Both groups of researchers constructed a phylogenetic tree of their organisms, but Fontaine et al. then added reticulations to their tree (thus manually forming an evolutionary network). The reticulations represent putative introgression among members of the An. gambiae species complex, many of which have overlapping distributions within sub-saharan Africa.

Fontaine et al. constructed their network by trying to take into account incomplete lineage sorting (which Neafsey et al. apparently did not — they left the An. gambiae species complex as an unresolved polychotomy). This is all well and good, and it matches the current paradigm in the literature where hybridization / introgression (a process involving horizontal gene flow that creates gene-tree discordance) is studied in association with ILS (a process involving vertical inheritance but which also creates gene-tree discordance). The alternative paradigm is that lateral gene transfer (a process involving horizontal gene flow that creates gene-tree discordance) is studied in association with gene duplication–loss (a process involving vertical inheritance but which also creates gene-tree discordance).

However, this might not be the best strategy in this particular case. In the companion paper by Neafsey et al., they note that for their 16 genomes:
Copy-number variation in homologous gene families also reveals striking evolutionary dynamism. Analysis of 11,636 gene families ... indicates a rate of gene gain / loss higher by a factor of at least 5 than that observed for 12 Drosophila genomes.
Under these circumstances, why ignore the possibility that gene duplication and selective loss has created gene-tree discordance? This possibility is not even mentioned by Fontaine et al. Also not mentioned are other possible sources of gene-tree discordance that are associated with vertical inheritance (eg. balancing selection), but they do at one stage concern themselves with the possibility of unequal rates of evolution among the chromosomes.

Their data-analysis strategy was this:
To infer the correct species branching order in the face of anticipated ILS and introgression, maximum-likelihood (ML) phylogenies were constructed from 50-kilobase (kb) non-overlapping windows across the alignments (referred to here as "gene trees" regardless of their protein-coding content), considering six in-group species rooted alternatively with An. christyi or An. epiroticus (n = 4063 windows).
They found a total of 85 different gene-tree topologies, some of them occurring much more frequently than others. They plotted these onto the four autosomal chromosomes plus the X chromosome, and found that the X chromosome favoured very different gene trees than did the autosomes.

From this analysis, the authors constructed a phylogenetic network (shown in the next figure) based on a species tree (black lines) with reticulations added (green arrows) to indicate introgression. I have added two labels ("Vertical" and "Horizontal") to emphasize the authors' interpretation of the evolutionary flow of genetic information, separated into vertical inheritance and horizontal gene flow (introgression).


The authors interpret the horizontal gene flow as being introgression because:
Autosomal introgression between An. arabiensis and the ancestor of An. gambiae [gam] + An. coluzzii [col] has long been postulated and could explain the strong discordance between the dominant tree topologies of the X and autosomes.
The idea of the introgression being autosomal seems to be based on the idea that the "true species tree" is the one shown by the genes that mediate male and female fertility (ie. the sex chromosomes).

The authors note that, for a "definitive interpretation of these conflicting signals" between the gene trees, they need to have "the correct species branching order". I have raised a number of times in this blog the difficulty of constructing a "species tree" in the face of reticulation. If there is evidence for horizontal gene flow in the data then how do we first extract just the vertical inheritance? The authors attempted to address this question in a section entitled "Tree height reveals the true species branching order in the face of introgression". Their argument is this:
To infer the correct historical branching order, we applied a strategy based on sequence divergence ... Because introgression will reduce sequence divergence between the species exchanging genes, we expect that the correct species branching order revealed by gene trees constructed from non-introgressed sequences will show deeper divergences than those constructed from introgressed sequences. If the hypothesis of autosomal introgression is correct, this implies that the topologies supported by the X chromosome should show significantly higher divergence times ... than topologies supported by the autosomes.
This, indeed, was what they found; and so they concluded that the X chromosome topology represents the species tree, and the autosomes are showing introgression. However, this seems to be a somewhat specious argument. Maybe introgression does lower tree height, but I don't think that we should conclude from this that lowered tree height indicates introgression. We cannot simply invert this argument (ie. A causes to B, and therefore B implies A), because there may be other differences between the autosomes and the X chromosome that also affect relative tree height, such as unequal gene duplication-loss, convergence, unequal evolutionary rates, balancing selection, and so on.

Therefore, we should not be surprised if the authors have got it wrong about whether the X chromosome or the autosomes is showing the "true species tree" (if there is one). That is, the edge labelled "Vertical" in the above network may actually represent the horizontal gene flow, while the edge labelled "Horizontal" may actually represent the vertical inheritance.

Finally, there is a published commentary on the two Anopheles papers:
Clark AG, Messer PW (2015) Conundrum of jumbled mosquito genomes. Science 347: 27-28.
These authors appropriately note that:
Fontaine et al. adhere to a classical view that there is a "true species tree" ... But given that the bulk of the genome has a network of relationships that is different from this true species tree, perhaps we should dispense with the tree and acknowledge that these genomes are best described by a network, and that they undergo rampant reticulate evolution.
This alternate philosophy requires an integrated method for constructing the network, rather than manually constructing a species tree and then adding reticulations. Such a method would construct a network from first principles, and then reveal whether the species phylogeny is tree-like or not, rather than assuming that it is a tree a priori. There are a number of methods being developed for doing this.

No comments:

Post a Comment