Showing posts with label Evolutionary network. Show all posts
Showing posts with label Evolutionary network. Show all posts

Wednesday, September 23, 2015

Uses of MUL-trees for evolutionary networks


Creating evolutionary phylogenetic networks is currently a somewhat ad hoc procedure, with a number of competing strategies based on various models of how gene flow occurs.

One possibility is to use multi-labeled trees. Here, multiple gene trees can be represented by a single multi-labeled tree (a MUL-tree), which in turn can also be represented as a reticulating network. A MUL-tree has leaves that are not uniquely labeled by a set of species (ie. each species can appear more than once). This means that multiple gene trees can be represented by a single MUL-tree, with different combinations of the leaf labels representing different gene trees.

The most obvious uses of a MUL-tree are where there are multiple copies of genes within an organism, as each gene copy can be represented independently in the MUL-tree. This will apply when there has been gene duplication, for example, or when there has been polyploidy (ie. multiple copies of the entire genome). Computer programs such as PADRE or MulRF can then be used to derive an optimal single-labeled species network from the MUL-tree.

However, this same strategy can also be used whenever there is conflict among gene trees. In this scenario, the conflicting genes are treated as different leaves in the MUL-tree. One labeled leaf would have the data for the first gene, with the second gene entered as missing data, and the second leaf would then have the inverse situation (the data for gene one are missing and those for gene two are present).

This can be illustrated by a recent example of the Erica (heather plants) genus, from Mugrabi de Kuppler et al. (2015). The authors were interested in whether the observed gene tree conflict in Erica lusitanica could be the result of hybridisation between morphologically dissimilar species, as this has previously been suggested.

They collected sequence data for a number of plastid regions as well as the nuclear ribosomal ITS region. The observed conflict was between the plastid (chloroplast) and nuclear sequences. They note:
A targeted supermatrix strategy was employed, whereby more variable ITS and trnL-trnF spacer sequences were obtained for most samples, and the other, mostly less variable chloroplast markers were added for selected taxa in order to improve resolution of deeper nodes in the chloroplast tree. 
Where gene tree conflict was identified, the taxa with conflicting phylogenetic signals were duplicated in a combined matrix following the approach of Pirie et al. (2008, 2009) in order to infer a single multi-labelled "taxon duplication" tree. [This occurred for only one species. Thus, one leaf label for E. lusitanica has the data only for the chloroplast sequences, and the other leaf has the data only for the nuclear sequence.]


The figure shows the result of the coalescent BEAST analysis of the multi-labeled data, with E. lusitanica appearing twice in the MUL-tree. Inset is the resulting single-labeled network, with E. lusitanica appearing once, as a reticulation.

This is an interesting application of MUL-trees. However, there are two issues that I wish to highlight about the procedure.

First, the reticulation as shown in the example is not actually time-consistent, given that the horizontal axis of the MUL-tree is scaled to time. This could, for example, be resolved by having "E. lusitanica CP" attached to a ghost lineage.

Second, the data matrix from which the MUL-tree is created will have a non-random distribution of missing data, by definition. This non-randomness is known to have a bad effect on likelihood analyses (Simmons 2012). In the example, the non-randomness is exacerbated by further non-randomness in the acquisition of the plastid sequences. So, if this form of MUL-tree analysis is to be pursued then maybe this potential limitation should be investigated.

References

Mugrabi de Kuppler AL, Fagúndez J, Bellstedt DU, Oliver EGH, Léon J, Pirie MD (2015) Testing reticulate versus coalescent origins of Erica lusitanica using a species phylogeny of the northern heathers (Ericeae, Ericaceae). Molecular Phylogenetics and Evolution 88: 121-131.

Pirie MD, Humphreys AM, Galley C, Barker NP, Verboom GA, Orlovich D, Draffin SJ, Lloyd K, Baeza CM, Negritto M, Ruiz E, Cota Sanchez JH, Reimer E, Linder HP (2008) A novel supermatrix approach improves resolution of phylogenetic relationships in a comprehensive sample of danthonioid grasses. Molecular Phylogenetic and Evolution 48: 1106-1119.

Pirie MD, Humphreys AM, Barker NP, Linder HP (2009) Reticulation, data combination, and inferring evolutionary history: an example from Danthonioideae (Poaceae). Systematic Biology 58: 612-628.

Simmons MP (2012) Radical instability and spurious branch support by likelihood when applied to matrices with non-random distributions of missing data. Molecular Phylogenetics and Evolution 62: 472-484.

Monday, May 4, 2015

A geek network


I have noted before that many of the diagrams on the web purporting to show "evolution" actually show transformational evolution rather than variational evolution, as is done in biology and the historical social sciences (eg. Non-phylogenetic trees; Evolution and timelines; The evolutionary March of Progress in popular culture).

This diagram seems to be an improvement, however. Perhaps its geekiness is responsible for this?


This is an evolutionary network because it is rooted, at "Geekus Prime". You will note that it is a population network rather than strictly a phylogenetic network. That is, many of the internal nodes are labeled with extant taxa, so that both ancestors and their descendants appear. It is a network rather than a tree, because the "World of Warcraft Geek" is a hybrid between the "Dungeons and Dragons Geek" and the ancestor of the "Video Game Geek".

Wednesday, December 17, 2014

Current methods for evolutionary networks


It has been noted before that we have a wide range of mathematical techniques available for producing data-display networks, most notably the many variants of splits graphs (see Huson & Scornavacca 2011). For example, NeighborNets and Consensus networks are commonly encountered in the phylogenetics literature, and Reduced median networks and Median-joining networks are commonly used for haplotype networks in population biology.

However, there are few techniques used to produce evolutionary networks. Studies of reticulate evolutionary histories, which include recombination networks, hybridization networks, introgression networks and HGT networks, have no unifying theme as yet. So, the biological literature has many papers in which biologists struggle with reticulate evolutionary histories using ad hoc collections of techniques, which often boil down to simply presenting incongruent phylogenetic trees from different datasets (see Morrison 2014a).

So, maybe a brief look at the current state of play with evolutionary networks would be useful. There are enough worthwhile techniques out there for people to be using them more often than they are.

Assumptions

Almost all current phylogenetic methods assume that the basic building unit is a non-recombining sequence block, for which the evolutionary history is strictly tree-like. We tend to call these blocks "genes" and their history "gene trees", but this is just for semantic convenience. In practice, we first collect data for various loci, and we then simply make the assumption that there is recombination between the loci but not within them. This is basically the assumption of independence between loci. At the limit, each nucleotide along a chromosome has a tree-like history, but for aggregations of nucleotides it is all assumptions.

Furthermore, we assume that there are no data errors that will confound any reconstruction of the phylogenetic trees. Possible sources of error include: incorrect data (e.g. contamination), inappropriate sampling (taxa or characters), and model mis-specification. Any of these errors will lead to stochastic variation at best and to bias at worst.

Gene-tree incongruence

Reticulate evolutionary processes lead to gene trees that are not all congruent. However, there are two other processes that have been widely recognized as also producing gene-tree incongruence, but which do not involve reticulation in the strict sense: incomplete lineage sorting (deep coalescence; ancestral polymorphism), and gene duplication-loss.

Many studies have now shown that stochastic variation due to ILS can be very large (see Degnan & Rosenberg 2009), and that this varies in relation to both the population sizes of the taxa and the times between divergence events. The expectation of completely congruent gene trees is thus very naive, even when the evolutionary history of the taxa has been strictly tree-like. A number of methods have been developed to reconstruct species trees in the face of ILS (Nakhleh 2013).

DL involves gene duplication (which can be repeated to create gene families) followed by selective gene loss. The phylogenetic history of the genes is usually presented as an unfolded species tree, where each gene copy has its own part of the tree. A number of methods have been developed to reconstruct gene DL histories given a "known" species tree, which is called gene-tree reconciliation (Szöllősi et al 2015). However, our interest here is in the reverse process, in which reconstructed but incongruent gene trees are combined into a single species tree, given a model of duplication and selective loss, which is called species-tree inference (which is the same as cophylogeny reconstruction; Drinkwater & Charleston 2014).

Reticulations

Known biological processes such as recombination, reassortment, hybridization, introgression and horizontal gene transfer all create reticulate phylogenetic histories. However, it is a moot point as to whether these processes can be distinguished from each other solely in the context of an evolutionary network (Holder et al 2001; Morrison 2015). These evolutionary processes operate by distinct biological mechanisms, but the evolutionary patterns that they create can all be rather similar. The processes all result in gene flow among contemporaneous organisms (usually called horizontal flow or transfer), whereas other evolutionary processes involve gene flow from parent to offspring (usually called vertical inheritance), including ILS and DL. These gene flows create incongruent gene histories, which we may detect directly in the data or via reconstructed gene trees. The patterns of incongruence do not necessarily allow us to infer the causal process.

There are a number of differences in pattern, but the consistency of these is doubtful. Polyploid hybridization produces the most distinctive pattern, because there is duplication of the genome in the hybrid. However, subsequent aneuploidy will serve to obscure this pattern. Homoploid hybridization nominally involves 50% of the genome coming from difference sources, while introgression ultimately involves a smaller percentage. However, in practice, genome mixtures vary continuously from 0 to 50%. HGT also involves a small percentage of the genome, but in theory it also can vary from 0 to 50%. Reassortment produces mixtures of viral genes, which can occur in such a great number that reconstructing the history is severely problematic.

So, in the absence of independent experimental evidence, distinguishing one form of evolutionary network from another is almost a matter of definition. This has become increasingly obvious in the methodological literature, where semantic confusion abounds.

For example, a network produced directly from a set of characters has usually been called a "recombination network", while one produced from a set of trees has usually been called a "hybridization network", irrespective of what processes the gene trees represent. Furthermore, models that add reticulation events to DL trees have usually referred to the horizontal gene flow as "HGT", whereas models that add reticulation events to ILS trees have usually referred to the horizontal gene flow as "hybridization" (Morrison 2014a). Studies of horizontal gene flow during human evolution have usually referred to "admixture", which is a more process-neutral term.

In many, if not most, cases we might all be better off if network methods simply distinguish gene flow among contemporaries (horizontal) from gene inheritance between generations (vertical), rather than trying to infer a process — process inference can often best take place after network construction. This does not help anthropologists, of course, who are dealing with evolutionary networks where oblique gene flow is possible (so that they do not have Time inconsistency in evolutionary networks).

Methods

There seems to be a dichotomy of purposes to current method development, which are neatly summarized by the contrasting theoretical views of Mindell (2013) and Morrison (2014b). These views each recognize that evolutionary history involves both vertical and horizontal processes, but they reconstruct the resulting evolutionary patterns as a species tree and a species network, respectively. Obviously, this blog is dedicated to the latter point of view, but it is the former one (the so-called Tree of Life) that seems to currently dominate the literature.

Focussing on gene-tree inference, Szöllősi et al (2015) provide a comprehensive review of the various models that have been used to describe the dependence between gene trees and species trees. Essentially, gene trees are contained within the species tree, and they may differ from it in relative branch lengths and/or topology. The differences between genes and species are the result of population-level processes, often modeled using the coalescent. These authors recognize four current classes of probabilistic model that combine different evolutionary processes:
  • the DLCoal model, which combines coalescence and DL
  • the DTLSR model and the ODT model, both of which combine gene transfer and DL
  • models that combine hybridization and ILS
  • models of allopolyploidization.
When inferring species trees from gene trees (species-tree inference), we basically combine the scores for all of the gene trees, and then search for the species tree with the best overall score. This involves adding the scores in parsimony analyses, or multiplying the conditional probabilities in likelihood analyses (ie. maximum-likelihood or bayesian context). Many methods have been developed for inferring a species tree based on multi-locus data. These differ in whether the gene and species trees are estimated simultaneously or sequentially, and in how the gene trees are used to infer the species tree. Nakhleh (2013) and Szöllősi et al (2015) discuss both parsimony and likelihood methods for species-tree inference based on either ILS or DL models.

Extending these ideas to infer networks (rather than species trees) is a bit more tricky, and most of the work to date has involved combining hybridization and ILS. There has been no recent summary of the ideas. However, calculating the parsimony score of a network, given a set of gene-tree topologies, has been addressed by Yu et al (2011); and Yu et al (2013a) have extended these ideas to heuristically search the network space for the optimal network (the one that minimizes the number of extra reticulation lineages in a species tree). Furthermore, methods for computing the likelihood of a phylogenetic network, given a set of gene-tree topologies, have been devised by Yu et al (2012, 2013b); and Yu et al (2014) have extended these ideas to heuristically search for the maximum-likelihood network for limited cases of introgression or hybridization (since they differ only in degree).

There are also several methods that simply use gene-tree incongruence to infer reticulation events in a species network (Huson et al 2010). Basically, these methods combine gene trees into "hybridization networks" by minimizing the number of reticulations required for reconciliation, measured either by counting the reticulations or calculating the network level. The combinatorial optimization can be based on trees, triplets or clusters, using parsimony as the optimality criterion. These methods model homoploid hybridization by assuming that reticulation is the sole cause of all gene-tree incongruence. This means that they are likely to overestimate the amount of reticulation in a dataset when other processes are co-occurring.

The most completely developed network methods involve data for allopolyploid hybrids. Here, there are multiple copies of each gene, one in each copy of the genome, so that allopolyploid hybrids have more copies than do their diploid parent taxa. To construct a hybridization network topology, Huber et al (2006) developed a parsimony method based on first estimating a multi-labeled gene tree, and then searching for the single-labeled network that best accommodates the multiple gene patterns. The model has been extended to heuristically include ILS (Marcussen et al 2012), as well as dates for the internal nodes (Marcussen et al 2015). Jones et al (2013) have also developed models that incorporate ILS in a bayesian context, but only for the case of a single hybridization event between two diploid species (an allotetraploid).

Species-tree inference for a pair of gene phylogenies that may be networks not trees, has been considered in terms of parsimony by Drinkwater & Charleston (2014).

This brings us to the matter of introgression. The massive recent influx of genome-scale data for hominids has lead to the development of methods explicitly for the analysis of what is termed admixture among the lineages. These methods basically work by constructing a phylogenetic tree that includes admixture events, the topology inference being based on allele frequencies. There has been no formal comparison of the methods, and not much application to non-humans. Three such methods have been produced so far (Patterson et al 2012; Pickrell & Pritchard 2012; Lipson et al 2013).

Recombination has somewhat been the poor cousin to other causes of reticulation, as most network methods assume it to be absent. Nevertheless, Gusfield (2014) has recently provided an ample survey of the study methods available to date.

References

Degnan JH, Rosenberg NA (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends in Ecology & Evolution 24: 332-340.

Drinkwater B, Charleston MA (2014) An improved node mapping algorithm for the cophylogeny reconstruction problem. Coevolution 2: 1-17.

Gusfield D (2014) ReCombinatorics: the Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks. MIT Press, Cambridge.

Holder MT, Anderson JA, Holloway AK (2001) Difficulties in detecting hybridization. Systematic Biology 50: 978-982.

Huber KT, Oxelman B, Lott M, Moulton V (2006) Reconstructing the evolutionary history of polyploids from multilabeled trees. Molecular Biology & Evolution 23: 1784-1791.

Huson D, Rupp R, Scornavacca C (2010) Phylogenetic Networks: Concepts, Algorithms, and Applications. Cambridge University Press, Cambridge.

Huson DH, Scornavacca C (2011) A survey of combinatorial methods for phylogenetic networks. Genome Biology & Evolution 3: 23-35.

Jones G, Sagitov S, Oxelman B (2013) Statistical inference of allopolyploid species networks in the presence of incomplete lineage sorting. Systematic Biology 62: 467-478.

Lipson M, Loh P-R, Levin A, Reich D, Patterson N, Berger B (2013) Efficient moment-based inference of population admixture parameters and sources of gene flow. Molecular Biology & Evolution 30: 1788-1802.

Marcussen T, Heier L, Brysting AK, Oxelman B, Jakobsen KS (2015) From gene trees to a dated allopolyploid network: insights from the angiosperm genus Viola (Violaceae). Systematic Biology 64: 84-101.

Marcussen T, Jakobsen KS, Danihelka J, Ballard HE, Blaxland K, Brysting AK, Oxelman B (2012) Inferring species networks from gene trees in high-polyploid north American and Hawaiian violets (Viola, Violaceae). Systematic Biology 61: 107-126.

Mindell DP (2013) The Tree of Life: metaphor, model, and heuristic device. Systematic Biology 62: 479-489.

Morrison DA (2014a) Phylogenetic networks: a review of methods to display evolutionary history. Annual Research and Review in Biology 4: 1518-1543.

Morrison DA (2014b) Is the Tree of Life the best metaphor, model or heuristic for phylogenetics? Systematic Biology 63: 628-638.

Morrison DA (2015, in press) Pattern recognition in phylogenetics: trees and networks. In: Elloumi M, Iliopoulos CS, Wang JTL, Zomaya AY (eds) Pattern Recognition in Computational Molecular Biology: Techniques and Approaches. Wiley, New York.

Nakhleh L (2013) Computational approaches to species phylogeny inference and gene tree reconciliation. Trends in Ecology & Evolution 28: 719-728.

Patterson NJ, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D (2012) Ancient admixture in human history. Genetics 192: 1065-1093.

Pickrell JK, Pritchard JK (2012) Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genetics 8: e1002967.

Szöllősi GJ, Tannier E, Daubin V, Boussau B (2015) The inference of gene trees with species trees. Systematic Biology 64: e42-e62.

Yu Y, Barnett RM, Nakhleh L (2013a) Parsimonious inference of hybridization in the presence of incomplete lineage sorting. Systematic Biology 62: 738-751.

Yu Y, Degnan JH, Nakhleh L (2012) The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLoS Genetics 8: e1002660.

Yu Y, Dong J, Liu KJ, Nakhleh L (2014) Maximum likelihood inference of reticulate evolutionary histories. Proceedings of the National Academy of Sciences of the USA 111: 16448-16453.

Yu Y, Ristic N, Nakhleh L (2013b) Fast algorithms and heuristics for phylogenomics under ILS and hybridization. BMC Bioinformatics 14: S6.

Yu Y, Than C, Degnan JH, Nakhleh L (2011) Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology 60: 138-149.

Wednesday, May 28, 2014

Phylogenetic networks and "evolutionary networks"


Complex networks are found in all parts of biology, graphically representing biological patterns and, if they are directed networks, also their causal processes. Directed networks are currently used to model various aspects of biological systems, such as gene regulation, protein interactions, metabolic pathways, ecological interactions, and evolutionary histories.

Two types of networks can be distinguished, and this distinction seems to me to be very important. Most networks are what might be called observed networks, in the sense that the nodes and edges represent empirical observations. For example, a food web consists of nodes representing animals with connecting edges representing who eats whom. Similarly, in a gene regulation network the genes (nodes) are connected by edges showing which genes affect the functioning of which other genes. In all cases, the presence of the nodes and edges in the graph is based on experimental data. These are collectively called interaction networks or regulation networks.

However, when studying historical patterns and processes not all of the nodes and edges can be observed. So, instead, they are inferred as part of the data-analysis procedure. That is, we infer the patterns as well as the processes; and we can call these inferred networks. In this case, the empirical data may consist solely of the leaf nodes, and we infer the other nodes plus all of the edges. For example, every person has two parents, and even if we do not observe those parents we can infer their existence with confidence, as we also can for the grandparents, and so on back through time with a continuous series of ancestors. Alternatively, we may also observe some of the internal nodes of the network, such as when we do record the parents and grandparents because they are contemporaneous (ie. their generations overlap). This type of pattern can be represented as a genealogical network, when referring to individual organisms, or a phylogenetic network when referring to groups (populations, species, or larger taxonomic groups).

What, then, are the things often referred to as "evolutionary networks" but which are clearly not phylogenetic networks? They are of the first type, the interaction networks. In an evolutionary network the observed nodes are directly connected to each other to represent some aspect of evolution. This aspect may have some component of phylogeny to it, but there is more to the study of evolution than solely phylogenetic history.

For example, directed LGT (dLGT) networks connect nodes representing contemporary organisms with edges that represent inferred lateral gene transfer. That is, the evolutionary networks show gene sharing. This is obviously related to the phylogeny of the organisms, but the network does not display the phylogeny itself. This first example (from Ovidiu Popa, Einat Hazkani-Covo, Giddy Landan, William Martin, Tal Dagan. 2011. Directed networks reveal genomic barriers and DNA repair bypasses to lateral gene transfer among prokaryotes. Genome Research 21: 599-609) shows "32,028 polarized lateral recipient–donor protein-coding gene transfer events" inferred from "the completely sequenced genomes of 657 prokaryote species".


The concept of a gene-sharing network as an evolutionary network has also been applied to viruses and their relatives, for example, as shown by this next diagram (from Natalya Yutin, Didier Raoult, Eugene V Koonin. 2013. Virophages, polintons, and transpovirons: a complex evolutionary network of diverse selfish genetic elements with different reproduction strategies. Virology Journal 10: 158).


The question, then, is what to make of diagrams that combine both a phylogenetic tree and this type of evolutionary network, such as is done in the Minimal Lateral Network. This next example is from linguistics rather than biology (from Johann-Mattis List, Shijulal Nelson-Sathi, Hans Geisler, William Martin. 2013. Networks of lexical borrowing and lateral gene transfer in language and genome evolution. Bioessays 36: 141-150), and it superimposes the sharing network and the phylogenetic tree. (For a discussion in the context of LGT, see also Tal Dagan. 2011. Phylogenomic networks. Trends in Microbiology 19: 483-491).


In this diagram, the tree explicitly represents the phylogenetic history of the languages while the evolutionary network represents possible borrowings of words, with thicker lines representing more borrowed words. Clearly, the network also contains phylogenetic information of some sort. For example, the connection of the root of the Romance languages to English reflects the conquest of Britain by the French-speaking Normans, which modified the Old-German heritage of Old English. However, the diagram as a whole is a hybrid, rather than being a coherent phylogenetic network in the simplest sense (ie. a reticulation network).

To see this clearly, note that the phylogenetic tree is not fully resolved and that the evolutionary network does suggest possible resolutions for several of polychotomies, such as the relationship of Armenian and Greek, the relationship of Albanian to the Romance languages, and the relationship of the Gaelic languages to the Romance languages. So, in some cases the evolutionary network helps resolve the phylogenetic tree rather than forming a reticulating network.

It would be possible to derive a phylogenetic network from this minimal lateral network, but as it stands it is a combination of a phylogenetic tree and a so-called evolutionary network.

Wednesday, February 5, 2014

NSF and reticulating phylogenies


In mid January, the US National Science Foundation released a new Program Solicitation, NSF 14-527. This is called the Genealogy of Life (GoLife).

The text notes:
This solicitation represents the successor program to the Assembling the Tree of Life program. The name has been changed to include projects covering the complexity of phylogenetic patterns across all of life.
So, it replaces the previous documents NSF 10-513 (the Assembling the Tree of Life program, AToL) and NSF 11-534 (the Assembling, Visualizing and Analyzing the Tree of Life program, AVAToL). The latter program has its own web page, which gives you an idea of what it has been about.

For those of us interested in phylogenetic networks, the following parts of the Introduction to the new program are of particular interest:
Understanding the tree of life has been a goal of evolutionary biologists since the time of Darwin. During the past decade, unprecedented gains in gathering and analyzing phylogenetic data have demonstrated increasingly complex genealogical patterns. 
The GoLife program builds upon the AToL program by accommodating the complexity of diversification patterns across all of life's history. Our current knowledge of processes such as hybridization, endosymbiosis and lateral gene transfer makes clear that the evolutionary history of life on Earth cannot accurately be depicted as a single, typological, bifurcating tree.
This is very good news, and a major step forward. The two Tree of Life programs were implicitly based on a rather unrealistic assumption about the shape of phylogenetic history. This new move has been a long time coming, both during prior stakeholder discussions and within the various committees. There are a number of people who will be personally pleased that this has come to fruition.

The focus of the new program is solely on biology, however:
Proposals should focus on poorly sampled clades or data layers within the Genealogy of Life where new data will have a profound impact on our understanding of the pattern of life's evolution ... Additional examples of projects that will not be considered by this program include: ... 5) projects that are solely focused on the development of new computational methods or technologies. 
This is a pity, because there needs to be method development before reticulating phylogenies can be constructed in a manner similar to what is currently used for tree building. There is likely to be a major need for practical large-scale methods for producing evolutionary networks, which unfortunately will be supported only outside of this particular program.

Furthermore:
The project should include a plan for integration and standardization of data consistent with three AVAToL projects: Open Tree of Life, ARBOR, and Next Generation Phenomics.
This is an important requirement, as scientific data tends to disappear into a black hole unless it is prised out of the scientists' hands. The problem is that the Open Tree of Life appears to currently have no mechanism for dealing with non-tree phylogenies!

Thursday, December 12, 2013

Textbooks and phylogenetic networks


The question has been asked as to which of the current general books about phylogenetics actually cover phylogenetic networks. There are collections of essays where networks are covered, and there are specialist books, of course, but the question here is about general introductory books. While a number of books mention tree incongruence, and that this phenomenon could be represented using a reticulating graph, there appear to be only two books that specifically cover the topic of phylogenetic networks.

(1)
Barry G. Hall (2011) Phylogenetic Trees Made Easy: A How-To Manual, Fourth Edition. Sinauer Associates, Sunderland MA.

The first three editions (2001, 2004, 2008) discussed trees only, but the fourth edition has added a chapter on networks. Chapter 15 (pp. 219-248) explicitly notes that "The material presented here is drawn almost entirely from the new book Phylogenetic Networks: Concepts Algorithms and Applications", which is also noted was "made available to me in manuscript prior to its publication."

There are four sections in the chapter:
  Why Trees Are Not Always Sufficient
  Unrooted and Rooted Phylogenetic Networks
  Learn More about Phylogenetic Networks
  Using SplitsTree to Estimate Unrooted Phylogenetic Networks
  Using Dendroscope to Estimate Rooted Networks from Rooted Tree
The first three sections are theoretical introductions to the topic, and the final two sections proceed through a worked example (a different one each).

The book provides a basic introduction to phylogenetics, which is its intent. So, the network topics are presented in a straightforward manner, which makes them easy to grasp. The worked examples are cookbook style, intended solely to get you started using the two chosen computer programs.

The author is to be congratulated for producing not only the first, but so far the only, general book that covers evolutionary networks.

(2)
Philippe Lemey, Marco Salemi, Anne-Mieke Vandamme (editors) (2009) The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Second Edition. Cambridge Uni Press, Cambridge.

The first edition (2003) had a chapter on SplitsTree by Vincent Moulton, and this was revised in the current edition to Split Networks: a Tool for Exploring Complex Evolutionary Relationships in Molecular Data, Chapter 21 (pp. 631-653), by Vincent Moulton and Katharina Huber.

The chapter provides a general introduction to the theory of splits graphs and their uses; and the practical exercises use SplitsTree. This was the first general book on phylogenetics to include networks, although evolutionary networks are not covered.

Conclusion

The coverage of networks is the final topic in the book in both cases, so it can hardly be claimed to have an important place. Nevertheless, these books are at least one step ahead of their competitors.

All of these books are examples of the contemporary focus on congruent tree patterns in evolution, with reticulate relationships being almost an afterthought. There is nothing in the word "phylogeny" that specifies a shape for evolutionary history — it comes from the Greek phylon "race" + geneia "origin". Evolutionary groups may arise by either vertical or horizontal processes, and so evolution may be tree-like or it may not. The current focus almost exclusively on trees is therefore somewhat misplaced.

Tuesday, November 5, 2013

Using constraints to get a handle on the space of phylogenetic networks?


The following two problems will be familiar to researchers working on evolutionary phylogenetic networks.

1) The severe computational intractability associated with globally optimizing most objective functions over the space of phylogenetic networks.

2) The fact that within the space of potential solutions, there are typically very many that an end-user biologist will want to exclude from consideration, for context-specific biological reasons that the software does not know about. This hidden information often only becomes available at the end of the analysis. It is not unusual to receive comments such as: "Thanks for the networks but they can't be good, because experimentalists strongly believe that taxon X is a hybrid of taxa Y and Z, and we also think that taxon group C should be monophyletic ... this is not visible in your networks."

In a recent opinion piece added to the Arxiv ("Fighting network space: it is time for an SQL-type language to filter phylogenetic networks") myself, David and Simone Linz pose the question of whether it might be possible to address both these questions at the same time, using constraint-based modelling. The core of the idea is that, via some kind of comparatively easy-to-use modelling language (e.g. something with an SQL flavour), the end-user biologist should be able to specify characteristics that all candidate solutions must (or must not) have.

The win-win scenario would be that this (a) tempers the intractability of the search problem, by cutting out large swathes of irrelevant networks in the vast search space and (b) invites biologists to incorporate their context-specific knowledge "upstream", reducing the risk that the networks generated by the software are mis-interpreted. In the context of phylogenetic trees, the idea is not new: in 1986 Constantinescu and Sankoff showed that the use of a constrained tree indeed reduces the search space remarkably.

It seems a natural idea to do this for networks, but the question of course is how feasible all this is. Constraint-based pruning of intractable search spaces is seductive but technically challenging for all kinds of reasons. Depending on the constraints used it might help a lot or a little, it is certainly no silver bullet. We might nevertheless hope that in many cases end-user biologists have so much implicit knowledge that the search space is massively shrunk. The question of the modelling language is also tricky because we need to decide upon a set of network constraints that biologists want and need: the dominant topological feature of trees, the clade, is no longer sufficient to describe (or constrain) the topologically richer space of phylogenetic networks. Furthermore, the constraints themselves should not become a new source of intractability.

In the opinion piece we make a few basic suggestions for atomic network constraints and how they might be combined via an SQL-style language. This, of course, is only the starting point for what we hope will be a wider discussion.

We're very keen to hear your thoughts about this!

Wednesday, October 30, 2013

Next Generation Sequencing and phylogenetic networks


I have recently been doing a course (along with a bunch of postgraduate students) on Massively Parallel Sequencing, also known as Next Generation Sequencing (NGS). This was a partially successful attempt to teach an old dog some new tricks. More to the point, it has prompted me to think about NGS in relation to phylogenetic networks. Most of the published discussions have focussed on trees, rather than networks.


NGS can potentially provide a fast and cost-effective means of generating multilocus sequence data for phylogenetics (Rannala & Yang 2008; McCormack et al. 2013; Moriarty Lemmon & Lemmon 2013). Unfortunately, the cost for the number of samples typically employed in phylogenetics is currently still beyond the reach of most researchers.

NGS and phylogenetics

Nevertheless, we are sometimes told things like: "The fields of phylogenetics and phylogeography are on the cusp of a revolution, enabled by the rapid expansion of genomic resources and explosion of new genome sequencing technologies." This is probably over-stating the case, as noted by McCormack et al. (2013):
Despite this obvious potential, NGS has been slow to take root in phylogeography and phylogenetics compared to other fields like metagenomics and disease genetics. We suggest that this lag has been caused by four specific aspects of phylogeographic and phylogenetic research: the predominant focus on non-model organisms, the need for sequencing large numbers of samples per species, the lack of consensus regarding library preparation protocols for particular research questions, and the transitional state of the technology (whole-genome data are still neither cost-effective, nor even desirable for phylogeography and phylogenetics, but are paradoxically easier to collect).
Another issue is the historical importance of utilizing gene trees in phylogeography and phylogenetics. Gene trees are most robustly inferred from loci with high information content, for example, a non-recombining locus containing a series of linked SNPs. Individual SNPs, on the other hand, have low information content on a per-locus basis and have been used predominately with classification methods such as Structure and Principal components analysis ... While distance-based genealogies and phylogenies can be built from unlinked SNPs, this ignores models of molecular substitution and probabilistic tree-searching algorithms that have led to more robust phylogenetic inference in the last several decades.
Furthermore, no-one has yet shown that many of the questions currently being asked by phylogeneticists will actually benefit from genomic data. We may well be able to answer some new questions, but that is quite a different thing from a revolution. The essence here is that in science the questions must come first. Collecting data for the sake of it is usually unproductive. So, we need a clear demonstration that genomics is actually needed in phylogenetics (as opposed to other disciplines, where it may indeed be very useful). If increased volume of data will solve a phylogenetic problem then that is good, but there is no necessary reason to expect that it will happen. Statistically, the extra data can lead to improved precision but not necessarily improved accuracy. In science, targeted data collection has always been the most productive approach to any clearly stated experimental question.

For example, the estimated relationships among humans, chimpanzees, and gorillas did not change as a result of genome sampling (Galtier and Daubin 2008), nor did those of malaria species (Kuo et al. 2008), nor those of mammal superorders (Hallström and Janke 2010). (I have discussed the mammal example in a previous blog post: Why are there conflicting placental roots?). In all three cases, the relationships were just as complex after the genome sequencing as before — the resolution of controversial branches in our trees did not occur as a result of increased access to character data.

In this sense, a small sample of representative gene sequences should reveal just as much of the genealogical truth as will a genome-wide sample. A more recent empirical example is presented by O'Neill et al. (2013), who found that including less informative loci added so much noise to the phylogenetic signal that the analysis eventually broke down. The issue here is that as data volume increases so does the potential occurrence of systematic bias due to model mis-specification.

This sort of problem can easily be visualized using phylogenetic networks, in which genome-scale data frequently produce unresolved bushes rather than tree-like phylogenies. I have provided a couple of examples in a previous post (When is there support for a large phylogeny?). Another example is provided by Beiko (2011), which I have reproduced below.

This all suggests that we will need to think carefully about how to apply phylogenetic networks to genome-scale data. Much of the lack of resolution may very well come from the nature of NGS, rather than from the actual evolutionary history.


NGS and networks

There are a number of potential problems with NGS. These may not matter so much for tree-building algorithms, but it is a different matter for networks.

[1]  Increased homoplasy due to sequencing errors
An error rate of even 0.01% is considered good in NGS (eg. Roche 454: 1%; Illumina HiSeq: 0.1%; Life SOLiD: 0.01%), but when this is extrapolated to the genome scale it results in thousands of errors. Networks are sensitive to this magnitude of stochastic error. Indeed, I have already written about the use of phylogenetic networks specifically to identify data errors (Checking data errors with phylogenetic networks).

[2]  Increased homoplasy due to intra-gene processes
These include substitutions, deletions, duplications (especially tandem repeats), inversions, and translocations. These processes can potentially reveal evolutionary history, but we have little idea about how best to process the data in a way that will reveal that history. Currently, we deal with this by lumping most of the processes together as "indels".

[3]  Increased homoplasy due to inter-gene processes
The most common processes known to confound attempts to identify reticulate evolution are incomplete lineage sorting and gene duplication–loss. There are several methods available for addressing these issues in the context of estimating phylogenetic trees, but their applicability to networks is still being assessed.

[4]  Increased homoplasy in non-coding regions
Sanger-style sequencing is usually targeted towards gene-coding regions or their introns, but genome-scale data can include what is currently called "junk DNA". The evolutionary processes in these regions are unknown, as is their applicability to phylogenetic analysis.

[5]  Inadequacies due to data-processing methods
The analysis of NGS data is often a black art — each paper seems to provide its own way of processing the data. This has been a cause of concern expressed in the literature (e.g. Check Hayden 2012; Editorial 2012a, 2012b; MacArthur 2012), especially in the light of the poor documentation and archiving of bioinformatics programs. I have discussed this issue in some previous posts (Poor bioinformatics?; Archiving of bioinformatics software). Perhaps the most talked-about problem is ascertainment bias — there is a brief discussion of this at the end of this post.

Network analysis of NGS data

All of this might make the application of networks to phylogenomics problematic in many cases, because we already have enough challenges dealing with the data from Sanger-style sequencing, without having them be orders of magnitude worse. It will therefore be very interesting to see what emerges from the current attempts to apply phylogenetic networks to NGS data.

There have been a few applications of EDA (exploratory data analysis) programs such as SplitsTree, mostly involving bacteria and viruses, and often in the context of detecting recombination. Not all of these studies have produced networks that look bushy, as shown by the example below (from Söderlund et al. 2013). SplitsTree is mostly limited by the number of samples not by the number of characters, so that genomic data are not a particular analysis issue for algorithms such as neighbor-net. However, you might like to calculate your inter-sample distances outside the program, unless you want the simple p-distance. (Popular genome-scale alternatives include Fst.)

There have also been programs developed for the study of admixture (a.k.a. introgression) in human genomes, such as TreeMix, AdmixTools, and MixMapper, and these might repay wider exploration. I have discussed some of these programs in a previous post (Admixture graphs – evolutionary networks for population biology). Essentially, they first construct a tree and then add reticulations based on various criteria. As is usual with this approach, there is the problem of constructing the initial tree in the presence of reticulation processes, and there seems to be no clear criterion about when to stop adding reticulations — optimization criteria always increase as reticulations are added, so that increasingly complex networks will always be preferred mathematically.


Note — a common data-processing problem

The following explanation of one type of ascertainment bias is adapted from the Fluxus Engineering web site:
For each DNA sample, a large number of short sequences are generated by the NGS sampling. Genomic variants are estimated from the consensus of these NGS sequences, after filtering the sequences for artifacts. Variant lists are never complete — the greater the sequence length, the greater the fraction of the genome that can be sequenced, but there are always uncharted regions which vary from sample to sample. The sampled genome sequences are then compared to a reference genome. NGS software usually reports SNP variants only if they do not match the reference genotype, and if there is sufficient evidence that they are non-reference. Non-reported variants do not necessarily match the reference genotype — they can just as well be sequencing failures, or coverage gaps, or insufficient evidence for a non-reference variant. Networks generated from such data are likely to consist largely of artifacts.
References

Beiko RG (2011) Telling the whole story in a 10,000-genome world. Biology Direct 6: 34.

Check Hayden E (2012) RNA studies under fire. Nature 484: 428.

Editorial (2012a) Must try harder. Nature 483: 509.

Editorial (2012b) Error prone. Nature 487: 406.

Galtier N, Daubin V (2008) Dealing with incongruence in phylogenomic analyses. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences 363: 4023-4029.

Hallström BM, Janke A (2010) Mammalian evolution may not be strictly bifurcating. Molecular Biology & Evolution 27: 2804-2816.

Kuo C-H, Wares JP, Kissinger JC (2008) The Apicomplexan whole-genome phylogeny: an analysis of incongruence among gene trees. Molecular Biology & Evolution 25: 2689-2698.

Moriarty Lemmon E, Lemmon AR (2013) High-throughput genomic data in systematics and phylogenetics. Annual Review of Ecology, Evolution & Systematics 2013. 44: 19.1–19.23.

MacArthur D (2012) Face up to false positives. Nature 487: 427-428.

McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield RT (2013) Applications of next-generation sequencing to phylogeography and phylogenetics. Molecular Phylogenetics and Evolution 66: 526-538.

O'Neill EM, Schwartz R, Bullock CT, Williams JS, Shaffer HB, Aguilar-Miguel X, Parra-Olea G, Weisrock DW (2013) Parallel tagged amplicon sequencing reveals major lineages and phylogenetic structure in the North American tiger salamander (Ambystoma tigrinum) species complex. Molecular Ecology 22: 111-129.

Rannala B, Yang Z (2008) Phylogenetic inference using whole genomes. Annual Review of Genomics and Human Genetics 9: 217-231.

Söderlund R, Jernberg C, Källman C, Hedenström I, Eriksson E, Bongcam-Rudloff E, Aspán A (2013) Rapid whole genome sequencing investigation of a familial outbreak of E. coli O121:H19 with a sheep farm as the suspected source. EMBnet Journal 19 suppl.A: 89-90.

Wednesday, October 16, 2013

What are evolutionary networks currently used for?


These days, there are many unrooted affinity-type networks used to display conflicting phylogenetic signals. There are many different methods available, although the various forms of splits graphs seem to dominate, especially NeighborNet and Consensus Networks (for species-level data), and Reduced Median Networks and Median Joining Networks (for population-level data). However, phylogeneticists are interested in genealogies, not just data displays.

Unfortunately, rooted evolutionary networks are not so well off. There is a great need for such networks in phylogenetics, but there are very few automated methods available for constructing them. These networks are needed whenever a genealogy involves reticulation processes rather than solely divergence. The latter produces a tree-like evolutionary history but the former do not, and these thus require network methods.

Due to the lack of obvious methods, most current research papers still do not illustrate reticulate evolution with a genealogy. A collection of ad hoc methods is usually applied to the data, and the evolutionary processes are then inferred from this. However, the use of a network to illustrate the inferred genealogy is rather rare.

Indeed, for species-level studies most papers simply present a set of incongruent gene trees, although some of them also illustrate either (i) the tree derived from the combined data, or (ii) a consensus tree with or without the conflicting relationships, or (iii) a pair of cophylogeny trees. Occasionally, the hybrid origin of some of the species, for example, is illustrated, but the putative parents are not connected in a phylogeny.

Population-level studies often present unrooted haplotype networks, illustrating processes such as hybridization and introgression between closely related species, or the evolution of domesticated species.

However, these ad hoc methods do not mean that evolutionary networks do not appear in the literature. In this blog post I include a representative sample of rooted networks that are intended to illustrate inferred genealogies. They are grouped according to the evolutionary processes being studied (see Reticulation patterns and processes in phylogenetic networks). I have also briefly indicated how the networks were constructed.

Homoploid Hybridization

Hybridization is commonly studied in the literature, and phylogenetic networks appear not infrequently. This first example was constructed by the unreleased program HyperPars.

Dickerman AW (1998) Generalizing phylogenetic parsimony from the tree to the forest. Systematic Biology 47: 414-426.


This next example was constructed by program SplitsTree. Note that the root of the network is not clearly indicated.

Pirie MD, Humphreys AM, Barker NP, Linder HP (2009) Reticulation, data combination, and inferring evolutionary history: an example from Danthonioideae (Poaceae). Systematic Biology 58: 612-628.


This example was constructed manually from a set of gene trees. Note that it is drawn in a rather unusual style for indicating hybridization.

Sang T, Crawford D, Stuessy T (1997) Chloroplast DNA phylogeny, reticulate evolution, and biogeography of Paeonia (Paeoniaceae). American Journal of Botany 84: 1120-1136.


Polyploid Hybridization

Polyploid hybridization is probably the most likely type of study to have a phylogenetic network. This is at least partly because there is a computer program, Padre, to automate much of the work. This program was used to construct this first network.

Marcussen T, Jakobsen KS, Danihelka J, Ballard HE, Blaxland K, Brysting AK, Oxelman B (2012) Inferring species networks from gene trees in high-polyploid North American and Hawaiian violets (Viola, Violaceae). Systematic Biology 61: 107-126.


This next example was also constructed by program Padre.

Sessa EB, Zimmer EA, Givnish TJ (2012) Unraveling reticulate evolution in North American Dryopteris (Dryopteridaceae). BMC Evolutionary Biology 12: 104.


This example constructed manually from a gene tree.

Marhold K, Lihová J (2006) Polyploidy, hybridization and reticulate evolution: lessons from the Brassicaceae. Plant Systematics and Evolution 259: 143-174.


Introgressive Hybridization

Introgression is a widely studied phenomenon. However, rooted evolutionary networks are rarely presented. This first one was constructed manually from a set of gene trees.

Koblmüller S, Duftner N, Sefc KM, Aibara M, Stipacek M, Blanc M, Egger B, Sturmbauer C (2007) Reticulate phylogeny of gastropod-shell-breeding cichlids from Lake Tanganyika — the result of repeated introgressive hybridization. BMC Evolutionary Biology 7: 7.


The next example was also constructed manually from a set of gene trees.

Morgan DR (2003) nrDNA external transcribed spacer (ETS) sequence data, reticulate evolution, and the systematics of Machaeranthera (Asteraceae). Systematic Botany 28: 179-190.


This example was constructed by program SplitsTree.

Labate JA, Robertson LD (2012) Evidence of cryptic introgression in tomato (Solanum lycopersicum L.) based on wild tomato species alleles. BMC Plant Biology 12: 133.


Horizontal Gene Transfer

HGT is a hot topic these days, both among prokaryotes and among eukaryotes, although most papers do not present a phylogenetic network. The first example was constructed by program Sprit from the species tree and a gene tree.

Walsh AM, Kortschak RD, Gardner MG, Bertozzi T, Adelson DL (2013) Widespread horizontal transfer of retrotransposons. Proceedings of the National Academy of Sciences USA 110: 1012-1016.


This next example was constructed manually from a gene tree.

Delwiche CF, Palmer JD (1996) Rampant horizontal transfer and duplication of rubisco genes in eubacteria and plastids. Molecular Biology and Evolution 13: 873-882.


This example was constructed manually from incongruence among a series of gene trees.

Richards TA, Soanes DM, Foster PG, Leonard G, Thornton CR, Talbot NJ (2009) Phylogenomic analysis demonstrates a pattern of rare and ancient horizontal gene transfer between plants and fungi. The Plant Cell 21: 1897-1911.


Homologous Recombination

Intra-genic recombination is often studied without reference to a network. Nevertheless, several programs exist, and this particular network was constructed by program Kwarg.

Jenkins PA, Song YS, Brem RB (2012) Genealogy-based methods for inference of historical recombination and gene flow and their application in Saccharomyces cerevisiae. PLoS One 7: e46947.


Chromosomal rearrangements are studied rather rarely. This network was constructed manually from a phylogenetic tree. Note that the root of the network is not clearly indicated.

Rumpler Y, Hauwy M, Fausser JL, Roos C, Zaramody A, Andriaholinirina N, Zinner D (2011) Comparing chromosomal and mitochondrial phylogenies of the Indriidae (Primates, Lemuriformes). Chromosome Research 19: 209-224.


Viral Reassortment

Reassortment of segmented viruses produces very complex networks. This one is a partial network, constructed manually from a series of phylogenetic analyses.

Smith GJ, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, Pybus OG, Ma SK, Cheung CL, Raghwani J, Bhatt S, Peiris JS, Guan Y, Rambaut A (2009) Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature 459(7250): 1122-1125.


Genome Fusion

This is a difficult topic to study. As is almost always done, this network was constructed manually from a phylogenetic tree.

Thiergart T, Landan G, Schenk M, Dagan T, Martin WF (2012) An evolutionary network of genes present in the eukaryote common ancestor polls genomes on eukaryotic and mitochondrial origin. Genome Biology and Evolution 4: 466-485.


Apomixis

This topic rarely involves networks. This network was constructed manually from the output of program SplitsTree.

Dyer RJ, Savolainen V, Schneider H (2012) Apomixis and reticulate evolution in the Asplenium monanthes fern complex. Annals of Botany 110: 1515-1529.


Removing Convergence

This is an unusual use of a network, but the author notes that "the use of reticulations clarifies the phylogeny by factoring out apparent convergence, even though there is no reason to think that actual hybridization or introgression has occurred." The network was constructed by an unreleased program.

Alroy J (1995) Continuous track analysis: a new phylogenetic and biogeographic method. Systematic Biology 44: 152-178.


Wednesday, October 2, 2013

Reticulation patterns and processes in phylogenetic networks


When it comes to phylogenetic networks, there is often misunderstanding between biological and computational scientists, because the former tend to focus on the biological processes underlying the network whereas the latter focus on the patterns needing to be analyzed to produce the networks.

Here, I try to provide a summary of the different processes and patterns involved in reticulation, so that both "sides" get an overview, and hopefully can communicate more easily. I am principally discussing the development of networks that display evolutionary history.

In phylogenetics, historical processes create contemporary patterns, and we then try to detect those patterns, and assess them in order to determine what process created each pattern. Computationally, algorithms will detect certain data patterns and display them in a directed acyclic graph, which is then interpreted biologically. What needs to happen is for us to identify the possible patterns created by the different processes, so that algorithms can be developed that will detect them. It is doubtful that an algorithm will be able to identify all individual processes — it will be up to biologists to work out what process created each pattern detected.

In what follows, there are major simplifications from both the biological and computational points of view, so please be aware of that. In particular, note that I have not discussed either deep coalescence or gene duplication-loss which, if present, will confound the detection of reticulation patterns.

Hybridization (hybrid speciation)

This is the formation of a new species via sexual reproduction. There are two basic forms that are of interest:
Homoploid Hybridization, in which one copy of the genome is inherited from each parent species (eg. diploid parents create a diploid hybrid);
Polyploid Hybridization, in which multiple copies of the genome are inherited from each parent species (eg. diploid parents create a polyploid hybrid).


Polyploid hybridization is usually assessed by sequencing each copy of the genome in the hybrid species, and treating each copy as a terminal in the data analysis, This produces a multi-labelled genome tree, which is then turned into a single-labelled species network.

At the species level, homoploid hybridization is usually assessed by sequencing several genes in the hybrid species (often from both the nuclear and non-nuclear genomes) and producing independent gene trees. The species network is created by resolving conflicts among the gene trees. This form of analysis assumes a data pattern that is very similar to that of HGT.

In population studies, homoploid hybridization is usually assessed at the sequence level, using multiple-copy nuclear genes, where hybrids are detected by additive polymorphisms at some alignment positions.

Introgression (introgressive hybridization)

This is the transfer of genetic material from one species to another via sexual reproduction. This happens when hybrid individuals back-cross preferentially to one of the parental species, rather than forming a new hybrid species. It can involve anything from 1-49% of the genome (at 50% it is best called hybridization). The data pattern created is very similar to that of HGT (the transfer of genetic material from one species to another via non-sexual means).


It is usually assessed at the population level, by sequencing one or more genes (often from both the nuclear and non-nuclear genomes) from many individuals, and demonstrating that identical haplotypes (haploid genotypes) occur in what are recognized as separate species. This is done by constructing a haplotype network. Often, individuals are detected where the non-nuclear haplotype differs from the nuclear haplotype (as shown in the figure).

Horizontal Gene Transfer

This is the transfer of genetic material from one species to another via non-sexual means (eg. transformation, transduction, or conjugation). The data pattern created is very similar to that of introgression (the transfer of genetic material from one species to another via sexual reproduction).

It is sometimes assessed by sequencing several genes and producing independent gene trees. The species network is created by resolving conflicts among the gene trees. This form of analysis assumes data that are very similar to those of homoploid hybridization or recombination.

Alternatively, it is often assessed by comparing gene trees to a species tree (either pre-specified, or derived from multi-gene data). The species network is created by resolving conflicts between the gene trees and the species tree.

Homologous Recombination and Viral Reassortment

These involve homologous parts of a genome breaking part and re-arranging themselves, often during sexual reproduction. With cross-over the two genomes exchange material, and with gene conversion one genome acquires material from the other. There are three basic forms that are of interest:
Intra-genic Recombination, in which the break-points occur within a single gene;
Inter-genic Recombination, in which the break-points occur in different genes or non-coding spaces between genes;
Reassortment, in which segmented viruses re-combine their segments to create new strains (similar to gene conversion); this is basically inter-genic recombination without sex.


Intra-genic recombination is usually analyzed at the sequence level, based on ordered data. The gene network is constructed by identifying break-points, and thus the recombined segments. It is also possible for one of the donors of a recombined sequence to be missing from the dataset, in which case the data pattern will be the same as for HGT without the donor sampled.

Inter-genic recombination will produce the same pattern as hybridization, if both break-points are outside the region sequenced. Furthermore, homoploid hybridization can be thought of as recombination of whole chromosomes.

Viral reassortment is usually assessed by comparing strains with each other based on presence-absence of segmental haplotypes (rather similar to haplotyping of sexual organisms). This is a unique form of analysis, and it can produce incredibly complex networks.

Summary

Process

Polyploid hybridization (species)
Homoploid hybridization (species)
Homoploid hybridization (population)

Introgression (population)

Horizontal gene transfer (species)


Intra-genic recombination
Inter-genic recombination
Reassortment (population)
Evaluation method

multi-labelled tree
incongruent gene trees
sequence additive polymorphisms

haplotype network

incongruent gene trees
incongruent gene/species trees

sequence break-points
incongruent gene trees
haplotype network

It may be impossible ever to reliably distinguish homoploid hybridization, introgression, HGT and inter-genic recombination from each other by pattern analysis alone, at least not without genome-scale data.

Wednesday, July 31, 2013

Trends in Genetics: The Future of Phylogenetic Networks


A couple of weeks ago I reported on those journal covers that I know illustrate phylogenetic networks. I am happy to report that networks have now also made it onto the cover of Volume 29 Issue 8 of Trends in Genetics. The cover illustration combines the traditional tree metaphor for phylogenetics with the new metaphor of a network.


The cover story is the review article by Eric Bapteste, Leo van Iersel, Axel Janke, Scot Kelchner, Steven Kelk, James McInerney, David Morrison, Luay Nakhleh, Mike Steel, Leen Stougie and James Whitfield: Networks: expanding evolutionary thinking, on pages 439-441.

The article is one of the tangible outcomes of the workshop last October, at the Lorentz Center in The Netherlands: The Future of Phylogenetic Networks. The workshop participants agreed that we should be active in promoting the use of networks for evolutionary analyses, and this article, written by a group of biologists and computational biologists, seeks to do just that.

There will be further outcomes of the workshop, including follow-up meetings at the same venue.

Wednesday, July 24, 2013

A rant about the term "evolutionary network"


Mostly, I just rant to myself, and so I have generally avoided doing so in this blog. But this time I intend making an exception.

The expression "evolutionary network" has become completely meaningless in science, and this is a pity. This has happened because it has been applied to so many unrelated concepts that we can no longer work out what anyone means when they use it, without reading the rest of their text to work out the context.

Networks are, of course, ubiquitous in areas as diverse as the social sciences, biology, computer science, physics and economics, and consequently there is an extensive literature on the subject. This means that the term "evolutionary network" has a different meaning in various assorted areas of intellectual activity, such as neural networks, systems biology and quality measurement, as well as the usage in phylogenetics. What is annoying me, however, is that biologists use the term in oodles of different ways, as well.

Partly, this issue arises because of the use by computer scientists of known biological processes as models for developing computer algorithms, which are then named after the process that provided the inspiration (e.g. so-called genetic algorithms). Partly, the problem comes from claiming that a particular process (or something analogous to it) does actually occur in some particular field of study, and therefore using the relevant name (e.g. so-called evolutionary computing). But the problem in biology is that everyone claims that they are studying evolution, and therefore whatever they do can be called "evolutionary".

The essential point in biology is, naturally, that most patterns are the product of one or more evolutionary processes, to one degree or another. That does not, however, justify calling all patterns and processes "evolutionary". For example, observed similarity (of genes, genomes, organisms, species, etc) may or may not have a large evolutionary component — similarity may be the result of either proximal processes (which may be ecological, rather than strongly evolutionary) or ultimate processes (which are very likely to be evolutionary).

This was one of the strongest arguments for the distinction that has been made been phenetics (based on overall similarity) and phylogenetics (based on genealogy). A phenogram (expressing observed similarity) and a phylogram (expressing inferred genealogy) may be two very different things for any given group of objects. There seems to be no real justification for the merging of these two ideas; and yet this seems to be occurring increasingly.

The latest salvo that blurs the distinction similarity and genealogy has been fired by Halary et al. (2013. EGN: a wizard for construction of gene and genome similarity networks. BMC Evolutionary Biology 13: 146), who have this to say:
Here, we introduce a simple but powerful software program, EGN (for Evolutionary Gene and genome Network), for the reconstruction of similarity networks from large molecular datasets.
To explain this, in an earlier paper Alvarez-Ponce and colleagues (2013. Gene similarity networks provide tools for understanding eukaryote origins and evolution. Proceedings of the National Academy of Sciences of the USA 110: E1594–E1603) developed the idea of a gene similarity network, the name of which tells you exactly what it is. It is a non-phylogenetic network in which the edges directly connect observed genes based on their similarity; that is, it extends the classical concept of gene families. The authors present various reasons to justify their claim that "gene similarity networks have the potential to explore deeper relationships than phylogenetic trees".

The follow-up paper by Halary et al. (the one under discussion here) describes a computer program that automates the production of these gene similarity networks. But why have they called the program "Evolutionary Gene Network" rather than some version of "Gene Similarity Network"? This name is not only blatantly misleading but downright confusing. The network produced can be used to explore evolutionary history, sure, but it does not represent anything directly evolutionary. The evolutionary interpretation is in the mind of the beholder, not in the network algorithm.

I encourage everyone to be careful when naming their programs. A program name can mislead naive users if the name is disconnected from the program's purpose. Even the program SplitsTree mostly produces networks these days, and very rarely trees!

The term "evolutionary network" in biology, at least, could be usefully restricted to those networks representing evolutionary history directly (e.g. Thiergart et al. 2012. An evolutionary network of genes present in the eukaryote common ancestor polls genomes on eukaryotic and mitochondrial origin. Genome Biology & Evolution 4: 466-485).