Showing posts with label HGT network. Show all posts
Showing posts with label HGT network. Show all posts

Monday, April 1, 2013

Empedocles, Lucretius and lateral gene transfer


Empedocles (c. 490–430 BCE) and Lucretius (c. 99-55 BCE) have been credited with first articulating the theory of "survival of the fittest" (Sedgley 2003). However, this is of interest only to Darwinian scholars, who focus solely on trees. What is of more interest to scholars of phylogenetic networks is that these same two philosophers have also been credited with first suggesting the doctrine of horizontal gene transfer (Wilkins 2009). Gene transfer is, of course, an important source of reticulate evolution.

Empedocles was a Greek philosopher, a citizen of what is now Agrigento, in Sicily. He is perhaps most famous for first outlining the elemental theory of the physical world (ie. Air, Earth, Fire, Water). Moreover, he identified two fundamental forces, which he called love and strife. Love is the force that brings objects together, while Strife is the force that drives them apart. Empedocles postulated that the universe was once condensed into a tight sphere by the force of love, and strife later exploded this into an expanding mass. This has been seen as a forerunner of modern ideas about the Big Bang and the subsequent expanding universe.

More importantly for our purposes, Empedocles had a physical theory about the random development of living forms. According to this theory, Life first emerged as a collection of disassociated body parts, which wandered about on their own, without the intervention of divine power. These were not parts severed from previously complex beings, but each functioned in its own right as an independent "single-limbed" being. Complex creatures were then created by the accidental combination of these disparate limbs and organs. If the correct parts combined, then the creature would survive and go on to found a species, but if the wrong combination occurred then the creature would perish — only those with the most suitable combinations survived, by a process that we now call natural selection.

Empedocles' hypothesized hybrid creatures were literally mocked by later Greek philosophers, notably Aristoteles (384-322 BCE) and Epicurus (341-270 BCE), and their followers. They derided these monsters as "roll-walking creatures with hands not properly articulated or distinguishable" and as "ox-headed man-creatures". It was Lucretius who resurrected Empedocles' idea, in the fifth part of his only known work (the poem De Rerum Natura), which was about the beliefs of Epicureanism — Lucretius was the first writer to introduce Roman readers to Epicurean philosophy.

Titus Lucretius Carus was a Roman poet and philosopher, apparently resident in Rome itself. He is perhaps most famous for his atomistic view of the physical world (everything is built up from collections of indivisible particles). More importantly for our purposes, Lucretius expounded a similar theory to that of Empedocles, namely that originally a set of randomly composed monsters sprang up, of which only the fittest survived. However, whereas Empedocles described isolated limbs as the starting point, Lucretius described whole organisms with defective combinations of body parts (what we would now call congenital defects), so that his maladapted creatures were formed at the atomic level rather than at the macroscopic level of whole limbs. Also, in Lucretius' theory there was apparently no inter-species mingling of limbs, as there was in Empedocles' version.

These two related theories of zoogony appear to have lain dormant for a couple of thousand years, crushed under the iron fist of both Aristoteleanism and the early Christian era. Even into the 1900s, biology could be best described as being essentially an extension of Aristoteles' philosophical ideas (Mayr 1982). Nevertheless, slowly the idea of natural selection was re-introduced to biology, notably with the work of Étienne Geoffroy Saint-Hilaire (1772-1844), and culminating in the work Alfred Russel Wallace (1823-1913) and Charles Robert Darwin (1809-1882).

However, even after the introduction of this evolutionary idea, the focus was on the inheritance of morphological modifications, not on the admixture of parts inherited from different organisms; and so only half of Empedocles' ideas were accepted.

It took until the dawn of the 20th century for the Russian lichenologist Constantin Sergeevich Mereschkowsky (1855-1921) to first outline a cellular version of Empedocles' vision. It had recently been shown that lichens involve a symbiotic relationship between fungi and algae, very much along the lines first envisioned more than 2,200 years before. Mereschkowsky extended this idea to the sub-cellular level, with the explicit goal of explaining the evolutionary development of land plants from algae-like forms of life, postulating that chloroplasts originated as symbiotic blue-green algae. The German histologist Richard Altman (1852-1900) had already hinted that what we now call mitochondria (he called them bioblasts) are bacterial symbionts. It was some time later that the American anatomist Ivan Emanuel Wallin (1883-1969) published Symbionticism and the Origin of Species, in which he explicitly suggested that symbiotic bacteria have played a fundamental role in the evolution of species.

This development culminated in the suggestion that genes themselves can be transferred between distant organisms, thus bringing thought down to the atomistic level envisioned by Lucretius. This revealed the hybrid nature of many genomes, even in situations where phenotypic admixture is not manifest. The first description of horizontal gene transfer is usually credited to Victor J. Freeman (in 1951), who demonstrated that the transfer of a viral gene into a bacterium could create a virulent strain from a non-virulent strain. Since then, lateral gene transfer has been widely reported as an important component of prokaryote evolution; and it has increasingly been reported in eukaryotes as well.

We have thus come full circle. Empedocles first introduced the theory of "survival of the fittest", which took nearly 2,300 years to be re-discovered by science, as well as outlining the basic concept of "horizontal gene transfer", which took an extra century for its renaissance.

All of the information presented here is factually correct. However, only on All Fool's Day can the facts be combined in this outrageous way, and such a history be told with a straight face.

References

Mayr E. (1982) The Growth of Biological Thought: Diversity, Evolution and Inheritance. Belknap Press, Cambridge MA.

Sedgley D. (2003) Lucretius and the new Empedocles. Leeds International Classical Studies 2.4.

Wilkins J.S. (2009) New work on lateral transfer shows that Darwin was wrong. ScienceBlogs Evolving Thoughts March 31 2009.

Wednesday, January 30, 2013

More datasets for validating network algorithms


Ten more datasets have been added to the Datasets blog page. These are:
  • 2 plant studies where hybrids are known from experimentation
  • 3 more plant studies where natural hybrids are known
  • 5 studies (fungi, plants, protozoa, viruses, animals) where recombination is known.

A comment

It is worth noting something that has become obvious to me while compiling these datasets — the mathematical model often applied to hybridization networks cannot easily be applied to many of the datasets collected by biologists. The usual mathematical model involves incompatibility between two or more trees for the same set of taxa, for example from different genes or genomes. The incompatibilities are resolved by postulating one or more reticulations in the network.

However, the data produced by biologists often involve only a single nuclear gene, most frequently the Internal Transcribed Spacer region, so that the biologists do not have multiple trees. Instead, hybrids are detected by additive polymorphisms at alignment positions within the study gene. These polymorphisms arise either from (i) the polyploid nature of the hybrids (there are multiple copies of each chromosome, each of which may have a gene copy from either parental species), or (ii) from multiple paralogous copies of the genes (the rRNA region, which contains the ITS, usually has many tandemly repeated copies of the genes, which are homogenized by concerted evolution, but in a hybrid any of them may have a gene copy from either parental species).

This means that it is difficult to use any current evolutionary network for the phylogenetic analysis of many of the datasets used for detecting hybridization. In turn, this suggests that we may need a different model, one based on additive polymorphisms rather than incongruent trees.

The usual mathematical model for lateral-transfer networks is actually the same as for hybridization networks, since the only real difference between HGT and hybridization is that HGT does not occur via sexual reproduction while hybridization does. (Also, hybridization often involves whole genomes while HGT usually involves partial genomes.) Importantly, the mathematical model does seem to apply to the sort of datasets collected by biologists when they are studying HGT. That is, HGT is detected by incompatibility between two or more trees for the same set of taxa. Indeed, this model is usually the only evidence for HGT, unlike hybridization and recombination where there is often evidence that is independent of the network model.

Wednesday, January 16, 2013

Datasets for validating algorithms for evolutionary networks


Steven Kelk has previously raised the issue about Validating methods for constructing evolutionary phylogenetic networks: there are currently not many options for validating the biological relevance of methods for constructing evolutionary phylogenetic networks. These are phylogenetic networks intended to represent evolutionary history, such as HGT networks. hybridization networks, and recombination networks.

Thus, we need a repository of biological datasets where there is some level of consensus amongst biologists as to the character, extent and location of reticulate evolutionary events. This could then be used as a framework for validating the output of algorithms for constructing evolutionary phylogenetic networks.

This issue was discussed at some length at the Workshop: The Future of Phylogenetic Networks. It was suggested by Leo van Iersel that a practical starting point would be to use this blog as a link to suitable datasets. As people become aware of such datasets, a blog post would be published with the details, and the dataset would be linked from one of the blog Pages.

This page now exists (Datasets), and can be accessed at the top right of each blog page. Everyone is encouraged to contribute to this "database", which you can do by sending details about potential dataset  to me by email.

In another post, What should a database of datasets look like?, I have noted that there have been four suggested approaches to acquiring datasets for evaluating algorithms (in order of increasing reality):
  1. simulate datasets under one or more data-generation models
  2. create mixed datasets from "pure" datasets, or create artificial mosaic taxa from real datasets
  3. use datasets where the postulated reticulation events have been independently confirmed
  4. experimentally create taxa with a known evolutionary history.
It seems unnecessary to store datasets of type (1), since they can be created to order by computer programs. Datasets of type (2) are rare, but would be suitable for the database.

Datasets of type (4) currently exist for tree-like evolutionary histories but not yet, as far as I know, for reticulated histories. I have added the known (and available) ones to the database.

Datasets of type (3) are likely to form the bulk of the database, and I have started this part of the database with some example datasets involving hybridization.

For the latter datasets, it is important to note the potential problem of the degree to which the postulated reticulation events have been independently confirmed. I suspect that only weak evidence has been applied to far too many datasets. This is particularly true for those involving horizontal gene transfer (HGT), where mere incongruence between genes is presented as the sole "evidence". More than this is required (see Than C, Ruths D, Innan H, Nakhleh L. 2007. Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 14: 517-535.).

Wednesday, July 18, 2012

The first gene transfer (HGT) network (1910)


I have previously noted in this blog that the first two published phylogenetic networks (by Buffon in 1755 and Duchesne in 1766) were hybridization networks. This leads to the obvious question: what was the first phylogenetic network illustrating horizontal gene transfer (HGT)?

This depends, of course, on exactly how one defines "HGT". If we require explicit reference to genes, then this must post-date the origin of our current understanding of genetics and the nature of genetic material. The first description of HGT is usually credited to Victor Freeman (1951), which thus sets an earliest possible date. However, it will take quite some bibliographic investigation to work out who first illustrated this with a phylogenetic network (none of the earliest reports were concerned with phylogeny). [See the later post The first HGT network.]

However, if we consider HGT to be a subset of genome transfer (or genome fusion), which is the horizontal transfer of an entire organismal genome, then a much earlier date becomes possible. This is because the idea of endosymbiosis, which posits eukaryote organelles as the acquisition of different bacterial genomes, dates back more than a century.

For example, Constantin Mereschkowsky (1905) developed his symbiogenesis theory with the explicit goal of explaining the evolutionary development of land plants from algae-like forms of life, postulating that chloroplasts originated as symbiotic blue-green algae. Richard Altman (1890) had already proposed (indirectly) that what we now call mitochondria are also symbionts.

Mereschkowsky (1910) then took this idea further, and developed a scenario for the origin of the nucleus and cytoplasm from two kinds of organisms and two kinds of protoplasm, called mycoplasm and amoeboplasm. Each kind of protoplasm had an origin in different historical epochs. He illustrated this two-stage symbiosis idea with an explicit network, which appears on page 366 of his paper.

Click to enlarge.

Mereschkowsky's own interpretation of this diagram as a genome-transfer network thus seems clear enough, even though he makes no explicit reference to a genome.

References

Altman R. (1890) Die Elementarorganismen und ihre Beziehungen zu den Zellen. Veit, Leipzig.

Freeman V.J. (1951) Studies on the virulence of bacteriophage-infected strains of Corynebacterium diphtheriae. Journal of Bacteriology 61: 675–688.

Mereschkowsky C. (1905) Über Natur und Ursprung der Chromatophoren im Pflanzenreiche. Biologisches Centralblatt 25: 593–604.

Mereschkowsky C. (1910) Theorie der zwei Plasmaarten als Grundlage der Symbiogenese, einer neuen Lehre von der Entstehung der Organismen. Biologisches Centralblatt 30: 278–303, 321–347, 353–367.

Monday, February 27, 2012

A fundamental limitation of hybridization networks?


In a "hybridization" network, reticulation cycles with three or fewer outgoing arcs are not uniquely defined with respect to trees, clusters or triplets. This point was first noted by Gambette and Huber (2009), although this work will not be formally published until later this year (Gambette and Huber 2012). This seems to be a fundamental mathematical limitation of such networks, which thereby limits what biologists can expect to achieve by performing a network analysis. It is thus a very important point for biologists to understand, as it currently can lead to incorrect interpretation of phylogenetic networks.


The figure shows two incompatible inputs and the three networks resulting from a hybridization model. The inputs are shown in the figure as trees, triplets and clusters, since in this example these three are identical. There are three taxa (labeled A, B, C), which form two triplets (labeled 1, 2), as shown. (The third possible triplet is not part of this discussion.) Obviously, these triplets also represent two trees, and those trees have two non-trivial clusters.

The figure also shows the three networks (labeled a, b, c) that are encoded (uniquely described) by these triplets / trees / clusters. The relevant arcs of the networks that must be deleted to induce each triplet / tree / cluster are labeled (i.e. deleting edge 1 induces triplet / tree / cluster 1, and similarly for edge 2).

These three networks each have a single reticulation cycle with a single reticulation node (i.e they are level-1 networks) and three outgoing arcs. Note that the three networks differ only in the direction of two of their arcs. Note, also, that the fourth possible combination of these two arcs produces a graph with two roots, which is invalid as a phylogenetic network.

So, these three networks are all associated with the same trees, clusters and triplets. In practice, this means that any one of taxa A, B or C can be attached to the reticulation node. Any network containing such a cycle is not unique – we cannot mathematically distinguish between the three different cycle topologies.

In one sense, this indistinguishability is a mathematically "trivial" ambiguous case. However, this should not make it an under-valued point, because it is likely to have enormous impact on the biological interpretation of networks. After all, every hybridization or horizontal gene transfer potentially creates a reticulation cycle with three outgoing arcs. For example, hybridization between sister taxa will create this situation, although hybridization between non-sister taxa may not (as shown below). When this situation does occur, it will be difficult for us to identify the affected taxa from the network topology alone. This is one fundamental mathematical limitation of using trees (or their subsets such as triplets and clusters) to construct networks.


What is even worse, current computer implementations usually output only one network solution (see Albrecht et al. 2012). If a computer program outputs only a single one of a set of optimal networks, then this may be very misleading. In the case discussed here there are three optimal networks, and biologists might identify the wrong taxon as being the hybrid, depending on which of the three equal networks the program chooses to output. This is an unacceptable situation; and the set of all optimal networks must be produced by each algorithm.

Finally, we may need other (biological) criteria for determining the reticulation taxon. For example, the three networks above represent three different biological scenarios. In scenarios "b" and "c", a daughter taxon apparently hybridizes with its parent taxon, whereas in scenario "a" two daughters hybridize. In other words, temporal order may be deemed to be violated in "b" and "c", thus potentially eliminating them as candidate scenarios. We need, however, to be careful about using this type of argument, as it has not previously been necessary in phylogenetics.

References

Albrecht B., Scornavacca C., Cenci A., Huson D.H. (2012) Fast computation of minimum hybridization networks. Bioinformatics 28: 191-197.

Gambette P., Huber K.T. (2009) A note on encodings of phylogenetic networks of bounded level. Unpublished ms at: arXiv:0906.4324v1. Tue 23 Jun 2009.

Gambette P., Huber K.T. (2012) On encodings of phylogenetic networks of bounded level. Journal of Mathematical Biology [in press].