Showing posts with label Phylogeny. Show all posts
Showing posts with label Phylogeny. Show all posts

Monday, October 13, 2014

The phylogeny of plastic bag ties


Some years ago Larisa Lehmer, Bruce Ragsdale, John Daniel, Edwin Hayashi and Robert Kvalstad published a medical report about an ingested plastic bag closure caught in someone's colon (Plastic bag clip discovered in partial colectomy accompanying proposal for phylogenic plastic bag clip classification. BMJ Case Reports 2011). This sounds quite painful.


What is more interesting, though, is that the report was accompanied by a phylogenetic and taxonomic evaluation of plastic ties in general, which the authors named Occlupanids.


Note that the proposed morphological changes in the phylogeny match Cope's Rule of phyletic size increase, as discussed in a previous blog post (Steven Jay Gould was wrong).


Shortly afterwards, one of the authors, John Daniel, set up a web page with a more detailed analysis, under the guise of the Holotypic Occlupanid Research Group (HORG).

Among a lot of other interesting information, there is a revised phylogenetic analysis.


Given the data, it seems fairly clear that the genealogical relationship among these objects is reticulate, and that the trees should thus actually be networks. This follows from the simple fact that these phylogenies are rather uninformative (they are bushes showing a few character transformation series). Also, note that contemporary taxa are ancestors, so that the diagrams are more like population networks than species networks.

These ties are used for packets of sliced bread (a relatively recent invention), and so there has been an explosion of Occlupanid forms as they occupy a new adaptive zone. This is a classic instance of recent speciation that is not yet complete. Occlupanids have now reached pest proportions, except where governments have instituted erradication programmes (such as Europe, where they are no longer found).

Part of the difficulty of analysis is that the objects shown constitute only a small part of the known diversity of Occlupanids (e.g. see this photo and this one). There are a number of manufacturers, and their products constitute separate historical lineages. Morphological features have been transferred from one lineage to another, which is a classic case of reticulate history that has not been taken into account in the above phylogenies.

Indeed, the HORG page is not the only detailed web resource about bread ties — see also the now-defunct but fascinating Transactoid page.

Wednesday, October 1, 2014

A fundamental limitation of pedigrees and networks but not trees


It would be nice to think that genealogical history can be reconstructed with ease. However, this is known not to be so. In particular, being able to reconstruct an overall history from a collection of sub-histories, which can thought of as the "building blocks", is not necessarily guaranteed.

That is, even given a complete collection of all of the sub-histories it is not necessarily possible to reconstruct a unique overall history. In other words, there can be pairs of graphs that do not represent the same evolutionary histories, but still display exactly the same collection of building blocks. ("Display" means roughly that a building block can be obtained by simply deleting some of the edges and vertices in the graph.) Mathematically, the sub-histories do not determine (or encode) the history.


For example, it is known that pedigrees cannot necessarily be reconstructed from a collection of all of the sub-pedigrees (Thatte 2008). Pedigrees are the traditional "family trees" showing the ancestry of individuals. Pedigrees differ from phylogenies in that all of the individuals have two parents (rather than possibly having a single immediate ancestor) and there are probably multiple roots (unless there is considerable inbreeding).

Phylogenetic trees, on the other hand can be uniquely reconstructed from a collection of all of the possible sub-trees (see Dress et al. 2012). This is one of the things that makes trees valuable as a phylogenetic model — it is theoretically possible to collect enough information to construct a unique phylogenetic tree.

Rooted phylogenetic networks do not, however, share this property. For some time it has been known that networks cannot necessarily be built from their building blocks, whether those blocks are rooted trees (Willson 2011) or triplets (= rooted 3-taxon trees) or clusters (= rooted sub-trees = clades) (Gambette and Huber 2012).

This is illustrated in the next figure (adapted from Huber et al.), which shows two networks at the top and below that the four trees that are displayed by both of them (by deleting one of each pair of incoming edges at the two reticulation nodes). Given these four trees we cannot reconstruct a unique network, and yet they are the only four trees associated with either network.


To make matters worse, Huber et al. (in press) have now revealed that we can't reconstruct rooted phylogenetic networks even from sub-networks. To do this they show that networks cannot necessarily be built from trinets (= rooted 3-taxon networks). Certain types of networks (e.g. level-1, level-2, tree-child) can be reconstructed (van Iersel and Moulton 2014), but Huber et al. show the example in the second figure, which shows two networks at the top and below that the four trinets that are displayed by both of them. Given these four trinets we cannot reconstruct a unique network, and yet they are the only four trinets associated with either network.


This means that "even if all of the building blocks for some reticulate evolutionary history were to be taken as the input for any given network building method, the method might still output an incorrect history." The best analogy here is Humpty Dumpty — even given all of the pieces, we literally might not be able to put him back together again. We could if he is a rooted tree, but we cannot guarantee it if he is a rooted network or pedigree.

This may not matter in practice, given that we don't yet know the circumstances under which it is possible to uniquely reconstruct networks, but it does mean that we acquire a certain degree of uncertainty as we move from "tree thinking" to "network thinking".

References

Dress A, Huber KT, Koolen J, Moulton V, Spillner A (2012) Basic Phylogenetic Combinatorics. Cambridge Uni Press.

Gambette P, Huber K (2012) On encodings of phylogenetic networks of bounded level. Journal of Mathematical Biology 65: 157-180.

Huber KT, van Iersel L, Moulton V, Wu T (in press) How much information is needed to infer reticulate evolutionary histories? Systematic Biology

van Iersel L, Moulton V (2014) Trinets encode tree-child and level-2 phylogenetic networks. Journal of Mathematical Biology 68: 1707-1729.

Thatte BD (2008) Combinatorics of pedigrees i: counterexamples to a reconstruction problem. SIAM Journal of Discrete Mathematics 22: 961-970.

Willson SJ (2011) Regular networks can be uniquely constructed from their trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8: 785-796.

Wednesday, September 3, 2014

Charles Darwin and the coalescent


The full title of Charles Darwin's most famous book was On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. It is important to note that this title juxtaposes the concepts of between-species variation and within-species variation (Darwin usually referred to "races" rather than to "breeds", "subspecies", etc). This was one of his major insights: the idea that there is a continuum of variation in biology through time (or, as he put it, that it is arbitrary whether variants are treated as different races or as different species).

As I recently noted, this paved the way for between-species phylogenies to be seen as directly analogous to within-species genealogies (The role of biblical genealogies in phylogenetics) — previous applications of genealogies to non-humans (such as those of Buffon and Duchesne) had been explicitly restricted to within-sepcies relationships.

This conceptual integration of within-species and between-species relationships has become explicit in modern biology by using multispecies coalescent models to integrate population genetics and phylogenetics. As noted by Reid et al. (2014):
These models treat populations, rather than alleles sampled from a single individual, as the focal units in phylogenetic trees. The multispecies coalescent model connects traditional phylogenetic inference, which seeks primarily to infer patterns of divergence between species, and population genetic inference, which has typically focused on intraspecific evolutionary processes. The development of these models was motivated by the common empirical observation that genealogies estimated from different genes are often discordant and the discovery that, if ignored, this discordance can bias parameters of direct interest to systematists, such as the relationships and divergence times among species.
However, as specifically emphasized by Reid et al.:
In order to reconcile discordance among gene trees and uncover true species relationships, the first gene tree/species tree models assumed that discordance is solely the result of stochastic coalescence of gene lineages within a species phylogeny ... Coalescent stochasticity, however, is not the only source of gene tree discordance. Selection, hybridization, horizontal gene transfer, gene duplication/extinction, recombination, and phylogenetic estimation error can also result in discordance.
They examined this situation by studying the fit of the multispecies coalescent model:
to 25 published data sets. We show that poor model fit is detectable in the majority of data sets; that this poor fit can mislead phylogenetic estimation; and that in some cases it stems from processes of inherent interest to systematists ...
Our analyses suggest that poor fit to the multispecies coalescent model can mislead inference in empirical studies. In the case of recent hybridization, the consequences may be severe, as species divergences are forced to post-date gene divergences ... When topological conflict among coalescent genealogies is the result of ancient hybridization, balancing selection, or gene duplication and extinction, the consequences may be less severe.
In other words, tree-based phylogenetics is inadequate in practice because of gene flow. Within-species genealogies and between-species phylogenies intersect in the concept of a network, not a tree. That is, the multispecies coalescent needs to be based on a network model not a tree model:
The biological processes that generate variation in gene tree topologies should be explicitly modeled, as should relevant dynamics of molecular evolution. Increasingly complex multispecies coalescent models are being implemented, but there are tradeoffs. Some examine gene duplication and extinction or migration but cannot estimate divergence times.
So, current models are inadequate. It will be interesting to see how these approaches develop to incorporate gene flow (reticulation) into what has heretofore been a tree model (modeling only ancestor-descendant relationships), as we are still in need of methods for estimating rooted evolutionary networks.

Reference

Reid NM, Hird SM, Brown JM, Pelletier TA, McVay JD, Satler JD, Carstens BC (2014) Poor fit to the multispecies coalescent is widely detectable in empirical data. Systematic Biology 63: 322-333.

Wednesday, August 20, 2014

The role of biblical genealogies in phylogenetics


Phylogeneticists treat the tree image as having special meaning for themselves. Conceptually, the tree is used as a metaphor for phylogenetic relationships among taxa, and mathematically it is used as a model to analyze phenotypic and genotypic data to uncover those relationships. Irrespective of whether this metaphor / model is adequate or not, it has a long history as part of phylogenetics (Pietsch 2012). Of particular interest has been Charles Darwin's reference to the "Tree of Life" as a simile, since that is clearly the key to the understanding of phylogenetics by the general public.

The principle on which phylogenetic trees are based seems to be the same as that for human genealogies. That is, phylogenies are conceptually the between-species homolog of within-species genealogies. As far as Western thought is concerned, human genealogies make their first important appearance in the Bible, with a rather specific purpose. The Bible contains many genealogies, mostly presented as chains of fathers and sons. For example, Genesis 5 lists the descendants of Adam+Eve down to Noah and his sons, which can be illustrated as a pair of chains (as shown in the first figure); and the rest of Genesis gets from there down to Moses' family, for which the genealogy can be illustrated as a complex tree.

The genealogy as listed in Genesis 5.
Cain's lineage was terminated by the Flood.

However, the theologically most important genealogies are those of Jesus, as recorded in Matthew 1:2-16 and Luke 3:23-38. Matthew apparently presents the genealogy through Joseph, who was Jesus' legal father; and Luke apparently traces Jesus' bloodline through Mary's father, Eli. These two lineages coalesc in David+Bathsheba, and from there they have a shared lineage back to Abraham. Their importance lies in the attempt to substantiate that Jesus' ancestry fulfils the biblical prophecies that the Messiah would be descended from Abraham (Genesis 12:3) through Isaac (Genesis 17:21) and Jacob (Genesis 28:14), and that he would be from the tribe of Judah (Genesis 49:8), the family of Jesse (Isaiah 11:1) and the house of David (Jeremiah 23:5).

That is, these genealogies legitimize Jesus as the prophesied Messiah. Following this lead, subsequent use of genealogies has commonly been to legitimize someone as a monarch, so that royal genealogies have been of vital political and social importance throughout recorded history (see the example in the next figure). This importance was not lost on the rest of the nobility, either, so that documented genealogies of most aristocratic families allow us to identify the first-born son of the first-born son, etc, and thus legitimize claimants to noble titles — genealogies are a way for nobles to assert their nobility.

The genealogy of the current royal family of Sweden. [Note: most children are not shown]
The lineage of the recent monarchs is highlighted as a chain, with an aborted side-branch dashed.

If we focus solely on the line of descent involved in legitimization, then genealogies can be represented as a chain (as shown in the genealogy above). However, if we include the rest of the paternal lines of descent then family genealogies can be represented as a tree. However, if we include some or all of the maternal lineages as well, then family genealogies can be represented as a network. For example, the biblical genealogies only rarely name women, but where females are specifically named the genealogies actually form a reticulated network. Jacob produced offspring with both Rachel and Leah, who were his first cousins; and Isaac and Rebekah were first cousins once removed. Even Moses was the offspring of parents who were, depending on the biblical source consulted, either nephew-aunt, first cousins, or first cousins once removed. These relationships cannot be represented in a tree. (See also the complex genealogy of the Spanish branch of the Habsburgs, who were kings of Spain from 1516 to 1700.)

This idea of genealogical chains, trees and networks was straightforward to transfer from humans to other species. Originally, biologists stuck pretty much to the idea of a chain of relationships among organisms, as presented in the early part of Genesis. Human genealogies were traced upwards to Adam and from there to God, and thus species relationships were traced upwards to God via humans. However, by the second half of the 1700s both trees and networks made their appearance as explicit suggestions for representing biological relationships. In particular, Buffon (1755) and Duchesne (1766) presented genealogical networks of dog breeds and strawberry cultivars, respectively.

However, these authors did not take the conceptual leap from within-species genealogies to between-species phylogenies. Indeed, they seem to have explicitly rejected the idea, confining themselves to relationships among "races". It was Charles Darwin and Alfred Russel Wallace, a century later, who first took this leap, apparently seeing the evolutionary continuum that connects genealogies to phylogenies. In this sense, they both took ideas that had been "in the air" for several decades, but previously applied only within species, and applied them to the origin of species themselves. [See the Note below.] Both of them, however, confined themselves to genealogical trees rather than using networks. It seems to me that it was Pax (1888) who first put the whole thing together, and produced inter-species phylogenetic networks (along with some intra-species ones).

In this sense, the biblical Tree of Life has only a peripheral relevance to phylogenetics. Darwin used it as a rhetorical device to arouse the interest of his audience (Hellström 2011), but it was actually the biblical genealogies that were of most practical importance to his evolutionary ideas. Apart from anything else, the original biblical tree was actually the lignum vitae (Tree of Eternal Life) not the arbor vitae (Tree of Life). Similarly, the tree from which Adam and Eve ate the forbidden fruit was the lignum scientiae boni et mali (Tree of Knowledge of Good and Evil), not the arbor scientiae (Tree of Knowledge) that was subsequently used as a metaphor for human knowledge.

Note. Along with phylogenetic trees, Darwin and Wallace did not actually originate the idea of natural selection, which had previously been discussed by people such as James Hutton (1794), William Charles Wells (1818), Patrick Matthew (1831), Edward Blyth (1835) and Herbert Spencer (1852). However, this discussion had been in relation to within-species diversity, whereas Wallace and Darwin applied the idea to the origin of between-species diversity (i.e. the origin of new species).

References

Buffon G-L de. 1755. Histoire naturelle générale et particulière, tome V. Paris: Imprimerie
Royale.

Duchesne A.N. 1766. Histoire naturelle des fraisiers. Paris: Didot le Jeune & C.J. Panckoucke.

Hellström N.P. 2011. The tree as evolutionary icon: TREE in the Natural History Museum, London. Archives of Natural History 38: 1-17.

Pax F.A. 1888. Monographische übersicht über die arten der gattung Primula. Bot. Jahrb. Syst. Pflanzeng. Pflanzengeo. 10:75-241.

Pietsch T.W. 2012. Trees of life: a visual history of evolution. Baltimore: Johns Hopkins University Press.

Thursday, July 3, 2014

Are genotype or phenotype data more tree-like?


I recently wrote a manuscript comparing the tree-likeness of phylogenetic data in biology and anthropology (see Are phylogenetic patterns the same in anthropology and biology?). While doing so, I also made a comparison of genotype and phenotype data within biology.

The comparison is based on maximum-parsimony analyses of the data, using the (ensemble) Retention Index (RI) as the measure of tree-likeness. If RI = 1 then all of the characters are compatible with the same tree, whereas if RI = 0 then none of them are pairwise compatible. As the graph shows, the genotype data are considerably less tree-like than are the phenotype data (mean RI ≈ 0.5 versus 0.7, respectively).

It would be interesting to know whether other people have observed this pattern. If it is general, then what causes it? Are the phenotype characters being chosen (subconsciously or not) because they show nested grouping patterns (which lend themselves automatically to a tree representation)? Or do the genotype data inherently have more stochastic variation? Does this mean that we should always be using phylogenetic networks for the representation of genotype data?


You can read the manuscript if you want the details of the analyses. Briefly, the initial collections of datasets were taken from Collard et al. (Evolution and Human Behavior 27: 169-184; 2006) — the graphed data are taken from the paper as I never managed to get the original datasets from the authors. I then supplemented this information with phenotype datasets from TreeBase (total of n=31) and miscellaneous genotype datasets from the literature (n=15). All of the datasets refer to vertebrates and insects (with one phenotype dataset from spiders). My parsimony analyses used the parsimony ratchet and PAUP*.

Wednesday, June 25, 2014

Non-phylogenetic trees


I recently published a post on Evolution and timelines, in which I pointed out that presenting historical data as a timeline is a very poor way of representing an evolutionary history. Evolutionary history is much better presented as a phylogeny, which will be either a tree or a network. However, this does not mean that all histories that are presented as a tree, for example, necessarily represent a phylogeny.

I have encountered a few examples of history-as-tree that seem to have very little connection to a phylogeny. That is, the relationships among the objects are presented along the branches of a tree, but the relationships along the branches seem to be little more than a timeline. So, the whole structure is simply a series of interconnected timelines.

Consider this first example, which is a poster purporting to show for the USA:
the evolution of jazz in its more than one hundred year history. From Archaic to Avant Garde, from blues to bebop, from radio to fusion, from spirituals to swing, from Armstrong to Zawinul, the jazz pedigree presents the diverse history and development of jazz in a clear way.

Perhaps it is the strong central trunk that gives it away as a non-phylogeny. The side-branches do group the jazz performers roughly by genre, but that is all they do. The actual title is a bit more accurate about the content — it is a "Story" rather than a phylogeny.

This poster is accompanied by a European counterpart with an even stronger central trunk. It is labeled as a "Community", but it still claims to "display the history and development of European jazz".


As another example, in 1946, the magazine P.M.published a tree by Ad Reinhardt with a sardonic view of modern American art. [Thanks to Joachim Dagg for alerting me to this example.]


At least there is no central trunk this time, but the clustering of artists along the branches seems to have less to do with phylogenetic history than with artistic genre (and satire). There was a follow-up example 15 years later, in which the sardonic humor plays much the strongest role in the relationships represented.


Finally, here is an example of a timeline that really should be represented using a phylogenetic tree. It is difficult to believe that the group of professions illustrated form a transformational series, as implied by the timeline that is actually shown. Most of the entrepreneur groups depicted actually still exist to this day, rather than being extinct, and so we have here a history of variational evolution, instead of a transformation.


Monday, June 23, 2014

Phylognetic trivia


Phylogenetics plays no part in games like Trivial Pursuit, but the web offers more opportunities. The Fun Trivia web site, for example, offers a page on Phylogenetics. You should try it, and see how well you do.


The answers (and explanations) are quite good, but the wording of some of the questions leaves a lot to be desired.

Wednesday, May 28, 2014

Phylogenetic networks and "evolutionary networks"


Complex networks are found in all parts of biology, graphically representing biological patterns and, if they are directed networks, also their causal processes. Directed networks are currently used to model various aspects of biological systems, such as gene regulation, protein interactions, metabolic pathways, ecological interactions, and evolutionary histories.

Two types of networks can be distinguished, and this distinction seems to me to be very important. Most networks are what might be called observed networks, in the sense that the nodes and edges represent empirical observations. For example, a food web consists of nodes representing animals with connecting edges representing who eats whom. Similarly, in a gene regulation network the genes (nodes) are connected by edges showing which genes affect the functioning of which other genes. In all cases, the presence of the nodes and edges in the graph is based on experimental data. These are collectively called interaction networks or regulation networks.

However, when studying historical patterns and processes not all of the nodes and edges can be observed. So, instead, they are inferred as part of the data-analysis procedure. That is, we infer the patterns as well as the processes; and we can call these inferred networks. In this case, the empirical data may consist solely of the leaf nodes, and we infer the other nodes plus all of the edges. For example, every person has two parents, and even if we do not observe those parents we can infer their existence with confidence, as we also can for the grandparents, and so on back through time with a continuous series of ancestors. Alternatively, we may also observe some of the internal nodes of the network, such as when we do record the parents and grandparents because they are contemporaneous (ie. their generations overlap). This type of pattern can be represented as a genealogical network, when referring to individual organisms, or a phylogenetic network when referring to groups (populations, species, or larger taxonomic groups).

What, then, are the things often referred to as "evolutionary networks" but which are clearly not phylogenetic networks? They are of the first type, the interaction networks. In an evolutionary network the observed nodes are directly connected to each other to represent some aspect of evolution. This aspect may have some component of phylogeny to it, but there is more to the study of evolution than solely phylogenetic history.

For example, directed LGT (dLGT) networks connect nodes representing contemporary organisms with edges that represent inferred lateral gene transfer. That is, the evolutionary networks show gene sharing. This is obviously related to the phylogeny of the organisms, but the network does not display the phylogeny itself. This first example (from Ovidiu Popa, Einat Hazkani-Covo, Giddy Landan, William Martin, Tal Dagan. 2011. Directed networks reveal genomic barriers and DNA repair bypasses to lateral gene transfer among prokaryotes. Genome Research 21: 599-609) shows "32,028 polarized lateral recipient–donor protein-coding gene transfer events" inferred from "the completely sequenced genomes of 657 prokaryote species".


The concept of a gene-sharing network as an evolutionary network has also been applied to viruses and their relatives, for example, as shown by this next diagram (from Natalya Yutin, Didier Raoult, Eugene V Koonin. 2013. Virophages, polintons, and transpovirons: a complex evolutionary network of diverse selfish genetic elements with different reproduction strategies. Virology Journal 10: 158).


The question, then, is what to make of diagrams that combine both a phylogenetic tree and this type of evolutionary network, such as is done in the Minimal Lateral Network. This next example is from linguistics rather than biology (from Johann-Mattis List, Shijulal Nelson-Sathi, Hans Geisler, William Martin. 2013. Networks of lexical borrowing and lateral gene transfer in language and genome evolution. Bioessays 36: 141-150), and it superimposes the sharing network and the phylogenetic tree. (For a discussion in the context of LGT, see also Tal Dagan. 2011. Phylogenomic networks. Trends in Microbiology 19: 483-491).


In this diagram, the tree explicitly represents the phylogenetic history of the languages while the evolutionary network represents possible borrowings of words, with thicker lines representing more borrowed words. Clearly, the network also contains phylogenetic information of some sort. For example, the connection of the root of the Romance languages to English reflects the conquest of Britain by the French-speaking Normans, which modified the Old-German heritage of Old English. However, the diagram as a whole is a hybrid, rather than being a coherent phylogenetic network in the simplest sense (ie. a reticulation network).

To see this clearly, note that the phylogenetic tree is not fully resolved and that the evolutionary network does suggest possible resolutions for several of polychotomies, such as the relationship of Armenian and Greek, the relationship of Albanian to the Romance languages, and the relationship of the Gaelic languages to the Romance languages. So, in some cases the evolutionary network helps resolve the phylogenetic tree rather than forming a reticulating network.

It would be possible to derive a phylogenetic network from this minimal lateral network, but as it stands it is a combination of a phylogenetic tree and a so-called evolutionary network.

Wednesday, May 21, 2014

Phylogenetics of computer viruses?


There is a difference between phylogenetics and clustering or classification. The latter processes put objects into groups based on some intrinsic features, but the former uses their intrinsic features to expresses their evolutionary history. Not all objects have an evolutionary history, even though they can all be put into groups. Furthermore, even objects that do have a history do not necessarily have an evolutionary history. Evolution involves ancestor-descendant relationships (as well as sister-group relationships), and not all of history involves ancestors and descendants.

This distinction is important for the use of phylogenetics as a metaphor for the history of non-biological objects. Outside of biology, many things are claimed to have a "phylogenetic history", including languages and most human artifacts. As I have noted before, one has to be careful when applying this metaphor (see False analogies between anthropology and biology).

One particular example that I have encountered involves the development of computer viruses and other malware (Iliopoulos et al. 2008). Metaphorically, such viruses can be seen to be phylogenetically related, because new viruses are often based on previous ones — that is, one virus "begets" another virus due to changes in its intrinsic attributes. In this sense the metaphor is helpful, although there is no actual copying of anything resembling a genome — this is phenotype evolution not genotype evolution.

Sorkin (1994) seems to have been the first to discuss the possibility of computer virus evolution, but the first empirical attempt to reconstruct a digital phylogeny appears to have been by Hull (1995b), who studied the Stoned computer virus (a virus that infected the boot sector of PCs between 1990 and 1995), as shown below.


Since then, phylogenetics has been a popular topic in the study of computer malware (eg. Goldberg et al. 1996; Carrera & Erdélyi 2004; Karim et al. 2005a,b; Ma et al. 2006; Wehner 2007; Walenstein et al. 2007; Wagener et al. 2008: Hayes et al. 2009; Khoo & Lió 2011; Guan et al. 2012). As noted by Webster & Malcolm (2007) these studies "present classifications of malware based on phylogenetic trees, in which the lineage of computer viruses can be traced and a 'family tree' of viruses constructed based on similar behaviors." (Note that there are other possible uses of phylogenetics related to computer programming, as exemplified by Ji et al. 2008.)

In all cases the phylogeny was produced using a distance-based clustering algorithm (ie. the tree is a phenetic one). Some of the distances are well motivated in terms of historical changes in the intrinsic attributes of the malware, but that does not necessarily make the resulting phenogram a phylogeny. So, the methods use basic clustering techniques to produce a tree, thus treating classification as phylogenetics (or phenetics as phylogenetics). This certainly clusters the objects, but there is no necessary reason for the clustering to reflect phylogeny.

Thus, a simple concept of clustering is inadequate, even though it can be used to construct a tree. A phylogenetic tree expresses the nested hierarchy formed by the shared derived character states, but not by anything else. A tree expresses nested clusters, but it is a form of "special nesting" that expresses a phylogeny. Only one form of tree is relevant to phylogenetics, and trees formed in other ways are likely to be suitable only under the simplest circumstances.

Furthermore, the general conception in these papers of a virus phylogeny as tree-like is clear. As noted by Hull (1995a):
Computer viruses evolve in complex ways not usually encountered in nature. The transplantation of large segments of computer code from one virus to another need not represent evolutionary relationship, for example. A newer virus may just represent a debugged or patched earlier version. The virus author may have deliberately incorporated parts of other viruses as a short cut, or because the plagiarized code is useful. If the virus incorporates code generating 'engines', similar code may appear in viruses with no other similarities. Structural similarities deriving from functional similarities likewise derive from several sources.
As we now know, these sorts of evolutionary events are usually found in nature, but they create reticulate histories, involving horizontal as well as vertical evolution. That is, computer virus phylogenetics involves reticulation — new computer code takes bits and pieces from various previous viruses. In particular, there is also what is called "oblique" evolution, in which there is horizontal evolution between generations. This is a characteristic of many histories involving human artifacts (see Time inconsistency in evolutionary networks), and it allows information to "time travel", so that the information available for horizontal transmission can come from the distant past as well as from the present.

So, malware evolution is not tree-like. Only two of the papers cited above seem to acknowledge this fact. Khoo & Lió (2011) were quite conventional in using splits graphs rather than unrooted trees to display their data, although they do not specify the algorithm for producing the networks. They do, however, claim that "networks were more useful for visualising short nop-equivalent code metamorphism than trees".

Goldberg et al. (1996) were more innovative, and analyzed their data using what they called a "phyloDAG", which is a directed network that can have multiple roots (it appears to be a type of minimum-spanning network). Interestingly, they note that "Beyond the computer virus realm for which it was conceived, the phyloDAG is also a plausible model for evolution of bacterial populations." Indeed, the possibility of multiple roots has been explicitly suggested for prokaryote phylogenetics (see Can networks have multiple roots?). I wouldn't doubt that it is also feasible for language history.

References

Carrera E, Erdélyi G (2004) Digital genome mapping – advanced binary malware analysis. Virus Bulletin Conference 2004.

Goldberg LA, Goldberg PW, Phillips CA, Sorkin GB (1996) Constructing computer virus phylogenies. Lecture Notes in Computer Science 1075: 253-270. [also Journal of Algorithms (1998) 26: 188-208]

Guan Q, Tang Y, Liu X (2012) A malware homologous analysis method based on sequence of system function. Advanced Science and Technology Letters, ASTL 15: Advanced Computer Science and Technology. Science and Engineering Research Support Society, Sandy Bay, Tasmania, Australia.

Hayes M, Walenstein A, Lakhotia A (2009) Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology 5: 335-343.

Hull DB (1995a) Computer viruses: naming and classification. Virus Bulletin Sept: 15-17.

Hull DB (1995b) Computer viruses: naming and classification, part II. Virus Bulletin Oct: 16-17.

Iliopoulos D, Adami C, Ször P (2008) Darwin inside the machines: malware evolution and the consequences for computer security. Virus Bulletin Conference 2008.

Ji J-H, Park S-H, Woo G, Cho H-G (2008) Generating pylogenetic tree of homogeneous source code in a plagiarism detection system. International Journal of Control, Automation, and Systems 6: 809-817.

Karim ME, Walenstein A, Lakhotia A (2005a) Malware phylogeny using maximal pi-patterns. Proceedings of the EICAR 2005 Conference, pp 156-174.

Karim ME, Walenstein A, Lakhotia A, Parida L (2005b) Malware phylogeny generation using permutations of code. Journal in Computer Virology 1: 13-23.

Khoo WM, Lió P (2011) Unity in diversity: phylogenetic-inspired techniques for reverse engineering and detection of malware families. Proceedings of the 2011 First Systems Security Workshop (SysSec'11), pp 3-10. IEEE Computer Society Washington, DC.

Ma J, Dunagan J, Wang HJ, Savage S, Voelker GM (2006) Finding diversity in remote code injection exploits. Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement, pp 53-64. ACM, New York.

Sorkin GB (1994) Grouping related computer viruses into families. Proceedings of the IBM Security ITS 1994.

Wagener G, State R, Dulaunoy A (2008) Malware behaviour analysis. Journal in Computer Virology 4: 279–287.

Walenstein A, Hayes M, Lakhotia A (2007) Phylogenetic comparisons of malware. Virus Bulletin Conference 2007.

Webster M, Malcolm G (2007) Classification of computer viruses using the theory of affordances. Second International Workshop on the Theory of Computer Viruses.

Wehner S (2007) Analyzing worms and network traffic using compression. Journal of Computer Security 15: 303-320. (arXiv:cs/0504045v1, 2007)

Monday, March 24, 2014

Trees, treemaps and networks

Hierarchically arranged information has traditionally been represented as a tree. However, this is not the only way that this information can be pictured. As noted by Manuel Lima (Visualization Metaphors: Old & New):
As one of the most hailed methods of modern information visualization, the treemap has truly become an epitome of the recent growth of the field and one of the most widespread methods for visualizing hierarchies.
Isabel Meirelles (Design for Information: An Introduction to the Histories, Theories, and Best Practices Behind Effective Information Visualizations. Rockport Publishers, 2013) provides this illustration as an example of the different ways to represent hierarchies:


So, treemaps display the tree information as a set of nested rectangles — each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. The main advantage of using a map as a representation is that the size and colour of the rectangles can be used to represent other information about each tree leaf. (Note: This treemap concept should not be confused with Mike Charleston's program TreeMap, which maps the relationships between two phylogenetic trees, nor with MLTreemap, which maps an unidentified DNA sequence onto a phylogenetic tree.)

Modern treemaps were developed in 1991 by Ben Shneiderman, who has conveniently provided a description of the history and initial development of the idea (Treemaps for space-constrained visualization of hierarchies). Not unexpectedly, this idea has been adopted in biology. For example, taxonomic hierarchies are sometimes represented using a treemap, such as in BioNames (which displays the taxonomic groups recognised by the Index to Organism Names database), and the Natural Science Museum of Barcelona (which allows interactive access to the database records via a taxonomic hierarchy). It has also been used to display the gene ontology associated with gene expression data from microarray studies (Visualization and analysis of microarray and gene ontology data with treemaps).

In addition, it has been suggested that treemaps could be used to represent phylogenetic trees (Using treemaps to visualize phylogenetic trees. 6th International Symposium on Biological and Medical Data Analysis, 2005. Lecture Notes in Computer Science 3745: 283-293); and there is an associated computer program. An example is shown below, in which the rectangles are coloured by their taxonomy — the circles highlight two sequences that are misplaced in the tree (ie. their tree location does not match their taxonomy).


This approach to displaying phylogenies has not really caught on (ie. phylogeneticists have stuck to the "node-link" layout). The treemap approach works best with a fixed-level hierarchy, such as the taxonomic hierarchy or the gene ontology hierarchy. In phylogenetics, on the other hand, branch lengths are variable, so that there is no fixed-level hierarchy. Treemaps work well for displaying information about groups that might be recognized in the tree, but not for the tree itself.

Nevertheless, similar methods were suggested long before the invention of computers (two early examples are noted by Manuel Lima, in the blog post linked above). Indeed, we end up with a treemap if we simply cut slices out of the tree, as shown by the next picture (taken from Isabel Meirelles' book), which shows Maximilian Fürbringer's tree of bird relationships from 1888 (published in Untersuchungen zur Morphologie und Systematik der Vögel). On the left is the side view of the tree, and on the right are three slices through the tree branches (as viewed from above). This produces a circular treemap rather than a rectangular one, which is admittedly a less efficient use of the visualization space.


Finally, we can consider the relationship of these ideas to phylogenetic networks. A network is not a nested hierarchy, but instead involves a collection of over-lapping sets. This can be represented as a venn diagram, for example, but not as a treemap. This form of visualization has also been a long-standing suggestion in phylogenetics. The final picture shows Georg August Goldfuss' "system of animals" from 1817 (published in Ueber de Entwicklungstufen). It is a set of nested egg-shaped sets, expressing his ideas about affinity relationships, with one set over-lapping several of the others, representing a non-nested series of relationships. There is nothing new under the sun!


Monday, March 3, 2014

Has phylogenetics reached its apogee?


Few people had heard of phylogenetics before 1970. It was during that decade that explicit methods for constructing phylogenetic trees came to prominence, although such methods had first appeared in the late 1950s. These methods appeared first in systematics, based on parsimony (1970s), and then in genetics, based on likelihood (1980s). These days, phylogenetics is seen as ubiquitous in biology, but it is interesting to consider whether this idea can be quantified.

Joseph Hughes (2011.TreeRipper web application: towards a fully automated optical tree recognition software. BMC Bioinformatics 12:178) had a go at this by trying to extract information from the PubMed bibliographic database. Here, I have expanded on this approach.

I searched PubMed for the string phylogen*, thus including words like "phylogeny" and "phylogenetics", as well as unusual variations on these words. I searched both the full bibliographic record (including the abstract) as well as restricting the search to the Title field. I did this for every calendar year from 1970–2012 inclusive (the 2013 data are currently still incomplete in the database).


The results are shown in the first graph, and the second graph shows the details of the title search alone. The data are expressed as a percentage of the total number of PubMed records for each year.


So, less than 2% of the current papers in biology mention phylogenetics in their title or abstracts. This does not, of course, mean that the paper doesn't mention the topic at all, as it could do so under some other name (eg. "evolutionary tree", "genealogy", etc), or do so in a way that does not make it into the abstract. Still, it seems to me that this is a rather low number.

The erratic nature of the data before 1975 is probably a by-product of the quality of the PubMed data for that time. However, the clear upper asymptote in the data this century is not artifactual, but real. The average maximum value for the "All" data is ~1.54%, reached in 2009, while the average for "Title only" is ~0.17%, reached in 2004. This seems to imply that phylogenetics has now saturated the market, and is as ubiquitous as it will be, unless something new comes along to change it.

The initial rise in usage of the phylogenetic methods coincided with the release of computer programs that implemented them. Wagner78 was released for mainframe computers in 1978, followed by Phylip in 1980. Phylip was the first to be ported to microcomputers; but it was the release of the PC version of PAUP (v. 2.4) in December 1985 that came to dominate the next 10 years. Hennig86, the successor to Wagner78, was released in 1988.

However, the rapid growth in usage coincided with the growth of molecular genetics. The patent applications for PCR were filed in 1985, and the first paper based on it was also published that year. The technology started to be used for human diagnostics during 1986, and PCR became a basic research tool in molecular biology from c.1989. (Science selected PCR as the major scientific development of 1989.) The journal Molecular Biology and Evolution was founded in 1983, and Molecular Phylogenetics and Evolution in 1992.

The inflection point in the graph is c.1999, which indicates where the slow-down in growth occurred. Coincidentally, it was in 1999 that the Journal of Molecular Evolution announced that it would henceforth exclude molecular phylogenetics (and research on the origin of life), except in cases that have "a special significance and impact." Phylogenetics was now seen as a tool of evolutionary analysis rather than an end in itself.

By this stage, bayesian methods were being proposed, and MrBayes was released in 2001, rapidly becoming the predominant program. However, this was simply a transformation of the existing methodology, rather than being a major new component of data analysis in the way the very first programs were. Furthermore, the rise in usage of genome data seems also to be a transformation, rather than a major addition to data collection the way sequence data were.

Thus, it took 30 years (c. 1978–2008) for the phylogenetics revolution to be complete. Mind you, it had already taken 150 years from 1859 for quantitative methods to first be proposed.

Thursday, February 27, 2014

Roots and the phylogenetics of mythology


A few weeks ago I discussed the phylogenetic analysis of the tale of Little Red Riding Hood (The phylogenetics of Little Red Riding Hood). In that case, I pointed out that historical reconstructions require a rooted tree, and I discussed various possible methods for rooting the unrooted trees produced by the data analyses.

This is not the only time that phylogenetics has been applied to myths or tales. For example, d'Huy (2013a) has studied the prehistoric Polyphemus tale belonging to the European and North Amerindian areas, and d'Huy (2013b) has studied the mythological motif of the Cosmic Hunt linked to the Big Dipper constellation (typical for northern and central Eurasia and for the Americas but unknown on other continents). In the first case a binary matrix of 98 characteristics for 44 versions of the tale was used, and in the latter 93 characteristics for 47 versions. Both of these studies have rooted trees.

In the latter case, a novel method of rooting the tree was used. The unrooted tree was successively rooted with each of the likely versions of the tale as outgroup. In each case the ancestral tale (the protomyth) was reconstructed and the ancestral states of the tale's characteristics (called mythemes) were determined. The author then "selected the version that holds the majority of the wide shared mythemes (>50%) as the better root."

Unfortunately, this produced an unexpected root, as shown in the tree below. The colors in the tree refer to various geographical groupings of the tale versions.


So, I re-analyzed the data using the rooting methods that I previously applied to the Red Riding Hood analysis:
  • For the bayesian analysis, I used MrBayes (2 runs, 4 chains, 1,000,000 generations, sampling frequency 1000, 25% burnin) with a relaxed clock (with independent gamma rates model for the variation of the clock rate across lineages).
  • For the neighbor-joining tree I used the BioNJ algorithm in PAUP*, and found the midpoint root.
  • For the parsimony analysis, I used a 200-replicate parsimony-ratchet search via PAUP*, calculated the branch lengths of the majority-rule consensus tree with ACCTRAN optimization, and found the midpoint root.
These three alternative roots are also shown on the tree. They seem more likely than the published root.

Geographically, the root chosen by the author's method is within the red group (tales from Asia), based on the idea that "arguments in favour of localization of protypical Cosmic Hunt in Asia seem persuasive (Berezkin 2005)." Unfortunately, this a priori argument seems to have excluded any testing of the possibility that more than one version is the sister to the remaining tales — that is, only single outgroups were considered.

On the other hand, all three of the alternative roots group the tales into two major clades. For the bayesian-clock root the two clades have distinct animal motifs, a herbivore and a carnivore, respectively. These clades do not correspond to any of the three variants recognized by Berezkin (2005).

The bayesian-clock root puts the red-colored (Asia) versions of the tale into one of the two major clades, as it also does with the orange group (Africa), which makes this root more consistent with the geographical groupings — that is, all of the geographical groups are in only one of the two major clades, except for the purple group (American coast-plateau / British Columbia). Both the Parsimony and NJ roots do the same thing, but as well as the purple group they also split the pink group (northeastern America) between the two major clades, which reduces their geographical consistency compared to the bayesian-clock root.

The bayesian-clock root does not support the suggestion that the Cosmic Hunt myth originated in Asia. Indeed, the bayesian tree does not support any particular geographical location. Furthermore, the polyphyly of the purple group presents an intriguing aspect of the tale's history.

References

Yuri Berezkin (2005) The cosmic hunt: variants of a Siberian—North-American myth. Folklore 31: 79-100.

Julien d'Huy (2013a) Polyphemus (Aa. Th. 1137): a phylogenetic reconstruction of a prehistoric tale. Nouvelle Mythologie Comparée 1: 1-21.

Julien d'Huy (2013b) A cosmic hunt in the Berber sky: a phylogenetic reconstruction of a Palaeolithic mythology. Les Cahiers de l’AARS 16: 93-106.

Wednesday, January 22, 2014

Blogs about phylogenetics


I have occasionally been asked about what blogs currently exist in phylogenetics, because there seem to be very few. There are blogs in related areas, such as phyloinformatics, evolutionary biology, and systematics, but very few blogs dedicated primarily to phylogenetics (not just occasionally mentioning it).

Below is a list of the current and former blogs that I know about. In each case I have provided basic information taken from the blog itself. Please let me know about any suitable blogs that have been missed. [Updated 15 October 2014]


Current General Blogs


The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis. This blog is about the use of networks in phylogenetic analysis, as a replacement for (or an adjunct to) the usual use of trees. This topic has received considerable attention in the biological literature, not least in microbiology (where horizontal gene transfer is often considered to be rampant) and botany (where hybridization has always been considered to be common). It has also received increasing attention in the computational sciences.

Contributors: David Morrison, Steven Kelk, Leo van Iersel, Mike Charleston, Jesper Jansson
Started: 25 February 2012


TreeThinkers

TreeThinkers is a blog devoted to phylogenetic and phylogeny-based inference. We aim to use it as a place to discuss recent research and methods; to ask and answer questions; and serve as a general resource for news and trivia in phylogenetics. Although the blog is associated with the Bodega workshop, we welcome posts and participation from the entire phylogenetics community.

Contributors: Bastien Boussau, Gideon Bradburd, Jeremy Brown, Rich Glor, Tracy Heath, David Hillis, Sebastian Höhna, Luke Mahler, Mike May, Brian Moore, Samantha Price, Peter Wainwright
Editor: Bob Thomson
Started: 2 October 2012


Open Tree of Life

The tree of life links all biodiversity through a shared evolutionary history. This project will produce the first online, comprehensive first-draft tree of all 1.8 million named species, accessible to both the public and scientific communities. Assembly of the tree will incorporate previously-published results, with strong collaborations between computational and empirical biologists to develop, test and improve methods of data synthesis. This initial tree of life will not be static; instead, we will develop tools for scientists to update and revise the tree as new data come in.

Contributors: Robin Blom, Karen Cranston, Karl Gude, Mark Holder, Rosemary Keane, Rick Ree
Started: April 8, 2012


EvoPhylo

Evolution, phylogenetics, bioinformatics, stuff.

Contributor: Dave Lunt
Started: 30 January 2008


The Bayesian Kitchen

Statistical inference and evolutionary biology. Undoubtedly, since its introduction in phylogenetics in the late 90's, Bayesian inference has become an essential part of current applied statistical work in evolutionary sciences. However, there are still many problems, computational, theoretical and even foundational. After ten years of applied Bayesian work in phylogenetics and in evolutionary genetics, I feel the need to step back and re-think the whole thing.

Contributor: Nicolas Lartillot
Started: 24 December 2013


Phylogenetics...

Musings on eukaryote evolution.

Contributor: Marko Prous
Started: 31 December 2013



Current Program Blogs


Phylogenetic Tools for Comparative Biology

This web-log chronicles the development of new tools for phylogenetic analyses in the phytools R package. Unless you are reading a very recent page of the blog, I recommend that you install the latest CRAN version of phytools (or latest beta release) before attempting to replicate any of the analyses of this site. That is because the linked functions may be archived, and very likely have been replaced by newer versions.

Contributor: Liam Revell
Started: 11 December 2010 (at Blogspot)


Osiris Phylogenetics

Accessible and reproducible phylogenetics using the Galaxy workflow system.

Contributor: Todd Oakley
Started: 7 September 2012



Announces the introduction of new tools for phylogenetic analyses in the Beast 2 package, as well as discussing usage issues with the current version, along with tips and tricks.

Contributor: Remco Bouckaert
Started: 18 March 2014



Blogs Currently in Limbo


Dechronization

Dechronization is authored by evolutionary biologists interested in the development and application of methods for estimating phylogeny and making phylogeny-based inferences. The goal of the blog is to provide a forum for discussion of the latest research and methods, while also providing anecdotes, tidbits of natural history, and other related information.

Contributors: Rich Glor, Luke Harmon, Brian Moore, Tom Near, Dan Rabosky, Liam Revell
Started: 29 April 2008      Last post: 6 June 2011


CYPHY - Cybertaxonomy and Phylogenetics

Mostly harmless pointing at things pertaining to cybertaxonomy and phylogenetics.

Contributor: Matt Yoder
Started: 6 November 2007      Last post: 23 February 2011


Phylogeny etc.

Meditations on phylogenetic inference.

Contributor: Bruce Rannala
Started: 6 March 2014      Last post: 6 March 2014


Fish Phylogenetics

I created this new blog to share thoughts on work from my research group on the phylogenetics and evolutionary biology of fishes. This will provide a forum to share insight about the studies that we publish, discuss important scientific aspects of fish diversity, reflect on my experiences teaching ichthyology (the study of fishes), and to comment and review contributions by other researchers.

Contributor: Tom Near
Started: 23 August 2012      Last post: 15 September 2012


Taxonomy Phylogeny

Taxonomies group organisms according to phenotype, while phylogenetic systems groups organisms according to shared evolutionary heritage.

Contributor: ???
Started: 1 January 2008      Last post: 31 December 2010


Phylogenetic Geek

A bag of info on phylogenetics.

Contributor: ???
Started: 5 August 2011      Last post: 16 September 2011

Monday, January 20, 2014

Faux phylogenies II


It is possible to produce a phylogeny of any group of objects that vary in their intrinsic characteristics, and where those characteristics can be inferred to vary through time. I have previously reported some examples of a Tree of LIfe where "life" has been interpreted very broadly, to include legendary figures, cartoon animals, pokémon, and dragons (see Faux phylogenies). Here, I broaden the scope even further.

Phylogeny of taste

This first example comes from the July-August 1998 edition of the Annals of Improbable Research (vol. 4, no. 4), in which Joe Staton published an article entitled Tastes like chicken? It contains the following tetrapod phylogeny onto which has been mapped what they taste like. Note that Homo sapiens is included.


Phylogeny of breakfast

Following the taste theme, Nash Turley works on community phylogenetics, and this has lead him to contemplate the phylogenetics of his own breakfast. This vegetarian feast contains 15 species in 11 families.


Insect blog phylogenetics

Moving on to cultural evolution, Morgan Jackson has investigated how insect blogs are related to each other. His phylogenetic analysis of entomology blogs was based on blog morphology, physiology, geography, ecology and behaviour. It produced the following tree, onto which has been plotted the insect families concerned.


Evolution blog phylogenetics

In a similar vein, when the blogger known as Psi Wavefunction hosted the 20th Carnival of Evolution, this was summarized as a phylogenetic tree. The tree was produced by the simple expedient of aligning the URL addresses of the Carnival submissions and performing a parsimony analysis, based on treating the letters as amino acid codes. I can't believe that it worked.


Android bubble shooter games

Finally, Megafouna Software has produced a phylogeny of Android bubble shooter games, based on a small set of their features.


Thursday, December 19, 2013

Is rate variation among lineages actually due to reticulation?


Non-congruence among characters has traditionally been attributed solely to so-called vertical evolutionary processes (parent to offspring), which can be represented in a phylogenetic tree. For example, phenotypic incongruence was originally attributed solely to homoplasy (convergence, parallelism, reversal). For molecular data this could be modeled with DNA substitutions and indels, along with allowance for variable rates in different genic regions (e.g. invariant sites, or the well-known gamma model of rate variation).

This approach was not all that successful, and so the substitution models were made more complex, by allowing different evolutionary rates in different branches of the tree (e.g. substitutions are more or less common in some parts of the tree compared to others). For many researchers this is still as sophisticated as their phylogenetic models get (Schwartz & Mueller 2010), allowing for a relaxed molecular clock in their model rather than imposing a strict clock.

There is, however, a fundamental limitation to trying to make any one model more sophisticated: the more complex model will probably fit the data better but it might be fitting details rather than the main picture. Consider the illustration below. There is a lot of variation among these six animals and yet they are all basically the same. If I wish to devise a model to describe them, do I need a sophisticated model that describes all the nuances of their shape variation, or do I need a simple model that recognizes that they are all five-pointed stars? The answer depends on my purpose — if I wish to identify them to class then it is the latter, if I wish to identify them to species then it might be the former.


Vertical process models

This is relevant to phylogenetics. For example, if I wish to estimate a species tree from a set of gene trees, do I need a complex model that deals with all of the evolutionary nuances of the individual gene trees, or a simpler model that ignores the details and instead estimates what the trees have in common? It has been argued that the latter will be more useful under these circumstances. On the other hand, if I am studying gene evolution itself, I may be better off with the former.

So, adding things like rate variation among lineages (and also rate variation along genes) will usually produce "better fitting" models. However, this is fit to the data, and the fit between data and model is not the important issue, because this increases precision but does not necessarily increase accuracy.

Therefore, modern interest is in changing the fundamentals of the model, rather than changing its details. There are many possible causes of gene-tree incongruence, and maybe these should be in the model in order to increase accuracy.

For example, there has been interest in adding other vertical processes to the tree-building model, most notably incomplete lineage sorting (ILS) and gene duplication-loss (DL). ILS means that gene trees are not expected to exactly match the species tree, but will vary stochastically around that tree, with probabilities that can be calculated using the coalescent. DL means that gene copies appear and disappear during evolution, so that gene sequence variation is due to hidden paralogy as well as to orthology.

ILS has been modeled by being integrated into a more sophisticated DNA substitution model (see the papers in Knowles & Kubatko 2010). Originally, DL was dealt with at the whole-gene level (Slowinski and Page 1999; Ma et al. 2000), but there have been recent attempts to integrate this into the DNA substitution models, as well (Åkerborg et al. 2009; Rasmussen & Kellis 2012). These models are not yet widely used, and so most published empirical species trees still rely on modeling incongruence using rate variation among branches.

Horizontal process models

However, this whole approach restricts the phylogenetic model to vertical processes alone. It is entirely possible that the sequence variation that is being attributed to rate variation among branches is actually being caused by horizontal evolutionary processes, such as recombination, hybridization, introgression or horizontal gene transfer (HGT). For example, an influx of genetic material from outside a lineage could be mis-interpreted as an increase in the rate of substitutions and indels within that lineage. That is, long branches might represent introgression (or HGT) rather than in situ rate variation. If this is true then we would be modeling the wrong thing.

There has been little explicit discussion of this point in the literature. Syvanen (1987) seems to have been among the first. However, his premise was that the molecular clock is ultimately correct (and that "the basic observation has been that different macromolecules yield roughly the same phylogenetic picture"), and he was arguing that HGT does not necessarily violate the clock. Our modern perspective is, of course, that a strict clock is unlikely unless it has been demonstrated, and that genes are incongruent as often as they are congruent.

Recent models for ILS and DL have started to broach this issue, by adding reticulation to their underlying models. Rather oddly, this has usually been described as:
  • ILS + hybridization (Meng & Kubatko 2009; Kubatko 2009; Joly et al. 2009; Bloomquist & Suchard 2010; Yu et al. 2011; Marcussen et al. 2012; Jones et al. 2013; Yu et al. 2013); and
  • DL + HGT (Mirkin et al. 2003; Górecki 2004; Hallett et al. 2004; Csürös & Miklós 2006; Doyon et al. 2010; Tofigh et al. 2011; Bansal et al. 2012; Sjöstrand et al. 2012).
This pairwise association seems to reflect historical accident, rather than any actual mathematical difference in procedure — the gene-tree incongruence patterns are essentially the same for hybridization, introgression and HGT, as well as recombination. In the mathematical models, all we can really talk about is "reticulation" — it is up to the biologist to determine the nature of the horizontal process in each case.

Conclusion

The point here is essentially the same one that I made in a previous post (Resistance to network thinking). Currently, phylogenetics is approached in a very conservative manner. The "old way" is the best way, and things change very slowly. The currently popular phylogenetic models are simply variants of the same models that have been used for 30 years. Temporal rate variation (among lineages) and spatial rate variation (along genes) have been added to the original model from the 1970s, but not yet more complex vertical processes (ILS or DL), and not yet horizontal processes. For these, specialist programs need to be used.

Essentially, all variation in branch length is still attributed to homoplasy and rate variation, rather than considering the myriad of other biological processes that will produce the same apparent phenomen. With this attitude we might be getting more precise models but not necessarily more accurate one.

References

Åkerborg Ö, Sennblad B, Arvestad L, Lagergren J (2009) Simultaneous bayesian gene tree reconstruction and reconciliation analysis. Proceedings of the National Academy of Sciences of the USA 106: 5714-5719.

Bansal MS, Alm EJ, Kellis M (2012) Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics 28: i283-i291.

Bloomquist EW, Suchard MA (2012) Unifying vertical and nonvertical evolution: a stochastic ARG-based framework. Systematic Biology 59: 27-41.

Csürös M, Miklós I (2006) A probabilistic model for gene content evolution with duplication, loss, and horizontal transfer. Lecture Notes in Computer Science 3909: 206-220.

Doyon J-P, Scornavacca C, Gorbunov KY, Szöllösi GJ, Ranwez V, Berry V (2019) An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. Lecture Notes in Computer Science 6398: 93-108.

Górecki P (2004) Reconciliation problems for duplication, loss and horizontal gene transfer. In: Bourne PE, Gusfield D (editors). Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology, pp. 316-325. ACM Press, New York.

Hallett M, Lagergren J, Tofigh A (2004) Simultaneous identification of duplications and lateral transfers. In: Bourne PE, Gusfield D (editors). Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology, pp. 347-356. ACM Press, New York.

Joly S, McLenachan PA, Lockhart PJ (2009) A statistical approach for distinguishing hybridization and incomplete lineage sorting. American Naturalist 174: E54-E70.

Jones G, Sagitov S, Oxelman B (2013) Statistical inference of allopolyploid species networks in the presence of incomplete lineage sorting. Systematic Biology 62: 467-478.

Knowles LL, Kubatko LS (editors) (2010) Estimating Species Trees: Practical and Theoretical Aspects. Wiley-Blackwell, Hoboken NJ.

Kubatko L (2009) Identifying hybridization events in the presence of coalescence via model selection. Systematic Biology 58: 478-488.

Ma B, Li M, Zhang L (2000) From gene trees to species trees. SIAM Journal on Computing 30:
729-752.

Marcussen T, Jakobsen KS, Danihelka J, Ballard HE, Blaxland K, Brysting AK, Oxelman B (2012) Inferring species networks from gene trees in high-polyploid North American and Hawaiian violets (Viola, Violaceae). Systematic Biology 61: 107-126.

Meng C, Kubatko LS (2009) Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theoretical Population Biology 75: 35-45.

Mirkin BG, Fenner TI, Galperin MY, Koonin EV (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evolutionary Biology 3: 2.

Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Research 22: 755-765.

Schwartz RS, Mueller RL (2010) Variation in DNA substitution rates among lineages erroneously inferred from simulated clock-like data. PLoS One 5: e9649.

Sjöstrand J, Sennblad B, Arvestad L, Lagergren J (2012) DLRS: gene tree evolution in light of a species tree. Bioinformatics 28: 2994-2995.

Slowinski J, Page RDM (1999) How should species phylogenies be inferred from sequence
data? Systematic Biology 48: 814-825.

Syvanen M (1987) Molecular clocks and evolutionary relationships: possible distortions due to horizontal gene flow. Journal of Molecular Evolution 26: 16-23.

Tofigh A, Hallett M, Lagergren J (2011) Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8: 517-535.

Yu Y, Barnett RM, Nakhleh L (2013) Parsimonious inference of hybridization in the presence of incomplete lineage sorting. Systematic Biology 62: 738-751.

Yu Y, Than C, Degnan JH, Nakhleh L (2011) Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology 60: 138-149.

Monday, December 16, 2013

Phylogenetics, ecologist style


Many of us are familiar with how a phylogeneticist, systematist or evolutionary biologist constructs a phylogenetic tree. However, ecologists apparently do it differently. Scott Chamberlain explains this procedure in one of his blog posts (Networks phylogeny):
There were about 500 species to make a phylogeny for, including birds and insects, and many species that were bound to end up as large polytomies. I couldn't in reasonable time make a molecular phylogeny for this group of species, so I made one ecologist style.
That is, I:
  • Created a topology using Mesquite software from published phylogenies, then
  • Got node age estimates from timetree.org (p.s. Wish I could use the new datelife.org, but there isn't much there quite yet), then
  • Used the bladj function in Phylocom to stretch out the branch lengths based on the node estimates.
Unfortunately, this process can't all be collected in an R script.
He then describes this process in more detail, which he hopes "makes it more reproducible". Here is his final tree (produced by FigTree).


This is an interesting bioinformatic solution to a biological problem, when empirical data collection has failed. I am not sure that I can recommend its widespread use, though.

Wednesday, December 4, 2013

The phylogenetics of Little Red Riding Hood


A couple of weeks ago we received an unexpected influx of visitors to this blog, being directed here by at article at the NBC News site. This article cited one of our blog posts (Network analysis of Genesis 1:3) as an example of the use of phylogenetic analysis in stemmatology (the discipline that attempts to reconstruct the transmission history of a written text). The NBC article itself is about a recently published paper that applies these same techniques to an oral tradition instead — the tale of Little Red Riding Hood. This paper has generated much interest on the internet, being reported in many blog posts, on many news sites, and in many twitter tweets. After all, the young lady in red has been known for centuries throughout the Old World.


Needless to say, I had a look at this paper (Jamshid J. Tehrani. 2013. The phylogeny of Little Red Riding Hood. PLoS One 8: e78871). The author collated data on various characteristics of 58 versions of several folk tales, such as plot elements and physical features of the participants. These tales included Little Red Riding Hood (known as Aarne-Uther-Thompson tale ATU 333), which has long been recorded in European oral traditions, along with variants from other regions, including Africa and East Asia (where it is known as The Tiger Grandmother), as well as another widespread international folk tale The Wolf and the Kids (ATU 123), which has been popular throughout Europe and the Middle East. As the author notes: "since folk tales are mainly transmitted via oral rather than written means, reconstructing their history and development across cultures has proven to be a complex challenge."

He produced phylogenetic trees from both parsimony and bayesian analyses, along with a neighbor-net network. He concluded: "The results demonstrate that ... it is possible to identify ATU 333 and ATU 123 as distinct international types. They further suggest that most of the African tales can be classified as variants of ATU 123, while the East Asian tales probably evolved by blending together elements of both ATU 333 and ATU 123." His network is reproduced here.


There is one major problem with this analysis: all three graphs are unrooted, and you can't determine a history from an unrooted graph. A phylogeny needs a root, in order to determine the time direction of history. Without time, you can't distinguish an ancestor from a descendant — the one becomes the other if the time direction is reversed. Unfortunately, the author makes no reference to a root, at all.

So, his recognition of three main "clusters" in his graphs is unproblematic (ATU 333; East Asian; and ATU 123 + African) although the relationship of these clusters to the "India" sample is not clear (as shown in the network). On the other hand, his conclusions about the relationships among these three groups is not actually justified in the paper itself.

Rooting the trees

So, the thing to do is put a root on each of the graphs. We cannot do this for the network, but we can root the two trees, and we can take the nearest tree to the network and root that, instead.

There are several recognized ways to root a tree in phylogenetics (Huelsenbeck et al. 2002; Boykin et al. 2010):
  1. a character transformation series (i.e. non-reversible substitution models)
  2. an outgroup
  3. mid-point rooting
  4. assume clock-like character replacement (e.g. the molecular clock).
The first one implies that we know the order in which at least some of the characters changed through time, which is not true for these folk tales. The second one requires us to know the next most closely related folk tale, which we cannot decide in this case. The third one is always possible, for any tree; and the fourth one is possible if a likelihood model has been used to model character changes. So, in this case, we can apply both of options 3 and 4.

I therefore did the following:
  • For the parsimony analysis, I imported the author's consensus tree into PAUP* (the program he used to produce it), calculated the branch lengths with ACCTRAN optimization, and found the midpoint root.
  • For the bayesian analysis, I re-ran the MrBayes analysis exactly as described by the author, except that I added a relaxed clock (with independent gamma rates model for the variation of the clock rate across lineages).
  • For the phylogenetic network, the neighbor-net is basically the network equivalent of a neighbor-joining tree, and so I calculated this in SplitsTree (the program the author used), and found the midpoint root.
  • Also, the strict clock version of a neighbor-joining tree is a UPGMA tree, which I calculated using SplitsTree.
The complete trees can be seen elsewhere (ParsimonyMidpoint; BayesRelaxed; NJmidpoint; UPGMA), but the figure below shows the relevant parts of the four rooted trees. As you can see, the first three analyses agree on the root location (shown at the left of each graph), with only the UPGMA tree suggesting an alternative.


Having the East Asian samples as the sister to the other tales does not match what would be expected for the historical scenario suggested by the original author from his unrooted graphs — that the East Asian tales "evolved by blending together elements of both ATU 333 and ATU 123".

Instead, this placement exactly matches an alternative theory that the author explicitly rejects: "One intriguing possibility raised in the literature on this topic ... is that the East Asian tales represent a sister lineage that diverged from ATU 333 and ATU 123 before they evolved into two distinct groups. Thus, ... the East Asian tradition represents a crucial 'missing link' between ATU 333 and ATU 123 that has retained features from their original archetype ... Although it is tempting to interpret the results of the analyses in this light, there are several problems with this theory."

The UPGMA root, on the other hand, would be consistent with the blending theory for the origin of the East Asian tales. However, this tree actually presents the African tales as distinct from ATU 123, rather than being a subset of it.

Anyway, the bottom line is that you shouldn't present scenarios without a time direction. History goes from the past towards the present, and you therefore need to know which part of your graph is the oldest part. A family tree isn't a tree unless it has a root.

References

Boykin LM, Kubatko LS, Lowrey TK (2010) Comparison of methods for rooting phylogenetic trees: a case study using Orcuttieae (Poaceae: Chloridoideae). Molecular Phylogenetics & Evolution 54: 687-700.

Huelsenbeck J, Bollback J, Levine A (2002) Inferring the root of a phylogenetic tree. Systematic Biology 51: 32-43.