Showing posts with label Rooting. Show all posts
Showing posts with label Rooting. Show all posts

Tuesday, December 12, 2017

Using consensus networks to understand poor roots


The standard way to root a tree is by outgroup rooting. Comprehensive outgroup sampling should minimize bias, such as ingroup-outgroup long-branch attraction (IO-LBA); but this may be impossible to avoid, when ingroup and all possible outgroups are very distant from each other. Using a recent all-Santalales up-to-7-gene dataset (Su et al. 2015) as an example, I will show how bootstrap consensus networks can illuminate topological issues associated with poorly supported roots.

These analyses should be obligatory whenever critical branches lack unambiguous support.

Some background: nothing to see, all is solved

The reason, why I had to re-analyse some of the data of Su et al., was that we dared to discuss alternatives to the accepted Loranthaceae (hereafter "loranth") root in our paper on loranth pollen (Grímsson, Grimm & Zetter 2017). Mainly for the artwork, I harvested gene banks for data on loranths in 2014 (re-checked in 2016, but no-one is studying this highly interesting group anymore). Earlier studies had used different taxon and gene sets, and produced cladograms (trees without branch lengths).

To assess the systematic value of loranth pollen, we needed to know how well resolved are intra-family relationships. Many deeper branches seen in the trees were not supported. What would be the competing alternatives? Unfortunately, one of our reviewers (#1), a veteran expert on plant parasites including loranths but not phylogenetics, was tremendously alienated by our tree, notably, the non-supported parts. He refused to understand the difference between a simple tree topology and branch support. Matrices producing ambiguous signals not-rarely lead to optimized trees that do not show the best-supported topological alternatives. Hence, for mapping our pollen, we did not use our tree, but the ML bootstrap consensus network (Fig. 1; which #1 refused even to look at). Fig. 2 shows the first plot focusing on the deepest, ambiguous relationships.

Fig. 1 The basis for our paper: a ML bootstrap consensus network, rooted under the commonly accepted assumption that the monotypic Nuytsia represents the first-diverging lineage.

Fig. 2 Example of what we did. My co-authors (re-)studied loranth pollen using high-res SEM microscopy; and then we plotted the pollen on the relevant part of the ML bootstrap consensus network (Fig. 1). Green background: the three surviving root parasites of the root-parasitic grade (not supported based on our data set) in outgroup rooted trees; orange background: the palynologically odd Tupeia. Note the lack of clear signal regarding root-proximal relationships, an indication for fast ancient radiation and diversification involving both root and aerial parasitic lineages.

Reviewer #1 pointed to the recent paper of Su et al., which (according to his interpretation) resolved the intrafamily relationships. Since our tree was topologically somewhat different, and did not provide the same level of support, it had to be wrong. My argument was that the genetic data are inconclusive regarding critical branches in the loranth subtree. This is obvious from our bootstrap network (not including any outgroup, to avoid ingroup-outgroup branching artefacts), and also from the earlier molecular phylogenies, including Su et al.’s. Plus, to focus on the loranths we included the best-sampled and most-divergent plastid region, a non-coding region, which was not included in Su et al. because it is unalignable across the order.

A particularly "absurd" notion (according to #1) was that we suggested using pollen morphology to pre-define an alternative root (the best-fitting alternative according to Bayes factors; Grímsson, Kapli et al. 2017). All loranths share a unique, easy to distinguish, triangular pollen morph (Fig. 1) that can be traced back till the Eocene, except for one monotypic genus: Tupeia (antarctica) from New Zealand (ie. at the margin of their modern distribution). Tupeia’s pollen is spheroidal (Fig. 1), the basic form found all over the Santalales tree (Fig. 3). Fig. 4 shows alternative rooting scenarios using Su et al.'s loranth subtree.

Fig. 3 Schematic drawings of Santalales pollen, mapped on the preferred tree of Su et al. (cf. Grímsson, Zetter & Grimm, 2017, pp. 36–41)

Fig. 4 I used this figure in one of the rebuttal letters to explain basic evolutionary concepts triggered by different rooting scenarios using the same tree, the one preferred by Su et al. (#1, and #3, did not realize trees are optimized unrooted). Note: aerial parasitism evolved independently several times in the Santalales (Vidal-Russell & Nickrent 2007), but the triangular loranth pollen is unique. Using Bayes Factors, we established that the pollen-based root is the best-fitting for the data set we used for node dating in Grímsson, Kapli et al. (2017).

Why I lacked confidence in the Su et al. tree(s)

Su et al.’s study establishes relationships within the Santalales using a quite gappy up-to-7-gene data set (3 nuclear + 3 plastid + 1 mitochondrial genes) to revisit the placement of an originally poorly defined family with extremely long-branching members, the Balanophoraceae s.l. They found two main clades, one – the resurrected Mystropetalaceae – reconstructed as sister of the loranths (see scheme in Fig. 3). From what the authors show (but do not discuss), there are a lot of signal issues with their data:
  • The standard nucleotide tree (partition scheme unclear) in the main text differs in many critical aspects (branching patterns, root-tip length ratios, support values) from the two in the supplement (one based on removing the 3rd codon positions, and one amino acid based).
  • Many branches have high posterior-probabilities but low ML bootstrap values (BS), a pattern typical for mixed-genome oligogene data sets with intrinsic conflict.
Regarding the loranth root question:
  • The first three branches within the loranth subtree, defining a ‘root parasitic grade’, have low to high posterior probabilities (PP = 0.65 / 0.99 / 0.52) and consistently low ML bootstrap supports (BS = <50 / 53 / <50).
  • The sister clades have extremely long branches, and the putative sister clade, the Mystropetalaceae, is only covered for nuclear and mitochondrial data. All probabilistic inference methods treat missing data and gaps as “N”; ie. the terminal probability vector p equals (1,1,1,1) for p(A,C,G,T). In case the signal from the best-sampled regions is clear, missing data is no issue; if there is conflict then missing data can be a problem.
In addition, I’m generally skeptical when it comes to “basal grades” and roots inferred by very distinct outgroups. Except for the three species forming the root-parasitic grade (one in South America, two in Australasia), all loranths are aerial parasites. Loranths are a ubiquitous, highly competitive tropical-subtropical group extending into temperate climates with proper summers and little snow. They have a very extensive and relatively old pollen record across all pieces of Gondwana and Laurasia (Grímsson, Kapli et al. 2017, and references therein). The main lineages are sorted continent-wise, and the shift from root to aerial parasitism is supposed to have happened several times within the Santalales and its families. But only once in loranths? And how could only one root-parasitic lineage survive in South America, and two more in Australasia; all with but a single species?

What’s behind the ambiguous bootstrap supports

For my head-to-head with the editor and reviewers (the editor had invited a #3 reviewer, an anonymous “expert on phylogeny”), I re-analysed Su et al.’s data subset on loranths plus sister groups. By reducing the outgroup sample to the sister clades only (Misodendraceae+Schoepfiaceae, Mystropetalaceae), I increased the BS support (100 / 62 / 60) for the root parasitic grade. This indicates either that Su et al.’s analysis was not comprehensive or that IO-LBA (ingroup-outgroup long-branch attraction) is prominent.

Large outgroup samples, all other Santalales + few non-Santalales in Su et al. vs. only loranths and sister clades in my re-analysis, may alleviate LBA to some degree (eg. Sanderson et al. 2000; Stefanović, Rice & Palmer 2004). Higher support with fewer outgroups than with many in the same dataset can be an alarm flag.

The ingroup-only bootstrap consensus network (Fig. 5) shows moderate, but unchallenged support for a split between root and aerial parasites, which agrees with Su et al.’s all-Santalales tree. The low support for the first and second branch in outgroup-rooted trees relates to an ambiguous signal between the root parasites — the signal from the combined data are ambiguous as to whether Atkinsonia (the second Australasian root parasite) or Gaiadendron (the South American root parasite) is the second diverging branch. Note that the position of our (pollen-based) alternative root candidate Tupeia is also unresolved (no split with a BS > 20; compare Figs 1 and 5), which could be expected for a potentially ancestral, relatively underived taxon — a taxon literally close to the family's root.

For most other deeper branches with high PP but low BS in Su et al.’s loranth subtree, the network shows two competing topological alternatives. There is thus an intrinsic signal conflict in Su et al.’s matrix, which explains the topological differences in earlier studies using different gene and taxon sets, preferring the one or other alternative in the final cladogram.

Fig. 5 ML Bootstrap consensus network using Su et al.'s loranth data subset. A, the outgroup-inferred root; B, pollen-based root; C, a clock-based root inferred by Grímsson, Kapli et al. (2017). In contrast to Su et al.'s tree the network distinguishes between ambiguous and unambiguous relationships. Not too different from what we found using a more data-dense and divergent matrix (Fig. 1)

How to deconstruct a tree:

First step — single-gene trees and support networks

There are two main reasons for inferring single-gene trees and establishing single-gene support:
  1. It is the only way assess the level of incongruence in the gene samples.
  2. They inform about the amount of signal that each gene region contributes to the combined tree.
In case of Su et al.’s data set, it becomes clear that the combined topology is supported primarily by signal from the best-sampled (and most variable) plastid gene included in the data set: the plastid matK gene. The second, much less divergent, plastid gene (rbcL) partly reinforces the matK signal, or at least does not contradict it (except for a few odd-balls, which might be possible mix-ups during sequence generation).

It is also clear that the nuclear and plastid data fit aspect-wise, eg. both support or at least don’t reject potential clades, but not entirely. Particularly, the placements of the sister groups of the loranths within the loranth tree are unstable, ie. the position of the outgroup-inferred root.

Fig. 6 One-to-one comparison of trees inferred using the two most divergent nuclear and plastid genes included in Su et al.'s matrix: the matK (plastid) and 25S rDNA (nuclear-encoded ribosomal RNA gene; same scale). The root-parasitic grade and a loranth | sister clades split is moderately to well-supported using matK data, but not using 25S rDNA data. No plastid data were included for the assumed direct sister of loranths, the Mystropetalaceae. Note also the position of Tupeia (dark red) in both graphs, the only loranth without loranth-typical pollen.

Last, the single-gene trees and bootstrap networks illustrate a severe taxon sampling bias. The third nuclear marker, the RBS2 gene, is only covered for five ingroup taxa including Nuytsia but no other root parasite, and it fails to resolve the two aerial tribes, which are otherwise genetically distinct. Analysis of this small subset illustrates nicely the perils of taxon sampling (Fig. 7). The third plastid gene, the accD gene (7 ingroup representatives), supports Nuytsia as the first diverging lineage, but rejects the root parasite grade with equally high support (BS = 87)

Fig. 7 Some RPB2 trees, the new nuclear gene included by Su et al. (not included in our data set lacking taxonomic coverage). A, the tree including the sister clades of loranths. B, ingroup loranth-only tree, codon-positions treated as different data partitions; C, same, unpartitioned analysis. D, the "true tree" (according #1), ie. topology seen in Su et al. Green, splits in agreement with Su et al.'s tree, red, splits contrasting Su et al.'s tree.

The only mitochondrial gene included in the data set (matR) rejects the root-parasitic grade, in addition to the matK-only based Elytrantheae-Lorantheae clade seen in Su et al.’s tree, with nearly unambiguous support. However, it is in general low-divergent, and it may be that the authors relied on fragmentary (potentially pseudogene) data for some taxa. There are huge indels in the gene not found in any other angiosperm, and the two sequenced Loranthus are conspicuously distinct to the other loranths (Fig. 8).

Fig. 8 Neighbour-net based on Su et al.'s matR data. Numbers show the ML bootstrap support for (alternative) splits when sister groups are included or excluded. Note the position of the Mystropetalaceae, possible affected by IO-LBA with the Loranthinae (the latter's signal is in stark contrast to any other gene region). Two root-parasitic species are included: NuyFl = Nuytsia, GaiaPun = Gaiadendron; no matR for Tupeia.

One of the arguments of reviewer #1 was that Su et al. obtained a better resolved phylogeny because they sampled these extra gene regions. In fact, they obtained a better supported phylogeny because they added diffuse-signal or under-sampled gene regions. Hence, they enforced the dominant matK-preferred topology (high PP, low BS), while weaking locally conflicting signal from the nuclear 18S and 25S rDNA.

The only constant is that, no matter which gene region, Nuytsia is always the most distinct taxon of all species close to the root node. With respect to the very distant sister clades, IO-LBA may be inevitable. The BS consensus networks favor a closer relationship of the root parasites, and there is relatively strong signal for a root parasitic clade in the matK (Fig. 5), so that the biased outgroup-root then becomes a proximal (‘basal’) grade (Fig. 6, left).

Data effects of the LBA-missing sort also challenge the placement of the Mystropetalaceae as sister to the loranths. The plastid data establish that Misodendraceae and Schoepfiaceae (M+S) are distinct from the loranths. They provide unambiguous support for a loranth versus M+S split. Having no data on any plastid gene, the extremely long-branched Mystropetalaceae fall in line. Being very distinct from both M+S and loranths in the nuclear (Fig. 5) and mitochondrial (Fig. 8) genes, the tree places them next to the loranths, because the latter's root node and many of its members are much more distant from the all-ancestor than those of the M+S clade (compare with Su et al.'s trees; keep in mind that missing data may increase branch-lengths).

Second step—remove a taxon, or two, or a gene region

If the hypothesis of a root-parasitic grade holds, then one should be able to remove the first diverging taxon (Nuytsia in this example) from the dataset, and still have the position of the second (Atkinsonia) and third diverging (Gaiadendron) remain the same. Furthermore, the support for the corresponding branches should at best increase, and at least not decrease. Atkinsonia is relatively distinct, but Gaiadendron is not (as reflected by their pollen morphology, Fig. 2).

In the case of Su et al.’s data, removal of Nuytsia collapses the root-proximal part of the ingroup tree (Fig. 9). ML-tree inferences select a sub-optimal topology, because the signal weakens for the splits {Atkinsonia + outgroups | all other Loranthaceae} and {root parasites + outgroups | aerial parasites}. The competing alternative (BS = 23) is a first-diverging Atkinsonia-Gaiadendron clade. When both Nuytsia and Atkinsonia are removed, Gaiadendron is placed within the aerial parasites, and the Elytrantheae, a low-diverged Australasian tribe, are favored as first-diverging loranth clade (BS increases from 15 to 39; any alternative BS is < 20; see also their placements in Fig. 5).

Fig. 9 ML tree and bootstrap (top-right) consensus network using Su et al.'s data with the putative first-branching Nuytsia eliminated from the equation (see above). The tree is rooted according to Su et al.'s tree.

Equally disastrous for the support of a root parasitic grade is the elimination of matK from the combined data set. This is a major advantage of BS or support consensus networks: one can estimate whether taxon or gene sampling increases or decreases support for competing topologies; and not only the combined trees’ topologies.

Support consensus networks should be obligatory when studying non-trivial or suboptimal data sets

When it comes to extremely long-branching roots and sister groups, one should generally be careful with combined-data roots defined by outgroups (see also our discussion of the outgroup-inferred Osmundaceae root in Bomfleur, Grimm & McLoughlin 2015). But when the crucial branches in addition lack consistent support, one should actually be alarmed.

In the case of gappy oligogene (as used by Su et al.) or multigene data, one should always test the primary signal in the combined gene regions. Fast and efficient ML implementations make it very easy to infer and compare single-gene trees, and thus establish differential branch support patterns, which can be visualized and investigated using support consensus networks. As biologists working with complex systems and, likely, non-trivial evolutionary pathways, we should not be afraid of conflicting splits patterns, but deal with them!

It can also be useful to plot (tabulate) competing support values from different analysis. RAxML has an option that allows testing the direct correlation between two bootstrap (or Bayesian) tree samples, without being restricted to any particular tree. The Phangorn library for R includes functions to map trees into networks, and transfer support from networks to trees. In my experience, many nuclear and plastid incongruences are:
  1. tree-based, meaning the trees are different, but the supports for topological alternatives are not conflicting;
  2. resolution-based, meaning the nuclear data illuminate different parts of the evolutionary history compared to the plastid data;
  3. rogue-based, meaning there are only a few taxa with strongly conflicting signals in the tree.
Support consensus networks can pinpoint all three scenarios. In the case of scenarios 1 and 2, a combined analysis still works, as also in the case of scenario 3 when the rogues are eliminated (from the combined analysis, not from the study, which unfortunately still a tradition in phylogenetics)

Uploading a matrix and the optimized tree to (eg.) TreeBase is just the first step (although many TreeBase submissions include only naked trees). In the case of ambiguous branch support (BS < 80 or 85, depending on the proportion of nuclear vs. mitochondrial and plastid gene regions; PP < 1.00), much more important would be to document the bootstrap replicate and Bayesian tree samples.

Furthermore, reviewers and editors should not encourage the publishing of trees where low support is masked by cut-offs (eg. BS < 50, PP < 0.95). Which is, unfortunately, the editorial policy of the journal that published the Su et al. study, and the reason for the scarcity of studies showing support consensus or other networks that can illuminate signal issues (but see eg. Denk & Grimm 2010, Wanntorp et al. 2014, and Khanum et al. 2016, published in Taxon). The latter a common feature in plant molecular data sets, independent of the taxonomic hierarchical level.

Further reading, and data

The full details of my re-analysis of Su et al.’s data subset, including loranths and sister clades, can be found in File S6 [PDF] to Grímsson, Grimm & Zetter (2017), included in the Online Supporting Archive (9.5 MB; journal hompage/mirror]. The relevant data and analysis files can be found in subfolder "Su_el_al." of the archive.

References

Bomfleur B, Grimm GW, McLoughlin S (2015) Osmunda pulchella sp. nov. from the Jurassic of Sweden — reconciling molecular and fossil evidence in the phylogeny of modern royal ferns (Osmundaceae). BMC Evolutionary Biology 15: 126. http://dx.doi.org/10.1186/s12862-015-0400-7

Denk T, Grimm GW. 2010. The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59:351–366. — Includes neighbour-nets based on inter-individual and uncorrected p-distances, with ML branch support mapped and tabulated.

Grímsson F, Grimm GW, Zetter R (2017) Evolution of pollen morphology in Loranthaceae. Grana DOI:10.1080/00173134.2016.1261939. http://dx.doi.org/10.1080/00173134.2016.1261939

Grímsson F, Kapli P, Hofmann C-C, Zetter R, Grimm GW (2017) Eocene Loranthaceae pollen pushes back divergence ages for major splits in the family. PeerJ 5: e3373 [e-pub]. https://peerj.com/articles/3373/ Don't miss reading the peer review documentation, which includes some interesting discussion. (If you agree that the peer review should be transparent in general, then you can sign up here for fighting the windmills.)

Khanum R, Surveswaran S, Meve U, Liede-Schumann S. 2016. Cynanchum (Apocynaceae: Asclepiadoideae): A pantropical Asclepiadoid genus revisited. Taxon 65:467–486. — Includes trees, median-joining and bootstrap consensus networks [I really enjoyed working on this, but retreated from co-authorship to protest against Taxon's editorial policies.]

Sanderson MJ, Wojciechowski MF, Hu J-M, Sher Khan T, Brady SG (2000) Error, bias, and long-branch attraction in data for two chloroplast photosystem genes in seed plants. Molecular Biology and Evolution 17: 782-797.

Stefanović S, Rice DW, Palmer JD (2004) Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evolutionary Biology 4: 35. http://biomedcentral.com/1471-2148/4/35

Su H-J, Hu J-M, Anderson FE, Der JP, Nickrent DL (2015) Phylogenetic relationships of Santalales with insights into the origins of holoparasitic Balanophoraceae. Taxon 64: 491-506.

Vidal-Russell R, Nickrent DL (2008) The first mistletoes: origins of aerial parasitism in Santalales. Molecular Phylogenetics and Evolution 47: 523-537.

Wanntorp L, Grudinski M, Forster PI, Muellner-Riehl AN, Grimm GW. 2014. Wax plants (Hoya, Apocynaceae) evolution: epiphytism drives successful radiation. Taxon 63:89–102. — Includes trees and a "splits rose" [The editors forced us do drop some quite nice bootstrap consensus and median-joining networks; see also this post]

Wednesday, May 21, 2014

Phylogenetics of computer viruses?


There is a difference between phylogenetics and clustering or classification. The latter processes put objects into groups based on some intrinsic features, but the former uses their intrinsic features to expresses their evolutionary history. Not all objects have an evolutionary history, even though they can all be put into groups. Furthermore, even objects that do have a history do not necessarily have an evolutionary history. Evolution involves ancestor-descendant relationships (as well as sister-group relationships), and not all of history involves ancestors and descendants.

This distinction is important for the use of phylogenetics as a metaphor for the history of non-biological objects. Outside of biology, many things are claimed to have a "phylogenetic history", including languages and most human artifacts. As I have noted before, one has to be careful when applying this metaphor (see False analogies between anthropology and biology).

One particular example that I have encountered involves the development of computer viruses and other malware (Iliopoulos et al. 2008). Metaphorically, such viruses can be seen to be phylogenetically related, because new viruses are often based on previous ones — that is, one virus "begets" another virus due to changes in its intrinsic attributes. In this sense the metaphor is helpful, although there is no actual copying of anything resembling a genome — this is phenotype evolution not genotype evolution.

Sorkin (1994) seems to have been the first to discuss the possibility of computer virus evolution, but the first empirical attempt to reconstruct a digital phylogeny appears to have been by Hull (1995b), who studied the Stoned computer virus (a virus that infected the boot sector of PCs between 1990 and 1995), as shown below.


Since then, phylogenetics has been a popular topic in the study of computer malware (eg. Goldberg et al. 1996; Carrera & Erdélyi 2004; Karim et al. 2005a,b; Ma et al. 2006; Wehner 2007; Walenstein et al. 2007; Wagener et al. 2008: Hayes et al. 2009; Khoo & Lió 2011; Guan et al. 2012). As noted by Webster & Malcolm (2007) these studies "present classifications of malware based on phylogenetic trees, in which the lineage of computer viruses can be traced and a 'family tree' of viruses constructed based on similar behaviors." (Note that there are other possible uses of phylogenetics related to computer programming, as exemplified by Ji et al. 2008.)

In all cases the phylogeny was produced using a distance-based clustering algorithm (ie. the tree is a phenetic one). Some of the distances are well motivated in terms of historical changes in the intrinsic attributes of the malware, but that does not necessarily make the resulting phenogram a phylogeny. So, the methods use basic clustering techniques to produce a tree, thus treating classification as phylogenetics (or phenetics as phylogenetics). This certainly clusters the objects, but there is no necessary reason for the clustering to reflect phylogeny.

Thus, a simple concept of clustering is inadequate, even though it can be used to construct a tree. A phylogenetic tree expresses the nested hierarchy formed by the shared derived character states, but not by anything else. A tree expresses nested clusters, but it is a form of "special nesting" that expresses a phylogeny. Only one form of tree is relevant to phylogenetics, and trees formed in other ways are likely to be suitable only under the simplest circumstances.

Furthermore, the general conception in these papers of a virus phylogeny as tree-like is clear. As noted by Hull (1995a):
Computer viruses evolve in complex ways not usually encountered in nature. The transplantation of large segments of computer code from one virus to another need not represent evolutionary relationship, for example. A newer virus may just represent a debugged or patched earlier version. The virus author may have deliberately incorporated parts of other viruses as a short cut, or because the plagiarized code is useful. If the virus incorporates code generating 'engines', similar code may appear in viruses with no other similarities. Structural similarities deriving from functional similarities likewise derive from several sources.
As we now know, these sorts of evolutionary events are usually found in nature, but they create reticulate histories, involving horizontal as well as vertical evolution. That is, computer virus phylogenetics involves reticulation — new computer code takes bits and pieces from various previous viruses. In particular, there is also what is called "oblique" evolution, in which there is horizontal evolution between generations. This is a characteristic of many histories involving human artifacts (see Time inconsistency in evolutionary networks), and it allows information to "time travel", so that the information available for horizontal transmission can come from the distant past as well as from the present.

So, malware evolution is not tree-like. Only two of the papers cited above seem to acknowledge this fact. Khoo & Lió (2011) were quite conventional in using splits graphs rather than unrooted trees to display their data, although they do not specify the algorithm for producing the networks. They do, however, claim that "networks were more useful for visualising short nop-equivalent code metamorphism than trees".

Goldberg et al. (1996) were more innovative, and analyzed their data using what they called a "phyloDAG", which is a directed network that can have multiple roots (it appears to be a type of minimum-spanning network). Interestingly, they note that "Beyond the computer virus realm for which it was conceived, the phyloDAG is also a plausible model for evolution of bacterial populations." Indeed, the possibility of multiple roots has been explicitly suggested for prokaryote phylogenetics (see Can networks have multiple roots?). I wouldn't doubt that it is also feasible for language history.

References

Carrera E, Erdélyi G (2004) Digital genome mapping – advanced binary malware analysis. Virus Bulletin Conference 2004.

Goldberg LA, Goldberg PW, Phillips CA, Sorkin GB (1996) Constructing computer virus phylogenies. Lecture Notes in Computer Science 1075: 253-270. [also Journal of Algorithms (1998) 26: 188-208]

Guan Q, Tang Y, Liu X (2012) A malware homologous analysis method based on sequence of system function. Advanced Science and Technology Letters, ASTL 15: Advanced Computer Science and Technology. Science and Engineering Research Support Society, Sandy Bay, Tasmania, Australia.

Hayes M, Walenstein A, Lakhotia A (2009) Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology 5: 335-343.

Hull DB (1995a) Computer viruses: naming and classification. Virus Bulletin Sept: 15-17.

Hull DB (1995b) Computer viruses: naming and classification, part II. Virus Bulletin Oct: 16-17.

Iliopoulos D, Adami C, Ször P (2008) Darwin inside the machines: malware evolution and the consequences for computer security. Virus Bulletin Conference 2008.

Ji J-H, Park S-H, Woo G, Cho H-G (2008) Generating pylogenetic tree of homogeneous source code in a plagiarism detection system. International Journal of Control, Automation, and Systems 6: 809-817.

Karim ME, Walenstein A, Lakhotia A (2005a) Malware phylogeny using maximal pi-patterns. Proceedings of the EICAR 2005 Conference, pp 156-174.

Karim ME, Walenstein A, Lakhotia A, Parida L (2005b) Malware phylogeny generation using permutations of code. Journal in Computer Virology 1: 13-23.

Khoo WM, Lió P (2011) Unity in diversity: phylogenetic-inspired techniques for reverse engineering and detection of malware families. Proceedings of the 2011 First Systems Security Workshop (SysSec'11), pp 3-10. IEEE Computer Society Washington, DC.

Ma J, Dunagan J, Wang HJ, Savage S, Voelker GM (2006) Finding diversity in remote code injection exploits. Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement, pp 53-64. ACM, New York.

Sorkin GB (1994) Grouping related computer viruses into families. Proceedings of the IBM Security ITS 1994.

Wagener G, State R, Dulaunoy A (2008) Malware behaviour analysis. Journal in Computer Virology 4: 279–287.

Walenstein A, Hayes M, Lakhotia A (2007) Phylogenetic comparisons of malware. Virus Bulletin Conference 2007.

Webster M, Malcolm G (2007) Classification of computer viruses using the theory of affordances. Second International Workshop on the Theory of Computer Viruses.

Wehner S (2007) Analyzing worms and network traffic using compression. Journal of Computer Security 15: 303-320. (arXiv:cs/0504045v1, 2007)

Wednesday, April 30, 2014

Reconstructing ancestors in a splits network?


A splits graph is an unrooted phylogenetic network (see How to interpret splits graphs). However, sometimes they are treated as being rooted networks, and under these circumstances it is assumed that they therefore represent a phylogeny. Nevetheless, it is important to recognize that a rooted splits graph does not explicitly represent a phylogeny, because reticulations in the graph represent uncertainty not genealogy (see How do we interpret a rooted haplotype network?).

A corollary to this is that reconstructing "ancestors" in a splits graph is problematic. The nodes do not necessarily represent inferred ancestors, because their actual role is to support the corners of the parallelograms formed by intersecting sets of parallel edges in the graph. Some of the nodes may, indeed, represent ancestors but there is no way to determine this from the network itself.

An example

Let's look at a specific example, taken from the paper by J. Miguel Díaz-Báñez, Giovanna Farigu, Francisco Gómez, David Rappaport & Godfried T. Toussaint (2004) El Compás flamenco: a phylogenetic analysis. Proceedings of BRIDGES Conference: Mathematical Connections in Art, Music and Science, pp. 61-70.

The authors provide an analysis of the hand-clapping patterns of the flamenco music of Andalucia, in southern Spain. There are four recognized patterns, plus the fandango pattern, and the authors use two different distance measures to assess their rhythmic similarities. They produce unrooted phylogenetic networks based on each of these distances, which turn out (on reanalysis of their data) to be NeighborNets (the authors refer to them as "SplitsTrees").

The authors ignore the fact that "it is well established that the fountain of flamenco music is the fandango", which would make the fandango the outgroup for rooting if we did wish to treat the networks as rooted. Instead, they try to "reconstruct the 'ancestral' rhythms corresponding to the nodes" by using mid-point rooting. This procedure can easily be applied to unrooted phylogenetic trees, but its application to networks is problematic because there are multiple paths through the graph, and there may thus be several points that qualify as the mid-point.


For one of their networks, shown in the first graph, the authors identify the single mid-point, and then try to reconstruct "the ancestral rhythm closest to the 'center'", based on the node closest to the mid-point. They do this by "trial and error", based on the distances from the identified "ancestral" node to all of the leaves. That is, they find the hand-clap pattern that has the required distance to all of the leaves.

The authors do not tackle this procedure for their other network. In this case, as shown in the next graph, there are two mid-point locations. While there is a node that is equally close to these two locations, which might therefore qualify as "the ancestral rhythm closest to the center", it is difficult to reconstruct its actual rhythm.


Finally, if we try to identify the "ancestral rhythm" using the node identified by the fandango outgroup, then the result is dramatically different to that produced by the mid-point method, for both networks.

Thursday, February 27, 2014

Roots and the phylogenetics of mythology


A few weeks ago I discussed the phylogenetic analysis of the tale of Little Red Riding Hood (The phylogenetics of Little Red Riding Hood). In that case, I pointed out that historical reconstructions require a rooted tree, and I discussed various possible methods for rooting the unrooted trees produced by the data analyses.

This is not the only time that phylogenetics has been applied to myths or tales. For example, d'Huy (2013a) has studied the prehistoric Polyphemus tale belonging to the European and North Amerindian areas, and d'Huy (2013b) has studied the mythological motif of the Cosmic Hunt linked to the Big Dipper constellation (typical for northern and central Eurasia and for the Americas but unknown on other continents). In the first case a binary matrix of 98 characteristics for 44 versions of the tale was used, and in the latter 93 characteristics for 47 versions. Both of these studies have rooted trees.

In the latter case, a novel method of rooting the tree was used. The unrooted tree was successively rooted with each of the likely versions of the tale as outgroup. In each case the ancestral tale (the protomyth) was reconstructed and the ancestral states of the tale's characteristics (called mythemes) were determined. The author then "selected the version that holds the majority of the wide shared mythemes (>50%) as the better root."

Unfortunately, this produced an unexpected root, as shown in the tree below. The colors in the tree refer to various geographical groupings of the tale versions.


So, I re-analyzed the data using the rooting methods that I previously applied to the Red Riding Hood analysis:
  • For the bayesian analysis, I used MrBayes (2 runs, 4 chains, 1,000,000 generations, sampling frequency 1000, 25% burnin) with a relaxed clock (with independent gamma rates model for the variation of the clock rate across lineages).
  • For the neighbor-joining tree I used the BioNJ algorithm in PAUP*, and found the midpoint root.
  • For the parsimony analysis, I used a 200-replicate parsimony-ratchet search via PAUP*, calculated the branch lengths of the majority-rule consensus tree with ACCTRAN optimization, and found the midpoint root.
These three alternative roots are also shown on the tree. They seem more likely than the published root.

Geographically, the root chosen by the author's method is within the red group (tales from Asia), based on the idea that "arguments in favour of localization of protypical Cosmic Hunt in Asia seem persuasive (Berezkin 2005)." Unfortunately, this a priori argument seems to have excluded any testing of the possibility that more than one version is the sister to the remaining tales — that is, only single outgroups were considered.

On the other hand, all three of the alternative roots group the tales into two major clades. For the bayesian-clock root the two clades have distinct animal motifs, a herbivore and a carnivore, respectively. These clades do not correspond to any of the three variants recognized by Berezkin (2005).

The bayesian-clock root puts the red-colored (Asia) versions of the tale into one of the two major clades, as it also does with the orange group (Africa), which makes this root more consistent with the geographical groupings — that is, all of the geographical groups are in only one of the two major clades, except for the purple group (American coast-plateau / British Columbia). Both the Parsimony and NJ roots do the same thing, but as well as the purple group they also split the pink group (northeastern America) between the two major clades, which reduces their geographical consistency compared to the bayesian-clock root.

The bayesian-clock root does not support the suggestion that the Cosmic Hunt myth originated in Asia. Indeed, the bayesian tree does not support any particular geographical location. Furthermore, the polyphyly of the purple group presents an intriguing aspect of the tale's history.

References

Yuri Berezkin (2005) The cosmic hunt: variants of a Siberian—North-American myth. Folklore 31: 79-100.

Julien d'Huy (2013a) Polyphemus (Aa. Th. 1137): a phylogenetic reconstruction of a prehistoric tale. Nouvelle Mythologie Comparée 1: 1-21.

Julien d'Huy (2013b) A cosmic hunt in the Berber sky: a phylogenetic reconstruction of a Palaeolithic mythology. Les Cahiers de l’AARS 16: 93-106.

Wednesday, December 4, 2013

The phylogenetics of Little Red Riding Hood


A couple of weeks ago we received an unexpected influx of visitors to this blog, being directed here by at article at the NBC News site. This article cited one of our blog posts (Network analysis of Genesis 1:3) as an example of the use of phylogenetic analysis in stemmatology (the discipline that attempts to reconstruct the transmission history of a written text). The NBC article itself is about a recently published paper that applies these same techniques to an oral tradition instead — the tale of Little Red Riding Hood. This paper has generated much interest on the internet, being reported in many blog posts, on many news sites, and in many twitter tweets. After all, the young lady in red has been known for centuries throughout the Old World.


Needless to say, I had a look at this paper (Jamshid J. Tehrani. 2013. The phylogeny of Little Red Riding Hood. PLoS One 8: e78871). The author collated data on various characteristics of 58 versions of several folk tales, such as plot elements and physical features of the participants. These tales included Little Red Riding Hood (known as Aarne-Uther-Thompson tale ATU 333), which has long been recorded in European oral traditions, along with variants from other regions, including Africa and East Asia (where it is known as The Tiger Grandmother), as well as another widespread international folk tale The Wolf and the Kids (ATU 123), which has been popular throughout Europe and the Middle East. As the author notes: "since folk tales are mainly transmitted via oral rather than written means, reconstructing their history and development across cultures has proven to be a complex challenge."

He produced phylogenetic trees from both parsimony and bayesian analyses, along with a neighbor-net network. He concluded: "The results demonstrate that ... it is possible to identify ATU 333 and ATU 123 as distinct international types. They further suggest that most of the African tales can be classified as variants of ATU 123, while the East Asian tales probably evolved by blending together elements of both ATU 333 and ATU 123." His network is reproduced here.


There is one major problem with this analysis: all three graphs are unrooted, and you can't determine a history from an unrooted graph. A phylogeny needs a root, in order to determine the time direction of history. Without time, you can't distinguish an ancestor from a descendant — the one becomes the other if the time direction is reversed. Unfortunately, the author makes no reference to a root, at all.

So, his recognition of three main "clusters" in his graphs is unproblematic (ATU 333; East Asian; and ATU 123 + African) although the relationship of these clusters to the "India" sample is not clear (as shown in the network). On the other hand, his conclusions about the relationships among these three groups is not actually justified in the paper itself.

Rooting the trees

So, the thing to do is put a root on each of the graphs. We cannot do this for the network, but we can root the two trees, and we can take the nearest tree to the network and root that, instead.

There are several recognized ways to root a tree in phylogenetics (Huelsenbeck et al. 2002; Boykin et al. 2010):
  1. a character transformation series (i.e. non-reversible substitution models)
  2. an outgroup
  3. mid-point rooting
  4. assume clock-like character replacement (e.g. the molecular clock).
The first one implies that we know the order in which at least some of the characters changed through time, which is not true for these folk tales. The second one requires us to know the next most closely related folk tale, which we cannot decide in this case. The third one is always possible, for any tree; and the fourth one is possible if a likelihood model has been used to model character changes. So, in this case, we can apply both of options 3 and 4.

I therefore did the following:
  • For the parsimony analysis, I imported the author's consensus tree into PAUP* (the program he used to produce it), calculated the branch lengths with ACCTRAN optimization, and found the midpoint root.
  • For the bayesian analysis, I re-ran the MrBayes analysis exactly as described by the author, except that I added a relaxed clock (with independent gamma rates model for the variation of the clock rate across lineages).
  • For the phylogenetic network, the neighbor-net is basically the network equivalent of a neighbor-joining tree, and so I calculated this in SplitsTree (the program the author used), and found the midpoint root.
  • Also, the strict clock version of a neighbor-joining tree is a UPGMA tree, which I calculated using SplitsTree.
The complete trees can be seen elsewhere (ParsimonyMidpoint; BayesRelaxed; NJmidpoint; UPGMA), but the figure below shows the relevant parts of the four rooted trees. As you can see, the first three analyses agree on the root location (shown at the left of each graph), with only the UPGMA tree suggesting an alternative.


Having the East Asian samples as the sister to the other tales does not match what would be expected for the historical scenario suggested by the original author from his unrooted graphs — that the East Asian tales "evolved by blending together elements of both ATU 333 and ATU 123".

Instead, this placement exactly matches an alternative theory that the author explicitly rejects: "One intriguing possibility raised in the literature on this topic ... is that the East Asian tales represent a sister lineage that diverged from ATU 333 and ATU 123 before they evolved into two distinct groups. Thus, ... the East Asian tradition represents a crucial 'missing link' between ATU 333 and ATU 123 that has retained features from their original archetype ... Although it is tempting to interpret the results of the analyses in this light, there are several problems with this theory."

The UPGMA root, on the other hand, would be consistent with the blending theory for the origin of the East Asian tales. However, this tree actually presents the African tales as distinct from ATU 123, rather than being a subset of it.

Anyway, the bottom line is that you shouldn't present scenarios without a time direction. History goes from the past towards the present, and you therefore need to know which part of your graph is the oldest part. A family tree isn't a tree unless it has a root.

References

Boykin LM, Kubatko LS, Lowrey TK (2010) Comparison of methods for rooting phylogenetic trees: a case study using Orcuttieae (Poaceae: Chloridoideae). Molecular Phylogenetics & Evolution 54: 687-700.

Huelsenbeck J, Bollback J, Levine A (2002) Inferring the root of a phylogenetic tree. Systematic Biology 51: 32-43.

Wednesday, September 25, 2013

How do we interpret a rooted haplotype network?


A splits graph is an unrooted phylogenetic network (see How to interpret splits graphs). It can be produced by any of several algorithms, including distance-based methods such as NeighborNet and Split Decomposition, character-based methods such as Median Networks and Parsimony Splits, and tree-based methods such as Consensus Networks and SuperNetworks.

Such graphs can also be produced by methods that conceptually modify Median Networks, such as Reduced Median Networks and Median-Joining Networks. These two methods are popular in population genetics, especially as related to Homo sapiens, where they are used as haplotype networks (or 1-step networks); and it is their use as haplotype networks that I wish to discuss here.

Haplotype networks represent the relationships among the different haploid genotypes observed in the dataset (ie. identical sequences are pooled into a single terminal). They are usually drawn unrooted, which is quite sensible for within-species data, where the root location is often unknown. However, there are occasions when a root is provided, and authors then interpret the splits graph as a directed network. This is directly analogous to starting with an unrooted phylogenetic tree and adding a root (usually via an outgroup), so that the rooted tree can be interpreted as a genealogical history. In moving from an unrooted to a rooted tree, each branch acquires a direction (away from the root), and the internal nodes become hypothetical ancestors.

However, this is problematic for all types of unrooted network. In the case of splits graphs, each edge acquires an unambiguous direction, as for a tree, but not every internal node can necessarily be interpreted as a hypothetical ancestor. How, then, do we interpret the rooted haplotype network?

An example

Let's look at a specific example, taken from the recent paper by Witas et al. (2013).


Figure 4 from this paper shows a haplotype network of four mtDNA HVR1 (hypervariable region 1 of the control region) samples from Ancient Mesopotamia (the middle Euphrates valley between 2500 BC and 500 AD), compared to contemporary samples from five different geographical regions. It shows that the ancient samples fit neatly into modern genetic variation from southern and eastern Asia, rather than from eastern Europe.

However, note that a root is also explicitly indicated. I explain below where this root comes from, but first let's concentrate on what happens if we treat the network as rooted.


This is a Median-Joining Network, and thus it is a splits graph. As such, the root provides unambiguous directions for all of the branches, based on the principle that the network must be a directed acyclic graph with only one root. This is shown by the arrows in the modified figure. Furthermore, all of the internal nodes can be interpreted as a hypothetical ancestors, except for the two reticulations in the graph, labelled A and B.

These reticulations are created by contradictory patterns involving the characters labelled 16276, 16185 and 16311. In a rooted splits graph, reticulations represent uncertainty about the order of character changes, rather than representing reticulate evolution (eg. recombination, hybridization, etc). In this case, we cannot determine whether character 16311 changes before or after the changes in characters 16185 and 16276.

So, it is important to recognize that a rooted splits graph does not explicitly represent a phylogeny, because reticulations in the graph represent uncertainty not genealogy.

The simplest interpretation of a this type of rooted splits graph is usually that the network represents a set of most-parsimonious trees, rather than a single parsimony tree. The different trees can be obtained by resolving the reticulations (ie. by deciding what order the character changes occur in). This relationship between the rooted haplotype network and a parsimony tree is shown by the following example from Jansen et al. (2002).


This is a network of 93 mtDNA control-region haplotypes from horses. It is also a Median-Joining Network, although the data were pre-processed using a Reduced Median Network. Node A6 is the root, based on equid outgroups. The solid lines indicates one of the most-parsimonious trees contained within the network — for every reticulation, one particular order of the character changes has been selected by the authors in order to postulate this particular tree. The non-chosen parts of the network are indicated by dotted lines — these are part of alternative most-parsimonious trees.

Explanation of the human mtDNA root

mtDNA is usually treated as a non-recombining locus, and so it should evolve along a tree. A rooted global tree has therefore been produced for humans, based on parsimony analysis of the mtDNA genome (Torroni et al. 2000; van Oven and Kayser 2009). Groups and subgroups of this tree have been labelled as haplotypes, such as haplotype group M shown in the top figure, and sub-haplogroups, such as M4b, M49 and M61. These are (monophyletic) clades in the mtDNA tree that have been highlighted for convenience. Parsimony analysis has been used to reconstruct the ancestral sequences in the tree (Behar et al. 2012), and these ancestral sequences can be used to assign new sequences to their appropriate place in the rooted tree (Blanco et al. 2011).

The basic limitation of this approach is that the haplogroups and sub-haplogroups are based on a non-unique parsimony tree. There are many equally parsimonious trees for the dataset, any one of which could have been chosen to define the haplogroups. In spite of this limitation, the predefined haplogroups are treated by many people as actually designating specific mitochondrial lineages, rather than merely being groups of convenience, which is what they are.

References

Behar DM, van Oven M, Rosset S, Metspalu M, Loogväli EL, Silva NM, Kivisild T, Torroni A, Villems R (2012) A "Copernican" reassessment of the human mitochondrial DNA tree from its root. American Journal of Human Genetics 90: 675-684.

Blanco R, Mayordomo E, Montoya J, Ruiz-Pesini E (2011) Rebooting the human mitochondrial phylogeny: an automated and scalable methodology with expert knowledge. BMC Bioinformatics 12: 174.

Jansen T, Forster P, Levine MA, Oelke H, Hurles M, Renfrew C, Weber J, Olek K (2002) Mitochondrial DNA and the origins of the domestic horse. Proc Natl Acad Sci USA 99: 10905-10910.

Torroni A, Achilli A, Macaulay V, Richards M, Bandelt H-J (2000) Harvesting the fruit of the human mtDNA tree. Trends in Genetics 22: 339-345.

van Oven M, Kayser M (2009) Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Human Mutation 30: E386-E394.

Witas HW, Tomczyk J, Jędrychowska-Dańska K, Chaubey G, Płoszaj T (2013) mtDNA from the Early Bronze Age to the Roman period suggests a genetic link between the Indian subcontinent and Mesopotamian cradle of civilization. PLoS One 8(9): e73682.

Wednesday, September 4, 2013

Mis-interpreting phylogenetic trees


I have noted before that biologists have used various metaphors or models for phylogenetic relationships, including a chain, a tree, and a network. I, and other people, have also noted that interpreting the relationships shown by these structures is not always easy for novices, and sometimes even for experts (see Ambiguity in phylogenies).

A chain is 1-dimensional, and so interpreting its relationships is usually straightforward. However, a tree is not a simple linear concept, as it consists of a set of inter-linked chains. It is clear from the literature that a tree is a structure where many people find it easy to interpret relationships incorrectly. Here, I illustrate this with an example.

The example is taken from this paper:
L. Luca Cavalli-Sforza (1997) Genes, peoples, and languages. Proceedings of the National Academy of Sciences of the U.S.A. 94: 7719-7724.
The first illustration is Figure 1 from that paper.


In the text, the author also notes this about the figure:
The most important difference is in the position of Europe, which with neighbor joining branches out first after the splitting of Africans and non-Africans and with maximum likelihood [sic!] is the last but one.
This interpretation is incorrect, because it is the position of Oceania that differs between the two trees (not Europe), as shown in the illustration below.


In the original figure, tree (a) is rooted while tree (b) is unrooted. In order to directly compare them, we need to root the tree in (b), as shown in the first row of the illustration. Note that I have re-ordered the areas in Figure a, but I have not changed the relationships as shown by the tree. One of the most common mis-interpretations of trees is to think that the linear order (top to bottom) has some meaning (see Ambiguity in phylogenies), but it does not.

Then, in order to identify the difference between the pair of rooted trees, we simply delete each of the areas in turn, which is shown in rows 2 to 5 of my figure. Only the deletion of Oceania makes the two rooted trees identical (row 3). The deletion of Europe (row 2) does not do this, and so the position of Europe should not be identified as a key difference between the two trees.

The literature is replete with this sort of simple interpretational mistake concerning trees. The concern for those of us involved is:
What will it be like for people to interpret networks, which are sets of inter-linked trees?

Monday, January 21, 2013

EDA or post-optimality analysis of phylogenetic data?


These days, phylogeneticists usually build trees to express the evolutionary history of their samples. As part of this procedure, they also show an interest in the "quality" of their trees. This is a very vaguely defined concept, probably because it has something to do with accuracy, or correctness, which is something we can know almost nothing about. So, instead, we resort to a whole swag of other concepts, such as resolution, robustness, sensitivity and stability, which are related to precision rather than to accuracy.

We implement these "precision" ideas in many ways, including: (i) analytical procedures, such as interior-branch tests, likelihood-ratio tests, clade significance, and the incongruence length difference test; (ii) statistical procedures, such as the ubiquitous non-parametric bootstrap and posterior probabilities, the jackknife, topology-dependent permutation, and clade credibility; and (iii) non-statistical procedures, such as the decay index, clade stability, data decisiveness, and spectral signals.

Most of these are forms of what is called post-optimality analysis — the tree is first calculated and then we evaluate it. In the current issue of Bioinformatics there is a paper by Saad Sheikh, Tamer Kahveci, Sanjay Ranka and Gordon Burleigh (Stability analysis of phylogenetic trees. Bioinformatics 2013 29(2): 166-174) that provides yet another take on the same theme:
We define measures that assess the stability of trees, subtrees and individual taxa with respect to changes in the input sequences. Our measures consider changes at the finest granularity in the input data (i.e. individual nucleotides).
Basically, the idea is to see how much the input would need to be changed in order to cause a change in the tree topology. For example, the authors quantify the minimum edit distance required to create a specified Robinson-Foulds tree distance from the optimal tree, although any similar distances could be used instead. Their basic purpose was to develop a method that could be effective for very large datasets, which most of the alternatives cannot.

What this approach begs is the question as to whether a post-optimality analysis is the best approach in the first place. This type of approach assumes a tree as the basic structure, and fails to consider alternative structures that might be more appropriate for the data.

Exploratory data analysis (EDA), if performed effectively, can achieve the same result (an assessment of "stability") while at the same time revealing whether a tree is actually the best structure. It does this by evaluating the dataset directly, a priori, rather than evaluating the data relative to a tree, a posteriori. Evaluating the tree in terms of the data is not the same thing as evaluating the data independently of any tree.

Of the methods listed above, the only one that evaluates the data in a tree-independent manner is the use of spectral signals. Another approach is, of course, to use a data-display network, which provides a very convenient picture of the data, and will thus reveal whether a tree-building analysis is a good idea or not.

An example

To explore the essential difference between EDA and post-optimality analysis, we can look at one of the example datasets used by Sheikh et al. to illustrate their method.

This dataset involves sequences of 169 species of mammals, published by Meredith et al. (Impacts of the Cretaceous terrestrial revolution and KPg extinction on mammal diversification. Science 2011 334(6055): 521-524). There are 35,603 aligned nucleotides, concatenated from 26 gene fragments. The sampling includes nearly all mammalian families, plus five vertebrate outgroups.

Meredith et al. note about their data:
Phylogenetic relations from maximum likelihood (ML) and Bayesian methods are well resolved across the mammalian tree. More than 90% of the nodes have bootstrap support of ≥ 90% and Bayesian posterior probabilities of ≥ 0.95. Amino acid and DNA ML trees are in agreement for 163 out of 168 internal nodes.
Not surprisingly, Sheikh et al. reach a similar enthusiastic conclusion:
The Mammals dataset is highly stable. There is not a single move (R = 1) possible for an edit distance of up to 530 nucleotides. Even if we place an extremely high limit of E = 1000, the biggest move possible is RF = 5. Thus, the stability measures provide an explicit guarantee that there is no move possible for E = 500 and any values of R within 1 SPR distance. This also demonstrates the power of building phylogenies from large densely sampled datasets.
However, this enthusiasm contradicts some well-known previous results. For example, Meredith et al. also note:
Several nodes that remain difficult to resolve (e.g., placental root) have variable support between studies of rare genomic changes, as well as genome-scale data sets, which suggest that diversification was not fully bifurcating or occurred in such rapid succession that phylogenetic signal tracking true species relations may not be recoverable with current methods.
A simple EDA analysis makes the situation clear, which is not done by either the bootstrap / posterior-probability approach of Meredith et al. or the edit-distance / tree-distance approach of Sheikh et al. If we stick to the simple parsimony approach of the latter (rather than the model-based approach of the former), then we can analyze the dataset with hamming distances and a NeighborNet graph.


First, the root of the mammals is not clear. The published tree places the root on the branch leading to monotremes, but in the network the outgroup involves a reticulation. This is caused by an ambiguous relationship between Echinops telfairi (Tenrecidae) and (i) the {outgroup + monotremes} and (ii) the Afrotheria. In the tree the Afrotheria is united by a "strongly supported node".

Second, the root of the placentals is very unclear. Most of the major groups of placentals form clusters, but the relationships among these clusters are very obscure. The data are bush-like within the placentals, rather than tree-like, both at the level of the four major groups (named in the graph) and within each of those groups. In the published tree, some of these subgroups are well-supported, but others involve disagreement between the DNA and amino-acid trees, while others have < 90% bootstrap support.

It is not immediately obvious that a tree-building analysis is going to be of much use for this dataset. There is certainly some "power of building phylogenies from large densely sampled datasets", but this does not automatically mean that those phylogenies will be tree-like. Evolution involves a more diverse process than that, and post-optimality analyses based on a single model may be very misleading about that diversity.

Saturday, October 6, 2012

Open questions about evolutionary networks, part 2


There are a number of issues that have been of interest to the phylogenetics community with regard to the construction of evolutionary trees that have not yet been addressed for evolutionary networks. These can be considered to be "open questions" — ones that need widespread discussion at some stage, either by biologists or by computational scientists (or both). This blog post continues my list of some of these topics (see Part 1 and Part 3).

Separating randomness and rooting from reticulation

There are at least three quite distinct causes of incompatible patterns in a phylogenetic dataset:
  • Randomness, which is expected to create stochastic variation (such as homoplasy), but which may also be due to bias (eg. selection);
  • Rooting, with different "gene trees" being rooted in different places; and
  • Reticulation, which can have any one of several causes (eg. hybridization, HGT, recombination).
If we want an evolutionary network to display only Reticulation then we need to deal with the first two issues, either before-hand or at the same time.

I have previously discussed published examples in which several trees have been presented, from different gene segments, that differ from each other in the location of their outgroup root — eg Figures 4.7 and 4.27 of my book (Introduction to Phylogenetic Networks), and also in the Grass Phylogeny Working Group dataset (see this blog post). In at least one of these cases, there are no reticulate evolutionary events at all, merely an uncertain root. That is, a network was constructed showing putative hybridizations and yet the only evolutionary pattern in the data was that the single unrooted species tree had different roots in the different gene trees.

In all of these cases, it is difficult to present an evolutionary network, because many of the resulting reticulations reflect the differences in the outgroup roots rather than true evolutionary reticulation events. Clearly, we cannot accept a situation where incompatibility among the trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes. This is further discussed in the next section.

Randomness refers to uncertainty in any of the relationships depicted by the tree. Stochastic variation has long been recognized in phylogenetics, and it is the principal issue that most tree-building methods try to address in their algorithms. Biologically, stochastic variation usually arises from short evolutionary intervals (represented as short branch lengths in the tree), but may also arise from inadequate tree-building models, etc. It is the problem that branch-support estimates are designed to quantify, such as bootstrap values or posterior probabilities.

In the "normal" statistical world, random data variation is assumed to be associated with estimation errors. For phylogenetic data, these might include incorrect data (eg. contamination), inappropriate sampling, and model mis-specification. Alternatively, these errors might lead to bias rather than random variation. If so, then the sources of bias should be dealt with via exploratory data analysis, and the offending information can then be corrected or deleted.

However, when we are specifically trying to study reticulate evolution, there will also be many possible biological causes of data conflicts, which are not the result of either reticulation or estimation errors, such as homoplasy (parallelism, convergence, reversal), duplication/loss, and various complex molecular activities (such as sequence inversion, duplication, and transposition). All of these issues need to be dealt with under the concept of "non-reticulation variation".

Separating reticulation-caused data conflict from non-reticulation data conflict requires a null model for reticulation. This is discussed below.

Standardizing the root

Rooting is, to my mind, a problem that has not yet been dealt with properly in phylogenetics. I find that the differences among a set of gene trees are often little more that the relative location of the root (the common ancestor). That is, the unrooted gene trees are (almost) identical, but they have been rooted in somewhat different places (often not too far from each other). In one sense, this is simply Randomness occurring with respect to the root. However, its effect can be great, because it can potentially affect all of the rest of the network, whereas Randomness in most other locations will have only local effects on the topology.

Situations where incompatibility among trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes, can be dealt with by pre-processing of the data (prior to network analysis). Here, I will make a few suggestions, just to get the ball rolling.

If we have a set of "gene trees", then problems with incompatible rooting might be dealt with using polychotomies. That is, we could try to create a set of rooted gene trees with the "same" root by deleting conflicting basal edges from the tree. For example, an algorithm might look like this:
  1. unroot all of the gene trees
  2. find the most-common root — the root location in an unrooted tree defines a split, so the most-common root will be the root-split that occurs in the largest number of trees (unless there are multiple outgroup taxa that make the ingroup non-monophyletic)
  3. any rooted gene tree consistent with that root (displays that split) can be used unmodified
  4. any gene tree with a nearby root could then be modified so that some of its edges are contracted into a ploychotomy until the unrooted tree is consistent with the common root, and the resulting less-refined tree would then be used as the rooted tree — obviously, it would be necessary to explicitly define "nearby"
  5. the remaining gene trees would then be set aside and not used in the network analysis.
This algorithm might work more often that not in practice, although it is easy to think of situations where it will be a very uncertain procedure.

If the ingroup is not monophyletic, then the biologist should fix this before proceeding with the network analysis. This is a "biological problem" of sampling, not a mathematical one — perhaps the problem arises from deep coalescence, for example. If there is no clear "most-common root" among the trees, then perhaps we could define an "average" or centroid root of some sort. We would then proceed with the rest of the method.

An alternative to this "polychotomy method" might be to use the coalescent to construct an "approximate" species tree from the multiple gene trees (there are now several methods to do this), and then in the network analysis we could allow the gene trees to differ from the species tree only with respect to the poorly supported branches in the species tree. That is, we would use the well-supported parts of the coalescent tree as a backbone common to all of the gene trees, and for the uncertain parts we would use each of the gene trees. However, I am not certain of the applicability of the coalescent to higher taxa (as opposed to closely related species).

I have often thought that duplication-loss is another potential cause of problems with the root. It is not immediately obvious how to approach this, but some suggestions have been made by Burleigh et al. (2011).

A different strategy would be to try all possible roots and see which one(s) minimize the network complexity. This might be computationally intensive, depending on the size of the dataset and the network method used. It might be necessary to restrict the roots tested to those observed among the input trees.

Null models for reticulation

Once we have the root standardized for the dataset, we are then set the task of separating reticulation-caused data conflict from non-reticulation data conflict. This requires a null model for data conflict — any data conflict that cannot be accommodated by the null model is a candidate for explanation as the result of a reticulation event.

Looking at the literature, it seems to me that the most commonly accepted null model is deep coalescence (incomplete lineage sorting) (Meng and Kubatko 2009; Kubatko and Meng 2010). For example, a maximum-likelihood method has been developed that models hybridization in the presence of deep coalescence (Kubatko 2009). One can also use the coalescent as an optimality criterion to choose among alternative networks, with lineage sorting under the coalescent as the null hypothesis (Huson et al. 2005; Buckley et al. 2006; Than et al. 2007; Lyngsø et al. 2008; Joly et al. 2009).

However, the sole use of deep coalescence effectively ignores the other non-reticulation causes of data conflict, as listed above (under Separating randomness and rooting from reticulation). Now, I suppose that it is possible that this approach will work in practice, but it seems unlikely to me that this will be so. Effectively, this approach assumes that the gene trees correspond to the true underlying coalescent trees. This is unlikely because the gene trees are inferred and therefore can be incorrect, due to the other (listed) non-reticulation causes of data conflict. Moreover, if there are multiple types of reticulation event occurring then the approach might fail. For example, if one wishes to study hybridization, then the coalescence methods assume that recombination occurs only between and not within the regions used to infer the gene trees, which is also unlikely.

So, a more comprehensive null model seems to be needed, one that includes more than simply traditional statistical randomness plus deep coalescence. The default expectation at the moment seems to be that deep coalescence occurs above the species level, so that all data sets should be tree-like, whereas the objective here is to detect the non-tree-like parts of evolutionary history.

Dealing with stochastic error and bias

In addition to null models, we may also need pre-processing to deal with stochastic error and bias. There is a limit to what can be done with a single null model, and phylogenetic data are rarely simple. Here, I make a few suggestions, once again to start some discussion.

If we have a set of "gene trees", then perhaps the most obvious approach is to delete uncertain edges. That is, they would appear as polychotomies in the gene trees. This allows refined versions of these trees to be represented in the network, rather than requiring extra edges in a network to accommodate all of them. An alternative is to weight all of the edges with respect to their data "support", with the expectation that poorly supported edges would only appear in the network if they are consistently supported across a number of the gene trees.

I think that there are two types of support that could be relevant to uncertainty: (1) classic branch support, such as bootstrap values; and (2) the set of multiple equally optimal or nearly optimal trees. These two types coincide in bayesian analysis, as it is currently implemented in phylogenetics, because in bayesian analysis the branch support is derived from the set of nearly optimal trees. I suspect that (2) may be a better idea than (1), because it expresses something about the tree itself rather than each edge alone; and it is used in the SpNet method (Nakhleh et al. 2005), for example, where each gene tree is a consensus tree of several nearly-optimal trees. The appeal of using polychotomies is that it is simple. The main arguments against it may be the work required for the calculations in methods such as maximum likelihood (both parsimony and bayesian analyses do the necessary calculations anyway), and the fact that it may create non-dense sets of triplets (Jansson and Sung 2006), for example.

Another idea might be to delete taxa that have no consistent position among the input trees. The idea here is that biologically we are looking for things like hybridization and HGT, and we are not expecting this to involve any one taxon in combination with many other taxa. Therefore, extremely uncertain positions are unlikely to reflect Reticulation but rather Randomness (or lack of information). Creating polychotomies would lose a lot of information in this situation, and so it would be better to flag these taxa as problematic, and then leave them out of the network analysis. This is basically the concept used for largest common pruned trees (or agreement subtrees), except that here we don't prune the data all the way down to a tree (see Abby et al. 2010). This also seems to be the idea behind the Dendroscope program's option to deal only with clusters that appear in a certain percentage of the trees. The problem with the Dendroscope approach, however, is that a cluster generated by HGT (say) that appears in only one tree will be ignored. It would thus be better to use the variation in position of individual taxa, rather than presence/absence of clusters.

References

Abby S.S., Tannier E., Gouy M., Daubin V. (2010) Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics 11: 324.

Buckley T., Cordeiro M., Marshall D., Simon C. (2006) Differentiating between hypotheses of lineage sorting and introgression in New Zealand Alpine cicadas (Maoricicada dugdale). Systematic Biology 55: 411-425.

Burleigh J.G., Bansal M.S., Eulenstein O., Hartmann S., Wehe A., Vision T.J. (2011) Genome-scale phylogenetics: inferring the plant tree of life from 18,896 gene trees. Systematic Biology 60: 117-125.

Huson D.H., Klöpper T., Lockhart P.J., Steel M.A. (2005) Reconstruction of reticulate networks from gene trees. Lecture Notes in Bioinformatics 3500: 233-249.

Jansson J., Sung W.-K. (2006) Inferring a level-1 phylogenetic network from a dense set of rooted triplets. Theoretical Computational Science 363: 60-68.

Joly S., McLenachan P.A., Lockhart P.J. (2009) A statistical approach for distinguishing hybridization and incomplete lineage sorting. American Naturalist 174: E54-E70.

Kubatko L.S. (2009) Identifying hybridization events in the presence of coalescence via model selection. Systematic Biology 58: 478-488.

Kubatko L.S., Meng C. (2010) Accommodating hybridization in a multilocus phylogenetic network. In: Knowles L.L., Kubatko L.S. (eds) Estimating Species Trees: Practical and Theoretical Aspects, pp. 99-113. Wiley-Blackwell, Hoboken NJ.

Lyngsø R.B., Song Y.S., Hein J. (2008) Accurate computation of likelihoods in the coalescent with recombination via parsimony. Lecture Notes in Computer Science 4955: 463-477.

Meng C., Kubatko L.S. (2009) Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theoretical Population Biology 75: 35-45.

Nakhleh L., Warnow T., Linder C.R., St John K. (2005) Reconstructing reticulate evolution in species — theory and practice. Journal of Computational Biology 12: 796-811.

Than C., Ruths D., Innan H., Nakhleh L. (2007) Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 14: 517-535.

Wednesday, July 11, 2012

Evolutionary trees: old wine in new bottles?


Biblical scripture tells us about the dangers of putting new wine into old bottles (Matthew 9:17, Mark 2:22, Luke 5:37). Here, I will say a few words about the equal folly of putting old wine into new bottles.

A number of authors have considered the iconography used to display biological (and other) relationships, and noted that it both reflects the way biologists think and it influences their ongoing thought processes (Brace 1981; Stevens 1984; O'Hara 1992; Stevens 1994; Clark 2001; Ragan 2009; Gontier 2011; Tassy 2011). In particular, several of these authors have noted that the Darwinian revolution did not much change the way that many biologists thought about biological relationships although it did to some extent change the way they presented them. That is, they practiced the art of putting old wine (their old way of thinking, based on a linear series of relationships) into new bottles (an evolutionary tree, allegedly based on genealogical history).

Changing iconography

The following series of icons illustrates this historical process. We started with the image of a series of steps ascending from "lower" forms to "higher" ones, an idea apparently originating with Aristoteles, with ourselves near the top (eg. Llull 1304, which illustrates the Ladder of the Intellect).

From Llull (1304).
Note that Homo is between the plant & animal steps and the sky, angel & god steps.
Click to enlarge.

We then converted this to the formal Scala Naturae, or Great Chain of Being (e.g. Bonnet 1745), which eliminates the non-living forms and non-earthly beings and thus restricts itself to observed biological phenomena.

From Bonnet (1745).
Note that Homo is at the top of the chain.
Click to enlarge.

Lamarck (1809) challenged this simple linear series, using instead a branching diagram to represent transformations among biological forms, with each form transforming into another form through time. However, Lamarck's version of evolution was still essentially a transformation series among organisms, and thus his diagram was simply a slightly branched version of the Scala Naturae.

Darwin (1859) then challenged the very idea of transformation series, insisting upon both the origin of new biological forms and the extinction of some of the old forms. He used a bush as his icon, but we have always referred to it as a tree. Indeed, it is worth noting that Darwin never explicitly refers to any of his diagrams as a "tree" (see previous blog post), referring instead to "descent with modification". The diagrams were intended to represent his ideas on continuity of descent through time, and the role of speciation in increasing biodiversity and extinction in counter-balancing this (see Kutschera 2011). Their primary purpose was not the representation of phylogenetic history.

From Darwin (1859).
Click to enlarge.

Darwin's "Tree of Life" metaphor is thus quite independent of his diagrams, both published and unpublished (see Penny 2011). However, it is the one paragraph of the Origin containing this poetic imagery that seems to have been inspired people. Indeed, Wallace (1855), the co-discoverer of "Darwinian evolution", had already used a very similar image, when he noted: "the analogy of a branching tree [is] the best mode of representing the natural arrangement of species ... a complicated branching of the lines of affinity, as intricate as the twigs of a gnarled oak ... we have only fragments of this vast system, the stem and main branches being represented by extinct species of which we have no knowledge, while a vast mass of limbs and boughs and minute twigs and scattered leaves is what we have to place in order, and determine the true position each originally occupied with regard to the others".

This theoretical icon was taken up by empirical scientists such as Mivart (1865) and Haeckel (1866), who then used a bush as the icon for their presentation of explicit hypotheses of genealogical relationship.

From Mivart (1865).
Note that Homo is on a side-branch.
Click to enlarge.

From Haeckel (1866).
Note that Homo is on a side-branch.
Click to enlarge.

However, this lead was quickly abandoned by many people, including notably Haeckel himself (1874), who subsequently drew trees with a distinct central trunk. They thus, in effect, re-drew the Scala Naturae as a tree rather than as a chain, the only important difference being that some of the forms appear on side-branches rather than along the main trunk. We might call this icon an implicit Scala Naturae rather than an explicit one. *

From Haeckel (1874).
Note that Homo is at the top of the central trunk.
Click to enlarge.

From Smallwood et al. (1948).
Note that Homo is at the top of the central trunk.
Click to enlarge. 

This approach can be used to put any chosen organism at the crown of the tree, not just human beings, as illustrated by Scott (1986). This is the fundamental difference from a chain — a chain is linked so that there are only two possible ends, but a tree can be drawn so that any part of the tree is at the crown.

From Scott (1986).
Click to enlarge.

This sequence of icons shows you that, indeed, one can (metaphorically) put old wine in new bottles — the Scala Naturae can be put into an evolutionary tree. This thinking can be detected throughout modern evolutionary publications (Nee 2005). Unfortunately, it re-inforces the view that evolution is a purposeful and goal-directed process, which runs counter to current scientific understanding.

In this sense, we have lost much of what Darwin tried to tell us back in 1859: that the history of life is a multi-stemmed bush not a single-stemmed tree.

Effect on evolutionary biology

Baum and Smith (2012) have noted the following:

"We do not know why it should be so, but we have learned from working with thousands of students that, without contrary training, people tend to have a one-dimensional and progressive view of evolution. We tend to tell evolution as a story with a beginning, a middle, and an end. Against that backdrop, phylogenetic trees are challenging; they are not linear but branching and fractal, with one beginning and many equally valid ends. Tree thinking is, in short, counterintuitive."

That is, the problem illustrated above is still widespread.

The effects of the distorted iconography manifest themselves in several ways in modern evolutionary biology. This topic has received considerable attention in the literature, and there are a number of very readable expositions on various parts of it (e.g. O'Hara 1992, 1997; Krell and Cranston 2004; Baum et al. 2005; Crisp and Cook 2005; Gregory 2008; Omland et al. 2008; Sandvik 2009; MacDonald and Wiley 2012). Some of the main points of potential bias are:
(i) presenting a sequence of contemporary taxa so that a main axis passes through the diagram (as I have illustrated above); (ii) the left–right ordering of the taxa at the tips (which is mistakenly interpreted as representing an evolutionary series); (iii) the selective pruning of side branches (thus making one line of evolution more prominent); (iv) the use of paraphyletic groups; and (v) the differential resolution of branches.

In particular, in a bush the relationships among monophyletic groups (clades) are equal, in the sense that each clade is the sister to some other clade and vice versa. Thus, clades cannot be "basal" or "crown", because each single clade branches from some other single clade, rather than each clade being a side-branch from a main stem. Logically, at each speciation event two new species arise, rather than one species producing an extra offshoot species. There is no main stem in an evolutionary tree, but instead there is a series of branches leading to a series of twigs, even if some of the branches do have more twigs than others.

Networks

My interest in raising these issues here is in considering whether these apparently widespread problems are likely to affect the use of reticulating phylogenetic networks as much as they do dichotomous phylogenetic trees. Since I recognize two forms of phylogenetic network being used in practice, there are two situations to evaluate when considering an answer to this question.

Unrooted networks
If the icons used for displaying biological relationships are non-genealogical "webs, nets, maps, or other basically horizontal, planar, reticulating structures" (Stevens 1984), as discussed in a previous blog post, then there seems to be little likelihood of linear distortions being produced. This is a simple by-product of the unrooted nature of the diagrams and the consequent lack of direction in the relationships (ie. the relationships between taxa are symmetrical).

Rooted networks
Any rooted, and thus explicitly directional, diagram can be subjected to linear distortions by the simple expedient of emphasizing some directions at the expense of others. In this sense, rooted reticulating networks and non-reticulating trees are prone to the same potential problems. It might be argued that reticulations make it harder to emphaszie a single direction, particularly if there are many reticulations, but this seems to be a weak argument. We will thus need to guard ourselves against the implicit Scala Naturae when employing evolutionary networks (e.g. hybridization networks, HGT networks, recombination networks; see previous blog post) just as much as when employing evolutionary trees.

* Footnote: In Haeckel's defence, I should quote from Richards (2011): "Haeckel regarded these two types of diagrams as having different purposes. The first represented ... a proper stem-tree, one highly branched. The latter diagram simply looked back from a given organism — in this case man — to its lineal progenitors. It's as if one began with the first kind of tree and traced back the series of man's direct ancestors— and this would result in that second kind of tree. Haeckel had not precipitously regressed, within the space of a few years, into a dogmatic teleologist."

References

Baum D.A., Smith S.D. (2012) Tree Thinking: An Introduction to Phylogenetic Biology. Roberts & Company Publishers, Greenwood Village, CO.

Baum D.A., Smith S.D., Donovan S.S. (2005) The tree-thinking challenge. Science 310: 979-980.

Brace C.L. (1981) Tales of the phylogenetic woods: the evolution and significance of evolutionary trees. American Journal of Physical Anthropology 56: 411-429.

Clark CA. (2001) Evolution for John Doe: pictures, the public, and the Scopes trial debate. Journal of American History 87: 1275-1303.

Crisp MD., Cook LG. (2005) Do early branching lineages signify ancestral traits? Trends in Ecology and Evolution 20: 122-128.

Gontier N. (2011) Depicting the Tree of Life: the philosophical and historical roots of evolutionary tree diagrams. Evolution: Education and Outreach 4: 515-538.

Gregory TR. (2008) Understanding evolutionary trees. Evolution: Education and Outreach 1: 121-137.

Krell F-T., Cranston PS. (2004) Which side of a tree is more basal? Systematic Entomology 29: 279-281.

Kutschera U. (2011) From the scala naturae to the symbiogenetic and dynamic tree of life. Biology Direct 6: 33.

MacDonald T., Wiley EO. (2012) Communicating phylogeny: evolutionary tree diagrams in museums. Evolution: Education and Outreach 5: 14-28.

Nee S. (2005) The great chain of being. Nature 435: 429.

O'Hara R.J. (1992) Telling the tree: narrative representation and the study of evolutionary history. Biology and Philosophy 7: 135-160.

O'Hara RJ. (1997) Population thinking and tree thinking in systematics. Zoologica Scripta 26: 323-329.

Omland K.E., Cook L.G., Crisp M.D. (2008) Tree thinking for all biology: the problem with reading phylogenies as ladders of progress. BioEssays 30: 854-867.

Penny D. (2011) Darwin’s theory of descent with modification, versus the biblical Tree of Life. PLoS Biology 9: e1001096.

Ragan M. (2009) Trees and networks before and after Darwin. Biology Direct 4: 43.

Richards R.J. (2011) Images of evolution. American Scientist 99: 165-167.

Sandvik H. (2009) Anthropocentrisms in cladograms. Biology and Philosophy 24: 425-440.

Stevens P.F. (1984) Metaphors and typology in the development of botanical systematics 1690-1960, or the art of putting new wine in old bottles. Taxon 33: 169-211.

Stevens P.F. (1994) The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. Columbia Uni. Press, New York.

Tassy P. (2011) Trees before and after Darwin. Journal of Zoological Systematics and Evolutionary Research 49: 89-101.

Wallace AR. (1855) On the law which has regulated the introduction of new species. Annals and Magazine of Natural History (n.s.) 16: 184-196.

Image Sources

Bonnet C. (1745) Traité d'Insectologie, premier parte. Durand, Paris.

Darwin C. (1859) On the Origin of Species. John Murray, London.

Haeckel E. (1866) Generelle Morphologie der Organismen. Reimer, Berlin.

Haeckel E. (1874) Anthropogenie oder Entwickelungsgeschichte des Menschen. Engelmann, Leipzig.

Lamarck J-B. (1809) Philosophie Zoologique. Dentu et l'Auteur, Paris.

Llull R.  (1304, published 1512) Liber de Ascensu et Descensu Intellectus. Jorge Costilla, Valencia.

Mivart St G. (1865) Contributions towards a more complete knowledge of the axial sleketon in the Primates. Proceedings of the Zoological Society of London 33: 545-592.

Scott J.A. (1986) The Butterflies of North America: a Natural History and Field Guide. Stanford Uni. Press, Stanford.

Smallwood WM., Reveley IL., Bailey G.A., Dodge RA. (1948) Elements of Biology. Allyn & Bacon, Boston.