Wednesday, February 6, 2013

Is there a philosophy of phylogenetic networks?

In some previous blog posts I have discussed the role of phylogenetic networks in science (Are phylogenetic networks as scientific as trees?), particularly in terms of Description, explanation and prediction in phylogenetics. In this post I will look at the philosophy of phylogenetic networks, in terms of whether there is a strong basis for treating the mathematical analyses as having biological relevance.

This is an important point, because there are theoretically an infinite number of ways to mathematically analyze a set of data, and yet it is unlikely that all (or even most) of these will have any relevance to a study of biology. For example, there is a big difference between a mathematical summary of a set of numbers and any biological interpretation of that summary. The mode, for instance, is a neat mathematical measure of the central location of a biological dataset that also nominates one of the biological objects represented by that dataset, while the mean is an estimate of the central location that rarely describes any biological object at all. So, a mode describes biology directly while a mean does not necessarily do so.

Given that there seem to be two quite different uses for phylogenetic networks, there are likely to be two different philosophical bases. The first of these is more easy to deal with than the second one.

Data-display networks

Data-display networks are usually unrooted, and are intended to display the major patterns of character variation in a dataset. There is no necessary implication that any of these patterns are due to the evolutionary history of the organisms concerned, although it is very likely that many of the patterns will reflect that history, either directly or indirectly. I have therefore repeatedly emphasized the role of these networks in Exploratory Data Analysis (EDA).

This means that the obvious philosophical basis for data-display networks is the same as for EDA. There is a strong mathematical basis for EDA and this is considered to have direct relevance to biological studies. EDA has been explored in a number of works, both in general (eg. Tukey 1977;  Hartwig & Dearing 1979; Tufte 1983, 1997; Ellison 2001; Behrens & Yu 2003; Young et al. 2006) and also within phylogenetics (eg. Bandelt 2005; Wägele & Mayer 2007; Morrison 2010). These can be consulted for further information.

The mathematical basis of EDA is to summarize the main characteristics of a dataset in an easily digested form, usually with graphs, without using an explicit statistical model or having formulated an a priori hypothesis. EDA is thus promoted as a counterpoint to confirmatory data analysis (ie. statistical hypothesis testing). The mathematics is not rigid, although various tools have been developed over more than a century. EDA is as relevant to biology as it is to all subjects where data are collected and analysed.

Evolutionary networks

Evolutionary networks, on the other hand, are rooted networks intended to elucidate phylogenetic history. Unlike phylogenetic trees, evolutionary networks explicitly allow for reticulation events (horizontal evolution) as well as descent from parent to offspring (vertical evolution). They are therefore usually seen as a logical generalization of phylogenetic trees.

So, the obvious philosophical basis for evolutionary networks is the same as for phylogenetic trees. However, this inference is not as clear as we might like it to be. For phylogenetic trees there is a rationale for treating the mathematical tree diagram as a representation of evolutionary history; but it is harder to apply the same rationale to evolutionary networks.

The three logical steps to inference using phylogenetic trees are outlined in the figure.

First, we start with some genotypic data, which we transform into a mathematical summary (a DAG) via some quantitative model. Each of these models has an explicit mathematical and/or philosophical basis; for example, maximum likelihood has a well-established mathematical foundation, as does Bayesian analysis. However, there is no necessary biological foundation to these quantitative models, and they are simply convenient mathematical summaries, just like the mean. (Indeed, the mean is the maximum-likelihood estimate of the central location of a set of numbers.)

The second step is to provide a biological basis for further inference. This is the importance of Willi Hennig in the history of phylogenetics — he provided the logical inference that a divergent mathematical tree can be treated as a representation of the gene or character history, because the tree-like patterns are formed from a nested series of shared derived character states (synapomorphies). That is, the mathematical summary can be logically inferred to represent a biological concept, the character history.

In the third step we infer that a set of gene and/or character histories will, when combined in some way, also represent the organismal history. That is, we infer that gene histories represent organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (ie. a pragmatic inference).

So, there is a philosophy to the use of trees for phylogenetic inference, involving three steps (mathematical, logical, practical). There may be mis-estimation of the evolutionary history in practice, of course, perhaps through mis-estimation of the trees or non-representative gene samples, but we cannot expect any method to be perfect. We simply accept that the method we have is the best one we can find, and that it provides a logical basis for inference.

The question is: how do we apply this philosophy to evolutionary networks?

It is sometimes argued that a network is a set of overlapping (partly incompatible) trees. For example, each genetic locus might show a tree-like evolutionary history, but this history might not be the same as any other locus in the same organism. If we adopt this viewpoint then we could consider it unproblematic to use the same philosophy as for trees. That is, at step 1 we produce a set of trees, and step 2 we infer these to represent a set of gene histories, and at step 3 we combine the histories. The only important difference would thus be at step 3, where we combine the genotypic trees in a way that allows for reticulation in the organismal history, rather than insisting that the organismal history be strictly tree-like.

This is an issue that was debated back in the 1980s, when cladists first tried to come to grips with reticulations in a cladogram (eg. Bremer & Wanntorp 1979; Funk 1981, 1985; Humphries 1983; Nelson 1983; Wagner 1983; Wanntorp 1983). It has resurfaced occasionally since then (eg. Skála & Zrzavy 1994; Brower et al. 1996; Lienau & DeSalle 2009), with the consensus apparently being that for reticulating phylogenies this argument is acceptable.

However, it has also been argued that an evolutionary network is not simply a collection of trees. It is often contended, especially by those people dealing with prokaryotes (eg. Doolittle 1999, 2009; Bapteste et al. 2009, 2012), that there is no underlying tree-like structure in much of organismal history — biological history is an anastomosing plexus, instead. If we adopt this viewpoint then we cannot apply the three-step logic as outlined above. We still need to deal with the three steps (biological data to mathematical DAG, DAG to character evolution, characters to organismal evolution), but the DAG will have reticulations rather than being a diverging tree. So, we cannot apply Hennigian logic at step 2, because in a reticulated DAG the characters do not form a nested series of shared derived character states.

So, where are we to get our philosophy under these circumstances? How do we justify the inference that the mathematical summary represents evolutionary history? I have not yet seen this issue discussed in the literature.

References

Bandelt H-J (2005) Exploring reticulate patterns in DNA sequence data. In: Bakker FT, Chatrou LW, Gravendeel B, Pelser PB, eds. Plant Species-Level Systematics: New Perspectives on Pattern and Process. Koeltz, Königstein, pp 245-269.

Bapteste E, Lopez P, Bouchard F, Baquero F, McInerney JO, Burian RM (2012) Evolutionary analyses of non-genealogical bonds produced by introgressive descent. Proceedings of the National Academy of Sciences of the USA 109: 18266-18272.

Bapteste E, O'Malley MA, Beiko RG, Ereshefsky M, Gogarten JP, Franklin-Hall L, Lapointe FJ, Dupré J, Dagan T, Boucher Y, Martin W (2009) Prokaryotic evolution and the tree of life are two different things. Biology Direct 4: 34.

Behrens JT, Yu CH (2003) Exploratory data analysis. In: Schinka JA, Velicer WF, eds. Handbook of Psychology, Vol. 2: Research Methods in Psychology. John Wiley & Sons, Hoboken, pp 33-64.

Bremer K, Wanntorp H-E (1979) Hierarchy and reticulation in systematics. Systematic Zoology 28: 624-627.

Brower AVZ, DeSalle R, Vogler AP (1996) Gene trees, species trees, and systematics: a cladistic perspective. Annual Review of Ecology and Systematics 27: 423-450.

Doolittle WF (1999) Phylogenetic classification and the universal tree. Science 284: 2124-2128.

Doolittle WF (2009) The practice of classification and the theory of evolution, and what the demise of Charles Darwin's tree of life hypothesis means for both of them. Philosophical Transactions of the Royal Society of London B Biological Sciences 364: 2221-2228.

Funk VA (1981) Special concerns in estimating plant phylogenies. In: Funk VA, Brooks DR, eds. Advances in Cladistics: Proceedings of the First Meeting of the Willi Hennig Society. New York Botanical Garden Press, New York, pp 73-86.

Funk VA (1985) Phylogenetic patterns and hybridization. Annals of the Missouri Botanical Garden 72: 681-715.

Ellison AM (2001) Exploratory data analysis and graphic display. In: Scheiner SM, Gurevitch J, eds. Design and Analysis of Ecological Experiments, 2nd ed. Oxford University Press, Oxford, pp 37-62.

Hartwig F, Dearing BE (1979) Exploratory Data Analysis. Sage, Newbury Park.

Humphries CJ (1983) Primary data in hybrid analysis. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 89–103.

Lienau EK, DeSalle R (2009) Evidence, content and collaboration and the tree of life. Acta Biotheoretica 57: 187-199.

Morrison D.A. (2010) Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology and Evolution 27: 1044-1057.

Nelson GJ (1983) Reticulation in cladograms. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 105-111.

Skála Z, Zrzavy J (1994) Phylogenetic reticulations and cladistics: discussion of methodological concepts. Cladistics 10: 305-313.

Tufte ER (1983) The Visual Display of Quantitative Information. Graphics Press, Cheshire.

Tufte ER (1997) Visual Explanations: Images and Quantities, Evidence and Narrative. Graphics Press, Cheshire.

Wägele JW, Mayer C (2007) Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects. BMC Evolutionary Biology 7: 147.

Wagner WH (1983) Reticulistics: The recognition of hybrids and their role in cladistics and classification. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 63-79.

Wanntorp H-E (1983) Reticulated cladograms and the identification of hybrid taxa. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 81-88.

Young FW, Valero-Mora PM, Friendly M (2006) Visual Statistics: Seeing Data with Dynamic Interactive Graphics. Wiley, Hoboken.