Wednesday, November 28, 2012
Are phylogenetic networks as scientific as trees?
Description, explanation, prediction
Science can be characterized as involving: (i) description, (ii) explanation, and (iii) prediction. As scientists, we need objective and repeatable methods for all three of these. For example, we have devised quantitative methods of description involving standardized units of measurement, often involving machines to perform the actual measuring. We also have modeling procedures that allow us to explicitly incorporate explanatory ideas, as well as for making predictions; and we have philosophical methods for assessing whether inferences are justified or not.
Philosophers of science tend to have focussed on the role of explanation (ii) in science, often to the exclusion of description (i) and prediction (iii), but practicing scientists frequently spend more time on (i) than on either (ii) or (iii), especially in biology. Moreover, physical scientists frequently combine all three simultaneously, using mathematical equations not only to describe the observed data but also to explain it (via the components that are included in the underlying mathematical model) and to predict as-yet unobserved phenomena (by arithmetical extrapolation).
It seems to me that one of the things that makes the study of evolution a science (rather than being a study of natural history) is our recent attempts to reconstruct evolutionary history in an objective and repeatable manner (rather than producing untestable historical scenarios). These phylogenetic analyses have usually been based on a tree model, although the adequacy of this model has recently been questioned.
However, one issue that I have not seen addressed in the literature is the affect on the description / explanation / prediction triumvirate if phylogenetics moves from a tree model to a network model.
[Added note: see the next blog post for a further explanation and examples of Description, explanation and prediction in phylogenetics.]
Trees and networks
Using a phylogenetic tree to describe biodiversity is uncomplicated — the tree describes the historical relationships among the taxa. Furthermore, using the tree for explanation is also uncomplicated — many of the intrinsic characteristics of organisms are the result of inheritance from their ancestors, and therefore characteristics that are shared among taxa can be explained as resulting from shared common ancestors. Furthermore, using the tree for prediction simply involves the reverse logic — shared ancestry predicts the existence of shared characteristics, which may not yet have been observed.
This is actually a point that Darwin makes when introducing the tree metaphor in his book (1859). He points out that many previously unexplained facets of biology become explainable if one adopts the concept of a phylogenetic tree (for example, so-called natural classifications, or the obvious relationships among languages).
In this context, note the potential importance of the distinction between pattern reconstruction and process explanation. For example, (i) can be done from the perspective of simply displaying patterns, but this is likely to preclude (ii) and (iii). Description may thus be best done from the perspective of displaying patterns that are related solely to particular processes. Jonathan Losos (2011, Seeing the forest for the trees: the limitations of phylogenies in comparative biology. American Naturalist 177: 709-727), for example, has noted that "phylogenies are much more informative about pattern than they are about process."
Nevertheless, replacing the tree model with a network model is not necessarily straightforward, because the studied history now involves both horizontal and vertical descent. If we conceive of a network as being a set of inter-connected trees, then the tree components represent the vertical ancestor-to-offspring history while the reticulations (connecting the trees) represent the horizontal components of the history.
In this view, using a phylogenetic network to describe biodiversity is the same as for a tree — the network describes the historical relationships among the taxa, with a clear indication of the pathways of the vertical and horizontal components of that history.
Unfortunately, the same cannot necessarily be said for explanation. Without an indication of exactly which characteristics are involved in the reticulations, we cannot have an unambiguous explanation. Characteristics that are shared among taxa may be explained by either shared ancestors (a vertical explanation) or by reticulation (a horizontal explanation). A network topology alone will not necessarily provide an unambiguous explanation, whereas a tree topology can do so.
A more extreme problem arises for prediction. When predicting the existence of shared characteristics, should the prediction be based on shared vertical ancestry or shared horizontal history, or both? Since we are predicting the unknown, how can we decide on the appropriate prediction framework? With a tree there is no such choice to be made, and thus no ambiguity.
If reticulation occurs, then we can "explain" almost any set of observations by postulating a suitable reticulation event; and we could "predict" almost any future event in the same way. So, it seems that network models are not practical for explanation and prediction in quite the same way as are tree models alone. The extra complexity available for network description potentially becomes ambiguity when used for explanation or prediction.
This issue manifests itself in a number of way. For instance, mathematical algorithms would need to be based on optimization criteria that have some biological relevance in terms of explanation not just description. For example, minimizing the number of reticulations when constructing a network involves descriptive parsimony — we describe the data using a tree model plus the minimum possible number of reticulations. However, this does not involve ontological parsimony, in the sense that we are not thereby postulating that evolution proceeds in such a parsimonious manner. Descriptive parsimony does not necessarily provide a phylogenetic network that is best as an explanatory framework, nor as a predictive tool. The same can be said about maximum-parsimony trees, of course, but they are rarely used these days.
Moreover, phylogenetic networks may not even provide a concise description of reticulate evolution. For example, if two gene trees differ by just one so-called Rooted Subtree Prune and Regraft (rSPR) move then we can represent them by a network with one reticulation node (the two trees that are embedded in the network are simply the two gene trees). However, if the trees differ by two or more rSPR moves then a large number of reticulations may be needed in order to embed the two trees. So, a network can be a simple description of two conflicting trees, or it can also be much more complex than those two trees.
What I have said so far refers to evolutionary network, which are intended to explicitly reflect evolutionary history. It is worth pointing out that data-display networks, on the other hand, are intended to provide description but not explanation or prediction. That is, they display the observed data without necessarily providing any explanation for the patterns displayed or necessarily allowing explicit predictions. Nevertheless, they are intended to provide insights that might contribute to explanations, and therefore predictions. They play a valuable role in exploring data to find the best description and to identify possible explanations.