Showing posts with label Philosophy. Show all posts
Showing posts with label Philosophy. Show all posts

Monday, December 10, 2012

Data enrichment in phylogenetics


Since this is post #100 in this blog, I thought that we might celebrate with something humorous. Since evolutionists often have a tough time, this post is about how to get more out of your phylogenetic analyses than you previously thought was possible.

In 1957, Henry R. Lewis published an article about The Data-Enrichment Method (Operations Research 5: 551-554). This method was intended "to improve the quality of inferences drawn from a set of experimentally obtained data ... without recourse to the expense and trouble of increasing the size of the sample data." This distinguishes the method from similarly named techniques, such as the likelihood method of Data Augmentation, which require actual data.

Clearly, such a method is of great interest to all empirical scientists, especially those without much grant money. Indeed, The Data Enrichment Method was immediately expanded by other interested parties (see Operations Research 5: 858-859, and 6: 136), who pointed out that it can be applied iteratively to great effect, and that it can be used to support an hypothesis and also its opposite.

The important requirements for the Data Enrichment Method are: (i) a nested set of data patterns, and (ii) an a priori expectation about what should be the answer to the experimental question. All scientists should have the latter, of course, since they are supposed to be testing the expectation by calling it an "hypothesis".

Most interestingly for us, phylogeneticists will often be able to meet requirement (i), as well, because their data often form a nested set, representing the shared derived character states from which a phylogenetic tree will be derived. I therefore once wrote an article examining the application of The Data Enrichment Method to phylogenetics, where it does indeed work very well. You do need at least some data to start with, and so it does not free you entirely from the inconvenience and embarrassment of uncontrollable empirical results.

This article appeared in 1992 in the Australian Systematic Botany Society Newsletter 71: 2–5. Since this issue of the Newsletter is not online, presumably no-one has read this article since then. However, you should read it, and so I have linked to a PDF copy of the paper:
A new method for increasing the robustness of cladistic analyses

After reading it, you might like to think about how to apply this method to phylogenetic networks. The mixture of horizontal gene flow with vertical descent breaks the simple nested data pattern of a phylogenetic tree, which complicates the application of data enrichment to networks.

Friday, November 30, 2012

Description, explanation and prediction in phylogenetics


My recent post on the relationship between phylogenetic trees and networks (Are phylogenetic networks as scientific as trees?) has generated some comment, particularly with regard to the way in which these three phenomena apply to phylogenetics.

By way of explanation, I have included here a specific example each of description, explanation and prediction using phylogenetic trees. They all come from my studies of one particular taxonomic group.

Description

The phylum Apicomplexa (sometimes also known as Sporozoa) forms a large and diverse group of unicellular protists with a wide environmental distribution. They are obligate intracellular parasites, being the only large taxonomic group whose members are entirely parasitic. The phylum is traditionally considered to contain four clearly defined groups: the Coccidians, the Gregarines, the Haemosporidians and the Piroplasmids. The phylogenetic tree shown here (from Morrison 2009) is based on complete 18S rDNA sequences.


This tree is, in one sense, nothing more than a mathematical summary of some of the patterns in the aligned nucleotide data. However, if we accept the idea that this data summary represents the evolutionary history of the organisms (ie. the data summary represents the gene history and the gene history represents the organismal history), then the tree is also a quantitative description of that history.

In this particular example, however, the description is likely to be wrong, in at least some details. For example, it seems improbable that the Haemospordians (Plasmodium and Hepatocystis) are derived from within the Gregarines. This placement is more likely to be the result of long-branch attraction, so that the data summary is in error (as the consequence of a mathematical artefact), which leads to an inaccurate description of the evolutionary history.

Explanation

Crytosporidium causes cryptosporidiosis in mammals. It has traditionally been classified with the Coccidians (see Ellis et al. 1998), a placement first suggested in 1907, based on features of the life-cycle, the macro- and microgamonts, and the oocysts (see Beĭer 2000). However, drugs that help treat coccidial infections (such as coccidiosis, toxoplasmosis, neosporosis and sarcocystosis in vertebrates) do not work on Cryptosporidium, an observation that has long puzzled parasitologists.

The earliest phylogenetic analyses of 18S rDNA from Apicomplexans called this taxonomic placement into question (Johnson et al. 1990), and this was repeatedly confirmed by later analyses (eg. Morrison & Ellis 1997). However, these analyses did not include representatives of all of the Apicomplexan groups (ie. they sampled only Coccidians, Haemopsoridians and Piroplasmids), and the first analyses to also include the Gregarines (which infect invertebrates) indicated a sister-group relationship (Carreno et al. 1999). This phylogenetic placement of Cryptosporidium as sister to the Gregarines is the currently accepted one (Barta & Thompson 2006, Leander 2007, Morrison 2009).

Thus, the currently accepted phylogeny explains why the anti-coccidial drugs do not work on Cryptosporidium — it is not a Coccidian. The traditional taxonomy does not provide any such explanation.

Prediction

Taxon sampling has been almost entirely opportunistic within the Apicomplexa, as it almost always is in parasitology. Opportunities for sampling arise principally from studies of medical diseases (eg. malaria, cryptosporidiosis and toxoplasmosis) and of veterinary diseases (eg. coccidiosis, neosporosis and babesiosis). This can create practical problems (eg. in epidemiology), such as when dealing with parasites that have a two-host life cycle but where only one of the hosts is known.

Sarcocystis is part of the Coccidia, causing sarcocystis in vertebrates. It has a two-host (or indirect) life cycle — the definitive host (in which sexual reproduction occurs) is usually a carnivore, while the intermediate host (where asexual reproduction occurs) is usually a herbivore. Sometimes, parasites have been collected only in the intermediate host, and thus we need to predict the definitive host species, in order to direct the search for it. (Importantly, targeted searches use fewer experimental animals.) This prediction can be done using a phylogeny, as the prediction then comes from known hosts for the other parasite species within the same clade (monophyletic group).


The 18S rDNA phylogeny shown here is for part of Sarcocystis (it is taken from Morrison et al. 2004), and it also shows the known host species for each parasite species. This phylogeny can be used to predict that the most likely definitive host for Sarcocystis species V would be the same as the host for the other species in the monophyletic group labelled A, which would thus be a canid. Similarly, the predicted definitive host for Sarcocystis sinensis would be the same as the host for the other species in the monophyletic group labelled B, which is thus probably humans but possibly a felid.

In three cases this form of prediction of the definitive host of Sarcocystis species was tested by subsequent experimental infection studies (Dahlgren & Gjerde 2010; Gjerde & Dahlgren 2010), and the predictions were all confirmed to be correct.

References

Barta JR, Thompson RCA (2006) What is Cryptosporidium? Reappraising its biology and phylogenetic affinities. Trends in Parasitology 22: 463-468.

Beĭer TV (2000) [Article in Russian, with English abstract.] [Further comment on the coccidian nature of cryptosporidia (Sporozoa: Apicomplexa)]. Parazitologiia  34: 183-195.

Carreno RA, Martin DS, Barta JR (1999) Cryptosporidium is more closely related to the Gregarines than to Coccidia as shown by phylogenetic analysis of Apicomplexan parasites inferred using small-subunit ribosomal RNA gene sequences. Parasitology Research 85: 899-904.

Dahlgren SS, Gjerde B (2010) The red fox (Vulpes vulpes) and the arctic fox (Vulpes lagopus) are definitive hosts of Sarcocystis alces and Sarcocystis hjorti from moose (Alces alces). Parasitology 137: 1547-1557.

Ellis JT, Morrison DA, Jeffries AC (1998) The phylum Apicomplexa: an update on the molecular phylogeny. In GH Coombs, K Vickerman, MA Sleigh, A Warren (eds) Evolutionary Relationships Among Protozoa (Kluwer, Dordrecht) pp. 255-274.

Gjerde B, Dahlgren SS (2010) Corvid birds (Corvidae) act as definitive hosts for Sarcocystis ovalis in moose (Alces alces). Parasitology Research 107: 1445-1453.

Johnson AM, Fielke R, Lumb R, Baverstock PR (1990) Phylogenetic relationships of Cryptosporidium determined by ribosomal RNA sequence comparison. International Journal for Parasitology 20: 141-147.

Leander BS (2007) Marine Gregarines: evolutionary prelude to the Apicomplexan radiation? Trends in Parasitology 24: 60-67.

Morrison DA (2009) Evolution of the Apicomplexa: where are we now? Trends in Parasitology 25: 375-382.

Morrison DA, Bornstein S, Thebo P, Wernery U, Kinne J, Mattsson JG (2004) The current status of the small subunit rRNA phylogeny of the Coccidia (Sporozoa). International Journal for Parasitology 34: 501-514.

Morrison DA, Ellis JT (1997) Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of Apicomplexa. Molecular Biology and Evolution 14: 428-441.

Wednesday, November 28, 2012

Are phylogenetic networks as scientific as trees?


Description, explanation, prediction

Science can be characterized as involving: (i) description, (ii) explanation, and (iii) prediction. As scientists, we need objective and repeatable methods for all three of these. For example, we have devised quantitative methods of description involving standardized units of measurement, often involving machines to perform the actual measuring. We also have modeling procedures that allow us to explicitly incorporate explanatory ideas, as well as for making predictions; and we have philosophical methods for assessing whether inferences are justified or not.

Philosophers of science tend to have focussed on the role of explanation (ii) in science, often to the exclusion of description (i) and prediction (iii), but practicing scientists frequently spend more time on (i) than on either (ii) or (iii), especially in biology. Moreover, physical scientists frequently combine all three simultaneously, using mathematical equations not only to describe the observed data but also to explain it (via the components that are included in the underlying mathematical model) and to predict as-yet unobserved phenomena (by arithmetical extrapolation).

It seems to me that one of the things that makes the study of evolution a science (rather than being a study of natural history) is our recent attempts to reconstruct evolutionary history in an objective and repeatable manner (rather than producing untestable historical scenarios). These phylogenetic analyses have usually been based on a tree model, although the adequacy of this model has recently been questioned.

However, one issue that I have not seen addressed in the literature is the affect on the description / explanation / prediction triumvirate if phylogenetics moves from a tree model to a network model.

[Added note: see the next blog post for a further explanation and examples of Description, explanation and prediction in phylogenetics.]

Trees and networks

Using a phylogenetic tree to describe biodiversity is uncomplicated — the tree describes the historical relationships among the taxa. Furthermore, using the tree for explanation is also uncomplicated — many of the intrinsic characteristics of organisms are the result of inheritance from their ancestors, and therefore characteristics that are shared among taxa can be explained as resulting from shared common ancestors. Furthermore, using the tree for prediction simply involves the reverse logic — shared ancestry predicts the existence of shared characteristics, which may not yet have been observed.

This is actually a point that Darwin makes when introducing the tree metaphor in his book (1859). He points out that many previously unexplained facets of biology become explainable if one adopts the concept of a phylogenetic tree (for example, so-called natural classifications, or the obvious relationships among languages).

In this context, note the potential importance of the distinction between pattern reconstruction and process explanation. For example, (i) can be done from the perspective of simply displaying patterns, but this is likely to preclude (ii) and (iii). Description may thus be best done from the perspective of displaying patterns that are related solely to particular processes. Jonathan Losos (2011, Seeing the forest for the trees: the limitations of phylogenies in comparative biology. American Naturalist 177: 709-727), for example, has noted that "phylogenies are much more informative about pattern than they are about process."

Nevertheless, replacing the tree model with a network model is not necessarily straightforward, because the studied history now involves both horizontal and vertical descent. If we conceive of a network as being a set of inter-connected trees, then the tree components represent the vertical ancestor-to-offspring history while the reticulations (connecting the trees) represent the horizontal components of the history.

In this view, using a phylogenetic network to describe biodiversity is the same as for a tree — the network describes the historical relationships among the taxa, with a clear indication of the pathways of the vertical and horizontal components of that history.

Unfortunately, the same cannot necessarily be said for explanation. Without an indication of exactly which characteristics are involved in the reticulations, we cannot have an unambiguous explanation. Characteristics that are shared among taxa may be explained by either shared ancestors (a vertical explanation) or by reticulation (a horizontal explanation). A network topology alone will not necessarily provide an unambiguous explanation, whereas a tree topology can do so.

A more extreme problem arises for prediction. When predicting the existence of shared characteristics, should the prediction be based on shared vertical ancestry or shared horizontal history, or both? Since we are predicting the unknown, how can we decide on the appropriate prediction framework? With a tree there is no such choice to be made, and thus no ambiguity.

If reticulation occurs, then we can "explain" almost any set of observations by postulating a suitable reticulation event; and we could "predict" almost any future event in the same way. So, it seems that network models are not practical for explanation and prediction in quite the same way as are tree models alone. The extra complexity available for network description potentially becomes ambiguity when used for explanation or prediction.

This issue manifests itself in a number of way. For instance, mathematical algorithms would need to be based on optimization criteria that have some biological relevance in terms of explanation not just description. For example, minimizing the number of reticulations when constructing a network involves descriptive parsimony — we describe the data using a tree model plus the minimum possible number of reticulations. However, this does not involve ontological parsimony, in the sense that we are not thereby postulating that evolution proceeds in such a parsimonious manner. Descriptive parsimony does not necessarily provide a phylogenetic network that is best as an explanatory framework, nor as a predictive tool. The same can be said about maximum-parsimony trees, of course, but they are rarely used these days.

Moreover, phylogenetic networks may not even provide a concise description of reticulate evolution. For example, if two gene trees differ by just one so-called Rooted Subtree Prune and Regraft (rSPR) move then we can represent them by a network with one reticulation node (the two trees that are embedded in the network are simply the two gene trees). However, if the trees differ by two or more rSPR moves then a large number of reticulations may be needed in order to embed the two trees. So, a network can be a simple description of two conflicting trees, or it can also be much more complex than those two trees.

What I have said so far refers to evolutionary network, which are intended to explicitly reflect evolutionary history. It is worth pointing out that data-display networks, on the other hand, are intended to provide description but not explanation or prediction. That is, they display the observed data without necessarily providing any explanation for the patterns displayed or necessarily allowing explicit predictions. Nevertheless, they are intended to provide insights that might contribute to explanations, and therefore predictions. They play a valuable role in exploring data to find the best description and to identify possible explanations.