Showing posts with label Philosophy. Show all posts
Showing posts with label Philosophy. Show all posts

Wednesday, April 15, 2015

What we know, what we know we can know, and what we know we cannot know


This is a guest blog post by:

Johann-Mattis List

Centre des Recherches Linguistiques sur l'Asie Orientale, Paris, France

What we know, what we know we can now, and what we know we cannot know: Ontological facts and epistemological reality in historical linguistics and evolutionary biology

In a recent blog post (Multiple sequence alignment), David wrote about some theoretical issues regarding the concept of homology in evolutionary biology, and specifically its impact on the design of sequence alignment programs. In that post, he mentioned a recently published paper, where he discusses algorithms for sequence alignment and notes that "there is no known objective function for identifying homology" (Morrison 2015: 14).

This statement triggered my interest, since I was immediately reminded of problems that have been occupying historical linguists for a long time now. These problems arise from the fact that in historical disciplines, such as evolutionary biology or historical linguistics (but also in general history or some parts of geology), scholars are not trying to infer general laws of nature, but rather use knowledge of general laws to infer unique events.


The tasks of scholars working in these disciplines is similar to the task of a crime investigator or a doctor: Detectives use the evidence from a crime scene to infer the individual events that led to the crime (and arrest the culprit), and doctors use the symptoms of patients to identify their individual diseases (and then look for a way to cure them). Similarly, evolutionary biologists and historical linguists try to identify the evolutionary events that lead to the observed diversity of life and languages, respectively.

What unites all these disciplines is the specific mode of reasoning that they employ. Charles Sanders Peirce (1839-1914) was among the first to investigate this reasoning mode in detail (Peirce 1931/1958: 7.202). He called it abduction, and contrasted it with induction and deduction, the traditional modes of logical reasoning. Induction is used to infer a currently unknown general rule from an initial state and its result state, while deduction infers the result state of an initial state and a general rule. On the other hand, abduction seeks to infer initial states from result states by employing a general rule.

What further complicates the task of evolutionary biologists and historical linguists is that we have only limited means to verify or falsify a given hypothesis, since, in contrast to detectives and doctors, our research objects usually do not confess, nor do they give positive feedback when we propose the right hypothesis. We never know whether we found the true murderer or whether we proposed the right cure.

Historical linguistics and the limits of knowledge

In historical linguistics, discussions regarding the limits of our knowledge have been centered around the question of the "nature of the proto-language". Using comparative techniques, in the second half of the 19th century linguists started to reconstruct ancestral words of languages that are not attested in any written source. Thus, linguists would first try to identify cognate (homologous) words in Indo-European languages, and then infer how these words were pronounced in the Indo-European language which was spoken some 8,000 years ago. This technique, which was originally introduced by August Schleicher (1821-1868) in 1861, became very popular, and has remained the standard way of knowledge representation in historical linguistics. Whenever linguists propose such a reconstructed form, based on various pieces of evidence, they use an asterisk symbol * to indicate that the word has been inferred, and that there is no written source that would confirm its existence.

As an example, consider some of the words for "sun" in Indo-European languages (discussed in detail in List 2014: 136):
According to modern historical linguistics theory, these words are all assumed to go back to the same ancestral word in Indo-European. The reconstructed pronunciation of the ancestral form is traditionally represented as *séh₂u̯el- "sun" and an approximate pronunciation of the nominate singular would be [soxwl] (with [x] indicating the same sound as the ch in German Rauch "smoke").

These techniques are generally thought to be quite reliable, and they provided concrete help in the decipherment of many ancient languages (including the Egyptian hieroglyphes, Linear B, and Hittite). The status of the reconstructions that scholars produced was, however, controversially debated. While some scholars claimed that there was a high probability that the proposed reconstructions would come close to the original pronunciation, others would classify them as a pure fiction (Schmidt 1872).

Linear B

While it is obvious that reconstructions represent hypotheses and not indisputable truths, it is less clear how they relate to the actual historical facts. First of all, we know for sure that our hypotheses are not stable over time. As our knowledge of the evidence increases, as we include more languages in our comparison, or get deeper insights into the major processes underlying language history, our hypotheses will also constantly be changed and refined. This is nicely reflected in August Schleicher's Fable (a short parable called "The Sheep and the Horses"), a text that he wrote in his reconstructed version of Proto-Indo-European, in order to illustrate what was by then known about the origin of the Indo-European language. When looking at the many later versions, written by scholars in order to illustrate how our knowledge of Indo-European had changed since then, the differences in the pronunciations are really striking (see this summary in Wikipedia), but so are the similarities.

Judging from the degree to which these reconstruction hypotheses evolved over about 150 years, we can reach an important, apparently paradoxical, conclusion: While our reconstructions in historical linguistics are far from being realistic (in the sense of representing actual pronunciations of an Indo-European people), they are by no means fictions, as Johannes Schmidt claimed long ago. The reconstructions are not (and never will be) realistic, since they will always be preliminary, depending on our currently available data and the theoretical development in our field. On the other hand, the reconstructions are also not necessarily unrealistic, since they reflect scientific hypotheses that have been constantly refined and independently developed using the best knowledge we have at that moment. So, although we know that our hypotheses do not truly reflect what really happened, we have good reasons to assume that they come much closer to the real story than any random hypothesis.

As reflected in David's aforementioned statement regarding the lack of an objective function for homology identification in evolutionary biology, the problem of assessing the realism of our hypotheses is not unique to historical linguistics. In a similar way to that with which we discuss the realism of our reconstructed forms in historical linguistics, one may discuss the realism behind any multiple sequence alignment in evolutionary biology. The objects of investigation in historical linguistics and evolutionary biology are not directly accessible to the researchers, but can only be inferred by tests and theories.


Interestingly, this problem also occurs in the social sciences. In psychology, for example, such attributes of people as "intelligence" cannot be directly observed, but have to be inferred by measuring what they provoke or how they are "reflected in test performance" (Cronbach and Meehl 1955: 178). What is inferred by psychological tests is usually called a construct, and is strictly separated from the underlying quality that scholars originally wanted to measure. The construct is thereby understood as the "fiction or story put forward by a theorist to make sense of a phenomenon" (Statt 1981 [1998]: 67). As in the case of reconstruction in linguistics or homology assessment in biology, it is not the "real" object or process.

Conclusion

What can we conclude from this? Or, to put it differently, why should we care about constructs or the degree of fiction behind our claims in historical linguistics and evolutionary biology? I see two important reasons to do so.

First, we can avoid confusion in our fields by strictly separating ontological facts and epistemological reality. In evolutionary biology, this would help to avoid the confusion that often arises when scholars talk about homologous genes, when in practice what they mean is that they applied some similarity threshold and some cluster procedure to cluster genes in sets of presumed homologs. In historical linguistics, on the other hand, it would help us to get rid of the tiresome debate between formalists (who emphasize that reconstructed forms are simple formulas) and realists (who take reconstructed forms as realistic representations) in reconstruction.

Second, from a broader viewpoint, as scientists, we should always try to be explicit in our claims, and we should also always try to be honest about what we know, what we know we can know, and what we know we cannot know.

References

Cronbach LJ, Meehl PE (1955) Construct validity in psychological tests. Psychological Bulletin 52: 281-302.

List J-M (2014) Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.

Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.

Peirce CS (1931/1958) Collected papers of Charles Sanders Peirce. Ed. by C Hartshorne and P Weiss. Cont. by AW Burke. 8 vols. Cambridge MA: Harvard University Press.

Schleicher A (1861) Compendium der vergleichenden Grammatik der indogermanischen Sprache. Vol. 1: Kurzer Abriss einer Lautlehre der indogermanischen Ursprache. Weimar: Böhlau.

Schmidt J (1872) Die Verwantschaftsverhältnisse der indogermanischen Sprachen. Weimar: Hermann Böhlau.

Statt DA, comp. (1981 [1998]) Concise Dictionary of Psychology, 3rd ed. London and New York: Routledge.

Monday, February 16, 2015

An Hennigian analysis of the Eukaryotae


As usual at the beginning of the week, this blog presents something in a lighter vein.

Homologies lie at the heart of phylogenetic analysis. They express the historical relationships among the characters, rather than the historical relationships of the taxa. As such, homology assessment is the first step of a phylogenetic analysis, while building a tree or network is the second step.

With a colleague (Mike Crisp, now retired), I once wrote a tongue-in-cheek article about how to mis-interpret homologies, and the consequences of this for any subsequent tree-building analysis. This article appeared in 1989 in the Australian Systematic Botany Society Newsletter 60: 24–26. Since this issue of the Newsletter is not online, presumably no-one has read this article since then. However, you should read it, and so I have linked to a PDF copy [1.2 MB] of the paper:
An Hennigian analysis of the Eukaryotae


Wednesday, February 4, 2015

Do biologists over-interpret computer simulations?


Computer simulations are an important part of phylogenetics, not least because people use them to evaluate analytical methods, for example for alignment strategies or network and tree-building algorithms.

For this reason, biologists often seem to expect that there is some close connection between simulation "experiments" and the performance of data-analysis methods in phylogenetics, and yet the experimental results often have little to say about the methods' performance with empirical data.

There are two reasons for the disconnection between simulations and reality, the first of which is tolerably well known. This is that simulations are based on a mathematical model, and the world isn't (in spite of the well-known comment from James Jeans that "God is a mathematician"). Models are simplifications of the world with certain specified characteristics and assumptions. Perhaps the most egregious assumption is that variation associated with the model involves independent and identically distributed (IID) random variables. For example, simulation studies of molecular sequences make the IID assumption, by generating substitutions and indels at random in the simulated sequences (called stochastic modeling). This IID assumption is rarely true, and therefore simulated sequences deviate strongly from real sequences, where variation occurs distinctly non-randomly and non-independently, both in space and time.


The second problem with simulations seems to be less well understood. This is that they are not intended to tell you anything about which data-analysis method is best. Instead, whatever analysis method matches the simulation model most closely will almost always do best, irrespective of any characteristics of the model.

To take a statistical example, consider assessing the t-test versus the Mann-Whitney test — this is the simplest form of statistical analysis, comparing two groups of data. If we simulate the data using a normal probability distribution, then we know a priori that the t-test will do best, because its assumptions perfectly match the model. What the simulation will tell us is how well the t-test does under perfect conditions; and indeed we find that its success is 100%. Furthermore, the Mann-Whitney test scores about 95%, which is pretty good. But we know a priori that it will do worse than the t-test; what we want to know is how much worse. All of this tells us nothing about which test we should use. It only tells us which method most closely matches the simulation model, and how close it gets to perfection. If we change the simulation model to one where we do not know a priori which analysis method is closest (eg. a lognormal distribution), then the simulation will tell us which it is.

This is what mathematicians intended simulations for — to compare methods relative to the models for which they were designed, and to deviations from those models. So, simulations evaluate models as much as methods. They will mainly tell you which model assumptions are important for your chosen analysis method. To continue the example, non-normality matters for the t-test when the null hypothesis being tested is true, but not when it is false. Instead, inequality of variances matters for the t-test when the null hypothesis is false. This is easily demonstrated using simulations, as it also is for the Mann-Whitney test. But does it tell you whether to use t-tests or Mann-Whitney tests?

This is not a criticism of simulations as such, because mathematicians are interested in the behaviour of their methods, such as their consistency, efficiency, power, and robustness. Simulations help with all of these things. Instead it is a criticism of the way simulations are used (or interpreted) by biologists. Biologists want to know about "accuracy" and about which method to use. Simulations were never intended for this.

To take a first phylogenetic example. People simulate sequence data under likelihood models, and then note that maximum likelihood tree-building does better than parsimony. Maximum likelihood matches the model better than parsimony, so we know a priori that it will do better. What we learn is how well maximum likelihood does under perfect conditions (it is some way short of 100%) and how well parsimony does relative to maximum likelihood.

As a second example, we might simulate sequence-alignment data with the gaps in multiples of three nucleotides. We then discover that an alignment method that puts gaps in multiples of three does better than ones that allow any size of gap. So what? We know a priori which method matches the model. What we don't know is how well it does (it is not 100%), and how close to it the other methods will get. But this is all we learn. We learn nothing about which method we should use.

So, it seems to me that biologists often over-interpret computer simulations. They are tempted to over-interpret the results and not see them for what they are, which is simply an exploration of one set of models versus other models within the specified simulation framework. The results have little to say about the data-analysis methods' performance with empirical data in phylogenetics.

Monday, March 3, 2014

Has phylogenetics reached its apogee?


Few people had heard of phylogenetics before 1970. It was during that decade that explicit methods for constructing phylogenetic trees came to prominence, although such methods had first appeared in the late 1950s. These methods appeared first in systematics, based on parsimony (1970s), and then in genetics, based on likelihood (1980s). These days, phylogenetics is seen as ubiquitous in biology, but it is interesting to consider whether this idea can be quantified.

Joseph Hughes (2011.TreeRipper web application: towards a fully automated optical tree recognition software. BMC Bioinformatics 12:178) had a go at this by trying to extract information from the PubMed bibliographic database. Here, I have expanded on this approach.

I searched PubMed for the string phylogen*, thus including words like "phylogeny" and "phylogenetics", as well as unusual variations on these words. I searched both the full bibliographic record (including the abstract) as well as restricting the search to the Title field. I did this for every calendar year from 1970–2012 inclusive (the 2013 data are currently still incomplete in the database).


The results are shown in the first graph, and the second graph shows the details of the title search alone. The data are expressed as a percentage of the total number of PubMed records for each year.


So, less than 2% of the current papers in biology mention phylogenetics in their title or abstracts. This does not, of course, mean that the paper doesn't mention the topic at all, as it could do so under some other name (eg. "evolutionary tree", "genealogy", etc), or do so in a way that does not make it into the abstract. Still, it seems to me that this is a rather low number.

The erratic nature of the data before 1975 is probably a by-product of the quality of the PubMed data for that time. However, the clear upper asymptote in the data this century is not artifactual, but real. The average maximum value for the "All" data is ~1.54%, reached in 2009, while the average for "Title only" is ~0.17%, reached in 2004. This seems to imply that phylogenetics has now saturated the market, and is as ubiquitous as it will be, unless something new comes along to change it.

The initial rise in usage of the phylogenetic methods coincided with the release of computer programs that implemented them. Wagner78 was released for mainframe computers in 1978, followed by Phylip in 1980. Phylip was the first to be ported to microcomputers; but it was the release of the PC version of PAUP (v. 2.4) in December 1985 that came to dominate the next 10 years. Hennig86, the successor to Wagner78, was released in 1988.

However, the rapid growth in usage coincided with the growth of molecular genetics. The patent applications for PCR were filed in 1985, and the first paper based on it was also published that year. The technology started to be used for human diagnostics during 1986, and PCR became a basic research tool in molecular biology from c.1989. (Science selected PCR as the major scientific development of 1989.) The journal Molecular Biology and Evolution was founded in 1983, and Molecular Phylogenetics and Evolution in 1992.

The inflection point in the graph is c.1999, which indicates where the slow-down in growth occurred. Coincidentally, it was in 1999 that the Journal of Molecular Evolution announced that it would henceforth exclude molecular phylogenetics (and research on the origin of life), except in cases that have "a special significance and impact." Phylogenetics was now seen as a tool of evolutionary analysis rather than an end in itself.

By this stage, bayesian methods were being proposed, and MrBayes was released in 2001, rapidly becoming the predominant program. However, this was simply a transformation of the existing methodology, rather than being a major new component of data analysis in the way the very first programs were. Furthermore, the rise in usage of genome data seems also to be a transformation, rather than a major addition to data collection the way sequence data were.

Thus, it took 30 years (c. 1978–2008) for the phylogenetics revolution to be complete. Mind you, it had already taken 150 years from 1859 for quantitative methods to first be proposed.

Thursday, December 19, 2013

Is rate variation among lineages actually due to reticulation?


Non-congruence among characters has traditionally been attributed solely to so-called vertical evolutionary processes (parent to offspring), which can be represented in a phylogenetic tree. For example, phenotypic incongruence was originally attributed solely to homoplasy (convergence, parallelism, reversal). For molecular data this could be modeled with DNA substitutions and indels, along with allowance for variable rates in different genic regions (e.g. invariant sites, or the well-known gamma model of rate variation).

This approach was not all that successful, and so the substitution models were made more complex, by allowing different evolutionary rates in different branches of the tree (e.g. substitutions are more or less common in some parts of the tree compared to others). For many researchers this is still as sophisticated as their phylogenetic models get (Schwartz & Mueller 2010), allowing for a relaxed molecular clock in their model rather than imposing a strict clock.

There is, however, a fundamental limitation to trying to make any one model more sophisticated: the more complex model will probably fit the data better but it might be fitting details rather than the main picture. Consider the illustration below. There is a lot of variation among these six animals and yet they are all basically the same. If I wish to devise a model to describe them, do I need a sophisticated model that describes all the nuances of their shape variation, or do I need a simple model that recognizes that they are all five-pointed stars? The answer depends on my purpose — if I wish to identify them to class then it is the latter, if I wish to identify them to species then it might be the former.


Vertical process models

This is relevant to phylogenetics. For example, if I wish to estimate a species tree from a set of gene trees, do I need a complex model that deals with all of the evolutionary nuances of the individual gene trees, or a simpler model that ignores the details and instead estimates what the trees have in common? It has been argued that the latter will be more useful under these circumstances. On the other hand, if I am studying gene evolution itself, I may be better off with the former.

So, adding things like rate variation among lineages (and also rate variation along genes) will usually produce "better fitting" models. However, this is fit to the data, and the fit between data and model is not the important issue, because this increases precision but does not necessarily increase accuracy.

Therefore, modern interest is in changing the fundamentals of the model, rather than changing its details. There are many possible causes of gene-tree incongruence, and maybe these should be in the model in order to increase accuracy.

For example, there has been interest in adding other vertical processes to the tree-building model, most notably incomplete lineage sorting (ILS) and gene duplication-loss (DL). ILS means that gene trees are not expected to exactly match the species tree, but will vary stochastically around that tree, with probabilities that can be calculated using the coalescent. DL means that gene copies appear and disappear during evolution, so that gene sequence variation is due to hidden paralogy as well as to orthology.

ILS has been modeled by being integrated into a more sophisticated DNA substitution model (see the papers in Knowles & Kubatko 2010). Originally, DL was dealt with at the whole-gene level (Slowinski and Page 1999; Ma et al. 2000), but there have been recent attempts to integrate this into the DNA substitution models, as well (Åkerborg et al. 2009; Rasmussen & Kellis 2012). These models are not yet widely used, and so most published empirical species trees still rely on modeling incongruence using rate variation among branches.

Horizontal process models

However, this whole approach restricts the phylogenetic model to vertical processes alone. It is entirely possible that the sequence variation that is being attributed to rate variation among branches is actually being caused by horizontal evolutionary processes, such as recombination, hybridization, introgression or horizontal gene transfer (HGT). For example, an influx of genetic material from outside a lineage could be mis-interpreted as an increase in the rate of substitutions and indels within that lineage. That is, long branches might represent introgression (or HGT) rather than in situ rate variation. If this is true then we would be modeling the wrong thing.

There has been little explicit discussion of this point in the literature. Syvanen (1987) seems to have been among the first. However, his premise was that the molecular clock is ultimately correct (and that "the basic observation has been that different macromolecules yield roughly the same phylogenetic picture"), and he was arguing that HGT does not necessarily violate the clock. Our modern perspective is, of course, that a strict clock is unlikely unless it has been demonstrated, and that genes are incongruent as often as they are congruent.

Recent models for ILS and DL have started to broach this issue, by adding reticulation to their underlying models. Rather oddly, this has usually been described as:
  • ILS + hybridization (Meng & Kubatko 2009; Kubatko 2009; Joly et al. 2009; Bloomquist & Suchard 2010; Yu et al. 2011; Marcussen et al. 2012; Jones et al. 2013; Yu et al. 2013); and
  • DL + HGT (Mirkin et al. 2003; Górecki 2004; Hallett et al. 2004; Csürös & Miklós 2006; Doyon et al. 2010; Tofigh et al. 2011; Bansal et al. 2012; Sjöstrand et al. 2012).
This pairwise association seems to reflect historical accident, rather than any actual mathematical difference in procedure — the gene-tree incongruence patterns are essentially the same for hybridization, introgression and HGT, as well as recombination. In the mathematical models, all we can really talk about is "reticulation" — it is up to the biologist to determine the nature of the horizontal process in each case.

Conclusion

The point here is essentially the same one that I made in a previous post (Resistance to network thinking). Currently, phylogenetics is approached in a very conservative manner. The "old way" is the best way, and things change very slowly. The currently popular phylogenetic models are simply variants of the same models that have been used for 30 years. Temporal rate variation (among lineages) and spatial rate variation (along genes) have been added to the original model from the 1970s, but not yet more complex vertical processes (ILS or DL), and not yet horizontal processes. For these, specialist programs need to be used.

Essentially, all variation in branch length is still attributed to homoplasy and rate variation, rather than considering the myriad of other biological processes that will produce the same apparent phenomen. With this attitude we might be getting more precise models but not necessarily more accurate one.

References

Åkerborg Ö, Sennblad B, Arvestad L, Lagergren J (2009) Simultaneous bayesian gene tree reconstruction and reconciliation analysis. Proceedings of the National Academy of Sciences of the USA 106: 5714-5719.

Bansal MS, Alm EJ, Kellis M (2012) Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics 28: i283-i291.

Bloomquist EW, Suchard MA (2012) Unifying vertical and nonvertical evolution: a stochastic ARG-based framework. Systematic Biology 59: 27-41.

Csürös M, Miklós I (2006) A probabilistic model for gene content evolution with duplication, loss, and horizontal transfer. Lecture Notes in Computer Science 3909: 206-220.

Doyon J-P, Scornavacca C, Gorbunov KY, Szöllösi GJ, Ranwez V, Berry V (2019) An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. Lecture Notes in Computer Science 6398: 93-108.

Górecki P (2004) Reconciliation problems for duplication, loss and horizontal gene transfer. In: Bourne PE, Gusfield D (editors). Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology, pp. 316-325. ACM Press, New York.

Hallett M, Lagergren J, Tofigh A (2004) Simultaneous identification of duplications and lateral transfers. In: Bourne PE, Gusfield D (editors). Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology, pp. 347-356. ACM Press, New York.

Joly S, McLenachan PA, Lockhart PJ (2009) A statistical approach for distinguishing hybridization and incomplete lineage sorting. American Naturalist 174: E54-E70.

Jones G, Sagitov S, Oxelman B (2013) Statistical inference of allopolyploid species networks in the presence of incomplete lineage sorting. Systematic Biology 62: 467-478.

Knowles LL, Kubatko LS (editors) (2010) Estimating Species Trees: Practical and Theoretical Aspects. Wiley-Blackwell, Hoboken NJ.

Kubatko L (2009) Identifying hybridization events in the presence of coalescence via model selection. Systematic Biology 58: 478-488.

Ma B, Li M, Zhang L (2000) From gene trees to species trees. SIAM Journal on Computing 30:
729-752.

Marcussen T, Jakobsen KS, Danihelka J, Ballard HE, Blaxland K, Brysting AK, Oxelman B (2012) Inferring species networks from gene trees in high-polyploid North American and Hawaiian violets (Viola, Violaceae). Systematic Biology 61: 107-126.

Meng C, Kubatko LS (2009) Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theoretical Population Biology 75: 35-45.

Mirkin BG, Fenner TI, Galperin MY, Koonin EV (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evolutionary Biology 3: 2.

Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Research 22: 755-765.

Schwartz RS, Mueller RL (2010) Variation in DNA substitution rates among lineages erroneously inferred from simulated clock-like data. PLoS One 5: e9649.

Sjöstrand J, Sennblad B, Arvestad L, Lagergren J (2012) DLRS: gene tree evolution in light of a species tree. Bioinformatics 28: 2994-2995.

Slowinski J, Page RDM (1999) How should species phylogenies be inferred from sequence
data? Systematic Biology 48: 814-825.

Syvanen M (1987) Molecular clocks and evolutionary relationships: possible distortions due to horizontal gene flow. Journal of Molecular Evolution 26: 16-23.

Tofigh A, Hallett M, Lagergren J (2011) Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8: 517-535.

Yu Y, Barnett RM, Nakhleh L (2013) Parsimonious inference of hybridization in the presence of incomplete lineage sorting. Systematic Biology 62: 738-751.

Yu Y, Than C, Degnan JH, Nakhleh L (2011) Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology 60: 138-149.

Wednesday, August 21, 2013

Conflicting placental roots: network or tree?


In this blog we champion networks as a fundamental model for phylogenetics. Networks are more general than trees, in the sense that some networks are more tree-like than are others. However, I have noted before that the current trend in phylogenetics seems to be to try to use more and more complex trees as the phylogenetic model, rather than embracing networks as a more flexible model (Resistance to network thinking).

An interesting example of this trend is in the current issue of Molecular Biology & Evolution. There are two articles that investigate the root of the placental clade, by Morgan et al. and Romiguier et al., along with an editorial commentary by Teeling & Hedges.

The "placental root" problem has been difficult to resolve as a bifurcating process because different genetic datasets support different trees. As noted by Teeling & Hedges: "Untangling the root of the evolutionary tree of placental mammals has been nearly an impossible task. The good news is that only three possibilities are seriously considered ... Now, two groups of researchers have scrutinized the largest available genomic data sets bearing on the question and have come to opposite conclusions". The three alternative tree histories for the clade root are shown in the figure.


Both of the new empirical studies are based on the protein-coding sequences for most of the 40 currently available mammalian genomes. Morgan et al. use heterogenous substitution models to account for tree and dataset heterogeneity, and get strong support for option (c). Romiguier et al. divide their dataset into GC-rich and AT-rich genes, conclude that the GC-rich genes are most likely to suffer from long-branch attraction, and get strong support from the AT-rich genes for option (a).

Teeling & Hedges continue: "Needless to say, more research is needed." No! Previous genome-scale analyses of more than one million amino acid sites from orthologous protein-coding genes have not rejected any of the three alternatives, despite the statistical estimate that 20,000 amino acid sites should be sufficient to resolve the question at this level of divergence given the tree structure, branch lengths, and number of substitutions (Hallström & Janke 2010). Doesn't this mean that we have enough evidence already?

Clearly, the conflicting results should lead the reader to at least consider the idea that something might be wrong with the underlying tree model itself. Both of these new analyses are still based on tree models, no matter how sophisticated those models might be (see also the several other papers cited by Teeling & Hedges), and no matter how much data are involved.

An alternative perspective is provided by Hallström & Janke (2010): "Mammalian evolution may not be strictly bifurcating". Their network analysis of retroposon insertion data supports an alternative hypothesis for the history of placentals: the early divergences involved incomplete lineage sorting and hybridization. Neither of these two evolutionary processes is accounted for in the tree models of Morgan et al. and Romiguier et al., but both can be integral parts of a network model.

Conclusion

I think that we can see the suggested move from trees to networks as a form of Kuhnian paradigm shift. In Kuhn's historical model, during the period of "normal science" the failure of results to conform to the current paradigm is not seen as refuting the paradigm, but instead is seen as resulting from errors by researchers (e.g. use of inadequate models, acquisition of unreliable data). However, in the Kuhn model, as anomalous results accumulate a new paradigm emerges that subsumes the old results along with the anomalous results, forming a single new framework or paradigm.

Non-tree-like phylogenetic results are currently not seen by most phylogeneticists as refuting the paradigm of a phylogenetic tree, but instead are the result of inadequate phylogenetic tree-models and/or insufficient data (as exemplified by Salichos and Rokas 2013). Nevertheless, these results can also be seen as refuting that paradigm. In that case, a shift to network thinking would embrace all of the tree results as well as the non-tree ones, and would thus form a viable new paradigm.

We should not really call this a Kuhnian "revolution", of course, since tree-thinking and network-thinking are not incompatible, but rather the one is an extension of the other.

Note: There is a follow-up post — Why are there conflicting placental roots?

References

Hallström BM, Janke A (2010) Mammalian evolution may not be strictly bifurcating. Molecular Biology & Evolution 27: 2804-2816.

Morgan CC, Foster PG, Webb AE, Pisani D, McInerney JO, O’Connell MJ (2013) Heterogeneous models place the root of the placental mammal phylogeny. Molecular Biology & Evolution 30: 2145-2156.

Romiguier J, Ranwez V, Delsuc F, Galtier N, Douzery EJP (2013) Less is more in mammalian phylogenomics: AT-rich genes minimize tree conflicts and unravel the root of placental mammals. Molecular Biology & Evolution 30: 2134-2144.

Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327-331.

Teeling EC, Hedges SB (2013) Making the impossible possible: rooting the tree of placental mammals. Molecular Biology & Evolution 30: 1999-2000.

Monday, July 8, 2013

Why people feel older than they are


As always at the beginning of the week, this blog presents something in a lighter vein. However, this week we depart from phylogenetic networks entirely, and delve into the general life of people, instead.

The passage of time is a curious thing, which varies not only with the speed of the observer but also with the age of the observer. Albert Einstein has written about the former phenomenon, and I once wrote a tongue-in-cheek article about the latter one, which I present here.

It turns out, according to my analysis, that your perception of time varies in a precisely quantifiable way depending on your age. The only times that you feel as young as you actually are are at ages 0 and 73 years; in between, you feel older than you are.

This article appeared in 1991 in the Australian Biologist 4: 187-190, a journal published by the Australian Institute of Biology. I specifically wrote about biologists, but the analysis applies to all humans. Sadly, this journal has no web page, and little has been heard about it since volume 17 (2004).

Since printed copies of the journal are held by only a few libraries in Australia, presumably no-one has read this article since 1991. Nevertheless, you should read it, and so I have linked to a PDF copy [1.8 MB] of the paper:
Why biologists feel older than they are

Wednesday, May 15, 2013

Resistance to network thinking


Phylogeneticists are used to the idea of tree thinking, in which evolutionary history is seen as a branching tree-like pattern. Clearly, for many phylogeneticists this has not yet been extended to network thinking, in which evolutionary history can also be seen as a reticulating network. Indeed, I have recently come across several people who have actively insisted that "trees are still central" to phylogenetics (to quote one of my correspondents). As Mindell (2013) has claimed, the Tree of Life is still a useful metaphor, model and heuristic device.

So, there is not just indifference to networks but there seems also to be some resistance to them. This is somewhat unexpected, as a network simplifies to a tree if there are no incompatible phylogenetic signals, and so there is no intrinsic reason to restrict phylogenies to being tree-like.

As a typical example from the literature, Losos et al. (2012) have recently commented:
Although molecular data have rarely changed our understanding of the major multicellular groups of the evolutionary tree of life, they have suggested changes in the relationships within many groups, such as the evolutionary position of whales in the clade of even-toed ungulates. Further investigation has usually resolved conflicts, often by revealing inadequacies in previous morphological studies. This has led to a presumption by many in favor of molecular data.
Needless to say this is a biased point of view, because conflicts can also be resolved by revealing inadequacies in molecular studies. For example, molecular analyses involve many subjective decisions about substitution models and rates of molecular change, and any one of the underlying assumptions may be violated. There is no theoretical justification for favouring one source of data over another.

Similarly, there is no theoretical justification for trying to resolve conflicts by preferring one hypothesis over another. Phylogenetic conflicts can also be "resolved" by recognizing that evolutionary history is not necessarily tree-like. Losos et al. do not even consider this possibility:
When two phylogenies are fundamentally discordant, at least one data set must be misleading.
In fact, the only misleading thing here is the word "must", because both datasets may be perfectly correct but are simply the product of two different evolutionary histories.

This point is perhaps most obvious when comparing molecular datasets. The evolutionary history revealed by between-gene evolutionary processes (e.g. recombination, hybridization, horizontal gene transfer) often conflicts with that from within-gene processes (e.g. nucleotide substitutions and insertions / deletions), and this leads to a reticulating evolutionary history.

Indeed, the more we learn about genomes the less tree-like does the evolutionary history of species seem to be. There are long-standing controversies regarding the evolutionary history of many taxonomic groups, and it has been hoped that genome-scale data would resolve these controversies. However, to date none of these controversies has been satisfactorily resolved into an unambiguous tree-like genealogical history using genome data. They all apparently involve reticulate evolutionary processes.

For example, the estimated relationships among humans, chimpanzees and gorillas did not change as a result of genome sampling (Galtier and Daubin 2008), nor did those of malaria species (Kuo et al. 2008) nor those of placental superorders (Hallström and Janke 2012). In all three cases the estimated relationships were just as complex after the genome sequencing as before. The resolution of controversial branches in our trees has not occurred as a result of increased access to character data or improved data analyses, but our recognition of reticulating relationships certainly has occurred.

There are many other examples where increased character sampling is yet to resolve long-standing controversies about branching patterns, and where reticulation may also be the true explanation. Birds seem to provide many of these examples (eg. Smith et al. 2013), but insects are a rich source as well (eg. Thomas et al. 2013), and sometimes even plants (eg. Goremykin et al. 2013).

Clearly, when two or more phylogenies are fundamentally discordant, none of the datasets needs to be misleading, because a reticulating history may be involved. Network thinking should thus be a standard tool in the arsenal of every phylogeneticist. Tree thinking excludes networks but network thinking does not exclude trees, and so the more general model will always be the more useful one.

[Note: An empirical example is discussed in this later blog post: Conflicting placental roots: network or tree?]

References

Galtier N, Daubin V (2008) Dealing with incongruence in phylogenomic analyses. Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences 363: 4023-4029.

Goremykin VV, Nikiforova SV, Biggs PJ, Zhong B, Delange P, Martin W, Woetzel S, Atherton RA, McLenachan PA, Lockhart PJ (2012) The evolutionary root of flowering plants. Systematic Biology 62: 50-61.

Hallström BM, Janke A (2012) Mammalian evolution may not be strictly bifurcating. Molecular Biology and Evolution 27: 2804-2816.

Kuo C-H, Wares JP, Kissinger JC (2008) The Apicomplexan whole-genome phylogeny: an analysis of incongruence among gene trees. Molecular Biology and Evolution 25: 2689-2698.

Losos JB, Hillis DM, Greene HW (2012) Who speaks with a forked tongue? Science 338: 1428-1429.

Minell DP (2013) The Tree of Life: metaphor, model, and heuristic device. Systematic Biology 62: 479-489.

Smith JV, Braun EL, Kimball RT (2013) Ratite nonmonophyly: independent evidence from 40 novel loci. Systematic Biology 62: 35-49.

Thomas JA, Trueman JW, Rambaut A, Welch JJ (2013) Relaxed phylogenetics and the Palaeoptera problem: resolving deep ancestral splits in the insect phylogeny. Systematic Biology 62: 285-297.

Wednesday, May 8, 2013

Journal of Phylogenetics & Evolutionary Biology?


Many of you will have recently received an email (or two) announcing the impending inaugural issue of the Journal of Phylogenetics & Evolutionary Biology, "an open access, peer-reviewed journal which aims to provide the most rapid and reliable source of information on current developments in the field of phylogenetics and evolutionary biology."

The journal promotional material notes that: "The emphasis will be on publishing quality papers [that will] help establish its high standard and facilitate the journal to be indexed by prestigious ISI and PubMed". Sadly, the journal's flyer indicates that the journal is unlikely to achieve any of these aims, because the people in charge have very little idea of what phylogenetics is:


Only one of these images explicitly relates to a rooted evolutionary history (and it even has reticulations!), but the other images vary from irrelevant to downright wrong.

Publishing "quality papers" will get them nowhere, since we cannot tell whether they will be high quality or low quality, good quality or poor quality. I am sure they will have some sort of quality, because even a used car has that. Caveat emptor. Moreover, perpetuating the transformational view of evolution will not attract the favourable attention of either ISI or PubMed, although this particular viewpoint might be appropriate for the evolution of scientific publishing:


Wednesday, April 17, 2013

When is a tree structure a phylogeny?


I have noted before today that many people seem to treat non-biological phylogenetic attributes as being analogous to genotypes whereas most such data are much more similar to phenotypes (eg. False analogies between anthropology and biology; The Music Genome Project is no such thing). This inappropriate analogy can lead to problems, such as incorrect conclusions regarding familial relationships.

In a similar vein, another problem is the appropriation of the word "phylogeny" to refer to non-evolutionary types of tree. A web search for phylogeny will lead you to many sites where the tree structure being referenced is very unlike an evolutionary history.

Systematists have long dealt with this issue as manifest in the confusion between classification and phylogeny. Biological classification is usually treated as most informative (eg. explanatory, predictive) when based on a phylogeny, but a phylogeny is not automatically a classification, and a classification is not automatically a phylogeny.

The best known example is the NCBI Taxonomy, as used by the GenBank database. This is one of the most commonly used classification schemes today, but in bioinformatics it is frequently used as a phylogeny as well as a classification. This is in spite of the fact that NCBI offers the following disclaimer:
The NCBI taxonomy database is not a primary source for taxonomic or phylogenetic information. Furthermore, the database does not follow a single taxonomic treatise but rather attempts to incorporate phylogenetic and taxonomic knowledge from a variety of sources, including the published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts. Consequently, the NCBI taxonomy database is not a phylogenetic or taxonomic authority and should not be cited as such.
The issue here is that the classification is hierarchical and can therefore be expressed as a tree, and the same can said of the nested relationships in a phylogeny. However, not all trees are phylogenies, and the NCBI Taxonomy is a classification that is not necessarily phylogenetic.

More recently, the word phylogeny has been adopted by the computational word to refer to many hierarchical clustering patterns. For example, consider this definition from FreeBase:
The phylogeny pattern is a major pattern within ontology / schema modelling, and is prevalent in many schemas in Freebase. Commonly related are the parent-child pattern and the containment pattern.
In other words, parent-child patterns are phylogenetic, which is literally true as far as it goes, but a two-level hierarchy fits this pattern without being anything more than a trivial phylogeny in the biological sense. An example is the Wikipedia music entries (eg. Rock music), which have a genre and several subgenres, along with fusion genres — this produces a shallow but broad "tree". Indeed, FreeBase has this to say about their own attempt to implement this idea:
One issue is that the some of the data in the music genre hierarchy in Freebase seems to attempt to show a genealogy of genres, rather than family groupings, which is counter to the way that parent and child Media genres are defined.
This seems to be a rather confused set of analogies involving families and genealogies. The false analogy between a tree and a phylogeny seems to have created this confusion. A genealogy expresses family groups (as does a phylogeny), but not all of those potential groups need be expressed in a classification.

It seems to me that it would be simpler for the computational world to refer to a hierarchy rather than a phylogeny.

Wednesday, February 13, 2013

Pasta have no phylogeny (so don't try to give them one)


If you feed arbitrary data into a phylogenetic analysis then you will always get something out again, but it will in all probability be meaningless. For example, non-living objects do not have a phylogenetic history, at least not in the same way as living objects. Cultural objects certainly do have a history, and many aspects of that history may be similar to evolutionary patterns (see False analogies between anthropology and biology), but we cannot take it for granted that all historical patterns can be treated as analogous to those induced by evolution.

Even data that can be placed in a hierarchy do not necessarily represent a phylogeny — a phylogeny may well be a tree but not every tree is necessarily a phylogeny. Even having an evolutionary history does not mean that there is a phylogenetic history — for example, if the evolution is transformational then the history will be a chain rather than a tree or network.

Olivier Rieppel (2010) mentioned pasta as an example of this important distinction, because the features of pasta contain almost no phylogenetic information at all. Pasta has a history, sure, but it is not a phylogenetic history, in the Darwinian sense of variational evolution. Nor, incidentally, does pasta have a transformational history, either. The different types of pasta were not derived by a historical process of descent with modification, but are instead simply different expressions of a small set of basic ideas about what shapes you can make out of noodles (which are themselves little more than durum wheat flour mixed with water). The key feature of phylogenetic datasets is congruence among different character sets, and this is what we detect as phylogenetic signal, but there is no such congruence among the characteristics of pasta. (For a detailed analysis, with figures, see the interesting book by George Legendre 2011.)

In spite of this, pasta actually is used by many institutions (particularly in the USA) as an example to teach school and/or undergraduate students about phylogenetic analysis. I won't list them all here, but the simple Internet search that I just did quickly produced at least six of them. Here is an example datasheet from one of them, to illustrate the idea:


There are, unfortunately, many other examples of inanimate objects being used as introductory examples for phylogenetic analysis, even when those objects have no obvious phylogenetic history, including: Paper-clips (discussed by Petroski 1992); Nuts & bolts (discussed by Nickels & Nelson 2005); and Biscuits (discussed by Madden 2011).

This violates all common sense in phylogenetics. Indeed, it is actually anti-phylogenetics because it promolgates the idea that there is nothing special about evolutionary history, as distinct from any other sort of history. As noted by Erin Naegle (2009):
Biological organisms have descended from common ancestors. This is not true of manmade objects such as hardware or pasta. While constructing trees of objects may be motivating for students, such exercises are removed from evolutionary theory. Using inanimate objects may give students the impression that all trees are equally correct, since there is no inherently correct way to place objects on a tree.
Part of the problem with almost all of these class exercises seems to be confusion between classification and phylogeny, since these seem frequently to be taught as part of the same exercise. As noted by Nickels & Nelson (2005):
Perhaps the most common — but ultimately self-defeating — approach in teaching about biological classification uses the arrangement of manufactured objects (hardware, furniture, whatever) in an attempt to illustrate the principles of biological classification. This approach assumes that classifying manufactured objects is fundamentally similar to classifying biological organisms. Unfortunately, this assumption is wrong in important ways  ...  simply put, taxonomic classifications of organisms are fundamentally different from the classifications of other things. And this distinction is the key point that students need to grasp.
All objects can be classified, and many objects can be classified using a hierarchical scheme, which can then be represented as a tree; but this does not make that tree a phylogeny. Classifications can be derived from any set of data, but they are particularly suitable for datasets with an intrinsic hierarchical pattern. Since the phylogenetic patterns in many groups of organisms are tree-like, they can be conveniently represented in a hierarchical classification. However, this logic cannot be inverted — just because we have a hierarchical classification does not mean that it came from a phylogenetic pattern.

I have always been acutely aware of this potential problem when I have used phylogenetic networks to analyze datasets where there is unlikely to be an evolutionary cause to the multivariate patterns, such as the Eurovision Song Contest, the FIFA World Cup, Scotch whiskies, Bordeaux wine, fast food, or lists of celebrities (see the Analyses page of this blog). In these cases I have explicitly emphasized that the analysis is intended as an Exploratory Data Analysis (EDA) not a phylogenetic analysis. This distinction is an important one in phylogenetics — any patterns detected by the EDA may, indeed, result from a phylogenetic history, but equally they may not do so. In this sense it is unfortunate that the output is still called a phylogenetic network.

I am not the first person to point out the problem of using inanimate objects for phylogenetics (Nickels & Nelson 2005; Naegle 2009; Meisel 2010). If anything, manufactured goods may provide a suitable example of horizontal transfer (Meisel 2010), but this seems a bit advanced for an introductory class of students.

Are there, then, any good examples that could be used to provide students with a simple and easy introduction to phylogenetic analysis? All that is required is that the objects actually have a phylogenetic history, and that a dataset for the objects can be collected by the students in a straightforward but entertaining manner.

As one example, Nelson & Nickels (2000) suggest using humans as the examplar, and there is a web page pursuing this idea at the Evolution and the Nature of Science Institutes. Alternatively, one could use the example of the fictional Caminalcules (Gendron 2000), which is discussed both here and here. Other examples are limited solely by your own imagination.

References

Gendron RP (2000) The classification & evolution of Caminalcules. American Biology Teacher 62: 570-576.

Legendre GL (2011) Pasta by Design. Thames & Hudson, London.

Madden D (2011) DNA to Darwin: Introductory Activities, Teacher's Guide. NCBE, University of Reading.

Meisel RP (2010) Teaching tree-thinking to undergraduate biology students. Evolution: Education and Outreach 3: 621-628.

Naegle E (2009) Patterns of Thinking about Phylogenetic Trees: A Study of Student Learning and the Potential of Tree Thinking to Improve Comprehension of Biological Concepts. Doctor of Arts thesis, Idaho State University.

Nelson CE, Nickels MK (2000) Using humans as a central example in teaching undergraduate biology labs. In: Karcher SJ (editor) Tested Studies for Laboratory Teaching, Volume 22. Proceedings of the 22nd Workshop / Conference of the Association for Biology Laboratory Education (ABLE), pp 332-365.

Nickels MK, Nelson CE (2005) Beware of nuts and bolts: putting evolution into the teaching of biological classification. American Biology Teacher 67: 283-289.

Petroski H (1992) The evolution of artifacts. American Scientist 80: 416-420.

Rieppel O (2010) The series, the network, and the tree: changing metaphors of order in nature. Biology and Philosophy 25: 475-496.

Wednesday, February 6, 2013

Is there a philosophy of phylogenetic networks?


In some previous blog posts I have discussed the role of phylogenetic networks in science (Are phylogenetic networks as scientific as trees?), particularly in terms of Description, explanation and prediction in phylogenetics. In this post I will look at the philosophy of phylogenetic networks, in terms of whether there is a strong basis for treating the mathematical analyses as having biological relevance.

This is an important point, because there are theoretically an infinite number of ways to mathematically analyze a set of data, and yet it is unlikely that all (or even most) of these will have any relevance to a study of biology. For example, there is a big difference between a mathematical summary of a set of numbers and any biological interpretation of that summary. The mode, for instance, is a neat mathematical measure of the central location of a biological dataset that also nominates one of the biological objects represented by that dataset, while the mean is an estimate of the central location that rarely describes any biological object at all. So, a mode describes biology directly while a mean does not necessarily do so.

Given that there seem to be two quite different uses for phylogenetic networks, there are likely to be two different philosophical bases. The first of these is more easy to deal with than the second one.

Data-display networks

Data-display networks are usually unrooted, and are intended to display the major patterns of character variation in a dataset. There is no necessary implication that any of these patterns are due to the evolutionary history of the organisms concerned, although it is very likely that many of the patterns will reflect that history, either directly or indirectly. I have therefore repeatedly emphasized the role of these networks in Exploratory Data Analysis (EDA).

This means that the obvious philosophical basis for data-display networks is the same as for EDA. There is a strong mathematical basis for EDA and this is considered to have direct relevance to biological studies. EDA has been explored in a number of works, both in general (eg. Tukey 1977;  Hartwig & Dearing 1979; Tufte 1983, 1997; Ellison 2001; Behrens & Yu 2003; Young et al. 2006) and also within phylogenetics (eg. Bandelt 2005; Wägele & Mayer 2007; Morrison 2010). These can be consulted for further information.

The mathematical basis of EDA is to summarize the main characteristics of a dataset in an easily digested form, usually with graphs, without using an explicit statistical model or having formulated an a priori hypothesis. EDA is thus promoted as a counterpoint to confirmatory data analysis (ie. statistical hypothesis testing). The mathematics is not rigid, although various tools have been developed over more than a century. EDA is as relevant to biology as it is to all subjects where data are collected and analysed.

Evolutionary networks

Evolutionary networks, on the other hand, are rooted networks intended to elucidate phylogenetic history. Unlike phylogenetic trees, evolutionary networks explicitly allow for reticulation events (horizontal evolution) as well as descent from parent to offspring (vertical evolution). They are therefore usually seen as a logical generalization of phylogenetic trees.

So, the obvious philosophical basis for evolutionary networks is the same as for phylogenetic trees. However, this inference is not as clear as we might like it to be. For phylogenetic trees there is a rationale for treating the mathematical tree diagram as a representation of evolutionary history; but it is harder to apply the same rationale to evolutionary networks.

The three logical steps to inference using phylogenetic trees are outlined in the figure.


First, we start with some genotypic data, which we transform into a mathematical summary (a DAG) via some quantitative model. Each of these models has an explicit mathematical and/or philosophical basis; for example, maximum likelihood has a well-established mathematical foundation, as does Bayesian analysis. However, there is no necessary biological foundation to these quantitative models, and they are simply convenient mathematical summaries, just like the mean. (Indeed, the mean is the maximum-likelihood estimate of the central location of a set of numbers.)

The second step is to provide a biological basis for further inference. This is the importance of Willi Hennig in the history of phylogenetics — he provided the logical inference that a divergent mathematical tree can be treated as a representation of the gene or character history, because the tree-like patterns are formed from a nested series of shared derived character states (synapomorphies). That is, the mathematical summary can be logically inferred to represent a biological concept, the character history.

In the third step we infer that a set of gene and/or character histories will, when combined in some way, also represent the organismal history. That is, we infer that gene histories represent organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (ie. a pragmatic inference).

So, there is a philosophy to the use of trees for phylogenetic inference, involving three steps (mathematical, logical, practical). There may be mis-estimation of the evolutionary history in practice, of course, perhaps through mis-estimation of the trees or non-representative gene samples, but we cannot expect any method to be perfect. We simply accept that the method we have is the best one we can find, and that it provides a logical basis for inference.

The question is: how do we apply this philosophy to evolutionary networks?

It is sometimes argued that a network is a set of overlapping (partly incompatible) trees. For example, each genetic locus might show a tree-like evolutionary history, but this history might not be the same as any other locus in the same organism. If we adopt this viewpoint then we could consider it unproblematic to use the same philosophy as for trees. That is, at step 1 we produce a set of trees, and step 2 we infer these to represent a set of gene histories, and at step 3 we combine the histories. The only important difference would thus be at step 3, where we combine the genotypic trees in a way that allows for reticulation in the organismal history, rather than insisting that the organismal history be strictly tree-like.

This is an issue that was debated back in the 1980s, when cladists first tried to come to grips with reticulations in a cladogram (eg. Bremer & Wanntorp 1979; Funk 1981, 1985; Humphries 1983; Nelson 1983; Wagner 1983; Wanntorp 1983). It has resurfaced occasionally since then (eg. Skála & Zrzavy 1994; Brower et al. 1996; Lienau & DeSalle 2009), with the consensus apparently being that for reticulating phylogenies this argument is acceptable.

However, it has also been argued that an evolutionary network is not simply a collection of trees. It is often contended, especially by those people dealing with prokaryotes (eg. Doolittle 1999, 2009; Bapteste et al. 2009, 2012), that there is no underlying tree-like structure in much of organismal history — biological history is an anastomosing plexus, instead. If we adopt this viewpoint then we cannot apply the three-step logic as outlined above. We still need to deal with the three steps (biological data to mathematical DAG, DAG to character evolution, characters to organismal evolution), but the DAG will have reticulations rather than being a diverging tree. So, we cannot apply Hennigian logic at step 2, because in a reticulated DAG the characters do not form a nested series of shared derived character states.

So, where are we to get our philosophy under these circumstances? How do we justify the inference that the mathematical summary represents evolutionary history? I have not yet seen this issue discussed in the literature.

References

Bandelt H-J (2005) Exploring reticulate patterns in DNA sequence data. In: Bakker FT, Chatrou LW, Gravendeel B, Pelser PB, eds. Plant Species-Level Systematics: New Perspectives on Pattern and Process. Koeltz, Königstein, pp 245-269.

Bapteste E, Lopez P, Bouchard F, Baquero F, McInerney JO, Burian RM (2012) Evolutionary analyses of non-genealogical bonds produced by introgressive descent. Proceedings of the National Academy of Sciences of the USA 109: 18266-18272.

Bapteste E, O'Malley MA, Beiko RG, Ereshefsky M, Gogarten JP, Franklin-Hall L, Lapointe FJ, Dupré J, Dagan T, Boucher Y, Martin W (2009) Prokaryotic evolution and the tree of life are two different things. Biology Direct 4: 34.

Behrens JT, Yu CH (2003) Exploratory data analysis. In: Schinka JA, Velicer WF, eds. Handbook of Psychology, Vol. 2: Research Methods in Psychology. John Wiley & Sons, Hoboken, pp 33-64.

Bremer K, Wanntorp H-E (1979) Hierarchy and reticulation in systematics. Systematic Zoology 28: 624-627.

Brower AVZ, DeSalle R, Vogler AP (1996) Gene trees, species trees, and systematics: a cladistic perspective. Annual Review of Ecology and Systematics 27: 423-450.

Doolittle WF (1999) Phylogenetic classification and the universal tree. Science 284: 2124-2128.

Doolittle WF (2009) The practice of classification and the theory of evolution, and what the demise of Charles Darwin's tree of life hypothesis means for both of them. Philosophical Transactions of the Royal Society of London B Biological Sciences 364: 2221-2228.

Funk VA (1981) Special concerns in estimating plant phylogenies. In: Funk VA, Brooks DR, eds. Advances in Cladistics: Proceedings of the First Meeting of the Willi Hennig Society. New York Botanical Garden Press, New York, pp 73-86.

Funk VA (1985) Phylogenetic patterns and hybridization. Annals of the Missouri Botanical Garden 72: 681-715.

Ellison AM (2001) Exploratory data analysis and graphic display. In: Scheiner SM, Gurevitch J, eds. Design and Analysis of Ecological Experiments, 2nd ed. Oxford University Press, Oxford, pp 37-62.

Hartwig F, Dearing BE (1979) Exploratory Data Analysis. Sage, Newbury Park.

Humphries CJ (1983) Primary data in hybrid analysis. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 89–103.

Lienau EK, DeSalle R (2009) Evidence, content and collaboration and the tree of life. Acta Biotheoretica 57: 187-199.

Morrison D.A. (2010) Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology and Evolution 27: 1044-1057.

Nelson GJ (1983) Reticulation in cladograms. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 105-111.

Skála Z, Zrzavy J (1994) Phylogenetic reticulations and cladistics: discussion of methodological concepts. Cladistics 10: 305-313.

Tufte ER (1983) The Visual Display of Quantitative Information. Graphics Press, Cheshire.

Tufte ER (1997) Visual Explanations: Images and Quantities, Evidence and Narrative. Graphics Press, Cheshire.

Tukey JW (1977) Exploratory Data Analysis. Addison-Wesley, Reading.

Wägele JW, Mayer C (2007) Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects. BMC Evolutionary Biology 7: 147.

Wagner WH (1983) Reticulistics: The recognition of hybrids and their role in cladistics and classification. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 63-79.

Wanntorp H-E (1983) Reticulated cladograms and the identification of hybrid taxa. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 81-88.

Young FW, Valero-Mora PM, Friendly M (2006) Visual Statistics: Seeing Data with Dynamic Interactive Graphics. Wiley, Hoboken.

Wednesday, January 2, 2013

False analogies between anthropology and biology


There has been much talk over the past few decades about the extent to which the various disciplines within anthropology (in the broad sense) can use, or benefit from, methodological techniques developed in other disciplines, notably biology (see Mace et al. 2005; Forster & Renfrew 2006; Lipo et al. 2006). This has been particularly true for historical studies of languages (ie. linguistics), past cultures (ie. archaeology) and physical type (ie. physical / biological anthropology). The use of, for example, phylogenetic methods seems to be relatively unproblematic in the latter case (studies of the origin and development of humans as a species; Holliday 2003), although this field is concerned as much with population genetics as it is with species phylogenies. (Note that I am leaving cultural anthropology out of the discussion, as it seems to be less concerned with historical studies.)

However, the use of phylogenetic methods in archaeology and linguistics is based on an analogy between human cultural evolution and biological evolution. This analogy assumes that the underlying processes of historical change in anthropology and biology are similar enough that the analytical methods can be combined. (Note that I am using the word anthropology in the broadest sense, to include linguistics and archaeology.) So, both anthropology and biology apparently involve an evolutionary process, in which the study objects form groups that change via modification of their intrinsic attributes, the attributes being transformed through time from ancestral to derived states (often called "innovations" in anthropology). That is, it is the groups of objects that change through time (variational evolution) rather than the objects themselves changing (transformational evolution). Thus, if one group acquires a new (derived or advanced) character state while the rest do not (i.e. they retain the ancestral or primitive state) then this group forms a separate historical lineage that diverges from the other populations, and maintains its own historical tendencies and fate. A search for derived character states that are shared among the groups allows us to reconstruct the evolutionary history.

However, this apparent similarity is basically a metaphor, because human culture is not a collection of biological objects. In Popperian terms, biology is part of the "world that consists of physical bodies" while culture and linguistics are part of the "world of the products of the human mind". Therefore, if we are drawing an analogy between anthropological studies and biological studies, and using this analogy to justify the use of certain analytical techniques, then we need to understand the analogy thoroughly. Here, I argue that in some important ways the currently used analogy is wrong from the biological perspective, and that this has important consequences for anthropological research.

Analogies

The analogy between anthropology and biology has recently focused on the possible relationship between anthropological entities and genes (eg. Mace & Holden 2005; Tëmkin & Eldredge 2007; Croft 2008; Pagel 2009; Steele et al. 2010; Howe & Windram 2011). However, this seems to be a false analogy, as there is no observable equivalent to a gene in the anthropological world (other than inside any biological organisms being studied). Memes, for example, are not observable objects in the way that genes are. So, the analogy between real replicators in biology (genes) and theoretical replicators in anthropology is inappropriate.

However, biology recognizes a distinction between genotype, which is the collection of genes and other associated material in an organism, and phenotype, which is the product of interactions between genes and also between genes and their environment. The DNA, RNA and proteins in an organism are usually taken to represent the genotype, whereas the cells, tissues and organs constitute the phenotype of an individual. To quote Richard Lewontin (in the Stanford Encyclopedia of Philosophy): "the actual correspondence between genotype and phenotype is a many–many relation in which any given genotype corresponds to many different phenotypes and there are different genotypes corresponding to a given phenotype."

The better analogy between anthropology and biology is thus with the phenotype, not the genotype. Genetic material stores information that allows it to replicate itself, either exactly or with modification, and this is the basis of the distinction between living and non-living objects. Nothing in archaeology or linguistics, for example, possesses these properties, and to form an analogy between anthropological entities and genes is thus potentially misleading. In particular, genetic material is based on standardized fundamental units (the nucleotides and amino acids), which have no simple counterpart in anthropology.

An analogy between anthropological entities and phenotypes is much more reasonable, however. Phenotypic entities, such as cells and organs, seem to have much more in common conceptually with anthropological entities, such as phonemes and words in linguistics and stemmatology. Most importantly, it is the phenotype that takes part in evolutionary processes, not the genotype alone (genes are just part of the "replicator story", as DNA on its own does nothing except denature slowly), and so it is actually the more useful comparison. Indeed, up until the 1990s phenotypes were the basic unit of phylogenetics in biology, and it is only since then that biologists have switched wholesale to genotypes for constructing phylogenies. Anthropologists cannot make this switch, and need to remain "phenotype phylogeneticists" instead.

The important point to note is that evolutionary anthropology is a study of historical relationships rather than specifically "genetic" ones. That is, while cultural transmission is qualitatively different from genetic transmission, that does not invalidate a study of history. Genes are passed directly to offspring whereas culture involves behaviour that is transmitted by social learning; for example, manuscripts are copied by hand, languages are learned by imitating parents, and musical instruments are deliberately designed by professionals. Biological transmission is thus different from anthropological transmission, but both types of transmission produce a history.

Phenotypes have historical relationships just as genotypes do, as is now recognized by the resurgence of interest in evolutionary developmental biology (also known as evo-devo). No analogy with genetics is necessary for evolutionary studies of anthropology. Moreover, not all genetic relationships are necessarily evolutionary (much of population genetics, for example, can be conducted without an evolutionary framework), although it is likely that they will all have a strong evolutionary component. (Note that in anthropology vertical phylogenetic descent is sometimes confusingly referred to as the "genetic relationship", perhaps as a result of Noam Chomsky's work, and phylogenies are sometimes referred to as "classifications".)

Consequences

Since phenotypes evolve, they can be an appropriate unit of study in phylogenetics, and can therefore can be an appropriate analogue for the study of cultural histories. The distinction between genotype and phenotype as the appropriate analogy is not a trivial one. In particular, the change of perspective seems to make clearer a number of issues that have been raised concerning the application of phylogenetic methods in anthropology.

Homology

First, it is often difficult to work out the homologies between phenotypic entities from divergent groups, just as it is for anthropological entities. If phylogenetics is a search for shared derived characters states, then we need to be comparing the same character states in different groups (ie. comparing like with like based on common ancestry). However, shared derived character states are not conveniently labeled as such on the objects themselves. We thus need to infer homology before we can infer phylogeny (or at least do this simultaneously), and this is often more difficult for phenotypes than for genotypes.

Phenotypic homology sometimes causes confusion even among evolutionary biologists. The basic issue is often which features should been seen as different states of the same character. As a cultural example, Tëmkin & Eldredge (2007) discuss the problem of the valves in a cornet, as "the Périnet valve did not derive from the Stölzel valve but rather was an alternative design solution" (alternative designs are quite common for manufactured objects). Thus, neither can be considered to be the ancestral state of a single character (valve type), even though the Stölzel valve predated the Périnet. Most biologists would solve this "problem" by having two separate characters, so that each valve type is either present or absent, thus effectively having a combined total of four character states. This allows a cornet to have either all Stölzel or all Périnet valves, or a combination of both (which a few instruments do have; Eldredge 2002). A cornet that has neither type of valve is called a post horn, this being the instrument from which the cornet was originally derived.

The search for an objective method of determining phenotypic homology has been a long one (Rieppel 2007), and is not by any means resolved; perhaps the most interesting discussion of an objective procedure is that of Jardine (1967). In particular, homoplasy (convergence / parallelism / reversal) is often a phenotypic phenomenon, as the genotype of the organisms concerned is almost always different in some way. That is, phenotypic homoplasy is usually the result of mistaken homology assessment, whereas genotypic homoplasy usually results from the fact that there are so few units of comparison (eg. four nucleotides). It has been suggested that homoplasy may be even more common in anthropology than in biology (Tëmkin & Eldredge 2007). Indeed, in culture it can be difficult even to decide on the units of comparison (eg. phonemes? syllables? words?), which is quite characteristic of phenotypic studies, and the "taxa" often need to be constructed for analysis (eg. tools, customs, etc).

Furthermore, it is likely to be inappropriate to use an analogy with molecular sequence alignment when discussing cultural and linguistic homologies (Covington 1996; Kondrak 2003; Pagel 2009). Computerized algorithms are usually used to align molecular data and thus make decisions about character-state homology, mostly based on overall similarity. However, homology of phenotypic characteristics requires careful comparative studies to determine what are called topological relations (or connectivity) among the character states, often based on ontogenetic development (Rieppel 2007); this is called "special similarity". It might be difficult to use ontogeny as an analogue for cultural development, since ontogeny refers to the sequential expression of genes, but topological relationships have obvious analogues in linguistics; for example, words consist of both primary structure (phonemes) and secondary structure (morphemes) (List 2012).

Reticulation

Second, it is likely that there will be a greater degree of reticulate evolution in archaeological and linguistic studies. This conclusion follows from the differences in barriers to horizontal flow of information — there are both weak and strong barriers in biology but only weak ones in anthropology.

In biology there are both pre-zygotic and post-zygotic barriers to gene flow, which refer to those acting to prevent the formation of a zygote and those acting after zygote formation, respectively. It is the latter that are most effective in creating reproductive isolation between taxa. Pre-zygotic mechanisms, such as geographical isolation (different locations), ecological isolation (different habitats), temporal isolation (different times), mechanical isolation (different physical structures) and ethological isolation (different behaviours), have obvious analogues in anthropological studies, but these barriers are often not completely effective, such as when species that were previously spatially separated encounter each other for the first time. Post-zygotic mechanisms, such as cross-incompatibility (inability of gametes to fuse), hybrid inviability (failure of zygotes to survive), hybrid sterility (failure of zygotes to reproduce) and hybrid breakdown (failure of second generation hybrids to survive), are strictly genetic mechanisms and they have no obvious analogue in anthropological studies. They are usually very effective barriers to gene flow, and indeed are the principal basis of the biological species concept, for example.

The important point to note is that the post-zygotic barriers are directly under genetic control whereas the pre-zygotic barriers are only indirectly genetically controlled (eg. habitat selection might be genetically determined, and if their habitats are different then two species will be reproductively isolated). This means that the post-zygotic barriers are much stronger. It also means that they are not available in the analogy between anthropology and phenotype.

Weak barriers mean that archaeological and linguistic aggregations are likely to form fuzzy clusters rather than clearly defined groups, just as they do for human races (Fuzzy clusters). Fuzzy clusters are not likely to form clear-cut evolutionary lineages, at least as far as vertical descent is concerned (Eldredge 2011).

Thus, because anthropological studies involve only weak barriers to the horizontal flow of information, reticulate evolution is predicted to be more prevalent than it is in biology. That is, the horizontal component of evolution may even be as large as the vertical one (and possibly more important), because there are none of the strong genetic ("post-zygotic") barriers to flow. Indeed, the use of trees as a model for archaeological and linguistic studies has been questioned repeatedly in recent years, on various grounds (eg. Southworth 1964; Hoenigswald 1990; Moore 1994; Dewar 1995; Ben Hamed & Wang 2006; Tëmkin & Eldredge 2007), usually in favor of reticulation models. Moreover, the earliest representations of historical relationships were networks rather than trees (Gallet), even in biology (Buffon, Duchesne), and since then many alternative reticulation metaphors have been developed (Metaphors). This suggests that the focus on trees has been a distraction from the more obvious model of a network in anthropology.

   Networks and trees

One point of confusion here seems to be that trees have been treated as representations of temporal relationships while networks have been treated as representations of spatial relationships. Indeed, this seems to be at the heart of the apparent differences of opinion about the two models — the tree advocates are emphasizing time whereas the network advocates are emphasizing space. The practical problem here is that there are currently no quantitative methods for combining the two. Tree-building algorithms in biology do not allow for reticulation, and the common network algorithms (such as neighbor-net, median-joining, reduced median) solely show static relationships, without any sense that the inferred nodes represent ancestors or the edges connecting the nodes represent evolutionary change. In these commonly used algorithms, the nodes are there solely to support the network structure, and the edges solely express the degree of character difference between the nodes.

For phylogenetic trees there is a rationale for treating the tree diagram as a representation of evolutionary history. For example, in a study of a set of gene sequences, first we produce a mathematical summary of the the data based on a quantitative model. We then infer that this summary represents the gene history, based on the Hennigian logic that the patterns are formed from a nested series of shared derived character states (this is a logical inference about the biology being represented by the mathematical summary). We then infer that this gene history represents the organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (ie. a pragmatic inference). However, no such rationale exits for most of the current network methods. The network still represents a mathematical summary of the data, but there is no logic for direct inference about biology. It is almost certain that the mathematical summary represents real biological patterns, but there is no necessity that those patterns are evolutionary ones.

The increasing appearance of neighbor-net networks in the linguistic and archaeological literature (eg. Ben Hamed 2005; Bryant et al. 2005; Bowern 2010; Gray et al. 2010; Heggarty et al. 2010; Dediu & Levinson 2012), for example, is thus based on trying to infer temporal patterns from the network display of spatial patterns, even though there is no explicit rationale for being able to do this — the networks may represent history and they may not. Clearly, what we need are quantitative methods that allow the direct inference of both vertical and horizontal evolutionary patterns — that is, we need phylogenetic networks rather than phylogenetic trees. Moreover, these networks need to be based on models of phenotypic variation not genotypic variation (eg. Lewis 2001). Nakhleh et al. (2005), Warnow et al. (2006) and Erdem et al. (2006) are among the few to have tackled this issue in anthropology.

Note that none of the above discussion is meant to contrast a tree model with a network model in a mutually exclusive way. Mathematically, trees form a subset of networks. Therefore, we do not need to choose between the two as the most appropriate model — we can always choose a network model, and the resulting network will be more or less tree-like depending on the data. So, it is not necessary to decide wether anthropological data are more or less tree-like than biological data (Collard et al. 2006), nor should it be necessary to decide whether horizontal transmission invalidates cultural phylogenetic trees (O'Brien et al. 2002; Greenhill et al. 2009; Currie et al. 2010b) — we should simply incorporate any reticulations into the phylogeny rather than decide they are too small to need to include them.

In this sense, many of the recent anthropological papers that are based solely on a tree model seem to be misguided, no matter how sophisticated the mathematics of their analyses may be (Gray & Atkinson 2003; Gray et al. 2009; Currie et al. 2010a; Dunn et al. 2011; Gray et al. 2011; Bouckaert et al. 2012). For example, if a dataset is admittedly affected by horizontal transfer, it is unlikely that any tree-building algorithm will correctly construct the tree-like pattern of vertical descent. Thus, even if our model for evolutionary history is "a tree obscured by vines", we will still find it difficult to reconstruct the tree unless we explicitly move the vines out of the way first. It is for this reason, for example, that in linguistics many studies are based on the Swadesh list of words, which is clearly (and intentionally) biased towards words that have been inherited vertically, with little or no horizontal transfer (eg. Bouckaert et al. note: "the cognate data we use excludes known cases of borrowing"). Under these circumstances, it is hardly surprising that authors so often find their phylogenies to be tree-like, since they are deliberately ignoring the vines! Networks are likely to reveal both the tree and the vines (eg. otherwise hidden lexical borrowing; Nelson-Sathi et al. 2011).

Finally, it is worth mentioning the network methods that have been developed for within-species (ie. population) data, particularly mtDNA sequences. These include those methods related to median networks (eg. median-joining, reduced median), but also include those related to one-step networks (eg. statistical parsimony, minimum-spanning). In many anthropological situations, it is likely that these will be more useful than methods related to phylogenetic trees (see the examples in Barbrook et al. 1998; Forster et al. 1998; Forster & Toth 2003; Spencer et al. 2004; Lipo 2006). Bouckaert et al. (2012) take this analogy even further, by using a phylogeny-based epidemiological model of population spread.

Time consistency

The third consequence of rejecting the genotype analogy is that time inconsistency is no longer required. Organisms store the information (that is vertically and horizontally transmitted) in genes that they carry with them, which restricts reticulation to occurring only between contemporaries. However, while cultural aretefacts clearly display their information, they do not transmit it themselves, and it must instead be interpreted by humans. Furthermore, language and culture store their "information" externally, either in the minds of people or in permanent or semi-permanent records (either written or pictorial).

Thus, in anthropology the information available for horizontal transmission can come from the distant past, as well as from the present — the only direction that cultural information cannot flow is from the future to the past. In this sense, extinction seems to be much rarer in archeology and linguistics than in biology, because information can be stored indefinitely, rather than disappearing along with the possessing species. I have illustrated time inconsistency twice before, with respect to both computers and computer languages, and Tëmkin & Eldredge (2007) illustrate it with musical instruments.

Part of the issue here is also that archaeological objects are often not contemporaneous, whereas most biological studies are based on data from contemporary organisms (Lipo 2006). This means that in archaeological phylogenetics the study objects appear at internal nodes in the phylogeny as well as at the tips (the data are diachronic), whereas in biology they occur only at the tips (the internal nodes are hypothetical ancestors). In this case, it may be better to consider an archaeological analogy with the incorporation into the phylogenetic histories of full stratigraphic information from fossils (eg. Sumrall 2005; Tëmkin & Eldredge 2007; Fisher 2008).

Historical anthropology is often concerned with "origins" and putting dates on those origins (Gray et al. 2011), and therefore the study interest is where the analytical uncertainty is greatest, since this is the place where there are fewest data. This is quite different to much of the use of phylogenetic techniques in biology, where the relationships of contemporary organisms are the primary interest. Of particular concern are estimates of rates of divergence, for which there appear to be few mathematical models in archaeology. Small changes in rates can have large effects on estimates of origins and their dates, as can changes of rates along lineages.

Disconnection of phenotype and phylogeny

The fourth consequence is that there is often a lack of association between phylogeny and phenotype. There are examples in the literature of phenotypic changes not being directly associated with the phylogeny. Losos (2011) discusses a number of these within biology, and Tëmkin & Eldredge (2007) discuss a couple of cultural examples. In these cases, it is not possible to reconstruct the evolutionary history from phenotypic data, nor indeed to infer the phenotypes from an hypothesis of evolutionary history. In these cases phylogenetics does not aid the study of contemporary patterns.

This is particularly relevant when attempting to reconstruct ancestral phenotypes. Because of the difference between cultural transmission (copied from person to person) and biological transmission (genes are passed directly), there is no necessary reason to assume that ancestral states can be reconstructed from a knowledge of phylogenetic history (see the Evolving Thoughts blog). This also applies when trying to reconstruct characteristics from an independent phylogeny, such as reconstructing a cultural history from a linguistic phylogeny (eg. Walker et al. 2012).

Furthermore, it is possible that archaeological and linguistic concepts (eg. cultural artefacts and languages, respectively) do not form integrated wholes, in the way that biological organisms must. That is, anthropological characters (or groups of characters) can often change independently of each other, and this will create a set of independent phylogenetic histories, so that there is no coherent "entity" with a single history. This situation is likely to be worse than the possibly analogous situation with independent gene histories in biology (Tëmkin & Eldredge 2007).

In addition, cultural evolution may occur faster than biological evolution (Perreault 2012), which makes reconstruction of ancient events more difficult. We might also question whether different cultural artefacts and languages each share a single common ancestor — that is, they are potentially polyphyletic rather than monophyletic.

Process analogies

Finally, we can consider possible analogies of anthropological processes with horizontal genotypic processes, such as introgression, hybridization, recombination, horizontal gene transfer (HGT), and genome fusion. These analogies are sometimes invoked in the linguistic and archaeological literature, but this is not necessarily appropriate given the overall analogy with phenotype rather than genotype.

Introgression is usually treated as a process of admixture, where genetic information from one group moves to another via sexual reproduction. Here, an analogy might be appropriate for anthropology, it being the closest analogy to what anthropologists have called "diffusion". However, it is worth noting that biological admixture initially involves the move of an entire copy of the genome, which might be unlikely for cultural phenomena. Hybridization, on the other hand, involves the creation of a new evolutionary lineage, separate from the parental ones but containing one or more copies of the genome of each of those parents. Creole languages might be an example where this analogy is appropriate, since the parental languages are usually clearly identifiable; but otherwise hybridization seems to be a poor analogy, even though it is commonly invoked in the literature.

Recombination also involves sexual reproduction, but usually refers to the mixing of genes before reproduction occurs, so that the offspring do not have a complete set of genes from any one grandparent. This analogy frequently appears in the literature, often as a synonym for the same phenomena that other people call hybridization, but I suspect that introgression would be a better analogy for the topics included. Examples analogous to recombination might be a single manufacturer "providing all permutations and combinations to the marketplace" of their products (eg. Courtois' cornets in the late 1850s; Eldredge 2002), or where "a scribe used more than one copy of a text when making his or her own" (called contamination; Howe & Windram 2011).

HGT refers to non-sexual transfer of genetic material, often small amounts rather than whole genomes. Clearly, word borrowing would be a prime example where this analogy might be appropriate. Genome fusion refers to the non-sexual transfer of whole genomes, and thus has a similar outcome to hybridization, but between distantly related organisms instead.

Conclusion

We need to drop the idea that there is an analogy between anthropological entities and biological genotypes, and recognize that the better analogy is with phenotypes. The analogy with genotypes is not a productive one, and may even be a positively misleading form of "gene envy". If we accept the qualitative analogy with phenotype, then we can also accept the quantitative consequences of this analogy, which include the idea that trees are much more likely to be inadequate models for cultural history than they apparently are in biology.

The mere fact that one can interpret certain cultural phenomena as showing features analogous to those in biology does not mean that the alleged analogy is of any practical use. We need to understand the analogies more thoroughly, in order to decide whether adopting the analogies is the best thing to do. Analogies are only useful tools for research if they direct that research into productive areas, or provide interpretive insights that would otherwise be unavailable. Otherwise, analogy is merely a topic of conversation.

The main advantage of the phylogenetic analogy is that it focuses attention on the important role of unique "accidents" in determining evolutionary history. The main disadvantage seems to be that the processes involved with these accidents are quite different in biology and anthropology, so that the focus is not always fruitful.

References

Barbrook AC, Howe CJ, Blake N, Robinson P (1998) The phylogeny of The Canterbury Tales. Nature 394: 839.

Ben Hamed M (2005) Neighbour-nets portray the Chinese dialect continuum and the linguistic legacy of China's demic history. Proceedings of the Royal Society of London series B 272: 1015–1022.

Ben Hamed M, Wang F (2006) Stuck in the forest: trees, networks and Chinese dialects. Diachronica 23:29-60.

Bouckaert R, Lemey P, Dunn M, Greenhill SJ, Alekseyenko AV, Drummond AJ, Gray RD, Suchard MA, Atkinson QD (2012) Mapping the origins and expansion of the Indo-European language family. Science 337: 957-960.

Bowern C. (2010) Historical linguistics in Australia: trees, networks and their implications. Philosophical Transactions of the Royal Society of London series B 365: 3845-3854.

Bryant D, Filimon F, Gray RD (2005) Untangling our past: languages, trees, splits and networks. In: Mace et al. (eds), pp. 67-83.

Collard M., Shennan SJ, Tehrani JJ (2006) Branching, blending, and the evolution of cultural similarities and differences among human populations. Evolution and Human Behavior 27: 169-184.

Covington MA (1996) An algorithm to align words for historical comparison. Comparative Linguistics 22: 481-496.

Croft W (2008) Evolutionary linguistics. Annual Review of Anthropology 37: 219-234.

Currie TE, Greenhill SJ, Gray RD, Hasegawa T, Mace R (2010a) The rise and fall of political complexity in island SE Asia and the Pacific. Nature 476: 801-804.

Currie TE, Greenhill SJ, Mace R (2010b) Is horizontal transmission really a problem for phylogenetic comparative methods? A simulation study using continuous cultural traits. Philosophical Transactions of the Royal Society of London series B 365: 3903-3912.

Dediu D, Levinson SC (2012) Abstract profiles of structural stability point to universal tendencies, family-specific factors, and ancient connections between languages. PLoS ONE 7: e45198.

Dewar RE (1995) Of nets and trees: untangling the reticulate and dendritic in Madagascar prehistory. World Archaeology 26: 301-318.

Dunn M, Greenhill SJ, Levinson SC, Gray RD (2011) Evolved structure of language shows lineage-specific trends in word-order "universals". Nature 473: 79-82.

Eldredge N (2002) A brief history of piston-valved cornets. Historic Brass Society Journal 14: 337-390.

Eldredge N (2011) Paleontology and cornets: thoughts on material cultural evolution. Evolution: Education and Outreach 4: 364–373.

Erdem E, Lifschitz V, Ringe D (2006) Temporal phylogenetic networks and logic programming. Theory and Practice of Logic Programming 6: 539-558.

Fisher DC (2008) Stratocladistics: integrating temporal data and character data in phylogenetic inference. Annual Review of Ecology, Evolution and Systematics 39: 365-385.

Forster P, Renfrew C (eds) (2006) Phylogenetic Methods and the Prehistory of Languages. McDonald Institute of Archaeological Research, Cambridge.

Forster P, Toth A (2003) Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European. Proceedings of the National Academy of Science of the USA 100: 9079-9084.

Forster P, Toth A, Bandelt H-J (1998) Evolutionary network analysis of word lists: visualising the relationships between Alpine Romance languages. Journal of Quantitative Linguistics 5: 174-187.

Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426: 435-439.

Gray RD, Atkinson QD, Greenhill SJ (2011) Language evolution and human history: what a difference a date makes. Philosophical Transactions of the Royal Society of London series B 366: 1090-1100.

Gray RD, Bryant D, Greenhill SJ (2010) On the shape and fabric of human history. Philosophical Transactions of the Royal Society of London series B 365: 3923-3933.

Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science 323: 479-483.

Greenhill SJ, Currie TE, Gray RD (2009) Does horizontal transmission invalidate cultural phylogenies? Proceedings of the Royal Society of London series B 276: 2299-2306.

Heggarty P, Maguire W, McMahon A (2010) Splits or waves? Trees or webs? How divergence measures and network analysis can unravel language histories. Philosophical Transactions of the Royal Society of London series B 365: 3829-3843.

Hoenigswald HM (1990) Does language grow on trees? Ancestry, descent, regularity. Proceedings of the American Philosophical Society 134: 10-18.

Holliday TW (2003) Species concepts, reticulation, and human evolution [with discussion]. Current Anthropology 44: 653-673.

Howe CJ, Windram HF (2011) Phylomemetics — evolutionary analysis beyond the gene. PLoS Biology 9: e1001069.

Jardine N (1967) The concept of homology in biology. British Journal for the Philosophy of Science 18: 125-139.

Kondrak G (2003) Phonetic alignment and similarity. Computers and the Humanities 37: 273-291.

Lewis PO (2001) A likelihood approach to inferring phylogeny from discrete morphological characters. Systematic Biology 50: 913-925.

Lipo CP (2006) The resolution of cultural phylogenies using graphs. In: Lipo et al. (eds), pp. 89-107.

Lipo CP, O’Brien MJ, Collard M, Shennan SJ (eds) (2006) Mapping our Ancestors: Phylogenetic Approaches in Anthropology and Prehistory. AldineTransaction, New Brunswick NJ.

List J-M (2012) Improving phonetic alignment by handling secondary sequence structures. In: Hinrichs E, Jäger G (eds) Computational Approaches to the Study of Dialectal and Typological Variation. Working papers submitted for the workshop organized as part of the ESSLLI 2012.

Losos J (2011) Seeing the forest for the trees: the limitations of phylogenies in comparative biology. American Naturalist 177: 709-727.

Mace R, Holden CJ (2005) A phylogenetic approach to cultural evolution. Trends in Ecology and Evolution 20: 116-121.

Mace R, Holden CJ, Shennan SJ (eds) (2005) The Evolution of Cultural Diversity: a Phylogenetic Approach. UCL Press, London.

Moore JH (1994) Putting anthropology back together again: the ethnogenetic critique of cladistic theory. American Anthropologist 96: 925-948.

Nakhleh L, Ringe DJ, Warnow T (2005) Perfect phylogenetic networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81: 382-420.

Nelson-Sathi S, List J-M, Geisler H, Fangerau H, Gray RD, Martin W, Dagan T (2011) Networks uncover hidden lexical borrowing in Indo-European language evolution. Proceedings of the Royal Society of London series B 278: 1794-1803.

O’Brien MJ, Lyman RL, Darwent JA (2002) Cladistics and archaeological phylogeny. In: Martínez G, Lanata JL (eds) Perspectivas Integradoras entre Arqueología y Evolución. Teoría, Métodos y Casos de Aplicación. INCUAPA–UNC, Olavarría, Argentina, pp. 175-186.

Pagel M (2009) Human language as a culturally transmitted replicator. Nature Reviews Genetics 10: 405-415.

Perreault C. (2012) The pace of cultural evolution. PLoS ONE 7: e45150.

Rieppel O (2007) Homology: a philosophical and biological perspective. In: Henke W, Tattersall I (eds) Handbook of Paleoanthropology: Vol I: Principles, Methods and Approaches. Springer-Verlag, Berlin, pp 217-240.

Southworth FC (1964) Family-tree diagrams. Language 40: 557-565.

Spencer M, Wachtel K, Howe CJ (2004) Representing multiple pathways of textual flow in the Greek manuscripts of the Letter of James using reduced median networks. Computers and the Humanities 38: 1-14.

Steele J., Jordan P, Cochrane E (2010) Evolutionary approaches to cultural and linguistic diversity. Philosophical Transactions of the Royal Society of London series B 365: 3829-3843.

Sumrall CD (2005) Fossils in phylogenetic reconstruction. In: Encyclopedia of Life Sciences.

Tëmkin I, Eldredge N (2007) Phylogenetics and material cultural evolution. Current Anthropology 48: 146-153.

Walker RS, Wichman S, Mailund T, Atkisson CJ (2012) Cultural phylogenetics of the Tupi language family in lowland South America. PLoS ONE 7: e35025.

Warnow T, Evans SN, Ringe DA, Nakhleh L (2006) A stochastic model of language evolution that incorporates homoplasy and borrowing. In: Forster & Renfrew (eds), pp. 75-87.