Wednesday, January 30, 2013

More datasets for validating network algorithms

Ten more datasets have been added to the Datasets blog page. These are:
  • 2 plant studies where hybrids are known from experimentation
  • 3 more plant studies where natural hybrids are known
  • 5 studies (fungi, plants, protozoa, viruses, animals) where recombination is known.

A comment

It is worth noting something that has become obvious to me while compiling these datasets — the mathematical model often applied to hybridization networks cannot easily be applied to many of the datasets collected by biologists. The usual mathematical model involves incompatibility between two or more trees for the same set of taxa, for example from different genes or genomes. The incompatibilities are resolved by postulating one or more reticulations in the network.

However, the data produced by biologists often involve only a single nuclear gene, most frequently the Internal Transcribed Spacer region, so that the biologists do not have multiple trees. Instead, hybrids are detected by additive polymorphisms at alignment positions within the study gene. These polymorphisms arise either from (i) the polyploid nature of the hybrids (there are multiple copies of each chromosome, each of which may have a gene copy from either parental species), or (ii) from multiple paralogous copies of the genes (the rRNA region, which contains the ITS, usually has many tandemly repeated copies of the genes, which are homogenized by concerted evolution, but in a hybrid any of them may have a gene copy from either parental species).

This means that it is difficult to use any current evolutionary network for the phylogenetic analysis of many of the datasets used for detecting hybridization. In turn, this suggests that we may need a different model, one based on additive polymorphisms rather than incongruent trees.

The usual mathematical model for lateral-transfer networks is actually the same as for hybridization networks, since the only real difference between HGT and hybridization is that HGT does not occur via sexual reproduction while hybridization does. (Also, hybridization often involves whole genomes while HGT usually involves partial genomes.) Importantly, the mathematical model does seem to apply to the sort of datasets collected by biologists when they are studying HGT. That is, HGT is detected by incompatibility between two or more trees for the same set of taxa. Indeed, this model is usually the only evidence for HGT, unlike hybridization and recombination where there is often evidence that is independent of the network model.

Monday, January 28, 2013

Cornets: from a tree to a network

I have commented before on the fact that, even in situations where people believe that evolutionary history is reticulate, phylogenies are still presented as a tree rather than as a network (see Why do we still use trees for the Neandertal genealogy? and Why do we still use trees for the dog genealogy?); and I shall presumably comment on this again in the future.

However, this blog post is about someone who stopped using a tree and started using a network: Niles Eldredge. More than 10 years ago (Eldredge 2000) he addressed the idea of the evolutionary history of the musical instrument known as a cornet (a soprano brasswind instrument similar to a trumpet), of which he had been a collector and historian for many decades (see Eldredge 2002).

Originally, he presented a phylogenetic tree of selected cornets, covering the range of known phenotypic variation, based on several diagnosable morphological features (described by Eldredge 2000, 2002, 2011). A version of this tree was published in an interview that he did for New Scientist in 2003 (Walker 2003), and the ideas were subsequently presented in interviews for several other media, including the New York Times (Wertheim 2004) and the Fibreculture Journal (Barnet 2004).

The original bush-like phylogenetic tree of 36 cornets (from New Scientist).
It is based on 17 characters, although there are only 14 synapomorphies shown.

His purpose was to show that cornets do not fit a traditional phylogeny well: they show a very punctuated history, with bursts of rapid radiation where features appear in many lineages. He attributed this topology to the distinct nature of evolution of cultural objects, where innovations developed in one lineage can immediately be transferred to other lineages, and even transferred to earlier parts of those lineages (see my previous post: Time inconsistency in evolutionary networks). That is, the bifurcating cladistic model of evolution does not apply — the tree looks more like a bush.

Eldredge also noted that this "further implies, as a practical matter, that most of the algorithms developed to reconstruct biological history are inappropriate for the reconstruction of material cultural systems"  (quoted in Barnet 2004). Indeed, the presence of lateral transfer of ideas means that a reticulating diagram would be a preferable mode of presentation for the history of cornets. So, his final version of that history became a network.

The final network of 39 cornets (from Tëmkin & Eldredge 2007).
Note that the branch lengths represent time, not character change.

This network first appeared in 2007 in Current Anthropology, in a paper co-authored with Ilya Tëmkin (who apparently did the data analysis). Since then, it has appeared in various other places, including Eldredge (2009, p.302), Kelly (2010, p.51) and Eldredge (2011, p.372).

Sadly, I do not think that the method used to produce the network would be repeatable by anyone else:
"This phylogeny was generated by combining several analytical methods. First, three independent phylogenetic analyses were performed on the same data set. The neighbour-joining and maximum parsimony trees were computed in PAUP∗, and the reticulate network (based on the dissimilarity matrix generated in PAUP∗) was computed in T-Rex. The reticulate branches generated by T-Rex were subsequently plotted onto the neighbour-joining tree. The reticulate branches suggesting relationships that were not corroborated by the presence of at least a single character in any of the shortest trees in the maximum parsimony analysis were eliminated."
The data are available, if anyone wants to have a go at devising a different approach.

The paper by Tëmkin and Eldredge (2007) also contains an evolutionary analysis of the musical instrument known as a Baltic psaltery (a plucked stringed instrument similar to a zither). The phylogeny was presented as a tree in that paper, but the relationships have also been conceptualized as a network in a later analysis (see Veloz et al. 2012).

Note  There is a follow-up post here: Trees and networks of written manuscripts.


Barnet B (2004) Material cultural evolution: an interview with Niles Eldredge. Fibreculture Journal Issue 3: FCJ-017.

Eldredge N (2000) Biological and material cultural evolution: are there any true parallels? In: Tonneau F., Thompson NS (eds) Perspectives in Ethology: Evolution, Culture, and Behavior, pp. 113-153. Kluwer Academic, New York.

Eldredge N (2002) An overview of piston-valved cornet history. Historic Brass Society Journal 14: 337-390.

Eldredge N (2009) Material cultural macroevolution. In: Prentiss AM, Kuijt I, Chatters JC (eds) Macroevolution in Human Prehistory: Evolutionary Theory and Processual Archaeology, pp. 297-316. Springer, New York.

Eldredge N (2011) Paleontology and cornets: thoughts on material cultural evolution. Evolution: Education and Outreach 4: 364-373.

Kelly K (2010) What Technology Wants. Viking Press, New York.

Tëmkin I, Eldredge N (2007) Phylogenetics and material cultural evolution. Current Anthropology 48: 146-153.

Veloz T, Tëmkin I, Gabora L (2012) A conceptual network-based approach to inferring the cultural evolutionary history of the Baltic psaltery. In: Miyake N, Peebles D, Cooper RP (eds) Proceedings of CogSci 2012: 34th Annual Meeting of the Cognitive Science Society, pp. 2487-2492. Cognitive Science Society, Austin TX.

Walker G (26 July 2003) The Collector. New Scientist 179: 38-41.

Wertheim M (9 March 2004) Scientist at work — Niles Eldredge: Bursts of cornets and evolution bring harmony to night and day. New York Times.

Wednesday, January 23, 2013

Some possible questions for EDA in phylogenetics

Exploratory data analysis (EDA) is an important part of biological data analysis. It is a form of a priori analysis, where the data are investigated before any substantial hypothesis-testing analysis occurs.

EDA seeks to investigate the patterns that exist in a dataset without too many constraints on what those patterns might be. It should reveal any patterns and their relative strengths, in a quick and easily employed manner, preferably using easily digested pictures. In phylogenetics, there are now many types of networks available for exploring the extent to which data are tree-like, and the nature and location of any non-tree-like patterns. These are ideal for EDA.

It is important to recognize that EDA is not a phylogenetic analysis, in the sense that it is not intended to reveal evolutionary history. It is an exploration of the data intended to reveal the major patterns and their relationships. This information will help a phylogeneticist make decisions about how best to analyze their data and, indeed, even whether a phylogenetic analysis of the data is worthwhile. In this sense it is unfortunate that the sorts of networks that I discuss here are often called "phylogenetic networks", since they do not perform phylogenetic analysis.

EDA questions

EDA should not proceed in a haphazard manner, as the more one looks at a dataset the more patterns one is likely to see. In particular, there is always the potential danger of detecting spurious patterns, which may seem to be biologically important, especially if they confirm some of the pre-conceived ideas of the phylogeneticist. Therefore, it is important to have a clear goal when employing EDA, which might involve asking a set of explicit questions when viewing the data.

These questions should not constrain the exploration of the data, as there are many possible questions, directed towards may different goals. Instead, they should be intended to discover possible constraints on subsequent analyses. Indeed, they may even reveal testable hypotheses that would otherwise go unrecognized. (You should not, of course, both discover and test hypotheses using the same dataset!)

As an example, I discuss some questions that might be important for EDA in phylogenetics. These are not the only possible questions, of course, but they seem to be among the more commonly seen ones in the literature. That is, they focus on the contemporary approach to phylogenetics as a tree-building exercise. These represent a basic set of questions that should be answered before a tree-building analysis could begin. Similar questions could be asked, for example, about reticulation processes in evolutionary history.

These questions can be asked about the whole data or about subsets of the data (eg. genomes, genes, loci), which will indicate whether the variation is within-gene or between-gene.

(1) Are the data tree-like?

(2) If not, then is this because they are:
   (a) uninformative about relationships (a bush)
   (b) weakly tree-like (a tree obscured by vines)
   (c) contain several strong incompatible relationships (a structured network)
   (d) confused about relationships (an unstructured network — a spiderweb)

(3) If the data are tree-like, then are the trees of data subsets:
   (a) congruent
   (b) incongruent

The important issues that these questions are intended to address include:
   (i) are there strong patterns in the data that might answer the experimental question? (finding suitable data for a specific phylogenetic question is not necessarily easy);
   (ii) what are the patterns? (different patterns are likely to have different biological causes):
   (iii) is any data incompatibility principally between genes or within genes? (these patterns are likely to be caused by different biological processes);
   (iv) are the incompatibilities the result of reticulation processes (eg. recombination, hybridization, lateral gene transfer) or not (eg. incomplete lineage sorting, gene duplication-loss)?

These questions can be addressed using programs such as SplitsTree, which implements a range of network analyses based on unrooted splits graphs. Probably the most commonly encountered of these methods is NeighborNet. However, it is not currently possible to directly extract answers to the above questions using this program. For example, we cannot simultaneously display the networks for different subsets of the data, to directly compare them. Nevertheless, I can illustrate the idea with an example.

An example

The data are from O'Donnell et al. (2000). There are data for six partial gene sequences (334-1336 nt per gene) from each of 27 samples of Fusarium graminearum fungi.

We can start the EDA with a NeighborNet analysis of the entire dataset:

This is rather tree-like, with one major reticulation, involving sample NRRL_28721. Note that we have not imposed a root on the data, because it is often the root location that is the most ambiguous part of the data. We can confirm that this sample is the only one causing the non-tree-like behaviour by temporarily deleting it:

Indeed, the data are now very tree-like. Thus, the complete dataset seems to fit category 2c from the list above. Given a root location, this might then be expected to be a good estimate of the phylogeny.

We can further investigate the incompatible data pattern by looking at at each of the six gene subsets individually, in order to locate the possible cause of the reticulation involving NRRL_28721. It turns out that there are several different patterns revealed by the subsets.

The Ligase2 and Ligase3 genes produce poorly resolved bushes (category 2a), with many of the samples being identical:

The Ligase1 and Permease1 genes produce bushes with a netted centre (a combination of categories 2a and 2d):

The Permease2 gene apparently shows a reticulation, but a look at the original sequence alignment shows that this involves an incompatibility between 1 character and 2 other characters; and so it is likely to be of little interest:

In my experience, this is an important consideration when using EDA in phylogenetics. A pattern that looks large in a network may be relatively trivial in the original data. This can happen whenever there is little information in the dataset, so that even the smallest pattern looks large.

The Trichothene gene also produces a net-centered bush, but it is the only gene that shows the reticulation involving NRRL_28721. It is therefore a gene of particular interest for the EDA:

What this EDA shows is that most of the individual genes contain little information on their own, but when combined they show a strong tree-like pattern. That is, the genes complement each other in a phylogenetic analysis. Moreover, the reticulation involving NRRL_28721 appears in only one gene. Indeed, investigation of the alignment suggests a recombination event, with a cross-over in this particular gene. The other genes share one or the other of the two patterns associated with this cross-over.

A final tree-building analysis would therefore be expected to be productive, except that NRRL_28721 will not have a stable position when forced into a tree. On the other hand, a network would be a more appropriate display of the data, as it could show the recombination event associated with NRRL_28721.

Some desiderata for EDA in phylogenetics

Given what I have said above, it would be convenient if there were quantitative measures to identify the different types of network structure, which could then be displayed on each graph as part of the EDA protocol. For example, we need to distinguish these graph forms from each other:
    Few edges (ie. many taxa are identical)
    Star tree (ie. unresolved relationships)
    Resolved tree (ie. clear relationships)
    Structured network (ie. several clear conflicting relationships)
    Unstructured network (ie. unclear relationships)

Unsurprisingly, there is no single measure that could distinguish these from each other. We will therefore need several measures, to be used together.

At the moment, two measures of the degree of reticulation have been proposed, the δ-score (Holland et al. 2002) and the Q-residual (Gray et al. 2010). The main difference is in the normalization constant. Both of them are currently reported by SplitsTree, as an option. Basically, they produce small values for trees and bushes, larger values for structured networks, and the largest values for spider webs. For example, deleting NRRL_28721 from the above dataset reduces the δ-score from 0.1341 to 0.1083 (the scores can range from 0 to 1).

Thus, these measures can potentially be used to distinguish among the last three graph types in the list, but not among the first three. Wichman et al. (2011) expressed an apparent preference for the δ-score over the Q-residual, based on an evaluation of a linguistic dataset. Unfortunately, these scores have no statistical interpretation. As noted by Gray et al.:
We have implemented and tested a number of schemes for assessing the significance of delta score and Q-residual values, including non-parametric and parametric bootstrapping. Unfortunately, and curiously, none have proven to be sufficiently powerful and robust.
It is possible to suggest other measures that might also be useful for quantifying the different types of graph structure. For example, for a NeighborNet graph, which is closely related to an unrooted Neighbor-Joining (NJ) tree, we could consider the following quantitative measures to characterize each type of structure, relative to a fully resolved tree:

Few edges
    how many splits there are relative to the number of samples
Star tree
    how many internal splits there are relative to the minimum number necessary for a fully
    resolved tree
Structured network
    how many of the NJ tree splits form parallel sets in the NeighborNet of the same data
    (as opposed to single edges)
Unstructured network
    how many extra splits there are in the NeighborNet relative to the NJ tree

I am sure that others can easily be devised, as well. I encourage people to have a think about this, so that some consensus might be reached about desiderata for EDA in phylogenetics. The computational people cannot develop algorithms until the biologists tell them what is needed.

Monday, January 21, 2013

EDA or post-optimality analysis of phylogenetic data?

These days, phylogeneticists usually build trees to express the evolutionary history of their samples. As part of this procedure, they also show an interest in the "quality" of their trees. This is a very vaguely defined concept, probably because it has something to do with accuracy, or correctness, which is something we can know almost nothing about. So, instead, we resort to a whole swag of other concepts, such as resolution, robustness, sensitivity and stability, which are related to precision rather than to accuracy.

We implement these "precision" ideas in many ways, including: (i) analytical procedures, such as interior-branch tests, likelihood-ratio tests, clade significance, and the incongruence length difference test; (ii) statistical procedures, such as the ubiquitous non-parametric bootstrap and posterior probabilities, the jackknife, topology-dependent permutation, and clade credibility; and (iii) non-statistical procedures, such as the decay index, clade stability, data decisiveness, and spectral signals.

Most of these are forms of what is called post-optimality analysis — the tree is first calculated and then we evaluate it. In the current issue of Bioinformatics there is a paper by Saad Sheikh, Tamer Kahveci, Sanjay Ranka and Gordon Burleigh (Stability analysis of phylogenetic trees. Bioinformatics 2013 29(2): 166-174) that provides yet another take on the same theme:
We define measures that assess the stability of trees, subtrees and individual taxa with respect to changes in the input sequences. Our measures consider changes at the finest granularity in the input data (i.e. individual nucleotides).
Basically, the idea is to see how much the input would need to be changed in order to cause a change in the tree topology. For example, the authors quantify the minimum edit distance required to create a specified Robinson-Foulds tree distance from the optimal tree, although any similar distances could be used instead. Their basic purpose was to develop a method that could be effective for very large datasets, which most of the alternatives cannot.

What this approach begs is the question as to whether a post-optimality analysis is the best approach in the first place. This type of approach assumes a tree as the basic structure, and fails to consider alternative structures that might be more appropriate for the data.

Exploratory data analysis (EDA), if performed effectively, can achieve the same result (an assessment of "stability") while at the same time revealing whether a tree is actually the best structure. It does this by evaluating the dataset directly, a priori, rather than evaluating the data relative to a tree, a posteriori. Evaluating the tree in terms of the data is not the same thing as evaluating the data independently of any tree.

Of the methods listed above, the only one that evaluates the data in a tree-independent manner is the use of spectral signals. Another approach is, of course, to use a data-display network, which provides a very convenient picture of the data, and will thus reveal whether a tree-building analysis is a good idea or not.

An example

To explore the essential difference between EDA and post-optimality analysis, we can look at one of the example datasets used by Sheikh et al. to illustrate their method.

This dataset involves sequences of 169 species of mammals, published by Meredith et al. (Impacts of the Cretaceous terrestrial revolution and KPg extinction on mammal diversification. Science 2011 334(6055): 521-524). There are 35,603 aligned nucleotides, concatenated from 26 gene fragments. The sampling includes nearly all mammalian families, plus five vertebrate outgroups.

Meredith et al. note about their data:
Phylogenetic relations from maximum likelihood (ML) and Bayesian methods are well resolved across the mammalian tree. More than 90% of the nodes have bootstrap support of ≥ 90% and Bayesian posterior probabilities of ≥ 0.95. Amino acid and DNA ML trees are in agreement for 163 out of 168 internal nodes.
Not surprisingly, Sheikh et al. reach a similar enthusiastic conclusion:
The Mammals dataset is highly stable. There is not a single move (R = 1) possible for an edit distance of up to 530 nucleotides. Even if we place an extremely high limit of E = 1000, the biggest move possible is RF = 5. Thus, the stability measures provide an explicit guarantee that there is no move possible for E = 500 and any values of R within 1 SPR distance. This also demonstrates the power of building phylogenies from large densely sampled datasets.
However, this enthusiasm contradicts some well-known previous results. For example, Meredith et al. also note:
Several nodes that remain difficult to resolve (e.g., placental root) have variable support between studies of rare genomic changes, as well as genome-scale data sets, which suggest that diversification was not fully bifurcating or occurred in such rapid succession that phylogenetic signal tracking true species relations may not be recoverable with current methods.
A simple EDA analysis makes the situation clear, which is not done by either the bootstrap / posterior-probability approach of Meredith et al. or the edit-distance / tree-distance approach of Sheikh et al. If we stick to the simple parsimony approach of the latter (rather than the model-based approach of the former), then we can analyze the dataset with hamming distances and a NeighborNet graph.

First, the root of the mammals is not clear. The published tree places the root on the branch leading to monotremes, but in the network the outgroup involves a reticulation. This is caused by an ambiguous relationship between Echinops telfairi (Tenrecidae) and (i) the {outgroup + monotremes} and (ii) the Afrotheria. In the tree the Afrotheria is united by a "strongly supported node".

Second, the root of the placentals is very unclear. Most of the major groups of placentals form clusters, but the relationships among these clusters are very obscure. The data are bush-like within the placentals, rather than tree-like, both at the level of the four major groups (named in the graph) and within each of those groups. In the published tree, some of these subgroups are well-supported, but others involve disagreement between the DNA and amino-acid trees, while others have < 90% bootstrap support.

It is not immediately obvious that a tree-building analysis is going to be of much use for this dataset. There is certainly some "power of building phylogenies from large densely sampled datasets", but this does not automatically mean that those phylogenies will be tree-like. Evolution involves a more diverse process than that, and post-optimality analyses based on a single model may be very misleading about that diversity.

Wednesday, January 16, 2013

Datasets for validating algorithms for evolutionary networks

Steven Kelk has previously raised the issue about Validating methods for constructing evolutionary phylogenetic networks: there are currently not many options for validating the biological relevance of methods for constructing evolutionary phylogenetic networks. These are phylogenetic networks intended to represent evolutionary history, such as HGT networks. hybridization networks, and recombination networks.

Thus, we need a repository of biological datasets where there is some level of consensus amongst biologists as to the character, extent and location of reticulate evolutionary events. This could then be used as a framework for validating the output of algorithms for constructing evolutionary phylogenetic networks.

This issue was discussed at some length at the Workshop: The Future of Phylogenetic Networks. It was suggested by Leo van Iersel that a practical starting point would be to use this blog as a link to suitable datasets. As people become aware of such datasets, a blog post would be published with the details, and the dataset would be linked from one of the blog Pages.

This page now exists (Datasets), and can be accessed at the top right of each blog page. Everyone is encouraged to contribute to this "database", which you can do by sending details about potential dataset  to me by email.

In another post, What should a database of datasets look like?, I have noted that there have been four suggested approaches to acquiring datasets for evaluating algorithms (in order of increasing reality):
  1. simulate datasets under one or more data-generation models
  2. create mixed datasets from "pure" datasets, or create artificial mosaic taxa from real datasets
  3. use datasets where the postulated reticulation events have been independently confirmed
  4. experimentally create taxa with a known evolutionary history.
It seems unnecessary to store datasets of type (1), since they can be created to order by computer programs. Datasets of type (2) are rare, but would be suitable for the database.

Datasets of type (4) currently exist for tree-like evolutionary histories but not yet, as far as I know, for reticulated histories. I have added the known (and available) ones to the database.

Datasets of type (3) are likely to form the bulk of the database, and I have started this part of the database with some example datasets involving hybridization.

For the latter datasets, it is important to note the potential problem of the degree to which the postulated reticulation events have been independently confirmed. I suspect that only weak evidence has been applied to far too many datasets. This is particularly true for those involving horizontal gene transfer (HGT), where mere incongruence between genes is presented as the sole "evidence". More than this is required (see Than C, Ruths D, Innan H, Nakhleh L. 2007. Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 14: 517-535.).

Monday, January 14, 2013

The mysterious rankings in Forbes' Celebrity 100

Every year since 1999 Forbes magazine has produced a list called the Celebrity 100, which purports "to list the 100 most powerful celebrities of the year" within the USA. The list is based on entertainment-related earnings plus media visibility (exposure in print, television, radio, and online). The current list was released in May 2012 (see The World's Most Powerful Celebrities).

The 2012 list generated plenty of negative comments around the web, typified by this one from the Huffington Post: "Looking at the list and the various artists' rankings in the five categories used to determine their placement, there's a sense that the actual positions are more arbitrary than usual. For example, Jennifer Lopez is #1 despite not ranking inside the top 10 in any of the categories (and is as low as #30 in money, #22 in press rank and #19 in social), while both Lady Gaga [Stefani Germanotta] and Justin Bieber earned top 10 placements in all of them except the Money Rank, the top tier of which is dominated by multimedia moguls like Oprah Winfrey".

Forbes' Celebrity 100, top three.

Forbes lists the five categories (designed to measure "earnings and fame") as follows, along with the data sources they used:
  • Money Rank — talk to industry insiders to come up with an estimate of earnings
  • TV / Radio Rank — use Lexis / Nexis to find out how many times each star was mentioned on television and on the radio
  • Press Rank — print media mentions come from Factiva
  • Web Rank — measured using Google blogs
  • Social Rank — count of Twitter followers and Facebook fans.
Each celebrity is ranked within each of these categories separately, and then: "all of the data is processed through an algorithm that creates our power ranking."

The following criticism appears among the comments on the Forbes blog page: "I’m not sure what your algorithm is based on, but it’s clearly not the rankings you provided, given that Beiber and GaGa tie or outrank J-Lo in every category, and Oprah beats her in all but one category and made more than three times as much $. Seems odd ..." Indeed, both Rihanna Fenty and Beyoncé Knowles also out-ranked Lopez in four of the five categories, and yet Beyoncé was ranked only 16th and Rihanna 4th.

In reply to the criticisms, Dorothy Pomerantz (the Forbes editor in charge of the list), noted: "The thing that doesn't show up on the celebrity profiles is magazine covers, and it counts for a lot ... We comb through dozens of magazines from the past 12 months to count how many times each star appeared on covers." This comment generated the response: "to leave magazine covers out when it supposingly 'counts for a lot' is crazy, and nonetheless it still seems weird Lopez could come out on top."

To quantify whether the rankings are indeed 'crazy' or not, we can use a network as a useful means of exploratory data analysis. As usual, I have used the manhattan distance and a neighbor-net network, but this time I have applied them to the rankings for each of the five celebrity-status categories, as listed by Forbes. If the final ranking in the overall Celebrity 100 list really is based on a pooling of the rankings for the five individual categories, then there should be a simple pattern in the resulting network, with similarly ranked celebrities appearing near each other in the graph. That is, people who are closely connected in the network are similar to each other based on their category rankings, and those who are further apart are progressively more different from each other.

The neighbor-net graph of the people in the Forbes Celebrity 100,
labelled according to their ranking.

However, this is blatantly not the case. To highlight this fact, I have coloured the top ten celebrities in red and the bottom ten in purple on the network. While it is true that the red numbers are down at the bottom of the graph and the purple ones are at the top, thus forming a pattern of sorts, the different colours are clearly not clustered together (the red form two separate groups rather than one, and so do the purple). So, it is not just the Lopez ranking that is screwy — a lot of the ranks appear to be fairly arbitrary.

If we look at some specific examples, the people at rank 16 (Beyoncé Knowles) and rank 24 (Adele Adkins) are clustered in the network with those people with ranks 1-8, while the people with ranks 9 and 10 are far away from 1-8. This means that, given their performance on the individual criteria, Beyoncé and Adele should be in the top ten on the Celebrity 100 list, and yet Forbes has them ranked as 16 and 24. Why? Surely their magazine-cover performance cannot have affected them that severely.

Alternatively, neither Britney Spears (ranked 6) nor Tom Cruise (ranked 9) made it into the top ten on any category, and yet the are both ranked in there, whereas everyone else ranked inside the top 16 made the top ten in at least one category.

As other examples, we can note that rank 48 is near ranks 9 and 10 in the network while rank 49 is near ranks 98 and 99; and rank 83 is next to rank 22 (near the bottom of the graph) while 26 is next to 81 (near the top)!

The Forbes so-called 'algorithm' must be a sight to behold, because it produces an outcome that is not quite random but nevertheless does a very creditable imitation of it. I think that we can safely say that there is more than 'magazine covers' involved in these ranking discrepancies.

A much more reasonable top dozen, based on the graph, would be the people ranked 1–8, 11, 15, 16 and 24. These are, in alphabetical order:
  • Adele Adkins
  • Justin Bieber
  • Rihanna Fenty
  • Stefani Germanotta
  • LeBron James
  • Kim Kardashian
  • Beyoncé Knowles
  • Jennifer Lopez
  • Katy Perry
  • Britney Spears
  • Taylor Swift
  • Oprah Winfrey
The exact rank order of these people would, presumably, be determined by the number of magazine covers on which they have appeared.

There are some other things that we can learn from an analysis of the Celebrity 100 list, but they have nothing to do with networks, so I will not cover them here. (I have now covered them in this later blog post: Non-randomness in Forbes' Celebrity 100 ranking.)

Wednesday, January 9, 2013

We should present bayesian phylogenetic analyses using networks

Bayesian methods differ from other forms of probabilistic analysis in that they are concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. That is, Bayesian analysis is not about identifying the most likely outcome, it is about quantifying the relative likelihood of all possible outcomes. In this sense, it is quite distinct from other probabilistic methods, such as those based on estimating the optimal outcome under criteria such as maximum likelihood, maximum parsimony or minimum evolution. As far as likelihood is concerned, Bayesian analysis deals with (maximum) integrated likelihood rather than (maximum relative) likelihood (Steel & Penny 2000).

In phylogenetic analysis this creates a potentially confusing situation, as the result of most Bayesian analyses is presented as a single tree, rather than showing the probability distribution of all trees. Certainly, some of the information from the probability distribution is used in the tree – usually the posterior probabilities that are attached to each of the tree branches – but this is a poor visual summary of the available information. A better approach would be to use a network to display the actual probability distribution.


Bayesian phylogeny programs such as MrBayes produce a file (or files) with a copy of every tree that was sampled by the MCMC run (or runs). These files can then be manipulated (eg. using the "sumt" command in MrBayes) to exclude the first trees as part of the burn-in; and a smaller file is then produced with a copy of the unique trees found, along with their frequency of occurrence in the sample (eg. in the "trprobs" file from MrBayes). It is this sample of trees that is summarized to produce the so-called MAP tree (maximum a posteriori probability tree) and its associated branch posterior probabilities.

Ideally, we would take this file and produce a consensus network rather than a consensus tree. The tree produced by MrBayes is built from the best-supported branches of the set of trees sampled, but only a set of compatible branches can be included in the consensus tree (see the original work of Margush and McMorris 1981). Any well-supported but incompatible branches will not be shown, and it is the absence of these branches that causes the phylogenetic tree to deviate from the standard Bayesian philosophy of presenting a probability distribution. A consensus network solves this problem because it is specifically designed to present a specified percentage of the incompatible branches as well (Holland et al. 2004).

The idea of using a consensus network rather than a consensus tree was first suggested by Holland et al. (2005), although it has rarely been used in practice (eg. Gibb & Penny 2010). Indeed, consensus networks could also be used to present the results of bootstrap analyses, or a set of equally parsimonious trees (Holland et al. 2006).

It is important to note that a consensus network is unrooted, and therefore it is solely a mathematical summary of the data, and cannot be treated as an evolutionary diagram. The MAP tree is, strictly speaking a form of consensus tree, and so it is technically also solely a mathematical summary of the data. It is, however, conventional to treat it as an evolutionary diagram.

An example

The example data are from Schnittger et al. (2012), being sequence data for the beta-tubulin gene of 17 samples of the genus Babesia. Details of the Bayesian analysis using MrBayes are provided here, which yielded 100,000 trees in the final MCMC sample. A nexus file with the sequence data, the 95% credible set of trees, and the associated bipartitions can be found here. The rooted MAP tree produced by MrBayes is shown in the first diagram, with the posterior probabilities (PP) labelled on each branch.

Bayesian MAP tree.

There are several ways that a consensus network could be presented. The standard way would be to choose some percentage of the MCMC trees and to show the network of those trees. For example, the MAP tree is simply a consensus that includes all of the bipartitions that occur in at least 50% of the trees, plus all of the other bipartitions that are compatible with those bipartitions.

As an alternative, here is the unrooted consensus network with all of the bipartitions that occur in at least 5% of the trees. It is important to note that the branch lengths in the following consensus networks represent the branch support not the estimated number of substitutions (as in the MAP tree). For example, the relationships among the Babesia microti sequences have very short branches in the MAP tree (ie. few substitutions) but long branches in the consensus networks (ie. good support).

Consensus network set at 5%.

This network makes visually clear where the major incompatibilities are among the MCMC trees, which the MAP consensus tree does not (unless one checks the PP values). The major set of boxes in the network involve the branches with PP values of c. 0.4. Note also that this network also approximates the 95% credible set of trees, as it excludes those bipartitions that occur in <5% of the trees.

Unfortunately, this network is visually rather cluttered, with a set of multi-dimensional boxes, and so maybe a simpler network would suffice. Instead, here is the consensus network with all of the bipartitions that occur in at least 20% of the trees. This network still emphasizes the main area of incompatibility among the trees, while losing the less-important incompatibilities on the other branches.

Consensus network set at 20%.

Another approach to simplifying the consensus network is to present a set of what are called weakly compatible bipartitions, rather than choosing some percentage of the bipartitions. This is shown in the next network. Note that visually it is a compromise between the above two networks, and may thus be preferable, as it makes clear the range of branches involved in the incompatibilities without needing to use multi-dimensional boxes to do so (ie. the graph is planar rather than 3-dimensional).

Consensus network of weakly compatible bipartitions.

Finally, we can emphasize the relationship between the network and bipartition compatibility by plotting the consensus network consisting of a set of compatible bipartitons. This forms a tree that has the same topology as the MAP tree (since it is constructed in exactly the same manner), but here the branch lengths represent the branch support rather than the estimated number of substitutions.

Consensus network of compatible bipartitions.

Current practical problems

The simplest way to implement the use of consensus networks to display the set of Bayesian trees would be to use SplitsTree to produce the consensus network directly from the (MrBayes) nexus-format MCMC treefile. However, this treefile will usually contain tens of thousands of trees, which is pushing the limit of SplitsTree. Alternatively, we could read in the smaller nexus-format "trprobs" file, which contains only the unique trees along with a weight indicating their relative frequency. Unfortunately, SplitsTree does not currently read tree weights in nexus treefiles. So, Holland et al. (2005, 2006) produced some Python scripts to create the required nexus files, which can then be input to SplitsTree (or SpectroNet).

Alternatively, MrBayes also produces a partition table, showing the relative frequency of the bipartitions found in the sample of trees. The consensus network is actually produced from these bipartitions, rather than the trees, and so this can also be used instead of the treefile. There are two practical problems with this approach. First, MrBayes does not produce a nexus-format file with the bipartition information, and so the available information must be manually converted to a Splits block and put into a nexus file. Second, SplitsTree currently does not construct networks with different percentages of splits when data are input via a Splits block, only when the data are input via a Trees block. So, a series of Splits blocks needs to be constructed, each with the appropriate number of bipartitions.

I used the latter approach to analyze the example data. [This is explained in more detail in a later post: How to construct a consensus network from the output of a bayesian tree analysis.]

Thanks to Bernie Cohen for prompting me to think about networks and Bayesian analysis, and to Barbara Holland for her help with getting the calculations done.


Gibb GC, Penny D (2010) Two aspects along the continuum of pigeon evolution: a South-Pacific radiation and the relationship of pigeons within Neoaves. Molecular Phylogenetics and Evolution 56: 698-706.

Holland BR, Delsuc F, Moulton V (2005) Visualizing conflicting evolutionary hypotheses in large collections of trees: using consensus networks to study the origins of placentals and hexapods. Systematic Biology 54:66-76.

Holland BR, Huber KT, Moulton V, Lockhart PJ (2004) Using consensus networks to visualize contradictory evidence for species phylogeny. Molecular Biology and Evolution 21: 1459-1461.

Holland BR, Jermiin LS, Moulton V (2006) Improved consensus network techniques for genome-scale phylogeny. Molecular Biology and Evolution 23: 848-855.

Margush T, McMorris FR (1981) Consensus n-trees. Bulletin of Mathematical Biology 43: 239-244.

Schnittger L, Rodriguez AE, Florin-Christensen M, Morrison DA (2012) Babesia: a world emerging. Infection, Genetics and Evolution 12: 1788-1809.

Steel M, Penny D (2000) Parsimony, likelihood, and the role of models in molecular phylogenetics. Molecular Biology and Evolution 17: 839-850.

Monday, January 7, 2013

Is there good and bad fast-food?

Since the Christmas feast days are now over, this blog post continues the series on the nutritional characteristics of modern fast-food, which started with the Network analysis of McDonald's fast-food.

Men's Health magazine has produced a list of what it considers to be The 10 Worst Fast Food Meals in the USA. They chose one meal (usually a combination of several menu items) from each of ten different fast-food chains, which they considered to be extreme meals based on their nutritional characteristics. To counter-balance this list, they also chose another meal combination from each chain that they considered to be much "better for you".

For each of these 20 meals the magazine provided data on four of the nutritional characteristics: Calories, Fat, Saturated fat, and Sodium (salt). I have analyzed these data in the same manner as before: I standardized the data by expressing them as a percent of the officially recommended daily value based on a 2,000 calorie diet, then calculated a NeighborNet network based on manhattan distances.

The resulting network is shown in the figure. I have coloured the ten allegedly "better for you" meals alternately in green or blue, with all of the "worse for you" meals in black. Meals that are closely connected in the network are similar to each other based on their nutritional characteristics, and those that are further apart are progressively more different from each other.

Clearly, the "worst food" meals differ greatly from each others in their nutritional characteristics, while the other ten meals do not. In other words, there is a single clear concept of what is "good for you" but many different ideas about what is "bad for you" (or many ways in which the food can be unhealthy).

Furthermore, the "worst food" meals vary in their relationship to the better meals, with Long John Silver's Fish Combo Basket being rather similar to the better items, and both KFC's Half Spicy Crispy Chicken Meal and Burger King's Large Triple Whopper being at the extreme far end of the graph. Indeed, the "worst food" meals form a gradient of increasingly extreme nutritional characteristics: the calories, fat and saturated fat all increase from bottom to top in the network, and sodium increases from left to right.

The sodium change applies in the better meals, as well, with Quizno's Roadhouse Steak Sammies having more salt than the other nine meals in that group.

So, it seems to me that the fast-food chains are having a harder time creating unhealthy fish meals than they are creating unhealthy chicken and beef meals. However, this may just be lack of effort on their part, because the Tuna Melts with Cheetos meal is certainly pretty extreme.

Anyway, you now know which meals to target should you wish to send yourself into an early grave.

Wednesday, January 2, 2013

False analogies between anthropology and biology

There has been much talk over the past few decades about the extent to which the various disciplines within anthropology (in the broad sense) can use, or benefit from, methodological techniques developed in other disciplines, notably biology (see Mace et al. 2005; Forster & Renfrew 2006; Lipo et al. 2006). This has been particularly true for historical studies of languages (ie. linguistics), past cultures (ie. archaeology) and physical type (ie. physical / biological anthropology). The use of, for example, phylogenetic methods seems to be relatively unproblematic in the latter case (studies of the origin and development of humans as a species; Holliday 2003), although this field is concerned as much with population genetics as it is with species phylogenies. (Note that I am leaving cultural anthropology out of the discussion, as it seems to be less concerned with historical studies.)

However, the use of phylogenetic methods in archaeology and linguistics is based on an analogy between human cultural evolution and biological evolution. This analogy assumes that the underlying processes of historical change in anthropology and biology are similar enough that the analytical methods can be combined. (Note that I am using the word anthropology in the broadest sense, to include linguistics and archaeology.) So, both anthropology and biology apparently involve an evolutionary process, in which the study objects form groups that change via modification of their intrinsic attributes, the attributes being transformed through time from ancestral to derived states (often called "innovations" in anthropology). That is, it is the groups of objects that change through time (variational evolution) rather than the objects themselves changing (transformational evolution). Thus, if one group acquires a new (derived or advanced) character state while the rest do not (i.e. they retain the ancestral or primitive state) then this group forms a separate historical lineage that diverges from the other populations, and maintains its own historical tendencies and fate. A search for derived character states that are shared among the groups allows us to reconstruct the evolutionary history.

However, this apparent similarity is basically a metaphor, because human culture is not a collection of biological objects. In Popperian terms, biology is part of the "world that consists of physical bodies" while culture and linguistics are part of the "world of the products of the human mind". Therefore, if we are drawing an analogy between anthropological studies and biological studies, and using this analogy to justify the use of certain analytical techniques, then we need to understand the analogy thoroughly. Here, I argue that in some important ways the currently used analogy is wrong from the biological perspective, and that this has important consequences for anthropological research.


The analogy between anthropology and biology has recently focused on the possible relationship between anthropological entities and genes (eg. Mace & Holden 2005; Tëmkin & Eldredge 2007; Croft 2008; Pagel 2009; Steele et al. 2010; Howe & Windram 2011). However, this seems to be a false analogy, as there is no observable equivalent to a gene in the anthropological world (other than inside any biological organisms being studied). Memes, for example, are not observable objects in the way that genes are. So, the analogy between real replicators in biology (genes) and theoretical replicators in anthropology is inappropriate.

However, biology recognizes a distinction between genotype, which is the collection of genes and other associated material in an organism, and phenotype, which is the product of interactions between genes and also between genes and their environment. The DNA, RNA and proteins in an organism are usually taken to represent the genotype, whereas the cells, tissues and organs constitute the phenotype of an individual. To quote Richard Lewontin (in the Stanford Encyclopedia of Philosophy): "the actual correspondence between genotype and phenotype is a many–many relation in which any given genotype corresponds to many different phenotypes and there are different genotypes corresponding to a given phenotype."

The better analogy between anthropology and biology is thus with the phenotype, not the genotype. Genetic material stores information that allows it to replicate itself, either exactly or with modification, and this is the basis of the distinction between living and non-living objects. Nothing in archaeology or linguistics, for example, possesses these properties, and to form an analogy between anthropological entities and genes is thus potentially misleading. In particular, genetic material is based on standardized fundamental units (the nucleotides and amino acids), which have no simple counterpart in anthropology.

An analogy between anthropological entities and phenotypes is much more reasonable, however. Phenotypic entities, such as cells and organs, seem to have much more in common conceptually with anthropological entities, such as phonemes and words in linguistics and stemmatology. Most importantly, it is the phenotype that takes part in evolutionary processes, not the genotype alone (genes are just part of the "replicator story", as DNA on its own does nothing except denature slowly), and so it is actually the more useful comparison. Indeed, up until the 1990s phenotypes were the basic unit of phylogenetics in biology, and it is only since then that biologists have switched wholesale to genotypes for constructing phylogenies. Anthropologists cannot make this switch, and need to remain "phenotype phylogeneticists" instead.

The important point to note is that evolutionary anthropology is a study of historical relationships rather than specifically "genetic" ones. That is, while cultural transmission is qualitatively different from genetic transmission, that does not invalidate a study of history. Genes are passed directly to offspring whereas culture involves behaviour that is transmitted by social learning; for example, manuscripts are copied by hand, languages are learned by imitating parents, and musical instruments are deliberately designed by professionals. Biological transmission is thus different from anthropological transmission, but both types of transmission produce a history.

Phenotypes have historical relationships just as genotypes do, as is now recognized by the resurgence of interest in evolutionary developmental biology (also known as evo-devo). No analogy with genetics is necessary for evolutionary studies of anthropology. Moreover, not all genetic relationships are necessarily evolutionary (much of population genetics, for example, can be conducted without an evolutionary framework), although it is likely that they will all have a strong evolutionary component. (Note that in anthropology vertical phylogenetic descent is sometimes confusingly referred to as the "genetic relationship", perhaps as a result of Noam Chomsky's work, and phylogenies are sometimes referred to as "classifications".)


Since phenotypes evolve, they can be an appropriate unit of study in phylogenetics, and can therefore can be an appropriate analogue for the study of cultural histories. The distinction between genotype and phenotype as the appropriate analogy is not a trivial one. In particular, the change of perspective seems to make clearer a number of issues that have been raised concerning the application of phylogenetic methods in anthropology.


First, it is often difficult to work out the homologies between phenotypic entities from divergent groups, just as it is for anthropological entities. If phylogenetics is a search for shared derived characters states, then we need to be comparing the same character states in different groups (ie. comparing like with like based on common ancestry). However, shared derived character states are not conveniently labeled as such on the objects themselves. We thus need to infer homology before we can infer phylogeny (or at least do this simultaneously), and this is often more difficult for phenotypes than for genotypes.

Phenotypic homology sometimes causes confusion even among evolutionary biologists. The basic issue is often which features should been seen as different states of the same character. As a cultural example, Tëmkin & Eldredge (2007) discuss the problem of the valves in a cornet, as "the Périnet valve did not derive from the Stölzel valve but rather was an alternative design solution" (alternative designs are quite common for manufactured objects). Thus, neither can be considered to be the ancestral state of a single character (valve type), even though the Stölzel valve predated the Périnet. Most biologists would solve this "problem" by having two separate characters, so that each valve type is either present or absent, thus effectively having a combined total of four character states. This allows a cornet to have either all Stölzel or all Périnet valves, or a combination of both (which a few instruments do have; Eldredge 2002). A cornet that has neither type of valve is called a post horn, this being the instrument from which the cornet was originally derived.

The search for an objective method of determining phenotypic homology has been a long one (Rieppel 2007), and is not by any means resolved; perhaps the most interesting discussion of an objective procedure is that of Jardine (1967). In particular, homoplasy (convergence / parallelism / reversal) is often a phenotypic phenomenon, as the genotype of the organisms concerned is almost always different in some way. That is, phenotypic homoplasy is usually the result of mistaken homology assessment, whereas genotypic homoplasy usually results from the fact that there are so few units of comparison (eg. four nucleotides). It has been suggested that homoplasy may be even more common in anthropology than in biology (Tëmkin & Eldredge 2007). Indeed, in culture it can be difficult even to decide on the units of comparison (eg. phonemes? syllables? words?), which is quite characteristic of phenotypic studies, and the "taxa" often need to be constructed for analysis (eg. tools, customs, etc).

Furthermore, it is likely to be inappropriate to use an analogy with molecular sequence alignment when discussing cultural and linguistic homologies (Covington 1996; Kondrak 2003; Pagel 2009). Computerized algorithms are usually used to align molecular data and thus make decisions about character-state homology, mostly based on overall similarity. However, homology of phenotypic characteristics requires careful comparative studies to determine what are called topological relations (or connectivity) among the character states, often based on ontogenetic development (Rieppel 2007); this is called "special similarity". It might be difficult to use ontogeny as an analogue for cultural development, since ontogeny refers to the sequential expression of genes, but topological relationships have obvious analogues in linguistics; for example, words consist of both primary structure (phonemes) and secondary structure (morphemes) (List 2012).


Second, it is likely that there will be a greater degree of reticulate evolution in archaeological and linguistic studies. This conclusion follows from the differences in barriers to horizontal flow of information — there are both weak and strong barriers in biology but only weak ones in anthropology.

In biology there are both pre-zygotic and post-zygotic barriers to gene flow, which refer to those acting to prevent the formation of a zygote and those acting after zygote formation, respectively. It is the latter that are most effective in creating reproductive isolation between taxa. Pre-zygotic mechanisms, such as geographical isolation (different locations), ecological isolation (different habitats), temporal isolation (different times), mechanical isolation (different physical structures) and ethological isolation (different behaviours), have obvious analogues in anthropological studies, but these barriers are often not completely effective, such as when species that were previously spatially separated encounter each other for the first time. Post-zygotic mechanisms, such as cross-incompatibility (inability of gametes to fuse), hybrid inviability (failure of zygotes to survive), hybrid sterility (failure of zygotes to reproduce) and hybrid breakdown (failure of second generation hybrids to survive), are strictly genetic mechanisms and they have no obvious analogue in anthropological studies. They are usually very effective barriers to gene flow, and indeed are the principal basis of the biological species concept, for example.

The important point to note is that the post-zygotic barriers are directly under genetic control whereas the pre-zygotic barriers are only indirectly genetically controlled (eg. habitat selection might be genetically determined, and if their habitats are different then two species will be reproductively isolated). This means that the post-zygotic barriers are much stronger. It also means that they are not available in the analogy between anthropology and phenotype.

Weak barriers mean that archaeological and linguistic aggregations are likely to form fuzzy clusters rather than clearly defined groups, just as they do for human races (Fuzzy clusters). Fuzzy clusters are not likely to form clear-cut evolutionary lineages, at least as far as vertical descent is concerned (Eldredge 2011).

Thus, because anthropological studies involve only weak barriers to the horizontal flow of information, reticulate evolution is predicted to be more prevalent than it is in biology. That is, the horizontal component of evolution may even be as large as the vertical one (and possibly more important), because there are none of the strong genetic ("post-zygotic") barriers to flow. Indeed, the use of trees as a model for archaeological and linguistic studies has been questioned repeatedly in recent years, on various grounds (eg. Southworth 1964; Hoenigswald 1990; Moore 1994; Dewar 1995; Ben Hamed & Wang 2006; Tëmkin & Eldredge 2007), usually in favor of reticulation models. Moreover, the earliest representations of historical relationships were networks rather than trees (Gallet), even in biology (Buffon, Duchesne), and since then many alternative reticulation metaphors have been developed (Metaphors). This suggests that the focus on trees has been a distraction from the more obvious model of a network in anthropology.

   Networks and trees

One point of confusion here seems to be that trees have been treated as representations of temporal relationships while networks have been treated as representations of spatial relationships. Indeed, this seems to be at the heart of the apparent differences of opinion about the two models — the tree advocates are emphasizing time whereas the network advocates are emphasizing space. The practical problem here is that there are currently no quantitative methods for combining the two. Tree-building algorithms in biology do not allow for reticulation, and the common network algorithms (such as neighbor-net, median-joining, reduced median) solely show static relationships, without any sense that the inferred nodes represent ancestors or the edges connecting the nodes represent evolutionary change. In these commonly used algorithms, the nodes are there solely to support the network structure, and the edges solely express the degree of character difference between the nodes.

For phylogenetic trees there is a rationale for treating the tree diagram as a representation of evolutionary history. For example, in a study of a set of gene sequences, first we produce a mathematical summary of the the data based on a quantitative model. We then infer that this summary represents the gene history, based on the Hennigian logic that the patterns are formed from a nested series of shared derived character states (this is a logical inference about the biology being represented by the mathematical summary). We then infer that this gene history represents the organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (ie. a pragmatic inference). However, no such rationale exits for most of the current network methods. The network still represents a mathematical summary of the data, but there is no logic for direct inference about biology. It is almost certain that the mathematical summary represents real biological patterns, but there is no necessity that those patterns are evolutionary ones.

The increasing appearance of neighbor-net networks in the linguistic and archaeological literature (eg. Ben Hamed 2005; Bryant et al. 2005; Bowern 2010; Gray et al. 2010; Heggarty et al. 2010; Dediu & Levinson 2012), for example, is thus based on trying to infer temporal patterns from the network display of spatial patterns, even though there is no explicit rationale for being able to do this — the networks may represent history and they may not. Clearly, what we need are quantitative methods that allow the direct inference of both vertical and horizontal evolutionary patterns — that is, we need phylogenetic networks rather than phylogenetic trees. Moreover, these networks need to be based on models of phenotypic variation not genotypic variation (eg. Lewis 2001). Nakhleh et al. (2005), Warnow et al. (2006) and Erdem et al. (2006) are among the few to have tackled this issue in anthropology.

Note that none of the above discussion is meant to contrast a tree model with a network model in a mutually exclusive way. Mathematically, trees form a subset of networks. Therefore, we do not need to choose between the two as the most appropriate model — we can always choose a network model, and the resulting network will be more or less tree-like depending on the data. So, it is not necessary to decide wether anthropological data are more or less tree-like than biological data (Collard et al. 2006), nor should it be necessary to decide whether horizontal transmission invalidates cultural phylogenetic trees (O'Brien et al. 2002; Greenhill et al. 2009; Currie et al. 2010b) — we should simply incorporate any reticulations into the phylogeny rather than decide they are too small to need to include them.

In this sense, many of the recent anthropological papers that are based solely on a tree model seem to be misguided, no matter how sophisticated the mathematics of their analyses may be (Gray & Atkinson 2003; Gray et al. 2009; Currie et al. 2010a; Dunn et al. 2011; Gray et al. 2011; Bouckaert et al. 2012). For example, if a dataset is admittedly affected by horizontal transfer, it is unlikely that any tree-building algorithm will correctly construct the tree-like pattern of vertical descent. Thus, even if our model for evolutionary history is "a tree obscured by vines", we will still find it difficult to reconstruct the tree unless we explicitly move the vines out of the way first. It is for this reason, for example, that in linguistics many studies are based on the Swadesh list of words, which is clearly (and intentionally) biased towards words that have been inherited vertically, with little or no horizontal transfer (eg. Bouckaert et al. note: "the cognate data we use excludes known cases of borrowing"). Under these circumstances, it is hardly surprising that authors so often find their phylogenies to be tree-like, since they are deliberately ignoring the vines! Networks are likely to reveal both the tree and the vines (eg. otherwise hidden lexical borrowing; Nelson-Sathi et al. 2011).

Finally, it is worth mentioning the network methods that have been developed for within-species (ie. population) data, particularly mtDNA sequences. These include those methods related to median networks (eg. median-joining, reduced median), but also include those related to one-step networks (eg. statistical parsimony, minimum-spanning). In many anthropological situations, it is likely that these will be more useful than methods related to phylogenetic trees (see the examples in Barbrook et al. 1998; Forster et al. 1998; Forster & Toth 2003; Spencer et al. 2004; Lipo 2006). Bouckaert et al. (2012) take this analogy even further, by using a phylogeny-based epidemiological model of population spread.

Time consistency

The third consequence of rejecting the genotype analogy is that time inconsistency is no longer required. Organisms store the information (that is vertically and horizontally transmitted) in genes that they carry with them, which restricts reticulation to occurring only between contemporaries. However, while cultural aretefacts clearly display their information, they do not transmit it themselves, and it must instead be interpreted by humans. Furthermore, language and culture store their "information" externally, either in the minds of people or in permanent or semi-permanent records (either written or pictorial).

Thus, in anthropology the information available for horizontal transmission can come from the distant past, as well as from the present — the only direction that cultural information cannot flow is from the future to the past. In this sense, extinction seems to be much rarer in archeology and linguistics than in biology, because information can be stored indefinitely, rather than disappearing along with the possessing species. I have illustrated time inconsistency twice before, with respect to both computers and computer languages, and Tëmkin & Eldredge (2007) illustrate it with musical instruments.

Part of the issue here is also that archaeological objects are often not contemporaneous, whereas most biological studies are based on data from contemporary organisms (Lipo 2006). This means that in archaeological phylogenetics the study objects appear at internal nodes in the phylogeny as well as at the tips (the data are diachronic), whereas in biology they occur only at the tips (the internal nodes are hypothetical ancestors). In this case, it may be better to consider an archaeological analogy with the incorporation into the phylogenetic histories of full stratigraphic information from fossils (eg. Sumrall 2005; Tëmkin & Eldredge 2007; Fisher 2008).

Historical anthropology is often concerned with "origins" and putting dates on those origins (Gray et al. 2011), and therefore the study interest is where the analytical uncertainty is greatest, since this is the place where there are fewest data. This is quite different to much of the use of phylogenetic techniques in biology, where the relationships of contemporary organisms are the primary interest. Of particular concern are estimates of rates of divergence, for which there appear to be few mathematical models in archaeology. Small changes in rates can have large effects on estimates of origins and their dates, as can changes of rates along lineages.

Disconnection of phenotype and phylogeny

The fourth consequence is that there is often a lack of association between phylogeny and phenotype. There are examples in the literature of phenotypic changes not being directly associated with the phylogeny. Losos (2011) discusses a number of these within biology, and Tëmkin & Eldredge (2007) discuss a couple of cultural examples. In these cases, it is not possible to reconstruct the evolutionary history from phenotypic data, nor indeed to infer the phenotypes from an hypothesis of evolutionary history. In these cases phylogenetics does not aid the study of contemporary patterns.

This is particularly relevant when attempting to reconstruct ancestral phenotypes. Because of the difference between cultural transmission (copied from person to person) and biological transmission (genes are passed directly), there is no necessary reason to assume that ancestral states can be reconstructed from a knowledge of phylogenetic history (see the Evolving Thoughts blog). This also applies when trying to reconstruct characteristics from an independent phylogeny, such as reconstructing a cultural history from a linguistic phylogeny (eg. Walker et al. 2012).

Furthermore, it is possible that archaeological and linguistic concepts (eg. cultural artefacts and languages, respectively) do not form integrated wholes, in the way that biological organisms must. That is, anthropological characters (or groups of characters) can often change independently of each other, and this will create a set of independent phylogenetic histories, so that there is no coherent "entity" with a single history. This situation is likely to be worse than the possibly analogous situation with independent gene histories in biology (Tëmkin & Eldredge 2007).

In addition, cultural evolution may occur faster than biological evolution (Perreault 2012), which makes reconstruction of ancient events more difficult. We might also question whether different cultural artefacts and languages each share a single common ancestor — that is, they are potentially polyphyletic rather than monophyletic.

Process analogies

Finally, we can consider possible analogies of anthropological processes with horizontal genotypic processes, such as introgression, hybridization, recombination, horizontal gene transfer (HGT), and genome fusion. These analogies are sometimes invoked in the linguistic and archaeological literature, but this is not necessarily appropriate given the overall analogy with phenotype rather than genotype.

Introgression is usually treated as a process of admixture, where genetic information from one group moves to another via sexual reproduction. Here, an analogy might be appropriate for anthropology, it being the closest analogy to what anthropologists have called "diffusion". However, it is worth noting that biological admixture initially involves the move of an entire copy of the genome, which might be unlikely for cultural phenomena. Hybridization, on the other hand, involves the creation of a new evolutionary lineage, separate from the parental ones but containing one or more copies of the genome of each of those parents. Creole languages might be an example where this analogy is appropriate, since the parental languages are usually clearly identifiable; but otherwise hybridization seems to be a poor analogy, even though it is commonly invoked in the literature.

Recombination also involves sexual reproduction, but usually refers to the mixing of genes before reproduction occurs, so that the offspring do not have a complete set of genes from any one grandparent. This analogy frequently appears in the literature, often as a synonym for the same phenomena that other people call hybridization, but I suspect that introgression would be a better analogy for the topics included. Examples analogous to recombination might be a single manufacturer "providing all permutations and combinations to the marketplace" of their products (eg. Courtois' cornets in the late 1850s; Eldredge 2002), or where "a scribe used more than one copy of a text when making his or her own" (called contamination; Howe & Windram 2011).

HGT refers to non-sexual transfer of genetic material, often small amounts rather than whole genomes. Clearly, word borrowing would be a prime example where this analogy might be appropriate. Genome fusion refers to the non-sexual transfer of whole genomes, and thus has a similar outcome to hybridization, but between distantly related organisms instead.


We need to drop the idea that there is an analogy between anthropological entities and biological genotypes, and recognize that the better analogy is with phenotypes. The analogy with genotypes is not a productive one, and may even be a positively misleading form of "gene envy". If we accept the qualitative analogy with phenotype, then we can also accept the quantitative consequences of this analogy, which include the idea that trees are much more likely to be inadequate models for cultural history than they apparently are in biology.

The mere fact that one can interpret certain cultural phenomena as showing features analogous to those in biology does not mean that the alleged analogy is of any practical use. We need to understand the analogies more thoroughly, in order to decide whether adopting the analogies is the best thing to do. Analogies are only useful tools for research if they direct that research into productive areas, or provide interpretive insights that would otherwise be unavailable. Otherwise, analogy is merely a topic of conversation.

The main advantage of the phylogenetic analogy is that it focuses attention on the important role of unique "accidents" in determining evolutionary history. The main disadvantage seems to be that the processes involved with these accidents are quite different in biology and anthropology, so that the focus is not always fruitful.


Barbrook AC, Howe CJ, Blake N, Robinson P (1998) The phylogeny of The Canterbury Tales. Nature 394: 839.

Ben Hamed M (2005) Neighbour-nets portray the Chinese dialect continuum and the linguistic legacy of China's demic history. Proceedings of the Royal Society of London series B 272: 1015–1022.

Ben Hamed M, Wang F (2006) Stuck in the forest: trees, networks and Chinese dialects. Diachronica 23:29-60.

Bouckaert R, Lemey P, Dunn M, Greenhill SJ, Alekseyenko AV, Drummond AJ, Gray RD, Suchard MA, Atkinson QD (2012) Mapping the origins and expansion of the Indo-European language family. Science 337: 957-960.

Bowern C. (2010) Historical linguistics in Australia: trees, networks and their implications. Philosophical Transactions of the Royal Society of London series B 365: 3845-3854.

Bryant D, Filimon F, Gray RD (2005) Untangling our past: languages, trees, splits and networks. In: Mace et al. (eds), pp. 67-83.

Collard M., Shennan SJ, Tehrani JJ (2006) Branching, blending, and the evolution of cultural similarities and differences among human populations. Evolution and Human Behavior 27: 169-184.

Covington MA (1996) An algorithm to align words for historical comparison. Comparative Linguistics 22: 481-496.

Croft W (2008) Evolutionary linguistics. Annual Review of Anthropology 37: 219-234.

Currie TE, Greenhill SJ, Gray RD, Hasegawa T, Mace R (2010a) The rise and fall of political complexity in island SE Asia and the Pacific. Nature 476: 801-804.

Currie TE, Greenhill SJ, Mace R (2010b) Is horizontal transmission really a problem for phylogenetic comparative methods? A simulation study using continuous cultural traits. Philosophical Transactions of the Royal Society of London series B 365: 3903-3912.

Dediu D, Levinson SC (2012) Abstract profiles of structural stability point to universal tendencies, family-specific factors, and ancient connections between languages. PLoS ONE 7: e45198.

Dewar RE (1995) Of nets and trees: untangling the reticulate and dendritic in Madagascar prehistory. World Archaeology 26: 301-318.

Dunn M, Greenhill SJ, Levinson SC, Gray RD (2011) Evolved structure of language shows lineage-specific trends in word-order "universals". Nature 473: 79-82.

Eldredge N (2002) A brief history of piston-valved cornets. Historic Brass Society Journal 14: 337-390.

Eldredge N (2011) Paleontology and cornets: thoughts on material cultural evolution. Evolution: Education and Outreach 4: 364–373.

Erdem E, Lifschitz V, Ringe D (2006) Temporal phylogenetic networks and logic programming. Theory and Practice of Logic Programming 6: 539-558.

Fisher DC (2008) Stratocladistics: integrating temporal data and character data in phylogenetic inference. Annual Review of Ecology, Evolution and Systematics 39: 365-385.

Forster P, Renfrew C (eds) (2006) Phylogenetic Methods and the Prehistory of Languages. McDonald Institute of Archaeological Research, Cambridge.

Forster P, Toth A (2003) Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European. Proceedings of the National Academy of Science of the USA 100: 9079-9084.

Forster P, Toth A, Bandelt H-J (1998) Evolutionary network analysis of word lists: visualising the relationships between Alpine Romance languages. Journal of Quantitative Linguistics 5: 174-187.

Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426: 435-439.

Gray RD, Atkinson QD, Greenhill SJ (2011) Language evolution and human history: what a difference a date makes. Philosophical Transactions of the Royal Society of London series B 366: 1090-1100.

Gray RD, Bryant D, Greenhill SJ (2010) On the shape and fabric of human history. Philosophical Transactions of the Royal Society of London series B 365: 3923-3933.

Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science 323: 479-483.

Greenhill SJ, Currie TE, Gray RD (2009) Does horizontal transmission invalidate cultural phylogenies? Proceedings of the Royal Society of London series B 276: 2299-2306.

Heggarty P, Maguire W, McMahon A (2010) Splits or waves? Trees or webs? How divergence measures and network analysis can unravel language histories. Philosophical Transactions of the Royal Society of London series B 365: 3829-3843.

Hoenigswald HM (1990) Does language grow on trees? Ancestry, descent, regularity. Proceedings of the American Philosophical Society 134: 10-18.

Holliday TW (2003) Species concepts, reticulation, and human evolution [with discussion]. Current Anthropology 44: 653-673.

Howe CJ, Windram HF (2011) Phylomemetics — evolutionary analysis beyond the gene. PLoS Biology 9: e1001069.

Jardine N (1967) The concept of homology in biology. British Journal for the Philosophy of Science 18: 125-139.

Kondrak G (2003) Phonetic alignment and similarity. Computers and the Humanities 37: 273-291.

Lewis PO (2001) A likelihood approach to inferring phylogeny from discrete morphological characters. Systematic Biology 50: 913-925.

Lipo CP (2006) The resolution of cultural phylogenies using graphs. In: Lipo et al. (eds), pp. 89-107.

Lipo CP, O’Brien MJ, Collard M, Shennan SJ (eds) (2006) Mapping our Ancestors: Phylogenetic Approaches in Anthropology and Prehistory. AldineTransaction, New Brunswick NJ.

List J-M (2012) Improving phonetic alignment by handling secondary sequence structures. In: Hinrichs E, Jäger G (eds) Computational Approaches to the Study of Dialectal and Typological Variation. Working papers submitted for the workshop organized as part of the ESSLLI 2012.

Losos J (2011) Seeing the forest for the trees: the limitations of phylogenies in comparative biology. American Naturalist 177: 709-727.

Mace R, Holden CJ (2005) A phylogenetic approach to cultural evolution. Trends in Ecology and Evolution 20: 116-121.

Mace R, Holden CJ, Shennan SJ (eds) (2005) The Evolution of Cultural Diversity: a Phylogenetic Approach. UCL Press, London.

Moore JH (1994) Putting anthropology back together again: the ethnogenetic critique of cladistic theory. American Anthropologist 96: 925-948.

Nakhleh L, Ringe DJ, Warnow T (2005) Perfect phylogenetic networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81: 382-420.

Nelson-Sathi S, List J-M, Geisler H, Fangerau H, Gray RD, Martin W, Dagan T (2011) Networks uncover hidden lexical borrowing in Indo-European language evolution. Proceedings of the Royal Society of London series B 278: 1794-1803.

O’Brien MJ, Lyman RL, Darwent JA (2002) Cladistics and archaeological phylogeny. In: Martínez G, Lanata JL (eds) Perspectivas Integradoras entre Arqueología y Evolución. Teoría, Métodos y Casos de Aplicación. INCUAPA–UNC, Olavarría, Argentina, pp. 175-186.

Pagel M (2009) Human language as a culturally transmitted replicator. Nature Reviews Genetics 10: 405-415.

Perreault C. (2012) The pace of cultural evolution. PLoS ONE 7: e45150.

Rieppel O (2007) Homology: a philosophical and biological perspective. In: Henke W, Tattersall I (eds) Handbook of Paleoanthropology: Vol I: Principles, Methods and Approaches. Springer-Verlag, Berlin, pp 217-240.

Southworth FC (1964) Family-tree diagrams. Language 40: 557-565.

Spencer M, Wachtel K, Howe CJ (2004) Representing multiple pathways of textual flow in the Greek manuscripts of the Letter of James using reduced median networks. Computers and the Humanities 38: 1-14.

Steele J., Jordan P, Cochrane E (2010) Evolutionary approaches to cultural and linguistic diversity. Philosophical Transactions of the Royal Society of London series B 365: 3829-3843.

Sumrall CD (2005) Fossils in phylogenetic reconstruction. In: Encyclopedia of Life Sciences.

Tëmkin I, Eldredge N (2007) Phylogenetics and material cultural evolution. Current Anthropology 48: 146-153.

Walker RS, Wichman S, Mailund T, Atkisson CJ (2012) Cultural phylogenetics of the Tupi language family in lowland South America. PLoS ONE 7: e35025.

Warnow T, Evans SN, Ringe DA, Nakhleh L (2006) A stochastic model of language evolution that incorporates homoplasy and borrowing. In: Forster & Renfrew (eds), pp. 75-87.