Showing posts with label Splits graph. Show all posts
Showing posts with label Splits graph. Show all posts

Wednesday, July 10, 2013

Networks and human inter-population variation


I have noted before that there are many situations in which the model of a phylogenetic tree is likely to be inappropriate for analysis of genetic data. The most obvious of these involves the study of intra-population variation (e.g. Why do we still use trees for the dog genealogy?). The within-population genealogy of sexually reproducing species, in particular, is not likely to be tree-like, even at large spatial scales. The iconic species for the study of intra-specific evolutionary history is Homo sapiens, and this is also the species where that history is least likely to be tree-like (e.g. Why do we still use trees for the Neandertal genealogy?). Clearly, a phylogenetic network is called for.

Pemberton et al. (2013, Population structure in a comprehensive genomic data set on human microsatellite variation. Genes Genomes Genetics 3: 891-907) provide an interesting dataset of global human autosomal microsatellite variation, based on merging eight previously published datasets. Microsatellites are a bit retro in this day and age, but that does not make them any less useful for the study of genetic variation.


The biggest issue is getting a large enough sample of loci for detailed study. Different researchers collect data on different microsatellites, and so combining datasets is not straightforward. Nevertheless, Pemberton et al. managed to come up with 5,795 individuals from 267 worldwide populations with genotypes at 645 loci. After filtering a member of every intra-population first-degree and second-degree relative pair, and then reducing the size of the over-represented Gujarati sample, they then added data for 84 chimpanzees. This yielded a dataset of 5,519 individuals from 255 populations sampled at 246 shared loci.

These data were processed as follows:
Using Microsat, we evaluated population-level pairwise allele-sharing distance (one minus the proportion of shared alleles), using all 246 loci ... We constructed a greedy-consensus neighbor-joining tree using the Neighbor and Consensus programs in the Phylip package from 1000 bootstrap resamples across loci.
Note that the original inter-population distances were not calculated — the tree was constructed by combining the branches with the highest bootstrap support.

This tree (reproduced above) does not show a great deal of support for many of the branches, and the authors discuss only seven of them. However, the presentation of a tree does not give much of a visual indication of the poor support for the genealogy, even if the different branch thicknesses do indicate the bootstrap values.

grey = chimpanzee, orange =  Africa, yellow = Middle East, blue = Europe,
red = Central/South Asia, purple = America, pink = East Asia, green = Oceania

So, I calculated a NeighborNet network from the distance data, by averaging the 1000 distance matrices from the bootstrap analysis. This is the network analogue of the neighbor-joining tree, as shown above. Note that I have used the same colour coding as for the tree (thus making it look like a very colourful hummingbird), and the branch lengths represent support.

There is clearly a degree of large-scale geographical clustering of the genotypes, and this corresponds to the larger bootstrap values in the tree. So, the main message from the tree and the network is the same, including the rooting of the human genealogy within the African "group". However, this message is visually much clearer in the network than in the circular version of the tree. Moreover, there is little distinction between the Middle Eastern (yellow) and European (blue) genotypes, and the network makes this more obvious than does the tree.

Wednesday, May 22, 2013

Are phylogenetic trees useful for domesticated organisms?


When looking at the population genetics literature I have noticed that many papers still present very traditional phylogenetic analyses, particularly in what can broadly be called agricultural studies. For instance, genetic distances might be calculated between the samples and a "tree of genetic relationships" presented based on UPGMA clustering.

The problem with this sort of approach to genotype data analysis is that it forces the data into an ultrametric tree, which has long been shown to be inappropriate as a model for evolutionary relationships. Furthermore, there is no indication of the robustness of this tree, nor even whether a tree model is appropriate in the first place.

As a specific example, we can look at the microsatellite data presented by Carimi et al. (2010) for various Sicilian grape cultivars. For grape varieties, where hybridization among cultivars has been the historical norm, an ultrametric tree seems singularly inappropriate.

Wine grapes have been grown on Sicily for more than 2,000 years, and at least 120 grape-vine cultivar names are known in the literature. The authors sampled 82 of the cultivars from the Institute of Plant Genetics (Palermo) germplasm collection, with 1-5 clones sampled per cultivar. They assessed six polymorphic microsatellite loci, producing diploid (co-dominant) data. Only 70 distinct genotypes were detected, which were then subjected to data analysis.

The authors used the "Simple Matching coefficient for co-dominant and multiallelic data" to estimate the genetic distances between samples. Unfortunately, this has been shown to have odd properties for diploid  microsatellite data (Kosman and Leanard 2005). Therefore, in my analysis I have used the simple metric of Kosman and Leonard (2005), instead, in which genotype distances are calculated as a proportion of the shared alleles at each locus (averaged across loci). This was calculated using the mmod R package (Winter 2012).

The authors then used the "UPGMA (Unweighted Pair-Group Method with Arithmetical Averages)" clustering method to produce their ultrametric tree from the distance data. This is the most commonly encountered agglomerative hierarchical clustering method to be found in the literature. Instead, I used a NeighborNet network to evaluate whether the data are tree-like, calculated using the SplitsTree program.

The resulting network is shown in the first graph. Cultivars that are closely connected in the network are similar to each other based on their microsatellite profiles, and those that are further apart are progressively more different from each other.


The network shows that there is very little hierarchical structure to the grape-vine microsatellite data. The data do not clearly distinguish "six main groups", as interpreted by the original authors based on their tree (which is shown below). [Note that one of the authors' groups (cluster E) is more heterogeneous than the others, and to be comparable should be divided into either two or three groups.]


Note that the network emphasizes two things: (1) there are no clear groupings of the grape cultivars, and (2) the data are rather "noisy", as microsatellite data often are (e.g. Leroy et al. 2009), with many incompatible signals.

As far as the phylogenetic history is concerned, there is no evidence of "several origins for Sicilian grape-vine germplasm", as interpreted by the authors. Instead, there seems to have been continuous mixing of the genotypes, probably including cultivars from elsewhere in Italy, and even further afield around the Mediterranean. This type of complex genetic history seems to be quite common in domesticated organisms, and a tree-based analysis is therefore unlikely to be appropriate for studying them; see, for example, Decker et al. (2009) for cows, Leroy et al. (2009) for horses, and Kijas et al. (2012) for sheep.

References

Carimi F, Mercati F, Abbate L, Sunseri F (2010) Microsatellite analyses for evaluation of genetic diversity among Sicilian grapevine cultivars. Genetic Resources and Crop Evolution 57: 703–719.

Decker J.E., Pires J.C., Conant G.C., McKay S.D., Heaton M.P., Chen K., Cooper A., Vilkki J., Seabury C.M., Caetano A.R., Johnson G.S., Brenneman R.A., Hanotte O., Eggert L.S., Wiener P., Kim J.-J., Kim K.S., Sonstegard T.S., Van Tassell C.P., Neibergs H.L., McEwan J.C., Brauning R., Coutinho L.L., Babar M.E., Wilson G.A., McClure M.C., Rolf M.M., Kim J., Schnabel R.D., Taylor J.F. (2009) Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proceedings of the National Academy of Sciences of the U.S.A. 106: 18644-18649.

Kijas J.W., Lenstra J.A., Hayes B., Boitard S., Porto Neto L.R., San Cristobal M., Servin B., McCulloch R., Whan V., Gietzen K., Paiva S., Barendse W., Ciani E., Raadsma H., McEwan J., Dalrymple B., other members of the International Sheep Genomics Consortium (2012) Genome-wide analysis of the world's sheep breeds reveals high levels of historic mixture and strong recent selection. PLoS Biology 10: e1001258.

Kosman E, Leonard KJ (2005) Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Molecular Ecology 14: 415–424.

Leroy G., Callède L., Verrier E., Mériaux J.C., Ricard A., Danchin-Burge C., Rognon X. (2009) Genetic diversity of a large set of horse breeds raised in France assessed by microsatellite polymorphism. Genetics Selection Evolution 41: 5.

Winter DJ (2012) mmod: an R library for the calculation of population differentiation statistics. Molecular Ecology Resources 12: 1158–1160.

Wednesday, April 24, 2013

Cloudograms and data-display networks


I have previously noted that splits graphs are a logical way to present the results of Bayesian analyses (We should present bayesian phylogenetic analyses using networks). Bayesian analyses are concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. Thus, the result of a Bayesian phylogenetic analysis should not be as a single tree (the so-called MAP tree or maximum a posteriori probability tree), but should instead show the probability distribution of all of the sampled trees. This can easily be done with a consensus network, as illustrated by example in the previous blog post.

An interesting alternative way of visualizing the probability distribution of trees is what has been called a Cloudogram, an idea introduced by Remco R. Bouckaert (2010, DensiTree: making sense of sets of phylogenetic trees. Bioinformatics 26: 1372-1373). This diagram superimposes the set of all trees arising from an analysis. Dark areas in such a diagram will be those parts where many of the trees agree on the topology, while lighter areas will indicate disagreement. This idea can be best illustrated by a few published examples.

The first cloudogram is from Figure 4 of Chaves JA, Smith TB (2011) Evolutionary patterns of diversification in the Andean hummingbird genus Adelomyia. Molecular Phylogenetics and Evolution 60: 207-218.

In this case the MAP tree has been superimposed on the cloudogram.

Species-tree with the highest posterior probability (PP > 80) superimposed upon
a cloudogram of the entire posterior distribution of species-trees recovered in BEAST.
Areas where the majority of trees agree in topology and branch length are shown as
darker areas (well-supported clades), while areas with little agreement as webs.

The next one is from Figure 2 of Pabijan M, Crottini A, Reckwell D, Irisarri I, Hauswaldt JS, Vences M (2012) A multigene species tree for Western Mediterranean painted frogs (Discoglossus). Molecular Phylogenetics and Evolution 64: 690-696.

Posterior density of 2700 species trees (‘‘cloudogram’’) representing the entire posterior distribution
of species trees (270,000 trees post-burnin) from the BEAST analysis based on seven nuclear loci and
4 mitochondrial gene fragments. The species tree with the highest posterior probability is nested within
the set; values indicate posterior probabilities associated with this consensus tree. Areas where many
species trees agree on topology and/or branch lengths are densely colored.

The next one is from Figure 1 of Lerner HR, Meyer M, James HF, Hofreiter M, Fleischer RC (2011) Multilocus resolution of phylogeny and timescale in the extant adaptive radiation of Hawaiian honeycreepers. Current Biology 21: 1838-1844.

In this case the data are more tree-like than the previous two examples.

Cloudogram showing all trees resulting from a Bayesian analysis of whole
mitogenomes (19,601 trees; 14,449 bps). Variation in timing of divergences is
shown as variation (i.e., fuzziness) along the x axis. Darker branches represent a
greater proportion of corresponding trees. All nodes have support values >0.99.

The final one is from  Figure 2 of McCormack JE, Faircloth BC, Crawford NG, Gowaty PA, Brumfield RT, Glenn TC (2012) Ultraconserved elements are novel phylogenomic markers that resolve placental mammal phylogeny when combined with species-tree analysis. Genome Research 22: 746-754.

This analysis involves bootstraps rather than Bayesian samples, showing that the same principle applies.

Evolutionary history of placental mammals resolved from conflicting
gene histories. Widespread consensus among 1000 species-tree bootstrap
replicates of the same 183-locus data set. STEAC trees are depicted because
the branch lengths allow for better visualization of branching patterns, but
STAR results supported the same topology. Cones emanating from terminal
tips of species trees (red arrows) indicate disagreement among bootstrap
replicates.

It would be nice to illustrate this further by direct comparison with a splits graph of the same dataset that I used in the previous blog post. Unfortunately, the computer program available (DensiTree) has the same practical limitation as the SplitsTree program (as mentioned in the previous post) — it does not read the MrBayes ".trprobs" file because it ignores the tree weights. This means that one has to enter the entire treefile (with thousands of trees), and I have not yet done that. Moreover, the program relies very much on having branch lengths for each tree — the output is really quite odd without them, with the taxa appearing in a series of steps rather than connected by straight branches. My previous analysis did not use branch lengths, as they are not needed for the consensus network, in which edge lengths represent support rather than character evolution.

Wednesday, April 10, 2013

Highlighting splits in a splits graph


A splits graph is interpreted in terms of splits, or bipartitions, which divide the graph into two non-overlapping parts. If one wishes to refer to particular splits in a graph then one needs a way of highlighting those splits.

This can be done in a number of ways, some of them derived from conventions originating for the presentation of rooted phylogenetic trees. These include highlighting the taxa in one of the partitions, which is analogous to highlighting a clade in a rooted phylogenetic tree. Alternatively, we could colour the edges associated with each of the two partitions, as shown in this previous blog post (How to interpret splits graphs); however, this works only for a single split at a time.

Alternatively, it is also possible to label the edges of the splits themselves, as shown in this previous blog post (Representing evolutionary scenarios using splits graphs). Dabert et al. (Dabert M, Witalinski W, Kazmierski A, Olszanowski Z, Dabert J (2010) Molecular phylogeny of acariform mites (Acari, Arachnida): strong conflict between phylogenetic signal and long-branch attraction artifacts. Molecular Phylogenetics and Evolution 56: 222-241) present another possibility, which is to colour only the edges that separate to two partitions of each split, as shown in the figure.


This works very well visually. However, there is still the matter of actually labelling the coloured edges. Unfortunately, Dabert et al. chose to do this using terminology that is more appropriate for a rooted phylogenetic tree than an unrooted data-display network. That is, they refer to "clades", which can be recognized only in a rooted graph. Their diagram is clearly labelled with a root taxon, even though the graph itself is unrooted. The implication here is that interpreting the unrooted graph as a rooted network is straightforward, but it is not. It would be better to use the standard terminology, which refers to "splits" or "partitions", rather than to "clades".

Wednesday, April 3, 2013

Representing evolutionary scenarios using splits graphs


Splits graphs are basically data-display networks, since their intended purpose is to graphically display the patterns of variation in a dataset. These patterns may relate to evolutionary history, or they may not.

A couple of weeks ago I discussed a paper by Myles et al. concerning the genetics of grape cultivars, and this paper provides an interesting example where the patterns of genetic variation seem to be strongly phylogenetic in nature (Myles S, Boyko AR, Owens CL, Brown PJ, Grassi F, Aradhya MK, Prins B, Reynolds A, Chia JM, Ware D, Bustamante CD, Buckler ES. 2011. Genetic structure and domestication history of the grape. Proceedings of the National Academy of Sciences of the USA 108: 3530-3535).

Myles et al. note that: "Archaeological evidence suggests that grape domestication took place in the South Caucasus between the Caspian and Black Seas and that cultivated vinifera then spread south to the western side of the Fertile Crescent, the Jordan Valley, and Egypt by 5,000 y ago." They provide an explicit historical scenario of the evolutionary history of cultivated grapes (Vitis vinifera):
  1. There are two species involved (V.sylvestris, V.vinifera), both distributed along the eastern and northern part of the Mediterranean basin;
  2. V.vinifera was domesticated from V.sylvestris in the eastern part of the distribution;
  3. V.vinifera then spread geographically from east to west;
  4. This spread was followed by introgression of V.sylvestris into V.vinifera in the western part of their joint distribution.
Myles et al. generated genotype data from a custom microarray, which assayed 5,387 SNPs genotyped in 570 V.vinifera samples and 59 V.sylvestris accessions from the US Department of Agriculture (USDA) germplasm collection. Average population-pairwise Fst estimates were then calculated from all 5,387 SNPs weighted by allele frequency, based on species and geographical region.

I constructed a NeighborNet splits graph from these Fst data, as shown in the graph. According to Myles et al., the geographic regions are defined as follows: "east" includes locations east of Istanbul, Turkey; "west" includes locations west of Slovenia, including Austria; and "central" refers to locations between them.


Each of the splits (bipartitions) in the graph represents one of the four steps in the hypothesized scenario, as labelled in the figure. Thus, there is apparently phylogenetic signal remaining from all of these proposed historical events that can be detected in the genetic distances. As the authors note: "Our analyses of relatedness between vinifera and sylvestris populations are consistent with the archaeological data".

Note, however, that one cannot infer the scenario from the splits graph, because the data analysis is not intended for direct evolutionary inference. The graph is undirected, and there are therefore several possible scenarios that could be derived from the graph. For example, the graph shown is also compatible with the domestication of V.vinifera from V.sylvestris in the western part of the distribution.

Thus, a splits graph can be used to suggest scenarios (ie. hypothesis generation) and it can be used to test scenarios (hypothesis testing), but the latter is a weak test because there will always be several phylogenetic scenarios with which it is compatible.

Monday, February 4, 2013

Network analysis of Genesis 1:3


This idea was stolen blatantly from the Laboratory Exercises in Evolution at the Biology Department, University of Virginia (Janis Antonovics, Joanna Vondrasek, Doug Taylor), where it is set as a class exercise for learning phylogenetic analysis. In turn, these people credit a similar idea to Barbrook et al. (1998. The phylogeny of the Canterbury Tales. Nature 394: 839), although the originators of the idea appear to be Robinson and O'Hara (1996. Cladistic analysis of an Old Norse manuscript tradition. Research in Humanities Computing 4: 115-137). It is an exercise in stemmatology, which can be a lot more tricky than you might think.

Stemmatology is the discipline that attempts to reconstruct the transmission history of a written text on the basis of relationships between the various extant versions (eg. manuscripts or printings). These relationships can be revealed using phylogenetic networks, which is the approach that I present here. A network is more appropriate than a phylogenetic tree, for reasons that will become obvious — the evolution of books is not a simple thing.

Genesis

The original text of the christian Bible was written mostly in Hebrew and Aramaic for the Old Testament, and in Greek for the New Testament. It was later translated into Latin, which was then standardized as the "Vulgate", and this was then almost the only version used in churches for the best part of a millennium. The only texts in Old English consisted usually of either the Gospels or the Psalms only.

This situation was challenged in the late 14th century, when the first Middle English translations of the whole Bible appeared. There was active resistance to this by the formal Church, and so the idea of an English translation was dropped until the mid 16th century, when the Reformation inspired attempts to translate the books into Modern English as part of a new Protestant religion. These moves were sanctioned by the government, with first the Great Bible (1539) and then the King James Version (1611). Various revisions of the latter have appeared, especially since the late 19th century. These days, there is a veritable cottage industry producing new versions of the Bible for various purposes, usually based on the original texts rather than on earlier translations, with various translation principles being employed (eg. Formal Equivalence, Dynamic Equivalence, Closest Natural Equivalence, etc).

You can consult the various versions of the English-language Bible at one or more of several online sites:
The data used below were all obtained from these sites. These sites suggest that the most famous English-language versions of the Bible are: the Geneva Bible (1560), as used throughout the Reformation, and by William Shakespeare as well as by the "Pilgrim Fathers" in America; and the King James Version (1611), which was the standard English text for a quarter of a millennium. The most widespread current Bible is apparently the New International Version, which has been updated several times since its first appearance in 1973.

Stemmatology

The text that I use is the third sentence of the Bible — Genesis 1:3. (The biblical text was first numbered in the Geneva Bible of 1560.) Here is a dated listing of that sentence in all of the early English translations, plus most of the revisions up to the mid-20th century, and a sample of the many recent versions:

1382 Wycliffe Bible  And God seide, Be maad li3t; and maad is li3t.
1395 Later Wycliffe  And God seide, li3t be maad; and li3t was maad.
1530 Tyndale Bible  Then God sayd: let there be lyghte and there was lyghte.
1535 Coverdale Bible  Than God sayd: let there be light: & there was lyght.
1537 Matthew Bible  And God sayde: let there be light, and there was light.
1539 Great Bible  And God sayde: let there be made lyght, and there was light made.
1560 Geneva Bible  Then God saide, Let there be light: And there was light.
1568 Bishop's Bible  And God sayde, let there be light: and there was light.
1609 Douay-Rheims Bible  And God said: Be light made And light was made.
1611 King James Version  And God said, Let there be light: and there was light.
1750 Challoner Revision  And God said: Be light made. And light was made.
1769 Blayney Revision  And God said, Let there be light: and there was light.
1833 Webster's Bible  And God said, Let there be light: and there was light.
1862 Young's Literal Translation  and God saith, 'Let light be;' and light is.
1885 English Revised Version  And God said, Let there be light: and there was light.
1890 Darby Bible  And God said, Let there be light. And there was light.
1901 American Standard Version  And God said, Let there be light: and there was light.
1950 Knox Bible  Then God said, Let there be light; and the light began.
1952 Revised Standard Version  And God said, "Let there be light"; and there was light.
1971 New American Standard Bible  Then God said, "Let there be light"; and there was light.
1973 New International Version  And God said, "Let there be light," and there was light.
1976 Good News Bible  Then God commanded, "Let there be light" — and light appeared.
1982 New King James Version  Then God said, "Let there be light"; and there was light.
1995 God's Word Translation  Then God said, "Let there be light!" So there was light.
1996 New Living Version  Then God said, "Let there be light," and there was light.
2011 Common English Bible  God said, "Let there be light." And so light appeared.

The first thing we need to do is align the text of these 26 versions, including both words and punctuation. This allows us to directly compare each of the elements of the sentence, comparing like with like as far as their features are concerned.

This is not as easy as it sounds. In this alignment I have separated words when they seem to have a different intent — for example, "was made" is not equivalent to "appeared". I can see endless arguments about the alignment of any text; and, indeed, disagreements about the intent of the original text is what has lead to so many different versions of the Bible being created in English.


This alignment then needs to be coded as a set of characters, which define the hypothesized homology between the various elements of the text. In this case I ended up with 50 additive binary characters for analysis. In general, I used Young's Literal Translation to determine the ancestral state for each character, as this translation was an explicit attempt to emulate the Hebrew original. A nexus-formatted version of the dataset is available here.

Various network methods could be used to summarize the character data. First, I have used a NeighborNet based on hamming distances, as I usually do (see my earlier analyses). As you can see from the graph, there is no simple tree-like relationships among these texts, which calls into question any simplistic attempt at stemmatology. (Note that in two cases there are multiple texts that have identical sentences, and thus they appear at the same location in the graph.)


It is worth pointing out here that Barbrook et al. (1998) produced a bush-like graph from their data for the Canterbury Tales, but only after deleting 14 of their 58 manuscripts, "as they were likely to have been copied from more than one exemplar, either by deliberate conflation of readings or by changing the exemplar during the course of copying." A similar explanation is likely to apply for some of the texts for Genesis 1:3, although many of them were translated directly from the original Hebrew rather than from later translations (eg. the Latin "Vulgate").

Nevertheless, there is a general separation of the older Genesis texts on the right of the graph and the more recent texts on the left. This might be easier to assess if we simplify the graph.

As a simpler summary of the same relationships, I have used a Reduced Median Network, based on r = 2 (the program default). Note that the time order is reversed in this graph, with the older texts on the left and the more recent texts on the right. The only major discrepancy between the two graphs is the relative placement of the Bishop's Bible. (Also, I have not labelled the two cases where there are several texts that have identical sentences.)


Historically, we would expect the Tyndale Bible, Coverdale Bible, Matthew Bible and Great Bible texts to be closely related, but the Great Bible seems not to fit this expectation. Similarly, we would expect a similarity between the Geneva Bible and the Bishop's Bible, which is also not reflected in the study sentence; nor is the acknowledged debt of the King James Version to the Tyndale Bible.

However, the fact that the Wycliffe Bible and Later Wycliffe are written in Middle English rather than Modern English is clear from their distant relationship to the other texts; and the close historical relationship of the Challoner Revision and the Douay-Rheims Bible is also clear.

Several texts show isolated relationships. The Knox Bible, for example, is unique among the modern texts in being taken from the Latin rather than the original Hebrew, while the Common English Bible is unusual in trying to balance two translation principles (Dynamic Equivalence and Formal Equivalence) rather than using only one.

On the other hand, the New International Version is clearly a very traditional version of the text, given its relationships as shown in the two graphs, which perhaps explains its popularity.

The close association of the Good News Bible with Young's Literal Translation is interesting, given that the former is an (often criticized) free paraphrase of the original Hebrew text while the latter is a literal translation of that same text — you can't get more different translation principles.

Conclusion

The lack of any simple tree-like relationship among these biblical texts makes any attempt to study their phylogeny difficult. My own look at the business of stemmatology suggests that the results here are quite typical of any study of written texts. Part of the problem seems to be that ideas developed in one historical lineage can be transferred to other lineages, and even transferred to earlier parts of those lineages (see my previous post: Time inconsistency in evolutionary networks). So, even though there is a general historical trend through time, that trend is not consistent enough for a tree-based historical analysis to be effective.

Note: there is a later blog post on this same topic — Trees and networks of written manuscripts.

Wednesday, January 23, 2013

Some possible questions for EDA in phylogenetics


Exploratory data analysis (EDA) is an important part of biological data analysis. It is a form of a priori analysis, where the data are investigated before any substantial hypothesis-testing analysis occurs.

EDA seeks to investigate the patterns that exist in a dataset without too many constraints on what those patterns might be. It should reveal any patterns and their relative strengths, in a quick and easily employed manner, preferably using easily digested pictures. In phylogenetics, there are now many types of networks available for exploring the extent to which data are tree-like, and the nature and location of any non-tree-like patterns. These are ideal for EDA.

It is important to recognize that EDA is not a phylogenetic analysis, in the sense that it is not intended to reveal evolutionary history. It is an exploration of the data intended to reveal the major patterns and their relationships. This information will help a phylogeneticist make decisions about how best to analyze their data and, indeed, even whether a phylogenetic analysis of the data is worthwhile. In this sense it is unfortunate that the sorts of networks that I discuss here are often called "phylogenetic networks", since they do not perform phylogenetic analysis.

EDA questions

EDA should not proceed in a haphazard manner, as the more one looks at a dataset the more patterns one is likely to see. In particular, there is always the potential danger of detecting spurious patterns, which may seem to be biologically important, especially if they confirm some of the pre-conceived ideas of the phylogeneticist. Therefore, it is important to have a clear goal when employing EDA, which might involve asking a set of explicit questions when viewing the data.

These questions should not constrain the exploration of the data, as there are many possible questions, directed towards may different goals. Instead, they should be intended to discover possible constraints on subsequent analyses. Indeed, they may even reveal testable hypotheses that would otherwise go unrecognized. (You should not, of course, both discover and test hypotheses using the same dataset!)

As an example, I discuss some questions that might be important for EDA in phylogenetics. These are not the only possible questions, of course, but they seem to be among the more commonly seen ones in the literature. That is, they focus on the contemporary approach to phylogenetics as a tree-building exercise. These represent a basic set of questions that should be answered before a tree-building analysis could begin. Similar questions could be asked, for example, about reticulation processes in evolutionary history.

These questions can be asked about the whole data or about subsets of the data (eg. genomes, genes, loci), which will indicate whether the variation is within-gene or between-gene.

(1) Are the data tree-like?

(2) If not, then is this because they are:
   (a) uninformative about relationships (a bush)
   (b) weakly tree-like (a tree obscured by vines)
   (c) contain several strong incompatible relationships (a structured network)
   (d) confused about relationships (an unstructured network — a spiderweb)

(3) If the data are tree-like, then are the trees of data subsets:
   (a) congruent
   (b) incongruent

The important issues that these questions are intended to address include:
   (i) are there strong patterns in the data that might answer the experimental question? (finding suitable data for a specific phylogenetic question is not necessarily easy);
   (ii) what are the patterns? (different patterns are likely to have different biological causes):
   (iii) is any data incompatibility principally between genes or within genes? (these patterns are likely to be caused by different biological processes);
   (iv) are the incompatibilities the result of reticulation processes (eg. recombination, hybridization, lateral gene transfer) or not (eg. incomplete lineage sorting, gene duplication-loss)?

These questions can be addressed using programs such as SplitsTree, which implements a range of network analyses based on unrooted splits graphs. Probably the most commonly encountered of these methods is NeighborNet. However, it is not currently possible to directly extract answers to the above questions using this program. For example, we cannot simultaneously display the networks for different subsets of the data, to directly compare them. Nevertheless, I can illustrate the idea with an example.

An example

The data are from O'Donnell et al. (2000). There are data for six partial gene sequences (334-1336 nt per gene) from each of 27 samples of Fusarium graminearum fungi.

We can start the EDA with a NeighborNet analysis of the entire dataset:


This is rather tree-like, with one major reticulation, involving sample NRRL_28721. Note that we have not imposed a root on the data, because it is often the root location that is the most ambiguous part of the data. We can confirm that this sample is the only one causing the non-tree-like behaviour by temporarily deleting it:


Indeed, the data are now very tree-like. Thus, the complete dataset seems to fit category 2c from the list above. Given a root location, this might then be expected to be a good estimate of the phylogeny.

We can further investigate the incompatible data pattern by looking at at each of the six gene subsets individually, in order to locate the possible cause of the reticulation involving NRRL_28721. It turns out that there are several different patterns revealed by the subsets.

The Ligase2 and Ligase3 genes produce poorly resolved bushes (category 2a), with many of the samples being identical:


The Ligase1 and Permease1 genes produce bushes with a netted centre (a combination of categories 2a and 2d):


The Permease2 gene apparently shows a reticulation, but a look at the original sequence alignment shows that this involves an incompatibility between 1 character and 2 other characters; and so it is likely to be of little interest:


In my experience, this is an important consideration when using EDA in phylogenetics. A pattern that looks large in a network may be relatively trivial in the original data. This can happen whenever there is little information in the dataset, so that even the smallest pattern looks large.

The Trichothene gene also produces a net-centered bush, but it is the only gene that shows the reticulation involving NRRL_28721. It is therefore a gene of particular interest for the EDA:


What this EDA shows is that most of the individual genes contain little information on their own, but when combined they show a strong tree-like pattern. That is, the genes complement each other in a phylogenetic analysis. Moreover, the reticulation involving NRRL_28721 appears in only one gene. Indeed, investigation of the alignment suggests a recombination event, with a cross-over in this particular gene. The other genes share one or the other of the two patterns associated with this cross-over.

A final tree-building analysis would therefore be expected to be productive, except that NRRL_28721 will not have a stable position when forced into a tree. On the other hand, a network would be a more appropriate display of the data, as it could show the recombination event associated with NRRL_28721.

Some desiderata for EDA in phylogenetics

Given what I have said above, it would be convenient if there were quantitative measures to identify the different types of network structure, which could then be displayed on each graph as part of the EDA protocol. For example, we need to distinguish these graph forms from each other:
    Few edges (ie. many taxa are identical)
    Star tree (ie. unresolved relationships)
    Resolved tree (ie. clear relationships)
    Structured network (ie. several clear conflicting relationships)
    Unstructured network (ie. unclear relationships)

Unsurprisingly, there is no single measure that could distinguish these from each other. We will therefore need several measures, to be used together.

At the moment, two measures of the degree of reticulation have been proposed, the δ-score (Holland et al. 2002) and the Q-residual (Gray et al. 2010). The main difference is in the normalization constant. Both of them are currently reported by SplitsTree, as an option. Basically, they produce small values for trees and bushes, larger values for structured networks, and the largest values for spider webs. For example, deleting NRRL_28721 from the above dataset reduces the δ-score from 0.1341 to 0.1083 (the scores can range from 0 to 1).

Thus, these measures can potentially be used to distinguish among the last three graph types in the list, but not among the first three. Wichman et al. (2011) expressed an apparent preference for the δ-score over the Q-residual, based on an evaluation of a linguistic dataset. Unfortunately, these scores have no statistical interpretation. As noted by Gray et al.:
We have implemented and tested a number of schemes for assessing the significance of delta score and Q-residual values, including non-parametric and parametric bootstrapping. Unfortunately, and curiously, none have proven to be sufficiently powerful and robust.
It is possible to suggest other measures that might also be useful for quantifying the different types of graph structure. For example, for a NeighborNet graph, which is closely related to an unrooted Neighbor-Joining (NJ) tree, we could consider the following quantitative measures to characterize each type of structure, relative to a fully resolved tree:

Few edges
    how many splits there are relative to the number of samples
Star tree
    how many internal splits there are relative to the minimum number necessary for a fully
    resolved tree
Structured network
    how many of the NJ tree splits form parallel sets in the NeighborNet of the same data
    (as opposed to single edges)
Unstructured network
    how many extra splits there are in the NeighborNet relative to the NJ tree

I am sure that others can easily be devised, as well. I encourage people to have a think about this, so that some consensus might be reached about desiderata for EDA in phylogenetics. The computational people cannot develop algorithms until the biologists tell them what is needed.

Monday, January 21, 2013

EDA or post-optimality analysis of phylogenetic data?


These days, phylogeneticists usually build trees to express the evolutionary history of their samples. As part of this procedure, they also show an interest in the "quality" of their trees. This is a very vaguely defined concept, probably because it has something to do with accuracy, or correctness, which is something we can know almost nothing about. So, instead, we resort to a whole swag of other concepts, such as resolution, robustness, sensitivity and stability, which are related to precision rather than to accuracy.

We implement these "precision" ideas in many ways, including: (i) analytical procedures, such as interior-branch tests, likelihood-ratio tests, clade significance, and the incongruence length difference test; (ii) statistical procedures, such as the ubiquitous non-parametric bootstrap and posterior probabilities, the jackknife, topology-dependent permutation, and clade credibility; and (iii) non-statistical procedures, such as the decay index, clade stability, data decisiveness, and spectral signals.

Most of these are forms of what is called post-optimality analysis — the tree is first calculated and then we evaluate it. In the current issue of Bioinformatics there is a paper by Saad Sheikh, Tamer Kahveci, Sanjay Ranka and Gordon Burleigh (Stability analysis of phylogenetic trees. Bioinformatics 2013 29(2): 166-174) that provides yet another take on the same theme:
We define measures that assess the stability of trees, subtrees and individual taxa with respect to changes in the input sequences. Our measures consider changes at the finest granularity in the input data (i.e. individual nucleotides).
Basically, the idea is to see how much the input would need to be changed in order to cause a change in the tree topology. For example, the authors quantify the minimum edit distance required to create a specified Robinson-Foulds tree distance from the optimal tree, although any similar distances could be used instead. Their basic purpose was to develop a method that could be effective for very large datasets, which most of the alternatives cannot.

What this approach begs is the question as to whether a post-optimality analysis is the best approach in the first place. This type of approach assumes a tree as the basic structure, and fails to consider alternative structures that might be more appropriate for the data.

Exploratory data analysis (EDA), if performed effectively, can achieve the same result (an assessment of "stability") while at the same time revealing whether a tree is actually the best structure. It does this by evaluating the dataset directly, a priori, rather than evaluating the data relative to a tree, a posteriori. Evaluating the tree in terms of the data is not the same thing as evaluating the data independently of any tree.

Of the methods listed above, the only one that evaluates the data in a tree-independent manner is the use of spectral signals. Another approach is, of course, to use a data-display network, which provides a very convenient picture of the data, and will thus reveal whether a tree-building analysis is a good idea or not.

An example

To explore the essential difference between EDA and post-optimality analysis, we can look at one of the example datasets used by Sheikh et al. to illustrate their method.

This dataset involves sequences of 169 species of mammals, published by Meredith et al. (Impacts of the Cretaceous terrestrial revolution and KPg extinction on mammal diversification. Science 2011 334(6055): 521-524). There are 35,603 aligned nucleotides, concatenated from 26 gene fragments. The sampling includes nearly all mammalian families, plus five vertebrate outgroups.

Meredith et al. note about their data:
Phylogenetic relations from maximum likelihood (ML) and Bayesian methods are well resolved across the mammalian tree. More than 90% of the nodes have bootstrap support of ≥ 90% and Bayesian posterior probabilities of ≥ 0.95. Amino acid and DNA ML trees are in agreement for 163 out of 168 internal nodes.
Not surprisingly, Sheikh et al. reach a similar enthusiastic conclusion:
The Mammals dataset is highly stable. There is not a single move (R = 1) possible for an edit distance of up to 530 nucleotides. Even if we place an extremely high limit of E = 1000, the biggest move possible is RF = 5. Thus, the stability measures provide an explicit guarantee that there is no move possible for E = 500 and any values of R within 1 SPR distance. This also demonstrates the power of building phylogenies from large densely sampled datasets.
However, this enthusiasm contradicts some well-known previous results. For example, Meredith et al. also note:
Several nodes that remain difficult to resolve (e.g., placental root) have variable support between studies of rare genomic changes, as well as genome-scale data sets, which suggest that diversification was not fully bifurcating or occurred in such rapid succession that phylogenetic signal tracking true species relations may not be recoverable with current methods.
A simple EDA analysis makes the situation clear, which is not done by either the bootstrap / posterior-probability approach of Meredith et al. or the edit-distance / tree-distance approach of Sheikh et al. If we stick to the simple parsimony approach of the latter (rather than the model-based approach of the former), then we can analyze the dataset with hamming distances and a NeighborNet graph.


First, the root of the mammals is not clear. The published tree places the root on the branch leading to monotremes, but in the network the outgroup involves a reticulation. This is caused by an ambiguous relationship between Echinops telfairi (Tenrecidae) and (i) the {outgroup + monotremes} and (ii) the Afrotheria. In the tree the Afrotheria is united by a "strongly supported node".

Second, the root of the placentals is very unclear. Most of the major groups of placentals form clusters, but the relationships among these clusters are very obscure. The data are bush-like within the placentals, rather than tree-like, both at the level of the four major groups (named in the graph) and within each of those groups. In the published tree, some of these subgroups are well-supported, but others involve disagreement between the DNA and amino-acid trees, while others have < 90% bootstrap support.

It is not immediately obvious that a tree-building analysis is going to be of much use for this dataset. There is certainly some "power of building phylogenies from large densely sampled datasets", but this does not automatically mean that those phylogenies will be tree-like. Evolution involves a more diverse process than that, and post-optimality analyses based on a single model may be very misleading about that diversity.

Monday, January 14, 2013

The mysterious rankings in Forbes' Celebrity 100


Every year since 1999 Forbes magazine has produced a list called the Celebrity 100, which purports "to list the 100 most powerful celebrities of the year" within the USA. The list is based on entertainment-related earnings plus media visibility (exposure in print, television, radio, and online). The current list was released in May 2012 (see The World's Most Powerful Celebrities).

The 2012 list generated plenty of negative comments around the web, typified by this one from the Huffington Post: "Looking at the list and the various artists' rankings in the five categories used to determine their placement, there's a sense that the actual positions are more arbitrary than usual. For example, Jennifer Lopez is #1 despite not ranking inside the top 10 in any of the categories (and is as low as #30 in money, #22 in press rank and #19 in social), while both Lady Gaga [Stefani Germanotta] and Justin Bieber earned top 10 placements in all of them except the Money Rank, the top tier of which is dominated by multimedia moguls like Oprah Winfrey".

Forbes' Celebrity 100, top three.

Forbes lists the five categories (designed to measure "earnings and fame") as follows, along with the data sources they used:
  • Money Rank — talk to industry insiders to come up with an estimate of earnings
  • TV / Radio Rank — use Lexis / Nexis to find out how many times each star was mentioned on television and on the radio
  • Press Rank — print media mentions come from Factiva
  • Web Rank — measured using Google blogs
  • Social Rank — count of Twitter followers and Facebook fans.
Each celebrity is ranked within each of these categories separately, and then: "all of the data is processed through an algorithm that creates our power ranking."

The following criticism appears among the comments on the Forbes blog page: "I’m not sure what your algorithm is based on, but it’s clearly not the rankings you provided, given that Beiber and GaGa tie or outrank J-Lo in every category, and Oprah beats her in all but one category and made more than three times as much $. Seems odd ..." Indeed, both Rihanna Fenty and Beyoncé Knowles also out-ranked Lopez in four of the five categories, and yet Beyoncé was ranked only 16th and Rihanna 4th.

In reply to the criticisms, Dorothy Pomerantz (the Forbes editor in charge of the list), noted: "The thing that doesn't show up on the celebrity profiles is magazine covers, and it counts for a lot ... We comb through dozens of magazines from the past 12 months to count how many times each star appeared on covers." This comment generated the response: "to leave magazine covers out when it supposingly 'counts for a lot' is crazy, and nonetheless it still seems weird Lopez could come out on top."

To quantify whether the rankings are indeed 'crazy' or not, we can use a network as a useful means of exploratory data analysis. As usual, I have used the manhattan distance and a neighbor-net network, but this time I have applied them to the rankings for each of the five celebrity-status categories, as listed by Forbes. If the final ranking in the overall Celebrity 100 list really is based on a pooling of the rankings for the five individual categories, then there should be a simple pattern in the resulting network, with similarly ranked celebrities appearing near each other in the graph. That is, people who are closely connected in the network are similar to each other based on their category rankings, and those who are further apart are progressively more different from each other.

The neighbor-net graph of the people in the Forbes Celebrity 100,
labelled according to their ranking.

However, this is blatantly not the case. To highlight this fact, I have coloured the top ten celebrities in red and the bottom ten in purple on the network. While it is true that the red numbers are down at the bottom of the graph and the purple ones are at the top, thus forming a pattern of sorts, the different colours are clearly not clustered together (the red form two separate groups rather than one, and so do the purple). So, it is not just the Lopez ranking that is screwy — a lot of the ranks appear to be fairly arbitrary.

If we look at some specific examples, the people at rank 16 (Beyoncé Knowles) and rank 24 (Adele Adkins) are clustered in the network with those people with ranks 1-8, while the people with ranks 9 and 10 are far away from 1-8. This means that, given their performance on the individual criteria, Beyoncé and Adele should be in the top ten on the Celebrity 100 list, and yet Forbes has them ranked as 16 and 24. Why? Surely their magazine-cover performance cannot have affected them that severely.

Alternatively, neither Britney Spears (ranked 6) nor Tom Cruise (ranked 9) made it into the top ten on any category, and yet the are both ranked in there, whereas everyone else ranked inside the top 16 made the top ten in at least one category.

As other examples, we can note that rank 48 is near ranks 9 and 10 in the network while rank 49 is near ranks 98 and 99; and rank 83 is next to rank 22 (near the bottom of the graph) while 26 is next to 81 (near the top)!

The Forbes so-called 'algorithm' must be a sight to behold, because it produces an outcome that is not quite random but nevertheless does a very creditable imitation of it. I think that we can safely say that there is more than 'magazine covers' involved in these ranking discrepancies.

A much more reasonable top dozen, based on the graph, would be the people ranked 1–8, 11, 15, 16 and 24. These are, in alphabetical order:
  • Adele Adkins
  • Justin Bieber
  • Rihanna Fenty
  • Stefani Germanotta
  • LeBron James
  • Kim Kardashian
  • Beyoncé Knowles
  • Jennifer Lopez
  • Katy Perry
  • Britney Spears
  • Taylor Swift
  • Oprah Winfrey
The exact rank order of these people would, presumably, be determined by the number of magazine covers on which they have appeared.

There are some other things that we can learn from an analysis of the Celebrity 100 list, but they have nothing to do with networks, so I will not cover them here. (I have now covered them in this later blog post: Non-randomness in Forbes' Celebrity 100 ranking.)

Wednesday, January 9, 2013

We should present bayesian phylogenetic analyses using networks


Bayesian methods differ from other forms of probabilistic analysis in that they are concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. That is, Bayesian analysis is not about identifying the most likely outcome, it is about quantifying the relative likelihood of all possible outcomes. In this sense, it is quite distinct from other probabilistic methods, such as those based on estimating the optimal outcome under criteria such as maximum likelihood, maximum parsimony or minimum evolution. As far as likelihood is concerned, Bayesian analysis deals with (maximum) integrated likelihood rather than (maximum relative) likelihood (Steel & Penny 2000).

In phylogenetic analysis this creates a potentially confusing situation, as the result of most Bayesian analyses is presented as a single tree, rather than showing the probability distribution of all trees. Certainly, some of the information from the probability distribution is used in the tree – usually the posterior probabilities that are attached to each of the tree branches – but this is a poor visual summary of the available information. A better approach would be to use a network to display the actual probability distribution.

Theory

Bayesian phylogeny programs such as MrBayes produce a file (or files) with a copy of every tree that was sampled by the MCMC run (or runs). These files can then be manipulated (eg. using the "sumt" command in MrBayes) to exclude the first trees as part of the burn-in; and a smaller file is then produced with a copy of the unique trees found, along with their frequency of occurrence in the sample (eg. in the "trprobs" file from MrBayes). It is this sample of trees that is summarized to produce the so-called MAP tree (maximum a posteriori probability tree) and its associated branch posterior probabilities.

Ideally, we would take this file and produce a consensus network rather than a consensus tree. The tree produced by MrBayes is built from the best-supported branches of the set of trees sampled, but only a set of compatible branches can be included in the consensus tree (see the original work of Margush and McMorris 1981). Any well-supported but incompatible branches will not be shown, and it is the absence of these branches that causes the phylogenetic tree to deviate from the standard Bayesian philosophy of presenting a probability distribution. A consensus network solves this problem because it is specifically designed to present a specified percentage of the incompatible branches as well (Holland et al. 2004).

The idea of using a consensus network rather than a consensus tree was first suggested by Holland et al. (2005), although it has rarely been used in practice (eg. Gibb & Penny 2010). Indeed, consensus networks could also be used to present the results of bootstrap analyses, or a set of equally parsimonious trees (Holland et al. 2006).

It is important to note that a consensus network is unrooted, and therefore it is solely a mathematical summary of the data, and cannot be treated as an evolutionary diagram. The MAP tree is, strictly speaking a form of consensus tree, and so it is technically also solely a mathematical summary of the data. It is, however, conventional to treat it as an evolutionary diagram.

An example

The example data are from Schnittger et al. (2012), being sequence data for the beta-tubulin gene of 17 samples of the genus Babesia. Details of the Bayesian analysis using MrBayes are provided here, which yielded 100,000 trees in the final MCMC sample. A nexus file with the sequence data, the 95% credible set of trees, and the associated bipartitions can be found here. The rooted MAP tree produced by MrBayes is shown in the first diagram, with the posterior probabilities (PP) labelled on each branch.

Bayesian MAP tree.

There are several ways that a consensus network could be presented. The standard way would be to choose some percentage of the MCMC trees and to show the network of those trees. For example, the MAP tree is simply a consensus that includes all of the bipartitions that occur in at least 50% of the trees, plus all of the other bipartitions that are compatible with those bipartitions.

As an alternative, here is the unrooted consensus network with all of the bipartitions that occur in at least 5% of the trees. It is important to note that the branch lengths in the following consensus networks represent the branch support not the estimated number of substitutions (as in the MAP tree). For example, the relationships among the Babesia microti sequences have very short branches in the MAP tree (ie. few substitutions) but long branches in the consensus networks (ie. good support).

Consensus network set at 5%.

This network makes visually clear where the major incompatibilities are among the MCMC trees, which the MAP consensus tree does not (unless one checks the PP values). The major set of boxes in the network involve the branches with PP values of c. 0.4. Note also that this network also approximates the 95% credible set of trees, as it excludes those bipartitions that occur in <5% of the trees.

Unfortunately, this network is visually rather cluttered, with a set of multi-dimensional boxes, and so maybe a simpler network would suffice. Instead, here is the consensus network with all of the bipartitions that occur in at least 20% of the trees. This network still emphasizes the main area of incompatibility among the trees, while losing the less-important incompatibilities on the other branches.

Consensus network set at 20%.

Another approach to simplifying the consensus network is to present a set of what are called weakly compatible bipartitions, rather than choosing some percentage of the bipartitions. This is shown in the next network. Note that visually it is a compromise between the above two networks, and may thus be preferable, as it makes clear the range of branches involved in the incompatibilities without needing to use multi-dimensional boxes to do so (ie. the graph is planar rather than 3-dimensional).

Consensus network of weakly compatible bipartitions.

Finally, we can emphasize the relationship between the network and bipartition compatibility by plotting the consensus network consisting of a set of compatible bipartitons. This forms a tree that has the same topology as the MAP tree (since it is constructed in exactly the same manner), but here the branch lengths represent the branch support rather than the estimated number of substitutions.

Consensus network of compatible bipartitions.

Current practical problems

The simplest way to implement the use of consensus networks to display the set of Bayesian trees would be to use SplitsTree to produce the consensus network directly from the (MrBayes) nexus-format MCMC treefile. However, this treefile will usually contain tens of thousands of trees, which is pushing the limit of SplitsTree. Alternatively, we could read in the smaller nexus-format "trprobs" file, which contains only the unique trees along with a weight indicating their relative frequency. Unfortunately, SplitsTree does not currently read tree weights in nexus treefiles. So, Holland et al. (2005, 2006) produced some Python scripts to create the required nexus files, which can then be input to SplitsTree (or SpectroNet).

Alternatively, MrBayes also produces a partition table, showing the relative frequency of the bipartitions found in the sample of trees. The consensus network is actually produced from these bipartitions, rather than the trees, and so this can also be used instead of the treefile. There are two practical problems with this approach. First, MrBayes does not produce a nexus-format file with the bipartition information, and so the available information must be manually converted to a Splits block and put into a nexus file. Second, SplitsTree currently does not construct networks with different percentages of splits when data are input via a Splits block, only when the data are input via a Trees block. So, a series of Splits blocks needs to be constructed, each with the appropriate number of bipartitions.

I used the latter approach to analyze the example data. [This is explained in more detail in a later post: How to construct a consensus network from the output of a bayesian tree analysis.]

Thanks to Bernie Cohen for prompting me to think about networks and Bayesian analysis, and to Barbara Holland for her help with getting the calculations done.

References

Gibb GC, Penny D (2010) Two aspects along the continuum of pigeon evolution: a South-Pacific radiation and the relationship of pigeons within Neoaves. Molecular Phylogenetics and Evolution 56: 698-706.

Holland BR, Delsuc F, Moulton V (2005) Visualizing conflicting evolutionary hypotheses in large collections of trees: using consensus networks to study the origins of placentals and hexapods. Systematic Biology 54:66-76.

Holland BR, Huber KT, Moulton V, Lockhart PJ (2004) Using consensus networks to visualize contradictory evidence for species phylogeny. Molecular Biology and Evolution 21: 1459-1461.

Holland BR, Jermiin LS, Moulton V (2006) Improved consensus network techniques for genome-scale phylogeny. Molecular Biology and Evolution 23: 848-855.

Margush T, McMorris FR (1981) Consensus n-trees. Bulletin of Mathematical Biology 43: 239-244.

Schnittger L, Rodriguez AE, Florin-Christensen M, Morrison DA (2012) Babesia: a world emerging. Infection, Genetics and Evolution 12: 1788-1809.

Steel M, Penny D (2000) Parsimony, likelihood, and the role of models in molecular phylogenetics. Molecular Biology and Evolution 17: 839-850.

Monday, January 7, 2013

Is there good and bad fast-food?


Since the Christmas feast days are now over, this blog post continues the series on the nutritional characteristics of modern fast-food, which started with the Network analysis of McDonald's fast-food.

Men's Health magazine has produced a list of what it considers to be The 10 Worst Fast Food Meals in the USA. They chose one meal (usually a combination of several menu items) from each of ten different fast-food chains, which they considered to be extreme meals based on their nutritional characteristics. To counter-balance this list, they also chose another meal combination from each chain that they considered to be much "better for you".

For each of these 20 meals the magazine provided data on four of the nutritional characteristics: Calories, Fat, Saturated fat, and Sodium (salt). I have analyzed these data in the same manner as before: I standardized the data by expressing them as a percent of the officially recommended daily value based on a 2,000 calorie diet, then calculated a NeighborNet network based on manhattan distances.

The resulting network is shown in the figure. I have coloured the ten allegedly "better for you" meals alternately in green or blue, with all of the "worse for you" meals in black. Meals that are closely connected in the network are similar to each other based on their nutritional characteristics, and those that are further apart are progressively more different from each other.


Clearly, the "worst food" meals differ greatly from each others in their nutritional characteristics, while the other ten meals do not. In other words, there is a single clear concept of what is "good for you" but many different ideas about what is "bad for you" (or many ways in which the food can be unhealthy).

Furthermore, the "worst food" meals vary in their relationship to the better meals, with Long John Silver's Fish Combo Basket being rather similar to the better items, and both KFC's Half Spicy Crispy Chicken Meal and Burger King's Large Triple Whopper being at the extreme far end of the graph. Indeed, the "worst food" meals form a gradient of increasingly extreme nutritional characteristics: the calories, fat and saturated fat all increase from bottom to top in the network, and sodium increases from left to right.

The sodium change applies in the better meals, as well, with Quizno's Roadhouse Steak Sammies having more salt than the other nine meals in that group.

So, it seems to me that the fast-food chains are having a harder time creating unhealthy fish meals than they are creating unhealthy chicken and beef meals. However, this may just be lack of effort on their part, because the Tuna Melts with Cheetos meal is certainly pretty extreme.

Anyway, you now know which meals to target should you wish to send yourself into an early grave.

Wednesday, December 26, 2012

Network analysis of McDonald's fast-food


Season's greetings: or as we say here in Sweden: God jul och gott nytt år! In many cultures, over-dosing on food is traditional at this time of year, so it is appropriate to have a food-related blog post this week. [Note there is a follow-up blog post called: Is there good and bad fast-food?]

In 1954 a multi-mixer salesman decided to check out the fast-food operation of a couple of brothers in California, named McDonald, and then offered to form a partnership with them. Nearly 60 years later, there are approximately 34,000 fast-food stores with this name worldwide, although there are very few in Africa, and I don't think they have any in Antarctica. This virus-like growth has been accompanied by negative comments on the nutritional quality of the food. Indeed, the international Slow Food organization, which cares very much about the traditional quality of food, was first formed to contest the opening of a McDonald's store near the historic Spanish Steps in Rome.

In a blog post amusingly entitled Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process, Edwin Chen looked at the nutritional content of the food provided by this culinary mega-chain. Another way to look these data is to use a phylogenetic network as a means of exploratory data analysis, which is what I provide here.

The data

The data are taken from the official document McDonald's USA Nutrition Facts for Popular Menu Items, dated 7 August 2012. The stated purpose of the document is this: "We provide a nutrition analysis of our menu items to help you balance your McDonald's meal with other foods you eat. Our goal is to provide you with the information you need to make sensible decisions about balance, variety and moderation in your diet." I presume that the data are accurate, at least on average for each menu item.

The data consist of measurements of 12 dietary characteristics for 82 of the menu food items available in the USA (excluding the drinks), not all of which are necessarily available in other countries. The data for each characteristic are listed as "% Daily Value" based on a 2,000 calorie diet, which thus standardizes the data to the same scale across all of the variables, thereby making them directly comparable. The characteristics are:
  • Calories
  • Total Fat
  • Saturated Fat
  • Carbohydrates
  • Cholesterol
  • Dietary Fiber
  • Protein
  • Sodium
  • Vitamin A
  • Vitamin C
  • Calcium
  • Iron
Sugar values are also listed in the original document, but there is no FDA recommended daily allowance available for sugar, and so these cannot be included in my analysis. The FDA argument is that we get enough calories out of the fat and protein that we eat, and so we don't actually need any extra sugar in our diet.

The analysis

I have analyzed these data using the manhattan distance and a neighbor-net network. The result is shown in the figure. Menu items that are closely connected in the network are similar to each other based on their dietary characteristics, and those that are further apart are progressively more different from each other.

Click to enlarge.

The network is basically a blob, with three side branches. The blob (with menu items labelled in black) represents what we could call "typical" McDonald's products, while the side-branches (with the labels in various colours) represent more extreme products, with either more or less of some of the measured characteristics. Basically, the menu items are increasingly "bad for you" at right and better for you at the very bottom.

I have numbered parts of the side-branches as 1–7 in the figure, and coloured the food names, to indicate various groups that are worth discussing. These groups can be described as follows:
  1. The dark red menu items to the right of (1) include the two Big Breakfast with Hotcakes, each of which provides 55% of your daily calorie needs (while most of the menu items to the left provide <40% each), and 40% of your iron requirements.
  2. The menu items to the right of (2) (in red or dark red) include all four of the Big Breakfasts, which each provide nearly 200% of the recommended daily intake of cholesterol (most of the other items provide < 90%), >60% of the total fat requirement, and >65% of your sodium needs (ie. salt).
  3. The items to the right of (3) (in orange, red or dark red) each provide >35% of your calorie needs and >60% of the total fat requirement, while those to the left provide less.
  4. The purple items below (4) include all of the menu items with added egg, and so these have the next highest cholesterol levels after the four Big Breakfasts, at 80-95% of your daily requirements (the other menu items have <50%).
  5. The four menu items to the left of (5) (in light green) are unique in containing >130% of your vitamin C needs, while the other items each provide <35%.
  6. The items to the left of (6) (in blue or light green) include all of the fruit-based items and potato-based items (fries, hash browns) plus the chicken nuggets and bites. These are distinguished by having relatively low values of several characteristics, including those that are bad for you (total fat, saturated fat, cholesterol, and sodium) and those that are good for you (protein and calcium).
  7. The green items below (7) include all nine of the Premium Salads (but not the Side Salad!). They each provide 160% of your vitamin A requirements (the other items each contain <20%), 25-35% of the vitamin C (most of the others contain <15%), and >15% of your dietary fiber.
The conclusions

This means that, for a healthy diet, you should steer well clear of the four "Big Breakfast" menu items, as they are very extreme even by McDonald's standards — you will need to take part in a great deal of strenuous exercise, to burn off their calories, cholesterol, fat and salt. The "Premium Salads", on the other hand, are extreme by having much less of these "bad for you" characteristics than is usual for McDonald's.

Unfortunately, while the the fruit concoctions (in light green and light blue) look good in the network, they also have more sugar (20-30 g each) than any of the other menu items (except the Cinnamon Melts), and therefore not everyone is a fan of them (eg. Mark Bittman; see also the MNN blog). Incidentally, it is also worth pointing out that fast-food in the USA is often much saltier than it is elsewhere in the world (see the article in Health magazine).

Finally, you might also like to compare the network locations of the menu items to William Harris' 10 Most Popular McDonald's Menu Items of All Time (not all of which have been included in my analysis):
  1. French Fries
  2. Big Mac
  3. Snack Wrap
  4. Happy Meal
  5. Egg McMuffin
  6. Apple Dippers and Baked Apple Pie
  7. Chicken McNuggets and Chicken Select Strips
  8. Premium Salads
  9. Double Cheeseburger
  10. McGriddles Breakfast Sandwich

Wednesday, December 19, 2012

Networks can outperform PCA ordinations in phylogenetic analysis


In a previous blog post I noted that there are potentially Distortions and artifacts in Principal Components Analysis analysis of genome data, in which the Principal Components Analysis (PCA) ordination axes are not independent of each other. This happens when the samples form one or more monotonic gradients, in which case higher-dimensional axes of the ordination plot are curved and twisted relative to the lower axes. The so-called Arch Effect is the best-known example of this problem. The danger is that the ordination diagrams will be interpreted as showing biological patterns when these artifactual patterns are not in the original datasets.

This mathematical artifact can be dealt with by: (i) mathematically correcting the distortion; (ii) using a Non-metric Multidimensional Scaling (NMDS) ordination instead; or (iii) using a phylogenetic network to display the multidimensional data, rather than using an ordination. Here, I illustrate this third possibility.

Network analysis of a 1-dimensional gradient

In the previous blog post I used the following small dataset as an example of a genetic dataset with a 1-dimensional pattern of relationships among the taxa. Here the taxa form a series of overlapping groups, defined by the pattern of Cs.

The diagonal dataset.

In the previous post I also illustrated the PCA ordination of this dataset, which shows the classic Arch Effect, rather than the expected linear arrangement of samples in the scatter plot. Below is a table showing whether analyses using two different types of data-display network can effectively display all of the taxon groups for datasets of this type but with varying numbers of taxa. All analyses were performed using the SplitsTree program.


As you can see, the NeighborNet analysis successfully displays all of the relevant bipartitions in all of the datasets. On the other hand, the related Neighbor-Joining tree only succeeds up to a taxon size of six. Six taxa can be represented by a simple linear tree, which can display three non-trivial bipartitions plus six trivial partitions, which is all that is needed when each taxon consists of five Cs. (Note that the NeighborNet and the Median network are identical to this tree, as well, for six taxa or less.)

The next two figures show the NeighborNet graphs for taxon sizes of 10 and 20, respectively, as examples.

NeighborNet of the diagonal dataset for n=10.

NeighborNet for the diagonal dataset for n=20.

The table also shows that the Median network analysis is only successful for small numbers of taxa. For greater numbers of taxa, some of the bipartitions are not displayed, as listed in the table. For example, for a taxon size of 10, the split {2,3,4,5,6}{1,7,8,9,10} is not displayed, as illustrated in the next figure.

Median network for the diagonal dataset for n=10.

For even larger numbers of taxa, the Median network superimposes some of the taxa, even though they have distinct sequences, and thus it cannot display all of the bipartitions, by definition. For example, for a taxon size of 20, taxon_9 and taxon_10 are superimposed, as shown in the next figure.

Median network for the diagonal dataset for n=20.

Network analysis of a 2-dimensional gradient

In the previous blog post I used the following dataset as an example of a genetic dataset with a 2-dimensional pattern of relationships among the taxa. Here the taxa form two series of overlapping groups, one defined by the pattern of Cs and the other by the pattern of Ts.

The grid dataset.

In the previous post I also illustrated the PCA ordination of this dataset, which shows a complicated 3-dimensional version of the 2-dimensional pattern. On the other hand, the Median network perfectly reproduces the original 2-dimensional grid arrangement of the taxa, as shown next, so that the network diagram is exactly as expected.

Median network for the grid dataset.

Sadly, the NeighborNet analysis does not fare quite so well, as it produces a much more complicated arrangement of the taxa. This arrangement does approximately represent the original grid-like pattern contained in the data, but it is presented in an unnecessarily convoluted way.

NeighborNet for the grid dataset.

Conclusion

It appears that data-display networks can be used to display multidimensional information, in the same manner as an ordination analysis. These networks try to display all of the relevant bipartitions in the data, rather than simply displaying a scatter plot of the samples.

However, there are considerable differences in the ability of the various network analyses to display the patterns inherent in the original data. In particular, NeighborNet seems to be very effective for displaying 1-dimensional patterns, while Median networks are more effective for 2-dimensional patterns.

Finally, it is worth noting that David Bryant has already thought of comparing NeighborNet with a Multi-dimensional Scaling (MDS) ordination, using an empirical dataset (2006, Radiation and network breaking in Polynesian linguistics. In: Forster P, Renfrew C (eds) Phylogenetic Methods and the Prehistory of Languages. McDonald Institute Press, University of Cambridge, pp 111-118). He found that they approximately agreed with each other.