The Genealogical World of Phylogenetic Networks: December 2015

Friday, December 25, 2015

Fast food studies

Season's greetings.

For your Christmas reading, this blog usually provides a seasonally appropriate post on fast food, including to date posts about: nutrition (McDonald's fast-food), geography (Fast-food maps) and diet (Fast food and diet). This year, we will update some of the geographical and diet information about the effects of fast food on people worldwide.

First, there seems to be a general perception that access to fast food is continuing to increase in the modern world, even though that increase started more than half a century ago. Such a perception is easy to verify in the USA, as shown in a previous post (Fast-food maps). However, this trend also appears globally.

For example, in 2013 the Guardian newspaper produced a dataset (called the McMap of the World) illustrating the recent growth in the number of McDonald's restaurants worldwide (McDonald's 34,492 restaurants: where are they?). This first graph shows the relative number of restaurants in 2007 and 2012, with each dot representing a single country. Almost all of the 116 countries showed an increase during the five years (ie. their dots are above the pink line). The only country with a major decrease in McDonald's restaurants was Greece (to the right of the pink line), due, no doubt, to its ongoing financial problems. The country with the largest number of restaurants is, of course, the USA, with Japan a clear second.

There is also a perception that fast-food restaurants compete for customers against other types of restaurants, so that suburbs can have one or the other but not both. This can be checked using the data in the Food Environment Atlas 2014 (produced by the USDA Economic Research Service), which show the number of both Fast-food and Full-service restaurants / 1,000 popele in 2012 for each of the more than 3,100 counties in the USA. This is illustrated in the next graph, where each dot represents a single county.

For most counties, full-service restaurants actually out-number fast-food restaurants, per capita. There are even a few counties that have no fast-food places at all, but also a few with no full-service restaurants. There are even a few with neither restaurant type, notably in AK (2 counties), KY and ND. Interestingly, 3 out of the 4 counties with the largest density of full-service restaurants are in CO (including one not shown because it is off the top of the graph).

Nevertheless, you can't go far in the USA without encountering a fast-food place. As shown in a previous post (Fast-food maps), Subway has the largest number of establishments, not McDonald's. The Flowing Data blog has recently compiled a couple of maps showing the dominance of Subway in the sandwich business (Where Subway dominates its sandwich place competition, basically everywhere). This map shows the subway dominance — each dot is an area with a 10-mile radius, colored by the brand of the nearest sandwich chain.

Unfortunately, studying the effects of geography on access to fast food is not as simple as it might seem. Large-scale patterns such as those shown above are only part of the picture, because access to fast food is usually assumed to be determined at a very local scale — how far from you is the nearest fast-food place, and how easy is it to get there?

There have been many studies over the years, based on different methods and with different study criteria. These have been summarized (from different perspectives) by SE Fleischhacker, KR Evenson, DA Rodriguez, AS Ammerman (2010. A systematic review of fast food access studies. Obesity Reviews 12: e460) and LK Fraser, KL Edwards, J Cade, GP Clarke (2010. The geography of fast food outlets: a review. International Journal of Environmental Research and Public Health 7: 2290-2308).

Their conclusions from their worldwide literature reviews include:

most studies indicated fast-food restaurants were more prevalent in low-income areas compared with middle- to higher-income areas (ie. there is a positive association between availability of fast-food outlets and increasing socio-economic deprivation);
most studies found that fast food restaurants were more prevalent in areas with higher concentrations of ethnic minority groups;
those studies that included overweight or obesity data (usually measured as the body mass index) showed conflicting results between obesity / overweight and fast-food outlet availability — most studies found that higher body mass index was associated with living in areas with increased exposure to fast food, but the remaining studies did not find any such association;
there is some evidence that fast food availability is associated with lower fruit and vegetable intake.

In a previous post (Fast food and diet) I illustrated the association between fast food and obesity in the USA. Here, I use the data from the Guardian article mentioned above (McMap of the World) to do the same thing at a global scale. This next graph shows the relationship between the per capita density of McDonald's restaurants and overweight / obesity for those countries for which there are data available (each dot represents a single country).

These patterns have continued over the five years since the reviews appeared, with published studies both pro (eg. J Currie, S DellaVigna, E Moretti, V Pathania. 2010. The effect of fast food restaurants on obesity and weight gain. American Economic Journal: Economic Policy 2: 32-63) and con (AS Richardson, J Boone-Heinonen, BM Popkin, P Gordon-Larsen. 2011. Neighborhood fast food restaurants and fast food consumption: a national study. BMC Public Health 11: 543) the association between fast food availability and health. To them has been added the issue of Type II diabetes and fast-food consumption (see DH Bodicoat et al. 2014. Is the number of fast-food outlets in the neighbourhood related to screen-detected type 2 diabetes mellitus and associated risk factors? Public Health Nutrition 18 : 1698-1705).

Moving on, people have also considered how the role of restaurants might define the identity of American cities. For example, Zachary Paul Neal has considered whether US cities can be classified on the basis of the local prevalence of specific types of restaurants (2006. Culinary deserts, gastronomic oases: a classification of US cities. Urban Studies 43: 1-21). He counted the numbers of several different types of restaurants in 243 of the most populous cities in the USA, and ended up classifying them into four distinct city types: Urbane oases (where one finds an abundance of restaurants of all sorts), McCulture oases (which have larger than normal concentrations of "highly standardised eating places designed for mass consumption"), Urbane deserts and McCulture deserts (both of which have fewer restaurants than their respective oasis counterpart).

Unfortunately, this sort of classification approach is self-fulfilling, because any mathematical grouping algorithm will form groups, by definition, even if there are no groups in the data. I have shown this a number of times in this blog (eg. Network analysis of scotch whiskies; Single-malt scotch whiskies — a network). These culinary data are thus crying out for a network analysis, and I would normally present one at this point in the blog post. However, I do not have a copy of Neal's dataset.

So, instead, I will finish by analyzing some data on the salt content of fast food (E Dunford et al. 2012. The variability of reported salt levels in fast foods across six countries: opportunities for salt reduction. Canadian Medical Association Journal 184: 1023-1028).

The authors collected data on the salt content of products served by six fast food chains that operate in Australia, Canada, France, New Zealand, the United Kingdom and the United States — Burger King, Domino’s Pizza, Kentucky Fried Chicken, McDonald’s, Pizza Hut and Subway. The product categories included: savoury breakfast items, burgers, chicken products, french fries, pizza, salads, and sandwiches. Data were collated for all of the products provided by all of the companies that fitted into these categories (137-523 products per country). Mean salt contents and their ranges were calculated, and compared within and between countries and companies.

We can use a phylogenetic network to visualize these data. As usual, I have used the manhattan distance and a neighbor-net network. The result is shown in the next figure. Countries that are closely connected in the network are similar to each other based on their fast-food salt content, and those that are further apart are progressively more different from each other.

You will note that the North American countries are on one side of the network, with the highest salt content, while the European countries are on the other, with the lowest salt content (on average 85% of the American salt content). This difference was reflected even between the same products in different countries — for example, McDonald's Chicken McNuggets contained 0.6 g of salt per 100 g in the UK but 1.6 g of salt per 100 g in the USA). As the authors note: "the marked differences in salt content of very similar products suggest that technical reasons are not a primary explanation."

Thursday, December 17, 2015

Is the Ring of Life a network?

Ten years ago, Rivera and Lake decided to emphasize the series if genome fusions that seem to have been involved in the origin of the major phylogenetic groups by calling it the ring of Life rather than the Tree of Life:

Maria C. Rivera and James A. Lake. 2004. The Ring of Life provides evidence for a genome fusion origin of eukaryotes. Nature 431: 182-185).

This terminology has been repeated in a number of subsequent papers, including:

James McInerney, Davide Pisani and Mary J. O'Connell (2015) The Ring of Life hypothesis for eukaryote origins is supported by multiple kinds of data. Philosophical Transactions of the Royal Society of London B 370: 20140323.

However, life is not that simple, and it has more recently become accepted that a set of inter-connected rings is involved in the metaphor, rather than the simple ring originally presented. Thus we now have the plural Rings of Life, instead.

James A. Lake and Janet S. Sinsheimer (2013) The deep roots of the Rings of Life. Genome Biology and Evolution 5: 2440-2448.

James A. Lake, Joseph Larsen, Brooke Sarna, Rafael R. de la Haba, Yiyi Pu, HyunMin Koo, Jun Zhao and Janet S. Sinsheimer (2016) Rings reconcile genotypic and phenotypic evolution within the Proteobacteria. Genome Biology and Evolution (in press).

I think that the rest of us would still call each of these diagrams a network. Indeed, most of the metaphors that have been used over the years can also be called a network (see Metaphors for evolutionary relationships).

Wednesday, December 9, 2015

Lexicostatistics: the predecessor of phylogenetic analyses in historical linguistics

Phylogenetic approaches in historical linguistics are extremely common nowadays. Especially, probabilistic models that model lexical change as a birth-death process of cognate sets evolving along a phylogenetic tree (Pagel 2009) are very popular (Lee and Hasegawa 2011, Kitchen et al. 2009, Bowern and Atkinson 2012), but also splits networks are frequently used (Ben Hamed 2005, Heggarty et al. 2010).

However, the standard procedure to produce a family tree or network with phylogenetic software in linguistics goes back to the method of lexicostatistics, which was developed in the 1950s by Morris Swadesh (1909-1967) in a series of papers (Swadesh 1950, 1952, 1955). Lexicostatistics was discarded by the linguistic community not long after it was proposed (Hoijer 1956, Bergsland and Vogt 1962). Since then, lexicostatistics is considered a methodus non gratus in classical circles of historical linguistics, and using it openly may drastically downgrade one's perceived credibility in certain parts of the community.

To avoid the conflicts, most linguists practicing modern phylogenetic approaches emphasize the fundamental differences between early lexicostatistics and modern phylogenetics. These differences, however, apply only to the way the data is analysed. The basic assumptions underlying the selection and preparation of data have not changed since the 1950s, and it is important to keep this in mind, especially when searching for appropriate phylogenetic models to analyse the data.

The Theory of Basic Vocabulary

Swadesh's basic idea was that in the lexicon of every human language there are words that are culturally neutral and functionally universal; and he used the term "basic vocabulary" to refer to these words. Culturally neutral hereby means that the meanings expressed by the words are independently used across different cultures. Functional universality means that the meanings are expressed by all human languages independent of the time and place where they are spoken. The idea is that these meanings are so important for the functioning of a language as a tool of communication, that every language needs to express them.

Cultural neutrality and functional universality guarantee two important aspects of basic words: their stability and their resistance to borrowing. Stability means that words expressing a basic concept are less likely to change their meaning or to be replaced by another word. An argument for this claim is the functional importance of the words — if the words are important for the functioning of a language, it would not make much sense to change them too quickly. Humans are good at changing the meanings of words, as we can see from daily conversations in the media, where new words tend to pop up seemingly on a daily basis, and old words often drastically change their meanings. But changing words that express basic meanings like "head", "stone", "foot", or "mountain" too often might give rise to confusion in communication. As a result, one can assume that words change at a different pace, depending on the meaning they express, and this is one of the core claims of lexicostatistics.

Resistance to borrowing follows also from stability, since the replacement of words expressing basic meanings may again have an impact on our daily communication, and we may thus assume that speakers avoid borrowing these words too quickly. Cultural neutrality of concepts is another important point to guarantee resistance to borrowing. Words expressing concepts which play an important cultural role may easily be transferred from one language to another along with the culture. Thus, although it seems likely that every language has a word for "god" or "spirit" and the like (so the concept is to a certain degree functionally universal), the lack of cultural independency makes words expressing religious terms very likely candidates for borrowing, and it is probably no coincidence that words expressing religion and belief rank first in the scale of borrowability (Tadmor 2009: 232).

Lexical Replacement, Data Preparation, and Divergence Time Estimation

Swadesh had further ideas regarding the importance of basic vocabulary. He assumed that the process of lexical replacement follows universal rates as far as the basic vocabulary is concerned, and that this would allow us to date the divergence of languages, provided we are able to identify the shared cognates. In lexical replacement, a word w₁ expressing a given meaning x in a language is replaced by a word w₂ which then expresses the meaning x, while w₁ either shifts to express another meaning, or completely disappears from the language. For example, older thou did in English was replaced by the plural form you, which now also expresses the singular. In order to search for cognates and determine the time when two languages diverged, Swadesh proposed a straightforward procedure, consisting of very concrete steps (compare Dyen et al. 1992):

Compile a list of basic concepts (concepts that you think are culturally neutral and functionally universal; see here for a comparative collection of different lists that have been proposed and used in the past)
translate these concepts into the different languages you want to analyse
search for cognates between the languages in each meaning slot; if words in two languages are not cognate for a given meaning, then this points to former processes of lexical replacement in at least one of the languages since their divergence
count the number of shared cognates, and use some mathematics to calculate the divergence time (which has been independently calibrated using some test cases of known divergence times).

As an example for such a wordlist with cognate judgments, compare the table in the first figure, where I have entered just a few basic concepts from Swadesh's standard concept list and translated them into four languages. Cognacy is assigned with help of IDs in the column at the right of each language column, but also further highlighted with different colors.

Classical cognate coding in lexicostatistics

Phylogenetic Approaches in Historical Linguistics

Modern phylogenetic approaches in historical linguistics basically follow the same workflow that Swadesh propagated for lexicostatistics, the only difference being the last step of the working procedure. Instead of Swadesh's formula, which compared lexical replacement with radioactive decay and was based on aggregated distances in its core, character-based methods are used to infer phylogenetic trees. Characters are retrieved from the data by extracting each cognate from a lexicostatistical wordlist and annotating the presence or absence of each cognate set in each language.

Thus, while Swadesh's lexicostatistical data model would state that the words for "hand" in German and English were cognate, and also in Italian and French, but not in Germanic and Romance, the binary presence-absence coding states that the cognate set formed by words like English hand and German Hand is not present in Romance languages, and that the cognate set formed by words like Italian mano and French main is absent in Germanic languages. This is illustrated in the table below, where the same IDs and colors are used to mark the cognate sets as in the table shown above.

Presence-absence cognate coding for modern phylogenetic analyses

The new way of cognate coding along with the use of phylogenetic software methods has brought, without doubt, many improvements compared to Swadesh's idea of dating divergence times by counting percentages of shared cognates. A couple of problems, however, remain, and one should not forget them when applying computational methods to originally lexicostatistic datasets.

First, we could ask whether the main assumptions of functional universality and cultural neutrality really hold. It seems to be true that words can be remarkably stable throughout the history of a language family. It is, however, also true that the most stable words are not necessarily the same across all language families. Ever since Swadesh established the idea of basic vocabulary, scholars have tried to improve the list of basic vocabulary items. Swadesh himself started from a list of 215 concepts (Swadesh 1950), which he then reduced to 200 concepts (1952) and then later to 100 concepts (1952). Other scholars went further, like Dolgopolsky (1964 [1986]) who reduced the list to 16 concepts. The Concepticon is a resource that links many of the concept lists that have been proposed in the past. When comparing these lists, which all represent what some scholars would label "basic vocabulary items", it becomes obvious that the number of items that all scholars agree upon sinks drastically, while the number of concepts that have been claimed to be basic increases.

An even greater problem than the question of universality and neutrality of basic vocabulary, however, is the underlying model of cognacy in combination with the proposed process of change. Swadesh's model of cognacy controls for meaning. While this model of cognacy is consistent with Swadesh's idea of lexical replacement as a basic process of lexical change, it is by no means consistent with birth-death models of cognate gain and cognate loss if they are created from lexicostatistical data. In biology, birth-death models are usually used to model the evolution of homologous gene families distributed across whole genomes. If we use the traditional view according to which words can be cognate regardless of meaning, the analogy holds, and birth-death processes seem to be adequate in order to analyze datasets that are based on these root cognates (Starostin 1989) or etymological cognates (Starostin 2013). But if we control for meaning in the cognate judgments, we do not necessarily capture processes of gain and loss in our data. Instead, we capture processes in which links between word forms and concepts are shifted, and we investigate these shifts through the very narrow "windows" of pre-defined slots of basic concepts, as I have tried to depict in the following graphic.

Looking at kexical replacement through the small windows of basic vocabulary

Conclusion

As David has mentioned before: We do not necessarily need realistic models in phylogenetic research to infer meaningful processes. The same can probably be said about the discrepancy between our lexicostatistical datasets (Swadesh's heritage, which we keep using for practical reasons) and the birth-death models we now use to analyse the data. Nevertheless, I cannot avoid an uncomfortable feeling when thinking that an algorithm is modeling gain and loss of characters in a dataset that was not produced for this purpose. In order to model the traditional lexicostatistical data consistently, we would either (i) need explicit multistate-models in which concepts are a character and the forms represent the states (Ringe et al. 2002, Ben Hamed and Wang 2006), or (ii) we should directly turn to "root-cognate" methods. These methods have been discussed for some time now (Starostin 1989, Holm 2000), but there is only one recent approach by Michael et al. (forthcoming) in which this is consistently tested.

References

Bergsland, K. and H. Vogt (1962): On the validity of glottochronology. Curr. Anthropol. 3.2. 115-153.
Bowern, C. and Q. Atkinson (2012): Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88. 817-845.
Dolgopolsky, A. (1964): Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija [A probabilistic hypothesis concering the oldest relationships among the language families of Northern Eurasia]. Voprosy Jazykoznanija 2. 53-63.
Dyen, I., J. Kruskal, and P. Black (1992): An Indoeuropean classification. A lexicostatistical experiment. T. Am. Philos. Soc. 82.5. iii-132.
Ben Hamed, M. and F. Wang (2006): Stuck in the forest: Trees, networks and Chinese dialects. Diachronica 23. 29-60.
Hoijer, H. (1956): Lexicostatistics. A critique. Language 32.1. 49-60.
Holm, H. (2000): Genealogy of the main Indo-European branches applying the separation base method. J. Quant. Linguist. 7.2. 73-95.
Kitchen, A., C. Ehret, S. Assefa, and C. Mulligan (2009): Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East. Proc. R. Soc. London, Ser. B 276.1668. 2703-2710.
Lee, S. and T. Hasegawa (2011): Bayesian phylogenetic analysis supports an agricultural origin of Japonic languages. Proc. R. Soc. London, Ser. B 278.1725. 3662-3669.
Pagel, M. (2009): Human language as a culturally transmitted replicator. Nature Reviews. Genetics 10. 405-415.
Ringe, D., T. Warnow, and A. Taylor (2002): Indo-European and computational cladistics. T. Philol. Soc. 100.1. 59-129.
Starostin, S. (1989): Sravnitel'no-istoričeskoe jazykoznanie i leksikostatistika [Comparative-historical linguistics and lexicostatistics]. In: Kullanda, S., J. Longinov, A. Militarev, E. Nosenko, and V. Shnirel'man (eds.): Materialy k diskussijam na konferencii[Materials for the discussion on the conference].1. Institut Vostokovedenija: Moscow. 3-39.
Starostin, G. (2013): Lexicostatistics as a basis for language classification. In: Fangerau, H., H. Geisler, T. Halling, and W. Martin (eds.): Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization.. Franz Steiner Verlag: Stuttgart. 125-146.
Swadesh, M. (1950): Salish internal relationships. Int. J. Am. Linguist. 16.4. 157-167.
Swadesh, M. (1952): Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proc. Am. Philol. Soc. 96.4. 452-463.
Swadesh, M. (1955): Towards greater accuracy in lexicostatistic dating. Int. J. Am. Linguist. 21.2. 121-137.
Tadmor, U. (2009): Loanwords in the world’s languages. Findings and results. In: Haspelmath, M. and U. Tadmor (eds.): Loanwords in the world's languages. de Gruyter: Berlin and New York. 55-75.

Monday, December 7, 2015

Recent book reviews

In one of the earliest blog posts (Reviews of recent books) I provided links to some book reviews. Recently, a few have appeared for Dan Gusfield's book: ReCombinatorics: the algorithmics of ancestral recombination graphs and explicit phylogenetic networks (2014. The MIT Press, Cambridge, MA).

In addition to the three endorsements that appear as part of the publisher's blurb, a number of independent book reviews have appeared since its publication:

(2014) Computing Reviews Review#143064.

Michael Sanderson (2015) Quarterly Review of Biology 90: 344-345.

Luay Nakhleh (2015) SIAM Reviews 57: 638-642.

From the mathematical point of view, the reviews make it clear that this book is necessary because networks are very much part of the fringe of the computational sciences. Indeed, the challenge is to convince mathematicians that interesting mathematical problems exist with the the study of networks. In this sense, the main limitation of the book is its focus on the parsimony criterion for optimization, rather than statistical approaches to inference, which play such a large part in phylogenetic analyses.

From the biological point of view, the principal issue seems to be the reliance of the book on the infinite sites model, which does not currently have wide applicability in phylogenetics (eg. mostly in population studies such as haplotype inference and association mapping).

The ultimate goal for both computational end biological scientists is working out how to include recombination in the framework of other types of phylogenetic networks. A basic assumption of many phylogenetic analyses is that there has been no recombination. This is because recombination can destroy much of the evidence left by historically preceding processes, so that neither genotype nor phenotype data can reveal patterns and processes that pre-date the recombination events. In this sense, recombination becomes the reticulation process, rather than processes like hybridization or introgression.