Showing posts with label Anthropology. Show all posts
Showing posts with label Anthropology. Show all posts

Wednesday, July 10, 2013

Networks and human inter-population variation


I have noted before that there are many situations in which the model of a phylogenetic tree is likely to be inappropriate for analysis of genetic data. The most obvious of these involves the study of intra-population variation (e.g. Why do we still use trees for the dog genealogy?). The within-population genealogy of sexually reproducing species, in particular, is not likely to be tree-like, even at large spatial scales. The iconic species for the study of intra-specific evolutionary history is Homo sapiens, and this is also the species where that history is least likely to be tree-like (e.g. Why do we still use trees for the Neandertal genealogy?). Clearly, a phylogenetic network is called for.

Pemberton et al. (2013, Population structure in a comprehensive genomic data set on human microsatellite variation. Genes Genomes Genetics 3: 891-907) provide an interesting dataset of global human autosomal microsatellite variation, based on merging eight previously published datasets. Microsatellites are a bit retro in this day and age, but that does not make them any less useful for the study of genetic variation.


The biggest issue is getting a large enough sample of loci for detailed study. Different researchers collect data on different microsatellites, and so combining datasets is not straightforward. Nevertheless, Pemberton et al. managed to come up with 5,795 individuals from 267 worldwide populations with genotypes at 645 loci. After filtering a member of every intra-population first-degree and second-degree relative pair, and then reducing the size of the over-represented Gujarati sample, they then added data for 84 chimpanzees. This yielded a dataset of 5,519 individuals from 255 populations sampled at 246 shared loci.

These data were processed as follows:
Using Microsat, we evaluated population-level pairwise allele-sharing distance (one minus the proportion of shared alleles), using all 246 loci ... We constructed a greedy-consensus neighbor-joining tree using the Neighbor and Consensus programs in the Phylip package from 1000 bootstrap resamples across loci.
Note that the original inter-population distances were not calculated — the tree was constructed by combining the branches with the highest bootstrap support.

This tree (reproduced above) does not show a great deal of support for many of the branches, and the authors discuss only seven of them. However, the presentation of a tree does not give much of a visual indication of the poor support for the genealogy, even if the different branch thicknesses do indicate the bootstrap values.

grey = chimpanzee, orange =  Africa, yellow = Middle East, blue = Europe,
red = Central/South Asia, purple = America, pink = East Asia, green = Oceania

So, I calculated a NeighborNet network from the distance data, by averaging the 1000 distance matrices from the bootstrap analysis. This is the network analogue of the neighbor-joining tree, as shown above. Note that I have used the same colour coding as for the tree (thus making it look like a very colourful hummingbird), and the branch lengths represent support.

There is clearly a degree of large-scale geographical clustering of the genotypes, and this corresponds to the larger bootstrap values in the tree. So, the main message from the tree and the network is the same, including the rooting of the human genealogy within the African "group". However, this message is visually much clearer in the network than in the circular version of the tree. Moreover, there is little distinction between the Middle Eastern (yellow) and European (blue) genotypes, and the network makes this more obvious than does the tree.

Monday, June 3, 2013

Plotting evolutionary divergence and time


Evolutionary studies have always been based in some way on the phenotype / genotype distance between organisms, usually taken as representing divergence or convergence through evolutionary time. In phylogenetics, this aspect of evolution is usually secondary to the discovery of pathways of historical descent, which are based on patterns of character distribution among organisms rather than on their degree of character divergence.

Nevertheless, evolutionary biologists have long wanted a diagram that links the phenotype / genotype distance between organisms with evolutionary time. One could thereby visualize the presumed historical patterns of phenotype / genotype divergence (and convergence). This should produce a diagram that looks something like a phylogenetic tree (or a coalescent tree), except that one axis would explicitly represent time and the other would explicitly represent phenotype / genotype distance.

Mike Keesey, at the consistently interesting blog A Three-Pound Monkey Brain, has attempted to produce such a thing. The particular example shown here is based on All Known Great Ape Individuals (Messinian to Present).


The figure is intended to include all known hominid fossil individuals (Hominidae), with the horizontal axis representing a distance matrix based on craniodental characters — varying from orangutan-like skull and teeth on the left to human-like on the right. Since the vertical axis represents fossil age, the diagram shows the morphological divergence of human-like individuals (including Neandertals) from earlier forms.

The diagram itself is somewhat of a work in progress, and Mike notes several limitations of the data (eg. absence of fossil gorillas and Pliocene stem-orangutans, lack of postcranial characters, the problematic assignment of names) as well as the analysis (eg. there's a random element to the plotting). Moreover, summarizing morphological divergence in only one axis is a major simplification, and so the diagram should not be over-interpreted.

Nevertheless, this looks very promising for those people who are interested in visualizing the process of evolutionary divergence. This could be especially useful if data are available for genetic distances rather than phenotypic ones. With the increasing availability of genome data for hominids, for example, the temporal relationships among Humans, Neandertals and Denisovans should be much easier to visualize (see Why do we still use trees for the Neandertal genealogy?). If nothing else, ancestral polymorphism would be clearly displayed!

Wednesday, March 27, 2013

The Music Genome Project is no such thing


The Music Genome Project is a database in which 1 million pieces of music (currently) have been coded for 450 distinct musical characteristics. The main use of the database at the moment is to provide the data from which predictions can be made about which other pieces of music might appeal to listeners of any nominated musical set; this is implemented in the Pandora Radio product. This seems like a valuable idea.

However, the use of the word "genome" is an analogy, in which the set of musical characteristics is seen as creating a sort of genetic fingerprint for a song. According to one of the originators, Nolan Gasser:
The basic idea ... was to see if we could approach music from almost a scientific perspective; that's why it's called the Music Genome Project, named not accidentally after the Human Genome Project.
     I've always taken that metaphor very seriously: biologists have come to understand the human species by identifying all the individual genes in our genome; it's then how each individual gene is manifest or expressed that makes us who we are as individuals — as well as defines how we're related to others: most closely to those in our family, but also indirectly to people who share our same physical attributes or capabilities in sports, and so forth.
     That orientation was paramount to my thinking in designing the Music Genome Project.
There seems to be a major misunderstanding here, since the mere idea of atomizing something does not make the atoms genes. After all, the idea behind the Project is basically one of taking music apart and evaluating it by its acoustic elements.

The first problem is that the study of musical attributes is clearly a study of phenotype not genotype, as Gasser alludes in the quote above — there are no hereditary units in music. Unfortunately, phenotype and genotype are frequently confused in the social sciences, with serious consequences when the wrong analogy is used (see the blog post False analogies between anthropology and biology). As noted by LessWrong user jmmcd:
I think the Music Genome project is misleadingly-named. A genome is generative: there is a mapping from a genome to an organism. There is no reverse mapping. In the case of music, there is a reverse mapping from a piece of music to these 400 odd features, but there's no forward mapping ... Knowledge of a phenotype is not constructive, because there are many ways of constructing that phenotype; a genotype is unique, and is thus constructive.
Equally importantly for the Music Genome Project, the musical attributes themselves cannot easily be related to genes as a metaphor — they are simply observed features of the music. The attributes cover musical ideas such as genre, type of instruments, type of vocals, tempo, etc. Most of these attributes are objective and observable (e.g. vocal duets, acoustic guitar solo, percussion, triple meter style, etc), although there are some that are more nuanced (e.g. driving shuffle feel, wildly complex rhythm, epic buildup / breakdown, etc) and thus involve expert subjective judgment. The attributes are coded on a 10-point scale for the "amount" of each attribute.

Given the quantitative nature of the attributes, the only possible analogy with genetics is that of gene expression, not the genome itself (as Gasser also alludes in the quote above). This is a very different metaphor, at least to a biologist. The power of a metaphor is that if it is a good one then it can give you insights that you might not otherwise have; the danger is that a false metaphor will probably lead you up the garden path. In this case, the genome analogy does seem to lead people astray, because they think that Pandora is picking "related" music in a genealogical sense (a "family resemblance") when it is doing no such thing. After all, trying to construct a phylogeny from gene expression data is not something that biologists have attempted successfully.

Thus, if the Music Genome Project did live up to its name then it would be a very valuable thing for musical anthropologists, because then it would be possible to reconstruct a phylogeny of music. Indeed, such a thing has been proposed for popular music: The Music Phylogeny Project. Furthermore, such phylogenies have already been constructed: A Phylogenetic Tree of Musical Style. In the latter case, the author notes: "Needless to say, the tree is not automatically produced by the raw data itself, but by my own interpretation of the data", which gives you some idea of the technical problems involved.

Finally, I will note that what I have said above applies to the other projects based on a supposed analogy with the Human Genome Project. These include the Book Genome Project and the Game Genome Project. Indeed, the blurb for the Book Genome Project makes it sound even more wildly inappropriate:
The genomic analogy is imperfect but useful nevertheless: we defined the three elements of Language, Story, and Character as the literary equivalent of DNA and RNA classifications. Each gene category contains its own subset of measurements specific to its branch of the book genome structure ... Each individual book produces 32,162 genomic measurements.
As noted by commentator CypherGames below, these projects would all be more accurately called Phenome Projects.

Later note: There is also a subsequent post on the The Genome Cellar.

Wednesday, February 20, 2013

Network of ancient Thai bronze Buddha images


This blog post continues the theme from the previous post (Trees and networks of written manuscripts), in which I noted that anthropological data are very likely to involve horizontal flows of phylogenetic information as well as vertical ones. My own analyses of anthropological datasets that are available online seem to confirm this suggestion. The simplest way to illustrate this point is to take a dataset and analyze it using a network method. If the network method produces a tree-like diagram then we can safely conclude that vertical descent has had a larger influence on the transmission of the cultural information than has horizontal transfer.


The dataset that I will use here is provided by Marwick (2012). It involves photographic images of 42 cast metal Buddha statues from the Alexander B. Griswold collection of the sacred sculpture of Thailand (Walters Art Gallery, Baltimore, USA). The statues cover seven widely recognized chronological Thai culture-historical groups.

The morphological features of the statues' heads were coded as 17 binary characters, representing the face of the Buddha image; and these data are included in Marwick (2012). Statues CN65 and CN66 had identical codings for the features used.

Originally, Marwick (2012) analyzed these data by first summarizing the characters for each of the seven culture-historical groups. The phylogenetic analysis was then performed with these seven groups as the taxa. The exhaustive-search parsimony analysis produced three maximum-parsimony trees, and the bootstrap consensus tree was not well-supported, as shown in the figure.

This result suggests that the data may not be particularly tree-like. To assess this, I have performed a network analysis using the hamming distance and a NeighborNet graph, as shown in the next figure. The seven culture-historical groups have been colour-coded as follows (in chronological order):
Dvaravati
Khmer
Thirteenth_Century
Sukhothai
Early_Ayutthaya
Lan_Na
Late_Ayutthaya
light green
dark green
dull blue
bright blue
purple-brown
pink
red

Click to enlarge.

Clearly, the network is not very tree-like, and so we can infer that there has been a considerable influence of horizontal flow of phylogenetic information, as well as the vertical flow through time. There are, however, distinct temporal patterns in the network, which we can infer are probably phylogenetic patterns.

The samples from the earliest three periods (Dvaravati, Khmer, Thirteenth_Century) are at the right-hand end of the network, while the samples from the next period (Sukhothai) are at the bottom-left. This implies that a large stylistic change occurred between the Thirteenth_Century and the Sukhothai periods. Furthermore, the Khmer period style is rather distinct from that of the immediately preceding period (Dvaravati) and the immediately following one (Thirteenth_Century), which are themselves not distinct. That is, there was no stylistic change between the first two periods, but there was a small change to the next period, and then a large change to the following period.

The samples from the latest two periods (Lan_Na, Late_Ayutthaya) are collected mainly in two locations, at the bottom of the graph and at the top-left. This indicates that, although there are two distinct styles, they do not correlate with the two culture-historical periods. So, the pattern here is not a strictly phylogenetic one, and we need to look for some other explanation

The samples from the Early_Ayutthaya period are scattered throughout the top and left of the network, suggesting that this is an intermediate style between that of the immediately previous Sukhothai period and the earliest three periods, rather than being an innovative style leading to the succeeding Lan_Na period.

Importantly, these interpretations of the phylogenetic patterns do not accord with those from the tree-building analysis, where the possible patterns of horizontal flow of information are not made explicit.

Reference

Marwick B (2012) A cladistic evaluation of ancient Thai bronze Buddha images: six tests for a phylogenetic signal in the Griswold Collection. In: Bonatz D, Reinecke A, Tjoa-Bonatz ML (editors) Connecting Empires. National University of Singapore Press, pp. 159-176.

Monday, February 18, 2013

Trees and networks of written manuscripts


It is often suggested by anthropologists that their studies, including archaeology and linguistics, are very likely to involve horizontal flows of phylogenetic information as well as vertical ones (see the earlier posts False analogies between anthropology and biology and Time inconsistency in evolutionary networks). For example, in linguistics the horizontal flow is referred to as "diffusion", while in stemmatology it is called "contamination".

The simplest way to illustrate this is to take a dataset and analyze it using both a tree-building method and a network method. Only if the network method produces a tree-like diagram can we then safely conclude that vertical descent has had a larger influence on the transmission of the cultural information than has horizontal transfer.

A few weeks ago I reported on a case, involving the historical development of the musical instrument called a cornet, where the author first used a tree to analyze the historical data and then later settled on a network, which turned out to be rather non-treelike (Cornets: from a tree to a network). Here, I point out another example, this time involving written text.

Stemmatology is the discipline that attempts to reconstruct the transmission history of a printed text on the basis of relationships between the various extant versions (eg. manuscripts or printings). In this case, the analysis concerns the Greek manuscripts for the New Testament, in particular the Letter of James.

The stemmatological study used a database listing the variants of the 761 characters in 165 Greek manuscripts of the Letter of James. Of these, 60 characters are constant, 266 are variable but parsimony-uninformative, and 435 are variable and parsimony-informative. The objective of the study was to trace the history of copying of one manuscript to another.

To construct a phylogenetic tree from the dataset, Spencer et al. (2002) performed a parsimony analysis, and then summarized this with an Adams-2 consensus tree of the resulting 10,000 maximum-parsimony trees. This tree is shown in the first figure.


However, this approach does not explicitly display the inferred contamination among the manuscripts, which would require a phylogenetic network rather than a tree. So, Spencer et al. (2004) produced a reduced median network, instead, based on 82 selected manuscripts and 301 binary characters. This is shown in the second figure.


Clearly, parts of the manuscript history are not very tree-like, notably the part at the inferred root of the network. Spencer et al. note that this network topology:
is consistent with the ideas that most variants arose early in the history of the Greek New Testament, that early manuscripts were often influenced by both oral and written traditions, and that later copies introduced fewer variants.
Under these circumstances, a tree cannot be an appropriate representation of the anthropological data, because horizontal transfer of information has had a large effect during at least part of the phylogenetic history.

References

Spencer M, Wachtel K, Howe CJ (2002) The Greek vorlage of the Syra Harclensis: a comparative study on method in exploring textual genealogy.  TC: a Journal of Biblical Textual Criticism 7: 3.

Spencer M, Wachtel K, Howe CJ (2004) Representing multiple pathways of textual flow in the Greek manuscripts of the Letter of James using reduced median networks. Computers and the Humanities 38: 1–14.

Wednesday, January 2, 2013

False analogies between anthropology and biology


There has been much talk over the past few decades about the extent to which the various disciplines within anthropology (in the broad sense) can use, or benefit from, methodological techniques developed in other disciplines, notably biology (see Mace et al. 2005; Forster & Renfrew 2006; Lipo et al. 2006). This has been particularly true for historical studies of languages (ie. linguistics), past cultures (ie. archaeology) and physical type (ie. physical / biological anthropology). The use of, for example, phylogenetic methods seems to be relatively unproblematic in the latter case (studies of the origin and development of humans as a species; Holliday 2003), although this field is concerned as much with population genetics as it is with species phylogenies. (Note that I am leaving cultural anthropology out of the discussion, as it seems to be less concerned with historical studies.)

However, the use of phylogenetic methods in archaeology and linguistics is based on an analogy between human cultural evolution and biological evolution. This analogy assumes that the underlying processes of historical change in anthropology and biology are similar enough that the analytical methods can be combined. (Note that I am using the word anthropology in the broadest sense, to include linguistics and archaeology.) So, both anthropology and biology apparently involve an evolutionary process, in which the study objects form groups that change via modification of their intrinsic attributes, the attributes being transformed through time from ancestral to derived states (often called "innovations" in anthropology). That is, it is the groups of objects that change through time (variational evolution) rather than the objects themselves changing (transformational evolution). Thus, if one group acquires a new (derived or advanced) character state while the rest do not (i.e. they retain the ancestral or primitive state) then this group forms a separate historical lineage that diverges from the other populations, and maintains its own historical tendencies and fate. A search for derived character states that are shared among the groups allows us to reconstruct the evolutionary history.

However, this apparent similarity is basically a metaphor, because human culture is not a collection of biological objects. In Popperian terms, biology is part of the "world that consists of physical bodies" while culture and linguistics are part of the "world of the products of the human mind". Therefore, if we are drawing an analogy between anthropological studies and biological studies, and using this analogy to justify the use of certain analytical techniques, then we need to understand the analogy thoroughly. Here, I argue that in some important ways the currently used analogy is wrong from the biological perspective, and that this has important consequences for anthropological research.

Analogies

The analogy between anthropology and biology has recently focused on the possible relationship between anthropological entities and genes (eg. Mace & Holden 2005; Tëmkin & Eldredge 2007; Croft 2008; Pagel 2009; Steele et al. 2010; Howe & Windram 2011). However, this seems to be a false analogy, as there is no observable equivalent to a gene in the anthropological world (other than inside any biological organisms being studied). Memes, for example, are not observable objects in the way that genes are. So, the analogy between real replicators in biology (genes) and theoretical replicators in anthropology is inappropriate.

However, biology recognizes a distinction between genotype, which is the collection of genes and other associated material in an organism, and phenotype, which is the product of interactions between genes and also between genes and their environment. The DNA, RNA and proteins in an organism are usually taken to represent the genotype, whereas the cells, tissues and organs constitute the phenotype of an individual. To quote Richard Lewontin (in the Stanford Encyclopedia of Philosophy): "the actual correspondence between genotype and phenotype is a many–many relation in which any given genotype corresponds to many different phenotypes and there are different genotypes corresponding to a given phenotype."

The better analogy between anthropology and biology is thus with the phenotype, not the genotype. Genetic material stores information that allows it to replicate itself, either exactly or with modification, and this is the basis of the distinction between living and non-living objects. Nothing in archaeology or linguistics, for example, possesses these properties, and to form an analogy between anthropological entities and genes is thus potentially misleading. In particular, genetic material is based on standardized fundamental units (the nucleotides and amino acids), which have no simple counterpart in anthropology.

An analogy between anthropological entities and phenotypes is much more reasonable, however. Phenotypic entities, such as cells and organs, seem to have much more in common conceptually with anthropological entities, such as phonemes and words in linguistics and stemmatology. Most importantly, it is the phenotype that takes part in evolutionary processes, not the genotype alone (genes are just part of the "replicator story", as DNA on its own does nothing except denature slowly), and so it is actually the more useful comparison. Indeed, up until the 1990s phenotypes were the basic unit of phylogenetics in biology, and it is only since then that biologists have switched wholesale to genotypes for constructing phylogenies. Anthropologists cannot make this switch, and need to remain "phenotype phylogeneticists" instead.

The important point to note is that evolutionary anthropology is a study of historical relationships rather than specifically "genetic" ones. That is, while cultural transmission is qualitatively different from genetic transmission, that does not invalidate a study of history. Genes are passed directly to offspring whereas culture involves behaviour that is transmitted by social learning; for example, manuscripts are copied by hand, languages are learned by imitating parents, and musical instruments are deliberately designed by professionals. Biological transmission is thus different from anthropological transmission, but both types of transmission produce a history.

Phenotypes have historical relationships just as genotypes do, as is now recognized by the resurgence of interest in evolutionary developmental biology (also known as evo-devo). No analogy with genetics is necessary for evolutionary studies of anthropology. Moreover, not all genetic relationships are necessarily evolutionary (much of population genetics, for example, can be conducted without an evolutionary framework), although it is likely that they will all have a strong evolutionary component. (Note that in anthropology vertical phylogenetic descent is sometimes confusingly referred to as the "genetic relationship", perhaps as a result of Noam Chomsky's work, and phylogenies are sometimes referred to as "classifications".)

Consequences

Since phenotypes evolve, they can be an appropriate unit of study in phylogenetics, and can therefore can be an appropriate analogue for the study of cultural histories. The distinction between genotype and phenotype as the appropriate analogy is not a trivial one. In particular, the change of perspective seems to make clearer a number of issues that have been raised concerning the application of phylogenetic methods in anthropology.

Homology

First, it is often difficult to work out the homologies between phenotypic entities from divergent groups, just as it is for anthropological entities. If phylogenetics is a search for shared derived characters states, then we need to be comparing the same character states in different groups (ie. comparing like with like based on common ancestry). However, shared derived character states are not conveniently labeled as such on the objects themselves. We thus need to infer homology before we can infer phylogeny (or at least do this simultaneously), and this is often more difficult for phenotypes than for genotypes.

Phenotypic homology sometimes causes confusion even among evolutionary biologists. The basic issue is often which features should been seen as different states of the same character. As a cultural example, Tëmkin & Eldredge (2007) discuss the problem of the valves in a cornet, as "the Périnet valve did not derive from the Stölzel valve but rather was an alternative design solution" (alternative designs are quite common for manufactured objects). Thus, neither can be considered to be the ancestral state of a single character (valve type), even though the Stölzel valve predated the Périnet. Most biologists would solve this "problem" by having two separate characters, so that each valve type is either present or absent, thus effectively having a combined total of four character states. This allows a cornet to have either all Stölzel or all Périnet valves, or a combination of both (which a few instruments do have; Eldredge 2002). A cornet that has neither type of valve is called a post horn, this being the instrument from which the cornet was originally derived.

The search for an objective method of determining phenotypic homology has been a long one (Rieppel 2007), and is not by any means resolved; perhaps the most interesting discussion of an objective procedure is that of Jardine (1967). In particular, homoplasy (convergence / parallelism / reversal) is often a phenotypic phenomenon, as the genotype of the organisms concerned is almost always different in some way. That is, phenotypic homoplasy is usually the result of mistaken homology assessment, whereas genotypic homoplasy usually results from the fact that there are so few units of comparison (eg. four nucleotides). It has been suggested that homoplasy may be even more common in anthropology than in biology (Tëmkin & Eldredge 2007). Indeed, in culture it can be difficult even to decide on the units of comparison (eg. phonemes? syllables? words?), which is quite characteristic of phenotypic studies, and the "taxa" often need to be constructed for analysis (eg. tools, customs, etc).

Furthermore, it is likely to be inappropriate to use an analogy with molecular sequence alignment when discussing cultural and linguistic homologies (Covington 1996; Kondrak 2003; Pagel 2009). Computerized algorithms are usually used to align molecular data and thus make decisions about character-state homology, mostly based on overall similarity. However, homology of phenotypic characteristics requires careful comparative studies to determine what are called topological relations (or connectivity) among the character states, often based on ontogenetic development (Rieppel 2007); this is called "special similarity". It might be difficult to use ontogeny as an analogue for cultural development, since ontogeny refers to the sequential expression of genes, but topological relationships have obvious analogues in linguistics; for example, words consist of both primary structure (phonemes) and secondary structure (morphemes) (List 2012).

Reticulation

Second, it is likely that there will be a greater degree of reticulate evolution in archaeological and linguistic studies. This conclusion follows from the differences in barriers to horizontal flow of information — there are both weak and strong barriers in biology but only weak ones in anthropology.

In biology there are both pre-zygotic and post-zygotic barriers to gene flow, which refer to those acting to prevent the formation of a zygote and those acting after zygote formation, respectively. It is the latter that are most effective in creating reproductive isolation between taxa. Pre-zygotic mechanisms, such as geographical isolation (different locations), ecological isolation (different habitats), temporal isolation (different times), mechanical isolation (different physical structures) and ethological isolation (different behaviours), have obvious analogues in anthropological studies, but these barriers are often not completely effective, such as when species that were previously spatially separated encounter each other for the first time. Post-zygotic mechanisms, such as cross-incompatibility (inability of gametes to fuse), hybrid inviability (failure of zygotes to survive), hybrid sterility (failure of zygotes to reproduce) and hybrid breakdown (failure of second generation hybrids to survive), are strictly genetic mechanisms and they have no obvious analogue in anthropological studies. They are usually very effective barriers to gene flow, and indeed are the principal basis of the biological species concept, for example.

The important point to note is that the post-zygotic barriers are directly under genetic control whereas the pre-zygotic barriers are only indirectly genetically controlled (eg. habitat selection might be genetically determined, and if their habitats are different then two species will be reproductively isolated). This means that the post-zygotic barriers are much stronger. It also means that they are not available in the analogy between anthropology and phenotype.

Weak barriers mean that archaeological and linguistic aggregations are likely to form fuzzy clusters rather than clearly defined groups, just as they do for human races (Fuzzy clusters). Fuzzy clusters are not likely to form clear-cut evolutionary lineages, at least as far as vertical descent is concerned (Eldredge 2011).

Thus, because anthropological studies involve only weak barriers to the horizontal flow of information, reticulate evolution is predicted to be more prevalent than it is in biology. That is, the horizontal component of evolution may even be as large as the vertical one (and possibly more important), because there are none of the strong genetic ("post-zygotic") barriers to flow. Indeed, the use of trees as a model for archaeological and linguistic studies has been questioned repeatedly in recent years, on various grounds (eg. Southworth 1964; Hoenigswald 1990; Moore 1994; Dewar 1995; Ben Hamed & Wang 2006; Tëmkin & Eldredge 2007), usually in favor of reticulation models. Moreover, the earliest representations of historical relationships were networks rather than trees (Gallet), even in biology (Buffon, Duchesne), and since then many alternative reticulation metaphors have been developed (Metaphors). This suggests that the focus on trees has been a distraction from the more obvious model of a network in anthropology.

   Networks and trees

One point of confusion here seems to be that trees have been treated as representations of temporal relationships while networks have been treated as representations of spatial relationships. Indeed, this seems to be at the heart of the apparent differences of opinion about the two models — the tree advocates are emphasizing time whereas the network advocates are emphasizing space. The practical problem here is that there are currently no quantitative methods for combining the two. Tree-building algorithms in biology do not allow for reticulation, and the common network algorithms (such as neighbor-net, median-joining, reduced median) solely show static relationships, without any sense that the inferred nodes represent ancestors or the edges connecting the nodes represent evolutionary change. In these commonly used algorithms, the nodes are there solely to support the network structure, and the edges solely express the degree of character difference between the nodes.

For phylogenetic trees there is a rationale for treating the tree diagram as a representation of evolutionary history. For example, in a study of a set of gene sequences, first we produce a mathematical summary of the the data based on a quantitative model. We then infer that this summary represents the gene history, based on the Hennigian logic that the patterns are formed from a nested series of shared derived character states (this is a logical inference about the biology being represented by the mathematical summary). We then infer that this gene history represents the organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (ie. a pragmatic inference). However, no such rationale exits for most of the current network methods. The network still represents a mathematical summary of the data, but there is no logic for direct inference about biology. It is almost certain that the mathematical summary represents real biological patterns, but there is no necessity that those patterns are evolutionary ones.

The increasing appearance of neighbor-net networks in the linguistic and archaeological literature (eg. Ben Hamed 2005; Bryant et al. 2005; Bowern 2010; Gray et al. 2010; Heggarty et al. 2010; Dediu & Levinson 2012), for example, is thus based on trying to infer temporal patterns from the network display of spatial patterns, even though there is no explicit rationale for being able to do this — the networks may represent history and they may not. Clearly, what we need are quantitative methods that allow the direct inference of both vertical and horizontal evolutionary patterns — that is, we need phylogenetic networks rather than phylogenetic trees. Moreover, these networks need to be based on models of phenotypic variation not genotypic variation (eg. Lewis 2001). Nakhleh et al. (2005), Warnow et al. (2006) and Erdem et al. (2006) are among the few to have tackled this issue in anthropology.

Note that none of the above discussion is meant to contrast a tree model with a network model in a mutually exclusive way. Mathematically, trees form a subset of networks. Therefore, we do not need to choose between the two as the most appropriate model — we can always choose a network model, and the resulting network will be more or less tree-like depending on the data. So, it is not necessary to decide wether anthropological data are more or less tree-like than biological data (Collard et al. 2006), nor should it be necessary to decide whether horizontal transmission invalidates cultural phylogenetic trees (O'Brien et al. 2002; Greenhill et al. 2009; Currie et al. 2010b) — we should simply incorporate any reticulations into the phylogeny rather than decide they are too small to need to include them.

In this sense, many of the recent anthropological papers that are based solely on a tree model seem to be misguided, no matter how sophisticated the mathematics of their analyses may be (Gray & Atkinson 2003; Gray et al. 2009; Currie et al. 2010a; Dunn et al. 2011; Gray et al. 2011; Bouckaert et al. 2012). For example, if a dataset is admittedly affected by horizontal transfer, it is unlikely that any tree-building algorithm will correctly construct the tree-like pattern of vertical descent. Thus, even if our model for evolutionary history is "a tree obscured by vines", we will still find it difficult to reconstruct the tree unless we explicitly move the vines out of the way first. It is for this reason, for example, that in linguistics many studies are based on the Swadesh list of words, which is clearly (and intentionally) biased towards words that have been inherited vertically, with little or no horizontal transfer (eg. Bouckaert et al. note: "the cognate data we use excludes known cases of borrowing"). Under these circumstances, it is hardly surprising that authors so often find their phylogenies to be tree-like, since they are deliberately ignoring the vines! Networks are likely to reveal both the tree and the vines (eg. otherwise hidden lexical borrowing; Nelson-Sathi et al. 2011).

Finally, it is worth mentioning the network methods that have been developed for within-species (ie. population) data, particularly mtDNA sequences. These include those methods related to median networks (eg. median-joining, reduced median), but also include those related to one-step networks (eg. statistical parsimony, minimum-spanning). In many anthropological situations, it is likely that these will be more useful than methods related to phylogenetic trees (see the examples in Barbrook et al. 1998; Forster et al. 1998; Forster & Toth 2003; Spencer et al. 2004; Lipo 2006). Bouckaert et al. (2012) take this analogy even further, by using a phylogeny-based epidemiological model of population spread.

Time consistency

The third consequence of rejecting the genotype analogy is that time inconsistency is no longer required. Organisms store the information (that is vertically and horizontally transmitted) in genes that they carry with them, which restricts reticulation to occurring only between contemporaries. However, while cultural aretefacts clearly display their information, they do not transmit it themselves, and it must instead be interpreted by humans. Furthermore, language and culture store their "information" externally, either in the minds of people or in permanent or semi-permanent records (either written or pictorial).

Thus, in anthropology the information available for horizontal transmission can come from the distant past, as well as from the present — the only direction that cultural information cannot flow is from the future to the past. In this sense, extinction seems to be much rarer in archeology and linguistics than in biology, because information can be stored indefinitely, rather than disappearing along with the possessing species. I have illustrated time inconsistency twice before, with respect to both computers and computer languages, and Tëmkin & Eldredge (2007) illustrate it with musical instruments.

Part of the issue here is also that archaeological objects are often not contemporaneous, whereas most biological studies are based on data from contemporary organisms (Lipo 2006). This means that in archaeological phylogenetics the study objects appear at internal nodes in the phylogeny as well as at the tips (the data are diachronic), whereas in biology they occur only at the tips (the internal nodes are hypothetical ancestors). In this case, it may be better to consider an archaeological analogy with the incorporation into the phylogenetic histories of full stratigraphic information from fossils (eg. Sumrall 2005; Tëmkin & Eldredge 2007; Fisher 2008).

Historical anthropology is often concerned with "origins" and putting dates on those origins (Gray et al. 2011), and therefore the study interest is where the analytical uncertainty is greatest, since this is the place where there are fewest data. This is quite different to much of the use of phylogenetic techniques in biology, where the relationships of contemporary organisms are the primary interest. Of particular concern are estimates of rates of divergence, for which there appear to be few mathematical models in archaeology. Small changes in rates can have large effects on estimates of origins and their dates, as can changes of rates along lineages.

Disconnection of phenotype and phylogeny

The fourth consequence is that there is often a lack of association between phylogeny and phenotype. There are examples in the literature of phenotypic changes not being directly associated with the phylogeny. Losos (2011) discusses a number of these within biology, and Tëmkin & Eldredge (2007) discuss a couple of cultural examples. In these cases, it is not possible to reconstruct the evolutionary history from phenotypic data, nor indeed to infer the phenotypes from an hypothesis of evolutionary history. In these cases phylogenetics does not aid the study of contemporary patterns.

This is particularly relevant when attempting to reconstruct ancestral phenotypes. Because of the difference between cultural transmission (copied from person to person) and biological transmission (genes are passed directly), there is no necessary reason to assume that ancestral states can be reconstructed from a knowledge of phylogenetic history (see the Evolving Thoughts blog). This also applies when trying to reconstruct characteristics from an independent phylogeny, such as reconstructing a cultural history from a linguistic phylogeny (eg. Walker et al. 2012).

Furthermore, it is possible that archaeological and linguistic concepts (eg. cultural artefacts and languages, respectively) do not form integrated wholes, in the way that biological organisms must. That is, anthropological characters (or groups of characters) can often change independently of each other, and this will create a set of independent phylogenetic histories, so that there is no coherent "entity" with a single history. This situation is likely to be worse than the possibly analogous situation with independent gene histories in biology (Tëmkin & Eldredge 2007).

In addition, cultural evolution may occur faster than biological evolution (Perreault 2012), which makes reconstruction of ancient events more difficult. We might also question whether different cultural artefacts and languages each share a single common ancestor — that is, they are potentially polyphyletic rather than monophyletic.

Process analogies

Finally, we can consider possible analogies of anthropological processes with horizontal genotypic processes, such as introgression, hybridization, recombination, horizontal gene transfer (HGT), and genome fusion. These analogies are sometimes invoked in the linguistic and archaeological literature, but this is not necessarily appropriate given the overall analogy with phenotype rather than genotype.

Introgression is usually treated as a process of admixture, where genetic information from one group moves to another via sexual reproduction. Here, an analogy might be appropriate for anthropology, it being the closest analogy to what anthropologists have called "diffusion". However, it is worth noting that biological admixture initially involves the move of an entire copy of the genome, which might be unlikely for cultural phenomena. Hybridization, on the other hand, involves the creation of a new evolutionary lineage, separate from the parental ones but containing one or more copies of the genome of each of those parents. Creole languages might be an example where this analogy is appropriate, since the parental languages are usually clearly identifiable; but otherwise hybridization seems to be a poor analogy, even though it is commonly invoked in the literature.

Recombination also involves sexual reproduction, but usually refers to the mixing of genes before reproduction occurs, so that the offspring do not have a complete set of genes from any one grandparent. This analogy frequently appears in the literature, often as a synonym for the same phenomena that other people call hybridization, but I suspect that introgression would be a better analogy for the topics included. Examples analogous to recombination might be a single manufacturer "providing all permutations and combinations to the marketplace" of their products (eg. Courtois' cornets in the late 1850s; Eldredge 2002), or where "a scribe used more than one copy of a text when making his or her own" (called contamination; Howe & Windram 2011).

HGT refers to non-sexual transfer of genetic material, often small amounts rather than whole genomes. Clearly, word borrowing would be a prime example where this analogy might be appropriate. Genome fusion refers to the non-sexual transfer of whole genomes, and thus has a similar outcome to hybridization, but between distantly related organisms instead.

Conclusion

We need to drop the idea that there is an analogy between anthropological entities and biological genotypes, and recognize that the better analogy is with phenotypes. The analogy with genotypes is not a productive one, and may even be a positively misleading form of "gene envy". If we accept the qualitative analogy with phenotype, then we can also accept the quantitative consequences of this analogy, which include the idea that trees are much more likely to be inadequate models for cultural history than they apparently are in biology.

The mere fact that one can interpret certain cultural phenomena as showing features analogous to those in biology does not mean that the alleged analogy is of any practical use. We need to understand the analogies more thoroughly, in order to decide whether adopting the analogies is the best thing to do. Analogies are only useful tools for research if they direct that research into productive areas, or provide interpretive insights that would otherwise be unavailable. Otherwise, analogy is merely a topic of conversation.

The main advantage of the phylogenetic analogy is that it focuses attention on the important role of unique "accidents" in determining evolutionary history. The main disadvantage seems to be that the processes involved with these accidents are quite different in biology and anthropology, so that the focus is not always fruitful.

References

Barbrook AC, Howe CJ, Blake N, Robinson P (1998) The phylogeny of The Canterbury Tales. Nature 394: 839.

Ben Hamed M (2005) Neighbour-nets portray the Chinese dialect continuum and the linguistic legacy of China's demic history. Proceedings of the Royal Society of London series B 272: 1015–1022.

Ben Hamed M, Wang F (2006) Stuck in the forest: trees, networks and Chinese dialects. Diachronica 23:29-60.

Bouckaert R, Lemey P, Dunn M, Greenhill SJ, Alekseyenko AV, Drummond AJ, Gray RD, Suchard MA, Atkinson QD (2012) Mapping the origins and expansion of the Indo-European language family. Science 337: 957-960.

Bowern C. (2010) Historical linguistics in Australia: trees, networks and their implications. Philosophical Transactions of the Royal Society of London series B 365: 3845-3854.

Bryant D, Filimon F, Gray RD (2005) Untangling our past: languages, trees, splits and networks. In: Mace et al. (eds), pp. 67-83.

Collard M., Shennan SJ, Tehrani JJ (2006) Branching, blending, and the evolution of cultural similarities and differences among human populations. Evolution and Human Behavior 27: 169-184.

Covington MA (1996) An algorithm to align words for historical comparison. Comparative Linguistics 22: 481-496.

Croft W (2008) Evolutionary linguistics. Annual Review of Anthropology 37: 219-234.

Currie TE, Greenhill SJ, Gray RD, Hasegawa T, Mace R (2010a) The rise and fall of political complexity in island SE Asia and the Pacific. Nature 476: 801-804.

Currie TE, Greenhill SJ, Mace R (2010b) Is horizontal transmission really a problem for phylogenetic comparative methods? A simulation study using continuous cultural traits. Philosophical Transactions of the Royal Society of London series B 365: 3903-3912.

Dediu D, Levinson SC (2012) Abstract profiles of structural stability point to universal tendencies, family-specific factors, and ancient connections between languages. PLoS ONE 7: e45198.

Dewar RE (1995) Of nets and trees: untangling the reticulate and dendritic in Madagascar prehistory. World Archaeology 26: 301-318.

Dunn M, Greenhill SJ, Levinson SC, Gray RD (2011) Evolved structure of language shows lineage-specific trends in word-order "universals". Nature 473: 79-82.

Eldredge N (2002) A brief history of piston-valved cornets. Historic Brass Society Journal 14: 337-390.

Eldredge N (2011) Paleontology and cornets: thoughts on material cultural evolution. Evolution: Education and Outreach 4: 364–373.

Erdem E, Lifschitz V, Ringe D (2006) Temporal phylogenetic networks and logic programming. Theory and Practice of Logic Programming 6: 539-558.

Fisher DC (2008) Stratocladistics: integrating temporal data and character data in phylogenetic inference. Annual Review of Ecology, Evolution and Systematics 39: 365-385.

Forster P, Renfrew C (eds) (2006) Phylogenetic Methods and the Prehistory of Languages. McDonald Institute of Archaeological Research, Cambridge.

Forster P, Toth A (2003) Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European. Proceedings of the National Academy of Science of the USA 100: 9079-9084.

Forster P, Toth A, Bandelt H-J (1998) Evolutionary network analysis of word lists: visualising the relationships between Alpine Romance languages. Journal of Quantitative Linguistics 5: 174-187.

Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426: 435-439.

Gray RD, Atkinson QD, Greenhill SJ (2011) Language evolution and human history: what a difference a date makes. Philosophical Transactions of the Royal Society of London series B 366: 1090-1100.

Gray RD, Bryant D, Greenhill SJ (2010) On the shape and fabric of human history. Philosophical Transactions of the Royal Society of London series B 365: 3923-3933.

Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science 323: 479-483.

Greenhill SJ, Currie TE, Gray RD (2009) Does horizontal transmission invalidate cultural phylogenies? Proceedings of the Royal Society of London series B 276: 2299-2306.

Heggarty P, Maguire W, McMahon A (2010) Splits or waves? Trees or webs? How divergence measures and network analysis can unravel language histories. Philosophical Transactions of the Royal Society of London series B 365: 3829-3843.

Hoenigswald HM (1990) Does language grow on trees? Ancestry, descent, regularity. Proceedings of the American Philosophical Society 134: 10-18.

Holliday TW (2003) Species concepts, reticulation, and human evolution [with discussion]. Current Anthropology 44: 653-673.

Howe CJ, Windram HF (2011) Phylomemetics — evolutionary analysis beyond the gene. PLoS Biology 9: e1001069.

Jardine N (1967) The concept of homology in biology. British Journal for the Philosophy of Science 18: 125-139.

Kondrak G (2003) Phonetic alignment and similarity. Computers and the Humanities 37: 273-291.

Lewis PO (2001) A likelihood approach to inferring phylogeny from discrete morphological characters. Systematic Biology 50: 913-925.

Lipo CP (2006) The resolution of cultural phylogenies using graphs. In: Lipo et al. (eds), pp. 89-107.

Lipo CP, O’Brien MJ, Collard M, Shennan SJ (eds) (2006) Mapping our Ancestors: Phylogenetic Approaches in Anthropology and Prehistory. AldineTransaction, New Brunswick NJ.

List J-M (2012) Improving phonetic alignment by handling secondary sequence structures. In: Hinrichs E, Jäger G (eds) Computational Approaches to the Study of Dialectal and Typological Variation. Working papers submitted for the workshop organized as part of the ESSLLI 2012.

Losos J (2011) Seeing the forest for the trees: the limitations of phylogenies in comparative biology. American Naturalist 177: 709-727.

Mace R, Holden CJ (2005) A phylogenetic approach to cultural evolution. Trends in Ecology and Evolution 20: 116-121.

Mace R, Holden CJ, Shennan SJ (eds) (2005) The Evolution of Cultural Diversity: a Phylogenetic Approach. UCL Press, London.

Moore JH (1994) Putting anthropology back together again: the ethnogenetic critique of cladistic theory. American Anthropologist 96: 925-948.

Nakhleh L, Ringe DJ, Warnow T (2005) Perfect phylogenetic networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81: 382-420.

Nelson-Sathi S, List J-M, Geisler H, Fangerau H, Gray RD, Martin W, Dagan T (2011) Networks uncover hidden lexical borrowing in Indo-European language evolution. Proceedings of the Royal Society of London series B 278: 1794-1803.

O’Brien MJ, Lyman RL, Darwent JA (2002) Cladistics and archaeological phylogeny. In: Martínez G, Lanata JL (eds) Perspectivas Integradoras entre Arqueología y Evolución. Teoría, Métodos y Casos de Aplicación. INCUAPA–UNC, Olavarría, Argentina, pp. 175-186.

Pagel M (2009) Human language as a culturally transmitted replicator. Nature Reviews Genetics 10: 405-415.

Perreault C. (2012) The pace of cultural evolution. PLoS ONE 7: e45150.

Rieppel O (2007) Homology: a philosophical and biological perspective. In: Henke W, Tattersall I (eds) Handbook of Paleoanthropology: Vol I: Principles, Methods and Approaches. Springer-Verlag, Berlin, pp 217-240.

Southworth FC (1964) Family-tree diagrams. Language 40: 557-565.

Spencer M, Wachtel K, Howe CJ (2004) Representing multiple pathways of textual flow in the Greek manuscripts of the Letter of James using reduced median networks. Computers and the Humanities 38: 1-14.

Steele J., Jordan P, Cochrane E (2010) Evolutionary approaches to cultural and linguistic diversity. Philosophical Transactions of the Royal Society of London series B 365: 3829-3843.

Sumrall CD (2005) Fossils in phylogenetic reconstruction. In: Encyclopedia of Life Sciences.

Tëmkin I, Eldredge N (2007) Phylogenetics and material cultural evolution. Current Anthropology 48: 146-153.

Walker RS, Wichman S, Mailund T, Atkisson CJ (2012) Cultural phylogenetics of the Tupi language family in lowland South America. PLoS ONE 7: e35025.

Warnow T, Evans SN, Ringe DA, Nakhleh L (2006) A stochastic model of language evolution that incorporates homoplasy and borrowing. In: Forster & Renfrew (eds), pp. 75-87.

Wednesday, September 12, 2012

Admixture graphs – evolutionary networks for population biology


Current methods for evolutionary networks include: (i) combining trees, clusters or triplets into what is usually called a hybridization network (but could also be a horizontal gene transfer network, HGT), and (ii) decomposing ordered character data into what is called a recombination network (or ancestral recombination graph). Much work on these two approaches has been carried out recently within the bioinformatics community, and this is continuing.

However, the biology community has sometimes taken a different approach. Notably, work has concentrated on constructing models for detecting reticulation events in various types of molecular data, such as comparative genome analysis for HGT, or quantifying inter-population gene flow (eg. due to migration). A network is then manually constructed by adding reticulation branches to a phylogenetic tree of the organisms concerned. Indeed, in many cases the network diagram is not presented explicitly in the publications, but is merely implied from a list of the sources and sinks of the gene flows detected.

The network model for this latter type is thus essentially "a tree obscured by vines", although the network can actually become rather complicated. The basic idea has a long history (Lathrop 1982), although it has only recently become popular. In this blog post I highlight one line of recent work that takes this approach, which involves admixture graphs in population genetics.

Introduction

Historically, population genetics has concentrated on estimating various population parameters from quantitative models of gene history, notably rates of population expansion/contraction, rates of migration, timing of divergence, and presence/absence of bottlenecks. This is rarely done in any graphical way, relying instead on summary statistics. Alternatively, graphical methods such as principal components analysis and agglomerative clustering have been used to summarize the genetic data, and from this summary various scenarios can be deduced post hoc about possible population history (e.g. Skoglund and Jakobsson 2011; Hodoglugil and Mahley 2012).

However, more recently, explicit models of historical gene flow between populations have been developed, usually within the context of generalizing a phylogenetic tree. A tree can be used to represent historical relationships in the absence of significant amounts of gene flow, but not otherwise. So, the general approach has been to use a tree as the null model (representing absence of gene flow), and then testing how many reticulation events are needed to significantly improve the fit of the data to an increasingly complex network. The resulting diagram is called an admixture graph, which thus models both population divergence and gene flow. The reticulations represent the different proportions of genetic mixing between pairs of populations.

A model of population separation and admixture, from Reich et al. (2011) p. 522.

Methods

There are several computer programs that quantify population structure in the presence of admixture between populations, such as the models used in the older programs Structure, BAP5 and TESS (see François and Durand 2010), as well as in more recent programs like Admixture (Alexander et al. 2009). However, the most recent programs have been developed specifically to deal with network analysis of genome-wide single nucleotide polymorphism (SNP) data. The populations studied will usually be within a single species, but this need not be so.

The TreeMix program (Pickrell and Pritchard 2012) is described by the authors as follows: "Our goal is to provide a statistical framework for inferring population networks that is motivated by an explicit population genetic model, but sufficiently abstract to be computationally feasible for genome-wide data from many populations (say, 10-100) ... Our approach to this problem is to first build a maximum likelihood tree of populations. We then identify populations that are poor fits to the tree model, and model migration events involving these populations." This process proceeds as for the standard tree-based approach except that the likelihood model also includes migration weights: "Estimation involves two major steps. First, for a given graph topology, we need to find the maximum likelihood branch lengths and migration weights. Second, we need to search the space of possible graphs. [For] a given graph topology, we iterate between optimizing the branch lengths and weights ... [Then,] to search the space of possible graphs, we take a hill-climbing approach."

This method has been used by, for example, Pickrell et al. (2012).

Inferred dog breed admixture graph, from Pickrell and Pritchard (2012).

The AdmixTools program (Patterson et al. 2012), as claimed by the authors, "has some similarities to the TreeMix method but differs in that TreeMix allows users to automatically explore the space of possible models and find the one that best fits the data (while our method does not), while our method provides a rigorous test for whether a proposed model fits the data (while TreeMix does not)." The explicit testing of the fit of data and model is "based on studying patterns of allele frequency correlations across populations. The 3-population test is a formal test of admixture and can provide clear evidence of admixture, even if the gene flow events occurred hundreds of generations ago. The 4-population test ... is also a formal test for admixture, which can not only provide evidence for admixture but also provide some information about the directionality of the gene flow. The F4 ratio estimation allows inference of the mixing proportions of an admixture event".

This approach has been used by Reich et al. (2009, 2011, 2012).

Distinct streams of gene flow from Asia into America, from Reich et al. (2012) p. 372.

These methods have not yet been subjected to any critical evaluation independently of their developers, although various blog authors have been actively investigating them (e.g. these posts by Dienekes Pontikos: 1, 2, 3). The general approach, of adding reticulations to an initial tree, is reminiscent of that taken by the T-Rex program to produce reticulograms, which has been subject to criticisms (Gauthier and Lapointe 2002, 2007; Huson et al. 2011), some of which may apply to the admixture methods as well.

References

Alexander D.H., Novembre J., Lange K. (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19: 1655-1664.

François O., Durand E. (2010) Spatially explicit Bayesian clustering models in population genetics. Molecular Ecology Resources 10: 773-784.

Gauthier O., Lapointe F.-J. (2002) A comparison of alternative methods for detecting reticulation
events in phylogenetic analysis. In: Jajuga K., Sokolowski A., Bock H.-H. (eds) Classification, Clustering, and Data Analysis: Recent Advances and Applications, pp. 341-347. Springer, Berlin.

Gauthier O., Lapointe F.-J. (2007) Hybrids and phylogenetics revisited: a statistical test of hybridization using quartets. Systematic Botany 32: 8-15.

Hodoglugil U., Mahley R.W. (2012) Turkish population structure and genetic ancestry reveal relatedness among Eurasian populations. Annals of Human Genetics 76: 128-141.

Huson D.H., Rupp R., Scornavacca C. (2011) Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, Cambridge.

Lathrop G.M. (1982) Evolutionary trees and admixture: phylogenetic inference when some populations are hybridized. Annals of Human Genetics 46: 245-55.

Patterson N.J., Moorjani P., Luo Y., Mallick S., Rohland N., Zhan Y., Genschoreck T., Webster T., Reich D. (2012) Ancient admixture in human history. Genetics (in press).

Pickrell J.K., Patterson N., Barbieri C., Berthold F., Gerlach L., Lipson M., Loh P.-R., Güldemann T., Kure B., Mpoloka S.W., Nakagawa H., Naumann C., Mountain J.L., Bustamante C.D., Berger B., Henn B.M., Stoneking M., Reich D., Pakendorf B. (2012) The genetic prehistory of southern Africa. Unpublished ms.

Pickrell J.K., Pritchard J.K. (2012) Inference of population splits and mixtures from genome-wide allele frequency data. Unpublished ms.

Reich D., Patterson N., Campbell D., Tandon A., Mazieres S., Ray N., Parra M.V., Rojas W., Duque C., Mesa N., García L.F., Triana O., Blair S., Maestre A., Dib J.C., Bravi C.M., Bailliet G., Corach D., Hünemeier T., Bortolini M.C., Salzano F.M., Petzl-Erler M.L., Acuña-Alonzo V., Aguilar-Salinas C., Canizales-Quinteros S., Tusié-Luna T., Riba L., Rodríguez-Cruz M., Lopez-Alarcón M., Coral-Vazquez R., Canto-Cetina T., Silva-Zolezzi I., Fernandez-Lopez J.C., Contreras A.V., Jimenez-Sanchez G., Gómez-Vázquez M.J., Molina J., Carracedo A., Salas A., Gallo C., Poletti G., Witonsky D.B., Alkorta-Aranburu G., Sukernik R.I., Osipova L., Fedorova S.A., Vasquez R., Villena M., Moreau C., Barrantes R., Pauls D., Excoffier L., Bedoya G., Rothhammer F., Dugoujon J.M., Larrouy G., Klitz W., Labuda D., Kidd J., Kidd K., Di Rienzo A., Freimer N.B., Price A.L., Ruiz-Linares A. (2012) Reconstructing Native American population history. Nature 488: 370-374.

Reich D., Patterson N., Kircher M., Delfin F., Nandineni M.R., Pugach I., Ko A.M., Ko Y.-C., Jinam T.A., Phipps M.E., Saitou N., Wollstein A., Kayser M., Pääbo S., Stoneking M. (2011) Denisova admixture and the first modern human dispersals into Southeast Asia and Oceania. American Journal of Human Genetics 89: 516-528.

Reich D., Thangaraj K., Patterson N., Price A.L., Singh L. (2009) Reconstructing Indian population history. Nature 461: 489-494.

Skoglund P., Jakobsson M. (2011) Archaic human ancestry in East Asia. Proceedings of the National Academy of Sciences of the USA 108: 18301-18306.

Monday, September 3, 2012

The ultimate phylogenetic network?


One concern with the current move from phylogenetic trees to phylogenetic networks is the increased complexity of a reticulating network versus a dichotomous tree. People fundamentally have trouble with interlinked and overlapping structures, and a network is more complex than a tree, just as a tree is more complex than a chain (see this previous blog post).

However, if we restrict ourselves to a two-dimensional representation, then there is a limit to how complex a network can be and yet still be interpretable. The network shown here, published by the anthropologist Franz Weidenreich, comes close to that limit.

Pedigree of the Hominidae, from Weidenreich (1947) p. 201.

This is usually referred to as a "trellis" or "lattice", for obvious reasons. It first appeared in Weidenreich (1946) and then again in  Weidenreich (1947); and it has recently been re-published several times (eg. by Brace 1981; Templeton 2007; Caspari 2008). It is "an attempt to present graphically the relation between the different hominid forms in time and space", expressing Weidenreich's idea that evolution is "transformation, in close connection with inter-breeding".

The labelled circles refer to named fossil species of the Hominidae. According to Weidenreich, the vertical lines represent different stages of human evolution through time, the horizontal lines represent the morphological differentiation between different geographical regions, and the diagonal lines represent patterns of gene flow ("crossing") between the populations. Thus, the trellis emphasizes continuity of descent (and ancestry) through time within geographical regions (vertically), while also emphasizing gene flow between the regional lineages (horizontally and diagonally). In particular, note that in the figure the horizontal and diagonal lines are just as important as the vertical lines — this is not a tree obscured by vines!

Weidenreich viewed humans as being a single polytypic species throughout the Middle and Late Pleistocene, with nearly continuous gene flow during that time. This gene flow was seen  as an integral part of the evolution of modern humans, dispersing genes throughout the species, so that any one recent human is likely to have had Pleistocene ancestors from different parts of the planet. This has been called a "polycentric model" of human evolution, also known as the "multi-regional model".

From Howells (1959).

However, racial thinking (as discussed in this previous blog post) has led to tree-like models of human evolution, and so Weidenreich's network model of inter-connected groups was either ignored or mis-interpreted (see Brace 1981; Caspari 2003; Templeton 2007). In particular, the trellis was repeatedly re-drawn as a tree, usually referred to as a candelabrum. This mis-representation started with the work of William White Howells (eg. Howells 1959), as shown above, which then became the source for most subsequent discussions of the multi-regional model, rather than Weidenreich's original. (Actually, Howells' mis-interpretation of Weidenreich's multi-regional model dated way back to 1942; see Hawks & Wolpoff 2003.)

Interestingly, the trellis metaphor has been revived as a model for recent human evolution, notably by Alan Templeton, as shown in the final two figures.

The trellis model of recent human evolution,
from Templeton (1999) p.636.

A model of recent human evolution, from Templeton (2007) p. 1517.

References

Brace C.L. (1981) Tales of the phylogenetic woods: the evolution and significance of evolutionary trees. American Journal of Physical Anthropology 56: 411-429.

Caspari R. (2003) From types to populations: a century of race, physical anthropology and the American Anthropological Association. American Anthropologist 105: 63–74.

Caspari R. (2008) "Out of Africa" hypothesis. In: Moore J.H. (ed.) Encyclopedia of Race and Racism, Volume 2 G–R, pp. 391-397. Macmillan Reference, Detroit.

Hawks J., Wolpoff M.H. (2003) Sixty years of modern human origins in the American Anthropological Association. American Anthropologist 105: 87-98.

Howells W.W. (1959) Mankind in the Making: the Story of Human Evolution. Doubleday, Garden City, NY.

Templeton A.R. (1999) Human races: a genetic and evolutionary perspective. American Anthropologist 100: 632-650.

Templeton A.R. (2007) Genetics and recent human evolution. Evolution 61: 1507–1519.

Weidenreich F. (1946) Apes, Giants, and Man. University of Chicago Press, Chicago.

Weidenreich F. 1947. Facts and speculations concerning the origin of Homo sapiens. American Anthropologist 49: 187–203.

Wednesday, August 29, 2012

Human races, networks and fuzzy clusters


Evolutionary networks have recently become a hot topic of discussion. However, although networks have rather a long history in some parts of biology (see this previous post), historically it is phylogenetic trees that have dominated in biology, rather than phylogenetic networks. Interestingly, during the first half of the 20th century one research area where networks were to be found somewhat more commonly is anthropology.

Humans have long been considered to have a reticulate evolutionary history, both genetically and culturally (Moore 1994), and anthropologists have, on occasion, therefore used networks as one of their representations of that intra-species history (Brace 1981). This does not mean that trees have not dominated in anthropology also (Caspari 2003), as elsewhere; and the consequences of reticulation for anthropological studies form an ongoing debate (Holliday 2003; Arnold 2009). Interestingly, modern anthropologists are still coming to terms with the genetics of reticulation (see Jolly 2009), having previously been distracted by the Evolutionary Synthesis as well as by fossils (Hawks and Wolpoff 2003).

Some Anthropological Trees and Networks

We can start this brief survey with a tree from Arthur Keith (1915). There is no indication of reticulation at this early stage of the century, and thus the genealogy seems to owe much in principle to Ernst Haeckel's (1868) tree from the previous century.

Keith (1915) Figure 187. Genealogical tree of man's ancestry.

Carleton Stevens Coon (1939) had a polyphyletic view of human origins but also believed in a degree of reticulation, as shown in the next diagram. Note, however, that the most common European race, Mediterranean, does not take part in the reticulation events.

Coon (1939) Figure 30. Schematic representation of White racial history.

Earnest Albert Hooton (1931) was an even stronger believer in the reticulate nature of human microevolution. He commented that the following figure: "represents my idea of the various ways in which human blood streams have intermingled to form the principal races. It is not a family tree, but a sort of arterial trunk with offshoots and connecting vessels."

Hooton (1931) Figure 58. The blood streams of human races.

He modified this figure for the revised edition of his book (Hooton 1946), making it even more complex.

Hooton (1946) Figure 68. Blood streams of human races.

Elsewhere in the same book, Hooton (1946) produced this next diagram, which expresses a more phylogenetic idea. Indeed, it comes very close to the modern idea of a tree obscured by vines.

Hooton (1946) page 413.

Finally, we can consider a modern anthropological network, based on polymorphic genetic markers. This one is from Campbell and Tishkoff (2010), in which they note: "decreasing intensity of color represents the concomitant loss of genetic diversity as populations migrated in an eastward direction from Africa. Solid horizontal lines indicate gene-flow between ancestral human populations and the dashed horizontal line indicates recent gene-flow between Asian and Australian/Melanesian populations."

Campbell and Tishkoff (2010) Figure 2. The Recent African Origin model of
modern humans and population substructure in Africa.

Discussion

This whole approach to the analysis of human history presupposes that races exist as more-or-less distinct lineages, which is an idea that is not all anthropologists support. Genomically, humans seem to form what might be called fuzzy clusters, rather than discrete groups with sharp boundaries (Novembre et al. 2008; Lao et al. 2008). Inter-breeding is predominantly within the clusters, due to geographical and social isolation, with relatively little inter-breeding between the clusters. This creates a situation where gene-based distinctions between "races" seem to be obvious to casual observers but where more detailed analysis reveals considerable complexity. This results from the evolutionary history being a network not a tree.

So, this raises a point that anthropologists have been struggling with for some time, and which all network biologists need to address at some stage: Are distinct evolutionary lineages worth recognizing when there is extensive reticulation in a network? From the analysis point of view, the recognition of races is a model, and all models are wrong (because they are simplifications of the real world). However, some models are more useful than others. So, the question can be re-phrased as: Is the recognition of distinct evolutionary lineages a worthwhile model for interpreting a reticulated network? After all, the lineages may not form nested phylogenetic clusters, which is historically the basic criterion for recognizing them.

Domesticated organisms provide other classic examples of genealogical reticulation. We recognize dog breeds, for instance, and we even have an official register of breeds at the Fédération Cynologique Internationale. However, dog breeds form fuzzy clusters rather than discrete groups, with many individual dogs being cross-breeds. In spite of this, a model of fuzzy clusters formed by a reticulate evolutionary history is still considered to be useful by dog breeders and owners. A similar thing can be said about the breeds of horses, cats and cows; and, indeed, also for almost all human-associated species (see Arnold 2009).

In the non-domesticated part of biodiversity, systematists recognize subspecies, which often refer to morphologically distinguishable populations occupying geographically separated areas, but which are not otherwise genetically isolated. These subspecies can also form fuzzy clusters as a result of a reticulate evolutionary history, especially for plants. Once again, this is apparently a useful model, although there is no universal criterion for how much morphological difference it takes to delimit a subspecies.

I have noted before (see this blog post) that using a tree model for the evolutionary history of dog breeds is inappropriate, because of the reticulate inter-breeding. However, the question here goes further than this, and asks about what should be the units of analysis in the first place. If it is the dog breeds, then we are effectively excluding cross-bred dogs from the evolutionary history, unless they themselves form a new breed that is subsequently recognized.

This issue has profound consequences for our view of possible human races. Most of the networks shown above use races as the units of analysis. Modern evolutionary diagrams of human ancestry, on the other hand, are more likely to be based on genetic data from individual people (as shown in the last figure), which does not pre-suppose the existence of races. Races (if they exist) are then an outcome of the analysis, rather than an input. This distinction has been of particular importance for anthropology, where for most of the past century it has been assumed that discrete races exist and can be fitted into a non-reticulating phylogenetic tree (Caspari 2003; Arnold 2009). Even the very language of naming races creates a supposition that those races are "real", and so care is needed.

Historically, studies of race and human evolution have been inexorably linked. One problem with the current discussions about race is the confusion over whether races are sociological constructions or biological ones (Tattersall and DeSalle 2011; Krimsky and Sloan 2011). My point here is that, either way, they are a model of fuzzy clusters formed by a reticulate evolutionary history, at best, rather than being discrete groups. They have clearly been misused in sociology (racism), but are they a useful model in biology (racialism)?

References

Arnold M.L. (2009) Reticulate Evolution and Humans: Origins and Ecology. Oxford University Press, New York.

Brace C.L. (1981) Tales of the phylogenetic woods: the evolution and significance of evolutionary trees. American Journal of Physical Anthropology 56: 411-429.

Campbell M.C., Tishkoff S.A. (2010) The evolution of human genetic and phenotypic variation in Africa. Current Biology 20: R166-R173.

Caspari R. (2003) From types to populations: a century of race, physical anthropology and the American Anthropological Association. American Anthropologist 105: 63-74.

Coon C.S. (1939) The Races of Europe. Macmillan, New York.

Haeckel E. (1868) Natürliche Schöpfungsgeschichte. G. Reimer, Berlin.

Hawks J., Wolpoff M.H. (2003) Sixty years of modern human origins in the American Anthropological Association. American Anthropologist 105: 87-98.

Holliday T.W. (2003) Species concepts, reticulation, and human evolution [with discussion]. Current Anthropology 44: 653-673.

Hooton E.A. (1931) Up From the Ape. Macmillan, New York.

Hooton E.A. (1946) Up From the Ape, 2nd edition. Macmillan, New York.

Jolly C.J. (2009) Mixed signals: reticulation in human and primate evolution. Evolutionary Anthropology 18: 275-281.

Keith A. (1915) The Antiquity of Man. Williams & Norgate, London.

Krimsky S., Sloan K. (editors) (2011) Race and the Genetic Revolution: Science, Myth, and Culture. Columbia University Press, New York.

Lao O., Lu T.T., Nothnagel M., Junge O., Freitag-Wolf S., Caliebe A., Balascakova M., Bertranpetit J., Bindoff L.A., Comas D., Holmlund G., Kouvatsi A., Macek M., Mollet I., Parson W., Palo J., Ploski R., Sajantila A., Tagliabracci A., Gether U., Werge T., Rivadeneira F., Hofman A., Uitterlinden A.G., Gieger C., Wichmann H.-E., Rüther A., Schreiber S., Becker C., Nürnberg P., Nelson M.R., Krawczak M., Kayser M. (2008) Correlation between genetic and geographic structure in Europe. Current Biology 18: 1241-1248.

Moore J.H. (1994) Putting anthropology back together again: the ethnogenetic critique of cladistic theory. American Anthropologist 96: 925-948.

Novembre J., Johnson T., Bryc K., Kutalik Z., Boyko A.R., Auton A., Indap A., King K.S., Bergmann S., Nelson M.R., Stephens M., Bustamante C.D. (2008) Genes mirror geography within Europe. Nature 456: 98-101.

Tattersall I., DeSalle R. (2011) Race? Debunking a Scientific Myth. Texas A&M University Press, College Station, TX.