The Genealogical World of Phylogenetic Networks: Ordination

Showing posts with label Ordination. Show all posts

Tuesday, May 3, 2016

Continued misuse of PCA in genomics studies

A few years ago I discussed some well-known methodological artifacts that can arise with the use of Principal Components Analysis (PCA) ordinations, and noted that these problems seem to be widespread in genomics studies (Distortions and artifacts in Principal Components Analysis analysis of genome data). This problem involves a spurious second axis in the output graph that is a merely curvilinear function of the first axis (rather than being an indication of important new information).

[Note: if you would like a description of PCA, try this blog post by Lior Pachter: What is principal component analysis?]

This distortion problem has been long known in research fields such as ecology, where it is referred to as the Arch Effect (or the Horseshoe Effect, or the Guttman Effect). It has previously been pointed out as a problem for genomic data when they form a clinal geographical pattern, although clearly the problem can involve much more than just geographical patterns.

The issue I previously raised was that the problem was being ignored by practitioners, which can lead to serious mis-interpretation of the data analysis. Here I note that this issue continues, apparently unabated.

For example, the following paper recently appeared:

Benjamin Vernot, et al. (2016) Excavating Neandertal and Denisovan DNA from the genomes of Melanesian individuals. Science 352: 235-239.

This paper contains the following pair of PCA ordinations, illustrating genomic variation among a sample of 159 geographically diverse humans. In both cases, the second axis (vertically) is clearly nothing more than a curved function of the first axis (horizontally).

The simplest interpretation of these diagrams is that there is a 1-dimensional spatial pattern (ie. a geographic gradient) that is being distorted into 2 dimensions. For example in Figure B from left to right, the geographic gradient proceeds from East to West to South.

Gil McVean (2009. A genealogical interpretation of principal components analysis. PLoS Genetics 5: e1000686) identifies a few other limitations of PCA, including distortions produced by greatly unequal sample sizes among groups (such as populations).

Lest you think that all PCA diagrams are faulty, I should point out that when there are two or more patterns then PCA can work quite well — it is only when there is a single pattern that a 2-dimensional diagram will be distorted. Consider this diagram, from Pille Hallast, et al. (2016) Great ape Y chromosome and mitochondrial DNA phylogenies reflect subspecies structure and patterns of mating and dispersal. Genome Research 26: 427-439:

There are four labelled groups here, and the first PCA axis separates PTV from the other three groups, while the second axis separates PTE from the other three, without distortion. [Any separation of PTS from PTT is presumably on the third axis, which is not shown.]

Finally, the paper by Vernot et al. (with which I started) does also contain a diagram that is more interesting for this blog. It is a manually constructed network illustrating the multiple inter-breeding events that the authors infer between Neandertals, Denisovans and various human geographical groups (as named in the first figure).

For an explanation of the inter-breeding events, see Nature News for 17 Feb 2016:

Evidence mounts for interbreeding bonanza in ancient human species.

Wednesday, February 19, 2014

Multivariate data displays are not always necessary

Over the past two years I have published a number of posts in which I have used a data-display network as a multivariate data summary, comparable to an ordination (eg. PCA) or a cluster analysis (eg. UPGMA). This is a form of exploratory data analysis.

Here, I wish to point out that a multivariate data summary is not always necessary, even when the data are multivariate in form.

As an example, I will use the official census data on retail book sales in the USA. The monthly data are provided by the United States Census Bureau for the years 1992-2013 at:
http://www.census.gov/retail/mrts/www/data/excel/mrtssales92-present.xls.
The data include census code 4512, which covers "Book Stores, General", "Specialty Book Stores" and "College Book Stores". The data notes say: "Estimates are shown in millions of dollars, and are based on data from the Monthly Retail Trade Survey, Annual Retail Trade Survey, and administrative records." I downloaded the data on 17 February 2014.

These data are multivariate. For example, if each year is taken as a sample object, then there are data for 12 variables for each sample (one for each month). Any multivariate data analysis can therefore be applied to this dataset.

In the usual manner, I have used the manhattan distance and a neighbor-net network. Years that are closely connected in the network are similar to each other based on the 12 monthly sales figures, and those that are further apart are progressively more different from each other.

However, all that the data show is a gradient clockwise from the top. That is, sales rose from 1992, reached a peak in 2007, and then declined again. That is, the data form a simple time series, and all that is actually needed is to plot them that way.

So, this same pattern could be displayed more simply by graphing the yearly averages, as shown in the next graph. A network is complete over-kill in this case. I presume that the recent decrease in retail book sales has something to do with the rise of e-book sales.

Finally, we could also plot the monthly sales, while we are at it. The peaks in late summer and at Christmas as very distinct. Presumably people are buying books to read in summer, and to give away at Christmas.

Finally, note that not all time series can be plotted in a simple manner. If the time patterns are complex, then a multivariate analysis, such as a network, will probably be of some use as a data display.

Wednesday, January 15, 2014

Pacific rock art - ordinations and networks

I have written previously about the use of phylogenetic networks as multivariate data displays instead of traditional techniques such as ordination (Networks can outperform PCA ordinations in phylogenetic analysis), and I have provided many empirical examples where I have used networks as heuristics to explore multivariate data. Here, I present a direct comparison between an ordination and a network.

Meredith Wilson (2004. Rethinking regional analyses of western Pacific rock-art. Records of the Australian Museum, Supplement 29: 173-186) has provided an interesting example of a difficult multivartiate dataset. She collated data (published and unpublished) concerning rock-art motifs (ie. engravings and paintings) for various locations in the western Pacific Ocean, including Papua - New Guinea, Solomon Islands, Fiji, Tonga, Micronesia and New Caledonia.

There were data for 614 figurative and non-figurative motifs, from 103 rock-art sites. However, this dataset is problematic to analyze because it contains a high proportion of unique motifs. This means that there is very little information about the art relationships among the sites, and the data summary is therefore uninformative. Wilson solved this problem by performing a series of analyses in which either (i) the motifs were aggregated into classes, or (ii) the sites were aggregated into geographical regions. Both strategies seemed to produce informative data summaries. In all cases, the data summaries were provided by ordinations, although some of these analyses used Multi-Dimensional Scaling while others used Correspondence Analysis.

For one of her analyses (the one discussed here) she kept the 614 individual motifs but aggregated the sites into 10 regions, and then analyzed the data matrix using Correspondence Analysis. The data consisted of counts of the motifs in each of the regions, and the multivariate distances among the regions were measured using the chi-square metric.

Wilson presented two ordination diagrams, with different subsets of the data, as shown above. Two diagrams were necessary because several of the regions were super-imposed in the initial analysis, and so a second analysis was performed with some of the regions omitted, to explore the patterns among the super-imposed regions. As with all ordinations diagrams, it is the proximity of the regions in the graph that expresses their multivariate similarity — nearby regions in the graph have similar rock-art motifs.

For comparison, I have used a NeighborNet network, based on the same chi-square metric distances, as shown below. As usual, regions that are linked by short connections in the network are similar based on their rock-art motifs.

This diagram suffers from the same problem as do the ordination diagrams — it is difficult to see the relationships among the regions. In this case it is because the network structure occupies only a small part of the diagram, with very long terminal edges, indicating that most of the motifs are unique to particular regions. This issue can be dealt with by instead drawing the diagram with uninformative edge lengths (ie. they are all the same length), as shown in the next figure.

The relationships among the regions are now clear in the network. These relationships are the same as shown in the two ordination diagrams, except for Bougainville. The network associates the Bougainville rock art with that of East New Britain, whereas the ordination analyses associate it with the art of Northwest Guadalcanal and Morobe. I have been unable to explain this discrepancy.

This dataset is a difficult one to deal with, because the data matrix is sparse. This necessitates multiple ordination analyses, or a network with uninformative branch lengths. The network may be the simpler data summary in this case, because it requires only one analysis and graph, rather than the two separate analyses needed for the ordination. However, the analysis highlights the fact that any dataset that presents difficulties for one multivariate summary technique is likely to be difficult for all of them.

Wednesday, December 19, 2012

Networks can outperform PCA ordinations in phylogenetic analysis

In a previous blog post I noted that there are potentially Distortions and artifacts in Principal Components Analysis analysis of genome data, in which the Principal Components Analysis (PCA) ordination axes are not independent of each other. This happens when the samples form one or more monotonic gradients, in which case higher-dimensional axes of the ordination plot are curved and twisted relative to the lower axes. The so-called Arch Effect is the best-known example of this problem. The danger is that the ordination diagrams will be interpreted as showing biological patterns when these artifactual patterns are not in the original datasets.

This mathematical artifact can be dealt with by: (i) mathematically correcting the distortion; (ii) using a Non-metric Multidimensional Scaling (NMDS) ordination instead; or (iii) using a phylogenetic network to display the multidimensional data, rather than using an ordination. Here, I illustrate this third possibility.

Network analysis of a 1-dimensional gradient

In the previous blog post I used the following small dataset as an example of a genetic dataset with a 1-dimensional pattern of relationships among the taxa. Here the taxa form a series of overlapping groups, defined by the pattern of Cs.

The diagonal dataset.

In the previous post I also illustrated the PCA ordination of this dataset, which shows the classic Arch Effect, rather than the expected linear arrangement of samples in the scatter plot. Below is a table showing whether analyses using two different types of data-display network can effectively display all of the taxon groups for datasets of this type but with varying numbers of taxa. All analyses were performed using the SplitsTree program.

As you can see, the NeighborNet analysis successfully displays all of the relevant bipartitions in all of the datasets. On the other hand, the related Neighbor-Joining tree only succeeds up to a taxon size of six. Six taxa can be represented by a simple linear tree, which can display three non-trivial bipartitions plus six trivial partitions, which is all that is needed when each taxon consists of five Cs. (Note that the NeighborNet and the Median network are identical to this tree, as well, for six taxa or less.)

The next two figures show the NeighborNet graphs for taxon sizes of 10 and 20, respectively, as examples.

NeighborNet of the diagonal dataset for n=10.

NeighborNet for the diagonal dataset for n=20.

The table also shows that the Median network analysis is only successful for small numbers of taxa. For greater numbers of taxa, some of the bipartitions are not displayed, as listed in the table. For example, for a taxon size of 10, the split {2,3,4,5,6}{1,7,8,9,10} is not displayed, as illustrated in the next figure.

Median network for the diagonal dataset for n=10.

For even larger numbers of taxa, the Median network superimposes some of the taxa, even though they have distinct sequences, and thus it cannot display all of the bipartitions, by definition. For example, for a taxon size of 20, taxon_9 and taxon_10 are superimposed, as shown in the next figure.

Median network for the diagonal dataset for n=20.

Network analysis of a 2-dimensional gradient

In the previous blog post I used the following dataset as an example of a genetic dataset with a 2-dimensional pattern of relationships among the taxa. Here the taxa form two series of overlapping groups, one defined by the pattern of Cs and the other by the pattern of Ts.

The grid dataset.

In the previous post I also illustrated the PCA ordination of this dataset, which shows a complicated 3-dimensional version of the 2-dimensional pattern. On the other hand, the Median network perfectly reproduces the original 2-dimensional grid arrangement of the taxa, as shown next, so that the network diagram is exactly as expected.

Median network for the grid dataset.

Sadly, the NeighborNet analysis does not fare quite so well, as it produces a much more complicated arrangement of the taxa. This arrangement does approximately represent the original grid-like pattern contained in the data, but it is presented in an unnecessarily convoluted way.

NeighborNet for the grid dataset.

Conclusion

It appears that data-display networks can be used to display multidimensional information, in the same manner as an ordination analysis. These networks try to display all of the relevant bipartitions in the data, rather than simply displaying a scatter plot of the samples.

However, there are considerable differences in the ability of the various network analyses to display the patterns inherent in the original data. In particular, NeighborNet seems to be very effective for displaying 1-dimensional patterns, while Median networks are more effective for 2-dimensional patterns.

Finally, it is worth noting that David Bryant has already thought of comparing NeighborNet with a Multi-dimensional Scaling (MDS) ordination, using an empirical dataset (2006, Radiation and network breaking in Polynesian linguistics. In: Forster P, Renfrew C (eds) Phylogenetic Methods and the Prehistory of Languages. McDonald Institute Press, University of Cambridge, pp 111-118). He found that they approximately agreed with each other.

Wednesday, December 12, 2012

Distortions and artifacts in Principal Components Analysis for analysis of genome data

It is perfectly clear from examples in the genomics literature that there is a problem occurring with the use of Principal Components Analysis (PCA) ordinations, and that it is apparently going undetected by users. This problem involves a spurious second axis in the output graph that is a curvilinear function of the first axis. The danger is that the ordination diagrams will be interpreted as showing biological patterns (see François et al. 2010; Arenas et al. 2013), when these patterns are not in the original datasets. I illustrate this problem here using some very simple datasets, and point to some examples of it in the literature. I then suggest some possible solutions, including the use of phylogenetic networks.

Introduction

Genomic data are multivariate, in the sense that we have multiple characters (the nucleotide sequences) for multiple organisms. "Ordination" is the term used when we arrange the multivariate data in a rational order, which is then usually displayed as a scatter plot, with the points representing the organisms. The spatial relationships of the points on the graph then indicate how similar the organisms are to each other (based on their genomes) — close points are more similar to each other than are distant points, as shown by the example in the first figure.

Salmela et al. (2011), Figure 2d.
Each dot represents the genome of one Swedish person,
coloured according to their geographical location within northern Sweden.

There are many mathematical techniques for ordination, although they fall into two main groups: (i) eigenanalysis methods, and (ii) non-metric multidimensional scaling methods. These methods have a long history in some parts of biology, for example in ecology (Gauch 1982; Jongman et al. 1987; Kent & Coker 1992; Leps & Smilauer; Legendre & Legendre 2012), although they are virtually unknown in other parts. (Note: the eigenanalysis ordinations are sometimes confusingly called metric multidimensional scaling ordinations.)

The eigenanalysis method called Principal Components Analysis (PCA) was introduced by Patterson et al. (2006) to the study of genomic data, having previously been introduced into genetics by ~~Menozzi et al. (1978)~~Cavalli-Sforza and Edwards (1964), and it has become quite popular in the literature since then. The basic idea is to summarize the data by calculating the correlation coefficient between each pair of organisms, and then to display as much as possible of the common pattern of variation along the first ordered axis of the graph, and as much of the remainder along the second axis, and so on. The different axes are expected to show independent patterns of genetic relationship among the organisms, with the first axis showing the most important pattern, the second axis the second-most important pattern, and so on.

However, ecologists have long known that the first two components of a PCA (and related eigenanalysis techniques such as correspondence analysis) often show an arching pattern, which is a mathematical artifact of the eigenanalysis rather than a real biological pattern in the original data (Gauch et al. 1977). This is most apparent when the sampling units cover a long ecological gradient and those samples at each end of the gradient have few species in common (Minchin 1987; see also this blog post). This is called the Arch Effect (or the Horseshoe Effect, or the Guttman Effect), in which the second axis is curved and twisted relative to the first axis, and thus the axes are not independent of each other.

The problem illustrated

Basically, the distortion involves the data pattern being displayed using one more axis than is needed for that data pattern. That is, a 1-dimensional pattern is displayed using 2 dimensions, instead, and a 2-dimensional pattern is displayed using 3 dimensions, etc. I will first illustrate this with a simple 1-dimensional pattern.

This first dataset consists of 20 nucleotide sequences. Each sequence consists of only two nucleotides (19 A and 5 C), and the pattern of Cs forms a simple gradient (ie. there is only one pattern in the data). This gradient can be displayed using only one ordination axis, which would arrange the taxa in order from Taxon_1 to Taxon_20. However, note that the patterns at the two ends of the gradient have nothing in common as far as the pattern of Cs is concerned.

The gradient dataset.

The next figure is the Principal Components Analysis (PCA) ordination diagram of these data. Note that the 1-dimensional pattern is displayed using 2 dimensions, arranging the 20 taxa in an arch rather than in a straight line. (Note that the open part of the arch can face either up or down; in this case it happens to face up.) So, it appears in the scatter plot that Taxon_1 and Taxon_20 are rather similar to each other (ie. they are near each other), which is not true. Furthermore, the dots are not equally spaced, which they should be given the pattern in the original data. The second dimension, along with the unequal spacing of the points, is a mathematical artifact of the eignenanalysis rather than a pattern in the data. It is caused by the unimodal distribution of the pattern of Cs along the gradient.

PCA ordination of the gradient data.
Not all of the taxa are labelled.

The second dataset consists of 24 nucleotide sequences. Each sequence consists of three nucleotides (8 A, 5 C and 4 T), and the pattern of Cs forms one gradient while the pattern of Ts forms a second gradient (orthogonal to the first one). These two independent patterns can be displayed using two ordination axes, which would arrange the taxa in a 6x4 rectangular grid. (If this is not clear, look at the bottom two figures of this post.)

The grid dataset.

This next figure is the PCA ordination diagram of these data. Note that the 2-dimensional pattern is displayed using 3 dimensions, only two of which are actually shown in the diagram. (It is standard to show only the first two axes, on the assumption that these contain most of the important information.) These axes arrange the 24 taxa in an arch in the first two dimensions, rather than in a straight line, and then separate pairs of taxa along the third axis. So, in two dimensions pairs of taxa appear to sit on top of one another, indicating a similarity that is not present in the original dataset. Once again, the use of the third dimension is an artifact of the eignenanalysis rather than a pattern in the data. It is caused by the unimodal distribution of the patterns of Cs and Ts along the two gradients.

PCA ordination of the grid data.

Clearly, when the samples form gradients then eigenanalysis is not an appropriate ordination technique, although it is very useful under many other circumstances. This problem with gradients has previously been pointed out for genomic data when they form a clinal geographical pattern (Novembre & Stephens 2008; Reich et al. 2008), but clearly the problem involves much more than just geographical patterns. Also, the problem is being ignored by users — most of the examples below were published after the paper by Novembre & Stephens.

Literature examples

The Arch Effect has appeared repeatedly in the genomics literature (often produced using the SmartPCA, Plink or Eigenstrat computer programs), and some examples are illustrated here. Note that the axes are not always in the same orientation, but the arch should be found on axis 2 relative to axis 1, and it can be oriented either up or down.

These first two examples come from the very paper that first introduced PCA to genomics studies.

Patterson et al. (2006), Figure 5.

Patterson et al. (2006), Figure 8.

These next two examples are from studies of genomic data from Scandinavia.

Salmela et al. (2011), Figure 2c.

Skoglund et al. (2012), Figure 1c.
Note that axis 1 is shown vertically.

The next two examples also come from worldwide genome studies. Note that the second of these refers to the ordination as metric multidimensional scaling (MDS).

Stoneking & Krause (2011), Box 1a.

Jakobsson et al. (2008), Figure 1d.

The next two examples also come from a worldwide genome study, but in this case the relationship of the Arch Effect to the geographical sampling is made explicit.

Wang et al. (2012), Figure 1b.

Wang et al. (2012), Figure 5b.

The next example illustrates rotation of the ordination axes, so that the arch is now aligned along the two axes.

Novembre and Ramachandran (2011), Figure 4e.

Finally, the Gene Expression blog has an example, where the ordination is called a metric multidimensional scaling (MDS) ordination. This is based on ~3,000 individuals and 250,000 SNPs.

Possible solutions

There are three possible solutions to the Arch Effect problem in eigenanalysis ordinations.

First, one can try to mathematically correct the distortion. This approach was actively pursued in ecology via what is known as detrending and rescaling (Hill & Gauch 1980), in which the axes are forced to be independent of each other after the eigenanalysis. This approach works, in the sense that the Arch Effect disappears, but it has the drawback that all relationships between the axes are removed, including those that require more than one dimension to be displayed (Palmer 1993). So, while this method is still commonly encountered in the ecological literature, it is not uniformly recommended.

This approach is illustrated in the next figure, which is a Detrended Correspondence Analysis (DCA, with detrending by second-order polynomials) ordination analysis of the second of my simple datasets above. Note that the original 2-dimensional grid arrangement of the taxa is perfectly preserved, so that the ordination diagram is exactly as expected.

DCA ordination of the grid data.

The second, and most obvious, solution to the problem is not to use an eigenanalysis ordination in the first place. That is, not all ordination techniques produce the same result. In this case we could use a Non-metric Multidimensional Scaling (NMDS) ordination, instead. The latter is a very different type of technique, mathematically, which explicitly tries to arrange the points in the ordination plot so that their relative distances reflect as closely as possible the pairwise "distances" in the original dataset (here, a distance is the inverse of a similarity). This technique is known to avoid the Arch Effect (Minchin 1987).

This is illustrated in the next figure, which is a NMDS ordination analysis of the second dataset. Note that the original 2-dimensional grid arrangement of the taxa is almost perfectly preserved. The main disadvantages of NMDS are the computational time required, the difficulty of finding the optimal solution for large datasets (eigenanalysis is significantly faster; Zheng et al. 2012), and the fact that the number of ordination axes needs to be specified beforehand.

NMDS ordination of the grid data.
Not all of the taxa are labelled.

The third solution to the problem is to use a phylogenetic network to display the multidimensional data, rather than using an ordination. Conceptually, this is similar to the NMDS type of ordination, but it uses a line graph to display the results rather than using a scatter plot. I have illustrated this in the next blog post (Networks can outperform PCA ordinations in phylogenetic analysis), where I report that this solution works quite well, too.

Conclusion

Every time you see an arch in a PCA plot, you might very well be looking at a mathematical artifact. You should bear this in mind when you interpret the ordination pattern, especially if you wish to draw biological conclusions from the diagram. You will, of course, see plenty of PCA ordinations without any evidence of an arch, and these are probably safe to interpret naively.

PCA can be useful when we expect genomes to be monotonically related to one another, but this is likely to be rather rare in nature. It is much more likely that genomes will have nonlinear relationships, such as unimodal patterns, with a peak at an intermediate part of any gradients that are present. Under these circumstances it produces artifacts, which may be mistakenly interpreted as representing real biological patterns. Under these circumstances, an alternative method of multivariate analysis should be used.

More information about ordination techniques can be found at the Ordination Methods web page.

Note: updated Friday December 21 2012.

References

Arenas M, François O, Currat M, Ray N, Excoffier L (2013) Influence of admixture and paleolithic range contractions on current European diversity gradients. Molecular Biology and Evolution 30: 57-61.

Cavalli-Sforza L, Edwards AWF (1964) Analysis of human evolution. Proceedings of the 11th International Congress of Genetics, The Hague, 1963. Genetics Today 3: 923-933.

François O, Currat M, Ray N, Han E, Excoffier L, Novembre J (2010) Principal component analysis under population genetic models of range expansion and admixture. Molecular Biology and Evolution 27: 1257-1268.

Hill MO, Gauch HG (1980) Detrended Correspondence Analysis: an improved ordination technique. Vegetatio 42: 47-58.

Gauch HG (1982) Multivariate Analysis in Community Structure. Cambridge University Press, Cambridge.

Gauch HG, Whittaker RH, Wentworth TR (1977) A comparative study of reciprocal averaging and other ordination techniques. Journal of Ecology 65: 157-174.

Jongman RHG, ter Braak CJF, van Tongeren OFR (1987) Data Analysis in Community and Landscape Ecology. Pudoc, Wageningen.

Kent M, Coker P (1992) Vegetation Description and Analysis: a Practical Approach. Belhaven Press, London.

Legendre P, Legendre L (2012) Numerical Ecology, 3rd edition. Elsevier, Amsterdam.

Leps J, Smilauer P (2003) Multivariate Analysis of Ecological Data using Canoco. Cambridge University Press, Cambridge.

Menozzi P, Piazza A, Cavalli-Sforza L (1978) Synthetic maps of human gene frequencies in Europeans. Science 201: 786-792.

Minchin PR (1987) An evaluation of relative robustness of techniques for ecological ordinations. Vegetatio 69: 89-107.

Novembre J, Ramachandran S (2011) Perspectives on human population structure at the cusp of the sequencing era. Annual Review of Genomics and Human Genetics 12: 245-74.

Novembre J, Stephens M (2008) Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40: 646-649.

Palmer MW (1993) Putting things in even better order: the advantages of Canonical Correspondence Analysis. Ecology 74: 2215-2230.

Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genetics 2: e190.

Reich D, Price AL, Patterson N (2008) Principal component analysis of genetic data. Nature Genetics 40: 491-492.

Salmela E, Lappalainen T, Liu J, Sistonen P, Andersen PM, Schreiber S, Savontaus ML, Czene K, Lahermo P, Hall P, Kere J (2011) Swedish population substructure revealed by genome-wide single nucleotide polymorphism data. PLoS One 6: e16747.

Skoglund P, Malmström H, Raghavan M, Storå J, Hall P, Willerslev E, Gilbert MT, Götherström A, Jakobsson M (2012) Origins and genetic legacy of Neolithic farmers and hunter-gatherers in Europe. Science 336: 466-469.

Stoneking M, Krause J (2011) Learning about human population history from ancient and modern genomes. Nature Reviews Genetics 12: 603-614.

Wang C, Zöllner S, Rosenberg NA (2012) A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genetics 8: e1002886.

Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012) A high-performance computing toolset for relatedness and Principal Component Analysis of SNP data. Bioinformatics 28: 3326-3328.