The Genealogical World of Phylogenetic Networks: Continued misuse of PCA in genomics studies

Tuesday, May 3, 2016

Continued misuse of PCA in genomics studies

A few years ago I discussed some well-known methodological artifacts that can arise with the use of Principal Components Analysis (PCA) ordinations, and noted that these problems seem to be widespread in genomics studies (Distortions and artifacts in Principal Components Analysis analysis of genome data). This problem involves a spurious second axis in the output graph that is a merely curvilinear function of the first axis (rather than being an indication of important new information).

[Note: if you would like a description of PCA, try this blog post by Lior Pachter: What is principal component analysis?]

This distortion problem has been long known in research fields such as ecology, where it is referred to as the Arch Effect (or the Horseshoe Effect, or the Guttman Effect). It has previously been pointed out as a problem for genomic data when they form a clinal geographical pattern, although clearly the problem can involve much more than just geographical patterns.

The issue I previously raised was that the problem was being ignored by practitioners, which can lead to serious mis-interpretation of the data analysis. Here I note that this issue continues, apparently unabated.

For example, the following paper recently appeared:

Benjamin Vernot, et al. (2016) Excavating Neandertal and Denisovan DNA from the genomes of Melanesian individuals. Science 352: 235-239.

This paper contains the following pair of PCA ordinations, illustrating genomic variation among a sample of 159 geographically diverse humans. In both cases, the second axis (vertically) is clearly nothing more than a curved function of the first axis (horizontally).

The simplest interpretation of these diagrams is that there is a 1-dimensional spatial pattern (ie. a geographic gradient) that is being distorted into 2 dimensions. For example in Figure B from left to right, the geographic gradient proceeds from East to West to South.

Gil McVean (2009. A genealogical interpretation of principal components analysis. PLoS Genetics 5: e1000686) identifies a few other limitations of PCA, including distortions produced by greatly unequal sample sizes among groups (such as populations).

Lest you think that all PCA diagrams are faulty, I should point out that when there are two or more patterns then PCA can work quite well — it is only when there is a single pattern that a 2-dimensional diagram will be distorted. Consider this diagram, from Pille Hallast, et al. (2016) Great ape Y chromosome and mitochondrial DNA phylogenies reflect subspecies structure and patterns of mating and dispersal. Genome Research 26: 427-439:

There are four labelled groups here, and the first PCA axis separates PTV from the other three groups, while the second axis separates PTE from the other three, without distortion. [Any separation of PTS from PTT is presumably on the third axis, which is not shown.]

Finally, the paper by Vernot et al. (with which I started) does also contain a diagram that is more interesting for this blog. It is a manually constructed network illustrating the multiple inter-breeding events that the authors infer between Neandertals, Denisovans and various human geographical groups (as named in the first figure).