Wednesday, September 24, 2014

Splits and neighborhoods in splits graphs

I have written before about How to interpret splits graphs. However, it is worth emphasizing a few points, so that people don't keep Mis-interpreting splits graphs.

A splits graph can potentially represent two main types of pattern. First, like a clustering analysis, it represents groups in the data that are in some way similar. Each group is represented by an explicit split in the graph (see Recognizing groups in splits graphs). The clusters may be hierarchically arranged (each group nested within another group), and they may overlap, so that objects can simultaneously be a member of more than one group. If the clusters do not overlap then the graph will be a tree.

Second, like on ordination analysis, a splits graph can summarize the multi-dimensional neighborhoods of the different objects. That is, the relative distance between the points on the graph summarizes the relationships among the objects — closer objects, as measured along the edges of the graph, are more similar.

These two patterns often appear in the same splits graph. Unfortunately, many published papers mis-interpret neighborhoods as splits. If there is an explicit split representing a cluster of interest, then the data can be said to support that possible cluster. However, if no such split exists, then the graph is agnostic with respect to that cluster — there might be no support for it in the data, or the split might be left out of the graph because other splits out-weigh it. So, graph objects occupying a particular neighborhood might not be well-supported by the original data, contrary to the interpretation sometimes seen in the literature.

This can be illustrated with a specific example, taken from: Sicoli MA, Holton G (2014) Linguistic phylogenies support back-migration from Beringia to Asia. PLOS One 9: e91722.

The splits graph is a consensus network, summarizing all of the splits with at least 10% support in 3000 MCMC bayesian trees. The authors note that the dashed line represents a "primary division" between the groups, and that the differently colored objects represent "clear groupings".

However, the dashed line is supported only by a small split, which has a larger contradictory split (that puts the North PCA group with the Plains-Apachean group). This split thus cannot be said to be well supported. Furthermore, the South Alaska grouping is not supported by any split shown in the graph (there are, however, two splits that combine uniquely to support it). That is, the South Alaska grouping represents a neighborhood rather than a supported cluster. Finally, the Alaska-Canada-1 grouping is also not supported by an uncontradicted split (ie. the tcb taa tau samples could as easily be part of the West Alaska grouping). All of the other identified groups are supported by unique and uncontradicted splits.

So, there are three types of pattern in this splits graph with respect to the groups of interest to the authors: uncontradicted splits, contradicted splits, and neighborhoods, representing good support, medium support and agnosticism, respectively. It is important to recognize these three possibilities, and to interpret them correctly with respect to "support" for any conclusions.

As an aside, I will point out that in the other splits graph in the same paper (a NeighborNet): the dashed line is not supported by any split, two of the colored groupings are not supported by any split, and two of the others have only a small contradicted split. Thus, the "primary division" and the "clear groupings" mostly represent neighborhoods, and are thus only dubiously supported.

No comments:

Post a Comment