The Genealogical World of Phylogenetic Networks: edge lengths

Showing posts with label edge lengths. Show all posts

Wednesday, September 24, 2014

Splits and neighborhoods in splits graphs

I have written before about How to interpret splits graphs. However, it is worth emphasizing a few points, so that people don't keep Mis-interpreting splits graphs.

A splits graph can potentially represent two main types of pattern. First, like a clustering analysis, it represents groups in the data that are in some way similar. Each group is represented by an explicit split in the graph (see Recognizing groups in splits graphs). The clusters may be hierarchically arranged (each group nested within another group), and they may overlap, so that objects can simultaneously be a member of more than one group. If the clusters do not overlap then the graph will be a tree.

Second, like on ordination analysis, a splits graph can summarize the multi-dimensional neighborhoods of the different objects. That is, the relative distance between the points on the graph summarizes the relationships among the objects — closer objects, as measured along the edges of the graph, are more similar.

These two patterns often appear in the same splits graph. Unfortunately, many published papers mis-interpret neighborhoods as splits. If there is an explicit split representing a cluster of interest, then the data can be said to support that possible cluster. However, if no such split exists, then the graph is agnostic with respect to that cluster — there might be no support for it in the data, or the split might be left out of the graph because other splits out-weigh it. So, graph objects occupying a particular neighborhood might not be well-supported by the original data, contrary to the interpretation sometimes seen in the literature.

This can be illustrated with a specific example, taken from: Sicoli MA, Holton G (2014) Linguistic phylogenies support back-migration from Beringia to Asia. PLOS One 9: e91722.

The splits graph is a consensus network, summarizing all of the splits with at least 10% support in 3000 MCMC bayesian trees. The authors note that the dashed line represents a "primary division" between the groups, and that the differently colored objects represent "clear groupings".

However, the dashed line is supported only by a small split, which has a larger contradictory split (that puts the North PCA group with the Plains-Apachean group). This split thus cannot be said to be well supported. Furthermore, the South Alaska grouping is not supported by any split shown in the graph (there are, however, two splits that combine uniquely to support it). That is, the South Alaska grouping represents a neighborhood rather than a supported cluster. Finally, the Alaska-Canada-1 grouping is also not supported by an uncontradicted split (ie. the tcb taa tau samples could as easily be part of the West Alaska grouping). All of the other identified groups are supported by unique and uncontradicted splits.

So, there are three types of pattern in this splits graph with respect to the groups of interest to the authors: uncontradicted splits, contradicted splits, and neighborhoods, representing good support, medium support and agnosticism, respectively. It is important to recognize these three possibilities, and to interpret them correctly with respect to "support" for any conclusions.

As an aside, I will point out that in the other splits graph in the same paper (a NeighborNet): the dashed line is not supported by any split, two of the colored groupings are not supported by any split, and two of the others have only a small contradicted split. Thus, the "primary division" and the "clear groupings" mostly represent neighborhoods, and are thus only dubiously supported.

Wednesday, August 22, 2012

How to interpret splits graphs

Splits graphs are produced by distance-based network methods such as NeighborNet and Split Decomposition, by character-based methods such as Median Networks and Parsimony Splits, and by tree-based methods such as Consensus Networks and SuperNetworks. They are all interpreted in the same way, which is discussed here.

An essential point to understand is that splits graphs are separation networks. That is, the edges in the graph represent separation between two clusters of nodes in the network; or, they split the graph in two. Formally, each edge represents a bipartition (or split) of the taxa based on one or more characteristics.

If there is no conflict in the data then each bipartition is represented by a single edge, and if there are contradictory patterns then the each bipartition is represented by a set of parallel edges. The edge lengths represent the relative amount of support in the whole dataset for each of the splits.

Example

As a simple example, I will use some data about opinion polls prior to a few Australian elections. There are data for nine election years: 1972, 1974, 1975, 1977, 1980, 1983, 1984, 1987, 1990. The data are for the actual winning margin as a result of the election, as well as data for various opinion polls predicting the outcome prior to the election: (i) McNair Survey, (ii) Roy Morgan Research, (iii) Saulwick Poll, and (iv) Other = pooling of Australian National Opinion Polls (data for 6 years), Spectrum (3 years), Newspoll (2 years), Levita (1 year).

I have calculated the Euclidean Distances between the results for the different opinion polls. So, the original data have been reduced to a set of distances between pairs of opinion polls; and it is these distances that are to be displayed by the network.

This is a simple dataset, and so the analyses based on Split Decomposition and NeighborNet turn out to be identical. The resulting network looks like this:

In this case, the network manages to represent all of the distances perfectly. That is, the Fit=100%. This is an improvement over trying to represent the data as a tree, instead. For example, the Neighbor Joining tree for these data has a fit of only Fit=92%, so that 8% of the information cannot be represented in the tree.

The network has five informative splits (bipartitions), each represented by a different set of parallel edges. The remaining five splits are simply shown as the single edges leading to each of the five sources of data. The informative bipartitions are (in order of decreasing support):

Actual Morgan Other McNair Saulwick
Actual Morgan Saulwick McNair Other
Actual Morgan McNair Other Saulwick
Actual Other McNair Morgan Saulwick
Actual McNair Other Morgan Saulwick

These bipartitions are each highlighted below, with red representing one partition and blue the other. The weight of each split is also shown, which represents the amount of support there is in the data. This also determines the relative lengths of the edges (greater weight = more support = longer edge).

We can now start to reach some conclusions about the relative success of the opinion polls. For example, note that the three best-supported partitions (bipartitions 1, 2 and 3) associate Actual (the election result) with Morgan (the outcome predicted by Roy Morgan Research). We can thus conclude that this opinion poll has most in common with the election outcome, and thus that it was the most "successful" of the four opinion polls (over the elections from 1972 to 1990).

As noted, the edge lengths in the network represent the relative amount of support in the whole dataset for each of the splits. In this example, because Fit=100% the edge lengths along the shortest paths sum to exactly the original Euclidean Distances in the dataset, which will not always be so for other datasets. For example, the shortest distance from Actual to each of the four opinion polls is the sum of these edge lengths:

Morgan 2.89 = 1.5833+0.4861+0.1875+0.6319
Other 3.89 = 1.5833+0.4931+0.6493+1.1632
McNair 4.25 = 1.5833+0.4931+0.6493+0.4861+0.8125+0.2257
Saulwick 4.88 = 1.5833+0.4861+0.1875+0.4931+0.8125+1.3125

The calculation of the shortest distance from Actual to Saulwick is highlighted in this figure:

Note that there are several shortest paths from Actual to Saulwick — we can take the edges in any order we like so long as we cross each split only once. To go from Actual to Saulwick we have to cross four of the five informative splits, plus two of the other five splits.

Also worth noting is that the pathlengths in the Neighbor Joining tree do not sum to the Euclidean Distances. This is because the Fit<100%. For example, the pathlength from Actual to Saulwick is 4.74 = 1.8733+0.5278+0.6840+1.6539, so that 4.88–4.74 = 0.14 of the distance has been left out.

The pathlengths can also be used to evaluate the relative success of the opinion polls. That is, the network pathlength distance from Actual to Morgan is the shortest, which we can interpret as indicating that Roy Morgan Research was the most "successful" of the four opinion polls. That is, its predictions were the "least different" from the actual election results, across all of the elections.

Finally, there are features of the data that cannot be displayed in the network. The network is a summary only, and not all of the information can be summarized in a line graph! Perhaps the most notable missing information is that the McNair Survey was the only opinion poll to predict any of the election results exactly correctly, which it managed to do twice (in 1974 and 1983).

If you are interested in learning more about splits graphs, then you can check out the Primer of Phylogenetic Networks web page.

Tuesday, May 8, 2012

A fundamental limitation of hybridization networks? (2)

This is a follow-up to an earlier post, which showed an example of two phylogenetic trees and three rooted phylogenetic networks. You can see them again in the figure below.

Each of the networks N1, N2 and N3 displays the two trees T1 and T2 (and no other trees). Thus, it is impossible to decide which of the three networks is correct. The question was asked whether this is a fundamental limitation of rooted phylogenetic networks (a.k.a. hybridization networks).

In my opinion, the answer is "no".

Let's first draw the networks such that each reticulation is an instantaneous event between two coexisting taxa. To do so, networks N2 and N3 need an additional taxon x, which could be an extinct taxon or just a taxon that has not been sampled.

I've specified a length for each edge of each network and have given corresponding edge lengths to the trees. The values of the edge lengths in the networks have been chosen rather arbitrarily, and are not important for the discussion below.

What is important is that, when you take the edge lengths into account, it is easy to decide which of the three networks should be chosen. N1 should be chosen if the roots of T1 and T2 have the same age, N2 should be chosen if the root of T1 is older and N3 if the root of T2 is older. The reason is the following. In network N1, the roots of T1 and T2 both coincide with the root of the network. This contrasts with network N2, where the root of T2 is a proper descendant of the root of T1 and with network N3, in which the root of T1 is a proper descendant of the root of T2.

We can conclude that the above example shows an important challenge but not a fundamental limitation of rooted phylogenetic networks. When taking edge lengths into account, it is indeed possible to uniquely reconstruct the network (at least in this case).