Monday, December 17, 2018

Using phylogenetic networks to prove new results about trees

By Steven Kelk and Simone Linz.

Many readers of this blog will be aware that phylogenetic tree space is often traversed using topological modification moves such as SPR (Subtree Prune and Regraft) and TBR (Tree Bisection and Reconnection). In a nutshell these moves allow us to step from one phylogenetic tree to another, with a view to finding “good” phylogenetic trees. A natural question which arises is this: what is the minimum number of SPR or TBR moves required to turn one tree into another? Being able to compute these values – known as the SPR distance and the TBR distance respectively – gives us some feeling about how long it will take, deterministically or stochastically, to move from one part of tree space to another. Unfortunately, it is NP-hard (i.e. formally intractable) to compute SPR or TBR distances.

Game over? No because NP-hardness is never a reason to give up! In 2001 Allen and Steel [1] showed the following. Suppose you have two trees, T1 and T2, both on n taxa, and they have TBR distance k. After applying common subtree and common chain reduction rules, you obtain two trees – also with TBR distance k – which have at most 28k taxa. The striking thing here is that n vanishes from the analysis. Hence, if k is small, then so too are the reduced trees, and computing the TBR distance of these two reduced trees becomes less time-consuming. This process is known as kernelization.

This is where the networks come in. In a recent pre-print [2] we have shown that the situation is actually even better than Allen and Steel calculated: the reduced trees will have at most 15k-9 taxa (and in fact, for the subtree and chain reduction rules, this is the best you can do). Perhaps somewhat counter-intuitively, the argument leverages the phylogenetic network literature. While it is quite common to use distances between trees to help construct networks, it is less common to use networks as an analytical instrument to prove new results about trees. We thus consider our new result as a somewhat striking example of the relevance of phylogenetic networks.

The high-level idea is as follows. A recent publication [3] proved that if two trees T1 and T2 have TBR distance k, then a simplest unrooted phylogenetic network that embeds both T1 and T2 will have reticulation number k. The reticulation number of an unrooted phylogenetic network is basically equal to the number of edges you have to delete to turn it into an unrooted phylogenetic tree. Due to this equivalence the problem of computing the TBR distance of two trees can be transformed into that of constructing an unrooted phylogenetic network. See the figure below. The blue tree and the green tree (which have TBR distance 2) can be obtained from the unrooted phylogenetic network on the left (which has reticulation number 2), by cutting at the blue and green breakpoints in the network (respectively).

Why is this important? We know that, after collapsing common pendant subtrees, unrooted phylogenetic networks can be obtained by “decorating” a given backbone topology with taxa. This backbone topology (known as a generator) has roughly 3k edges where chains of taxa can be added to the network, where k is the TBR distance. The critical fact is that, if you add more than 9 taxa to one of these edges, then – however you extract the two embedded trees from the network – you will obtain two trees with a common chain (of length 4 or more, which is the threshold at which common chains are reduced). So, under the assumption that all common chains have been reduced, you can add at most 9 taxa to each edge.

The figure above allows us to obtain some intuition about this. In the network, there are two breakpoints interrupting the sequence of taxa {1,2,3,4,5,6,7,8,9}, one for the green tree and one for the blue tree. The interaction of these two breakpoints induces three (not necessarily maximal) chains that are common to both trees: on taxa sets {1,2,3,4}, {5,6,7} and {8,9} respectively. Under the common chain reduction rule, chain {1,2,3,4} would have already been collapsed (since the reduction rule collapses common chains of length 4 or more) – so in fact a situation in which two breakpoints are placed along the sequence {1,2,3,4,5,6,7,8,9}, as shown in the figure, cannot happen if we assume that the reduction rules had already been applied to exhaustion. You could try to fix this by shifting the blue and green breakpoint one place anticlockwise, to give common chains {1,2,3}, {4,5,6} and {7,8,9}. Sometimes such shifts will work, sometimes they will not. However, if there had been 10 taxa here, rather than 9, you can never avoid creating a common chain of size 4 or more, no matter where you place the two breakpoints. This is simply because, however you partition the set {1,2,3,4,5,6,7,8,9,10} into three contiguous intervals, at least one of the intervals will have size 4 (or more). This common chain should then, by assumption, already have been collapsed.

This limit of 9 taxa already puts an upper bound of (roughly) 3k * 9 = 27k on the number of taxa in the reduced trees. To get to the improved bound of 15k-9, we observe that there is a limit to the number of edges that can carry 9 taxa. (Specifically: only those edges that carry two breakpoints can carry 9 taxa and the number of those edges is limited to k). The remaining edges can be decorated with at most 6, or 3, taxa, and after a bit of counting magic we obtain our result.

Interestingly, the phylogenetic network perspective does not only help us to obtain this improved upper bound, it also plays a crucial role in helping us to prove that you can’t, in the worst case, obtain a bound better than 15k-9 (even if, as well as collapsing common subtrees and chains, you also try to decompose the trees around common splits).

Looking forward, it is natural to ask: is this a one-off success? Or might it be possible to use a similar “backwards network perspective” in other unexpected places to help improve best-known results about trees?

REFERENCES

[1] Allen, B. L., & Steel, M. (2001). Subtree transfer operations and their induced metrics on evolutionary trees. Annals of combinatorics, 5(1), 1-15.

[2] Kelk, S., & Linz, S. (2018). A tight kernel for computing the tree bisection and reconnection distance between two phylogenetic trees. arXiv preprint arXiv:1811.06892.

[3] Van Iersel, L., Kelk, S., Stamoulis, G., Stougie, L., & Boes, O. (2018). On unrooted and root-uncertain variants of several well-known phylogenetic network problems. Algorithmica, 80(11), 2993-3022.