*By Steven Kelk and Simone Linz.*

Many readers of this blog will be aware that phylogenetic
tree space is often traversed using topological modification moves such as SPR
(Subtree Prune and Regraft) and TBR (Tree Bisection and Reconnection). In a
nutshell these moves allow us to step from one phylogenetic tree to another,
with a view to finding “good” phylogenetic trees. A natural question which
arises is this: what is the minimum number of SPR or TBR moves required to turn
one tree into another? Being able to compute these values – known as the SPR
distance and the TBR distance respectively – gives us some feeling about how
long it will take, deterministically or stochastically, to move from one part
of tree space to another. Unfortunately, it is NP-hard (i.e. formally
intractable) to compute SPR or TBR distances.

Game over? No because NP-hardness is never a reason to give
up! In 2001 Allen and Steel [1] showed the following. Suppose you have two
trees, T1 and T2, both on

*n*taxa, and they have TBR distance*k*. After applying*common subtree*and*common chain*reduction rules, you obtain two trees – also with TBR distance*k*– which have at most 28*k*taxa. The striking thing here is that*n*vanishes from the analysis. Hence, if*k*is small, then so too are the reduced trees, and computing the TBR distance of these two reduced trees becomes less time-consuming. This process is known as*kernelization*.
This is where the networks come in. In a recent pre-print [2]
we have shown that the situation is actually even better than Allen and Steel
calculated: the reduced trees will have at most 15

*k*-9 taxa (and in fact, for the subtree and chain reduction rules, this is the best you can do). Perhaps somewhat counter-intuitively, the argument leverages the phylogenetic network literature. While it is quite common to use distances between trees to help construct networks, it is less common to use networks as an analytical instrument to prove new results about trees. We thus consider our new result as a somewhat striking example of the relevance of phylogenetic networks.
The high-level idea is as follows. A recent publication [3] proved
that if two trees T1 and T2 have TBR distance

*k*, then a simplest unrooted phylogenetic network that embeds both T1 and T2 will have reticulation number*k.*The reticulation number of an unrooted phylogenetic network is basically equal to the number of edges you have to delete to turn it into an unrooted phylogenetic tree. Due to this equivalence the problem of computing the TBR distance of two trees can be transformed into that of constructing an unrooted phylogenetic network. See the figure below. The blue tree and the green tree (which have TBR distance 2) can be obtained from the unrooted phylogenetic network on the left (which has reticulation number 2), by cutting at the blue and green*breakpoints*in the network (respectively).
Why is this important? We know that, after collapsing common
pendant subtrees, unrooted phylogenetic networks can be obtained by
“decorating” a given backbone topology with taxa. This backbone topology (known
as a

*generator*) has roughly 3*k*edges where chains of taxa can be added to the network, where*k*is the TBR distance. The critical fact is that, if you add__more__than 9 taxa to one of these edges, then – however you extract the two embedded trees from the network – you will obtain two trees with a common chain (of length 4 or more, which is the threshold at which common chains are reduced). So, under the assumption that all common chains have been reduced, you can add at most 9 taxa to each edge.
The figure above allows us to obtain some intuition about
this. In the network, there are two breakpoints interrupting the sequence of
taxa {1,2,3,4,5,6,7,8,9}, one for the green tree and one for the blue tree. The
interaction of these two breakpoints induces three (not necessarily maximal) chains
that are common to both trees: on taxa sets {1,2,3,4}, {5,6,7} and {8,9}
respectively. Under the common chain reduction rule, chain {1,2,3,4} would have
already been collapsed (since the reduction rule collapses common chains of
length 4 or more) – so in fact a situation in which two breakpoints are placed
along the sequence {1,2,3,4,5,6,7,8,9}, as shown in the figure, cannot happen
if we assume that the reduction rules had already been applied to exhaustion. You
could try to fix this by shifting the blue and green breakpoint one place
anticlockwise, to give common chains {1,2,3}, {4,5,6} and {7,8,9}. Sometimes such
shifts will work, sometimes they will not. However, if there had been 10 taxa
here, rather than 9, you can

*avoid creating a common chain of size 4 or more, no matter where you place the two breakpoints. This is simply because, however you partition the set {1,2,3,4,5,6,7,8,9,10} into three contiguous intervals, at least one of the intervals will have size 4 (or more). This common chain should then, by assumption, already have been collapsed.*__never__
This limit of 9 taxa already puts an upper bound of (roughly)
3

*k** 9 = 27*k*on the number of taxa in the reduced trees. To get to the improved bound of 15*k*-9, we observe that there is a limit to the number of edges that can carry 9 taxa. (Specifically: only those edges that carry two breakpoints can carry 9 taxa and the number of those edges is limited to*k*). The remaining edges can be decorated with at most 6, or 3, taxa, and after a bit of counting magic we obtain our result.
Interestingly, the phylogenetic network perspective does not
only help us to obtain this improved upper bound, it also plays a crucial role
in helping us to prove that you can’t, in the worst case, obtain a bound better
than 15

*k*-9 (even if, as well as collapsing common subtrees and chains, you also try to decompose the trees around common splits).
Looking forward, it is natural to ask: is this a one-off
success? Or might it be possible to use a similar “backwards network
perspective” in other unexpected places to help improve best-known results
about

*trees*?**REFERENCES**

[1] Allen, B. L., & Steel, M. (2001). Subtree transfer
operations and their induced metrics on evolutionary trees.

*Annals of combinatorics*, 5(1), 1-15.
[2] Kelk, S., & Linz, S. (2018). A tight kernel for
computing the tree bisection and reconnection distance between two phylogenetic
trees.

*arXiv preprint*arXiv:1811.06892.
[3] Van Iersel, L., Kelk, S., Stamoulis, G., Stougie, L.,
& Boes, O. (2018). On unrooted and root-uncertain variants of several
well-known phylogenetic network problems.

*Algorithmica*, 80(11), 2993-3022.
## No comments:

## Post a Comment