By Steven Kelk and Simone Linz.
Many readers of this blog will be aware that phylogenetic
tree space is often traversed using topological modification moves such as SPR
(Subtree Prune and Regraft) and TBR (Tree Bisection and Reconnection). In a
nutshell these moves allow us to step from one phylogenetic tree to another,
with a view to finding “good” phylogenetic trees. A natural question which
arises is this: what is the minimum number of SPR or TBR moves required to turn
one tree into another? Being able to compute these values – known as the SPR
distance and the TBR distance respectively – gives us some feeling about how
long it will take, deterministically or stochastically, to move from one part
of tree space to another. Unfortunately, it is NP-hard (i.e. formally
intractable) to compute SPR or TBR distances.
Game over? No because NP-hardness is never a reason to give
up! In 2001 Allen and Steel [1] showed the following. Suppose you have two
trees, T1 and T2, both on n taxa, and
they have TBR distance k. After
applying common subtree and common chain reduction rules, you obtain
two trees – also with TBR distance k
– which have at most 28k taxa. The
striking thing here is that n
vanishes from the analysis. Hence, if k
is small, then so too are the reduced trees, and computing the TBR distance of
these two reduced trees becomes less time-consuming. This process is known as kernelization.
This is where the networks come in. In a recent pre-print [2]
we have shown that the situation is actually even better than Allen and Steel
calculated: the reduced trees will have at most 15k-9 taxa (and in fact, for the subtree and chain reduction rules,
this is the best you can do). Perhaps somewhat counter-intuitively, the
argument leverages the phylogenetic network literature. While it is quite
common to use distances between trees to help construct networks, it is less
common to use networks as an analytical instrument to prove new results about
trees. We thus consider our new result as a somewhat striking example of the
relevance of phylogenetic networks.
The high-level idea is as follows. A recent publication [3] proved
that if two trees T1 and T2 have TBR distance k, then a simplest unrooted phylogenetic network that embeds both
T1 and T2 will have reticulation number k.
The reticulation number of an unrooted phylogenetic network is basically
equal to the number of edges you have to delete to turn it into an unrooted
phylogenetic tree. Due to this equivalence the problem of computing the TBR
distance of two trees can be transformed into that of constructing an unrooted
phylogenetic network. See the figure below. The blue tree and the green tree (which
have TBR distance 2) can be obtained from the unrooted phylogenetic network on
the left (which has reticulation number 2), by cutting at the blue and green breakpoints in the network (respectively).
Why is this important? We know that, after collapsing common
pendant subtrees, unrooted phylogenetic networks can be obtained by
“decorating” a given backbone topology with taxa. This backbone topology (known
as a generator) has roughly 3k edges where chains of taxa can be
added to the network, where k is the
TBR distance. The critical fact is that, if you add more than 9 taxa to
one of these edges, then – however you extract the two embedded trees from the
network – you will obtain two trees with a common chain (of length 4 or more,
which is the threshold at which common chains are reduced). So, under the
assumption that all common chains have been reduced, you can add at most 9 taxa
to each edge.
The figure above allows us to obtain some intuition about
this. In the network, there are two breakpoints interrupting the sequence of
taxa {1,2,3,4,5,6,7,8,9}, one for the green tree and one for the blue tree. The
interaction of these two breakpoints induces three (not necessarily maximal) chains
that are common to both trees: on taxa sets {1,2,3,4}, {5,6,7} and {8,9}
respectively. Under the common chain reduction rule, chain {1,2,3,4} would have
already been collapsed (since the reduction rule collapses common chains of
length 4 or more) – so in fact a situation in which two breakpoints are placed
along the sequence {1,2,3,4,5,6,7,8,9}, as shown in the figure, cannot happen
if we assume that the reduction rules had already been applied to exhaustion. You
could try to fix this by shifting the blue and green breakpoint one place
anticlockwise, to give common chains {1,2,3}, {4,5,6} and {7,8,9}. Sometimes such
shifts will work, sometimes they will not. However, if there had been 10 taxa
here, rather than 9, you can never
avoid creating a common chain of size 4 or more, no matter where you place the
two breakpoints. This is simply because, however you partition the set
{1,2,3,4,5,6,7,8,9,10} into three contiguous intervals, at least one of the
intervals will have size 4 (or more). This common chain should then, by
assumption, already have been collapsed.
This limit of 9 taxa already puts an upper bound of (roughly)
3k * 9 = 27k on the number of taxa in the reduced trees. To get to the
improved bound of 15k-9, we observe
that there is a limit to the number of edges that can carry 9 taxa. (Specifically:
only those edges that carry two breakpoints can carry 9 taxa and the number of
those edges is limited to k). The
remaining edges can be decorated with at most 6, or 3, taxa, and after a bit of
counting magic we obtain our result.
Interestingly, the phylogenetic network perspective does not
only help us to obtain this improved upper bound, it also plays a crucial role
in helping us to prove that you can’t, in the worst case, obtain a bound better
than 15k-9 (even if, as well as
collapsing common subtrees and chains, you also try to decompose the trees
around common splits).
Looking forward, it is natural to ask: is this a one-off
success? Or might it be possible to use a similar “backwards network
perspective” in other unexpected places to help improve best-known results
about trees?
REFERENCES
[1] Allen, B. L., & Steel, M. (2001). Subtree transfer
operations and their induced metrics on evolutionary trees. Annals of combinatorics, 5(1), 1-15.
[2] Kelk, S., & Linz, S. (2018). A tight kernel for
computing the tree bisection and reconnection distance between two phylogenetic
trees. arXiv preprint
arXiv:1811.06892.
[3] Van Iersel, L., Kelk, S., Stamoulis, G., Stougie, L.,
& Boes, O. (2018). On unrooted and root-uncertain variants of several
well-known phylogenetic network problems. Algorithmica,
80(11), 2993-3022.
No comments:
Post a Comment