The Genealogical World of Phylogenetic Networks: An open question about computational complexity

Thursday, October 11, 2012

An open question about computational complexity

This is a guest blog post by:

Jesper Jansson

Laboratory of Mathematical Bioinformatics, Kyoto University, Japan

Here is an open problem for people interested in computational complexity issues related to phylogenetic networks.

In a recent paper we introduced a parameter called the "minimum spread" that measures a kind of structural complexity of phylogenetic networks:
T. Asano, J. Jansson, K. Sadakane, R. Uehara, G. Valiente (2012) Faster computation of the Robinson-Foulds distance between phylogenetic networks. Information Sciences 197: 77-90.

The definition is as follows:

The "minimum spread" of a rooted phylogenetic network N is the smallest integer x such that the leaves of N can be relabeled by distinct positive integers in a way that at every node u in N, the set of all leaf descendants of u forms at most x consecutive intervals.

For example, any phylogenetic tree has minimum spread 1 because if we do a depth-first traversal of the tree and number the leaves in the order that they are discovered, then at each node, the set of leaf descendants corresponds to a single consecutive interval. This property was used in, for example, Day's algorithm from 1985 for comparing phylogenetic trees and constructing a strict consensus tree.

Similarly, any level-k phylogenetic network has minimum spread at most k+1 (see our paper for the proof). Moreover, any "leaf-outerplanar network" has minimum spread 1, where a "leaf-outerplanar network" is a network that admits a non-crossing layout in the plane with the root (if any) and all leaves lying on the outer face. Today's existing software typically outputs such networks. So, for certain classes of phylogenetic networks, we automatically get a nice upper bound on the minimum spread.

Having a small minimum spread means that the phylogenetic network is "tree-like" in the sense that its cluster collection has a space-efficient representation. But are compact representations of the clusters in a network useful?

Well, they can be employed to compare phylogenetic networks quickly, for example when using the Robinson-Foulds distance to measure the dissimilarity between two phylogenetic networks. There may be other applications, too. On the other hand, if a phylogenetic network is "chaotic" and non-tree-like then the minimum spread will not be a helpful parameter when looking for a compact encoding of its branching information.

At this point in time, not much is known about how to compute the minimum spread efficiently. As an example, consider the class of level-k networks for any fixed k > 1. According to Lemma 6 in our paper, we can always find a leaf relabeling function that yields spread at most k+1 in linear time, but that might not be the minimum possible for some particular level-k network.

As observed by Sylvain Guillemot and Philippe Gambette (independently of each other), a related result for the k-Consecutive Ones Problem implies that computing the minimum spread of an arbitrary phylogenetic network is NP-hard in the general case, although we can expect it to be easier when restricted to special cases:
P. W. Goldberg, M. C. Golumbic, H. Kaplan, R. Shamir (1995) Four strikes against physical mapping of DNA. Journal of Computational Biology 2: 139-152.

In summary, the following is still open: