Tuesday, November 5, 2013

Using constraints to get a handle on the space of phylogenetic networks?

The following two problems will be familiar to researchers working on evolutionary phylogenetic networks.

1) The severe computational intractability associated with globally optimizing most objective functions over the space of phylogenetic networks.

2) The fact that within the space of potential solutions, there are typically very many that an end-user biologist will want to exclude from consideration, for context-specific biological reasons that the software does not know about. This hidden information often only becomes available at the end of the analysis. It is not unusual to receive comments such as: "Thanks for the networks but they can't be good, because experimentalists strongly believe that taxon X is a hybrid of taxa Y and Z, and we also think that taxon group C should be monophyletic ... this is not visible in your networks."

In a recent opinion piece added to the Arxiv ("Fighting network space: it is time for an SQL-type language to filter phylogenetic networks") myself, David and Simone Linz pose the question of whether it might be possible to address both these questions at the same time, using constraint-based modelling. The core of the idea is that, via some kind of comparatively easy-to-use modelling language (e.g. something with an SQL flavour), the end-user biologist should be able to specify characteristics that all candidate solutions must (or must not) have.

The win-win scenario would be that this (a) tempers the intractability of the search problem, by cutting out large swathes of irrelevant networks in the vast search space and (b) invites biologists to incorporate their context-specific knowledge "upstream", reducing the risk that the networks generated by the software are mis-interpreted. In the context of phylogenetic trees, the idea is not new: in 1986 Constantinescu and Sankoff showed that the use of a constrained tree indeed reduces the search space remarkably.

It seems a natural idea to do this for networks, but the question of course is how feasible all this is. Constraint-based pruning of intractable search spaces is seductive but technically challenging for all kinds of reasons. Depending on the constraints used it might help a lot or a little, it is certainly no silver bullet. We might nevertheless hope that in many cases end-user biologists have so much implicit knowledge that the search space is massively shrunk. The question of the modelling language is also tricky because we need to decide upon a set of network constraints that biologists want and need: the dominant topological feature of trees, the clade, is no longer sufficient to describe (or constrain) the topologically richer space of phylogenetic networks. Furthermore, the constraints themselves should not become a new source of intractability.

In the opinion piece we make a few basic suggestions for atomic network constraints and how they might be combined via an SQL-style language. This, of course, is only the starting point for what we hope will be a wider discussion.

We're very keen to hear your thoughts about this!

No comments:

Post a Comment