Monday, November 3, 2014

On partitioning incongruent data into congruent blocks

In a recent article (by myself, Leo van Iersel, Nela Lekić and Simone Linz) we stumbled upon the following problem which appears to touch upon some interesting biological issues.

A rooted triplet xy|z is a rooted binary tree in which x and y have a common parent p, p is a child of the root, and z is the other child of the root. A rooted phylogenetic tree T displays (informally: agrees with) xy|z if the common ancestor of x and y is a strict descendant of the common ancestor of x and z (or y and z). See the figure below: the tree on the right displays triplet xy|z.

Suppose we are given a set of rooted triplets S on a set X of taxa. Suppose we have reason to believe that the set of triplets S have been obtained from different sources (e.g. genes), where the genes have different evolutionary histories due to reticulate phenomena. This means that, for a given subset of 3 taxa {x,y,z} from X, S will contain zero, one, two or three of the possible triplets {xy|z, xz|y, yz|x}.

Crucially, suppose we do not know which gene generated each triplet in S. This might sound artificial, but if some of the rooted triplets have been generated from phenotypic data, or have been obtained from inherently complex data (such as metagenomic data), then the genomic origins of the triplets might not be readily available.

Under such circumstances it is tempting to obtain a lower bound on the number of incongruent gene topologies by answering the following question. What is the minimum number of blocks that we can partition the triplets into, such that the triplets in each block are compatible with a tree (i.e. can all be displayed by the same tree)? It's easy to see that the worst case is when all 3(n 3) possible triplet topologies are present in S, where n is the number of elements in X. Let tau(n) denote this worst case.

We computed tau(n) exactly for small n. For n equal to 3 or 4, tau(n) is 3. For 5 <= n < = 12, tau(n) is 4. For 13 <= n <= 20 we only have an upper bound: tau(n) <= 5.

We know that tau(n) never stops growing: you can always find a larger value of n to force tau to rise even further. However, tau(n) seems to grow extremely slowly as a function of n. We have reason to believe that tau(n) grows far more slowly than O(log n).

The bottom line is that, for any n that might be encountered in a real-world experiment, it could well be that tau(n) <= C where C is a small constant. Suppose, for the sake of argument, that C is 10. This would mean that even the most incredibly diverse and incongruent real-world dataset could be 'explained' by (at most) 10 underlying tree topologies, even when - in truth - there are hundreds or even thousands of different tree signals within the dataset. In other words, tau(n) is potentially a massive underestimation of underlying incongruence.

The problem here is that rooted triplets only carried limited information about the underlying trees that generated them. Specifically, a large amount of information is lost if we do not know which gene tree generated which triplet.

One possible response to this problem is to explicitly incorporate topology. Suppose, for example, that a set of triplets S can be partitioned into 2 trees, but any rooted phylogenetic network displaying these two tree topologies (where each tree is on the full set of taxa X) must have a huge amount of reticulation, due to the fact that the 2 trees are highly incongruent to each other. On the other hand, suppose there exists a partition of S into 4 trees such that the trees are relatively congruent and thus can be displayed by a rooted phylogenetic network with a much smaller amount of reticulation. Depending on where we lay the emphasis we might prefer one solution over the other. Indeed, at one extreme we can completely ignore the number of trees in the underlying partition, and seek only to pack the rooted triplets into a rooted phylogenetic network with as little reticulate activity as possible: a well-studied problem. At the other extreme we are back with the original problem, i.e. of minimizing the number of trees needed to cover/generate the triplets in S, irrespective of how incongruent they are relative to each other.

As far as we know hybrid approaches — constructing rooted phylogenetic networks from triplets, but insisting that the triplets originate from a constrained number of trees within that network — have not yet received any significant research attention and could be interesting to investigate.

To conclude, rooted triplets are mathematically and experimentally seductive due to the hope that on one hand they can be experimentally inferred with high accuracy (due to their small size) but on the other hand still carry enough evolutionary signal to be puzzled together into large, accurate phylogenetic trees (or even networks). However, they carry limited information and this means that one must be careful when using them for hypothesis generation. Explicitly incorporating tree or network topology into triplet-based methods is potentially a way of strengthening their inference power.

No comments:

Post a Comment