Wednesday, May 14, 2014

Non-model distances in phylogenetics

Bioinformatics is sometimes divorced from biology. This happens when the ideas involved are solely computational ones, and are not biologically motivated. For example, Otu et al. (2003), when proposing "a new sequence distance measure for phylogenetic tree construction" point out that "it is worth noting that our distance measures do not use any evolutionary model". They seem to see this a a Good Thing whereas biologists might not.

Pairwise distances are actually a good case in point, because there are many ways to measure the similarity / distance between any two objects, and most of them have nothing whatever to do with biology. One method that was popular about 10 years ago was using computerized compression.

This is based on the notion of complexity — all objects are complex to one degree or another, but the information they contain can often be reduced (or compressed) to a smaller amount without losing anything essential. That is, the original information can be exactly reproduced from the compressed version. This notion is applied to computer files, for example, in the idea of zipping a file — depending on the file's content it may be possible to zip the file to much a smaller size for storage or transmission.

This idea is straightforward to apply to similarity (Cilibrasi & Vitányi 2005). One simply has to compress the two objects separately as well as compress the combination of the two objects (ie. their union or concatenation). If the two objects are identical then the compression of their union will be the same size as the compression of each object individually (= a distance of 0), because all of the information in one of the objects is redundant. If they have nothing in common then the compression of their union will be the sum of the sizes of the compression of each object individually (= a distance of 1).

This idea has been applied in a number of areas where similarity / distance is used to group objects together (ie. clustering or classification), including computer viruses (Wehner 2008), music (Cilibrasi et al. 2004), languages (Benedetto et al. 2002a), and genomics (Chen et al. 2000, Li et al. 2001, Otu et al. 2003).

This approach has not always been received with equanimity. For example, the paper by Benedetto et al. (2002a) has received considerable commentary (Khmelev & Teahan 2003a; Benedetto et al. 2003; Khmelev & Teahan 2003b), (Goodman 2002; Benedetto et al. 2002b), (Wang 2009).

In biology, this idea has been mostly ignored. Chen et al. (2000) and Li et al. (2001) based their idea of information on Kolmogorov complexity, which is "the shortest program on a universal computer that outputs one of the objects when the input is the other object". This complexity cannot be measured directly, and so it is approximated by using a file compression program (eg. gzip). Clearly, the approximation can only be as good as the compression program and its algorithms.

Otu et al. (2003) based their idea of information on Lempel & Ziv complexity, which is "related to the number of steps required by a production process that builds either of the two objects". They claim an exact solution to the measurement of this complexity for gene sequences, and propose several slightly different distances based on this. Furthermore, the claim to:
show that the proposed distance measures, which are based on the relative complexity between sequences imply the evolutionary distance between organisms ... As the proposed approach does not depend on multiple alignments we test the validity of the approach in two ways: we use simulated data to show that the proposed distance measures can reasonably be represented by a tree. We also show the superiority of the proposed method on existing techniques using this simulated data. Secondly, we look at how well the results generated by the proposed method agree with existing phylogenies.
The use of generic compression similarity simply demonstrates that phylogeny does produce similarity to some extent. However, phylogeny is based on what we might call a form of "special similarity", which is similarity solely due to the possession of shared derived character states. Other components of similarity, such as the similarity produced by parallelisms, convergences and reversals, do not contribute to an estimate of a phylogeny. Indeed, they will often be positively misleading. A general concept of similarity is inadequate for reconstructing a phylogeny except under the simplest circumstances. This was empirically demonstrated, for example, in liguistics by the results of the Computer-Assisted Stemmatology Challenge, in which the compression method CompLearn came dead last in the primary ranking of the different phylogenetic methods (but did okay in the secondary ranking).

In short, only some aspects of complexity are relevant for phylogenetics, and only some of the information contained in a genome is relevant for measuring phylogenetic similarity. The focus on information as a whole ignores the "bio" in bioinformatics.


Benedetto D, Caglioti E, Loreto V (2002a) Language trees and zipping. Physical Review Letters 88: 048702. [arXiv:cond-mat/0108530, 2001]

Benedetto D, Caglioti E, Loreto V (2002b) On J. Goodman’s comment to "Language trees and zipping". arXiv:cond-mat/0203275

Benedetto D, Caglioti E, Loreto V (2003) A reply to the comment by Khmelev and Teahan. Physical Review Letters 90: 089804.

Chen X, Kwong S, Li M (2000) A compression algorithm for DNA sequences and its applications in genome comparison. Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB), pp. 107-117. ACM Press, Tokyo.

Cilibrasi R, Vitányi PMB (2005) Clustering by compression. IEEE Transactions on Information Theory 51: 1523-1545.

Cilibrasi R, Vitányi P, de Wolf R (2004) Algorithmic clustering of music based on string compression. Computer Music Journal 28(4): 49-67.

Goodman J (2002) Extended comment on "Language trees and zipping". arXiv:cond-mat/0202383

Khmelev DV, Teahan WJ (2003a) Comment on "Language trees and zipping". Physical Review Letters 90: 089803.

Khmelev DV, Teahan WJ (2003b) Comment on the reply of Benedetto et al. arXiv:cond-mat/0303261

Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H (2001) An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17: 149-154.

Otu HH, Sayood K (2003) A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19: 2122-2130.

Wang X-L (2009) Comment on "Language trees and zipping". arXiv:0903.3669

Wehner S (2008) Analyzing worms and network traffic using compression. Journal of Computer Security 15: 303-320. (arXiv:cs/0504045v1, 2007)

No comments:

Post a Comment