The Genealogical World of Phylogenetic Networks: The need for a new sequence alignment database

Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. The proliferation of alignment methods have diverse optimization functions, along with assorted heuristics to search for the optimum alignment; and these methods produce detectably different multiple sequence alignments in almost all realistic cases (see The need for a new sequence alignment program). This leaves the phylogeneticists wondering what to do. In response, the majority of phylogeneticists use manual alignment or re-alignment at some stage in their procedures.

If our goal is to develop an automated procedure for homology assessment (see Multiple sequence alignment), then we need some means of evaluating the relative success of different alignment methods.

There are four suggestions for benchmarking strategies for sequence alignment (Iantorno S, Gori K, Goldman N, Gil M, Dessimoz C 2014. Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. Methods in Molecular Biology 1079: 59-73):

Benchmarks based on simulated evolution of biological sequences, to create examples with known homology.
Benchmarks based on consistency among several alignment techniques.
Benchmarks based on the three-dimensional structure of the products encoded by sequence data.
Benchmarks based on knowledge of, or assumption about, the phylogeny of the aligned biological sequences.

These authors list a number of pros and cons for each strategy. For our purposes here we nee to consider the cons, which I discuss here (not all of these are covered by the authors).

Cons

1.
Simulation-based approaches adopt a probabilistic model of sequence evolution to describe nucleotide substitution, deletion, and insertion rates, while keeping track of “true” relationships of homology between individual residue positions (see Do biologists over-interpret computer simulations?).
(a) The simulation and analysis methods are not independent. All observations drawn from simulated data depend on the assumptions and simplifications of the model used to generate the data. This means that the results are biased towards those analysis methods that most closely match the assumptions of the simulation model.
(b) Simulations cannot straightforwardly, if at all, account for all evolutionary forces. This means that the simulations are not realistic, and their relevance for the behaviour of real datasets is unknown. The biggest failing in this regard is that, at some stage in the simulation, insertions and deletions are assumed to occur at random along the sequence (IID), and nothing could be further from the truth. Sequence variation occurs as a result of tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, and insertions; and there are strong spatial constraints on variation such as codons and stem-loops. Current simulation methods fall well short of modeling these patterns of sequence variation.

2.
The key idea behind consistency-based benchmarks is that different good aligners should tend to agree on a common alignment (namely, the correct one) whereas poor aligners might make different kinds of mistakes, thus resulting in inconsistent alignments.
(a) Two wrongs don't make a right. That is, consistent methods may be collectively biased. Moreover, consistency is not independent of the set of methods used (some may be consistent with each other and not with others).
(b) Consistency scores are a feature of several methods, which means that the benchmark is not independent.

3. Structural benchmarks most commonly employ the superposition of known protein/RNA structures as an independent means of alignment, to which alignments derived from sequence analysis can then be compared (see Edgar RC 2010. Quality measures for protein alignment benchmarks. Nucleic Acids Research 38: 2145-2153). The best known of these include: BAliBASE, OXBench, PREFAB, SABmark, IRMBase, and BRAliBase.
(a) Datasets are limited to structurally conserved regions, and may not be relevant for other alignment objectives.
(b) Deriving the structure-based alignments is problematic. For example, there is inconsistency amongst different stuctural superpositions.

4. Given a reference tree, the more accurate is the tree resulting from a given alignment, then the more accurate the underlying alignment is assumed to be (see Dessimoz C, Gil M 2010. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biology 11: R37).
(a) False inversion of a proposition: Accurate alignments yield accurate trees, therefore accurate trees must be based on accurate alignments.
(b) Alignment is often involved in constructing the reference tree. If not, the tree may be trivial in terms of taxon relationships.

Discussion

This evaluation leaves us in the invidious position of not yet having any benchmarking method that is relevant to homology assessment for multiple sequence alignments. This conclusion is at variance with other previous assessments (eg. Aniba MR, Poch O, Thompson JD 2010. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Research 38: 7353-7363).

We need to consider what such a method might look like, and how we might go about constructing it. If biologists can't give the bioinformaticians a concrete goal for homology alignment then they can expect nothing in return.

It seems clear that we need to follow the idea behind option 3, but base the alignments on homology rather than structure. I once made a start with compiling some suitable datasets (see Morrison DA 2009. A framework for phylogenetic sequence alignment. Plant Systematics and Evolution 282: 127-149); but this was a very minor effort.

As I see it, we need alignments that are explicitly annotated with the reasons for considering the columns to be homologous. One suggestion would be to have relatively short alignments with annotations for "known" features, such as tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, insertions, or stem-loops. These all create sequence variation, and they provide evidence of the homology relations among the sequences. Presumably the alignments would vary in length and number of sequences, and in the complexity of the patterns.

Perhaps the biggest practical problem will be how to deal with alignments where the homology criteria conflict with each other. That is, there are different types of criteria used to recognize homology — ie. similarity, structure, ontogeny, congruence (see Morrison DA 2015. Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26) — and they do not necessarily agree with each other.

This would allow us to come up with a set of requirements to specify various categories of the database, based on each of the above features. We would then try to accumulate as many example datasets for each category as we can. The database will presumably have protein-coding sequences in one section and RNA-coding, introns, etc in another. This dichotomy is simplistic, but I feel that it needs to be that way in order to be of practical use. Within each of those two sections we would have subsets of varying degrees of difficulty (eg. different degrees of average sequence similarity, or distinct taxon subsets in the same alignment, or orphan sequences).

This organisational approach is similar to that originally adopted for BAliBase, but it was dropped by most of the databases developed subsequently. I believe that it is the best approach for our purposes.

There are also experimentally created datasets where the alignment is known because all of the ancestors were sequenced as well. These would be useful; but their limitation is that the sequence variation was generated more or less at random, and so it does not match normal evolutionary processes. These alignments are more likely to match the IID assumption of the current automated alignment methods.

There is one further issue with this approach. Bioinformaticians often state that a few carefully prepared datasets is of little practical use to them (as opposed to being of use to phylogeneticists). What they need is a large number of datasets, the more the better. This is because they are interested in the percent success of their algorithms, and this cannot be assessed with small sample sizes. So, each alignment probably does not need to have too many taxa or too much sequence length — it is the number of alignments that is important, not their individual sizes. This could be achieved by sub-dividing larger datasets.

Wednesday, March 11, 2015

The need for a new sequence alignment database

No comments:

Post a Comment