Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. If our goal is to develop an automated procedure for homology assessment, then we need someone to produce a program that explicitly implements this aim.
Alignment is just as much a part of phylogenetics as is tree or network building. It is the procedure that expresses the homology relationships among the characters, rather than the historical relationships among the taxa. Therefore, we need a computer program that accurately expresses homology relationships, as well as one that accurately expresses the historical relationships. We have some programs for the latter but currently nothing for the former.
Unfortunately, homology is a rather nebulous concept. It has to do with inheriting characters from a shared ancestor, which is not something that we can directly observe. Therefore we have to infer it. Somehow.
Systematists have developed criteria for making decisions about potential homologies in an objective and (hopefully) repeatable manner, and these are directly applicable to nucleotide sequences, which these days are the most common form of data used in phylogenetics. These criteria are:
- Compositional = apparent likeness or resemblance between sequences (% similarity)
- Topographical = apparent likeness or resemblance between sequences (second- and third-order structure of proteins or RNA)
- Functional = functional relationship to other characters in the same sequence (annotated function of the sequence in protein or RNA)
- Ontogenetic = variation arising from the same molecular mechanism between sequences (inferred molecular mechanism creating the sequence variation — tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, insertions)
• Congruence = agreement with other postulated homologies elsewhere in the same sequences (synapomorphy).
Traditionally, characters have been first proposed as homologous using the criteria of similarity and conjunction (together called primary homology), and then tested with the criterion of congruence (secondary homology).
It is important to note that these criteria do not necessarily always agree with each other in their inferences of homology. Changes that occur during evolutionary history can weaken the connection between these criteria so that, for example, nucleotide homology inferred from structural similarity is no longer the same as nucleotide similarity inferred from compositional similarity. It is for this reason that compositional similarity of the sequences is insufficient to establish gene orthology, for example. The same limitation applies to nucleotides.
Current computer programs
It is clear that these criteria have been incorporated singly into current computerized procedures for producing multiple sequence alignments, but rarely in combination. For example, compositional similarity is the criterion used by the most popular computer programs, such as CLUSTAL, MAFFT and Muscle. Topographical similarity is being invoked whenever structure-based alignments are produced. such as for RNA-coding sequences (eg. PicXAA-R; PMFastR), or when nucleotide sequences are translated to amino acids before alignment (eg. PROMALS). Functional similarity is used for specialist studies of conserved motifs and binding sites, for instance. Ontogenetic similarity of nucleotide sequences is based on inferring the possible molecular processes that cause the observed sequence variation — the program Prank uses this criterion by distinguishing between insertions and deletions.
Congruence as a criterion involves the observation of repeated patterns of synapomorphy in a phylogeny. Among alignment algorithms, both Direct Optimization (e.g. POY; MSAM; BeeTLe) and Statistical Alignment (e.g. BAli-Phy; StatAlign) try to simultaneously produce a multiple alignment and a phylogenetic tree, thus optimizing the criterion of congruence.
The fact that none of the current crop of programs basically apply more than one criterion is, I contend, the principal reason why so many phylogeneticists adjust their alignments manually. Personal judgment may not be perfect, but at least it can be consciously based on homology as a general character concept. Since the different criteria may conflict with each other, at the moment only human judgment is available to compare them and thus make a final decision.
To make the homology criteria fully operational, we need to compare their inferences by evaluating the comparative evidence. That is, since the different criteria may conflict with each other, we need an automated way to compare them and evaluate their relative probabilities for any alignment column. What we need is a computerized procedure that will includes all of the known criteria for homology assessment. Sadly, there are currently no mathematical models for doing this.
I suspect that there are two reasons for the failure of such a program to appear by now. First, biologists have not been clear about homology as a concept, and have not been able to express it in a form that computationalists could use to develop an algorithm. That is, we have criteria but they are not really operational criteria in a computational sense. Second, it will not be easy, because there is no obvious algorithm for inferring inheritance of characters. That is, we cannot easily separate homology from analogy.
Another proposal is to have an interactive alignment editor. This editor would have the ability to show the conflicting hypotheses of homology (eg. where the homology suggested by structural pairing in a stem conflicts with homology suggested by tandem repeats), and then to annotate each column in the final alignment with the reason for the researcher having chosen to align those particular nucleotides. For example, one could press a button and see the RNA stem pairs in different colors (irrespective of whether the stem nucleotides are aligned), or press again and see the tandem repeats and inversions in different colours (once again, irrespective of how the nucleotides are aligned). One could also choose to see the annotations for the columns (summarized, using some coded schema), or simply look at the unadorned alignment itself.
This seems to me to be an achievable goal in the short-term; and the PhyDE editor already does some of it. Such an editor would also serve as a necessary step on the way to working out how to automate as much of the process as possible. The ultimate goal for some people may be total automation (ie. a black box), but I see no way to achieve that in the immediate term. Besides, I suspect that phylogeneticists will always want some judgemental control over the process, which would be best achieved with a semi-automated interactive editor. That is, we might ask the program to work out what the alternative alignments are for any specified subsequence (in an automated manner), and then we evaluate their relative merits for ourselves.
Note that I am treating the alignment as a set of hypotheses independent of their phylogenetic analysis. Subsequences can still be tentatively aligned even if the researcher intends masking those subsequences out of any subsequent tree-building analysis. Also, subsets of the taxa might be aligned confidently while other subsets are left unaligned. With current editors, this involves having a separate alignment file for each subset, which is very cumbersome, as well as error-prone.