Tuesday, May 10, 2016

The early history of sequence alignment

The historical development of the concept that we now call a "sequence alignment" is something that seems to have rarely been considered in the biological literature. Apparently, the idea took some time to develop.

To a bioinformatician, the history of sequence alignment starts in 1970, with the presentation of the dynamic programming algorithm of Needleman and Wunsch (1970). However, protein sequencing started fully 20 years earlier than this (see García-Sancho 2010); and by the end of the 1950s comparisons of amino-acid sequences among related organisms were beginning to appear. However, as noted by Eck (1961): "data on amino acid sequences can be sorted, tabulated and arranged in a great variety of ways ... Any such manipulation will produce some sort of pattern." Thus, a multiple sequence alignment was seen as only one of many possible data presentations, and not necessarily the most obvious one unless intended for an evolutionary analysis.

For example, most of these early comparative studies focussed on the structure (and thus function) of the proteins rather than on their evolution, and so they tended to present juxtapositions consisting of ungapped fragments of the sequences (eg. Brown et al. 1955; Tuppy and Dus 1958; Anfinsen 1959), particularly the active regions. Other studies were directed towards finding a solution to the problem of the genetic code (ie. how nucleotides code for amino acids), and their presentation of sequence alignments was similarly non-evolutionary (eg. Gamow et al. 1956; Tsugita and Fraenkel-Conrat 1960).

Nevertheless, the early work on molecular evolution did reveal that different protein molecules are homologous, including what are now called paralogs (eg. Itano 1957; Ingram 1961). With the sequencing of the proteins, it soon occurred to several people independently that the relative positions in the amino acid sequences are homologous as well (see Morgan 1998). This is an important distinction, because the latter refers to the 1:1 matching of the parts (amino acids) of a complex whole (the protein molecule), which is the usual empirical procedure for determining homology (Ghiselin 2016). However, most sequences were still presented unaligned (eg. Ingram 1961), until the work of Margoliash (1963) and Pauling and Zuckerkandl (1963), who can thus be seen as the pioneers of the modern form of sequence alignment.

The major problem with sequencing proteins in the 1960s was that it was still a slow and tedious procedure, so that data were rather scarce — the first major compilation of aligned sequences did not appear until 1965 (Dayhoff et al. 1965). Strasser (2010) provides an interesting coverage of the early uses of multiple amino-acid sequence alignments, including the development of one-letter codes for each of the amino acids in order to make the alignments more readable. García-Sancho (2010) and Suárez-Díaz (2014) discuss the subsequent development of experimental methods for the sequencing of RNA in the mid-1960s and then DNA in the mid-1970s, which greatly increased the need for an automated sequence alignment method. [García-Sancho (2012) provides a much more detailed discussion.]

Most importantly, a number of the early molecular sequence alignments were constructed by hand explicitly based on evaluation of the likely biological mechanisms that had produced the sequence variation. That is, the alignments made clear the originating molecular mechanisms. For example, Pauling and Zuckerkandl (1963) provided a pairwise alignment of two reconstructed ancestral amino-acid sequences of haemoglobin, along with a discussion of the substitutions and insertions / deletions.

Twenty years later, in what appears to be the first published study of intraspecific variation using DNA sequences, Kreitman (1983) took this idea further, and provided a very carefully considered multiple alignment based on explicit recognition of tandem repeats and RNA stem structures within the study gene. This was very much in line with traditional approaches to the assessment of homologies prior to phylogenetic tree building, for example when using morphological or anatomical characters.

However, immediately after this, practical computerized procedures were developed by Hogeweg and Hesper (1984), based on dynamic programming for pairwise sequence alignment (solely maximizing similarity, as explicitly noted in the title of the Needleman and Wunsch paper) and based on the progressive alignment strategy for multiple alignment. Then the Clustal computer program was released in 1988, which implemented these procedures in a usable manner for personal computers (see Chenna et al. 2003); and the history of studies in molecular evolution was thereby changed forever.

This brief history emphasizes one simple point about the relationship between homology and phylogeny — the apparent primary interest in the latter rather than the former, despite the fact that they are simply two views of the same dataset (phylogeny refers to the relationship among the rows of a multiple sequence alignment, while homology refers to the relationship among the columns). The first automated or semi-automated tree-building algorithm (the user could manually intervene at each step) was developed by Eck and Dayhoff (1966), followed by the first fully automated procedure presented by Fitch and Margoliash (1967). This was nearly 20 years before equivalent ideas were developed for homology assessment.


Christian B. Anfinsen (1959) The Molecular Basis of Evolution. Wiley, New York.

H. Brown, Frederick Sanger, Ruth Kitai (1955) The structure of pig and sheep insulins. Biochemical Journal 60: 556-565.

Ramu Chenna, Hideaki Sugawara, Tadashi Koike, Rodrigo Lopez, Toby J. Gibson, Desmond G. Higgins, Julie D. Thompson (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Research 31: 3497-3500.

Margaret O. Dayhoff, Richard V. Eck, Marie A. Chang, Minnie R. Sochard (1965) Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Spring MD.

Richard V. Eck (1961) Non-randomness in amino-acid "alleles". Nature 191: 1284-1285.

Richard V. Eck, Margaret O. Dayhoff (1966) Atlas of Protein Sequence and Structure, second edition. National Biomedical Research Foundation, Silver Spring MD.

Walter M. Fitch, Emanuel Margoliash (1967) Construction of phylogenetic trees. Science 155: 279-284.

George Gamow, Alexander Rich, Martynas Yčas (1956) The problem of information transfer from the nucleic acids to proteins. Advances in Biological and Medical Physics 4: 23-68.

Miguel García-Sancho (2010) A new insight into Sanger’s development of sequencing: from proteins to DNA, 1943–1977. Journal of the History of Biology 43: 265-323.

Miguel García-Sancho (2012) Biology, Computing and the History of Molecular Sequencing: From Proteins to DNA, 1945–2000. Palgrave MacmIllan, Basingstoke UK.

Michael T. Ghiselin (2016) Homology, convergence and parallelism. Philosophical Transactions of the Royal Society, Series B 371: 20150035.

Paulien Hogeweg, Ben Hesper (1984) The alignment of sets of sequences and the construction of phyletic trees: an integrated method. Journal of Molecular Evolution 20: 175-186.

Vernon M. Ingram (1961) Gene evolution and the hæmoglobins. Nature 139: 704-708.

Harvey A. ltano (1957) The human hemoglobins: their properties and genetic control. Advances in Protein Chemistry 12: 215-268.

Martin Kreitman (1983) Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster. Nature 304: 412-417.

Emanuel Margoliash (1963) Primary structure and evolution of cytochrome c. Proceedings of the National Academy of Sciences of the USA 50: 672-679.

Gregory J. Morgan (1998) Emile Zuckerkandl, Linus Pauling, and the molecular evolutionary clock, 1959–1965. Journal of the History of Biology 31: 155-178.

Saul B. Needleman, Christian D. Wunsch (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

Linus Pauling, Emile Zuckerkandl (1963) Chemical paleogenetics: molecular "restoration studies" of extinct forms of life. Acta Chemica Scandinavica 17: S9-S16.

Bruno J. Strasser (2010) Collecting, comparing, and computing sequences: the making of Margaret O. Dayhoff's Atlas of Protein Sequence and Structure, 1954–1965. Journal of the History of Biology 43: 623-660.

Edna Suárez-Díaz (2014) The long and winding road of molecular data in phylogenetic analysis. Journal of the History of Biology 47: 443–478.

Akira Tsugita, Heinz Fraenkel-Conrat (1960) The amino acid composition and c-terminal sequence of a chemically evoked mutant of TMV. Proceedings of the National Academy of Sciences of the USA 46: 636-642.

No comments:

Post a Comment