Tuesday, May 31, 2016

Darwin's coral and seaweed metaphors

When challenging previous ideas about biological organization, Charles Darwin insisted upon both the origin of new biological forms and the extinction of some of the old forms. He used a multi-stemmed bush as his published metaphoric icon for these processes (in 1859), but we have always referred to it as a tree.

However, as noted in an earlier blog post (Charles Darwin's unpublished tree sketches), Darwin's first tree-like diagram (dated 1837-1838) was actually a drawing of a coral, accompanied by the text:
The tree of life should perhaps be called the coral of life, [with the] base of [the] branches dead; so that [the] passages cannot be seen
Darwin's specimen 1143, labelled Corallina officinalis.

As a geologist, Darwin had studied corals extensively in the Pacific and Indian Oceans, on the Beagle voyage (1831-1836). In May 1837 he read a paper before the Geological Society of London about his ideas for the development of reefs. This was then published in their journal:
Darwin, C.R. (1837) On certain areas of elevation and subsidence in the Pacific and Indian oceans, as deduced from the study of coral formations. Proceedings of the Geological Society of London 2: 552-554.
He subsequently published his book on the development of coral reefs in 1842 (this was his first monograph):
The Structure and Distribution of Coral Reefs. Being the first part of the geology of the voyage of the Beagle, under the command of Capt. Fitzroy, R.N. during the years 1832 to 1836. Smith, Elder and Co., London.
After this first use of a coral image, Darwin also tried a different marine metaphor:
a tree not [a] good simile — endless piece of sea weed dividing
He seems to have done nothing further with this particular idea.

What is most interesting for us is that the coral metaphor is not a strictly divergent model of evolutionary history. After all, there are many types of coral that form anastomoses. Indeed, there are also corals that do not even form a branching pattern. The neat divergent tree metaphor does not match the world of corals.

This point has been made at length by:
Horst Bredekamp (2003) Darwins Korallen: Frühe Evolutionsmodelle und die Tradition der Naturgeschichte. Verlag Klaus Wagenbach.
Horst Bredekamp (2005) Darwins Korallen: Die frühen Evolutionsdiagramme und die Tradition der Naturgeschichte. Verlag Klaus Wagenbach, second edition.
This book deals with the "aesthetic and political dimension of the coral", which the author (a philosopher and art historian) sees as "a model of anarchic evolution" that opposes the hierarchical metaphor of a tree.

Biologically, Darwin should have stuck to his original idea! However, it is undoubted that the Biblical association of the tree image was more likely to capture his readers' imaginations.

Tuesday, May 24, 2016

Y chromosome and mitochondrial DNA phylogenies — networks?

If one combines a Y chromosome genealogy, which usually shows the paternal ancestry, with a mitochondrial genealogy, which usually shows the maternal ancestry, it is likely that the resulting phylogeny will be reticulate. After all, if a sexually reproducing group of organisms is monophyletic then there is, in theory, a common ancestral pair of organisms, although in practice it is likely to be a small group of inter-breeding organisms. That being so, the ancestry of each descendant individual must consist of a pair of intersecting trees, one maternal and one paternal.

This can be illustrated using this recent paper:
Pille Hallast, Pierpaolo Maisano Delser, Chiara Batini, Daniel Zadik, Mariano Rocchi, Werner Schempp, Chris Tyler-Smith, and Mark A. Jobling (2016) Great ape Y chromosome and mitochondrial DNA phylogenies reflect subspecies structure and patterns of mating and dispersal. Genome Research 26: 427-439.
The authors sequenced autosomal DNA, as well as the Y chromosome (MSY) and the mitochondrial DNA (mtDNA), for each of 19 great ape males (orangutans, gorillas, chimpanzees, bonobos, and humans), and added this to the data for 24 published genomes. For the 19 individuals:
we carried out principal component analysis (PCA) of autosomal SNP variation (∼10,000–48,000 variable sites, depending on species) ... 17 of our 19 sequenced individuals lie within known subspecies clusters ... Two of the sequenced chimpanzees lie mid-way between clusters in the PCA, suggesting recent inter-subspecies hybridization in their ancestry (Tommy: Pan troglodytes verus / Pan troglodytes troglodytes hybrid; EB176JC: Pan troglodytes verus / Pan troglodytes ellioti hybrid).
This seems quite clear in their ordination diagram, as shown here.

Furthermore, for the other two datasets (43 males) the authors note:
PHYLIP v3.69 was used to create maximum parsimony phylogenetic trees for both MSY and mtDNA. Three independent trees were constructed with DNAPARS using randomization of input order with different seeds, each 10 times. Output trees of these runs were used to build a consensus tree with the consense program included in the PHYLIP package. Intraspecific MSY trees were rooted using the ancestral sequence generated and described in the Supplemental Text [basically, the allele matching the outgroup]. Intraspecific mtDNA trees were rooted using the Human Revised Cambridge Reference Sequence.
The two resulting trees for the 19 chimpanzees are shown here, with the MSY tree on the left and the mtDNA tree on the right.

One of the hybrid individuals identified in the autosomal analysis was labelled EBC176JV, and he is clearly shown in a different place in each of the two trees — he is shown as having a Pan troglodytes verus (PTV) father and a Pan troglodytes ellioti (PTE) mother. Consequently, he will be placed at a reticulation node in any attempt to combine the two trees

More oddly, the other individual, named Tommy, does not show this pattern at all. In the two trees he is shown as having both a Pan troglodytes troglodytes (PTT) father and mother, rather than one of them being identified as Pan troglodytes ellioti (PTE), as expected from the autosomes. The authors do not even note this apparently contradictory situation, let alone suggest an explanation. Clearly, however, no reticulation node will be needed in a combined phylogeny.

Tuesday, May 17, 2016

Machine learning, the Go-game, and language evolution

I am not a hard-core science fiction fan. I have not even watched the new Star Wars movie yet. But I am quite interested in all kinds of issues involving artificial intelligence, duels between humans and machines, and also the ethical implications as they are discussed, for example, in the old Blade Runner movie. It is therefore no wonder that my interest was caught by the recent Go-Game human-machine challenge.

Silver et al. (2016) reported in an article about a new Go program, called AlphaGo, that defeated other Go programs with a rate of 99.8%, and finally also defeated the European Go champion, Fan Hui, in 5 matches with 5 to 0. They proudly report in their paper (p. 488):
This is the first time that a computer Go program has defeated a human professional player, without handicap, in the full game of Go — a feat that was previously believed to be at least a decade away.
The secret of the success of the new Go program seems to lie in a smart workflow by which the neural networks of the program were trained. As a result, the program could afford to calculate "thousands of times fewer positions than Deep Blue did in its chess match against Kasparov" (Silver et al. 2016: 489).

I should say that I was never really interested in the Go-game before. My father played it once in a while when I was a child, but I never understood what one actually needs to do. From the articles in the media in which this fight between man and machine was reported, I learned, however, that the Go-Game was apparently considered to be much more challenging than the Chess Game, due to an increased number of positions and moves, and that nobody was expecting the time to be already ripe for machines to beat humans in this task.

When reading the article and reflecting about it, I wondered how complicated the task of finding homologous words in linguistic datasets might be compared to the Go-Game. I know quite a few colleagues who consider this task as impossible to model; and I know that they have not only good reasons, but also a lot of experience in language comparison, so they would not say this without having given it some serious thoughts. But if it is impossible for computer programs to compete with humans in language comparison, does this mean that the Go-Game is a less challenging task?

On the other hand, I know also quite a few colleagues who consider automatic data-driven approaches in historical linguistics to be generally superior to the classical manual workflow of the comparative method (Meillet 1925). In fact, the algorithms for cognate detection that I developed during my PhD (List 2014) are often criticized as lacking the stochastic or the machine-learning component, since they are based on a rather explicit attempt to model how historical linguists compare languages.

Among many classically oriented linguists there is a strong mistrust regarding all kind of automated approaches in historical linguistics, while among many computationally oriented linguists and linguistically oriented computer scientists there is a strong belief that enough data will sooner or later solve the problems, and that all explicit frameworks with hard-coded parameters are inferior to data-driven frameworks. While classical linguists usually emphasize that the processes are just too complex to be modeled with simple approaches as they are used by computational linguists, the computational camp usually emphasizes the importance of letting "the data decide", or that "the data is robust enough to find signal even with simple models".

Given the success of AlphaGo, one could argue that the computational camp might be right, and that it will be just a matter of time until manual language comparison will be done in a fully automated manner. Our current situation in historical linguistics is somewhat similar to the situation in evolutionary biology during the 1960s and 1970s, when quantitative scholars prophesied (incorrectly, so far) that most classical taxonomists would soon be replaced by computers (Hull 1990: 121f).

However, since we are scientists, we should be really careful with any kind of orthodoxy, and I consider as problematic both the blind trust in machine learning techniques as well as the blind trust in the superiority of human experts over quantitative analyses. The problem with human experts is that they are necessarily less consistent and efficient than machines when it comes to tasks like counting and repeating. Given the increasing amount of digitally available data in historical linguistics, we simply lack the human resources to pursue classical research without trying to automatize at least parts of it.

The problem of computational approaches, and especially machine-learning techniques, however, is that they only provide us with a result of our analysis, not with an explanation that would tell us why the result was preferred over alternative possibilities. Apparently, Go players now have this problem with AlphaGo, since in many cases they do not know why the program made a certain move, they only know that it turned out to be successful. This black-box aspect of many computational approaches does not necessarily constitute a problem in practical applications: When designing an application for automatic speech recognition, the users won't care how the application recognizes speech as long as it understands their demands and acts accordingly. In science, however, it is not just the results that matter, but the explanation.

This is especially important in the historical sciences, where we investigate what happened in the past, and we constantly revise our knowledge about the past events by adjusting our theories and our interpretation of the evidence. If a machine tells me that two words in different languages are homologous, it is not the statement which is interesting but the explanation. Without the explanation, the statement itself is worthless. Since we are dealing with statements about the past, we can never really prove any statement that has been made. But what we can do is investigate explanations and compare the evolution of explanations in the past, thereby selecting those explanations that we prefer, perhaps because they are more probable, more general, or less complicated. A black-box method for word homology prediction would only make sense if we could evaluate the prediction — but if we could evaluate the prediction, we would not need the black-box method any more.

This does not mean that black-box methods are generally useless. A well-trained homology prediction machine could still speed up the process of data annotation, or assist linguists by providing them with initial hints regarding remotely related language families. But as long as black-box methods remain black boxes, they won't be able to replace the only ones who could still interpret what they produce.

  • Hull, D. (1988): Science as a Process - An Evolutionary Account of the Social and Conceptual Development of Science. The University of Chicago Press: Chicago.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • Meillet, A. (1954): La méthode comparative en linguistique historique [The comparative method in historical linguistics]. Honoré Champion: Paris.
  • Silver, D., A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016): Mastering the game of Go with deep neural networks and tree search. Nature 529.7587. 484-489.

Tuesday, May 10, 2016

The early history of sequence alignment

The historical development of the concept that we now call a "sequence alignment" is something that seems to have rarely been considered in the biological literature. Apparently, the idea took some time to develop.

To a bioinformatician, the history of sequence alignment starts in 1970, with the presentation of the dynamic programming algorithm of Needleman and Wunsch (1970). However, protein sequencing started fully 20 years earlier than this (see García-Sancho 2010); and by the end of the 1950s comparisons of amino-acid sequences among related organisms were beginning to appear. However, as noted by Eck (1961): "data on amino acid sequences can be sorted, tabulated and arranged in a great variety of ways ... Any such manipulation will produce some sort of pattern." Thus, a multiple sequence alignment was seen as only one of many possible data presentations, and not necessarily the most obvious one unless intended for an evolutionary analysis.

For example, most of these early comparative studies focussed on the structure (and thus function) of the proteins rather than on their evolution, and so they tended to present juxtapositions consisting of ungapped fragments of the sequences (eg. Brown et al. 1955; Tuppy and Dus 1958; Anfinsen 1959), particularly the active regions. Other studies were directed towards finding a solution to the problem of the genetic code (ie. how nucleotides code for amino acids), and their presentation of sequence alignments was similarly non-evolutionary (eg. Gamow et al. 1956; Tsugita and Fraenkel-Conrat 1960).

Nevertheless, the early work on molecular evolution did reveal that different protein molecules are homologous, including what are now called paralogs (eg. Itano 1957; Ingram 1961). With the sequencing of the proteins, it soon occurred to several people independently that the relative positions in the amino acid sequences are homologous as well (see Morgan 1998). This is an important distinction, because the latter refers to the 1:1 matching of the parts (amino acids) of a complex whole (the protein molecule), which is the usual empirical procedure for determining homology (Ghiselin 2016). However, most sequences were still presented unaligned (eg. Ingram 1961), until the work of Margoliash (1963) and Pauling and Zuckerkandl (1963), who can thus be seen as the pioneers of the modern form of sequence alignment.

The major problem with sequencing proteins in the 1960s was that it was still a slow and tedious procedure, so that data were rather scarce — the first major compilation of aligned sequences did not appear until 1965 (Dayhoff et al. 1965). Strasser (2010) provides an interesting coverage of the early uses of multiple amino-acid sequence alignments, including the development of one-letter codes for each of the amino acids in order to make the alignments more readable. García-Sancho (2010) and Suárez-Díaz (2014) discuss the subsequent development of experimental methods for the sequencing of RNA in the mid-1960s and then DNA in the mid-1970s, which greatly increased the need for an automated sequence alignment method. [García-Sancho (2012) provides a much more detailed discussion.]

Most importantly, a number of the early molecular sequence alignments were constructed by hand explicitly based on evaluation of the likely biological mechanisms that had produced the sequence variation. That is, the alignments made clear the originating molecular mechanisms. For example, Pauling and Zuckerkandl (1963) provided a pairwise alignment of two reconstructed ancestral amino-acid sequences of haemoglobin, along with a discussion of the substitutions and insertions / deletions.

Twenty years later, in what appears to be the first published study of intraspecific variation using DNA sequences, Kreitman (1983) took this idea further, and provided a very carefully considered multiple alignment based on explicit recognition of tandem repeats and RNA stem structures within the study gene. This was very much in line with traditional approaches to the assessment of homologies prior to phylogenetic tree building, for example when using morphological or anatomical characters.

However, immediately after this, practical computerized procedures were developed by Hogeweg and Hesper (1984), based on dynamic programming for pairwise sequence alignment (solely maximizing similarity, as explicitly noted in the title of the Needleman and Wunsch paper) and based on the progressive alignment strategy for multiple alignment. Then the Clustal computer program was released in 1988, which implemented these procedures in a usable manner for personal computers (see Chenna et al. 2003); and the history of studies in molecular evolution was thereby changed forever.

This brief history emphasizes one simple point about the relationship between homology and phylogeny — the apparent primary interest in the latter rather than the former, despite the fact that they are simply two views of the same dataset (phylogeny refers to the relationship among the rows of a multiple sequence alignment, while homology refers to the relationship among the columns). The first automated or semi-automated tree-building algorithm (the user could manually intervene at each step) was developed by Eck and Dayhoff (1966), followed by the first fully automated procedure presented by Fitch and Margoliash (1967). This was nearly 20 years before equivalent ideas were developed for homology assessment.


Christian B. Anfinsen (1959) The Molecular Basis of Evolution. Wiley, New York.

H. Brown, Frederick Sanger, Ruth Kitai (1955) The structure of pig and sheep insulins. Biochemical Journal 60: 556-565.

Ramu Chenna, Hideaki Sugawara, Tadashi Koike, Rodrigo Lopez, Toby J. Gibson, Desmond G. Higgins, Julie D. Thompson (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Research 31: 3497-3500.

Margaret O. Dayhoff, Richard V. Eck, Marie A. Chang, Minnie R. Sochard (1965) Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Spring MD.

Richard V. Eck (1961) Non-randomness in amino-acid "alleles". Nature 191: 1284-1285.

Richard V. Eck, Margaret O. Dayhoff (1966) Atlas of Protein Sequence and Structure, second edition. National Biomedical Research Foundation, Silver Spring MD.

Walter M. Fitch, Emanuel Margoliash (1967) Construction of phylogenetic trees. Science 155: 279-284.

George Gamow, Alexander Rich, Martynas Yčas (1956) The problem of information transfer from the nucleic acids to proteins. Advances in Biological and Medical Physics 4: 23-68.

Miguel García-Sancho (2010) A new insight into Sanger’s development of sequencing: from proteins to DNA, 1943–1977. Journal of the History of Biology 43: 265-323.

Miguel García-Sancho (2012) Biology, Computing and the History of Molecular Sequencing: From Proteins to DNA, 1945–2000. Palgrave MacmIllan, Basingstoke UK.

Michael T. Ghiselin (2016) Homology, convergence and parallelism. Philosophical Transactions of the Royal Society, Series B 371: 20150035.

Paulien Hogeweg, Ben Hesper (1984) The alignment of sets of sequences and the construction of phyletic trees: an integrated method. Journal of Molecular Evolution 20: 175-186.

Vernon M. Ingram (1961) Gene evolution and the hæmoglobins. Nature 139: 704-708.

Harvey A. ltano (1957) The human hemoglobins: their properties and genetic control. Advances in Protein Chemistry 12: 215-268.

Martin Kreitman (1983) Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster. Nature 304: 412-417.

Emanuel Margoliash (1963) Primary structure and evolution of cytochrome c. Proceedings of the National Academy of Sciences of the USA 50: 672-679.

Gregory J. Morgan (1998) Emile Zuckerkandl, Linus Pauling, and the molecular evolutionary clock, 1959–1965. Journal of the History of Biology 31: 155-178.

Saul B. Needleman, Christian D. Wunsch (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

Linus Pauling, Emile Zuckerkandl (1963) Chemical paleogenetics: molecular "restoration studies" of extinct forms of life. Acta Chemica Scandinavica 17: S9-S16.

Bruno J. Strasser (2010) Collecting, comparing, and computing sequences: the making of Margaret O. Dayhoff's Atlas of Protein Sequence and Structure, 1954–1965. Journal of the History of Biology 43: 623-660.

Edna Suárez-Díaz (2014) The long and winding road of molecular data in phylogenetic analysis. Journal of the History of Biology 47: 443–478.

Akira Tsugita, Heinz Fraenkel-Conrat (1960) The amino acid composition and c-terminal sequence of a chemically evoked mutant of TMV. Proceedings of the National Academy of Sciences of the USA 46: 636-642.

Hans Tuppy, K. Dus (1958) Eine Untersuchung über Cytochrom-c aus Hefe. Monatshefte für Chemie 89: 407-417.

Tuesday, May 3, 2016

Continued misuse of PCA in genomics studies

A few years ago I discussed some well-known methodological artifacts that can arise with the use of Principal Components Analysis (PCA) ordinations, and noted that these problems seem to be widespread in genomics studies (Distortions and artifacts in Principal Components Analysis analysis of genome data). This problem involves a spurious second axis in the output graph that is a merely curvilinear function of the first axis (rather than being an indication of important new information).

[Note: if you would like a description of PCA, try this blog post by Lior Pachter: What is principal component analysis?]

This distortion problem has been long known in research fields such as ecology, where it is referred to as the Arch Effect (or the Horseshoe Effect, or the Guttman Effect). It has previously been pointed out as a problem for genomic data when they form a clinal geographical pattern, although clearly the problem can involve much more than just geographical patterns.

The issue I previously raised was that the problem was being ignored by practitioners, which can lead to serious mis-interpretation of the data analysis. Here I note that this issue continues, apparently unabated.

For example, the following paper recently appeared:
Benjamin Vernot, et al. (2016) Excavating Neandertal and Denisovan DNA from the genomes of Melanesian individuals. Science 352: 235-239.
This paper contains the following pair of PCA ordinations, illustrating genomic variation among a sample of 159 geographically diverse humans. In both cases, the second axis (vertically) is clearly nothing more than a curved function of the first axis (horizontally).

The simplest interpretation of these diagrams is that there is a 1-dimensional spatial pattern (ie. a geographic gradient) that is being distorted into 2 dimensions. For example in Figure B from left to right, the geographic gradient proceeds from East to West to South.

Gil McVean (2009. A genealogical interpretation of principal components analysis. PLoS Genetics 5: e1000686) identifies a few other limitations of PCA, including distortions produced by greatly unequal sample sizes among groups (such as populations).

Lest you think that all PCA diagrams are faulty, I should point out that when there are two or more patterns then PCA can work quite well — it is only when there is a single pattern that a 2-dimensional diagram will be distorted. Consider this diagram, from Pille Hallast, et al. (2016) Great ape Y chromosome and mitochondrial DNA phylogenies reflect subspecies structure and patterns of mating and dispersal. Genome Research 26: 427-439:

There are four labelled groups here, and the first PCA axis separates PTV from the other three groups, while the second axis separates PTE from the other three, without distortion. [Any separation of PTS from PTT is presumably on the third axis, which is not shown.]

Finally, the paper by Vernot et al. (with which I started) does also contain a diagram that is more interesting for this blog. It is a manually constructed network illustrating the multiple inter-breeding events that the authors infer between Neandertals, Denisovans and various human geographical groups (as named in the first figure).

For an explanation of the inter-breeding events, see Nature News for 17 Feb 2016:
Evidence mounts for interbreeding bonanza in ancient human species.