Friday, July 31, 2015

Singapore, Day 5

Today was a mixed bag ot talks.

Louxin Zhang started with a couple of proofs about what he called "stable" networks; and Stefan Grünewald developed his thoughts on quartet algoritms for splits graphs. At the other extreme, Nadine Ziemert talked entirely biology, introducing the audience to the problem of trying to study the evolution of secondary metabolites. In between, Eric Tannier tried to use horizontal gene transfer to date the nodes of networks, assuming that HGT requires a temporally consistent network. Francois-Joseph Lapointe produced the only really statistical talk of the week, trying to produce p-values for patterns on sequence similarity networks.

Daniel Huson popped in for the last day, and presented us with some ideas for the future development of both SplitsTree (unrooted networks) and Dendroscope (rooted networks). Apparently, the need is for SplitsTree to handle larger sets of trees, while for Dendroscope it is to produce networks from pairs of input trees. He also noted that there are still more networks being produced using median joining rather than neighbor-net, due to the amount of work being done on human mitochondrial sequences.

An interest was expressed in continuing the series of meetings on phylogenetic networks (Leiden 2012, Leiden 2014) — I first met most of the people working on networks in phylogenetics in Uppsala in 2004 (Phylogenetic Combinatorics and Applications).

Today we also celebrated Dan Gusfield's 2^6 birthday, with a strawberry cream cake.

So, all in all, a very successful meeting.

After the sessions finished, I went down to the Gardens By The Bay to look at the Supertree Grove. As you can see, a "super" tree is by any definition actually a network.

Thursday, July 30, 2015

Singapore, Day 4

There was more heavy maths today.

Charles Semple started by counting trees within specified types of network. In the process, he provided the first mathematical proof of the week (he actually provided two). He also raised the issue of what, exactly, is a phylogenetic network — we have had many mathematical restrictions placed on networks this week, and it is not always clear how any of them might relate to biological concepts.

Leo van Iersel tried constructing super-networks from incomplete sub-networks, sticking to algorithms rather than proofs. Yufeng Wu and Zhi-Zhong Chen later tried the same strategy for their networks, as did Lusheng Wang for pedigree comparison (he was the only person other than myself to even mention pedigrees).

Mike Steel considered under what circumstances a network can be viewed as a "tree with reticulations" rather than a non-tree network (ie. not every vertex is part of the same underlying tree); this led him to the interesting observation that whether a dataset can be represented by a tree can depend on the taxon sampling. He also looked at when a set of non-tree distances can appear to be tree-like, which is the sort of question that only a mathematician would ask.

Most of the audience interjections this week have come from Sagi Snir, and the rest of the speakers got to return the favor this afternoon, when he spoke about trying to reconstruct trees subject to large amounts of horizontal gene transfer. In the process, he also tried to "sketch" a mathematical proof, which turned into a full-sized painting, before moving on to his algorithm.

Wednesday, July 29, 2015

Singapore, Day 3

It was hard going this morning for the biologists, as there were three main computational talks. First, Vince Moulton further developed some of his ideas about split networks, including median networks, quasi-median networks and neighbor-nets, and what sorts of trees they might contain. Then, Céline Scornavacca expanded on her ideas for calculating the "hybrid number". Finally, Jens Lagergren outlined his work on fitting gene trees to known species trees and networks; this has come a long way in recent years.

We had the afternoon off, although many people took the opportunity to pretend that they were still in their offices at home. Myself, I sat by the pool waiting for the temperature to cool (this was the hottest day so far this week), and then went to the Singapore Botanical Gardens, where I circumnavigated the Evolution Garden, the National Orchid Garden, and the Rainforest Walk. I then briefly perused Orchard Street (one of the most ridiculous shopping meccas you will ever see) and the Raffles Hotel (an even more ridiculous hang-over from British imperialism), before returning to the pool side. It's a tough life.

Tuesday, July 28, 2015

Singapore, Day 2

The computational people were very patient today, as the three major talks focussed on biology, with only the shorter talks being computational.

In particular, Eric Bapteste and James McInerney were determined to tackle the true complexity of phylogenetics, rather than trying to see genealogical history as being a tree with reticulations. They have recently been championing sequence similarity networks as tools for exploring phylogenetic history, and Eric discussed them in relation to prokaryote evolution while James looked at gene families. Strictly speaking, SSNs are not phylogenetic networks, because they do not involve the inference of unobserved nodes connected to observed (labelled) nodes by inferred edges, bit instead connect observed (labelled) nodes via observed edges. This does not mean that they have no role to play in phylogenetics, as the speakers made amply clear.

My own talk had little to do directly with empirical networks, but instead tried to look at an overview of the field, presenting some of my own ideas about where networks are heading, and what role they might play as phylogenetic tools. Not everyone was convinced.

Also of personal interest to me, Philippe Gambette unveiled the new, much more ambitious, version of the Who is Who in Phylogenetic Networks database. I claim no other role in this than encouraging Philippe to be as ambitious as possible. I think that people will be genuinely impressed by what can now be done to explore the people, literature and software associated with phylogenetic networks.

PDF copies of the talks have now started to appear on the workshop web page, which will give you a bit more idea of what our speakers have tried to say.

We also started the discussion about how to engender more effective development of computational tools for phylogenetic networks. Topics covered included the need for more gold-standard datasets that can be used to test new methods — to date, the ones available on this blog have been compiled by me alone, but in order to expand this other people will need to contribute. Also, improved communication and collaboration between biologists and computationalists would be very helpful, and several suggestions were canvassed, but no real way forward was found. One interesting point was made that many of the practical applications of networks were not likely to attract the professional interest of most computationalists — indeed, to date, most phylogenetics programs have been written by biologists rather than computational people.

In the evening, I had a very enjoyable dinner at the Singapore Seafood Republic, in the company of Louxin Zhang (our organizer) and some very nice Chinese visitors, all of whom politely failed to comment on my inability to use chopsticks. Pictures of dinner were taken, and may appear on Facebook, if I am not careful. We finished just in time to watch the Sentosa Crane Dance, which you should all check out.

Monday, July 27, 2015

Singapore, Day 1

I'm not sure how these reports are going to go, as I did not bring a laptop with me. Also, Blogger is not happy with me logging in from another country. However, I have managed to get a decent sized keyboard on the screen of my iPad Mini, so I can at least type somewhat normally. I will not, however, write about every talk (and my apologies to those speakers who do not get mentioned).

Singapore is as expected — hot and humid; except when one is indoors, and even 24 °C surprisingly feels cold. I have washed most of my shirts once already, to remove the perspiration.

Most people seem to have arrived; indeed, many have already been here for a few days. Myself, I spent Sunday afternoon touring Chinatown, and Little India. The food market at the latter location was unbelievably hot, although the locals did not seem to realize this.

We have now dealt with the first day of talks. No mercy was shown to the uninitiated, and we started with the heavy network stuff right from the start.

This took the form of Dan Gusfeld explaining to us in no uncertain terms that Integer Linear Programming can be used to solve many computational problems that are too hard for Dynamic Programming, using Ancestral Recombination Graphs as his example. When asked about possible connections to actual biology, he patiently explained that this was another matter entirely. Kathi Huber later said the same thing when asked about the loss of biological information resulting from unrooting a rooted network. At the time, she was trying to "bridge the gap" between rooted and unrooted networks, and unrooting them is surprisingly effective way to achieve this.

Luay Nakhleh's talk was my favorite of the day. He is one of the few people in this business who can successfully talk computations to a mathematician and biology to a biologist — most of the rest of us fail at one or the other (or both). Sadly, he pointed out that under the coalescent model any gene tree fits inside any species tree (or network), simply by having the gene coalescences occur after the species root is reached. He also noted that we can't distinguish among reticulation processes on a network, which took away one third of my talk!

We finnished with Jesper Jansson decomposing networks into triangles, which is a neat change from the usual decomposition into triplets, clusters or trees. Along the way, he concluded that we need to keep using a lot of different measures for network to network distances, because none of the current ones are good under all conditions. That is another major difference between trees and networks.

Wednesday, July 22, 2015

Phylogenetic Network Workshop, Singapore

Next week there will be a gathering in Singapore, for a Phylogenetic Network Workshop. This is being hosted by the Institute for Mathematical Sciences, at the National University of Singapore.

The workshop has been organized under the guidance of Louxin Zhang. The program and abstracts can be found here. It runs for the whole week, 27 – 31 July 2015.

The workshop is actually the final part of a much larger, 2-month programme, called Networks in Biological Sciences (1 June – 31 July 2015). This programme is focused on mathematics for network models in biology, including complex networks and systems biology. Network modeling is extremely challenging, and so it offers outstanding opportunities for mathematicians and statisticians. The phylogenetics workshop will focus on the mathematics needed to develop fast and robust computer programs for inferring an evolutionary network models from biological sequence data.

The participants are principally from the computational sciences, of course, including many who have attended the previous network workshops in Leiden, in the Netherlands, in October 2012 and July 2014. There are, however, a few biologists to round out the field, including myself.

Singapore is hot and humid for most of the year, and July is no exception. So, I am expecting the unacclimatized participants to spend most of their time indoors, avoiding the daily thunderstorms.

I am hoping to add some blog posts based on what happens at the workshop, as it proceeds.

Monday, July 20, 2015

The Tree of Architecture

The following diagrams are taken from the book A History of Architecture on the Comparative Method for the Student, Craftsman, and Amateur. This book is considered to be "a canonical text that has played a formative role in the education of generations of architects" because it really does "cram everything into a single volume". The first edition of the book appeared in 1896, with the 20th edition appearing in 1996.

The first picture is from the 5th edition (1905), and the second one is from the 16th edition (1954).

As noted in the first figure, these trees purport to show the "evolution" of the various architectural styles. However, they do no such thing.

At the base of the tree trunk is a set of individual architectural styles that apparently led nowhere, while at the crown of the tree several styles are repeated. Each of the latter styles exist on two side-branches from the main trunk, each pair connected by vertical tendrils. So, this is a network, at least. However, the meaning of this network is not immediately obvious. Indeed, even a short perusal of the diagram should lead you to the idea that the meaning is contained more in cultural bias than in the actual history of architecture.

The history of the book itself is somewhat complex. The first edition was written by the father and son team of Banister Fletcher & Banister F. Fletcher. Subsequent editions were revised by Banister F. Fletcher (the son), with the 6th edition (1921) being rewritten by Fletcher and his first wife (who got no credit, even though the father's name was then dropped). After Fletcher's death in 1953, the 17th edition (1961) was revised by R.A. Cordingley, the 18th (1975) by James Palme, the 19th (1984) by John Musgrove, and the 20th (1996) by Dan Cruickshank. The tone and arrangement of the book was changed with each edition.

The tree has been analyzed in detail by Gülsüm Baydar Nalbantoglu (1998. Toward postcolonial openings: rereading Sir Banister Fletcher's "History of Architecture". Assemblage 35: 6-17). She notes the following:
Until the fourth edition of 1901, A History of Architecture had been a relatively modest survey of European styles. The fourth edition, however, appeared with an important difference: this time the book was divided into two sections, "The Historical Styles", which covered all the material from earlier editions, and "The Non-Historical Styles", which included Indian, Chinese, Japanese, Central American, and Saracenic architecture. 
The "Tree of Architecture" has a very solid upright trunk that is inscribed with the names of European styles and that branches out to hold various cultural / geographical locations. The nonhistorical styles, which unlike others remain undated, are supported by the "Western" trunk of the tree with no room to grow beyond the seventh-century mark. European architecture is the visible support for nonhistorical styles. Nonhistorical styles, grouped together, are decorative additions, they supplement the proper history of architecture that is based on the logic of construction. 
In the posthumously published seventeenth edition of 1961, the two parts were renamed "Ancient Architecture and the Western Succession" and "Architecture in the East", respectively. The nineteenth edition of 1987, on the other hand, consisted of seven parts based on chronology and geographical location. Cultures outside of Europe included "The Architecture of the Pre-Colonial Cultures outside Europe" and "The Architecture of the Colonial and Post-Colonial Periods outside Europe".

That is, "architecture" for the Banisters was defined as being about a building's construction, not its decoration. European cultures focused on construction, and they developed their styles through time. Other cultures focused on decoration, and were therefore not a proper part of architecture, and had no historical development. This is what the tree attempts to show.

This cultural bigotry was corrected in the final few editions of the book (after the Fletchers were no longer involved), where all architectural styles were considered more-or-less equal.

Wednesday, July 15, 2015

What is "science phylogeny"?

Some years ago I came across this paper in the arXiv:
David Chavalarias and Jean-Philippe Cointet (2010) The reconstruction of science phylogeny. arXiv:0904.3154v3
I was intrigued by what they could possibly mean by "science phylogeny". The abstract contains this information:
We are facing a real challenge when coping with the continuous acceleration of scientific production and the increasingly changing nature of science. In this article, we extend the classical framework of co-word analysis to the study of scientific landscape evolution. Capitalizing on formerly introduced science mapping methods with overlapping clustering, we propose methods to reconstruct phylogenetic networks from successive science maps, and give insight into the various dynamics of scientific domains ... These results suggest that there exist regular patterns in the “life cycle” of scientific fields. The reconstruction of science phylogeny should improve our global understanding of science evolution and pave the way toward the development of innovative tools for our daily interactions with its productions. Over the long run, these methods should lead quantitative epistemology up to the point to corroborate or falsify theoretical models of science evolution based on large-scale phylogeny reconstruction from databases of scientific literature.
The only actual description of phylogenetic methods is this:
The core question is: How can we reconstruct science dynamics through automated bottom-up analysis of scientific publications? ... The reconstruction of these inheritance patterns will be very useful to get a global overview of the activity and evolution of large scientific domains. Moreover, contrary to what is often encountered in biology, we should expect some hybridization events be- tween fields of research, which requires switching from phylogenetic trees to phylogenetic networks. Reconstructing the phylogenetic network of science consists in answering this simple question: given a scientific field CT' at period T' and a period T prior to T', from which fields at T does CT' derives its conceptual legacy? To achieve inter-temporal matching between fields, we have to find for each field at T the field or union of fields from which it inherits.
When the authors formally published their work, the literature had changed, and the reference to phylogenetic networks had been replaced:
David Chavalarias, Jean-Philippe Cointet (2013) Phylomemetic patterns in science evolution — the rise and fall of scientific fields. PLOS One 8: e54847.
The abstract contains this information:
We introduce an automated method for the bottom-up reconstruction of the cognitive evolution of science, based on big-data issued from digital libraries, and modeled as lineage relationships between scientific fields. We refer to these dynamic structures as phylomemetic networks or phylomemies, by analogy with biological evolution; and we show that they exhibit strong regularities, with clearly identifiable phylomemetic patterns.
The explanation of phylomemetics is this:
[The] evolution of science, featuring innovations, cross-fertilization and selection, is suggestive of an analogy with the evolution of living organisms. We propose an adaptation of the concept of the phylogenetic tree, and combine it with the Richard Dawkins intuition of meme, to refer to phylomemetic networks (or phylomemy), which describes the complex dynamic structure of transformation of relations between terms. The concept of "phylomemetic network" is used by analogy to biological phylogenetic trees, which account for evolutionary relationships between genes. We do not make any assumption concerning the type of dynamics underlying the evolution and diffusion of terms. As such, contrarily to previous works in line with the memetics theory [9], which have already coined the term, we do not claim that cultural entities (memes) evolve following the same laws of selection as biological replicators (genes) do.
The term "phylomemetics" was coined by:
Christopher J. Howe and Heather F. Windram (2011) Phylomemetics — evolutionary analysis beyond the gene. PLoS Biology 9: e1001069.
However, you should note that Chavalarias & Cointet explicitly distance themselves from Howe & Windram's claim that cultural entities (memes) evolve following the same laws of selection as biological replicators (genes) do. They also insist upon a network representation rather than Howe & Windram's use of a tree.

The resulting networks are rather odd looking things, with multiple roots occurring at different times. There is one network for each of the selected fields of science (defined by their use of specific terminology). This is the one for the term "Gap junctions":

Monday, July 13, 2015

The films of Jacques Tati

When I was young, my siblings and I used to go to the movies regularly with my father. One of the movies I remember well, even after more than 40 years, was Jacques Tati's final cinema release, Trafic.

Tati made only five cinema features, plus several short movies, and one final made-for-TV movie (made in Sweden in 1974):
  • Jour de Fête (1949)
  • Les Vacances de Monsieur Hulot (1953)
  • Mon Oncle (1958)
  • Playtime (1967)
  • Trafic (1971)
All of them were originally in French, and were released internationally with subtitles. However, Tati came from the world of mime, and so his movies had very little dialog anyway, relying instead on the "moving picture" aspect of film making.

In spite of his small output, Tati managed to have a large impact on world cinema. His movies won several awards, notably at the Cannes Film Festival and the Venice Film Festival, and Mon Oncle won the Academy Award for Best Foreign-Language Film (and Les Vacances de Monsieur Hulot received a nomination for Best Screenplay). His movies regularly appear in "Top 50" and "Top 100" lists. Many people have acknowledged his influence, and Rowan Atkinson's character Mr Bean is basically an updated English-language version of Tati's character M. Hulot (even to the extent of making Mr Bean's Holiday). There is even a small homage to Mon Oncle near the beginning of The Blues Brothers.

Not unexpectedly, then, Tati's movies continue to attract the attention of the critics. At the aggregator site Rotten Tomatoes, his movies have 100% positive reviews, except for Mon Oncle which unexpectedly has only 92%. There are a total of 89 critics listed as having written reviews about at least one of Tati's movies, although only 28 of them have reviewed more than one of the films. Indeed, only four critics have provided individual reviews of all five movies (and a few others have reviewed them collectively).

The fact that so few of the movie reviewers compare Tati's movies can be used to illustrate the dangers of assessing things in isolation. If we average the reviewer's scores for the movies, then we get this (standardized to a scale of 0-1):
Jour de Fête
Les Vacances de Monsieur Hulot
Mon Oncle
No. reviewers
Average score
Clearly, Playtime is the favorite, with Trafic trailing the field (although still with a good score).

On the other hand, this pattern is not quite repeated when we consider only those reviewers who provided scores for more than one movie. That is, we do not see quite the same pattern when we assess the pairwise preferences of those critics who scored at least two of the films.

Sometimes, the overall pattern is repeated. For example, you will note that Les Vacances de Monsieur Hulot scored higher than the other movies except for Playtime. Of the reviewers who also scored Les Vacances, 5 preferred Playtime, 4 scored them as equal, and 1 preferred Les Vacances, so that Playtime is clearly preferred. Similarly, 4 critics preferred Les Vacances to Trafic, 3 scored them as equal, and no-one preferred Trafic.

However, overall Les Vacances de Monsieur Hulot scored higher than Mon Oncle, which also reflects the latter's 92% "fresh" rating noted above, but this pattern is not repeated for the pairwise comparisons. If we look at the 10 reviewers who scored both of these movies, then 4 preferred Les Vacances to Mon Oncle, 3 scored them as equal, and 3 preferred Mon Oncle, thus showing little preference for one film over the other.

So, direct comparisons can be more important than independent assessments. Some of you will recognize this as an example of Simpson's Paradox.

As usual, we can use a phylogenetic network as a form of exploratory data analysis, to compare all five movies in a single diagram. I first used the gower similarity to calculate the similarity of the five movies based on those 20 reviewers who scored more than one movie. This was followed by a Neighbor-net analysis to display the between-film similarities as a phylogenetic network. So, films that are closely connected in the network are similar to each other based on their scores, and those that are further apart are progressively more different from each other.

Clearly, the movies are not really very different from each other in score, and there is little preference for one over another for these 20 critics. This contrasts with the scores from all 89 critics.

The "audience score" at Rotten Tomatoes differs somewhat from the critics' scores. They score Playtime (90%) and Mon Oncle (89%) at the top, followed by Les Vacances de M. Hulot (86%) and Jour de Fête (85%), and finally Trafic (77%). In spite of this, I still have a soft spot for Trafic, although Mon Oncle is my personal favourite.

Wednesday, July 8, 2015

Productive and unproductive analogies between biology and linguistics

Genotypes or phenotypes?

In a blogpost from 2013, David investigated some of the popular analogies between anthropology (including linguistics) and biology. He rejected those analogies that compare the genotype with anthropological entities (like the common "words = genes" analogy). Instead, he proposed to draw the analogy between anthropological entities and the phenotype. I generally agree that we should be very careful about the analogies we draw between different disciplines, and I share the scepticism regarding those naive approaches in which genes are compared with words or sounds are compared with nucleotide bases. I am, however, sceptical whether the alternative analogy between phenotypes and anthropological entities offers a general solution for the study of language evolution.

Productive and unproductive analogies

My scepticism results from a general uncertainty about the transfer of models and methodologies among scientific disciplines. I am deeply convinced that such a transfer is useful and that it can be fruitful, but we seem to lack a proper understanding of how to carry out such a transfer. Apart from this general uncertainty as to how to do it properly, I think that for linguistics the analogy between phenotypes and linguistic entities is too broad to be successfully applied.

Instead of drawing general analogies between biology and linguistics, it would be more useful to carry out a fine-grained analysis of productive analogies between the two disciplines. By productive, I mean that the analogies should lead to an interdisciplinary transfer of models and methods that increases the insights about the entities in the discipline that imports them. If this is not the case for a given analogy, this does not mean that the analogy is wrong or false, but rather that it is simply unproductive, since an analogy is just a similarity between entities from different domains, and what we define as being "similar" crucially depends on our perspective. With enough fantasy, we can draw analogies between all kinds of objects, and we never really know the degree to which we construct rather than detect, as I have tried to illustrate in the graphic below.

Constructed or detected similarities?

Local productive analogies: alignment analyses

A productive analogy does not necessarily have to be global, offering a full-fledged account of shared similarities, as in the analogies which compare, for example, languages with organisms (Schleicher 1848) or languages with species (Mufwene 2001), but also the analogy between phenotypes and anthropological entities proposed by David. It is likewise possible to find very useful local analogies, which only hold to a certain extent, but offer enough insights to get started.

Consider, for example, the problem of sequence alignment in biology and linguistics. It is clear, that both biologists and linguists carry out alignment analyses of some of the entities they are dealing with in their disciplines. We use alignment analyses in biology and linguistics, since both disciplines have to deal with entities that are best modeled as sequences, be it sequences of DNA, RNA, or amino acids in biology, or sequences of sounds in linguistics. In both cases, we are dealing with entities in which a limited numer of symbols is linearily ordered, and an alignment analysis is a very intuitive and fruitful way to show which of the symbols in two different sequences correspond.

In this very general point, the analogy between words as sequences of sounds and genes as sequences of nucleic acids holds, and it seems straightforward to think of transferring models and methods between the disciplines (in this case from biology to linguistics, since automatic sequence alignment has a longer tradition in biology).

In the details, however, we will detect differences between biological and linguistic sequences, with the main differences lying in the alphabets (the collections of symbols) from which our sequences are drawn (discussed in more detail in List 2014: 61-75):
  • Biological alphabets are universal, that is, they are basically the same for all living creatures, while the alphabets of languages are specific for each and every language or dialect.
  • Biolological alphabets are limited and small regarding the number of symbols, while linguistic alphabets are widely varying and can be very large in size.
  • Biological alphabets are stable over time, with sequences changing by the replacement of symbols with other symbols drawn from the same pool of symbols, while linguistic alphabets are mutable: not only can they acquire new sounds or lose existing ones, but also the sounds themselves can change.

How similar are words and genes in the end?

What are the consequences of these differences in the word-gene analogy? Can we still profit from the long tradition of automatic alignment methods when dealing with phonetic alignment (the alignment of sound sequences, like words or morphemes) in linguistics? Yes, we can! But within limits!

Linguists can profit from the general frameworks for sequence alignment developed in biology, but we need to make sure that we adapt them according to our linguistic needs. For alignment methods, this means, for example, that we can use the traditional frameworks of dynamic programming for pairwise alignment, which were developed back in the seventies (Needleman and Wunsch 1971, Smith and Waterman 1981). We can also use some of the frameworks for multiple sequence alignment, which were developed a bit later, starting from the end of the eighties, be it progressive (Feng and Doolittle 1987, Thompson et al. 1994, Notredame et al. 1998), iterative (Barton and Sternberg 1987, Edgar 2004), or probabilistic (Do et al. 2004). But we can only import the overall frameworks, not their details.

All algorithms for phonetic alignment that are supposed to be applicable to a wide range of data (and not serve as a mere proof of concept that handles but a limited range of test datasets) need to address the specific characteristics of sound sequences. Apart from the differences in alphabet size and the mutable character of sound systems mentioned above, these differences also include the important role that context plays in sound change (List 2014: 26-33), the problem of secondary sequence structures (List 2012), the problem of metathesis (List 2012: 51f), but also the problem of unalignable parts resulting from cases of partial and oblique homology in language evolution (see my recent blog post on this issue).

Concluding remarks

Drawing analogies between the research objects of different disciplines is not a bad idea, and it can be very inspiring, as multiple cases in the history of science show. When transferring models and methods from one discipline to another, however, we need to make sure that the analogies we use are productive, adding value to our research and understanding. We should never expect that analogies hold in all details. Instead we need to be aware about their specific limits, and we need to be willing to adapt those models and methods we transfer to the needs of the target discipline. Only then can we make sure that the analogies we use are really productive in the end.


  • Barton, G. J. and M. J. E. Sternberg (1987). “A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons”. J. Mol. Biol. 198.2, 327 –337. 
  • Do, C. B., M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou (2005). “ProbCons. Probabilistic consistency-based multiple sequence alignment”. Genome Res. 15, 330–340.
  • Edgar, R. C. (2004). “MUSCLE. Multiple sequence alignment with high accuracy and high throughput”. Nucleic Acids Res. 32.5, 1792–1797.
  • Feng, D. F. and R. F. Doolittle (1987). “Progressive sequence alignment as a prerequisite to correct phylogenetic trees”. J. Mol. Evol. 25.4, 351–360.
  • List, J.-M. (2014). Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.  
  • List, J.-M. (2012a). "Improving phonetic alignment by handling secondary sequence structures". In: Hinrichs, E. and Jäger, G.: Computational approaches to the study of dialectal and typological variation. Working papers submitted for the workshop organized as part of the ESSLLI 2012. 
  • List, J.-M. (2012b). “Multiple sequence alignment in historical linguistics. A sound class based approach”. In: Proceedings of ConSOLE XIX. “The 19th Conference of the Student Organization of Linguistics in Europe” (Groningen, 01/05–01/08/2011). Ed. by E. Boone, K. Linke, and M. Schulpen, 241–260.
  • Mufwene, S. S. (2001): The ecology of language evolution. Cambridge: Cambridge University Press.
  • Needleman, S. B. and C. D. Wunsch (1970). “A gene method applicable to the search for similarities in the amino acid sequence of two proteins”. J. Mol. Biol. 48, 443– 453.
  • Notredame, C., L. Holm, and D. G. Higgins (1998). “COFFEE. An objective function for multiple sequence alignment”. Bioinformatics 14.5, 407–422.
  • Schleicher, A. (1848). Zur vergleichenden Sprachengeschichte [On comparative language history]. Bonn: König.
  • Smith, T. F. and M. S. Waterman (1981). “Identification of common molecular subsequences”. J. Mol. Biol. 1, 195–197.
  • Thompson, J. D., D. G. Higgins, and T. J. Gibson (1994). “CLUSTAL W. Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”. Nucleic Acids Res. 22.22, 4673–4680.

Monday, July 6, 2015

Rivers of Life, instead of trees

In an earlier blog post, I discussed some of the evocative Metaphors for evolutionary relationships, particularly reticulating ones.

In that post I listed the concept of a "braided river", and mentioned a 1994 paper by John Moore as my earliest source for the image. However, the metaphor actually goes back more than 100 years earlier. It occurs as the central metaphor in this quite remarkable book on comparative religion:
Forlong, J.G.R. (1883) Rivers of Life: or Sources and Streams of the Faiths of Man in All Lands, Showing the Evolution of Faiths from the Rudest Symbolisms to the Latest Spiritual Developments. 2 vols. Bernard Quaritch: London.
James George Roche Forlong was a Scottish engineer serving in the British army that occupied India during the 19th century. He apparently had a life-long interest in comparative religion, and his book arose from his personal experience of non-Christian religions (facilitated by his knowledge of several languages). The book involves a serious re-interpretation of the evolutionary history of world religions, as a series of six inter-connecting rivers running from ancient times into the modern world, each river representing a different type of worship.

The illustrative chart that accompanies the book can be viewed here. A low-resolution copy is shown below.

Wednesday, July 1, 2015

Networks of admixture or introgression

There are several processes that create reticulate phylogenetic topologies, including hybridization, introgression (or admixture) and horizontal gene transfer (HGT). Biologically, introgression operates via the same mechanism as does hybridization (ie. during sexual reproduction), but it results in only a small amount of genetic material entering the recipient genome, making an admixed genome that is similar to the end result of HGT.

Constructing phylogenetic networks in situations where introgression or HGT have occurred has been somewhat different in practice to that used for hybridization. Hybridization has usually been tackled by merging incongruent tree topologies, based on the idea that the different topologies represent the phylogenetic history of the different genomes of the hybrid taxon. Introgression and HGT have usually been tackled by adding reticulation edges to a phylogenetic tree, on the basis that the tree represents the phylogenetic history of the main part of the genome.

So, the study of introgression (and HGT) involves (a) constructing a phylogenetic tree from some genomic sample, and (b) detecting the introgressed (or HGT) parts of the genome. This is potentially a problematic procedure, because how do we construct a phylogenetic tree from data that already contain non-tree components? Apparently, the expectation is that a single tree will be supported by the majority of the data, and the remainder will represent the introgressed (or HGT) pathways(s), plus whatever other components have created the observed genomic variability (such as incomplete lineage sorting, gene duplication-loss, and stochastic mutations).

Recently, there have been quite a few studies published that have adopted a specific protocol for this procedure, usually under the rubric of admixture. Most of these have involved the study of ancient human DNA, but there have also been studies of contemporary humans, as well as ancient non-humans, An example of the latter is shown in the next two figures, which represent parts (a) and (b), respectively. They are taken from this study of the relatives of horses: Hákon Jónsson, et alia (2014) Speciation with gene flow in equids despite extensive chromosomal plasticity. Proceedings of the National Academy of Sciences of the USA 111: 18655-18660.

The phylogenetic tree (step a) was constructed using "maximum likelihood inference and 20,374 protein-coding genes ... based on a relaxed molecular clock." So, only stochastic mutations were accounted for when constructing the tree, and not incomplete lineage sorting or gene duplication-loss.

The detection of introgression (step b) used "the D statistics approach, which tests for an excess of shared polymorphisms between one of two closely related lineages (E1 or E2) and a third lineage (E3)". The reticulations representing the detected gene flow were then added to the tree manually.

The D-statistic is also known as the ABBA-BABA test (see: Patterson NJ et alia. 2012. Ancient admixture in human history. Genetics 192: 1065-1093). It operates as follows for sets of four taxa, applied to character data.

Let the species tree be this, where E1–E3 are the three taxa being compared, and O is the outgroup:

There are three possible allele trees for each binary character (ie. single nucleotide polymorphism) in which states are shared pairwise:

In the first tree, E3 shares the ancestral character state with the outgroup, which is expected to be the most common pattern in the absence of gene flow. E1 and E2 share the ancestral state with the outgroup in the second and third trees, respectively.

The admixture test compares the ABBA tree to the BABA tree. The expectation is that if there has been no introgression then the data support for these two trees should be equal. That is, under the null hypothesis that there is no gene flow between the species (and the underlying species tree is correct), the difference in the expected number of occurrences of the ABBA and BABA patterns should be zero. Deviation from this expectation is statistically evaluated using a jackknife procedure.

When there are more than three ingroup taxa, they are tested in groups of three (plus the outgroup). No correction for multiple hypothesis testing seems ever to be applied. Recently, the test has been extended to five taxa (Pease JB, Hahn MW. 2015. Detection and polarization of introgression in a five-taxon phylogeny. Systematic Biology 64: 651-662).

Note that this test assumes that:
  • the "excess of shared polymorphisms" arises solely from gene flow, with or without incomplete lineage sorting, rather than from any other tree-like processes such as gene duplication-loss or ancestral population structure
  • there are no other sources of co-ordinated polymorphisms, such as character-state reversals due to adaptation / selection
  • any gene flow that does exist is due to introgression, rather than to hybridization or HGT.
How realistic these assumptions are is not immediately obvious.