The Genealogical World of Phylogenetic Networks: Alignment

Showing posts with label Alignment. Show all posts

Monday, July 23, 2018

Sequence alignment is still an open computational problem

I recently submitted an invited manuscript about multiple sequence alignment to a bioinformatics journal, but it did not fare well with the reviewers (ominously, there were more than the usual two, and it took a couple of months to get the reviews). The bioinformatics referees simply rejected the notion that a multiple alignment is an object in its own right, which is the basic premise of the manuscript.

To explain this: if we think of the normal tabular arrangement of a multiple sequence alignment, then the historical relationships among the rows (the taxa) are drawn as a phylogeny, while the historical relationships among the columns represent the homologies among the characters. There is no necessary primary importance of the phylogeny relationships over the homology relationships. However, phylogenies are much more prominent in the literature; and, indeed, sequence alignment is often seen as nothing more than a pesky step on the way to getting a phylogeny.

However, if we accept this notion, that homology relationships are both important and interesting in their own right, then multiple sequence alignment is certainly still an open computational problem, because most automated sequence alignments currently do not represent homology relationships. Instead, they represent sequence similarity of various sorts, and thus they only represent homology to the extent that similarity reflects history. In fact, similarity = homology + analogy, and the latter is not trivial.

I have previously written about the topic of alignment-as-homology for the biological audience:

Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.
Morrison DA, Morgan MJ, Kelchner SA (2015) Molecular homology and multiple sequence alignment: an analysis of concepts and practice. Australian Systematic Botany 28: 46-62.

This new manuscript is intended to be the equivalent for the bioinformatics audience, explaining why homology ≠ similarity, and therefore why the current alignment algorithms are inadequate.

Rather than let it languish, and since it is likely to be the last single-author paper that I ever write, I tried to add it to the bioRxiv repository, for everyone to read. Sadly, their reviewers decided that it is insufficiently original, but is merely a summary of existing information. So, I guess that they are not impressed by the novel ideas, either.

I also tried the arXiv, which may seem to be more appropriate, given the audience, but they no longer recognize my user account, which means that the manuscripts I have there now exist in limbo. The world is apparently against my manuscript!

[ Note: This issue has now been resolved, and the manuscript can be accessed as arXiv:1808.07717 ]

So, I am linking the paper here, instead:

Multiple Sequence Alignment is not a Solved Problem

Please have a look; and if you think it is worth it, then please spread the word. Moreover, if you are computationally inclined, then feel free to be inspired to tackle the problem described therein.

PS. I also once wrote a brief blog post about this:

The need for a new sequence alignment program

Tuesday, December 19, 2017

The art of doing science: alignments in historical linguistics

In the past two years, during which I have been writing for this blog, I have often tried to emphasize the importance of alignments in historical linguistics — alignment involves explicit decisions about which characters / states are cognate (and can thus be aligned in a data table). I have also often mentioned that explicit alignments are still rarely used in the field.

To some degree, this situation is frustrating, since it seems so obvious that scholars align data in their head, for example, whenever they write etymological dictionaries and label parts of a word as irregular, not fulfilling their expectations when assuming regular sound change (in the sense in which I have described it before). It is also obvious that linguists have been trying to use alignments before (even before biologists, as I tried to show in this earlier post), but for some reason, they never became systematized.

As an example for the complexity of alignment analyses in historical linguistics, consider the following figure, which depicts both an early version of an alignment (following Dixon and Kroeber 1919), and a "modern" version of the dame data. For the latter, I used the EDICTOR (http://edictor.digling.org), a software tool that I have been developing during recent years, and which helps linguists to edit alignments in a consistent way (List 2017). The old version on the left has been modified in such a way that it becomes clearer what kind of information the authors tried to convey (for the original, see my older post), while the EDICTOR version contains some markup that is important for linguistics, which I will discuss in more detail below.

Figure 1: Alignments from Dixon and Kroeber (1919) in two flavors

If we carefully inspect the first alignment, it becomes evident that the scholars did not align the data sound by sound, but rather morpheme by morpheme. Morphemes are those parts in words that are supposed to bear a clear-cut meaning, even when taken in isolation, or when abstracting from multiple words. The plural-ending -s in English, for example, is a morpheme that has the function to indicate the plural (compare horse vs. horses, etc.). In order to save space, the authors used abbreviations for the language group names and the names for the languages themselves.

The authors have further tried to save space by listing identical words only once, but putting two entries, separated by a comma, in the column that I have labelled "varieties". If you further compare the entries for NW (=North-Western Maidu) and NE/S (=North-Eastern Maidu and Southern Maidu), you can see that the first entry has been swapped: the tsi’ in tsi’-bi in NW is obviously better compared with the tsi in NE/S bi-tsi rather than comparing bi in NE with tsi in NE/S. This could be a typographical error, of course, but I think it is more likely that the authors did not quite know how to handle swapped instances in their alignment.

In the EDICTOR representation of the alignment, I have tried to align the sounds in addition to aligning the morphemes. My approach here is rather crude. In order to show which sounds most likely share a common origin, I extracted all homologous morphemes, aligned them in such a way that they occur in the same column, and then stripped off the remaining sounds by putting a check-mark in the IGNORE column on the bottom of the EDICTOR representation. When further analyzing these sound correspondences with some software, like the LingPy library (List et al. 2017), all sounds that occur in the IGNORE column will be ignored. Correspondences will then only be calculated for the core part of this alignment, namely the two columns that are left over, in the center of the alignment.

In many cases, this treatment of sound correspondences and homologous words in alignments is sufficient, and also justified. If we want to compare the homologous (cognate) parts across words in different languages, we can't align the words entirely. Consider, for example, the German verb gehen [geːən] and its English counterpart go [gɔu]. German regularly adds the infinitive ending -en to each verb, but English has long ago dropped all endings on verbs apart from the -s in the third person singular (compare go vs. goes). Comparing the whole of the verbs would force us to insert gaps for the verb ending in German, which would be linguistically not meaningful, as those have not been "gapped" in English, but lost in a morphological process by which all endings of English verbs were lost.

There are, however, also cases that are more complicated to model, especially when dealing with instances of partial cognacy (or partial homology). Compare, for example, the following alignment for words for bark (of a tree) in several dialects of the Bai language, a Sino-Tibetan language spoken in China, whose affiliation with other Sino-Tibetan languages is still unclear (data taken from Wang 2006).

Figure 2: Alignment for words for "bark" in Bai dialects

In this example, the superscript numbers represent tones, and they are placed at the end of each syllable. Each syllable in these languages usually also represents a morpheme in the sense mentioned above. That means, that each of the words is a compound of two original meanings. Comparison with other words in the languages reveals that most dialects, apart from Mazhelong, express bark as tree-skin, which is a very well-known expression that we can find in many languages of the world. If we want to analyze those words in alignments, we could follow the same strategy as shown above, and just decide for one core part of the words (probably the skin part) and ignore the rest. However, for our calculations of sound correspondences, we would loose important information, as the tree part is also cognate in most instances and therefore rather interesting. But ignoring only the unalignable part of the first syllable in Mazhelong would also not be satisfying, since we would again have gaps for this word in the tree part in Mazhelong which do not result from sound change.

The only consistent solution to handle these cases is to split the words into their morphemes, and then to align all sets of homologous morphemes separately. This can also be done in the EDICTOR tool (but it requires more effort from the scholar and the algorithms). An example is shown above, where you can see how the tool breaks the linear order in the representation of the words as we find them in the languages, in order to cluster them into sets of homologous "word-parts".

Figure 3: Alignments of partial cognates in the Bai dialects

But if we only look at the tree part of those alignments, namely the third cognate set from the left, with the ID 8, we can see a further complication, as the gaps introduced in some of the words look a little bit unsatisfying. The reason is that the j in Enqi and Tuolo may just as well be treated as a part of the initial of the syllable, and we could re-write it as dj in one segment instead of using two. In this way, we might capture the correspondence much more properly, as it is well known that those affricate initials in the other dialects ([ts, tʂ, dʐ, dʑ]) often correspond to [dj]. We could thus rewrite the alignment as shown in the next figure, and simply decide that in this situation (and similar ones in our data), we treat the d and the j as just one main sound (namely the initial of the syllables).

Figure 4: Revised alignment of "tree" in the sample

Summary and conclusions

Before I start boring those of the readers of this blog who are not linguists, and not particularly interested in details of sound change or language change, let me just quickly summarize what I wanted to illustrate with these examples. I think that the reason why linguists never really formalized alignments as a tool of analysis is that there are so many ways to come up with possible alignments of words, which may all be reasonable for any given analysis. In light of this multitude of possibilities for analysis, not to speak of historical linguistics as a discipline that often prides itself by being based on hard manual labor that would be impossible to achieve by machines, I can in part understand why linguists were reluctant to use alignments more often in their research.

Judging from my discussions with colleagues, there are still many misunderstandings regarding the purpose and the power of alignment analyses in historical linguistics. Scholars often think that alignments directly reflect sound change. But how could they, given that we do not have any ancestral words in our sample? Alignments are a tool for analysis, and they can help to identify sound change processes or to reconstruct proto-forms in unattested ancestral languages; but they are by no means the true reflection of what happened and how things changed. The are the starting point, not the end point of the analysis. Furthermore, given that there are many different ways in which we can analyze how languages changed over time, there are also many different ways in which we can analyze language data with the help of alignments. Often, when comparing different alignment analyses for the same languages, there is no simple right and wrong, just a different emphasis on the initial analysis and its purpose.

As David wrote in an email to me:

"An alignment represents the historical events that have occurred. The alignment is thus a static representation of a dynamic set of processes. This is ultimately what causes all of the representational problems, because there is no necessary and sufficient way to achieve this."

This also nicely explains why alignments in biology as well, with respect to the goal of representing homology, "may be more art than science" (Morrison 2015), and I admit that I find it a bit comforting that biology has similar problems, when it comes to the question of how to interpret an alignment analysis. However, in contrast to linguists, who have never really given alignments a chance, biologists not only use alignments frequently, but also try to improve them.

If I am allowed to have an early New Year wish for the upcoming year, I hope that along with the tools that facilitate the labor of creating alignments for language data, we will also have a more vivid discussion about alignments, their shortcomings, and potential improvements in our field.

References

Dixon, R. and A. Kroeber (1919) Linguistic families of California. University of California Press: Berkeley.
List, J.-M. (2017) A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations, pp. 9-12.
LingPy: A Python library for historical linguistics. Version 2.6. Max Planck Institute for the Science of Human History: Jena.
Morrison, D. (2015) Molecular homology and multiple-sequence alignment: an analysis of concepts and practice. Australian Systematic Botany 28: 46-62.
Wang, W.-Y. (2006) Yǔyán, yǔyīn yǔ jìshù \hana 語言,語音與技術 [Language, phonology and technology]. Xiānggǎng Chéngshì Dàxué: Shànghǎi 上海.

Tuesday, November 28, 2017

“Man gave names to all those animals”: goats and sheep

This is a joint post by Guido Grimm, Johann-Mattis List, and Cormac Anderson.

This is the second of a pair of posts dealing with the names of domesticated animals. In the first part, we looked at the peculiar differences in the names we use for cats and dogs, two of humanity’s most beloved domesticated predators. In this, the second part (and with some help from Cormac Anderson, a fellow linguist from the Max Planck Institute for the Science of Human History), we’ll look at two widely cultivated and early-domesticated herbivores: goats and sheep.

Similar origins, but not the same

Both goats and sheep are domesticated animals that have an explicitly economic use; and, in both cases, genetic and archaeological evidence points to the Near East as the place of domestication (Naderi et al. 2007). The main difference between the two is the natural distribution of goats (providing nourishment and leather) and sheep (providing the same plus wool). This distribution is also reflected in the phonetic (dis)similarities of the terms used in our sample of languages (Figures 1 and 2).

Capra aegagrus, the species from which the domestic goat derives, is native to the Fertile Crescent and Iran. Other species of the genus, similar to the goat in appearance, are restricted to fairly inaccessible areas of the mountains of western Eurasia (see Figure 3, taken from Driscoll et al. 2009). On the other hand, Ovis aries, the sheep and its non-domesticated sister species, are found in hilly and mountainous areas throughout the temperate and boreal zone of the Northern Hemisphere. Whenever humans migrated into mountainous areas, there was the likelihood of finding a beast that:

Had wool on his back and hooves on his feet,
Eating grass on a mountainside so steep
[Bob Dylan: Man Gave Names to all those animals].

Goats

Goats were actively propagated by humans into every corner of the world, because they can thrive even in quite inhospitable areas. Reflecting this, differences in the terms for "goat" generally follow the main subgroups of the Indo-European language family (Figure 1), in contrast to "cat", "dog", and "sheep". From the language data, it seems that for the most part each major language expansion, as reflected in the subgroups of Indo-European languages, brought its own term for "goat", and that it was rarely modified too much or borrowed from other speech communities.

There is one exception to this, however. The terms in the Italic and Celtic languages look as though they are related, coming from the same Proto-Indo-European root, *kapr-, although the initial /g/ in the Celtic languages is not regular. In Irish and Scottish Gaelic, the words for "sheep" also come from the same root. In other cases, roots that are attested in one or other language have more restricted meanings in some other language; for example, the Indo-Iranic words for goat are cognate with the English buck, used to designate a male goat (or sometimes the male of other hooved animals, such as deer).

The German word Ziege sticks out from the Germanic form gait- (but note the Austro-Bavarian Goaß, and the alternative term Geiß, particularly in southern German dialects). The origin of the German term is not (yet) known, but it is clear that it was already present in the Old High German period (8th century CE), although it was not until Luther's translation of the Bible, in which he used the word, that the word became the norm and successively replaced the older forms in other varieties of Germany (Pfeifer 1993: s. v. "Ziege").

Figure 1: Phonetic comparison of words for "goat"

Sheep

The terms for sheep, however, are often phonetically very different even in related languages. The overall pattern seems to be more similar to that of the words for dog – the animal used to herd sheep and protect them from wolves. An interesting parallel is the phonetic similarity between the Danish and Swedish forms får (a word not known in other Germanic languages) and the Indic languages. This similarity is a pure coincidence, as the Scandinavian forms go back to a form fahaz- (Kroonen 2013: 122), which can be further related to Latin pecus "cattle" (ibd.) and is reflected in Italian [pɛːkora] in our sample.

This example clearly shows the limitations of pure phonetic comparisons when searching for historical signal in linguistics. Latin c (pronounced as [k]) is usually reflected as an h in Germanic languages, reflecting a frequent and regular sound change. The sound [h] itself can be easily lost, and the [z] became a [r] in many Scandinavian words. The fact that both Italian and Danish plus Swedish have cognate terms for "sheep", however, does not mean that their common ancestors used the same term. It is much more likely that speakers in both communities came up with similar ways to name their most important herded animals. It is possible, for example, that this term generically meant "livestock", and that the sheep was the most prototypical representative at a certain time in both ancestral societies.

Furthermore, we see substantial phonetic variation in the Romance languages surrounding the Mediterranean, where both sheep and goats have probably been cultivated since the dawn of human civilization. Each language uses a different word for sheep, with only the Western Romance languages being visibly similar to ovis, their ancestral word in Latin, while Italian and French show new terms.

Figure 2: Phonetic comparison of words for "sheep"

More interesting aspects

The wild sheep, found in hilly and mountainous areas across western Eurasia, was probably hunted for its wool long before mouflons (a subspecies of the wild sheep) were domesticated and kept as livestock. The word for "sheep" in Indo-European, which we can safely reconstruct, was h₂ owis, possibly pronounced as [xovis], and still reflected in Spanish, Portuguese, Romanian, Russian, Polish. It survives in many more languages as a specific term with a different meaning, addressing the milk-bearing / birthing female sheep. These include English ewe, Faroean ær (which comes in more than a dozen combinations; Faroes literally means: “sheep islands”), French brebis (important to known when you want sheep-milk based cheese), German Aue (extremely rare nowadays, having been replaced by Mutterschaf "mother-sheep"). In other languages it has been lost completely.

What is interesting in this context is that while the phonetic similarity of the terms for "sheep" resembles the pattern we observe for "dog", the history of the words is quite different. While the words for "dog" just continued in different language lineages, and thus developed independently in different groups without being replaced by other terms, the words for "sheep" show much more frequent replacement patterns. This also contrasts with the terms for "goat", which are all of much more recent origin in the different subgroups of Indo-European, and have remained rather similar after they were first introduced.

The reasons for these different patterns of animal terms are manifold, and a single explanation may never capture them all. One general clue with some explanatory power, however, may be how and by whom the animals were used. Humans, in particular nomadic societies, rely on goats to colonize or survive in unfortunate environments, even into historic times. For instance, goats were introduced to South Africa by European settlers to effectively eat up the thicket growing in the interior of the Eastern Cape Province. Once the thicket was gone, the fields were then used for herding cattle and sheep.

Figure 3: Map from Driscoll et al. (2009)

There are other interesting aspects of the plot.

For example, as mentioned before, in Chinese the goat refers to the "mountain sheep/goat" and the "sheep/goat" is the "soft sheep". While it is straightforward to assume that yáng, the term for "sheep/goat", originally only denoted one of the two organisms, either the sheep or the goat, it is difficult to say which came first. The term yáng itself is very old, as can also be seen from the Chinese character used, which serves as one of the base radicals of the writing system, depicting an animal with horns: 羊. The sheep seems to have arrived in China rather early (Dodson et al. 2014), predating the invention of writing, while the arrival of the goat was also rather ancient (Wei et al. 2014) (and might also have happened more than once). Whether sheep arrived before goats in China, or vice versa, could probably be tested by haplotyping feral and locally bred populations while recording the local names and establishing the similarity of words for goat and sheep.

While the similar names for goat and sheep may be surprising at first sight (given that the animals do not look all that similar), the similarity is reflected in quite a few of the world's languages, as can be seen from the Database of Cross-Linguistic Colexifications (List et al. 2014) where both terms form a cluster.

Source Code and Data

We have uploaded source code and data to Zenodo, where you can download them and carry out the tests yourself (DOI: 10.5281/zenodo.1066534). Great thanks goes to Gerhard Jäger (Eberhard-Karls University Tübingen), who provided us with the pairwise language distances computed for his 2015 paper on "Support for linguistic macro-families from weighted sequence alignment" (DOI: 10.1073/pnas.1500331112).

Final remark

As in the case of cats and dogs, we have reported here merely preliminary impressions, through which we hope to encourage potential readers to delve into the puzzling world of naming those animals that were instrumental for the development of human societies. In case you know more about these topics than we have reported here, please get in touch with us, we will be glad to learn more.

References

Dodson, J., E. Dodson, R. Banati, X. Li, P. Atahan, S. Hu, R. Middleton, X. Zhou, and S. Nan (2014) Oldest directly dated remains of sheep in China. Sci Rep 4: 7170.
Driscoll, C., D. Macdonald, and S. O’Brien (2009) From wild animals to domestic pets, an evolutionary view of domestication. Proceedings of the National Academy of Sciences 106 Suppl 1: 9971-9978.
Jäger, G. (2015) Support for linguistic macrofamilies from weighted alignment. Proceedings of the National Academy of Sciences 112.41: 12752–12757.
Kroonen, G. (2013) Etymological dictionary of Proto-Germanic. Brill: Leiden and Boston.
List, J.-M., T. Mayer, A. Terhalle, and M. Urban (eds) (2014) CLICS: Database of Cross-Linguistic Colexifications. Forschungszentrum Deutscher Sprachatlas: Marburg.
Naderi, S., H. Rezaei, P. Taberlet, S. Zundel, S. Rafat, H. Naghash, et al. (2007) Large-scale mitochondrial DNA analysis of the domestic goat reveals six haplogroups with high diversity. PLoS One 2.10. e1012.
Pfeifer, W. (1993) Etymologisches Wörterbuch des Deutschen. Akademie: Berlin.
Wei, C., J. Lu, L. Xu, G. Liu, Z. Wang, F. Zhao, L. Zhang, X. Han, L. Du, and C. Liu (2014) Genetic structure of Chinese indigenous goats and the special geographical structure in the Southwest China as a geographic barrier driving the fragmentation of a large population. PLoS One 9.4: e94435.

Tuesday, March 28, 2017

Why we need alignments in historical linguistics

Alignments have been discussed quite a few times in this blog. They are so extremely common in molecular biology that I doubt that there are any debates about their usefulness, apart from certain attempts to improve the modelling, especially in cases of non-colinear patterns (Kehr et al. 2014), or to speed up computation (Mathura and Adlakha 2016). In linguistics, on the other hand, alignments are rarely used, although initial attempts to arrange homologous words in a matrix go back to the early 20th century, as you can see from this example taken from Dixon and Koerber (1919: 61):

Early alignment from Dixon and Kroeber (1919)

This example is rather difficult to read for those not familiar with the annotation. The authors group homologous words across different indigenous languages from California. The group labels of the languages under investigation are given in abbreviated form at the very left of the matrix, and the actual varieties are listed in the next column. What follows is the actual alignment, along with comments in the last column. Regarding the alignments, the authors note on page 55:

A number of sets of cognates have been taken from their numbered place in this list and put at the end to allow of their being printed in columnar form, with a view to bringing out parallelisms that otherwise might fail to impress without detailed analysis and discussion. (Dixon and Kroeber 1919: 55)

In my opinion, this expresses nicely why alignments should be used more often in linguistics — due to the problem that our "alphabets" (the sound systems of languages) are undergoing constant change (see this earlier post for details regarding this claim), we need to infer both the scoring function between different sounds across different languages, and the alignment at the same time. If we look at the similarities the authors spotted, it should become obvious what I mean.

I am not yet sure how to interpret the data exactly, but if I am not mistaken, the authors claim that each of the column contains homologous material. So, they find a similarity between kaha in the first row (the language is Northern Wintun, according to the key to abbreviations in the book), and tu in the last row (Monterey Costanoan). The last column shows suffixes, which I think the authors exclude from their analysis, but I could not find additional information confirming this in their book.

The comment column illustrates another problem of representation, namely that the authors do not know how to handle cases of metathesis (or transpositions) consistently. The transposition of the parts of words is a process that is quite frequent in language evolution. It is very frequent in compounds consisting of modifier and modified, such as milk coffee in English, where milk modifies the coffee, while French, for example, puts the modifier after the main noun, expressing this as café au lait.

Nowadays, we can handle these cases consistently in linguistics, both in our data annotation and in the alignments, and we can even search for the structures automatically (see List et al. 2016). One hundred years ago, when Dixon and Kroeber worked out their comparison of the languages in California, they were pioneers who tried to increase the transparency of our discipline, and it is clear that their solutions are not completely satisfying from today's perspective.

It is extremely surprising for me that, despite these early attempts to make our homology judgments in linguistics more transparent, the practice of phonetic alignments is still rarely used by historical linguists. Indeed, the majority of them even think that it is a waste of time, or only useful for the purpose of teaching.

I was reminded of this when I looked at a recent proposal by Bengtson (2017, see also this blog for details) for deep genetic connections between Basque and North Caucasian languages. Note that the Basque language is traditionally considered as an isolate, i.e. a language whose nearest relatives we cannot find among the languages in the world. Many linguists have attempted to solve this puzzle by proposing various hypotheses (see Forni 2013 for an example of attempting to link Basque with Indo-European). Bengtson proposes various types of evidence, which I cannot really judge, as I do not know the languages under comparison, but finally, he also shows a list with potential homologs between Basque and North Caucasian varieties, which you find below.

Potential homologs between Basque and North Caucasian languages (Bengtson 2017)

If you are not a trained historical linguistic, and thus do not know what to do with this table, be assured that many historical linguists will feel similarly. As a rough explanation: the concepts are supposed to be very, very stable, being drawn from Sergey Yakhontov's list of 35 ultra-stable concepts, and I think that all words in one row are supposed to be etymologically related — that is, they should be potential homologs across all of the languages. If word forms are preceded by the asterisk symbol (*), this means that they are reconstructed, i.e. not reflected in written sources. But that is all I can tell you for the moment. Where I should start the comparison between the words remains a mystery for me, as I do not know which parts are supposed to be similar. Alignments would help us to see immediately where the author thinks that the historical similarities can be found — that is, we would see, which parts of the words are supposed to be homologous.

At this point in the post, I originally planned to provide you with an alignment of Bengtson's table, in order to illustrate the benefits of alignment in linguistics. Unfortunately, I had to admit to myself that I cannot do this, as I simply do not know where to align the words (apart from some rare trivial cases in the table).

I really hope that this will change in the future. Too often, our hypotheses in linguistics suffer from insufficient transparency with regards to the "proofs" and the evidence. I agree that it is very difficult to come up with good alignments in linguistics, especially if one regards cases of metathesis, unrelated parts, and general uncertainty. However, instead of giving in to the problem, we should follow the pioneering work of Dixon and Kroeber, and try to improve the way we present our data to both our colleagues and a broader public.

Theories such as the link between Basque and the North Caucasian languages are usually highly disputed in historical linguistics, and I do not know of any long range proposal that has gained broad acceptance during the last 50 years. Yet, maybe this is not because the proposals are not valid, but simply because those who are proposing these theories have failed to present their findings in a transparent and testable way.

References

Bengtson, J. (2017) The Euskaro-Caucasian Hypothesis. Current model. PDF.
Dixon, R. and A. Kroeber (1919) Linguistic families of California. University of California Press: Berkeley.
Forni, G. (2013) Evidence for Basque as an Indo-European language. The Journal of Indo-European Studies 41.1 & 2: 1-142.
Kehr, B., K. Trappe, M. Holtgrewe, and K. Reinert (2014) Genome alignment with graph data structures: a comparison. BMC Bioinformatics 15.1: 99.
List, J.-M., P. Lopez, and E. Bapteste (2016) Using sequence similarity networks to identify partial cognates in multilingual wordlists. In: Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers). Association of Computational Linguistics, pp. 599-605.
Mathur, R. and N. Adlakha (2016) A graph theoretic model for prediction of reticulation events and phylogenetic networks for DNA sequences. Egyptian Journal of Basic and Applied Sciences 3.3: 263-271.

Tuesday, June 21, 2016

Alignments and phylogenetic reconstruction in linguistics and biology

In a very interesting article from 2009 (Morrison 2009), David discusses the question of why phylogeneticists would "ignore computerized sequence alignment". This article was really interesting to me for two reasons: First, the article provides some interesting statistics regarding the degree to which biologists manually adjust the alignments that were automatically produced by software. Second, the article points to the seemingly strange situation in biology in which tree-building is considered to be a task that can be entirely carried out by machines, while the majority of scholars would not trust their final sequence alignments to a computer (Morrison 2009: 150).

This situation finds a direct analogon in historical linguistics. Phylogenetic reconstruction is gaining more and more ground, with many scholars applying (mostly Bayesian) phylogenetic tools to analyze their data (Indo-European: Bouckaert et al. 2012, Tupí-Guaraní (South America): Michael et al. 2015, Japonic: Lee and Hasegawa 2011, Pama-Nguyan (Australian): Bowern and Atkinson 2012, Semitic: Kitchen et al. 2009, Bantu: Grollemund et al. 2015, etc.). Fully automated workflows involving automatic sequence comparison are also practiced (Holman et al. 2011, Jäger 2015, Wheeler 2015), but many linguists remain sceptical regarding their results.

One major difference between biology and linguistics is the selection of comparanda. Biological methods usually derive phylogenetic trees from multiply aligned sequences. Linguistic methods derive trees from sets of homologous (cognate) words (cognate sets) distributed across languages whose evolution is modeled as a process of word-gain and word loss (similar to gene-family gain-loss-studies in biology). While biologists fiddle with their alignments, linguists fiddle with their cognate sets. Cognate identification is exclusively done manually at the moment, and scholars use all kinds of information about word relations that they can get, be it etymological dictionaries, which have been published for more than 200 years, or the intuition of the expert who is annotating the data for cognacy.

Identification of cognate sets in linguistics is essentially a task of sequence comparison (List 2014), and algorithmic as well as manual procedures involve the multiple and the pairwise alignment of words (even if it is done only implicitly by human experts). Compared to biology, sequence comparison in historical linguistics is exacerbated by two factors:

alphabets (phoneme systems) in linguistics are themselves mutable (Geisler and List 2013), so that when aligning two words we need to find both a mapping between the two alphabets, translating one alphabet into the other, plus a scoring function by which we can score the alignment,
regular sound change (the process by which the phoneme system is changed) and sporadic sound change (the process by which a sound is sporadically assimilated, lost, or added) are not the only processes that contribute to change of words in the lexicon, and morphological change (by which whole blocks of meaningful parts of a word are re-arranged, exchanged, lost, or added) yields patterns that are essentially unalignable.

The problem of finding the correct mapping between two alphabets in linguistics is further exacerbated by language contact: If languages exchange words on a large scale, then this may have a huge impact on the system of the languages, and it may even introduce new sounds to a language that were not there before (thanks to English, German has now the sound [dʒ], as in journalist or job). If borrowing is frequent enough, it may get close to impossible to judge from comparing the words alone, whether two words in different languages have been transferred directly (vertically) from an ancestral language, or laterally.

As a result, it is probably understandable why linguists often refuse to carry out full alignments of the words in their data. An alignment itself does not necessarily tell us much, compared to all of those processes that an expert infers when comparing language data, which are not alignable.

As an example, let us consider the word for "sun" in six Indo-European languages. Since "sun" is a very basic concept, probably fundamental for all human cultures, experts assume that this word was present as *séh₂u̯el- in Indo-European (an asterisk indicates that the word is not reflected in written sources), and that it was retained as Russian солнце [sɔnʦə], Polish słońce [swɔnjʦɛ], French soleil [sɔlɛj], Italian sole [sole], German Sonne [sɔnə], and Swedish sol [suːl] (Wodtko et al. 2008). An obvious alignment, reflecting the surface similarity between all of these words, would be the following one (taken from List 2014: 135):

Alignment based on sequence similarity.

This alignment, however, is by no means correct. Russian [sɔnʦə] and Polish [swɔnʲʦɛ], for example, share a common suffix, which is reflected as [nʦə] in Russian and as [nʲʦɛ] in Polish, and which was innovated in the the common ancestor of Russian and Polish, but is not present in either of the four other languages. So the [n] in German [sɔnə] is essentially not homologous with the [n] in Russian or the [nʲ] in Polish. The same applies to the [ɛj] in French [sɔlɛj] which reflects a diminutive suffix in Latin sol-iculus "small sun", the regular ancestor form of French soleil. Furthermore, the [w] in the Polish word regularly corresponds to the [l] in French, Italian, and Swedish, but it reflects a swap (metathesis) in the order of the vowel and the consonant in Polish — [sɔl] became [slɔ] which became [swɔ]).

Taking all (and more) of this into account, we need to modify our alignment to account more closely for the processes that experts have inferred from intensive language comparison, as shown in the next figure below (taken from List 2014: 135). In this alignment, the swap in Polish is reflected by the white font of the sounds involved, and gray-shaded columns are supposed to reflect the oldest layer of homology.

Historically informed alignment.

However, even this alignment is essentially misleading. The Indo-European word for "sun" supposedly had a complex paradigm in which the word's stem was alternating in the nominative (and accusative) case and the other cases (oblique cases). So, nominative and accusative used the stem *sóh₂u̯el-, while the other cases used the stem *sh₂én-. The Russian, Polish, French, Italian, and the Swedish form go back to the former, while the German form goes back to the latter, since it is further assumed (or it can be assumed) that the alternation was still preserved in the ancestor of Swedish and German.

This means, however, that our alignment above shrinks to an alignment in which only the first letter, the s, is still reflected in all languages! The following graphic (taken from List 2016) illustrates the processes that led to the current situation for four of our six languages:

Morphological processes of lexical change.

What does this example tell us? On the one hand, it gives some explanation for why linguists do not really want to align words (although the first alignments go back to the early 20th centur, cf. Dixon and Kroeber 1919). It also explains, why classical linguists have a very sceptical attitude towards the computerization of word comparisons, based on the (partially justified) assumption that computers could not handle the complex patterns that are so characteristic of language change.

On the other hand, comparing the situation with biology as reported in Morrison (2009), we can find an interesting parallel between the two disciplines: both linguists and biologists do not really trust machines for comparing their sequences (albeit at different levels of analysis), but they do not seem to have many problems in trusting machines to reconstruct their trees.

However, especially this last point, the fact that we trust machines to grow our trees, while we distrust them to prepare the seeds, should ring an alarm bell. First, we seem to lack clear guidelines (at least in linguistics) regarding the way the manual adjustment (of alignments in biology and cognate sets in linguistics) should be carried out, which has a clear impact on repeatability. Second, if we have processes in both fields that yield essentially unalignable patterns, such as duplications and other molecular processes in biology (Morrison 2009: 156), and morphological processes in linguistics, how can we assume that a phylogenetic tree analysis can sufficiently cope with them, even if we manually adjust everything?

References

Bouckaert, R., P. Lemey, M. Dunn, S. Greenhill, A. Alekseyenko, A. Drummond, R. Gray, M. Suchard, and Q. Atkinson (2012): Mapping the origins and expansion of the Indo-European language family. Science 337.6097. 957-960.
Bowern, C. and Q. Atkinson (2012): Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88. 817-845.
Dixon, R. and A. Kroeber (1919): Linguistic families of California. University of California Press: Berkeley.
Geisler, H. and J.-M. List (2013): Do languages grow on trees? The tree metaphor in the history of linguistics. In: Fangerau, H., H. Geisler, T. Halling, and W. Martin (eds.): Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization. Franz Steiner Verlag: Stuttgart. 111-124.
Grollemund, R., S. Branford, K. Bostoen, A. Meade, C. Venditti, and M. Pagel (2015): Bantu expansion shows that habitat alters the route and pace of human dispersals. Proceedings of the National Academy of Sciences 112.43. 13296–13301.
Holman, E., C. Brown, S. Wichmann, A. Müller, V. Velupillai, H. Hammarström, S. Sauppe, H. Jung, D. Bakker, P. Brown, O. Belyaev, M. Urban, R. Mailhammer, J.-M. List, and D. Egorov (2011): Automated dating of the world’s language families based on lexical similarity. Curr. Anthropol. 52.6. 841-875.
Jäger, G. (2015): Support for linguistic macrofamilies from weighted alignment. Proceedings of the National Academy of Sciences 112.41. 12752–12757.
Kitchen, A., C. Ehret, S. Assefa, and C. Mulligan (2009): Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East. Proc. R. Soc. London, Ser. B 276.1668. 2703-2710.
Lee, S. and T. Hasegawa (2011): Bayesian phylogenetic analysis supports an agricultural origin of Japonic languages. Proc. R. Soc. London, Ser. B 278.1725. 3662-3669.
List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
List, J.-M. (2016): Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1. DOI: 10.1093/jole/lzw006.
Michael, L., N. Chousou-Polydouri, K. Bartolomei, E. Donnelly, V. Wauters, S. Meira, and Z. O’Hagan (2015): A Bayesian phylogenetic classification of Tupí-Guaraní. LIAMES 15.2. 193-221.
Morrison, D. (2009): Why would phylogeneticists ignore computerized sequence alignment? Syst. Biol. 58.1. 150-158.
Wheeler, W. and P. Whiteley (2015): Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages. Cladistics 31.2. 113-125.
Wodtko, D., B. Irslinger, and C. Schneider (2008): Nomina im Indogermanischen Lexikon [Nouns in the Indo-European lexicon]. Winter: Heidelberg.

Tuesday, May 10, 2016

The early history of sequence alignment

The historical development of the concept that we now call a "sequence alignment" is something that seems to have rarely been considered in the biological literature. Apparently, the idea took some time to develop.

To a bioinformatician, the history of sequence alignment starts in 1970, with the presentation of the dynamic programming algorithm of Needleman and Wunsch (1970). However, protein sequencing started fully 20 years earlier than this (see García-Sancho 2010); and by the end of the 1950s comparisons of amino-acid sequences among related organisms were beginning to appear. However, as noted by Eck (1961): "data on amino acid sequences can be sorted, tabulated and arranged in a great variety of ways ... Any such manipulation will produce some sort of pattern." Thus, a multiple sequence alignment was seen as only one of many possible data presentations, and not necessarily the most obvious one unless intended for an evolutionary analysis.

For example, most of these early comparative studies focussed on the structure (and thus function) of the proteins rather than on their evolution, and so they tended to present juxtapositions consisting of ungapped fragments of the sequences (eg. Brown et al. 1955; Tuppy and Dus 1958; Anfinsen 1959), particularly the active regions. Other studies were directed towards finding a solution to the problem of the genetic code (ie. how nucleotides code for amino acids), and their presentation of sequence alignments was similarly non-evolutionary (eg. Gamow et al. 1956; Tsugita and Fraenkel-Conrat 1960).

Nevertheless, the early work on molecular evolution did reveal that different protein molecules are homologous, including what are now called paralogs (eg. Itano 1957; Ingram 1961). With the sequencing of the proteins, it soon occurred to several people independently that the relative positions in the amino acid sequences are homologous as well (see Morgan 1998). This is an important distinction, because the latter refers to the 1:1 matching of the parts (amino acids) of a complex whole (the protein molecule), which is the usual empirical procedure for determining homology (Ghiselin 2016). However, most sequences were still presented unaligned (eg. Ingram 1961), until the work of Margoliash (1963) and Pauling and Zuckerkandl (1963), who can thus be seen as the pioneers of the modern form of sequence alignment.

The major problem with sequencing proteins in the 1960s was that it was still a slow and tedious procedure, so that data were rather scarce — the first major compilation of aligned sequences did not appear until 1965 (Dayhoff et al. 1965). Strasser (2010) provides an interesting coverage of the early uses of multiple amino-acid sequence alignments, including the development of one-letter codes for each of the amino acids in order to make the alignments more readable. García-Sancho (2010) and Suárez-Díaz (2014) discuss the subsequent development of experimental methods for the sequencing of RNA in the mid-1960s and then DNA in the mid-1970s, which greatly increased the need for an automated sequence alignment method. [García-Sancho (2012) provides a much more detailed discussion.]

Most importantly, a number of the early molecular sequence alignments were constructed by hand explicitly based on evaluation of the likely biological mechanisms that had produced the sequence variation. That is, the alignments made clear the originating molecular mechanisms. For example, Pauling and Zuckerkandl (1963) provided a pairwise alignment of two reconstructed ancestral amino-acid sequences of haemoglobin, along with a discussion of the substitutions and insertions / deletions.

Twenty years later, in what appears to be the first published study of intraspecific variation using DNA sequences, Kreitman (1983) took this idea further, and provided a very carefully considered multiple alignment based on explicit recognition of tandem repeats and RNA stem structures within the study gene. This was very much in line with traditional approaches to the assessment of homologies prior to phylogenetic tree building, for example when using morphological or anatomical characters.

However, immediately after this, practical computerized procedures were developed by Hogeweg and Hesper (1984), based on dynamic programming for pairwise sequence alignment (solely maximizing similarity, as explicitly noted in the title of the Needleman and Wunsch paper) and based on the progressive alignment strategy for multiple alignment. Then the Clustal computer program was released in 1988, which implemented these procedures in a usable manner for personal computers (see Chenna et al. 2003); and the history of studies in molecular evolution was thereby changed forever.

This brief history emphasizes one simple point about the relationship between homology and phylogeny — the apparent primary interest in the latter rather than the former, despite the fact that they are simply two views of the same dataset (phylogeny refers to the relationship among the rows of a multiple sequence alignment, while homology refers to the relationship among the columns). The first automated or semi-automated tree-building algorithm (the user could manually intervene at each step) was developed by Eck and Dayhoff (1966), followed by the first fully automated procedure presented by Fitch and Margoliash (1967). This was nearly 20 years before equivalent ideas were developed for homology assessment.

References

Christian B. Anfinsen (1959) The Molecular Basis of Evolution. Wiley, New York.

H. Brown, Frederick Sanger, Ruth Kitai (1955) The structure of pig and sheep insulins. Biochemical Journal 60: 556-565.

Ramu Chenna, Hideaki Sugawara, Tadashi Koike, Rodrigo Lopez, Toby J. Gibson, Desmond G. Higgins, Julie D. Thompson (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Research 31: 3497-3500.

Margaret O. Dayhoff, Richard V. Eck, Marie A. Chang, Minnie R. Sochard (1965) Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Spring MD.

Richard V. Eck (1961) Non-randomness in amino-acid "alleles". Nature 191: 1284-1285.

Richard V. Eck, Margaret O. Dayhoff (1966) Atlas of Protein Sequence and Structure, second edition. National Biomedical Research Foundation, Silver Spring MD.

Walter M. Fitch, Emanuel Margoliash (1967) Construction of phylogenetic trees. Science 155: 279-284.

George Gamow, Alexander Rich, Martynas Yčas (1956) The problem of information transfer from the nucleic acids to proteins. Advances in Biological and Medical Physics 4: 23-68.

Miguel García-Sancho (2010) A new insight into Sanger’s development of sequencing: from proteins to DNA, 1943–1977. Journal of the History of Biology 43: 265-323.

Miguel García-Sancho (2012) Biology, Computing and the History of Molecular Sequencing: From Proteins to DNA, 1945–2000. Palgrave MacmIllan, Basingstoke UK.

Michael T. Ghiselin (2016) Homology, convergence and parallelism. Philosophical Transactions of the Royal Society, Series B 371: 20150035.

Paulien Hogeweg, Ben Hesper (1984) The alignment of sets of sequences and the construction of phyletic trees: an integrated method. Journal of Molecular Evolution 20: 175-186.

Vernon M. Ingram (1961) Gene evolution and the hæmoglobins. Nature 139: 704-708.

Harvey A. ltano (1957) The human hemoglobins: their properties and genetic control. Advances in Protein Chemistry 12: 215-268.

Martin Kreitman (1983) Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster. Nature 304: 412-417.

Emanuel Margoliash (1963) Primary structure and evolution of cytochrome c. Proceedings of the National Academy of Sciences of the USA 50: 672-679.

Gregory J. Morgan (1998) Emile Zuckerkandl, Linus Pauling, and the molecular evolutionary clock, 1959–1965. Journal of the History of Biology 31: 155-178.

Saul B. Needleman, Christian D. Wunsch (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

Linus Pauling, Emile Zuckerkandl (1963) Chemical paleogenetics: molecular "restoration studies" of extinct forms of life. Acta Chemica Scandinavica 17: S9-S16.

Bruno J. Strasser (2010) Collecting, comparing, and computing sequences: the making of Margaret O. Dayhoff's Atlas of Protein Sequence and Structure, 1954–1965. Journal of the History of Biology 43: 623-660.

Edna Suárez-Díaz (2014) The long and winding road of molecular data in phylogenetic analysis. Journal of the History of Biology 47: 443–478.

Akira Tsugita, Heinz Fraenkel-Conrat (1960) The amino acid composition and c-terminal sequence of a chemically evoked mutant of TMV. Proceedings of the National Academy of Sciences of the USA 46: 636-642.

Hans Tuppy, K. Dus (1958) Eine Untersuchung über Cytochrom-c aus Hefe. Monatshefte für Chemie 89: 407-417.

Monday, September 14, 2015

Multiple sequence alignment

Following a previous post on Multiple sequence alignment, celebration of the 20th anniversary of my first publication in the alignment field continues, with a new publication:

Morrison DA, Morgan MJ, Kelchner SA (2015) Molecular homology and multiple sequence alignment: an analysis of concepts and practice. Australian Systematic Botany 28: 46-62.

This paper places sequence alignment within the larger picture of detecting homologies in molecular data, emphasizing the hierarchical nature of homologies. Surprisingly, this relationships has not been emphasized before. It also points out why nucleotide alignments are a unique form of homology assessment, even within this framework. Indeed, the only genotypic data are nucleotides, since everything else is an expression of the nucleotide sequences, rather than being inherited.

The article is Open Access.

Monday, March 23, 2015

Phylogenetic network of pairwise alignment methods

Phylogenetic networks can be used to illustrate the history of any set of objects or concepts, provided that this history is a divergent one (ie. the history is not simply the transformation of objects through time).

Since I have recently been writing about sequence alignments, it is worthwhile to show an example of applying a network to sequence alignment programs. This comes from the paper by Chaisson MJ, Tesler G (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13: 238.

The authors discuss programs that map reads from a sample genome onto a reference sequence. They note: "the relationship between many existing alignment methods is qualitatively illustrated in the figure."

Their legend reads:

The applications / corresponding computational restrictions shown are: (green) short pairwise alignment / detailed edit model; (yellow) database search / divergent homology detection; (red) whole genome alignment / alignment of long sequences with structural rearrangements; and (blue) short read mapping / rapid alignment of massive numbers of short sequences. Although solely illustrative, methods with more similar data structures or algorithmic approaches are on closer branches. The BLASR method combines data structures from short read alignment with optimization methods from whole genome alignment.

The reticulation refers to their new program, which "maps reads using coarse alignment methods developed during WGA [whole genome alignment] studies, while speeding up these methods by using the advanced data structures employed in many NGS [next generation sequencing] mapping studies."

Wednesday, March 18, 2015

The need for a new sequence alignment program

Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. If our goal is to develop an automated procedure for homology assessment, then we need someone to produce a program that explicitly implements this aim.

Alignment is just as much a part of phylogenetics as is tree or network building. It is the procedure that expresses the homology relationships among the characters, rather than the historical relationships among the taxa. Therefore, we need a computer program that accurately expresses homology relationships, as well as one that accurately expresses the historical relationships. We have some programs for the latter but currently nothing for the former.

Unfortunately, homology is a rather nebulous concept. It has to do with inheriting characters from a shared ancestor, which is not something that we can directly observe. Therefore we have to infer it. Somehow.

Homology criteria

Systematists have developed criteria for making decisions about potential homologies in an objective and (hopefully) repeatable manner, and these are directly applicable to nucleotide sequences, which these days are the most common form of data used in phylogenetics. These criteria are:

• Similarity

Compositional = apparent likeness or resemblance between sequences (% similarity)
Topographical = apparent likeness or resemblance between sequences (second- and third-order structure of proteins or RNA)
Functional = functional relationship to other characters in the same sequence (annotated function of the sequence in protein or RNA)
Ontogenetic = variation arising from the same molecular mechanism between sequences (inferred molecular mechanism creating the sequence variation — tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, insertions)

• Conjunction = possible within-genome copies of the same sequence (i.e. paralogy)

• Congruence = agreement with other postulated homologies elsewhere in the same sequences (synapomorphy).

Traditionally, characters have been first proposed as homologous using the criteria of similarity and conjunction (together called primary homology), and then tested with the criterion of congruence (secondary homology).

It is important to note that these criteria do not necessarily always agree with each other in their inferences of homology. Changes that occur during evolutionary history can weaken the connection between these criteria so that, for example, nucleotide homology inferred from structural similarity is no longer the same as nucleotide similarity inferred from compositional similarity. It is for this reason that compositional similarity of the sequences is insufficient to establish gene orthology, for example. The same limitation applies to nucleotides.

Current computer programs

It is clear that these criteria have been incorporated singly into current computerized procedures for producing multiple sequence alignments, but rarely in combination. For example, compositional similarity is the criterion used by the most popular computer programs, such as CLUSTAL, MAFFT and Muscle. Topographical similarity is being invoked whenever structure-based alignments are produced. such as for RNA-coding sequences (eg. PicXAA-R; PMFastR), or when nucleotide sequences are translated to amino acids before alignment (eg. PROMALS). Functional similarity is used for specialist studies of conserved motifs and binding sites, for instance. Ontogenetic similarity of nucleotide sequences is based on inferring the possible molecular processes that cause the observed sequence variation — the program Prank uses this criterion by distinguishing between insertions and deletions.

Congruence as a criterion involves the observation of repeated patterns of synapomorphy in a phylogeny. Among alignment algorithms, both Direct Optimization (e.g. POY; MSAM; BeeTLe) and Statistical Alignment (e.g. BAli-Phy; StatAlign) try to simultaneously produce a multiple alignment and a phylogenetic tree, thus optimizing the criterion of congruence.

The fact that none of the current crop of programs basically apply more than one criterion is, I contend, the principal reason why so many phylogeneticists adjust their alignments manually. Personal judgment may not be perfect, but at least it can be consciously based on homology as a general character concept. Since the different criteria may conflict with each other, at the moment only human judgment is available to compare them and thus make a final decision.

Required program

To make the homology criteria fully operational, we need to compare their inferences by evaluating the comparative evidence. That is, since the different criteria may conflict with each other, we need an automated way to compare them and evaluate their relative probabilities for any alignment column. What we need is a computerized procedure that will includes all of the known criteria for homology assessment. Sadly, there are currently no mathematical models for doing this.

I suspect that there are two reasons for the failure of such a program to appear by now. First, biologists have not been clear about homology as a concept, and have not been able to express it in a form that computationalists could use to develop an algorithm. That is, we have criteria but they are not really operational criteria in a computational sense. Second, it will not be easy, because there is no obvious algorithm for inferring inheritance of characters. That is, we cannot easily separate homology from analogy.

Interactive editor

Another proposal is to have an interactive alignment editor. This editor would have the ability to show the conflicting hypotheses of homology (eg. where the homology suggested by structural pairing in a stem conflicts with homology suggested by tandem repeats), and then to annotate each column in the final alignment with the reason for the researcher having chosen to align those particular nucleotides. For example, one could press a button and see the RNA stem pairs in different colors (irrespective of whether the stem nucleotides are aligned), or press again and see the tandem repeats and inversions in different colours (once again, irrespective of how the nucleotides are aligned). One could also choose to see the annotations for the columns (summarized, using some coded schema), or simply look at the unadorned alignment itself.

This seems to me to be an achievable goal in the short-term; and the PhyDE editor already does some of it. Such an editor would also serve as a necessary step on the way to working out how to automate as much of the process as possible. The ultimate goal for some people may be total automation (ie. a black box), but I see no way to achieve that in the immediate term. Besides, I suspect that phylogeneticists will always want some judgemental control over the process, which would be best achieved with a semi-automated interactive editor. That is, we might ask the program to work out what the alternative alignments are for any specified subsequence (in an automated manner), and then we evaluate their relative merits for ourselves.

Note that I am treating the alignment as a set of hypotheses independent of their phylogenetic analysis. Subsequences can still be tentatively aligned even if the researcher intends masking those subsequences out of any subsequent tree-building analysis. Also, subsets of the taxa might be aligned confidently while other subsets are left unaligned. With current editors, this involves having a separate alignment file for each subset, which is very cumbersome, as well as error-prone.

Wednesday, March 11, 2015

The need for a new sequence alignment database

Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. The proliferation of alignment methods have diverse optimization functions, along with assorted heuristics to search for the optimum alignment; and these methods produce detectably different multiple sequence alignments in almost all realistic cases (see The need for a new sequence alignment program). This leaves the phylogeneticists wondering what to do. In response, the majority of phylogeneticists use manual alignment or re-alignment at some stage in their procedures.

If our goal is to develop an automated procedure for homology assessment (see Multiple sequence alignment), then we need some means of evaluating the relative success of different alignment methods.

There are four suggestions for benchmarking strategies for sequence alignment (Iantorno S, Gori K, Goldman N, Gil M, Dessimoz C 2014. Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. Methods in Molecular Biology 1079: 59-73):

Benchmarks based on simulated evolution of biological sequences, to create examples with known homology.
Benchmarks based on consistency among several alignment techniques.
Benchmarks based on the three-dimensional structure of the products encoded by sequence data.
Benchmarks based on knowledge of, or assumption about, the phylogeny of the aligned biological sequences.

These authors list a number of pros and cons for each strategy. For our purposes here we nee to consider the cons, which I discuss here (not all of these are covered by the authors).

Cons

1.
Simulation-based approaches adopt a probabilistic model of sequence evolution to describe nucleotide substitution, deletion, and insertion rates, while keeping track of “true” relationships of homology between individual residue positions (see Do biologists over-interpret computer simulations?).
(a) The simulation and analysis methods are not independent. All observations drawn from simulated data depend on the assumptions and simplifications of the model used to generate the data. This means that the results are biased towards those analysis methods that most closely match the assumptions of the simulation model.
(b) Simulations cannot straightforwardly, if at all, account for all evolutionary forces. This means that the simulations are not realistic, and their relevance for the behaviour of real datasets is unknown. The biggest failing in this regard is that, at some stage in the simulation, insertions and deletions are assumed to occur at random along the sequence (IID), and nothing could be further from the truth. Sequence variation occurs as a result of tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, and insertions; and there are strong spatial constraints on variation such as codons and stem-loops. Current simulation methods fall well short of modeling these patterns of sequence variation.

2.
The key idea behind consistency-based benchmarks is that different good aligners should tend to agree on a common alignment (namely, the correct one) whereas poor aligners might make different kinds of mistakes, thus resulting in inconsistent alignments.
(a) Two wrongs don't make a right. That is, consistent methods may be collectively biased. Moreover, consistency is not independent of the set of methods used (some may be consistent with each other and not with others).
(b) Consistency scores are a feature of several methods, which means that the benchmark is not independent.

3. Structural benchmarks most commonly employ the superposition of known protein/RNA structures as an independent means of alignment, to which alignments derived from sequence analysis can then be compared (see Edgar RC 2010. Quality measures for protein alignment benchmarks. Nucleic Acids Research 38: 2145-2153). The best known of these include: BAliBASE, OXBench, PREFAB, SABmark, IRMBase, and BRAliBase.
(a) Datasets are limited to structurally conserved regions, and may not be relevant for other alignment objectives.
(b) Deriving the structure-based alignments is problematic. For example, there is inconsistency amongst different stuctural superpositions.

4. Given a reference tree, the more accurate is the tree resulting from a given alignment, then the more accurate the underlying alignment is assumed to be (see Dessimoz C, Gil M 2010. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biology 11: R37).
(a) False inversion of a proposition: Accurate alignments yield accurate trees, therefore accurate trees must be based on accurate alignments.
(b) Alignment is often involved in constructing the reference tree. If not, the tree may be trivial in terms of taxon relationships.

Discussion

This evaluation leaves us in the invidious position of not yet having any benchmarking method that is relevant to homology assessment for multiple sequence alignments. This conclusion is at variance with other previous assessments (eg. Aniba MR, Poch O, Thompson JD 2010. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Research 38: 7353-7363).

We need to consider what such a method might look like, and how we might go about constructing it. If biologists can't give the bioinformaticians a concrete goal for homology alignment then they can expect nothing in return.

It seems clear that we need to follow the idea behind option 3, but base the alignments on homology rather than structure. I once made a start with compiling some suitable datasets (see Morrison DA 2009. A framework for phylogenetic sequence alignment. Plant Systematics and Evolution 282: 127-149); but this was a very minor effort.

As I see it, we need alignments that are explicitly annotated with the reasons for considering the columns to be homologous. One suggestion would be to have relatively short alignments with annotations for "known" features, such as tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, insertions, or stem-loops. These all create sequence variation, and they provide evidence of the homology relations among the sequences. Presumably the alignments would vary in length and number of sequences, and in the complexity of the patterns.

Perhaps the biggest practical problem will be how to deal with alignments where the homology criteria conflict with each other. That is, there are different types of criteria used to recognize homology — ie. similarity, structure, ontogeny, congruence (see Morrison DA 2015. Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26) — and they do not necessarily agree with each other.

This would allow us to come up with a set of requirements to specify various categories of the database, based on each of the above features. We would then try to accumulate as many example datasets for each category as we can. The database will presumably have protein-coding sequences in one section and RNA-coding, introns, etc in another. This dichotomy is simplistic, but I feel that it needs to be that way in order to be of practical use. Within each of those two sections we would have subsets of varying degrees of difficulty (eg. different degrees of average sequence similarity, or distinct taxon subsets in the same alignment, or orphan sequences).

This organisational approach is similar to that originally adopted for BAliBase, but it was dropped by most of the databases developed subsequently. I believe that it is the best approach for our purposes.

There are also experimentally created datasets where the alignment is known because all of the ancestors were sequenced as well. These would be useful; but their limitation is that the sequence variation was generated more or less at random, and so it does not match normal evolutionary processes. These alignments are more likely to match the IID assumption of the current automated alignment methods.

There is one further issue with this approach. Bioinformaticians often state that a few carefully prepared datasets is of little practical use to them (as opposed to being of use to phylogeneticists). What they need is a large number of datasets, the more the better. This is because they are interested in the percent success of their algorithms, and this cannot be assessed with small sample sizes. So, each alignment probably does not need to have too many taxa or too much sequence length — it is the number of alignments that is important, not their individual sizes. This could be achieved by sub-dividing larger datasets.

Wednesday, March 4, 2015

Multiple sequence alignment

I started actively working on phylogenetic networks more than 10 years ago, when I gave a talk at the Phylogenetic Combinatorics and Applications meeting in Uppsala in July 2004.

However, before I started working on networks I had for several years been working on multiple sequence alignment methodology, and I still do. This work is also of direct relevance to network construction, of course, since faulty alignments will generate conflicting signals that can confound the biological signals that alone should appear in the network.

This year marks the 20th anniversary of my first publication in the alignment field (see the list appended below). To celebrate this I have some review / commentary articles planned. The first of these has now appeared online, and I would like to draw it to your attention:

Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.

This paper relates current sequence alignment procedures to homology assessments as they are practiced for other data. Most algorithms can be seen as implementing only one of the several criteria that are used to identify homologies, which is inadequate. Suggestions are made for improving this situation.

Note: the second of these papers has now also appeared.

There will also be a couple of upcoming blog posts canvassing a few issues that I see as important for the future development of alignment methods.

Previous Publications

Theory

Ellis J, Morrison DA (1995) Effects of sequence alignment on the phylogeny of Sarcocystis deduced from 18S rDNA sequences. Parasitology Research 81: 696-699.

Morrison DA, Ellis JT (1997) Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of Apicomplexa. Molecular Biology and Evolution 14: 428-441. [This has been the most cited of these publications, surprising me by still getting cited about once per month]

Morrison DA (2006) Multiple sequence alignment for phylogenetic purposes. Australian Systematic Botany 19: 479-539.

Morrison DA (2009) A framework for phylogenetic sequence alignment. Plant Systematics and Evolution 282: 127-149. [This was actually accepted for publication in 2007]

Morrison DA (2009) Why would phylogeneticists ignore computerized sequence alignment? Systematic Biology 58: 150-158.

Morrison DA (2010) [Book review of] ‘Sequence Alignment: Methods, Models, Concepts, and Strategies’. Systematic Biology 59: 363-365.

Empirical examples

Mugridge NB, Morrison DA, Johnson AM, Luton K, Dubey JP, Votypka J, Tenter AM (1999) Phylogenetic relationships of the genus Frenkelia: a review of its history and new knowledge gained from comparison of large subunit ribosomal RNA gene sequences. International Journal for Parasitology 29: 957-972.

Mugridge NB, Morrison DA, Heckeroth AR, Johnson AM, Tenter AM (1999) Phylogenetic analysis based on full-length large subunit ribosomal RNA gene sequence comparison reveals that Neospora caninum is more closely related to Hammondia heydorni than to Toxoplasma gondii. International Journal for Parasitology 29: 1545-1556.

Mugridge NB, Morrison DA, Jäkel T, Heckeroth AR, Tenter AM, Johnson AM (2000) Effects of sequence alignment and structural domains of ribosomal DNA on phylogeny reconstruction for the protozoan family Sarcocystidae. Molecular Biology and Evolution 17: 1842-1853.

Beebe NW, Cooper RD, Morrison DA, Ellis JT (2000) Subset partitioning of the ribosomal DNA small subunit and its effects on the phylogeny of the Anopheles punctulatus group. Insect Molecular Biology 9: 515-520.

Beebe NW, Cooper RD, Morrison DA, Ellis JT (2000) A phylogenetic study of the Anopheles punctulatus group of malaria vectors comparing rDNA sequence alignments derived from the mitochondrial and nuclear small ribosomal subunits. Molecular Phylogenetics and Evolution 17: 430-436.