The Genealogical World of Phylogenetic Networks: Productive and unproductive analogies between biology and linguistics

Genotypes or phenotypes?

In a blogpost from 2013, David investigated some of the popular analogies between anthropology (including linguistics) and biology. He rejected those analogies that compare the genotype with anthropological entities (like the common "words = genes" analogy). Instead, he proposed to draw the analogy between anthropological entities and the phenotype. I generally agree that we should be very careful about the analogies we draw between different disciplines, and I share the scepticism regarding those naive approaches in which genes are compared with words or sounds are compared with nucleotide bases. I am, however, sceptical whether the alternative analogy between phenotypes and anthropological entities offers a general solution for the study of language evolution.

Productive and unproductive analogies

My scepticism results from a general uncertainty about the transfer of models and methodologies among scientific disciplines. I am deeply convinced that such a transfer is useful and that it can be fruitful, but we seem to lack a proper understanding of how to carry out such a transfer. Apart from this general uncertainty as to how to do it properly, I think that for linguistics the analogy between phenotypes and linguistic entities is too broad to be successfully applied.

Instead of drawing general analogies between biology and linguistics, it would be more useful to carry out a fine-grained analysis of productive analogies between the two disciplines. By productive, I mean that the analogies should lead to an interdisciplinary transfer of models and methods that increases the insights about the entities in the discipline that imports them. If this is not the case for a given analogy, this does not mean that the analogy is wrong or false, but rather that it is simply unproductive, since an analogy is just a similarity between entities from different domains, and what we define as being "similar" crucially depends on our perspective. With enough fantasy, we can draw analogies between all kinds of objects, and we never really know the degree to which we construct rather than detect, as I have tried to illustrate in the graphic below.

Constructed or detected similarities?

Local productive analogies: alignment analyses

A productive analogy does not necessarily have to be global, offering a full-fledged account of shared similarities, as in the analogies which compare, for example, languages with organisms (Schleicher 1848) or languages with species (Mufwene 2001), but also the analogy between phenotypes and anthropological entities proposed by David. It is likewise possible to find very useful local analogies, which only hold to a certain extent, but offer enough insights to get started.

Consider, for example, the problem of sequence alignment in biology and linguistics. It is clear, that both biologists and linguists carry out alignment analyses of some of the entities they are dealing with in their disciplines. We use alignment analyses in biology and linguistics, since both disciplines have to deal with entities that are best modeled as sequences, be it sequences of DNA, RNA, or amino acids in biology, or sequences of sounds in linguistics. In both cases, we are dealing with entities in which a limited numer of symbols is linearily ordered, and an alignment analysis is a very intuitive and fruitful way to show which of the symbols in two different sequences correspond.

In this very general point, the analogy between words as sequences of sounds and genes as sequences of nucleic acids holds, and it seems straightforward to think of transferring models and methods between the disciplines (in this case from biology to linguistics, since automatic sequence alignment has a longer tradition in biology).

In the details, however, we will detect differences between biological and linguistic sequences, with the main differences lying in the alphabets (the collections of symbols) from which our sequences are drawn (discussed in more detail in List 2014: 61-75):

Biological alphabets are universal, that is, they are basically the same for all living creatures, while the alphabets of languages are specific for each and every language or dialect.
Biolological alphabets are limited and small regarding the number of symbols, while linguistic alphabets are widely varying and can be very large in size.
Biological alphabets are stable over time, with sequences changing by the replacement of symbols with other symbols drawn from the same pool of symbols, while linguistic alphabets are mutable: not only can they acquire new sounds or lose existing ones, but also the sounds themselves can change.

How similar are words and genes in the end?

What are the consequences of these differences in the word-gene analogy? Can we still profit from the long tradition of automatic alignment methods when dealing with phonetic alignment (the alignment of sound sequences, like words or morphemes) in linguistics? Yes, we can! But within limits!

Linguists can profit from the general frameworks for sequence alignment developed in biology, but we need to make sure that we adapt them according to our linguistic needs. For alignment methods, this means, for example, that we can use the traditional frameworks of dynamic programming for pairwise alignment, which were developed back in the seventies (Needleman and Wunsch 1971, Smith and Waterman 1981). We can also use some of the frameworks for multiple sequence alignment, which were developed a bit later, starting from the end of the eighties, be it progressive (Feng and Doolittle 1987, Thompson et al. 1994, Notredame et al. 1998), iterative (Barton and Sternberg 1987, Edgar 2004), or probabilistic (Do et al. 2004). But we can only import the overall frameworks, not their details.

All algorithms for phonetic alignment that are supposed to be applicable to a wide range of data (and not serve as a mere proof of concept that handles but a limited range of test datasets) need to address the specific characteristics of sound sequences. Apart from the differences in alphabet size and the mutable character of sound systems mentioned above, these differences also include the important role that context plays in sound change (List 2014: 26-33), the problem of secondary sequence structures (List 2012), the problem of metathesis (List 2012: 51f), but also the problem of unalignable parts resulting from cases of partial and oblique homology in language evolution (see my recent blog post on this issue).

Concluding remarks

Drawing analogies between the research objects of different disciplines is not a bad idea, and it can be very inspiring, as multiple cases in the history of science show. When transferring models and methods from one discipline to another, however, we need to make sure that the analogies we use are productive, adding value to our research and understanding. We should never expect that analogies hold in all details. Instead we need to be aware about their specific limits, and we need to be willing to adapt those models and methods we transfer to the needs of the target discipline. Only then can we make sure that the analogies we use are really productive in the end.

References

Barton, G. J. and M. J. E. Sternberg (1987). “A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons”. J. Mol. Biol. 198.2, 327 –337.
Do, C. B., M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou (2005). “ProbCons. Probabilistic consistency-based multiple sequence alignment”. Genome Res. 15, 330–340.
Edgar, R. C. (2004). “MUSCLE. Multiple sequence alignment with high accuracy and high throughput”. Nucleic Acids Res. 32.5, 1792–1797.
Feng, D. F. and R. F. Doolittle (1987). “Progressive sequence alignment as a prerequisite to correct phylogenetic trees”. J. Mol. Evol. 25.4, 351–360.
List, J.-M. (2014). Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.
List, J.-M. (2012a). "Improving phonetic alignment by handling secondary sequence structures". In: Hinrichs, E. and Jäger, G.: Computational approaches to the study of dialectal and typological variation. Working papers submitted for the workshop organized as part of the ESSLLI 2012.
List, J.-M. (2012b). “Multiple sequence alignment in historical linguistics. A sound class based approach”. In: Proceedings of ConSOLE XIX. “The 19th Conference of the Student Organization of Linguistics in Europe” (Groningen, 01/05–01/08/2011). Ed. by E. Boone, K. Linke, and M. Schulpen, 241–260.
Mufwene, S. S. (2001): The ecology of language evolution. Cambridge: Cambridge University Press.
Needleman, S. B. and C. D. Wunsch (1970). “A gene method applicable to the search for similarities in the amino acid sequence of two proteins”. J. Mol. Biol. 48, 443– 453.
Notredame, C., L. Holm, and D. G. Higgins (1998). “COFFEE. An objective function for multiple sequence alignment”. Bioinformatics 14.5, 407–422.
Schleicher, A. (1848). Zur vergleichenden Sprachengeschichte [On comparative language history]. Bonn: König.
Smith, T. F. and M. S. Waterman (1981). “Identification of common molecular subsequences”. J. Mol. Biol. 1, 195–197.
Thompson, J. D., D. G. Higgins, and T. J. Gibson (1994). “CLUSTAL W. Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”. Nucleic Acids Res. 22.22, 4673–4680.