Tuesday, July 26, 2016

Can biologists learn from linguists?

Of course they can. Biologists who know nothing about linguistics can learn a lot about linguistics from linguists, including the most nerdy, the most boring, and the most interesting things.

However, it is obvious that the question in the title of this post implies a different object of learning, and a more precise title would have been "Can biologists learn about evolution from linguists?" As a linguist, I would of course also provide an affirmative answer, but I doubt that most biologists would agree. At the moment, we have a situation in which the majority of interdisciplinary papers state that linguists can learn from biologists. The opposite, that biologists can learn from linguistics, can rarely be found.

Biology to linguistics

An abundance of analogies between biology and linguistics has been noticed so far, and new analogies are regularly being proposed. When looking at the analogies that have been made so far, we find that most of them have never been really followed up. Languages, for example, have been compared with organisms (Schleicher 1848: 16f), species (Pagel 2009), microbes (Nelson-Sathi et al. 2011, List et al. 2014), mutualist symbionts (van Driem 2004), and populations (Mufwene 2001). Words have been compared with cells (Schleicher 1863: 23f), amino-acids (Zwick 1978), codons (Enguix et al. 2012, Jakobson 1973) and genes (Pagel 2009. Sounds (phonemes) have been compared with nucleic bases (Hruschka et al. 2015, Enguix et al. 2012) and atoms (Zwick 1978). Only a small number of these analogies have received broader attention, many have been rejected quickly after they were first proposed, and only recently has an explicit transfer of methods and models been initiated (Atkinson and Gray 2005).

The tenor of most recent studies, especially in the literature published during the past one to two decades, is often that we finally realize that language evolution is largely the same as biological evolution,  surprisingly (for a recent account in this direction, see Pagel 2016). As a result, it is claimed that we can easily use biological methods to study language evolution. We need to use them, since linguistics is in a poor state with no methods of its own, and linguists have never quantified what they know about the history of their languages. Then, finally, with these new methods developed in biology, we see light at the end of the tunnel, and we can draw nice trees of our languages and see how they evolved into their current shape.

I am in complete favour of increasing the objectivity in historical linguistics, making it a more data-driven and a more transparent discipline. I also advocate interdisciplinary transfer of methods and models, and there are quite a few things we can actually learn from biologists in linguistics. What I do not like is this tone, which suggests that biology is the discipline that saved linguistics, waking it up from its 200-year-long sleep in the ivory tower. At the same time, I also do not like the horror-scenarios in traditional linguistics, which state that quantitative approaches would deprive our discipline of all its wit (see the figure below as a not too serious attempt to visualize these two perspectives). In this context, it is quite interesting to look back in history and to recapitulate what actually happened.

The biological storm of bits and bytes: Will it destroy the ivory tower of historical linguistics
or ultimately help it to shine with a new gloss?

The discipline of historical linguistics is about 200 years old, starting with the legendary scholarly work of poeple like Rasmus Rask (Rask 1818), Jakob Grimm (Grimm 1822), and Franz Bopp (Bopp 1816). Using family trees to model language history goes back to the 17th century, pre-dating the first networks in biology by one century (see David's overview in Morrison 2016). The first explicit alignments showing homologous sounds across words occur at least as early as the beginning of the 20th century (Dixon and Kroeber 1919), cladistic frameworks date back to the second half of the 19th century (Brugmann 1886), and even algorithms for tree reconstruction based on distance data occur back in the 1960s (Dyen's comment in Hymes 1960).

The discipline of historical linguistics can look back on a remarkable history of excellent scholarship. Thanks to this scholarship, we have gained invaluable insights, not only into the history of the world's languages, but also into the mechanisms that trigger linguistic diversity. It is undeniable that methods from evolutionary biology have given us some fresh insights during the past 20 years, but their actual influence is often exaggerated. On the one hand, our experience (since the quantitative turn in historical linguistics) shows that in most cases we cannot use biological methods to analyze our data directly. Instead, we need to carefully adapt them to our needs in order to get the best out of them (as I have tried to show in more detail in List 2014).

On the other hand, there is no example during the past 20 years, that I would know of, where the modern biological methods have really revolutionized our insights into language history. They have undeniably shifted our attention towards data and quantification. They have exposed weak spots, in our argumentation, and they have forced us to restate questions that we had forgotten to ask. But no new language family has been detected, no deeper genealogies between existing languages have been proposed, and no deeper insights into human prehistory have been achieved by the use of biological methods alone. Historical linguistics has profited from evolutionary biology, but not as a small oasis in the desert that was given water and seeds by the lords of bits and bytes, but as a discipline in which scholars learned to make active and critical use of interdisciplinary approaches.

Linguistics to biology

This brings us back to the question of the title. Can biology learn from linguistics? It has done so undoubtedly in the past. Tree-drawing in biology, for example, was popularized by Ernst Haeckel who himself became influenced by the linguist August Schleicher (Sutrop 2012: 300). In the early days of genetics, a multitude of metaphors were borrowed from linguistics to describe biological phenomena with words like "alphabet", "word" (Gamov 1954), or "translation" (Crick 1959).

While not all biologists have been in favor of this tendency (see, for example, Shanon 1978), and the borrowing of terms does not necessarily imply methodological transfer, we also find examples for the explicit transfer of methods and theories from the linguistic to the biological domain. As an example, consider the theory of formal grammar (Chomsky 1959) which still plays a very important role in addressing certain problems in bioinformatics (Searls 1997), like RNA folding and protein structure analysis. Biological textbooks on sequence comparison still tend to include a chapter on formal grammars and their application in biology (Durbin et al. 1998).

Biology could also profit from linguistic insights in the future, and this becomes a bit clearer when we recall, what Schleicher mentioned 150 years a go (and what has been obviously forgotten since then):
Observing how new forms descend from old ones can be done more straightforwardly and in a larger scale in linguistics than in biology. For once, the linguists have an advantage over the natural scientists. (Schleicher 1863: 18, my translation)
The advantage of linguistics, which Schleicher points out, is the availability of very concrete, very detailed, very valuable data in linguistics. This data allows us to see evolutionary forces in a detailed way of which biologists can only dream. Written sources allow us to trace the history of whole language families like Romance (and to some extent also Chinese dialects) from their ancestral speech varieties down to today. Language change is fast enough to allow us to investigate it in action. Recent topics in biology, like the importance of invoking a system perspective in evolution, have been long since debated and discussed in linguistics (Tynjanow and Jakobson 1928, since they are so much easier to detect.

In the past, when I worked intensively on the implementation of the Minimal Lateral Network method (Dagan and Martin 2007, Dagan et al. 2008) on linguistic data (List et al. 2014, List 2015), I stumbled upon numerous examples showing the limits of tree topology as a predictor for lateral transfer events. Given that the same necessarily also holds for lateral gene transfer, I was asking myself whether these false positives and the false negatives in the analyses would simply not matter due to the large amount of data in biology, or whether it was ignored due to the lack of good data for algorithmic evaluation. Later, when I read David's post on Tardigrades and phylogenetic networks, where he pointed to two analyses on the same data that explained them once with lateral gene transfer (Boothby et al. 2015) and once with errors in the data (Koutsovoulos 2015), I became aware of the strong advantage of my linguistic data, since I could test it against written records, tracing the history of words through centuries, thus being able to spot errors immediately when looking up a data point.

The detail of our data in linguistics is both a blessing and a curse. It enables us to write detailed word histories without ever having heard of tree reconciliation methods. On the other hand, it seduces us to get lost in details, forgetting about the bigger picture, and the bigger questions that we could ask, if this data was properly digitized and formalized. In this regard, historical linguistics still needs to learn from biology, as we have failed to turn historical linguistics into a modern, data-driven discipline. With more and more detailed data becoming available, however, the day will come when Schleicher is proven right, and when biologists can learn from linguists about evolution.

  • Atkinson, Q. and R. Gray (2005): Curious Parallels and Curious Connections: Phylogenetic Thinking in Biology and Historical Linguistics. Syst. Biol. 54.4. 513-526.
  • Boothby, T., J. Tenlen, F. Smith, J. Wang, K. Patanella, E. Osborne Nishimura, S. Tintori, Q. Li, C. Jones, M. Yandell, D. Messina, J. Glasscock, and B. Goldstein (2015): Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proceedings of the National Academy of Sciences 112.52. 15976-15981.
  • Bopp, F. (1816): Über das Conjugationssystem der Sanskritsprache in Vergleichung mit jenem der griechischen, lateinischen, persischen und germanischen Sprache. Nebst Episoden des Ramajan und Mahabharas in genauen metrischen Uebersetzungen aus dem Originaltexte und einigen Aabschnitten aus den Veda’s. Andreäische Buchhandlung: Frankfurt am Main.
  • Brugmann, K. (1886): Einleitung und Lautlehre: Vergleichende Laut-, Stammbildungs- und Flexionslehre der Indogermanischen Sprachen [Introduction and Phonetics. Comparative Studies of Sound Systems, Stem Formations, and Inflexion Systems of Indo-European Languages]. Grundriß der vergleichenden Grammatik der indogermanischen Sprachen [Foundations of the comparative grammar of the Indo-European languages], vol. 1. Walter de Gruyter, Berlin, Leipzig.
  • Chomsky, N. (1959): On certain formal properties of grammars. Information and Control 2. 137-167.
  • Crick, F. (1959): The present position of the coding problem. The Brookhaven Symposia in Biology 12. 35-39.
  • Dagan, T. and W. Martin (2007): Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proceedings of the National Academy of Sciences 104.3. 870-875
  • Dagan, T., Y. Artzy-Randrup, and W. Martin (2008): Modular networks and cumulative impact of lateral transfer in prokaryote genome evolution. Proceedings of the National Academy of Sciences 105.29. 10039-10044.
  • Dixon, R. and A. Kroeber (1919): Linguistic families of California. University of California Press: Berkeley.
  • van Driem, G. (2004): Language as organism: A brief introduction to the Leiden theory of language evolution. In: Lin, Y.-c., F.-m. Hsu, C.-c. Lee, J.-S. Sun, H.-f. Yang, and D.-a. Ho (eds.): Studies on Sino-Tibetan Languages. Academia Sinica: Taipei. 1-9.
  • Durbin, R., S. Eddy, A. Krogh, and G. Mitchinson (2002): Biological sequence analysis. Probabilistic models of proteins and nucleic acids. Cambridge University Press: Cambridge.
  • Enguix, G. and M. Jimenez-Lopez (2012): Natural language and the genetic code: From the semiotic analogy to biolinguistics. In: Proceedings of the 10th World Congress of the International Association for Semiotic Studies (IASS/AIS). 771-780.
  • Gamov, G. (1954): Possible relation between deoxyribonucleic acid and protein structures. Nature 173. 318.
  • Grimm, J. (1822): Deutsche Grammatik. Dieterichsche Buchhandlung: Göttingen.
  • Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015): Detecting regular sound changes in linguistics as events of concerted evolution. Curr. Biol. 25.1. 1-9.
  • Hymes, D. (1960): Lexicostatistics so far. Curr. Anthropol. 1.1. 3-44.
  • Jakobson (1973): Six lectures on sound and meaning. Cambridge and London: MIT Press
  • Koutsovoulos, G., S. Kumar, D. Laetsch, L. Stevens, J. Daub, C. Conlon, H. Maroon, F. Thomas, A. Aboobaker, and M. Blaxter (2015): The genome of the tardigrade Hypsibius dujardini. bioRxiv.
  • List, J.-M., S. Nelson-Sathi, H. Geisler, and W. Martin (2014): Networks of lexical borrowing and lateral gene transfer in language and genome evolution. Bioessays 36.2. 141-150.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • List, J.-M. (2015): Network perspectives on Chinese dialect history. Bull. Chin. Linguist. 8. 42-67.
  • Morrison, D.A. (2016): Genealogies: Pedigrees and phylogenies are reticulating networks not just divergent trees. Evol. Biol. in press.
  • Mufwene, S. (2001): The ecology of language evolution. Cambridge University Press: Cambridge.
  • Nelson-Sathi, S., J.-M. List, H. Geisler, H. Fangerau, R. Gray, W. Martin, and T. Dagan (2011): Networks uncover hidden lexical borrowing in Indo-European language evolution. Proc. R. Soc. London, Ser. B 278.1713. 1794-1803.
  • Pagel, M. (2009): Human language as a culturally transmitted replicator. Nature Reviews. Genetics 10. 405-415.
  • Pagel, M. (2016): Darwinian perspectives on the evolution of human languages. Psychonomic Bulletin & Review . 1-7.
  • Rask, R. (1818): Undersögelse om det gamle Nordiske eller Islandske sprogs oprindelse [Investigation of the origin of the Old Norse or Icelandic language]. Gyldendalske Boghandlings Forlag: Copenhagen.
  • Schleicher, A. (1848): Zur vergleichenden Sprachengeschichte. König: Bonn.
  • Schleicher, A. (1863): Die Darwinsche Theorie und die Sprachwissenschaft. Offenes Sendschreiben an Herrn Dr. Ernst Haeckel. Hermann Böhlau: Weimar.
  • Searls, D. (1997): Linguistic approaches to biological sequences. Comput. Appl. Biosci. 13.4. 333-344.
  • Shanon, B. (1978): The genetic code and human language. Synthese 39.3. 401-415.
  • Sutrop, U. (2012): Estonian traces in the Tree of Life concept and in the language family tree theory. Journal of Estonian and Finno-Ugric Lingusitics 3. 297-326.
  • Tynjanow, J. and R. Jakobson (1991): Probleme der Literatur- und Sprachforschung. In: Viehoff, R. (ed.): Alternative Traditionen.10. Vieweg: Braunschweig. 67-69.
  • Zwick, M. (1978): Some analogies of hierarchical order in biology and linguistics. In: Klir, G. (ed.): Applied General Systems Research: Recent Developments & Trends. Plenum Press: New York. 521-529.

Tuesday, July 19, 2016

A tree with a cycle

You can see other views of the cycle here:
   He went to war

Tuesday, July 12, 2016

Coal — trees and networks of knowledge

The Tree of Knowledge is a well-known concept, and the tree can indeed be used to arrange information. One possible use is to describe the relationships of derivative products (ie. the chemical derivatives of other substances). Indeed, these can be viewed as having a "phylogeny", since the processing follows a time sequence.

The U.S. Geological Survey (in the U.S. Department of the Interior) has provided one such example in Geological Survey Circular 1143 Coal — a Complex Natural Resource. The centerfold of that publication shows:
Coal byproducts in tree form showing basic chemicals as branches and derivative substances as twigs and leaves. [Modified from an undated public domain illustration provided by the Virginia Surface Mining and Reclamation Association.]

However, a tree is a simplification of a network, and the network can thus show more information. In this case, the same information has previously been illustrated using a reticulating network, not a tree.

In the 7th edition (1924) of Joseph Meyer's Große Conversations-lexikon für gebildete Stände (first edition 1840-1855) there is a Steinkohle: Stammbaum der Steintohlenerzeugnisse [Coal: family tree of coal products]:

This has three reticulations, showing coal products produced as a result of combining two different processing routes. This is thus a hybridization network.

Thanks to the Trees of Knowledge page (by Paul Michel) of the "Encyclopedias as Indicators of Change in the Social Importance of Knowledge, Education and Information" web site, for pointing out this unexpected use of trees of knowledge.

Tuesday, July 5, 2016

Hybridization in the world of duplication-transfer-loss

It seems to me that the study of reticulate evolutionary histories currently boils down to two options:
(1) reconstructing a species "tree" from multiple gene trees using a coalescent model that includes hybridization (either homoploid or polyploid);
(2) reconciling multiple gene trees with a known [sic] species tree using a model that includes gene duplication, loss and transfer (as well as speciation) - a DTL model.

This often leads me to wonder where hybridization fits into option (2) and where gene transfer fits into option (1). They must fit somewhere. For example, Jacox et al. (2016. ecceTERA: comprehensive gene tree-species tree reconciliation using parsimony. Bioinformatics 32: 2056-2058) describe their DTL as:
comprehensive as it includes the following evolutionary events: speciation, speciation-loss (speciation followed by a loss of one gene copy), gene duplication, gene loss, gene transfer and transfer-loss (gene transfer with loss of the original gene) between two sampled species, and gene transfer and transfer-loss from/to an unsampled species (i.e. a species that is not represented in the dataset) to/from a sampled one.

Since the model is "comprehensive", then hybridization must be included. The only parts of the model that include reticulate histories are gene transfer and transfer-loss, so this is where hybridization must be. Possibly, polyploid hybridization is included in "gene transfer" (an increase in the number of gene copies), and homoploid hybridization is included in "transfer-loss" (maintaining the same number of genes).

This seems to be a simple example of the idea that different types of reticulation events cannot be distinguished from each other. Genomic material moves from one place to another in contemporaneous organisms, either sexually (introgression, hybridization) or asexually (lateral gene transfer). There is nothing intrinsic about gene trees to tell us which mechanism is involved in any given reticulation, other than the relative positions of the donor and recipient in the "species tree" and the possibility of time inconsistency.

This leads to the question of why horizontal gene movement is called "transfer" in one model (2) and "hybridization" in the other (1).