Monday, November 9, 2015

Capturing phylogenetic algorithms for linguistics

A little over a week ago I was at a workshop "Capturing phylogenetic algorithms for linguistics" at the Lorentz Centre in Leiden (NL). This is, as some of you will recall, the venue that hosted two earlier workshops on phylogenetic networks in 2012 and 2014.

I had been invited to participate and to give a talk and I chose to discuss the possible relevance of phylogenetic networks (as opposed to phylogenetic trees) for linguistics. (My talk is here). This turned out to be a good choice because, although phylogenetic trees are now a firmly established part of contemporary linguistics, networks are much less prominent. Data-display networks (which visualize incongruence in a data-set, but do not model the genealogical processs that gave rise to it) have found their way into some linguistic publications, and a number of the presentations earlier in the week showed various flavours of split networks. However, the idea of constructing "evolutionary" phylogenetic networks - e.g. modeling linguistic analogues of horizontal gene transfer - has not yet gained much traction in the field. In many senses this is not surprising, since tools for constructing evolutionary phylogenetic networks in biology are not yet widely used, either. As in biology, much of the reticence concerning these tools stems from uncertainty about whether models for reticulate evolution are sufficiently mature to be used 'out of the box'.

As far as this blog is concerned the relevant word in linguistics is 'borrowing'. My lay-man interpretation of this is that it denotes the process whereby words or terms are transferred horizontally from one language to another. (Mattis, feel free to correct me...) There were many discussions of how this proces can confound the inference of concept and language trees, but other than it being a problem there was not a lot a said about how to deal with it methodologically (or model it). One of the issues, I think, is that linguists are nervous about the interface between micro and macro levels of evolution and at what scale of (language) evolution horizontal events could and should be modelled. To cite a biological analogue: if you study populations at the most microscopic level evolution is usually reticulate (because of e.g. meiotic recombination) but at the macro level large parts of mammalian evolution are uncontroversially tree-like. In this sense whether reticulate events are modeled depends on the event itself and the scale of the phylogenetic model concerned.

Are there analogues of population-genetic phenomena in linguistics, and are they foundations for phenomena observed at the macro level? Is there a risk of over-stating the parallels with biology? One participant told me that, while she felt that there was definitely scope for incrorporating analogies of species and gene trees within linguistics - and many of the participants immediately recognized these concepts - comparisons quickly break down at more micro levels of evolution.

I'm not the right person to comment on this of course, or to answer these questions, but in any case it's clear that linguistics has plenty of scope for continuing the horizontal/vertical discussions that have already been with us for many years in biology...

Last, but not least: it was a very enjoyable workshop and I'm grateful to the organizers for inviting me!

1 comment:

  1. The reluctance of going from the macro-levels to the micro-levels in computational applications in historical linguistics has been always confusing me. Traditionally, linguists have long been saying that "chaque mot a son histoire", as biologists now say about genes. Mosaic history is all you find when searching an etymological dictionary of any language. But when it comes to the algorithms which are nowadays so popular, all these complex aspects of language history are quickly forgotten and the macro level rules, although we know often that it is just oversimplifying.

    One problem with modeling the microlevel in automatic applications is, however, that it may well expose how unrealistic the models are. In contrast to biologists, who have the attitude that they can improve unrealistic models, especially classical linguists may get really dismissive about models which simplify too much.

    I myself like this aspect of the microlevel especially for that reason, that it exposes problems in the general models: if our ML tree only recovers 70% of the character states at the root correctly (this are numbers we usually get when looking really into the data), this shows that the binary ML models may just be too simple for our purpose.

    Well, but it's difficult and it will take some time before we will see the first large-scale attempts to reconstruct language trees from word trees...