Monday, August 27, 2018

Regular cognates: A new term for homology relations in linguistics

The identification of homologous words between genealogically related languages is one of the crucial tasks in historical linguistics. In contrast to biology where, especially at the level of genetic sequences, we find a rather rich terminology contrasting different types of homology among genes and gene sequences, linguistic terminology is still not very precise. Most scholars seem to be content if they can claim that they have identified words that are cognate, which means that they are homologous but have not been borrowed throughout their history.

On various occasions in the past, I have tried to work on a more precise terminology for linguistic frameworks (see for example List 2014 and List 2016, or this earlier blogpost on homology in linguistics). In this context, I have often tried to emphasize that we need to be specifically more careful with the problem of partial cognacy in linguistics, since many words across related languages are not fully homologous, but show homology only in specific parts (List et al. 2016).

Thanks to an increase in accurately annotated linguistic data, resulting specifically from my very productive collaboration with Nathan W. Hill (SOAS, London) on the Burmish languages (see Hill and List 2017), my view has now again changed a bit, and I thought it would be useful to share it here.

Cognacy and homology

The starting point for my earlier proposals to refine the notion of cognacy in linguistics was the rather refined distinction between orthologs, paralogs, and xenologs in molecular biology (Fitch 2000). To account for the distinction between directly inherited (orthologs), duplicated (paralogs), and laterally transferred genes (xenologs), I proposed the terms direct cognates, indirect cognates (inspired by the term oblique cognates by Trask 2000), and indirectly etymologically related words or morphemes (word parts).

While the first and last term are more or less straightforward with respect to linguistic processes, the notion of indirect cognates, however, turned out to be insufficient, given that it is not clear which processes lead to indirect cognacy. Originally, I thought of morphological processes, that is, processes of word formation, by which a word is slightly modified to account for a slightly derived meaning (usually involving processes like suffixation or compounding). My idea was that words that have "experienced" these processes would behave similarly to genes that have been duplicated in biological evolution, and that it would be sufficient to just assign them to a common sub-class of cognates.

However, the research with Nathan W. Hill recently revealed that these terms are insufficient to capture the processes underlying lexical change in historical linguistics.

In order to understand this idea, it is useful to get back to the biological terms and have a closer look at how they distinguish the underlying processes. As far as I understand it, a directaly inherited gene sequence may differ from its ancestral sequence due to processes of random mutation, by which the original gene sequence becomes modified throughout its history. In cases of paralogy, the original gene sequence is duplicated and both copies are subsequently inherited. The copies may, during this process, become more different from each other than would be expected when assuming direct inheritance and random mutation. Similarly, in cases of lateral transfer of genetic material, the changes may again be different from the ones introduced by "normal" random mutation.

If we adopt the view of "normal change", as it is employed in the biological processes, we find a counterpart in the process of sound change in linguistics. As I have mentioned earlier, sound change is a systemic process by which certain sounds in certain environments change regularly across all words in the lexicon of a given language. This process is definitely not comparable with random mutation in sequence evolution, since the process involves a class of "letters" in the sound system of a language that are systematically turned into another sound. However, regarding the crucial role that sound change plays in language evolution, it seems that it is in some sense comparable with random mutation resulting in orthologous genes. Sound change is somewhat the baseline of what happens if languages change, and we have the means to identify its traces by searching for regular sound correspondence patterns across related languages (see my earlier blogpost on this matter).

That sound change is the default which can be handled with some confidence, while other processes, like word formation, semantic change, or the notorious process of analogical leveling, by which not only complex paradigms are transformed to reduce complexity, but other complexities can emerge (compare the German irregular plural of Morgen-de "mornings", which is built on the template of "evenings" Abend-e), is also the reason why Gévaudan (2007) does not include it into the major processes of lexical change. If we take sound change as the default process of language change and as our key evidence for homologous word relations, however, this means that we can no longer make the distinction between direct and indirect cognates following my earlier proposal, since indirect cognates do not necessarily reflect instances of irregular sound change.

This is in fact easy to illustrate. If we follow the former definition of indirect cognacy, the comparison of German Handschuh "glove" (lit. hand-shoe) with English hand would reflect indirect cognacy, since the German word is a compound of Hand "hand" and Schuh "shoe", and thus a derived word form. The morpheme Hand in this example, however, is phonetically identical with German Hand, and the sound correspondences between the English word and the first element of the German compound are still regular by all means. In fact, only a small amount of word formation processes in language evolution also impact on the pronunciation of the base forms.

This means, in turn, that any distinction of cognate word forms (and word parts, i.e., morphemes) into direct and indirect ones that is based on the absence or presence of morphological (= word formation) processes, does not tell us much about the degree to which the sound change affecting these word forms was regular. We could state that direct cognates should always reflect regular sound change, since any irregularity would have to be accounted for by alternative explanations (eg. shortening of a given word due to frequent use, assimilation of sounds serving the ease of pronunciation, etc.).

I wonder whether this would be useful for the initial idea behind the concept of direct cognacy. If we find direct cognates, that is, words that we assume were used by a couple of languages without further modification, apart from regular sound change and potentially sporadic sound changes, it seems still useful to assume that these reflect vertical language history better than cognate sets with residues that were exposed to various morphological processes. Thus, when coding direct cognacy in linguistic datasets, sporadic sound change (if it can be illustrated properly) should not serve as an argument against direct cognacy.

The only way around this problem seems to be to establish a further shade of cognacy, which describes the relations among words and morphemes that have been only affected by sound change, in contrast to words whose history reflects various morphological derivations that impact directly on pronunciation, or processes of irregular sound change due to analogical leveling or assimilation. While I first thought that the biological term ortholog would be useful to describe these specific word relations in linguistics, I realized later that, judging from the Ancient Greek meaning of ortholog (ortho "straight, direct" + logos "relation"), the fact that differences are due to regular sound change is not that neatly reflected.

For now, I think that it should be sufficient to use the term regular cognates for those words or word parts for which we can demonstrate that their change was following the regular "laws" of sound change. Regular cognates are thus defined as words or word parts that have been affected only by sound change during their history. This notion deliberately excludes differences in meaning, frequency of use, or whether the word forms are only reflected in compounds or derived word forms. In fact, for some cases, we could even propose that only parts of a word form that no longer bear any meaning of their own (eg. the first two sounds of a word form) are regular cognates, as long as we can propose good arguments for the regularity of the correspondences.

Note that our tools for alignment analyses in historical linguistics already account for this property. The EDICTOR (, List 2017), a web-based tool for editing, analyzing, and publishing etymological dictionaries, allows users to exclude those parts from an alignment that are assumed to be irregular, as can be seen in the following illustrative alignment of Proto-Germanic *bakanan "to bake". Scholars who want to be explicit about what parts of an alignment they consider to be regular can use this annotation framework to provide more refined analyses.

EDICTOR alignment of regular cognates for Proto-Germanic *bakanan "to bake"

A crucial consequence of using only regularity in the sound correspondences as the criterion to distinguish regular from irregular cognates is that regular cognacy may also be found to hold for borrowings, since borrowings can, as well, be shown to be regular, especially when the language contact between languages was intensive. Identifying regular cognates is furthermore the first and most important step of the classical comparative method (Weiss 2015) for historical language comparison, since (unless we have written evidence for the true relations between languages) regular cognates (as proven by readily aligned cognate sets) are the fundament upon which we build all our hypotheses regarding the external history of languages.

Fitch, W. (2000) Homology: s personal view on some of the problems. Trends in Genetics 16.5: 227-231.
Hill, N. and J.-M. List (2017) Challenges of annotation and analysis in computer-assisted language comparison: a case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1: 47–76.
List, J.-M. (2014) Sequence Comparison in Historical Linguistics. Düsseldorf University Press: Düsseldorf.
List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1.2: 119-136.
List, J.-M., P. Lopez, and E. Bapteste (2016) Using sequence similarity networks to identify partial cognates in multilingual wordlists. In: Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers). Association of Computational Linguistics, pp. 599-605.

List, J.-M. (2017) A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations, pp. 9-12.
Trask, R. (2000) The Dictionary of Historical and Comparative Linguistics. Edinburgh University Press: Edinburgh.
Weiss, M. (2015) The comparative method. In: Bowern, C. and N. Evans (eds.) The Routledge Handbook of Historical Linguistics. Routledge: New York, pp. 127-145.

Wednesday, August 22, 2018

Distinguishability in Phylogenetic Networks, report

We have now completed the workshop, as you can tell from the previous post with some photos. Here is a brief report on what seem to me to be some of the more useful points covered.

We had 10 formal presentations, but we also focused on group discussions for several hours each day. It may be the latter that were the most productive. However, I will briefly summarize the talks first.

I spent my time time in the opening talk emphasizing the different viewpoints of network computations, which focus on the patterns that can be detected in the data, and the network users, who are generally more interested in the processes that create those patterns (or are, indeed, absence from the patterns but present in the phylogenetic history, anyway). This highlights the two essential point of the workshop title, that both the patterns and the processes are much harder to untangle for networks than for trees.

Céline Scornavacca then bravely tried to tackle the combined problem, anyway, by trying to produce networks from analyzing the patterns in terms of their processes. The issues immediately become obvious, but she seems to be determined to proceed, regardless. Later in the week, Luay Nakhleh reduced the issue simply to vertical processes (including incomplete lineage sorting but not gene duplication-loss) versus horizontal processes. This creates a tractable problem for parsimony and likelihood, but the current challenge remains the limited number of taxa.

Vincent Moulton, Cécile Ané and Charles Semple dodged the issue by focusing on computations. Charles took on the challenge of trying to create a network version of Neighbor-Joining, which would address the issues of computational speed and taxon sampling, and Vince tackled super-networks, and the conditions required for building networks from a collection of smaller (ie. incomplete) trees. Both topics remain open questions. Cécile, on the other hand, discussed network models for trait evolution, which is important for the use of phylogenetic comparative methods when using networks.

On the user side, the presentations focused on examples, and the issues encountered when dealing with them. James Whitfield and Axel Janke talking about biology (mostly phylogenomics), while Johann-Mattis List talked about linguistics, and Tiago Tresoldi talked about stemmatology. In some ways, historical linguistics seems to be the odd one out, since many of the issues dealt with are somewhat removed from those in the other fields. However, in biology there are actually two options for producing networks — directly from the data or via "gene trees" (trees derived from non-recombining blocks of sequences). For the humanities, much of the current discussion is about the nature of the data, and how to code it for quantitative analysis.

This brings us to the discussions. While some time was spent on trying to establish whether biologists think that there is a difference between lateral gene transfer and horizontal gene transfer, or between incomplete lineage sorting, ancestral polymorphism and deep coalescence, some productive interchanges also occurred. Here is a coverage of four of the most important ones.

There was general agreement that there are several barriers to widespread adoption of network analyses in phylogenetics. This includes the development of suitable methods (in the face on indistinguishability), but also includes an understanding of what methods are currently available, what data are required to apply those methods, what taxon sampling is required to benefit from the methods, and how to use the programs that implement those methods.

One popular suggestion was therefore to produce some sort of "cookbook", to address the complexity of producing networks, given that there are many methods and programs. From the users' point of view this would illustrate what network analyses can do, in terms of finding reticulation patterns in the data; and from the computational point of view it would outline what needs to be done to get the programs to work. The consensus idea was to choose two suitable datasets (yet to be determined), and then have each program author provide analyses of them (including any scripts that are needed).

Following on from this latter point, it was agreed that the programs need easy user interfaces, if they are to become more widely used. Here, the word "widely" includes casual users from outside of phylogenetics, who use phylogenies as only one of many tools in their work. So, users will include those who need nothing more than a "point and click" control panel (which may be >90% of potential users) to those who would benefit from scripting control of the analyses. The interface needs both a front end, to specify the particular analysis, and a back end, to allow exploration of the output.

Another long-discussed issue was how to popularize networks, which is clearly a major topic. A phylogenetic tree is nothing more than one of the possible networks for any given dataset, and yet the focus is often on trees rather than networks.

To this end, it was noted that the current Wikipedia entry is inadequate, especially compared to the corresponding entry for phylogenetic trees. Not only is this entry out of date, it is in a number of ways misleading. In particular, there needs to be a discussion of the fact that, if a network is a "tree with reticulations", then ignoring the reticulations can result in the wrong tree, and the branch lengths may be severely under-estimated. There are challenges to getting Wikipedia entries changed, especially the wholesale re-writing of an entry, but this will be necessary.

Finally, it was noted that Philippe Gambette's Who is Who in Phylogenetic Networks website is extremely useful but is still poorly known, even within the phylogenetic networks community. We had a long discussion about how to enhance this site, to make it a more general-purpose repository of information about phylogenetic networks. This included a more inclusive database, more comprehensive tagging of keywords, enhanced descriptions of those keywords, and ways to keep the database up to date.

Steven Kelk has the notes from the final session, which was a review of what we achieved during the workshop, and which contains the To Do list. Both he and Philippe have the notes about modifications for the Who is Who in Phylogenetic Networks website, which is likely to be the first outcome-project tackled.

Thankyou to everybody who participated in the workshop. It seemed to be very productive, with a number of concrete outcomes that will be interesting to review at the next workshop.

Friday, August 17, 2018

Distinguishability in Phylogenetic Networks, photos

Evidence that we were in the Netherlands.

Evidence that we did some work.

Left to right: Steven Kelk, David Morrison, Mike Steel, Philippe Gambette (obscured), Tiago Tresoldi, Claudia Solis-Lemus, Fabio Pardi, Simone Linz, Mark Jones.

Left to right: David Morrison, Cecile Ané, Philippe Gambette (obscured), Katharina Huber, Leen Stougie, Remie Janssen, Yukihiro Murakami, Mattis List, Gereon Kaiping and Charles Semple.

Left to right: David Morrison (obscured), Axel Janke, Steven Kelk, Charles Semple, Claudia Solis-Lemus, Mark Jones (obscured), Fabio Pardi, Leo van Iersel, Simone Linz and Vincent Moulton.

Céline Scornavacca lectures Cecile Ané.

Axel Janke and Leo van Iersel contemplate methods for infering hybridization.

Philippe Gambette and Guido Grimm.

Mozes Blom and Jim Whitfield.

Mike Steel and Luay Nakhleh.

Luay delivers his Final Message, to Mozes Blom, Cecile Ané, Katharina Huber and Charles Semple.

Monday, August 13, 2018

Workshop: Distinguishability in Phylogenetic Networks

This week we are back in Leiden (in the Netherlands), for the third workshop sponsored by the Lorentz Center. The first workshop, in October 2012, is discussed in this blog post: Workshop: The Future of Phylogenetic Networks. The second one, in July 2014, is discussed here: Workshop: Touching the Data.

I say "we" because all of the blog authors are attending, making up nearly one-quarter of the participants. As before, it has been organized by Steven Kelk, Leo van Iersel, and David Morrison, this time along with Céline Scornavacca. The program and abstracts can be found here. It runs for the whole week 13 August – 17 August 2018.

The workshop is similar to the previous one, in that it is intended to be a small and well-focused event. The basic aim, as before, is to get biologists and computational people to sit down in a small group and actually talk about real phylogenetic issues. The main issue at hand this time can be called "indistinguishability", which significantly complicates the reconstruction, analysis and interpretation of phylogenetic networks. This includes the problems of: (i) distinguishing horizontal from vertical descent, (ii) distinguishing among reticulate processes, (iii) distinguishing reticulate evolution from incomplete lineage sorting, and (iv) distinguishing among network topologies.

You will note that we seem to be in the Netherlands in World Cup years. This is quite safe this year, but last time the workshop was in July, and my wife and I traveled through Germany at the end of the campaign, which was an "interesting" experience.

I am hoping to add some blog posts based on what happens at the workshop, either as it proceeds or at the end.

Monday, August 6, 2018

Trivial data, but not so trivial graphs

One may expect that perfectly compatible, trivial data will lead to perfect trees that are trivial to interpret. And this may really be the case when phylogenetics is restricted to contemporary taxa and molecular data. Adding to various earlier posts that deal with data patterns and their representation in inference graphs (e.g. Networks can outperform PCA..., Stacking neighbour-nets..., Clades, cladograms, cladistics ... and networks ...), I will show in this post what we get when we deal with very trivial, straightforward to interpret, data.

Two trivial scenarios: a linear and a dichotomous evolutionary sequence

The virtual data matrix for our experiment comprises seven taxa (OTUs) from different time scales and six binary (Dollo) characters. There are two historical scenarios that are supported by patterns in the data (see the first figure).

The linear scenario has a mother taxon that evolves by acquiring a unique, persistent trait, and is replaced by its daughter taxon through time. In contrast, the dichotomous scenario has two subsequent events of cladogenesis: the all-ancestor A splits into two taxa (B, E), each defined by a unique change in a binary character passed on to their descendants. B and E then underwent a second cladogenetic event, giving rise to C+D and F+G.

The resultant data matrices have different properties. In the case of the linear evolution, all changes lead to synapomorphies sensu Hennig (characters #1–#5) along with one terminal autapomorphy of the latest member of the lineage, G (character #6).

In the case of the dichotomous evolution, we have two synapomorphies supporting the BCD and EFG clades (characters #1, #4), respectively, and four autapomorphies (each one for C, D, F and G, the youngest set of taxa).

The following figure shows the character-based splits (taxon bipartitions) for the linear evolution scenario:
(Trivial splits, one taxon separated from all others, in blue)

Reconstructing the (true) evolutionary pathway is trivial based on this perfect split pattern, especially if we know that A is the oldest taxon and G the youngest.

It's equally straighforward for our second scenario, with perfectly dichotomous evolution:

Character 1 and character 4 define taxon cliques comprising B,C,D and E,F,G. The remaining characters indicate that C,D and F,G derive from B and E, respectively.

Explicit inferences

As stated above, the data properties for both scenarios are different. The matrices have a different number of parsimony-informative characters (4 for linear, 2 for dichotomous). Accordingly, the reconstructed optimal trees (here using the maximum parsimony, least-squares, and maximum likelihood criteria), are better resolved / more correct for the linear than for the dichotomous evolution.

MPT = most-parsimonious tree; ML = maximum likelihood. *Corrected for ascertainment bias.

Using all of the variable characters, NJ and ML are generally more decisive and produce higher support for the right branches. But for the dichotomous evolution scenario, they also show ghost-clades ("para-clades" as they include close relatives sharing a recent common origin, but do not represent monophyletic groups sensu Hennig) with low support. The corresponding MPT has no ghost-clades, but it also provides no clues to how B,C,D and E,F,G are related to each other.

Beyond this, and as can be seen in many real-world examples, there is no fundamental difference between character-based inferences such as maximum parsimony (MP) or maximum likelihood (ML) and distance-based inferences (NJ) fulfilling (here) the least-squares criterion (sometimes still called "phenetic" inferences in contrast to the "phylogenetic" parsimony, Bayesian inference and maximum likelihood).

The differences diminish further when we look at the phylograms instead of the cladograms, as shown next.

Another observation we can make is that for the linear-evolution scenario (four synapormophies), the ascertainment bias correction under ML has little effect, but it is crucial for the dichotomous evolution (two synapomorphies) to get sensible branch lengths.

Parsimony provides the most conservative (and least decisive) results for the dichotomous-evolution scenario, also because of the way I applied it: PAUP* allows optimizing trees with hard polytomies when using the default branch-and-bound search (for tree inference as well as bootstrapping), whereas the NJ / BioNJ algorithm and the ML implementation in RAxML will always produce fully dichotomized trees, including zero-length or near-zero-length branches. This explains the difference in the support values of preferred and alternative splits.

(Non-filtered) Bootstrap support consensus networks for the linear evolution scenario. Same scale for all graphs, trivial splits (dashed lines) collapsed.
(Non-filtered) Bootstrap support consensus networks for the dichotomous evolution scenario.

Trees are not wrong, but they miss the point

None of the graphs above show anything strongly erroneous, but they also don't fully capture the evolutionary pathways — that is, the actual ancestor-descendant relationships. This is because our taxon set includes ancestral forms, which, in traditional trees, have to be placed as sisters to part or all of their descendants. Networks provide a quick solution to this limitation.

Median-joining networks inferred with NETWORK for both scenarios, with the inferred (and real) character changes annotated along edges.

Neighbour-nets inferred with SplitsTree 4.13.1 for both scenarios, based on the mean (Hamming) pairwise distances.

The two (perfectly tree-like) graphs, one parsimony-based, the other distance-based, look identical, and place all of the taxa exactly where they should be: the ancestors on the nodes ("medians"), and their (latest) descendants at the tips. But note that in the case of the Neighbour-net this is a visual illusion / approximation: in fact, the ancestors are actually connected by zero-length edges to the node they appear to be sitting on.

Given that both scenarios used here produce trivial, straightforward to interpret, data patterns (see the first figures), the failure of the traditional tree inferences to get it completely right can be a bit unsettling. Trees including primitive-old and derived-new forms are common in the (palaeontological) literature, and typically show many branches lacking high support (note that only ML produced a bootstrap support >90 for a true-tree branch, and only for the linear evolution scenario). To address evolution over time, networks should hence be standard applications, rather than the exception. Cladograms should be long gone, as they show very little beyond the most trivial.

If we want trees (and many of us want trees!), we need tree inferences that can optimize an older taxon on an internal branch or node, to accommodate potentially ancestral forms.

Related blog posts

In Clades, cladograms, cladistics, and why networks are inevitable, I argue that we cannot get around networks when we aim to study taxa from different time scales using their morphologies.

Digging deeper: Population dynamics and individual-based fossil phylogenies raises the question of what we deal with when we use individual fossils (i.e. long-dead individuals) as OTUs in our phylogenetic inferences.
Monophyletic groups in networks by David gives an introduction into (fringe) terminology. What to do when dealing with more than a single most-recent common ancestor and past reticulation?

Networks and most recent common ancestors by David discusses the concepts of conservative MRCAs (most recent common ancestors), fuzzy MRCAs and (alternative) LCA — lowest (last) common ancestors in the face of reticulation.

In Stacking neighbour-nets: ancestors and descendants, I outline how one may (and why one should) stack Neighbour-nets to analyse the evolutionary history of a group including (mostly) fossil representatives.

The first Darwinian evolutionary tree[s] show features one rarely finds in a modern-day phylogenetic tree: ancestral and descendant forms, ancestral taxa addressed as species and not higher taxa, and gradual transition between forms (post by David).

Tree metaphors and mathematical trees by David, which introduces János Podani's concept about "branching silhouettes" and how to depict an actual evolutionary tree.

Where have all the ancestors gone? discusses the common notion that we don't have to deal with ancestor-descendant problems in phylogenetics at all, because the scarcity of the (terrestrial) fossil records ensures to only find extinct side (sister) lineages.