Tuesday, June 28, 2016

Rate variation and gene tree discordance

Several years ago, I published a post about variation in nucleotide substitution rates along lineages: Is rate variation among lineages actually due to reticulation?

In that post I suggested that a reticulate evolutionary history would likely be modelled as apparent rate variation if the data were forced into a tree model. That is, in a tree model, the sudden influx of new genomic material due to the reticulation event could only be modelled as a sudden change in substitution rate. Therefore, a lot of what tree-based phylogeneticists see as rate variation might actually be reticulation.

I have often wondered why this topic of pseudo-rate variation has not been investigated. It turns out that it now has been, to some extent. Fábio K. Mendes and Matthew W. Hahn (2016. Gene tree discordance causes apparent substitution rate variation. Systematic Biology 65: 711-721) have at least confirmed the idea in terms of gene tree discordance.

They note:
Substitution rates are known to be variable among genes, chromosomes, species, and lineages due to multifarious biological processes. Here, we consider another source of substitution rate variation due to a technical bias associated with gene tree discordance. Discordance has been found to be rampant in genome-wide data sets, often due to incomplete lineage sorting (ILS). This apparent substitution rate variation is caused when substitutions that occur on discordant gene trees are analyzed in the context of a single, fixed species tree. Such substitutions have to be resolved by proposing multiple substitutions on the species tree.
All of this is true, and the authors demonstrate this using simulations. They show that the artificially increased level of apparent rate variation becomes more obvious with increasing levels of ILS, and on trees with larger numbers of taxa.

Now, all that has to be done is demonstrate the same thing when gene tree discordance due to reticulation rather than incomplete lineage sorting.

Tuesday, June 21, 2016

Alignments and phylogenetic reconstruction in linguistics and biology

In a very interesting article from 2009 (Morrison 2009), David discusses the question of why phylogeneticists would "ignore computerized sequence alignment". This article was really interesting to me for two reasons: First, the article provides some interesting statistics regarding the degree to which biologists manually adjust the alignments that were automatically produced by software. Second, the article points to the seemingly strange situation in biology in which tree-building is considered to be a task that can be entirely carried out by machines, while the majority of scholars would not trust their final sequence alignments to a computer (Morrison 2009: 150).

This situation finds a direct analogon in historical linguistics. Phylogenetic reconstruction is gaining more and more ground, with many scholars applying (mostly Bayesian) phylogenetic tools to analyze their data (Indo-European: Bouckaert et al. 2012, Tupí-Guaraní (South America): Michael et al. 2015, Japonic: Lee and Hasegawa 2011, Pama-Nguyan (Australian): Bowern and Atkinson 2012, Semitic: Kitchen et al. 2009, Bantu: Grollemund et al. 2015, etc.). Fully automated workflows involving automatic sequence comparison are also practiced (Holman et al. 2011, Jäger 2015, Wheeler 2015), but many linguists remain sceptical regarding their results.

One major difference between biology and linguistics is the selection of comparanda. Biological methods usually derive phylogenetic trees from multiply aligned sequences. Linguistic methods derive trees from sets of homologous (cognate) words (cognate sets) distributed across languages whose evolution is modeled as a process of word-gain and word loss (similar to gene-family gain-loss-studies in biology). While biologists fiddle with their alignments, linguists fiddle with their cognate sets. Cognate identification is exclusively done manually at the moment, and scholars use all kinds of information about word relations that they can get, be it etymological dictionaries, which have been published for more than 200 years, or the intuition of the expert who is annotating the data for cognacy.

Identification of cognate sets in linguistics is essentially a task of sequence comparison (List 2014), and algorithmic as well as manual procedures involve the multiple and the pairwise alignment of words (even if it is done only implicitly by human experts). Compared to biology, sequence comparison in historical linguistics is exacerbated by two factors:
  • alphabets (phoneme systems) in linguistics are themselves mutable (Geisler and List 2013), so that when aligning two words we need to find both a mapping between the two alphabets, translating one alphabet into the other, plus a scoring function by which we can score the alignment,
  • regular sound change (the process by which the phoneme system is changed) and sporadic sound change (the process by which a sound is sporadically assimilated, lost, or added) are not the only processes that contribute to change of words in the lexicon, and morphological change (by which whole blocks of meaningful parts of a word are re-arranged, exchanged, lost, or added) yields patterns that are essentially unalignable.
The problem of finding the correct mapping between two alphabets in linguistics is further exacerbated by language contact: If languages exchange words on a large scale, then this may have a huge impact on the system of the languages, and it may even introduce new sounds to a language that were not there before (thanks to English, German has now the sound [dʒ], as in journalist or job). If borrowing is frequent enough, it may get close to impossible to judge from comparing the words alone, whether two words in different languages have been transferred directly (vertically) from an ancestral language, or laterally.

As a result, it is probably understandable why linguists often refuse to carry out full alignments of the words in their data. An alignment itself does not necessarily tell us much, compared to all of those processes that an expert infers when comparing language data, which are not alignable.

As an example, let us consider the word for "sun" in six Indo-European languages. Since "sun" is a very basic concept, probably fundamental for all human cultures, experts assume that this word was present as *séh₂u̯el- in Indo-European (an asterisk indicates that the word is not reflected in written sources), and that it was retained as Russian солнце [sɔnʦə], Polish słońce [swɔnjʦɛ], French soleil [sɔlɛj], Italian sole [sole], German Sonne [sɔnə], and Swedish sol [suːl] (Wodtko et al. 2008). An obvious alignment, reflecting the surface similarity between all of these words, would be the following one (taken from List 2014: 135):

Alignment based on sequence similarity.

This alignment, however, is by no means correct. Russian [sɔnʦə] and Polish [swɔnʲʦɛ], for example, share a common suffix, which is reflected as [nʦə] in Russian and as [nʲʦɛ] in Polish, and which was innovated in the the common ancestor of Russian and Polish, but is not present in either of the four other languages. So the [n] in German [sɔnə] is essentially not homologous with the [n] in Russian or the [nʲ] in Polish. The same applies to the [ɛj] in French [sɔlɛj] which reflects a diminutive suffix in Latin sol-iculus "small sun", the regular ancestor form of French soleil. Furthermore, the [w] in the Polish word regularly corresponds to the [l] in French, Italian, and Swedish, but it reflects a swap (metathesis) in the order of the vowel and the consonant in Polish — [sɔl] became [slɔ] which became [swɔ]).

Taking all (and more) of this into account, we need to modify our alignment to account more closely for the processes that experts have inferred from intensive language comparison, as shown in the next figure below (taken from List 2014: 135). In this alignment, the swap in Polish is reflected by the white font of the sounds involved, and gray-shaded columns are supposed to reflect the oldest layer of homology.

Historically informed alignment.

However, even this alignment is essentially misleading. The Indo-European word for "sun" supposedly had a complex paradigm in which the word's stem was alternating in the nominative (and accusative) case and the other cases (oblique cases). So, nominative and accusative used the stem *sóh₂u̯el-, while the other cases used the stem *sh₂én-. The Russian, Polish, French, Italian, and the Swedish form go back to the former, while the German form goes back to the latter, since it is further assumed (or it can be assumed) that the alternation was still preserved in the ancestor of Swedish and German.

This means, however, that our alignment above shrinks to an alignment in which only the first letter, the s, is still reflected in all languages! The following graphic (taken from List 2016) illustrates the processes that led to the current situation for four of our six languages:

Morphological processes of lexical change.

What does this example tell us? On the one hand, it gives some explanation for why linguists do not really want to align words (although the first alignments go back to the early 20th centur, cf. Dixon and Kroeber 1919). It also explains, why classical linguists have a very sceptical attitude towards the computerization of word comparisons, based on the (partially justified) assumption that computers could not handle the complex patterns that are so characteristic of language change.

On the other hand, comparing the situation with biology as reported in Morrison (2009), we can find an interesting parallel between the two disciplines: both linguists and biologists do not really trust machines for comparing their sequences (albeit at different levels of analysis), but they do not seem to have many problems in trusting machines to reconstruct their trees.

However, especially this last point, the fact that we trust machines to grow our trees, while we distrust them to prepare the seeds, should ring an alarm bell. First, we seem to lack clear guidelines (at least in linguistics) regarding the way the manual adjustment (of alignments in biology and cognate sets in linguistics) should be carried out, which has a clear impact on repeatability. Second, if we have processes in both fields that yield essentially unalignable patterns, such as duplications and other molecular processes in biology (Morrison 2009: 156), and morphological processes in linguistics, how can we assume that a phylogenetic tree analysis can sufficiently cope with them, even if we manually adjust everything?

  • Bouckaert, R., P. Lemey, M. Dunn, S. Greenhill, A. Alekseyenko, A. Drummond, R. Gray, M. Suchard, and Q. Atkinson (2012): Mapping the origins and expansion of the Indo-European language family. Science 337.6097. 957-960.
  • Bowern, C. and Q. Atkinson (2012): Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88. 817-845.
  • Dixon, R. and A. Kroeber (1919): Linguistic families of California. University of California Press: Berkeley.
  • Geisler, H. and J.-M. List (2013): Do languages grow on trees? The tree metaphor in the history of linguistics. In: Fangerau, H., H. Geisler, T. Halling, and W. Martin (eds.): Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization. Franz Steiner Verlag: Stuttgart. 111-124.
  • Grollemund, R., S. Branford, K. Bostoen, A. Meade, C. Venditti, and M. Pagel (2015): Bantu expansion shows that habitat alters the route and pace of human dispersals. Proceedings of the National Academy of Sciences 112.43. 13296–13301.
  • Holman, E., C. Brown, S. Wichmann, A. Müller, V. Velupillai, H. Hammarström, S. Sauppe, H. Jung, D. Bakker, P. Brown, O. Belyaev, M. Urban, R. Mailhammer, J.-M. List, and D. Egorov (2011): Automated dating of the world’s language families based on lexical similarity. Curr. Anthropol. 52.6. 841-875.
  • Jäger, G. (2015): Support for linguistic macrofamilies from weighted alignment. Proceedings of the National Academy of Sciences 112.41. 12752–12757.
  • Kitchen, A., C. Ehret, S. Assefa, and C. Mulligan (2009): Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East. Proc. R. Soc. London, Ser. B 276.1668. 2703-2710.
  • Lee, S. and T. Hasegawa (2011): Bayesian phylogenetic analysis supports an agricultural origin of Japonic languages. Proc. R. Soc. London, Ser. B 278.1725. 3662-3669.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • List, J.-M. (2016): Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1. DOI: 10.1093/jole/lzw006.
  • Michael, L., N. Chousou-Polydouri, K. Bartolomei, E. Donnelly, V. Wauters, S. Meira, and Z. O’Hagan (2015): A Bayesian phylogenetic classification of Tupí-Guaraní. LIAMES 15.2. 193-221.
  • Morrison, D. (2009): Why would phylogeneticists ignore computerized sequence alignment? Syst. Biol. 58.1. 150-158.
  • Wheeler, W. and P. Whiteley (2015): Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages. Cladistics 31.2. 113-125.
  • Wodtko, D., B. Irslinger, and C. Schneider (2008): Nomina im Indogermanischen Lexikon [Nouns in the Indo-European lexicon]. Winter: Heidelberg.

Tuesday, June 14, 2016

Grape genealogies are networks, not trees

I have noted before that the genealogies for all domesticated organisms are networks not trees, and specifically they are hybridization networks. That is, in sexually reproducing species, every offspring is the hybrid of two parents. If we include both parents in the pedigree, plus all of their relatives, then this will form a complex network every time inbreeding occurs.

I have previously illustrated this phenomenon using genealogies of grape cultivars:
     Are phylogenetic trees useful for domesticated organisms?
     First-degree relationships and partly directed networks

Reconstructing grape genealogies is often a tricky business. This was originally done using phenotypic characters and historical records, of course, but these days we use DNA from whatever cultivars are available for sampling. Perhaps the biggest problem is that many of the cultivars are no longer known (there have been at least 10,000 of them recorded at some time in history), so that the genealogies are full of question marks representing unknown (unsampled) parents.

The practical consequence of this is that the time direction of the genealogy will be ambiguous whenever there is a missing parent. Estimates of identity-by-descent (IBD) are calculated based on linkage analysis for all pairwise comparisons of samples, and complex crossing schemes can generate IBD values that are indistinguishable from sibling relationships. So, in these cases we cannot distinguish parent-offspring relationships from sibling relationships.

A simple example is shown in the most detailed current book on grape cultivars:
Jancis Robinson, Julia Harding, José Vouillamoz (2012) Wine Grapes: a Complete Guide to 1,368 Vine Varieties, including their Origins and Flavours. Allen Lane / Ecco.
This example involves the grand-parentage of the Shiraz grape, usually called Syrah in the effete monarchies of the Old World. The authors present three possible scenarios, as shown here.

There are five sampled cultivars and two inferred unknowns, arranged in an unrooted network. Because the unknowns are inferred to be parents, the network can be rooted in any of three different places, as shown by the three Options illustrated.

The authors (or, more specifically, the third author, who is the one responsible for the genealogies) are in favour of Option A. This means that Mondeuse Noir and Viognier are Syrah's half-siblings rather than either being the grandparent.

This small genealogy is a tree, but when we move to larger genealogies the network nature of the cultivars should become obvious.

However, the authors resort to a standard subterfuge to hide this fact. This strategy is to show cultivars multiple times in the genealogies, to avoid drawing reticulate relationships. I have illustrated this approach a couple of times before in this blog:

     Reducing networks to trees
    Thoroughbred horses and reticulate pedigrees

In the following genealogy of the Pinot cultivar, the authors note: "For the sake of clarity, Trebbiano Toscano and Folle Blanche appear twice in the diagram."

Trees reign supreme as simplifications of networks!

Wednesday, June 8, 2016

Why do so few biologists look at their phylogenetic data?

Most data analyses involve processing the data using some model. For example, standard parametric statistical tests assume a normal distribution for the "error" term, as well as equal variances and linear relationships between the variables. If these model assumptions ado not hold, then any inferences from the tests may be incorrect.

It is possible to look at any dataset in a model-free manner, although this does not necessarily lead to any strong inferences. Looking at data is usually called exploratory data analysis. This is often done using graphs of various types.

Exactly the same principle applies to phylogenetics. A phylogenetic tree is an inference from the data via a given model. The inference is a reconstructed genealogical history assuming a divergent tree. In this context, different models will often (usually?) give different inferences.

Therefore, most phylogeneticists never actually see their data. What they see, instead, is the data as processed through some model. That is, they see inferences from the model, not the original data. Models are important, but the data should be even more important, for a scientist.

It is thus interesting that so many phylogeneticists skip the step of looking at their data, and proceed immediately to the model-based inference. So many of the disagreements throughout the literature end up being about the models and not the data. There are very strong opinions about which models should be used, with less attention being paid to whether the data contain sufficient information to answer the original scientific question in the first place.

A specific example of this was discussed in some earlier blog posts:
Conflicting placental roots: network or tree?
Why are there conflicting placental roots?
In this example there are three possible genealogical patterns, each of which has been reported to receive strong support from model-based tree inference of nucleotide sequences. However, when looking at the sequence data themselves, in a model-free manner using data-display networks, any one dataset shows all three possible patterns. So, any inference of a single tree is coming from the model not from the data. That is, the data do not distinguish between the three genealogies, but the models do discriminate amongst them.

It is worth mentioning here that a haplotype network is not a genealogy. Instead, it is a summary of a population dataset, which may contain some phylogenetic patterns or it may not. So, a haplotype network is closer to exploratory data analysis than it is to model-based inference. This point is clearly made by Jessica W. Leigh and David Bryant (2015. PopART: full-feature software for haplotype network construction. Methods in Ecology and Evolution 6: 1110-1116):
The haplotype networks do provide, however, a concise and accessible representation of the data themselves, one aspect which is often lost in methods heavily dependent on model-based inference.
Looking at the data before you start processing it can be a very good idea. After all, you may be able to avoid unlikely inferences.