Showing posts with label Philosophy. Show all posts
Showing posts with label Philosophy. Show all posts

Monday, February 4, 2019

Should we bother about character independence?


The comments of David Marjanović to one my last posts (Please stop using cladograms!), kept me musing about an old question of mine: Why should we be concerned about whether characters in a matrix are independent or not?

When I started to get into phylogenetics (I taught myself by reading and just doing it and never had a course in phylogenetics at university), I learned that the most important thing for a phylogenetic matrix is:
All characters are independent of each other.
In other words: the mutation (change) in one character doesn't affect the mutation (change) in any other character.

I could never wrap my head around this. After all, the characters are all part of the same organism and must therefore function together, so how can they possibly be biologically independent? Even the fact that everything is part of the same universe means that everything is functionally dependent to one extent or another — when a butterfly sneezes the polar bears tremble, as they poetically say.

However, what is meant is that characters must be independent enough for practical mathematical purposes. This is a fundamental assumption of most mathematical analyses, in order to make them tractable. Trying to account for the dependencies is far too difficult, mathematically.

However, it is still worthwhile thinking about whether these "practical purposes" are likely to be realistic for phylogenetics. Consider this:
  1. Traditional phylogenetics mostly uses morphological traits, some of which must have been evolutionary beneficial and evolved as consequence of the same reason (adaptive process).
  2. Working at the tips of the tree of life, our data were from the nuclear-encoded 35S rDNA, the cistron encoding for 18S rRNA (small subunit), 5.8 S rRNA, and 25 rRNA (large subunit, erroneously called 26S in some of the phylogenetic literature), which is known for compensatory mutations (eg. strands of the 5.8S rRNA have to fit to the 5' end of the 25S rRNA; here's a link for those interested in RNA structure).
To investigate point 1, let's look at a dolphin (image source) and a bat (image source).


Without sequencing their entire genomes and establishing the function of each gene (and kicking out one or another gene during development), we cannot assess how independent (genetically) the traits are that make a dolphin a near-perfect swimmer, and a bat the only actively flying mammal. But obviously, a lot of their traits are adapted to this single function of movement. The practical consequence is that instead of a plethora of distinguishing characters, we only can score two fully independent ones: "can swim" versus "can fly".
(And then eliminate these two, because another rule in phylogenetics is that we should only include characters that are not under positive selection. The commonly implemented models all assume that evolution is neutral. This is why Charles Darwin has two parts to The Origin, one discussing historical dependence of characters and one discussing natural selection.)

As for point 2, everyone who worked with ITS, the internal transcribed spacers of the 35S rDNA, can easily see that some mutational patterns always come in pairs or some other series. Although rarely done, we can correct for linked mutations during inference by using the assumed secondary structures as a functional corrective. This is rarely done, because even without this correction you still get trees (or networks) that make sense.

Linked mutations and evolutionary trends within the LP3 of the 5' ITS2 in species of Acer section Acer (see my Ph.D. thesis, open access; figure from Grimm et al., Plant Syst. Evol., 2007). This (non-coding) length-polymorphic region (found in all angiosperms in various modifications) comprises an upstream CT- and partly linked (complementary) downstream GA-motif.


A very simple example

Let's take a group of very simple, made-up organisms differing in two trait complexes (note that it may be a collection of genes that trigger the difference): form and colour.




In total, "evolution" came up with 15 different combinations ("species"), five of which are extinct, two of which are primitive in the sense that they still occur today, but have also been found as fossils.

We all know that morphologies have a high level of homoplasy. Homoplastic traits mean that groups will not accurately reflect the true tree. Having as many forms (9) as colors (9), we have no clue as to which trait is more conservative, and hence could better reflect the true tree.

The 15 species form nine potentially monophyletic genera.

The alternative nine potentially monophyletic genera.

The promise of phylogenetics is that we can infer the true tree based on the scored characters. We could follow the strict independence rule, and score them as two multi-state characters, leading us to the following "tree" — this has been parsimony-optimized and unweighted, as in most studies using morphological data, with the sample of MPTs summarized using a strict consensus cladogram.

The strict consensus tree of 355602 equally parsimonious trees with 17 steps, a CI of 0.94 and RI of 0.88: a pitchfork (an extreme case, but pitchfork-like subtrees are very common in palaeontological phylogenetic literature).

Alternatively, we could score the features as a series of binary characters such as:
  • Is the center depressed?
  • Is it horizontally or vertically elongated?
  • Is it round or pointed?
  • Do we have few (<= 6) or many tips (>= 8; "?" for all round species)?
  • Is it reddish? Or greenish? Or bluish? (Example: purple doughnut would be 1 - 0 - 0, the turquoise five-star 0 - 1 - 1)
  • Has it a dark or light shade (relatively speaking: green taken as darker than turquoise)?
These characters are not particularly independent. Certain evolutionary steps make it impossible to go back or evolve something in parallel / convergently. For example, the Roundish group never evolved pointed tips, and the Pointish organisms can vary their outline, but not smooth it. The characters are also not overly compatible (e.g. shading splits each basic coloring into two subsets), so we wouldn't expect a very resolved tree or one that matches the true tree exactly:

Adams consensus tree of 80 MPTs with 19 steps, CI = 0.57, RI = 0.79, naming follows the principles of cladistic classification (only subtrees in a rooted tree may be named; not to be confused with phylogenetic classification fide Hennig)

However, it doesn't look like a very bad evolutionary hypothesis. In fact, the inferred clades only miss one monophyletic group (I can tell, because I invented this group to illustrate that 'cladistics' is a subset of 'phylogenetics'): Fivestar reflects the morph of the common ancestor of all stars, resolved as part of a monophyletic grade "basal" to the (reciprocally monophyletic) polygons:

Evolution as it happened. Note, each dichotomy is accompanied by one or two exclusive subsequent mutations (synapomorphies at the time). Unknown ancestors (not found in the fossil records) are dimmed. Green: valid names following Hennig's phylogenetic classification; orange: only valid for the most recent time frame (Purpleoval is indistinguishable from the ancestor of all non-olive Roundish, Fivestar from the ancestor of all stars, and the ancestor of all polygons was a blue pentagon).

Of course, I would always show all of the topological alternatives in the optimized tree sample. Here is the strict consensus network of all of the MPTs:

Strict consensus network of all 80 MPTs, the network analogue to the commonly seen strict consensus cladograms.

In contrast to the consensus trees, we see the equally optimal alternatives, and can even make a call as to which trait to give a higher weight (evolution-wise). For instance, although only 12 MPTs have a Pentagon clade, 40 have an Octagon clade, which would fit with the hypothesis of reciprocal monophyly. The shading-based alternative seen in other MPTs (light vs. dark polygons) can be argued to be less likely, noting how scattered this feature is across the entire graph (this is what TNT's iterative weighting does, except that it starts from one of the alternative trees)



And here's the distance network, probably (like with real-world data) the least-biased depiction of the differentiation pattern:

All labelled taxa are monophyletic (as defined by the true tree). Note how some neighborhoods reflect monophyly while others would result in paraphyletic groups.

Take-home message

Now, you could rightfully point out that this is totally hypothetical and, having generated the group, I made sure that the analysis works out — actually, I didn't, and I was quite surprised at how well the binary matrix, which just scores everything that differs between the species, resolves aspects of the true tree. However, just compare the above graphs with trees published in (paleo)phylogenetic studies, and the real-world data we dealt with here on the Genealogical World of Phylogenetic Networks.

You might also point out that this is just like using stepmatrices — forcing a topology by suitably coding complex characters. Likewise, this thought must be discouraged (but see Joe Felsenstein 2004 book, Inferring Phylogenies). I would respond that scoring complex traits filtered by evolution as a single multi-state character severely underestimates the information content. An example from my own research: in the King Ferns (Osmundaceae), the subsequent modification of the sclerenchyma ring along the leaf traces is fully compatible with the molecular tree, so why should I be forced to reduce the surely interdependent (and traceable in the fossil record) aspects of this evolutionary filtered trait complex to a single, multi-state (and unweighted) character?

Coding of a single complex trait (Bomfleur et al., PeerJ, 2017, fig. 7), the structure of the sclerenchym ring in Osmundaceae leaf traces, as five binary characters that reflect the ontogenetic sequence seen in Osmundaceae rhizomes (arrows), a case where ontogeny mirrors phylogeny (Bomfleur et al., BMC Evol. Biol., 2015; cf. Additional file 1, fig. S1-1).


If we have character complexes that we can score, then we should not bother ourselves with drawing a (often very subjective) line between biologically dependent and independent characters. We should just score as much as we can see, and then explore the signal in the resulting matrix (see our many blog posts on the latter topic).

Exploratory data analysis benefits from few-state characters. This is because characters with many states (nine in the above example, which is something also found in the actual literature) that do not inform any taxon bipartitions, lead only to quite useless pitchfork-trees.

Scoring what we see as detailed as possible may, of course, get some things wrong. We may face one or another paraphyletic (or even polyphyletic) clade and monophyletic grade — inferring trees/networks and establishing branch-support with more than a single optimality criterion is advisable as is character mapping. At least it gets us a data-based hypothesis to discuss and to investigate further; or several hypotheses, when using consensus networks or distance-based splits graphs instead of consensus trees.


Monday, November 19, 2018

The curiously converted logic of phylogenetics


Phylogenetic analysis involves describing patterns, not studying processes. That is, we cannot conduct a manipulative experiment to study evolutionary history. All we can do is collect naturally occurring data, and then try to detect relevant patterns in it. Thus, in a descriptive study we investigate processes by examining the patterns they produce, not by manipulating the processes themselves, which is what we would do in an experimental study.

Obviously, one of the limitations of this procedure is that the patterns we need may not be in the data we have at hand. It is this limitation that leads some scientists to claim that descriptive studies are not part of science. However, this is not the majority view. [See Mattis' later post, on Patterns, processes, abduction, and consilience]


Equally importantly, there is a logical limitation to descriptive studies, as well, which I have rarely seen mentioned. In the world of logic, propositions cannot be converted; and yet converting propositions is exactly what is done by all descriptive analyses. [The four terms used in logic are defined at the bottom of this post.]

Our initial logic works from process to pattern (if p, then q), but we interpret it the other way around, that a specified pattern must be created by a particular process (if q, then p). Thus:
  • we expect this specific process to produce that particular pattern
  • therefore, when we see that particular pattern we can infer this specific process.
The problem here is the second statement, which is the logical converse of the first statement (the proposition). The inference is illogical, because other processes might also create the same pattern, in which case our inference can be wrong.

The Monty Python comedy team had a go at this in their Logician skit on "The Holy Grail" album (but not in the movie of the same name). Their example concerned a 1950s-60s singer called Alma Cogan, who died in 1966. Their inference was:
  • all of Alma Cogan is dead
  • therefore, all dead people are Alma Cogan.
This is illogical, because there is more to being dead than simply being Alma Cogan — logical propositions can be only partially converted.

The same logical fallacy has also been pointed out in the application of statistics to ecology. Stuart Hurlbert (1990. Spatial distribution of the Montane Unicorn. Oikos 58: 257-271) assessed the use of the poisson probability distribution as evidence for random spatial distributions of organisms. The inference is:
  • for a poisson distribution, the variance equals the mean
  • therefore, if the variance equals the mean we can infer a poisson distribution.
His paper points out many real datasets where the variance equals the mean but the data do not fit a poisson distribution. He concluded: "Each population showed a different pattern of aggregation and none corresponded to a Poisson distribution. The variance:mean ratio is useless as a measure of departure from randomness, though it is widely recommended as such."

These are simply examples of a general problem: we cannot convert a proposition and expect to be right all of the time, or even most of the time. The issue applies to all phylogenetic analyses, whether they involve the assessment of homology, or the construction of trees and networks — we are inferring particular evolutionary processes form the observation of particular patterns in our data. For example, our model of the process of speciation implies a tree model of evolution, and therefore every time we get a "well-supported tree" we treat it as the true phylogeny. This will not work if other processes are occurring, such as hybridization.

I will finish with one specific example from network analysis. The D-statistic is used in the so-called ABBA-BABA test for detecting introgression among taxa (see Networks of admixture or introgression). The logic works from process to pattern (introgression would create a particular gene-tree pattern), but we interpret it the other way around — we see the specified gene pattern and we thereby infer the presence of introgression.

This issue of illogic is definitely a limitation of phylogenetic analysis.



The terms of logical analysis:
Proposition
Inverse
Converse
Contrapositive
if p, then q
if not p, then not q
if q, then p
if not q, then not p

Wednesday, November 18, 2015

Are realistic mathematical models necessary?


In a comment on last week's post (Capturing phylogenetic algorithms for linguistics), Mattis noted that linguists are often concerned about how "realistic" are the models used for mathematical analyses. This is something that biologists sometimes also allude to, as well, not only in phylogenetics.

Here, I wish to argue that model realism is often unnecessary. Instead, what is necessary is only that the model provides a suitable summary of the data, which can be used for successful scientific prediction. Realism can be important for explanation in science, but even here it is not necessarily essential.

The fifth section of this post is based on some data analyses that I carried out a few years ago but never published.

Isaac Newton

Isaac Newton is one of the top handful of most-famous scientists. Among other achievements, he developed a quantitative model for describing the relative motions of the planets. As part of this model he needed to include the mass of each planet. He did this by assuming that each mass is concentrated at an infinitesimal point at the centre of mass. Clearly, the planets do not have zero volume, and thus this aspect of the model is completely unrealistic. However, the model functions quite well for both description of planetary motion and prediction of future motion. (It gets Mercury's motion slightly wrong, which is one of the improvements that Einstein's model of Special Relativity provides).

Newton's success came from neither wanting nor needing realism. Modeling the true distribution of mass throughout each planetary volume would be very difficult, since it is not uniformly distributed, and we still don't have the data anyway; and it is thus fortunate that it is unnecessary.

Other admonitions

The importance of Newton's reliance on the simplest model was also recognized by his best-known successor, Albert Einstein:
Everything should be as simple as it can be, but not simpler.
This idea is usually traced back to William of Ockham:
1. Plurality must never be posited without necessity.
2. It is futile to do with more things that which can be done with fewer.
However, like all things in science, it actually goes back to Aristotle:
We may assume the superiority, all things being equal, of the demonstration that derives from fewer postulates or hypotheses.

Sophisticated models model details

Realism in models makes the models more sophisticated, rather than keeping them simple. However, more complex models often end up modelling the details of individual datasets rather than improving the general fit of the model to a range of datasets.

In an earlier post (Is rate variation among lineages actually due to reticulation?) I also commented on this:
There is a fundamental limitation to trying to make any one model more sophisticated: the more complex model will probably fit the data better but it might be fitting details rather than the main picture.
The example I used was modelling the shape of starfish, all of which have a five-pointed star shape but which vary considerably in the details of that shape. If I am modelling starfish in general, then I don't need to concern myself about the details of their differences.

Another example is identifying pine trees. I usually can do this from quite a distance away, because pine needles are very different from most tree leaves, which makes a pine forest look quite distinctive. I don't need to identify to species each and every tree in the forest in order to recognize it as a pine forest.

Simpler phylogenetic models

This is relevant to phylogenetics whenever I am interested in estimating a species tree or network. Do I need to have a sophisticated model that models each and every gene tree, or can I use a much simpler model? In the latter case I would model the general pattern of the species relationships, rather than modelling the details of each gene tree. The former would be more realistic, however.

In that previous post (Is rate variation among lineages actually due to reticulation?) I noted:
If I wish to estimate a species tree from a set of gene trees, do I need a complex model that deals with all of the evolutionary nuances of the individual gene trees, or a simpler model that ignores the details and instead estimates what the trees have in common? ... adding things like rate variation among lineages (and also rate variation along genes) will usually produce "better fitting" models. However, this is fit to the data, and the fit between data and model is not the important issue, because this increases precision but does not necessarily increase accuracy.
So, it is usually assumed ipso facto that the best-fitting model (ie. the best one for description) will also be the best model for both prediction and explanation. However, this does not necessarily follow; and the scientific objectives of description, prediction and explanation may be best fulfilled by models with different degrees of realism.

In this sense, our mathematical models may be over-fitting the details of the gene phylogenies, and in the process sacrificing our ability to detect the general picture with regard to the species phylogenies.

Empirical examples

In phylogenetics, about 15 years ago it was pointed out that simpler and obviously unrealistic models can yield more accurate answers than do more complex models. Examples were provided by Yang (1997), Posada & Crandall (2001) and Steinbachs et al. (2001). That is, the best-fitting model does not necessarily lead to the correct phylogenetic tree (Gaut & Lewis 1995; Ren et al. 2005).

This situation is related to the fact that gene trees do not necessarily match species phylogenies. These days, this is frequently attributed to things like incomplete lineage sorting, horizontal gene transfer, etc. However, it is also related to models over-fitting the data. We may (or may not) accurately estimate each individual gene tree, but that does not mean that the details of these trees will give us the species tree. Basically, estimation in a phylogenetic context is not a straightforward statistical exercise, because each tree has its own parameter space and a different probability function (Yang et al. 1995).

One way to investigate this is to analyze data where the species tree is known. We could estimate the phylogeny using each of a range of mathematical models, and thus see the extent to which simpler models do better than more complex ones, by comparing the estimates to the topology of the true tree.

I used six DNA-sequence datasets, as described in this blog's Datasets page. Each one has a known tree-like phylogenetic history:
Datasets where the history is known experimentally:
Sanson — 1 full gene, 16 sequences
Hillis — 3 partial genes, 9 sequences
Cunningham — 2 genes + 2 partial genes, 12 sequences
Cunningham2 — 2 partial genes, 12 sequences
Datasets where the history is known from retrospective observation:
Leitner — 2 partial genes, 13 sequences
Lemey — 2 partial genes, ~16 sequences
For each dataset I carried out a branch-and-bound maximum-likelihood tree search, using the PAUP* program, for each of the 56 commonly used nucleotide-substitution models. I used the ModelTest program to evaluate which model "best fits" each dataset. The models along with their number of free parameters (ie. those that can be estimated) is:


For the Sanson, Hillis and Lemey datasets it made no difference which model I used, as in each case all models produced the same tree. For the Sanson dataset this was always the correct tree. For the Hillis dataset it was not the correct tree for any gene. For the Lemey dataset it was the correct tree for one gene but not the other.

The results for the other three datasets are shown below. In each case the lines represent different genes (plus their concatenation), the horizontal axis is the number of free parameters in the models, and the vertical axis is the Robinson-Foulds distance from the true tree (for models with the same number of parameters the data are averages). The crosses mark the "best-fitting" model for each line.

Cunningham:

Cunninham2

Leitner

For all three datasets, for both individual genes and for the concatenated data, there is almost always at least one model with fewer free parameters that produces an estimated tree that is closer to the true phylogenetic tree. Furthermore, the concatenated data do not produce estimates that are closer to the true tree than are those of the individual genes.

Conclusion

The relationship between precision and accuracy is a thorny one in practice, but it is directly relevant to the whether we need / use complex models, and thus more realistic ones.

References

Gaut BS, Lewis PO (1995) Success of maximum likelihood phylogeny inference in the four-taxon case. Molecular Biology & Evolution 12: 152-162.

Posada D, Crandall KA (2001) Simple (wrong) models for complex trees: a case from Retroviridae. Molecular Biology & Evolution 18: 271-275.

Ren F, Tanaka H, Yang Z (2005) An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Systematic Biology 54: 808-818.

Steinbachs JE, Schizas NV, Ballard JWO (2001) Efficiencies of genes and accuracy of tree-building methods in recovering a known Drosophila genealogy. Pacific Symposium on Biocomputing 6: 606-617.

Yang Z (1997) How often do wrong models produce better phylogenies? Molecular Biology & Evolution 14: 105-108.

Yang Z, Goldman N, Friday AE (1995) Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Systematic Biology 44: 384-399.

Wednesday, November 4, 2015

Conflicting avian roots


A couple of years ago, I noted that genomic datasets have not helped resolve the phylogeny at the root of the placentals, because each new genomic analysis produces a different phylogenetic tree (Conflicting placental roots: network or tree?). It appears that the results depend more on the analysis model used than on the data obtained (Why are there conflicting placental roots?), and it is thus likely that the early phylogenetic history of the mammals was not tree-like at all.

Recently, a similar situation has arisen for the early history of the birds. In the past year, three genomic analyses have appeared involving the phylogenetics of modern birds (principally the Neoaves):
Erich D. Jarvis et alia (2014) Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346: 1320-1331.
Alexander Suh, Linnéa Smeds, Hans Ellegren (2015) The dynamics of incomplete lineage sorting across the ancient adaptive radiation of Neoavian birds. PLoS Biology 13: e1002224.
Richard O. Prum, Jacob S. Berv, Alex Dornburg, Daniel J. Field, Jeffrey P. Townsend, Emily Moriarty Lemmon, Alan R. Lemmon (2015) A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing. Nature 526: 569-573.
The first analysis used concatenated gene sequences from 50 bird genomes (including the outgroups), and the second one used 2,118 retrotransposon markers in those same genomes. The third analysis used 259 gene trees from 200 genomes. The second analysis incorporated incomplete lineage sorting (ILS) into the main analysis model, while the other two addressed ILS in secondary analyses. None of the analyses explicitly included the possibility of gene flow, although the second analysis considered the possibility of hybridization for one clade.


These three studies can be directly compared at the taxonomic level of family. I have used a SuperNetwork (estimated using SplitsTree 4) to display this comparison. The tree-like areas of the network are where the three analyses agree on the tree-based relationships, and the reticulated areas are where there is disagreement about the inferred tree.

The network shows that some of the major bird groups do have tree-like relationships in all three analyses (shown in red, green and blue). However, the relationships between these groups, and between them and the other bird families, is very inconsistent between the analyses. In particular, the basal relationships are a mess (the outgroup is shown in purple), with none of the three analyses agreeing with any other one.

Thus, the claims that any of these analyses provide a "highly supported" phylogeny or "resolve the early branches in the tree of life of birds" seem to be rather naive. ILS is likely to have been important in the early history of birds, as this is usually considered to have involved a rapid adaptive radiation. However, I think that models involving gene flow need to be examined as well, if progress is to be made in unravelling the bird phylogeny.

This analysis was inspired by a similar one by Alexander Suh, which appeared on Twitter.

Wednesday, October 28, 2015

Arguments against the use of networks?


The usual argument in favour of using phylogenetic networks is the obvious one that they can account for gene flow during phylogenetic history, as well as vertical inheritance. The usual argument against their use, if there is one, is that vertical inheritance is of primary importance, and thus a tree is "adequate" under many circumstances; or the use of a tree is simply an unquestioned assumption (ie. phylogenetics = trees).

However, Walter Salzburger, Greg B. Ewing and Arndt von Haeseler (2011. The performance of phylogenetic algorithms in estimating haplotype genealogies with migration. Molecular Ecology 20: 1952-1963) have presented a different argument. They point out that a collection of trees can contain more information than can a single network that combines them. This occurs when reticulations represent ambiguity rather than gene flow, as they will in a population or haplotype network (see How do we interpret a rooted haplotype network?).


Their argument is this:
We note that out of a set of different haplotype genealogies, no single genealogy offers a better description of the ‘truth’ than any other one does without considering external data such as the underlying DNA sequences (this is the same when dealing with a set of different MP trees with the same score). The question raised is how are we better off with a group of haplotype genealogies vs. a network that may not be tree-like. The existence of many haplotype genealogies is simply another way of representing ambiguity in the data.
However, the important difference between a network and a set of trees is the lack of independence of Fitch length labellings [ie. the Hadamard distance between nodes]. We illustrate this in Fig. 2. We have the same initial tree with the same tip sequences, but the Fitch branch lengths and internal sequences are different. In the top figure we see that haplotype E connects to D, while haplotype A and B form a cherry also connecting to D. But an alternative is that haplotype E connects to C. This has the effect of changing the topology throughout the tree. So by making some choice in one part of the Fitch tree, it can have topological consequences elsewhere in the tree. In the network case, each ambiguity is represented independently of each other.
It is difficult to represent the same information in a [single] graph compared to a set of trees.

Using this argument, the authors focussed entirely on trees in their simulation study comparing phylogenetic methods: "Here, we are considering the case where the true signal is tree-like and that reticulations represent reconstruction ambiguity." They then confirmed the consequent, by demonstrating that under these circumstances network methods produce false-positive reticulations. Tree-based methods cannot produce reticulations, and so there can be no false positives.

Apart from the impracticality of dealing with potentially large numbers of trees, the main downside of a collection of trees is that we cannot easily compare those trees, which we can instantly do when they are represented by a single network (ie. the trees differ where there are reticulations in the network). Salzburger et al. indirectly refer to this when they note that a "problem is the evaluation of the reliability of connections in haplotype genealogies." They suggest mapping the consistency index for each mutation responsible for each connection in each tree, which seems to be a rather cumbersome alternative to the use of network reticulations to represent unreliability. (NB. A consensus tree is the third way to represent a set of trees, and this seems rarely be used for haplotypes.)

Interestingly, the authors' results showed that the Phylip program DNAPARS consistently did better than the program PAUP* at recovering the simulated trees. The main difference between these two programs is that PAUP* does a better job of finding the set of maximum-parsimony (MP) trees. The results therefore suggest that the authors' trees were usually not MP trees, so that PAUP* was simply wasting its time looking harder for them.

Wednesday, July 8, 2015

Productive and unproductive analogies between biology and linguistics


Genotypes or phenotypes?

In a blogpost from 2013, David investigated some of the popular analogies between anthropology (including linguistics) and biology. He rejected those analogies that compare the genotype with anthropological entities (like the common "words = genes" analogy). Instead, he proposed to draw the analogy between anthropological entities and the phenotype. I generally agree that we should be very careful about the analogies we draw between different disciplines, and I share the scepticism regarding those naive approaches in which genes are compared with words or sounds are compared with nucleotide bases. I am, however, sceptical whether the alternative analogy between phenotypes and anthropological entities offers a general solution for the study of language evolution.

Productive and unproductive analogies

My scepticism results from a general uncertainty about the transfer of models and methodologies among scientific disciplines. I am deeply convinced that such a transfer is useful and that it can be fruitful, but we seem to lack a proper understanding of how to carry out such a transfer. Apart from this general uncertainty as to how to do it properly, I think that for linguistics the analogy between phenotypes and linguistic entities is too broad to be successfully applied.

Instead of drawing general analogies between biology and linguistics, it would be more useful to carry out a fine-grained analysis of productive analogies between the two disciplines. By productive, I mean that the analogies should lead to an interdisciplinary transfer of models and methods that increases the insights about the entities in the discipline that imports them. If this is not the case for a given analogy, this does not mean that the analogy is wrong or false, but rather that it is simply unproductive, since an analogy is just a similarity between entities from different domains, and what we define as being "similar" crucially depends on our perspective. With enough fantasy, we can draw analogies between all kinds of objects, and we never really know the degree to which we construct rather than detect, as I have tried to illustrate in the graphic below.

Constructed or detected similarities?

Local productive analogies: alignment analyses

A productive analogy does not necessarily have to be global, offering a full-fledged account of shared similarities, as in the analogies which compare, for example, languages with organisms (Schleicher 1848) or languages with species (Mufwene 2001), but also the analogy between phenotypes and anthropological entities proposed by David. It is likewise possible to find very useful local analogies, which only hold to a certain extent, but offer enough insights to get started.

Consider, for example, the problem of sequence alignment in biology and linguistics. It is clear, that both biologists and linguists carry out alignment analyses of some of the entities they are dealing with in their disciplines. We use alignment analyses in biology and linguistics, since both disciplines have to deal with entities that are best modeled as sequences, be it sequences of DNA, RNA, or amino acids in biology, or sequences of sounds in linguistics. In both cases, we are dealing with entities in which a limited numer of symbols is linearily ordered, and an alignment analysis is a very intuitive and fruitful way to show which of the symbols in two different sequences correspond.

In this very general point, the analogy between words as sequences of sounds and genes as sequences of nucleic acids holds, and it seems straightforward to think of transferring models and methods between the disciplines (in this case from biology to linguistics, since automatic sequence alignment has a longer tradition in biology).

In the details, however, we will detect differences between biological and linguistic sequences, with the main differences lying in the alphabets (the collections of symbols) from which our sequences are drawn (discussed in more detail in List 2014: 61-75):
  • Biological alphabets are universal, that is, they are basically the same for all living creatures, while the alphabets of languages are specific for each and every language or dialect.
  • Biolological alphabets are limited and small regarding the number of symbols, while linguistic alphabets are widely varying and can be very large in size.
  • Biological alphabets are stable over time, with sequences changing by the replacement of symbols with other symbols drawn from the same pool of symbols, while linguistic alphabets are mutable: not only can they acquire new sounds or lose existing ones, but also the sounds themselves can change.

How similar are words and genes in the end?

What are the consequences of these differences in the word-gene analogy? Can we still profit from the long tradition of automatic alignment methods when dealing with phonetic alignment (the alignment of sound sequences, like words or morphemes) in linguistics? Yes, we can! But within limits!

Linguists can profit from the general frameworks for sequence alignment developed in biology, but we need to make sure that we adapt them according to our linguistic needs. For alignment methods, this means, for example, that we can use the traditional frameworks of dynamic programming for pairwise alignment, which were developed back in the seventies (Needleman and Wunsch 1971, Smith and Waterman 1981). We can also use some of the frameworks for multiple sequence alignment, which were developed a bit later, starting from the end of the eighties, be it progressive (Feng and Doolittle 1987, Thompson et al. 1994, Notredame et al. 1998), iterative (Barton and Sternberg 1987, Edgar 2004), or probabilistic (Do et al. 2004). But we can only import the overall frameworks, not their details.

All algorithms for phonetic alignment that are supposed to be applicable to a wide range of data (and not serve as a mere proof of concept that handles but a limited range of test datasets) need to address the specific characteristics of sound sequences. Apart from the differences in alphabet size and the mutable character of sound systems mentioned above, these differences also include the important role that context plays in sound change (List 2014: 26-33), the problem of secondary sequence structures (List 2012), the problem of metathesis (List 2012: 51f), but also the problem of unalignable parts resulting from cases of partial and oblique homology in language evolution (see my recent blog post on this issue).

Concluding remarks

Drawing analogies between the research objects of different disciplines is not a bad idea, and it can be very inspiring, as multiple cases in the history of science show. When transferring models and methods from one discipline to another, however, we need to make sure that the analogies we use are productive, adding value to our research and understanding. We should never expect that analogies hold in all details. Instead we need to be aware about their specific limits, and we need to be willing to adapt those models and methods we transfer to the needs of the target discipline. Only then can we make sure that the analogies we use are really productive in the end.

References

  • Barton, G. J. and M. J. E. Sternberg (1987). “A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons”. J. Mol. Biol. 198.2, 327 –337. 
  • Do, C. B., M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou (2005). “ProbCons. Probabilistic consistency-based multiple sequence alignment”. Genome Res. 15, 330–340.
  • Edgar, R. C. (2004). “MUSCLE. Multiple sequence alignment with high accuracy and high throughput”. Nucleic Acids Res. 32.5, 1792–1797.
  • Feng, D. F. and R. F. Doolittle (1987). “Progressive sequence alignment as a prerequisite to correct phylogenetic trees”. J. Mol. Evol. 25.4, 351–360.
  • List, J.-M. (2014). Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.  
  • List, J.-M. (2012a). "Improving phonetic alignment by handling secondary sequence structures". In: Hinrichs, E. and Jäger, G.: Computational approaches to the study of dialectal and typological variation. Working papers submitted for the workshop organized as part of the ESSLLI 2012. 
  • List, J.-M. (2012b). “Multiple sequence alignment in historical linguistics. A sound class based approach”. In: Proceedings of ConSOLE XIX. “The 19th Conference of the Student Organization of Linguistics in Europe” (Groningen, 01/05–01/08/2011). Ed. by E. Boone, K. Linke, and M. Schulpen, 241–260.
  • Mufwene, S. S. (2001): The ecology of language evolution. Cambridge: Cambridge University Press.
  • Needleman, S. B. and C. D. Wunsch (1970). “A gene method applicable to the search for similarities in the amino acid sequence of two proteins”. J. Mol. Biol. 48, 443– 453.
  • Notredame, C., L. Holm, and D. G. Higgins (1998). “COFFEE. An objective function for multiple sequence alignment”. Bioinformatics 14.5, 407–422.
  • Schleicher, A. (1848). Zur vergleichenden Sprachengeschichte [On comparative language history]. Bonn: König.
  • Smith, T. F. and M. S. Waterman (1981). “Identification of common molecular subsequences”. J. Mol. Biol. 1, 195–197.
  • Thompson, J. D., D. G. Higgins, and T. J. Gibson (1994). “CLUSTAL W. Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”. Nucleic Acids Res. 22.22, 4673–4680.

Monday, June 15, 2015

Do Shakespeare's plays have a phylogeny?


I have noted several times in this blog that it is not just biological organisms that can be considered to have a phylogenetic history. Many human artifacts also do, provided that their history results from diversification from a common ancestor. For example, there are blog posts about the following topics:
All of these can be considered to have a phylogenetic history of shared common ancestors. For instance, manuscript copies do share ancestors — the source manuscripts that have been copied.


However, while all human artifacts have a history, not everything has a phylogenetic history. There can be transformational history, for example, where concepts simply change through time without diversifying. This can represented by a timeline rather than a phylogeny, as discussed in these blog posts:

There are also situations where artifacts simply cluster together, based on their similarity. This can be represented as a tree-like diagram or a network, but such a tree/network is not a phylogeny, because the clustering does not necessarily have anything to do with common ancestry. Examples discussed in this blog include:

The problem with this latter situation is that we can always mathematically measure the similarity between concepts or objects, and therefore we can always cluster them based on this similarity, even if the clusters have little meaning. I have previously discussed this issue in this blog, noting that if the similarity measure used does not model evolutionary patterns then it cannot be expected to produce a phylogeny (Non-model distances in phylogenetics).

Another case in point is the work of William Shakespeare. Can the plays, for example, be considered to have a phylogeny? Each play certainly has a phylogeny on its own, because the Shakespearean author is well known for having taken the ideas for the plays from previous sources. So, each play has a phylogeny (a reticulate history) based on the historical connections among its sources. However, the plays as a group do not have a phylogeny (not unless they have been plagiarized from each other, anyway). Does Othello really share a common ancestor with King Lear? It certainly has similarities, if only on the basis that it is one of the Tragedies (along with Macbeth, etc). But they are not phylogenetic similarities, and there is no common ancestral Shakespearean play.

As shown by the picture above, this point is not always appreciated. The alleged phylogeny is taken from a press release from the Lawrence Berkeley National Laboratory. The textual similarities among the plays are based on what are called "feature frequency profiles", which have nothing to do with evolutionary history. So, while the data analysis may or may not be helpful for identifying the author(s) of the so-called Shakespearean plays, it is not much help for constructing a phylogeny.

The data analysis is discussed in more detail by:

Wednesday, June 10, 2015

The diasystematic structure of languages and its impact on language evolution


What is a language?

It is not easy to define exactly, what a language is. We find one reason for this in the daily use of the word “language” in non-linguistic contexts. What we call a language does not depend on purely linguistic criteria. The criteria we normally use are social and cultural.

If we were to define languages with help of linguistic criteria, we would use the degree to which speakers understand each other; and in most cases, we could draw some line around areas of what linguists would call “mutual intelligibility” (similar to the criterion of “interbreedability” in biology). But mutual intelligibility does not usually serve as the criterion by which we define languages in everyday situations. For example, we tend to say that the people from Shanghai, Beijing, and Meixian (all cities in China) all speak “Chinese”. On the other hand, we think that people from Scandinavia speak “Norwegian”, “Swedish”, and “Danish”, although there three are no more different than are the former three.


The table above (taken from List 2014: 11f, with adaptations) gives phonetic transcriptions of translations of the sentence “The North Wind and the Sun were disputing which was the stronger” in three Chinese “dialects” (Beijing Chinese, which is also called Mandarin or Standard Chinese, spoken in Beijing and all over the country as a second language; Shanghainese, spoken in Shanghai; and Hakka Chinese, spoken in Meixian), and three Scandinavian “languages” (Norwegian, Swedish, and Danish). In the table, I have put all words that have the same meaning in one column (ie. I have aligned them semantically). Furthermore, I have highlighted the words which share a common etymological origin (call them “homologs” or “cognates”) with a gray background. In red, I have added a more or less literal translation of the respective column.

As the phonetic transcriptions of the sentences show, the Chinese varieties differ to a similar, if not even greater, degree as the Scandinavian ones. And we find this variation both in the way the meaning of the sentence is expressed by the choice of words, and in the degree of etymological similarity between the words. Note, further, that none of the three Chinese dialects is mutually intelligible with any other of the dialects, while we know from famous TV series like Broen/Bron that Danish and Swedish people can often understand each other quite well (with some effort); and Norwegians and Swedes are mutually intelligible most of the time. Nevertheless, we address the latter three speech traditions as the three languages “Norwegian”, “Swedish”, and “Danish”, while we say that the speech of the people in Shanghai, Beijing, and Meixian are merely specific variants of one and the same “Chinese” language.

Languages as Diasystems

One could say that this is just a cultural problem, not a linguistic one, we are facing here. So we could say that there are two different ways of distinguishing languages from dialects. One would be the linguistic one, which uses mutual intelligibility as a unique criterion to tell languages from dialects. The other one would be the cultural definition of languages as, say, “dialects with an army” (a definition usually attributed to Uriel Weinreich).

But this is, unfortunately, only part of the real story, since the cultural definition of the boundaries of a language has a direct impact on the way languages evolve. In societies such as China, for example, a very largeproportion of all speakers is bilingual. Apart from their home dialect, speakers are also able to speak Standard Chinese (also called Mandarin Chinese), and they use it to talk to people from different regions, or to read and to write. So, from a pure linguistic viewpoint, it is not necessarily useful to break up the Chinese dialects into distinct languages, since these dialects are located within a larger speech society that is united by a common language for written and interdialectal communication.

In order to describe this complex structure of our modern languages, linguists have proposed the model of the “diasystem”, which is very common in the discipline of sociolinguistics. This model goes back to the aforementioned dialectologist Uriel Weinreich (1926–1967) who originally thought of some linguistic construct which would make it possible to describe different dialects in a uniform way (Weinreich 1954). According to the modern form of the model, a language is a complex aggregate of different linguistic systems, “which coexist and mutually influence each other” (Coseriu 1973: 40, my translation from the German).

An important aspect for determining a linguistic diasystem is the presence of a “Dachsprache” (“roof language”). This is a linguistics variety that serves as a standard for interdialectal communication (Goossens 1973: 11). The different linguistic varieties (dialects, but also sociolects) that are connected by such a standard constitute the “variety space” of a language (Oesterreicher 2001). I have tried to illustrate this in the following figure (taken from List 2014: 13).


As you can see from the figure, there are different “dimensions” according to which the varieties of a language can differ. The figure shows three of them. First, there are “diatopic varieties” which point to the division of a language into different dialects (varying regarding the place where they are spoken).

Second, there are “diastratic varieties”, pointing to different social layers in which the varieties are used. Compare, for example, the language of a football player with that of a politician, which are similar in their tendency to say nothing in many words (especially after hard defeats or before unpopular decisions to be told to the public), but which differ a lot regarding their choice of words. Third, there are “diastratic varieties”, which are varieties depending on the situation in which people speak. Compare, for example, the way our politician speeks when giving a speech to the public with the speech when discussion big politics behind closed doors.

But these three dimensions of language variation are not all that a diasystem of a language has to offer! We can further identify different speech habits when looking at the medium that is used to produce language; and there are significant differences in many respects when writing or reading something, or when speaking and listening. This dimension is commonly called “diamesic” (varying in dependency of the “medium”).

Last, but not least, we should also note that we do not necessarily speak and understand the language from only one time. Think of modern German kids in school who are forced to read Goethe's Faust, bitterly lamenting the old-fashioned style of the language, but think also about different generations of speakers living in the same speech society. This last dimension of language variety is usually called the “diachronic dimension”. The following image tries to summarize the different dimensions in which the diasystem of a language can vary.


Diasystematic aspects of language change

Given all of these fancy terms starting with “dia” and ending in “ic”, one may think that they are a mere play with thoughts developed by a bunch of linguist geeks who are interested in sociology. Why can't we just forget about all these different kinds of “variation” and keep on modeling our languages as bags of words? Applying computational methods from biology will be much easier, and as long as we use networks once in a while, we are not completely giving ourselves in to the dark side of the Force, which knows only trees. Unfortunately, this is not possible, since the diasystematic structure actually has an impact on the way in which languages change!

As an example from practice, let me tell you how I tried to buy cigarettes when I was in China for the first time. At the time, I had just started to learn Mandarin Chinese, and was really suffering from the difficulty of the language. But I had searched my dictionary several times, and looked up all the important words I needed to tell the man at the kiosk which cigarettes I wanted to have. My choice was “Marlboro”, since it was the only brand I recognized.

Although having only a complete beginner's knowledge of Chinese, I knew, as a linguist, that the language is peculiar in one specific respect — it has a very, very restricted structure of possible syllables. So one can't say “Saint Petersburg” in Chinese, since syllables in Chinese are not allowed to end in a “t” (as in “Saint”), an “s” (as in the syllable “ters”), or a “g” (as in the syllable “burg”). Instead, Chinese speakers will say Shèngbǐdébǎo. I also knew that there is no sound for “r”, and that this sound is often rendered by using a “l” instead.

So, based on this background knowledge, I “translated” the pronunciation of the word “Marlboro” into what I thought by then was perfectly understandable Mandarin, and told the man at the shop that I wanted to have a pocket of mābóluō cigarettes. Unfortunately, he didn't understand at all, what I wanted, and only when I pointed with my finger to the packets of Marlboro cigarettes did he finally understand, and say, “Ah, wànbǎolù !”.

So, I learned that “Marlboro” in Mandarin Chinese is called wànbǎolù, not mābóluō, written 万宝路, literally meaning 10 000-treasure-road, which can be translated as “road of 10 000 treasures”. (Good brand name, actually, especially for cigarettes.) It was only some months later that I understood why my prediction for the Mandarin Chinese pronunciation of “Marlboro” failed so dramatically, when I heard people from Hong Kong pronouncing the word wànbǎolù 万宝路 in Cantonese, the Chinese dialect they speak in Hong Kong. There, wànbǎolù 万宝路 becomes something like [maːn²²-pow³⁵-low³²] (the numbers are tone marks), which sounds very, very similar to the mābóluō I had falsely predicted for Mandarin Chinese.


In the image above, I have tried to depict the process by which “Marlboro” becomes the “road of 10 000 treasures”. What we are dealing with here is a complex pattern of change: both phrases, Mandarin Chinese wànbǎolù and Cantonese [maːn²²-pow³⁵-low³²], are homologous. This applies to their three parts (10 000 + treasure + road), since the phrase itself was presumably not present in earlier stages of Chinese. In the ancestor language of Cantonese and Mandarin Chinese, a variety we usually call “Middle Chinese” (spoken around 600 AD), the phrase “road of 10 000 treasures” would have sounded approximately like [mjon³-paw²-lu³]. In Mandarin Chinese, the pronunciation changed greatly, while it changed only slightly in Cantonese.

When Marlboro entered China, it was probably only sold in Hong Kong in the beginning. So, in order to trigger the interest of Hong Kong consumers, the marketing stragegists did a good job in choosing a translation that sounded both very similar to the original product while at the same time having a nice and promising meaning. They would use Chinese characters to write down the product name. When Marlboro, or the “road of 10 000 treasures” then entered the rest of China, people would read the phrase, but pronounce it in their own way — reading the Chinese characters in Mandarin Chinese just yields wànbǎolù, and not mābóluō.

The transfer of the word from one dialect to another was thus made via the diamesic dimension, via the writing system, not via the spoken language. And this is the way that many, many words (also very basic terms) are exchanged between the Chinese dialect varieties — via their “roof language”, which is the common writing system. And since this change doesn't involve the direct borrowing of a spoken word, it is barely perceivable, since it leaves no direct traces in the pronunciation of the words. While normal borrowings in other languages usually sound outlandish, borrowings in Chinese dialects which make their way from one variety to another via the writing system just sound like any other possible word in the recipient dialect.

Summary

In the same way in which languages may change via the interaction between their written and spoken varieties, the interaction between varieties from the other dimensions may also trigger change. Words originating in one social layer may be transferred to other layers; dialect words of one dialect may become popular and henceforth be used in all dialects; and even those varieties of our languages which are only accessible via stories or books may be revived, at least in part, and find a new steady place in our regular speech, up to the moment where we again cease to use them. The diasystematic structure of languages plays a crucial role in their development. Due to the diasystematic character of languages, language change involves complex network-like structures within one and the same (dia)system. If we really aim to depict language evolution in all its complexity, then it is definitely not a good thing to ignore the diasystematic aspect of languages.

References

Coseriu, E. (1973) Probleme der strukturellen Semantik. Vorlesung gehalten im Wintersemester 1965/66 an der Universität Tübingen. Tübingen: Narr.

Goossens, J. (1973) “Sprache”. In: Niederdeutsch. Sprache und Literatur. Eine Einführung. Vol. 1. Ed. by J. Goossens. Neumünster: Karl Wachholtz.

List, J.-M. (2014) Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.

Oesterreicher, W. (2001) “Historizität, Sprachvariation, Sprachverschiedenheit, Sprachwandel”. In: Language Typology and Language Universals. An International Handbook. Ed. by M. Haspelmath. Berlin and New York: Walter de Gruyter, 1554– 1595.

Weinreich, U. (1954) “Is a structural dialectology possible?” Word 10.2/3, 388–400.

Wednesday, June 3, 2015

"Basal" and "crown" are dirty words in phylogenetics


There are at least two misleading expressions that one very commonly encounters in the professional phylogenetics literature: "basal branch of the tree", and "derived species".

The first expression is used to refer to an unbranched lineage arising near the common ancestor, when compared to a more-branched lineage. For example, in the first diagram below we might say that taxon A is on a "basal branch", whereas taxon B is not. The taxa associated with taxon B are then referred to as the "crown" of the tree. But, how can one lineage be more basal than another? After all, both lineages connect to the "base" of the tree at the same point. To claim that one is basal and the other not is like saying that one brother is more basal than another in a family tree just because he has fewer children!


The second expression refers to a species that has more "derived" characters than another. For example, in the diagram we might say that taxon B is more derived than taxon A. Characters change from ancestral to derived through time (eg. scaly skin covering is ancestral while fur is derived, because the latter arose later in time). However, this does not make any species more derived. It is the characters that are derived not the species — each species has a combination of ancestral characters and derived ones (including humans).

These issues seem to arise from the tree iconography. Some people seem to conceptualize this as a pine tree rather than a bush (as drawn by Charles Darwin in the Origin). A pine tree, indeed, does have basal branches and a crown. Here is an example from a sign in my local botanical garden, which tries to explain plant phylogenetic relationships to the general public. This tree does, indeed, have basal branches and a distinct crown.


This issue seems to have started with Ernst Haeckel in the late 1800s. Haeckel's first phylogenies (see Who published the first phylogenetic tree?) were drawn as multi-branched bushes, rather similar to the diagram that Darwin himself had published. However, Haeckel then veered away from this approach when explicitly discussing the evolution of humans. Here, he drew a tree with a distinct central trunk and much smaller side-branches (presumably modeled on an oak tree, rather than a bush). This image emphasizes one particular lineage at the expense of the others, because there is one taxon obviously sitting at the crown of the tree while the others are relegated to side-branches.

E. Haeckel (1874) Anthropogenie oder Entwickelungsgeschichte
des Menschen.
Engelmann, Leipzig.

This approach to drawing a phylogeny can be used to put any chosen organism at the crown of the tree, not just human beings, as illustrated by the following diagram from James Scott (which looks like it is actually modeled on a pine tree). This is a fundamental characteristic of a phylogeny — it can be drawn so that any part of the diagram is at the crown. However, to be accurate it should always be drawn so that no one lineage is emphasized over any other one — there should be no taxa sitting at the crown.

J.A. Scott (1986) The Butterflies of North America:
a Natural History and Field Guide.
Stanford University Press, Stanford.

Distorted images occur in several ways in modern evolutionary biology. This topic has received considerable attention in the literature, and there are a number of very readable expositions of various parts of it. Here is a brief list.

Gregory T.R. (2008) Understanding evolutionary trees. Evolution: Education and Outreach 1: 121-137.

O'Hara R.J. (1992) Telling the tree: narrative representation and the study of evolutionary history. Biology and Philosophy 7: 135-160.

Crisp M.D., Cook L.G. (2005) Do early branching lineages signify ancestral traits? Trends in Ecology and Evolution 20: 122-128.

Krell F.-T., Cranston P.S. (2004) Which side of a tree is more basal? Systematic Entomology 29: 279-281.

Omland K.E., Cook L.G., Crisp M.D. (2008) Tree thinking for all biology: the problem with reading phylogenies as ladders of progress. BioEssays 30: 854-867.

Sandvik H. (2009) Anthropocentrisms in cladograms. Biology and Philosophy 24: 425-440.

Wednesday, May 13, 2015

Homology and cognacy: fundamental historical relations between words


This is a guest blog post, following on from his previous post, by:

Johann-Mattis List

Centre des Recherches Linguistiques sur l'Asie Orientale, Paris, France

Introduction

All languages constantly change. Words are lost when speakers cease to use them, new words are gained when new concepts evolve, and even the pronunciation of the words changes slightly over time. Slight modifications that can barely be noticed during a person's lifetime sum up to great changes in the system of a language over centuries. When the speakers of a language diverge, their speech keeps on changing independently in the two communities, and at a certain point of time the independent changes are so great that they can no longer communicate with each other — what was one language has become two.

Demonstrating that two languages once were one is one of the major tasks of historical linguistics. If no written documents of the ancestral language exist, one has to rely on specific techniques for linguistic reconstruction (see the examples in this previous post). These techniques require us to first identify those words in the descendant languages that presumably go back to a common word form in the ancestral language. In identifying these words, we infer historical relations between them. The most fundamental historical relation between words is the relation of common descent. However, similarly to evolutionary biology, where homology can be further subdivided into the more specific relations of orthology, paralogy, and xenology, more specific fundamental historical relations between words can be defined for historical linguistics, depending on the underlying evolutionary scenario.

Homology and Cognacy in Linguistics and Biology

In evolutionary biology there is a rather rich terminological framework describing fundamental historical relations between genes and morphological characters. Discussions regarding the epistemological and ontological aspects of these relations are still ongoing (see the overview in Koonin 2005, but also this recent post by David). Linguists, in contrast, have rarely addressed these questions directly. They rather assumed that the fundamental historical relations between words are more or less self-evident, with only few counter-examples, which were largely ignored in the literature (Arapov and Xerc 1974; Holzer 1996; Katičić 1966). As a result, our traditional terminology to describe the fundamental historical relations between words is very imprecise and often leads to confusion, especially when it comes to computational applications that are based on software originally developed for applications in evolutionary biology.

As an example, consider the fundamental concept of homology in evolutionary biology. According to Koonin (2005: 311), it "designates a relationship of common descent between any entities, without further specification of the evolutionary scenario". The terms orthology, paralogy, and xenology are used to address more specific relations. Orthology refers to "genes related via speciation" (Koonin 2005: 311); that is, genes related via direct descent. Paralogy refers to "genes related via duplication" (ibid.); that is, genes related via indirect descent. Xenology, a notion which was introduced by Gray and Fitch (1983), refers to genes "whose history, since their common ancestor, involves an interspecies (horizontal) transfer of the genetic material for at least one of those characters" (Fitch 2000: 229); i.e. to genes related via descent involving lateral transfer.

In historical linguistics, the only relation that is explicitly defined is cognacy (also called cognation). Cognacy usually refers to words related via “descent from a common ancestor” (Trask 2000: 63), and it is strictly distinguished from descent involving lateral transfer (borrowing). The term cognacy itself, however, covers both direct and indirect descent. Hence, traditionally, German Zahn 'tooth' is cognate with English tooth, and German selig 'blessed' with English silly, and German Geburt 'birth' with English birth, although the historical processes that shaped the present appearance of these three word pairs are quite different. Apart from the sound shape, Zahn and tooth have regularly developed from Proto-Germanic *tanθ-; selig and silly both go back to Proto-Germanic *sæli- 'happy', but the meaning of the English word has changed greatly; Geburt and birth stem from Proto-Germanic *ga-burdi-, but the English word has lost the prefix as a result of specific morphological processes during the development of the English language (all examples follow Kluge and Seebold 2002, with modifications for the pronunciation of Proto-Germanic). Thus, of the three examples of cognate words given, only the first would qualify as having evolved by direct inheritance, while the inheritance of the latter two could be labelled as indirect, involving processes which are largely language-specific and irregular, such as meaning shift and morpheme loss. Trask (2000: 234) suggests the term oblique cognacy to label these cases of indirect inheritance, but this term seems to be rarely used in historical linguistics; and at least in the mainstream literature of historical linguistics I could not find even a single instance where the term was employed (apart from the passage by Trask).


In the table above (with modifications taken from List 2014: 39), I have tried to contrast the terminology used in evolutionary biology and historical linguistics by comparing to which degree they reflect fundamental historical relations between words or genes. Here, common descent is treated as a basic relation which can be further subdivided into relations of direct common descent, indirect common descent, and common descent involving lateral transfer. As one can easily see, historical linguistics lacks proper terms for at least half of the relations, offering no exact counterparts for homology, orthology, and xenology in evolutionary biology.

Cognacy in historical linguistics is often deemed to be identical with homology in evolutionary biology, but this is only true if one ignores common descent involving lateral transfer. One may argue that the notion of xenology is not unknown to linguists, since the borrowing of words is a very common phenomenon in language history. However, the specific relation which is termed xenology in biology has no direct counterpart in historical linguistics: the term borrowing refers to a distinct process, not a relation resulting from the process. There is no common term in historical linguistics which addresses the specific relation between such words as German kurz 'short' and English short. These words are not cognate, since the German word has been borrowed from Latin cŭrtus 'mutilated' (Kluge and Seebold 2002). They share, however, a common history, since Latin cŭrtus and English short both (may) go back to Proto-Indo-European *(s)ker- 'cut off' (Vaan 2008: 158). The specific history behind these relations is illustrated in the following figure.


A specific advantage of the biological notion of homology as a basic relation covering any kind of historical relatedness, compared to the linguistic notion of cognacy as a basic relation covering direct and indirect common descent, is that the former is much more realistic regarding the epistemological limits of historical research. Up to a certain point, it can be fairly reliably demonstrated that the basic entities in the respective disciplines (words, genes, or morphological characters) share a common history. Demonstrating that more detailed relations hold, however, is often much harder. The strict notion of cognacy has forced linguists to set goals for their discipline which may often be far too ambitious to achieve. We need to adjust our terminology accordingly and bring our goals into balance with the epistemological limits of our discipline. In order to do so, I have proposed to refine our current terminology in historical linguistics to the schema shown in the table below (with modifications taken from List 2014: 44):


Fifty Shades of Cognacy

In a recent blog post, David pointed to the relative character of homology in evolutionary biology in emphasizing that it "only applies locally, to any one level of the hierarchy of character generalization". Recalling his example of bat wings compared to bird wings, which are homologous when comparing them as forelimbs but who are analogous when comparing them as wings, we can find similar examples in historical linguistics.

If we consider words for 'to give' in the four Romance languages Portuguese, Spanish, Provencal and French, then we can state that both Portuguese dar and Spanish dar are homologous, as are Provencal douna and French donner. The former pair go back to the Latin word dare 'to give', and the latter pair go back to the Latin word donare 'to gift (give as a present)'. In those times when Latin was commonly spoken, both dare and donare were clearly separated words denoting clearly separated contexts and being used in clearly separated contexts. The verb donare itself was derived from Latin donum 'present, gift'. Similarly to English where nouns can be easily used as verbs, Latin allowed for specific morphological processes. In contrast to English, however, these processes required that the form of the noun was modified (compare English gift vs. to gift with Latin donum vs. donare).

What the ancient Romans (who spoke Latin as their native tongue) were not aware of is that Latin donum 'gift' and Latin dare 'to give' themselve go back to a common word form. This was no longer evident in Latin, but it was in Proto-Indo-European, the ancestor of the Latin language. Thus, Latin dare goes back to Proto-Indo-European *deh3- 'to give', and Latin donum goes back to Proto-Indo-European *deh3-no- 'that which is given (the gift)' (Meiser 1999; what is written as *h3 in this context was probably pronounced as [x] or [h]). The word form *deh3-no- is a regular derivation from *deh3-, so at the Indo-European level both forms are homologous, since one is derived from the other. That means, in turn, that Latin dare and donum are also homologs, since they are the residual forms of the two homologous words in Proto-Indo-European. And since Latin donare is a regular derivation of donum, this means, again, that Latin dare and donare are also homologous, as are the words in the four descendant languages, Portuguese dar, Spanish dar, Provencal douna, and French donner. Depending on the time depth we apply, we will arrive at different homology decisions. I have tried to depict the complex history of the words in the following figure:


Judging from the treatment in linguistic databases, many scholars do not regard these different "shades of homology" as a real problem. In most cases, scholars use a "lumping approach" and label as cognates all words that go back to a common root, no matter how far that root goes back in time (compare, for example, the cognate labeling for reflexes of Proto-Indo-European *deh3- in the IELex).

Importantly, this labeling practice, however, may be contrary to the models that are used to analyze the data afterwards. All computational analyses model language evolution as a process of word gain and word loss. The words for the analyses are sampled from an initial set of concepts (such as 'give', 'hand', 'foot', 'stone', etc.) which are translated into the languages under investigation. If we did not know about the deeper history of Latin dare and donare, we would assume a regular process of language evolution here: at some point, the speakers of Gallo-Romance would cease to use the word dare to express the meaning 'to give' and use the word donare instead, while the speakers of Ibero-Romance would keep on using the word dare. This well-known process of lexical replacement (illustrated in the graphic below), which may provide strong phylogenetic signals, is lost in the current encoding practice where all four words are treated as homologs. Our current practice of cognate coding masks vital processes of language change.


Outlook

Historical linguistics needs a more serious analysis of the fundamental processes of language change and the fundamental historical relations resulting from these processes. In the last two decades a large arsenal of quantitative methods has been introduced in historical linguistics. The majority of these methods come from evolutionary biology. While we have quickly learned to adapt and apply these methods to address questions of language classification and language evolution, we have forgotten to ask whether the processes these methods are supposed to model actually coincide with the fundamental processes of language evolution. Apart from adapting only the methods from evolutionary biology, we should consider also adapting the habit of having deeper discussions regarding the very basics of our methodology.

References

Arapov MV, Xerc MM (1974) Математические методы в исторической лингвистике [Mathematical methods in historical linguistics]. Moscow: Nauka. German translation: Arapov, M. V. and M. M. Cherc (1983). Mathematische Methoden in der historischen Linguistik. Trans. by R. Köhler and P. Schmidt. Bochum: Brockmeyer.

Fitch WM (2000) Homology: a personal view on some of the problems. Trends in Genetics 16.5, 227-231.

Gray GS, Fitch WM (1983) Evolution of antibiotic resistance genes: the DNA sequence of a kanamycin resistance gene from Staphylococcus aureus. Molecular Biology and Evolution 1.1, 57-66.

Holzer G (1996) Das Erschließen unbelegter Sprachen. Zu den theoretischen Grundlagen der genetischen Linguistik. Frankfurt am Main: Lang

Katičić R (1966) Modellbegriffe in der vergleichenden Sprachwissenschaft. Kratylos 11, 49-67.

Kluge F, Seebold E (2002) Etymologisches Wörterbuch der deutschen Sprache. 24th ed. Berlin: de Gruyter.

List J-M (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

Meiser G (1999) Historische Laut- und Formenlehre der lateinischen Sprache. Wissenschaftliche Buchgesellschaft: Darmstadt.

Trask RL (2000) The Dictionary of Historical and Comparative Linguistics. Edinburgh: Edinburgh University Press.

Vaan M (2008) Etymological Dictionary of Latin and the Other Italic Languages. Leiden and Boston: Brill.