The Genealogical World of Phylogenetic Networks: February 2016

Wednesday, February 24, 2016

Unacknowledged re-use of intellectual property

I am publishing this post because both the author and publisher involved have stopped responding to my emails.

Some of you may have noticed the recent publication of the following book:

Dan Graur (2016) Molecular and Genome Evolution. Sinauer Associates.

Chapter 6 is of interest to the readers of this blog, being entitled "Reticulate evolution and phylogenetic networks". Unfortunately, as originally published, not all of the figures in that chapter acknowledge the source of the illustration. Of personal interest to me, Figure 6.4 [which can be viewed here] is a direct copy of the first part of my Primer of Phylogenetic Networks. Needless to say, Graur's figure prominently claims to be the copyright of the publisher rather than myself.

Neither the author not the publisher has provided a satisfactory explanation, and have made it clear that nothing will be done to rectify this except in some possible future edition of the book.

So, in the meantime, could you all please note that the idea is mine, not either Graur's or Sinauer's. This annoys me, not just because of the laziness that lead to this situation, but because the figures are an original conception of mine to explain Median Networks to non-experts, and I was very happy to have developed them. Having someone else trying to take the credit takes the shine off that.

Tuesday, February 16, 2016

Through a glass darkly

In an earlier blogpost I mentioned the now largely abandoned discipline of lexicostatistics that was in vogue in the 1950s, originally initiated by Morris Swadesh (1909-1967; Swadesh 1950, 1952, 1955), but abandoned in the 1960s and henceforth often labeled as some kind of a failed theory that was explicitly proven to be wrong.

The crucial idea of Swadesh was to investigate lexical change from the perspective of the meaning of words. This perspective is contrasted with the perspective which takes similar (cognate) word forms in different languages as a starting point and compares to which degree they differ in their meanings. Swadesh's perspective, instead, starts from a set of meanings and investigates by which word forms they are expressed, and is also called an onomasiological perspective (which "names" are assigned to concepts?), while the other perspective is called a semasiological perspective (which "meanings" can words have?).

From a semasiological perspective, we would start from a set of related words and investigate their meanings. In this way, we could compare English head with German Hauptstadt "capital city" or English cup with German Kopf "head". Through such an analysis, we would learn that there was a semantic shift from the German word Haupt, which originally meant "head", to a more abstract meaning that is now probably best translated as "capital" or "main", and only occurs in compounds, such as Hauptstadt "capital city", Hauptursache "main reason", etc.

From an onomasialogical perspective, we would start from a set of meanings and investigate which words are use in order to express them in different languages:

No.	Items	German	English	Dutch	Russian
1	hand	Hand	hand	hand	ruka
2	arm	Arm	arm	arm	ruka
3	mainly	hauptsächlich	mainly	hoofdzakelijk	glavny
4	head	Kopf, (Haupt)	head	hoofd, kop	golova
5	cup	Tasse	cup	kop	stakan
...	...	...	...	...	...

When looking at specific meanings in this way, one can find interesting patterns within one and the same language whenever a language uses the same or similar words to express what are different concepts in other languages. Russian thus uses the same word for "hand" and "arm", Dutch shows the same word for "head" and "cup", and Russian, Dutsch, and German have similar forms for "mainly" and "head". These patterns can be historically interpreted by reconstructing patterns of semantic shift. In the case of English cup, German Kopf, and Dutch kop, for example, the original meaning of the words was "vessel" or "cup". Later on, the word changed its meaning and came to denote "head" in German. The transition is still reflected in Dutch, where the word can denote both meanings.

We can model this situation by assuming that every word in a language has a certain reference potential (Schwarz 1996: 175; Allwood 2003; List 2014: 21f, 36). This means that every word has the potential to denote different things in the world, due to the concept it denotes primarily. In List (2014: 21), I have tried to depict this as follows:

Reference Potential of the Linguistic Sign

In this visualization, a word form refers to a meaning, and the meaning itself has the potential to denote various things in the world, but with different probabilities. A word that primarily means "head", for example, may likewise be used to denote the "first person", as in the "head of a group", and a word that primarily means "melon" may also be used to denote a "head", due to the similarity in form. We can investigate the reference potential of words by simply looking at different translations in dictionaries. As an example (from List 2014: 36), when looking at our three words English cup, Dutch kop, and German Kopf, we find the following rough arrangement with respect to the reference potential of the word (the thickness of the arrows indicating differences in denotation probability):

Reference Potential of Words Across Languages

Why do I mention all of this? First, I wanted to show that lexical change, no matter which perspective we take, is a very complex phenomenon. In a simplifying model, we could think of a lexicon as a bipartite network consisting of nodes that represent word forms in a language and nodes that represent meanings, and weighted links between word forms and meanings denoting the frequency by which a word is used to denote a given meaning. In such a network representation, lexical change could be modelled as the re-arrangement of the edges between word forms and meanings. If a word form looses all its edges, this word is lost from the language, but we could also think of new words entering the language, be it that they are borrowed, or created from the language itself. Such a model would be very simplistic, ignoring aspects like word compounding, by which new words are created from existing ones. But it would be much more realistic than the idea that lexical change is just about the gain and loss of words, as assumed in the quasi-standard model of lexical change in phylogenetic reconstruction.

This brings us to my second point. When Swadesh introduced lexicostatistics, and his very specific onomasiological perspective on lexical change, he established a model of lexical change that would deliberately ignore all interesting processes underlying the phenomenon. Since then, we have been looking through a glass darkly. This is like a crime inspector having no other means but watching potential suspects through the windows of their apartments, noticing changes, like the differently coloured words in state A and state B in the Figure below, but never knowing what was really going on inside those flats (state C).

Trough a Glass Darkly: The lexicostatistic perspective on lexical change (A, B), and what is really going on (C).

Yet, when being honest with oneself, the problem of looking through a glass darkly does not pertain to the lexicostatistic perspective alone, but effectively applies to all of our research on language change. It is just the size and the number of windows that we survey, and the cleanliness of the glasses, that may make a little difference.

References

Allwood, J. (2003) Meaning potentials and context: Some consequences for the analysis of variation in meaning. In: Cuyckens, H., R. Dirven, and J. Taylor (eds.): Cognitive approaches to lexical semantics. Mouton de Gruyter: Berlin and New York. 29-65.
List, J.-M. (2014) Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
Schwarz, M. (1996) Einführung in die kognitive Linguistik. Francke: Basel and Tübingen.
Swadesh, M. (1950) Salish internal relationships. Int. J. Am. Linguist. 16.4. 157-167.
Swadesh, M. (1952) Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proc. Am. Philol. Soc. 96.4. 452-463.
Swadesh, M. (1955) Towards greater accuracy in lexicostatistic dating. Int. J. Am. Linguist. 21.2. 121-137.

Monday, February 8, 2016

The network of woodpeckers, etc

The world continues on its merry way, searching for fragments of the Tree of Life. That is, research papers continue to be published that give no credence to the possibility of reticulate evolutionary history, especially in zoology.

A recent case in point is this one:

Matthew J. Dufort (2016) An augmented supermatrix phylogeny of the avian family Picidae reveals uncertainty deep in the family tree. Molecular Phylogenetics and Evolution 94: 313–326.

The author constructed a supermatrix for 26 loci for 78 taxa of the bird family containing woodpeckers, piculets, and wrynecks. The author used an array of phylogenetic techniques, including the construction of maximum-likelihood "gene" trees, several different maximum-likelihood species trees, plus time trees. All of these methods pre-suppose that the evolutionary history of the species was strictly tree-like.

We can use an exploratory data analysis to evaluate how probable is this fundamental assumption. I constructed a SuperNetwork based on the 26 gene trees produced by the author, using the SplitsTree program, as shown here. The network is basically tree-like, with one major exception.

I have labeled only one taxon, which seems to be the culprit for the major non-tree-likeness. This species appears in 10 of the gene trees, being part of the Leiopicus group (as expected) in 7 of the rooted trees. It is unexpectedly related to the Picus group in two of the other rooted trees, and is close to Dryocopus in the remaining tree.

This network EDA does not, of course, imply the existence of reticulate evolution. It does, however, highlight a pattern of incongruence that requires explanation, if the history of these birds is to fully elucidated. Reticulate evolution remains one of the possible explanations, pending further investigation.

For an example of a network view of bird evolution, see:

Antonio Hernandez-Lopeza, Didier Raoult, Pierre Pontarotti (2013) Birds perching on bushes: networks to visualize conflicting phylogenetic signals during early avian radiation. Comptes Rendus Palevol 12: 333-337.

Monday, February 1, 2016

Tardigrades and phylogenetic networks

In this blog we have always championed the use of Exploratory Data Analysis prior to phylogenetic analyses. This approach explores the characteristics of the data before making formal inferences about possible evolutionary scenarios. One of the reasons for doing this is the possibility of data errors. That is, we need to distinguish between estimation errors deriving from our experimental procedures and real biological scenarios, because both of these will result in complex patterns in our data.

One possible classification of the potential causes of complex data patterns in phylogenetics is this:

Estimation errors
(i) incorrect data
— inadequate data-collection protocol
— poor laboratory / museum / herbarium technique
— lack of quality control after data collection
— misadventure
(ii) inappropriate sampling
— distant outgroup
— rapid evolutionary rates
— short internal branches
(iii) model mis-specification
— wrong assessment of primary homology
— wrong substitution model
— different optimality criteria

Biological complexity
(iv) analogy
— parallelism
— convergence
— reversal
(v) homology
— deep coalescence
— duplication–loss
— hybridization
— introgression
— recombination
— horizontal gene transfer
— genome fusion

The scientific literature has a number of prime examples where people have asserted a case of biological complexity that has subsequently been questioned, and attributed to estimation errors instead.

For example, many of you will have noted the recent attention given to the release of various genome sequences from the Tardigrades, a group of microscopic animals often alleged to be the world's most resistant to environmental conditions. Two rival papers have appeared:

Thomas C. Boothby et al. (2015) Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proceedings of the National Academy of Sciences of the USA 112: 15976–15981.

Georgios Koutsovoulos et al. (2015) The genome of the tardigrade Hypsibius dujardini. BioRxiv preprint 33464. [Now published as: Georgios Koutsovoulos et al. (2016) No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proceedings of the National Academy of Sciences of the USA]

The former paper attributes their observed phylogenetic complexity to horizontal gene transfer (group v in the list above) while the latter attributes it to sequencing errors (group i). This situation is discussed in more detail elsewhere on the web, for example:

Rival scientists cast doubt upon recent discovery about invincible animals

How did these indestructible pond critters get their genes?

This difference in possible cause (of complexity) matters particularly for the use of phylogenetic networks, because both estimation errors and biological complexity will appear as reticulation patterns in any network. This is particularly important for the assertion of evolutionary scenarios such as horizontal gene transfer, because usually the only evidence for any such gene flow is the complexity of the phylogenetic network — that is, there is no independent experimental evidence, and we are relying entirely on the phylogenetic pattern analysis. Estimation errors must thus be eliminated prior to the phylogenetic analysis, if we are to produce a high quality network.

The current situation potentially has unfortunate consequences. For example, there are continual comments that horizontal gene flow is rare, particularly from zoologists, even though there is a large amount of evidence to the contrary. Situations like the current one can only add fuel to this argument, if strong claims of gene flow turn out to be erroneous. There is no quantitative basis for an assertion that gene flow is rare in zoology — those who have looked for reticulate evolution in animals have found it, and those who haven't haven't.

In the end, data-display networks are useful for displaying incongruent data patterns, but the source of the incongruence needs to be identified before these networks are turned into evolutionary networks (either explicitly drawn or verbally implied).