Tuesday, December 5, 2017

The Synoptic Gospels problem: preparing a phylogenetic approach

This is the second part of my series on phylogenetics and a specific case of textual criticism, the Biblical one. The first part appeared as Another test case for phylogenetics and textual criticism: the Bible, and covered the background to the textual problem — that post should be read first. Here, I provide a preliminary genealogical analysis of some specific data related to the problem.

The synoptic gospels and phylogenetics: how to code data?

Just like in the cases of general stemmatics and historical linguistics, our immediate problem for a phylogenetic approach to Biblical criticism is one of data. Upon investigation, the field proves itself desperately in need of an open access mentality — a great deal of work would be needed to turn the few aggregated data I could find into datasets that could feed the most basic analysis tools.

No open dataset proved either adequate or correct enough. They are mostly quotations or subjective developments of the scientific sources, available only in printed editions and in software for Biblical studies, sometimes at exorbitant prices, and frequently with licenses that explicitly prohibit extracting and reusing the data. This forced me to postpone an analysis of families of manuscripts, as unfortunately there is no complete free edition of the Novum Testamentum Graece (the reference work in the field, usually referred to as Nestle-Aland after its main editors).

However, I could explore the problem of the synoptic gospels in a way and with a dataset closer to the ones of the 19th century analyses, by sitting with a printed Bible and compiling my own synopsis of episodes. My work in this field ends with this second post, but it seems like a good approach to the development of a phylogenetic investigation, to start by reproducing the old analyses with new tools.

After some bibliographic review and inspection of the solutions presented to the problem, my understanding is that there would be three fundamental ways of coding for features of these texts.

The first and simplest is to compile a list of episodes, themes, and topics found in each gospel (a proper “synopsis”), without considering semantic differences or relative positions, coding for a truth table indicating whether each “event” (i.e. “character”) is found. For example, the imprisonment of John the Baptist is mentioned in the three synoptic gospels (Matthew 4,12; Mark 1,14; Luke 3,18-20) and would be coded as “present” in all of them, even though in Luke the relative order is different (it is narrated before the baptism of Jesus, in a flashforward). On the other hand, the priests conspiring against Jesus is only narrated in two gospels (Mark 11,18; Luke 19,47-48), and the “character” of the meek inheriting the Earth is only found in one of them (Matthew 5,5), as shown in the table below.

Imprisonment of John
Priests conspiring

Meek inheritance

This kind of census approach is what most descriptive statistics on the synoptic relationship consider when demonstrating how much there is in common among the gospels, including the graph reproduced back in the first part of this post. As in the case of the statistics of genetic material shared between species, like humans and other apes, caution is needed to understand what is actually meant — the percentages usually reported refer to episode coincidence (in a loose analogy, like the presence of a protein), not text coincidence (like the sequences of genetic bases). This is the reason why these analyses should equally consider “episode homology” and “episode analogy” — one must remember that all gospels as we have them evolved from initial versions, and to be missing an episode favored by the public or the clergy, which denounced other gospels now lost as “uninspired”, could have been an evolutionary pressure to incorporate such episode.

A deeper level of coding would be to map the text of episodes and events into “semantic” characters, ignoring textual differences (like synonyms) but coding for differences in intended meaning. For example, the event of Jesus being tested in the wilderness, while narrated in all three gospels (Matthew 4,1-2; Mark 1,12-13; Luke 4,1-2), is really only equivalent in Matthew and Luke, where he is tempted by "the Devil", while in Mark he is tempted by "Satan", which is a figure closer to the Hebrew meaning of "enemy, adversary; accuser". Likewise, while Matthew and Luke both narrate Jesus’ most famous sermon, they are semantically different: the setting is a mountain in the first and a plain in the second.

by the Devilby Satanby the Devil

This kind of mapping is harder, due to the expertise required to subjectively distinguish meaning, as in the case of the mountain / plain, which scholars in Biblical hermeneutics seem to agree to be more than merely a change of setting for narration. The difficulty is aggravated by the eventual need to quantify the semantic shifts (how far is "the Devil" from "Satan (the adversary)", especially when the episode is missing from the non-synoptic gospel of John?). These three states ("null", "Devil", and "Satan") should not be considered equally different, especially when the texts of the three synoptic gospels are clearly related. Luckily, while not necessarily in a systematic way for phylogenetic purposes, this kind of coding has already been conducted by many Biblical scholars, and we might thus appropriate it in the future.

The third way of coding, partly solving the difficulties of the second solution, listed above, would be to compare the Greek text for each event, using some distance metric. For strings, there is the common Levenshtein distance, or, in a blatant self-promotion, my own sequence similarity algorithm. For linguistic texts, there are dozens of possible Natural Language Processing solutions, but usually with no model for Koine Greek (apart from purely statistical ones that can overfit, because in general they are actually trained on the text of the gospels, in the first place).

Βίβλος γενέσεως Ἰησοῦ... (1,1)Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ... (1,1)Ἐπειδήπερ πολλοὶ ἐπεχείρησαν... (1,1-4)
Birth of Jesus
Τοῦ δὲ Ἰησοῦ χριστοῦ ἡ γένεσις... (1,18-25)
Ἐγένετο δὲ ἐν ταῖς ἡμέραις ἐκείναις... (2,1-7)
Healing of possessed

καὶ εὐθὺς ἦν ἐν τῇ συναγωγῇ αὐτῶν ἄνθρωπος ἐν πνεύματι... (1,23-28)καὶ ἐν τῇ συναγωγῇ ἦν ἄνθρωπος ἔχων... (4,33-37)
Parable of tares
Ἄλλην παραβολὴν παρέθηκεν αὐτοῖς λέγων... (13,24-30)

By comparing all distance pairs for all characters, we could build a matrix of pairwise distances, similarly to what David frequently does in the EDA analyses posted to this blog. Considering that most synoptic lists have already mapped each event to their texts (sometimes in discontinuous blocks), with a copy of the reconstructed Greek original, from Holmes (2010) in the table immediately above, it should not be too hard to perform such a study.

A simple Splits Graph analysis

For the purpose of this post, I decided to proceed with the first of these three possible solutions, listing whether an event is found in each Gospel or not, ignoring semantic and textual differences. I modified the synopsis by Garmus (1982), itself apparently modified from some Nestle-Aland edition. This produced a final list of 364 characters and their presence in each of the four gospels — I decided to include the non-synoptic John to test where the analyses would place it.

As expected, the data are to a large extent arbitrary and subjective. Garmus has obvious limitations in the way of dealing with events narrated out of the expected chronological sequence (i.e., flashbacks and flashforwards, as in the case of the beheading of John in relation to his actions), as well as with theological excursuses. None of these limits, however, seem to impact the general shape of a network or tree generated from these data, at most strengthening more feeble signals.

Splits tree, modified from the one generated by Huson & Bryant (2010)

As also expected, the graph supports what is by now a general consensus. Mark is likely to be the gospel closest to a hypothetical root (in this case, nearest to the mid-point). John is the most distinct of the four gospels, being closer to Mark than to the Matthew-Luke group (due to the “core” events narrated and the fewer innovations in Mark). Considering edge lengths, Luke seems to be the most innovative taxon of the synoptic gospel neighborhood / group. Such a network could never demonstrate the existence of "Q" (see the first post) as a stand-alone and actual document, but this tentative analysis does support the hypothesis that Matthew and Luke share a common development, overall supporting Marcan priority.

While probably obvious, it is important to remember that phylogenetic methods are tools that imply the existence of users — it should be an additional instrument for investigation, possibly promoting the collaboration of serious Biblical critics and experts in phylogenetic methods. Let’s consider two examples of the need for such expertise.

First, there are much historical, textual, and theological evidence supporting a hypothesis that the gospel of Mark originally ended with what is now Mark 16,8, with the twelve following verses as later additions (something common to many Greek texts, including the Odyssey). If these supposed additions, only known to whoever delves into Biblical scholarship, are marked as missing in our data, as we should at least test, the distance between Mark and all other gospels, including the unrelated Gospel of John and especially in the edge length between Luke and Mark, increases considerably for such an apparently minor change.

Second, if conducting the third and especially the second type of coding that I described above, a researcher should have at least a basic knowledge of the language they are dealing with. Adapting the explanation of Smith (2017), Matthew and Mark might seem to use the same vocabulary for the “parable of the harvest” when read in English translation, but there is a concealed change of meaning (whose theological importance and implication I'm not debating here), as the single English word “seed” tends to be used in translation of two different Greek words: in Matthew, “sperma” (the kernels of grain, in a more agricultural sense) and, in Mark, “sporos” (which carries a connotation of generative matter to be released).


My dataset is available in preliminary state (for example, labels are in Portuguese) here.

In conclusion, phylogenetics still has much to offer to the field of textual criticism, and this should include Biblical criticism, especially if we are able to support analyses of textual development from trees on manuscripts. I hope this pair of will motivate Biblical scholars to collaborate. If so, please write to me.


Garmus, Ludovico (ed.) (1982) Bíblia sagrada. Petrópolis: Editora Vozes. [reprint 2001]

Goodacre, Mark (2001) The Synoptic Problem: a Way Through the Maze. New York: T & T Clark International. (available on Archive.org)

Holmes, Michael W. (ed.) (2010) SBL Greek New Testament. Atlanta, GA: Society of Biblical Literature.

Huson, Daniel H.; Bryant David (2006) Application of Phylogenetic Networks in Evolutionary Studies, Mol. Biol. Evol., 23(2):254-267. [SplitsTree.org]

Smith, Mahlon H (2017) A Synoptic Gospels Primer. http://virtualreligion.net/primer/

Tuesday, November 28, 2017

“Man gave names to all those animals”: goats and sheep

This is a joint post by Guido Grimm, Johann-Mattis List, and Cormac Anderson.

This is the second of a pair of posts dealing with the names of domesticated animals. In the first part, we looked at the peculiar differences in the names we use for cats and dogs, two of humanity’s most beloved domesticated predators. In this, the second part (and with some help from Cormac Anderson, a fellow linguist from the Max Planck Institute for the Science of Human History), we’ll look at two widely cultivated and early-domesticated herbivores: goats and sheep.

Similar origins, but not the same

Both goats and sheep are domesticated animals that have an explicitly economic use; and, in both cases, genetic and archaeological evidence points to the Near East as the place of domestication (Naderi et al. 2007). The main difference between the two is the natural distribution of goats (providing nourishment and leather) and sheep (providing the same plus wool). This distribution is also reflected in the phonetic (dis)similarities of the terms used in our sample of languages (Figures 1 and 2).

Capra aegagrus, the species from which the domestic goat derives, is native to the Fertile Crescent and Iran. Other species of the genus, similar to the goat in appearance, are restricted to fairly inaccessible areas of the mountains of western Eurasia (see Figure 3, taken from Driscoll et al. 2009). On the other hand, Ovis aries, the sheep and its non-domesticated sister species, are found in hilly and mountainous areas throughout the temperate and boreal zone of the Northern Hemisphere. Whenever humans migrated into mountainous areas, there was the likelihood of finding a beast that:
Had wool on his back and hooves on his feet,
Eating grass on a mountainside so steep
[Bob Dylan: Man Gave Names to all those animals].

Goats were actively propagated by humans into every corner of the world, because they can thrive even in quite inhospitable areas. Reflecting this, differences in the terms for "goat" generally follow the main subgroups of the Indo-European language family (Figure 1), in contrast to "cat", "dog", and "sheep". From the language data, it seems that for the most part each major language expansion, as reflected in the subgroups of Indo-European languages, brought its own term for "goat", and that it was rarely modified too much or borrowed from other speech communities.

There is one exception to this, however. The terms in the Italic and Celtic languages look as though they are related, coming from the same Proto-Indo-European root, *kapr-, although the initial /g/ in the Celtic languages is not regular. In Irish and Scottish Gaelic, the words for "sheep" also come from the same root. In other cases, roots that are attested in one or other language have more restricted meanings in some other language; for example, the Indo-Iranic words for goat are cognate with the English buck, used to designate a male goat (or sometimes the male of other hooved animals, such as deer).

The German word Ziege sticks out from the Germanic form gait- (but note the Austro-Bavarian Goaß, and the alternative term Geiß, particularly in southern German dialects). The origin of the German term is not (yet) known, but it is clear that it was already present in the Old High German period (8th century CE), although it was not until Luther's translation of the Bible, in which he used the word, that the word became the norm and successively replaced the older forms in other varieties of Germany (Pfeifer 1993: s. v. "Ziege").

Figure 1: Phonetic comparison of words for "goat"


The terms for sheep, however, are often phonetically very different even in related languages. The overall pattern seems to be more similar to that of the words for dog – the animal used to herd sheep and protect them from wolves. An interesting parallel is the phonetic similarity between the Danish and Swedish forms får (a word not known in other Germanic languages) and the Indic languages. This similarity is a pure coincidence, as the Scandinavian forms go back to a form fahaz- (Kroonen 2013: 122), which can be further related to Latin pecus "cattle" (ibd.) and is reflected in Italian [pɛːkora] in our sample.

This example clearly shows the limitations of pure phonetic comparisons when searching for historical signal in linguistics. Latin c (pronounced as [k]) is usually reflected as an h in Germanic languages, reflecting a frequent and regular sound change. The sound [h] itself can be easily lost, and the [z] became a [r] in many Scandinavian words. The fact that both Italian and Danish plus Swedish have cognate terms for "sheep", however, does not mean that their common ancestors used the same term. It is much more likely that speakers in both communities came up with similar ways to name their most important herded animals. It is possible, for example, that this term generically meant "livestock", and that the sheep was the most prototypical representative at a certain time in both ancestral societies.

Furthermore, we see substantial phonetic variation in the Romance languages surrounding the Mediterranean, where both sheep and goats have probably been cultivated since the dawn of human civilization. Each language uses a different word for sheep, with only the Western Romance languages being visibly similar to ovis, their ancestral word in Latin, while Italian and French show new terms.

Figure 2: Phonetic comparison of words for "sheep"

More interesting aspects

The wild sheep, found in hilly and mountainous areas across western Eurasia, was probably hunted for its wool long before mouflons (a subspecies of the wild sheep) were domesticated and kept as livestock. The word for "sheep" in Indo-European, which we can safely reconstruct, was h owis, possibly pronounced as [xovis], and still reflected in Spanish, Portuguese, Romanian, Russian, Polish. It survives in many more languages as a specific term with a different meaning, addressing the milk-bearing / birthing female sheep. These include English ewe, Faroean ær (which comes in more than a dozen combinations; Faroes literally means: “sheep islands”), French brebis (important to known when you want sheep-milk based cheese), German Aue (extremely rare nowadays, having been replaced by Mutterschaf "mother-sheep"). In other languages it has been lost completely.

What is interesting in this context is that while the phonetic similarity of the terms for "sheep" resembles the pattern we observe for "dog", the history of the words is quite different. While the words for "dog" just continued in different language lineages, and thus developed independently in different groups without being replaced by other terms, the words for "sheep" show much more frequent replacement patterns. This also contrasts with the terms for "goat", which are all of much more recent origin in the different subgroups of Indo-European, and have remained rather similar after they were first introduced.

The reasons for these different patterns of animal terms are manifold, and a single explanation may never capture them all. One general clue with some explanatory power, however, may be how and by whom the animals were used. Humans, in particular nomadic societies, rely on goats to colonize or survive in unfortunate environments, even into historic times. For instance, goats were introduced to South Africa by European settlers to effectively eat up the thicket growing in the interior of the Eastern Cape Province. Once the thicket was gone, the fields were then used for herding cattle and sheep.

Figure 3: Map from Driscoll et al. (2009)

There are other interesting aspects of the plot.

For example, as mentioned before, in Chinese the goat refers to the "mountain sheep/goat" and the "sheep/goat" is the "soft sheep". While it is straightforward to assume that yáng, the term for "sheep/goat", originally only denoted one of the two organisms, either the sheep or the goat, it is difficult to say which came first. The term yáng itself is very old, as can also be seen from the Chinese character used, which serves as one of the base radicals of the writing system, depicting an animal with horns: . The sheep seems to have arrived in China rather early (Dodson et al. 2014), predating the invention of writing, while the arrival of the goat was also rather ancient (Wei et al. 2014) (and might also have happened more than once). Whether sheep arrived before goats in China, or vice versa, could probably be tested by haplotyping feral and locally bred populations while recording the local names and establishing the similarity of words for goat and sheep.

While the similar names for goat and sheep may be surprising at first sight (given that the animals do not look all that similar), the similarity is reflected in quite a few of the world's languages, as can be seen from the Database of Cross-Linguistic Colexifications (List et al. 2014) where both terms form a cluster.

Source Code and Data

We have uploaded source code and data to Zenodo, where you can download them and carry out the tests yourself (DOI: 10.5281/zenodo.1066534). Great thanks goes to Gerhard Jäger (Eberhard-Karls University Tübingen), who provided us with the pairwise language distances computed for his 2015 paper on "Support for linguistic macro-families from weighted sequence alignment" (DOI: 10.1073/pnas.1500331112).

Final remark

As in the case of cats and dogs, we have reported here merely preliminary impressions, through which we hope to encourage potential readers to delve into the puzzling world of naming those animals that were instrumental for the development of human societies. In case you know more about these topics than we have reported here, please get in touch with us, we will be glad to learn more.

  • Dodson, J., E. Dodson, R. Banati, X. Li, P. Atahan, S. Hu, R. Middleton, X. Zhou, and S. Nan (2014) Oldest directly dated remains of sheep in China. Sci Rep 4: 7170.
  • Driscoll, C., D. Macdonald, and S. O’Brien (2009) From wild animals to domestic pets, an evolutionary view of domestication. Proceedings of the National Academy of Sciences 106 Suppl 1: 9971-9978.
  • Jäger, G. (2015) Support for linguistic macrofamilies from weighted alignment. Proceedings of the National Academy of Sciences 112.41: 12752–12757.
  • Kroonen, G. (2013) Etymological dictionary of Proto-Germanic. Brill: Leiden and Boston.
  • List, J.-M., T. Mayer, A. Terhalle, and M. Urban (eds) (2014) CLICS: Database of Cross-Linguistic Colexifications. Forschungszentrum Deutscher Sprachatlas: Marburg.
  • Naderi, S., H. Rezaei, P. Taberlet, S. Zundel, S. Rafat, H. Naghash, et al. (2007) Large-scale mitochondrial DNA analysis of the domestic goat reveals six haplogroups with high diversity. PLoS One 2.10. e1012.
  • Pfeifer, W. (1993) Etymologisches Wörterbuch des Deutschen. Akademie: Berlin.
  • Wei, C., J. Lu, L. Xu, G. Liu, Z. Wang, F. Zhao, L. Zhang, X. Han, L. Du, and C. Liu (2014) Genetic structure of Chinese indigenous goats and the special geographical structure in the Southwest China as a geographic barrier driving the fragmentation of a large population. PLoS One 9.4: e94435.

Tuesday, November 21, 2017

Another test case for phylogenetics and textual criticism: the Bible

This is a two-part blog post. Here, I will introduce a particular stemmatological problem, along with the studies of it to date; and in a subsequent post I will discuss possible phylogenetic analyses that might be applied.


This year marks the celebration of 500 years since Martin Luther famously proposed his 95 religious theses, thus presaging the Protestant Reformation of the Western Christian Church. In line with this, it is worth discussing a subfield of textual criticism and stemmatics deeply influenced by the Reformation: Biblical criticism. While the importance of written texts to Christianity begins at least in the 2nd century, the theological doctrine of the sola fide (“by scripture alone”, regarding the infallible and final authority in all matters), along with translation work and individual study of the Bible, paved the way, sometimes unwillingly, to scientific approaches of Biblical criticism equivalent to those of secular literature.

The seminal figure in textual criticism of the New Testament was Hermann Reimarus (1694-1768), apparently the first to apply the methodology of literary texts to religious ones. As in the case of literary criticism, it is hardly a coincidence that Biblical criticism developed in the same cultural framework that would support and promote the idea of biological evolution and the tools for establishing genealogical trees and networks. This is especially so when considering the secularization of that society, in which proving the human origin and transmission of sacred texts was deemed an important act of civic freedom. Along with this was the parallel radicalization of some religious positions, such as denouncement as heresy of scientific studies of religious texts (nowadays objected to by most Christian doctrines that stated the imperative of serious research of the sacred texts).

A concrete problem: the synoptic gospels

The most important problem in the textual criticism of the New Testament is the “synoptic gospels" one, involving the three Gospels of Mark, Matthew, and Luke. These gospels have strikingly similar narratives that relate many of the same stories, with similar or identical wording. Like the other canonical gospel, John, these texts were composed around the last quarter of the first century by literate Greek-speaking Christians, only becoming canonical at least a century after their composition.

The synoptic gospels differ from similar sources, such as the non-canonical Gospel of Thomas, in being biographies with a clear religious motivation, and not just a collection of sayings. When compared to the Gospel of John, the three synoptic gospels are distinct in apparently being written by and for a Jewish community that was not on the verge of breaking from the Jewish synagogue, also favoring short and simple sentences.

However, the most important proof of their genealogical relationship is the text itself. The table below shows the reconstructed Greek original of each gospel for the episode of Jesus’ recruitment of a tax collector (an episode missing from the non-synoptic Gospel of John). The text in blue is the material shared by any two of the gospels, and the text in red is common to all three of them. [This is adapted from Smith (2017); on Wikipedia there is a further example, referring to the episode of the cleansing of a leper, see https://en.wikipedia.org/wiki/Synoptic_Gospels#Example.]

Matthew 9,9

Mark 2,13-14

Luke 5, 27-28
Καὶ παράγων ὁ Ἰησοῦς ἐκεῖθεν εἶδεν ἄνθρωπον καθήμενον ἐπὶ τὸ τελώνιον, Μαθθαῖον λεγόμενον, καὶ λέγει αὐτῷ· Ἀκολούθει μοι. καὶ ἀναστὰς ἠκολούθησεν αὐτῷ. Καὶ ἐξῆλθεν πάλιν παρὰ τὴν θάλασσαν· καὶ πᾶς ὁ ὄχλος ἤρχετο πρὸς αὐτόν, καὶ ἐδίδασκεν αὐτούς. καὶ παράγων εἶδεν Λευὶν τὸν τοῦ Ἁλφαίου καθήμενον ἐπὶ τὸ τελώνιον, καὶ λέγει αὐτῷ· Ἀκολούθει μοι. καὶ ἀναστὰς ἠκολούθησεν αὐτῷ. Καὶ μετὰ ταῦτα ἐξῆλθεν καὶ ἐθεάσατο τελώνην ὀνόματι Λευὶν καθήμενον ἐπὶ τὸ τελώνιον, καὶ εἶπεν αὐτῷ· Ἀκολούθει μοι. καὶ καταλιπὼν πάντα ἀναστὰς ἠκολούθει αὐτῷ.

The relationships between the gospels, such as the so-called “triple tradition”, is summed by the graph below, from the Wikipedia article on the synoptic gospels. Mark, the shortest text, has almost no unique material (only 3%, in part superfluous adjectives and Aramaic translations) and is almost entirely (94%) reproduced in Luke. Matthew and Luke have their share of unique material (20% and 35%, respectively), which suggests independence, except for a "double tradition" of common material of about a quarter of the contents of each one, including notable passages such as the “Sermon of the Mount”. The parallelisms of these two gospels are found not only in their contents, but also in their arrangement, with most episodes described in the same order and, in case of displacements, with blocks of episodes moved together while preserving their internal order.

Previous studies

Such similarities were already noted in the first centuries of Christianity. This raises typical genealogical questions regarding topics such as priority (which gospel was written first) and dependence (which gospel was used as a source).

As for the first question, due to textual and theological evidence, a well-established majority of commentators favors the hypothesis of Marcan priority — that is, that the gospel of Mark is the oldest, and both Matthew and Luke used it as a source. As for the second question, a major point of dispute is the double tradition of Matthew and Luke, which can only be properly explained in terms either of descent or of a common ancestor. The two leading hypotheses are the one of a lost gospel (referred as “Q”, after the German Quelle [“source”]), and the one by Austin Farrer, according to whom Matthew used Mark as its source and Luke then used both of them. But these are not the only hypotheses that have been proposed, as shown in the next set of diagrams (also from the Wikipedia article above).

Augustinian Theory
Q Hypothesis
Farrer Theory
Jerusalem School Hypothesis

The first fully developed theory was actually proposed by Augustine of Hippo back in the 5th century, which is essentially the one by Farrer, but with Matthew in place of Mark (i.e., supporting a Matthean priority). Given Augustine’s authority as a “Father of the Church”, his view was not disputed until the late 18th century, when Johann Jakob Griesbach published a synopsis of the three gospels and developed a new hypothesis, swapping Mark and Luke in the dominant explanation. Griesbach’s scientific approach led to the first application to Biblical problems of textual criticism, then in development in the German towns of Jena and Leipzig where he lived.

In 1838, Christian Weisse proposed the “Q” Hypothesis, mentioned above, asserting that Matthew and Luke were produced independently, both using Mark plus a lost source. This source was described as a lost collection of sayings of Jesus, along with feeble indirect evidence of its existence. This hypothesis was further developed by Burnett Streeter in 1924, with the proposal of “proto-versions” of both Mark and Luke — the wording of the canonical versions we have today would then be the product of later revisions, influenced by all of the texts.

During the past fifty years, due to advances in textual criticism and new manuscript analyses, the independence of Luke in relation to Matthew has been questioned, with diminishing support for the Q Hypothesis. A now leading position holds for Farrer’s hypothesis, along with alternative trees such as the one by the Jerusalem School, according to which a lost Greek anthology “A” (postulated as the translation of a collection of saying either in Hebrew or in Aramaic) was directly or indirectly used by all gospels, including John.


Considering the analogies between literary and genetic texts that we have already discussed on this blog, it is clear that this topic should be an interesting anecdote to share around phylogenetic water-coolers. The four texts can be divided into two “families” of gospels, the synoptic (taxa: Matthew, Mark, and Luke) and non-synoptic (taxon: John). Their similarities suggest a distant common ancestor, probably oral traditions, as reported by Christian writers of the first and second centuries such as Papias.

The relationship between the taxa of the first family, however, is far from clear, as their relative dates cannot be determined with confidence. We might be faced with processes that, by analogy with biology, can be explained as gene pool recombinations and horizontal gene transfers – even though the most likely explanation is the one of direct descent, possibly from unknown taxa.

In literary terms, we must also consider features such as Matthew clearly being written by someone highly familiar with aspects of Jewish law, possibly asserting the Jewish component of the preaching while perceiving a universal tendency for the new faith. We must also consider the fact that Mark provides no ancestral lineage for Jesus, while Matthew traces him from a line of kings and Luke from a line of commoners — clearly stating the theological point of view of each gospel. Other aspects are worth consideration, such as the fact that what we today identify as the Gospel of Luke is likely to have been the first part of a once single document that included what is now the book of the “Acts of the Apostles”.

While I must admit that my research has been limited to some googling of keywords, it is curious that a topic that has attracted so much attention for millennia, from serious academic scholarship to conspiracy theories, and from impressionistic reviews to advanced statistical modeling, does not seem to have been covered by phylogenetic analyses so far. Given the range of data and literature, it should actually look like a prime candidate for such application, even from an outsider point of view. This viewpoint is in fact discussed in a review by Christian P. Robert of a book called The Synoptic Problem and Statistics by Andry Abakuks:
The book by Abakuks goes […] through several modelling directions, from logistic regression using variable length Markov chains [to predict agreement between two of the three texts by regressing on earlier agreement] to hidden Markov models [representing, e.g., Matthew’s use of Mark], to various independence tests on contingency tables, sometimes bringing into the model an extra source denoted by Q. Including some R code for hidden Markov models. Once again, from my outsider viewpoint, this fragmented approach to the problem sounds problematic and inconclusive. And rather verbose in extensive discussions of descriptive statistics. Not that I was expecting a sudden Monty Python-like ray of light and booming voice to disclose the truth! Or that I crave for more p-values (some may be found hiding within the book). But I still wonder about the phylogeny… Especially since phylogenies are used in text authentication as pointed out to me by Robin Ryder for Chauncer’s [sic] Canterbury Tales.
We can certainly list among the reasons for such omission the diffidence of the textual community towards phylogenetic methods, especially when performed by people from outside the field; but the potential reception problems for texts of enormous religious significance cannot be ruled out. However, one the reasons might be far more trivial: the fact that, just as in the case of historical linguistics, we don’t have digital structured databases of the trove of data about this topic. Most of the literature is not even properly digital, at best with scanned PDFs. Furthermore, the data are usually far from perfect for such usage, as in the case of the synopsis by Smith (2017), which looks more like a typed table than a true database.


In a future post, I will explore the problems of the synoptic gospels from a phylogenetic point of view, also releasing a minimal dataset. Until then, those interested in the topic can find a lot of discussion on a mailing list devoted to the scholarly study of the synoptic gospels, Synoptic-L.


Abakuks, Andris (2014) The Synoptic Problem and Statistics. London: Chapman and Hall / CRC.

Goodacre, Mark (2001) The Synoptic Problem: a Way Through the Maze. New York: T & T Clark International. (available on Archive.org)

Robert, Christian P (2015) The synoptic problem and statistics [book review]. https://xianblog.wordpress.com/2015/03/20/the-synoptic-problem-and-statistics-book-review/

Orchard, Bernard; Longstaff, Thomas RW (1979) J.J. Griesbach: Synoptic and Text - Critical Studies 1776-1976. Cambridge: Cambridge University Press.

Smith, Mahlon H (2017) A Synoptic Gospels Primer. http://virtualreligion.net/primer/

Tuesday, November 14, 2017

Power laws and cryptocurrencies

The Power Law is used to describe phenomena where large occurrences are rare but small ones are quite common. For example, there are few billionaires while most people make only a modest income; there are few large cities but many small towns; there are few very frequent words but many rare words.

Mathematically, Power Laws are of interest because of what is known as "scale invariance", as well as the fact that there is no well-defined average value. Furthermore, Power Laws are considered to be universal — you can read about this in Wikipedia. One of the more obvious places that we might expect to find them is in the exchange rates of currencies (their "worth") — there will be a few of great worth (the "major currencies") and lots of lesser worth.

For example, I recently read the headline: Bitcoin isn't "too expensive", says BTCC boss Bobby Lee. He was defending the price of the digital currency Bitcoin, which has increased in value more than 600 percent this year, claiming that this is not evidence of a financial bubble, but instead is evidence that the currency is proving its utility in the digital world. Obviously, I cannot let this claim pass without turning a quantitative eye upon it.


Bitcoin is the original cryptocurrency, established in 2009, just after the financial crash of that time. It is a digital currency, which by design has no central bank or regulatory authority supporting it. The coins don’t exist in a tangible form, but instead exist solely in a digital "wallet". Nevertheless, they can still be exchanged and used in transactions, just as with any fiat currency.

Bitcoin is based on a technology now referred to as the blockchain, which seriously has the potential to redefine future economic and legal transactions. Indeed, it is the blockchain idea that has proven to be of interest to financial and legal institutions, not the currency itself (which is just an example of using the blockchain). Blockchain is a distributed digital database, where every transaction is broadcast over the net and stored publicly, making it immutable as well as transparent. Compared to traditional financial and legal systems, this provides increased security, higher efficiency, greater error resistance, and reduced transaction costs. You can read about it in The ultimate 3500-word guide in plain English to understand Blockchain.

Bitcoin was launched for around $US0.005 (ie. half a cent). It was pretty much ignored for 4 years, but it has increased greatly in popularity over the past 4 years. Its exchange rate first exploded to a peak in late 2013, followed by a slow decline of nearly 90% (associated with the collapse of the Mt Gox digital currency exchange). It has achieved near-manic popularity in the past year, as shown in the first graph.

From CoinGecko
Bitcoin exchange rate with the US dollar

So, we now have headlines like this: Bitcoin just surged over $4000 and is near biggest financial crash in 400 years. The reference is to to what is known as Tulip mania, in the Netherlands in 1636-1637, where the tulip bulb prices quickly went from 1 guilder to 60, exploded to 1,000 or more, and then crashed. This is the context within which Bobby Lee made his claim (quoted above) that the current Bitcoin price is not too high.

The important point for our purposes here is that Bitcoin has spawned a host of imitators. So, there are now, or have been, more than 1,000 cryptocurrencies in existence. Many of them are intended as genuine digital currencies, each one addressing one or more of the perceived limitations of the original Bitcoin (such as its inability to scale up to a large number of transactions, or to process transactions faster). Indeed, we may see Bitcoin as a proof of concept and/or pilot study for digital currencies.

Most of the so-called altcoins, however, are not intended as general-use currencies at all. Instead, they form a totally new mode of fundraising for start-up companies, which now sell custom cryptocurrencies in order to raise investment. That is, instead of issuing shares as an IPO (initial public offer) they have an ICO (initial coin offer), thus bypassing the traditional venture capital processes. There is is a whole new world of digital finance emerging (see Cryptocurrency mania fuels hype and fear at venture firms).


In order to assess the comparative price of Bitcoin to the altcoins, I need the exchange rate of the current crop of cryptocurrencies. I took the CoinGecko rates at 14:25 UTC on 11 November 2017 (they change by the minute!). There were 735 coins listed, of which I took the top 100 exchange rates in US dollars. I then ignored the data for the Bit20 coin, which is actually related to an index fund, and thus has a price that is unrelated to the other currencies.

The next graph shows the currencies listed in the rank order of their value. This should illustrate a special case of the Power Law that is known as Zipf's Law, which refers to the "size" of each event relative to it's rank order of size. The standard way to evaluate the Zipf pattern is to plot the data with both axes of the graph converted to logarithms, under which circumstances the data should form a straight line.

As you can see, the exchange rates do fit Zipf's Law very well. In particular, Bitcoin, which is the #1 ranked coin, is not over-priced relative to the other coins. Note that this does not address the question as to whether all of the coins are over-priced or not. That would be a separate question, about the intrinsic value of cryptocurrencies.

Note that the top 25 ranked coins do not fit the Power Law as well as do the remaining 75 coins. So, we might also look at these top coins separately. This is shown in the next graph.

These 25 coins also fit Zipf's Law very well, but the power exponent is clearly smaller than for the remaining coins. In this case, Bitcoin fits the Power Law even better than before. Like it or not, relative to the other coins, Bitcoin is, indeed, not "too expensive".

Very few of the coins appear to be be over-priced (ie. far above the line), but a few of them might be considered under-priced (ie. far below the line). In particular, the #4 ranked coin is the SegWit2x [Futures]. This coin represents a controversial suggestion to split off from Bitcoin. It has not received a great deal of support from the Bitcoin community, and the proposed split was officially suspended only a few days ago. Whether it will go ahead eventually is unclear. The #5 ranked coin is Dash, which is often touted as a currency much more like cash, in the sense that the users can remain almost completely anonymous (which is actually a bit tricky with Bitcoin).

In the world of currency exchange, the big three pieces of information about each currency are (i) the Price of each coin, (ii) the Market Capitalization, which is the total coin supply multiplied by the coin price, and (iii) the Liquidity, which refers to how easy it is to buy and sell coins without causing a change in their price (it is used to measure the market share, market maturity and market acceptance). We could summarize this information for each coin by using a phylogenetic network.

So, I took the information as supplied by CoinGecko (see above) in US dollars, and log-transformed the numbers (economic worth is usually considered to be log-normally distributed). I then calculated the manhattan distances pairwise between the currencies, and plotted this using a NeighborNet graph, as shown in the final figure. The 10 top-price currencies have their full name shown, while the remainder are labeled with their exchange abbreviation. As usual, coins that have similar financial characteristics are near each other in the network; and the further apart the coins are in the network then the more different are their characteristics.

There are basically four neighborhoods in the graph, representing four different types of coins. Those coins at the top-right of the network all have a high Price, Capitalization and Liquidity. These are the coins that currently dominate the market. Moving leftwards from there in the graph, the Price, Capitalization and Liquidity all decrease, so that the coins in the middle of the network have low values of all three criteria. The coins at the top-left of the network have a relatively high Price but still have a low Capitalization and Liquidity. Those coins isolated at the bottom of the network currently have no Market Capitalization at all, even though they are available for trading and thus have a Price (this includes the SegWit2x Futures).


So, should you invest your hard-earned savings in cryptocurrencies? Plenty of people are doing so. For example, Coinbase, the largest cryptocurrency exchange in the USA, reportedly now has 12 million customers.

The general consensus seems to be "yes" to investment only if you like a bit of a gamble, because you may win big, but otherwise the answer is currently "no". The attributes that currently make cryptocurrencies such a speculative investment, such as their big price swings, their volatility and unpredictability, and their potentially lucrative payoffs, actually make them pretty useless as currencies. If you are looking for a long-term investment, then you probably need to find an altcoin that is either useful as a transaction medium, or provides an innovative application of the blockchain technology.

Tuesday, November 7, 2017

PhyloNetworks: a package for phylogenetic networks

Recently, another computer package was released that is of relevance to this blog. This is described in a forthcoming paper:
Claudia Solís-Lemus, Paul Bastide, Cécile Ané (2017) PhyloNetworks: a package for phylogenetic networks. Molecular Biology and Evolution (in press) 12: 3292-3298.
The authors describe the package this way:
PhyloNetworks is a Julia package for the inference, manipulation, visualization and use of phylogenetic networks in an interactive environment. Inference of phylogenetic networks is done with maximum pseudolikelihood from gene trees or multi-locus sequences (SNaQ), with possible bootstrap analysis. PhyloNetworks is the first software providing tools to summarize a set of networks (from a bootstrap or posterior sample) with measures of tree edge support, hybrid edge support, and hybrid node support. Networks can be used for phylogenetic comparative analysis of continuous traits, to estimate ancestral states or do a phylogenetic regression.

The  SNaQ analysis is described in a previous paper:
Solís-Lemus C, Ané C (2016) Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLOS Genetics 12:e 1005896.
The phylogenetic model used incorporates: mutations (as usual), incomplete lineage sorting of alleles in ancestral populations (using the coalescent), and horizontal inheritance of genes (ie. reticulations in the network). The likelihood is decomposed into quartets, which makes the likelihood calculations relatively fast, and also allows the analyses to be scaled up to many species and many genes.

The PhyloNetworks software is open source, and is available with documentation at:
Have fun learning to use the Julia system, which I had never even heard of before investigating this new package!

Note: In spite of the similarity in name, this new package has nothing to do with Luay Nakhleh's PhyloNet package, nor to the Phylogenetic Networks blog.

Tuesday, October 31, 2017

"Man gave names to all those animals": cats and dogs

This is a joint post by Guido Grimm and Johann-Mattis List.

As specialists, we rarely dare to dive into cross-disciplinary research. However, in a small series of posts, we will now try to open a door between linguistics, phylogenetics, biogeography, and molecular genetics (with its various subdisciplines), using the curious cases of domestic animals, such as cat, dog, goat, and sheep, and what these are called in various Eurasian languages, with a special focus on Indo-European languages.

Today's post will introduce the little dataset that we have created, and discuss the findings for the names of cats and dogs. A follow-up post will be devoted to goats and sheep.

Domesticated animals and their names

Various types of archaeological and biological research revolve around the domestication of animals — GoogleScholar gives tens of thousands of hits for search items such as "cat domestication"; and we have several blog posts about the need for networks to illustrate the genealogy of domestication. However, linguistic literature on these topics is rather sparse, often related to specific language families, such as domesticated animals in the Indo-European proto-society (Anthony and Ringe 2015).

Nevertheless, many studies mention the potential value of linguistic evidence as some specific kind of indirect evidence, which should be considered when carrying out research on domestication (see, for example, Kraft et al. 2015). Furthermore, the public interest in domestic animals such as cat, dog, goat and sheep, is reflected by the number of languages in which Wikipedia articles are available: the domestic dog (219 entries), our most trusted companion animal, narrowly beats the cat (211 entries), our least-productive domestic animal but, according to cliché, an obligatory accessory for e.g. literates, thinkers, and little old ladies (entry counts include extinct ones like Gothic). Sheep are available for 166 languages, and goats for 142.

One doesn't have to travel far to recognize substantial difference between the four animal names. For example, when Guido moved to Sweden, the most confusing thing was "Fåret Shaun", which he knew as "Shaun, das Schaf" in German, or "Shaun, the sheep" in English. [As an aside, Shaun's name is a pun in English, but not in German or Swedish.] While Swedish and German / English differ greatly in the pronunciation of the words they use to denote "sheep", the Swedish words for "cat" (Swedish katt, German Katze), "dog" (hund vs. Hund), and "goat" (get vs. Geiß) are essentially the same (using Guido's dialect of German). They also are basically the same for many other essential items, such as "house" (hus vs. Haus), and "hand" (hand vs. Hand).

Since Guido moved to France, he has been watching "Shaun le mouton"; and Hund ("dog") has become chien. He now needs to look for chèvre ("goat") when making choosing his cheeses; but his cats are called chats, which is similar in writing (and linguistic history) but phonetically rather different, as the word is pronounced as [ʃa] (sha).

When Mattis visited China, he had few problems memorizing the word for "cat", as the Chinese word māo is quite similar to the sound which cats are alleged to make in many languages (see the list on Wikipedia for cross-linguistic similarities of onomatopoeia). The words for "sheep" and "goat", on the other hand, were surprisingly the same, the former being called míanyáng, which roughly translates as "soft sheep/goat", while the latter is called shānyáng which translates to "mountain sheep/goat".

Differences in animal naming

We were intrigued by these differences and similarities of animal names across different languages. So, we decided to investigate this further, by comparing pronunciation differences for "dog", "cat", "goat", and "sheep" across a larger sample of languages. For this purpose, we selected 28 different languages, and searched for the translations as they are given in the different Wikipedia articles. We then manually added the pronunciations, based on different sources, such as Wiktionary, our own knowledge of some of the languages, or specialized sources listing translations and transcriptions (Key and Comrie 2016; Huang et al. 1992).

We then used the overall pronunciation distances for all languages as proposed by Jäger (2015), who applied sophisticated alignment algorithms to a sample of 40 historically stable words per language for a large sample of North Eurasian languages (taken from the ASJP database). Since our sample contains languages which have never been shown to be historically related, the networks which we inferred from these distances should not be interpreted as true phylogenies, but rather as an aid for visualizing overall similarities among them.

To compare the pronunciation differences of our small datasets of animal names, we used the LingPy software (List and Forkel 2016, http://lingpy.org) to cluster the data into preliminary sets of phonetically similar words. As we lack the data to carry out deep inference of truly historical similarities, for this purpose we used the Sound-Class-Based Phonetic Alignment Algorithm (for details, see List et al. 2017). This algorithm compares words for shallow phonetic similarity with some degree of historical information. As a result, the inferred clusters do not (as we will see below) reflect true instances of cognacy (homology), but rather serve as a proxy for similarity of pronunciation.

Cats and Dogs

It is commonly assumed that the dog (Canis lupus familiaris, literally the 'domestic wolf-dog') was the first animal domesticated by humans, although it has not yet been settled exactly when and where. Multiple domestication events are quite likely, with respect to the (grey) wolves' (Canis lupus) natural behaviour (i.e. living in small family groups with complex social structure) and being originally distributed across Eurasia, although genetic studies have lead to inconclusive results (compare the contradicting results in Frantz et al. 2016 and Botigue et al. 2017). Its trainability and pack-loyalty make the wolf an excellent hunting companion, and wolf packs migrate naturally over long distances, which perfectly fits early (pre-cultivation) human societies of hunters and gatherers. Accordingly, ages of up to 30,000 BC have been proposed for the dog's domestication (Botigue et al. 2017).

In contrast, the cat, Felis sylvestris (literally the 'forest cat'), is a solitary, very elusive animal. It was domesticated much later, and most likely in the Near East (Driscoll et al. 2009). In contrast to other domestic animals, it has no direct use (other than luxury), and rather trains its owners than being trained (e.g., there are no police cats, and very rarely circus cats). But the cat decimates rodents and other small mammals, as well as birds. Thus, the domestication of cats likely followed the cultivation of wheat, and is possibly instrumental for building up fixed settlements and agricultural societies (Driscoll et al. 2009). Thus, George R.R. Martin's fictional character Haviland Tuf may be right when judging all human societies throughout the universe by how they treat cats: "civilized" people cherish them, "barbaric" societies don't!

Figure 1: Terms for cat in our sample

Thus, the hypothesis is that the dog was probably with us from the dawn of our civilization, while the cat opportunistically followed human settlements because these provide a surplus of food (and ultimately shelter). This idea is well reflected by the literal and phonetic variation of the words for "cat" (Figure 1) and "dog" (Figure 2). Cats are called by essentially the same names in all western Eurasian languages (be they Indo-European or not), but the word for dog can be phonetically very different in even closely related languages.

As you can see in the plot, the name for "cat" (English) is effectively similar across all of the Indo-European languages of western Eurasia in our sample, while the name for "dog" sounds quite different. Given that similar names for "cat" can be found in languages of northern Africa (Pfeifer 1993: s. v. "Katze"), this provides additional evidence for the Near-East domestication of the cat; and we can assume that the word traveled to Europe along with its carriers. On the other hand, the differences in the names for "dog" across all Indo-European languages in our sample reflect language change, rather than different naming practices. With the exception of Indic, Greek, and the Slavic languages, which coined new terms (cf. Derksen 2008: 431, and the cognates sets in IELex), the dog terms in Romance (with exception of Spanish), Germanic (with exception of English), Baltic and Armenian all evolved from the same root.

Figure 2: Terms for dog in our sample

With respect to the genetics of the dog (origin unclear) and the cat (origin in the Near East), plus the migration history of European people, the most likely hypothesis, which is also supported by Indo-European linguists, assumes that the dog was already with the humans before the Indo-European languages formed, following their migrations. Given the importance of the term, people may have avoided replacing it with a new term. This is also reflected in the cross-linguistic stability of the concept "dog", usually listed as one of the most stable concepts which are rarely replaced by neologisms ("dog" ranks at place 16 of Starostin's 2007 stability scale; "cat" is not even included).

With linguistic methods for language comparison, we can show that these words share a common origin, but stability does not imply that the pronunciation of the words is not affected. It is difficult to say how fast pronunciation evolves in general, but assuming that greater phonetic differences indicate a greater amount of elapsed time is a useful proxy. Since many Indo-European languages arrived in Europe by migration waves from the steppes of Central Asia, it is little surprise that each of these waves brought its modified variant of the original term for "dog" in Proto-Indo-European to Europe. Given the importance of the term for the daily lives of the people, speakers of one language variety would also not necessarily feel obliged to borrow the terms from neighboring language communities.

In Hebrew (not included in Figure 1), the word for cat is חתול khatúl. The Celtic Irish term is cat, and even the Basques, with their entirely unrelated language, have the word katu, probably a borrowing from the surrounding Romance languages (cf. Spanish gato). When the Germanic tribes (BC) and Slavs (AD) arrived on horseback, accompanied by their *hunda- (Kroonen 2013: 256), or their *pesə (Derksen 2008: 431), they settled down, started farming, and then took up the *kattōn- and the *kotə from the locals. This is interesting, because we have to assume (based on genetics and modern distribution of the wild subspecies of Felis sylvestris) that there were always wild cats in the European woods. Either the word for them was lost in surviving languages, or the hunters and gathers living in Europe never bothered to name a small furry animal that – at best – could be just glimpsed.

Notably, the South Asian Indo-European languages and the East Asian Sino-Tibetic languages have their own terms for cats (Figure 1), but the word is globally quite invariable in stark contrast to the terms for "dog".

Where does this lead?

Our graphs are at this point indicate many curiosities. Nevertheless, by mapping words associated with animals (or plants), crucial for the history of human civilisation, we may tap into a complete new data set to discuss different scenarios erected by archaeologists and historians regarding domestication and beyond. While linguists, archaeologists, and geneticists have been working a lot on these questions on their own, examples for a rigorous collaboration, involving larger datasets and common research questions, are – to our current knowledge from sifting the literature – still rather rare. Furthermore, most linguistic accounts are anecdotal. They provide valuable insights, but these insights are not amenable for empirical investigations, as they are only reflected in prose. As a result, recent articles concentrating on archaeogenetic studies often ignore linguistic evidence completely. Given the uncertainty about the origin of domesticated animals and plants, despite advanced methods and techniques in archaeology and genetics, it seems that this strategy of simply putting linguistic evidence to one side deserves some re-evaluation.

It seems to be about time to pursue these questions in data-driven frameworks. When doing so, however, we need to be careful in the way we treat linguistic data as evidence. What we need is a thorough understanding of the processes underlying "naming" in language evolution. We constantly modify our lexicon, be it (i) by no longer using certain words, (ii) by using certain previously unfashionable words more frequently, (iii) by coining new words, or (iv) by borrowing words from our linguistic neighbors. So far, we still barely understand under which conditions societies will tend to keep a certain word against pressure from linguistic neighbors who use a different term, or when they will prefer to coin their own new words for newly introduced techniques, animals, or plants, instead of taking the words along with the technology.

Linguists can say a few things about this; and etymological dictionaries, some of which we also consulted for this study, offer a wealth of information for some terms. However, without formalizing our linguistic knowledge, providing standardization efforts (compare the Tsammalex or the Concepticon projects) and improvement of algorithms for automatic sequence comparison, linguists will have a hard time keeping pace with quickly evolving disciplines like archaeogenetics and archaeology.

  • Anthony, D. and D. Ringe (2015) The Indo-European homeland from linguistic and Archaeological perspectives. Annual Review of Linguistics 1: 199-219.
  • Botigue, L., S. Song, A. Scheu, S. Gopalan, A. Pendleton, M. Oetjens, A. Taravella, T. Seregely, A. Zeeb-Lanz, R. Arbogast, D. Bobo, K. Daly, M. Unterlander, J. Burger, J. Kidd, and K. Veeramah (2017) Ancient European dog genomes reveal continuity since the Early Neolithic. Nature Communications 8: 16082.
  • Derksen, R. (2008) Etymological dictionary of the Slavic inherited lexicon. Brill: Leiden and Boston.
  • Driscoll, C., D. Macdonald, and S. O’Brien (2009) From wild animals to domestic pets, an evolutionary view of domestication. Proceedings of the National Academy of Sciences 106 Suppl 1: 9971-9978.
  • Frantz, L.A., V.E. Mullin, M. Pionnier-Capitan, O. Lebrasseur, M. Ollivier, A. Perri, A. Linderholm, V. Mattiangeli, M.D. Teasdale, E.A. Dimopoulos, A. Tresset, M. Duffraisse, F. McCormick, L. Bartosiewicz, E. Gal, É.A. Nyerges, M.V. Sablin, S. Bréhard, M. Mashkour, A. Bălăşescu, B. Gillet, S. Hughes, O. Chassaing, C. Hitte, J.-D. Vigne, K. Dobney, C. Hänni, D.G. Bradley, G. Larson (2016) Genomic and archaeological evidence suggest a dual origin of domestic dogs. Science 352: 1228-1231.
  • Huáng Bùfán 黃布凡 (1992) Zàngmiǎn yǔzú yǔyán cíhuì [A Tibeto-Burman lexicon]. Zhōngyāng Mínzú Dàxué 中央民族大学 [Central Institute of Minorities]: Běijīng 北京.
  • Jäger, G. (2015) Support for linguistic macrofamilies from weighted alignment. Proceedings of the National Academy of Sciences 112: 12752-12757.
  • Key, M. and B. Comrie (2016) The intercontinental dictionary series. Max Planck Institute for Evolutionary Anthropology: Leipzig.
  • Kraft, K., C. Brown, G. Nabhan, E. Luedeling, J. Luna Ruiz, G. Coppens d’Eeckenbrugge, R. Hijmans, and P. Gepts (2014) Multiple lines of evidence for the origin of domesticated chili pepper, Capsicum annuum, in Mexico. Proceedings of the National academy of Sciences of the United States of America 111: 6165-6170.
  • Kroonen, G. (2013) Etymological dictionary of Proto-Germanic. Brill: Leiden and Boston.
  • List, J.-M. and R. Forkel (2016) LingPy. A Python library for historical linguistics. Max Planck Institute for the Science of Human History: Jena.
  • List, J.-M., S. Greenhill, and R. Gray (2017) The potential of automatic word comparison for historical linguistics. PLOS ONE 12: 1-18.
  • Pfeifer, W. (1993) Etymologisches Wörterbuch des Deutschen. Akademie: Berlin.
  • Starostin, S. (2007) Opredelenije ustojčivosti bazisnoj leksiki [Determining the stability of basic words]. In: : S. A. Starostin: Trudy po jazykoznaniju [S. A. Starostin: Works on linguistics. Languages of Slavic Cultures: Moscow. 580-590.
Final Remark

Given that we had little time to review all of the literature on domestication in these disciplines, we may well have missed important aspects, and we may well have even failed to be original in our claims. We would like to encourage potential readers of this blog to provide us with additional hints and productive criticism. In case you know more about these topics than we have reported here, please get in touch with us — we will be glad to learn more.

Tuesday, October 24, 2017

Let's distinguish between Hennig and Cladistics

There are theoretically an infinite number of ways to mathematically analyze any set of data, and yet it is unlikely that all (or even most) of these will have any relevance to a study of biology. In this sense, the philosophy of phylogenetic analysis needs to show that there is a strong basis for treating any particular mathematical analysis as having biological relevance. This is a point that I have discussed before: Is there a philosophy of phylogenetic networks?

Willi Hennig clearly has some role to play here. However, his ideas are often treated as being solely related to one particular form of phylogenetic analysis — cladistics. In this post I will point out that his work has a much greater relevance than that — he provides a crucial logical step that applies to all phylogenetic inference.

The steps of phylogenetic inference are shown in the first figure, which is taken from my earlier post. The first step is a mathematical inference from character data to tree/network; the second step is a logical inference that the mathematical summary resulting from the first step has some biological relevance; and the third step is a practical inference that the biological summary applies to whole organisms as well as to their characters.

The logic of phylogeny reconstruction


Hennig's concept of "shared innovations" (which he called synapomorphies) is the only thing that allows us to use the mathematical phylogenetics in the pursuit of genealogical history. Without this concept, the mathematics could just produce something like the arithmetic mean, a mathematical concept with no connection to real objects (unlike the median or mode, which will always be real). The idea of shared innovations is what leads us to believe that the mathematical summary (whether tree or network) might actually also be a close approximation to the real thing. This is a separate concept from cladistics, which is simply a mathematical algorithm based on a particular optimality criterion (parsimony), just like maximum likelihood or bayesian approaches. So, shared innovations underlie the use of both parsimony, likelihood and distance methods — Willi Hennig (and, before him, Karl Brugmann in linguistics) is relevant no matter what algorithm we use.

Mathematical analyses

If they are to represent genealogical history, then all trees and networks in phylogenetics will be directed acyclic graphs (DAGs), mathematically. There are many ways to produce a DAG, some of which have had varying degrees of popularity in phylogenetics, and some of which have not been used at all.

To produce an acyclic line graph (in which nodes are connected by edges), we can start with character data or distance data. We can then use various optimality criteria to choose among the many graphs that could apply to the data, such as parsimony (usually ssociated with cladistics) and likelihood (either as maximum likelihood or integrated likelihood). We can also ensure that the graph is directed (ie. the edges have arrows), by choosing a root location, either directly as part of the analysis or a posteriori by specifying an outgroup.

All of these approaches are mathematically valid, as are a number of others. They all provide a mathematical summary of the data. This is step one of the phylogenetic inference, as illustrated above.

But what of step two? Biologists need a summary of the data that has biological relevance, as well, not just mathematical relevance. This has long been a thorn in the side of biologists — just because they can perform a particular mathematical calculation does not automatically mean that the calculation is relevant to their biological goal.

Consider the simplest mathematics of all — calculating the central location of a set of data. There are many ways to do this, mathematically — indeed, there are technically an infinite number of ways. These include the mode, the median, the arithmetic mean, the geometric mean, and the harmonic mean. All of these are mathematically valid, but do any of them produce a central location that describes biology?

The mode does, because it is the most common observation in the dataset. The median usually does, because it is the "middle" observation in the dataset. But what of the various means? There is no necessary reason for them to describe biology, although they are perfectly valid mathematics.

For instance, the modal number of children in modern families is 2, meaning that more families have this number than any other number of children. The median number is also 2, meaning that half of the families have 2 or fewer children and half of the families have 2 or more. So, these mathematical summaries are also descriptions of real families. But the means are not. For example, the arithmetic mean number of children is 2.2, which does not describe any real family. If you ever find a family with 2.2 children, then you should probably call the police, to investigate!

Mathematically valid data summaries have a lot of relevance, but they do not necessarily describe biological concepts. I can use the mean number of children per local family to estimate the number of schools that I might need in that area, but I cannot use it to describe the families themselves. This is a classic case of "horses for courses".


So, in phylogenetics we need some piece of logic that says that we can expect our DAG (a mathematical concept) to be a representation of a genealogy (a biological concept). Our genealogical estimate may still be wrong (and indeed it probably will be, in some way!), but that is a separate issue. The DAG needs to a reasonable representation, not a correct one. Correctness needs to be a result of our data, not our mathematics.

This is where Willi Hennig comes in. Hennig's ideas, and the ideas derived from them, are illustrated in the second figure.

Hennig explicitly noted that characters have a genealogical polarity, with ancestral states being modified into derived states through evolutionary time. Furthermore, he noted that it is only the derived states that are of relevance to studying evolutionary history — the sharing of derived character states reveals evolutionary history, but shared ancestral states tells us nothing.

We have done two things with these Hennigian ideas. Some people have been interested in classification, for which the concept of monophyly is relevant, and others have been interested in reconstructing the genealogies, rather than simply interpreting them.


Reconstructing a tree-like phylogenetic history is conceptually straightforward, although it took a long time for someone (Hennig 1966) to explain the most appropriate approach. Interestingly, the study of historical linguistics has developed the same methodology (Platnick and Cameron 1977; Atkinson and Gray 2005), thus independently arriving at exactly the same solution to what is, in effect, exactly the same problem. From this point of view, the logical inference itself is uncontroversial; and its generic nature means that it can be used for any objects with characteristics that can be identified and measured, and that follow a history of descent with modification. I will, however, discuss this in terms of biology — you can make the leap to other objects yourself.

The objective is to infer the ancestors of the contemporary organisms, and the ancestors of those ancestors, etc., all the way back to the most recent common ancestor of the group of organisms being studied. Ancestors can be inferred because the organisms share unique characteristics (shared innovations, or shared derived character states. That is, they have features that they hold in common and that are not possessed by any other organisms. The simplest explanation for this observation is that the features are shared because they were inherited from an ancestor. The ancestor acquired a set of heritable (i.e. genetically controlled) characteristics, and passed those characteristics on to its offspring. We observe the offspring, note their shared characteristics, and thus infer the existence of the unobserved ancestor(s). If we collect a number of such observations, what we often find is that they form a set of nested groupings of the organisms.

Hennig, in particular, was interested in the interpretation of phylogenetic trees, rather than their reconstruction. He did this interpretation in terms of monophyletic groups (also called clades), each of which consists of an ancestor and all of its descendants. These are natural groups in terms of their evolutionary history, whereas other types of groups (eg. paraphyletic, polyphyletic) are not. So, a phylogenetic tree consists of a set of nested clades, which are the groups that are represented and given names in formal taxonomic schemes.

For phylogenetic trees, there is thus a rationale for treating a tree diagram as a representation of evolutionary history. For example, in a study of a set of gene sequences, first we produce a mathematical summary of the data based on a quantitative model. We then infer that this summary represents the gene history, based on the Hennigian logic that the patterns are formed from a nested series of shared innovations (this is a logical inference about the biology being represented by the mathematical summary). We then infer that this gene history represents the organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (ie. a pragmatic inference).

Mis-interpretations of Hennig

What I have said above has lead to various mis-interpretations of Hennig's role in phylogenetics.

First, he did not propose any specific method for producing a phylogenetic tree (or network). He was concerned about the logic of the diagram. not how to get it in the first place. He distinguished shared derived character states, or shard innovations, (he called them synapomorphies) from shared ancestral states (symplesiomorphies), and noted that only the former are relevant for phylogenies. So, distance methods will also work in phylogenetics provided the distances are based on homologous apomorphic features. If they are not so based, then they are simply mathematical constructions, which may or may not represent anything to do with phylogeny. Distances estimated from plesiomorphic features can be used to construct a tree, obviously, but there is no reason to expect that tree to represent a phylogeny.

Second, parsimony analysis was developed independently of Hennig, by people such as Farris, Nelson and Platnick. This came to be called cladistics, intended by Ernst Mayr to be a derogatory term for the new form of analysis. The fact that the Willi Hennig Society is associated exclusively with cladistics has nothing to do with Hennig himself, or with the logic of his approach to phylogenetics. You need to clearly distinguish between Hennig and Cladistics!

Third, Hennig was more interested in classification than he was in phylogeny reconstruction. This seems to cause confusion for gene jockeys and linguists, in particular, who often associate phylogenetics solely with classification (see, for example, Felsenstein 2004, chapter 10). Sure, Hennig was primarily interested in the interpretation of phylogenies, rather than their construction. However, that was simply a personal point of view. The logic of his work transcends his own personal interests. Without him, no genealogical reconstruction makes logical sense, in genetics or linguistics. Mathematical methods for summarizing data were developed independently in genetics and linguistics, just as they were in other areas of biology and also in stemmatology. However, without the concept of shared innovations, these methods remain mathematical summaries, not estimates of genealogies.

Finally, Hennig's work was not original, being naturally a synthesis of much previous work. In biology, the work of Walter Zimmerman is frequently noted (eg. Donoghue & Kadereit 1992), and in linguistics the work of Karl Brugmann is obviously important (see Mattis' post Arguments from authority, and the Cladistic Ghost, in historical linguistics). Sometimes, wheels have to be re-invented many times before the general populace comes to realize just how important they are.


Atkinson QD, Gray RD (2005) Curious parallels and curious connections — phylogenetic thinking in biology and historical linguistics. Systematic Biology 54: 513-526.

Donoghue MJ, Kadereit W (1992) Walter Zimmermann and the growth of phylogenetic theory. Systematic Biology 41: 74-85.

Felsenstain J (2004) Inferring Phylogenies. Sinauer Associates, Sunderland MA.

Hennig W (1966) Phylogenetic Systematics. University of Illinois Press, Urbana IL. [Translated by DD Davis and R Zangerl from W. Hennig 1950. Grundzüge einer Theorie der Phylogenetischen Systematik. Deutscher Zentralverlag, Berlin.]

Platnick NI, Cameron HD (1977) Cladistic methods in textual, linguistic, and phylogenetic analysis. Systematic Zoology 26: 380-385.