Showing posts with label Stemmatology. Show all posts
Showing posts with label Stemmatology. Show all posts

Tuesday, December 5, 2017

The Synoptic Gospels problem: preparing a phylogenetic approach


This is the second part of my series on phylogenetics and a specific case of textual criticism, the Biblical one. The first part appeared as Another test case for phylogenetics and textual criticism: the Bible, and covered the background to the textual problem — that post should be read first. Here, I provide a preliminary genealogical analysis of some specific data related to the problem.


The synoptic gospels and phylogenetics: how to code data?

Just like in the cases of general stemmatics and historical linguistics, our immediate problem for a phylogenetic approach to Biblical criticism is one of data. Upon investigation, the field proves itself desperately in need of an open access mentality — a great deal of work would be needed to turn the few aggregated data I could find into datasets that could feed the most basic analysis tools.

No open dataset proved either adequate or correct enough. They are mostly quotations or subjective developments of the scientific sources, available only in printed editions and in software for Biblical studies, sometimes at exorbitant prices, and frequently with licenses that explicitly prohibit extracting and reusing the data. This forced me to postpone an analysis of families of manuscripts, as unfortunately there is no complete free edition of the Novum Testamentum Graece (the reference work in the field, usually referred to as Nestle-Aland after its main editors).

However, I could explore the problem of the synoptic gospels in a way and with a dataset closer to the ones of the 19th century analyses, by sitting with a printed Bible and compiling my own synopsis of episodes. My work in this field ends with this second post, but it seems like a good approach to the development of a phylogenetic investigation, to start by reproducing the old analyses with new tools.

After some bibliographic review and inspection of the solutions presented to the problem, my understanding is that there would be three fundamental ways of coding for features of these texts.

(i)
The first and simplest is to compile a list of episodes, themes, and topics found in each gospel (a proper “synopsis”), without considering semantic differences or relative positions, coding for a truth table indicating whether each “event” (i.e. “character”) is found. For example, the imprisonment of John the Baptist is mentioned in the three synoptic gospels (Matthew 4,12; Mark 1,14; Luke 3,18-20) and would be coded as “present” in all of them, even though in Luke the relative order is different (it is narrated before the baptism of Jesus, in a flashforward). On the other hand, the priests conspiring against Jesus is only narrated in two gospels (Mark 11,18; Luke 19,47-48), and the “character” of the meek inheriting the Earth is only found in one of them (Matthew 5,5), as shown in the table below.

Character/Event
MatthewMarkLuke
Imprisonment of John
Priests conspiring

Meek inheritance



This kind of census approach is what most descriptive statistics on the synoptic relationship consider when demonstrating how much there is in common among the gospels, including the graph reproduced back in the first part of this post. As in the case of the statistics of genetic material shared between species, like humans and other apes, caution is needed to understand what is actually meant — the percentages usually reported refer to episode coincidence (in a loose analogy, like the presence of a protein), not text coincidence (like the sequences of genetic bases). This is the reason why these analyses should equally consider “episode homology” and “episode analogy” — one must remember that all gospels as we have them evolved from initial versions, and to be missing an episode favored by the public or the clergy, which denounced other gospels now lost as “uninspired”, could have resulted in pressure to incorporate such episode.

(ii)
A deeper level of coding would be to map the text of episodes and events into “semantic” characters, ignoring textual differences (like synonyms) but coding for differences in intended meaning. For example, the event of Jesus being tested in the wilderness, while narrated in all three gospels (Matthew 4,1-2; Mark 1,12-13; Luke 4,1-2), is really only equivalent in Matthew and Luke, where he is tempted by "the Devil", while in Mark he is tempted by "Satan", which is a figure closer to the Hebrew meaning of "enemy, adversary; accuser". Likewise, while Matthew and Luke both narrate Jesus’ most famous sermon, they are semantically different: the setting is a mountain in the first and a plain in the second.

Character/Event
MatthewMarkLuke
Temptation
by the Devilby Satanby the Devil
Sermon
mountain
plain

This kind of mapping is harder, due to the expertise required to subjectively distinguish meaning, as in the case of the mountain / plain, which scholars in Biblical hermeneutics seem to agree to be more than merely a change of setting for narration. The difficulty is aggravated by the eventual need to quantify the semantic shifts (how far is "the Devil" from "Satan (the adversary)", especially when the episode is missing from the non-synoptic gospel of John?). These three states ("null", "Devil", and "Satan") should not be considered equally different, especially when the texts of the three synoptic gospels are clearly related. Luckily, while not necessarily in a systematic way for phylogenetic purposes, this kind of coding has already been conducted by many Biblical scholars, and we might thus appropriate it in the future.

(iii)
The third way of coding, partly solving the difficulties of the second solution, would be to compare the Greek text for each event, using some distance metric. For strings, there is the common Levenshtein distance, or, in a blatant self-promotion, my own sequence similarity algorithm. For linguistic texts, there are dozens of possible Natural Language Processing solutions, but usually with no model for Koine Greek (apart from purely statistical ones that can overfit, because in general they are actually trained on the text of the gospels, in the first place).

Character/EventMatthewMarkLuke
Prologue
Βίβλος γενέσεως Ἰησοῦ... (1,1)Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ... (1,1)Ἐπειδήπερ πολλοὶ ἐπεχείρησαν... (1,1-4)
Birth of Jesus
Τοῦ δὲ Ἰησοῦ χριστοῦ ἡ γένεσις... (1,18-25)
Ἐγένετο δὲ ἐν ταῖς ἡμέραις ἐκείναις... (2,1-7)
Healing of possessed

καὶ εὐθὺς ἦν ἐν τῇ συναγωγῇ αὐτῶν ἄνθρωπος ἐν πνεύματι... (1,23-28)καὶ ἐν τῇ συναγωγῇ ἦν ἄνθρωπος ἔχων... (4,33-37)
Parable of tares
Ἄλλην παραβολὴν παρέθηκεν αὐτοῖς λέγων... (13,24-30)



By comparing all distance pairs for all characters, we could build a matrix of pairwise distances, similarly to what David frequently does in the EDA analyses posted to this blog. Considering that most synoptic lists have already mapped each event to their texts (sometimes in discontinuous blocks), with a copy of the reconstructed Greek original, from Holmes (2010) in the table immediately above, it should not be too hard to perform such a study.

A simple Splits Graph analysis

For the purpose of this post, I decided to proceed with the first of these three possible solutions, listing whether an event is found in each Gospel or not, ignoring semantic and textual differences. I modified the synopsis by Garmus (1982), itself apparently modified from some Nestle-Aland edition. This produced a final list of 364 characters and their presence in each of the four gospels — I decided to include the non-synoptic John to test where the analyses would place it.

As expected, the data are to a large extent arbitrary and subjective. Garmus has obvious limitations in the way of dealing with events narrated out of the expected chronological sequence (i.e., flashbacks and flashforwards, as in the case of the beheading of John in relation to his actions), as well as with theological excursuses. None of these limits, however, seem to impact the general shape of a network or tree generated from these data, at most strengthening more feeble signals.

Splits tree, modified from the one generated by Huson & Bryant (2010)

As also expected, the graph supports what is by now a general consensus. Mark is likely to be the gospel closest to a hypothetical root (in this case, nearest to the mid-point). John is the most distinct of the four gospels, being closer to Mark than to the Matthew-Luke group (due to the “core” events narrated and the fewer innovations in Mark). Considering edge lengths, Luke seems to be the most innovative taxon of the synoptic gospel neighborhood / group. Such a network could never demonstrate the existence of "Q" (see the first post) as a stand-alone and actual document, but this tentative analysis does support the hypothesis that Matthew and Luke share a common development, overall supporting Marcan priority.

While probably obvious, it is important to remember that phylogenetic methods are tools that imply the existence of users — it should be an additional instrument for investigation, possibly promoting the collaboration of serious Biblical critics and experts in phylogenetic methods. Let’s consider two examples of the need for such expertise.

First, there is much historical, textual, and theological evidence supporting a hypothesis that the gospel of Mark originally ended with what is now Mark 16,8, with the twelve following verses as later additions (something common to many Greek texts, including the Odyssey). If these supposed additions, only known to whoever delves into Biblical scholarship, are marked as missing in our data, as we should at least test, the distance between Mark and all other gospels, including the unrelated Gospel of John and especially in the edge length between Luke and Mark, increases considerably for such an apparently minor change.

Second, if conducting the third and especially the second type of coding that I described above, a researcher should have at least a basic knowledge of the language they are dealing with. Adapting the explanation of Smith (2017), Matthew and Mark might seem to use the same vocabulary for the “parable of the harvest” when read in English translation, but there is a concealed change of meaning (whose theological importance and implication I'm not debating here), as the single English word “seed” tends to be used in translation of two different Greek words: in Matthew, “sperma” (the kernels of grain, in a more agricultural sense) and, in Mark, “sporos” (which carries a connotation of generative matter to be released).

Conclusion

My dataset is available in preliminary state (for example, labels are in Portuguese) here.

In conclusion, phylogenetics still has much to offer to the field of textual criticism, and this should include Biblical criticism, especially if we are able to support analyses of textual development based on trees / networks of manuscripts. I hope this pair of blog posts will motivate Biblical scholars to collaborate. If so, please write to me.

References

Garmus, Ludovico (ed.) (1982) Bíblia sagrada. Petrópolis: Editora Vozes. [reprint 2001]

Goodacre, Mark (2001) The Synoptic Problem: a Way Through the Maze. New York: T & T Clark International. (available on Archive.org)

Holmes, Michael W. (ed.) (2010) SBL Greek New Testament. Atlanta, GA: Society of Biblical Literature.

Huson, Daniel H.; Bryant David (2006) Application of Phylogenetic Networks in Evolutionary Studies, Mol. Biol. Evol., 23(2):254-267. [SplitsTree.org]

Smith, Mahlon H (2017) A Synoptic Gospels Primer. http://virtualreligion.net/primer/

Tuesday, November 21, 2017

Another test case for phylogenetics and textual criticism: the Bible


This is a two-part blog post. Here, I will introduce a particular stemmatological problem, along with the studies of it to date; and in a subsequent post I will discuss possible phylogenetic analyses that might be applied (see The Synoptic Gospels problem: preparing a phylogenetic approach).

Introduction

This year marks the celebration of 500 years since Martin Luther famously proposed his 95 religious theses, thus presaging the Protestant Reformation of the Western Christian Church. In line with this, it is worth discussing a subfield of textual criticism and stemmatics deeply influenced by the Reformation: Biblical criticism. While the importance of written texts to Christianity begins at least in the 2nd century, the theological doctrine of the sola fide (“by scripture alone”, regarding the infallible and final authority in all matters), along with translation work and individual study of the Bible, paved the way, sometimes unwillingly, to scientific approaches of Biblical criticism equivalent to those of secular literature.

The seminal figure in textual criticism of the New Testament was Hermann Reimarus (1694-1768), apparently the first to apply the methodology of literary texts to religious ones. As in the case of literary criticism, it is hardly a coincidence that Biblical criticism developed in the same cultural framework that would support and promote the idea of biological evolution and the tools for establishing genealogical trees and networks. This is especially so when considering the secularization of that society, in which proving the human origin and transmission of sacred texts was deemed an important act of civic freedom. Along with this was the parallel radicalization of some religious positions, such as denouncement as heresy of scientific studies of religious texts (nowadays objected to by most Christian doctrines that stated the imperative of serious research on the sacred texts).


A concrete problem: the synoptic gospels

The most important problem in the textual criticism of the New Testament is the “synoptic gospels" one, involving the three Gospels of Mark, Matthew, and Luke. These gospels have strikingly similar narratives that relate many of the same stories, with similar or identical wording. Like the other canonical gospel, John, these texts were composed around the last quarter of the first century by literate Greek-speaking Christians, only becoming canonical at least a century after their composition.

The synoptic gospels differ from similar sources, such as the non-canonical Gospel of Thomas, in being biographies with a clear religious motivation, and not just a collection of sayings. When compared to the Gospel of John, the three synoptic gospels are distinct in apparently being written by and for a Jewish community that was not on the verge of breaking from the Jewish synagogue, also favoring short and simple sentences.

However, the most important proof of their genealogical relationship is the text itself. The table below shows the reconstructed Greek original of each gospel for the episode of Jesus’ recruitment of a tax collector (an episode missing from the non-synoptic Gospel of John). The text in blue is the material shared by any two of the gospels, and the text in red is common to all three of them. [This is adapted from Smith (2017); on Wikipedia there is a further example, referring to the episode of the cleansing of a leper, see https://en.wikipedia.org/wiki/Synoptic_Gospels#Example.]


Matthew 9,9

Mark 2,13-14

Luke 5, 27-28
Καὶ παράγων ὁ Ἰησοῦς ἐκεῖθεν εἶδεν ἄνθρωπον καθήμενον ἐπὶ τὸ τελώνιον, Μαθθαῖον λεγόμενον, καὶ λέγει αὐτῷ· Ἀκολούθει μοι. καὶ ἀναστὰς ἠκολούθ ησεν αὐτῷ. Καὶ ἐξῆλθεν πάλιν παρὰ τὴν θάλασσαν· καὶ πᾶς ὁ ὄχλος ἤρχετο πρὸς αὐτόν, καὶ ἐδίδασκεν αὐτούς. καὶ παράγων εἶδεν Λευὶν τὸν τοῦ Ἁλφαίου καθήμενον ἐπὶ τὸ τελώνιον, καὶ λέγει αὐτῷ· Ἀκολούθει μοι. καὶ ἀναστὰς ἠκολούθ ησεν αὐτῷ. Καὶ μετὰ ταῦτα ἐξῆλθεν καὶ ἐθεάσατο τελώνην ὀνόματι Λευὶν καθήμενον ἐπὶ τὸ τελώνιον, καὶ εἶπεν αὐτῷ· Ἀκολούθει μοι. καὶ καταλιπὼν πάντα ἀναστὰς ἠκολούθ ει αὐτῷ.

The relationships between the gospels, such as the so-called “triple tradition”, is summarized by the graph below, from the Wikipedia article on the synoptic gospels. Mark, the shortest text, has almost no unique material (only 3%, in part superfluous adjectives and Aramaic translations) and is almost entirely (94%) reproduced in Luke. Matthew and Luke have their share of unique material (20% and 35%, respectively), which suggests independence, except for a "double tradition" of common material of about a quarter of the contents of each one, including notable passages such as the “Sermon of the Mount”. The parallelisms of these two gospels are found not only in their contents, but also in their arrangement, with most episodes described in the same order and, in case of displacements, with blocks of episodes moved together while preserving their internal order.


Previous studies

Such similarities were already noted in the first centuries of Christianity. This raises typical genealogical questions regarding topics such as priority (which gospel was written first) and dependence (which gospel was used as a source).

As for the first question, due to textual and theological evidence, a well-established majority of commentators favors the hypothesis of Marcan priority — that is, that the gospel of Mark is the oldest, and both Matthew and Luke used it as a source. As for the second question, a major point of dispute is the double tradition of Matthew and Luke, which can only be properly explained in terms either of descent or of a common ancestor. The two leading hypotheses are the one of a lost gospel (referred as “Q”, after the German Quelle [“source”]), and the one by Austin Farrer, according to whom Matthew used Mark as its source and Luke then used both of them. But these are not the only hypotheses that have been proposed, as shown in the next set of diagrams (also from the Wikipedia article above).

Augustinian Theory
Q Hypothesis
Farrer Theory
Jerusalem School Hypothesis

The first fully developed theory was actually proposed by Augustine of Hippo back in the 5th century, which is essentially the one by Farrer, but with Matthew in place of Mark (i.e., supporting a Matthean priority). Given Augustine’s authority as a “Father of the Church”, his view was not disputed until the late 18th century, when Johann Jakob Griesbach published a synopsis of the three gospels and developed a new hypothesis, swapping Mark and Luke in the dominant explanation. Griesbach’s scientific approach led to the first application to Biblical problems of textual criticism, then in development in the German towns of Jena and Leipzig where he lived.

In 1838, Christian Weisse proposed the “Q” Hypothesis, mentioned above, asserting that Matthew and Luke were produced independently, both using Mark plus a lost source. This source was described as a lost collection of sayings of Jesus, along with feeble indirect evidence of its existence. This hypothesis was further developed by Burnett Streeter in 1924, with the proposal of “proto-versions” of both Mark and Luke — the wording of the canonical versions we have today would then be the product of later revisions, influenced by all of the texts.

During the past fifty years, due to advances in textual criticism and new manuscript analyses, the independence of Luke in relation to Matthew has been questioned, with diminishing support for the Q Hypothesis. A now leading position holds for Farrer’s hypothesis, along with alternative trees such as the one by the Jerusalem School, according to which a lost Greek anthology “A” (postulated as the translation of a collection of sayings either in Hebrew or in Aramaic) was directly or indirectly used by all gospels, including John.

Phylogenetics

Considering the analogies between literary and genetic texts that we have already discussed on this blog, it is clear that this topic should be an interesting anecdote to share around phylogenetic water-coolers. The four texts can be divided into two “families” of gospels, the synoptic (taxa: Matthew, Mark, and Luke) and non-synoptic (taxon: John). Their similarities suggest a distant common ancestor, probably oral traditions, as reported by Christian writers of the first and second centuries such as Papias.

The relationship between the taxa of the first family, however, is far from clear, as their relative dates cannot be determined with confidence. We might be faced with processes that, by analogy with biology, can be explained as gene pool recombinations and horizontal gene transfers – even though the most likely explanation is the one of direct descent, possibly from unknown taxa.

In literary terms, we must also consider features such as Matthew clearly being written by someone highly familiar with aspects of Jewish law, possibly asserting the Jewish component of the preaching while perceiving a universal tendency for the new faith. We must also consider the fact that Mark provides no ancestral lineage for Jesus, while Matthew traces him from a line of kings and Luke from a line of commoners — clearly stating the theological point of view of each gospel. Other aspects are worth consideration, such as the idea that what we today identify as the Gospel of Luke is likely to have been the first part of a once single document that included what is now the book of the “Acts of the Apostles”.

While I must admit that my research has been limited to some googling of keywords, it is curious that a topic that has attracted so much attention for millennia, from serious academic scholarship to conspiracy theories, and from impressionistic reviews to advanced statistical modeling, does not seem to have been covered by phylogenetic analyses, so far. Given the range of data and literature, it should actually look like a prime candidate for such application, even from an outsider point of view. This viewpoint is in fact discussed in a review by Christian P. Robert of a book called The Synoptic Problem and Statistics by Andry Abakuks:
The book by Abakuks goes […] through several modelling directions, from logistic regression using variable length Markov chains [to predict agreement between two of the three texts by regressing on earlier agreement] to hidden Markov models [representing, e.g., Matthew’s use of Mark], to various independence tests on contingency tables, sometimes bringing into the model an extra source denoted by Q. Including some R code for hidden Markov models. Once again, from my outsider viewpoint, this fragmented approach to the problem sounds problematic and inconclusive. And rather verbose in extensive discussions of descriptive statistics. Not that I was expecting a sudden Monty Python-like ray of light and booming voice to disclose the truth! Or that I crave for more p-values (some may be found hiding within the book). But I still wonder about the phylogeny… Especially since phylogenies are used in text authentication as pointed out to me by Robin Ryder for Chauncer’s [sic] Canterbury Tales.
We can certainly list among the reasons for such omission the diffidence of the textual community towards phylogenetic methods, especially when performed by people from outside the field; but the potential reception problems for texts of enormous religious significance cannot be ruled out. However, one of the reasons might be far more trivial: the fact that, just as in the case of historical linguistics, we don’t have digital structured databases of the trove of data about this topic. Most of the literature is not even properly digital, at best with scanned PDFs. Furthermore, the data are usually far from perfect for such usage, as in the case of the synopsis by Smith (2017), which looks more like a typed table than a true database.

Next

In a future post, I will explore the problems of the synoptic gospels from a phylogenetic point of view, also releasing a minimal dataset (see The Synoptic Gospels problem: preparing a phylogenetic approach). Until then, those interested in the topic can find a lot of discussion on a mailing list devoted to the scholarly study of the synoptic gospels, Synoptic-L.

References

Abakuks, Andris (2014) The Synoptic Problem and Statistics. London: Chapman and Hall / CRC.

Goodacre, Mark (2001) The Synoptic Problem: a Way Through the Maze. New York: T & T Clark International. (available on Archive.org)

Robert, Christian P (2015) The synoptic problem and statistics [book review]. https://xianblog.wordpress.com/2015/03/20/the-synoptic-problem-and-statistics-book-review/

Orchard, Bernard; Longstaff, Thomas RW (1979) J.J. Griesbach: Synoptic and Text - Critical Studies 1776-1976. Cambridge: Cambridge University Press.

Smith, Mahlon H (2017) A Synoptic Gospels Primer. http://virtualreligion.net/primer/

Tuesday, July 11, 2017

The curious case of the word “stemma” — from circles to trees


Each word has its own history, according to a maxim attributed to Jules Gilliéron that makes some historical linguists tremble. One with a curious history is the word stemma (plural stemmata), which we stumble upon when investigating the development of phylogenetic trees.

David has been exploring this question for some time, showing how the origin ultimately lies in the alternative to the hierarchical model of the Aristotelian "scale" offered by the practice (and the metaphors) of genealogies and pedigrees. While dealing with possible influences on 19th-century biology, I have explored a different field, stemmatics ("textual criticism"), which shares with genealogical practices both the tree model and, obviously, the word stemma. As stemmatics is one of the first scientific approaches to the idea, and considering that the now widespread tree is likely a calque of German Baum, itself a calque of stemma, it is worth writing a bit about the history of this word.

Stemma / stemmata

Dictionaries (as well as my queries in Google Books, which would probably fail to impress a reviewer) agree that not even in Romance languages does this Latin word display an uninterrupted tradition from the time of Caesar. It only entered languages such as English and Italian, with the meaning of "genealogical tree; pedigree; nobility", from the mid 17th century on. This date supports the theory that family pedigrees were not commonplace before the 17th century (when they became a true fashion, as in Strein, 1559), despite being drawn since the Middle Ages and always discussed — as in royal disputes or in the case of the genealogies of Jesus found in the Gospels (likely drawn to confirm the messianic claims with Jewish criteria, but assimilated to the European mindset). In short, modern stemmata are mostly a product of Neoclassical fashion, and their popularity was influenced by the same descriptions of Roman pedigrees where the word was learned.

Speaking of pedigrees, this Latin word is of Greek descent, a loanword of στέμμα [stémma], meaning "wreath, crown". This sense was already a development of an original "that which surrounds; circle": in Homer's Iliad, for example, we still find an occurrence in the first sense of "circle" (of warriors, cf. XIII.736); but elsewhere the word refers to a laurel-wreath wound around a staff, mostly in the plural and in relation to the laurel god Apollo (cf. Iliad, I.14, I.28, and I.373). The development is due to the costume of conceding crowning wreaths, with στέμμα deriving from the verb στέφειν [stéphein] ("to encircle, to crown, to wreathe, to tie around") by the addition of the morpheme -μα [-ma], used to form nouns denoting the result of an action, as in the analogous case of γράφω (gráphō, "write") and γράμμα (grámma, "that which is written"). Our word ultimately derives from the Proto-Indo-European root *stebh- "post, stem; to place firmly on; to fasten", related to English "(to) step" and "staff".


Theodosius offers a laurel wreath to the victor;
on the base of the obelisk in the Hippodrome (Istanbul)
[source: Wikipedia]

The "wreath, garland, chaplet" meaning is attested in Ancient Greek literature of all times and genres, such as in tragedy (cf. Euripides, Andromache, 894), comedy (cf. Aristophanes, Wealth, 39), philosophy (cf. Plato, Republic, 617c), and historiography (cf. Thucydides, Peloponnesian War, IV.133). At least one metaphoric usage is attested, in the sense of "web/tangle of life" (cf. Euripides, Orestes, 12), and various inscriptions indicate an additional meaning of "guild" (such as in one "guild of huntsmen" epigraph quoted by Liddell & Scott, 1940). The genealogical meaning is only found in later Greek authors like Plutarch (1st century CE), suggesting that it was imported from Latin.

The Roman meaning developed from the custom of decorating the portraits of one’s ancestors, sometimes in elaborate full-wall genealogies, with laurel wreaths indicating both excellence and nobility (as "noble" pretty much meant "descending from gods"). Domestic cults were central to Roman religion, and this practice seems to have become so widespread in Imperial times that it turned into a banality, with the laurel decoration being decried as a symbol of vanity by poets and philosophers alike. The custom – and the usage of stemma for "genealogical tree" – is mentioned twice by Seneca in essays of utmost importance for Roman Stoicism. In Ad Lucilium Epistulae Morales, XLIV.1, he says:
Si quid est aliud in philosophia boni, hoc est, quod stemma non inspicit. Omnes, si ad originem primam revocantur, a dis sunt. [Philosophy also has this advantage: it does not look at your genealogical tree. Everyone, if we look at their remotest origin, descends from the gods].
A similar reference, with a more detailed description of the practice, is found in his De Beneficiis, XXX.28:
We all spring from the same source, have the same origin; no man is nobler than another except in so far as the nature of one man is more upright and more capable of good actions. Those who display ancestral busts in their halls [qui imagines in atrio exponunt], and place in the entrance of their houses the names of their family, arranged in a long row and entwined in the multiple ramifications of a genealogical tree [ac multis stemmatum illigata flexuris] – are these not notable rather than noble? Heaven is the one parent of us all, whether from his earliest origin each one arrives at his present degree by an illustrious or obscure line of ancestors. You must not be duped by those who, in making a review of their ancestors, wherever they find an illustrious name lacking, foist in the name of a god. [adapted from the translation of Basore, 1935]


A golden laurel wreath, probably originating from Cyprus, 4th-3rd century BC
[source: Wikipedia]

In matters of phylogenetics, to prove that something existed is usually not enough, as we should try to demonstrate its influence and descent. Both are clear in the case of Seneca: his moral essays were read and copied without interruption in the early years of Christianity, proliferated during the Carolingian Renaissance, and were among the most published works of secular Western literature for centuries. The Wikipedia article on the second essay is well referenced on the matter:
Three translations were made into English during the sixteenth and early seventeenth century. The first translation at all into English was made in 1569 by Nicolas Haward, of books one to three, while the first full translation into English was made in 1578 by Arthur Golding, and the second in 1614 by Thomas Lodge. Roger L'Estrange made a relevant work in 1678, he had been making efforts on Seneca's works since at least 1639. A partial Latin publication of books 1 to 3, being edited by M. Charpentier & F. Lemaistre, was made circa 1860, books 1 to 3 were translated into French by de Wailly, and a translation into English was made by JW. Basore circa 1928-1935.
The new meaning od the word is confirmed by many other authors popular in Medieval times, and especially after the Renaissance, such as Suetonius (cf. Nero, 37; Galba, 2) and Statius (cf. Silvae, 3). Pliny the Elder's Naturalis Historia, an obligatory reading for all Western scholars from the Renaissance to at least the 19th century, is another important source. When exposing the history of Roman art and discussing the honor attached to portraits, Pliny mentions that "in ancient times" people had much care for faithful likeness, when "portraits modeled in wax were arranged, each in its separate niche, to be always in readiness to accompany the funeral processions of the family [... while the] the pedigree [stemmata] of the individual was traced in lines upon each of these coloured portraits" (XXXV.6, adapted from Bostock, 1855).

The last important source to note is the eight Satire of Juvenal, on the paradoxes of the Roman aristocracy, where the word stemma, as usual in the plural, is used to open the poem:
Stemmata quid faciunt? quid prodest, Pontice, longo / sanguine censeri, pictos ostendere uultus / maiorum et stantis in curribus Aemilianos / et Curios iam dimidios umeroque minorem / Coruinum et Galbam auriculis nasoque carentem [Genealogies, what are they worth? What is in for you, Ponticus, in being judged by ancient bloodline, in flaunting the portraits of your ancestors, the Aemilians standing on chariots, only half of the Curii, a Corvinus devoid of shoulders, and a Galba missing ears and nose?]
Sources suggest that the new meaning was well established by the reign of Hadrian (2nd century CE), including the derivative meanings of "high value" (cf. Martial, Satyra, VIII.6) and "antique", as in Prudentius (cf. Liber Cathemerinon, VII.81), a Christian author much read in Medieval times. As already mentioned, the word even found its way back into Greek with the new semantic shift, such as in Plutarch, one of the most popular Greek authors since the Renaissance. In his Numa, 1, we find:
ἔστι δὲ καὶ περὶ τῶν Νομᾶ τοῦ βασιλέως χρόνων, καθ᾽ οὓς γέγονε, νεανικὴ διαφορά, καίπερ ἐξ ἀρχῆς εἰς τοῦτον κατάγεσθαι τῶν στεμμάτων ἀκριβῶς δοκούντων ["There is likewise a vigorous dispute about the time at which King Numa lived, although from the beginning down to him the genealogies seem to be made out accurately"; Perrin, 1914].
It is somewhat ironic that the accusations of futility and uselessness of genealogical trees probably contributed to the Medieval and Renaissance restoration of such practices. Informed about the Roman tradition, and equipped with examples from nobility and religion, people turned genealogy and its trees into a fashion. This helped to lay the ground for the acceptance of the tree model when new scientific endeavors required a better way to describe things, like dog races and strawberry varieties, especially when non-ascending genealogies (who descends from whom, instead of who are the ancestors of whom) were already common, and when the concept of the "tree of life" gained a new popularity.


Neptune's genealogy as per Boccaccio.
Paris: Luois Hornken, 1511. [source]

References
  • Aristophanes (1938).Wealth. The Complete Greek Drama, vol. 2. Eugene O'Neill, Jr. New York: Random House
  • στέμμα in Autenrieth, Georg (1891) A Homeric Dictionary for Schools and Colleges. New York: Harper and Brothers.
  • στέμμα in Bailly, Anatole (1935) Le Grand Bailly: Dictionnaire grec-français. Paris: Hachette.
  • Euripides (forthcoming) Euripides, with an English translation by David Kovacs. Cambridge MA: Harvard University Press.
  • Euripides (1938) The Complete Greek Drama, edited by Whitney J. Oates and Eugene O'Neill, Jr. in two volumes. New York: Random House.
  • stemma in Lewis, Charlton T; Short, Charles Short (1879) A Latin Dictionary. Founded on Andrews' edition of Freund's Latin dictionary. revised, enlarged, and in great part rewritten by. Oxford: Clarendon Press.
  • στέμμα in Liddell & Scott (1940) A Greek–English Lexicon. Oxford: Clarendon Press.
  • στέμμα in Liddell & Scott (1889) An Intermediate Greek–English Lexicon. New York: Harper & Brothers.
  • Omero (1990) Iliade. Traduzione di Rosa Calzecchi Onesti. Torino: Giulio Einaudi editore.
  • Plato (1903) Platonis Opera, ed. John Burnet. Oxford: Oxford University Press.
  • Pliny the Elder (1855) The Natural History. John Bostock, H.T. Riley. London. Taylor and Francis.
  • Plutarch (1914).Plutarch's Lives. with an English Translation by. Bernadotte Perrin. Cambridge MA: Harvard University Press. London: William Heinemann.
  • Seneca (1917-1925) Ad Lucilium Epistulae Morales, volume 1-3. Richard M. Gummere. Cambridge MA: Harvard University Press; London: William Heinemann.
  • Seneca, Lucius Annasus (1928-1935) Moral Essays. Translated by John W. Basore. The Loeb Classical Library. London: W. Heinemann. 3 vols.: Volume III.
  • Statius, P. Papinius (1928) Statius, Vol I. John Henry Mozley. London: William Heinemann; New York: G.P. Putnam's Sons.
  • Strein, Richardus (1559) Gentium et familiarum Romanorum stemmata. Paris[?]: Henr. Stephanus.
  • Suetonius (1889).The Lives of the Twelve Caesars; An English Translation, Augmented with the Biographies of Contemporary Statesmen, Orators, Poets, and Other Associates. Suetonius. Publishing Editor. J. Eugene Reed. Alexander Thomson. Philadelphia: Gebbie & Co.
  • Thucydides (1942) Historiae in two volumes. Oxford: Oxford University Press.

Tuesday, May 23, 2017

A test case for phylogenetic methods and stemmatics: the Divine Comedy


In a previous post I gave an outline of stemmatics, and briefly touched on the adoption and advantages of phylogenetic methods for textual criticism (On stemmatics and phylogenetic methods). Here I present the results of an empirical investigation I have been conducting, in which such methods are used to study some philological dilemmas of a cornerstone work in textual criticism, Dante Alighieri's Divine Comedy. I am reproducing parts of the text and the results of a paper still under review; the NEXUS file for this research is available on GitHub.


Before describing the analysis, I discuss the work and its tradition, as well as some of the open questions concerning its textual criticism. This should not only allow the main audience of this blog to understand (and perhaps question) my work, but it is also a way to familiarize you with the kind of research conducted in stemmatics. After all, the first step is the recensio, a deep review of all information that can be gathered about a work.

The Divine Comedy

The Divine Comedy is an Italian medieval poem, and one of the most successful and influential medieval works. It is written in a rigid structure that, when compared to other works, guaranteed it a certain resistance to copy errors, as most changes would be immediately evident. Composed of three canticas (Inferno, Purgatory, and Paradise), the first of its 100 cantos were written in 1306-07, with the work completed not long before the death of the author in 1321. Written mostly during Dante's exile from his home city, Florence (Tuscany), like many works of the time it was published as the author wrote it, and not only upon completion. In fact, it is even possible, while not proven, that the author changed some cantos and published revisions, thus being himself the source of unresolvable differences.

No original manuscript has survived, but scholarship has traced the development of the tradition from copies and historical research. The poem is one of the most copied works of the Middle Ages, with more than 600 known complete copies, besides another 200 partial and fragmentary witnesses. For comparison, there are around 80 copies of Chaucer's Canterbury Tales, which is itself a successful work by medieval standards.

Commercial enterprises soon developed to attend to the market demand of its success. In terms of geographical diffusion, quantitative data suggests that, before the Black Death that ravaged the city of Florence in 1348, scribal activity was more intense in Tuscany than in Northern Italy, where the author had died. Among the hypotheses for its textual evolution, the results of my investigation support the widespread hypothesis that Dante published his work with Florentine orthography in Northern Italy. That is, the first copies adopted Northern orthographic standards, which would then revert to Tuscan customs, with occasional misinterpretations, when the work found its way back to Florence. These essentials of the transmission must be considered when curating a critical edition, as the less numerous Northern manuscripts, albeit with an adapted orthography, can in general be assumed to be closer to the archetype (if there ever was one to speak of) than the Florentine ones.

The tradition is characterized by intentional contamination, as the work soon became a focus of politics and grammar prescriptivism. Errors and contamination have been demonstrated even in the earliest securely dated manuscript, the Landiano of 1336 (cf. Shaw, 2011), and can also be identified in the first commentaries dating from the 1320s (such as in the one by Jacopo Alighieri, the author's son).

Critical studies

Here are some details about previous studies. I have included considerable stemmatic information, but I include a biological analogy to help make sense for non-experts.

The first critical editions date from the 19th century, but a stemmatic approach was advanced only at the end of that century, by Michele Barbi. Facing the problem of applying Lachmann's method to a long text with a massive tradition, in 1891 Barbi proposed his list of around 400 loci (samples of the text), inviting scholars to contribute the readings in the manuscripts they had access to. His project, which was intended to establish a complete genealogy without the need for a full collatio, had disappointing results, with only a handful of responses. Mario Casella would later (1921) conduct the first formal stemmatic study of the poem, grouping some older manuscripts into two families, α and β, with an unequal number of witnesses but equal value for the emendatio. His two families are not rooted at a higher level, but he observed that they share errors, supporting the hypothesis of a common ancestor, likely copied by a Northern scribe.

Casella's stemma, reproduced from Shaw (2011).

Forty years later, Giorgio Petrocchi proposed to overcome the large stemma by employing only witnesses dating from before the editorial activity of Giovanni Boccaccio, as his alterations and influence were considered to be too pervasive. Petrocchi defended a cut-off date of 1355 as being necessary for a stemmatic approach, which would otherwise have been impossible, given the level of contamination of later copies. The restriction in the number of witnesses was contrasted with his expansion of the collatio to the entire text, criticizing Barbi's loci as subjective selections for which there was no proof of sufficiency.

Making use of analogies with biology, we may say that Barbi proposed to establish a tree from a reduced number of "proteins" for all possible "taxa". Casella considered this to be impracticable and, selecting a few representative "fossils", built a tree from a large number of phenotypic characteristics. Finally, Petrocchi produced a network while considering the entire "genome" for all "fossils" dated from before an event that, while well-supported in theory (we could compare its effects to a profound climate change), was nonetheless arbitrary.

Petrocchi's stemma, reproduced from Shaw (2011).

Questions about Petrocchi's methodology and assumptions were soon raised, particularly regarding the proclaimed influence of Boccaccio, without quantitative proofs either that his editions were as influential as asserted or that all later witnesses were superfluous for stemmatics. Later research focused on questioning his stemma. For example, the absence of consensus about the relationship between the Ash and Ham manuscripts, the supposedly weak demonstration of the polytomy of Mad, Rb, and Urb (the "Northern manuscripts"), and the dating of Gv (likely copied fifty to a hundred years after Petrocchi's assumption). Evidence was presented that Co, a key manuscript in his stemma, could not be an ancestor of Lau (its copyist was still active in the 15th century), and that Ga contained disjunctive errors not found in its supposed decedents. Abusing once more the biological analogy, the dating of his "fossils" was in some cases plainly wrong.

Federico Sanguineti presented an alternative stemma in 2001, arguing that a rigorous application of stemmatics would evidence errors made by Petrocchi. To that end, he decided to resurrect Barbi's loci and trace the first complete genealogy, without arbitrary and a priori decisions about the usefulness of the textual witnesses. Sanguineti defended the suggestion that, after this proper recensio, a small number of manuscripts (which he eventually set to seven) would be sufficient for emendation. His stemma, described as "optimistic in its elegance and minimalism" (Shaw 2011), resulted in a critical edition that heavily relied on a single manuscript, Urb, the only witness of his β family (as Rb was displaced from the proximity it had in Petrocchi's stemma, and Mad was excluded from the analysis). Keeping with the biological analogy, Sanguineti proposed building a tree from an extremely reduced number of "proteins", but for all "taxa". In the end, however, the reduced number of "proteins" was considered only for seven "taxa", selected mostly due to their age.

Sanguineti's stemma, reproduced from Shaw (2011).

The edition of Sanguineti was attacked by critics, who confronted the limited number of manuscripts used in the emendatio, the position of Rb, the high value attributed to LauSC, and the unparalleled importance of Urb, all resulting in an unexpected Northern coloring to the language of a Florentine writer. Regarding his methodology, reviewers pointed out that stemmatic principles had not been followed strictly, as the elimination was not restricted to descripti, but extended to branches that were considered to be too contaminated.

The digital edition of Prue Shaw (2011) was developed as a project for phylogenetic testing of Sanguineti's assumptions. Her edition includes complete manuscript transcriptions, and the transcriptions include all of the layers of revision of each manuscript (original readings and corrections by later hands), and are complemented by high-quality reproductions of the manuscripts. After testing the validity of Sanguineti's method and stemma, Shaw concluded that his claims do not "stand up to close scrutiny", and that the entire edition is compromised, because Rb "is shown unequivocally to be a collaterale of Urb, and not a member of α as [Sanguineti] maintains".

Applying phylogenetic methods

With the goal of following and, to a large part, replicating Shaw (2011), I have analyzed signals of phylogenetic proximity for validating stemmatic hypotheses, produced both a computer-generated and a computer-assisted phylogeny (equivalent to a stemma), and evaluated the performance of such phylogenies with methods of ancestral state reconstruction.

I wanted to investigate the phylogenetic proximity of witnesses and the statistical support for the published stemmas. After experiments with rooted graphs, I made a decision to use NeighborNets, in which splits are indicative of observed divergences and edge lengths are proportional to the observed differences. These unrooted split networks were preferable because they facilitated visual investigation, and also provided results for the subsequent steps. These involved exploring the topology and evaluating potential contaminations, guiding the elimination of taxa whose data would be redundant for establishing prior hypotheses of genealogical relationships. Analyses were conducted using all manuscript layers and critical editions, both with and without bootstrapping, thus obtaining results supported in terms of inferred trees as well as character data.

NeighborNet of the manuscripts and revisions from my data, generated with SplitsTree
(Huson & Bryant 2006)

The analysis confirmed most of the conclusions of Shaw (2011) — there are no doubts about the proximity and distinctiveness of Ash and Ham, with Sanguineti's hypothesis (in which they are collaterals) better supported than Petrocchi's hypothesis (in which the first is an ancestor of the second). The proximity of Mart and Triv was confirmed; but the position of the ancestors postulated by Petrocchi and Sanguineti should be questioned in face of the signals they share with LauSC, perhaps because of contamination. The most important finding, in line with Shaw and in contrast with the fundamental assumption of Sanguineti, is the clear demonstration of the relationship between Rb and Urb.

The relationship analyses allowed the generation of trees for further evaluation. Despite the goal of a full Bayesian tree-inference, I discarded this option because, without a careful and demanding selection of priors, it would yield flawed results. As such, I made the decision to build trees using both stochastic inference and user design (ie. manually). This postponed more complex topology analyses for future research, but generated the structures needed by the subsequent investigation steps; both trees are included in the datafile.

The second tree (shown below), allowing polytomies and manually constructed by myself, tries to combine the findings of Petrocchi and Sanguineti by resolving their differences with the support of the relationship analyses. Using Petrocchi's edition as a gold standard, and considering only single hypothesis reconstructions, parsimonious ancestral state reconstructions agree with 9,016 characters (79.9%). When considering multiple hypotheses, instead, reconstructions agree with 10,226 characters (90.7%). Cases of disagreement were manually analyzed and, as expected, most resulted from readings supported by the tradition but refuted by Petrocchi on exegetic grounds.

My proposed tree for the manuscripts selected by Sanguineti,
generated with PhyD3 (Kreft et al., 2017).

This tree suggests that, in general, Petrocchi's network is better supported than the tree by Sanguineti, as phylogenetic principles lead us to expect — the first was built considering statistical properties and using all of available data, while the second relied on many intuitions and assumptions never really tested. In particular, it supports the findings of Shaw and, as such, allows us to indicate the critical edition of Petrocchi as the best one. Even more important, however, it is a further evidence of the usefulness of phylogenetic methods, when appropriately used, in stemmatics.

References

Alagherii, Dantis (2001) Comedìa. Edited by Federico Sanguineti. Firenze: Edizioni del Galluzzo.

Alighieri, Dante (1994) La Commedia Secondo L’antica Vulgata: Introduzione. Edited by Giorgio Petrocchi. Opere di Dante Alighieri v. 1. Firenze: Le Lettere.

Huson, Daniel H.; Bryant, David (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23: 254–267.

Inglese, Giorgio (2007) Inferno, Revisione del testo e commento. Roma: Carocci.

Kreft, Lukasz; Botzki, Alexander; Coppens, Frederik; Vandepoele, Klaas; Van Bel, Michiel (2017) PhyD3: a Phylogenetic Tree Viewer with Extended PhyloXML Support for Functional Genomics Data Visualization. BioRxiv. Doi: 10.1101/107276.

Leonardi, Anna M.C. (1991) Introduzione. In: La Divina Commedia, by Dante Alighieri. Milano: Arnoldo Mondadori Editore.

Shaw, Prue (2011) Commedia: a Digital Edition. Birmingham: Scholarly Digital Editions.

Trovato, Paolo (2016) Metodologia editoriale per la Commedia di Dante Alighieri. Ferrara. https://www.youtube.com/watch?v=BfKUOAR9PXA. Date of access: March 19, 2017.

Wednesday, May 3, 2017

On stemmatics and phylogenetic methods

No se publica un libro sin alguna divergencia entre cada uno de los ejemplares. Los escribas prestan juramento secreto de omitir, de interpolar, de variar. [No book is published without some divergence between each of the copies. Scribes take a secret oath to omit, to interpolate, to change.] (Jorge Luis Borges, La lotería en Babilonia, in Ficciones, 1962)
This is the first on series of posts on stemmatics, a field just as much in love with trees and networks as are phylogenetics and historical linguistics. Being an introduction, I explain what the field does, present the most important jargon, and offer a list references that, while suitable for the audience of this blog, is denser than what one might expect for a blog post.

Thank you to Mattis and David for inviting me to write!

Textual criticism

Textual criticism (or, less precisely, "philology") is a discipline concerned with the investigation of the history of literary, legal, and religious texts for explaining how differences among the copies of a text (its "witnesses") arose, and with the production of "critical editions", either scholarly curated versions of a text that aim to reconstruct the lost original or corrected versions of an existing copy.

The problem of divergence between copies of text, with the accumulation of involuntary and deliberate errors, as well as the need for a systematic study of such differences, is as old as writing itself. For example, our current editions for the epic poems of Homer descend from Ancient philological attempts to restore an uncontaminated original (see the first two figures). These include the edition of Pisistratus (VI century BCE, which determined what was to be sung at the Panathenaic Games), and the so-called VMK (Viermännerkommentar, "commentary of the four men") of the Alexandrian School (I-II century BCE), which is generally assumed to be the root of the witnesses that we have.

Van der Valk's reconstruction of the sources for Venetus A, one of the most
important manuscripts of Homer's Iliad (source: Wikipedia).

Erbse's reconstruction of the sources for Venetus A, one of the most important
manuscripts of the Iliad (source: Wikipedia).

Before stemmatics, an edition could either be based on a "good copy" (a version considered to be less contaminated or more faithful than others), in a "majority reading" (in which the most attested variant would be chosen), or in a principle of "eclecticism" (with each best reading individually selected by the editor's judgment). Each new version, as expected, contributed even more to the confusion, particularly when changes were voluntary.

Among the texts with long and complex traditions, objects of countless and sometimes bloody disputations on the "correct" readings, are the Bible and codes of laws, for which it was not uncommon to have a different version in each city, with predictable consequences. For example, the first published textual tree, as already covered in this blog (The first Darwinian evolutionary tree), was authored by Carl Johan Schlyter in 1827 in a study precisely on the multiple and conflicting copies of Swedish law.

As such, it is no surprise that objective approaches were soon developed (Homer's VMK edition being one of the first examples), culminating with the development of stemmatics, with its study of the genealogical relationship between witnesses, and its representation of such relationships by means of trees.

Stemmatics

As a scientific approach to textual criticism, stemmatics established itself from the beginnings of 19th century as an alternative to emendations based in the opinions and wishes of editors, possibly inspiring both Charles Darwin and August Schleicher (for a general discussion on the development and significance of this method, see Timpanaro 2005). However, more than a "source", we should consider it a branch equally stemming from the "cultural framework" (Macé and Baret 2006: 91) that also gave us Darwinism and historical linguistics.

As was true for these latter disciplines, stemmatics was at first opposed, because of the revolution it brought to its field, along with its genealogical trees. However, just as in these sister disciplines, the results of the new mindset introduced by the explanation of evolution with trees could not be ignored, and this approach is so central to textual criticism that the latter can be divided into periods before and after the work of Karl Lachmann, the "father" of stemmatics, in particular the publication of his edition of Lucretius' De rerum natura (1850). In his commentaries, besides demonstrating the number of lines per page in the lost manuscript at the root of the tradition, Lachmann was even able to demonstrate the kind of script used to write it (Lachmanni 1850).

The work he chose, with the importance of Lucretius in the development of the scientific mindset (and, as we should remember when dealing with cultural evolution, of Darwin's theories), is unlikely to be casual, but this is a matter for a different blog post.

Trees

Genealogical trees are so central to the stemmatic method that the field itself is actually named after them. The main goal of an editor is to produce a stemma codicum ("family tree of manuscripts"), or simply stemma, a tree-like structure that supports the textual emendation and represents the "tradition" (the witnesses' genealogy), in analogy with the family trees of Roman families that figured in many texts reviewed by 19th century philologists. Stemma, in fact, is a Greek word meaning garland or wreath, that was incorporated in Imperial Latin to designate a family tree (and, figuratively, nobility itself), as family trees were drawn with a stemma at their top.

In short, stemmatics begins with a recensio, which is an investigation of all total and partial copies of a work. This review is followed by a collatio, a systematic scrutiny of the manuscripts' contents, when readings are aligned and compared. The results of this alignment are used to produce the stemma, following the principle that "community of errors implies community of origin". By analyzing the stemma and the errors, editors finally proceed to the emendatio, which is a reconstruction that explains the known variants, and is intended to represent the "archetype" (a lost witness at the root of the ramification, assumed to be closer to the original than any other copy).

A stemma is conventionally drawn top-to-bottom, with vertical placements roughly indicating the date of the manuscript (the higher, the older). Solid edges ("arrows") indicate descent, while dashed ones imply contamination (scribes using more than one source). Witnesses are usually labeled with abbreviated names or Latin letters, when the manuscript is available, or with Greek letters, when it is missing (with α usually reserved for the archetype and ω for the original). Below is a reproduction of Petrocchi's partial stemma for the tradition of Dante Alighieri's Divine Comedy, which I will cover in a future post. Note that the genealogy is actually a reticulating network rather than a simple tree.

Petrocchi's partial stemma for the Divine Comedy, presented in the
introduction to his critical edition (1965).

The example stemma offered by Maas (1958), adapted below, is still useful to demonstrate the principles of stemmatics. In this example, for a textual emendation manuscript H should be eliminated (as it descends from F), as well as I and J (copies of G). Manuscript C shows a contamination from its collateral D, something which should be considered when weighting errors. Sub-archetypes β and γ are to be inferred from the available witnesses of their branches, and their readings will have the same weight as K, the only member of the third family branching from the archetype (even though it is a recent manuscript), in establishing the "lesson" of α. Errors might be presumed in α itself, or even in the original ω, and in both cases a corrected "lesson" might be offered by the editor after internal and external evidences.

Exemplary stemma adapted from Maas (1958).

Adoption and practice

Stemmatics has been criticized and confronted since Lachmann's time. It requires very specialized knowledge, for example in distinguishing between monogenetic and polygenetic errors, i.e. those that arose once and those that emerged independently more than once (and that, as such, are not disjunctive). A number of its suppositions are routinely called into question, such as the idea that each copy always derives from a single source (accepting contamination, at most), that each copy has at least the same number of errors of its source, and, fundamentally, that traditions have one and only one archetype.

Many measures tend to be adopted to reduce the editorial effort. These include eliminating manuscripts considered to be descripti (i.e. proved to descend from a preserved witness, in theory sharing all the errors of their sources), and only performing the collatio in a set of critical passages (loci critici). While a complete stemma and a full collatio are desirable, such compromises might be unavoidable for long texts with ample traditions. For example, in the case of Dante Alighieri's Divine Comedy, after considering the time employed by scholars such as Petrocchi, Sanguineti, and Shaw for their editions, Trovato (2016) estimated the length of a full stemmatic approach in 400 man-years.

An alternative to stemmatic methods and suppositions, which also reduces the editorial effort, is found in scholars who follow the work of Joseph Bédier, who successfully challenged the limits of stemmatics by adopting a renewed version of the method of the "good copy" for his editions of medieval texts. The Bédierian method does not refute a scientific approach or methods such as the recensio, the collatio, or even the production of a stemma, but these are used to support the editor's judgment in selecting and curating a bon manuscript — a good edition of text to be corrected only where errors can be proved beyond reasonable doubt. In short, trees (and networks) have been central to textual criticism even when stemmatics itself, as a method, is being challenged.

Considering the editorial effort and the analogies with linguistics and biology, it is no surprise that digital workflows have been proposed, along with the development of computer resources and phylogenetic methods. Ideas for new approaches were explored by Froger (1969), and formal phylogenetic methods were attempted by Platnick and Cameron (1977). Recently, the number of editions supported by formal phylogenetic methods and software has increased (see, for example, Barbook et al. 1998; Stolz 2003; and Lantin, Baret and Macé 2004), also in the face of scientific evaluations of performance (Roos and Heikkila 2009).

Besides advances in speed and replicability, the new technologies are allowing us to expand the goals of the discipline, moving from electronic editing to computational philology. In fact, while the field has for centuries been defined by the production of critical editions, digital approaches have been shown to support a reduction in the importance of "authorial intention", allowing researchers to focus on the reception of texts by the public, in line with developments of literary theory (Jauss 1982), and with the goals established by the "New Philology" (Cerquiglini 1989). Manuscripts with readings that differ from a supposed original, traditionally described as "corrupted", are changing from copies that were meant to be discarded into data points that collaborate in an investigation of human history that is assisted by quantitative data and methods.

References

Barbrook A.C., Howe C.J., Blake N., Robinson P. (1998) The phylogeny of the Canterbury Tales. Nature 394 (6696): 839.

Cerquiglini B. (1989) Éloge de la variante: histoire critique de la philologie. Aux Travaux. Paris: Éditions du Seuil.

Froget J. (1969) La critique des textes et son automatization. Bulletin De L’Association Guillaume Budé 1(1): 125–129.

Jauss H.-R. (1982) Toward an Aesthetic of Reception. Minneapolis: University of Minnesota Press.

Lachmann C. (1850) De Rerum Natura. Commentarius. Berolini: Imprensis Georgii Reimeri.

Lantin A.-C., Baret P.V., Macé C. (2004) Phylogenetic analysis of Gregory of Nazianzus’ Homily 27. 7èmes Journées Internationales d’Analyse statistique des Données Textuelles, pp. 700-707.

Maas P. (1958). Textual Criticism. Translated by Barbara Flower. Oxford: Oxford University Press.

Macé C.; Baret P.V. (2006) Why phylogenetic methods work: the theory of evolution and textual criticism. Linguistica Computazionale. The Evolution of Texts: Confronting Stemmatological and Genetical Methods 24: 89–108.

Platnick N.I., Cameron H.D. (1977) Cladistic methods in textual, linguistic, and phylogenetic analysis. Systematic Zoology 26: 380–385.

Roos T., Heikkilä T. (2009) Evaluating methods for computer-assisted stemmatology using artificial benchmark data sets. Literary and Linguistic Computing fqp002.

Stolz, M. (2003) New philology and new phylogeny: aspects of a critical electronic edition of Wolfram’s Parzival. Literary and Linguistic Computing 18(2): 139–150.

Timpanaro S. (2005) The Genesis of Lachmann's Method. Translated and edited by G. W. Most. Chicago: University of Chicago Press.

Trovato P. (2016) Metodologia editoriale per la Commedia di Dante Alighieri. Ferrara. See Youtube; date of access: March 19, 2017.

Tuesday, April 18, 2017

Multimedia phylogeny?


Evolutionary concepts have often been transferred to other fields of study, or derived independently in them, especially in anthropology in the broadest sense, covering all cultural products of the human mind. This includes phylogenetic studies of languages, texts, tales, artifacts, and so on — you will find many examples of such studies in this blog. One of the more recent applications has been to what is sometimes called multimedia phylogeny — the research field that "studies the problem of discovering phylogenetic dependencies in digital media".

I have noted before that phylogenetics in the biological sense is an analogy when applied to other fields, because only in biology is genetic information physically transferred between generations — in the other fields, cultural information transfer is all in the minds of the people, not in their genes (see False analogies between anthropology and biology). This analogy often becomes problematic when applied to other fields, because the practical application of bioinformatics techniques separates the informatics from the bio, and the mathematical analyses focus on trying to implement the informatics without any biological justification.


A recent paper that discusses the application of bioinformatics to multimedia phylogeny exemplifies the potential problems:
Guilherme D Marmerola, Marina A Oikawa, Zanoni Dias, Siome Goldenstein, Anderson Rocha (2017) On the reconstruction of text phylogeny trees: evaluation and analysis of textual relationships. PLoS One 11(12): e0167822.
The authors described their background information thus:
Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works, are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance.
However, this is not an easy task, as textual features pointing to the documents' evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework.
So, their solution to the separation of bio from informatics is to try a range of techniques, none of which are based on any particular model of how phylogenetic changes might occur in text documents. All of these methods involve distance-based tree-building.

The essential problem, as I see it, is that without a model of change there is no reliable way to separate phylogenetic information from any other type of information. For example, similarity can arise from many sources, only some of which provide information about phylogenetic history — phylogenetic similarity is a form of "special similarity". In biology, other sources of similarity are usually lumped together as chance similarities, such as convergence, parallelism, etc. Without this basic separation of phylogenetic and chance similarity, it does not matter how many distance measures you use, or how many tree-building methods you employ — if you can't separate phylogeny from chance then you are wasting your time constructing a hypothetical  evolutionary history.

The authors' only saving grace is their claim that: "In text phylogeny, unlike stemmatology [the analysis of hand-written rather than digital texts], the fundamental aim is to find the relationships among near-duplicate text documents through the analysis of their transformations over time." The expectation, then, is that the phylogenetic similarity of the texts will be high, which will thus reduce the possibility of chance similarities. Sadly, it will also reduce the probability that the similarities will contain any phylogenetic information at all — this is the classic short-branches-are-hard-to-reconstruct problem in phylogenetics.

For digital texts, the authors employ three distance measures: edit distance, normalized compression distance, and cosine similarity. None of these are model-based in any phylogenetic sense (although the first one is used in alignment programs such as Clustal) — I have discussed this in the post on Non-model distances in phylogenetics. Their tree-building methods include: parsimony, support vector machines (a machine-learning form of classification), and random forests (a decision-tree form of classification). Once again, none of these is model-based in terms of textual changes.

A final issue is the insistence on trees as the model of a phylogeny. In stemmatology, for example, a network is a more obvious phylogenetic model, because hand-written texts can be copied from multiple sources. Indeed, this distinction plays an important role in the first application of phylogenetics to stemmatology (see the post on An outline history of phylogenetic trees and networks). Perhaps this is not an issue for "near-duplicate text documents", but it does seem like an unnecessary restriction. Moreover, one of the empirical examples used in the paper actually has a network history, which therefore does not match the authors' reconstructed tree.

Tuesday, August 9, 2016

Network of Linné's "Philosophia Botanica" editions


Carl von Linné's book Philosophia Botanica (1751) was arranged as a series of botanical aphorisms, expanded over the previous 15 years from when he first developed them. During those years, he settled on binomial nomenclature as his preferred naming system, and he presents this in Philosophia Botanica, so that the book has considerable historical interest for biologists.

Recently, János Podani and András Szilágyi (History and Philosophy of the Life Sciences, in press) have pointed out a basic inconsistency in this book, relating to Linné's calculation of how many possible plant genera there could be, given the morphological features he used to distinguish among them.


Linné did not do a good job with this calculation, as these authors show. Indeed, the correct calculation is far more complex than Linné realized, but even given his simplifications his arithmetic is faulty. There are basic inconsistencies among the aphorisms, where the numbers do not "add up" when some of the aphorisms are compared. In essence, 31 plant parts are defined in one aphorism but this becomes "n=38" in a later aphorism; and then 4n2 is claimed to be "5736" rather than 5776.

This then raises the issue of how this error was treated in subsequent editions of Philosophia Botanica. Podani and Szilágyi trace the error through 14 subsequent editions, showing that the various editors of those editions dealt with the issue in different ways. The history of these editions can be represented as a phylogenetic diagram, which the authors also provide.


This history turns out to be a network, because some of the later editions were compiled from several earlier editions. The network is rooted at the bottom, and each network edge is implicitly directed away from the root. The book editions are named using their place and time of publication.

Note that one particular "solution" to the arithmetical issue arises independently in three separate editions of the book. That is, the three editions on the network's right independently correct the 4n2 problem but do not correct the 31=38 problem

Also, note that no editions since 1787 actually correct both errors (ie. they show both n=31 and 4n2 =3844). Recent editions are reprints of the original erroneous version.