Wednesday, December 4, 2013

The phylogenetics of Little Red Riding Hood

A couple of weeks ago we received an unexpected influx of visitors to this blog, being directed here by at article at the NBC News site. This article cited one of our blog posts (Network analysis of Genesis 1:3) as an example of the use of phylogenetic analysis in stemmatology (the discipline that attempts to reconstruct the transmission history of a written text). The NBC article itself is about a recently published paper that applies these same techniques to an oral tradition instead — the tale of Little Red Riding Hood. This paper has generated much interest on the internet, being reported in many blog posts, on many news sites, and in many twitter tweets. After all, the young lady in red has been known for centuries throughout the Old World.

Needless to say, I had a look at this paper (Jamshid J. Tehrani. 2013. The phylogeny of Little Red Riding Hood. PLoS One 8: e78871). The author collated data on various characteristics of 58 versions of several folk tales, such as plot elements and physical features of the participants. These tales included Little Red Riding Hood (known as Aarne-Uther-Thompson tale ATU 333), which has long been recorded in European oral traditions, along with variants from other regions, including Africa and East Asia (where it is known as The Tiger Grandmother), as well as another widespread international folk tale The Wolf and the Kids (ATU 123), which has been popular throughout Europe and the Middle East. As the author notes: "since folk tales are mainly transmitted via oral rather than written means, reconstructing their history and development across cultures has proven to be a complex challenge."

He produced phylogenetic trees from both parsimony and bayesian analyses, along with a neighbor-net network. He concluded: "The results demonstrate that ... it is possible to identify ATU 333 and ATU 123 as distinct international types. They further suggest that most of the African tales can be classified as variants of ATU 123, while the East Asian tales probably evolved by blending together elements of both ATU 333 and ATU 123." His network is reproduced here.

There is one major problem with this analysis: all three graphs are unrooted, and you can't determine a history from an unrooted graph. A phylogeny needs a root, in order to determine the time direction of history. Without time, you can't distinguish an ancestor from a descendant — the one becomes the other if the time direction is reversed. Unfortunately, the author makes no reference to a root, at all.

So, his recognition of three main "clusters" in his graphs is unproblematic (ATU 333; East Asian; and ATU 123 + African) although the relationship of these clusters to the "India" sample is not clear (as shown in the network). On the other hand, his conclusions about the relationships among these three groups is not actually justified in the paper itself.

Rooting the trees

So, the thing to do is put a root on each of the graphs. We cannot do this for the network, but we can root the two trees, and we can take the nearest tree to the network and root that, instead.

There are several recognized ways to root a tree in phylogenetics (Huelsenbeck et al. 2002; Boykin et al. 2010):
  1. a character transformation series (i.e. non-reversible substitution models)
  2. an outgroup
  3. mid-point rooting
  4. assume clock-like character replacement (e.g. the molecular clock).
The first one implies that we know the order in which at least some of the characters changed through time, which is not true for these folk tales. The second one requires us to know the next most closely related folk tale, which we cannot decide in this case. The third one is always possible, for any tree; and the fourth one is possible if a likelihood model has been used to model character changes. So, in this case, we can apply both of options 3 and 4.

I therefore did the following:
  • For the parsimony analysis, I imported the author's consensus tree into PAUP* (the program he used to produce it), calculated the branch lengths with ACCTRAN optimization, and found the midpoint root.
  • For the bayesian analysis, I re-ran the MrBayes analysis exactly as described by the author, except that I added a relaxed clock (with independent gamma rates model for the variation of the clock rate across lineages).
  • For the phylogenetic network, the neighbor-net is basically the network equivalent of a neighbor-joining tree, and so I calculated this in SplitsTree (the program the author used), and found the midpoint root.
  • Also, the strict clock version of a neighbor-joining tree is a UPGMA tree, which I calculated using SplitsTree.
The complete trees can be seen elsewhere (ParsimonyMidpoint; BayesRelaxed; NJmidpoint; UPGMA), but the figure below shows the relevant parts of the four rooted trees. As you can see, the first three analyses agree on the root location (shown at the left of each graph), with only the UPGMA tree suggesting an alternative.

Having the East Asian samples as the sister to the other tales does not match what would be expected for the historical scenario suggested by the original author from his unrooted graphs — that the East Asian tales "evolved by blending together elements of both ATU 333 and ATU 123".

Instead, this placement exactly matches an alternative theory that the author explicitly rejects: "One intriguing possibility raised in the literature on this topic ... is that the East Asian tales represent a sister lineage that diverged from ATU 333 and ATU 123 before they evolved into two distinct groups. Thus, ... the East Asian tradition represents a crucial 'missing link' between ATU 333 and ATU 123 that has retained features from their original archetype ... Although it is tempting to interpret the results of the analyses in this light, there are several problems with this theory."

The UPGMA root, on the other hand, would be consistent with the blending theory for the origin of the East Asian tales. However, this tree actually presents the African tales as distinct from ATU 123, rather than being a subset of it.

Anyway, the bottom line is that you shouldn't present scenarios without a time direction. History goes from the past towards the present, and you therefore need to know which part of your graph is the oldest part. A family tree isn't a tree unless it has a root.


Boykin LM, Kubatko LS, Lowrey TK (2010) Comparison of methods for rooting phylogenetic trees: a case study using Orcuttieae (Poaceae: Chloridoideae). Molecular Phylogenetics & Evolution 54: 687-700.

Huelsenbeck J, Bollback J, Levine A (2002) Inferring the root of a phylogenetic trees. Systematic Biology 51: 32-43.


  1. If you take another look you'll see that there IS an analysis using rooted trees in the paper (an ancestral state reconstruction), and on the basis of that and other analyses the East Asian origin hypothesis is rejected, for reasons described at some length in the paper.

    1. Thanks for your comment, Jamie. However, I do not see that in the paper you explicitly provide a root for your diagrams, although you do indeed discuss the data as though there is a root. The implied root appears to be where the three groups join, which I thought would be worth testing quantitatively. The data analyses do not place the root at the place that you imply, which weakens the discussion somewhat. It would be interesting to provide an outgroup root, but I presume that this is unrealistic.

  2. Thanks for this post. About ways to root phylogenetic trees, I have proposed another method: see A Cosmic Hunt in the Berber sky : a phylogenetic reconstruction of Palaeolithic mythology. - Les Cahiers de l'AARS, 15, 2013: 93-106. ( What do you think of that? (Additionaly, there are many peer-reviewed papers about the use of phylogenetic tools to study myths and folktales here:, including a critical review of J. Tehrani).

    1. I have read the paper that you refer to, and I have now written a blog post about the rooting method that you used. The root location that you decide on seems rather unlikely.