The Genealogical World of Phylogenetic Networks: Directed graph

Showing posts with label Directed graph. Show all posts

Thursday, February 27, 2014

Roots and the phylogenetics of mythology

A few weeks ago I discussed the phylogenetic analysis of the tale of Little Red Riding Hood (The phylogenetics of Little Red Riding Hood). In that case, I pointed out that historical reconstructions require a rooted tree, and I discussed various possible methods for rooting the unrooted trees produced by the data analyses.

This is not the only time that phylogenetics has been applied to myths or tales. For example, d'Huy (2013a) has studied the prehistoric Polyphemus tale belonging to the European and North Amerindian areas, and d'Huy (2013b) has studied the mythological motif of the Cosmic Hunt linked to the Big Dipper constellation (typical for northern and central Eurasia and for the Americas but unknown on other continents). In the first case a binary matrix of 98 characteristics for 44 versions of the tale was used, and in the latter 93 characteristics for 47 versions. Both of these studies have rooted trees.

In the latter case, a novel method of rooting the tree was used. The unrooted tree was successively rooted with each of the likely versions of the tale as outgroup. In each case the ancestral tale (the protomyth) was reconstructed and the ancestral states of the tale's characteristics (called mythemes) were determined. The author then "selected the version that holds the majority of the wide shared mythemes (>50%) as the better root."

Unfortunately, this produced an unexpected root, as shown in the tree below. The colors in the tree refer to various geographical groupings of the tale versions.

So, I re-analyzed the data using the rooting methods that I previously applied to the Red Riding Hood analysis:

For the bayesian analysis, I used MrBayes (2 runs, 4 chains, 1,000,000 generations, sampling frequency 1000, 25% burnin) with a relaxed clock (with independent gamma rates model for the variation of the clock rate across lineages).
For the neighbor-joining tree I used the BioNJ algorithm in PAUP*, and found the midpoint root.
For the parsimony analysis, I used a 200-replicate parsimony-ratchet search via PAUP*, calculated the branch lengths of the majority-rule consensus tree with ACCTRAN optimization, and found the midpoint root.

These three alternative roots are also shown on the tree. They seem more likely than the published root.

Geographically, the root chosen by the author's method is within the red group (tales from Asia), based on the idea that "arguments in favour of localization of protypical Cosmic Hunt in Asia seem persuasive (Berezkin 2005)." Unfortunately, this a priori argument seems to have excluded any testing of the possibility that more than one version is the sister to the remaining tales — that is, only single outgroups were considered.

On the other hand, all three of the alternative roots group the tales into two major clades. For the bayesian-clock root the two clades have distinct animal motifs, a herbivore and a carnivore, respectively. These clades do not correspond to any of the three variants recognized by Berezkin (2005).

The bayesian-clock root puts the red-colored (Asia) versions of the tale into one of the two major clades, as it also does with the orange group (Africa), which makes this root more consistent with the geographical groupings — that is, all of the geographical groups are in only one of the two major clades, except for the purple group (American coast-plateau / British Columbia). Both the Parsimony and NJ roots do the same thing, but as well as the purple group they also split the pink group (northeastern America) between the two major clades, which reduces their geographical consistency compared to the bayesian-clock root.

The bayesian-clock root does not support the suggestion that the Cosmic Hunt myth originated in Asia. Indeed, the bayesian tree does not support any particular geographical location. Furthermore, the polyphyly of the purple group presents an intriguing aspect of the tale's history.

References

Yuri Berezkin (2005) The cosmic hunt: variants of a Siberian—North-American myth. Folklore 31: 79-100.

Julien d'Huy (2013a) Polyphemus (Aa. Th. 1137): a phylogenetic reconstruction of a prehistoric tale. Nouvelle Mythologie Comparée 1: 1-21.

Julien d'Huy (2013b) A cosmic hunt in the Berber sky: a phylogenetic reconstruction of a Palaeolithic mythology. Les Cahiers de l’AARS 16: 93-106.

Wednesday, December 4, 2013

The phylogenetics of Little Red Riding Hood

A couple of weeks ago we received an unexpected influx of visitors to this blog, being directed here by at article at the NBC News site. This article cited one of our blog posts (Network analysis of Genesis 1:3) as an example of the use of phylogenetic analysis in stemmatology (the discipline that attempts to reconstruct the transmission history of a written text). The NBC article itself is about a recently published paper that applies these same techniques to an oral tradition instead — the tale of Little Red Riding Hood. This paper has generated much interest on the internet, being reported in many blog posts, on many news sites, and in many twitter tweets. After all, the young lady in red has been known for centuries throughout the Old World.

Needless to say, I had a look at this paper (Jamshid J. Tehrani. 2013. The phylogeny of Little Red Riding Hood. PLoS One 8: e78871). The author collated data on various characteristics of 58 versions of several folk tales, such as plot elements and physical features of the participants. These tales included Little Red Riding Hood (known as Aarne-Uther-Thompson tale ATU 333), which has long been recorded in European oral traditions, along with variants from other regions, including Africa and East Asia (where it is known as The Tiger Grandmother), as well as another widespread international folk tale The Wolf and the Kids (ATU 123), which has been popular throughout Europe and the Middle East. As the author notes: "since folk tales are mainly transmitted via oral rather than written means, reconstructing their history and development across cultures has proven to be a complex challenge."

He produced phylogenetic trees from both parsimony and bayesian analyses, along with a neighbor-net network. He concluded: "The results demonstrate that ... it is possible to identify ATU 333 and ATU 123 as distinct international types. They further suggest that most of the African tales can be classified as variants of ATU 123, while the East Asian tales probably evolved by blending together elements of both ATU 333 and ATU 123." His network is reproduced here.

There is one major problem with this analysis: all three graphs are unrooted, and you can't determine a history from an unrooted graph. A phylogeny needs a root, in order to determine the time direction of history. Without time, you can't distinguish an ancestor from a descendant — the one becomes the other if the time direction is reversed. Unfortunately, the author makes no reference to a root, at all.

So, his recognition of three main "clusters" in his graphs is unproblematic (ATU 333; East Asian; and ATU 123 + African) although the relationship of these clusters to the "India" sample is not clear (as shown in the network). On the other hand, his conclusions about the relationships among these three groups is not actually justified in the paper itself.

Rooting the trees

So, the thing to do is put a root on each of the graphs. We cannot do this for the network, but we can root the two trees, and we can take the nearest tree to the network and root that, instead.

There are several recognized ways to root a tree in phylogenetics (Huelsenbeck et al. 2002; Boykin et al. 2010):

a character transformation series (i.e. non-reversible substitution models)
an outgroup
mid-point rooting
assume clock-like character replacement (e.g. the molecular clock).

The first one implies that we know the order in which at least some of the characters changed through time, which is not true for these folk tales. The second one requires us to know the next most closely related folk tale, which we cannot decide in this case. The third one is always possible, for any tree; and the fourth one is possible if a likelihood model has been used to model character changes. So, in this case, we can apply both of options 3 and 4.

I therefore did the following:

For the parsimony analysis, I imported the author's consensus tree into PAUP* (the program he used to produce it), calculated the branch lengths with ACCTRAN optimization, and found the midpoint root.
For the bayesian analysis, I re-ran the MrBayes analysis exactly as described by the author, except that I added a relaxed clock (with independent gamma rates model for the variation of the clock rate across lineages).
For the phylogenetic network, the neighbor-net is basically the network equivalent of a neighbor-joining tree, and so I calculated this in SplitsTree (the program the author used), and found the midpoint root.
Also, the strict clock version of a neighbor-joining tree is a UPGMA tree, which I calculated using SplitsTree.

The complete trees can be seen elsewhere (ParsimonyMidpoint; BayesRelaxed; NJmidpoint; UPGMA), but the figure below shows the relevant parts of the four rooted trees. As you can see, the first three analyses agree on the root location (shown at the left of each graph), with only the UPGMA tree suggesting an alternative.

Having the East Asian samples as the sister to the other tales does not match what would be expected for the historical scenario suggested by the original author from his unrooted graphs — that the East Asian tales "evolved by blending together elements of both ATU 333 and ATU 123".

Instead, this placement exactly matches an alternative theory that the author explicitly rejects: "One intriguing possibility raised in the literature on this topic ... is that the East Asian tales represent a sister lineage that diverged from ATU 333 and ATU 123 before they evolved into two distinct groups. Thus, ... the East Asian tradition represents a crucial 'missing link' between ATU 333 and ATU 123 that has retained features from their original archetype ... Although it is tempting to interpret the results of the analyses in this light, there are several problems with this theory."

The UPGMA root, on the other hand, would be consistent with the blending theory for the origin of the East Asian tales. However, this tree actually presents the African tales as distinct from ATU 123, rather than being a subset of it.

Anyway, the bottom line is that you shouldn't present scenarios without a time direction. History goes from the past towards the present, and you therefore need to know which part of your graph is the oldest part. A family tree isn't a tree unless it has a root.

References

Boykin LM, Kubatko LS, Lowrey TK (2010) Comparison of methods for rooting phylogenetic trees: a case study using Orcuttieae (Poaceae: Chloridoideae). Molecular Phylogenetics & Evolution 54: 687-700.

Huelsenbeck J, Bollback J, Levine A (2002) Inferring the root of a phylogenetic tree. Systematic Biology 51: 32-43.

Wednesday, November 13, 2013

Monophyletic groups in networks

I have noted before that taxonomic groups that are represented in any tree-like parts of a phylogeny can be considered to be monophyletic, but those that consist of hybrids cannot, unless we hypothesize a single hybrid origin for each group (How should we treat hybrids in a taxonomic scheme?). This issue arises from the concept that monophyletic groups must share an exclusive Most Recent Common Ancestor (MRCA), and this concept is not straightforward for a network compared to a tree.

This topic has been tackled mathematically a couple of times (see Huson and Rupp 2008; Fischer and Huson 2010), resulting in the recognition that for a network there are three main types of MRCA: conservative MRCA (or stable MRCA), Lowest Common Ancestor (or minimal common ancestor), and Fuzzy MRCA (see Networks and most recent common ancestors). These have definitions based on the Least Lower Bound and Greatest Lower Bound of mathematical lattices.

Unfortunately, there has been very little discussion of the topic in the biological literature. However, recently Wheeler (2013) has made a start. There is no reference to the mathematical work on MRCAs, but he considers what to do about the concepts of monophyly, paraphyly and polyphyly with respect to networks.

Basically, he suggests three new types of phyletic group: periphyletic, epiphyletic, and anaphyletic. He provides algorithmic definitions of these groups, relating them to the previous algorithmic definitions of monophyly, paraphyly and polyphyly. These new types concern groups that are monophyletic on a tree, but have additional gains or losses of members from network edges — that is, they lie somewhere between monophyletic and paraphyletic.

For example, an epiphyletic group would be one that is otherwise monophyletic but also contains one or more hybrids that have one of their parents from outside the group, while a periphyletic group would be monophyletic but has contributed as a parent to at least one hybrid outside the group. An anaphyletic group would have done both of these things. For clarification, Wheeler provides the following empirical example, based on Indo-European languages (where English is recognized as a "hybrid" of Germanic and Romance languages).

Reproduced from Wheeler (2013).

In terms of MRCA, it seems to me that all three new group types use the Lowest Common Ancestor model, which is the shared ancestor that is furthest from the root along any path (ie. the LCA is not an ancestor of any other common ancestor of the taxa concerned). However, this is only clear when we consider hybrids, in which the two (or more) parents contribute equally to the hybrid offspring. When dealing with introgression or horizontal gene transfer, where the parentage is unequal, then we approach the Fuzzy MRCA model, in which only a specified proportion of the paths (representing some proportion of the genomes) needs to be accommodated by the MRCA, thus keeping the MRCA close to the main collection of descendants.

What is not yet clear is whether we would want to recognize any of these new group types in a taxonomic scheme. I guess that this is something that the PhyloCode will have to think about, since it is based strictly on clades (although they are allowed to overlap).

References

Fischer J, Huson DH (2010) New common ancestor problems in trees and directed acyclic graphs. Information Processing Letters 110: 331–335.

Huson DH, Rupp R (2008) Summarizing multiple gene trees using cluster networks. Lecture Notes in Bioinformatics 5251: 296–305.

~~Wheeler WC (2013) Phyletic groups on networks. Cladistics (online early).~~
Wheeler WC (2014) Phyletic groups on networks. Cladistics 40: 447-451.

Wednesday, March 13, 2013

Topological restrictions: some comments

In a recent post (Different topological restrictions of rooted phylogenetic networks. Which make biological sense?), Leo and Steven synthesized a lot of the issues that have been raised in recent years, both in this blog and in the literature, about the underlying models for rooted phylogenetic networks. This is an excellent summary (and explanation) of the current situation; please read it if you have not already done so.

The key issue for biologists is: what are we trying to model biologically? In one sense the answer is trivial: "evolutionary history". However, this answer is uninformative, in the sense that it is too broad to be practical. We do not know enough about evolutionary history for it to be obvious which model(s) we should use, and we presumably never will know (we were not there to study it, and it happens too slowly to study it now).

Nevertheless, we need to make decisions about models, and in many cases quite detailed decisions. Therefore, we need public discussion about the various possible models that have been suggested, as well as suggestions for new models. Unfortunately, the impetus for this has almost always come form the computational side rather than the biological side, if only from practical necessity — a mathematician cannot proceed without a model. (A biologist can't either, but they seem to be much more comfortable with vague word models rather than quantitative mathematical ones.)

So, I will provide an initial response to Leo and Steven's points, as a staring point for discussion, in the order in which they raised them. Hopefully, someone else will have something to say, as well.

Rooted

If there is assumed to be a single origin of life, and evolution is assumed to be acyclic, then there will always be at least one root in every evolutionary network. Indeed, if we are studying a monophyletic group of organisms then there will always be precisely one root. But if we are studying a polyphyletic group then technically there are multiple roots, in the sense that we cannot reconstruct the most recent common ancestor of the group. In practice, we have ignored this issue, and simply reconstructed the ancestor anyway, as best we can, since that is what the current tree-building algorithms do. In this sense, the only difference between an evolutionary tree and an evolutionary network is the complication caused by horizontal gene transfer, in which we might consider a species' genome to be polyphyletic, as well.

So, a single artificial root will probably suffice — see Steven's earlier comment: "the multiple-root situation can probably be reduced to the single-root situation by having some kind of high-degree (i.e. unrefined) artificial root which is the parent of the real roots."

Acyclic

Acyclicity is an obvious criterion for evolution, because a descendant cannot be its own ancestor. So, in a network we should treat this as "sacred" — an evolutionary network in its final form cannot show any directed cycles.

This does not, I guess, preclude algorithms that do not themselves require acyclicity, as biology always violates mathematical assumptions (eg. normality, homoscedasticity, etc). However, I think that we should require the output to be acyclic.

This is not necessarily an issue for reconciliation between gene trees and species trees, as raised in the original blog post. In this scenario the resulting phylogeny is a tree, and therefore cannot show cycles, by definition.

Time-consistent

Time consistency is a stronger form of acyclicity, in the sense that the acyclic parts of a network are also restricted by time's arrow. It is another obvious criterion for evolution, because a descendant cannot reticulate with its predecessors.

However, this requirement is confounded by incomplete taxon sampling, as noted by Leo and Steven. To deal with this, we can allow "ghost" lineages representing hypothesized missing taxa. This does, nevertheless, raise the issue of where we draw the line. Any two trees can be reconciled by allowing enough ghost lineages, just as is done by algorithms for the gene duplication-loss problem, where minimizing the numbers of unobserved duplications and losses is the optimality criterion. Similarly, how many ghost lineages should we allow in a reticulating network? Perhaps we should be minimizing them, too.

There is also the more fundamental issue of whether time-consistency should actually be a requirement at all. It has been pointed out a number of times that this is not a requirement for anthropology, for example, where transfer of non-genetic information through time is not only possible but quite common. This has been discussed in earlier blog posts.

So, in a general sense we cannot make time consistency a mandatory requirement of evolutionary networks. However, we could do so for strictly biological networks, where information transfer occurs with gene flow, which must follow time's arrow.

Degree-restricted

Nodes with indegree 1 and outdegree 1 are usually considered to be unnecessary, because they do not represent anything more than would an edge (or arc). The only times they have been seriously suggested are when placing known fossils onto a tree, and the fossil is claimed to represent an ancestor of one of the nodes. However, this is considered to be inappropriate, because we can never be sure that any specified fossil is actually an ancestor, as opposed to being a sister to one of the ancestors (see the blog post at Evolving Thoughts).

Indegree >2 has also been considered to be problematic in phylogenetics. If a sexually reproducing organism has two parents, then theoretically we cannot observe indegree>2. In practice, however, evolutionary events may occur over a short enough time-scale that we cannot distinguish a series of individual indegree-2 events, so that for practical purposes a network might end up showing indegree >2. This is analogous to "soft polytomies" in trees, where outdegree >2 represents uncertainty.

Tree-child, Tree-sibling, Reticulation-visible

These network restrictions all refer to so-called "invisible" nodes (nodes from which all paths end up in reticulations). All nodes in a tree are visible, and so there is no analogous situation in tree-building from which we might derive a response (as I have done several times above).

The basic issue with invisible nodes is restricting their occurrence. Theoretically, we could add an infinite number of invisible nodes to a network, but few, if any, of them would be based on any inference from the data. So, tree-child networks ban them entirely, whereas tree-sibling and reticulation-visible networks allow them under particular circumstances.

It is difficult for me to see exactly how I should interpret invisible nodes biologically. How can I reconstruct their features, for example? Invisible nodes tend to appear when combining multiple trees into a network, and so they are not usually constructed directly from character data. Biologically, they might exist, but there seems to be little reason to consider them as well-supported inferences, as we do for visible nodes.

In this sense, they have much more of the feel of mathematical artifacts than do visible nodes.

Galled trees, Galled networks, Level-k

These are all artificial restrictions on networks that do nothing more than limit the complexity (and thus make the algorithms more tractable). There seems to be no good biologicals basis for using any of these criteria as restrictions (as opposed to using them as network descriptors).

There is, however, a good logical basis for their use — simpler networks are to be preferred over more complex ones, at least as working hypotheses for evolutionary history. The problem with this attitude seems to be that sometimes simpler networks look less biologically realistic than do more complex ones (eg. level-1 networks might spread the reticulations across the network whereas level-3 networks can concentrate them in one place).

Regular and normal

These always seem to me to be of mathematical interest for characterizing networks, rather than being of any biological interest. But maybe that is just my ignorance.

Directed Acyclic Graph or tree-with-edges-added?

To me , this is one of the BIG questions. Until biologists come to grips with this, I do not see how the computational people are going to make the helpful contributions that they obviously desire (or, at least, the ones I know). I have thought (and read) about this a lot, and I have come to the (perhaps unfortunate) conclusion that the answer is: "both".

From the mathematical point of view, the problem is simply that certain network topologies will not be considered when we add reticulation nodes to a tree. So, we automatically exclude those topologies if our algorithm starts with a tree. From the biological point of view, the issue is whether we consider evolutionary history to be approximately tree-like or whether it is fundamentally reticulated.

My reading of the literature is that those people dealing with hybridization in eukaryotes are leaning towards the "tree obscured by vines" model, whereas those dealing with horizontal gene transfer in prokaryotes prefer the "anastomosing plexus" model. Perhaps it is too soon to tell whether this dichotomy will continue; but, certainly, discussion about the hybridization issue goes back to the early 1980s, and since then the opinion about the fundamentally tree-like nature of history has been repeatedly expressed. Moreover, those people using reduced median networks and median joining networks for mitochondrial data seem to be interested in simplifying their initially complex network as much as possible, and then interpreting the resulting network in terms of a few conflicting parsimony trees; I interpret this as a preference for trees over networks.

Of course, preferring a reticulated tree model may just be historical inertia, whose importance in the daily lives of scientists should not be underestimated.

Sunday, March 3, 2013

Different topological restrictions of rooted phylogenetic networks. Which make biological sense?

Those readers active in the field of evolutionary phylogenetic networks will know that there are many different definitions of “rooted phylogenetic network” in circulation. While certain features are almost universal (e.g. rooted, acyclicity), many are not. Why do these differences arise? There are multiple answers to this. Some arise because of differing opinions on what biologically realistic is, and (relatedly) what the correct balance is between biological detail and mathematical abstraction. Others arise because they make optimization problems on networks easier to solve. This should not automatically be viewed as mathematics prescribing reality to biology, but rather as the observation that if evolution looks like this then certain optimization problems can be solved well. Finally, some differences are superficial; they do not lead to any intrinsic differences in the model or its underlying mathematical structure. Of course, “superficial” is highly context dependent, as some of the examples below will show.

Here we list some well-known and less-well known properties that have surfaced in definitions of rooted phylogenetic network. We will take as given that evolution is directed i.e. that the arcs in the network have an explicit orientation (representing the flow of time). Below most properties we show a figure of a network violating it.

The question to our readers, particularly those from the biological side of the spectrum: which of the following properties make sense? And which other restrictions would make more sense biologically?

Rooted

A root is a node with indegree 0, meaning that it does not have any ancestors. If evolution is assumed to be acyclic (see below) then there will always be at least one root. Most articles writing about rooted phylogenetic networks assume a single root (which is necessarily an ancestor of every node in the network). Some time ago on this blog David raised the question of whether it would not sometimes be better to allow multiple roots. This is an interesting point both from the perspective of interpretation (what does it mean?) and its impact on algorithmic efficiency.

Acyclic

Most models assume acyclicity: it is not possible to walk along the arcs of the network (respecting their orientation) such that you end up back where you started. The most intuitive argument for this is the passing of time: if arcs represent forward motion through evolutionary time, how can you end up back at an earlier point in time? Recently someone pointed out to me that in the reconciliation literature – where one shows how to reconcile a given gene tree with a given species tree – acyclicity is actually not sacred at all. The reason for this is that, without the acyclicity constraint, the problem becomes computationally tractable. See this recent RECOMB 2013 article [3] for an example of this.

Time-consistent

This is an interesting property. It was introduced to prevent reticulation events between non-contemporaneous taxa. That is, to avoid absurd situations such as an organism hybridizing with its ancestor. The most common mathematical articulation of time-consistency is this: it should be possible to put a “time-stamp” on each node of the network such that (a) time always moves strictly forward along tree-edges , and (b) all the nodes involved in a reticulation event have the same time-stamp. The figure here shows a network that is not time-consistent. For many contextualisations of “reticulation event” time-consistency seems to make a lot of sense. But there is a catch. A network might fail to be time-consistent only for the rather artificial reason that we failed to sample a taxon that was, in fact, part of the network (incomplete taxon sampling). Given the reality of incomplete data, demanding time-consistency might be too prohibitive. However, as with all restrictions in this blog, it is perhaps useful as a selection criterion for preferring one network over another.

Figure 1: A network that is not time-consistent. The red and blue node are the two parents of a reticulation node but cannot have coexisted in time.

Degree-restricted

The indegree of a node is the number of parents of it, and the outdegree of a node is the number of children of it. In a bifurcating, rooted phylogenetic tree all nodes have indegree-1 (except the root: indegree-0) and outdegree-2. Polytomies have outdegree-3 or higher. In a rooted phylogenetic network we also have reticulation nodes i.e. nodes with indegree-2 or higher. Articles differ in the degree-restrictions they place on nodes; there is an entire zoo of different permutations possible. Many articles agree that nodes with indegree-1 and outdegree-1 should not be allowed, because they are phylogenetically uninformative (and indeed such nodes are also rarely encountered in the phylogenetic tree literature). But what about polytomies? And what about reticulation nodes: should they be permitted to have indegree-3 or higher, and if so how should such “reticulate polytomies” be interpreted? From a parsimony perspective it is usual to argue that reticulate polytomies should be more “expensive” than indegree-2 reticulations, because a single reticulation polytomy can explain far more discordance than a single indegree-2 reticulation. Interestingly, some optimization problems are not really affected at all by degree-restrictions on phylogenetic networks (e.g. the hybridization number problem) while others are highly sensitive to degree-restrictions (e.g. the small parsimony problem).

Tree-child
Every node has at least one non-reticulate child. This restriction makes sure that there are no "invisible nodes", i.e. nodes from which all paths end up in reticulations. As a result, this restriction makes many computational problems more tractable and mathematical reasoning easier. For example, consider the basic Tree Containment problem, i.e. given a phylogenetic network and a phylogenetic tree, decide if the network displays the tree, see [8]. This problem was shown to be tractable for tree-child networks, while it is NP-hard for most other classes of networks (time-consistent, tree-sibling, regular). This seems important because if it is already hard to tell if a given tree is in a given network, then it seems a daunting task to try to build networks of that class from any kind of data.

It should be noted that there is of course no guarantee that "real" evolutionary histories are tree-child. In fact, simulation studies show that only under very low recombination rates one can expect tree-child networks [1][2].

Figure 2: A network that is not tree-child. The red node does not have a non-reticulate child.

Tree-sibling
Every reticulate node has at least one non-reticulate sibling. Originally introduced by Cardona, Llabrés, Rosselló and Valiente who write "Biologically, this condition means that for each of the reticulation events, at least one of the species involved in it has also some descendant through mutation" and showed that networks can efficiently be compared (with a polynomial-time computable metric) if they are both tree-sibling and time-consistent.

An advantage of the class of tree-sibling networks is that it is much larger than the class of tree-child networks. If a reticulation has no non-reticulate siblings, then its parents have no non-reticulate children. Hence, every tree-child network is tree-sibling. However, it can easily be seen that there are many tree-sibling networks that are not tree-child. Computationally, the tree-sibling restriction does not seem to help as much as the tree-child restriction. Again, there is no guarantee that the "real" network is tree-sibling, but it might be more likely, see [1][2].

Figure 3: A network that is not tree-sibling (the red node does not have a non-reticulate sibling).

Reticulation-visible

For every reticulation node there exists a leaf such that all paths from the root to the leaf pass through that reticulation node. This is another attempt to weaken the tree-child restriction, thus obtaining a larger class of networks for which many computational problems are still tractable. It is incomparable with tree-sibling (a reticulation-visible network does not have to be tree-sibling and vice versa) but this class clearly contains the class of tree-child networks, and in fact is much larger. It does not forbid all invisible nodes, but just invisible reticulation nodes. It has been shown in the book by Huson, Rupp and Scornavacca that the so-called Cluster Containment problem becomes tractable for reticulation-visible networks. If the same is the case for the Tree Containment problem is still an open question.

Figure 4: A network that is not reticulation-visible. The red reticulation node is not "visible".

Galled trees

All reticulation cycles are disjoint. Introduced by Gusfield, Eddhu and Langley [6] although studied before under different names. Makes computational problems much easier but seems biologically unrealistic. On the other hand, galled trees could have a future in the context of data-display networks, by using the galls to show where the reticulate activity is in the network, rather than claiming that each gall represents exactly one reticulation event.

Galled networks

Each arc leaving a reticulation node is a cut-arc. Introduced by Huson and Klöpper [7] (who gave a different but equivalent definition) as a generalization of galled trees, giving a fast algorithm for constructing galled networks from clusters. Hasn't been studied much since. (Be aware that there is also an article using "galled network" as an alternative term for galled tree.)

Figure 5: A network that is not a galled network. The red arc leaves a reticulation node but is not a cut-arc.

Level-k (not a restriction, but a measurement)

Also a generalization of galled trees (which are basically level-1 networks). Every network is a level-k network for some k. Hence, “level-k” should not really be viewed as a topological restriction, but rather as a measure of how “tangled” (intensely concentrated) the islands of reticulation are in the network. The higher k is, the more tangled the network is; level-0 networks are simply trees. For more information, see a previous blog. Other proposed measurements of tangledness include k-nested, r-reticulation and depth-k.

Regular and normal

If you see a phylogenetic network as a representation of a set of clusters, then it makes sense to consider regular networks (introduced by Baroni, Semple and Steel). A network is regular if it is the so-called "cover digraph" (Hesse diagram) of its set of clusters. Hence, for each set of clusters, there exists a unique regular network with precisely that set of (hardwired) clusters. Normal networks have the additional requirement of being tree-child, thus forming a very restricted class of networks. For example, the network in Figure 1 is not regular and hence also not normal.

DAG or tree-with-edges-added?

This is a rather subtle one. If one views a phylogenetic network as a tree with edges added, then this leads to a different space of networks than the “a phylogenetic network is essentially a directed acyclic graph” definition encountered in other articles. The point being that if you insist that a network has to be “grown” from some tree starting point (by adding edges in a certain way), certain topologies cannot be reached which can be reached if the we do not anchor it in this way. The following figure shows an example.

Figure 6: A network that cannot be obtained by adding edges to a tree (for common edge-adding rules). (Figure corrected on May 5, 2014.)

It can be shown that any tree-sibling network can be constructed by adding edges to a tree, but the network in Figure 3 shows that the converse is not true (the shown network can be constructed by adding edges to a tree, but is not tree-sibling).

This restriction touches on the fundamental question whether there exists something like a species tree, and if it might be possible to reconstruct this species tree before starting network analysis.

A related but stronger (?) restriction was recently used by Wu [9]. In his RECOMB 2013 article, he writes “(R1) For a network N, when only one of the incoming edges of each reticulation node is kept and the other is deleted, we always derive a tree T'.”

Conclusion

We see from this list that there are already some quite different topological properties and restrictions in circulation for rooted phylogenetic networks. To biologists these discussions might appear to be a strange side-show to keep computer scientists in work. But it runs deeper than that, because it touches on three fundamental issues. Firstly, what are we trying to model exactly? Secondly, the importance of understanding the networks that your favourite software for constructing rooted phylogenetic networks will not build, however biologically relevant, due to the fact that they are a priori excluded from its search space. Finally, since the total number of networks is huge, it could be inevitable to focus on certain restricted classes of networks when one wants to search through network-space efficiently.

Note: There is a follow-up post Topological restrictions: some comments.

REFERENCES

[1] Miguel Arenas, Mateus Patricio, David Posada and Gabriel Valiente. Characterization of Phylogenetic Networks with NetTest. In BMCB, Vol. 11:268, 2010.

[2] Miguel Arenas, Gabriel Valiente and David Posada, Characterization of Reticulate Networks Based on the Coalescent with Recombination, Mol Biol Evol (2008) 25 (12):2517-2520.

[3] Mukul S. Bansal, Eric J. Alm, Manolis Kellis, Reconciliation Revisited: Handling Multiple Optima when Reconciling with Duplication, Transfer, and Loss, RECOMB 2013.

[4] Gabriel Cardona, Merce Llabres, Francesc Rossello, Gabriel Valiente, The comparison of tree-sibling time consistent phylogenetic networks is graph isomorphism-complete, arXiv:0902.4640 [q-bio.PE], 2009.

[5] Gabriel Cardona, Mercè Llabrés, Francesc Rosselló and Gabriel Valiente. A Distance Metric for a Class of Tree-Sibling Phylogenetic Networks. In BIO, Vol. 24(13):1481-1488, 2008.

[6] Dan Gusfield, Satish Eddhu and Charles Langley. Efficient reconstruction of phylogenetic networks with constrained recombination. In CSB03, Pages 363-374, 2003.

[7] Daniel H. Huson and Tobias Klöpper. Beyond Galled Trees - Decomposition and Computation of Galled Networks. In RECOMB 2007, Vol. 4453:211-225 of LNCS.

[8] Leo van Iersel, Charles Semple and Mike Steel, Locating a Tree in a Phylogenetic Network, Information Processing Letters, 110 (23), pp. 1037-1043 (2010).

[9] Yufeng Wu. An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees. RECOMB 2013.

Monday, January 28, 2013

Cornets: from a tree to a network

I have commented before on the fact that, even in situations where people believe that evolutionary history is reticulate, phylogenies are still presented as a tree rather than as a network (see Why do we still use trees for the Neandertal genealogy? and Why do we still use trees for the dog genealogy?); and I shall presumably comment on this again in the future.

However, this blog post is about someone who stopped using a tree and started using a network: Niles Eldredge. More than 10 years ago (Eldredge 2000) he addressed the idea of the evolutionary history of the musical instrument known as a cornet (a soprano brasswind instrument similar to a trumpet), of which he had been a collector and historian for many decades (see Eldredge 2002).

Originally, he presented a phylogenetic tree of selected cornets, covering the range of known phenotypic variation, based on several diagnosable morphological features (described by Eldredge 2000, 2002, 2011). A version of this tree was published in an interview that he did for New Scientist in 2003 (Walker 2003), and the ideas were subsequently presented in interviews for several other media, including the New York Times (Wertheim 2004) and the Fibreculture Journal (Barnet 2004).

The original bush-like phylogenetic tree of 36 cornets (from New Scientist).
It is based on 17 characters, although there are only 14 synapomorphies shown.

His purpose was to show that cornets do not fit a traditional phylogeny well: they show a very punctuated history, with bursts of rapid radiation where features appear in many lineages. He attributed this topology to the distinct nature of evolution of cultural objects, where innovations developed in one lineage can immediately be transferred to other lineages, and even transferred to earlier parts of those lineages (see my previous post: Time inconsistency in evolutionary networks). That is, the bifurcating cladistic model of evolution does not apply — the tree looks more like a bush.

Eldredge also noted that this "further implies, as a practical matter, that most of the algorithms developed to reconstruct biological history are inappropriate for the reconstruction of material cultural systems" (quoted in Barnet 2004). Indeed, the presence of lateral transfer of ideas means that a reticulating diagram would be a preferable mode of presentation for the history of cornets. So, his final version of that history became a network.

The final network of 39 cornets (from Tëmkin & Eldredge 2007).
Note that the branch lengths represent time, not character change.

This network first appeared in 2007 in Current Anthropology, in a paper co-authored with Ilya Tëmkin (who apparently did the data analysis). Since then, it has appeared in various other places, including Eldredge (2009, p.302), Kelly (2010, p.51) and Eldredge (2011, p.372).

Sadly, I do not think that the method used to produce the network would be repeatable by anyone else:

"This phylogeny was generated by combining several analytical methods. First, three independent phylogenetic analyses were performed on the same data set. The neighbour-joining and maximum parsimony trees were computed in PAUP∗, and the reticulate network (based on the dissimilarity matrix generated in PAUP∗) was computed in T-Rex. The reticulate branches generated by T-Rex were subsequently plotted onto the neighbour-joining tree. The reticulate branches suggesting relationships that were not corroborated by the presence of at least a single character in any of the shortest trees in the maximum parsimony analysis were eliminated."

The data are available, if anyone wants to have a go at devising a different approach.

The paper by Tëmkin and Eldredge (2007) also contains an evolutionary analysis of the musical instrument known as a Baltic psaltery (a plucked stringed instrument similar to a zither). The phylogeny was presented as a tree in that paper, but the relationships have also been conceptualized as a network in a later analysis (see Veloz et al. 2012).

Note There is a follow-up post here: Trees and networks of written manuscripts.

References

Barnet B (2004) Material cultural evolution: an interview with Niles Eldredge. Fibreculture Journal Issue 3: FCJ-017.

Eldredge N (2000) Biological and material cultural evolution: are there any true parallels? In: Tonneau F., Thompson NS (eds) Perspectives in Ethology: Evolution, Culture, and Behavior, pp. 113-153. Kluwer Academic, New York.

Eldredge N (2002) An overview of piston-valved cornet history. Historic Brass Society Journal 14: 337-390.

Eldredge N (2009) Material cultural macroevolution. In: Prentiss AM, Kuijt I, Chatters JC (eds) Macroevolution in Human Prehistory: Evolutionary Theory and Processual Archaeology, pp. 297-316. Springer, New York.

Eldredge N (2011) Paleontology and cornets: thoughts on material cultural evolution. Evolution: Education and Outreach 4: 364-373.

Kelly K (2010) What Technology Wants. Viking Press, New York.

Tëmkin I, Eldredge N (2007) Phylogenetics and material cultural evolution. Current Anthropology 48: 146-153.

Veloz T, Tëmkin I, Gabora L (2012) A conceptual network-based approach to inferring the cultural evolutionary history of the Baltic psaltery. In: Miyake N, Peebles D, Cooper RP (eds) Proceedings of CogSci 2012: 34th Annual Meeting of the Cognitive Science Society, pp. 2487-2492. Cognitive Science Society, Austin TX.

Walker G (26 July 2003) The Collector. New Scientist 179: 38-41.

Wertheim M (9 March 2004) Scientist at work — Niles Eldredge: Bursts of cornets and evolution bring harmony to night and day. New York Times.