Friday, November 30, 2012

Description, explanation and prediction in phylogenetics

My recent post on the relationship between phylogenetic trees and networks (Are phylogenetic networks as scientific as trees?) has generated some comment, particularly with regard to the way in which these three phenomena apply to phylogenetics.

By way of explanation, I have included here a specific example each of description, explanation and prediction using phylogenetic trees. They all come from my studies of one particular taxonomic group.


The phylum Apicomplexa (sometimes also known as Sporozoa) forms a large and diverse group of unicellular protists with a wide environmental distribution. They are obligate intracellular parasites, being the only large taxonomic group whose members are entirely parasitic. The phylum is traditionally considered to contain four clearly defined groups: the Coccidians, the Gregarines, the Haemosporidians and the Piroplasmids. The phylogenetic tree shown here (from Morrison 2009) is based on complete 18S rDNA sequences.

This tree is, in one sense, nothing more than a mathematical summary of some of the patterns in the aligned nucleotide data. However, if we accept the idea that this data summary represents the evolutionary history of the organisms (ie. the data summary represents the gene history and the gene history represents the organismal history), then the tree is also a quantitative description of that history.

In this particular example, however, the description is likely to be wrong, in at least some details. For example, it seems improbable that the Haemospordians (Plasmodium and Hepatocystis) are derived from within the Gregarines. This placement is more likely to be the result of long-branch attraction, so that the data summary is in error (as the consequence of a mathematical artefact), which leads to an inaccurate description of the evolutionary history.


Crytosporidium causes cryptosporidiosis in mammals. It has traditionally been classified with the Coccidians (see Ellis et al. 1998), a placement first suggested in 1907, based on features of the life-cycle, the macro- and microgamonts, and the oocysts (see Beĭer 2000). However, drugs that help treat coccidial infections (such as coccidiosis, toxoplasmosis, neosporosis and sarcocystosis in vertebrates) do not work on Cryptosporidium, an observation that has long puzzled parasitologists.

The earliest phylogenetic analyses of 18S rDNA from Apicomplexans called this taxonomic placement into question (Johnson et al. 1990), and this was repeatedly confirmed by later analyses (eg. Morrison & Ellis 1997). However, these analyses did not include representatives of all of the Apicomplexan groups (ie. they sampled only Coccidians, Haemopsoridians and Piroplasmids), and the first analyses to also include the Gregarines (which infect invertebrates) indicated a sister-group relationship (Carreno et al. 1999). This phylogenetic placement of Cryptosporidium as sister to the Gregarines is the currently accepted one (Barta & Thompson 2006, Leander 2007, Morrison 2009).

Thus, the currently accepted phylogeny explains why the anti-coccidial drugs do not work on Cryptosporidium — it is not a Coccidian. The traditional taxonomy does not provide any such explanation.


Taxon sampling has been almost entirely opportunistic within the Apicomplexa, as it almost always is in parasitology. Opportunities for sampling arise principally from studies of medical diseases (eg. malaria, cryptosporidiosis and toxoplasmosis) and of veterinary diseases (eg. coccidiosis, neosporosis and babesiosis). This can create practical problems (eg. in epidemiology), such as when dealing with parasites that have a two-host life cycle but where only one of the hosts is known.

Sarcocystis is part of the Coccidia, causing sarcocystis in vertebrates. It has a two-host (or indirect) life cycle — the definitive host (in which sexual reproduction occurs) is usually a carnivore, while the intermediate host (where asexual reproduction occurs) is usually a herbivore. Sometimes, parasites have been collected only in the intermediate host, and thus we need to predict the definitive host species, in order to direct the search for it. (Importantly, targeted searches use fewer experimental animals.) This prediction can be done using a phylogeny, as the prediction then comes from known hosts for the other parasite species within the same clade (monophyletic group).

The 18S rDNA phylogeny shown here is for part of Sarcocystis (it is taken from Morrison et al. 2004), and it also shows the known host species for each parasite species. This phylogeny can be used to predict that the most likely definitive host for Sarcocystis species V would be the same as the host for the other species in the monophyletic group labelled A, which would thus be a canid. Similarly, the predicted definitive host for Sarcocystis sinensis would be the same as the host for the other species in the monophyletic group labelled B, which is thus probably humans but possibly a felid.

In three cases this form of prediction of the definitive host of Sarcocystis species was tested by subsequent experimental infection studies (Dahlgren & Gjerde 2010; Gjerde & Dahlgren 2010), and the predictions were all confirmed to be correct.


Barta JR, Thompson RCA (2006) What is Cryptosporidium? Reappraising its biology and phylogenetic affinities. Trends in Parasitology 22: 463-468.

Beĭer TV (2000) [Article in Russian, with English abstract.] [Further comment on the coccidian nature of cryptosporidia (Sporozoa: Apicomplexa)]. Parazitologiia  34: 183-195.

Carreno RA, Martin DS, Barta JR (1999) Cryptosporidium is more closely related to the Gregarines than to Coccidia as shown by phylogenetic analysis of Apicomplexan parasites inferred using small-subunit ribosomal RNA gene sequences. Parasitology Research 85: 899-904.

Dahlgren SS, Gjerde B (2010) The red fox (Vulpes vulpes) and the arctic fox (Vulpes lagopus) are definitive hosts of Sarcocystis alces and Sarcocystis hjorti from moose (Alces alces). Parasitology 137: 1547-1557.

Ellis JT, Morrison DA, Jeffries AC (1998) The phylum Apicomplexa: an update on the molecular phylogeny. In GH Coombs, K Vickerman, MA Sleigh, A Warren (eds) Evolutionary Relationships Among Protozoa (Kluwer, Dordrecht) pp. 255-274.

Gjerde B, Dahlgren SS (2010) Corvid birds (Corvidae) act as definitive hosts for Sarcocystis ovalis in moose (Alces alces). Parasitology Research 107: 1445-1453.

Johnson AM, Fielke R, Lumb R, Baverstock PR (1990) Phylogenetic relationships of Cryptosporidium determined by ribosomal RNA sequence comparison. International Journal for Parasitology 20: 141-147.

Leander BS (2007) Marine Gregarines: evolutionary prelude to the Apicomplexan radiation? Trends in Parasitology 24: 60-67.

Morrison DA (2009) Evolution of the Apicomplexa: where are we now? Trends in Parasitology 25: 375-382.

Morrison DA, Bornstein S, Thebo P, Wernery U, Kinne J, Mattsson JG (2004) The current status of the small subunit rRNA phylogeny of the Coccidia (Sporozoa). International Journal for Parasitology 34: 501-514.

Morrison DA, Ellis JT (1997) Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of Apicomplexa. Molecular Biology and Evolution 14: 428-441.

Wednesday, November 28, 2012

Are phylogenetic networks as scientific as trees?

Description, explanation, prediction

Science can be characterized as involving: (i) description, (ii) explanation, and (iii) prediction. As scientists, we need objective and repeatable methods for all three of these. For example, we have devised quantitative methods of description involving standardized units of measurement, often involving machines to perform the actual measuring. We also have modeling procedures that allow us to explicitly incorporate explanatory ideas, as well as for making predictions; and we have philosophical methods for assessing whether inferences are justified or not.

Philosophers of science tend to have focussed on the role of explanation (ii) in science, often to the exclusion of description (i) and prediction (iii), but practicing scientists frequently spend more time on (i) than on either (ii) or (iii), especially in biology. Moreover, physical scientists frequently combine all three simultaneously, using mathematical equations not only to describe the observed data but also to explain it (via the components that are included in the underlying mathematical model) and to predict as-yet unobserved phenomena (by arithmetical extrapolation).

It seems to me that one of the things that makes the study of evolution a science (rather than being a study of natural history) is our recent attempts to reconstruct evolutionary history in an objective and repeatable manner (rather than producing untestable historical scenarios). These phylogenetic analyses have usually been based on a tree model, although the adequacy of this model has recently been questioned.

However, one issue that I have not seen addressed in the literature is the affect on the description / explanation / prediction triumvirate if phylogenetics moves from a tree model to a network model.

[Added note: see the next blog post for a further explanation and examples of Description, explanation and prediction in phylogenetics.]

Trees and networks

Using a phylogenetic tree to describe biodiversity is uncomplicated — the tree describes the historical relationships among the taxa. Furthermore, using the tree for explanation is also uncomplicated — many of the intrinsic characteristics of organisms are the result of inheritance from their ancestors, and therefore characteristics that are shared among taxa can be explained as resulting from shared common ancestors. Furthermore, using the tree for prediction simply involves the reverse logic — shared ancestry predicts the existence of shared characteristics, which may not yet have been observed.

This is actually a point that Darwin makes when introducing the tree metaphor in his book (1859). He points out that many previously unexplained facets of biology become explainable if one adopts the concept of a phylogenetic tree (for example, so-called natural classifications, or the obvious relationships among languages).

In this context, note the potential importance of the distinction between pattern reconstruction and process explanation. For example, (i) can be done from the perspective of simply displaying patterns, but this is likely to preclude (ii) and (iii). Description may thus be best done from the perspective of displaying patterns that are related solely to particular processes. Jonathan Losos (2011, Seeing the forest for the trees: the limitations of phylogenies in comparative biology. American Naturalist 177: 709-727), for example, has noted that "phylogenies are much more informative about pattern than they are about process."

Nevertheless, replacing the tree model with a network model is not necessarily straightforward, because the studied history now involves both horizontal and vertical descent. If we conceive of a network as being a set of inter-connected trees, then the tree components represent the vertical ancestor-to-offspring history while the reticulations (connecting the trees) represent the horizontal components of the history.

In this view, using a phylogenetic network to describe biodiversity is the same as for a tree — the network describes the historical relationships among the taxa, with a clear indication of the pathways of the vertical and horizontal components of that history.

Unfortunately, the same cannot necessarily be said for explanation. Without an indication of exactly which characteristics are involved in the reticulations, we cannot have an unambiguous explanation. Characteristics that are shared among taxa may be explained by either shared ancestors (a vertical explanation) or by reticulation (a horizontal explanation). A network topology alone will not necessarily provide an unambiguous explanation, whereas a tree topology can do so.

A more extreme problem arises for prediction. When predicting the existence of shared characteristics, should the prediction be based on shared vertical ancestry or shared horizontal history, or both? Since we are predicting the unknown, how can we decide on the appropriate prediction framework? With a tree there is no such choice to be made, and thus no ambiguity.

If reticulation occurs, then we can "explain" almost any set of observations by postulating a suitable reticulation event; and we could "predict" almost any future event in the same way. So, it seems that network models are not practical for explanation and prediction in quite the same way as are tree models alone. The extra complexity available for network description potentially becomes ambiguity when used for explanation or prediction.

This issue manifests itself in a number of way. For instance, mathematical algorithms would need to be based on optimization criteria that have some biological relevance in terms of explanation not just description. For example, minimizing the number of reticulations when constructing a network involves descriptive parsimony — we describe the data using a tree model plus the minimum possible number of reticulations. However, this does not involve ontological parsimony, in the sense that we are not thereby postulating that evolution proceeds in such a parsimonious manner. Descriptive parsimony does not necessarily provide a phylogenetic network that is best as an explanatory framework, nor as a predictive tool. The same can be said about maximum-parsimony trees, of course, but they are rarely used these days.

Moreover, phylogenetic networks may not even provide a concise description of reticulate evolution. For example, if two gene trees differ by just one so-called Rooted Subtree Prune and Regraft (rSPR) move then we can represent them by a network with one reticulation node (the two trees that are embedded in the network are simply the two gene trees). However, if the trees differ by two or more rSPR moves then a large number of reticulations may be needed in order to embed the two trees. So, a network can be a simple description of two conflicting trees, or it can also be much more complex than those two trees.

What I have said so far refers to evolutionary network, which are intended to explicitly reflect evolutionary history. It is worth pointing out that data-display networks, on the other hand, are intended to provide description but not explanation or prediction. That is, they display the observed data without necessarily providing any explanation for the patterns displayed or necessarily allowing explicit predictions. Nevertheless, they are intended to provide insights that might contribute to explanations, and therefore predictions. They play a valuable role in exploring data to find the best description and to identify possible explanations.

Monday, November 26, 2012

Molluscs on Monday

This week, for Monday we have a phylogenetic tree constructed from the organisms whose relationships are represented. This one uses their shells to depict the evolutionary relationships between the eight classes of molluscs: Bivalvia (bivalves), Cephalopoda (octopuses, squids, etc), Chaetodermomorpha (caudofoveates), Gastropoda (snails, slugs), Monoplacophora, Neomeniomorpha (solenogasters), Polyplacophora (chitons), Scaphopoda (tusk shells).

The picture is by Richard Edwards, who took the photo at the Oxford University Museum of Natural History.

Wednesday, November 21, 2012

Phylogenetic position of turtles: a network view

The evolutionary history of turtles has been difficult to determine. Historically, turtles were thought to be early diverging reptiles (called anapsids), but recent morphological studies have allied turtles with lizards and snakes (squamates) plus tuataras (together, the lepidosaurs). These relationships are indicated at the top left and top right of the first figure, respectively.

Four hypotheses about the evolutionary relationships of turtles.
The figure is adapted from Hedges (2012).

However, most molecular studies support neither of these hypotheses, as shown in the bottom two parts of the figure. To quote from Parham et al. (2012):

  • Recently, several molecular data sets have recovered support for a novel turtle-crocodilian clade [bottom right of the figure] (Hedges and Poling 1999; Mannen and Li 1999; Cao et al. 2000; Shedlock et al. 2007) or a novel turtle-bird clade (Cotton and Page 2002). However, support for these topologies over an alternative where turtles are the sister taxon to a monophyletic Archosauria [birds + crocodiles; bottom left of the figure] is often weak (Cao et al. 2000; Iwabe et al., 2005; Katsu et al. 2009). The majority of recent molecular analyses support a monophyletic Archosauria (Iwabe et al. 2005; Hugall et al. 2007; Alfaro et al. 2009; Katsu et al. 2009; Lyson et al. 2012).

A number of research groups have recently tackled this phylogenetic problem using genome-wide datasets for various representatives of the taxonomic groups, including Chiari et al. (2012), Crawford et al. (2012), and Tzika et al. (2011). Sadly, they come to a diversity of conclusions; and here I use a phylogenetic network to explore why this might be so.

The sequence alignment used by Crawford et al. (2012) is freely available in the Dryad database, and so it provides a good starting point. The objective here is Exploratory Data Analysis (EDA), to investigate the characteristics of the data before the data are used to formally test the above four hypotheses about turtle relationships. For this I have performed a NeighborNet analysis using the SplitsTree program.

NeighborNet analysis of the aligned sequence data
provided by Crawford et al. (2012).

The NeighborNet displays 99.3% of the data, and so almost all of the data patterns are shown in the splits graph. I have numbered the nine best-supported splits in the data, and shown their location in the graph as well as their relative weights. [The weights represent the relative amount of data supporting each split — a greater weight means more support.]

Note that splits 1-6 & 9 are consistent with the hypothesis in the bottom left of the first figure, and none of the other hypotheses are supported by these seven splits. So, these seven splits appear in the tree produced by Crawford et al. (if Human is the outgroup root), and thus they represent the phylogenetic signals detected by the authors.

However, there are two other well-supported splits (7 & 8) that contradict this tree, and thus they create complexity that is not recognized by the authors. Note that split 7 contradicts splits 4, 5 & 6, and that split 8 contradicts splits 4 & 9. These two splits thus represent data that refutes the hypothesis of relationships favoured by the authors, as well as contradicting all three of the other hypotheses. Of course, splits 7 & 8 do not appear in the tree because there is at least one stronger split that contradicts them (eg. split 4).

Of particular note, the complexity created by split 7 involves the relationship between the turtles and the tuatara, while split 8 involves the relationship between the turtles and the crocodilians. This emphasizes just why there are so many different hypotheses about turtle relationships — many contradictory relationships are supported by at least some of the data! This calls into question the strong conclusions reached by the authors from these data.

In this context, it is worth emphasizing that split 9, which supports the Archosaurs, is rather small. This contradicts the quote above from Parham et al. (2012), which indicates that molecular data usually support this group more strongly than alternative phylogenetic arrangements.

However, it is the tuatara relationship that is one of the keys to understanding the complexity of turtle relationships. It is therefore unfortunate that there are no other available datasets to test this relationship further. Those studies with genomic data available do not include the tuatara; and those genomic studies that do include the tuatara apparently do not have their aligned molecular data freely available online (and sometimes both issues apply).

One potentially interesting genome study from the former group is that of Chiari et al. (2012). Their sequence alignment is also freely available in the Dryad database, and so it is possible to perform an EDA here, as well. As above, I have performed a NeighborNet analysis using the SplitsTree program.

NeighborNet analysis of the aligned sequence data
provided by Chiari et al. (2012).

The NeighborNet displays 99.0% of the data, and so almost all of the data patterns are shown in the splits graph. I have shown the same nine splits where they are supported by the data. Note that (i) splits 5 & 7 are missing because the tuatara is absent from the dataset, and (ii) split 8 involves crocodile and painted turtle, both of which are also absent from the data.

Once again, split 9 (supporting the Archosaurs) is very small, thus confirming this part of the results of Crawford et al. However, in this case there is a more strongly supported split, labelled X, that contradicts split 9. This second dataset is therefore consistent with the hypothesis in the bottom-right of the first figure; and this is reflected in the rooted tree produced by Chiari et al. Split X does exist in the NeighborNet of the data of Crawford et al., but it has a weight 0.00003, and so it is almost impossible to detect visually in the graph.

So, these two datasets apparently support two different hypotheses of turtle relationships. However, both datasets also provide incompatible data patterns within themselves, as discovered by the EDAs, and so they do not necessarily provide strong support for any one hypothesis of turtle relationships. It seems that we need more data for the tuatara, so that it can be incorporated into datasets such as that of Chiari et al.

I will finish by noting that the genome study of Tzika et al. (2011) provides no resolution of this situation. Their Figure 4a matches the result of Chiari et al. (turtle + crocodile) and their Figure 4c matches the result of Crawford et al. (birds + crocodile)! The multi-gene study of Shen et al. (2011), on the other hand, supports the Archosaurs (birds + crocodiles).


Chiari Y., Cahais V., Galtier N., Delsuc F. (2012) Phylogenomic analyses support the position of turtles as the sister group of birds and crocodiles (Archosauria). BMC Biology 10: 65.

Crawford N.G., Faircloth B.C., McCormack J.E., Brumfield R.T., Winker K., Glenn T.C. (2012) More than 1000 ultraconserved elements provide evidence that turtles are the sister group of archosaurs. Biology Letters 8: 783-786.

Hedges S.B. (2012) Amniote phylogeny and the position of turtles. BMC Biology 10: 64.

Parham J.F., et al. (2012) Best practices for justifying fossil calibrations. Systematic Biology 61: 346-359.

Shen X.-X., Liang D., Wen J.-Z., Zhang P. (2011) Multiple genome alignments facilitate development of NPCL markers: a case study of tetrapod phylogeny focusing on the position of turtles. Molecular Biology Evolution 28: 3237-3252.

Tzika A.C., Helaers R., Schramm G., Milinkovitch M.C. (2011) Reptilian-transcriptome v1.0, a glimpse in the brain transcriptome of five divergent Sauropsida lineages and the phylogenetic position of turtles. EvoDevo 2: 19.

Monday, November 19, 2012

The Future of Phylogenetic Networks: Presentations

This note is just to make sure that everyone knows that PDF files of most of the presentations from the workshop The Future of Phylogenetic Networks are now online at the Lorentz Center web page.

Those of you who were there can check up on things you may have missed, and those of you who were not there can see what it was all about. Some of the presentations have been modified slightly (to remove unpublished or confidential data, for example).

Note especially that Leo van Iersel's final summary of the Workshop is an excellent digest of what was said.

Wednesday, November 14, 2012

Family trees, pedigrees and hybridization networks

A family tree is technically called a pedigree. This is because it is not really a tree. Branches do not fuse in a tree, whereas in a pedigree every individual is the fusion of two genealogical branches. That is, in sexually reproducing species, every offspring is the hybrid of two parents. A family tree is only a tree if you trace one pair of ancestors through their descendants while ignoring the spouses.

So, a pedigree is a network not a tree, and specifically it is a hybridization network. This can be seen most clearly when there is a considerable level of inbreeding going on. Under these circumstances, both spouses are likely to be offspring of the same ancestors in the not-too-distant past, and so they will both be connected by the network branches. We are all of us connected in the human pedigree network, of course, but for most of us our (shared) common ancestor is a long way back in the past.

A high degree of inbreeding is common in many human cultures, but it is particularly prevalent among royalty, even in cultures with relatively little inbreeding among the common populace. I will illustrate this phenomenon with what is often considered to be the most extreme example recorded — the inbreeding that lead to the demise of the Spanish branch of the Habsburg dynasty in 1700 (other branches of the House of Austria continued until 1780).

The Spanish branch of the Habsburgs were kings of Spain from 1516 to 1700. Under Habsburg rule, Spain reached the peak of its power in Europe (covering Spain, the Netherlands and parts of Italy), and the world-wide Spanish Empire reached its greatest extent. The last king of this dynasty was Charles II, who was the product of such serious inbreeding that he was disfigured, physically disabled and mentally retarded (see Alvarez et al. 2009 for a full description). The fact that he had no children lead to the War of the Spanish Succession, although this was mostly precipitated by the reaction of the reigning French king, Louis XIV.

Click to enlarge.

The basic issue here is that the Spanish Habsburgs tried to keep power by literally "keeping it in the family". During the last three-quarters of their time, from 1551 to 1700, no outsider married into the Spanish royal family. Indeed, if one looks at the six kings from 1497 (when Philip the Fair married Joanna I of Castile and Aragon, and thus became Philip I), then we note that there were 11 marriages, most of which were among blood relatives — two uncle-niece marriages, one double first cousin marriage, one first cousin marriage, two first cousins once removed marriages, one second cousin marriage, and two third cousin marriages. (See Wikipedia for an explanation of these relationship terms.) This gave Charles II an inbreeding coefficient of 0.254 (calculated by Alvarez et al. 2009) — for comparison, the offspring of a brother-sister union would have a value of 0.250, as would the offspring of a parent-child union. Phillip III (Charles II's grandfather) also reached a high level: 0.218. Both of these people were the offspring of uncle-niece marriages.

This first diagram (linked from Wikipedia) shows the pedigree of Charles II, the final member of the dynasty. It illustrates the above points in the usual manner for a family tree. It shows only the royal lineage, as there were many other offspring, and indeed other marriages (Philip II married four times, Philip IV twice, and Charles II also twice). However, none of the male offspring were alive at the time of the death of Charles II, and nor were most of the females. Another of the consequences of the inbreeding was a poor survival rate among the children.

My point with this blog post is that the family tree can also be drawn as a network, as shown in the second diagram (which is also called a "path diagram" by geneticists). This illustrates the same pedigree as above, but with a few additions (at the left) to illustrate the lineage to Don Carlos (crown prince Charles), another highly inbred male (coefficient 0.211), being the offspring of double first cousins. This form of the diagram makes the connection between a family tree and a hybridization network clear — they are both ways of drawing a pedigree.

Basically, the two diagrams illustrate the same point — the Habsburg's defeated their own purpose, because they ultimately lost power by refusing to share it with anyone else. Biology is about biodiversity, and conserving biodiversity applies within your own family just as much as anywhere else.

There are several follow-up posts on this topic, about other famous people:
Charles Darwin's family pedigree network
Toulouse-Lautrec: family trees and networks
Albert Einstein's consanguineous marriage

Further reading

Alvarez G., Ceballos F.C., Quinteiro C. (2009) The role of inbreeding in the extinction of a European royal dynasty. PLoS ONE 4: e5147.

If you know little about the pros and cons of inbreeding, then this blog post will enlighten you:
Why inbreeding really isn’t as bad as you think it is.

Monday, November 12, 2012

An early tree of languages

In response to my recent blog post on Relationship trees drawn like real trees, what is probably the earliest genealogical tree of languages has been pointed out to me by Johann-Mattis List. This is an obvious omission from my earlier post, which suggested that the earliest such tree was published in 1853; and so I have reproduced it here.

Genealogical Tree of Dead and Living Languages,
by Félix Gallet (c. 1800).

There seems to be little information available about the origin of this figure. It is undated, but is apparently from the period 1795-1800, which would also make it the first "tree drawn as a real tree", slightly predating Augustin Augier's tree from 1801. It is a single engraved broadside sheet, rather than a figure in a book, and only two examples seem to be known, one in the Bibliothèque Nationale de France (see the auction catalogue #1441 by Maggs Bros) and the other recently acquired by Princeton University Library (see the blog post by Julie Mellby).

Félix Gallet's use of the title "Arbre Généalogique" [genealogical tree] makes the phylogenetic context of the figure clear. The tree representation may be a response to William Jones' suggestion of a historical relationship among Indo-European languages, notably in his 1786 book The Sanskrit Language, in which he suggested the possible historical affinity of Sanskrit and Persian with Greek and Latin: "they came from a common source, which perhaps no longer exists." This is usually considered to be the first addition of a historical component to the traditional spatial (geographical) one of comparative linguistics.

What is more important to us, in this blog, is that the tree is clearly a network, as several of the languages are shown as hybrid developments of other languages (eg. Etruscan, French, Greek, Latin). Many of the suggested relationships are no longer accepted, of course. For example, Swedish is derived from German (as shown in the tree for Flemish and Dutch) not Runic, and English is now considered to be a classic example of a hybrid language (originally Germanic but now with extensive Romance influence via French; see, for example, the blog post by Seth Long).

Indeed, Sylvain Auroux (1990, pp 213-238 in Leibniz, Humboldt, and the Origins of Comparativism ed. Tullio De Mauro and Lia Formigari) has noted that Gallet's tree is an intermediary between modern genealogical ideas about language history and the preceding interest in spatial comparisons of languages, along with the Biblical-scholar tradition that all European languages derive from Hebrew (via the story of the Tower of Babel): "The location of the branches of the tree is no longer totally geographical, nor is it yet a pure depiction of chronology and the similarities among languages." Indeed, Auroux has a rather poor opinion of the tree in general: "the tree is of astonishingly poor quality given the period in which it [was] executed."

Of equal interest to us is the parallel historical pattern of network-thinking in biology and linguistics. The first two depictions of genealogical history in biology (in 1755 by Buffon and 1766 Duchesne), both displayed hybridization as an important component of history, just as did Gallet for linguistics in 1800. In both disciplines, this early lead was later side-tracked by non-reticulating tree iconography in the mid 1800s, by Charles Darwin in biology in 1859 and by August Schleicher in 1853-1861 in linguistics. Finally, in both cases there is now a burgeoning interest in returning to network representations of genealogy, particularly if the analyses can be formalized mathematically. Thus, there are more parallels between the two fields than is suggested simply from a study of tree thinking alone (for which, see Platnick and Cameron 1977 Systematic Zoology 26: 380-385; Atkinson & Gray 2005 Systematic Biology 54: 513-526; Pagel 2009 Nature Reviews Genetics 10. 405-415).

Wednesday, November 7, 2012

Explanation of the many names for types of phylogenetic networks

Two types of phylogenetic network are commonly recognized, although there can be gradations between the two extremes. These go by many different names, which inevitably leads to some confusion on the part of users.

Some of the names are listed here, along with an explanation of what the terminology is intended to convey. The terms are arranged in pairs, indicating the two different types of network. The "network" part of the name is assumed in each case unless indicated otherwise.

      Type 1       Type 2
  1. Affinity  Genealogical
  2. Data-display Reticulogeny
  3. Implicit  Explicit
  4. Directed  Undirected
  5. Rooted  Unrooted
  6. Splits graph Augmented tree, Reconciliation, Recombination,
1.  This reflects the biologists' perspective, describing the different purposes for which networks have been used. Affinity networks display overall similarity relationships among the organisms, whereas genealogical networks display only historical relationships of ancestry.

2.  This reflects the assumptions used for the data analysis. Data-display networks are interpreted solely as visualizations of the patterns of variation in the data, while the reticulogenies are based on some inferences about those data patterns (such as their possible cause). Some network types, such as Reduced Median Networks and Median-Joining Networks, are based on algorithms that make partial inferences from the data. Data-display networks have mainly been used as affinity networks and reticulogenies as genealogical networks.

3.  This reflects the computational perspective, describing the goal of the algorithm used to analyze the data. Explicit networks are intended to provide a phylogeny in the traditional sense used for phylogenetic trees, displaying both vertical and horizontal patterns of descent with modification. Implicit networks provide information that can be used to explore phylogenetic patterns in a dataset without any direct interpretation as necessarily showing a phylogeny. Implicit networks have mainly been used as data-display networks and explicit networks as reticulogenies.

4.  This reflects the mathematical interpretation of networks as line graphs. In a directed graph the edges have a direction, usually indicated by an arrow, in which case the edges are more correctly referred to as arcs. Undirected graphs do not have directed edges.

5.  This reflects the tree-thinking view of phylogenetic networks, in which directed graphs are called rooted trees and undirected graphs are called unrooted trees. Rooted networks are usually treated as explicit networks and are thus used as genealogical networks, although there is no reason why they could not be used simply as a convenient form of data display.

6.  This reflects the modelling approach to network analysis based on mathematical structures. Splits graphs model phylogenetic patterns as bipartitions of the data, and build the network from those partitions (the result will be a tree if there are no incompatible bipartitions). Augmented trees are essentially trees with a few added reticulation edges / arcs, while reconciliation networks are based on reconciling the differences between trees. Recombination networks are based on analyzing data patterns in terms of a simple model of genetic cross-over, while hybridization networks model the data in terms of patterns in conflicting trees.

So, there are reasons why so many different terms have appeared in the literature. Unfortunately, they are not always used consistently with the meaning that was originally intended.

Monday, November 5, 2012

Relationship trees drawn like real trees

Charles Darwin (1859) introduced the "Tree of Life" as a simile, which has since become very popular as a metaphor for phylogenetic relationships, especially among the general public. Darwin seems to have named his simile after its biblical namesake, and in doing so he "mobilized one of the oldest and richest traditions of imagery available to him. To play consciously on religious tree imagery was no new trick ... but still it helped Darwin to seize the imagination of his readers" (Hellström 2011).

However, this simile was quite independent of Darwin's diagrams, because he always referred to his theory as "descent with modification" (see Penny 2011). Darwin referred to the Tree of Life at the end of the chapter containing his bush-like phylogenetic figure (see this image), and later he referred to relationships as being "somewhat like the branches of a tree", but neither of these was a direct reference to any diagram.

Moreover, the original  biblical tree was actually the lignum vitae (Tree of Eternal Life) not the arbor vitae (Tree of Life). It was explicitly contrasted with the lignum scientiae boni et mali (Tree of Knowledge of Good and Evil). Genesis tells us that Adam and Eve were exiled from the Garden of Eden after eating a fruit from the Tree of Knowledge of Good and Evil, to prevent them from also eating from the Tree of Eternal Life (as humans, they apparently were not allowed to have both eternal life and moral knowledge).

This distinction between different trees is important historically, because prior to Darwin the biblical tree imagery had already been co-opted to refer to the arbor scientiae (Tree of Knowledge), rather than the lignum scientiae. That is, knowledge could be arranged like the branches of a tree; and indeed, that metaphor has come down to us today when referring to the different "branches" of human knowledge (e.g. branches of science). For example, Joachim of Fiore used the tree as a metaphor for historical relationships in his Liber Figurarum (1202) (Hestmark 2000); and in his book Arbor Scientiae (1295) Ramón Llull used it to illustrate the growth and inter-relationships of knowledge (Gontier 2011, Kutschera 2011).

This imagery has not escaped biologists, of course. The first person to suggest a systematic arrangement of all organisms in the image of a tree is reported to be Peter Simon Pallas, in his Elenchus Zoophytorum (1776) (Ragan 2009); and since that time biological relationships have often been depicted literally using a tree. Prior to Darwin, however, none of this imagery had anything to do with evolutionary relationships. Indeed, in the time between the early evolutionary work of Jean-Baptiste Lamarck and that of Charles Darwin (50 years later), several people drew trees without expressing any belief in evolution. Some of these pre-Darwinian "relationships drawn as trees" are illustrated here, showing just how broad were the purposes for which they were used. (Many other  metaphors were also used for the same purpose during the same time, of course.)

Trees in Biology

Augustin Augier (1801) is usually credited with producing the first such tree (Stevens 1994, Archibald 2009, Ragan 2009, Gontier 2011, Tassy 2011). It depicts the natural relationships of all of the plant groups known at the time, based on several parts of the flower. The taxonomic groups label the nodes, with genera labelling the leaves. As noted by Stevens (1994): "Families on different branches of the tree, but in a similar position, showed the 'relationship of analogy', while the 'relationship of proximity' occurred between different families on the same branch." The tree thus illustrates increasing structural perfection from bottom to top, on which Augier based his taxonomic classification.

Analogy and proximity relationships of the plant kingdom,
from Augier (1801). The analogy relationships are indicated by stars.

It is perhaps worth noting here that this appears to be the first diagram of relationships published after those of Buffon (see this blog post) and Duchesne (see this post), and it is thus the first one depicting non-reticulating relationships as well as the first one not representing genealogical history. Indeed, Augier noted that although it is "like a genealogical tree" he accepted the pattern as coming from the Creator rather than genealogy. Augier states that he developed the tree idea after first trying to organize the families of plants according to a scale of perfection (a Scala Naturae, see this blog post), but failing.

Some years later, Nicolas Charles Seringe (1815) produced a tree that represented, instead, the characters of a dichotomous identification key (Stevens 1994). This referred solely to the known Swiss species of Salix (willows). The branch labels indicate the two character states being compared at each step in the key, starting at the base, with the species labels finally appearing on the leaves. Identification keys are no longer drawn like this, but it is an interesting visual device.

Identification key to the species of Swiss willows, from Seringe (1815).

Carl Edward von Eichwald (1829) published a tree of animal life that is often assumed to be a depiction of the tree suggested by Pallas in 1776, as mentioned above (Ragan 2009). Only a zoologist could illustrate a leafless bunch of asparagus spears suspended in an aquatic wasteland, and treat it as a tree! The Roman numerals label the primary animal types. As noted by Ragan (2009), Eichwald considered that: "the first type arose from abundant 'globules of primitive mucus', followed by the others in temporal succession, each a branch off from, and elevated in relation to, the previous type".

A tree of animal life, from Eichwald (1829), p. 41.

Edward Hitchcock (1840) produced a paleontological chart of the plant and animal kingdoms, which incorporated fossil time into the illustration of relationships (Archibald 2009, Ragan 2009, Gontier 2011), which had not been done before. Actually, not much is shown about relationships in the diagram, since few of the branches are connected other than at the base, but the extinction of fossil groups in different strata is clearly indicated, and the branch widths indicate the relative number of species at the different geological times. Interestingly, Hitchcock withdrew this diagram from the later editions of his book, immediately after Darwin published his similar bush-like figure in 1859, in opposition to its use to depict evolution, "arguing that evolution could not be the mechanism for change that he saw in the fossil record" (Archibald 2009).

The fossil history of plants and animals,
from the 8th (1852) edition of Hitchcock (1840).

Heinrich Georg Bronn (1858) published another tree of animals based on the fossil record, albeit this time a theoretical one (Archibald 2009, Ragan 2009, Gontier 2011, Tassy 2011). The letters depict the various sequences of increasing organizational perfection of the animal groups through fossil time. As noted by Archibald (2009): "Bronn seems to have been most concerned with addressing the idea that although there was a trend toward perfection, less perfect forms kept branching even after more perfected forms had appeared". Later, Bronn was responsible for the first translation of Darwin's book into German (with his own commentary and a chapter of his own criticisms!).

Tree-shaped image of the animal system,
from Bronn (1858), p. 481.

Trees in Linguistics

Finally, it is worth pointing out that the situation was somewhat different within the study of linguistics. The analysis of biological and linguistic relationships has much in common (Atkinson and Gray 2005), and similar techniques have been developed at similar times in both disciplines but quite independently of each other. In particular, phylogenetic trees have been developed both for the study of the historical development of languages and for biological relationships.

However, one way in which the development of trees in linguistics differed from that in biology is that some explicitly genealogical tree diagrams appeared before 1859. Here are two examples (see Gontier 2011).

[Update: see the subsequent blog post An early tree of languages.]

Priestly (1975) notes that it was apparently František Ladislav Čelakovský who drew the first genealogical diagram in linguistics, depicting a history of the Slavic languages, which was published posthumously in 1853. This may thus count as the first phylogenetic tree in the modern sense of the word (i.e. it is interpreted exactly as would be a modern phylogenetic tree).

A history of the Slavic languages, from Čelakovský (1853), p 3.

However, it was August Schleicher who is usually credited with popularizing the use of phylogenetic trees in historical linguistics, starting with a short note in 1853 concerning the historical development of the Indo-Germanic language family. He published a more extensive account in 1861 (before he had read Bronn's translation of Darwin's book), and then in 1863 clearly linked his own work with Darwin's evolutionary ideas (Gontier 2011).

The development of the Indo-Germanic language family,
from Schleicher (1853), p. 787.

In biology, similar "relationships drawn as trees" representing genealogy were not published until 1866, by Ernst Haeckel (see this previous blog post).


Archibald J.D. (2009) Edward Hitchcock’s pre-Darwinian (1840) "Tree of Life". Journal of the History of Biology 42: 561-592.

Atkinson Q.D., Gray R.D. (2005) Curious parallels and curious connections: phylogenetic thinking in biology and historical linguistics. Systematic Biology 54: 513-526.

Darwin C. (1859) On the Origin of Species by Means of Natural Selection. Murray, London.

Gontier N. (2011) Depicting the Tree of Life: the philosophical and historical roots of evolutionary tree diagrams. Evolution, Education and Outreach 4: 515-538.

Hellström N.P. (2011) The tree as evolutionary icon: TREE in the Natural History Museum, London. Archives of Natural History 38: 1-17.

Hestmark G. (2000) Temptations of the tree. Nature 408: 911.

Kutschera U. (2011) From the scala naturae to the symbiogenetic and dynamic tree of life. Biology Direct 6: 33.

Penny D. (2011) Darwin’s theory of descent with modification, versus the biblical Tree of Life. PLoS Biology 9: e1001096.

Priestly T.M.S. (1975) Schleicher, Čelakovský, and the family-tree diagram: a puzzle in the history of linguistics. Historiographica Linguistica 2: 299-333.

Ragan M. (2009) Trees and networks before and after Darwin. Biology Direct 4: 43.

Stevens P.F. (1994) The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. Columbia Uni. Press, New York.

Tassy P. (2011) Trees before and after Darwin. Journal of Zoological Systematics and Evolutionary Research 49: 89-101.

Sources of the figures

Augier A. (1801) Essai d'une Nouvelle Classification des Végétaux. Bruyset Aîné, Lyon.

Bronn H.G. (1858) Untersuchungen über die Entwickelungs-Gesetze der Organischen Welt. E. Schwiezerbart'sche, Stuttgart.

Čelakovský F. (1853) Čtení o Srovnávací Mluvnici Slovanské na Universitě Pražskě. F. Řivnáče, Prague.

Eichwald C.E. von (1829) Zoologia Specialis quam Expositis Animalibus. Josephus Zawadzki, Vilnae.

Hitchcock E. (1840) Elementary Geology. Adams, Amherst.

Schleicher A. (1853) Die ersten Spaltungen des Indogermanischen Urvolkes. Allgemeine Monatsschrift für Wissenschaft und Literatur 1853: 786-787.

Seringe N.C. (1815) Essai d'une Monographie des Saules de la Suisse. Maurhofer and Dellenbach, Berne.