The Genealogical World of Phylogenetic Networks: February 2013

Wednesday, February 27, 2013

How should we treat hybrids in a taxonomic scheme?

There are many plant genera that have long been considered to involve extensive hybridization, as indeed there are also for vertebrates (eg. Lepus, Salmo). These genera form complexes of taxa that apparently intergrade due to individuals that are phenotypically and genotypically intermediate. One such plant genus is Nicotiana (tobacco).

In 1954 Thomas Harper Goodspeed published a summary of his lifetime's work on this genus, consolidating his data concerning morphology, cytology, genetics, distribution and taxonomy. While this work was not universally well received, especially the taxonomy (eg. DeWolf 1957), it does contain one interesting diagram, which is an evolutionary network summarizing his ideas on the phylogeny of the genus.

Since the relationships are complex, so is the diagram. However, basically, Part A summarizes the relationships among some of the species, while Part B summarizes the rest along with "centroids" representing the species from Part A. The different orbital rings represent hypothetical ancestors of the various groups, with the groups having different numbers of chromosomes.

As Knapp et al. (2004) have noted: "Amphidiploid (allotetraploid) hybridization is common in the genus and appears to be unrelated to human intervention; both the tobaccos of commerce as well as several other species and species groups are amphidiploid hybrids." An amphidiploid is an interspecific hybrid having a complete diploid chromosome set from each parent. It is this hybridization that creates the complexity among the tobacco taxa, and its nature is summarized by Chase et al. (2003).

Goodpseed's hybridization network is a fascinating early attempt to represent evolutionary complexity, irrespective of what one thinks about his methodology, which is never satisfactorily explained. The existence of subjectively determined hypothetical ancestors is probably the most worrisome aspect.

However, apart from drawing attention to the existence of the network, I wish to raise a point about how to turn this complexity into a classification. Goodspeed (1947) had a go at this, as did Goodspeed (1954). However, more importantly Knapp et al. (2004) made a more recent attempt, and they had this to say:

We retain Goodspeed's sectional classification as the basic framework for this new classification, understanding that different ways of treating hybrids in phylogenetic classifications have been proposed.

In this sectional classification we have placed all amphidiploid species in sections separate from their progenitor taxa because they represent fusion of two distinct genomes, and based on studies such as Kenton & al. (1993) we know that these genomic interactions create new traits and permit movement into new habitats.

The classification is based on a phylogenetic tree onto which the authors have plotted reticulations representing hybridization (see below). Those taxonomic sections that are represented in the tree-like parts of the phylogeny can thereby be considered to be monophyletic, but those sections consisting of hybrids cannot.

Summary of the classification of Nicotiana, showing diploid species and their sectional classification.
Sections of allotetraploid origin are indicated by dashed lines from both of their parental lineages.

My question is this: Is this really the best way of treating hybrids taxonomically? It looks a bit like an attempt to sweep all of the problems together into separate piles, and then simply labelling them "problem piles".

Mind you, in this particular example all of the hybrids in any one section are hypothesized to have had the same lineages as parents, so there is some coherence to each section. However, unless we hypothesize a single hybrid origin for each section the component species cannot be considered to have a unique common ancestor, which all of the non-hybrid sections do have. In this sense the sections are not directly comparable.

References

Chase M.W., Knapp S., Cox A.V., Clarkson J.J., Butsko Y., Joseph J., Savolainen V., Parokonny A.S. (2003) Molecular systematics, GISH and the origin of hybrid taxa in Nicotiana (Solanaceae). Annals of Botany 92: 107-127.

DeWolf G.P. (1957) Review of Goodspeed (1954). The Southwestern Naturalist 2: 177-179.

Goodspeed T.H. (1947) On the evolution of the genus Nicotiana. Proceedings of the National Academy of Sciences of the USA 33: 158-171.

Goodspeed T.H. (1954) The genus Nicotiana: origins, relationships and evolution of its species in the light of their distribution, morphology and cytogenetics. Chronica Botanica 16: 1-536.

Knapp S., Chase M.W., Clarkson J.J. (2004) Nomenclatural changes and a new sectional classification in Nicotiana (Solanaceae). Taxon 53: 73-82.

Monday, February 25, 2013

One year of network blogging

Today is the first anniversary of starting this blog, and this is post number 120. So, a big thankyou to all of our visitors over the past year. We hope that the next year will be as productive as this past one has been.

We have summarized here some of the accumulated data, in order to document at least some of the productivity.

As of this morning, there have been 29,316 pageviews, for a median of 70 per day, but with a range of 3-667 pageviews. The daily pattern for the year is shown in the first graph.

Line graph of pageviews through time, up to today.

The largest value (Day 224) is off the graph.

The erratic nature of the daily variation is apparently all too typical of blogs, and there appears to be no good explanation for it. So, we might take this as a good example of the stochastic nature of the web. Nervertheless, there are general patterns detectable. For example, the steady rise from one third of the way through the year is very gratifying, although the slight dip right at the end is less so. The recent mean pageview data are:

October – November
December
Christmas – New Year
January – mid February
late February

90
130
90
130
90

Some of the sharp peaks in the graph were due to various identifiable events, including the email announcing the existence of the blog, the addition of the blog to the Systematic Biology homepage, the mention of the blog in some posts at the Scientopia blog, and the mention of some of the posts in the monthly Carnival of Evolution blog roundup.

The biggest peak (which goes off the graph) was due to hosting an edition of the Carnival of Evolution, which generated an extra 2,000 pageviews. There were also unexpected Twitter announcements for particular posts, including the fourth Tattoo post (which got picked up when it happened to go out on April Fool's Day) and the one on Scotch Whiskies, which is apparently a topic of widespread interest.

There are also other general patterns in the data, the most obvious one being the day of the week, as shown in the second graph. The posts have usually been on Mondays and Wednesdays, and these two days have had the greatest mean number of pageviews (84 and 90, respectively), The other weekdays have had somewhat less (Tuesday 82, Thursday 75, Friday 65), and the weekend even fewer (Saturday 50, Sunday 63).

Boxplot of the daily pageviews, up to last Friday.

The largest value has been excluded.

There were also a few instances of what appear to be "rogue" visits during late December and early January. These involved an almost instantaneous addition of c.100 pageviews, without obvious explanation, which presumably came from bots examining the blog. They occurred once the blog reached 100 posts, which may not be coincidental.

The posts themselves have varied greatly in popularity, as shown in the next graph. It is actually a bit tricky to assign pageviews to particular posts, because visits to the blog's homepage are not attributed by the counter to any specific post. Since the current two posts are the ones that appear on the homepage, these posts are under-counted until they move off the homepage, (after which they can be accessed only by a direct visit to their own pages, and thus always get counted). On average, 33% of the blog's pageviews are to the homepage, rather than to a specific post page, and so there is considerable under-counting.

Scatterplot of post pageviews through time, up to today; the line is the median.

Note the log scale, and that the values are under-counted (see the text).

The fact that 33% of the blog's pageviews are to the homepage means that one-third of the visitors are reading the blog as the posts are posted, while two-thirds are visiting via web searches and external links. So, we do have a regular readership, as well as having itinerant visitors.

It is good to note that the most popular posts were scattered throughout the year. Keeping in mind the under-counting, the top collection of posts (with counted pageviews) have been:

73
42
19
49
10
58
98
26
67
17
29
2
35

Carnival of Evolution Number 52
Charles Darwin's unpublished tree sketches
Tattoo Monday IV
Evolutionary trees: old wine in new bottles?
Why do we still use trees for the dog genealogy?
Who published the first phylogenetic tree?
Faux phylogenies
Steven Jay Gould was wrong
Metaphors for evolutionary relationships
Tattoo Monday III
Network analysis of scotch whiskies
The first phylogenetic network (1755)
Tattoo Monday V

1,559
1,302
737
687
666
606
600
429
420
415
414
403
394

This blog has two possible uses: (i) providing an outlet for commentaries and ideas by professionals; and (ii) advertising phylogenetic networks to a wider audience. It has turned out that the latter posts have appeared mostly on Mondays and the former mostly on Wednesdays. Furthermore, it seems reasonable for the latter posts to have fewer pageviews, since the expected audience is much smaller (or "more select", as we prefer to see it).

There have been five main types of posts:

(i) Discussions of methodology
These are the mainstay of the blog for those who are professionally interested in phylogenetic networks. A wide range of topics have been discussed, and there is plenty more that can be said.

If anyone wants to contribute to this part of the blog, then we welcome guest bloggers. This is a good forum to try out all of your half-baked ideas, in order to get some feedback, as well as to raise issues that have not yet received any discussion in the literature. If nothing else, it is a good place to be dogmatic without interference from a referee!

As a blogger, you are very likely to get feedback from people, even if they do not leave comments on the blog itself. Professionals do not yet seem to be very used to writing blog comments, but they will send you an email.

(ii) Explanations
There are all sorts of things that seem obvious to professionals but which are obscure to non-experts. These posts are designed to redress this situation, so that there is somewhere on the web for people to go when they want to find out. They seem to have been rather popular posts.

(iii) Data analyses
The EDA analyses are intended to illustrate the usefulness of networks as data summaries (as opposed to their use for strictly evolutionary analyses). In particular, choosing datasets outside science advertizes the potential uses of scientific data analysis to a wider public. Networks provide a valuable way of visualizing a table of numbers -- so, any time you see such a table you should be tempted to find out whether a network will help people to picture what it says. Most of the analyses have proved quite popular in terms of pageviews, but there has been little feedback about whether the public understands any of it.

(iv) Historical commentaries
These have usually been among the most popular posts with visitors. They simply involve bits of information that have accumulated through time, and the blog seems to be a good place to put them. They often involve phylogenetic trees, rather than networks, but that is only because trees have been used more often and thus have more history. Mind you, you have to have a good title in order to attract the public's attention!

(v) Miscellaneous
These are uncategorizable posts, which just consist of things that relate in some way to phylogenetic analysis, however peripherally. There are almost no other phylogenetics blogs on the web, and so there is no other obvious outlet for this information. The most popular of these posts have been the ones compiling the various pictures of phylogenetic tattoos that are lying around the web -- these are the most common Google search hits to the blog, along with the first compilation of Darwin's unpublished tree sketches.

Along with these posts, we have also started compiling a list of datasets that will be useful for evaluating network algorithms. Such datasets, where biologists seem to have an independently validated idea about the phylogenetic pattern, are hard to come by, and so it is worthwhile to make them available at a centralized location. A blog page is a good as anywhere else for this purpose, and the number of visits to this page is quite steady. Contributions of datasets are always welcome.

Finally, the audience for the blog has been, not unexpectedly, firmly in the USA. Based on the number of pageviews, the data are:

United States
United Kingdom
Germany
Russia
Canada
France
Australia
New Zealand
Netherlands
Sweden

37.4%
6.6%
5.3%
4.7%
4.0%
2.7%
2.3%
1.7%
1.6%
1.5%

You will note that this list is dominated by English-speaking countries. The blog does have a link to Google Translate to help other people, but it is clear that the audience is made up almost entirely of those people who are comfortable with English (or Australian, any any rate).

Wednesday, February 20, 2013

Network of ancient Thai bronze Buddha images

This blog post continues the theme from the previous post (Trees and networks of written manuscripts), in which I noted that anthropological data are very likely to involve horizontal flows of phylogenetic information as well as vertical ones. My own analyses of anthropological datasets that are available online seem to confirm this suggestion. The simplest way to illustrate this point is to take a dataset and analyze it using a network method. If the network method produces a tree-like diagram then we can safely conclude that vertical descent has had a larger influence on the transmission of the cultural information than has horizontal transfer.

The dataset that I will use here is provided by Marwick (2012). It involves photographic images of 42 cast metal Buddha statues from the Alexander B. Griswold collection of the sacred sculpture of Thailand (Walters Art Gallery, Baltimore, USA). The statues cover seven widely recognized chronological Thai culture-historical groups.

The morphological features of the statues' heads were coded as 17 binary characters, representing the face of the Buddha image; and these data are included in Marwick (2012). Statues CN65 and CN66 had identical codings for the features used.

Originally, Marwick (2012) analyzed these data by first summarizing the characters for each of the seven culture-historical groups. The phylogenetic analysis was then performed with these seven groups as the taxa. The exhaustive-search parsimony analysis produced three maximum-parsimony trees, and the bootstrap consensus tree was not well-supported, as shown in the figure.

This result suggests that the data may not be particularly tree-like. To assess this, I have performed a network analysis using the hamming distance and a NeighborNet graph, as shown in the next figure. The seven culture-historical groups have been colour-coded as follows (in chronological order):

Dvaravati
Khmer
Thirteenth_Century
Sukhothai
Early_Ayutthaya
Lan_Na
Late_Ayutthaya

light green
dark green
dull blue
bright blue
purple-brown
pink
red

Click to enlarge.

Clearly, the network is not very tree-like, and so we can infer that there has been a considerable influence of horizontal flow of phylogenetic information, as well as the vertical flow through time. There are, however, distinct temporal patterns in the network, which we can infer are probably phylogenetic patterns.

The samples from the earliest three periods (Dvaravati, Khmer, Thirteenth_Century) are at the right-hand end of the network, while the samples from the next period (Sukhothai) are at the bottom-left. This implies that a large stylistic change occurred between the Thirteenth_Century and the Sukhothai periods. Furthermore, the Khmer period style is rather distinct from that of the immediately preceding period (Dvaravati) and the immediately following one (Thirteenth_Century), which are themselves not distinct. That is, there was no stylistic change between the first two periods, but there was a small change to the next period, and then a large change to the following period.

The samples from the latest two periods (Lan_Na, Late_Ayutthaya) are collected mainly in two locations, at the bottom of the graph and at the top-left. This indicates that, although there are two distinct styles, they do not correlate with the two culture-historical periods. So, the pattern here is not a strictly phylogenetic one, and we need to look for some other explanation

The samples from the Early_Ayutthaya period are scattered throughout the top and left of the network, suggesting that this is an intermediate style between that of the immediately previous Sukhothai period and the earliest three periods, rather than being an innovative style leading to the succeeding Lan_Na period.

Importantly, these interpretations of the phylogenetic patterns do not accord with those from the tree-building analysis, where the possible patterns of horizontal flow of information are not made explicit.

Reference

Marwick B (2012) A cladistic evaluation of ancient Thai bronze Buddha images: six tests for a phylogenetic signal in the Griswold Collection. In: Bonatz D, Reinecke A, Tjoa-Bonatz ML (editors) Connecting Empires. National University of Singapore Press, pp. 159-176.

Monday, February 18, 2013

Trees and networks of written manuscripts

It is often suggested by anthropologists that their studies, including archaeology and linguistics, are very likely to involve horizontal flows of phylogenetic information as well as vertical ones (see the earlier posts False analogies between anthropology and biology and Time inconsistency in evolutionary networks). For example, in linguistics the horizontal flow is referred to as "diffusion", while in stemmatology it is called "contamination".

The simplest way to illustrate this is to take a dataset and analyze it using both a tree-building method and a network method. Only if the network method produces a tree-like diagram can we then safely conclude that vertical descent has had a larger influence on the transmission of the cultural information than has horizontal transfer.

A few weeks ago I reported on a case, involving the historical development of the musical instrument called a cornet, where the author first used a tree to analyze the historical data and then later settled on a network, which turned out to be rather non-treelike (Cornets: from a tree to a network). Here, I point out another example, this time involving written text.

Stemmatology is the discipline that attempts to reconstruct the transmission history of a printed text on the basis of relationships between the various extant versions (eg. manuscripts or printings). In this case, the analysis concerns the Greek manuscripts for the New Testament, in particular the Letter of James.

The stemmatological study used a database listing the variants of the 761 characters in 165 Greek manuscripts of the Letter of James. Of these, 60 characters are constant, 266 are variable but parsimony-uninformative, and 435 are variable and parsimony-informative. The objective of the study was to trace the history of copying of one manuscript to another.

To construct a phylogenetic tree from the dataset, Spencer et al. (2002) performed a parsimony analysis, and then summarized this with an Adams-2 consensus tree of the resulting 10,000 maximum-parsimony trees. This tree is shown in the first figure.

However, this approach does not explicitly display the inferred contamination among the manuscripts, which would require a phylogenetic network rather than a tree. So, Spencer et al. (2004) produced a reduced median network, instead, based on 82 selected manuscripts and 301 binary characters. This is shown in the second figure.

Clearly, parts of the manuscript history are not very tree-like, notably the part at the inferred root of the network. Spencer et al. note that this network topology:

is consistent with the ideas that most variants arose early in the history of the Greek New Testament, that early manuscripts were often influenced by both oral and written traditions, and that later copies introduced fewer variants.

Under these circumstances, a tree cannot be an appropriate representation of the anthropological data, because horizontal transfer of information has had a large effect during at least part of the phylogenetic history.

References

Spencer M, Wachtel K, Howe CJ (2002) The Greek vorlage of the Syra Harclensis: a comparative study on method in exploring textual genealogy. TC: a Journal of Biblical Textual Criticism 7: 3.

Spencer M, Wachtel K, Howe CJ (2004) Representing multiple pathways of textual flow in the Greek manuscripts of the Letter of James using reduced median networks. Computers and the Humanities 38: 1–14.

Wednesday, February 13, 2013

Pasta have no phylogeny (so don't try to give them one)

If you feed arbitrary data into a phylogenetic analysis then you will always get something out again, but it will in all probability be meaningless. For example, non-living objects do not have a phylogenetic history, at least not in the same way as living objects. Cultural objects certainly do have a history, and many aspects of that history may be similar to evolutionary patterns (see False analogies between anthropology and biology), but we cannot take it for granted that all historical patterns can be treated as analogous to those induced by evolution.

Even data that can be placed in a hierarchy do not necessarily represent a phylogeny — a phylogeny may well be a tree but not every tree is necessarily a phylogeny. Even having an evolutionary history does not mean that there is a phylogenetic history — for example, if the evolution is transformational then the history will be a chain rather than a tree or network.

Olivier Rieppel (2010) mentioned pasta as an example of this important distinction, because the features of pasta contain almost no phylogenetic information at all. Pasta has a history, sure, but it is not a phylogenetic history, in the Darwinian sense of variational evolution. Nor, incidentally, does pasta have a transformational history, either. The different types of pasta were not derived by a historical process of descent with modification, but are instead simply different expressions of a small set of basic ideas about what shapes you can make out of noodles (which are themselves little more than durum wheat flour mixed with water). The key feature of phylogenetic datasets is congruence among different character sets, and this is what we detect as phylogenetic signal, but there is no such congruence among the characteristics of pasta. (For a detailed analysis, with figures, see the interesting book by George Legendre 2011.)

In spite of this, pasta actually is used by many institutions (particularly in the USA) as an example to teach school and/or undergraduate students about phylogenetic analysis. I won't list them all here, but the simple Internet search that I just did quickly produced at least six of them. Here is an example datasheet from one of them, to illustrate the idea:

There are, unfortunately, many other examples of inanimate objects being used as introductory examples for phylogenetic analysis, even when those objects have no obvious phylogenetic history, including: Paper-clips (discussed by Petroski 1992); Nuts & bolts (discussed by Nickels & Nelson 2005); and Biscuits (discussed by Madden 2011).

This violates all common sense in phylogenetics. Indeed, it is actually anti-phylogenetics because it promolgates the idea that there is nothing special about evolutionary history, as distinct from any other sort of history. As noted by Erin Naegle (2009):

Biological organisms have descended from common ancestors. This is not true of manmade objects such as hardware or pasta. While constructing trees of objects may be motivating for students, such exercises are removed from evolutionary theory. Using inanimate objects may give students the impression that all trees are equally correct, since there is no inherently correct way to place objects on a tree.

Part of the problem with almost all of these class exercises seems to be confusion between classification and phylogeny, since these seem frequently to be taught as part of the same exercise. As noted by Nickels & Nelson (2005):

Perhaps the most common — but ultimately self-defeating — approach in teaching about biological classification uses the arrangement of manufactured objects (hardware, furniture, whatever) in an attempt to illustrate the principles of biological classification. This approach assumes that classifying manufactured objects is fundamentally similar to classifying biological organisms. Unfortunately, this assumption is wrong in important ways ... simply put, taxonomic classifications of organisms are fundamentally different from the classifications of other things. And this distinction is the key point that students need to grasp.

All objects can be classified, and many objects can be classified using a hierarchical scheme, which can then be represented as a tree; but this does not make that tree a phylogeny. Classifications can be derived from any set of data, but they are particularly suitable for datasets with an intrinsic hierarchical pattern. Since the phylogenetic patterns in many groups of organisms are tree-like, they can be conveniently represented in a hierarchical classification. However, this logic cannot be inverted — just because we have a hierarchical classification does not mean that it came from a phylogenetic pattern.

I have always been acutely aware of this potential problem when I have used phylogenetic networks to analyze datasets where there is unlikely to be an evolutionary cause to the multivariate patterns, such as the Eurovision Song Contest, the FIFA World Cup, Scotch whiskies, Bordeaux wine, fast food, or lists of celebrities (see the Analyses page of this blog). In these cases I have explicitly emphasized that the analysis is intended as an Exploratory Data Analysis (EDA) not a phylogenetic analysis. This distinction is an important one in phylogenetics — any patterns detected by the EDA may, indeed, result from a phylogenetic history, but equally they may not do so. In this sense it is unfortunate that the output is still called a phylogenetic network.

I am not the first person to point out the problem of using inanimate objects for phylogenetics (Nickels & Nelson 2005; Naegle 2009; Meisel 2010). If anything, manufactured goods may provide a suitable example of horizontal transfer (Meisel 2010), but this seems a bit advanced for an introductory class of students.

Are there, then, any good examples that could be used to provide students with a simple and easy introduction to phylogenetic analysis? All that is required is that the objects actually have a phylogenetic history, and that a dataset for the objects can be collected by the students in a straightforward but entertaining manner.

As one example, Nelson & Nickels (2000) suggest using humans as the examplar, and there is a web page pursuing this idea at the Evolution and the Nature of Science Institutes. Alternatively, one could use the example of the fictional Caminalcules (Gendron 2000), which is discussed both here and here. Other examples are limited solely by your own imagination.

References

Gendron RP (2000) The classification & evolution of Caminalcules. American Biology Teacher 62: 570-576.

Legendre GL (2011) Pasta by Design. Thames & Hudson, London.

Madden D (2011) DNA to Darwin: Introductory Activities, Teacher's Guide. NCBE, University of Reading.

Meisel RP (2010) Teaching tree-thinking to undergraduate biology students. Evolution: Education and Outreach 3: 621-628.

Naegle E (2009) Patterns of Thinking about Phylogenetic Trees: A Study of Student Learning and the Potential of Tree Thinking to Improve Comprehension of Biological Concepts. Doctor of Arts thesis, Idaho State University.

Nelson CE, Nickels MK (2000) Using humans as a central example in teaching undergraduate biology labs. In: Karcher SJ (editor) Tested Studies for Laboratory Teaching, Volume 22. Proceedings of the 22nd Workshop / Conference of the Association for Biology Laboratory Education (ABLE), pp 332-365.

Nickels MK, Nelson CE (2005) Beware of nuts and bolts: putting evolution into the teaching of biological classification. American Biology Teacher 67: 283-289.

Petroski H (1992) The evolution of artifacts. American Scientist 80: 416-420.

Rieppel O (2010) The series, the network, and the tree: changing metaphors of order in nature. Biology and Philosophy 25: 475-496.

Monday, February 11, 2013

An early phylogenetic classification (1884)

The history of phylogenetics has concentrated on those people who made (ostensibly) original contributions to theory, with very little mention of those who synthesized empirical information. A prominent member of the latter group was the marine zoologist William Abbott Herdman (1858–1924), who was Derby Professor of Natural History at University College, Liverpool (1881–1919) and then Professor of Oceanography (1919–1920).

At the Fourth Ordinary Meeting (December 1, 1884) of the 74th Session (1884–1885) of the Literary and Philosophical Society of Liverpool he presented a paper entitled "A phylogenetic arrangement of animals". The Society duly published this the next year (Herdman W.A. 1885. A phylogenetic arrangement of animals. Proceedings of the Literary and Philosophical Society of Liverpool 39: 65-85). This publication contained an explicitly phylogenetic tree and an accompanying classification reflecting that tree (which the author calls a "natural classification").

At the same time, Herdman published a book (Herdman W.A. 1885. A Phylogenetic Classification of Animals (For the Use of Students). Macmillan, London; Adam Holden, Liverpool) with this prefaratory note:

The accompanying Genealogical Table was drawn up in May, 1884, mainly from various partial schemes of classification which I have been in the habit of using in my lectures for several years; and a brief description was read in the following December before the Literary and Philosophical Society of Liverpool. While preparing this paper for publication it occurred to me that in an extended form it might prove serviceable to students of Biology: hence its issue in the present condition.

This book is available at the Open Library, and the above illustration (which Herdman calls a "table") is taken from it. It appears to be the same as the one in the journal publication; and the text of the book (76 pp) is, indeed, an expanded version of that in the paper. (The book apparently was published first.)

The tree has much in common with modern phylogenetic trees, in that Herdman emphasizes single common ancestors, and all contemporary taxa are on side-branches, but it differs in having a strong central axis, and in trying to represent "relative advancement" (ie. Herdman recognizes grades as well as clades). For example, in the Explanatory notes accompanying the diagram, he explains:

The lowest organisms are placed at the foot of the Table, the highest at the top. The line, straight or zig-zag, traced from the very base upwards to any name indicates the probable course of the evolution of the group of animals to which the name belongs. If a line stretches upwards it shows an advance in structure; if it is nearly horizontal it means that little or no upward evolution has taken place; if it slopes downwards, that indicates degeneration or degradation. The proportional lengths and angles of the various lines are meant to represent roughly the amount and the nature of the evolution which has taken place.

In no case has the line representing the evolution of one group been allowed to pass through another group. All existing animals are represented as being at the ends of lines or branches.

It is scarcely necessary to point out that horizontal lines could not be drawn across this table in such a way as to divide it into sections representing the Fauna of the various geological periods. In order to show that, a very different table would require to be constructed in which distance along a line stretching upwards from the base would indicate merely the age of the group and not evolution or advance in organisation as in the present table.

What is of most interest here is the sole reticulation shown in the tree, which involves the Gregarinida (now part of the phylum Apicomplexa). Herdman's description of this is (pp 7-8):

The Gregarinida, like all parasitic organisms, are difficult to place, as there is always a probability that they have been considerably modified, or even degraded from the ancestral type, in consequence of their habits. They are placed in the table at the end of a long branch springing from the main stem of the Protozoa, close to the highest Monera, and extending outward and upward so as to reach a point a little above the level of Amoeba, but far from the axis. The length of the line shows the considerable amount of differentiation attained by the group and its somewhat isolated position, while its point of origin indicates the relationship which probably exists with the Monera, There is a similarity with the life-history of Myxastrum, and the ancestors of the Gregarinida may have diverged from the other Protozoa at a point close to this form, or one of the other allied Monera. On the other hand, it is possible that the Gregarinida may have degenerated from one of the higher Protozoa — from some form above Amoeba — or even from still higher animals. The dotted line in the table, stretching downwards from the base of the Metazoa, may serve to recall the possibility that the Gregarinida are a much degraded offshoot from some group of Gastrea-like organisms.

So, in this case the reticulation represents ambiguity regarding the evolutionary history. This is unusual for its time; and indeed this may be the first occasion on which a phylogenetic reticulation was used to represent anything other than hybridization. (It is not the first time that anyone presented conflicting evolutionary histories, however: Is this the first network from conflicting datasets?)

Finally, Herdman explicitly states his intention to provide a synthesis of contemporary empirical knowledge. This is also notable, given that many of his contemporaries were still doubtful about the feasibility of deriving an empirically based phylogenetic tree. Herdman notes in his Preface:

It is obvious that a classification such as this can only be in a limited sense original. It must of necessity agree in many respects with older schemes, amongst which the well-known diagrams of Professor Haeckel, published first in 1866, are the most notable of those in a tree-like form.

In working out the details of the table many books have been consulted, and I have tried to incorporate the views of the latest authorities so far as they commended themselves to my judgment. I may expressly mention the extensive use that has been made of various books and papers by Huxley, Ray Lankester, Moseley, Haeckel, Glaus, and others; and particularly of that invaluable work, Balfour's Treatise on Comparative Embryology.

Herdman himself was interested in marine zoology, oceanography and geology. He is perhaps best known for his role in The Liverpool Marine Biology Committee (1885–1919). The Committee established a small biological station on Puffin Island, off the North coast of Anglesey, in 1887; and in 1892 a new, larger and better equipped station was opened at Port Erin, near the southern tip of the Isle of Man. (See S.J.H. 1920. The Liverpool Marine Biology Committee. Nature 104: 677.)

Thanks to John S. Wilkins for first drawing this book to my attention.

Wednesday, February 6, 2013

Is there a philosophy of phylogenetic networks?

In some previous blog posts I have discussed the role of phylogenetic networks in science (Are phylogenetic networks as scientific as trees?), particularly in terms of Description, explanation and prediction in phylogenetics. In this post I will look at the philosophy of phylogenetic networks, in terms of whether there is a strong basis for treating the mathematical analyses as having biological relevance.

This is an important point, because there are theoretically an infinite number of ways to mathematically analyze a set of data, and yet it is unlikely that all (or even most) of these will have any relevance to a study of biology. For example, there is a big difference between a mathematical summary of a set of numbers and any biological interpretation of that summary. The mode, for instance, is a neat mathematical measure of the central location of a biological dataset that also nominates one of the biological objects represented by that dataset, while the mean is an estimate of the central location that rarely describes any biological object at all. So, a mode describes biology directly while a mean does not necessarily do so.

Given that there seem to be two quite different uses for phylogenetic networks, there are likely to be two different philosophical bases. The first of these is more easy to deal with than the second one.

Data-display networks

Data-display networks are usually unrooted, and are intended to display the major patterns of character variation in a dataset. There is no necessary implication that any of these patterns are due to the evolutionary history of the organisms concerned, although it is very likely that many of the patterns will reflect that history, either directly or indirectly. I have therefore repeatedly emphasized the role of these networks in Exploratory Data Analysis (EDA).

This means that the obvious philosophical basis for data-display networks is the same as for EDA. There is a strong mathematical basis for EDA and this is considered to have direct relevance to biological studies. EDA has been explored in a number of works, both in general (eg. Tukey 1977; Hartwig & Dearing 1979; Tufte 1983, 1997; Ellison 2001; Behrens & Yu 2003; Young et al. 2006) and also within phylogenetics (eg. Bandelt 2005; Wägele & Mayer 2007; Morrison 2010). These can be consulted for further information.

The mathematical basis of EDA is to summarize the main characteristics of a dataset in an easily digested form, usually with graphs, without using an explicit statistical model or having formulated an a priori hypothesis. EDA is thus promoted as a counterpoint to confirmatory data analysis (ie. statistical hypothesis testing). The mathematics is not rigid, although various tools have been developed over more than a century. EDA is as relevant to biology as it is to all subjects where data are collected and analysed.

Evolutionary networks

Evolutionary networks, on the other hand, are rooted networks intended to elucidate phylogenetic history. Unlike phylogenetic trees, evolutionary networks explicitly allow for reticulation events (horizontal evolution) as well as descent from parent to offspring (vertical evolution). They are therefore usually seen as a logical generalization of phylogenetic trees.

So, the obvious philosophical basis for evolutionary networks is the same as for phylogenetic trees. However, this inference is not as clear as we might like it to be. For phylogenetic trees there is a rationale for treating the mathematical tree diagram as a representation of evolutionary history; but it is harder to apply the same rationale to evolutionary networks.

The three logical steps to inference using phylogenetic trees are outlined in the figure.

First, we start with some genotypic data, which we transform into a mathematical summary (a DAG) via some quantitative model. Each of these models has an explicit mathematical and/or philosophical basis; for example, maximum likelihood has a well-established mathematical foundation, as does Bayesian analysis. However, there is no necessary biological foundation to these quantitative models, and they are simply convenient mathematical summaries, just like the mean. (Indeed, the mean is the maximum-likelihood estimate of the central location of a set of numbers.)

The second step is to provide a biological basis for further inference. This is the importance of Willi Hennig in the history of phylogenetics — he provided the logical inference that a divergent mathematical tree can be treated as a representation of the gene or character history, because the tree-like patterns are formed from a nested series of shared derived character states (synapomorphies). That is, the mathematical summary can be logically inferred to represent a biological concept, the character history.

In the third step we infer that a set of gene and/or character histories will, when combined in some way, also represent the organismal history. That is, we infer that gene histories represent organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (ie. a pragmatic inference).

So, there is a philosophy to the use of trees for phylogenetic inference, involving three steps (mathematical, logical, practical). There may be mis-estimation of the evolutionary history in practice, of course, perhaps through mis-estimation of the trees or non-representative gene samples, but we cannot expect any method to be perfect. We simply accept that the method we have is the best one we can find, and that it provides a logical basis for inference.

The question is: how do we apply this philosophy to evolutionary networks?

It is sometimes argued that a network is a set of overlapping (partly incompatible) trees. For example, each genetic locus might show a tree-like evolutionary history, but this history might not be the same as any other locus in the same organism. If we adopt this viewpoint then we could consider it unproblematic to use the same philosophy as for trees. That is, at step 1 we produce a set of trees, and step 2 we infer these to represent a set of gene histories, and at step 3 we combine the histories. The only important difference would thus be at step 3, where we combine the genotypic trees in a way that allows for reticulation in the organismal history, rather than insisting that the organismal history be strictly tree-like.

This is an issue that was debated back in the 1980s, when cladists first tried to come to grips with reticulations in a cladogram (eg. Bremer & Wanntorp 1979; Funk 1981, 1985; Humphries 1983; Nelson 1983; Wagner 1983; Wanntorp 1983). It has resurfaced occasionally since then (eg. Skála & Zrzavy 1994; Brower et al. 1996; Lienau & DeSalle 2009), with the consensus apparently being that for reticulating phylogenies this argument is acceptable.

However, it has also been argued that an evolutionary network is not simply a collection of trees. It is often contended, especially by those people dealing with prokaryotes (eg. Doolittle 1999, 2009; Bapteste et al. 2009, 2012), that there is no underlying tree-like structure in much of organismal history — biological history is an anastomosing plexus, instead. If we adopt this viewpoint then we cannot apply the three-step logic as outlined above. We still need to deal with the three steps (biological data to mathematical DAG, DAG to character evolution, characters to organismal evolution), but the DAG will have reticulations rather than being a diverging tree. So, we cannot apply Hennigian logic at step 2, because in a reticulated DAG the characters do not form a nested series of shared derived character states.

So, where are we to get our philosophy under these circumstances? How do we justify the inference that the mathematical summary represents evolutionary history? I have not yet seen this issue discussed in the literature.

References

Bandelt H-J (2005) Exploring reticulate patterns in DNA sequence data. In: Bakker FT, Chatrou LW, Gravendeel B, Pelser PB, eds. Plant Species-Level Systematics: New Perspectives on Pattern and Process. Koeltz, Königstein, pp 245-269.

Bapteste E, Lopez P, Bouchard F, Baquero F, McInerney JO, Burian RM (2012) Evolutionary analyses of non-genealogical bonds produced by introgressive descent. Proceedings of the National Academy of Sciences of the USA 109: 18266-18272.

Bapteste E, O'Malley MA, Beiko RG, Ereshefsky M, Gogarten JP, Franklin-Hall L, Lapointe FJ, Dupré J, Dagan T, Boucher Y, Martin W (2009) Prokaryotic evolution and the tree of life are two different things. Biology Direct 4: 34.

Behrens JT, Yu CH (2003) Exploratory data analysis. In: Schinka JA, Velicer WF, eds. Handbook of Psychology, Vol. 2: Research Methods in Psychology. John Wiley & Sons, Hoboken, pp 33-64.

Bremer K, Wanntorp H-E (1979) Hierarchy and reticulation in systematics. Systematic Zoology 28: 624-627.

Brower AVZ, DeSalle R, Vogler AP (1996) Gene trees, species trees, and systematics: a cladistic perspective. Annual Review of Ecology and Systematics 27: 423-450.

Doolittle WF (1999) Phylogenetic classification and the universal tree. Science 284: 2124-2128.

Doolittle WF (2009) The practice of classification and the theory of evolution, and what the demise of Charles Darwin's tree of life hypothesis means for both of them. Philosophical Transactions of the Royal Society of London B Biological Sciences 364: 2221-2228.

Funk VA (1981) Special concerns in estimating plant phylogenies. In: Funk VA, Brooks DR, eds. Advances in Cladistics: Proceedings of the First Meeting of the Willi Hennig Society. New York Botanical Garden Press, New York, pp 73-86.

Funk VA (1985) Phylogenetic patterns and hybridization. Annals of the Missouri Botanical Garden 72: 681-715.

Ellison AM (2001) Exploratory data analysis and graphic display. In: Scheiner SM, Gurevitch J, eds. Design and Analysis of Ecological Experiments, 2nd ed. Oxford University Press, Oxford, pp 37-62.

Hartwig F, Dearing BE (1979) Exploratory Data Analysis. Sage, Newbury Park.

Humphries CJ (1983) Primary data in hybrid analysis. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 89–103.

Lienau EK, DeSalle R (2009) Evidence, content and collaboration and the tree of life. Acta Biotheoretica 57: 187-199.

Morrison D.A. (2010) Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology and Evolution 27: 1044-1057.

Nelson GJ (1983) Reticulation in cladograms. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 105-111.

Skála Z, Zrzavy J (1994) Phylogenetic reticulations and cladistics: discussion of methodological concepts. Cladistics 10: 305-313.

Tufte ER (1983) The Visual Display of Quantitative Information. Graphics Press, Cheshire.

Tufte ER (1997) Visual Explanations: Images and Quantities, Evidence and Narrative. Graphics Press, Cheshire.

Tukey JW (1977) Exploratory Data Analysis. Addison-Wesley, Reading.

Wägele JW, Mayer C (2007) Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects. BMC Evolutionary Biology 7: 147.

Wagner WH (1983) Reticulistics: The recognition of hybrids and their role in cladistics and classification. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 63-79.

Wanntorp H-E (1983) Reticulated cladograms and the identification of hybrid taxa. In: Platnick NI, Funk VA, eds. Advances in Cladistics: Proceedings of the Second Meeting of the Willi Hennig Society. Columbia Uni. Press, New York, pp 81-88.

Young FW, Valero-Mora PM, Friendly M (2006) Visual Statistics: Seeing Data with Dynamic Interactive Graphics. Wiley, Hoboken.

Monday, February 4, 2013

Network analysis of Genesis 1:3

This idea was stolen blatantly from the Laboratory Exercises in Evolution at the Biology Department, University of Virginia (Janis Antonovics, Joanna Vondrasek, Doug Taylor), where it is set as a class exercise for learning phylogenetic analysis. In turn, these people credit a similar idea to Barbrook et al. (1998. The phylogeny of the Canterbury Tales. Nature 394: 839), although the originators of the idea appear to be Robinson and O'Hara (1996. Cladistic analysis of an Old Norse manuscript tradition. Research in Humanities Computing 4: 115-137). It is an exercise in stemmatology, which can be a lot more tricky than you might think.

Stemmatology is the discipline that attempts to reconstruct the transmission history of a written text on the basis of relationships between the various extant versions (eg. manuscripts or printings). These relationships can be revealed using phylogenetic networks, which is the approach that I present here. A network is more appropriate than a phylogenetic tree, for reasons that will become obvious — the evolution of books is not a simple thing.

Genesis

The original text of the christian Bible was written mostly in Hebrew and Aramaic for the Old Testament, and in Greek for the New Testament. It was later translated into Latin, which was then standardized as the "Vulgate", and this was then almost the only version used in churches for the best part of a millennium. The only texts in Old English consisted usually of either the Gospels or the Psalms only.

This situation was challenged in the late 14th century, when the first Middle English translations of the whole Bible appeared. There was active resistance to this by the formal Church, and so the idea of an English translation was dropped until the mid 16th century, when the Reformation inspired attempts to translate the books into Modern English as part of a new Protestant religion. These moves were sanctioned by the government, with first the Great Bible (1539) and then the King James Version (1611). Various revisions of the latter have appeared, especially since the late 19th century. These days, there is a veritable cottage industry producing new versions of the Bible for various purposes, usually based on the original texts rather than on earlier translations, with various translation principles being employed (eg. Formal Equivalence, Dynamic Equivalence, Closest Natural Equivalence, etc).

You can consult the various versions of the English-language Bible at one or more of several online sites:

The data used below were all obtained from these sites. These sites suggest that the most famous English-language versions of the Bible are: the Geneva Bible (1560), as used throughout the Reformation, and by William Shakespeare as well as by the "Pilgrim Fathers" in America; and the King James Version (1611), which was the standard English text for a quarter of a millennium. The most widespread current Bible is apparently the New International Version, which has been updated several times since its first appearance in 1973.

Stemmatology

The text that I use is the third sentence of the Bible — Genesis 1:3. (The biblical text was first numbered in the Geneva Bible of 1560.) Here is a dated listing of that sentence in all of the early English translations, plus most of the revisions up to the mid-20th century, and a sample of the many recent versions:

1382 Wycliffe Bible  And God seide, Be maad li3t; and maad is li3t.
1395 Later Wycliffe  And God seide, li3t be maad; and li3t was maad.
1530 Tyndale Bible  Then God sayd: let there be lyghte and there was lyghte.
1535 Coverdale Bible  Than God sayd: let there be light: & there was lyght.
1537 Matthew Bible  And God sayde: let there be light, and there was light.
1539 Great Bible  And God sayde: let there be made lyght, and there was light made.
1560 Geneva Bible Then God saide, Let there be light: And there was light.
1568 Bishop's Bible And God sayde, let there be light: and there was light.
1609 Douay-Rheims Bible And God said: Be light made And light was made.
1611 King James Version And God said, Let there be light: and there was light.
1750 Challoner Revision And God said: Be light made. And light was made.
1769 Blayney Revision And God said, Let there be light: and there was light.
1833 Webster's Bible And God said, Let there be light: and there was light.
1862 Young's Literal Translation and God saith, 'Let light be;' and light is.
1885 English Revised Version And God said, Let there be light: and there was light.
1890 Darby Bible And God said, Let there be light. And there was light.
1901 American Standard Version And God said, Let there be light: and there was light.
1950 Knox Bible Then God said, Let there be light; and the light began.
1952 Revised Standard Version And God said, "Let there be light"; and there was light.
1971 New American Standard Bible Then God said, "Let there be light"; and there was light.
1973 New International Version And God said, "Let there be light," and there was light.
1976 Good News Bible Then God commanded, "Let there be light" — and light appeared.
1982 New King James Version Then God said, "Let there be light"; and there was light.
1995 God's Word Translation Then God said, "Let there be light!" So there was light.
1996 New Living Version Then God said, "Let there be light," and there was light.
2011 Common English Bible God said, "Let there be light." And so light appeared.

The first thing we need to do is align the text of these 26 versions, including both words and punctuation. This allows us to directly compare each of the elements of the sentence, comparing like with like as far as their features are concerned.

This is not as easy as it sounds. In this alignment I have separated words when they seem to have a different intent — for example, "was made" is not equivalent to "appeared". I can see endless arguments about the alignment of any text; and, indeed, disagreements about the intent of the original text is what has lead to so many different versions of the Bible being created in English.

This alignment then needs to be coded as a set of characters, which define the hypothesized homology between the various elements of the text. In this case I ended up with 50 additive binary characters for analysis. In general, I used Young's Literal Translation to determine the ancestral state for each character, as this translation was an explicit attempt to emulate the Hebrew original. A nexus-formatted version of the dataset is available here.

Various network methods could be used to summarize the character data. First, I have used a NeighborNet based on hamming distances, as I usually do (see my earlier analyses). As you can see from the graph, there is no simple tree-like relationships among these texts, which calls into question any simplistic attempt at stemmatology. (Note that in two cases there are multiple texts that have identical sentences, and thus they appear at the same location in the graph.)

It is worth pointing out here that Barbrook et al. (1998) produced a bush-like graph from their data for the Canterbury Tales, but only after deleting 14 of their 58 manuscripts, "as they were likely to have been copied from more than one exemplar, either by deliberate conflation of readings or by changing the exemplar during the course of copying." A similar explanation is likely to apply for some of the texts for Genesis 1:3, although many of them were translated directly from the original Hebrew rather than from later translations (eg. the Latin "Vulgate").

Nevertheless, there is a general separation of the older Genesis texts on the right of the graph and the more recent texts on the left. This might be easier to assess if we simplify the graph.

As a simpler summary of the same relationships, I have used a Reduced Median Network, based on r = 2 (the program default). Note that the time order is reversed in this graph, with the older texts on the left and the more recent texts on the right. The only major discrepancy between the two graphs is the relative placement of the Bishop's Bible. (Also, I have not labelled the two cases where there are several texts that have identical sentences.)

Historically, we would expect the Tyndale Bible, Coverdale Bible, Matthew Bible and Great Bible texts to be closely related, but the Great Bible seems not to fit this expectation. Similarly, we would expect a similarity between the Geneva Bible and the Bishop's Bible, which is also not reflected in the study sentence; nor is the acknowledged debt of the King James Version to the Tyndale Bible.

However, the fact that the Wycliffe Bible and Later Wycliffe are written in Middle English rather than Modern English is clear from their distant relationship to the other texts; and the close historical relationship of the Challoner Revision and the Douay-Rheims Bible is also clear.

Several texts show isolated relationships. The Knox Bible, for example, is unique among the modern texts in being taken from the Latin rather than the original Hebrew, while the Common English Bible is unusual in trying to balance two translation principles (Dynamic Equivalence and Formal Equivalence) rather than using only one.

On the other hand, the New International Version is clearly a very traditional version of the text, given its relationships as shown in the two graphs, which perhaps explains its popularity.

The close association of the Good News Bible with Young's Literal Translation is interesting, given that the former is an (often criticized) free paraphrase of the original Hebrew text while the latter is a literal translation of that same text — you can't get more different translation principles.

Conclusion

The lack of any simple tree-like relationship among these biblical texts makes any attempt to study their phylogeny difficult. My own look at the business of stemmatology suggests that the results here are quite typical of any study of written texts. Part of the problem seems to be that ideas developed in one historical lineage can be transferred to other lineages, and even transferred to earlier parts of those lineages (see my previous post: Time inconsistency in evolutionary networks). So, even though there is a general historical trend through time, that trend is not consistent enough for a tree-based historical analysis to be effective.

Note: there is a later blog post on this same topic — Trees and networks of written manuscripts.