Monday, September 30, 2013

Affinity networks updated

Last year I published a post on Networks of affinity rather than genealogy, in which I listed the publications from 1750-1900 that I know contain affinity networks. These are non-directional networks showing affinity among taxa, rather than showing genealogical relationships among the taxa.

Affinity refers to a natural (rather than artificial) overall group resemblance, usually quantified (in modern terminology) by some sort of weighted similarity of characters. Patterns of affinity may, indeed, result from evolutionary relationships, but affinity is a much broader concept than genealogy — in particular, affinity relationships are usually multi-directional rather than nested. This distinction between affinity and genealogy runs throughout the history of depictions of biological relationships, and continues to this day.

The importance that I see in these historical networks is that they match closely the modern idea of unrooted data-display networks. In this post I update my list of affinity networks, and add some more notes about their origins.

Maps versus networks

The important point here is that affinity is usually imagined as being multi-facetted, so that any diagram of affinities shows multiple connections among the taxa, and relationships between groups are very definitively reticulating. However, one point that I did not sufficiently emphasize in the previous blog post is that there are all sorts of other representations of reticulate relationships that have been used by biologists, some intended to be solid figures with faces, others are interlocking circles or radiating hexagons or nested ovals, and some have been explicitly referred to as maps. There is also what are referred to as quinarian classifications, in which taxa are arranged in groups of five that show multiple relationships.

Most of these diagrams could be converted to a network graphical representation. However, a network diagram should strictly have relationships indicated solely by connecting lines, rather than by overlapping circles, etc. That is, we can distinguish between networks in the strict sense and what are usually called "maps" in the broad sense. The latter name comes from Carl von Linné (see the quotation below). So, both networks and maps show reticulate relationships, but networks are still connected by lines even when any enclosing structure of groups is deleted.

I have used this strict definition of networks to update my list of affinity networks. This differs from the list in the previous post mainly by deleting two publications (which are better considered as maps not networks) and adds two more networks.

Affinity networks

Here is a list of the publications from 1750-1900 containing affinity networks that I know about. I have indicated my source, and I have also linked to an online copy of the diagram. (I have extracted some of the figures from Gallica, Google Books or the Biodiversity Heritage Library, where they are otherwise unavailable online.)
  • 1774 Johann Philipp Rühling "Ordines Naturales Plantarum Commentatio Botanica" [Ragan Fig. 7, Stevens Fig. 12, Barsanti Fig. 26, Pietsch Fig. 18]
  • 1777 Johann Hermann "Tabula affinitatum animalium" [Barsanti Fig. 31] republished in 1783 [Ragan Fig. 8]
  • 1802 August Johann Georg Carl Batsch "Tabula Affinitatum Regni Vegetabilis" [Ragan Fig. 9, Barsanti Fig. 43, Pietsch Fig. 19]
  • 1825 Adrien-Henri-Laurent de Jussieu "Sur le groupe des Rutacées." Mémoires du Muséum d'Histoire Naturelle 12: 384-542 [Ragan Fig. 13, Stevens Fig. 10, Pietsch Fig. 30]
  • 1826 Leopold Joseph Franz Johann Fitzinger "Neue Classification der Reptilien" [Gaffney Fig. 2]
  • 1841 Eduard Fenzl "Darstellung und Erläuterung vier minder bekannter, ihrer Stellung im natürlichen Systeme nach bisher zweifelhaft gebliebener, Pflanzen-Gattungen." Denkschriften der Königlich-Baierischen Botanischen Gesellschaft zu Regensburg 3: 153-270 [Stevens, figure]
  • 1843 Adrien-Henri-Laurent de Jussieu "Monographie de la famille des Malpighiacées." Archives du Muséum d'Histoire Naturelle 3: 5-151 [Stevens, figure]
  • 1844 Henri Milne-Edwards "Considérations sur quelques principes relatifs à la classification naturelle des animaux et plus particulièrement sur la distribution méthodique des mammifères." Annales des Sciences Naturelles, Sér 3 Zoologique 1: 65-99 [Barsanti Fig. 59, figure]
  • 1872 Alexander Andrejewitch von Bunge "Die gattung Acantholimon Boiss." Mémoires du Academie Imperiale des Sciences de St Pétersbourg Série 7 18(2): 1-72 [Stevens Fig. 19]
  • 1888 Ferdinand Albin Pax "Monographische übersicht über die arten der gattung Primula." Botanische Jahrbücher für Systematik, Pflanzengeschichte und Pflanzengeographie 10: 75-241 [Stevens, there are 15 figures]
  • 1889 Julien Vesque "Epharmosis, sive Materiae ad Instruendam Anatomiam Systematis Naturalis. Pars Prima. Folia Capparearum" [Stevens, figure]
  • 1889 Julien Vesque "Epharmosis, sive Materiae ad Instruendam Anatomiam Systematis Naturalis. Pars Secunda. Genitalia Foliaque Garciniearum et Calophyllearum" [Stevens, figure]
  • 1890 Franz Georg Philipp Buchenau "Monographia Juncacearum." Botanische Jahrbücher für Systematik, Pflanzengeschichte und Pflanzengeographie 12: 1-495 [Stevens, figure]
  • 1893 Georg Klebs "Flagellatenstudien, Theil II." Zeitschrift für Wissenschaftliche Zoologie 55: 353-445 [Ragan Fig. 26]
  • 1895 Olga Tchouproff "Quelques notes sur l'anatomie systématique des Acanthacées." Bulletin de l'Herbier Boissier 3: 550-560 [Stevens, figure]
  • 1896 Nicolai Ivanovich Kusnezov "Subgenus Eugentiana Kusnez. generis Gentiana Tournef." Acta Horti Petropolitani 15: 1-507 [Ragan, Stevens Fig. 18, Pietsch Fig. 91]
  • 1898 Émile Constant Perrot "Anatomie comparée des Gentianacées." Annales des Sciences Naturelles: Botanique, Série 8, 7: 105-292 [Stevens, figure]

Some historical notes

Other early authors described biological relationships as being like a reticulating network, without any necessarily genealogical interpretation, even though they apparently did not themselves produce network diagrams. (It seems that only 5 of the first 11 historical references to network relationships provided a diagram.) Here are some known examples (with source):
  • 1750 Vitaliano Donati "Della Storia Naturale Marina dell' Adriatico" [Ragan]
  • 1792 Giuseppe Olivi "Zoologia Adriatica" [Ragan]
  • 1802 Gottfried Reinhold Treviranus "Biologie, oder Philosophie der Lebenden Natur für Naturforscher und Aerzte, Band 1" [Ragan]
  • 1824 Johann Heinrich Friedrich Link "Elementa Philosophiae Botanicae" [Stevens]
  • 1828 Georges Cuvier "Histoire Naturelle des Poissons, Tome Premier" [Ragan]
  • 1836 Constantin Rafinesque "Flora Telluriana" [Stevens]
Note that it was apparently Vitaliano Donati who first suggested that biological relationships are like a network, although he did not provide an explicit diagram to illustrate this idea:
In addition, the links of the chain are joined in such a way with the links of another chain, that the natural progressions should have to be compared more to a net than to a chain, that net being, so to speak, woven with various threads which show, between them, changing communications, connections, and unions.” [Translated from the Italian by Ragan 2009]
This was followed immediately by a very similar idea from Carl Linnaeus (1751 "Philosophia Botanica"), who also did not provide a diagram:
Aphorism 77: All plants show affinities on either side, like territories in a geographical map. [Translated from the Latin]
The first figure of relationships drawn as a map was apparently Linnaeus' own map of plant families [see Barsanti Fig. 37, Pietsch Fig. 16], published by two of his former students, J.C. Fabricius and P.D. Giseke, in a posthumous collection of his lectures (Giseke 1792 "Caroli a Linné Praelectiones in Ordines Naturales Plantarum").

Modern phylogenetics has used a tree as the preferred means of depicting relationships, rather than a network or map. The tree metaphor seems to have first come from Peter Simon Pallas (1766 "Elenchus Zoophytorum"), who explicitly acknowledged the earlier ideas:
As Donati has already judiciously observed, the works of Nature are not connected in series in a Scale, but cohere in a Net. On the other hand, the whole system of organic bodies may be well represented by the likeness of a tree that immediately from the root divides both the simplest plants and animals, [which remain] variously contiguous as they advance up the trunk, Animals and Vegetables. [Translated from the Latin by Ragan 2009]
Note that the tree metaphor was explicitly intended by Pallas to be a simplification of the previously proposed network metaphor.

Actually, reticulating diagrams dominated over over trees in the literature until the publication of Charles Robert Darwin's major work (1859 "On the Origin of Species"). Darwin had two effects that are important for the discussion of metaphors.

First, he replaced the idea of an inherent order with a less ordered view of biodiversity as resulting from the contingencies of natural selection. This meant that the previous need for metaphors that allowed for multiple relationships among taxa (required to express the observed complexity of biodiversity), and hence the documented preference for reticulating diagrams (networks, maps, circles, cones, etc) was no longer needed. Darwin focused attention solely on genealogical relationships, to the exclusion of all others.

Second, Darwin championed the tree as the appropriate metaphor. This was possible because descent with modification can easily be expressed in a tree, provided that we focus (as he did) on vertical genealogical relationships (ancestor–descendant) rather than horizontal ones. One of his most famous quotes is:
The affinities of all the beings of the same class have sometimes been represented by a great tree ... The green and budding twigs may represent existing species), and those produced during each former year may represent the long succession of extinct species.
Darwin knew about horizontal evolutionary events like hybridization, but he did not really integrate them into his metaphor. Darwin did not use the word "network" but he did use the word "web" with regard to affinity:
We can clearly see how it is that all living and extinct forms can be grouped together in one great system), and how the several members of each class are connected together by the most complex and radiating lines of affinities. We shall never, probably, disentangle the inextricable web of affinities between the members of any one class.


Barsanti G. (1992) La Scala, la Mappa, l'Albero: Immagini e Classificazioni della Natura fra Sei e Ottocento. Sansoni Editore, Firenze.

Gaffney E.S. (1984) Historical analysis of theories of chelonian relationship. Systematic Zoology 33: 283-301.

Pietsch T.W. (2012) Trees of Life: a Visual History of Evolution. Johns Hopkins Uni. Press, Baltimore.

Ragan M. (2009) Trees and networks before and after Darwin. Biology Direct 4: 43.

Stevens P.F. (1994) The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. Columbia Uni. Press, New York.

Wednesday, September 25, 2013

How do we interpret a rooted haplotype network?

A splits graph is an unrooted phylogenetic network (see How to interpret splits graphs). It can be produced by any of several algorithms, including distance-based methods such as NeighborNet and Split Decomposition, character-based methods such as Median Networks and Parsimony Splits, and tree-based methods such as Consensus Networks and SuperNetworks.

Such graphs can also be produced by methods that conceptually modify Median Networks, such as Reduced Median Networks and Median-Joining Networks. These two methods are popular in population genetics, especially as related to Homo sapiens, where they are used as haplotype networks (or 1-step networks); and it is their use as haplotype networks that I wish to discuss here.

Haplotype networks represent the relationships among the different haploid genotypes observed in the dataset (ie. identical sequences are pooled into a single terminal). They are usually drawn unrooted, which is quite sensible for within-species data, where the root location is often unknown. However, there are occasions when a root is provided, and authors then interpret the splits graph as a directed network. This is directly analogous to starting with an unrooted phylogenetic tree and adding a root (usually via an outgroup), so that the rooted tree can be interpreted as a genealogical history. In moving from an unrooted to a rooted tree, each branch acquires a direction (away from the root), and the internal nodes become hypothetical ancestors.

However, this is problematic for all types of unrooted network. In the case of splits graphs, each edge acquires an unambiguous direction, as for a tree, but not every internal node can necessarily be interpreted as a hypothetical ancestor. How, then, do we interpret the rooted haplotype network?

An example

Let's look at a specific example, taken from the recent paper by Witas et al. (2013).

Figure 4 from this paper shows a haplotype network of four mtDNA HVR1 (hypervariable region 1 of the control region) samples from Ancient Mesopotamia (the middle Euphrates valley between 2500 BC and 500 AD), compared to contemporary samples from five different geographical regions. It shows that the ancient samples fit neatly into modern genetic variation from southern and eastern Asia, rather than from eastern Europe.

However, note that a root is also explicitly indicated. I explain below where this root comes from, but first let's concentrate on what happens if we treat the network as rooted.

This is a Median-Joining Network, and thus it is a splits graph. As such, the root provides unambiguous directions for all of the branches, based on the principle that the network must be a directed acyclic graph with only one root. This is shown by the arrows in the modified figure. Furthermore, all of the internal nodes can be interpreted as a hypothetical ancestors, except for the two reticulations in the graph, labelled A and B.

These reticulations are created by contradictory patterns involving the characters labelled 16276, 16185 and 16311. In a rooted splits graph, reticulations represent uncertainty about the order of character changes, rather than representing reticulate evolution (eg. recombination, hybridization, etc). In this case, we cannot determine whether character 16311 changes before or after the changes in characters 16185 and 16276.

So, it is important to recognize that a rooted splits graph does not explicitly represent a phylogeny, because reticulations in the graph represent uncertainty not genealogy.

The simplest interpretation of a this type of rooted splits graph is usually that the network represents a set of most-parsimonious trees, rather than a single parsimony tree. The different trees can be obtained by resolving the reticulations (ie. by deciding what order the character changes occur in). This relationship between the rooted haplotype network and a parsimony tree is shown by the following example from Jansen et al. (2002).

This is a network of 93 mtDNA control-region haplotypes from horses. It is also a Median-Joining Network, although the data were pre-processed using a Reduced Median Network. Node A6 is the root, based on equid outgroups. The solid lines indicates one of the most-parsimonious trees contained within the network — for every reticulation, one particular order of the character changes has been selected by the authors in order to postulate this particular tree. The non-chosen parts of the network are indicated by dotted lines.

Explanation of the human mtDNA root

mtDNA is usually treated as a non-recombining locus, and so it should evolve along a tree. A rooted global tree has therefore been produced for humans, based on parsimony analysis of the mtDNA genome (Torroni et al. 2000; van Oven and Kayser 2009). Groups and subgroups of this tree have been labelled as haplotypes, such as haplotype group M shown in the top figure, and sub-haplogroups, such as M4b, M49 and M61. These are (monophyletic) clades in the mtDNA tree that have been highlighted for convenience. Parsimony analysis has been used to reconstruct the ancestral sequences in the tree (Behar et al. 2012), and these ancestral sequences can be used to assign new sequences to their appropriate place in the rooted tree (Blanco et al. 2011).

The basic limitation of this approach is that the haplogroups and sub-haplogroups are based on a non-unique parsimony tree. There are many equally parsimonious trees for the dataset, any one of which could have been chosen to define the haplogroups. In spite of this, the predefined haplogroups are treated by many people as designating specific mitochondrial lineages, rather than merely being groups of convenience.


Behar DM, van Oven M, Rosset S, Metspalu M, Loogväli EL, Silva NM, Kivisild T, Torroni A, Villems R (2012) A "Copernican" reassessment of the human mitochondrial DNA tree from its root. American Journal of Human Genetics 90: 675-684.

Blanco R, Mayordomo E, Montoya J, Ruiz-Pesini E (2011) Rebooting the human mitochondrial phylogeny: an automated and scalable methodology with expert knowledge. BMC Bioinformatics 12: 174.

Jansen T, Forster P, Levine MA, Oelke H, Hurles M, Renfrew C, Weber J, Olek K (2002) Mitochondrial DNA and the origins of the domestic horse. Proc Natl Acad Sci USA 99: 10905-10910.

Torroni A, Achilli A, Macaulay V, Richards M, Bandelt H-J (2000) Harvesting the fruit of the human mtDNA tree. Trends in Genetics 22: 339-345.

van Oven M, Kayser M (2009) Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Human Mutation 30: E386-E394.

Witas HW, Tomczyk J, Jędrychowska-Dańska K, Chaubey G, Płoszaj T (2013) mtDNA from the Early Bronze Age to the Roman period suggests a genetic link between the Indian subcontinent and Mesopotamian cradle of civilization. PLoS One 8(9): e73682.

Monday, September 23, 2013

The first paper on HGT in plants (1971)

The first descriptions of what we now call horizontal gene transfer (HGT) started appearing in the middle of last century (Freeman 1951; Lederberg et al. 1951), although what appears to be the first account was mistakenly attributed to sexual recombination (Lederberg & Tatum 1946). Shortly thereafter, experimental work was published concerning mechanisms for the transfer of genetic material between micro-organisms via what we now call transduction (Zinder & Lederberg 1952; Stocker et al. 1953).

The possibility was soon considered that the asexual transfer of genetic units may be of more general occurrence (Ravin 1955). However, it was not really until molecular sequencing became available in the 1980s that biologists started presenting anecdotal evidence for gene transfer among eukaryotes (Shilo & Weinberg 1981; Singh et al. 1981; Buslinger et al. 1982; Hyldig-Nielson et al. 1982; Engels 1983), although Benveniste & Todaro (1974) may have been the first to do so. Unfortunately, most of these suggestions turned out to be spurious, once more evidence accumulated (Smith et al. 1992; Syvanen 1994).

So, it was not until this century that HGT among eukaryotes started to be taken seriously (Bergthorsson et al. 2003, 2004; Won and Renner 2003), although in the first case the evidence presented is rather doubtful. Since the advent of the genome sequencing era, HGT has become an important discussion point for eukaryote evolution (see reviews by Bock 2010; Boto 2010; Renner & Bellot 2012).

All of this work seems to have ignored a speculative, but very prescient, paper published by Frits Went in 1971, based on morphological and anatomical data rather than on gene sequences (ie. phenotypic rather than genotypic evidence). This seems to be a classic example of molecular research being disconnected from the literature on whole-organism biology. Note that, of course, all of the early papers about HGT in bacteria were also based on phenotypic data, although in those cases it was experimental rather than descriptive data.

Went's "suggested mechanism for parallel development [is] by transfer of chromosome fragments carrying groups of genes of proven adaptive competence". Here are some extracts from the Discussion, showing that he was explicitly considering HGT among plants:
The development of parallel morphological characters suggests that in each case closely similar sequences of cell division and cell differentiation occur, which thus lead to similar forms. This in turn suggests that similar sets of genes are involved. It is known in a number of cases (except when too many chromosomal translocations have occurred such as in Drosophila) that genes involved in development of individual organs are frequently located together in chromosome segments. This almost forces us to assume that these parallel forms are due to the presence of similar chromosome segments. According to this view, the similarities did not arise by identical series of mutations in all plants with parallel forms (which would have resulted in whole series of intermediate forms), but by a one-time transfer of the same chromosome segment.
The transfer of a particular chromosome segment between different families has to be non-sexual of course. Non-sexual transfer of genetic material has now been demonstrated in a number of cases. Transduction in bacteria is a prime example. There is also the transfer of viruses, which are either RNA or DNA, and which can occur between completely unrelated families (tobacco mosaic can infect more than a dozen different families).
In consequence I suggest that 1) particular chromosome segments, containing gene sequences for the development of specific forms, exist in certain geographical areas. And, 2) these chromosome segments can be transmitted from one plant to another. This can occur sexually within one genus ... or a-sexually between genera or families ...
Is it possible to accept the existence of gene-group transfer between families? The morphological, anatomical, and biochemical examples presented speak for it. Transduction provides a basis. Interfamiliar virus transfer is possible. Is there perhaps an insect vector for this interfamiliar chromosome transfer? Or does it occur during fertilisation, when anyway at least 2 complete nuclei are transferred, and when perhaps part of a third nucleus could move as well?
Went was principally a plant physiologist, although he also worked in plant ecology, and was apparently an avid anti-reductionist, who was concerned about the increasing dominance of genetics in biology. Possibly, he would not have been impressed by the molecular revolution that lead ultimately to the widespread study of HGT in plants. He preferred (Went 1974) "presently neglected fields which may not find their solution in DNA or RNA. Excessive preoccupation with this subject presently so popular has impoverished biology as a whole."


Benveniste RE, Todaro GJ (1974) Evolution of C-type viral genes: inheritance of exogenously acquired viral genes. Nature 252: 456-459.

Bergthorsson U, Adams KL, Thomason B, Palmer JD (2003) Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 424: 197–201.

Bergthorsson U, Richardson AO, Young GJ, Goertzen LR, Palmer JD (2004) Massive horizontal transfer of mitochondrial genes from diverse land plant donors to the basal angiosperm Amborella. Proceedings of the National Academy of Sciences of the USA 101: 17747-17752.

Bock R (2010) The give-and-take of DNA: horizontal gene transfer in plants. Trends in Plant Science 15: 11-22.

Boto L (2010) Horizontal gene transfer in evolution: facts and challenges. Proceedings of the Royal Society B: Biological Sciences 277: 819-827.

Busslinger M, Rusconi S, Birnstiel ML (1982) An unusual evolutionary behaviour of a sea urchin histone gene cluster. EMBO Journal 1: 27-33.

Engels WR (1983) The P family of transposable elements in Drosophila. Annual Review of Genetics 17: 315-344.

Freeman VJ (1951) Studies on the virulence of bacteriophage-infected strains of Corynebacterium diphtheriae. Journal of Bacteriology 61: 675-688.

Hyldig-Nielson, JJ, Jensen EØ, Paludan K, Wiburg O, Garrett R, Jørgensen P, Marcker KA (1982) The prnmary structures of two lehemoglobin genes from soybean. Nucleic Acids Research 10: 689-701.

Lederberg J, Lederberg EM, Zinder ND, Lively ER (1951) Recombination analysis of bacterial heredity. Cold Spring Harbor Symposium on Quantitative Biology 16: 413-443.

Lederberg J, Tatum EL (1946) Gene recombination in Escherichia coli. Nature 158: 558.

Ravin AW (1955) Infection by viruses and genes. American Scientist 43: 468-478.

Renner SS, Bellot S (2012) Horizontal gene transfer in eukaryotes: fungi-to-plant and plant-to-plant transfers of organellar DNA. Advances in Photosynthesis and Respiration 35: 223-235.

Shilo BZ, Weinberg RA (1981) DNA sequences homologous to vertebrate oncogenes are conserved in Drosophila melanogaster. Proceedings of the National Academy of Sciences of the USA 78: 6789-6792.

Singh L, Purdom IF, Jones KW (1981) Conserved sex chromosome-associated nucleotide sequences in eukaryotes. Cold Spring Harbor Symposium on Quantitative Biology 45: 805-813.

Smith MW, Feng D-F, Doolittle RF (1992) Evolution by acquisition: the case for horizontal gene transfers. Trends in Biochemical Science 17: 489-493.

Stocker BAD, Zinder ND, Lederberg J (1953) Transduction of flagellar characters in Salmonella. Journal of General Microbiology 9: 410-433.

Syvanen M (1994) Horizontal gene transfer: evidence and possible consequences. Annual Review of Genetics 28: 237-261.

Went FW (1974) Reflections and speculations. Annual Review of Plant Physiology 25: 1-26.

Won H, Renner SS (2003) Horizontal gene transfer from flowering plants to Gnetum. Proceedings of the National Academy of Sciences of the USA 100: 10824-10829.

Zinder ND, Lederberg J (1952) Genetic exchange in Salmonella. Journal of Bacteriology 64: 679-699.

Wednesday, September 18, 2013

Checking data errors with phylogenetic networks

Data-display networks can be used for a number of purposes, for example: Exploratory data analysis, Displaying data patterns, Displaying data conflicts, Summarizing analysis results, and Testing phylogenetic hypotheses. One of the more important, but currently under-valued, purposes is detecting data errors.

For instance, networks can help you detect data-sampling errors or outliers (eg. wrong specimen identification, diseased specimens), as well as data-collection errors (eg. extracting the wrong DNA, amplifying the wrong gene, sequencing artifacts) and data-processing errors (eg. data entry mistakes, incorrect alignment). These types of errors will likely show up as reticulations in a network, especially a splits graph.

Perhaps the most powerful use of such networks is in conjunction with a database of gold-standard or benchmark sequences. Comparison of all new sequences with the database would allow for a systematic quality check, because the network structure of the database is already known, and any deviation from this structure highlights potential problems ("identifying idiosyncrasies that cannot be attributed to natural evolutionary processes") or indicates novel sequence variation. Much of this process can be effectively automated by computer scripts.

To date, the champion of this use of networks has been Hans-Jürgen Bandelt, who has presented a number of interesting practical examples over the past dozen years. Below, I have included an annotated list of some of the more interesting publications in this area.

Bandelt H-J, Lahermo P, Richards M, Macaulay V (2001) Detecting errors in mtDNA data by phylogenetic analysis. International Journal of Legal Medicine 115: 64-69. —The first to suggest phylogenetic analysis as a component of data-quality checking, although networks are not explicitly mentioned

Bandelt H-J, Quintana-Murci L, Salas A, Macaulay V (2002) The fingerprint of phantom mutations in mitochondrial DNA data. American Journal of Human Genetics 71: 1150-1160. — The first to explicitly suggest using networks, and then use median and quasi-median networks to detect errors in published human mtDNA control-region datasets

Bandelt HJ, Kivisild T (2006) Quality assessment of DNA sequence data: autopsy of a mis-sequenced mtDNA population sample. Annals of Human Genetics 70: 314- 326. — Use quasi-median networks to detect errors in a published human mtDNA control-region dataset

Bandelt HJ, Dür A (2007) Translating DNA data tables into quasi-median networks for parsimony analysis and error detection. Molecular Phylogenetics and Evolution 42: 256-271. — Discuss the use of quasi-median networks for error detection, and re-visit the analysis of Bandelt and Kivisild (2006)

Parson W, Dür A (2007) EMPOP — A forensic mtDNA database. Forensic Science International: Genetics 1: 88-92. — Use quasi-median networks to detect mtDNA errors in forensic data by comparison with a benchmark database

Kong Q-P, Salas A, Sun C, Fuku N, Tanaka M, Zhong L, Wang C-Y, Yao Y-G, Bandelt H- J (2008) Distilling artificial recombinants from large sets of complete mtDNA genomes. PLOS One 3: e3016. — Use median networks to detect possible artificial recombinant sequences in molecular databases (ie. chimeric sequences resulting from laboratory-induced errors)

Bandelt H-J, Yao Y-G, Bravi CM, Salas A, Kivisild T (2009) Median network analysis of defectively sequenced entire mitochondrial genomes from early and contemporary disease studies. Journal of Human Genetics 54: 174-181. — Use median networks to detect possible errors in human mtDNA genomes intended to find sequence mutations associated with particular diseases

Monday, September 16, 2013

A network of New Zealand's livestock regions

If I was to burrow down from Sweden through the centre of the Earth and keep going, I would come up off the east coast of New Zealand. In spite of being as far apart as you can get on this planet, these two countries have one thing in common — most people don't know quite where they are (Sweden is usually thought to be somewhere in the Alps, and New Zealand apparently exists only as a figment of Tolkien's imagination).

New Zealand is full of New Zealanders, of course (about 4.5 million of them), but according to government statistics it is also full of dairy cattle, beef cattle, sheep, deer, pigs, goats, horses, ostriches & emus, alpacas & llamas, and miscellaneous other farm animals. Indeed, there appear to be as many dairy cattle as people, as many beef cattle as people, and ten times as many sheep as people. This is ridiculous — even Australia has only five times as many sheep as people!

Where are they keeping all of these animals? To find out, I have used a network to explore the official statistics from the recently released New Zealand 2012 Agricultural Census tables, broken down by geographical region. I have restricted the data to dairy cattle, beef cattle, sheep, deer, pigs, goats and horses, as the data are a bit sporadic for the rarer animals; and they are also sporadic for the Chatham Islands, which I thus excluded. Note that the official bureaucratic terms for missing data are "Confidential" and "Suppressed" — the mind boggles at the idea that the number of pigs in or near Auckland, for example, is a government secret. Anyway, I have done my best to impute the missing data (based on the reported totals for each island and on the 2007 census).

The map shows you that the geographical regions vary dramatically in size (compare Nelson and Canterbury, for example), and so I standardized the animal counts for each region using the reported area (to get density, as animals per square kilometre). I then used a double square-root transformation, which is a traditional technique in zoology for standardizing extreme differences in animal abundance.

I then calculated the similarity of the regions using the Manhattan distance. A Neighbor-net analysis was then used to display the between-region similarities as a phylogenetic network. So, regions that are closely connected in the network are similar to each other based on their livestock abundances, and those that are further apart are progressively more different from each other.

The main outcome will surprise no New Zealander — both the North Island (shown in red) and the South Island (in blue) have distinct groups of regions. Indeed, the North Island has two-thirds of the dairy cattle, beef cattle, goats and horses, and the South Island has two-thirds of the deer and pigs, while they split the sheep equally. This means that the North Island has about 22.5 million livestock and the South Island about 20.5 million.

Nelson, West Coast, Marlborough and Tasman, all from the South Island, are the regions with the lowest density of livestock, and hence are strongly associated in the network. The West Coast has the lowest density of all of the livestock types except dairy cattle and deer, which is why it stands out in the network. (It also has the lowest density of people.)

Much of the network is highly correlated with sheep density, not unexpectedly, with increasing density from top to bottom in the graph. For example, Northland and Bay of Plenty, from the North Island, are relatively devoid of sheep, just like Marlborough and Tasman, although there are still 30-40 of them per square kilometre in all four regions. Gisborne, Hawke's Bay, Manawatu-Wanganui and Wellington are the areas with the greatest sheep density on the North Island, along with Canterbury, Southland and Otago from the South Island.

Other positions in the network are often based on single livestock species. For example, Waikato and Taranaki are far and away the most popular areas for dairy cattle; and Waikato and Auckland are the best places to go to see goats (although, at one per square kilometre, you won't see too many). Gisborne and Hawke's Bay are the places to go beef-cattle watching. Deer are particularly popular in the Canterbury, Otago and Southland regions. No-one will be surprised that horses are densest in the region with the biggest city, Auckland, which has one-third of New Zealand's population and 1 horse for every 225 people. (This doesn't even remotely challenge Sweden's national average of 1 horse for every 25 people.)

Apart form the quality of its lamb, New Zealand is world-famous for its vinous products (and its phylogeneticists, although these things are not necessarily causally related). It is probably the grape-growing in Marlborough that is excluding the livestock, since it has two-thirds of the 35,000 hectares of New Zealand's wine-grape area, according to the Agricultural Census. The West Coast is the only region listed as officially having 0 hectares of grapes, but Bay of Plenty, Taranaki, Nelson and Southland are all listed as Confidential (they have less than 55 hectares between them).

Wednesday, September 11, 2013

Public availability of phylogenetic data

I have previously noted the frequent failure of phylogeneticists to make their data publicly available (Releasing phylogenetic data ). Recently, a paper appeared in PLoS Biology providing some quantitative data regarding this issue:
Drew B.T., Gazis R., Cabezas P., Swithers K.S., Deng J., Rodriguez R., Katz L.A., Crandall K.A., Hibbett D.S., Soltis D.E. (2013) Lost branches on the Tree of Life. PLoS Biology 11(9): e1001636.
While constructing a super-tree of life, Drew et al. noted that of their 7,500 papers (appearing in 2000–2012) the published data (eg. alignment and tree) had been deposited in a public repository in only one-sixth of the cases, and were available on request from the original authors for a further one-sixth, leaving two-thirds of the data unavailable.

Not unexpectedly, they suggest that the journals publishing these papers might play a role in addressing this issue:
Our findings indicate that while some journals (e.g., Evolution, Nature, PLOS Biology, Systematic Biology) currently require nucleotide sequence alignments, associated tree files, and other relevant data to be deposited in public repositories, most journals do not have these requirements.
Notable among the absent journals are high-profile phylogenetic ones such as Molecular Biology and Evolution and Molecular Phylogenetics and Evolution.

Sadly, the role of journals has been presented in a rather poor light by some bloggers. For example, Roli Roberts notes:
And it's clear that journals are indeed spectacularly well-placed to police and incentivise the deposition, tracking, accessibility, and permanence of data associated with the papers that they publish. At the point of acceptance we have the authors over a barrel, and are in a great position to mandate deposition of all data for every paper.
This attitude has been criticized by other bloggers. For example, Rod Page notes:
In my opinion, as soon as you start demanding people do something you've lost the argument, and you're relying on power ("you don't get to publish with us unless you do x"). This is also lazy. I have argued that this is the wrong approach: when building shared resources carrots are better than sticks ... So, my challenge to the phylogenetics community is to stop resorting to bullying people, and ask instead how you could make it a no brainer for people to share their trees.
However, I ask a different question:
Why are phylogeneticists so reluctant to present their actual data in the first place?
After all, without data science is merely opinion, and you don't need to be a scientist to have an opinion. (Even theoretical science ultimately concerns itself with data, so data really is the essence of science.) One does not have to be sceptical about a dataset in order to think that it should be publicly and freely available.

So, why is telling phylogeneticists to act like scientists "resorting to bullying people"? Why do we have to "inspire [people] to contribute" by offering them carrots? It seems to me that we have lost the argument that phylogenetics is science if the phylogeneticists won't behave like scientists.

Note that the alignment is the key thing in phylogenetics, not the derived tree. In one sense, a tree just makes a figure out of a table. So, given the published description of the tree-building method, it should be straightforward to reproduce the tree from the alignment. Indeed, if the tree cannot be reproduced from the alignment then there is serious cause for concern.

In this sense, databases like TreeBASE might be missing the point somewhat. Where does one put the alignment if one is not interested in also storing the tree? Where does one put a network, if that is what you have instead of a tree? One could use Dryad, but they are now insisting on payment for storing scientific data — for those of us without financial support this is no longer a realistic option.

Problems with data availability are not unique to phylogenetics, of course. Dani Zamir has recently noted:
In crop genetics and breeding research, phenotypic data are collected for each plant genotype, often in multiple locations and field conditions, in search of the genomic regions that confer improved traits. But what is happening to all of these phenotypic data? Currently, virtually none of the data generated from the hundreds of phenotypic studies conducted each year are being made publically available as raw data; thus there is little we can learn from past experience when making decisions about how to breed better crops for the future.
Nevertheless, in biology, there are databases for many things, such as gene sequences (Genbank), protein structures (PDB), and gene ontology (GO), and these are all used to one extent or another. Perhaps the most direct parallel to the problems with phylogenetic datasets is that of ecological datasets, as recently discussed in a PLoS ONE article:
Morris B.D., White E.P. (2013) The EcoData Retriever: improving access to existing ecological data. PLoS ONE 8(6): e65848.
It is interesting to ponder why this is such a problem in the biological sciences when it is apparently not so in the physical sciences. There are databases in astronomy, and databases of chemical properties in chemistry, for example, but otherwise it is generally the ability to get the same data by repeating the experiment that is the important thing in the physical sciences. In most cases a database would be not only redundant but also self-defeating (storing the data would imply that the data are not repeatable!).

So, this appears to be yet another by-product of dealing with biodiversity — data are incredibly variable in many areas of biology, and so it is necessary to store them for posterity because they are unique.

Monday, September 9, 2013

Phylogenetic networks 1900-1990

In earlier blog posts I have pointed out that the first phylogenetic network explicitly representing a genealogy was published in 1755 (The first phylogenetic network, 1755) and the second in 1766 (The second phylogenetic network, 1766), but the third one that I know of did not appear until 1888 (Networks of genealogy). Up until 1900 there were, however, many networks published that represented affinity rather than genealogy (Networks of affinity rather than genealogy).

In this post I consider the subsequent history of phylogenetic networks, as far as I have been able to determine it, up until 1990. Networks remained relatively rare up to that time; and indeed even the name "phylogenetic network" usually referred to an unrooted tree rather than to a reticulating network (Who first used the term "phylogenetic network"?). From 1990 onwards networks have become quite common, and many scores of them have now been published.

Below, I present all of the networks that I am aware of from 1900-1990. I doubt very much that this includes all of the published networks. Indeed, I do not know even what proportion of them are presented here. However, I do believe that this is a representative selection of the uses of phylogenetic networks between 1900 and 1990.

Note: Last updated 16 November 2014.


This was an era in which trees dominated phylogenetic thinking, presumably in response to Charles Darwin's 1859 book (Who published the first phylogenetic tree?). Reticulation was talked about by a number of authors when discussing affinity, notably in botany (Stevens P.F. 1994. The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. Columbia Uni. Press, New York), but it was rarely illustrated, especially empirically. Indeed, the most popular time for affinity networks was up to the mid-late 1800s, at which time genealogical trees took over.

Most of the networks shown below are rooted, and thus represent genealogy, but a few unrooted affinity networks still appeared. However, in general few people used phylogenies to display their results, even when discussing hybridization or horizontal gene transfer. The people investigating these phenomena appeared to not be thinking in terms of phylogenetics, but instead were investigating mechanisms among a small group of species. The phylogenetic context that is so prevalent in biology today was rare before 1990.

There is an obvious peak during the 1950s, and there is an interesting gap after 1970 when cladistics rose to prominence, with its focus on dichotomous trees (Who first used the term "phylogenetic network"?). Nevertheless, the existence of such a diverse collection of networks shows that biologists were still able to "think outside of the box" when they felt it was necessary.

The networks

Mereschkowsky C. (1910) Theorie der zwei Plasmaarten als Grundlage der Symbiogenese, einer neuen Lehre von der Entstehung der Organismen. Biologisches Centralblatt 30: 278–303, 321–347, 353–367.
Available from the Biodiversity Heritage Library.

Small J. (1919) The origin and development of the Compositæ. Chapter XIII: General conclusions. New Phytologist 18: 201–234.
Available from Wiley.

Danser B.H. (1924) Über einige Aussaatversuche mit Rumex-bastarden. Genetica 6: 145-220.
Available from Springer.

Anderson E. (1931) Internal factors influencing discontinuity between species. American Naturalist 65: 144-148.
Available from JStor.

Milne M.J., Milne L.J. (1939) Evolutionary trends in caddis worm case construction. Annals of the Entomological Society of America 32: 533-542.
Available from the Core Historical Literature of Agriculture.

Taylor H. (1945) Cyto-taxonomy and phylogeny of the Oleaceae. Brittonia 5: 337-367.
Available from JStor.

Clausen J. (1951) Stages in the Evolution of Plant Species. Cornell University Press, Ithaca NY.
Figure copy available in JStor.

Grant V. (1953) The role of hybridization in the evolution of the leafy-stemmed gilias. Evolution 7: 51-64.
Available from JStor.

Goodspeed T.H. (1954) The genus Nicotiana: origins, relationships and evolution of its species in the light of their distribution, morphology and cytogenetics. Chronica Botanica 16: 1-536.
The figure is taken from Chase et al. (2003) Annals of Botany 92, available from Oxford.

Lewis H., Lewis M.R.E. (1955) The genus Clarkia. University of California Publications in Botany 20: 241-392.
The figure is taken from Alston & Turner (1963) Biochemical Systematics, available from the Biodiversity Heritage Library.

Turner B.L. (1956) A cytotaxonomic study of the genus Hymenopappus (Compositae). Rhodora 58: 163-186; 208-242; 250-269; 295-308.
Available from the Biodiversity Heritage Library.

Lysenko O., Sneath P.H.A. (1959) The use of models in bacterial classification. Journal of General Microbiology 20: 284-290.
Available from the Society for General Microbiology.

Goodwin T.W. (1963) Comparative biochemistry of carotenoids. In: S. Ochoa (ed.) Proceedings of the Fifth International Congress of Biochemistry, Moscow 10–16 Aug 1961, Vol. III. Pergamon Press, Oxford.
The figure is taken from Alston & Turner (1963) Biochemical Systematics, available from the Biodiversity Heritage Library.

Lowe C.H., Wright J.W., Cole C.J., Bezy R.L. (1970) Chromosomes and evolution of the species groups of Cnemidophorus (Reptilia: Teiidae). Systematic Zoology 19: 128-141.
Available from JStor.

Mikelsaar R. (1987) A view of early cellular evolution. Journal of Molecular Evolution 25: 168-183.
Available from Springer.

Wednesday, September 4, 2013

Mis-interpreting phylogenetic trees

I have noted before that biologists have used various metaphors or models for phylogenetic relationships, including a chain, a tree, and a network. I, and other people, have also noted that interpreting the relationships shown by these structures is not always easy for novices, and sometimes even for experts (see Ambiguity in phylogenies).

A chain is 1-dimensional, and so interpreting its relationships is usually straightforward. However, a tree is not a simple linear concept, as it consists of a set of inter-linked chains. It is clear from the literature that a tree is a structure where many people find it easy to interpret relationships incorrectly. Here, I illustrate this with an example.

The example is taken from this paper:
L. Luca Cavalli-Sforza (1997) Genes, peoples, and languages. Proceedings of the National Academy of Sciences of the U.S.A. 94: 7719-7724.
The first illustration is Figure 1 from that paper.

In the text, the author also notes this about the figure:
The most important difference is in the position of Europe, which with neighbor joining branches out first after the splitting of Africans and non-Africans and with maximum likelihood [sic!] is the last but one.
This interpretation is incorrect, because it is the position of Oceania that differs between the two trees (not Europe), as shown in the illustration below.

In the original figure, tree (a) is rooted while tree (b) is unrooted. In order to directly compare them, we need to root the tree in (b), as shown in the first row of the illustration. Note that I have re-ordered the areas in Figure a, but I have not changed the relationships as shown by the tree. One of the most common mis-interpretations of trees is to think that the linear order (top to bottom) has some meaning (see Ambiguity in phylogenies), but it does not.

Then, in order to identify the difference between the pair of rooted trees, we simply delete each of the areas in turn, which is shown in rows 2 to 5 of my figure. Only the deletion of Oceania makes the two rooted trees identical (row 3). The deletion of Europe (row 2) does not do this, and so the position of Europe should not be identified as a key difference between the two trees.

The literature is replete with this sort of simple interpretational mistake concerning trees. The concern for those of us involved is:
What will it be like for people to interpret networks, which are sets of inter-linked trees?

Monday, September 2, 2013

Another journal cover

I am told that it never rains but it pours. Meteorologically this seems unlikely, but in other systems it may be true. Recently, I noted that in the past couple of months both Trends in Genetics and Transactions on Computational Biology and Bioinformatics have featured phylogenetic networks on their front covers, highlighting articles in their respective issues.

Now. I can note that Volume 190 Number 1 of Molecular and Biochemical Parasitology also illustrates a phylogenetic network. Hopefully, the widespread nature of these three journals reflects an increasing recognition of the importance of networks in phylogenetics.

The illustration is from the paper by Eva Tydén, Annie Engström, David Morrison and Johan Höglund: Sequencing of the β-tubulin genes in the ascarid nematodes Parascaris equorum and Ascaridia galli, on pages 38-43.