The Genealogical World of Phylogenetic Networks: May 2012

Wednesday, May 30, 2012

Networks of genealogy

Phylogenetic trees, indicating evolutionary relationships among organisms, are usually considered to have derived, to one extent or another, from the concept of a family tree or pedigree (also called genealogy, généalogie, stammbaum, stamboom or levensboom, depending on your language), which depict the lineal descent from some single specified human ancestral couple. However, these diagrams are only trees if one sticks to lineal descent, historically usually the patrilineal lineage. As soon as one includes the (many) matrilinear lines, the diagram becomes a network, with inter-connections between different family trees. [See the later post Family trees, pedigrees and hybridization networks.]

When creating the first two published phylogenetic networks, this seems to have been the idea of both Buffon (1755) and Duchesne (1766). They each observed hybridization histories (in dogs and strawberries, respectively), and they drew diagrams showing both the male and female parents (presumably without knowing which was which). Thus, they apparently saw themselves as doing nothing unusual, and neither did their contemporaries. They were simply drawing pedigrees in which both the matrilineal and patrilineal relationships were shown.

It is only with historical hindsight that we can see that they did something conceptually new. The "usual" family genealogies showed (and still do show) predominantly the male lineages (which keep the family name, and thus are easier to trace), with the female lineages apparently appearing from nowhere at the point where they marry into the male lineage. What both Buffon and Duchesne did was show that both the male and female lines came from within the same group of organisms. (In modern parlance, this was both biologically more informative and politically more correct.)

However, this new idea was restricted to within-species relationships (dog breeds and strawberry cultivars), thus emphasizing the close conceptual relationship to a pedigree of individuals. The idea that reticulating evolutionary relationships might occur between species (possibly members of different higher taxonomic groups) was apparently not something that either Buffon or Duchesne considered.

Lamarck (1809) appears to have been the first to consider supra-species evolution, and he used a non-reticulating tree for his diagram rather than a reticulating network. His published trees differ from our modern diagrams in having contemporary higher-level taxonomic groups at both the internal and external nodes, so that each tree represents a transformation series among the groups (based on morphoclines). Thus, his trees were based on the idea of descent with modification but do not match our modern trees. (Basically, Lamarck did not believe in extinction, and thought that the disappearance of species was due to their transformation into new species.) The animal tree of Barbançois (1816) and the pre-Darwinian ornithology trees of Strickland (1841) and Wallace (1856) are direct descendants of this type of tree (see O'Hara 1988, Tassy 2011).

Darwin (1859) is usually credited as being the originator of modern phylogenetic trees, with contemporary taxa at the leaves and ancestors at the internal nodes. (Darwin firmly believed in extinction, so that the internal tree nodes represented the extinct ancestors of modern species.) However, he never published an empirical tree (although one appears in his notes and another in his letters); and it was left to Hilgendorf (1863), Mivart (1865), Gaudry (1866) and Haeckel (1866) to provide the first published ones. [This later blog post discusses the history of these first trees.]

The point of this long dissertation is that from 1750, when Vitaliano Donati first suggested that biological relationships might be represented as a network, almost all of the relationship diagrams were either (i) networks showing non-genealogical affinity, (ii) trees showing non-genealogical affinity (eg. those of Agassiz, Augier, Bronn, Hitchcock), or (iii) trees showing genealogy (eg. those of Gaudry, Haeckel, Hilgendorf, Lamarck, Mivart) (see Stevens 1994, Ragan 2009, Tassy 2011). Networks showing genealogy at the species level or above were notably absent.

This situation appears to have lasted until 1888. In that year Ferdinand Pax published a revision of the plant genus Primula. In it he provided 14 diagrams illustrating the relationships among the species within each section of the genus, plus one diagram illustrating the relationships among the sections. Almost all of these diagrams are networks rather than trees, as they show complex reticulating relationships. These are affinity networks as they do not represent genealogies, with three exceptions.

Page 186, Section Vernales.
Click to enlarge.

The figures on pages 186, 202 and 233 all explicitly indicate hybridization relationships among the species. The hybrid species are indicated by a cross (rather than a circle), and they are connected to their nominated parents by dashed lines (as used by both Buffon and Duchesne). Some of the suggested relationships are quite complex, with two hybrid species deriving from the same parents.

Page 202, Section Farinosae.
Click to enlarge.

This thus seems to be the first publication of a hybridization network involving species rather than sub-specific relationships. Quite why it took more than 130 years from Buffon's innovation to take this next step is not clear. Perhaps it had something to do with the contemporary focus on genealogical trees of animals, particularly vertebrates, which traditionally have been considered to show little evidence of inter-specific hybridization (unlike plants, where there is considerable evidence).

Page 233, Section Auricula.Click to enlarge.

It seems to me to be important to emphasize the historical distinction between affinity and genealogy networks. This distinction continues in phylogenetic networks today, with rooted networks explicitly representing genealogy and unrooted networks representing similarity of a more general sort. Almost all of the published phylogenetic networks of the past 20 years have been unrooted, and therefore logically cannot represent historical relationships. (The root indicates the time direction of the genealogy, and time goes in only one direction, so that we have Time's Arrow not Time's Boomerang.) These unrooted networks can be (and usually are) a valuable tool for helping to understand phylogenetic history, but they do not represent evolution directly.

For examples of genealogy networks published later, see Phylogenetic networks 1900-1990.

References

Barbançois, Charles-Hélion de (1816) Observations sur la filiation des animaux, depuis le polype jusqu'au singe. Journal de Physique, de Chimie, d'histoire Naturelle et des Arts 82: 444-448.

Buffon, Georges-Louis Leclerc, comte de (1755) Histoire Naturelle Générale et Particulière, tome V. Imprimerie Royale, Paris.

Darwin, Charles Robert (1859) On the Origin of Species. John Murray, London.

Donati, Vitaliano (1750) Della Storia Naturale Marina dell' Adriatico. Francesco Storti, Venezia.

Duchesne, Antoine Nicolas (1766) Histoire Naturelle des Fraisiers. Didot le jeune & C.J. Panckoucke, Paris.

Gaudry, Albert (1866) Considérations Générales sur les Animaux Fossiles de Pikermi. F. Savy, Paris.

Haeckel, Ernst Heinrich (1866) Generelle Morphologie der Organismen. Reimer, Berlin.

Hilgendorf, Franz Martin (1863) Beiträge zur Kenntniß des Süßwasserkalkes von Steinheim. Unpublished PhD Dissertation. Philosophische Fakultät, Universität Tübingen, 42 pp.**

Lamarck, Jean-Baptiste de Monet, chevalier de (1809) Philosophie Zoologique. Dentu et l'Auteur, Paris.

Mivart, St George Jackson (1865) Contributions towards a more complete knowledge of the axial skeleton in the Primates. Proceedings of the Zoological Society of London 33: 545-592.

O’Hara, Robert J. (1988) Diagrammatic classifications of birds, 1819-1901: views of the natural system in 19th-century British ornithology. In: H. Ouellet (ed.) Acta XIX Congressus Internationalis Ornithologici, pp. 2746-2759. National Museum of Natural Sciences, Ottawa.

Pax, Ferdinand Albin (1888) Monographische übersicht über die arten der gattung Primula. Botanische Jahrbücher für Systematik, Pflanzengeschichte und Pflanzengeographie 10: 75-241.

Ragan, Mark (2009) Trees and networks before and after Darwin. Biology Direct 4: 43.

Stevens, Peter F. (1994) The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. Columbia Uni. Press, New York.

Strickland, Hugh Edwin (1841) On the true method of discovering the natural system in zoology and botany. Annals and Magazine of Natural History 6: 184-194.

Tassy, Pascal (2011) Trees before and after Darwin. Journal of Zoological Systematics and Evolutionary Research 49: 89-101.

Wallace, Alfred Russel (1856) Attempts at a natural arrangement of birds. Annals and Magazine of Natural History 18: 193-216.

** See this later post for information about this thesis.
~~[This was read as a paper before the Royal Prussian Academy of Sciences on July 19 1866, and was then published separately as:~~
~~Hilgendorf F. (1866) Planorbis multiformis im Steinheimer Süßwasserkalk. Ein Beispiel von Gestaltveränderung im Laufe der Zeit. Buchhandlung von W. Weber, Berlin, 36 pp.~~
~~The paper then appeared as a regular part of the Academy's journal:~~
~~Hilgendorf F. (1867) Über Planorbis multiformis im Steinheimer Süsswasserkalk. Monatsberichte der Königliche Preussischen Akademie der Wissenschaften zu Berlin 1866: 474-504.~~
~~So, you can take your pick as to the formal publication date of the tree.]~~

Sunday, May 27, 2012

Tattoo Monday V

This week we have a few of the recently available phylogenetic tree tattoos for you, following on from the previous posts: Tattoo Monday, Tattoo Monday II, Tattoo Monday III, and Tattoo Monday IV. We even have the first tattooed feet. The tree on the middle right is somewhat of a worry, if you care about phylogenetic accuracy. The circular designs re-appear in Tattoo Monday VII.

While it seems true that the probability of observing a new phylogenetic-tree tattoo decreases with time, this is apparently an exponential decay, so that the probability approaches zero rather slowly. It might, of course, be a pareto (power law) distribution instead, which never reaches zero, thus keeping Carl Zimmer employed producing endless supplements to his book (Science Ink: Tattoos of the Science Obsessed).

Tuesday, May 22, 2012

Networks of affinity rather than genealogy

To date, published phylogenetic networks have been of two distinct types: (i) non-directional networks showing affinity among taxa, and (ii) directed networks showing genealogical relationships among taxa. In previous posts I have discussed what are apparently the first two networks of the latter type (1755 and 1766), which illustrate hybridization relationships among domesticated breeds / cultivars.

Today, I wish to point out that the vast majority of the networks published from 1750-1900 were actually of the first type. Indeed, the third explicitly genealogical network appears to date from 1888, whereas more than a dozen networks of the affinity type are known.

There is little point in trying to formally define what biologists have meant by "affinity", since it seems to vary greatly. However, it has usually referred to some sort of attraction or connection between taxa (and their characteristics), sometimes described as similar to the laws regulating the combinations of elements that form compounds in chemistry. It has usually been the basis of some sort of "natural system" of classification (as opposed to some artificial grouping of organisms, eg. utilitarian): "an arrangement that groups organisms of the same kind more closely to each other than either of them is to any organism of another kind" (Cuvier 1817). In modern terminology, affinity originally included evaluation of both homology and analogy, and therefore affinity was a more general relationship than evolutionary history, and it does not necessarily correlate well with it. Genealogy also appears to be a much looser form of connection between taxa than was originally imagined by some of the seekers of affinity connections.

The important point here, however, is that affinity was usually imagined as being multi-facetted, so that any diagram of affinities showed multiple connections among the taxa: relationships between groups were very definitively reticulating, and it was considered impossible to form a linear series because emphasizing a relationship in one direction necessarily entailed simultaneously breaking relationships in another (de Jussieu 1843; see Stevens 1994 p. 98). Hence, the diagrams were networks rather than trees. (Trees have a long history as a metaphor for arranging human ideas, but not for representing affinity; Gontier 2011.) Indeed, network diagrams may out-number tree illustrations from 1750-1900 (Stevens 1994, Ragan 2009, Tassy 2011), if one excludes dichotomous keys for the identification of taxa. These network diagrams lacked any directionality, as affinities between taxa were considered to be symmetrical (unlike genealogical relationships). Notably, when interpreting their diagrams the authors either fail to mention genealogy, or they implicitly or explicitly exclude it.

The importance that I see in these historical networks is that they match closely the modern idea of unrooted data-display networks. They are not a form of exploratory data analysis, of course, because they were intended to express the author's ideas about biological relationships rather than to reveal previously unquantified patterns in the data; but they certainly are not rooted evolutionary networks. Thus, the dichotomy that I see in contemporary usage of the expression "phylogenetic network" apparently has a long history, and it needs to be recognized. (There are also all sorts of other historical representations of reticulate relationships, some intended to be solid figures with faces, others were interlocking circles or radiating hexagons or nested ovals, and some were explicitly referred to as maps. Most of these could be converted to a network representation.)

Affinity Networks

[Note: the following list has been updated in a later blog post: Affinity networks updated]

Here is a list of the publications from 1750-1900 containing affinity networks that I know about. I have indicated my source, and I have also linked to an online copy of the diagram. (I have extracted some of the figures from Gallica, Google Books or the Biodiversity Heritage Library, where they are otherwise unavailable.)

1774 Johann Philipp Rühling "Ordines Naturales Plantarum Commentatio Botanica" [Ragan Fig. 7, Stevens Fig. 12]
1783 Johann Hermann "Tabula affinitatum animalium" [Ragan Fig. 8]
1802 August Johann Georg Carl Batsch "Tabula Affinitatum Regni Vegetabilis" [Ragan Fig. 9]
1825 Adrien-Henri-Laurent de Jussieu "Sur le groupe des Rutacées." Mémoires du Muséum d'Histoire Naturelle 12: 384-542 [Ragan Fig. 13, Stevens Fig. 10]
1826 Augustin-Pyramus de Candolle "Mémoires sur la Famille des Légumineuses. Quatrième Mémoire. Division de la Famille des Légumineuses" [Stevens Fig. 14]
1841 Eduard Fenzl "Darstellung und Erläuterung vier minder bekannter, ihrer Stellung im natürlichen Systeme nach bisher zweifelhaft gebliebener, Pflanzen-Gattungen." Denkschriften der Königlich-Baierischen Botanischen Gesellschaft zu Regensburg 3: 153-270 [Stevens, figure]
1843 Adrien-Henri-Laurent de Jussieu "Monographie de la famille des Malpighiacées." Archives du Muséum d'Histoire Naturelle 3: 5-151 [Stevens, figure]
1872 Alexander Andrejewitch von Bunge "Die gattung Acantholimon Boiss." Mémoires du Academie Imperiale des Sciences de St Pétersbourg Série 7 18(2): 1-72 [Stevens Fig. 19]
1873 George Bentham "Notes on the classification, history, and geographical distribution of Compositae." Botanical Journal of the Linnean Society 13: 335-577 [Stevens Fig. 20]
1888 Ferdinand Albin Pax "Monographische übersicht über die arten der gattung Primula." Botanische Jahrbücher für Systematik, Pflanzengeschichte und Pflanzengeographie 10: 75-241 [Stevens, there are 15 figures]
1889 Julien Vesque "Epharmosis, sive Materiae ad Instruendam Anatomiam Systematis Naturalis. Pars Prima. Folia Capparearum" [Stevens, figure]
1889 Julien Vesque "Epharmosis, sive Materiae ad Instruendam Anatomiam Systematis Naturalis. Pars Secunda. Genitalia Foliaque Garciniearum et Calophyllearum" [Stevens, figure]
1890 Franz Georg Philipp Buchenau "Monographia Juncacearum." Botanische Jahrbücher für Systematik, Pflanzengeschichte und Pflanzengeographie 12: 1-495 [Stevens, figure]
1893 Georg Klebs "Flagellatenstudien, Theil II." Zeitschrift für Wissenschaftliche Zoologie 55: 353-445 [Ragan Fig. 26]
1895 Olga Tchouproff "Quelques notes sur l'anatomie systématique des Acanthacées." Bulletin de l'Herbier Boissier 3: 550-560 [Stevens, figure]
1896 Nicolai Ivanovich Kusnezov "Subgenus Eugentiana Kusnez. generis Gentiana Tournef." Acta Horti Petropolitani 15: 1-507 [Ragan, Stevens Fig. 18]
1898 Émile Constant Perrot "Anatomie comparée des Gentianacées." Annales des Sciences Naturelles: Botanique, Série 8, 7: 105-292 [Stevens, figure]

It seems important to note that these affinity networks continued to be produced after 1859, when phylogenetic trees in the modern context were introduced by Charles Darwin (ie. internal nodes represent ancestors and the leaves represent existing taxa). Clearly, affinity and genealogical inheritance did not become synonymous concepts until much later.

Also, other early authors described biological relationships as being like a reticulating network, without any necessarily genealogical interpretation, even though they apparently did not themselves produce network diagrams. (It seems that only 5 of the first 11 historical references to network relationships provided a diagram.) Here are some known examples (with source):

1750 Vitaliano Donati "Della Storia Naturale Marina dell' Adriatico" [Ragan]
1792 Giuseppe Olivi "Zoologia Adriatica" [Ragan]
1802 Gottfried Reinhold Treviranus "Biologie, oder Philosophie der Lebenden Natur für Naturforscher und Aerzte, Band 1" [Ragan]
1824 Johann Heinrich Friedrich Link "Elementa Philosophiae Botanicae" [Stevens]
1828 Georges Cuvier "Histoire Naturelle des Poissons, Tome Premier" [Ragan]
1836 Constantin Rafinesque "Flora Telluriana" [Stevens]

Note that it was apparently Vitaliano Donati who, 5 years before Buffon drew his dog genealogy, first suggested that biological relationships are like a network, although he did not provide an explicit diagram to illustrate this idea.

References

Cuvier G. (1817) Le Règne Animal Distribué d'Après son Organisation, Tome Premier. Deterville, Paris.

de Jussieu A.-H.-L. (1843) Cours Elémentaire d'Histoire Naturelle. Fortin Masson et Langlois & Leclerc, Paris.

Gontier N. (2011) Depicting the Tree of Life: the philosophical and historical roots of evolutionary tree diagrams. Evolution: Education and Outreach 4: 515-538.

Ragan M. (2009) Trees and networks before and after Darwin. Biology Direct 4: 43.

Stevens P.F. (1994) The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. Columbia Uni. Press, New York.

Tassy P. (2011) Trees before and after Darwin. Journal of Zoological Systematics and Evolutionary Research 49: 89-101.

Sunday, May 20, 2012

Network analysis of Bordeaux wine critics II

Last week I provided a brief network analysis of the evaluation behaviour of a few well-known wine commentators, by comparing the scores that they gave to some red Médoc wines from the 2004 vintage. Today I will expand this analysis by doing a similar comparison for the 2005 vintage wines.

The point of doing this is that the critics regularly claim that a particular year has produced "the vintage of the century", which they did for Bordeaux in 2000, 2005 and 2009. Leaving aside the obvious hyperbole, the 2005 vintage was a apparently a very different affair (Wine Spectator magazine gave it a score of 98/100) from the 2004 vintage (Wine Spectator score = 89). (Note that these scores can vary only from 50-100, not 0-100, so the scores are actually 96% versus 78%, respectively.) This means that we can compare a very good vintage with an ordinary one, to see if it makes any difference to the way the evaluators behave.

As last time, the data to be analyzed are taken from the bordOverview (Bolomey Wijnimport) website: http://www.bordoverview.com/?year=2005. They consist of quality ratings of 141 wines by 6 expert commentator groups (associated with wine magazines or newsletters), these scores being awarded in October 2008:
United States of America
  Wine Advocate (Robert Parker)
  Wine Spectator (James Suckling)
United Kingdom
  Jancis Robinson
  Decanter (Steven Spurrier, James Lawther, Michel Bettane)
France
  Tast (Michel Bettane, Thierry Desseauve)
  La Revue du Vin de France (Bernard Burtschy)

The scores have been converted to a 0-100 scale, and all wines were evaluated by all commentators. The manhattan distance measure was calculated between each pair of commentators, and the result displayed as a NeighborNet network. People who are closely connected in the network are similar to each other based on their scoring patterns, and those who are further apart are progressively more different from each other.

Click to enlarge.

Note, first, that there is considerable conflict among the critics. As noted last time, they certainly don't agree with each other uniformly, presumably because wine tasting is an art not a science.

Second, the relationship between the commentators is quite different compared to last time. Importantly, this time there is no suggestion of a cultural divide between the French- and English-speaking commentators. Perhaps cultural preferences play a larger role only when the wine is of average quality (2004) whereas high-quality wine (2005) has more international appreciation.

Finally, there are more long terminal edges than last time, indicating that more of each opinion is unique to that particular commentator. This suggests that personal preferences play a larger role when evaluating good wine (2005) compared to average wine (2004). Apparently it is not true that "good wine is good wine and everyone can agree on it".

Once again, we can also produce a directed graph rather than an undirected one, by nominating Robert Parker as the outgroup to the others, and producing a rooted reticulation network. In this diagram the blue lines indicate "hybridization" histories.

Click to enlarge.

In this analysis, the Decanter (Spurrier et alia) and Wine Spectator (Suckling) opinions are hybrid ones, as they also were last time. (The Revue du Vin de France opinion is not a hybrid this time.) However, only one of the hybrid sources is the same as last time (Parker's as a source of Suckling's). Interestingly, the Decanter opinion, from the UK, is simply a mixture of the two French opinions. The presence of Michel Bettane on two of the tasting panels may have something to do with this. Also, the Wine Spectator is a hybrid of UK and US opinions.

As before, everyone's opinion seems to be someone else's, modified in response to someone else again; and this is not related to whether the wine is "good" or "average" since it happened for both 2004 and 2005.

Wednesday, May 16, 2012

GPWG Poaceae dataset

In a previous post, Steven mentioned that one of the datasets from the Grass Phylogeny Working Group has played an unexpectedly prominent role in evaluation of hybridization network algorithms.

These algorithms work by trying to construct a network from a set of rooted trees with overlapping sets of taxa; and the GPWG dataset provides six such trees, one from each of six different molecular loci. This dataset seems to have been introduced into the network literature by Bordewich et al. (2007), although it had previously been used for evaluations of supertree methods (Salamin et al. 2002; Schmidt 2003).

The data used consist of DNA sequences of three nuclear loci and three chloroplast genes. The original publication also has data provided for morphology and restriction sites, but these have not been used for the network analyses. One reason for interest in this dataset is the possibility of reticulation signals between the nuclear and chloroplast data sources. There are 66 taxa, although nearly half of them are composites formed from data for several different species in the same genus, and only a few of the taxa have data for all six datasets (the number of taxa varies from 19-65 per dataset). The data available are summarized in Table 7.1 from Schmidt (2003).

An important point about these data is that in the original GPWG publication the six gene trees were strict consensus trees from maximum-parsimony analyses, and so they have quite a number of polychotomies. These polychotomies were intended by the authors [personal communication] to express uncertainty about the topologies of the trees.

However, this uncertainty is not shown in the trees that have been used for network evaluation. According to Bordewich et al., the trees that they (and everyone else) used were reconstructed using the fastDNAmL program (ie. maximum-likelihood), and were supplied by Heiko Schmidt (see Schmidt 2003, p.74). As expected, there are no polychotomies in these ML trees and no indication of uncertain topology; and, of course, the tree topologies are somewhat different from the parsimony trees.

An important consequence is that there is more incompatibility among the dichotomous maximum-likelihood trees than there is among the polychromous maximum-parsimony trees. That is, many of the ML incompatibilities are related to uncertainties in the MP trees. Unfortunately, most of the network algorithms that have been evaluated using these data require strictly dichotomous trees.

Also, the root seems to create problems for these data. The GPWG trees are all rooted with this topology:
(Flagellaria,((Elegia,Baloskion),(Joinvillea,((Streptochaeta,Anomochloa),(Pharus,(ingroup))))))
However, the position of this 7-taxon outgroup relative to the rest of the taxa varies among the gene trees. That is, the connection between the outgroup and the ingroup differs between the gene trees. So, some of the incompatibility among the trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes.

Some of the ML datasets available have trees with the same set of ingroup / outgroup relationships as the GPWG trees, for example those datasets available with the CASS algorithm. However, some of the ML trees presented in the literature seem to be rooted in quite a different place, and this place differs between the gene trees. For example, the data as presented with the HybridInterleave program, which is presented as 15 pairs of subtrees rather than as six complete trees, not only are the the gene trees apparently rooted in different places but the different subsets presented of the same gene tree are also sometimes rooted in different places.

It seems to me that there are two consequences arising from these points: (i) it is unnecessarily hard to construct a network from the ML data (because not all of the data signals relate to reticulation), and (ii) the resulting networks (as published) look rather unrealistic to a biologist (there are far too many reticulation nodes). Perhaps this isn't the most realistic dataset to be using for the evaluation of network algorithms.

Another commonly used dataset is the Ranunculus data from Lockhart et al. (2001). In this dataset much of the incompatibility signal also seems to be associated with an uncertain position for the root (see Morrison 2011, Fig. 4.7). In this case there are two gene trees (one nuclear and one chloroplast) that have similar unrooted topologies but have different outgroup-derived root locations. Dealing with root uncertainty may thus be one of the biggest confounding problems when trying to identify reticulation events.

The original GPWG data are available at:
http://www.eeob.iastate.edu/faculty/profiles/ClarkL/GPWG-2001-Appendices.pdf

The nexus data matrix is available at:
http://www.umsl.edu/services/kellogg/gpwg/matrix.html
[In this dataset, 0=A, 1=C, 2=G, 3=T]

A nexus treefile with the original six GPWG (consensus parsimony) trees is available at:
http://acacia.atspace.eu/data/GPWG.tre

A dendroscope treefile with the six ML trees is available at:
http://sites.google.com/site/cassalgorithm/data-sets

References

Bordewich M., Linz S., St. John K., Charles Semple C. (2007) A reduction algorithm for computing the hybridization number of two trees. Evolutionary Bioinformatics 3: 86-98.

Grass Phylogeny Working Group (2001) Phylogeny and subfamilial classification of the grasses (Poaceae). Annals of the Missouri Botanical Garden 88: 373-457.

Lockhart P., McLechnanan P.A., Havell D., Glenny D., Huson D., Jensen U. (2001) Phylogeny, radiation, and transoceanic dispersal of New Zealand alpine buttercups: molecular evidence under split decomposition. Annals of the Missouri Botanical Garden 88: 458-477.

Morrison D.A. (2011) Introduction to Phylogenetic Networks. RJR Productions, Uppsala.

Salamin N., Hodkinson T.R., Savolainen V. (2002) Building supertrees: an empirical assessment using the grass family (Poaceae). Systematic Biology 51: 136-150.

Schmidt H.A. (2003) Phylogenetic Trees From Large Datasets. PhD thesis, Heinrich Heine University, Düsseldorf.

Wu Y. (2010) Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees. Bioinformatics 26: i140-i148.

Sunday, May 13, 2012

Network analysis of Bordeaux wine critics

Last week I provided a brief network analysis of the characteristics of single-malt Scotch whiskies. Today I wish to consider fermented rather than distilled products, and to move on to an analysis of the commentators rather than the actual products. I will do this by looking at the red wines from the Médoc region (the so-called "Left Bank" of Bordeaux, France) from the 2004 vintage. There are people who think that these are the best wines in the world, but as an Australian I find this attitude naïve.

The data to be analyzed are taken from the bordOverview (Bolomey Wijnimport) website: http://www.bordoverview.com/?year=2004. They consist of quality ratings of 27 wines by 7 expert commentator groups (associated with wine magazines or newsletters), these scores being awarded in September 2007 (when the wines were evaluated for "en primeur" sale):
United States of America
  Wine Advocate (Robert Parker)
  Wine Spectator (James Suckling)
United Kingdom
  Jancis Robinson
  Decanter (Steven Spurrier, James Lawther, Serena Sutcliffe)
France
  Tast (Michel Bettane, Thierry Desseauve)
  La Revue du Vin de France (Bernard Burtschy)
  Le Point (Jacques Dupont)

The scores have been converted to a 0-100 scale (many wine commentators use idiosyncratic scoring schemes), and all wines were evaluated by all commentators. (Many more wines were evaluated, but only 27 were evaluated by all of the commentators.) The manhattan distance measure was calculated between each pair of commentators, and the result displayed as a NeighborNet network. People who are closely connected in the network are similar to each other based on their scoring patterns, and those who are further apart are progressively more different from each other.

Click to enlarge.

Note, first, that there is considerable conflict among the critics — they certainly don't agree with each other uniformly. Wine tasting is an art not a science.

Second, note that the biggest split (bipartition) separates the English- and French-speaking commentators from each other, with the exception of Jancis Robinson, who is apparently a Francophile. The French clearly do not have the same view of their wine as do some other people.

Finally, there are some long terminal edges, indicating that a lot of each opinion is unique to that commentator, except for James Suckling, who apparently has fewer opinions of his own. As I noted, wine tasting is an art not a science.

This analysis was an exploratory one, looking for shared data patterns. It seems worthwhile, however, to pursue the origins of these differing expert opinions. That is, we need a directed graph rather than an undirected one. If we nominate Robert Parker as the outgroup to the others, as he obviously is to anyone who knows about him, then we can produce a rooted reticulation network using the method of Huson et al. (2005, Lecture Notes in Bioinformatics 3500: 233–249). In this diagram the blue lines indicate "hybridization" histories.

Click to enlarge.

There appear to be three hybrid results, so that there are four "pure" wine evaluations and three "hybrid" ones. First, the Decanter opinion (Spurrier_Lawther_Sutcliffe) is a combination of the opinions of the Wine Advocate (Parker) and Le Point (Dupont) + Robinson. Second, the Wine Spectator scores (Suckling) are a hybrid of those of the Wine Advocate and Tast (Bettane_Desseauve). Third, the Revue du Vin de France outcome (Burtschy) is a hybrid of Robinson's and Tast's. Note that each of these hybrid opinions is a cross-cultural (French-English) mixture! The world of wine truly is international.

I think that this adequately sums up the world of wine criticism. Everyone's opinion is mostly someone else's, modified in response to someone else again.

Tuesday, May 8, 2012

A fundamental limitation of hybridization networks? (2)

This is a follow-up to an earlier post, which showed an example of two phylogenetic trees and three rooted phylogenetic networks. You can see them again in the figure below.

Each of the networks N1, N2 and N3 displays the two trees T1 and T2 (and no other trees). Thus, it is impossible to decide which of the three networks is correct. The question was asked whether this is a fundamental limitation of rooted phylogenetic networks (a.k.a. hybridization networks).

In my opinion, the answer is "no".

Let's first draw the networks such that each reticulation is an instantaneous event between two coexisting taxa. To do so, networks N2 and N3 need an additional taxon x, which could be an extinct taxon or just a taxon that has not been sampled.

I've specified a length for each edge of each network and have given corresponding edge lengths to the trees. The values of the edge lengths in the networks have been chosen rather arbitrarily, and are not important for the discussion below.

What is important is that, when you take the edge lengths into account, it is easy to decide which of the three networks should be chosen. N1 should be chosen if the roots of T1 and T2 have the same age, N2 should be chosen if the root of T1 is older and N3 if the root of T2 is older. The reason is the following. In network N1, the roots of T1 and T2 both coincide with the root of the network. This contrasts with network N2, where the root of T2 is a proper descendant of the root of T1 and with network N3, in which the root of T1 is a proper descendant of the root of T2.

We can conclude that the above example shows an important challenge but not a fundamental limitation of rooted phylogenetic networks. When taking edge lengths into account, it is indeed possible to uniquely reconstruct the network (at least in this case).

Sunday, May 6, 2012

Network analysis of scotch whiskies

When using non-scientific data to illustrate mathematical methods, I am told that the most common sources are baseball averages, movie grosses and political polls. I do not wish to resort to such clichéd examples, at least not yet. So, I am going to continue my series (which started with the Eurovision Song Contest) by analyzing a well-known dataset by François-Joseph Lapointe and Pierre Legendre (1994) A classification of pure malt scotch whiskies. Applied Statistics 43: 237-257. These same authors have also re-used these data: Pierre Legendre and François-Joseph Lapointe (2004) Assessing congruence among distance matrices: single-malt scotch whiskies revisited. Australian and New Zealand Journal of Statistics 46: 615-629.

The data consist of measurements of 68 characteristics (nose, color, body, palate, finish) for 109 single-malt scotch whiskies. The original authors analyzed these data using a similarity matrix and a tree. From this they produced a classification of the whiskies. They concluded that there is, indeed, a weak but detectable relationship between their classification and the geographical location of the various distilleries.

I have re-analyzed these data using a weighted bray-curtis similarity and a neighbor-net network. The bray-curtis similarity ignores "negative matches", as discussed in the previous post, so that only shared characteristics generate similarity (not shared lack of a characteristic). The weights were used to give the five types of characteristic equal influence (the 68 characteristics are not equally distributed among the five types).

Whiskies that are closely connected in the network are similar to each other based on the 68 characteristics, and those that are further apart are progressively more different from each other. I have added colors to the network along geographical lines within Scotland: light purple = lowland, red = east, yellow = midland, light green = north, dark green = west, light blue = islands, dark purple = Islay, black = Speyside. These groups are the same as those used by Lapointe and Legendre.

Click to enlarge.

There is very little to say about this diagram, except that it is not very tree-like, thus calling into question any classification scheme, and there is very little evidence of geographical patterns. There are 21 whiskies at the left of the diagram that share the biggest split, none of which come from the lowlands, midlands or Islay, but that is it.

I have been reliably told, by people with extensive experience of the matter, that each and every Scotch single malt is unique, and that therefore personal preference for one over another is entirely justified. I now have the feeling that this may actually be true.

Note: There is a follow-up post (Single-malt scotch whiskies — a network) as well as one on bourbon (The bourbon family forest?).

Wednesday, May 2, 2012

What should a database of datasets look like?

In the previous post, Steven made the very good point that we need a "database" of datasets that can be used to evaluate algorithms for phylogenetic networks. In biological terms, we currently lack a "gold standard" with which to compare the results of our data analyses. This is an important point, to which it is worth adding a few biological notes.

Validation (or evaluation) is a common analytical problem, not just in biology, and it has been addressed in many different circumstances. For example, another area within which I work is multiple DNA sequence alignment. Between 1999 and 2005 several different databases of empirical alignments were developed: BAliBase, OXBench, PREFAB, SABmark, and BRAliBase. These were created independently of each other, and were ostensibly designed for somewhat different purposes. Evaluations of computer algorithms since that time tend to have used several of these databases as their gold standard; and it is quite obvious that success using one of the databases does not imply success with any of the others.

This background has lead me to the conclusion that a database needs a structured set of data, with the structure addressing all of the different biological issues that are likely to be important. For example, phylogenetic networks are used to analyze datasets that contain "reticulation" events such as hybridization & introgression, lateral gene transfer, recombination, and genome fusion, but such events can be confounded by other events such as deep coalescence and hidden paralogy. So, a truly valuable database would have datasets that encompass not only all of these possibilities but their combinations as well. Chris Whidden's comment on Steven's post discusses the same important issue, but from the point of view of the mathematical requirements for finding the optimal network.

Such a database is a very ambitious goal. More to the point, I doubt that such a database could be created without widespread collaboration, as both Steven and Chris have emphasized.

BAliBase (referred to above) attempted to have a structured set of multiple alignments, but it is interesting to note that this structure was mostly ignored by subsequent users of the database. The users simply pooled all the different groups of alignments together and came up with an "average" success for the alignment algorithms, rather than discovering (as they would have done; see Morrison 2006) that different algorithms have different degrees of success depending on the particular characteristics of the dataset being analyzed. We should not make the same mistake when evaluating network algorithms.

I think that there have been four suggested approaches to acquiring datasets for evaluating tree/network algorithms (in order of increasing reality):

simulation under one or more data-generation models
create mixed datasets from "pure" datasets, or create artificial mosaic taxa from real datasets
use datasets where the postulated events have been independently confirmed
experimentally create taxa with a known evolutionary history.

Option (1) has been used by many workers to evaluate tree-building algorithms, and the models have been readily adapted for phylogenetic networks (eg. Bandelt et al. 2000; Morin 2007; Woolley et al. 2008). Indeed, this has been the most common strategy for evaluating network algorithms, although there seems to be little consensus so far on what data-generation model(s) to use. The basic limitation here is that simulations (a) only show the success of the algorithms relative to how well they fit the model used to simulate the data, and (b) the relationship between the simulation model and the "real world" is unknown.

Option (2) has rarely been used for networks (eg. Vriesendorp 2007). The basic idea is to create "known" reticulations by combining parts of pre-existing datasets that lack reticulation signals. One can either combine whole datasets that contain mutually incompatible signals, or one can create individual taxa that have parts of their data taken from different reticulation-free datasets. This is a promising approach to "experimental" phylogenetics, although lack of prior experience means that we do not yet know how to use this strategy most effectively.

Option (3) is an obvious approach to collating data (McDade 1990), and has been used for evaluating tree-building algorithms (eg. McDade 1992; Leitner et al. 1996, Lemey et al. 2005). This has been used, for example, for fast-evolving organisms such as viruses, where the transmission history can sometimes be independently checked. Also, hybrids can often be experimentally verified; and Vriesendorp (2007) lists several such datasets for plants. The problem here is the degree to which the postulated reticulation events have been independently confirmed. A network reticulation may look like good evidence in favor of a "suspected" hybrid, for example, but it is not really independent evidence of anything in particular. I suspect that this weak sort of reasoning has been applied to far too many datasets used for the evaluation of network algorithms, where unsuitable datasets have been employed.

Option (4) has occasionally been used for evaluating tree-building algorithms (eg. Hillis et al. 1992; Cunningham et al. 1998; Sanson et al. 2002) but not, as far as I know, network algorithms. The idea is to experimentally manipulate some biological organisms in the laboratory to create a known evolutionary history, against which subsequent data analyses can be compared. Realistically, this restricts the datasets to viruses and phages, as these can be manipulated within a reasonable timeframe.

We need to think about which of these options we wish to adopt. Perhaps all of them?

Note: The suggested database now exists: Datasets for validating algorithms for evolutionary networks

References

Bandelt H.-J., Macauley V., Richards M. (2000) Median networks: speedy construction and greedy reduction, one simulation, and two case studies from human mtDNA. Molecular Phylogenetics & Evolution 16: 8–28.

Cunningham C.W., Zhu H., Hillis D.M. (1998) Best-fit maximum-likelihood models for phylogenetic inference: empirical tests with known phylogenies. Evolution 52: 978-987.

Hillis D.M., Bull J.J., White M.E., Badgett M.R., Molineux I.J. (1992) Experimental phylogenetics: generation of a known phylogeny. Science 255: 589-592.

Leitner T., Escanilla D., Franzén C., Uhlén M., Albert J. (1996) Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proceedings of the National Academy of Sciences of the USA 93: 10864-10869.

Lemey P., Derdelinckx I., Rambaut A., Van Laethem K., Dumont S., Vermeulen S., Van Wijngaerden E., Vandamme A.-M. (2005) Molecular footprint of drug-selective pressure in a Human Immunodeficiency Virus transmission chain. Journal of Virology 79: 11981–11989.

McDade L.A. (1990) Hybrids and phylogenetic systematics I. Patterns of character expression in hybrids and their implications for cladistic analysis. Evolution 44: 1685–1700.

McDade L.A. (1992) Hybrids and phylogenetic systematics II. The impact of hybrids on cladistic analysis. Evolution 46: 1329–1346.

Morin M.M. (2007) Phylogenetic Networks: Simulation, Characterization, and Reconstruction. PhD Thesis, University of New Mexico.

Morrison D.A. (2006) Multiple sequence alignment for phylogenetic purposes. Australian Systematic Botany 19: 479-539.

Sanson G.F.O., Kawashita S.Y., Brunstein A., Briones M.R.S. (2002) Experimental phylogeny of neutrally evolving DNA sequences generated by a bifurcate series of nested polymerase chain reactions. Molecular Biology & Evolution 19: 170–178.

Vriesendorp B. (2007) Phylogenetworks: Exploring Reticulate Evolution and its Consequences for Phylogenetic Reconstruction. PhD Thesis, Wageningen University.

Woolley S.M., Posada D., Crandall K.A. (2008) A comparison of phylogenetic network methods using computer simulation. PLoS One 3: e1913.