The Genealogical World of Phylogenetic Networks: April 2012

Saturday, April 28, 2012

Validating methods for constructing evolutionary phylogenetic networks

Many researchers working on constructing evolutionary (i.e. explicit, as opposed to implicit/data-display) phylogenetic networks encounter the problem that, at present, there are not many options for validating the biological relevance of their methods. In other words, how does a researcher verify whether the network produced by his/her latest algorithm is a biologically plausible approximation of reality? This is of critical importance because, unlike implicit/data-display networks, evolutionary phylogenetic networks seek to produce an explicit hypothesis of what actually happened.

Ideally there should be a repository of biological datasets where there is some level of consensus amongst biologists as to the character, extent and location of reticulate evolutionary events. This can then be used as a framework for validating the output of algorithms for constructing evolutionary phylogenetic networks. Unfortunately, as far as I am aware there are very few such “reference” datasets in circulation – if any. There seem to be multiple reasons for this. Within biology reticulate evolution is still a comparatively new topic which actually encompasses an entire range of evolutionary time-scales and phenomena. I can fully appreciate that trying to get a grip on even a tiny part of this world is an immensely complex task for biologists! This is probably why biological validation of algorithmic methods, if it happens at all, still requires collaborating biologists to perform a labour-intensive and highly case-specific analysis. It will be a massive challenge to move beyond such ad-hoc models of validation.

On the algorithmic side there are also plenty of issues. Input-side and output-side limitations to existing software are well-known. Expressed deliberately sharply: it is not often that one encounters a biologist who has two fully-refined, unambiguously rooted gene trees on the same set of taxa who wants to develop a reticulation-minimal solution and who does not mind if ancestors can hybridize with descendants. Faced with such limitations computer scientists inevitably resort to simulations or try and analyse the same dataset that the last group of computer scientists used, which is (sigh…) probably the Grass Phylogeny Working Group's Poaceae dataset. Simulations tend to use a variety of plausible-sounding techniques (e.g. random rSPR moves to simulate HGT, or – at the population-genomic level – techniques for simulating recombination) but in how far do these simulations really approximate reality?

My concern is that, at the moment, biologists and computer-scientists are locked in an unhealthy embrace, both expecting the other group to come up with “real” networks. This could be dangerous. I’ve seen biologists adjust their hypotheses based on the output of evolutionary phylogenetic network software. But those computer programs often lack any form of biological validation: not because algorithm designers are bad people aiming to mislead but because the apparently intractable character of the associated optimization problems forces computer scientists to make all kinds of restrictions and assumptions which are not necessarily compatible with the concerns of biologists. In any case: it’s clearly not helpful if hypotheses derived this way find their way back into the literature with an “approved by biologists” seal of approval.

How, then, to transform this embrace into something more virtuous? One possibility could be a structured collaboration between groups in the phylogenetic network community to produce and disseminate at least a small number of rigorously validated reference datasets which can serve as benchmarks. Is this realistic?

Very curious to hear what you think!

Note: The suggested database now exists: Datasets for validating algorithms for evolutionary networks

Steven Jay Gould was wrong

As always at the beginning of the week, this blog presents something in a lighter vein. However, this week we depart from the restricted world of phylogenetic networks and delve into the deeper waters of evolutionary processes.

In 1980 Steven Jay Gould published a chapter in a book about junk food (Phyletic size decrease in Hershey bars. Pages 178-179 in: C.J. Rubin, D. Rollert, J. Farago, R. Stark, J. Etra, eds. Junk Food. Dial Press/James Wade, New York), in which he tried to convince his readers that Cope's Rule of phyletic size increase applies to biological organisms but not to manufactured objects. He did this by analyzing the evolutionary history of Hershey bars, a chocolate confection well known to most Americans (but not to all that many others, at least in 1980).

I thought then, and I still think now, that Gould was wrong. I can think of several manufactured objects that show a size increase during their evolutionary history. Eventually, I decided that I could stand it no longer, and in 2000 I wrote about this in the Australian Systematic Botany Society Newsletter. I chose to write about the Evolutionary History of Mazda Motor Cars, because this manufactured object is not edible and is not well known to most Americans. I have linked to a PDF copy [1.6 MB] of the paper, because I figure that most of you have never heard of the ASBS Newsletter, and have therefore never read the article. You should.

Wednesday, April 25, 2012

Networks and bootstraps as tree-support criteria

It has been pointed out several times in the literature (eg. Wägele & Mayer 2007; Wägele et al. 2009; Morrison 2010) that network analyses and, for example, bootstrap analyses of trees do not necessarily show the same amount of "support" for a tree. This occurs because branch support values can be independent of character support.

Consequently, many apparently "well-supported" trees published in the literature are often not well-supported by the original data at all. That is, incongruences in the data are ignored by all tree-building algorithms, by definition. Indeed, this problem may be almost universal in the literature, because very few papers provide any evidence that the tree-likeness of the data has been evaluated by the authors.

Since this point seems to poorly understood by most workers, it is worth re-iterating here with an example. The three references cited above provide other examples where bootstrap analyses and network analyses yield very different conclusions about the support for phylogenetic trees.

The basic distinction between networks and bootstrapped trees is this: use of a data-display network, such as a splits graph, evaluates the character (or distance) data independently of any tree, whereas a bootstrap analysis evaluates the data solely in terms of a tree. For example, a bootstrap analysis records the trees at each iteration (or replicate) rather than recording the bootstrapped character set itself, and many different character sets can produce the same tree. Therefore, a bootstrap analysis does not directly assess the character support for a tree. Neither does a posterior probability from a bayesian analysis.

The importance of this distinction for phylogenetics is that a tree analysis forces the data into a tree irrespective of how well the data fit that tree. All that is required is that the tree be the optimal one based on a particular criterion (parsimony, likelihood, etc), while the degree of fit of the data and tree is effectively treated as immaterial to the analysis. This is true at each bootstrap iteration, as well, so that all we learn from a bootstrap analysis is which tree branches are the best supported — we do not learn anything directly about the support of the data for a tree in the first place.

Literally, bootstrap values represent "branch support" rather than "tree support"; and a similar thing can be said for bayesian posterior probabilities. [This issue is discussed further in this later blog post: How networks differ from bootstrapped trees.]

This can be illustrated with a simple empirical example. The data are taken from my Primer of Phylogenetic Networks. The original data are 1,687 aligned nucleotide positions of two genes from five species of the plant genus Viburnum. However, only 43 of the characters vary among these five species. It is expected a priori that V. prunifolium is a hybrid between V. rufidulum and V. lentago, so that a single well-supported tree is not necessarily likely.

Median network. Click to enlarge.

The Median network for the data is shown in the first figure, with the branches labelled by the characters that "support" them. Other types of splits graphs have the same topology as this one (eg. NeighborNet based on uncorrected distances), since the characters are all binary and are never more than pairwise incompatible. This means that all of the character data are displayed in the graph. The netted region in the graph is created by four characters (3, 32, 41, 42) that are incompatible with nine others. Thus, there is no unambiguously supported branch (other than the terminal ones), let alone support for a single tree.

Neighbor-Joining tree, with NJ (above) and Parsimony (below) bootstrap values. Click to enlarge.

Nevertheless, both Neighbor-Joining (based on uncorrected distances) and Parsimony analyses of the data produce a tree that is well-supported by bootstrap analyses, as shown in the second figure. In particular, note that there is strong support in both analyses (based on 100,000 bootstrap replicates) for the branch uniting V. prunifolium and V. rufidulum, even though the data indicate that this arrangement is supported by 3 characters and contradicted by 2 other characters.

Bayesian tree, with posterior probabilities (above) and Maximum-likelihood bootstrap values (below).
Click to enlarge.

Both the Maximum-Likelihood and the Bayesian analyses deal with the situation in a somewhat different manner, as shown in the third figure. Based on a GTR+G+I model (and 5,000 sampled or re-sampled trees), they correctly recognize the relative lack of data support for uniting V. prunifolium and V. rufidulum (the character support is 3/5=60%). However, they both greatly over-estimate the character support for the branch involving V. lantanoides and V. nudum, which is supported by 5 characters and contradicted by 3 other characters (5/8=60% support). The extra number of characters (8 versus 5) apparently makes a big difference to the evaluation of branch support.

Thus, there is no reason to expect branch support values of any ilk to represent character support for that branch; and there is no simple relationship between the two things. The mere fact that character data can repeatedly be shoe-horned into the same tree does not mean that the data offer much support for that tree!

If you want an evaluation of the tree-likeness of the original data, you need to use either a data-display network or some other non-tree evaluation method. Only then can we directly assess the tree support.

References

Morrison D.A. (2010) Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology & Evolution 27: 1044-1057.

Wägele J.W., Letsch H., Klussmann-Kolb A., Mayer C., Misof B., Wägele H. (2009) Phylogenetic support values are not necessarily informative: the case of the Serialia hypothesis (a mollusk phylogeny). Frontiers in Zoology 6: 12.

Wägele J.W., Mayer C. (2007) Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects. BMC Evolutionary Biology 7: 147.

Monday, April 23, 2012

Network road sign

There are many images of phylogenetic trees on the internet, suitable for use when an icon or symbol is required. However, there are very few for phylogenetic networks. So, this is my second contribution to the genre. (My first one was the Favicon used in the URL address for this blog.) The road sign is based on one developed for trees by Colin Purrington, which is widely available on the internet.

Note: There is a later post with some more images: Network poster images.

Wednesday, April 18, 2012

An explanation of graph types

Biologists sometimes are not clear about the distinction between directed and undirected graphs in relation to whether they are cyclic or acyclic. So, to help clarify matters, I have included a figure here that places examples of the various graphs into their respective categories.

There are four combinations of characteristics, shown in the figure as a 2x2 table.

Click to enlarge

In all of the graphs there are four (unlabelled) leaves, but the number of internal nodes and edges varies depending on whether there are cycles (4 nodes, 4 edges) or not (2 nodes, 1 edge; or 4 nodes, 4 edges).

The important point for biologists to note is that any evolutionary diagram must involve a directed acyclic graph (DAG). An undirected graph cannot represent history, because the direction of that history is not shown (and history is defined in terms of a past relative to the present). A directed cyclic graph cannot represent a realistic history, because at one of the nodes in the cycle an inferred ancestor in also its own descendant (or one of the inferred descendants is also its own ancestor).

Note that an undirected cyclic graph can be turned into either a directed cyclic graph or a directed acyclic graph. In a phylogenetic analysis, the goal is to produce a directed acyclic graph.

The main practical distinction between a "data-display network" and an "evolutionary network" is that the former is usually undirected and the latter always directed. The usual conceptual difference between a phylogenetic tree and an equivalent phylogenetic network (= evolutionary network) is that the latter has a reticulation node while the former does not.

There seems to be no consistency in the literature about what to call a cycle in the various graphs. I have made two suggestions here (loop and circuit). But, what should one call the reticulated part of a DAG?

Sunday, April 15, 2012

Book covers

Here is something for you to ponder. Of the three book covers that I know about that illustrate phylogenetic networks, the two by computational scientists have stylized "trees" with straight edges, such as one sees in graphs, while the one from a biologist has curved branches, such as one sees in nature. Who says that you can't judge a book by its cover?

Tuesday, April 10, 2012

The second phylogenetic network (1766)

In my previous post on the origin of phylogenetic networks, I considered the idea that the earliest published one was the genealogical network of races of dogs produced in 1755 by Georges-Louis Leclerc, comte de Buffon. If we accept that idea, then the next such genealogy published was the genealogical network of species and cultivars of strawberries ("Généalogie des Fraisiers") produced in 1766 by Antoine Nicolas Duchesne (1747-1827).

Like Buffon, Duchesne was a remarkable man in many ways. From a scientific perspective he was a child prodigy. In 1764, while 16 years old, he published his first professional book: "Manuel de Botanique, contenant les Propriétés des Plantes utiles" (Didot le jeune & C.J. Panckoucke, Paris). This manual described the nutritional, medical, artistic, and ornamental uses of the plants cultivated around Paris, and also provided standardized common names. The work is particularly important for botanical systematists, because it contains the first publication of Bernard de Jussieu's "natural system" of plant classification, the only one that appeared during de Jussieu's lifetime (the next version was published by Antoine Laurent de Jussieu in 1789) (see Stevens 1994).

Duchesne's father was Superintendent of the King's Buildings, giving his son access to the various royal gardens. Bernard de Jussieu, Sub-demonstrator of Plants at the Jardin du Roi (at Versailles), was his mentor, so that Duchesne grew up in a distinctly botanical environment, probably knowing as a teenager more first hand about the natural history of cultivated plants than most people acquire in a lifetime.

Unfortunately, as a scientist Duchesne did not live up to his early potential. He continued to work and publish sporadically within the field of horticulture, but did nothing particularly distinguished. He inherited his father's post, but this was abolished after the French Revolution, so that he lost touch with the nobility. Perhaps he was also influenced by Bernard de Jussieu's almost completely self-effacing nature, and thus did not seek recognition — his first book on strawberries (see below), for example, was officially written by "M. Duchesne fils" ("Mr Duchesne's son"). Today, we can appreciate his attention to scientific detail with the recent publication of two books of his immaculate illustrations (Staudt 2003, Paris 2007; see also Paris 2000), which were prepared for his publications but were not published — Duchesne could not afford to pay for their inclusion in his printed works.

Strawberries

Duchesne first presented his work on strawberries in 1764, at the age of 17. This consisted of the first-hand observation (in 1763-64) of the origin of a new cultivar, capable of persisting from its own seeds. This is a remarkable story in its own right (see Lee 1964, 1966), but it's importance for us is that Duchesne became convinced of the non-fixity of varieties and species, and then contemplated the idea that all strawberries came from a single progenitor: "The formation of this new race of strawberry plants must render the hypothesis that all descend originally from a single one more than probable" (pp. 133-134 of the book cited below).

Duchesne was encouraged to continue his studies, by Carl von Linné among others (Hylander 1945). This culminated in 1766 in the publication of the work of interest to us here: "Histoire naturelle des fraisiers, contenant Les vues d'Économie réunies à la Botanique; et suivie de Remarques Particulières sur plusieurs points qui ont rapport à l'Histoire naturelle générale" (Didot le jeune & C.J. Panckoucke, Paris), available for viewing online at Google Books.

The illustration shown here faces page 228, and summarizes Duchesne's textual description on pages 219-228. It is, as noted by its title, explicitly a diagrammatic genealogy of strawberries; and on pages 223-224 relationships between strawberries are described as being like those between different branches of the same "house" (i.e. the genealogy of a human family). Like Buffon's diagram the root is at the top, and reticulating relationships are clearly distinguished from bifurcating ones, although wiggly lines are used rather than dashed ones. Equally importantly, Duchesne distinguishes varieties (in dashed boxes) from species (in solid boxes). The new cultivar that he had observed is indicated as "La Race nouvelle".

Duchesne's evolutionary intent is made clear in his text: "I consider the alpine, Fraisier des mois I [Fragaria vesca semperflorens, the everbearing strawberry] as the father of all the others. It is also at the head of the tree. The common wood strawberry, Fraisier de bois II [F. vesca sylvestris, the wood strawberry] which differs almost solely in its slower rate of growth, is immediately below, as if produced by it."

For each of the ten species and nine varieties of strawberry he tries to trace the history of its European introduction, cultivation and distribution, using this information to indicate what he thought to be the oldest and the newest species, thus suggesting which kinds might have descended from which others. The order in which he discusses each species follows this system. Thus, the diploid F. vesca kinds are followed by the hexaploid F. moschata and finally by the octoploid American kinds (F. virginiana and F. chiloensis). As with Buffon's work, this arrangement is almost entirely supported by modern genetic studies (see Hummer et al. 2011), thus emphasizing the extraordinary intuitive insight involved in this early work.

Duchesne's thoughts on evolution in general are contained in the Appendix constituting the "Remarques Particulières", notably pages 11-21. He was particularly concerned about the possible distinction between species ("espèce") and variety ("race"). "It is certain today that, if all species are stable, there are also races whose distinctions are constant, although belonging to the same species. The Versailles strawberry that I saw born, and which became the head of a race, puts that fact beyond doubt. Cultivation and other accidental causes do not produce new species, but changes in certain individuals do occur that are perpetuated in their posterity, constituting new races." Like Buffon's work, Duchesne's evolutionary ideas seems to have aroused no especial long-lasting interest among his colleagues, although his book received an honourable commendation when presented to the Académie des sciences (not bad for a 19-year-old).

Interestingly, Duchesne seems to have also produced a more detailed version of his genealogy. Lee (1966) reproduces several figures labelled as being "Illustrations pour I'Histoire naturelle des fraisiers by Duchesne, courtesy Mme. G. Duprat, Bibliothéque Centrale du Muséum National d'Histoire Naturelle, Paris", of which Figure 5.12 is a much more complicated network than the one published. (This network may actually have been prepared for the book discussed below, rather than for the one discussed above.)

Duchesne further continued his work on strawberries. He contributed the unillustrated article on "Fraisier" to the "Encyclopédie Méthodique. Botanique" edited by Jean Baptiste Lamarck, published on pages 527-540 of part 2 of volume 2 (1788; C.J. Panckoucke, Paris). This has 25 named (and numbered) taxa of strawberries, and Duchesne also rearranged his classification system. At the same time (or perhaps earlier, in 1770 or 1771) he published a separate book that appears to be a fuller 46-page version of the encyclopaedia article: "Essai sur l'histoire naturelle des fraisiers". I have not seen this work, but a contemporary review (Journal de médecine, chirurgie, pharmacie, etc. [sic!] 1788, 74:373-375) describes it as starting with a quotation from Linné's encouraging letter to the young Duchesne, and then saying a lot more about each taxon, with "vingt-sept espèces ou plutôt variétés". There is no mention of a network diagram. Duchesne has also been noted as publishing at least two later works on strawberries, neither of which I have seen: "Sur le fraisier de Versailles" (Journ. Hist. Nat. de Lamarck II pp. 343-347; 1792); and "Fraisier" du Cours d'Agriculture de Deterville (VI. p. 129-189; 1809).

Anyway, Duchesne's own interpretation of his diagram as a hybridization network is abundantly clear. Moreover, it is a network of species as well as races although, like Buffon, the author stops short of suggesting that species themselves can hybridize.

References

Hummer K.E., Bassil N., Njuguna W. (2011) Fragaria. In: C. Kole (ed.) Wild Crop Relatives: Genomic and Breeding Resources: Temperate Fruits. Springer-Verlag, Berlin.

Hylander, N. (1945) Linné, Duchesne och smultronen. Svenska Linné-Sallskapets Årsskrift 28: 17-40.

Lee V. (1964) Antoine Nicolas Duchesne — first strawberry hybridist. American Horticultural Magazine 43: 80-88.

Lee, D.V. (1966) Duchesne and his work. In: G.M. Darrow (ed.) The Strawberry: History, Breeding and Physiology. Holt, Rinehart & Winston.

Paris H.S. (2000) Paintings (1769-1774) by A.N. Duchesne and the history of Cucurbita pepo. Annals of Botany 85: 815-830.

Paris H.S. (2007) Les dessins d'Antoine Nicolas Duchesne pour son histoire naturelle des courges. Publications scientifiques du Muséum national d'histoire naturelle, Paris.

Staudt G. (2003) Les dessins d'Antoine Nicolas Duchesne pour son histoire naturelles des fraisiers. Publications scientifiques du Muséum national d'histoire naturelle, Paris.

Stevens P.F. (1994) The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. Columbia Uni. Press, New York.

Tuesday, April 3, 2012

Eurovision Song Contest 2006: a network analysis

Data-display networks can be used for displaying affinities between any group of objects, especially those for which a distance matrix can be calculated. Their use is essentially as Exploratory Data Analysis (EDA), creating a visual display of the patterns of relationship (or affinity), without strong assumptions about the cause of the affinities.

Philippe Gambette has, tucked away on a web page linked from his blog, a brief tree-based analysis of the scores for the finalists in the Eurovision Song Contest from 2006. It seems to me that these data illustrate some interesting things about the possible uses of networks in EDA, and so I will expand on the analysis here.

For those of you who know little of this odd musical institution (held every year for a bit longer than I've been alive), I will provide a brief description. All countries in Europe (very broadly defined, being any country that is a member of the European Broadcasting Union) are entitled to submit a song for consideration, although some countries have never participated (and others have rarely contributed). The submitted songs are performed, and are voted upon by all of the countries participating that year.

There were 24 countries represented in the Final in 2006: Armenia, Bosnia & Herzegovina, Croatia, Denmark, Finland, France, Germany, Greece, Ireland, Israel, Latvia, Lithuania, the Former Yugoslav Republic of Macedonia, Malta, Moldova, Norway, Romania, the Russian Federation, Spain, Sweden, Switzerland, Turkey, the Ukraine and the United Kingdom. The entries from Albania, Andorra, Belarus, Belgium, Bulgaria, Cyprus, Estonia, Iceland, Monaco, the Netherlands, Poland, Portugal and Slovenia did not make it into the final, but these 13 countries retained their voting rights. Serbia & Montenegro withdrew from the contest but also retained voting rights, resulting in 38 voting countries.

Red = did not qualify for the final; green = qualified for the final; orange = non-participating in 2006; grey = never participating.

Voting was carried out in 2006 by having each country assign scores of 1, 2, 3, 4, 5, 6, 7, 8, 10 and 12 to ten of the 24 contestants in the Final. The winner was the one who accumulated the greatest total. As it turned out, these totals ranged from 1 to 292 per song that year.

Network analysis

Philippe Gambette suggests the Manhattan distance as a suitable measure of the relationships in voting patterns among the 38 voting countries. I will discuss this choice below, but for the moment let's proceed with this choice. I have visualized the distance matrix with a NeighborNet analysis. On this network I have superimposed some broad geographical groups as different colours. Countries that are closely connected in the network are similar to each other based on their voting patterns, and those that are further apart are progressively more different from each other.

NeighborNet based on the Manhattan distance. Click to enlarge.

I am sure that you can recognize the geographical groupings. Red represents the countries from Northern Europe and around the Baltic; green is for the countries from Eastern Europe; orange is for the countries of Western Europe; blue is for the countries from Southern Europe and the Middle East; and purple is for the countries from the former Soviet Union, in Far Eastern Europe. Clearly, given these definitions, some of the larger countries could be in more than one group (e.g. France, Poland); and I could, of course, have chosen any way of grouping the countries that I liked.

However, I feel that it is noteworthy that the voting follows geographical lines, with few exceptions. Obviously, it is not geography per se that is reflected in the voting, but sociological influences associated with the geographical distributions of the cultural, political and language groups (see Clerides & Stengos 2006; Raykoff & Tobin 2007; Ginsburgh & Noury 2008; Spierdijk & Vellekoop 2009).

This network is EDA and thus only a data summary, and so we should look at the patterns in more detail, to see how they are created.

The Western European countries (as defined in my groupings) gave their big votes to Turkey, Finland, Bosnia & Herzegovina, Armenia and Greece (in that order). The Eastern European countries gave their big votes to Bosnia & Herzegovina, Russia, Finland, Greece, Macedonia and Croatia. Three of the countries are held in common between these two lists, with Finland highest.

The Southern Europeans gave their big votes to Romania, Russia, Finland and the Ukraine. The former Soviet countries gave their votes to Russia, Bosnia & Herzegovina, the Ukraine, Sweden and Lithuania. Here, two countries are in common between these two lists, with Russia highest.

The Northern European countries gave their big votes to Finland, Lithuania, Russia and Sweden, in that order.

Note that the Finns (who won) appear on four of these five lists, while Russia (who came second) also appears on four of the five. However, Finland does not appear on the former Soviets' list, and Russia does not appear on the Western European list.

This also makes it clear why the Finnish vote does not follow geographical lines: Denmark, Estonia, Greece, Iceland, Norway, Poland, Sweden and the United Kingdom, for example, all gave their 12-point vote to Finland, whereas Finland (who couldn't vote for themselves) gave their main votes to Russia and then Bosnia & Herzegovina. It is thus important to note that the winner's voting pattern will rarely be "geographical", because the competition regulations over-rule this possibility.

Andorra (an outlier in 2006) gave their big points to Spain (which no-one else did) but the rest of their voting exactly followed the Northern European countries. The only other unique vote was the 12 points that Malta gave to the Swiss song.

Armenia straddles two geographical groups in the network because they gave their big points to Russia and the Ukraine, but most of their other points went to Eastern European countries. Monaco has a long terminal edge because they gave many votes to Ireland and Latvia, which few other countries did.

Austria, Georgia, the Czech Republic and Hungary all announced that they would not be participating in 2006, and Italy had not taken part in the Contest since 1997 (some of these countries are orange on the map above). Other qualified countries had never previously participated (and some still have not) (grey on the map). Based on the EDA analysis, however, we might be able to predict what their voting pattern would be.

For example, we might predict that the Czech Republic, Hungary and Slovakia would vote with the Eastern European countries, while Azerbaijan and Georgia would vote with their former Soviet compatriots. Austria, Liechtenstein and Luxembourg are likely to have voted with the Western Europeans, while Italy and San Marino (and the Vatican City!) would presumably be more influenced by the Southern European choices.

Alternative analysis

As noted above, an interesting question for EDA is the choice of distance measure. In particular, we need to consider whether double-zeroes (negative matches) should count as representing a similar voting pattern between two countries. Given that each country can vote for only 10 songs out of the 24 available, there will be between 4 (if there are no votes in common) and 14 (if they vote for the same ten songs) countries that are not voted on by any given pair of countries. Do these "double absent" votes count as a similar voting pattern or not? The above analysis assumes that they do, but this may not be a reasonable assumption, because the joint absence of at least 4/24 votes is a requirement of the rules rather than a natural choice of the voters. Once again, the competition regulations over-rule the possibility of geographical patterns — we simply do not find out how each of the countries would have voted beyond 10/24 finalists (or 10/23 when the voter is also a competitor).

This is a clear case where absence of evidence is being treated as evidence of absence. This confounding of two sources of "experimental" variation seems to be a factor in most of the previous analyses of the Eurovision voting patterns that I have been able to locate, including those based on clustering (eg. Yair 1995; Fenn et al. 2006; Dekker 2008), ordination (eg. Yair 1995; Yair & Maman 1996; Doosje & Haslam 2005; Dekker 2008), multiple linear regression (eg. Bruine de Bruin 2005; Haan et al. 2005; Ginsburgh & Noury 2008; Spierdijk & Vellekoop 2009), z-scores (eg. Gatherer 2004), decision trees (Ochoa et al. 2009) and monte carlo simulations (eg. Gatherer 2006). In all cases, lack of a vote is treated as a score of zero rather than treated as "not applicable", so that absence of evidence (not applicable) is being treated as evidence of absence (a zero score). The only exception I found is the use of a censored regression model, such as the rank-ordered logit or tobit used by Clerides & Stengos (2006), where "not applicable" is treated as a censored observation.

An alternative distance measure that ignores all double-zeroes is the Steinhaus dissimilarity (or Bray-Curtis similarity), which is basically the Manhattan distance standardized by the total of the scores for each pair of countries. Using this distance measure means that at least one of each pair of countries must have voted for a song before it counts as a measurement of the voting patterns. This does not deal with the issue of comparing a "not applicable" result with a vote, but does deal with the issue of comparing two "not applicables" with each other.

NeighborNet based on the Steinhaus distance. Click to enlarge.

Using this distance instead does, indeed, have a significant effect on the observed patterns of voting relationships shown by the NeighborNet analysis. Much of the geographical pattern is still evident in the network, but it is fragmented into subgroups. Note, in particular, that Poland (with its ambiguous geographical placement) changes its affinities but France does not; and most importantly, Finland nows joins its geographical neighbours, because the disruptive effects of the rules have been eliminated from the analysis.

It is likely that the non-geographical patterns are related to some of the other factors that have been implicated to influence the Eurovision scores, such as performance order (Bruine de Bruin 2005; Haan et al. 2005; Clerides & Stengos 2006), host country (Bruine de Bruin 2005; Clerides & Stengos 2006), song language (Clerides & Stengos 2006) and possibly even song quality (Clerides & Stengos 2006; Ginsburgh & Noury 2008).

The network analysis is thus an effective EDA tool because it has revealed confounded experimental patterns that have not been addressed by the previous statistical analyses.

It is important to note that the difference between the Manhattan and Steinhaus distances is that the former considers patterns based on which countries are voted for as well as which ones are not, whereas the latter considers only the positive voting patterns. Since these distances produce different EDA data summaries, we can conclude that the strong "geographical" pattern shown by the Manhattan distance is, to a large extent, controlled by which countries are avoided in the voting rather than which are chosen to receive scores. I have long suspected that European politics works in precisely this same manner!

References

Bruine de Bruin W. (2005) Save the last dance for me: unwanted serial position effects in jury evaluations. Acta Psychologica 118: 245-260.

Clerides S., Stengos T. (2006) Love thy neighbor, love thy kin: Voting biases in the Eurovision Song Contest. Department of Economics, University of Cyprus, Discussion Paper 2006-01.

Dekker A. (2008) The Eurovision Song Contest as a ‘friendship’ network. Connections 28(1): 59-72.

Doosje B., Haslam S.A. (2005) What have they done for us lately? The dynamics of reciprocity in intergroup contexts. Journal of Applied Social Psychology 35: 508-535.

Fenn D., Suleman O., Efstathiou J., Johnson N.F. (2006) How does Europe make its mind up? Connections, cliques, and compatibility between countries in the Eurovision Song Contest. Physica A: Statistical Mechanics and its Applications 360: 576-598.

Gatherer D. (2004) Birth of a meme: the origin and evolution of collusive voting patterns in the Eurovision Song Contest. Journal of Memetics - Evolutionary Models of Information Transmission 8: 4.

Gatherer D. (2006) Comparison of Eurovision Song Contest simulation with actual results reveals shifting patterns of collusive voting alliances. Journal of Artificial Societies and Social Simulation 9: 2.

Ginsburgh V., Noury A.G. (2008) The Eurovision Song Contest: Is voting political or cultural? European Journal of Political Economy 24: 41-52.

Haan M., Dijkstra G., Dijkstra P. (2005) Expert judgment versus public opinion: Evidence from the Eurovision Song Contest. Journal of Cultural Economics 29: 59-78.

Ochoa A., Muñoz-Zavala A.E., Hernández-Aguirre, A. (2009) A hybrid system approach to determine the ranking of a debutant country in Eurovision. Journal of Computers 4: 713-720.

Raykoff I., Tobin R.D. (eds) (2007) A Song for Europe: Popular Music and Politics in the Eurovision Song Contest. Ashgate, Farnham UK.

Spierdijk L., Vellekoop M. (2009) The structure of bias in peer voting systems: Lessons from the Eurovision Song Contest. Empirical Economics 36: 403-425.

Yair G. (1995) 'Unite Unite Europe': The political and cultural structures of Europe as reflected in the Eurovision Song Contest. Social Networks 17(2): 147-161.

Yair G., Maman D. (1996) The persistent structure of hegemony in the Eurovision Song Contest. Acta Sociologica 39: 309-325.

Sunday, April 1, 2012

Tattoo Monday IV

This week we have some phylogenetic tree tattoos for the biologist rather than the graph theoreticist. Here, we have designs inspired by the works of Ernst Haeckel and Charles Darwin, plus a literal interpretation of the term "molecular tree". (Another version of this tree appears in Tattoo Monday VI.)

These are the last of the tattoos that I have for you — there are only so many exhibitionists in the world of phylogenetics. That is, as N approaches infinity the probability of N+1 approaches zero, where N is the size of the set of known phylogenetic-tree tattoos.

I have been unable to locate any phylogenetic networks imprinted on the bodies of young persons, at least among those bodies publicly displayed on the internet. Perhaps networkers prefer to put their designs on t-shirts?

If you feel the need to see more science tattoos, then you will enjoy this book:

Carl Zimmer (2011) Science Ink: Tattoos of the Science Obsessed. Sterling, New York.

See also the previous posts: Tattoo Monday, Tattoo Monday II, Tattoo Monday III.