Wednesday, March 27, 2013

The Music Genome Project is no such thing


The Music Genome Project is a database in which 1 million pieces of music (currently) have been coded for 450 distinct musical characteristics. The main use of the database at the moment is to provide the data from which predictions can be made about which other pieces of music might appeal to listeners of any nominated musical set; this is implemented in the Pandora Radio product. This seems like a valuable idea.

However, the use of the word "genome" is an analogy, in which the set of musical characteristics is seen as creating a sort of genetic fingerprint for a song. According to one of the originators, Nolan Gasser:
The basic idea ... was to see if we could approach music from almost a scientific perspective; that's why it's called the Music Genome Project, named not accidentally after the Human Genome Project.
     I've always taken that metaphor very seriously: biologists have come to understand the human species by identifying all the individual genes in our genome; it's then how each individual gene is manifest or expressed that makes us who we are as individuals — as well as defines how we're related to others: most closely to those in our family, but also indirectly to people who share our same physical attributes or capabilities in sports, and so forth.
     That orientation was paramount to my thinking in designing the Music Genome Project.
There seems to be a major misunderstanding here, since the mere idea of atomizing something does not make the atoms genes. After all, the idea behind the Project is basically one of taking music apart and evaluating it by its acoustic elements.

The first problem is that the study of musical attributes is clearly a study of phenotype not genotype, as Gasser alludes in the quote above — there are no hereditary units in music. Unfortunately, phenotype and genotype are frequently confused in the social sciences, with serious consequences when the wrong analogy is used (see the blog post False analogies between anthropology and biology). As noted by LessWrong user jmmcd:
I think the Music Genome project is misleadingly-named. A genome is generative: there is a mapping from a genome to an organism. There is no reverse mapping. In the case of music, there is a reverse mapping from a piece of music to these 400 odd features, but there's no forward mapping ... Knowledge of a phenotype is not constructive, because there are many ways of constructing that phenotype; a genotype is unique, and is thus constructive.
Equally importantly for the Music Genome Project, the musical attributes themselves cannot easily be related to genes as a metaphor — they are simply observed features of the music. The attributes cover musical ideas such as genre, type of instruments, type of vocals, tempo, etc. Most of these attributes are objective and observable (e.g. vocal duets, acoustic guitar solo, percussion, triple meter style, etc), although there are some that are more nuanced (e.g. driving shuffle feel, wildly complex rhythm, epic buildup / breakdown, etc) and thus involve expert subjective judgment. The attributes are coded on a 10-point scale for the "amount" of each attribute.

Given the quantitative nature of the attributes, the only possible analogy with genetics is that of gene expression, not the genome itself (as Gasser also alludes in the quote above). This is a very different metaphor, at least to a biologist. The power of a metaphor is that if it is a good one then it can give you insights that you might not otherwise have; the danger is that a false metaphor will probably lead you up the garden path. In this case, the genome analogy does seem to lead people astray, because they think that Pandora is picking "related" music in a genealogical sense (a "family resemblance") when it is doing no such thing. After all, trying to construct a phylogeny from gene expression data is not something that biologists have attempted successfully.

Thus, if the Music Genome Project did live up to its name then it would be a very valuable thing for musical anthropologists, because then it would be possible to reconstruct a phylogeny of music. Indeed, such a thing has been proposed for popular music: The Music Phylogeny Project. Furthermore, such phylogenies have already been constructed: A Phylogenetic Tree of Musical Style. In the latter case, the author notes: "Needless to say, the tree is not automatically produced by the raw data itself, but by my own interpretation of the data", which gives you some idea of the technical problems involved.

Finally, I will note that what I have said above applies to the other projects based on a supposed analogy with the Human Genome Project. These include the Book Genome Project and the Game Genome Project. Indeed, the blurb for the Book Genome Project makes it sound even more wildly inappropriate:
The genomic analogy is imperfect but useful nevertheless: we defined the three elements of Language, Story, and Character as the literary equivalent of DNA and RNA classifications. Each gene category contains its own subset of measurements specific to its branch of the book genome structure ... Each individual book produces 32,162 genomic measurements.
As noted by commentator CypherGames below, these projects would all be more accurately called Phenome Projects.

Later note: There is also a subsequent post on the The Genome Cellar.

Monday, March 25, 2013

Network analysis of New York neighborhoods


Trying to quantify the characteristics of a neighborhood is a tricky business. Part of the problem is trying to define the nebulous idea of "livability" with respect to a geographical area, and part is due to the impracticality of collecting most of the data that might allow us to quantify the various aspects of life in that area.

Nevertheless, in New York magazine Nate Silver had a go at this in 2010, by trying to identify The Most Livable Neighborhoods in New York. He tried this because:
there is a wealth of information to study. The Bloomberg administration gathers reams of data about almost every element of life in the city — from potholes to infant-mortality rates— as do New York University's Furman Center and the U.S. Census Bureau. Sites like Yelp provide a reasonably objective perspective on the popularity of neighborhood bars and restaurants. StreetEasy.com and Zillow.com publish the costs of apartment space per square foot. Ethnic diversity is now broken down in much finer gradients than black and white ... Our goal was to take advantage of this wealth of data and apply a little bit of science to the question. If there was anything that could plausibly affect one's quality of life in a particular neighborhood, we tried to incorporate it.
New York thus provides a unique opportunity to try quantifying the nedbulous, and I think that it is worth looking at these data in more detail.

The data

The data were compiled into twelve broad categories, representing different characteristics about the various New York neighborhoods:
 Affordability / Housing Cost (as measured on a price-per-square-foot basis, for both renters and buyers), Housing Quality (historic districts, code violations, cockroaches), Transit and Proximity (commute times to lower Manhattan and midtown, the density of subway coverage), Safety (as measured by violent- and nonviolent-crime rates), Public Schools (test scores and parent satisfaction), Shopping & Services (the number of neighborhood amenities, especially supermarkets), Food & Restaurants (judged by density and quality of options), Nightlife (ditto), Creative Capital (arts venues as well as the number of residents engaged in the arts), Diversity (in terms of both race and income), Green Space (park and waterfront access, street trees), and Health & Environment (noise, air quality, overall cleanliness).
The data were gathered from the stated sources, and are presented in the original magazine article for 50 of the 60 neighborhoods that were assessed. The data for all of the characteristics were then summed for each neighborhood, based on a particular weighting scheme for the 12 categories. This provided "a quantitative index of the 50 most satisfying places to live."

The sum total of the scores is not actually very different among the neighborhoods (score 73–78 / 100), and therefore the choice between them on that basis is (as the author admits) "splitting hairs". More particularly, neighborhoods with very different characteristics can end up with the same total score — they simply get that total by combining the category scores in very different ways (ie. the neighborhoods have different strengths and weaknesses).

So, this is a rather limited approach to assessing the data. Surely we can get more out of the data than this? What would be more useful is a picture showing which neighborhoods are similar to each other based on the way the scores are distributed across the different categories. This will tell us which neighborhoods have the same characteristics, and which are different from each other. This avoids splitting hairs, because it uses all of the data simultaneously, rather than summarizing the data down to a single number for each neighborhood.

The analysis

A phylogenetic network is ideal for doing this sort of thing, as I have emphasized many times in this blog, and so I have constructed one. As my analysis of choice, I have used the manhattan distance (appropriately enough!) combined with a NeighborNet network. Neighborhoods that are closely connected in the network are similar to each other based on the various characteristics, and those that are further apart are progressively more different from each other.

Click to enlarge.

I have color-coded the neighborhoods based on their borough, using roughly the same colors as in the map shown above.

I have also placed an asterisk next to the top five neighborhoods based on their total scores. Two of these neighborhoods are near each other in the graph, with two a bit further away, and one is quite distant from the others. This indicates that, even though they have very similar total scores, these neighborhoods are actually quite different.

In general, the network shows a trend from Manhattan (at the right-hand end of the graph) to Queens and the Bronx (at the left-hand end), via Brooklyn (stretching through the middle). This seems to neatly summarize the overall impression of the relationships among the areas of New York, at least as it is usually presented to outsiders. So, I think that the network analysis has been a successful one, in the sense that it provides a useful picture of the relationships between the neighborhoods.

Going deeper, many of the detailed patterns in the network graph are fairly obvious. For example, (at the right-hand end of the graph) the association of the southern Manhattan neighborhoods of Soho, Central Greenwich Village, Tribeca, Battery Park City, and the Financial District should surprise no-one. Similarly, (at the top of the graph) the linking of Manhattan's Inwood with the nearby Bronx neighborhoods of Belmont, Bedford Park and Riverdale is not unexpected. Furthermore, (at the left-hand end of the graph) the connection of Astoria, Woodside, Jackson Heights, and Flushing (in north-western Queens) with Cobble Hill, Boerum Hill, and Bay Ridge (in western Brooklyn) is hardly surprising, even though the two borough areas are geographically separated.

Other patterns are less obvious, and thus more intriguing, such as the apparent similarity of Chinatown (southern Manhattan), Central Harlem (northern Manhattan), Co-op City (the Bronx), and West Brighton and New Dorp (both Staten Island) (at the bottom-left of the graph). This bears looking into, should you be looking for somewhere to live in New York. Perhaps the oddest juxtaposition is that of Chelsea (midtown Manhattan) with Corona Park (Queens) and Washington Heights (northern Manhattan).

Another possible use of the graph is that it makes suggestions for areas that might be suitable as alternatives to any neighborhood that is out of reach on the Affordability / Housing Cost criterion. That is, we might consider areas that are similar based on the other criteria and yet differ in Affordability. For example, Park Slope (northern Brooklyn) differs dramatically in Affordability from the Nolita & Little Italy neighborhood (lower Manhattan), and yet the only other characteristic they differ greatly on is Shopping & Services. Williamsburg, Greenpoint, and Carroll Gardens & Gowanus are indicated in the network as other neighborhoods worth considering.

It seems unlikely, however, that anyone looking for a substitute for the Upper East Side of Manhattan (one of the most expensive neighborhoods in the USA) is going to look at Sheepshead Bay, as suggested by the network — the two neighborhoods differ dramatically in Transit Proximity, since Sheepshead Bay is way down on the Atlantic coastline. Nor are those looking for a replacement for the Upper West Side going to consider Brooklyn's Prospect Heights — these two differ more than somewhat in Housing Quality, for example. So, good though it is, the suggestions made by the network graph are not perfect!

Postscript

There is one other ranking scheme that I know of, at the StreetAdvisor Best Neighborhoods web page [on that page, click on Neighborhoods]. It is described as follows:
Our rankings begin with reviews written by locals. Each review contains certain scoring elements that tell us how good, or how bad a place is. We then combine all the scores and apply a 'fairness' factor that takes into account things such as volume of reviews, age of reviews and the type of person writing a review. We then apply a rank so we can compare and sort locations.
You will find many of the rankings odd, to say the least. For example, it seems doubtful that Country Club (the Bronx) is the "3rd best neighborhood in New York City" (after Carnegie Hill and Gramercy Park).

Not the least of the oddities is that the Upper East Side (7.6) scores much less than neighboring Carnegie Hill (9.4) and Lenox Hill (8.1). Indeed, it scores worse than parts of Brooklyn (Carroll Gardens, Clinton Hill, Brooklyn Heights, Park Slope, Bay Ridge), Queens (Glendale, Richmond Hill, Forest Park), the Bronx (Country Club, Schuylerville) and Staten Island (Huguenot).

You can access the individual neighborhoods within the boroughs at these web pages:
Manhattan
Brooklyn
Queens
Bronx

Wednesday, March 20, 2013

First-degree relationships and partly directed networks


I have noted before that a pedigree is a network not a tree, and specifically it is a hybridization network (Family trees, pedigrees and hybridization networks). That is, in sexually reproducing species, every offspring is the hybrid of two parents. If we include both parents in the pedigree, plus all of their relatives, then this will form a complex network every time inbreeding occurs.

This situation can be generalized to groups of closely related individuals, such as cultivated plants and domesticated animals, where human-mediated inbreeding has resulted in the formation of new breeds and cultivars with limited genetic diversity. In the extreme case, the network will consist of first-degree relationships, where the branches connect parent-offspring relationships or sibling relationships.

An example of this is provided by the work on the genetics of grape cultivars by Myles et al. (Myles S, Boyko AR, Owens CL, Brown PJ, Grassi F, Aradhya MK, Prins B, Reynolds A, Chia JM, Ware D, Bustamante CD, Buckler ES. 2011. Genetic structure and domestication history of the grape. Proceedings of the National Academy of Sciences of the USA 108: 3530-3535).


The genotype data were generated from a custom microarray, which assayed 5,387 SNPs genotyped in 583 unique Vitis vinifera samples from the US Department of Agriculture (USDA) germplasm collection. Estimates of identity-by-descent (IBD) were calculated based on linkage analysis for all pairwise comparisons of samples. These IBD values were calibrated based on known pedigree relationships (ie. confirmed parent-offspring relationships), and this was used to differentiate between parent-offspring and other pedigree relationships. For each cultivar that was related to at least two other cultivars by an estimated parent-offspring relationship, the proportion of SNPs consistent with Mendelian inheritance was used to determine the two parents.

The authors found that 75% of the grape cultivars were related to at least one other cultivar by a first-degree relationship. The first figure (above) shows the frequency histogram of these first-degree relationships, along with the resulting complex pedigree structure, which can be visualized as a set of undirected networks. This set is dominated by a single network with 58% of the cultivars, each related to at least one other cultivar by a first-degree realtionship.

Fig. 3. Network of first-degree relationships among common grape cultivars.
Solid edges represent likely parent-offspring relationships. Dotted edges represent sibling
relationships or equivalent. Arrows point from parents to offspring for the inferred triplets.

The authors inferred that about half of the first-degree relationships were likely to be parent-offspring, with the other half being labeled "sibling or equivalent" (because complex crossing schemes can generate IBD values that are indistinguishable from sibling relationships). By evaluating Mendelian inconsistencies, they assigned parentage for 83 triplets of cultivars. The second figure shows a directed hybridization network of some well-known grape cultivars that includes several resolved triplets.

Note that the hybridization network is only partly directed — quite a few of the edges do not have a uniquely identified direction, based on the SNP data. This is an issue that I have not seen directly addressed in the literature. Practitioners tend to treat phylogenetic networks (and trees) as either directed or undirected, rather than a mixture of both, as this characteristic is determined by the presence or absence of a root node. However, in the grape case there is no root identifiable based on the cultivar SNP data. (There is a scenario for the origin of modern grape cultivars from Vitis sylvestris around the eastern Mediterranean, but even this is complicated by hypothesized later gene flow between V. sylvestris and V. vinifera.)

Perhaps the possibility of partly directed phylogenetic networks needs more consideration.

Monday, March 18, 2013

What is "Haeckel's Tree of Life"?


The "Tree of LIfe" is an expression that you will find all over the web, usually referring to little more than a phylogenetic tree with only a few species in it, and certainly not all of Life, nor even the major groups of LIfe. More specifically, however, it seems commonly to refer to any tree that has Homo sapiens in it.


One tree that has intrigued me is found on Wikipedia's page called Tree of life (biology). It is labelled as "Haeckel's Stambaum der Primaten (1860s)", but in the text it is referred to as "the first sketch of the famous Haeckel's Tree of Life in the 1870s which shows 'Pithecanthropus alalus' as the ancestor of Homo sapiens."

The original JPEG file of the tree, dated 18 February 2009, has a compromise between these two contradictory statements: "The first sketch of the famous Heackel's Tree of Life which shows 'Pithecanthropus alalus' as the ancestor of Homo sapiens. Date: 1860s." No source is given for the picture.

Ernst Haeckel was the most famous popularizer of phylogenetic trees in the 19th century (he called them Stammbaum, literally "stem tree"). However, the illustration itself is not in the style of Haeckel's trees from the 1860s, which were drawn as realistic trees (see Who published the first phylogenetic tree?), nor is it in the style of his most famous tree from the 1870s, which is drawn as an oak tree (see Evolutionary trees: old wine in new bottles?). So, I decided to trace the history of this tree.

Haeckel published a slightly modified version of the sketch, with a different title, in:
Ernst Haeckel (1899)
Ueber Unsere Gegenwärtige Kenntniss vom Ursprung des Menschen.
[About Our Current Knowledge of Human Origins]
Emil Straws, Bonn.

It is worth noting that all of the names along the central axis are hypothetical, except for Homo sapiens. Pithecanthropus alalus, however, came to be associated with what is colloquially called Java Man, now named Homo erectus.

As far as I can determine, the hand-drawn version of the illustration (ie. the one in Wkipedia) first appeared in:
Herbert Wendt (1954)
Ich Suchte Adam: Roman einer Wissenschaft.
[In Search of Adam: a Science Novel]
Zweite, erweiterte Auflage. [Second, enlarged edition]
Grote Verlag, Hamm.
It did not appear in the first edition of the book, published the year before. It appears as Figure 22 (page 310):
Abb. 22:
Haeckels klassisch gewordener Menschenstammbaum, eigenhändig als erste Skizze entworfen – ein historisches Dokument. [Haeckel's classic people pedigree, designed by hand as the first sketch – a historical document.]
Note that, contrary to the claim, this is not the first pedigree from Haeckel, nor is it even the first primate pedigree from him. The source of the document is noted on page 581 as:
Verzeichnis der Textabbildungen
Abb. 22: Haeckels Stammbaum des Menschen. (Prof. Dr. Heberer, Göttingen)
Gerhard Heberer was a German anthropologist and phylogeneticist, who studied Haeckel's work closely. He apparently passed a copy of the illustration to Herbert Wendt when the latter was expanding his book with many more illustrations. This book was a best-seller from the start, going through five German editions, before re-appearing in 1961 as Ich Suchte Adam: Die Entdeckung des Menschen [The Discovery of Humans], whence it went through another seven editions. It was translated into English, appearing in 1955 as I Looked for Adam, in 1956 as In Search of Adam: The Story of Man's Quest for the Truth About His Earliest Ancestors, and in 1972 as From Ape to Adam: Search for the Evolution of Man. It was also translated into several other languages (including Swedish in 1955 as I Urmänniskornas Spår: Förhistoriens Forskaräventyr).

The hand-drawn tree has appeared in print at least twice since its first appearance in Wendt's book. The most important of these is in:
Thomas Junker and Uwe Hoßfeld (2001)
Die Entdeckung der Evolution: Eine revolutionäre Theorie und ihre Geschichte.
[The Discovery of Evolution: a Revolutionary Theory and its History]
Wissenschaftliche Buchgesellschaft, Darmstadt.
The picture appears as Figure 18 on page 125:
Abb. 18: Handschriftlicher Entwurf des Stammbaums der Primaten von Ernst Haeckel (1895). (Bildmaterial im Nachlass Heberer; im Besitz von Uwe Hoßfeld).
[Hand-drawn sketch of the family tree of primates by Ernst Haeckel (1895). (Artwork in the estate of Heberer; in the possession of Uwe Hoßfeld.]
Uwe Hoßfeld has told me: "I got the whole archive material from the Heberer family in 1990 and found the photo in his diaries." This explains the later history of the sketch, although not how Gerhard Heberer acquired the photo in the first place.

The date 1895 makes much more sense than do the 1860s and 1870s dates (as given in Wikipedia), especially given the publication date of the printed version.

The other appearance of the hand-drawn tree is in this book:
Winfried Henke and Ian Tattersall (editors) (2007)
Handbook of Paleoanthropology, Volume 1.
Springer‐Verlag, Berlin.
The tree appears as Figure 1.4 on page 16 (in the chapter by Winfried Henke 'Historical overview of paleoanthropological research'), and is labelled: "First pedigree designed by Ernst Haeckel". As noted above, it is not the first pedigree from Haeckel, nor the first primate pedigree from him. Winfried Henke has told me that he got the sketch from Wendt's book.

As a final point, Haeckel's hand-drawn trees usually seem to match the published versions rather more closely than the one above does. For example, here is the hand-drawn original of his famous oak tree. Perhaps, the more stick-like tree was not treated as being a "real" picture.


Thanks to Winfried Henke and Uwe Hoßfeld for their email correspondence regarding my quest.

Wednesday, March 13, 2013

Topological restrictions: some comments


In a recent post (Different topological restrictions of rooted phylogenetic networks. Which make biological sense?), Leo and Steven synthesized a lot of the issues that have been raised in recent years, both in this blog and in the literature, about the underlying models for rooted phylogenetic networks. This is an excellent summary (and explanation) of the current situation; please read it if you have not already done so.

The key issue for biologists is: what are we trying to model biologically? In one sense the answer is trivial: "evolutionary history". However, this answer is uninformative, in the sense that it is too broad to be practical. We do not know enough about evolutionary history for it to be obvious which model(s) we should use, and we presumably never will know (we were not there to study it, and it happens too slowly to study it now).

Nevertheless, we need to make decisions about models, and in many cases quite detailed decisions. Therefore, we need public discussion about the various possible models that have been suggested, as well as suggestions for new models. Unfortunately, the impetus for this has almost always come form the computational side rather than the biological side, if only from practical necessity — a mathematician cannot proceed without a model. (A biologist can't either, but they seem to be much more comfortable with vague word models rather than quantitative mathematical ones.)

So, I will provide an initial response to Leo and Steven's points, as a staring point for discussion, in the order in which they raised them. Hopefully, someone else will have something to say, as well.

Rooted

If there is assumed to be a single origin of life, and evolution is assumed to be acyclic, then there will always be at least one root in every evolutionary network. Indeed, if we are studying a monophyletic group of organisms then there will always be precisely one root. But if we are studying a polyphyletic group then technically there are multiple roots, in the sense that we cannot reconstruct the most recent common ancestor of the group. In practice, we have ignored this issue, and simply reconstructed the ancestor anyway, as best we can, since that is what the current tree-building algorithms do. In this sense, the only difference between an evolutionary tree and an evolutionary network is the complication caused by horizontal gene transfer, in which we might consider a species' genome to be polyphyletic, as well.

So, a single artificial root will probably suffice — see Steven's earlier comment: "the multiple-root situation can probably be reduced to the single-root situation by having some kind of high-degree (i.e. unrefined) artificial root which is the parent of the real roots."

Acyclic

Acyclicity is an obvious criterion for evolution, because a descendant cannot be its own ancestor. So, in a network we should treat this as "sacred" — an evolutionary network in its final form cannot show any directed cycles.

This does not, I guess, preclude algorithms that do not themselves require acyclicity, as biology always violates mathematical assumptions (eg. normality, homoscedasticity, etc). However, I think that we should require the output to be acyclic.

This is not necessarily an issue for reconciliation between gene trees and species trees, as raised in the original blog post. In this scenario the resulting phylogeny is a tree, and therefore cannot show cycles, by definition.

Time-consistent

Time consistency is a stronger form of acyclicity, in the sense that the acyclic parts of a network are also restricted by time's arrow. It is another obvious criterion for evolution, because a descendant cannot reticulate with its predecessors.

However, this requirement is confounded by incomplete taxon sampling, as noted by Leo and Steven. To deal with this, we can allow "ghost" lineages representing hypothesized missing taxa. This does, nevertheless, raise the issue of where we draw the line. Any two trees can be reconciled by allowing enough ghost lineages, just as is done by algorithms for the gene duplication-loss problem, where minimizing the numbers of unobserved duplications and losses is the optimality criterion. Similarly, how many ghost lineages should we allow in a reticulating network? Perhaps we should be minimizing them, too.

There is also the more fundamental issue of whether time-consistency should actually be a requirement at all. It has been pointed out a number of times that this is not a requirement for anthropology, for example, where transfer of non-genetic information through time is not only possible but quite common. This has been discussed in earlier blog posts.

So, in a general sense we cannot make time consistency a mandatory requirement of evolutionary networks. However, we could do so for strictly biological networks, where information transfer occurs with gene flow, which must follow time's arrow.

Degree-restricted

Nodes with indegree 1 and outdegree 1 are usually considered to be unnecessary, because they do not represent anything more than would an edge (or arc). The only times they have been seriously suggested are when placing known fossils onto a tree, and the fossil is claimed to represent an ancestor of one of the nodes. However, this is considered to be inappropriate, because we can never be sure that any specified fossil is actually an ancestor, as opposed to being a sister to one of the ancestors (see the blog post at Evolving Thoughts).

Indegree >2 has also been considered to be problematic in phylogenetics. If a sexually reproducing organism has two parents, then theoretically we cannot observe indegree>2. In practice, however, evolutionary events may occur over a short enough time-scale that we cannot distinguish a series of individual indegree-2 events, so that for practical purposes a network might end up showing indegree >2. This is analogous to "soft polytomies" in trees, where outdegree >2 represents uncertainty.

Tree-child, Tree-sibling, Reticulation-visible

These network restrictions all refer to so-called "invisible" nodes (nodes from which all paths end up in reticulations). All nodes in a tree are visible, and so there is no analogous situation in tree-building from which we might derive a response (as I have done several times above).

The basic issue with invisible nodes is restricting their occurrence. Theoretically, we could add an infinite number of invisible nodes to a network, but few, if any, of them would be based on any inference from the data. So, tree-child networks ban them entirely, whereas tree-sibling and reticulation-visible networks allow them under particular circumstances.

It is difficult for me to see exactly how I should interpret invisible nodes biologically. How can I reconstruct their features, for example? Invisible nodes tend to appear when combining multiple trees into a network, and so they are not usually constructed directly from character data. Biologically, they might exist, but there seems to be little reason to consider them as well-supported inferences, as we do for visible nodes.

In this sense, they have much more of the feel of mathematical artifacts than do visible nodes.

Galled trees, Galled networks, Level-k

These are all artificial restrictions on networks that do nothing more than limit the complexity (and thus make the algorithms more tractable). There seems to be no good biologicals basis for using any of these criteria as restrictions (as opposed to using them as network descriptors).

There is, however, a good logical basis for their use — simpler networks are to be preferred over more complex ones, at least as working hypotheses for evolutionary history. The problem with this attitude seems to be that sometimes simpler networks look less biologically realistic than do more complex ones (eg. level-1 networks might spread the reticulations across the network whereas level-3 networks can concentrate them in one place).

Regular and normal

These always seem to me to be of mathematical interest for characterizing networks, rather than being of any biological interest. But maybe that is just my ignorance.

Directed Acyclic Graph or tree-with-edges-added?

To me , this is one of the BIG questions. Until biologists come to grips with this, I do not see how the computational people are going to make the helpful contributions that they obviously desire (or, at least, the ones I know). I have thought (and read) about this a lot, and I have come to the (perhaps unfortunate) conclusion that the answer is: "both".

From the mathematical point of view, the problem is simply that certain network topologies will not be considered when we add reticulation nodes to a tree. So, we automatically exclude those topologies if our algorithm starts with a tree. From the biological point of view, the issue is whether we consider evolutionary history to be approximately tree-like or whether it is fundamentally reticulated.

My reading of the literature is that those people dealing with hybridization in eukaryotes are leaning towards the "tree obscured by vines" model, whereas those dealing with horizontal gene transfer in prokaryotes prefer the "anastomosing plexus" model. Perhaps it is too soon to tell whether this dichotomy will continue; but, certainly, discussion about the hybridization issue goes back to the early 1980s, and since then the opinion about the fundamentally tree-like nature of history has been repeatedly expressed. Moreover, those people using reduced median networks and median joining networks for mitochondrial data seem to be interested in simplifying their initially complex network as much as possible, and then interpreting the resulting network in terms of a few conflicting parsimony trees; I interpret this as a preference for trees over networks.

Of course, preferring a reticulated tree model may just be historical inertia, whose importance in the daily lives of scientists should not be underestimated.

Monday, March 11, 2013

Tattoo Monday VIII – the March of Progress


Transformational evolution is easier to grasp than variational evolution, and so it is hardly surprising that evolution tattoos should include the Descent of Man (or March of Progress) image. We have six examples here.



The image originated as shown in the bottom-right tattoo, which is based on the frontispiece to Thomas Henry Huxley's book about primate anatomy (1863. Evidence as to Man's Place in Nature. Williams & Norgate, London), as reproduced in the next picture.


A century later, this image was expanded and updated in a book by the anthropologist Francis C. Howell (1965. Early Man. Time-Life International, New York), as shown in the next picture. (The full-size picture, with labels, can be viewed here.)


In spite of it's disconnection with modern phylogenetic ideas (see my blog post on Linear versus branching evolution), it is probably the single most famous image in evolutionary biology. However, as a tattoo image it cannot compete in popularity with Charles Darwin's "I think" tree (as shown in the blog posts Tattoo Monday III, Tattoo Monday V and Tattoo Monday VI).

Thursday, March 7, 2013

Network analysis of Michelin starred restaurants


The world is full of snobbery, and the food & wine business has more of it than almost anywhere else. Probably the most snobbish part of the restaurant business is the 3-star rating system used in the annual Michelin tourist guides. It seems to me that this is worth trying to explore in some detail.

Michelin stars

Originally (starting in 1926), the Michelin ratings were intended (as noted by Anthony Capella) "to tell haute-bourgeoisie French motorists where to find Parisian-style fine dining"; and that philosophy is still prevalent today. So, if that's not what you're looking for, then Michelin cannot necessarily help you. For example, Paul Levy has observed: "The third Michelin star seems always to have been awarded for surroundings and service as well as food"; and restauranteur Mat Follas (winner of the BBC TV programme MasterChef in 2009) comments: "There's a perceived level of service and over-formality that comes with Michelin, and it's not something that I'd aspire to."

Furthermore, the Michelin guides are frequently accused of a distinct francophile bias. In particular, Anthony Capella notes: "From the type of glassware to the number of amuse-gueules, there is a Michelin way of doing things that often seems to stifle rather than celebrate regional idiosyncrasy." In other words, the inspectors "don't understand" other cooking styles. Nevertheless, as Andy Hayler observes: "Despite its eccentricities, for French food Michelin is generally reliable in its assessments at the top of the restaurant tree".

So, in this analysis I will restrict myself to restaurants that are actually in France (not Monaco!), for which Michelin released the 2013 ratings on February 18. This is appropriate, because in 2010 "Gastronomic Meal of the French" was added to UNESCO's Representative List of the Intangible Cultural Heritage of Humanity, which means that we are supposed to actively safeguard it (along with "Viennese Coffee House Culture").

The stars themselves, according to Michelin, represent (since 1936):
  • One star indicates a very good restaurant in its category, offering cuisine prepared to a consistently high standard. A good place to stop on your journey.
  • Two stars denote excellent cuisine, skillfully and carefully crafted dishes of outstanding quality. Worth a detour.
  • Three stars reward exceptional cuisine where diners eat extremely well, often superbly. Distinctive dishes are precisely executed, using superlative ingredients. Worth a special journey.
The first post-WWII year in which Michelin re-instated their 3-star ratings was 1951, which is thus a convenient starting point for my analysis of haute cuisine dining.


I have compiled a dataset of the number of stars every year since 1951, for every restaurant in France that received 3 stars at least once during that time. Unfortunately, Michelin does not maintain any restaurant lists (you have to search through the individual guides), and so I have compiled the dataset from Marc Vilbois' Art et Gastronomie, supplemented by Andy Hayler's 3 star Restaurant Guide, plus extensive searches through the French edition of Wikipedia and recent Michelin press releases. [The data are apparently also available in: Jean-François Mesplède. 1998. Trois Étoiles au Michelin : une Histoire de la Haute Gastronomie Française et Européenne.]

The analysis

There are 59 restaurants on the list, 26 of which currently still have a 3-star rating; 13 currently have 2 stars, and 6 have 1 star. There are 14 restaurants that currently have no stars, most of which have closed permanently. (It is important to re-emphasize that the 3-star restaurant in Monaco is not included, which reduces the number of restaurants by one, compared to what you will see in the newspapers.)

To start the analysis, we can simply look at the number of three-star restaurants throughout the study period. They come and go, opening and closing, and gaining and losing stars, but there is still an inexorable increase through time.


Next, I have used the manhattan distance and a NeighborNet network to produce the following two network graphs, which summarize the patterns of variation in the stars. Objects that are closely connected in the network are similar to each other based on similar times at which they were at their 3-star peak, and those that are further apart are progressively more different from each other.

As far as the years are concerned, we would expect there to be a gradient from 1951 to 2013, since restaurants come and go continuously. Indeed, this is exactly what the first network shows.


The break from 1962 to 1963 reflects the arrival of "Maison Lameloise" and "L'Oasis la Napoule", the latter of which lost its stars in 1987. The clustering of the years 1972-1999 is associated with the simultaneous arrival on the scene of "La Côte St Jacques", "Au Crocodile", "Hôtel de l'Espérance" and "Le Vivarois", the latter of which disappeared again in 1999. The gap between 1986 and 1987 reflects the arrival of "L'Arpège", "La Vague d'Or" and "l'Hôtel George V".

The break from 2000 to 2001 reflects the move of chef Alain Ducasse from "l'Hôtel du Parc" to "Hôtel Plaza Athénée", taking his three stars with him, combined with the opening of "La Ferme de Mon Père", which immediately acquired three stars. The gap between 2006 and 2007 reflects the closure of both "La Ferme de Mon Père" and "Jamin", the sale of both "Le Buerehiesel" and "Hôtel de l'Espérance", and the resurgence of "Maison Pic".

The pattern among the restaurants themselves is more complex, however, as shown in the second network. There are four groups that might be recognized, based on the timing of their star patterns, and this is clearly reflected in the network (as shown in different colours).

Restaurants with a red asterisk have three stars in 2013.

The small blue group had their Michelin-starred heyday during the first one-third to two-thirds of the time period. The only exception is "Le Café de Paris", which had three stars in the early 1950s but is not placed in this group by the network.

The larger pink group had Michelin stars for all or almost all of the time, although none had three stars for the whole time. The longest was "l'Auberge Du Pont De Collonges" (made famous by chef Paul Bocuse), which had three stars for 49 / 63 years, followed by "l'Auberge de l'Ill" (47 years), "La Maison Troisgros" (46) and "Restaurant de la Tour d'Argent" (44 years, but no longer has three stars).

This group forms two subgroups plus a few outliers. The subgroup from "L'Auberge Du Père Bise" to "Restaurant Lasserre" had their 3-star peak before 1995, whereas the subgroup from "Maison Pic" to "Lucas Carton" had three stars into the 2000s as well. "Hôtel Plaza Athénée" and "Hôtel George V" share the characteristic of having their 3-star period only in the 2000s. "La Petite Auberge" and "Restaurant Charles Barrier" are distinct in having their 3-star period in the 1960s and 1970s, respectively. "La Cote d'Or" is distinct because it had three stars at two very different times, 1951-1963 with chef Alexandre Dumaine, and 1991-2013 with Bernard Loiseau (and then Patrick Bertron).

The large black group had Michelin stars during the final one-third to one-half of the time, except for "Le Café de Paris". They almost all still have three stars in 2013, or they have closed. The subgroup from "Le Petit Nice" to "L'Arpège" had three stars for much longer than did the other members of this group. "Le Café de Paris" had three stars for only a few years (1951-1955) and no stars at all at any other time, just like "l'Hôtel du Parc" (1996-2000) and "La Ferme de Mon Père" (2001-2006).

The green group had Michelin stars during the final two-thirds of the time, but usually peaking before the 2000s. Few still have three stars. "Le Buerehiesel" is distinct in receiving its stars later and peaking later than the others in this group.

In conclusion, this network neatly arranges the restaurants based on when they had stars, and particularly three stars, with the exception of "Le Café de Paris".

Postscript

As you might imagine, three Michelin stars does not come cheap from the customer's point of view. Even one star increases the meal's price significantly, no matter what it tastes like. Running a 3-star restaurant appears to require about two staff members for every three customers, and someone has to pay these people — so, the bill you get at the end of the meal is guaranteed to take your breath away. Notwithstanding this, a list of any sort provides a challenge to certain types of people.

There appear to be only two people currently advertising on the web that they have eaten in all of the 3-star restaurants in the world that were available to them at the time: Andy Hayler claims to have achieved this on three separate occasions, first in 2004 and then with catch-ups in 2008 and 2010, while the anonymous 3starbackpacker claims to have done it between November 2006 and October 2007. Hayler has reviews of all of his visits (along with a score out of 10), but the Backpacker has only a couple of his reviews. He does, however, use a 13-point scale to rate his visits, as shown in the histogram below. He is particularly critical of the variability among the restaurants, and recommends only 33 of the 73 restaurants as being worth a re-visit (ie. the other 40 do not merit a 3-star rating).

A     **
A/A-  ****
A-/A  *****
A-    *********
A-/B+ ****
B+/A- ********
B+    *******
B+/B  ******
B/B+  ****
B     ********
B/B-  ******
B-/B  *
B-    *********

There is also a story of another person who tried to eat in all of the 3-star restaurants but gave up. Hayler notes that: "Keeping pace with the three-star crowd is not easy ... You can't eat this stuff all the time. You need a break." A New York childrens-wear manufacturer named Leonard S. Bernstein describes how to survive eating in eight 3-star restaurants in eight days, in his classic book The Official Guide to Wine Snobbery. He mentions 17 of the 21 French 3-star restaurants from 1982, two of which no longer exist and only seven of which still have three stars today.

Sunday, March 3, 2013

Different topological restrictions of rooted phylogenetic networks. Which make biological sense?


Those readers active in the field of evolutionary phylogenetic networks will know that there are many different definitions ofrooted phylogenetic networkin circulation. While certain features are almost universal (e.g. rooted, acyclicity), many are not. Why do these differences arise? There are multiple answers to this. Some arise because of differing opinions on what biologically realistic is, and (relatedly) what the correct balance is between biological detail and mathematical abstraction. Others arise because they make optimization problems on networks easier to solve. This should not automatically be viewed as mathematics prescribing reality to biology, but rather as the observation that if evolution looks like this then certain optimization problems can be solved well. Finally, some differences are superficial; they do not lead to any intrinsic differences in the model or its underlying mathematical structure. Of course,superficialis highly context dependent, as some of the examples below will show.

Here we list some well-known and less-well known properties that have surfaced in definitions of rooted phylogenetic network. We will take as given that evolution is directed i.e. that the arcs in the network have an explicit orientation (representing the flow of time). Below most properties we show a figure of a network violating it.

The question to our readers, particularly those from the biological side of the spectrum: which of the following properties make sense? And which other restrictions would make more sense biologically?

Rooted
A root is a node with indegree 0, meaning that it does not have any ancestors. If evolution is assumed to be acyclic (see below) then there will always be at least one root. Most articles writing about rooted phylogenetic networks assume a single root (which is necessarily an ancestor of every node in the network). Some time ago on this blog David raised the question of whether it would not sometimes be better to allow multiple roots. This is an interesting point both from the perspective of interpretation (what does it mean?) and its impact on algorithmic efficiency.

Acyclic
Most models assume acyclicity: it is not possible to walk along the arcs of the network (respecting their orientation) such that you end up back where you started. The most intuitive argument for this is the passing of time: if arcs represent forward motion through evolutionary time, how can you end up back at an earlier point in time? Recently someone pointed out to me that in the reconciliation literaturewhere one shows how to reconcile a given gene tree with a given species treeacyclicity is actually not sacred at all. The reason for this is that, without the acyclicity constraint, the problem becomes computationally tractable. See this recent RECOMB 2013 article [3] for an example of this.

Time-consistent
This is an interesting property. It was introduced to prevent reticulation events between non-contemporaneous taxa. That is, to avoid absurd situations such as an organism hybridizing with its ancestor. The most common mathematical articulation of time-consistency is this: it should be possible to put atime-stampon each node of the network such that (a) time always moves strictly forward along tree-edges , and (b) all the nodes involved in a reticulation event have the same time-stamp. The figure here shows a network that is not time-consistent. For many contextualisations ofreticulation eventtime-consistency seems to make a lot of sense. But there is a catch. A network might fail to be time-consistent only for the rather artificial reason that we failed to sample a taxon that was, in fact, part of the network (incomplete taxon sampling). Given the reality of incomplete data, demanding time-consistency might be too prohibitive. However, as with all restrictions in this blog, it is perhaps useful as a selection criterion for preferring one network over another.

Figure 1: A network that is not time-consistent. The red and blue node are the two parents of a reticulation node but cannot have coexisted in time.


Degree-restricted
The indegree of a node is the number of parents of it, and the outdegree of a node is the number of children of it. In a bifurcating, rooted phylogenetic tree all nodes have indegree-1 (except the root: indegree-0) and outdegree-2. Polytomies have outdegree-3 or higher.  In a rooted phylogenetic network we also have reticulation nodes i.e. nodes with indegree-2 or higher. Articles differ in the degree-restrictions they place on nodes; there is an entire zoo of different permutations possible. Many articles agree that nodes with indegree-1 and outdegree-1 should not be allowed, because they are phylogenetically uninformative (and indeed such nodes are also rarely encountered in the phylogenetic tree literature). But what about polytomies? And what about reticulation nodes: should they be permitted to have indegree-3 or higher, and if so how should such “reticulate polytomies” be interpreted? From a parsimony perspective it is usual to argue that reticulate polytomies should be more “expensive” than indegree-2 reticulations, because a single reticulation polytomy can explain far more discordance than a single indegree-2 reticulation. Interestingly, some optimization problems are not really affected at all by degree-restrictions on phylogenetic networks (e.g. the hybridization number problem) while others are highly sensitive to degree-restrictions (e.g. the small parsimony problem).

Tree-child
Every node has at least one non-reticulate child. This restriction makes sure that there are no "invisible nodes", i.e. nodes from which all paths end up in reticulations. As a result, this restriction makes many computational problems more tractable and mathematical reasoning easier. For example, consider the basic Tree Containment problem, i.e. given a phylogenetic network and a phylogenetic tree, decide if the network displays the tree, see [8]. This problem was shown to be tractable for tree-child networks, while it is NP-hard for most other classes of networks (time-consistent, tree-sibling, regular). This seems important because if it is already hard to tell if a given tree is in a given network, then it seems  a daunting task to try to build networks of that class from any kind of data.

It should be noted that there is of course no guarantee that "real" evolutionary histories are tree-child. In fact, simulation studies show that only under very low recombination rates one can expect tree-child networks [1][2].

Figure 2: A network that is not tree-child. The red node does not have a non-reticulate child.

Tree-sibling
Every reticulate node has at least one non-reticulate sibling. Originally introduced by Cardona, Llabrés, Rosselló and Valiente who write "Biologically, this condition means that for each of the reticulation events, at least one of the species involved in it has also some descendant through mutation" and showed that networks can efficiently be compared (with a polynomial-time computable metric) if they are both tree-sibling and time-consistent.

An advantage of the class of tree-sibling networks is that it is much larger than the class of tree-child networks. If a reticulation has no non-reticulate siblings, then its parents have no non-reticulate children. Hence, every tree-child network is tree-sibling. However, it can easily be seen that there are many tree-sibling networks that are not tree-child. Computationally, the tree-sibling restriction does not seem to help as much as the tree-child restriction. Again, there is no guarantee that the "real" network is tree-sibling, but it might be more likely, see [1][2].

Figure 3: A network that is not tree-sibling (the red node does not have a non-reticulate sibling).

Reticulation-visible
For every reticulation node there exists a leaf such that all paths from the root to the leaf pass through that reticulation node. This is another attempt to weaken the tree-child restriction, thus obtaining a larger class of networks for which many computational problems are still tractable. It is incomparable with tree-sibling (a reticulation-visible network does not have to be tree-sibling and vice versa) but this class clearly contains the class of tree-child networks, and in fact is much larger. It does not forbid all invisible nodes, but just invisible reticulation nodes. It has been shown in the book by Huson, Rupp and Scornavacca that the so-called Cluster Containment problem becomes tractable for reticulation-visible networks. If the same is the case for the Tree Containment problem is still an open question.

Figure 4: A network that is not reticulation-visible. The red reticulation node is not "visible".

Galled trees
All reticulation cycles are disjoint. Introduced by  Gusfield, Eddhu and Langley [6] although studied before under different names. Makes computational problems much easier but seems biologically unrealistic. On the other hand, galled trees could have a future in the context of data-display networks, by using the galls to show where the reticulate activity is in the network, rather than claiming that each gall represents exactly one reticulation event.

Galled networks
Each arc leaving a reticulation node is a cut-arc. Introduced by Huson and Klöpper [7] (who gave a different but equivalent definition) as a generalization of galled trees, giving a fast algorithm for constructing galled networks from clusters. Hasn't been studied much since. (Be aware that there is also an article using "galled network" as an alternative term for galled tree.)

Figure 5: A network that is not a galled network. The red arc leaves a reticulation node but is not a cut-arc.

Level-k (not a restriction, but a measurement)
Also a generalization of galled trees (which are basically level-1 networks). Every network is a level-k network for some k. Hence, “level-k” should not really be viewed as a topological restriction, but rather as a measure of how “tangled” (intensely concentrated) the islands of reticulation are in the network. The higher k is, the more tangled the network is; level-0 networks are simply trees. For more information, see a previous blog. Other proposed measurements of tangledness include k-nested, r-reticulation and depth-k.

Regular and normal
If you see a phylogenetic network as a representation of a set of clusters, then it makes sense to consider regular networks (introduced by Baroni, Semple and Steel). A network is regular if it is the so-called "cover digraph" (Hesse diagram) of its set of clusters. Hence, for each set of clusters, there exists a unique regular network with precisely that set of (hardwired) clusters. Normal networks have the additional requirement of being tree-child, thus forming a very restricted class of networks. For example, the network in Figure 1 is not regular and hence also not normal.

DAG or tree-with-edges-added?
This is a rather subtle one. If one views a phylogenetic network as a tree with edges added, then this leads to a different space of networks than the “a phylogenetic network is essentially a directed acyclic graph” definition encountered in other articles. The point being that if you insist that a network has to be “grown” from some tree starting point (by adding edges in a certain way), certain topologies cannot be reached which can be reached if the we do not anchor it in this way. The following figure shows an example.

Figure 6: A network that cannot be obtained by adding edges to a tree (for common edge-adding rules). (Figure corrected on May 5, 2014.)

It can be shown that any tree-sibling network can be constructed by adding edges to a tree, but the network in Figure 3 shows that the converse is not true (the shown network can be constructed by adding edges to a tree, but is not tree-sibling). 

This restriction touches on the fundamental question whether there exists something like a species tree, and if it might be possible to reconstruct this species tree before starting network analysis.


A related but stronger (?) restriction was recently used by Wu [9]. In his RECOMB 2013 article, he writes “(R1) For a network N, when only one of the incoming edges of each reticulation node is kept and the other is deleted, we always derive a tree T'.”

Conclusion
We see from this list that there are already some quite different topological properties and restrictions in circulation for rooted phylogenetic networks. To biologists these discussions might appear to be a strange side-show to keep computer scientists in work. But it runs deeper than that, because it touches on three fundamental issues. Firstly, what are we trying to model exactly? Secondly, the importance of understanding the networks that your favourite software for constructing rooted phylogenetic networks will not build, however biologically relevant, due to the fact that they are a priori excluded from its search space. Finally, since the total number of networks is huge, it could be inevitable to focus on certain restricted classes of networks when one wants to search through network-space efficiently.

Note: There is a follow-up post Topological restrictions: some comments.

REFERENCES

[1] Miguel Arenas, Mateus Patricio, David Posada and Gabriel Valiente. Characterization of Phylogenetic Networks with NetTest. In BMCB, Vol. 11:268, 2010. 

[2] Miguel Arenas, Gabriel Valiente and David Posada, Characterization of Reticulate Networks Based on the Coalescent with Recombination, Mol Biol Evol (2008) 25 (12):2517-2520.

[3] Mukul S. Bansal, Eric J. Alm, Manolis Kellis, Reconciliation Revisited: Handling Multiple Optima when Reconciling with Duplication, Transfer, and Loss, RECOMB 2013.

[4] Gabriel Cardona, Merce Llabres, Francesc Rossello, Gabriel Valiente, The comparison of tree-sibling time consistent phylogenetic networks is graph isomorphism-complete, arXiv:0902.4640 [q-bio.PE], 2009.

[5] Gabriel Cardona, Mercè Llabrés, Francesc Rosselló and Gabriel Valiente. A Distance Metric for a Class of Tree-Sibling Phylogenetic Networks. In BIO, Vol. 24(13):1481-1488, 2008.

[6] Dan Gusfield, Satish Eddhu and Charles Langley. Efficient reconstruction of phylogenetic networks with constrained recombination. In CSB03, Pages 363-374, 2003.

[7] Daniel H. Huson and Tobias Klöpper. Beyond Galled Trees - Decomposition and Computation of Galled Networks. In RECOMB 2007, Vol. 4453:211-225 of LNCS.

[8] Leo van Iersel, Charles Semple and Mike Steel, Locating a Tree in a Phylogenetic Network, Information Processing Letters, 110 (23), pp. 1037-1043 (2010).

[9] Yufeng Wu. An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees. RECOMB 2013.