Wednesday, August 31, 2016

Network thinking in phylogeography?

This blog has, of course, long championed the importance of network models in phylogenetics. Slowly, very slowly, the rest of the world is catching up.

Apparently, the world of phylogeography has now woken up:
Scott V. Edwards, Sally Potter, C. Jonathan Schmitt, Jason G. Bragg and Craig Moritz (2016) Reticulation, divergence, and the phylogeography–phylogenetics continuum. Proceedings of the National Academy of Sciences of the USA 113: 8025-2032.
Phylogeography was conceived as some sort of connection between population biology and phylogenetics. It has always seemed odd that the tree model has been used in phylogeography at all, because there is no a priori reason to expect within-species phylogenetic patterns to be tree-like. Indeed, inter-breeding seems to suggest quite the opposite. Nevertheless, phylogeographic studies are full of trees.

But apparently no more. To quote the authors:
As phylogeography moves into the era of next-generation sequencing, the specter of reticulation at several levels — within loci and genomes in the form of recombination and across populations and species in the form of introgression — has raised its head with a prominence even greater than glimpsed during the nuclear gene PCR era ... We discuss a variety of forces generating reticulate patterns in phylogeography, including introgression, contact zones, and the potential selection-driven outliers on next-generation molecular markers. We emphasize the continued need for demographic models incorporating reticulation at the level of genomes and populations ...
That phylogeography sits centrally in this process-oriented space emphasizes the importance of understanding interactions between reticulation (gene flow / introgression and recombination), drift, and protracted isolation. This combination of processes sets phylogeography apart from traditional population genetics and phylogenetics.
Scanning entire genomes of closely related organisms has unleashed a level of heterogeneity of signals that was largely of theoretical interest in the PCR era. This genomic heterogeneity is profoundly influencing our basic concepts of phylogeography and phylogenetics, and indeed our views of speciation processes. It is now routine to encounter a diversity of gene trees across the genome that is often as large as the number of loci surveyed.
The new genome-scale analyses are causing evolutionary biologists to reevaluate the very nature of species, which, in some cases, appear to maintain phenotypic distinctiveness despite extensive gene flow across most of the genome, and to recognize introgression as an important source of adaptive traits in a variety of study systems.
The role of horizontal gene flow in speciation and phylogeography, particularly for animal taxa, has long been championed by Michael L. Arnold (see the references). However, the authors ignore this literature, and claim that this is a recent insight, instead. They also mention only in passing the extensive genomics literature on human introgression, where it is called "admixture". Indeed, they mention only a data-analysis technique, rather than the biological insights that have arisen. It is still disappointing just how little information-connection there is between different fields of biology.

Finally, the authors manage to mention the work "network" only three times in the whole paper. Their key word is "reticulation", instead, in the sense that a phylogeny is a tree with reticulation, rather than any other form of network. So, they are still only one step away from tree-thinking, and at least one step from true network-thinking.

In the context of trees versus networks, the authors mention so-called "species tree" methods based on the multispecies coalescent, which try to account for incomplete lineage sorting in genome studies (see also Edwards et al. 2016). Unfortunately, these have recently been shown to be inconsistent in the presence of gene flow (Solís-Lemus et al. 2016), thus emphasizing the need for proper network methods.


Arnold ML (1997) Natural Hybridization and Evolution. Oxford University Press.

Arnold ML (2006) Evolution Through Genetic Exchange. Oxford University Press.

Arnold ML (2009) Reticulate Evolution and Humans – Origins and Ecology. Oxford University Press.

Arnold ML (2016) Divergence With Genetic Exchange. Oxford University Press.

Edwards SV, Xi Z, Janke A, Faircloth BC, McCormack JE, Glenn TC, Zhong B, Wu S, Lemmon EM, Lemmon AR, Leaché AD, Liu L, Davis CC (2016) Implementing and testing the multispecies coalescent model: a valuable paradigm for phylogenomics. Molecular Phylogenetics & Evolution 94: 447-462.

Solís-Lemus C, Yang M, Ané C (2016) Inconsistency of species tree methods under gene flow. Systematic Biology 65: 843–851.

Thursday, August 25, 2016

More on analogies between biological and linguistic evolution

Analogies between biological and linguistic evolution have been discussed before on this blog. Last month, I asked whether biologists could learn from linguists; a bit earlier, I proposed to distinguish fruitful from unfruitful analogies; and David has written a very long and interesting blog post on false analogies between anthropology and biology.

In contrast to the discussion of similarities in many articles that have been published, most of these posts were rather sceptical and reserved, emphasizing the importance of being extremely careful when using analogies to justify methodological transfer across disciplines. Despite this general scepticism, that I mentioned myself, I am still convinced that methodological transfer can be fruitful when carefully adapting methods to the needs of the target discipline — and we know that this has been done in both directions in the fields of biology and linguistics.

Apart from the problem of adapting methods from other disciplines, one important question is, how to identify fruitful analogies in the first place. As a visiting post-doc in the bioinformatics research group Adaptation, Integration, Reticulation and Evolution, led by Eric Bapteste and Philippe Lopez (UPMC Paris), I have discussed this question a lot during the past one and a half years.

We came up with the idea that it might be useful to restrict the range of potential analogies one might draw between biology and linguistics by concentrating on analogies between processes. Taking processes, rather than research objects, as a starting point comes closer to general approaches to analogy, which usually claim that the core of analogy is similarities of functions (Gentner 1983). By applying this principle to compare aspects of linguistic and biological evolution, we were able to identify some potentially fruitful analogies that could lead to novel approaches, not only in linguistics but potentially also in biology.

Among these are specific processes of divergence (like incomplete lineage sorting in biology, which is very similar to dialect chain dissolution in linguistics), specific introgressive processes (like protein assembly, which shows some striking similarities with word formation), and specific systemic processes (like constructive neutral evolution in biology, providing an explanation for convergent evolution in languages resulting from common descent, also called drift or Sapir's drift). On the other hand, we also found that many processes are most likely to be unique to one of the disciplines, including such processes as sound change in linguistics and natural selection in biology.

These reflections have been summarized in a paper titled "Unity and disunity in evolutionary sciences" which was published at the beginning of this week (List et al. 2016, PDF here). I will not go into further detail of the specific new analogies we proposed, but instead recommend those who are interested in the issue to read our paper (and potentially discuss the issue of analogies further with us).

Since the identification of potentially fruitful, new analogies between biology and linguistics is just a starting point for a closer investigation of the suitability of methodological transfer in practice, I am quite optimistic that I will follow up on the new analogies mentioned above in more detailed future blog posts.

  • Gentner D. Structure-mapping: A theoretical framework for analogy. Cogn Sci. 1983; 7: 155–70.
  • List, JM, JS Pathmanathan, P Lopez and E Bapteste. Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct. 2016; 11.39.

Tuesday, August 16, 2016

Networks of music history

Networks are currently popular in studies of music. However, they tend to be unrooted similarity networks, showing some form of alleged commonality among artists or their music, as shown in the first graph. This example displays phenotypic similarity among the named artists, although how the "similarity" is measured is not always clear (the post on The Music Genome Project is no such thing briefly discusses this).

[Note: For an alternative approach, Glenn McDonald's Every Noise at Once has a two-dimensional scatter-plot of 1,491 music genres.]

Of more interest to us is the use of a network to study the historical development of music genres, for which we need a rooted network. Clearly, music history will be reticulate rather than tree-like, given the obvious transfers of musical modes between and within cultures, and even the possible resurrection of earlier styles at a later time and even place. A similar argument applies to musical instruments, of course (see Cornets: from a tree to a network; Guitars and networks).

Music networks appear in a previous post, on Reconstructing ancestors in a splits network. That post discusses the paper by J. Miguel Díaz-Báñez, Giovanna Farigu, Francisco Gómez, David Rappaport & Godfried T. Toussaint (2004) El Compás flamenco: a phylogenetic analysis. Proceedings of BRIDGES Conference: Mathematical Connections in Art, Music and Science, pp. 61-70.

The authors provide an analysis of the hand-clapping patterns of the flamenco music of Andalucia, in southern Spain. There are four recognized patterns, plus the fandango pattern, and the authors use two different distance measures to assess their rhythmic similarities. They produce unrooted phylogenetic networks based on each of these distances, using NeighborNet, one of which is shown in the second graph.

The authors ignore the fact that "it is well established that the fountain of flamenco music is the fandango", which would make the fandango the outgroup for rooting if we did wish to treat the networks as rooted. Instead, they try to "reconstruct the 'ancestral' rhythms correspnding to the nodes" by using mid-point rooting. This is a tricky business for networks, because there are multiple paths through the graph, and so the mid-point is not necessarily unique.

A similar NeighborNet analysis had previously been provided by Godfried Toussaint (2003) Classification and phylogenetic analysis of African ternary rhythm timelines. Proceedings of BRIDGES Conference: Mathematical Connections in Art, Music and Science, pp. 25-36. This involved an analysis of the 12/8 time bell rhythms in African and Afro-American music. The distances were based on "measures of rhythmic oddity and off-beatness" (this is briefly discussed in Hunting for rhythm’s DNA).

Very few people seem to be interested in producing rooted phylogenetic diagrams directly, except when their model is a tree rather than a network. Perhaps the most ambitious of these is by Victor Grauer (2011) Sounding the Depths: Tradition and the Voices of History. This is available as a paperback or for kindle. The audio-visual examples are available as a blog page, as are the figures.

His tree is shown in the next graph, including the characters on which it is based. Note that group B3. "Social Unison" is associated with a historical bottleneck, so that the prior history appears to be uncertain.

Finally, not everyone agrees about the importance of the obvious reticulation patterns in music history, notably Sylvie Le Bomin, Guillaume Lecointre, Evelyne Heyer (2016) The evolution of musical diversity: the key role of vertical transmission. PLoS One 11: e0151570. These authors study the music of groups of farmer and hunter-gatherer Bantu and Ubanguian speakers from Gabon, in western Africa. Their music characters are from three groups: repertoire (set of pieces including circumstance and social or symbolic implicit information), performativ (polyphonic process, form, instruments and vocal techniques), and intrinsic (metrics, rhythm and melodic).

The authors present a rooted phylogenentic tree, but there is also a "filtered" NeighborNet tucked away in an appendix. It seems to contradict any claim for the data being particularly tree-like.

Finally, to return to where I started, you could take a look at Musicmap, which allegedly covers The Genealogy and History of Popular Music Genres from Origin till Present (1870-2016). To quote from the info:
Musicmap attempts to provide the ultimate genealogy of popular music genres, including their relations and history. It is the result of more than seven years of research with over 200 listed sources and cross examination of many other visual genealogies. Its aim is to focus on the delicate balance between comprehensibility, accuracy and accessibility.

You need to zoom in a long way to appreciate the complexity of the network, covering 230 music genres. There is nominally a timeline from top to bottom (starting in 1870), although the network connections are not strictly time-consistent. As the (mostly Belgian) creators (lead by Kwinten Crauwels) note:
The ideal genealogy is not only complete and correct, but also easy to understand despite its complexity. This is a utopian balance that can never be achieved but only approached. By choosing the right amount of genres, determining forms of hierarchy and analogy and ordering everything in a logical but authentic manner, a satisfactory balance can be obtained ... Musicmap is a platform in search for the perfect balance of popular music genres to provide a powerful tool for educational means or a complementary framework in the field of music metadata and automatic taxonomy.

Tuesday, August 9, 2016

Network of Linné's "Philosophia Botanica" editions

Carl von Linné's book Philosophia Botanica (1751) was arranged as a series of botanical aphorisms, expanded over the previous 15 years from when he first developed them. During those years, he settled on binomial nomenclature as his preferred naming system, and he presents this in Philosophia Botanica, so that the book has considerable historical interest for biologists.

Recently, János Podani and András Szilágyi (History and Philosophy of the Life Sciences, in press) have pointed out a basic inconsistency in this book, relating to Linné's calculation of how many possible plant genera there could be, given the morphological features he used to distinguish among them.

Linné did not do a good job with this calculation, as these authors show. Indeed, the correct calculation is far more complex than Linné realized, but even given his simplifications his arithmetic is faulty. There are basic inconsistencies among the aphorisms, where the numbers do not "add up" when some of the aphorisms are compared. In essence, 31 plant parts are defined in one aphorism but this becomes "n=38" in a later aphorism; and then 4n2 is claimed to be "5736" rather than 5776.

This then raises the issue of how this error was treated in subsequent editions of Philosophia Botanica. Podani and Szilágyi trace the error through 14 subsequent editions, showing that the various editors of those editions dealt with the issue in different ways. The history of these editions can be represented as a phylogenetic diagram, which the authors also provide.

This history turns out to be a network, because some of the later editions were compiled from several earlier editions. The network is rooted at the bottom, and each network edge is implicitly directed away from the root. The book editions are named using their place and time of publication.

Note that one particular "solution" to the arithmetical issue arises independently in three separate editions of the book. That is, the three editions on the network's right independently correct the 4n2 problem but do not correct the 31=38 problem

Also, note that no editions since 1787 actually correct both errors (ie. they show both n=31 and 4n2 =3844). Recent editions are reprints of the original erroneous version.

Tuesday, August 2, 2016

A century of French wine vintages

It has been quite some time since I have produced a network-based exploratory data analysis (EDA) of some multivariate dataset, so it could be time to do so again.

In the wine industry, it is common to provide quality scores for the different vintages from particular wine-producing regions. These so-called vintage charts are intended to tell us how the harvest quality has varied from vintage to vintage. They are often disparaged, because they simplify the complexities of each harvest (where there can be considerable spatial variation) down into a single number. They also make little sense if a single number is applied to a very large area, which often occurs in practice.

Nevertheless, they can be an interesting and informative guide to the general features of each vintage, especially if they cover a long period of time.

My interest in this concept comes from the fact that I have recently started a blog about wine: The Wine Gourd. In the interests of doing something different to every other wine blogger, this blog delves into the world of wine data, instead of the usual reviews of recently released wines. The intention is to ferret out some of the interesting stuff, and to bring it out into the light, for everyone to see. Hopefully, this will be both interesting and informative.

French wine vintages

The Cavus Vinifera web site has produced vintage charts for several of the wine-producing regions of France, from the year 1900 to the present. This is very unusual, as most vintage charts cover a much shorter period of time. This circumstance thus provides the opportunity to compare these French regions over the past century, to investigate to what extent vintage variation is correlated among these areas.

Each vintage from 1900-2014 has been rated on a scale of 0-20. The region and wines covered by the entire time span include:
   Région de Bordeaux (rouge)
   Région de Bordeaux (blanc)
   Région de Bordeaux (liquoreux)
   Région de la Bourgogne (rouge)
   Région de la Bourgogne (blanc)
   Région du Rhône (Nord)
   Région du Rhône (Sud)
   Région du Loire (rouge)
   Région de la Champagne
   Région du Beaujolais

As usual, we can use a phylogenetic network to visualize these data, with the network being used as a form of exploratory data analysis. I first used the manhattan distance to calculate the similarity of the different years and regions, based on the quality scores. This was followed by a neighbor-net analysis to display the between-region and the between-year similarities as two phylogenetic networks.

The network for the ten regions is shown in the first graph. Regions that are closely connected in the network are similar to each other based on the variation in their vintage quality scores through time, and those that are further apart are progressively more different from each other.

Not unexpectedly, the different wines from the same regions form neighborhoods: the three wines types from Bordeaux (in south-western France); the three wines from Burgundy and Beaujolais (along the Saône River in eastern France); and the two wines from the Rhône River (in the south-east). However, unexpectedly, the Loire wine, from western France, is associated with the Rhône wines, while the Champagne region, in northern France, is somewhat isolated.

The network for the 115 years is shown in the second graph. In this case, years that are closely connected in the network are similar to each other based on the vintage quality scores averaged across all of the regions, and those that are further apart are progressively more different from each other.

Here, the years form a gradient from the poorest-quality years, at the top, to the best-quality vintages at the bottom. Only four of the vintages are labeled, but the vintages at the top of the network include 1902, 1910, 1913, 1930, 1931, and 1968. The vintages at the bottom of the graph include: 1929, 1945 and 1947, followed by 1928, 1949, 1989 and 1990, and then 1906, 1953, 1959, 1961 and 2005.

Note that the 1930s were generally not a good time for wine-making in France, and nor were the 1910s (although 1906 was an early century exception). The 1940s and 1950s, on the other hand, were generally good times for wine production.

The 1910 vintage stands out as particularly poor, with none of the regions scoring more than 10 out of 20 for their grape harvest, and both Burgundy wines scoring 0. This contrasts with the best years, where no region scored less than 16 out of 20.

Needless to say, the years stacked in the middle of the graph were variable, with some regions having a good time in a particular year and some having a bad time in that same year. This is the normal state of affairs.