The Genealogical World of Phylogenetic Networks: July 2013

Wednesday, July 31, 2013

Trends in Genetics: The Future of Phylogenetic Networks

A couple of weeks ago I reported on those journal covers that I know illustrate phylogenetic networks. I am happy to report that networks have now also made it onto the cover of Volume 29 Issue 8 of Trends in Genetics. The cover illustration combines the traditional tree metaphor for phylogenetics with the new metaphor of a network.

The cover story is the review article by Eric Bapteste, Leo van Iersel, Axel Janke, Scot Kelchner, Steven Kelk, James McInerney, David Morrison, Luay Nakhleh, Mike Steel, Leen Stougie and James Whitfield: Networks: expanding evolutionary thinking, on pages 439-441.

The article is one of the tangible outcomes of the workshop last October, at the Lorentz Center in The Netherlands: The Future of Phylogenetic Networks. The workshop participants agreed that we should be active in promoting the use of networks for evolutionary analyses, and this article, written by a group of biologists and computational biologists, seeks to do just that.

There will be further outcomes of the workshop, including follow-up meetings at the same venue.

Monday, July 29, 2013

A network analysis of London's theatres in 1965

In the English-speaking world there is a long-standing rumour that the pinnacle of live theatre involves performances in the theatres near Broadway, in New York, and those in the so-called West End of London. This seems unlikely, on the whole, and a number of other locations justifiably claim at least equality. Nevertheless, there are certainly a lot of theatres in both locations, and there have been for a long time. So, there must have been some good performances over the centuries, and plenty of good performers.

What we are interested in here, though, is the physical characteristics of the theatres in London in the mid 1960s, not the activities therein. The data to be analyzed come from this book:

Who's Who in the Theatre: a Biographical Record of the Contemporary Stage. Fourteenth (and Jubilee) edition, 1967. Edited by Freda Gaye. Isaac Pitman & Sons, London.

This book is a motley collection, published periodically from 1912 (1st edition) to 1981 (17th edition, in 2 volumes). It was originally compiled by John Parker, but was then edited by various other people from 1961 onwards. It is, to quote Noel Coward, "not only a valuable reference book but a real treasury for the stage enthusiast."

Most of the book is precisely what the title suggests, a series of biographies of people associated with the theatre, mostly in Britain. They seem to be arbitrarily chosen — for example, Jonathan Miller is there but not Alan Bennett, Peter Cook or Dudley Moore; Spike Milligan and Harry Secombe are there but not Michael Bentine or Peter Sellers; Norman Wisdom is there but not Eric Morecambe or Ernie Wise.

The rest of the book is an assortment of anything and everything that took the editor's fancy. For our purposes, the interesting information is on a foldout page opposite page 1554. It is entitled "Working Dimensions of London Theatres, 1965". It is in the section labelled "Opening of Existing London and Suburban Theatres", from which the theatre opening dates quoted below are taken.

In case you are wondering why this information is included in the book at all, I will quote Malcolm Pride: "The Table ... is immensely valuable to the designer, both in preparing a new production, and in transferring an existing show from one theatre to another."

The data

The book's data table refers to the dimensions of the performance spaces, rather than to the external size of the buildings. It lists data for the following characteristics:

Width of proscenium opening
Height of proscenium opening
Depth from proscenium wing to back wall
Distance between side walls
Distance between fly rails and girders
Height from stage to grid
Depth from under fly platform to stage
Depth under stage
Height to take cloths up out of sight
Approximate seating capacity

The names I have used for the theatres are exactly as listed in the book (not their current names).

One of the listed theatres could not be included in my analysis: the Mermaid Theatre (which is no longer in use) had an open stage. Also, I added two theatres from the previous edition of the book (the data from the 1961 edition can be seen online here): the Streatham Hill Theatre (now closed), and the Royalty Theatre (now called the Peacock Theatre). For the curious, there are some current West End theatres that were not included in book, because they were not used for live theatre at the time: the Dominion, Lyceum, New London, Playhouse, and Prince Edward theatres.

For the network analysis, I normalized the data within each characteristic, and I then calculated the similarity of the theatres using the Manhattan distance. A Neighbor-net analysis was then used to display the between-theatre similarities as a phylogenetic network. So, theatres that are closely connected in the network are similar to each other based on their characteristics, and those that are further apart are progressively more different from each other.

The analysis

As anticipated, the network basically shows the differences in size between the theatre performance spaces, with the largest theatres at the top-right and the smallest at the bottom-left.

The London Coliseum was built as a variety theatre in 1904, although it now houses the English National Opera Company, who recently restored it. It was specifically planned to be the biggest theatre in London, which it still is. The biggest theatre at the time of building was the Theatre Royal, Drury Lane (built in 1812), which is still the second largest.

Drury Lane has a history as long a your arm, especially if you include the three previous theatres built on the same site (dating back to 1663). The current building was remodelled in 1922; and only a couple of months ago it was refurbished. It is currently owned by Really Useful Theatres (which is wholly owned by Andrew Lloyd Webber), which also operates the London Palladium, Her Majesty's Theatre, the Cambridge Theatre, New London Theatre (opened 1973, and thus not in the network), and the Adelphi Theatre. However, for people of my generation Drury Lane is most famous for being the scene of the comedy album Monty Python Live at Drury Lane. This show was originally intended to be 6 weeks at the much smaller Comedy Theatre (now the Harold Pinter Theatre) but ended up being 4 weeks at Drury Lane in February-March 1974, which turned it into what Eric Idle called "a total rock 'n' roll audience" for the group's first foray into London's West End.

The Royal Opera House, Covent Garden had two previous theatres built on the same site (dating back to 1732), but the current building dates from 1858. Since then, the auditorium has been reconstructed, increasing the audience size, and the building itself enlarged; and it has also been recently refurbished, funded by a lottery (an idea first used in 1957 to fund the building of the Sydney Opera House).

The Streatham Hill Theatre opened in 1929 and closed for theatrical use in 1967. It is apparently now used as a bingo hall. It is one of only six theatres in the network not actually in the West End, and it apparently suffered from being a large theatre built too far away from "the action". It held a bigger audience (2,600) than the Coliseum (2,500), Drury Lane (2,283) and Covent Garden (2,128) theatres, although its stage size was smaller than these other venues. Indeed, the London Palladium (2,338) and Golder's Green Hippodrome (2,200) also had large audiences but smaller stage sizes. The Golder's Green theatre is another non-West End theatre, opening in 1913 and closing as a live theatre in 1968. It is now a Christian centre.

As an aside, and by way of contrast, the three biggest Broadway theatres in the 1960s were:
the Metropolitan Opera (audience 3,800) and the State Theatre (2,729), both in the Lincoln Center Plaza, followed by City Center (2,935), where Monty Python performed in April-May 1976 (and recorded the album Monty Python Live! at City Center), and where the "pop concert" atmosphere was apparently even more marked than during their Drury Lane season.

To return to the London theatres, the Saville Theatre was a medium-sized (audience 1,067) theatre in the West End (built 1931) that has closed (in 1970). It is now the Odeon Covent Garden, a four-screen cinema complex. The Scala Theatre (1,141), on the other hand, was demolished in 1969, having existed in the West End since 1905 (although the site had hosted entertainment since 1772). It's most recent claim to fame was as the concert venue used for the Beatles' film A Hard Day's Night. While the Scala was replaced by an office building, the smaller Westminster Theatre (audience 618) was a West End theatre actually replaced by another theatre (in 2012), the St James Theatre. The Westminster apparently originally opened just the day before the Saville Theatre.

The Sadler's Wells Theatre is second only to Drury Lane as the oldest continuous entertainment location in London, dating from 1683, although the first of its four theatres was erected in 1733. The building that appears in the network (the third theatre) was built in 1931 and demolished in 1996. The New Sadler's Wells Theatre (and also the Lilian Baylis Studio Theatre) was completed in 1998.

At the small end of the network, several of the theatres had audiences of less than 500, including the Duchess Theatre (491), Ambassadors' Theatre (453), Royal Court Theatre (439), Fortune Theatre (438), New Arts Theatre (339), and May Fair Theatre (310). The Westminster Theatre, Criterion Theatre (607), and New Lyric Theatre, Hammersmith (750) had larger audiences, but they are grouped with these theatres in the network because of their small stage sizes. The smallest, the May Fair Theatre, is now the screening room of the May Fair Hotel conference venue.

The small Ambassadors' Theatre is famous as the original venue for Agatha Christie's play The Mousetrap, which apparently was deliberately placed in a very small theatre in order to enjoy a longish run (royalties being based on the number of performances) — Christie is said to have expected 8 months and the producer 14 months. At the time the book was compiled (end of 1965) it had completed 5,439 performances (having opened in November 1952), but it is now in excess of 25,000. This is five times as many as if the play had been put on at Drury Lane, instead. It moved to the St. Martin's Theatre (audience 550, an increase of 20%) in 1974.

Much more information about these theatres can be found at The Music Hall and Theatre History Site, which is a glorious treasure chest of photos and information about the theatres of Britain, maintained by Matthew Lloyd. I first came across the "Working Dimensions of London Theatres" data at this site, although the site itself refers to the thirteenth edition of the book.

Wednesday, July 24, 2013

A rant about the term "evolutionary network"

Mostly, I just rant to myself, and so I have generally avoided doing so in this blog. But this time I intend making an exception.

The expression "evolutionary network" has become completely meaningless in science, and this is a pity. This has happened because it has been applied to so many unrelated concepts that we can no longer work out what anyone means when they use it, without reading the rest of their text to work out the context.

Networks are, of course, ubiquitous in areas as diverse as the social sciences, biology, computer science, physics and economics, and consequently there is an extensive literature on the subject. This means that the term "evolutionary network" has a different meaning in various assorted areas of intellectual activity, such as neural networks, systems biology and quality measurement, as well as the usage in phylogenetics. What is annoying me, however, is that biologists use the term in oodles of different ways, as well.

Partly, this issue arises because of the use by computer scientists of known biological processes as models for developing computer algorithms, which are then named after the process that provided the inspiration (e.g. so-called genetic algorithms). Partly, the problem comes from claiming that a particular process (or something analogous to it) does actually occur in some particular field of study, and therefore using the relevant name (e.g. so-called evolutionary computing). But the problem in biology is that everyone claims that they are studying evolution, and therefore whatever they do can be called "evolutionary".

The essential point in biology is, naturally, that most patterns are the product of one or more evolutionary processes, to one degree or another. That does not, however, justify calling all patterns and processes "evolutionary". For example, observed similarity (of genes, genomes, organisms, species, etc) may or may not have a large evolutionary component — similarity may be the result of either proximal processes (which may be ecological, rather than strongly evolutionary) or ultimate processes (which are very likely to be evolutionary).

This was one of the strongest arguments for the distinction that has been made been phenetics (based on overall similarity) and phylogenetics (based on genealogy). A phenogram (expressing observed similarity) and a phylogram (expressing inferred genealogy) may be two very different things for any given group of objects. There seems to be no real justification for the merging of these two ideas; and yet this seems to be occurring increasingly.

The latest salvo that blurs the distinction similarity and genealogy has been fired by Halary et al. (2013. EGN: a wizard for construction of gene and genome similarity networks. BMC Evolutionary Biology 13: 146), who have this to say:

Here, we introduce a simple but powerful software program, EGN (for Evolutionary Gene and genome Network), for the reconstruction of similarity networks from large molecular datasets.

To explain this, in an earlier paper Alvarez-Ponce and colleagues (2013. Gene similarity networks provide tools for understanding eukaryote origins and evolution. Proceedings of the National Academy of Sciences of the USA 110: E1594–E1603) developed the idea of a gene similarity network, the name of which tells you exactly what it is. It is a non-phylogenetic network in which the edges directly connect observed genes based on their similarity; that is, it extends the classical concept of gene families. The authors present various reasons to justify their claim that "gene similarity networks have the potential to explore deeper relationships than phylogenetic trees".

The follow-up paper by Halary et al. (the one under discussion here) describes a computer program that automates the production of these gene similarity networks. But why have they called the program "Evolutionary Gene Network" rather than some version of "Gene Similarity Network"? This name is not only blatantly misleading but downright confusing. The network produced can be used to explore evolutionary history, sure, but it does not represent anything directly evolutionary. The evolutionary interpretation is in the mind of the beholder, not in the network algorithm.

I encourage everyone to be careful when naming their programs. A program name can mislead naive users if the name is disconnected from the program's purpose. Even the program SplitsTree mostly produces networks these days, and very rarely trees!

The term "evolutionary network" in biology, at least, could be usefully restricted to those networks representing evolutionary history directly (e.g. Thiergart et al. 2012. An evolutionary network of genes present in the eukaryote common ancestor polls genomes on eukaryotic and mitochondrial origin. Genome Biology & Evolution 4: 466-485).

Monday, July 22, 2013

The earliest tree / network of languages (1671)

Urmas Sutrop (2012), who seems to have dug deeper into linguistic history than most other researchers, has noted that: "The first language family trees I managed to track down date from the 17th century. To my knowledge, the very first language family tree was published by the Estonian-Swedish scholar Georg Stiernhielm."

Actually, Stiernhielm's "tree" is a hybridization network, thus making it also the first known phylogenetic network, of any type.

Georg Stiernhielm (1598-1672) was a civil servant, linguist and poet. He is best known as "the father of Swedish poetry" (he didn't write many poems, but their language form was very influential), but here we are interested in his linguistic work. In particular we are interested in his 1671 edition of Wulfia's "Gothic Bible": D.N. Jesu Christi SS. Evangelia ab Ulfila Gothorum in Moesia Episcopo Circa Annum à Nato Christo CCCLX. Ex Græco Gothicé translata, nunc cum Parallelis Versionibus, Sveo-Gothicâ, Norrænâ, seu Islandicâ, & vulgatâ Latinâ edita (published by Nicolai Wankif, Stockholmiæ). A copy is available from Google Books.

The Gothic Bible or Wulfila Bible is the Christian Bible as translated by Bishop Wulfila into the Gothic language spoken by the Eastern Germanic, or Gothic, Tribes in c.350 AD. Wulfila invented the Gothic alphabet, comprised of Greek letters and runic signs improvised by himself, so that he could do this, and it is thus considered to be the first text written in German.

Stiernhielm's edition sets out four texts in parallel (ie. four columns per double page): Gothic, Icelandic, Swedish (called Suedo-Gothic), and finally "vulgar Latin". The transliteration of the Gothic text is in Latin font, the Icelandic and Swedish translations appear in the so-called "Gothic" letters, and the Latin translation is, naturally, in Latin font.

What is important to us, however, is that Stiernhielm took the opportunity to present a 48-page preface: De Linguarum Origine Præfatio [Preface on the Origin of Language], in which he discussed his ideas about the origins of languages. The diagram shown above (from page xxxvi) is apparently intended to illustrate the idea that three Germanic dialects [Svevica, Mechlenbergia, Brabantica] could gradually merge into one new dialect [Lingua Nova], which would be different from the earlier ones but would still be a Germanic dialect [ipsa Germanica]. This is thus explicitly a hybridization network.

However, Stiernhielm went much further than this. As Umberto Eco (1995) has described, there has long been the idea (dating back at least to the Christian Bible, and the story of the Garden of Eden) that there once existed a language which perfectly and unambiguously expressed the essence of all possible things and concepts, and that the jumble of modern languages is a confused corruption of this "perfect language" (this is the story of the Tower of Babel). Many European philosophers have speculated about a solution for this modern confusion, either by trying to retrieve the language spoken in the Garden of Eden, or by thinking of a "Language of Reason" that would possess the perfection of the lost speech of Eden.

The languages that have been proposed as this "perfect language" include, in time order:

Hebrew, Gaelic, Tuscan, Dutch, German, Swedish, English, and French.

Stiernhielm, in his Preface, was responsible for the suggestion of Swedish.

Stiernhielm's argument was that Old Swedish (Suedo-Gothic) came closest to the "primaeval language" because Old Swedish was a Japhethian language. In the Bible, Japheth had not been present under the Tower of Babel, and therefore was not involved in the subsequent confusion of languages. Stiernhielm argued that the language of Japheth and his descendants ought thus to be a continuation of the language spoken in the Garden of Eden. He concluded that all of the Gothic dialects arose from this stock (he illustrates this with family trees), and he considered Old Swedish to be the most archaic Japhethian language.

This patriotic conclusion was not at all out of place in 17th century Sweden. The Swedish empire was then at its height, covering most of northern Europe. Indeed, shortly after Stiernhielm, Olof Rudbeck (a professor at Uppsala University) wrote a four-volume work (Atlantica sive Manheim) supporting the idea that Swedish was the original language of Adam, and also identifying Sweden as Atlantis, the cradle of civilization, from which civilization spread to the rest of the world. He did some useful things, too, including founding what later became Linnaeus' botanical garden.

Thanks to Johann-Mattis List for alerting me to Sutrop's paper, and thus leading me to Stiernhielm's work.

References

Eco U. (1995) The Search for the Perfect Language. Wiley-Blackwell.

Sutrop U. (2012) Estonian traces in the Tree of Life concept and in the language family tree theory. Journal of Estonian and Finno-Ugric Linguistics 3: 297-326.

Wednesday, July 17, 2013

Networks and journal covers

If you've ever looked at the cover illustrations of phylogenetics journals, either biological or computational, you will have noticed that there are quite a few phylogenetic trees. Sometimes these trees show ancestral polymorphism, but mostly they are uncomplicated dichotomous structures. There are also often various types of biological networks on these covers, such as gene networks and ecological networks. However, there are almost never phylogenetic networks, irrespective of the journal contents.

So, it is with pleasure that we note that Volume 10 Issue 1 of the IEEE/ACM Transactions on Computational Biology and Bioinformatics illustrates not one but two phylogenetic networks.

The illustration is from the paper by Stefan Grunewald, Andreas Spillner, Sarah Bastkowski, Anja Bogershausen and Vincent Moulton: SuperQ: computing supernetworks from quartets, on pages 151-160.

These are unrooted data-display networks, of course. If we look for evolutionary networks, instead, then we need to go to Volume 23 Issue 5 of Trends in Ecology and Evolution (May 2008). Actually, it is difficult to believe that this was ever intended to be an evolutionary network, because the phylogenetic relationships shown are rather bizarre.

Monday, July 15, 2013

Pierre Trémaux, the unknown phylogeneticist

Pierre Trémaux is a name that most of you will not have heard of, and yet he was a remarkable man. Or, rather, he wrote one remarkable book that has been almost completely ignored by history.

Trémaux (1818-1895) has been described as "a French architect, orientalist and photographer", but in his late 40s he also became a theoretical biologist, and it is this latter context that is of interest here. His best-known book is Origin et Transformations de l'Homme et des Autres Êtres (1865, L. Hachette, Paris). The ambitious nature of this book is indicated by its subtitle: Indiquant la transformation des êtres organisés, la formation des espèces, les conditions qui produisent les types, l'instinct et les facultés intellectuelles, la base des sciences naturelles, historiques, politiques, etc. [Indicating the transformation of organized beings; the formation of species; the conditions that produce the types, the instinct and the intellectual faculties; the base of natural sciences, history, politics, etc.]

In this book the author did four noteworthy and original things regarding phylogenetics:

he drew the first "proper" post-Darwinian phylogenetic tree (ie. showing speciation and extinction, with the internal branches as ancestors and the leaves as extant organisms), which connects all of the historical branches back to a single origin (which Darwin's diagram does not show)
he discussed the idea that speciation is concentrated at certain times in the geological record (the boundaries between "ages") and that there is effectively evolutionary stasis at other times, which presages Eldredge and Gould's theory of punctuated equilibrium by more than a century
he presented the idea that species form in geographically isolated populations, thus beating Moritz Wagner to the idea of allopatric speciation by 3 years
he applied Darwinian ideas to the evolution of Homo sapiens, 6 years before Darwin explicitly did so himself. (Thomas Henry Huxley contributed to the topic in 1863, but his book is best known for its infamous frontispiece illustrating transformational evolution among apes.)

Points (1) and (2) are illustrated in the only diagram in Trémaux's book, as reproduced here.

Trémaux's phylogenetic tree (from Google Books)

The text translates as:

Origin and Transformations of Beings

Synoptic figure of the transformation of species, by P. Trémaux

Species of different branches of the animal kingdom, arising from the same origin (the primordial cell or utricle), subdivide age by age; some of the species become extinct in every age, the others continue to grow and to divide and diverge more and more in their characters.

As far as point (1) is concerned, it is important to note that Darwin's 'tree' diagram is not connected to his description of the Tree of Life. The diagram is used to describe his vision of divergence, and descent with modification; at no stage does he refer to it as a "tree". His dry-as-dust description of the diagram and his poetic evocation of the biblical Tree thus have nothing to do with each other. It is merely a modern fancy to suggest that Darwin drew what we would now call a phylogenetic tree.

Therefore, Trémaux seems to have been the first to publish an illustration of the concept of a phylogenetic tree, in the form in which we know it today. St George Mivart, on the other hand, seems to have been the first to publish an empirical phylogenetic tree, in the same year (1865; see Who published the first phylogenetic tree?).

Darwin's diagram is not even connected at the base, as must be a genealogical tree. This presumably reflects his idea, as he put it in Notebook C in 1838: "The bottom of the tree of life is utterly rotten & obliterated in the course of [the] ages." His predecessors also had doubts about the base of the tree (notably Louis Agassiz and Edward Hitchcock), so that when they drew their diagrams of the fossil history of organisms they also were not connected to a single origin. Trémaux had no such doubts, and explicitly indicated a single origin and connected all of his lineages to it.

Ironically, Trémaux actually produced (quite independently) a finished version of an idea that Darwin sketched in one of his notebooks. This is on a page numbered 184 (probably from the early 1850s; see Charles Darwin's unpublished tree sketches). As shown below, it also considers the relationship between genealogical trees and geological history. Darwin owned a copy of Trémaux's book but apparently saw nothing of worth in it (Wilkins & Nelson 2008), in spite of the obvious relationship to his own idea!

Darwin's sketch

Circular trees apparently did not re-appear until the work of Engler (1881), where the concentric circles represented different morphological features, instead of geological time, thus showing phenotypic divergence from the common ancestor (along with the genealogy represented by the tree itself).

Conclusion

It is of some concern that Trémaux's book is so poorly known and the author himself almost unheard of. He does not appear in any of the standard histories of biology (see Wilkins & Nelson 2008), and his entry in the English-language Wikipedia has only two lines (he fares better in the French version)

In his 1874 book (Origine des Espèces et de l'Homme, avec les Causes de Fixité et de Transformation, et Principe Universel du Mouvement et de la Vie ou Loi des Transmissions de Force) Trémaux makes it clear that the [French] Académie des Sciences had rejected his work. Thereafter, few people seem to have read any of it, relying instead on the correspondence of Karl Marx (who thought the book was great) and Friedrich Engels (who though that it was rubbish) to pass judgement. Wilkins & Nelson (2008) have tried to redress the problem to some extent.

The rejection by the Académie is likely to have been because they did not accept the application of Darwinian evolution to humans. Indeed, it has been noted (see Hull 1988) that French biologists did not embrace Darwin's ideas in general. More particularly, Darwin himself realized that most of the problems engendered by his work would surround the idea of human evolution. Indeed, it was one major area where he and Alfred Russel Wallace (who independently developed the idea of natural selection) disagreed, as Wallace refused to apply the idea to humans.

Wilkins & Nelson (2008) wisely suggest that much of the confusion also comes from translating Trémaux's use of the French word "sol" as "soil" rather than as "habitat", thus leading to the conclusion that Trémaux was claiming that it is the nature of the soil alone that affects evolution. This was certainly done by Stephen Jay Gould (1997, 1999), who commented that: "I have never read a more absurd or more poorly documented thesis." Trémaux and his book deserve a better epitaph than that, because Trémaux certainly meant much more than most of the historical commentators have credited him with.

References

Engler A. (1881) Über die morphologischen Verhältnisse und die geographische Verbreitung der Gattung Rhus, wie der mit ihr verwandten, lebenden und ausgestorbenen Anacardiaceae. Botanische Jahrbücher für Systematik, Pflanzengeschichte und Pflanzengeographie 1: 365-426.

Gould S.J. (1997) Redrafting the Tree of Life. Proceedings of the American Philosophical Society 141: 30-54.

Gould S.J. (1999) A Darwinian gentleman at Marx's funeral. Natural History 108(7): 32-41. [Reprinted as "The Darwinian gentleman at Marx's funeral" in the book I Have Landed (2002).]

Hull D.L. (1988) Science as a Process: An Evolutionary Account of the Social and Conceptual Development of Science. University of Chicago Press, Chicago.

Wilkins J.S, Nelson G.J. (2008) Trémaux on species: A theory of allopatric speciation (and punctuated equilibrium) before Wagner. History and Philosophy of the Life Sciences 30: 179-206.

Wednesday, July 10, 2013

Networks and human inter-population variation

I have noted before that there are many situations in which the model of a phylogenetic tree is likely to be inappropriate for analysis of genetic data. The most obvious of these involves the study of intra-population variation (e.g. Why do we still use trees for the dog genealogy?). The within-population genealogy of sexually reproducing species, in particular, is not likely to be tree-like, even at large spatial scales. The iconic species for the study of intra-specific evolutionary history is Homo sapiens, and this is also the species where that history is least likely to be tree-like (e.g. Why do we still use trees for the Neandertal genealogy?). Clearly, a phylogenetic network is called for.

Pemberton et al. (2013, Population structure in a comprehensive genomic data set on human microsatellite variation. Genes Genomes Genetics 3: 891-907) provide an interesting dataset of global human autosomal microsatellite variation, based on merging eight previously published datasets. Microsatellites are a bit retro in this day and age, but that does not make them any less useful for the study of genetic variation.

The biggest issue is getting a large enough sample of loci for detailed study. Different researchers collect data on different microsatellites, and so combining datasets is not straightforward. Nevertheless, Pemberton et al. managed to come up with 5,795 individuals from 267 worldwide populations with genotypes at 645 loci. After filtering a member of every intra-population first-degree and second-degree relative pair, and then reducing the size of the over-represented Gujarati sample, they then added data for 84 chimpanzees. This yielded a dataset of 5,519 individuals from 255 populations sampled at 246 shared loci.

These data were processed as follows:

Using Microsat, we evaluated population-level pairwise allele-sharing distance (one minus the proportion of shared alleles), using all 246 loci ... We constructed a greedy-consensus neighbor-joining tree using the Neighbor and Consensus programs in the Phylip package from 1000 bootstrap resamples across loci.

Note that the original inter-population distances were not calculated — the tree was constructed by combining the branches with the highest bootstrap support.

This tree (reproduced above) does not show a great deal of support for many of the branches, and the authors discuss only seven of them. However, the presentation of a tree does not give much of a visual indication of the poor support for the genealogy, even if the different branch thicknesses do indicate the bootstrap values.

grey = chimpanzee, orange = Africa, yellow = Middle East, blue = Europe,
red = Central/South Asia, purple = America, pink = East Asia, green = Oceania

So, I calculated a NeighborNet network from the distance data, by averaging the 1000 distance matrices from the bootstrap analysis. This is the network analogue of the neighbor-joining tree, as shown above. Note that I have used the same colour coding as for the tree (thus making it look like a very colourful hummingbird), and the branch lengths represent support.

There is clearly a degree of large-scale geographical clustering of the genotypes, and this corresponds to the larger bootstrap values in the tree. So, the main message from the tree and the network is the same, including the rooting of the human genealogy within the African "group". However, this message is visually much clearer in the network than in the circular version of the tree. Moreover, there is little distinction between the Middle Eastern (yellow) and European (blue) genotypes, and the network makes this more obvious than does the tree.

Monday, July 8, 2013

Why people feel older than they are

As always at the beginning of the week, this blog presents something in a lighter vein. However, this week we depart from phylogenetic networks entirely, and delve into the general life of people, instead.

The passage of time is a curious thing, which varies not only with the speed of the observer but also with the age of the observer. Albert Einstein has written about the former phenomenon, and I once wrote a tongue-in-cheek article about the latter one, which I present here.

It turns out, according to my analysis, that your perception of time varies in a precisely quantifiable way depending on your age. The only times that you feel as young as you actually are are at ages 0 and 73 years; in between, you feel older than you are.

This article appeared in 1991 in the Australian Biologist 4: 187-190, a journal published by the Australian Institute of Biology. I specifically wrote about biologists, but the analysis applies to all humans. Sadly, this journal has no web page, and little has been heard about it since volume 17 (2004).

Since printed copies of the journal are held by only a few libraries in Australia, presumably no-one has read this article since 1991. Nevertheless, you should read it, and so I have linked to a PDF copy [1.8 MB] of the paper:
Why biologists feel older than they are

Wednesday, July 3, 2013

Archiving of bioinformatics software

Some months ago I wrote a blog post about what is perceived to be the rather poor quality of many computer programs in bioinformatics (Poor bioinformatics?), noting that many bioinformaticians aren't taking seriously the need to properly engineer software, with full documentation and standard programming development and versioning.

An obvious follow-up to that post is to consider the archiving of bioinformatics software. If programs are written well, then they should be permanently archived for future reference. A number of bloggers have commented on what is perceived to be the poor current state of affairs here, as well, and I thought that I might draw your attention to a few of the posts.

In many ways, this issue is the computational equivalent of storing biological data, about which I have also written recently (Releasing phylogenetic data). My comments about this were:

There is a difference between storing / releasing the original data (eg. raw DNA sequences) and the data as analyzed (eg. aligned sequences)
There are sustainable and accessible archiving facilities for raw data that are almost universally used (eg. GenBank)
Many people do not release the processed data as analyzed (some of them will if directly asked to do so)
Many of the people who do release their analyzed data do so on the homepage of one of the authors, which is better than nothing but is rarely sustainable
There are sustainable and accessible archiving facilities for processed data, such as TreeBASE and Dryad.

Analogous comments can be made about the archiving of bioinformatics software.

The first question to ask is this: what proportion of the bioinformatics software referred to in publications is actually stored in sustainable and accessible archives? A corollary to this question is: what archive facilities are being used? Casey Bergman, at the I Wish You'd Made Me Angry Earlier blog, has attempted to answer both of these questions (Where Do Bioinformaticians Host Their Code?).

In answer to the first question, Casey notes:

of the many thousands of articles published in the field of bioinformatics, as of Dec 31 2012 just under 700 papers (n=676) have easily discoverable code linked to a major repository in their abstract.

While many papers may have the code URL in the Methods or Results sections but not the Abstract, this does suggest that repository archiving is not the mode actually employed by bioinformaticians. Instead, they are archiving (if at all) on personal or institutional homepages.

Sadly, the reported rate of decay of URLs ("Error 404: Page not found") indicates that this is rarely a sustainable approach to archiving (eg. see the Google+ comment by Dave Lunt). The relevance of the similar situation with the TreeBase / Dryad type of repository has not gone unnoticed, for example by Hilmar Lapp. These repositories require and enforce standards of data and software archiving, as well as providing persistence.

The answer to the second question, about which repositories, seems to be (see also the data provided by MRR in the comments to Casey's blog post):

SourceForge has been vastly predominant
Google Code has a large number of projects, but many of them have never made it to publication
GitHub has had a rapid recent growth rate, and therefore appears to be becoming the preferred repository.

Other repositories, such as BitBucket, seem to be much less used. Users on other forums (eg. Biostar: Where would you host your open source code repository today?) seem to concur with the choice of GitHub, mainly because of the tools available (it is user-oriented rather than project-oriented).

This leads to the issue of how permanent the archiving is at the major repositories. It turns out that there is a major difference in policies, as noted by Casey Bergman:

SourceForge has a very draconian policy when it come to deleting projects, which prevents accidental or willful deletion of a repository. In my opinion, Google Code and (especially) GitHub are too permissive in terms of allowing projects to be deleted.

In a follow-up post (On the Preservation of Published Bioinformatics Code on GitHub), Casey expands on this theme:

A clear trend emerging in the bioinformatics community is to use GitHub as the primary repository of bioinformatics code in published papers. While I am a big fan of Github and I support its widespread adoption, I have concerns about the ease with which an individual can delete a published repository. In contrast to SourceForge, where it is extremely difficult to delete a repository once files have been released, and this can only be done by SourceForge itself, deleting a repository on GitHub takes only a few seconds and can be done (accidentally or intentionally) by the user who created the repository.

This is an important issue, as exemplified by Christopher Hogue in the comments section of that blog post:

In my case SourceForge preserved the SLRI toolkit my group made in Toronto. As the intellectual property underlying the code was sold to Thompson-Reuters in 2007, my host institution and the dealmakers pressured me to delete the repository. SourceForge policy kept it on the site ... [However,] the aftermath of all this is that, of everything my group did under the guise of open source, only about 30% is preserved and online, and the rest is buried in an intellectual property shoebox at Thompson-Reuters. Host institutions have a lot of power of ownership over your intellectual property. If you win the right to post work into open-source, the GitHub delete policy means that your host institution can over-ride this, and require you to take your code out of circulation. GitHub is great, but for the sake of preservation, SourceForge has the right policy, protecting your decision to go open source from later manipulations by your host institution when it becomes "valuable".

Casey Bergman's response to this issue has been to create the Bioinformatics Archive on GitHub. This is based on the idea used by the journal Computers & Geosciences, in which the journal editor forks the GitHub code into a journal "organization" for all accepted papers — this creates a permanent repository, which is necessary because deleting a private GitHub repository will delete all forks of the repository but deleting a public repository will not do so. So, Casey has been personally forking the code for all publications that come to hand (currently 147 repositories) into the Bioinformatics Archive, thus creating a public repository for all of the relevant GitHub code.

However, this is clearly a stopgap measure. Dave Lunt, at the EvoPhylo blog, has listed three desiderata for a more permanent solution to the issue (How can we ensure the persistence of analysis software?):

A publisher driven version of the Bioinformatics Archive; journals should have a policy for the hosting of published code in a sustainable and accessible archive in a standardized manner
Redundancy to ensure persistence in the worst case scenario; archive persistence is the key requirement, and this can only happen in public repositories, with the published URL and/or DOI pointing to a public copy of the code
The community to initiate actual action; authors need to pressure the publishers to adopt a Dryad-like strategy, in which a large group of ecology and evolutionary biology journals agreed to require the use of a public database for storing the biological data associated with their publications.

At a minimum, a persistent public repository is a snapshot of the code at the time of publication, just as a sequence alignment is a snapshot of the processed data at the time of its publication. This does not preclude further work on the code, and further publications based on the newly modified code, just as new sequence alignments can be created by adding newly acquired sequences. Open-source code can still be newly forked, and there can be user-contributed updates and public issue tracking. Multiple snapshots of code related to different publications through time is not necessarily an issue, but it will need to be handled in some sensible manner.

The main reason for requiring the public archiving of code is to deal with the all-too-common situation when code is no longer being maintained (the scholarship ran out, the grant ended, the author retired, etc). For example, Jamie Cuticchia & Gregg Silk (2004, Bioinformatics needs a software archive, Nature 429: 241) mention the loss of part of the code when the multi-million dollar Genome Database lost funding in 1998. These two authors seem to be the first to have proposed a Bioinformatics Software Archive, "in which an archival copy of bioinformatics software would be maintained in a secure central repository supported by public funding." Personal and institutional homepages are too ephemeral (suffering what is known as URL decay) and too prone to politics to be considered acceptable for the storage of data and software in high-quality science.

Monday, July 1, 2013

Networks of the "Sight & Sound" film polls

There are at least three things wrong with "best of" lists: (i) there is rarely any clear idea of what "best" is supposed to mean; (ii) the list is of arbitrary length (eg. Top-10 only); and (iii) the ranking does not reflect the differences in the original scores.

A good example of all three problems is the "Greatest Films Poll" produced every decade by Sight & Sound magazine, which lists the Top 10 films as voted by selected film critics. You will find dozens of web sites that reproduce these lists (which started in 1952), with arguments and counter-arguments about the films that have appeared on the various lists. Much of the commentary on the latest poll is summarized here.

As far as point (i) is concerned, since the lists are compiled from the voting of film commentators, rather than film makers or the general viewing public, this clearly defines "best" as having something to do with the things that appeal most to critics. For example, Scott Tobias notes that: "Since its inception, the poll has championed films that have dramatically altered the [film] landscape", rather than films that the general public enjoys. We cannot change this bias, and for the sake of the argument, we will accept the critics' point of view in this blog post (ie. we are going to look at films that are challenging, or that changed the ways things are done).

However, we can easily address the other two problems in a quantitative way. For point (ii) we do not need to truncate the list at an arbitrary number like 10, and for point (iii) we can use the original votes because (to one extent or another) they are available on the web. In this post, I use a network to investigate the patterns in the data from the votes cast in all seven of the Sight & Sound polls to date.

The poll

Sight & Sound magazine solicits a list of their "top 10 films" from a each of a number of critics. [Note that all each critic does in order to vote is write a list of 10 films, and send it in.] For example, in 2012 the magazine solicited lists from 1000 commentators and received 846 lists in reply. However, the numbers of lists available for the polls were only 119-145 in 1982-2002 and 31-35 in 1952-1972. The number of times each film appears in any of the critics lists is summed; and the total is used to produce the Top-10 list for that particular poll (ie. the total equals the number of critics who listed the film in their Top 10). However, the magazine's Top-10 list rarely contains precisely 10 films, due to ties in the voting.

It is important to emphasize how these data can be interpreted, because most of the media reports get it badly wrong. To say, as many of the media did, that "In the 2012 poll the critics voted Vertigo the best movie of all time" is wrong, because the vast majority of the critics (77%) said that this film doesn't even belong in the top 10. And yet it is listed as the no. 1 film, because more critics (23%) put it on their list than did so for any other film. Similarly, 91% of the critics said The Searchers should not be in the top 10, and yet it is ranked no. 7. So, the rank order of the films is simply that — a rank order; it does not tell you how many critics think highly of each film.

The data

I tried to compile all of the available data for the Critics' Poll (not the separate Directors' Poll), and I ended up using the following sources:

http://alumnus.caltech.edu/~ejohnson/
http://www.oocities.org/the7thart/1982.html
http://www.cinemacom.com/sight-and-sound.html
http://www.cinemacom.com/2002-sight-sound.html
http://explore.bfi.org.uk/sightandsoundpolls/2012/critics/

1952,1962,1972
1982
1992
2002
2012

The 1972 data are somewhat incomplete, and the 1982 data are somewhat doubtful, but I could not find multiple copies of these lists to cross-check them.

Most films get very few votes in any one year. For instance, in 2012 the 5th placed film was listed by only 11% of the critics. Clearly, rank order means little below that point, as the order of the films then involves splitting hairs. Part of the problem here is that most of the film directors have the critics' votes spread across several of their films, so that each of their films receives only a few votes even though the director actually accumulates many votes for their whole body of work. This does not seem to happen for the top-ranked films, where one film is "chosen" as the outstanding one, and almost all of that director's votes go to that single film.

Moreover, most films do not appear regularly in the polls. For example, of the 94 films that appeared at least once in the top 30 across all of the polls, only 4 appeared in all 7 polls, and 50 of the films appeared in the top 30 only once.

The analysis

In order to make the data comparable between polls, I have had to break the dataset up into two overlapping subsets, because of the paucity of data available in the polls from the early years:

data for all seven polls, which consist of the vote scores only for movies that appeared in the top c. 30 ranking (a total of 94 films) — the number of films varied from 27-33 between polls depending on tied votes, except 1972 for which I could find data only for the top 23 films;
data for the 1982, 1992, 2002 and 2012 polls only, which consist of the vote scores for movies that appeared in the top c. 130 ranking (a total of 232 films) — the number varied from 125-130 between polls depending on tied votes.

In the latter case, 53 of the 232 films appeared in all 4 years, while 95 films appeared in one poll only.

For the technical details of the analysis, I normalized the data within each poll, and I then calculated the similarity of the poll results using the Steinhaus dissimilarity (which ignores "negative matches", as discussed in a previous blog post). A Neighbor-net analysis was then used to display the between-film similarities as a phylogenetic network. So, films that are closely connected in the network are similar to each other based on their poll results, and those that are further apart are progressively more different from each other.

Comparison of the polls

When comparing all 7 polls, the important point turns out to be that only 4 / 94 films appeared in all 7 polls, while 50 / 94 films appeared in only one of the polls. This means that there is very little consistency between the polls.

The network illustrates this by arranging the years in an anti-clockwise circle. So, the 1952 poll shares a lot of films with the 1962 poll, which shares films with the 1972 poll, and so on around the circle. Thus, the 1952 poll shares little with the 2012 poll. Therefore, we can conclude that film preference changes through time for the critics. This is partly because the critics change (ie. they come and go, as critics), and also because the films available change, with new films appearing constantly. This can be investigated further by looking at networks of the films themselves, which I do next.

The seven polls for 1952-2012

The relationships among the films, as shown in the network, are strongly determined by which polls they appeared in, rather than by where they were ranked in those polls. That is, the locations of the films in the network is determined most by their presence / absence in each of the seven polls.

Clearly, how many polls a film can appear in is determined by when the film was made, as well as when it achieved critical appreciation. There are 17 films that have appeared in the Top 30 list for every poll after they first got onto the list.

Four films made it into all 7 polls:
   Citizen Kane
   La Règle du Jeu (The Rules of the Game)
   Battleship Potemkin
   Passion de Jeanne d'Arc (The Passion of Joan of Arc)
Two films missed the first poll only:
   Sunrise: A Song of Two Humans
   L'Avventura
Four films missed the first two polls only:
   Vertigo
   The Searchers
   2001: A Space Odyssey
   8½
Two films missed the first three polls:
   Seven Samurai
   Singin' in the Rain
One film missed the first four polls:
   À Bout de Souffle (Breathless)
Four films missed the first five polls:
   Rashomon
   The Godfather
   The Godfather Part II
   Au Hazard Balthazar

Other films have come and gone from the Top 30, which determines their position in the network. For example, The General did not make the first or last polls but did make all of the ones in between; and Ugetsu Monogatari missed the first one and the final two. Others have come and gone sporadically, notably Greed, City Lights, Bicycle Thieves, Intolerance, and Wild Strawberries (Smultronstället). The Gold Rush and La Grande Illusion (The Grand Illusion) made only the first three lists, and Zéro de Conduite, The Childhood of Maxim Gorki, Monsieur Verdoux, and Earth made the first two lists only — these films have clearly fallen from favour.

Les Enfants du Paradis (Children of Paradise) has a strange position in the network because it appeared only in the 1952 and 1982 Top 30 lists. Most oddly, Tokyo Story, L'Atalante, and Pather Panchali made the 1962, 1992, 2002 and 2012 lists (except Pather Panchali in 2012) but none of these appeared in the 1972 or 1982 Top 30 lists. Perhaps this relates to the poor quality of the data that I have for those two polls.

Anyway, there is little evidence here of consistency regarding how appealing any particular film is to the critics. I won't go so far as to say there are fads through time, but clearly the idea of "best films" is not a particularly constant thing across decades.

The four polls for 1982-2012

The four-poll dataset includes many more films for which there are data available. This dataset shows much stronger clustering of the films, because there are far fewer possible patterns of relationship across only 4 polls instead of 7. I have labelled only the Top 10 films from the the most recent poll (2012), since that is the current "best" list.

Of the 232 films, 53 appeared in the Top 130 list for all four polls, 13 films missed only the first poll, and 8 missed only the last poll. The remaining films have appeared sporadically, with 95 of them appearing only once.

The top 15 films stand out from the others based on their average score across the four polls, and they form the cluster in the bottom-right corner of the network. That is, the network suggests that we should have a Top-15 list (not a Top-10 list). In order of average critic scores, the films are:

Citizen Kane
La Règle du Jeu
Vertigo
Tokyo Story
2001: A Space Odyssey
Battleship Potemkin
The Searchers
Sunrise: A Song of Two Humans
8½
Singin' in the Rain
Seven Samurai
Passion de Jeanne d'Arc
L'Atalante
L'Avventura
The General

Orson Welles
Jean Renoir
Alfred Hitchcock
Yasujirô Ozu
Stanley Kubrick
Sergei Eisenstein
John Ford
F W Murnau
Federico Fellini
Stanley Donen & Gene Kelly
Akira Kurosawa
Carl Theodor Dreyer
Jean Vigo
Michelangelo Antonioni
Buster Keaton & Clyde Bruckman

1941
1939
1958
1953
1968
1925
1956
1927
1963
1952
1954
1928
1934
1960
1927

Note that this list, and its order, is based on all four polls. Consequently, it reflects each film's assessment over 40 years, rather than merely its current popularity. Note, also, that the top 2 films (Citizen Kane, and La Règle du Jeu) also appeared in all 7 polls (see above), indicating that they have been consistently appreciated by the critics for 70 years, with only Battleship Potemkin and Passion de Jeanne d'Arc as competitors. Of the directors involved in these four films, probably only Orson Welles is well-known to the general public, possibly because of the age of the films (only four of the Top-15 have been made in my lifetime!).

In the list there are 6 American films, 3 French, 2 Japanese, 2 Italian, 1 Russian, and 1 American-British collaboration. Of the15 films, 7 were nominated for at least one Academy Award, but only 4 of them won one. Very few of them were particularly popular with the public when they were first released.

As noted above, no director appears more than once in this list, as one of their films has been singled out by the critics. These are not necessarily their best films, but are more likely to be their most radical film in terms that affected other filmmakers and the critics. The big losers in this sense are those directors who have many films in the list, none of which scores well on its own. For example, Jean-Luc Godard has 8 nominated films, as does Luis Buñuel, while Robert Bresson has 7 films, Howard Hawks, Satyajit Ray, and Charles Chaplin all have 6, and Ingmar Bergman has 5, as also does the team of Michael Powell & Emeric Pressburger.

Finally, the graph below illustrates the changing nature of the critics' choices. It shows the fate of 9 of the top 10 films from the list over the four polls. The top five films in the most recent poll (2012) have all changed consistently in score across the four polls, with two decreasing (Citizen Kane, La Règle du Jeu) and three increasing (Vertigo, Sunrise: A Song of Two Humans, and 2001: A Space Odyssey). The most obvious connection between Citizen Kane and Vertigo (the former and current No. 1, respectively) is that Bernard Herrmann wrote the music for both movies, and his music has been considered to play an important dramatic role in both cases.

Conclusion

Those of you who are interested in the top-rated movies might like to look at Bill Georgaris' list from February 2013:

They Shoot Pictures, Don’t They? 1,000 Greatest Films

This was compiled from all available lists, not just those from the Sight & Sound polls, although the results are dominated by the Sight & Sound 2012 rankings (31.5% of the 3,194 votes).