The Genealogical World of Phylogenetic Networks: August 2014

Wednesday, August 27, 2014

Fritz Müller and the first phylogenetic tree

I have previously noted that the first empirical phylogenetic tree apparently was published by St George Jackson Mivart in late 1865, a full 6 years after Charles Darwin released On the Origin of Species (Who published the first phylogenetic tree?). Mivart was not necessarily the first to start producing such a tree, but he got into print first. For example, Franz Martin Hilgendorf wrote a PhD thesis in 1863 for which he produced a hand-drawn tree, but he did not actually include the tree itself in the thesis (The dilemma of evolutionary networks and Darwinian trees). Also, Ernst Heinrich Philipp August Haeckel claimed to have started work on his series of phylogenetic trees in 1864, but the resulting book, Generelle Morphologie der Organismen, was not published until 1866 (Who published the first phylogenetic tree?).

Another actor in this series of events was Fritz Müller, who can also be considered to have published a tree first, in 1864, albeit a very small one.

Johann Friedrich Theodor Müller (1822–1897)

Müller was born in Germany, but in the 1850s he emigrated to southern Brazil with his brother and their wives. As a naturalist in the Atlantic forest, he studied the insects, crustaceans and plants, and he is chiefly remembered today as the describer of what we now call Müllerian mimicry (the phenotypic resemblance between two or more unpalatable species).

Heinrich Bronn's German translation of the Origin appeared in 1860, and Müller read it and agreed with its central thesis (as did Hilgendorf and Haeckel). Indeed, in 1864 he published a book discussing some of the empirical evidence that he adduced with regard to the Crustacea:

Für Darwin
Verlag von Wilhelm Engelman, Leipzig.

The book has 91 pages and 67 figures, and the Foreword is dated 7th September 1863. Several copies are available in Google Books (here, here, here).

In this book Müller described the development of Crustacea, illustrating that crustaceans and their larvae could be affected by adaptations and natural selection at any growth stage. He discussed in detail how living forms diverged from ancestral ones, based on his study of aerial respiration, larvae morphology, sexual dimorphism, and polymorphism.

Darwin read the book, and began a life-long correspondence with Müller (ultimately some 60 letters having been exchanged between them). Subsequently, Darwin commissioned an English translation of the book, and in 1869 published it with John Murray on commission (ie. taking the risk himself). Darwin printed 1000 copies but it apparently was not a great success:

Facts and Arguments for Darwin
Translated from the German by W.S. Dallas
John Murray, London.

The book has 144 pages and 67 figures, and the Translator's Preface is dated 15th February 1869. A copy is available in the Biodiversity Heritage Library (here).

The following quotes are from this English translation [Note that Müller's unnecessarily convoluted sentences exist in the original German — this writing style is one reason why the book is not as well known as the works of Darwin and Wallace]:

It is not the purpose of the following pages to discuss once more the arguments deduced for and against Darwin's theory of the origin of species, or to weigh them one against the other. Their object is simply to indicate a few facts favourable to this theory ...

When I had read Charles Darwin's book 'On the Origin of Species,' it seemed to me that there was one mode, and that perhaps the most certain, of testing the correctness of the views developed in it, namely, to attempt to apply them as specially as possible to some particular group of animals ...

When I thus began to study our Crustacea more closely from this new stand-point of the Darwinian theory,—when I attempted to bring their arrangements into the form of a genealogical tree, and to form some idea of the probable structure of their ancestors,—I speedily saw (as indeed I expected) that it would require years of preliminary work before the essential problem could be seriously handled ...

But although the satisfactory completion of the "Genealogical tree of the Crustacea" appeared to be an undertaking for which the strength and life of an individual would hardly suffice, even under more favourable circumstances than could be presented by a distant island, far removed from the great market of scientific life, far from libraries and museums—nevertheless its practicability became daily less doubtful in my eyes, and fresh observations daily made me more favourably inclined towards the Darwinian theory.

In determining to state the arguments which I derived from the consideration of our Crustacea in favour of Darwin's views, and which (together with more general considerations and observations in other departments), essentially aided in making the correctness of those views seem more and more palpable to me, I am chiefly influenced by an expression of Darwin's: "Whoever," says he ('Origin of Species' p. 482), "is led to believe that species are mutable, will do a good service by conscientiously expressing his conviction."

So, for the reason stated, Müller did not produce a complete phylogeny in the book. However, of particular interest to us is the figure on page 6 of the original German edition (page 9 of the translation). It turns out to be a pair of three-taxon statements concerning species of Melita (amphipods), as shown in the figure above (original) and below (translation). Müller has this to say:

[There are five] species of Melita ... in which the second pair of feet bears upon one side a small hand of the usual structure, and on the other an enormous clasp-forceps. This want of symmetry is something so unusual among the Amphipoda, and the structure of the clasp-forceps differs so much from what is seen elsewhere in the this order, and agrees so closely in the five species, that one must unhesitatingly regard them as having sprung from common ancestors belonging to them alone among known species.

This is as clear a statement of synapomorphy, and its relationship to constructing a phylogeny, as you could get; and so we could credit Müller with having produced an empirical phylogenetic tree (the one on the left in the figures).

Equally interestingly, Müller then goes on to consider a potentially contradictory character: the secondary flagellum of the anterior antennae, which is missing in one species. This would produce a different three-taxon statement (shown on the right in the figures). He resolves the issue by suggesting that the flagellum might be similar to the situation in other species, where it is "reduced to a scarcely perceptible rudiment—nay, that it is sometimes present in youth and disappears at maturity". This is a clear example of the character conflict that arises when trying to construct an empirical phylogeny; and it was also encountered by Mivart in his studies of primate skeletons (Is this the first network from conflicting datasets?).

Conclusion

Müller did not publish a complete phylogeny, but instead discussed how to produce one, and illustrated the practicality (and necessity) of doing so. In the process, he produced a simple three-taxon statement (which is not even numbered as a figure). Nevertheless, this cladogram is technically the first in print, pre-dating Mivart by a year. Darwin was right to recognize its importance, although he seemed to take a while to bring it to the attention of the English-speaking public. Furthermore, Müller was apparently the first to encounter the empirical difficulty of how to deal with conflicting data, which would produce different phylogenetic trees. This is an issue that is just as important today as it was then.

Monday, August 25, 2014

The evolution of statistical phylogenetics

For those of you who do not understand the notation:

Homo apriorius ponders the probability of a specified hypothesis, while Homo pragamiticus is interested by the probability of observing particular data. Homo frequentistus wishes to estimate the probability of observing the data given the specified hypothesis, whereas Homo sapiens is interested in the joint probability of both the data and the hypothesis. Homo bayesianis estimates the probability of the specified hypothesis given the observed data.

Wednesday, August 20, 2014

The role of biblical genealogies in phylogenetics

Phylogeneticists treat the tree image as having special meaning for themselves. Conceptually, the tree is used as a metaphor for phylogenetic relationships among taxa, and mathematically it is used as a model to analyze phenotypic and genotypic data to uncover those relationships. Irrespective of whether this metaphor / model is adequate or not, it has a long history as part of phylogenetics (Pietsch 2012). Of particular interest has been Charles Darwin's reference to the "Tree of Life" as a simile, since that is clearly the key to the understanding of phylogenetics by the general public.

The principle on which phylogenetic trees are based seems to be the same as that for human genealogies. That is, phylogenies are conceptually the between-species homolog of within-species genealogies. As far as Western thought is concerned, human genealogies make their first important appearance in the Bible, with a rather specific purpose. The Bible contains many genealogies, mostly presented as chains of fathers and sons. For example, Genesis 5 lists the descendants of Adam+Eve down to Noah and his sons, which can be illustrated as a pair of chains (as shown in the first figure); and the rest of Genesis gets from there down to Moses' family, for which the genealogy can be illustrated as a complex tree.

The genealogy as listed in Genesis 5.
Cain's lineage was terminated by the Flood.

However, the theologically most important genealogies are those of Jesus, as recorded in Matthew 1:2-16 and Luke 3:23-38. Matthew apparently presents the genealogy through Joseph, who was Jesus' legal father; and Luke apparently traces Jesus' bloodline through Mary's father, Eli. These two lineages coalesc in David+Bathsheba, and from there they have a shared lineage back to Abraham. Their importance lies in the attempt to substantiate that Jesus' ancestry fulfils the biblical prophecies that the Messiah would be descended from Abraham (Genesis 12:3) through Isaac (Genesis 17:21) and Jacob (Genesis 28:14), and that he would be from the tribe of Judah (Genesis 49:8), the family of Jesse (Isaiah 11:1) and the house of David (Jeremiah 23:5).

That is, these genealogies legitimize Jesus as the prophesied Messiah. Following this lead, subsequent use of genealogies has commonly been to legitimize someone as a monarch, so that royal genealogies have been of vital political and social importance throughout recorded history (see the example in the next figure). This importance was not lost on the rest of the nobility, either, so that documented genealogies of most aristocratic families allow us to identify the first-born son of the first-born son, etc, and thus legitimize claimants to noble titles — genealogies are a way for nobles to assert their nobility.

The genealogy of the current royal family of Sweden. [Note: most children are not shown]
The lineage of the recent monarchs is highlighted as a chain, with an aborted side-branch dashed.

If we focus solely on the line of descent involved in legitimization, then genealogies can be represented as a chain (as shown in the genealogy above). However, if we include the rest of the paternal lines of descent then family genealogies can be represented as a tree. However, if we include some or all of the maternal lineages as well, then family genealogies can be represented as a network. For example, the biblical genealogies only rarely name women, but where females are specifically named the genealogies actually form a reticulated network. Jacob produced offspring with both Rachel and Leah, who were his first cousins; and Isaac and Rebekah were first cousins once removed. Even Moses was the offspring of parents who were, depending on the biblical source consulted, either nephew-aunt, first cousins, or first cousins once removed. These relationships cannot be represented in a tree. (See also the complex genealogy of the Spanish branch of the Habsburgs, who were kings of Spain from 1516 to 1700.)

This idea of genealogical chains, trees and networks was straightforward to transfer from humans to other species. Originally, biologists stuck pretty much to the idea of a chain of relationships among organisms, as presented in the early part of Genesis. Human genealogies were traced upwards to Adam and from there to God, and thus species relationships were traced upwards to God via humans. However, by the second half of the 1700s both trees and networks made their appearance as explicit suggestions for representing biological relationships. In particular, Buffon (1755) and Duchesne (1766) presented genealogical networks of dog breeds and strawberry cultivars, respectively.

However, these authors did not take the conceptual leap from within-species genealogies to between-species phylogenies. Indeed, they seem to have explicitly rejected the idea, confining themselves to relationships among "races". It was Charles Darwin and Alfred Russel Wallace, a century later, who first took this leap, apparently seeing the evolutionary continuum that connects genealogies to phylogenies. In this sense, they both took ideas that had been "in the air" for several decades, but previously applied only within species, and applied them to the origin of species themselves. [See the Note below.] Both of them, however, confined themselves to genealogical trees rather than using networks. It seems to me that it was Pax (1888) who first put the whole thing together, and produced inter-species phylogenetic networks (along with some intra-species ones).

In this sense, the biblical Tree of Life has only a peripheral relevance to phylogenetics. Darwin used it as a rhetorical device to arouse the interest of his audience (Hellström 2011), but it was actually the biblical genealogies that were of most practical importance to his evolutionary ideas. Apart from anything else, the original biblical tree was actually the lignum vitae (Tree of Eternal Life) not the arbor vitae (Tree of Life). Similarly, the tree from which Adam and Eve ate the forbidden fruit was the lignum scientiae boni et mali (Tree of Knowledge of Good and Evil), not the arbor scientiae (Tree of Knowledge) that was subsequently used as a metaphor for human knowledge.

Note. Along with phylogenetic trees, Darwin and Wallace did not actually originate the idea of natural selection, which had previously been discussed by people such as James Hutton (1794), William Charles Wells (1818), Patrick Matthew (1831), Edward Blyth (1835) and Herbert Spencer (1852). However, this discussion had been in relation to within-species diversity, whereas Wallace and Darwin applied the idea to the origin of between-species diversity (i.e. the origin of new species).

References

Buffon G-L de. 1755. Histoire naturelle générale et particulière, tome V. Paris: Imprimerie
Royale.

Duchesne A.N. 1766. Histoire naturelle des fraisiers. Paris: Didot le Jeune & C.J. Panckoucke.

Hellström N.P. 2011. The tree as evolutionary icon: TREE in the Natural History Museum, London. Archives of Natural History 38: 1-17.

Pax F.A. 1888. Monographische übersicht über die arten der gattung Primula. Bot. Jahrb. Syst. Pflanzeng. Pflanzengeo. 10:75-241.

Pietsch T.W. 2012. Trees of life: a visual history of evolution. Baltimore: Johns Hopkins University Press.

Monday, August 18, 2014

Bioinformaticians' nightmares

These illustrations are from Alper Uzun's Biocomicals web site.

Bioinformaticians' dream

Bioinformaticians' reality

Wednesday, August 13, 2014

Phylogenomics: the effect of data quality on sampling bias

Sampling bias refers to a statistical sample that has been collected in such a way that some members of the intended statistical population are less likely to be included than are others. The resulting biased sample does not necessarily represent the population (which it should), because the population members were not all equally likely to have been selected for the sample.

This affects scientific work because all scientific questions are about the population not the sample (ie. we infer from the sample to the population), and we can only answer these questions if the samples we have collected truly represent the populations we are interested in. That is, our results could be due to to the method of sampling but erroneously be attributed to the phenomenon under study instead. Bias needs to be accounted for, but it cannot be assessed by looking at the sampled data alone. Bias can only be addressed via the sampling protocol itself.

In genome sequencing, sampling bias is often referred to as ascertainment bias, but clearly it is simply an example of the more general phenomenon. This is potentially a big problem for next generation sequencing (NGS) because there are multiple steps at which sampling is performed during genome sequencing and assembly. These include the initial collection of sequence reads, assembling sequence reads into contigs, and the assembly of these into orthologous loci across genomes. (NB: For NGS technologies, sequence reads are of short lengths, generally <500 bp and often <100 bp.)

The potential for sampling (ascertainment) bias has long been recognized for the detection of SNPs. This bias occurs because SNPs are often developed using only a small group of samples from which to choose the polymorphic markers. The resulting collection of markers samples only a small fraction of the diversity that exists in the population, and this mis-estimates phylogenetic relationships.

However, it is entirely possible that the any attempt to collect high-quality NGS data actually results in poor quality sampling — that is, we end up with high-quality but biased genome sequences. Genome sequencing is all about the volume of data collected, and yet data volume cannot be used to address bias (it can only be used to address stochastic variation). It would be ironic if phylogenomics turns out to have poorer data than traditional sequence-based phylogenetics, but biased genomic data are unlikely to be any more use than non-genome sequences.

The basic issue is that attempts to get high-quality genome data usually involve leaving out data, either because the initial sequencing protocol never collects the data in the first place, or because the subsequent assembly protocols delete the data as being below a specified quality threshold. If these data are left out in a non-random manner, which is very likely, then sampling bias inevitably results. Unfortunately, both the sequencing and bioinformatic protocols are usually beyond the control of the phylogeneticist, and sampling bias can thus go undetected.

Two recent papers highlight common NGS steps that potentially result in biased samples.

First, Robert Ekblom, Linnéa Smeds and Hans Ellegren (2014. Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria. BMC Genomics 15: 467) discuss genome coverage bias, using mtDNA as an example. They note:

It is known that the PCR step involved in sequencing-by-synthesis methods introduces coverage bias related to GC content, possibly due to the formation of secondary structures of single stranded DNA. Such GC dependent bias is seen on a wide variety of scales ranging from individual nucleotides to complete sequencing reads and even large (up to 100 kb) genomic regions. Systematic bias could also be introduced during the DNA fragmentation step or caused by DNA isolation efficacy, local DNA structure, variation in sequence quality and mapability of sequence reads.

In addition to variation in coverage, there may be sequence dependent variation in nucleotide specific error rates. Such systematic patterns of sequencing errors can also have consequences for downstream applications as errors may be taken for low frequency SNPs, even when sequencing coverage is high. GC rich regions and sites close to the ends of sequence reads typically show elevated errors rates and it has also been shown that certain sequence patterns, especially inverted repeats and "GGC" motifs are associated with an elevated rate of Illumina sequencing errors. Such sequence specific miscalls probably arise due to specific inhibition of polymerase binding. Homopolymer runs cause problems for technologies utilising a terminator free chemistry (such as Roche 454 and Ion Torrent), and specific error profiles exist for other sequencing technologies as well.

Sequencing coverage showed up to six-fold variation across the complete mtDNA and this variation was highly repeatable in sequencing of multiple individuals of the same species. Moreover, coverage in orthologous regions was correlated between the two species and was negatively correlated with GC content. We also found a negative correlation between the site-specific sequencing error rate and coverage, with certain sequence motifs "CCNGCC" being particularly prone to high rates of error and low coverage.

The second paper is by Frederic Bertels, Olin K. Silander, Mikhail Pachkov, Paul B. Rainey and Erik van Nimwegen (2014. Automated reconstruction of whole-genome phylogenies from short-sequence reads. Molecular Biology and Evolution 31: 1077-1088). They discuss the situation where raw short-sequence reads from each DNA sample are directly mapped to the genome sequence of a single reference genome. They note:

There are reasons to suspect that such reference-mapping-based phylogeny reconstruction methods might introduce systematic errors. First, multiple alignments are traditionally constructed progressively, that is, starting by aligning the most closely related pairs and iteratively aligning these subalignments. Aligning all sequences instead to a single reference is likely to introduce biases. For example, reads with more SNPs are less likely to successfully and unambiguously align to the reference sequence, as is common in alignments of more distantly related taxa. This mapping asymmetry between strains that are closely and distantly related to the reference sequence may affect the inferred phylogeny, and this has indeed been observed. Second, as maximum likelihood methods explicitly estimate branch lengths, including only alignment columns that contain SNPs and excluding (typically many) columns that are nonpolymorphic, may also affect the topology of the inferred phylogeny. This effect has been described before for morphological traits and is one reason long-branch attraction can be alleviated with maximum likelihood methods when nonpolymorphic sites are included in the alignment.

We identify parameter regimes where the combination of single-taxon reference mapping and SNP extraction generally leads to severe errors in phylogeny reconstruction. These simulations also show that even when including nonpolymorphic sites in an alignment, the effect of mapping to a single reference can lead to systematic errors. In particular, we find that when some taxa are diverged by more than 5-10% from the reference, the distance to the reference is systematically underestimated. This can generate incorrect tree topologies, especially when other branches in the tree are short.

These issues are part of the current "gee-whizz" phase of phylogenomics, in which over-optimism prevails over realism, and volume of data is seen as the important thing. David Roy Smith (2014. Last-gen nostalgia: a lighthearted rant and reflection on genome sequencing culture. Frontiers in Genetics 5: 146) has recently commented on this:

The promises of NGS have, at least for me, not lived up to their hype and often resulted in disappointment, frustration, and a loss of perspective.

I was taught to approach research with specific hypotheses and questions in mind. In the good ol' Sanger days it was questions that drove me toward the sequencing data. But now it’s the NGS data that drive my questions ... I'm trapped in a cycle where hypothesis testing is a postscript to senseless sequencing.

As we move toward a world with infinite amounts nucleotide sequence information, beyond bench-top sequencers and hundred-dollar genomes, let’s take a moment to remember a simpler time, when staring at a string of nucleotides on a screen was special, worthy of celebration, and something to give us pause. When too much data were the least of our worries, and too little was what kept us creative. When the goal was not to amass but to understand genetic data.

Monday, August 11, 2014

The science of advantage gambling in casino blackjack

In many games of chance the odds of winning or losing remain constant during play, such as in the street coin-game Two-Up and for the casino Roulette wheel. At the other extreme, the odds of winning are sometimes determined by the players to a much greater extent, such as in the card game Poker. This is why poker is such a popular form of gambling — all players are under the delusion that the advantage lies with them alone.

In between these extremes, there are games of chance where the odds of winning vary depending on the circumstances. If a player can identify these circumstances, then they can increase their wagers when the circumstances are favorable and decrease them when they are unfavorable, thus maximizing their chances of making a profit. This is called Advantage Gambling, and it is amenable to formal mathematical analysis. These analyses have kept a number of mathematicians gainfully employed over the centuries.

Some well-known examples of advantage gambling are the use of Arbitrage Bets in sports betting, and of Card Counting in card games. This blog post is about the latter, especially as applied to the casino card-game of Blackjack. [There are also many similar games played both inside and outside casinos, such as Twenty-one, Vingt-et-un, Spanish 21, Pontoon, etc.]

In blackjack the player is betting their card hand against that of the dealer (not any other player). The basic idea is to be dealt a hand of cards whose face values sum to a final score that is higher than that of the dealer's hand without exceeding a sum of 21. There are many variants throughout the world, although they tend to be minor variations on a single basic theme (as described by Wikipedia). In general, the dealer follows a strict set of rules specifying how many cards they can be dealt, while the player has a free choice regarding their own hand.

Clearly, the composition of the cards being dealt must change throughout a series of hands being dealt, because the deck of cards (or more usually several decks) gradually becomes exhausted. If the cards have been shuffled so that the random order of the cards is very even then there will be little change in composition through time, but if the random order is clustered (as it can be by random chance) then the composition of the cards remaining to be dealt may favor either the dealer or the player.

This favoritism happens because the dealer has to follow a fixed strategy, and certain cards favor that strategy. In particular, the dealer must always be dealt another card when their hand sums to a total in the range 12-16 (and sometimes 17). If the card dealt is a 10, J, Q or K (all of which have a value of ten) then the dealer's sum will exceed 21, and the player will win. Thus, if there is a high proportion of these cards remaining in the deck then the dealer is at a disadvantage relative to the player, who can chose not to take the extra card. On the other hand, if there is a high proportion of low cards remaining (especially 4s, 5s and 6s) then the dealer will not be disadvantaged.

In general play, the casino dealer will have an advantage of 0.5-1%, depending on the precise rules of play and how many decks of cards are in use simultaneously. So, in the long term the casino will make a profit, which is why they are in the gambling business in the first place. However, they make a smaller profit from blackjack than from any of their other games (for example, in roulette the casino's advantage is usually 5.3% in the USA and 2.7% in Europe), and this means that for blackjack the advantage gambler doesn't have to move the advantage very far for it to be in their favor instead of the casino's.

There is a Basic Strategy in blackjack, which stipulates what the gambler should do when their hand has any specified total against that of the dealer's — that is, whether they should Stand, Hit, Double Down, or Split. This was first explained by Roger Baldwin, Wilbert E. Cantey, Herbert Maisel and James P. McDermott in 1956 (Optimum strategy in blackjack. Journal of the American Statistical Association 51: 429-439); and Wikipedia provides a simple exposition. For the gambler, this strategy will lose the least amount of money to the casino in the long term (ie. lose only the 0.5-1% referred to above), as determined by mathematical analysis.

The advantage gambler wants to change these odds. The most common advantage play for blackjack is card counting, and it can change the advantage to be up to 2% in the gambler's favor. The essential idea is to keep a running track of whether the remaining undealt cards are biased towards small values (2, 3, 4, 5, 6) or large values (10, J, Q, K, A). To do this, a pre-specified value is added to the running total for each of the small cards that have already been dealt (and therefore can't still be in the deck), and a pre-specified value is subtracted for each of the large cards. The value of the running count will then indicate how much the advantage is in favor of the gambler. The gambler can then bet according to the size of their advantage.

There is nothing unique about this: "anyone who aspires to play Bridge, Stud Poker, Rummy, Gin, Pinochle, or Go Fish knows that you must keep track of the played cards" (Norman Wattenberger. 2009. Modern Blackjack: an Illustrated Guide to Blackjack Advantage Play). It requires no especial mathematical ability, although you do have to pay attention, and not forget what your count currently is (this is far simpler than playing bridge, where to play well you need to keep precise information about the remaining cards). Blackjack has apparently increased in popularity over the past 40 years because it is one of the few casino games that can consistently be won using expert play (maybe also video poker). However, the casinos will not unexpectedly try to stop you from winning via card counting.

The idea of counting cards in blackjack has been around since at least the 1950s, but the first popular text on the subject was Edward O. Thorp's book Beat the Dealer: a Winning Strategy for the Game of Twenty-One (1962). Since then, oodles of card counting systems have been devised, which differ in how many points are to be added to or subtracted from the running total for each card that is dealt. They range from relatively easy to implement to unnecessarily difficult.

We can look at the relationship between the different counting systems using a phylogenetic network. The data for 24 of these systems are available at Norman Wattenberger's Card Counting page (see also Popular Card Counting Strategies). The above graph is a NeighborNet (based on the manhattan distance) of these data. Those counting systems that are near each other in the network have a similar assignment of points to cards, while systems further apart are progressively more different from each other. The network shows a simple trend of increasing complexity of the systems from the top-right to the bottom-left. [Note that some of the systems use the same points, and thus appear at the same place in the network, but these do differ in other ways.]

This trend correlates quite well with the perceived ease of use of the systems, with the hardest ones to use being highlighted in red in the network and the medium ones in blue. The hardest ones do seem to be the most successful at predicting good betting situations. However, the consensus seems to be that the most complex systems are not that much better than some of the simpler ones — these are slightly less powerful but far easier to use. That is, the differences in difficulty are much greater than are the differences in performance, and so the complex ones are rarely recommended these days.

The powerful but simple systems include KISS III, K-O, REKO and Red Seven. Indeed, K-O appears to be becoming one of the most popular card counting systems. However, the older Hi-Lo is probably the most used counting strategy in existence.

Other games

Actually, consistently winning at blackjack is now old hat. What is far more interesting is trying to be an advantage gambler at games like lotto and the lotteries. Advantage gambling at lotto turns out sometimes to be an investment strategy rather than a gamble. For example, there have been times when the prize money has actually been greater than the cost of the betting tickets required to cover all of the needed number combinations (see The International Lotto Fund) and other times when the prize distribution has made each ticket worth more than it costs (see Massachusetts' Cash WinFall). My favorite, though, is trying to work out how to use advantage gambling for scratch lotteries, the gambling that usually has the worst chance of winning (see this article about Joan Ginther, who has clearly tried).

Wednesday, August 6, 2014

Tree Alignment Graphs and data-display networks

Data-display networks are a means of visualizing complex patterns in multivariate data. One particular use is for displaying the patterns in a set of trees. For example, Consensus Networks and SuperNetworks are splits graphs that display the patterns common to some specified subset of a collection of trees (eg. a set of equally optimal trees, or a set of trees sampled by a bayesian or bootstrap analysis). Alternatively, Parsimony Networks try to simultaneously display all of the trees in a collection of most-parsimonious trees for a single dataset.

Another display method for multiple trees is what has been called a Cloudogram (see the post Cloudograms and data-display networks). These superimpose the set of all trees arising from an analysis, so that dark areas in such a diagram will be those parts where many of the trees agree on the topology, while lighter areas will indicate disagreement.

Yet another method for combining trees into a graph while retaining all of the original information from the source trees is the Tree Alignment Graph (TAG), an idea introduced by Stephen A. Smith, Joseph W. Brown and Cody E. Hinchliff (2013. Analyzing and synthesizing phylogenies using tree alignment graphs. PLoS Computational Biology 9: e1003223).

The authors note:

These methods address the problem of identifying common nodes and edges across sets of phylogenetic trees and constructing a data structure that efficiently contains this information while retaining original source information ... Mapping trees into a TAG exploits the fact that rooted phylogenetic trees are in fact a specific type of graph: they are directed, acyclic, and require that each node has, at most, one parent. By relaxing these requirements, we can combine multiple trees into a common graph, while minimizing changes to the semantic interpretations of nodes and edges in the trees. Because they contain nodes and edges directly analogous to those from their source trees, TAGs have the desirable quality of retaining the full identifiability of the original source trees they contain. Additionally, because they are not restricted to the bifurcating model of evolution, TAGs may represent conflict among source trees as reticulations in the graph.

The basic principal is illustrated in the first figure (about). Internal nodes represent collections of terminal nodes, and arcs (directed edges) represent their relationships. Nodes and arcs are added to the growing TAG, each of which represents one relationship shown in one of the original trees. TAG A in the figure shows the result of combining the black, blue and orange trees, while TAG B shows the result of then adding the gray and green trees to TAG A (the arcs are colour-coded). The resulting TAG is thus a database of all of the original information, which can then be queried in any way to provide summaries of the data. In particular, standard network summaries can be used, such as node degree, which will highlight parts of the TAG with interesting characteristics.

The authors provide two empirical examples of applications. The one shown here involves 100 bootstrap trees for 640 species representing the majority of known lineages from the Angiosperm Tree of Life dataset (chloroplast, mitochondrial, and ribosomal data). The TAG is shown lightly in the background. Superimposed on this, the nodes are coloured to represent the effective number of parent nodes, and their size represents node bootstrap support. Highly supported nodes with a low number of effective parents (large blue nodes) are frequently recovered and confidently placed in the source trees, while highly supported nodes with a low number of effective parents (large and pink or orange) are frequently resolved in the source trees but their placement varies among bootstrap replicates. So, the three largest problem areas as illustrated in the TAG correspond to the Malpighiales, Lamiales and Ericales.

For comparison, a NeighborNet analysis of the same data is shown in the blog post When is there support for a large phylogeny? This simply shows an unresolved blob.

Monday, August 4, 2014

A network of cheese rind microorganisms?

Cheese making is about 8,000 years old, and there are now about 1,000 distinct types of cheese throughout the world. As with most ancient crafts, the art of making cheese is to get the microbes to do most of the work for you.

To this end, there has been much interest in the microbial communities that occur in cheese rinds (the bit around the outside). Different communities are expected to be associated with different styles of cheese, since the production process can be quite different. This is shown in the first figure, which emphasizes that much of the difference between cheeses is due to different maturation procedures.

From Wolfe et al. (2014).

Recently, Wolfe BE, Button JE, Santarelli M, and Dutton R (2014. Cheese rind communities provide tractable systems for in situ and in vitro studies of microbial diversity. Cell 158: 422-433) had a look at the dominant genera of bacteria and microfungi in the rind communities of 137 different types of cheese. They don't actually tell us much about which cheeses these were, merely claiming:

We attempted to evenly sample across rind type (24 bloomy rind cheeses, 52 washed rind cheeses, and 61 natural rind cheeses) and geographic regions (87 European cheeses across 9 countries; 50 American cheeses across 13 states from the West Coast to the east Coast). We also attempted to sample across different milk types (77 cow milk, 34 goat milk, 21 sheep milk, and 5 mixed milk) and milk treatments (99 raw milk, 38 pasteurized).

Based on sequencing the bacterial 16S and fungal ITS loci, the authors identified 14 bacterial and 10 fungal genera (moulds and yeasts) that occurred with an average abundance of >1%, as shown in the next figure.

The 137 rind samples with their bacterial (middle row) and fungal (bottom row) genera indicated
by different colours. The order of the samples was determined by UPGMA clustering (top row).

The authors also used shotgun metagenomic sequencing to identify a range of genes in the microorganisms. They present a phylogeny of one particular gene (shown in the next figure) that shows a close relationship between some of the cheese microbes and marine bacteria:

The widespread distribution and high abundance of marine-associated gamma-Proteobacteria, enriched in both washed and bloomy rind cheeses, was an unexpected finding in our survey of taxonomic diversity ... One possible source of these marine microbes is the sea salt used in cheese production.

[Note: the other cheese rind bacterium shown in the phylogeny, Brevibacterium linens, is the one responsible for the unbelievable smell of washed-rind cheeses such as Epoisses, Münster and Limburger. It is also responsible for personal-hygiene issues such as foot odour. You can imagine how it first got into cheese making!]

However, Ropars J, Cruaud C, Lacoste S, and Dupont J (2012. A taxonomic and ecological overview of cheese fungi. International Journal of Food Microbiology 155: 199-210), in a related study, have pointed out the usual problem with microbial phylogenies: gene trees are frequently incongruent. So, the gene phylogeny shown above is not likely to be the species phylogeny. It would thus be of great interest to investigate the full microbial network, rather than looking at a single tree.