Monday, August 5, 2013

Ngrams and phylogenetics

Google Books is an archive of published books. For example, it is a very good place to search for scanned copies of books that are in the public domain, which can then be down-loaded as PDFs. It also has copies of modern in-print books, which can searched but not down-loaded. The Ngram Viewer is a part of Google Books. It plots a graph showing the number of book occurrences of a given expression through time from 1800-2008.

So, I thought that it might be interesting to search for a few expressions of relevance to readers of this blog. I will let the graphs speak for themselves.

Just to be clear about the scale of the vertical axis, I quote from the instructions:
What the y-axis shows is this: of all the bigrams [two-word expressions] contained in our sample of books written in English and published in the United States, what percentage of them are "phylogenetic network" or "phylogenetic tree"?
It is worth noting that Google Books contains some journal volumes, and so its definition of "book" is rather vague. Also, the dating of some of the books can best be described as bizarre.

We can expand the "phylogenetic network" search, to get more detail. [Note: we could also scale the graph above, as discussed in the below Comment by Joachim Dagg.] We could also compare this graph to Philippe Gambette's publication graph for Who is Who in Phylogenetic Networks, which doesn't show the same explosive growth after 2000. Mind you, before 2000 there weren't many books that could mention networks.

We could also try alternative "tree" searches. We might then ask: why does "evolutionary tree" die off as an expression after 2000?


  1. You can also multiply the bigram "phylogenetic network" with a factor to bring both curves into scale. Looks like this:

    1. Thanks for pointing that out, Joachim. It looks much nicer. David