Monday, June 11, 2012

Wordle, TreeCloud and SplitsNetworkCloud

There are a number of available ways to analyze word frequency and usage in a block of text, and to display the result as a diagram. Here, I have applied three of them to my one and only published book (after deleting extraneous text such as the references and glossary), to find out what my writing style is like. The results are not as embarrassing as I feared.


This analysis uses word size in the diagram to represent word frequency in the text.

Click to enlarge.

It is good to note that most of the words refer to the topics rather than coming specifically from my writing style. Note that "data" is one of the most used words, but this actually comes from expressions like "data-display network". Sadly, "also", "although", "however", "important", "might", "much", "necessarily", "particular", "rather" and "way" seem to get a bit of a workout in the book. The only author who makes it onto the list is "Huson", not unexpectedly.


The TreeCloud output helps make some of the word patterns more clear, since it uses clusters on an unrooted tree to represent words that occur near each other in the text, thus introducing word context into the analysis. Many fewer words are displayed, thus focussing on topics rather than writing style.

Click to enlarge.

Proximity presumably explains why the words "network" and "networks" are at opposite ends of the tree — they are used in quite different contexts in the book. This is also why both "data" and "data-display" occur in the tree, since "data patterns" is a commonly used expression. Also, the expression "shown in the figure" arises from the large number of illustrations, thus explaining the (perhaps unexpected) appearance of the two words.


This analysis generalizes the TreeCloud output to a data-display network. This makes even clearer some of the complexity of the word associations. The number of words has been reduced, to make the diagram less complex than it would otherwise be. Also, word colour refers to the relative placement in the book, with red at the beginning and blue at the end.

Click to enlarge.

Notably, "network" is not specifically associated with any particular word except "reticulations", whereas "networks" appears preferentially in the expressions "evolutionary networks" and "data-display networks". It is perhaps noteworthy that I use the expression "number of reticulations" rather than "reticulation number", thus revealing my non-mathematical background.

Thanks to Philippe Gambette for producing the SplitsNetworkCloud.

No comments:

Post a Comment