Wednesday, June 4, 2014

Phylogenetic networks as multivariate data displays

Over the past two years or so of blogging, I have presented a number of empirical examples in which I have used splits graphs as general multivariate data summaries, rather than using them for the analysis of what we might call strictly phylogenetic data. I have listed these analyses at the bottom of this post.

There have been two reasons for doing these analyses. First, I wish to emphasize that unrooted networks are a form of data display rather than being evolutionary diagrams. That is, they do not display evolutionary history, in the same manner as is intended for rooted phylogenetic trees, for example. Unrooted networks can be a valuable tool for exploring phylogenetic data, but they do not display a phylogeny. They are a form of exploratory data analysis.

Second, these networks form part of a much larger class of methods for the analysis of multivariate data. Indeed, I believe that they are a very valuable part of this class. One way to illustrate this has been to analyze a whole series of datasets that have little to do with phylogenetic analysis. That is, the data are not necessarily even related to a historical trend. This illustrates just what can be done with these methods.

I have now formalized this point of view in a peer-reviewed publication:
Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4(4): in press (Online Early). doi:10.1002/widm.1130
Exploratory data analysis (EDA) involves both graphical displays and numerical summaries of data, intended to evaluate the characteristics of the data as well as providing a form of data mining. For multivariate data, the best-known visual summaries include discriminant analysis, ordination and clustering, particularly metric ordinations such as Principal Components Analysis. However, these techniques have limiting mathematical assumptions that are not always realistic. Recently, network techniques have been developed in the biological field of phylogenetics that address some of these limitations. They are now widely used in biology under the name phylogenetic networks, but they are actually of general applicability to any multivariate dataset. Phylogenetic networks are fast and relatively easy to calculate, which makes them ideal as a tool for EDA. This review provides an overview of the field, with particular reference to the use of what are called splits graphs. There are several types of splits graph, which summarize the multivariate data in different ways. Example analyses are presented based on the neighbor-net graph, which seems to be the most generally useful of the available algorithms. This should encourage the more widespread use of these networks whenever a summary of a multivariate dataset is required.

If you don't have subscription access to the journal, you can contact me for a PDF copy.

Blog posts with multivariate data summaries:

Datasets involving temporal patterns

Network analysis of Genesis 1:3
Network of ancient Thai bronze Buddha images
Language history and language weirdness
Pacific rock art - ordinations and networks
The network history of the Carnival of Evolution
The rise and fall of "David"

Datasets with no phylogenetic pattern

Eurovision Song Contest 2006: a network analysis
Network analysis of scotch whiskies
Network analysis of Bordeaux wine critics
Network analysis of Bordeaux wine critics II
A network analysis of Médoc wines
Eurovision Song Contest 2012: a network analysis
Phylogenetic network of the FIFA World Cup
Astrocladistics: a network analysis
Network analysis of McDonald's fast-food
Is there good and bad fast-food?
The mysterious rankings in Forbes' Celebrity 100
Network analysis of Michelin starred restaurants
Network analysis of New York neighborhoods
A network analysis of Simon and Garfunkel
Network analysis of Manhattan apartment buildings
A network analysis of the Bundesliga
Networks of the "Sight & Sound" film polls
A network analysis of London's theatres in 1965
The acoustics of the Sydney Opera House
A network of New Zealand's livestock regions
A network analysis of airplane disasters
World ice hockey champions — a network
Fast-food maps — a network analysis
Single-malt scotch whiskies — a network
Which cars are good, really?
The Netherlands is more than just tulips and sea-dykes
Automated natural language processing
Cancer rates and diagnosis

Theoretical considerations

Distortions and artifacts in Principal Components Analysis analysis of genome data
Networks can outperform PCA ordinations in phylogenetic analysis
Multivariate data displays are not always necessary

No comments:

Post a Comment