Monday, July 9, 2018

Using splits graphs for multivariate data analysis


Data containing multiple measurements for each of a set of objects are usually too complex to be viewed easily in their raw form. Therefore, methods have been developed to usummarize the data down to something simpler. This is called multivariate data analysis.

One of the issues that needs to be addressed is that a data summary is designed to lose information. The goal is to somehow keep the most important information in the summary. Clearly, the simpler is the summary then the more information we are likely to lose.

This post is a simplistic introduction to why splits graphs, which were originally developed to summarize multivariate phylogenetic data, are usually very good data summaries. It compares the ability of maps, indexes and networks to summarize data.

Maps

A map is a 2-dimensional drawing of some piece of 4-dimensional space-time. For example, the map shown here represents the southern part of Scandinavia.

A map is quite successful as a data summary. It reduces the 4-dimensional world down to 2+ dimensions — latitude and longitude are represented accurately; we use symbols or colors/shading to represent altitude; and we choose one specific time (thus eliminating that dimension). We can therefore reconstruct much of the 3-dimensional world from looking at a map (ie. much of the original information is retained in the summary).


In our example, we can see even from a glance at the map that Denmark is as flat as a pancake, Norway is very hilly, and Sweden is somewhere in between. We can also see that Uppsala and Oslo are at the same latitude, and that the simplest way to get from Uppsala to Trondheim is likely to be via Östersund rather than Oslo.

Indexes

An index is a linear ordering of numbers measuring some calculated characteristic of a set of objects. It condenses a series of measurements for each object down to a single number. The index shown here refers to the hotels in Östersund (which we might stay at on our way from Uppsala to Trondheim), and indicates the overall quality score from a well-known online booking site. The index summarizes a set of features of the hotels that might be of interest to potential guests.

Hotell Emma
Clarion Hotell Grand
Hotell Stortorget
Quality Hotell Frösö Park
Hotell Jämteborg
Best Western Hotell Ett
Best Western Hotell Gamla Teatern
Hotell Älgen
Hotell Zäta
   8.9
   8.7
   8.6
   8.6
   8.3
   8.1
   8.0
   7.9
   7.8

Unfortunately, an index is rarely very successful as a data summary. It reduces multi-dimensional data down to only 1 dimension. Therefore, we cannot tell which dimensions contribute to each value of the index — the same value could arise in many different ways. We therefore cannot reconstruct any of the original dimensions — what goes into the summary cannot come back out (as it can for a map).



Staff
Location
Cleanliness
Comfort
Facilities
Breakfast
Free WiFi
Value for money
Hotell
Stortorget
8.9
9.4
9.1
8.5
7.7
8.5
9.1
8.3
Quality Hotel
Frösö Park
8.7
8.9
8.3
8.2
8.8
8.5
8.7
8.9

In our example, two of the hotels have exactly the same index score, but this does not necessarily mean that the two hotels are the same as regards the quality features, as shown above. For instance, there are notable differences between them in Location and Value for Money, and even larger differences in Cleanliness and Facilities. This information is lost in the calculation of the quality index.

Networks

A splits graph (a type of phylogenetic network) is a 2-dimensional drawing of some multi-dimensional set of data, such as might be used to calculate an index. The network shown here is based on the same data used to calculate the quality index above.

A network reduces multi-dimensional data down to 2+ dimensions. Each object is represented as a point — the spatial relationship of the points (their neighborhood) has meaning; and the inter-connecting lines have meaning (they are groups supported by the data). Such a network is therefore much more successful as a summary than is an index. Like a map, it will be very successful for 3-dimensional data, with potentially reduced success as the number of dimensions increases — the rate of information loss will depend on how well-correlated are the dimensions.


In our example, the main pattern in the network shows the relative quality of the hotels, as measured by the index, descending from top to bottom (so that all of the information form the index is in the network). However, the graph also emphasizes the difference between the two hotels with identical index scores. Indeed, it shows us that the Quality Hotell Fröösö Park is probably more similar to the Clarion Hotell Grand than to the Hotell Stortorget.

Alternatives

There are other forms of multivariate data analysis that are often used instead of networks. Two common ones are: an ordination, which reduces multi-dimensional data down to 2 dimensions only; and a cluster tree, which reduces to 1 dimension only. These are therefore often less successful as data summaries. Indeed, a network is very much like a combination of an ordination and a cluster tree, with the best features of both methods and fewer of their limitations.

Further reading

How to interpret splits graph

Primer of Phylogenetic Networks

Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: 296-312.

No comments:

Post a Comment