Monday, July 16, 2018

A network of World happiness

This is a joint post by Guido Grimm and David Morrison.

You may never have heard of it, but the there is a World Happiness Report. This is sponsored by The Sustainable Development Solutions Network (SDSN) and The Global Happiness Council (GHC). Reports were produced in 2012, 2013, 2015 and 2017, but here we are going to look at the World Happiness Report 2018.

To quote the Report:
The World Happiness Report is a landmark survey of the state of global happiness. The World Happiness Report 2018 ranks 156 countries by their happiness levels, and 117 countries by the happiness of their immigrants.
The rankings use data that come from the Gallup World Poll (GWP). The rankings are based on answers to the main life evaluation question asked in the poll. This is called the Cantril ladder: it asks respondents to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The rankings are from nationally representative samples, for the years 2015-2017.
The Report is very comprehensive in its discussion of methodology, and its limitations. It is also very ambitious in its conclusions. The main focus of the 2018 Report is comparing the happiness of immigrants with their local counterparts. Interestingly, they found no important differences between these two groups.

More importantly for this blog, the raw data are provided in an Appendix, so that anyone can look at what is going on. We have decided to do just that.

The Report's happiness index

Below is the first little bit of Figure 2.2 (extracted from the report), which "shows the average ladder score (the average answer to the Cantril ladder question, asking people to evaluate the quality of their current lives on a scale of 0 to 10) for each country, averaged over the years 2015-2017." As you can see, the people who claim that they are happiest are those in the Nordic countries (Finland plus the Scandinavian countries: Norway, Denmark, Iceland and Sweden). These are the people whom the world's cultural cliché sees as sitting for half the year in the gloom! Apparently, you have all got it wrong.

As we have noted before, an index can often do a poor job of summarizing data, because it reduces complex data down to just one dimension. The Happiness Report tries to alleviate this limitation by adding information about some of the other variables that correlate with the Happiness score, using colors:
Each of these bars is divided into seven segments, showing our research efforts to find possible sources for the ladder levels. The first six sub-bars show how much each of the six key variables is calculated to contribute to that country’s ladder score, relative to that in a hypothetical country called Dystopia, so named because it has values equal to the world’s lowest national averages for 2015-2017 for each of the six key variables
However, we can do much better than this, by using all of these variables in a phylogenetic network. The key variables are (color-coded from left to right in the figure above):
  1. Gross Domestic Product (GDP) per capita is in terms of Purchasing Power Parity (PPP)
  2. Social support [the national average of the binary responses to the Gallup World Poll]
  3. The time series of healthy life expectancy at birth
  4. Freedom to make life choices [the national average of binary responses to the GWP question]
  5. Generosity [the residual of regressing the national average of GWP responses]
  6. Perceptions of corruption [ the average of binary answers to two GWP questions]
For the network, we simply put all of these variables into the analysis, along with the Happiness score.

[Technical details of our analysis: Qatar was deleted because it has too many missing values; each data variable was then standardized to zero mean and unit variance; the gower similarity was calculated, which ignores missing values, and this was converted to a distance; the distances were then displayed as a neighbor-net splits graph.]

A network analysis

The resulting network is shown next. Each point represents a country, with the name codes following the ISO-3166-1 standard. The spatial relationship of the points contains the summary information — points near each other in the network are similar to each other based on the data variables, and the further apart they are then the less similar they are. The points are color-coded based on major geographic regions (asterisks highlight single states that don't group with the rest of their geographical region). We have added some annotations for the major network groups, indicating which geographical regions are included — these groups are the major happiness groupings.

In this blog post we do not want to risk over-interpreting the data, as explained in the final paragraphs below. However, it is obvious that there are distinct patterns in the network. Happiness, and its correlates are not randomly distributed on this planet but, not unexpectedly, relate to the local socio-political situation.

Starting at the bottom-left, we have a geographically heterogeneous cluster of very well-off countries, either welfare states (as in northern Europe), capitalist democracies (eg. the USA, Singapore, Hong Kong), or oil-rich monarchies with high levels of public spending (as in the Middle East). Moving clockwise, the next cluster has much of the rest of the western and central European countries, along with the financially well-off parts of South America and Asia. The next cluster has many of the remaining eastern European countries, plus the nearest parts of Asia, where government spending on welfare is still apparent. Clearly, national wealth plays a large part in happiness, in spite of the well-known adage to the contrary.

This is followed, at the top-middle of the network, by a broad neighborhood (not a distinct cluster), where government spending on welfare is much less apparent, at least to an outsider. The countries here come from Europe, Asia, and Central plus South America (including, at the moment, Greece). Happiness and its correlates is reported to be much lower here.

To make this situation clearer, here is a version of the network with some of the happiness scores annotated — values are provided for the first and last 10th percentile of the happiness score, and the 10 largest (by population) countries in the world.

On the opposite side of the network, happiness is also apparently lower, but with a different set of correlations among the variables. There is a two-part cluster of geographically heterogeneous countries at the bottom-middle, plus a neighborhood at the bottom-right. The latter includes China and India, the two most populous countries (with one-third of our people), while Indonesia (4th) and Brazil (6th) are in the neighborhood at the top of the network.

Finally, the cluster at the right consists mostly of African countries, plus Pakistan (the 5th most-populous country). In this cluster, happiness is reported to be at its lowest observed level. Much of the world's monetary aid is spent in Africa, of course, to try to improve the situation, although there is clearly a long way to go. Not unexpectedly, most of the world's migrants come from the right-hand part of the network, which is one of the main focuses of the Happiness Report.

Final comments

It is interesting to note that the Bhutan (code BTN) government reportedly aims to increase the Gross National Happiness rather than the GDP (see Gross national happiness in Bhutan: the big idea from a tiny state that could change the world). The network shows that their 2015-2017 happiness is quite different to that of their geographical neighbors. However, it also suggests that they still have a long way to go.

We should finish the discussion with a general point about surveys, such as the Gallup Poll on which the Happiness Report is based. Respondents are not always completely honest when answering survey questions, which is why pre-election polls sometimes get it wrong — people are most serious when faced with an actual decision, rather than a question. All of the results here need to be interpreted in this light — they may not be far wrong, but they are unlikely to be completely right.

Apart from anything else, there can be cultural differences in the way in which the answers to the Gallup World Poll questions are treated. Does "happiness" really mean the same thing across all cultures? We know that "beauty" does not, and "freedom" does not; so why not "happiness"? After all, things like reported happiness are likely to be confounded with other feelings such as national pride. This issue could presumably be addressed by looking at other answers from the Gallup Poll.

Monday, July 9, 2018

Using splits graphs for multivariate data analysis

Data containing multiple measurements for each of a set of objects are usually too complex to be viewed easily in their raw form. Therefore, methods have been developed to usummarize the data down to something simpler. This is called multivariate data analysis.

One of the issues that needs to be addressed is that a data summary is designed to lose information. The goal is to somehow keep the most important information in the summary. Clearly, the simpler is the summary then the more information we are likely to lose.

This post is a simplistic introduction to why splits graphs, which were originally developed to summarize multivariate phylogenetic data, are usually very good data summaries. It compares the ability of maps, indexes and networks to summarize data.


A map is a 2-dimensional drawing of some piece of 4-dimensional space-time. For example, the map shown here represents the southern part of Scandinavia.

A map is quite successful as a data summary. It reduces the 4-dimensional world down to 2+ dimensions — latitude and longitude are represented accurately; we use symbols or colors/shading to represent altitude; and we choose one specific time (thus eliminating that dimension). We can therefore reconstruct much of the 3-dimensional world from looking at a map (ie. much of the original information is retained in the summary).

In our example, we can see even from a glance at the map that Denmark is as flat as a pancake, Norway is very hilly, and Sweden is somewhere in between. We can also see that Uppsala and Oslo are at the same latitude, and that the simplest way to get from Uppsala to Trondheim is likely to be via Östersund rather than Oslo.


An index is a linear ordering of numbers measuring some calculated characteristic of a set of objects. It condenses a series of measurements for each object down to a single number. The index shown here refers to the hotels in Östersund (which we might stay at on our way from Uppsala to Trondheim), and indicates the overall quality score from a well-known online booking site. The index summarizes a set of features of the hotels that might be of interest to potential guests.

Hotell Emma
Clarion Hotell Grand
Hotell Stortorget
Quality Hotell Frösö Park
Hotell Jämteborg
Best Western Hotell Ett
Best Western Hotell Gamla Teatern
Hotell Älgen
Hotell Zäta

Unfortunately, an index is rarely very successful as a data summary. It reduces multi-dimensional data down to only 1 dimension. Therefore, we cannot tell which dimensions contribute to each value of the index — the same value could arise in many different ways. We therefore cannot reconstruct any of the original dimensions — what goes into the summary cannot come back out (as it can for a map).

Free WiFi
Value for money
Quality Hotel
Frösö Park

In our example, two of the hotels have exactly the same index score, but this does not necessarily mean that the two hotels are the same as regards the quality features, as shown above. For instance, there are notable differences between them in Location and Value for Money, and even larger differences in Cleanliness and Facilities. This information is lost in the calculation of the quality index.


A splits graph (a type of phylogenetic network) is a 2-dimensional drawing of some multi-dimensional set of data, such as might be used to calculate an index. The network shown here is based on the same data used to calculate the quality index above.

A network reduces multi-dimensional data down to 2+ dimensions. Each object is represented as a point — the spatial relationship of the points (their neighborhood) has meaning; and the inter-connecting lines have meaning (they are groups supported by the data). Such a network is therefore much more successful as a summary than is an index. Like a map, it will be very successful for 3-dimensional data, with potentially reduced success as the number of dimensions increases — the rate of information loss will depend on how well-correlated are the dimensions.

In our example, the main pattern in the network shows the relative quality of the hotels, as measured by the index, descending from top to bottom (so that all of the information form the index is in the network). However, the graph also emphasizes the difference between the two hotels with identical index scores. Indeed, it shows us that the Quality Hotell Fröösö Park is probably more similar to the Clarion Hotell Grand than to the Hotell Stortorget.


There are other forms of multivariate data analysis that are often used instead of networks. Two common ones are: an ordination, which reduces multi-dimensional data down to 2 dimensions only; and a cluster tree, which reduces to 1 dimension only. These are therefore often less successful as data summaries. Indeed, a network is very much like a combination of an ordination and a cluster tree, with the best features of both methods and fewer of their limitations.

Further reading

How to interpret splits graph

Primer of Phylogenetic Networks

Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: 296-312.

Monday, July 2, 2018

Reticulation at its best — an example from the oaks

One particular case where networks turn out to be a versatile tool is the study of low-level evolutionary patterns. This is especially so when we leave the comfort zone of well-sorted molecular markers, and use more than a single individual per species. Our recently published data set on (mostly Mediterranean) oaks, provides a nice example of this.

Why so few people study oaks at the intra-generic level

Oaks are notoriously difficult to study because they don't bother too much about species boundaries (which can be more or less obvious) and – at one point – decided to not sort their plastids at all (and full plastomes, as I once saw for myself first-hand, won't help). Hence, all reasonable phylogenetic reconstructions of oak evolution have been based on genetic data from the nucleome. However, this imposes a new problem — the sequenced nuclear gene regions allow the recognition of the major lineages (which recently have been formalized), but the closer one comes to the species level the more difficult it is to resolve anything at all.

Even the famous ITS region, which includes the weakly constrained internal transcribed spacer ITS1, and the structurally quite constrained ITS2, and have been frequently advocated as plant barcodes, turns out to be a two-edged sword. Relationships between the major intra-generic lineages is relatively clear, the ITS is pretty divergent down to the species level, but at the individual level, one faces a intra-genomic divergence that often outmatches inter-species differentiation.

In some groups, like the most speciose and most widespread white oaks (sect. Quercus), identical ITS variants exist from individuals / species separated today by thousands of kilometers of ocean or icy wasteland. One possible explanation is that oaks have very large population sizes, and they are wind-pollinated, so that they have a high capacity to permanently homogenize their genepools. Plastids, on the other hand, are only transmitted via the large fruit, the acorns, and the main animal vector for distributing acorns, the jaybirds, are sedentary birds. Their backup-vector, the squirrels are known to hoard a lot of acorns in a single place, but not for migrating globally (unless we assist them).

Nonetheless, we readily notice that the intra-individual differentiation patterns appear not to be entirely random, and so in our study we moved to another nuclear multi-copy spacer known to be more variable than the ITS1 and ITS2 (hence, largely ignored by molecular phylogeneticists) — the 5S intergenic spacer (5S IGS). It didn't help too much for solving the white oak puzzle (in western Eurasia), but did give us new insights into the two other western Eurasian sections: Ilex and Cerris.

The 'host-associate' framework

A cloned 5S-IGS (or ITS) sequence is not a good OTU, because we are usually not interested in a clone phylogeny (a mere sequence genealogy), but in the phylogenetic relationships between the individuals or species carrying the cloned sequence variants: the nuclear spacer population. Even networks struggle with such data, and my colleague Markus Göker came up with the idea to treat this in the form of hosts, the individuals, and associates, the cloned sequences found in the individual (Göker & Grimm, BMC Evol. Biol. 2008 — open access). There are several options to transfer the primary clone (associate) data into individual (host) data.

Options that we tested for transferring associate data into host data.
CM = character matrix, DM = distance matrix. CMhosts, independent used were morphological matrices. ENT — entropy, FRQ — frequence, CON — strict consensus, MOD — modal consensus, and SIZ — sample size, are character transformations implemented in Markus' g2cef, PBC and MIN are distance transformations implemented in pbc (these and other little helper programmes can be found here).

Using three cloned (ITS) datasets, we found that for these data the "Phylogenetic Bray-Curtis" (PBC — see the next figure) distance transformation outperforms the other tested options.

Computation of the "Phylogenetic Bray-Curtis" distance. It's a modification of the Bray-Curtis dissimilarity using the minimum distance for each covered row/column instead absence/presence. H1/H2 = hosts with different sets of associates (A1–A6)

Incidental but interesting insights

Whenever I come into contact with such data I advise the use of the PBC distance transformation as the basis for the main individual-level network, but also to run the MIN distance transformation: MIN will just calculate the minimum inter-clone distance between the clone samples of two individuals, and use this as the inter-individual distance.

Neighbour-net using the MIN transformation

The MIN network (above) is quite bushy for these data, because we naturally have many shared 5S-IGS variants among individuals of the same species, but occasionally also shared by individuals of different species. Nonetheless, it visualizes some basic differentiation patterns in the clone sample: compare e.g. the coherent cluster 3, the crenata-suber lineage (the 'Cork Oaks') — all individuals share a pair of very similar to identical 5S-IGS clones; and the divergent cluster 4, the 'Vallonea' oaks — all individuals have different sets of clones, but uniuqe 5S-IGS variants separating them from all other Cerris oaks (long proximal edge bundle).

Furthermore, we have potential F1 hybrids (morphologically intermediate) in our sample, and such hybrids, e.g. tj08, should have very low (to zero) MIN distances with members of their parental lineages.

However, the PBC network (below) is as beautiful as it gets — I really love this transformation, as it always comes up with something usable and interpretable.

Neighbor-net based on PBC-transformed inter-individual distances. See Simeone et al. (PeerJ PrePrints 2018 — open access pre-print) for a discussion.

However, this network was a last minute addition, because a happy little "accident" happened along the way, and the networks we were working with and looking at while drafting the paper where not PBC networks, as I thought.

It happened this way. Also implemented in Markus' little helper program are AVG, the average inter-clone distance, and MAX, the maximum inter-clone distance. AVG and MAX don't result in a proper distance matrix, because the diagonal will be the average or maximum distance between the clones of a single individual, and not all-zero as it should be (for MIN it's always zero). [We discussed a few options to modify AVG and MAX to ensure a zero diagonal, but couldn't devise something that makes sense.]

However, the SplitsTree program didn't bother about an all-zero diagonal, so the AVG and MAX transformed distance matrices will produce a Neighbor-net. So, what I assumed were PBC networks were in fact AVG networks.

Neighbor-net based on AVG-transformed inter-individual distances.

It took me quite long to recognize this "error" because, in contrast to the AVG (and MAX) networks I looked at when we did the 2008 paper, the one for the oaks made a lot of sense. Notably, the suspected F1 hybrids were perfectly resolved spanning up according boxes, and the species aggregates (clusters) did make sense regarding the general geographic setting, the history of the region under study, and their morphology.

Same graph as above, highlighting known or potential F1-hybrids spanning up according boxes.

For these data (with a minimum of four clones available per individual, individuals covering all species, and including the entire range of the section in western Eurasia), the AVG network better shows the potential F1 hybrids (or introgrades) than the (more methodologically sophisticated) PBC network. However, the latter makes more sense regarding speciation processes and the history of the group (because, the distance is a "phylogenetic" version of the well-known Bray-Curtis distances).

A "cactus-oak" fusion graph depicting nuclear and plastid differentiation (and evolution) in Quercus Group Cerris.

Take-home message

First, it's always good to delegate work you can do by heart to somebody new to it! This forces its propagation, which is important. More importantly, though, one has ones preferences and established analysis pipelines, and they may have become restricted in scope. I mainly used the -a (AVG), -i (MIN) and -x (MAX transformation) options in the little helper program to quickly summarize some of the primary differentiation data — for example, individuals have identical clones (MIN = 0), intra-individual divergence may be higher or not than inter-individual (MAX intra-individual > MIN inter-individual), and individuals may have strongly divergent clones (high MAX). AVG was computed and tabulated but never cherished by me. I always looked at the MIN transformed networks, since this provides a valid distance matrix, but then ignored them. But I never again tried to infer a Neighbor-net based on AVG or MAX transformations after our 2008 paper.

Second, Neighbor-nets are so quick to infer that there is no resource- or logic-related reason to not just run whatever distance one has on hand or can easily establish. Maybe even the biologically less-sound will reveal some interesting aspect (there are a lot of biological arguments that can be put forward for dismissing AVG distances in favour of PBC distances)

Paper (pre-print) and open data
Simeone MC, Cardoni S, Piredda R, Imperatori F, Avishai M, Grimm GW, Denk T. 2018. Comparative systematics and phylogeography of Quercus Section Cerris in western Eurasia: inferences from plastid and nuclear DNA variation. PeerJ Preprints 6: e26995v1.
Primary data and analysis files are included in the Online Supplemantary Archive: Simeone et al., PeerJ Preprints, doi: 10.7287/peerj.preprints.26995v1/supp-4. (See Readme.txt included in the topfolder of the archive.)