Monday, September 16, 2019

A network of happiness, by ranks

This is a joint post by David Morrison and Guido Grimm

Over a year ago, we showed a network relating to the World Happiness Report 2018 based on the variables used for explaining why people in some countries report themselves to be happier than in other countries. A new WHR report is out for 2019, warranting a new network.

The 2019 Report describes itself as:
a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. This year’s World Happiness Report focuses on happiness and the community: how happiness has evolved over the past dozen years, with a focus on the technologies, social norms, conflicts and government policies that have driven those changes.
For our purposes, we will simply focus on the happiness scores themselves. So, this time we will base our analysis on the country rankings for the four measures of subjective well-being:
  • Cantril Ladder life-evaluation question in the Gallup World Poll — asks the survey respondents to place the status of their lives on a “ladder” scale ranging from 0 to 10, where 0 means the worst possible life and 10 the best possible life
  • Ladder standard deviation — provides a measure of happiness inequality across the country
  • Positive affect — comprises the average frequency of happiness, laughter and enjoyment on the previous day to the survey (scaled from 0 to 1)
  • Negative affect — comprises the average frequency of worry, sadness and anger on the previous day to the survey (scaled from 0 to 1)
As expected, not a lot has changed between 2018 and 2019. The first graph shows the comparison of the Cantril Ladder scores (the principal happiness measure) for those 153 countries that appear in both reports. Each point represents one country, with the color coding indicating the geographical area (as listed in the network below).

Only three countries (as labeled) show large differences, with Malaysia becoming less happy, and two small African countries improving. As also expected, the European countries (green) tending to be at the top, and the African countries (grey) dominating the bottom scores.

Finland is still ranked #1, with even happier people than in 2018's report. New in the top-10 of the happiest countries is Austria (last years #12), which took the place of Australia (now #11). At the other end, South Sudan went down from 3.3 to 2.9 — this is not really a good start for the youngest state in the world. New to the lowest-ranking ten are Botswana (−0.1, down two places) and Afghanistan (−0.4, down 9).

A network analysis

The four measures of subjective well-being do not necessarily agree with each other, since they measure different things. To get an over view of all four happiness variables simultaneously, we can use a phylogenetic network as a form of exploratory data analysis. [Technical details of our analysis: Qatar was deleted because it has too many missing values. The data used were the simple rankings of the counties for each of the four variables. The Manhattan distance was then calculated; the distances have been displayed as a neighbor-net splits graph.]

In the network (shown below), the spatial relationship of the points contains the summary information — points near each other in the network are similar to each other based on the data variables, and the further apart they are then the less similar they are. The points are color-coded based on major geographic regions; and the size of the points represents the Cantril Ladder score. We have added some annotations for the major network groups, indicating which geographical regions are included — these groups are the major happiness groupings.

The rank-based network 2019 looks quite different to the one based on the explaining parameters 2018. Let us have a short look at the clusters, as annotated in the graph.

Cluster 1: The happiest this includes the welfare states of north-western and central Europe (score > 6.7), as well as Australia, Canada and New Zealand (~7.3), Taiwan (the 25th happiest country in the world, 6.4) and Singapore (#34 with 6.3). For both the positive and negative measures of happiness, the countries rank typically in the top 50, with Czechia ranking lowest regarding positive affects (#74), while the people in Singapore (#1) and Taiwan apparently suffer the fewest negative affects (#2).

Cluster 2: Quite happy includes countries like France, with 6.6 making it the happiest one of the group, plus countries along the southern shore of the Baltic Sea, as well as Japan, Hong Kong, but also also quite different countries from western Asia such as Kyrgyzstan and Turkmenistan, and Vietnam, the least happy (5.1) of the group. Common to all of them is that they rank in the top third of the standard derivation of the Cantril ladder scores, i.e. their people are equally happy across each country. Towards the right of the group, bridging to Cluster 3, we have countries that rank in the bottom third of positive affects. Potential causes are the high levels of perceived corruption, or the lack of social support and generosity, as in the case of Turkmenistan (#147 in social support, #153 in generosity).

Cluster 3: Not so happy — an Old World group of the lower half (Cantril scores between 5.2, Algeria, and 3.4, Rwanda) that are either doing a bit better than other, equally (un)happy countries regarding positive affects (Myanmar, Madagascar, Rwanda) or negative affects (e.g. Georgia, Ukraine), and are in the top-half when it comes to the SD.

Cluster 4: Generally unhappy — this collects most of the countries of the Sub-saharan cluster 2018 with Cantril scores ≤ 5, including three of the (still) unhappiest countries in the world: war-ridden Syria, the Central African Republic, and South Sudan, which rank in the bottom-half of all happiness rankings. When is comes to explanations, the ranking table is of little use: Chad, for example, ranks 2nd regarding perceived corruption, and the Central African Republic, generally regarded a as a failed state, ranks 16th, and 14th regarding freedom — ie. it seems to have similar values here like the happiest bunch (Cluster 1).

Cluster 5: Pretty unhappy — this includes Asian and African countries that are not much happier than those of Cluster 4 but which rank high when only looking at positive affects. The reasons may include low levels of perceived corruption but also generosity, at least in the case of Bhutan (#25, #13) and South Africa (#24/#1), the latter being the most generous country in the world (something Guido agrees with based on personal experience).

Cluster 6: Partially unhappy — is a very heterogeneous cluster, when we look at the Cantril scores ranging from 7.2 for Costa Rica (#12), a score close to the Top-10 of Cluster 1, to 4.7 for Somalia (#112). Effectively, it collects all states that don't fit ranking-pattern-wise in any of the other clusters. For example, the U.S. (6.9, #19) and U.A.E. (6.8, #21) plot close to each other in the network because both rank between 35 and 70 on the other three variables, ie. lower than the countries of Cluster 1 with not much higher Cantril scores. Mexico, by the way (6.6, #23), performs similarly to the U.S. but ranks much higher regarding positive affects. The latter seems to be a general trend within the other states of the New World in this cluster.

Cluster 7: Really not happy — also covers a wide range, from a Cantril score of 6.0 (Kuwait, #51 in the world) to 3.2 (Afghanistan, #154). It includes the remainder of the Sub-saharan countries, most of the countries in the Arab world, and the unhappy countries within and outside the EU (Portugal, Greece, Serbia, Bosnia & Herzegovina). These are countries that usually rank in the lower half or bottom third regarding all four included variables.

Cluster 8: Increasingly unhappy — these countries bridge between Clusters 1 and 7, starting (upper left in the graph) with Russia (#68, top 10 regarding negative affects) and ending with Democratic Republic of Congo (#127, Congo Kinshasa in WHR dataset, ranking like a Cluster 7 country). In between are pretty happy countries such as Israel (#13) and unhappy EU members (Bulgaria, #97). The reason Israel is not in Cluster 1 is its very low ranking regarding both positive affects (#104) and not too high placement when it comes to negative affects (#69), but in contrast to the U.S. it ranks high when it comes to the SD of the Cantril scores — that is, the USA has a great diversity regarding happiness, from billionaires to the very poor, whereas the peoples of most countries are more equally happy. Other very-high ranking countries regarding the latter are Bulgaria, the least-happy country of the EU, and Mongolia.

Monday, September 9, 2019

Lifestyle habits in the states of the USA

People throughout the western world are constantly being reminded that modern lifestyles have many unhealthy aspects. This is particularly true of the United Stats of America, where obesity (degree of over-weight) is now officially considered to be a medical epidemic. That is, it is a disease, but it is not caused by some organism, such as a bacterium or virus, but is instead a lifestyle disease — it can be cured and prevented only by changing the person's lifestyle.

The Centers for Disease Control and Prevention (CDC), in the USA, publish a range of data collected in their surveys — Nutrition, Physical Activity, and Obesity: Data, Trends and Maps. Their current data include information up to 2017.

These data are presented separately for each state. The data collection includes:
  • Obesity — % of adults who are obese, as defined by the Body Mass Index (>30 is obese)
  • Lack of exercise — % of adults reporting no physical leisure activity; % of adolescents watching 3 or more hours of television each school day
  • Unhealthy eating — % of adults eating less than one fruit per day; % of adolescents drinking soda / pop at least once per day.
The CDC show maps and graphs for these data variables separately, but there is no overall picture of the data collection as a whole. This would be interesting, because it would show us which states have the biggest general problem, in the sense that they fare badly on all or most of the lifestyle measurements. So, let's use a network to produce such a picture.

For our purposes here, I have looked at the three sets of data for adults only. The network will thus show states that have lots of obese adults who get little exercise and do not eat many fruits and vegetables.

As usual for this blog, the network analysis is a form of exploratory data analysis. The data are the percentages of people in each state that fit into the three lifestyle characteristics defined above (obese, no exercise, unhealthy eating). For the network analysis, I calculated the similarity of the states using the manhattan distance; and a Neighbor-net analysis was then used to display the between-state similarities.

Network of the lifestyle habits i the various US states

The resulting network is shown in the graph. States that are closely connected in the network are similar to each other based on their adult lifestyles, and those states that are further apart are progressively more different from each other. In this case, the main pattern is a gradient from the healthiest states at the top of the network to the most unhealthy at the bottom.

Note that there are seven states separated from the rest at the bottom of the network. These states have far more people with unhealthy lifestyles than do the other US states. In other words, the lifestyle epidemic is at its worst here.

In the top-middle of the network there is a partial separation of states at the left from those at the right (there is no such separation elsewhere in the network). The states at the left are those that have relatively low obesity levels but still fare worse on the other two criteria (exercise and eating). For example, New York and New Jersey have the same sorts of eating and exercise habits as Pennsylvania and Maryland but their obesity levels are lower.

It is clear that the network relates closely to the standard five geographical regions of the USA, as shown by the network colors. The healthiest states are mostly from the Northeast (red), except for Delaware, while the unhealthiest states are from the Southeast (orange), with Florida, Virginia and North Carolina doing much better than the others. The Midwest states are scattered along the middle-right of the network, indicating a middling status. The Southwest states are mostly at the middle-left of the network.

The biggest exception to these regional clusterings is the state of Oklahoma. This is in the bottom (unhealthiest) network group, far from the other Southwest states. This pattern occurs across all three characteristics; for example, Oklahoma has the second-lowest intake of fruit (nearly half the adults don't eat fruit), second only to Mississippi.

These data have also been analyzed by Consumer Protect, who offer some further commentary.


This analysis highlights those seven US states that have quantitatively the worst lifestyles in the country, and where the lifestyle obesity epidemic is thus at its worst.

These poor lifestyles have a dramatic impact on longevity — people cannot expect to live very long if they live an unhealthy lifestyle. The key concept here is the difference between life expectancy (how long people live, on average) and healthy life expectancy (how long people people remain actively healthy, on average). This topic is discussed by the The US Burden of Disease Collaborators (2018. The state of US health, 1990-2016. Journal of the American Medical Association 319: 1444-1472).

In that paper, the data for the USA show that, for most states, healthy life expectancy is c. 11 years less than the total life expectancy, on average. This big difference is due to unhealthy lifestyles, which eventually catch up with you. As a simple example, the seven states at the bottom of the network are ranked 44-51 in terms of healthy longevity, at least 2.5 years shorter than the national average. (Note: Tennessee is ranked 45th.)

You can see why the CDC is concerned, and why there is considered to be an epidemic.


Some of the seven states highlighted here have other lifestyle problems, as well. For example, if you consult Places in America with the highest STD rates, you will find that they are listed as five of the top ten: 2: Mississippi, 3: Louisiana, 6: Alabama, 9: Arkansas, 10: Oklahoma, 31: Kentucky, and 50: West Virginia.

Monday, September 2, 2019

Losing information in phylogenetic consensus

Any summary loses information, by definition. That is, a summary is used to extract the "main" information from a larger set of information. Exactly how "main" is defined and detected varies from case to case, and some summary methods work better for certain purposes than for others.

A thought experiment that I used to play with my experimental-design students was to imagine that they were all given the same scientific publication, and were asked to provide an abstract of it. Our obvious expectation is that there would be a lot of similarity among those abstracts, which would represent the "important points" from the original — that is, those points of most interest to the majority of the students. However, there would also be differences among the abstracts, as each student would find different points that they think should also be included in the summary. In one sense, the worst abstract would be the one that has the least in common with the other abstracts, since it would be summarizing things that are of less general interest.

The same concept applies to mathematical summaries (aka "averages"), such as the mean, median and mode, which reduce the central location of a dataset to a single number. It also applies to summaries of the variation in a dataset, such as the variance and inter-quartile range. (Note that a confidence interval or standard error is an indication of the precision of the estimate of the central location, not a summary of the dataset variation — this is a point that seems to confuse many people.)

So, it is easy to summarize data and thereby lose important information. For example, if my dataset has two exactly opposing time patterns, then the data average will appear to remain constant through time. I might thus conclude from the average that "nothing is happening" through time when, in fact, two things are happening. I will never find out about my mistake by simply looking at the data summary — I also need to look at the original data patterns.

So, what has this got to do with phylogenetics? Well, a phylogenetic tree is a summary of a dataset, and that summary is, by definition, missing some of the patterns in the data. These patterns might be of interest to me, if I knew about them.

Even worse, phylogenetic data analyses often produce multiple phylogenetic trees, all of which are mathematically equal as summaries of the data. What are we then to do?

One thing that people often do is to compute a Consensus Tree (eg. the majority consensus), which is a summary of the summaries — that is, it is a tree that summarizes the other trees. It would hardly be surprising if that consensus tree is an inadequate summary of the original data. In spite of this, how often do you see published papers that contain any evaluation of their consensus tree as a summary of the original data?

This issue has recently been addressed in a paper uploaded to the BioRxiv:
Anti-consensus: detecting trees that have an evolutionary signal that is lost in consensus
Daniel H. Huson, Benjamin Albrecht, Sascha Patz, Mike Steel
Not unexpectedly, given the background of the authors, they explore this issue in the context of phylogenetic networks. As they note:
A consensus tree, such as the majority consensus, is based on the set of all splits that are present in more than 50% of the input trees. A consensus network is obtained by lowering the threshold and considering all splits that are contained in 10% of the trees, say, and then computing the corresponding splits network. By construction and in practice, a consensus network usually shows the majority tree, extended by a number of rectangles that represent local rearrangements around internal nodes of the consensus tree. This may lead to the false conclusion that the input trees do not differ in a significant way because "even a phylogenetic network" does not display any large discrepancies.
That is, sometimes authors do attempt to evaluate their consensus tree, by looking at a network. However, even the network may turn out to be inadequate, because a phylogenetic tree is a much more complex summary than is a simple mathematical average. This is sad, of course.

So, the new suggestion by the authors is:
To harness the full potential of a phylogenetic network, we introduce the new concept of an anti-consensus network that aims at representing the largest interesting discrepancies found in a set of trees.
This should reveal multiple large patterns, if they exist in the original dataset. Phylogenetic analyses keep moving forward, fortunately.