The Genealogical World of Phylogenetic Networks: September 2019

Monday, September 30, 2019

Typology of semantic change (Open problems in computational diversity linguistics 8)

With this month's problem we are leaving the realm of modeling, which has been the basic aspect underlying the last three problems, discussed in June, July, and August, and enter the realm of typology, or general linguistics. The last three problems that I will discuss, in this and two follow-up posts, deal with the basic problem of making use or collecting data that allows us to establish typologies, that is, to identify cross-linguistic tendencies for specific phenomena, such as semantic change (this post), sound change (October), or semantic promiscuity (November).

Cross-linguistic tendencies are here understood as tendencies that occur across all languages independently of their specific phylogenetic affiliation, the place where they are spoken, or the time when they are spoken. Obviously, the uniformitarian requirement of independence of place and time is an idealization. As we know well, the capacity for language itself developed, potentially gradually, with the evolution of modern humans, and as a result, it does not make sense to assume that the tendencies of semantic change or sound change were the same through time. This has, in fact, been shown in recent research that illustrated that there may be a certain relationship between our diet and the speech sounds that we speak in our languages (Blasi et al. 2019).

Nevertheless, in the same way in which we simplify models in physics, as long as they yield good approximations of the phenomena we want to study, we can also assume a certain uniformity for language change. To guarantee this, we may have to restrict the time frame of language development that we want to discuss (eg. the last 2,000 years), or the aspects of language we want to investigate (eg. a certain selection of concepts that we know must have been expressed 5,000 years ago).

For the specific case of a semantic change, the problem of establishing a typology of the phenomenon can thus be stated as follows:

Assuming a certain pre-selection of concepts that we assume were readily expressed in a given time frame, establish a general typology that informs about the universal tendencies by which a word expressing one concept changes its meaning, to later express another concept in the same language.

In theory, we can further relax the conditions of universality and add the restrictions on time and place later, after having aggregated the data. Maybe this would even be the best idea for a practical investigation; but given that the time frames in which we have attested data for semantic changes are rather limited, I do not believe that it would make much of a change.

Why it is hard to establish a typology of semantic change

There are three reasons why it is hard to establish a typology of semantic change. First, there is the problem of acquiring the data needed to establish the typology. Second, there is the problem of handling the data efficiently. Third, there is the problem of interpreting the data in order to identify cross-linguistic, universal tendencies.

The problem of data acquisition results from the fact that we lack data on observed processes of semantic change. Since there are only a few languages with a continuous tradition of written records spanning 500 years or more, we will never be able to derive any universal tendencies from those languages alone, even if it may be a good starting point to start from languages like Latin and its Romance descendants, as has been shown by Blank (1997).

Accepting the fact that processes attested only for Romance languages are never enough to fill the huge semantic space covered by the world's languages, the only alternative would be using inferred processes of semantic change — that is, processes that have been reconstructed and proposed in the literature. While it is straightforward to show that the meanings of cognate words in different languages can vary quite drastically, it is much more difficult to infer the direction underlying the change. Handling the direction, however, is important for any typology of semantic change, since the data from observed changes suggests that there are specific directional tendencies. Thus, when confronted with cognates such as selig "holy" in German and silly in English, it is much less obvious whether the change happened from "holy" to "silly" or from "silly" to "holy", or even from an unknown ancient concept to both "holy" and "silly".

As a result, we can conclude that any collection of data on semantic change needs to make crystal-clear upon which types of evidence the inference of semantic change processes is based. Citing only the literature on different language families is definitely not enough. Because of the second problem, this also applies to the handling of data on semantic shifts. Here, we face the general problem of elicitation of meanings. Elicitation refers to the process in fieldwork where scholars use a questionnaire to ask their informants how certain meanings are expressed. The problem here is that linguists have never tried to standardize which meanings they actually elicit. What they use, instead, are elicitation glosses, which they think are common enough to allow linguists to understand to what meaning they refer. As a result, it is extremely difficult to search in field work notes, and even in wordlists or dictionaries, for specific meanings, since every linguist is using their own style, often without further explanations.

Our Concepticon project (List et al. 2019, https://concepticon.clld.org) can be seen as a first attempt to handle elicitation glosses consistently. What we do is to link those elicitation glosses that we find in questionnaires, dictionaries, and fieldwork notes to so-called concept sets, which reflect a given concept that is given a unique identifier and a short definition. It would go too far to dive deeper into the problem of concept handling. Interested readers can have a look at a previous blog post I wrote on the topic (List 2018). In any case, any typology on semantic change will need to find a way to address the problem of handling elicitation glosses in the literature, in the one or the other way.

As a last problem, when having assembled data that show semantic change processes across a sufficiently large sample of languages and concepts, there is the problem of analyzing the data themselves. While it seems obvious to identify cross-linguistic tendencies by looking for examples that occur in different language families and different parts of the world, it is not always easy to distinguish between the four major reasons for similarities among languages, namely: (1) coincidence, (2) universal tendencies, (3) inheritance, and (4) contact (List 2019). The only way to avoid being forced to make use of potentially unreliable statistics, to squeeze out the juice of small datasets, is to work on a sufficiently large coverage of data from as many language families and locations as possible. But given that there are no automated ways to infer directed semantic change processes across linguistic datasets, it is unlikely that a collection of data acquired from the literature alone will reach the critical mass needed for such an endeavor.

Traditional approaches

Apart from the above-mentioned work by Blank (1997), which is, unfortunately, rarely mentioned in the literature (potentially because it is written in German), there is an often-cited paper by Wilkinson (1996), and preliminary work on directionality (Urban 2012). However, the attempt that addresses the problem most closely is the Database of Semantic Shifts (Zalizniak et al. 2012), which has, according to the most recent information on the website, was established in 2002 and has been continuously updated since then.

The basic idea, as far as I understand the principle of the database, is to collect semantic shifts attested in the literature, and to note the type of evidence, as well as the direction, where it is known. The resource is unique, nobody else has tried to establish a collection of semantic shifts attested in the literature, and it is therefore incredibly valuable. It shows, however, also, what problems we face when trying to establish a typology of semantic shifts.

Apart from the typical technical problems found in many projects shared on the web (missing download access to all data underlying the website, missing deposit of versions on public repositories, missing versioning), the greatest problem of the project is that no apparent attempt was undertaken to standardize the elicitation glosses. This became specifically obvious when we tried to link an older version of the database, which is now no longer available, to our Concepticon project. In the end, I selected some 870 concepts from the database, which were supported by more datapoints, but had to ignore more than 1500 remaining elicitation glosses, since it was not possible to infer in reasonable time what the underlying concepts denote, not to speak of obvious cases where the same concept was denoted by slightly different elicitation glosses. As far as I can tell, this has not changed much with the most recent update of the database, which was published some time earlier this year.

Apart from the afore-mentioned problems of missing standardization of elicitation glosses, the database does not seem to annotate which type of evidence has been used to establish a given semantic shift. An even more important problem, which is typical of almost all attempts to establish databases of change in the field of diversity linguistics, is that the database only shows what has changed, while nothing can be found on what has stayed the same. A true typology of change, however, must show what has not changed along with showing what has changed. As a result, any attempt to pick proposed changes from the literature alone will fail to offer a true typology, a collection of universal tendencies

To be fair: the Database of Semantic Shifts is by no means claiming to do this. What it offers is a collection of semantic change phenomena discussed in the linguistic literature. This itself is an extremely valuable, and extremely tedious, enterprise. While I wish that the authors open their data, versionize it, standardize the elicitation glosses, and also host it on stable public archives, to avoid what happened in the past (that people quote versions of the data which no longer exist), and to open the data for quantitative analyses, I deeply appreciate the attempt to approach the problem of semantic change from an empirical, data-driven perspective. To address the problem of establishing a typology of semantic shift, however, I think that we need to start thinking beyond collecting what has been stated in the literature.

Computational approaches

As a first computational approach that comes in some way close to a typology of semantic shifts, there is the Database of Cross-Linguistic Colexifications (List et al. 2018), which was originally launched in 2014, and received a major update in 2018 (see List et al. 2018b for details). This CLICS database, which I have mentioned several times in the past, does not show diachronic data, ie. data on semantic change phenomena, but lists automatically detectable polysemies and homophonies (also called colexifications), instead.

While the approach taken by the Database of Semantic shifts is bottom-up in some sense, as the authors start from the literature and add those concept that are discussed there, CLICS is top-down, as it starts from a list of concepts (reflected as standardized Concepticon concept sets) and then checks which languages express more than one concept by one and the same word form.

The advantages of top-down approaches are: that much more data can be processed, and that one can easily derive a balanced sample in which the same concepts iare compared for as many languages as possible. The disadvantage is that such a database will ignore certain concepts a priori, if they do not occur in the data.

Since CLICS lists synchronic patterns without further interpreting them, the database is potentially interesting for those who want to work on semantic change, but it does not help solve the problem of establishing a typology of semantic change itself. In order to achieve this, one would have to go through all attested polysemies in the database and investigate them, searching for potential hints on directions.

A potential way to infer directions for semantic shifts is presented by Dellert (2016), who applies causal inference techniques on polysemy networks to address this task. The problem, as far as I understand the techniques, is that the currently available polysemy databases barely offer enough information needed for these kinds of analyses. Furthermore, it would also be important to see how well the method actually performs in comparison to what we think we already know about the major patterns of semantic change.

Initial ideas for improvement

There does not seem to be a practical way to address our problem by means of computational solutions alone. What we need, instead, is a computer-assisted strategy that starts from the base of a thorough investigation of the criteria that scholars use to infer directions of semantic change from linguistic data. Once these criteria are settled, more or less, one would need to think of ways to operationalize them, in order to allow scholars to work with concrete etymological data, ideally comprising standardized word-lists for different language families, and to annotate them as closely as possible.

Ideally, scholars would propose larger etymological datasets in which they reconstruct whole language families, proposing semantic reconstructions for proto-forms. These would already contain the proposed directions of semantic change, and they would also automatically show where change does not happen. Since we currently lack automated workflows that fully account for this level of detail, one could start by applying methods for cognate detection across semantic semantic slots (cross-semantic cognate detection), which would yield valuable data on semantic change processes, without providing directions, and then adding the directional information based on the principles that scholars use in their reconstruction methodology.

Outlook

Given the recent advances in detection of sound correspondence patterns, sequence comparison, and etymological annotation in the field of computational historical linguistics, it seems perfectly feasible to work on detailed etymological datasets of the languages of the world, in which all information required to derive a typology of semantic change is transparently available. The problem is, however, that it would still take a lot of time to actually analyze and annotate these data, and to find enough scholars who would agree to carry out linguistic reconstruction in a similar way, using transparent tools rather than convenient shortcuts.

References

Blank, Andreas (1997) Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Tübingen:Niemeyer.

Blasi, Damián E. and Steven Moran and Scott R. Moisik and Paul Widmer and Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

List, Johann-Mattis and Simon Greenhill and Cormac Anderson and Thomas Mayer and Tiago Tresoldi and Robert Forkel (2018: CLICS: Database of Cross-Linguistic Colexifications. Version 2.0. Max Planck Institute for the Science of Human History. Jena: http://clics.clld.org/.

Johann Mattis List and Simon Greenhill and Christoph Rzymski and Nathanael Schweikhard and Robert Forkel (2019) Concepticon. A resource for the linking of concept lists (Version 2.1.0). Max Planck Institute for the Science of Human History. Jena: https://concepticon.clld.org/.

Dellert, Johannes and Buch, Armin (2016) Using computational criteria to extract large Swadesh Lists for lexicostatistics. In: Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics.

List, Johann-Mattis and Greenhill, Simon J. and Anderson, Cormac and Mayer, Thomas and Tresoldi, Tiago and Forkel, Robert (2018) CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2: 277-306.

List, Johann-Mattis (2018) Towards a history of concept list compilation in historical linguistics. History and Philosophy of the Language Sciences 5.10: 1-14.

List, Johann-Mattis (2019) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

Urban, Matthias (2011) Asymmetries in overt marking and directionality in semantic change. Journal of Historical Linguistics 1.1: 3-47.

Wilkins, David P. (1996) Natural tendencies of semantic change and the search for cognates. In: Durie, Mark (ed.) The Comparative Method Reviewed: Regularity and Irregularity in Language Change. New York: Oxford University Press, pp. 264-304.

Zalizniak, Anna A. and Bulakh, M. and Ganenkov, Dimitrij and Gruntov, Ilya and Maisak, Timur and Russo, Maxim (2012) The catalogue of semantic shifts as a database for lexical semantic typology. Linguistics 50.3: 633-669.

Monday, September 23, 2019

Where are we, 60 years after Hennig?

Phylogenetic analysis is common in the modern study of evolutionary biology, and yet it often seems to be a poorly understood tool. Indeed, it seems to often be seen as nothing more than a tool, and one for which one does not need much expertise.

For example, we do not need to spend much time on Twitter to realize that many evolutionary biologists do not understand even the most basic things about the difference between taxa and characters. Taxa are often referred to as "primitive", particularly by people studying the so-called Origin of Life. However, taxa themselves cannot be either primitive or derived; instead, they are composed of mixtures of primitive and derived characters — they have derived characters relative to their ancestors and primitive ones compared to their descendants.

The logical relationship between common ancestors and monophyletic / paraphyletic groups is also apparently unknown to many evolutionary biologists. There is endless debate about whether the Last Universal Common Ancestor was a Bacterium or an Archaean when, of course, it cannot be either. That is, we sample contemporary organisms for analysis, which come from particular taxonomic groupings, and from these data we infer hypothetical ancestors. However, those ancestors cannot be part of the same taxonomic group as their descendants unless that taxonomic group is monophyletic.

This is all basic stuff, first expounded in the 1950s by Willi Hennig. So, why do so many people apparently still not know any of this 60 years later? I suspect that somewhere along the line the molecular geneticists got the idea that Hennig was part of Parsimony Analysis, and since they adopted Likelihood Analysis, instead, he is thus irrelevant.

However, Hennigian Logic underlies all phylogenetic analyses, of whatever mathematical ilk. All such analyses are based on the search for unique shared derived characters, which is the only basis on which we can objectively produce a rooted phylogenetic tree or network.

In the molecular world, many analysis techniques are based on analyzing the similarity of the taxa. However, similarity is only relevant if it is based on shared derived characters — if it is based on shared primitive characters then it cannot reliably detect phylogenetic history. This was Hennig's basic insight, and it is as true today as it was 60 years ago.

The confusing thing here is that most similarity among taxa will be based on both primitive and derived characters. This means that some of the analysis output reflects phylogenetic history and some does not. The further we go back in evolutionary time, the more likely it is that similarity reflects shared primitive characters rather than shared derived characters. This simple limitation seems to be poorly understood by evolutionary biologists.

Perhaps it would be a good idea if university courses in molecular evolutionary biology actually taught phylogenetics as a topic of its own, rather than as an incidental tool for studying evolution. After all, there is more to getting a scientific answer than feeding data into a computer program.

Obviously, I may be wrong in painting my picture with such a broad brush. If so, then it must be that the people I have described seem to have gathered on Twitter, like birds of a feather.

And yet, I see the same thing in the literature, as well. Consider this recent paper:

A polyploid admixed origin of beer yeasts derived from European and Asian wine populations. Justin C. Fay, Ping Liu, Giang T. Ong, Maitreya J. Dunham, Gareth A. Cromie, Eric W. Jeffery, Catherine L. Ludlow, Aimée M. Dudley. 2019. PLoS Biology 17(3): e3000147.

This seems to be quite an interesting study of a reticulate evolutionary history involving budding yeasts, from which the authors conclude that:

The four beer populations are most closely related to the Europe/wine population. However, the admixture graph also showed strong support for two episodes of gene flow into the beer lineages resulting in 40% to 42% admixture with the Asia/sake population.

However, they then undo all of their good work with this sentence:

The inferred admixture graph grouped the four beer populations together, with the lager and two ale populations being derived from the lineage leading to the Beer/baking population.

Nonsense! Neither lineage derives from the other, but instead they both derive from a common ancestor. This is like saying that I derive from the lineage leading to my younger brother, when in fact we both derive from the same parents. I doubt that the authors believe the latter idea, so why do they apparently believe the former?

That is a little test that you can all use when writing about phylogenetics. If your words don't make sense for a family history, then they don't make sense for phylogenetics either.

Monday, September 16, 2019

A network of happiness, by ranks

This is a joint post by David Morrison and Guido Grimm

Over a year ago, we showed a network relating to the World Happiness Report 2018 based on the variables used for explaining why people in some countries report themselves to be happier than in other countries. A new WHR report is out for 2019, warranting a new network.

The 2019 Report describes itself as:

a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. This year’s World Happiness Report focuses on happiness and the community: how happiness has evolved over the past dozen years, with a focus on the technologies, social norms, conflicts and government policies that have driven those changes.

For our purposes, we will simply focus on the happiness scores themselves. So, this time we will base our analysis on the country rankings for the four measures of subjective well-being:

Cantril Ladder life-evaluation question in the Gallup World Poll — asks the survey respondents to place the status of their lives on a “ladder” scale ranging from 0 to 10, where 0 means the worst possible life and 10 the best possible life
Ladder standard deviation — provides a measure of happiness inequality across the country
Positive affect — comprises the average frequency of happiness, laughter and enjoyment on the previous day to the survey (scaled from 0 to 1)
Negative affect — comprises the average frequency of worry, sadness and anger on the previous day to the survey (scaled from 0 to 1)

As expected, not a lot has changed between 2018 and 2019. The first graph shows the comparison of the Cantril Ladder scores (the principal happiness measure) for those 153 countries that appear in both reports. Each point represents one country, with the color coding indicating the geographical area (as listed in the network below).

Only three countries (as labeled) show large differences, with Malaysia becoming less happy, and two small African countries improving. As also expected, the European countries (green) tending to be at the top, and the African countries (grey) dominating the bottom scores.

Finland is still ranked #1, with even happier people than in 2018's report. New in the top-10 of the happiest countries is Austria (last years #12), which took the place of Australia (now #11). At the other end, South Sudan went down from 3.3 to 2.9 — this is not really a good start for the youngest state in the world. New to the lowest-ranking ten are Botswana (−0.1, down two places) and Afghanistan (−0.4, down 9).

A network analysis

The four measures of subjective well-being do not necessarily agree with each other, since they measure different things. To get an over view of all four happiness variables simultaneously, we can use a phylogenetic network as a form of exploratory data analysis. [Technical details of our analysis: Qatar was deleted because it has too many missing values. The data used were the simple rankings of the counties for each of the four variables. The Manhattan distance was then calculated; the distances have been displayed as a neighbor-net splits graph.]

In the network (shown below), the spatial relationship of the points contains the summary information — points near each other in the network are similar to each other based on the data variables, and the further apart they are then the less similar they are. The points are color-coded based on major geographic regions; and the size of the points represents the Cantril Ladder score. We have added some annotations for the major network groups, indicating which geographical regions are included — these groups are the major happiness groupings.

The rank-based network 2019 looks quite different to the one based on the explaining parameters 2018. Let us have a short look at the clusters, as annotated in the graph.

Cluster 1: The happiest — this includes the welfare states of north-western and central Europe (score > 6.7), as well as Australia, Canada and New Zealand (~7.3), Taiwan (the 25th happiest country in the world, 6.4) and Singapore (#34 with 6.3). For both the positive and negative measures of happiness, the countries rank typically in the top 50, with Czechia ranking lowest regarding positive affects (#74), while the people in Singapore (#1) and Taiwan apparently suffer the fewest negative affects (#2).

Cluster 2: Quite happy — includes countries like France, with 6.6 making it the happiest one of the group, plus countries along the southern shore of the Baltic Sea, as well as Japan, Hong Kong, but also also quite different countries from western Asia such as Kyrgyzstan and Turkmenistan, and Vietnam, the least happy (5.1) of the group. Common to all of them is that they rank in the top third of the standard derivation of the Cantril ladder scores, i.e. their people are equally happy across each country. Towards the right of the group, bridging to Cluster 3, we have countries that rank in the bottom third of positive affects. Potential causes are the high levels of perceived corruption, or the lack of social support and generosity, as in the case of Turkmenistan (#147 in social support, #153 in generosity).

Cluster 3: Not so happy — an Old World group of the lower half (Cantril scores between 5.2, Algeria, and 3.4, Rwanda) that are either doing a bit better than other, equally (un)happy countries regarding positive affects (Myanmar, Madagascar, Rwanda) or negative affects (e.g. Georgia, Ukraine), and are in the top-half when it comes to the SD.

Cluster 4: Generally unhappy — this collects most of the countries of the Sub-saharan cluster 2018 with Cantril scores ≤ 5, including three of the (still) unhappiest countries in the world: war-ridden Syria, the Central African Republic, and South Sudan, which rank in the bottom-half of all happiness rankings. When is comes to explanations, the ranking table is of little use: Chad, for example, ranks 2nd regarding perceived corruption, and the Central African Republic, generally regarded a as a failed state, ranks 16th, and 14th regarding freedom — ie. it seems to have similar values here like the happiest bunch (Cluster 1).

Cluster 5: Pretty unhappy — this includes Asian and African countries that are not much happier than those of Cluster 4 but which rank high when only looking at positive affects. The reasons may include low levels of perceived corruption but also generosity, at least in the case of Bhutan (#25, #13) and South Africa (#24/#1), the latter being the most generous country in the world (something Guido agrees with based on personal experience).

Cluster 6: Partially unhappy — is a very heterogeneous cluster, when we look at the Cantril scores ranging from 7.2 for Costa Rica (#12), a score close to the Top-10 of Cluster 1, to 4.7 for Somalia (#112). Effectively, it collects all states that don't fit ranking-pattern-wise in any of the other clusters. For example, the U.S. (6.9, #19) and U.A.E. (6.8, #21) plot close to each other in the network because both rank between 35 and 70 on the other three variables, ie. lower than the countries of Cluster 1 with not much higher Cantril scores. Mexico, by the way (6.6, #23), performs similarly to the U.S. but ranks much higher regarding positive affects. The latter seems to be a general trend within the other states of the New World in this cluster.

Cluster 7: Really not happy — also covers a wide range, from a Cantril score of 6.0 (Kuwait, #51 in the world) to 3.2 (Afghanistan, #154). It includes the remainder of the Sub-saharan countries, most of the countries in the Arab world, and the unhappy countries within and outside the EU (Portugal, Greece, Serbia, Bosnia & Herzegovina). These are countries that usually rank in the lower half or bottom third regarding all four included variables.

Cluster 8: Increasingly unhappy — these countries bridge between Clusters 1 and 7, starting (upper left in the graph) with Russia (#68, top 10 regarding negative affects) and ending with Democratic Republic of Congo (#127, Congo Kinshasa in WHR dataset, ranking like a Cluster 7 country). In between are pretty happy countries such as Israel (#13) and unhappy EU members (Bulgaria, #97). The reason Israel is not in Cluster 1 is its very low ranking regarding both positive affects (#104) and not too high placement when it comes to negative affects (#69), but in contrast to the U.S. it ranks high when it comes to the SD of the Cantril scores — that is, the USA has a great diversity regarding happiness, from billionaires to the very poor, whereas the peoples of most countries are more equally happy. Other very-high ranking countries regarding the latter are Bulgaria, the least-happy country of the EU, and Mongolia.

Monday, September 9, 2019

Lifestyle habits in the states of the USA

People throughout the western world are constantly being reminded that modern lifestyles have many unhealthy aspects. This is particularly true of the United Stats of America, where obesity (degree of over-weight) is now officially considered to be a medical epidemic. That is, it is a disease, but it is not caused by some organism, such as a bacterium or virus, but is instead a lifestyle disease — it can be cured and prevented only by changing the person's lifestyle.

The Centers for Disease Control and Prevention (CDC), in the USA, publish a range of data collected in their surveys — Nutrition, Physical Activity, and Obesity: Data, Trends and Maps. Their current data include information up to 2017.

These data are presented separately for each state. The data collection includes:

Obesity — % of adults who are obese, as defined by the Body Mass Index (>30 is obese)
Lack of exercise — % of adults reporting no physical leisure activity; % of adolescents watching 3 or more hours of television each school day
Unhealthy eating — % of adults eating less than one fruit per day; % of adolescents drinking soda / pop at least once per day.

The CDC show maps and graphs for these data variables separately, but there is no overall picture of the data collection as a whole. This would be interesting, because it would show us which states have the biggest general problem, in the sense that they fare badly on all or most of the lifestyle measurements. So, let's use a network to produce such a picture.

For our purposes here, I have looked at the three sets of data for adults only. The network will thus show states that have lots of obese adults who get little exercise and do not eat many fruits and vegetables.

As usual for this blog, the network analysis is a form of exploratory data analysis. The data are the percentages of people in each state that fit into the three lifestyle characteristics defined above (obese, no exercise, unhealthy eating). For the network analysis, I calculated the similarity of the states using the manhattan distance; and a Neighbor-net analysis was then used to display the between-state similarities.

Network of the lifestyle habits i the various US states

The resulting network is shown in the graph. States that are closely connected in the network are similar to each other based on their adult lifestyles, and those states that are further apart are progressively more different from each other. In this case, the main pattern is a gradient from the healthiest states at the top of the network to the most unhealthy at the bottom.

Note that there are seven states separated from the rest at the bottom of the network. These states have far more people with unhealthy lifestyles than do the other US states. In other words, the lifestyle epidemic is at its worst here.

In the top-middle of the network there is a partial separation of states at the left from those at the right (there is no such separation elsewhere in the network). The states at the left are those that have relatively low obesity levels but still fare worse on the other two criteria (exercise and eating). For example, New York and New Jersey have the same sorts of eating and exercise habits as Pennsylvania and Maryland but their obesity levels are lower.

It is clear that the network relates closely to the standard five geographical regions of the USA, as shown by the network colors. The healthiest states are mostly from the Northeast (red), except for Delaware, while the unhealthiest states are from the Southeast (orange), with Florida, Virginia and North Carolina doing much better than the others. The Midwest states are scattered along the middle-right of the network, indicating a middling status. The Southwest states are mostly at the middle-left of the network.

The biggest exception to these regional clusterings is the state of Oklahoma. This is in the bottom (unhealthiest) network group, far from the other Southwest states. This pattern occurs across all three characteristics; for example, Oklahoma has the second-lowest intake of fruit (nearly half the adults don't eat fruit), second only to Mississippi.

These data have also been analyzed by Consumer Protect, who offer some further commentary.

Conclusions

This analysis highlights those seven US states that have quantitatively the worst lifestyles in the country, and where the lifestyle obesity epidemic is thus at its worst.

These poor lifestyles have a dramatic impact on longevity — people cannot expect to live very long if they live an unhealthy lifestyle. The key concept here is the difference between life expectancy (how long people live, on average) and healthy life expectancy (how long people people remain actively healthy, on average). This topic is discussed by the The US Burden of Disease Collaborators (2018. The state of US health, 1990-2016. Journal of the American Medical Association 319: 1444-1472).

In that paper, the data for the USA show that, for most states, healthy life expectancy is c. 11 years less than the total life expectancy, on average. This big difference is due to unhealthy lifestyles, which eventually catch up with you. As a simple example, the seven states at the bottom of the network are ranked 44-51 in terms of healthy longevity, at least 2.5 years shorter than the national average. (Note: Tennessee is ranked 45th.)

You can see why the CDC is concerned, and why there is considered to be an epidemic.

Postscript

Some of the seven states highlighted here have other lifestyle problems, as well. For example, if you consult Places in America with the highest STD rates, you will find that they are listed as five of the top ten: 2: Mississippi, 3: Louisiana, 6: Alabama, 9: Arkansas, 10: Oklahoma, 31: Kentucky, and 50: West Virginia.

Monday, September 2, 2019

Losing information in phylogenetic consensus

Any summary loses information, by definition. That is, a summary is used to extract the "main" information from a larger set of information. Exactly how "main" is defined and detected varies from case to case, and some summary methods work better for certain purposes than for others.

A thought experiment that I used to play with my experimental-design students was to imagine that they were all given the same scientific publication, and were asked to provide an abstract of it. Our obvious expectation is that there would be a lot of similarity among those abstracts, which would represent the "important points" from the original — that is, those points of most interest to the majority of the students. However, there would also be differences among the abstracts, as each student would find different points that they think should also be included in the summary. In one sense, the worst abstract would be the one that has the least in common with the other abstracts, since it would be summarizing things that are of less general interest.

The same concept applies to mathematical summaries (aka "averages"), such as the mean, median and mode, which reduce the central location of a dataset to a single number. It also applies to summaries of the variation in a dataset, such as the variance and inter-quartile range. (Note that a confidence interval or standard error is an indication of the precision of the estimate of the central location, not a summary of the dataset variation — this is a point that seems to confuse many people.)

So, it is easy to summarize data and thereby lose important information. For example, if my dataset has two exactly opposing time patterns, then the data average will appear to remain constant through time. I might thus conclude from the average that "nothing is happening" through time when, in fact, two things are happening. I will never find out about my mistake by simply looking at the data summary — I also need to look at the original data patterns.

So, what has this got to do with phylogenetics? Well, a phylogenetic tree is a summary of a dataset, and that summary is, by definition, missing some of the patterns in the data. These patterns might be of interest to me, if I knew about them.

Even worse, phylogenetic data analyses often produce multiple phylogenetic trees, all of which are mathematically equal as summaries of the data. What are we then to do?

One thing that people often do is to compute a Consensus Tree (eg. the majority consensus), which is a summary of the summaries — that is, it is a tree that summarizes the other trees. It would hardly be surprising if that consensus tree is an inadequate summary of the original data. In spite of this, how often do you see published papers that contain any evaluation of their consensus tree as a summary of the original data?

This issue has recently been addressed in a paper uploaded to the BioRxiv:

Anti-consensus: detecting trees that have an evolutionary signal that is lost in consensus
Daniel H. Huson, Benjamin Albrecht, Sascha Patz, Mike Steel

Not unexpectedly, given the background of the authors, they explore this issue in the context of phylogenetic networks. As they note:

A consensus tree, such as the majority consensus, is based on the set of all splits that are present in more than 50% of the input trees. A consensus network is obtained by lowering the threshold and considering all splits that are contained in 10% of the trees, say, and then computing the corresponding splits network. By construction and in practice, a consensus network usually shows the majority tree, extended by a number of rectangles that represent local rearrangements around internal nodes of the consensus tree. This may lead to the false conclusion that the input trees do not differ in a significant way because "even a phylogenetic network" does not display any large discrepancies.

That is, sometimes authors do attempt to evaluate their consensus tree, by looking at a network. However, even the network may turn out to be inadequate, because a phylogenetic tree is a much more complex summary than is a simple mathematical average. This is sad, of course.

So, the new suggestion by the authors is:

To harness the full potential of a phylogenetic network, we introduce the new concept of an anti-consensus network that aims at representing the largest interesting discrepancies found in a set of trees.

This should reveal multiple large patterns, if they exist in the original dataset. Phylogenetic analyses keep moving forward, fortunately.