The Genealogical World of Phylogenetic Networks: October 2016

Tuesday, October 25, 2016

Sound change as systemic evolution

I have been discussing the peculiarities of sound change in linguistics in a range of blog posts in the past (see Alignments and Phylogenetic Reconstruction, Directional Processes in Language Change, Productive and Unproductive Analogies). My core message was that it is really difficult to find an analogy with biology, as sound change is not the simple mutation of one sound in a certain word, but the regular modification of all sounds of all words in the lexicon which occur in a specific contextual slot.

Scholars have tried to model this as concerted evolution (Hruschka et al. 2015). But the analogy with biology does not sound very convincing, as the change concerns the production of speech rather than its product. By this, I mean that sound change concerns the abstract system by which speakers produce the words of their language. Think of speakers in comic books who lose a tooth in some fight. Often, in order to show how their speech suffers from this loss, writers illustrate this by replacing certain "s" sounds in the speech of the victims with a "th" (in German, it would be an "f"). They do this in order to illustrate that with a lost tooth, it is "very difficult to thpeak". In the same way, writers imitate speech of people suffering from speech impediments like sigmatism (lisp). The loss of a tooth changes all "s"es in a person's language. Sound change, at least one type of sound change, is identical with this.

In a recent talk I gave with Nathan Hill at a conference in Poznań, we found a way to demonstrate this on actual language data. In this talk, we used data from eight Burmish languages (a language family spoken mainly in the South-West of China and in Myanmar), which we coded for partial cognates (as these languages contain many compounds). We aligned these cognate sets automatically, and then searched for recurring patterns in the alignments. One needs to keep in mind that our words in linguistics are extremely short, and we have no more than five sounds per alignment in our data, which translates to five sites in an alignment in biology.

While biology knows certain contextual patterns like hydrophilic stretches in alignments (as already demonstrated in the famous ClustalW software, compare Thompson et al. 1994), the context in which a sound occurs in language evolution is even more important. We can, for example, say, that the beginning of a word or morpheme is usually the most stable part, where sounds change much more slowly than in the other parts (in the end of a word or of a syllable). We thus concentrated only on the first sound of each word and looked at the patterns of sounds we could find there.

Those patterns in our data usually look like this:

Cognate set	L1	L2	L3	L4	L5	L6	L7	L8
word 1	p	p	p	Ø	f	f	Ø	p
word 2	p	Ø	p	p	Ø	f	p	p
word 3	k	Ø	tɕ	k	s	k	Ø	k
word 4	Ø	k	tɕ	Ø	s	Ø	s	k
...	...	...	...	...	...	...	...	...

Note that the symbol "Ø" in this context denotes missing data, as we did not find a cognate set in the given language. As always, most of our data is patchy, and we have to deal with that. You can see that when looking only at the first sound in each alignment, we find quite a degree of variation; and if you look at all the data, you can see some things that seem to structure, but the amount of complexity is still immense. You may see this from the following plot, showing only some 100 of the more than 300 patterns we created (coloured cells represent not necessarily the same sound, but one of ten different sound classes to which the more than 50 different sounds in our data belong):

Sound patterns (initial consonant) in the aligned cognates sets of the Burmish languages

Interestingly, however, most of the variation can be reduced quite efficiently with help of network techniques. Since we are dealing with systemic evolution, it is straightforward to group our more than 300 alignments into groups that evolve in an identical manner. At least this is what our linguistic theory predicts, and what linguists have been studying for the last 200 years. When looking at the patterns I gave above, you can see that we can easily group the four sounds into two groups:

Cognate set	L1	L2	L3	L4	L5	L6	L7	L8
word 1	p	p	p	Ø	f	f	Ø	p
word 2	p	Ø	p	p	Ø	f	p	p
-	-	-	-	-	-	-	-	-
word 3	k	Ø	tɕ	k	s	k	Ø	k
word 4	Ø	k	tɕ	Ø	s	Ø	s	k

Essentially, the two groups reflect only two patterns, if we disregard the gaps and merge them into one row each:

Cognate set	L1	L2	L3	L4	L5	L6	L7	L8
word 1 / word 2	p	p	p	p	f	f	p	p
-	-	-	-	-	-	-	-	-
word 3 / word 4	k	k	tɕ	k	s	k	s	k

What is important when grouping two alignments into one pattern is to make sure that they do not contain any conflicting positions. This can be checked in a rather straightforward manner by constructing a network from the data. In this network, the nodes are the alignment sites (word 1, word 2, etc. in our examples), and links are drawn between nodes if two sites are not in conflict with each other. If we use this criterion of compatibility on our data, we receive following network:


Compatibility network of the sites in our aligned cognate sets

In the network, I further coloured the nodes according to the overall similarity of sounds present in them. The legend gives capital letters for major sound classes, in order to facilitate seeing the structure.

This network itself, however, does not tell us how to group the data into classes that correspond to one identical process of systemic evolution, as we can still see many conflicts. In order to solve this, we need to carry out a specific partitioning analysis that cuts the network into an ideally minimal number of cliques. Why cliques? Because a clique will represent patterns in our data that do not show any conflicts in their sounds, and this is exactly what we want to see: those patterns that behave identically, without exceptions.

The problem of finding the minimal clique partition of a network is, unfortunately, a hard one (see Bhasker and Samad 1991), so we needed to use some approximate shortcuts. Nevertheless, with a very simple procedure of clique partitioning, we succeeded at reducing the 317 cognate sets that we selected for our study down to 35 groups that covered 74% of the data (234 cognate set), with a minimal size of 2 alignments per group. The "manual" inspection by the Burmish expert in our team (that is Nathan Hill) showed that many of these patterns correspond to what experts assume was one single sound in the ancestral Proto-Burmish language.

But to just illustrate more closely what I mean by reducing patterns to unique groups, look at the following pattern, which shows different nasal sounds in the data:

Nasal sounds in the Burmish data

And then at another pattern, showing s-sounds:

S-sounds in the Burmish data

I think (at least I hope) that the amount of regularity we find here is enough to demonstrate what is meant by the regularity of sound change in linguistics: sound change is in some sense just like losing a tooth, but for a complete population of speakers, not just one speaker, as the population starts to change all sounds occurring in a certain environment to some other sound.

Our results are not perfect: the 26% of unique patterns, for example, are something we will need to look into in more detail in the near future. A quick check showed that they may result from errors in the cognate annotation, but also from peculiarities in the data, and even simply from sounds that are rare in the languages under investigation.

We are currently looking into these issues, trying to refine our approach. I realized, for example, that the minimal clique coverage problem has been studied before by other researchers, and I found a rather large amount of Russian literature on the topic (see, for example, Bratceva and Čerenin 1994 and Ryzhkov 1975), but those approaches do not seem to have been thoroughly studied in the Western literature. We also know that at some point we need to relax our approach, allowing for some exceptions — we know that systemic sound change processes are easily overridden by language-specific factors, be it lateral transfer, or pragmatics in a larger sense (think of Bob Dylan, talking of "the words I never KNOWED" in order to make sure the word rhymes with "ROAD", or the form "wanna" as a shortcut for "want to").

Not all cases in which speakers changed the pronunciation of sounds have systemic reasons, and we are still far from actually understanding the systemic reasons that lead to the regular aspects of sound change. What we can show, however, is that sound change is really something peculiar in language evolution, with no real counterpart in biology. At least, I do not know of any case where a set of 300 alignments could be reduced to some 35 largely identical patterns. This shows, on the other hand, that the classical biological approaches that try to model each site of an alignment independently are definitely not what we need in order to model sound change realistically. The assumption of independence of sites in an alignment is already problematic in biology. In linguistics, at least in the cases illustrated above, it seems to be just as useless as tossing a coin to predict the weather in a desert: it is too much of an effort with very poor results to be expected.

References

Bhasker, J. and T. Samad (1991): The clique-partitioning problem. Computers \& Mathematics with Applications 22.6. 1 - 11.
Bratceva, E. and V. Čerenin (1994): Otyskanie vsex naimen’šix porkrytij grafa klikami [Searching all minimal clique coverages of a graph]. Žurnal Vyčislitel’noj Matematiki i Matematičeskoj Fisiki [Journal of Computational Mathematics and Physics] 34.8-9. 1272-1292.
Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015): Detecting regular sound changes in linguistics as events of concerted evolution. Curr. Biol. 25.1. 1-9.
Ryzhkov, A. (1975): Partitioning a graph into the minimal number of complete subgraphs. Cybernetics 11.6. 939-943. Original article: Рыжков А. П., Разбиение графа на минимальное число полных подграфов .. 90-96. Kybernetika 1975. 6.
Thompson, J., D. Higgins, and T. Gibson (1994): CLUSTAL W. Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22.22. 4673–4680.

Tuesday, October 18, 2016

The Genome Cellar is no such thing

In an earlier blog post, I noted that The Music Genome Project is no such thing. The use of the word "genome" in this context is an analogy, in which the musical characteristics are seen as producing a sort of genetic fingerprint. However, this is a false analogy, because the data used for the Music Genome Project are actually phenotypic, not genotypic. Indeed, music has no analog of a genotype.

In a similar vein, the data used for The Genome Cellar are phenotypic, not genotypic, and so this is also a false analogy.

The Genome Cellar is the database used by the Next Glass app. This app was released in November 2014, and a concurrent press release explained the concept:

Next Glass is the breakthrough app that uses science and machine learning software to provide accurate, personalized recommendations to consumers. Next Glass has analyzed tens of thousands of bottles of wine and beer with a mass spectrometer and stores the "DNA" of each product in its Genome Cellar™, which combines with users' Taste Profiles™ to provide product-specific recommendations.

So, the beer / wine data in the Genome Cellar are peaks in a spectrophotometer output. This is made clear in another press release:

Next Glass has developed the world’s first Genome Cellar, an extensive database that contains the chemical makeup – or "DNA" – of tens of thousands of wines and beers. By looking at each bottle on a molecular level, Next Glass defines a unique taste profile for every bottle by analyzing thousands of chemical elements.

This procedure will, indeed, provide a unique fingerprint for each alcoholic product, but it will be a phenotypic one not a genotypic one. Genetics is often chemistry but not all chemistry is genetics.

The idea of the Next Glass app is the same as that for the Music Genome Project — to use the fingerprint of currently liked products (music or wines / beers) to make recommendations for other products that might appeal to the customer. This approach can be expected to work for alcoholic beverages, because the subjective preferences will be based to some extent on the sensory components of the chemical makeup. If you document enough of the chemistry then you are bound to include a large proportion of the sensory part.

Anyway, you can see a short video about the laboratory here.

Finally, you might like to compare this approach with that of WineFriend, which tries to assess your taste in wine with multiple-choice questions, instead of complex chemistry. WineFriend:

uses a simple eight question taste survey that gives insights into a customer's thresholds for sweet, sour, bitterness and intensity of flavour. It then creates a profile which enables it to select wines that are tailored to the individual customer's tastes.

No mention of genomes here.

Tuesday, October 11, 2016

Changes in Playboy's women through 60 years

It has long been known that ideas about female attractiveness, and concern with body weight among young women, are closely related to exposure to mass media images (see the review by Spettigue & Henderson 2004). The print media are particularly involved in this issue, not least the so-called "men's magazines", such as Playboy. It therefore created a great deal of media interest when it was announced in October 2015 that Playboy would no longer feature nude centerfolds (known as Playmates).

Indeed, Playboy has often been claimed as a purveyor of the US society's image of the "ideal woman", although this is surely media exaggeration. Playboy, whether we love it or hate it, has simply portrayed females that the editors thought would sell magazines at the time. Nevertheless, the magazine's choice of models has been used in the professional medical and psychological literature as representative of a prevalent cultural idealization of an ultra-slender female body shape (eg. Garner et al. 1980; Wiseman et al. 1992; Szabo 1996; Spitzer et al. 1999; Katzmarzyk & Davis 2001; Pettijohn & Jungeberg 2004).

It therefore comes as no surprise that the magazine's database of model statistics was subjected to scrutiny in the online media after the 2015 announcement, particularly with regard to how things had changed during the magazine's 62 years (for an earlier analysis, see The girls next door: Life in the centerfold). Sadly, some of this recent analysis was quite poor (eg. Playboy's image of the ideal woman sure has changed). Here, I try to correct this by presenting a more thorough study of the available data.

The data I have used covers all of the Playmates of the Month that have appeared in the US edition of the magazine since its inception. This is contained in a searchable version of the pmstats.txt file that has been maintained by Jim Dean, Johnny Corvin and Doug Ewell, as currently available on Peggy Wilkins' website. This file is an updated compilation of the so-called "vital statistics" of the Playmates from December 1953 to February 2016, inclusive, as reported in Playboy, sometimes supplemented from other available sources.

Note, especially, that the data are basically self-reported by the Playmates. Some of the information has been questioned at various times, notably where it seems to contradict the associated photographic evidence. As a reputable scientist, I should probably have personally checked all of this evidence, but I have not done so (you can do so yourself, based on whatever photos you can find on the internet, or the book edited by Gretchen Edgren 2006). I have simply assumed that, at a minimum, the information presents whatever the Playmates thought was a desirable public image at the time of publication.

There are 753 records in the dataset, separately including twins and triplets appearing in the same magazine issue, as well as multiple appearances by the same woman in different issues. The data include: magazine issue month; Playmate name, birth date and birth location; height in inches and weight in pounds; breast, waist and hip dimensions in inches; and photographer name. From this information, for each Playmate I calculated their age at the time of publication, along with standard measurements for determining whether a body is healthy or not: Body Mass Index (BMI), for body size (ie. underweight, normal weight, overweight, obese), and Waist to Hip Ratio (WHR), for body curvaceousness.

Analysis

As is usual in this blog, the data can be summarized using a phylogenetic network as a form of exploratory data analysis (see How to interpret splits graphs).

I first range-standardized the data (so that all of the measurements are compared on the same scale), and log-transformed the BMI and WHR measurements (because otherwise these ratios will have non-linear relationships to the other variables). I then used the manhattan distance to calculate the similarity of the different publication years and birth locations, based on the Playmates' body dimensions. This was followed by a neighbor-net analysis to display the between-year and the between-location similarities as two phylogenetic networks.

The network of relationships among the years is shown first. Years that are closely connected in the network are similar to each other based on the body dimensions of their Playmates, and those that are further apart are progressively more different from each other.

Click to enlarge

The network shows that there has been a strong and consistent change in Playmate age, size and shape through time. In the graph there is a simple gradient through time form top-right to bottom-left — the 1950s and 1960s are intermingled at the top, with the 1970s below them, the 1980s and 1990s below that, and the 2000s and 2010s intermingled at the bottom.

So, it will be worth looking at time graphs of the individual measurements. Let's start with age.

This does not show a particularly consistent trend, but the average age of the models does increase from 21 to 24 years from beginning to end of the time period.

The next graph shows that the reported height of the Playmates also increases across the 62 years, by 2.5" on average. There is almost no change in average weight across the decades (and so the graph is not shown).

However, far more notable is the relationship between height and weight, as expressed by the BMI, which is shown in the next graph. This does not show a linear trend at all, but a distinctly curved one. That is, the size of Playmates definitely changed through time, becoming thinner for the first 40 years, but then thickening up again for the next 20 years.

This trend has not been discussed in the professional literature, as far as I can determine, perhaps because previous assessments have been based only on a relatively short period of time, not the full 6 decades. Note that the bottom point of the curve occurs in c. 1997, and that by 2016 the BMI measurements had returned to the 1975 level (40 years earlier). I wonder whether they would return to the 1950s level in another 20 years?

More importantly, given that Playmates are to one degree or another reflecting a contemporary societal image of a desirable woman, we can note that 48% of these models are classified as being underweight. The lower limit of a healthy BMI is 18.5, as shown in the next graph, which also shows the boundaries between Mild thinness (17-18.5), Moderate thinness (16-17) and Severe thinness (<16).

Clearly, during the period 1975-1995 the vast majority of the models reported being underweight, while in the 1950s and 1960s very few of them did. This situation has improved recently, with roughly a half being underweight during the past 20 years. Also, several of the reported body sizes are very unhealthy. However, perhaps the BMI values below 16 are unreliable, in the sense that such a person is not likely to be very photogenic.

We can now move on to the circumferences of the models. The next graph shows the time trend for the reported circumference at breast level. This shows the biggest and most consistent change of all, with a dramatic reduction in bustiness.

Indeed, chest sizes of >36" have hardly been reported since the start of 1990, and yet in the early years a buxom 36-24-36 figure was the most common claim by the Playmates. Interestingly, very few of the models have claimed a chest size of 33" (as opposed to 32" or 34"); is this some sort of superstition?

The other large and consistent change in circumference is for waist size, as shown in the next graph. This shows the opposite trend, with an increase in average reported size of 2" across the 60 years.

There was a slight but not consistent reduction in hip circumference during time (and so the graph is not shown). This means that the WHR, the measure of curvaceousness, changed greatly through time, as shown in the next graph. So, with the waists reportedly becoming larger, there was apparently a very large reduction in the curvaceousness of the models through time.

Note that the reduction in BMI was apparently achieved in spite of an increase in waist size — the BMI reduction seems to be related to the increase in average reported height without an increase in weight, and partly to the decrease in chest size.

When combined with the reduction in breast circumference, this means that the Playmates of the 21st century have been a very different shape from those of the mid 20th century. They were taller, with smaller breasts and larger waists, and thus had fewer curves.

We can end this discussion by considering where these Playmates were born. Most of them reported being born in the USA (83%). This means that we can consider how the various states compare in producing nude models. Obviously, more models are likely to come from the most populous states, and so we need to standardize the data by dividing by the population size of each state (as estimated for 2015 in Wikipedia), to yield the number of Playmates per million people in each state.

Apparently, Hawaii and California are more likely than the other states to produce models who are prepared to take their clothes off in public, while Delaware and Vermont have not yet done so, at least as far as Playboy is concerned. The apparently large value for Washington DC represents only 2 models from a relatively small population.

We can also consider whether the dimensions of the models vary in any consistent way between the states. This can be done with a phylogenetic network, as discussed above. In the following network, states that are closely connected are similar to each other based on the body dimensions of their Playmates, and those that are further apart are progressively more different from each other.

There appear to be no consistent patterns here.

So, we can finish by considering the countries from which the remaining 17% of the models originated. Once again, the data are standardized, to yield the number of Playmates per million people in each country (or province, for Canada). The apparently large value for Malta represents one set of twins from a relatively small population.

There have been a relatively large number of models from Scandinavia (Norway, Denmark and Sweden). This presumably represents the number of females whose body shape matches the image required by the Playboy editors, as much as the willingness of Scandinavians to disrobe publicly. However, it is notable that the rate of models from Norway is double those for Denmark and Sweden.

References

Edgren G (ed.) (2006) The Playmate Book: Six Decades of Centerfolds. Taschen.

Garner DM, Garfinkel P, Schwartz D, Thompson M (1980) Cultural expectations of thinness in women. Psychological Reports 47: 484-491.

Katzmarzyk PT, Davis C (2001) Thinness and body shape of Playboy centerfolds from 1978 to 1998. International Journal of Obesity 25: 590-592.

Pettijohn TF, Jungeberg BJ (2004) Playboy Playmate curves: changes in facial and body feature preferences across social and economic conditions. Personality and Social Psychology Bulletin 30: 1186-1197.

Spettigue W, Henderson KA (2004) Eating disorders and the role of the media. Canadian Child and Adolescent Psychiatry Review 13: 16-19.

Spitzer BL, Henderson KA, Zivian, MT (1999) Gender differences in population versus media body sizes: a comparison over four decades. Sex Roles 40: 545-565.

Szabo CP (1996) Playboy centrefolds and eating disorders - from male pleasure to female pathology. South African Medical Journal 86: 838-839.

Wiseman CV, Gray JJ, Mosimann JE, Ahrens AH (1992) Cultural expectations of thinness in women: an update. International Journal of Eating Disorders 11: 85-89.

Tuesday, October 4, 2016

The practical limits of networks?

Network techniques are becoming more widespread in biology and anthropology. However, the data in both of these disciplines can form very complicated patterns, indeed; and there must be practical limits to what one can do with a network analysis. This post discusses an example that covers both disciplines, and which may well exceed those limits.

The data come from:

Pugach I, Matveev R, Spitsyn V, Makarov S, Novgorodov I, Osakovsky V, Stoneking M, Pakendorf B (2016) The complex admixture history and recent southern origins of Siberian populations. Molecular Biology and Evolution 33: 1777-1795.

The authors note:

Siberia is an extensive geographical region of North Asia stretching from the Ural Mountains in the west to the Pacific Ocean in the east, and from the Arctic Ocean in the north to the Kazakh and Mongolian steppes in the south. This vast territory is inhabited by a relatively small number of indigenous peoples, with most populations numbering only in the hundreds or few thousands. These indigenous peoples speak a variety of languages belonging to the Turkic, Tungusic, Mongolic, Uralic, Yeniseic, Chukotko-Kamchatkan, and Aleut-Yupik-Inuit families, as well as a few isolates. There is also variation in traditional subsistence patterns ... This linguistic and cultural diversity suggests potentially different origins and historical trajectories of the Siberian peoples.

Previous studies of the genetic history of Siberian populations were hampered by the extensive admixture that appears to have taken place among these populations, because commonly used methods assume a tree-like population history and at most single admixture events.

This suggests the use of network techniques, instead of tree-based ones. However, under the circumstances described here it may be unwise to try to produce a phyogenetic network. The situation, as described, does not resemble a "tree with reticulations" but more of an "anastomosing plexus". The latter may be more confusing than helpful, when visualized as a network.

So, the authors do not mention the word "network" nor even "reticulation". Instead:

Here we analyze geogenetic maps and use other approaches to distinguish the effects of shared ancestry from prehistoric migrations and contact, and develop a new method based on the covariance of ancestry components, to investigate the potentially complex admixture history. We furthermore adapt a previously devised method of admixture dating for use with multiple events of gene flow, and apply these methods to whole-genome genotype data [genome-wide SNPs] from over 500 individuals belonging to 20 different Siberian ethnolinguistic groups [plus 9 reference populations].

The results of these analyses indicate that there have been multiple layers of admixture detectable in most of the Siberian populations, with considerable differences in the admixture histories of individual populations.

The admixture (or introgression) patterns among the populations are illustrated using a map. Each bar represents a population, with the colors denoting the different enthnolinguistic groups. Note that every population shows admixture.

The reconstructed migration relationships among the populations are also illustrated using a map. This time, the colors of the arrows represent the different ethnolinguistic groups.

I would not like to have to represent these patterns using a network, and make that network comprehensible. So, this dataset may exceed the practical limits of networks.