The Genealogical World of Phylogenetic Networks: March 2018

Monday, March 26, 2018

It's the system, stupid! More thoughts on sound change in language history

In various blog posts in the past I have tried to emphasize that sound change in linguistics is fundamentally different from the kind of change in phenotype / genotype that we encounter in biology. The most crucial difference is that sound sequences, i.e., our words or parts of the words we use when communicating, do not manifest as a physical substance but — as linguists say — "ephemerically", i.e. by the air flow that comes out of the mouth of a speaker and is perceived as an acoustic signal by the listener. This is in strong contrast to DNA sequences, for example, which are undeniably somewhere "out there". They can be sliced, investigated, and they preserve information for centuries if not millenia, as the recent boom in archaeogenetics illustrates.

Here, I explore the consequences of this difference in a bit more detail.

Language as an activity

Language, as Wilhelm von Humboldt (1767-1835) — the boring linguist who investigated languages from his armchair while his brother Alexander was traveling the world — put it, is an activity (energeia). If we utter sentences, we pursue this activity and produce sample output of the system hidden in our heads. Since the sound signal is only determined by the capacity of our mouth to produce certain sounds, and the capacity of our brain to parse the signals we hear, we find a much stronger variation in the different sounds available in the languages of the world than we find when comparing the alphabets underlying DNA or protein sequences.

Despite the large variation in the sound systems of the world's languages, it is clear that there are striking common tendencies. A language without vowels does not make much sense, as we would have problems pronouncing the words or perceiving them at longer distances. A language without consonants would also be problematic; and even artificial communication systems developed for long-distance communication, like the different kinds of yodeling practiced in different parts of the world, make use of consonants to allow for a clearer distinction between vowels (see the page about Yodeling on Wikipedia). But, between both extremes we find great variation in the languages of the world, and this does not seem to follow any specific pattern that could point to any kind of selective pressure, although scholars have repeatedly tried to demonstrate it (see Everett et al. 2015 and the follow-up by Roberts 2018).

What is also important here is that, not only is the number of the sounds we find in the sound system of a given language highly variable, but there is also variation in the rules by which sounds can be concatenated to form words (called the phonotactics of a language), along with the frequency of the sounds in the words of different languages. Some languages tolerate clusters of multiple consonants (compare Russian vzroslye or German Herbst), others refuse them (compare the Chinese name for Frankfurt: fǎlánkèfú), yet others allow words to end in voiced stops (compare English job in standard pronunciation), and some turn voiced stops into voiceless ones (compare the standard pronunciation of Job in German as jop).

Language as a system

Language is a system which essentially concatenates a fixed number of sounds to sequences, being only restricted by the encoding and decoding capacities of its users. This is the core reason why sound change is so different from change in biological characters. If we say that German d goes back to Proto-Germanic *θ (pronounced as th in path), this does not mean that there were a couple of mutations in a couple of words of the German language. Instead it means that the system which produced the words for Proto-Germanic changed the way in which the sound *θ was produced in the original system.

In some sense, we can think metaphorically of a typewriter, in which we replace a letter by another one. As a result, whenever we want to type a given word in the way we know it, we will type it with the new letter instead. But this analogy would be to restricted, as we can also add new letters to the typewriter, or remove existing ones. We can also split one letter key into two, as happens in the case of palatalization, which is a very common type of sound change during which sounds like [k] or [g] turn into sounds like [tʃ] and [dʒ] when being followed by front vowels (compare Italian cento "hundred", which was pronounced [kɛntum] in Latin and is now pronounced as [tʃɛnto]).

Sound change is not the same as mutation in biology

Since it is the sound system that changes during the process we call sound change, and not the words (which are just a reflection of the output of the system), we cannot equate sound change with mutations in biological sequences, since mutations do not recur across all sequences in a genome, replacing one DNA segment by another one, which may not even have existed before. The change in the system, as opposed to the sequences that the system produces, is the reason for the apparent regularity of sound change.

This culminates in Leonard Bloomfield's (1887-1949) famous (at least among old-school linguists) expression that 'phonemes [i. e., the minimal distinctive units of language] change' (Bloomfield 1933: 351). From the perspective of formal approaches to sequence comparison, we could restate this as: 'alphabets change'. Hruschka et al. (2015) have compared sound change with concerted evolution in biology. We can state the analogy in simpler terms: sound change reflects systemics in language history, and concerted evolution results from systemic changes in biological evolution. It's the system, stupid!

Given that sound systems change in language history, this means that the problem of character alignments (i.e. determining homology/cognacy) in linguistics cannot be directly solved with the same techniques that are used in biology, where the alphabets are assumed to be constant, and alignments are supposed to identify mutations alone. If we want to compare sequences in linguistics, where we have to compare sequences that were basically drawn from different alphabets, this means that we need to find out which sounds correspond to which sounds across different languages while at the same time trying to align them.

An artificial example for the systemic grounding of sound change

Let me provide a concrete artificial example, to illustrate the peculiarities of sound change. Imagine two people who originally spoke the same language, but then suffered from diseases or accidents that inhibited them from producing their speech in the way they did before. Let the first person suffer from a cold, which blocks the nose, and therefore turns all nasal sounds into their corresponding voiced stops, i.e., n becomes a d, ng becomes a g, and m becomes a b. Let the other person suffer from the loss of the front teeth, which makes it difficult to pronounce the sounds s and z correctly, so that they sound like a th (in its voiced and voiceless form, like in thing vs. that).

Artificial sound change resulting from a cold or the loss of the front teeth.

If we now let both persons pronounce the same words in their original language, they won't sound very similar anymore, as I have tried to depict in the following table (dh points to the th in words like father, as opposed to the voiceless th in words like thatch).

No.	Speaker Cold	Speaker Tooth
1	bass	math
2	buzic	mudhic
3	dose	nothe
4	boizy	moidhy
5	sig	thing
6	rizig	ridhing

By comparing the words systematically, however, bearing in mind that we need to find the best alignment and the mapping between the alphabets, we can retrieve a set of what linguists call sound correspondences. We can see that the s of speaker Cold corresponds to the th of speaker Tooth, z corresponds to dh, b to m, d to n, and g to ng. Having probably figured out by now that my words were taken from the English language (spelling voiced s consequently as z), it is easy even to come up with a reconstruction of the original words (mass, music[=muzik], nose, noisy=[noizy], etc.).

Reconstructing ancestral sounds in our artificial example with help of regular sound correspondences.

Summary

Systemic changes are difficult to handle in phylogenetic analyses. They leave specific traces in the evolving objects we investigate that are often difficult to interpret. While it has been long since known to linguists that sound change is an inherently systemic phenomenon, it is still very difficult to communicate to non-linguistics what this means, and why it is so difficult for us to compare languages by comparing their words. Although it may seem tempting to compare languages with simple sequence-alignment algorithms with differences in biological sequences resulting from mutations (see for example Wheeler and Whiteley 2015), it is basically an oversimplifying approach.

Simple models undeniably have their merits, especially when dealing with big datasets that are difficult to inspect manually — there is nothing to say against their use. But we should always keep in mind that we can, and should, do much better than this. Handling systemic changes remains a major challenge for phylogenetic approaches, no matter whether they use trees, networks, bushes, or forests.

Given the peculiarity of sound change in linguistic evolution, and how well the phenomena are understood in our discipline, it seems worthwhile to invest time in exploring ways to formalize and model the process. During the past two decades, linguists have taken a lot of inspiration from biology. The time will come when we need to pay something back. Providing models and analyses to deal with systemic processes like sound change might be a good start.

References

Bloomfield, L. (1973) Language. Allen & Unwin: London.

Everett, C., D. Blasi, and S. Roberts (2015) Climate, vocal folds, and tonal languages: connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5: 1322-1327.

Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015) Detecting regular sound changes in linguistics as events of concerted evolution. Curr. Biol. 25.1: 1-9.

Roberts, S. (2018) Robust, causal, and incremental approaches to investigating linguistic adaptation. Frontiers in Psychology 9: 166.

Wheeler, W. and P. Whiteley (2015) Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages. Cladistics 31.2: 113-125.

Monday, March 19, 2018

Comparing neighbour-nets and PCA graphs – the example of Mediterranean oaks

Distance matrices offer many avenues for exploring data. A common method is Principal Component Analysis (PCA). A much less common method is the use of Neighbour-nets. We have previously compared PCA and Neighbor-nets using theoretical data. In this post, I'll compare a PCA graph and the corresponding Neighbour-net using some empirical data.

Genetic differentiation in Mediterranean oaks

In the paper by Vitelli et al. (2017), we explored the phylogeographic structuring of a group of Mediterranean oak species. The species represented the westernmost populations of one of the main Eurasian oak lineages: the evergreen Quercus section Ilex ("Ilex oaks"; see Denk et al. 2017 for an up-to-date classification of oaks; see also this figshare-spread-sheet). It was a follow-up study to the one by Simeone et al. (2016).

We found that one species, the most widespread (Quercus ilex), carry plastids from quite different origins. The 2016 paper identified three main plastid haplotypes in the Ilex oaks: the unique (within the entire genus) "Euro-Med" haplotype; the "Cerris-Ilex" haplotype shared with western Eurasian members of (essentially deciduous) section Cerris, the sister clade of section Ilex (see Denk & Grimm 2010; confirmed by NGS SNP data, Hipp et al. 2015); and the "WAHEA" haplotype, an east-bound haplotype of section Ilex. Vitelli et al. aimed to characterise the range of these three main haplotypes throughout the four Ilex oak species found in the Mediterranean.

Figure 1 shows the two multivariate data analyses, along with a map of the sample locations.

Fig. 1 Phylogeographic structure of Quercus section Ilex around the Mediterranean (after Vitelli et al. 2017). a. PCA graph, and b. Neighbour-net based on the same inter-haplotype pairwise distance matrix. c. A map depicting the distribution of main haplotype groups labelled by Roman numerals: I haplotypes of the "WAHEA" lineage, II "Cerris-Ilex"-lineage, III–VI, subtypes of the "Euro-Med" lineage (cf. Simeone et al. 2016, fig. 1)

Regarding the overall diversification pattern, the PCA graph and the Neighbour-net show similar things. The "Euro-Med" lineage is the most diverse group, with four subgroups — two larger (and widespread) ones (haplotypes IV, V) and two rare ones (III, VI) only found in the Aegean region.

According to the PCA, haplotype III (colored olive) is intermediate between "Euro-Med" IV (blue) and the haplotype II (yellow), which represents another lineage of oak haplotypes, the Aegean/Northern Turkish "Cerris-Ilex" lineage. The same can be seen in the Neighbour-net.
The PCA further places haplotype VI (red) as equidistant to all of the other types, with IV and I (green; representing the oriental "WAHEA" lineage) being a bit closer. In the Neighbour-net, we can sum up the length of the connecting edge-bundles to find the same pattern. A difference between the two analyses is that VI is connected only with part of V (purple) by a pronounced edge bundle, but not connected to I (green). This is strikingly different from III, which shares an edge bundle with II and IV+V.

At this point in the analyses, we can use the potential property of the Neighbour-net acting as a distance-based 2-dimensional graph and acting as a meta-phylogenetic network (Fig. 2). Based on the PCA, which also is a 2-dimensional depiction of the differentiation, one may be tempted to interpret VI as a bridge between IV/V and I, not much different from how III bridges between II and IV (Fig. 1). On the other hand, the network (Figs 1, 2) informs us that VI is a likely relative of V, which in turn is a likely relative of IV; and the only connection between I and VI is their increasing distinctness to the other haplotypes of the "Euro-Med" lineage, III/IV/V.

Fig. 2 The main splits expressed in the neighbour-net. III may either be sister to II, or is part of a clade comprising IV and V.

Using the main split patterns in the Neighbour-net, we can infer the one phylogenetic hypothesis, a tree, that can accommodate them all (Fig. 3).

Fig. 3 The tree solution congruent with the major split patterns (Fig. 2).

I rejected the alternative sister relationship between II and III because this would imply a sister clade that only includes IV, V and VI but not III, which clashes with the affinity of III to IV and V (Fig. 2). Interpreting III as a sister of IV and V, explains both its affinity to II (putative sister lineage to III–VI) and IV and V.

We might accept that all three plastome lineages are reciprocally monophyletic (in a quite broad sense), meaning that each lineage evolved from a pool of closely related mother plants. If so, then the higher similarity between III ("Euro-Med") and II ("Cerris-Ilex") may represent a relative lack of derivation, whereas the dissimilarity between VI ("Euro-Med") and I ("WAHEA") to all other types can be due to a higher level of distinctness. And we can come up with a "cactus"-type metaphorical tree (Fig. 4) explaining the Neighbour-net (and PCA graph).

Fig. 4 A "cactus"-type tree metaphor for the evolution of oak plastomes (based on the results of Simeone et al. 2016, Vitelli et al. 2017, and – outside the focus group, i.e. Mediterranean oaks of Subgenus Cerris – some partly arcane, not yet published knowledge, I have access to)

We thus learn more from the Neighbor-net than from the PCA.

There's no reason to stop with a PCA

One empirical example is far from being conclusive, but it shows what the Neighbour-nets have to offer.

Trees are fine for proposing phylogenetic hypotheses, but we should always be aware of equally valid alternatives to the tree that we have optimized. And with increasing numbers of taxa, inferring optimal trees and assessing their alternatives require increasing effort, and checking. For many questions, PCA has been used as a quick alternative, including in large-sample genetic studies (see Continued misuse of PCA in genomics studies).

Neighbour-nets are just a natural step further towards a phylogeny, which come with very little extra effort and can use the same data basis: a matrix of pairwise distances. In the case of genetic data, which usually reflects at least the main aspects of the actual phylogeny (trivial or complex) behind it, the "true tree", they should be obligatory. They are much more than just a clustering approach (even though their algorithm is based on a cluster algorithm) or a bivariate analysis. Neighbour-nets are meta-phylogenetic networks that have the capacity to contain the one or many topologies explaining the data. They are as straightforward as PCA, when it comes to recognising "natural", coherent and equal, groups (in contrast to phylogenetic trees).

Postscript

I would have liked to add some more examples with non-genetic data. Data sets where the distances are not the result of an explicit phylogenetic process. But this requires much more effort, since none of the PCA studies I browsed had documented the used distance data/matrix. However, I'm sure that inferring a Neighbour-net based on no-matter-what similarity data used for PCA, can be a fruitful and revealing endeavour (and the reason why you find Neighbour-net based on U.S. gun legislation, breast sizes, languages, cryptocurrencies, etc. on this blog, but few PCAs). So, try it out the next time you make a PCA, and share the results e.g. by using our comment option or even a post as guest-blogger.

Don't miss these earlier posts with similar topic:

Also, this paper introduces Neighbor-nets to the wider audience of multivariate data analyses:

Morrison, D.A. (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: 296–312.

References

Denk T, Grimm GW. 2010. The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59: 351–366.

Denk T, Grimm GW, Manos PS, Deng M, Hipp AL. 2017. An updated infrageneric classification of the oaks: review of previous taxonomic schemes and synthesis of evolutionary patterns. In: Gil-Pelegrín E, Peguero-Pina JJ, and Sancho-Knapik D, eds. Oaks Physiological Ecology. Heidelberg, New York: Springer, p. 13–38. Free Pre-Print at bioRxiv [major change: Ponticae and Virentes accepted as additional sections in final version]

Hipp AL, Manos P, McVay JD, ... , Avishai M, Simeone MC. 2015 [abstract]. A phylogeny of the World's oaks. Botany 2015. Edmonton.

Simeone MC, Grimm GW, Papini A, Vessella F, Cardoni S, Tordoni E, Piredda R, Franc A, Denk T. 2016. Plastome data reveal multiple geographic origins of Quercus Group Ilex. PeerJ 4: e1897 [open access, comments/questions welcomed]

Vitelli M, Vessella F, Cardoni S, Pollegioni P, Denk T, Grimm GW, Simeone MC. 2017. Phylogeographic structuring of plastome diversity in Mediterranean oaks (Quercus Group Ilex, Fagaceae). Tree Genetics and Genomes 13:3.

Monday, March 12, 2018

Tattoo Monday XIV

Tattoos are quite common among modern women. So, for today's collection, here are some circular phylogenetic trees of various sizes and in various locations.

For anyone who wants to pursue the matter, there is a rredit thread on the topic:

Ladies with tattoos - What are some negative (or positive) comments you've gotten from strangers because of your tattoos? Where are your tattoos and what are they of?

Monday, March 5, 2018

Visualizing U.S. gun laws

The Founding Fathers of the USA made the decision to explicitly insist on the right of all US citizens to bear arms, because they felt that to do otherwise could be the foundation of what we would now call a Police State. The right was granted in the well-known 2^nd Amendment, along with the right form a militia (to fend off the British, among others). This may have been a reasonable way to achieve freedom in the 1700s; and it was certainly the basis of the reputation of the Wild West in the 1800s.

However, increasingly during the 1900s, and especially now, in the 2000s, the practical consequences of this part of the US Constitution have come into question. Indeed, due to recent events in some states, this facet of the United States has come to world-wide attention, because it is a quite unique gun legislation. However, this is an over-simplification, because there are substantial differences between the fifty states (and the District of Columbia). This blog post provides a practical look at the similarities and differences in these gun laws.

The 2nd Amendment of the U.S. Constitution, the Bill of Rights:
"A well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed."

Gun legislation in the United States

Gun legislation is not a federal business, as one may think when following the news. The USA is a union of states, rather than a federation, with the states retaining all political rights that they have not delegated to the federal government (ie. inter-state laws and inter-nation laws). This differs from almost all other countries, in which the federal government retains all political rights that it does not delegate to the states or counties.

In particular, the US state legislations are highly diverse regarding how to exercise the basic (constitutional) right to bear arms. Some states retain the original 1700s interpretation while others have made it rather hard to carry guns, either openly or concealed.

The web site GunsToCarry, for example, breaks the legislation down to five general points:

Does one need a permit?
Does one need a permit to purchase a gun?
Does one need to register an owned gun?
Is it allowed to carry the gun in the open?
Are there background checks when one privately sells or buys a gun?

For each state, each of these questions can be answered by 'Yes' or 'No', for both hand guns (pistols, revolvers, etc) and long guns (rifles) separately.

In addition, further restrictions/modifications are listed. For instance, there are variations regarding the general policy regarding getting a permit for a gun ("Unrestricted"; "Shall Issue"; "May Issue") and how it's done. Hawaii, to take one example, requires permits, and the general policy is "May Issue", meaning that the state may issue a permit or decide not to, on a case-to-case basis. In reality, the bars to getting a permit are so high in Hawaii that normally people don't get one. The other "May Issue" exercised as "No Issue" state is New Jersey. Another characteristic is that some states, such as California, do not allow private sales unless they are done via a licensed dealer or state law enforcement department.

This all leads to 17 characters that can contribute to differences between states. These can be illustrated in a simple network. The outcome is shown below, after some technical details about how to produce the picture.

Technical details

To provide a pictorial overview of these differences, we can use a particular type of network, called a phylogenetic network. We first calculate pairwise distances between the states, quantifying their differences, and then use a neighbor-net to create the picture as a single graph.

The five main questions provide ten binary (2-state) characters (No = 0, Yes = 1), but I chose an ordered ternary character (3-state) for open carry, to account for local variation (open carry allowed in general = 0, not state-wide = 1, not-at-all = 2). For the ordered ternary characters, the change from e.g. "Unrestricted" (0) to "May Issue" (2) counts as two differences. To even out the impact of binary and ternary characters, all binary characters have the weight two. Hence, a distance of 0 (between any given pair of states) means that the two states have the same legislation in all scored characters; and a distance of 2 would mean that two states differ completely in their legislation.

I excluded one character (the maximum number of rounds allowed per magazine) that provides little discriminatory signal, since it can only be scored for the rather few states that have a magazine size restriction (either 10 or 15 rounds) for hand or long guns (or both).

Gun legislation in the states of the U.S.A.

The interpretation of the network is straightforward. States that are closely connected in the network are similar to each other based on their gun laws, and those that are further apart are progressively more different from each other. Find your own state, and you can immediately see which states are similar to it. (For more details, see: How to interpret splits graphs)

Figure 1 A neighbour-net visualizing the differences in gun legislation in the U.S.A. Blue stars indicate states where guns have to be registered.

The graph well captures the differences in the state legislations. States without gun control, i.e. no permits needed, no registration, free-to-carry, no limitation of magazine sizes, form one endpoint of the network (highlighted in red).

At the opposite end of the graph, highlighted in green, are those states requiring permits for having, buying or selling a gun, that don't endorse open carry, and limit the size of magazines to 10/15 rounds. This part of the network is spread out because each state shows a different combination of controls. The most restrictive states are the right-most ones (Hawaii, Connecticut, California and District of Columbia).

In between these two endpoints, come the states that exercise some control (e.g. on handguns only). These are generally more similar to the no-control states, in that they may require one or another permit, but otherwise have no or few restrictions.

You will note the position of both Texas and Florida (states that joined the Union in the 19th century and were part of the Confederacy) in the network — they are both down the end with the fewest gun controls. You will also note the position of the most densely populated states, which are mostly down the other end.

Finally, here is the same graph with two historical groups of states highlighted, representing two phases of the development of the modern USA. The nature of modern gun laws is not randomly distributed among these groups.

Figure 2 Same graph as in Fig. 1, showing the original Thirteen Colonies (1700s) and the states of the Confederacy (1800s).

Conclusion

Clearly, the United States provides a variety of gun legislation, from strong control to almost none. This inevitably leads to strongly opposing opinions among the public when it comes to guns, although this calls into question a basic constitutional right.

The network also provides a guide-graph for any tourists who might be concerned about U.S. gun legislation. They should visit states such as California if they wish to feel safer, or Alaska if they are searching for a little wild-west feeling.

More plots, links, etc can be found in the related long-read. It includes mapping results of recent and earlier tight presidential elections, population density, real GDP, and number of firearm-related deaths per 100,000 inhabitants; links for further reading and some thoughts on the issue.

Data

I have provided fileset on Figshare, including the matrix used (annotated NEXUS, generated with and optimized for Mesquite; "simple" NEXUS with set-up details for PAUP*), the resultant distance matrices (raw, PHYLIP-formatted; analyzed, Splits-NEXUS-formatted), and the figures (for this post an the related long-read)