Monday, November 11, 2019

A new playground for networks and exploratory data analysis


[This is a post by Guido with some help from David]

There tend to be two types of studies of inheritance and evolution. First, there is evolution of organisms, either of the phenotype (morphology, anatomy, cell ultrastructure, etc) or genotype (chromosome, nucleotides). The latter involves direct inheritance, but it is often treated as including all molecules, although it is the nucleotides (and chromosomes) that get inherited, not amino acids, for example.

Second, there are studies of the evolution of behaviour, which has focused mainly on humans, of course, but can include all species. For humans, this includes socio-cultural phenomena, particularly language (written as well as spoken), but also including cultural advancements such as social organization, tool use, agriculture, etc., which are inherited indirectly, by learning.

However, we rarely see studies that are multi-disciplinary in the sense of combining both physical and behavioural evolution. It is therefore very interesting to note the just-published preprint by:
Fernando Racimo, Martin Sikora, Hannes Schroeder, Carles Lalueza-Fox. 2019. Beyond broad strokes: sociocultural insights from the study of ancient genomes. arXiv.
These authors provide a review about the extent to which the analysis of ancient human genomes has provided new insights into socio-cultural evolution. This provides a platform for interesting future cross-disciplinary research.

The authors comment:
In this review, we summarize recent studies showcasing these types of insights, focusing on the methods used to infer sociocultural aspects of human behaviour. This work often involves working across disciplines that have, until recently, evolved in separation. We argue that multidisciplinary dialogue is crucial for a more integrated and richer reconstruction of human history, as it can yield extraordinary insights about past societies, reproductive behaviours and even lifestyle habits that would not have been possible to obtain otherwise.
Since multi-disciplinary dialogue is a focal point here at the Genealogical World of Phylogenetic Networks. Since our blog embraces non-biological data, we have done a little brainstorming, to put forward some ideas based on Racimo et al.'s comments. The four figures contain some extra discussion, with some visual representations of the ideas.

Why it's important to correlate genetic, linguistic and socio-cultural data. The doodle shows a simple free expansion model of a founder population with three genotypes (yellow, green, blue), a shared language (L) and two major cultural innovations (white stars). Because of drift and stochastic intra-population processes (size represent the size of the actively reproducing populace) the first expansion (light gray arrows) lead to 'tribes' that show already some variation. The smaller ones close to the founder population spoke still the same language, the ones further away used variants (dialects) of L (L', still close to L, L'', more distinct). Because of bootlenecks, geographic distance and differing levels of inbreeding (the smaller a population, the farther away from the source, the more likely are changes in genotype frequency), each population has a different genotype composition. The second expansion (mid-gray arrows) mixing two sources leads to a grandchild that evolved a new language M and lost the blue genotype. Because the cultural innovations are beneficial, we find them in the entire group. In extreme cases of genetic sorting and linguistic evolution, such shared cultural innovations may be the only evidence clearly linking all these populations.

Social-cultural character matrices

Correlating different sets of data and (cross-)exploring the signal in these data can be facilitated by creating suitable character matrices. In phylogenetics, we primarily use characters that underlie (ideally) neutral evolution, such as nucleotide sequences and their transcripts, amino-acid sequences. When using matrices scoring morphological traits, we relax the requirement of neutral evolution, but we are still scoring traits that are the product of biological evolution. However, we don't need to stop there, phylo-linguistics is an active field, even though languages involve different evolutionary constraints and processes than we meet in biology. Data-wise there are nonetheless many analogies, and phylogenetic methods seem to work fine.

So, why not also score socio-cultural traits in a character matrix? For instance, we can characterize cultures and populations by basic features including: the presence of agriculture, which crops were cultivated, which animals were domesticated, which technological advances were available, whether it was a stone-age, bronze-age, iron-age culture, etc. Linguistically, we could also develop matrices of local populations, with regional accents or dialects, etc.

Creating such a matrix should, of course, be informed by available objective information. As in the case of morphological matrices or non-biological matrices in general, we should not be concerned about character independence. We don't need to infer a phylogenetic tree from these matrices, as their purpose is just to sum up all available characteristics of a socio-cultural group.

Second phase: stabilization of differentiation pattern. While the close-by tribes are still in contact with the mother population, the most distant lost contact. As consequence the gene pools of the L/L'-speaking communities will become more similar, and new innovations acquired by the founder population (black star) are readily propagated within its cultural sphere. Re-migration from the larger M-speaking tribe to the struggling L''-speakers (small population with high inbreeding levels) lead to the extinction of the blue genotype in the latter and increased 'borrowing' of M-words and concepts.

Distance calculations

Pairwise distance matrices are most versatile for comparing data across different data sets.

First, any character matrix can be quickly transformed into a distance matrix, and the right distance transformation can handle any sort of data: qualitative, categorical data as well as quantitative, continuous data.

Second, the signal in any distance matrix can be quickly visualized using Neighbor-nets. This blog has a long list of posts showing Neighbor-nets based on all sorts of sociological data that don't follow any strict pattern of evolution, and are heavily biased by socio-cultural constraints (eg. bikability, breast sizes, German politics, gun legislation, happiness, professional poker, spare-time activities). We have even included celestial bodies.

Third, distance matrices can be tested for correlation as-is, without any prior inference, using simple statistics, such as the Pearson correlation coefficient. To give just one example from our own research: in Göker and Grimm (BMC Evol. Biol. 2008), the latter was used for testing the performance of character and distance transformations for cloned ITS data covering substantial intra-genomic diversity, by correlating the resulting individual-based distances with species-level morphological data matrices. (The internal transcribed spacers are multi-copy, nuclear-encoded, non-coding gene regions; in the simplest case each individual has two sets of copies, arrays, one inherited from the father, the other from the mothers, which may differ between but also within the individual.)

In the context of Racimo et al.'s paper, one could construct a genetic, a socio-cultural, a linguistic and a geographical matrix, determine the pairwise distances between what in phylogenetics are called OTUs (the operational taxonomic units), and test how well these data (or parts of it) correlate. The OTUs would be local human groups sharing the same culture (and, if known) language.

Alternatively, one can just map the scored socio-cultural traits onto trees based on genetic data or linguistics.

A new culture with its own language (Λ), genotype (red) and innovations (ruby-red pentagon) migrates close to the settling area of the L-people. Because of raids, genotypes and innovations from the the L-people get incorporated into the the Λ-culture.

How to get the same set of OTUs

The Göker & Grimm paper mentioned above tested several options for character and distance transformations, because we faced a similar problem to what researchers will face when trying to correlate socio-cultural data with genetic profiles of our ancestors: a different set of leaves (the OTUs). We were interested in phylogenetic relationships between individuals using data representing the genetic heterogeneity within these individuals.

Genetic studies of human (ancient or modern) DNA use data based from individuals, but socio-cultural and linguistic data can only be compiled at a (much) higher level: societies, or other groups of many individuals. In addition, these groups may also span a larger time frame. Since humans love to migrate, we are even more of a genetic mess than were the ITS data that we studied.

One potential alternative is to use the host-associate analysis framework of Göker & Grimm. Instead of using the individual genetic profiles (the associate data), one sums them across a socio-cultural unit (serving as host). The simplest method is to create a consensus of the data (in Göker & Grimm, we tested strict and modal consensuses). This produces sequences with a lot of ambiguity codes — genetic diversity within the population will be presented by intra-unit sequence polymorphism (IUSP). Standard distance and parsimony implementation do not deal with ambiguities, but the Maximum likelihood, as implemented in RAxML, does to some degree. A gapstop is the recoding of ambiguities as discrete states for phylogenetic analysis (tree and network inference) as done by Potts et al. (Syst. Biol. 2014 [PDF]) for 2ISPs ('twisps'), intra-individual site polymorphism. It can't hurt to try out whether this works for IUSPs, too.

Since humans (tribes, local groups) often differ in the frequency of certain genotypes, it would be straightforward to use these frequencies directly when putting up a host matrix. Instead of, for example, nucleotides or their ambiguity codes, the matrix would have the frequency of the different haplotypes. We can't infer trees from such a matrix (we need categorical data), but we can still calculate the distance matrix and infer a Neighbor-net.

The 'phylogenetic Bray-Curtis' (distance) transformation introduced in Göker & Grimm (2008) also keeps the information about within-host diversity when determining inter-host distances (see Reticulation at its best ...)


Transformations for genetic data from smaller to larger, more-inclusive units are implemented in the software package POFAD by Joli et al. (Methods in Ecology & Evolution, 2015. Their paper also provides a comparison of different methods, including the ones tested in Göker & Grimm (2008, also implemented in the tiny executables g2cef and pbc, compiled for any platform).

The process of assimilation. The Λ-people subdued the L-culture with the consequence that all innovations are shared in their influence sphere. Having a much smaller total population size, the language of the invaders is largely lost but the new common language L* still includes some Λ-elements (in a phylogenetic tree analysis, L* would be part of the L/M clade, using networks, L* would share edges with Λ in contrast to L and M). The L''/M-speaking remote population is re-integrated. The invaders' genotype (red) becomes part of the L-people's gene pool. Re-migration (forced or not) introduces L-genotypes into the original Λ-population. Only by comparing all available data, ideally covering more than one time period, we can deduce that the M-speakers represent an early isolated subpopulation of the L-people that was not affected by the Λ-invasion. With only the genetic data at hand, one may identify the M-speakers as one source and the Λ-tribe as another source for the L*-people, and infer that all L/M and Λ-tribes share a common origin (since the yellow genotype is found in both the M- and the original Λ-population).

Conclusion

It therefore seems to us that there is enormous potential for multi-disciplinary work, that truly combine organismal and socio-cultural evolution. We have provided a few practical suggestions here about how this might be done. We encourage you all to have try some of these ideas, to see where it leads us all.