Monday, November 11, 2019

A new playground for networks and exploratory data analysis

[This is a post by Guido with some help from David]

There tend to be two types of studies of inheritance and evolution. First, there is evolution of organisms, either of the phenotype (morphology, anatomy, cell ultrastructure, etc) or genotype (chromosome, nucleotides). The latter involves direct inheritance, but it is often treated as including all molecules, although it is the nucleotides (and chromosomes) that get inherited, not amino acids, for example.

Second, there are studies of the evolution of behaviour, which has focused mainly on humans, of course, but can include all species. For humans, this includes socio-cultural phenomena, particularly language (written as well as spoken), but also including cultural advancements such as social organization, tool use, agriculture, etc., which are inherited indirectly, by learning.

However, we rarely see studies that are multi-disciplinary in the sense of combining both physical and behavioural evolution. It is therefore very interesting to note the just-published preprint by:
Fernando Racimo, Martin Sikora, Hannes Schroeder, Carles Lalueza-Fox. 2019. Beyond broad strokes: sociocultural insights from the study of ancient genomes. arXiv.
These authors provide a review about the extent to which the analysis of ancient human genomes has provided new insights into socio-cultural evolution. This provides a platform for interesting future cross-disciplinary research.

The authors comment:
In this review, we summarize recent studies showcasing these types of insights, focusing on the methods used to infer sociocultural aspects of human behaviour. This work often involves working across disciplines that have, until recently, evolved in separation. We argue that multidisciplinary dialogue is crucial for a more integrated and richer reconstruction of human history, as it can yield extraordinary insights about past societies, reproductive behaviours and even lifestyle habits that would not have been possible to obtain otherwise.
Since multi-disciplinary dialogue is a focal point here at the Genealogical World of Phylogenetic Networks. Since our blog embraces non-biological data, we have done a little brainstorming, to put forward some ideas based on Racimo et al.'s comments. The four figures contain some extra discussion, with some visual representations of the ideas.

Why it's important to correlate genetic, linguistic and socio-cultural data. The doodle shows a simple free expansion model of a founder population with three genotypes (yellow, green, blue), a shared language (L) and two major cultural innovations (white stars). Because of drift and stochastic intra-population processes (size represent the size of the actively reproducing populace) the first expansion (light gray arrows) lead to 'tribes' that show already some variation. The smaller ones close to the founder population spoke still the same language, the ones further away used variants (dialects) of L (L', still close to L, L'', more distinct). Because of bootlenecks, geographic distance and differing levels of inbreeding (the smaller a population, the farther away from the source, the more likely are changes in genotype frequency), each population has a different genotype composition. The second expansion (mid-gray arrows) mixing two sources leads to a grandchild that evolved a new language M and lost the blue genotype. Because the cultural innovations are beneficial, we find them in the entire group. In extreme cases of genetic sorting and linguistic evolution, such shared cultural innovations may be the only evidence clearly linking all these populations.

Social-cultural character matrices

Correlating different sets of data and (cross-)exploring the signal in these data can be facilitated by creating suitable character matrices. In phylogenetics, we primarily use characters that underlie (ideally) neutral evolution, such as nucleotide sequences and their transcripts, amino-acid sequences. When using matrices scoring morphological traits, we relax the requirement of neutral evolution, but we are still scoring traits that are the product of biological evolution. However, we don't need to stop there, phylo-linguistics is an active field, even though languages involve different evolutionary constraints and processes than we meet in biology. Data-wise there are nonetheless many analogies, and phylogenetic methods seem to work fine.

So, why not also score socio-cultural traits in a character matrix? For instance, we can characterize cultures and populations by basic features including: the presence of agriculture, which crops were cultivated, which animals were domesticated, which technological advances were available, whether it was a stone-age, bronze-age, iron-age culture, etc. Linguistically, we could also develop matrices of local populations, with regional accents or dialects, etc.

Creating such a matrix should, of course, be informed by available objective information. As in the case of morphological matrices or non-biological matrices in general, we should not be concerned about character independence. We don't need to infer a phylogenetic tree from these matrices, as their purpose is just to sum up all available characteristics of a socio-cultural group.

Second phase: stabilization of differentiation pattern. While the close-by tribes are still in contact with the mother population, the most distant lost contact. As consequence the gene pools of the L/L'-speaking communities will become more similar, and new innovations acquired by the founder population (black star) are readily propagated within its cultural sphere. Re-migration from the larger M-speaking tribe to the struggling L''-speakers (small population with high inbreeding levels) lead to the extinction of the blue genotype in the latter and increased 'borrowing' of M-words and concepts.

Distance calculations

Pairwise distance matrices are most versatile for comparing data across different data sets.

First, any character matrix can be quickly transformed into a distance matrix, and the right distance transformation can handle any sort of data: qualitative, categorical data as well as quantitative, continuous data.

Second, the signal in any distance matrix can be quickly visualized using Neighbor-nets. This blog has a long list of posts showing Neighbor-nets based on all sorts of sociological data that don't follow any strict pattern of evolution, and are heavily biased by socio-cultural constraints (eg. bikability, breast sizes, German politics, gun legislation, happiness, professional poker, spare-time activities). We have even included celestial bodies.

Third, distance matrices can be tested for correlation as-is, without any prior inference, using simple statistics, such as the Pearson correlation coefficient. To give just one example from our own research: in Göker and Grimm (BMC Evol. Biol. 2008), the latter was used for testing the performance of character and distance transformations for cloned ITS data covering substantial intra-genomic diversity, by correlating the resulting individual-based distances with species-level morphological data matrices. (The internal transcribed spacers are multi-copy, nuclear-encoded, non-coding gene regions; in the simplest case each individual has two sets of copies, arrays, one inherited from the father, the other from the mothers, which may differ between but also within the individual.)

In the context of Racimo et al.'s paper, one could construct a genetic, a socio-cultural, a linguistic and a geographical matrix, determine the pairwise distances between what in phylogenetics are called OTUs (the operational taxonomic units), and test how well these data (or parts of it) correlate. The OTUs would be local human groups sharing the same culture (and, if known) language.

Alternatively, one can just map the scored socio-cultural traits onto trees based on genetic data or linguistics.

A new culture with its own language (Λ), genotype (red) and innovations (ruby-red pentagon) migrates close to the settling area of the L-people. Because of raids, genotypes and innovations from the the L-people get incorporated into the the Λ-culture.

How to get the same set of OTUs

The Göker & Grimm paper mentioned above tested several options for character and distance transformations, because we faced a similar problem to what researchers will face when trying to correlate socio-cultural data with genetic profiles of our ancestors: a different set of leaves (the OTUs). We were interested in phylogenetic relationships between individuals using data representing the genetic heterogeneity within these individuals.

Genetic studies of human (ancient or modern) DNA use data based from individuals, but socio-cultural and linguistic data can only be compiled at a (much) higher level: societies, or other groups of many individuals. In addition, these groups may also span a larger time frame. Since humans love to migrate, we are even more of a genetic mess than were the ITS data that we studied.

One potential alternative is to use the host-associate analysis framework of Göker & Grimm. Instead of using the individual genetic profiles (the associate data), one sums them across a socio-cultural unit (serving as host). The simplest method is to create a consensus of the data (in Göker & Grimm, we tested strict and modal consensuses). This produces sequences with a lot of ambiguity codes — genetic diversity within the population will be presented by intra-unit sequence polymorphism (IUSP). Standard distance and parsimony implementation do not deal with ambiguities, but the Maximum likelihood, as implemented in RAxML, does to some degree. A gapstop is the recoding of ambiguities as discrete states for phylogenetic analysis (tree and network inference) as done by Potts et al. (Syst. Biol. 2014 [PDF]) for 2ISPs ('twisps'), intra-individual site polymorphism. It can't hurt to try out whether this works for IUSPs, too.

Since humans (tribes, local groups) often differ in the frequency of certain genotypes, it would be straightforward to use these frequencies directly when putting up a host matrix. Instead of, for example, nucleotides or their ambiguity codes, the matrix would have the frequency of the different haplotypes. We can't infer trees from such a matrix (we need categorical data), but we can still calculate the distance matrix and infer a Neighbor-net.

The 'phylogenetic Bray-Curtis' (distance) transformation introduced in Göker & Grimm (2008) also keeps the information about within-host diversity when determining inter-host distances (see Reticulation at its best ...)

Transformations for genetic data from smaller to larger, more-inclusive units are implemented in the software package POFAD by Joli et al. (Methods in Ecology & Evolution, 2015. Their paper also provides a comparison of different methods, including the ones tested in Göker & Grimm (2008, also implemented in the tiny executables g2cef and pbc, compiled for any platform).

The process of assimilation. The Λ-people subdued the L-culture with the consequence that all innovations are shared in their influence sphere. Having a much smaller total population size, the language of the invaders is largely lost but the new common language L* still includes some Λ-elements (in a phylogenetic tree analysis, L* would be part of the L/M clade, using networks, L* would share edges with Λ in contrast to L and M). The L''/M-speaking remote population is re-integrated. The invaders' genotype (red) becomes part of the L-people's gene pool. Re-migration (forced or not) introduces L-genotypes into the original Λ-population. Only by comparing all available data, ideally covering more than one time period, we can deduce that the M-speakers represent an early isolated subpopulation of the L-people that was not affected by the Λ-invasion. With only the genetic data at hand, one may identify the M-speakers as one source and the Λ-tribe as another source for the L*-people, and infer that all L/M and Λ-tribes share a common origin (since the yellow genotype is found in both the M- and the original Λ-population).


It therefore seems to us that there is enormous potential for multi-disciplinary work, that truly combine organismal and socio-cultural evolution. We have provided a few practical suggestions here about how this might be done. We encourage you all to have try some of these ideas, to see where it leads us all.

Monday, November 4, 2019

Why the emperor has no clothes on – a thicket of trees

A critical question in phylogenetics, and this applies to both the detection and inference of reticulation, is: How much trust do we put in the inferred tree? A phylogenetic tree is just the simplest of all possible phylogenetic networks. Let's assume that there was some phylogenetic reticulation in the past (lineage mixing and crossing), then, in the best-case scenario, our inferred tree shows one of the intertwining pathways but misses the tangles, the crossroads.

An example of simple reticulate evolution: pink is the product of very recent lineage crossing between an early diverged (and otherwise lost) member of the blue lineage and the more recently, hence genetically more coherent, red lineage. Bold lines show the tree we would likely infer in such a situation.

In the worst case, summarizing data with substantially different signals will give us branching artifacts such as:
  • terminal branches that are too long,
  • too long internal branches with conspicuously low support (ie. BS << 100, PP < 1.0),
  • artificial branches representing the least-conflicting solution for the conflicting data,
  • low branch support in general.
See eg. the bear data we used as a real-world example for our Intertwining trees and networks paper (Schliep et al. 2017, open access).

Three possible trees for bears, (a), Y-chromosome, paternal, and (c), nuclear-encoded autosomal introns, biparentally inherited, are congruent but disagree with the maternal genealogy (b), based on the mitochondrial genes. When fusing all three data sets, we get a (low) supported sister relationship for Sloth and Sun bears (red clade), not supported by any of the three fused data set – a branching artifact.

Topological incongruence between gene trees and parental genealogies (as above) is commonly taken as evidence for reticulation. If one gene provides high support for taxon A as sister to B, and another gene has high support for B as sister to C, then B is likely the product of reticulation (eg. hybridization)

One simple possibility to put together a phylogenetic network is to summarize all of the trees in the form of a Consensus network, as shown next. (Technically this is a splits graph, it becomes a phylogenetic network as soon as we determine a root, which, here, would be at the edge leading to the Giant Panda.)

A strict Consensus network of the paternal, biparental, and maternal bear genealogies.
The numbers show the non-parametric bootstrap support for each (competing) split.

In this case, low support for a branch in a combined tree (the values on top) can result from strong conflict. For instance, the brownish splits, which are poorly supported using the combined data (BS = 21, 29), receive near unambiguous support from the mitochondrial genes, but are largely or entirely rejected by the Y-chromosome and nc-intron data. In the combined tree, this deep conflict is resolved by introducing the artificial red clade, with similarly low support: the signal in the data is ambiguous and they support splits between equally possible alternatives.

We know lineage crossing took place in bears (the mitochondrial and Y-chromosome tree are very much in conflict). However, does the above mean that earliest bear-ish creatures hybridized, too? Note that the conflict is associated with a short-branched part of the graph, where apparently little evolution happened. Fast ancient radiations usually come with incomplete lineage sorting and diffuse signals. The only data set producing longer roots, but with notably lower support, are the biparentally inherited introns.

We are closing in our own tail and have to ask again: Is this low support in the autosomal intron tree due to internal conflict, (sets of) introns preferring different topologies, supporting an ancient mixing hypothesis, or just reflecting lack of resolution? Check out the original paper by Kutschera et al. (2014, Mol. Biol. Evol. 31: 2004–2017), and make up your own mind.

On to the angiosperms

In my last post, I exemplified what Walker et al. (PeerJ, 2019) found in their angiosperm study: when we look at a plastome tree we are not looking at a summary of all gene trees but instead at a topology forced by very few of the genes in the chloroplast genome, such as the matK. We also have seen that one misplaced sequence (outgroup Podocarpus-matK) doesn't affect at all the combined analysis — it didn't even reduce the ingroup - outgroup split support. Also, I noted that the low-supported part of the combined tree goes hand in hand with lack of decisive signal from the matK.

It's time to take a look at what the other genes in this example data set come up with.

The eight gene trees. Terminal subtrees collapsed. Scales fit to size, scale bar = 0.1 expected substitutions per site. Upper left, matK tree which is very similar to the combined tree using all gene regions (cp = chloroplast, mt = mitochondrial, nc = nuclear genes). Note the low performance of the mt genes.

One thing is obvious: for most genes (except the nuclear-encoded rRNA genes) including the outgroup taxa adds little ingroup information of use — they are just too distant to any of the ingroup taxa. Outgroup rooting is tricky for angiosperms. Outgroup taxa will always be attracted to the ingroup taxon that is the least similar to any other part of the ingroup: Amborella in this case.

Generally, all of the ANA-grade water plants are genetically distinct and topologically isolated; any outgroup-inferred root must be placed in this part of the tree (all other living seed plants are very distant relatives of angiosperms looking back at, at least, ~250 million years of independent evolution, see eg. Age of Angiosperms... and What is an angiosperm pt. 2). The relatively conserved plastid rbcL and mitochondrial matR prefer an Amborella-Nympheales clade as sister to all other angiosperms, while the more divergent atpB, plastid, nad5, mitochondrial, and 25S, nuclear, prefer the Amborella-root — this is a direct indication for ingroup-outgroup long-branch attraction. Any other placement of the outgroup subtree within the ingroup would necessarily decrease the likelihood of the tree (but note the position of the root in the 18S tree, lower-left the tree based on the most-conserved, evolution-constrained gene in our sample; see also All solved a decade ago..., fig. 4A).

We can look at these trees with the strict consensus network, using uninformed edge lengths— that is, the network counterpart to the strict consensus cladograms still common in plant phylogenetic literature.

This is a nice piece of computer-art, but is scientifically quite useless (the boxiness and general graph structure is, however, reminiscent of strict consensus networks of most-parsimonious tree samples inferred for extinct animals, one example, and plants).

We can add some discriminatory information by counting how often each split occurs in the set of gene trees.

Same set of tree, different way of summarizing it. Note how the main clades emerge: one or two genes may have misplaced the one or other OTU but the others get it right.

Alternatively, we can average the actual tree branch lengths to inform the edge length of the consensus network.

The light green, sand-colored, light brown and dark olive (clockwise) splits are likely branching artifacts. The light blue split is the one that supports the ANA-grade when the (combined) tree is rooted with the very distant outgroups.

A pretty little thicket of trees. Some agreement is found towards the leaves, but even here we have conflict among the gene trees. In some trees, there are long branches grouping non-related OTUs, obvious tree inference artifacts. The general rule is that the deeper we go (ie. the farther back in time), the messier it gets. Adding to this is that, irrespective of which gene is used, some OTUs are much closer to the hypothetical common ancestor (of Mesangiosperms, ie. all but ANA grade) than others – in the eudicots, the least-evolved taxa are Platanus (very old tree genus) and Euptelea (the basalmost Ranunculales); in the Magnoliales, the only angiosperm clade that lacks synapomorphies, it's Magnolia and Liriodendron (again, very old and primitive tree genera). Darwin's Abominable Mystery, the sudden appearance and quick dominance of angiosperms, resulted in an abominable chaos of gene trees and signals. How can they possibly converge to a single tree with amply high support along most branches?

The combined tree from the first post.
When compared to the bears, the answer may well be: because there has been very little to no reticulation between these lineages. Our thicket may be not a forest of trees but just a poorly trimmed, wildly overgrown bush. They genes share the same history, but when being analyzed one-by-one, each of their trees get some aspects right, and some others (severely) wrong. Misplacing one OTU (e.g. the light green, dark olive, sand-colored and dark yellow splits in the averaged Consensus network) may have further topological effects; it didn't matter for the matK gene, because we misplaced only one very alien OTU in a data set that otherwise is hardly affected by adding or removing OTUs.

I argue here that, if there had been substantial reticulation messing up the signal of contemporary lineages and reflecting decoupled histories (like in the case of bears), we would expect at least some (artificial) branching patterns with low support in the combined tree, as well. This would also be the case looking at the gene-tree consensus networks, not only in the deepest parts but also closer to the leaves.

We will be explore this alternative hypothesis in the next (and final) post of this mini-series.