The Genealogical World of Phylogenetic Networks: trait mapping

Showing posts with label trait mapping. Show all posts

Monday, January 6, 2020

Why we may want to map trait evolution on networks, pt. 1 – Introduction

One of the more interesting aspects of studying evolution is to trace the evolution of the traits possessed by the organisms, whether those traits are physical or not (such as languages). That is, we usually infer phylogenetic trees and networks to see how things evolve, including both the organisms and their characteristics. However, this can easily lead to circular reasoning, as I will discuss here.

Background

A phylogenetic tree may be enough to work out who is sister to whom. However, when thinking about evolution itself, we actually want to find out who comes from whom, instead. This may be the reason why Charles Darwin did not title his 'abstract' A Natural Order of Species but used instead The Origin of Species.

The tricky bit is this: in order to find the origin, we first need to establish ancestor-descendant relationships, so that we can then see how things like fossils fit in (ie. whether they represent extinct sister lineages or precursors of the modern-day taxa). When taxon B is a derivation of A (ie. B evolved from A), the character suite of A is not only primitive but also the original set. Now, let's assume that we have a third fossil taxon C, which is clearly related to A and B. As evolutionary biologists, we cannot be content merely with inferring sister relationships between A, B, and C, but instead we need to decide whether or not C is also descended from A.

Ironically (from a modern cladistic viewpoint, focusing on establishing sister-clade relationships), Willy Hennig provided us with some tools for doing this:

We all know that apomorphies are derived traits, either unique (aut-) or uniquely shared by a group with inclusive common origin (syn-). Aut- and syn-apomorphies define (Hennig's) monophyletic groups. According to Hennig, a synapomorphy is a necessary criterion for recognizing a monophyletic group, and also a sufficient one (although the latter easily falls prey to circular reasoning). More importantly, they tell us that the ancestor(s) of the group and its (potentially lost) sister lineages lacked this trait!
Sym-plesiomorphies are traits that are primitive within a certain lineage. They define paraphyletic groups, which are groups of exclusive common origin. Following Hennig, they need to be discarded for systematics. From an evolutionary viewpoint, however, symplesiomorphies have double information content: (a) they provide us with traits that, at some point back in time, were synapomorphies; and (b) any member of a lineage not carrying the symplesiomorphic trait, shows a derived one.

Farris' cladistics is still the basis for systematics, and is widely applied in phylo-paleontology. The initial flaw of this approach is to assume that we can use morphological traits to infer a tree (with parsimony), and then map the same traits onto the inferred tree, allowing us to qualify the traits towards Hennig's objectives. How can this not be circular reasoning? We are mapping the traits onto a tree derived from those traits in the first place, so that the tree-building and mapping are not independent.

A simple (?) real-world example

For the purpose of this exercise, we will take the minute (seven character) matrix of Wilf et al. (2019) from this previous blog post (characters 5 and 7 corrected, and missing Fagaceae added; see also Denk et al. 2019, Science, 10.1126/science.aaz2189).

Wilf et al. found an Eocene fossil in South America, and argued that it must be a member of the modern genus Castanopsis, based on a parsimony DNA-scaffold approach (without actually using a DNA partition). Being a member of a modern genus, the fossil should have some aut-/synapormphies or at least symplesiomorphies or homoiologies characterizing its sublineage of the Fagaceae, the paraphyletic Castanoideae.

Based on the morphology, we can infer this tree:

Fig. 1 – Adams consensus tree of 3 most-parsimonious trees (11 steps, CI = 0.84, RI = 0.88), traits are mapped using Mesquite's default parsimony model. Castanopsis rothwellii is the Eocene fossil found by Wilf et al.

Two characters qualify as near-synapomorphies (effectively there is only one: hemispheric indehiscent cupules) that define a crown-clade including Lithocarpus, Notholithocarpus, Castanopsis (as part of intrageneric variation) and Quercus. Most other putatively derived traits within the Fagaceae subclades are symplesiomorphies; two are potential homoiologies, one defining the Castanea-Chrysolepis clade. [Note the staircase-like tree topology, a common feature of parsimony trees dealing with extinct lineages.] The fossil's character suite is relatively derived, characters 6 (shared only with some Castanopsis) and 7 (reversal as in Quercus) could be interpreted as an extinct side lineage of the (paraphyletic) Castanoideae.

This is not a bad analysis for seven characters, but it is likely to be quite wrong.

Fagaceae still exist today, and their DNA can be sampled. Below is a maximum-likelihood tree, based on a 2012 NCBI GenBank oligogene data harvest I did for a talk in Bordeaux — the alignment is 19,242 basepairs long, has 2,985 distinct alignment patterns and a gappyness of 35.8%. Each genus and major intra-generic lineage is represented by a strict consensus sequence based on all available data (checked for mislabeled or pseudogene accessions). [Oaks started to radiate > 50 Ma, Grímsson et al. 2015, Hipp et al. 2019; beeches about the same time, Denk et al. 2009, Renner et al. 2016.]

Fig. 2 – a ML tree based on strict genus/intrageneric consensus sequences (see also Oh & Manos, 2008, fig. 4, based only on data from the Crabs Claw gene, CRC; fig. 5 in the same paper shows a combined CRC + ITS tree)

According to this analysis, Chrysolepis and Castanea are not sisters; Castanea, but not Lithocarpus, is a close relative of the oaks. The (monophyletic) Trigonobalanoideae should form a clade (Fig. 2) not a grade (Fig. 1).

The analysis is not circular anymore, when we infer a tree based on data that is, as far as we know, independent of the data we want to map onto the tree. With the invention of stochastic mapping methods, we also avoid the possible limitations of parsimony when it comes to character mapping — morphological evolution is often not parsimonious, at least for the traits we can observe back in time or study in detail today.

Fig. 3 – ML trait mapping on the tree in Fig. 2 (ie. considering molecular branch lengths). Note, the reconstruction of character state for the all-ancestor are ambiguous due to the extreme genetic distance between Fagus and the remainder of the Fagaceae. The situation in the scored fossils (Wilf et al. 2019, Denk et al. 2019) are shown for comparison.

For the ML mapping above, I scored intra-generic variations as additional states (ML ancestral-state reconstruction as implemented in the Mesquite program needs defined tips) and applied Mesquite's default model — this is essentially Lewis' Mk model for multi-state standard characters: one substitution category for any possible mutation. We can now compare the two mappings.

What our morphology-based tree recognized as derived was actually partly primitive. The near-synapomorphy (hemispheric indehiscent cupules) is in fact a symplesiomorphy of all Castanoideae + Quercus. Traits shared by Castanea-Castanopsis (pro parte, ie. some species show the ancestral, others the derived state) and Quercus are primitive, while those unique to (or part of intra-generic variation) one or several Castanoideae are derived.

Note that the alleged crown-group but old fossil Castanopsis rothwellii would fit at the base of the (core Fagaceae) tree (zero conflict) as well as close to its leaves (at least one conflicting character). Six of the seven traits can be pinpointed for the core Fagaceae ancestor. According to the reconstruction, it had three styles, scaly cupule appendages, hemispheric indehiscent cupules (vs. valvate in C. rothwellii), one flower per cupules, no valve dehiscence ("partial" in C. rothwellii), and inflorescences were unisexual and mixed (Wilf et al. state the Eocene fossils were unisexual, although the difference can only be assessed when investigating all inflorescences on a tree, see Denk et al.'s comment). The reconstruction is ambiguous regarding whether female flowers were clustered or solitary.

However, there is one implicit assumption held in common by all of the methods, including DNA-scaffolding, probabilistic and stochastic character mapping, total evidence dating, evolutionary placement algorithm (EPA) as implemented in the RAxML program, etc. That is: the inferred molecular tree is the true tree. This is the second fundamental flaw of cladistic approaches to evolution, as I will show in Part 2.

Data information

The morphological data used here is based on an emeneded version of the Wilf et al. matrix provided by my former colleague and co-author Thomas Denk (see also Denk et al. 2019, table 1); and it can be, together with the molecular data matrix used here, accessed via figshare.

References

Denk T, Grimm GW. (2009) The biogeographic history of beech trees. Review of Palaeobotany and Palynology 158: 83–100.

Denk T, Hill RS, Simeone MC, Cannon C, Dettmann ME, Manos PS. (2019) Comment on “Eocene Fagaceae from Patagonia and Gondwanan legacy in Asian rainforests”. Science 366: eaaz2189.

Grímsson F, Zetter R, Grimm GW, Krarup Pedersen G, Pedersen AK, Denk T. (2015) Fagaceae pollen from the early Cenozoic of West Greenland: revisiting Engler's and Chaney's Arcto-Tertiary hypotheses. Plant Systematics and Evolution 301: 809–832.

Hipp AL, Manos PS, Hahn M, et al. (2019) Genomic landscape of the global oak phylogeny. New Phytologist doi:10.1111/nph.16162.

Oh S-H, Manos PS. (2008) Molecular phylogenetics and cupule evolution in Fagaceae as inferred from nuclear CRABS CLAW sequences. Taxon 57: 434–451.

Renner SS, Grimm GW, Kapli P, Denk T. (2016) Species relationships and divergence times in beeches: New insights from the inclusion of 53 young and old fossils in a birth-death clock model. Philosophical Transactions of the Royal Society B doi:10.1098/rstb.2015.0135.

Wilf P, Nixon KC, Gandolfo MA, Cúneo NR. (2019) Eocene Fagaceae from Patagonia and Gondwanan legacy in Asian rainforests. Science 364: eaaw5139.

For more literature, see the post:
Ockham's Razor applied but not used: can we make a DNA-scaffolding with seven characters?

Monday, July 8, 2019

Character cliques and networks – mapping haplotypes of manual alphabets

[This post is the second part of our miniseries on the origin and evolution of sign language manual alphabets]

One aspect of exploratory data analysis (EDA) is for us to try to understand how our data relate to our inference(s). This is especially important when the signal from our data is increasingly complex. Sign language manual alphabets are such a case.

In our first post about sign language manual alphabets, I introduced the principal networks that we used to classify sign languages. Here, I'll describe our character mapping procedure and why we did it as part of our EDA framework, in order to establish scenarios for the origin and evolution of sign languages.

Characters and mapping

We encoded each hand-shape used to signify a certain concept, such as the letters included in the standard Latin alphabet "a", "b", "c", .... "x", "y", "z", as a binary sequence – the presence or absence of a certain COGID (we will explain and discuss this in a later post). These binary sequences can be seen as an analogy of the genetic code, as a sort of 'linguistic haplotype', and their evolution can be mapped onto a network based on the entire dataset.

For instance, our matrix has three binaries (haplotypes) for the concept [g] in the oldest set of sign languages (pre-1840), two of which can be found in the earliest alphabets in our dataset: those of Yebra 1953 and Bonet 1620. Russian 1835, the oldest Cyrillic alphabet, uses a somewhat different hand-shape for its counterpart of the Latin "g", the Cyrillic "г".

For the concept [g], we thus have three taxon cliques, each defined by a distinct binary/haplotype: the 'Yebra haplotype', the 'Bonet haplotype', and the 'Cyrillic haplotype'.

By mapping these haplotypes on the network, as shown in the next figure, we can see that there is a small edge bundle reflecting the basic split between the Yebra and Bonet haplotypes.

Hand-shape drawings are taken from the original manuscripts.

We can also see that the Russian haplotype either evolved from the Yebra haplotype kept in the older Austrian-origin Group, ie. is an adaptation of the Yebra haplotype, or that it is a genuinely new invention — note the similarity of the Russian hanshape with the letter г.

We repeated this procedure for all 26 concepts of the standard Latin alphabet, to get an idea of how often the encoded linguistic haplotypes fit with the overall pattern visualized in the inferred Neighbor-nets (ie. the neighborhoods as defined by edge bundles). This is shown in the next figure.

The arrows indicate inferred evolutionary processes (replacement or invention).

Using this network mapping(which, in principle, uses the logic of parsimony/median networks), we can make direct inferences about the general mode of evolution.

For instance, even though Russian 1835 uses a different set of hand-shapes (ie. is defined by partly unique haplotypes), the hand-shapes for the concepts [p] and [z] are exclusively shared with the Austrian-origin Group. The biological equivalent would be: the 'Austrian haplotypes' are a uniquely shared derived feature reflecting a putative common origin of the Austrian and Russian lineages — ie a potential linguistic synapomorphy. We also can see that all haplotypes shared by Russian and all ([a][c][f][r][u][y]) or part ([b][e][i][k][n][o][x]) of the French-origin Group, an alternative source that may have inspired this early Cyrillic alphabet, lack this quality.

We can also make inferences about:

which hand-shape is the original one (O);
lineage-specific / diagnostic hand-shapes, eg. At. = Austrian, Da. = Danish (using two letter abbreviations);
which hand-shapes are shared but apparently derived, eg. At.-Fr. are hand-shapes / haplotypes shared by members of the Austrian- and French-origin groups not found in the Yebra or Bonet alphabets — C stands for cosmopolitan, non-original handshapes common in various lineages, including British-origin Group, and D represents derived but rare hand-shapes without any clear lineage-affiliation; and
alphabet-unique (ie. represent a linguistic autapomorphy.

In addition, we can explore certain details, including patterns (character-based taxon cliques) that are at odds with the overall reconstruction. The latter are to be expected, because the graph is planar (2-dimensional) but the processes that shaped sign alphabets are likely to be multi-dimensional. For instance, our networks failed to resolve the affinity of the contemporary Norwegian Sign Language, the reason for which can be seen in the following character map.

Note the position of Norwegian 1955, which is still part of the Austrian-origin Group (like older manual alphabets used in the late 19th century in Norway). However, it is already influenced by international standardization — eg. concepts [k], [p], and [z] use(d) French hand-shapes. Hence, Norwegian 1955 shares quite a high number of lineage-diagnostic hand-shapes with Danish 1967 and the Icelandic Sign Language. These, and others, were further replaced in its contemporary counterpart (Norwegian SL) by hand-shapes borrowed from various lineages — eg. [c],[f] from the nearly extinct Austrian-origin Group, [p] from the Russian Group, [k] same as in the Spanish Group) — as well as unique hand-shapes, including hand-shapes evolved from earlier forms or those that have been genuinely invented.

Why we map character evolution along networks

In many cases, we only have one set of data, in order to draw our conclusions based on the graph(s) we infer. We cannot test to which degree our data (the way we scored the differentiation patterns) and inferences are systematically biased. Thus, we want to explore which aspects of our inference are supported by character splits, and establish taxon cliques and evolutionary pathways for the characters (scored traits). Lacking an independent source of data, the latter would involve circular reasoning — ie. mapping the traits along a tree derived from those same traits.

By inferring a tree, we crystallize one pattern dimension out of the data, although more often than not this will be a comprise from multidimensional signals. A network, such as a Neighbor-net, has two dimensions, and hence our mapping can consider two alternatives at the same time — this enables us to make a choice, if we have to. Another practical advantage of a Neighbor-net is that it is quick to infer, so that we can easily reduce the data set and use a more focused graph for the map.

In cases where 2-dimensional graphs don't suffice, there are still Consensus networks, which would allow mapping character evolution based on a sample of many alternative trees.

We could even eliminate the circular reasoning while maintaining a relatively stable inference framework. Deleting a character or several characters (or recoding them: see eg. Should we try to infer trees on tree-unlikely matrices?) can easily lead to a new tree topology, although it has less effect on the structure of a Neighbor-net. When we would need to worry about circular reasoning for mapping a certain concept, or two concepts that may have interacted, we just base our Neighbour-net on a distance matrix calculated from a reduced character matrix, and then map only those concepts not considered for the inference.

Other posts in this miniseries

Stacking networks based on sign language manual alphabets – introduction and principal networks used in our study