The Genealogical World of Phylogenetic Networks: February 2019

Monday, February 25, 2019

Automatic morpheme segmentation (Open problems in computational diversity linguistics 1)

The first task on my list of 10 open problems in computational diversity linguistics deals with morphemes, that is, the minimal meaning-bearing parts in a language. A morpheme can be a word, but it does not have to be a word, since words may consist of more than one morpheme, and — depending on the language in question — may do so almost by default.

Examples of morphemes in English include clear-cut cases of compounding, where two words are joined to form a new word. Often, this is not even readily reflected in spelling, and, as a result, speakers may at times think that a word like "primary school" is not a single word, although it is easy to determine from its semantics that the word is indeed pointing to one uniform concept. Other examples include grammatical markers, such as the ending -s for most English plurals, or to mark the third person singular of verbs. When confronted with a word form like walks, linguists will analyze this word as consisting of two morphemes, illustrating it by adding a dash as a boundary marker: walk-s.

The problem

The task of automatic morpheme segmentation is thus a pretty straightforward one: given a list of words, potentially along with additional information, such as their meaning, or their frequency in the given language, try to identify all morpheme boundaries, and mark this by adding dash symbols where a boundary has been identified.

One may ask why automatic identification of morphemes should be a problem — and some people commenting on my presentation of the 10 open problems last month did ask this. The problem is not unrecognized in the field of Natural Language Processing, and solutions have been discussed from the 1950s onwards (Harris 1955, Benden 2005, Bordag 2008, Hammarström 2006, see also the overview by Goldsmith 2017).

Roughly speaking, all approaches build on statistics about n-grams, i.e., recurring symbol sequences of arbitrary length. Assuming that n-grams representing meaning-building units should be distributed more frequently across the lexicon of a language, they assemble these statistics from the data, trying to infer the ones which "matter". With Morfessor (Creutz and Lagus 2005, there is also a popular family of algorithms available in form of a very stable and easy-to-use Python library (Virpioja et al. 2013). Applying and testing methods for automatic morpheme segmentation is thus very straightforward nowadays.

The issue with all of these approaches and ideas is that they require a very large amount of data for training, while our actual datasets are small and sparse, by nature. As a result, all currently available algorithms fail graciously when it comes to determining the morphemes in datasets of less of 1,000 words.

Interestingly, even when having been trained on large datasets, the algorithms still commit surprising errors, as can be easily seen when testing the online demo of the Morfessor software for German (https://asr.aalto.fi/morfessordemo/). When testing words like auftürmen "pile up", for example, the algorithm yields the segmentation auf-türme-n, which is probably understandable from the fact that the word Türme "towers" is quite frequent in the German lexicon, thus confusing the algorithm; but for a German speaker, who knows that verbs end in -en in their infinitive, it is clear that the auftürmen can only be segmented as auf-türm-en.

If I understand the information on the website correctly, the Morfessor algorithm offered online was trained with more than 1 million different word forms in German. Given that in our linguistic approaches we can usually dispose of 1,000 words, if not less, per language, it is clear that the algorithms won't provide help in finding the morphemes in our data.

To illustrate this, I ran a small test on the Morfessor software, using two datasets for training, one big dataset with about 50000 words from Baayen et al. (1995), and one smaller dataset of about 600 words which I used as a cognate detection benchmark when writing my dissertation (List 2014). I then used these two datasets to train the Morfessor software and then applied the trained models to segment a list of 10 German words (see the GitHub.Gist here.

The results for the two models (small data and big data) as well as the segmentations proposed by the online application (online) are given in the table below (with my own judgments on morphemes given in the column word).

Number	Word	Small data	Big data	Online
1	hand	hand	hand	hand
2	hand-schuh	hand-sch-uh	hand-schuh	hand-schuh
3	hantel	h-a-n-t-el	hant-el	han-tel
4	hunger	h-u-n-g-er	hunger	hunger
5	lauf-en	l-a-u-f-en	laufen	lauf-en
6	geh-en	gehen	gehen	gehen
7	lieg-en	l-i-e-g-en	liegen	liegen
8	schlaf-en	sch-lafen	schlafen	schlaf-en
9	kind-er-arzt	kind-er-a-r-z-t	kind-er-arzt	kinder-arzt
10	grund-schule	g-rund-sch-u-l-e	grund-schule	grundschule

What can be seen clearly from the table, where all forms deviating from my analysis are marked in red font, is that none of the models makes a convincing job in segmenting my ten test words. More importantly, however, we can clearly see that the algorithm's problems increase drastically when dealing with small training data. Since the segmentations proposed in the Small data column are clearly the worst, splitting words in a seemingly random fashion into letters.

What is interesting in this context is that trained linguists would rarely fail at this task, even when all they were given is the small data list for training. That they do not fail is shown by the numerous studies where linguistic fieldworkers have investigated so far under-investigated languages, and quickly figured out how the morphology works.

Why is it so difficult to find morpheme boundaries?

What makes the detection of morpheme boundaries so difficult, also for humans, is that they are inherently ambiguous. A final -s can mark the plural in German, especially on borrowings, as in Job-s, but it can likewise mark a short variant of es "it", where the vowel is deleted, as in ist's "it's", and in many other cases, it can just mark nothing, but instead be part of a larger morpheme, like Haus "house". Whether or not a certain substring of sounds in a language can function as a morpheme depends on the meaning of the word, not on the substring itself. We can — once more — see one of the great differences between sequences in biology and sequences in linguistics here: linguistic sequences derive their "function" (ie. their meaning) from the context in which they are used, not from their structure alone.

If speakers are no longer able to clearly understand the morphological structure of a given word, they may even start to change it, in order to make it more "transparent" in its denotation. Examples for this are the numerous cases of folk etymology, where speakers re-interpret the morphemes in a word, with English ham-burger as a prominent example, since the word originally seems to derive from the city Hamburg, which has nothing to do with ham.

How do humans find morphemes?

The reasons why human linguists can relatively easy find morphemes in sparse data, while machines cannot, is still not entirely clear to me (ie. humans are good at pattern recognition and machines are not). However, I do have some basic ideas about why humans largely outperform machines when it comes to morpheme segmentation; and I think that future approaches that try to take these ideas into account might drastically improve the performance of automatic morpheme segmentation methods.

As a first point, given the importance of meaning in order to determine morphemic structure, it seems almost absurd to me to try to identify morphemes in a given language corpus based on a pure analysis of the sequences, without taking their meaning into account. If we are confronted with two words like Spanish hermano "brother" and hermana "sister", it is clear — if we know what they mean — that the -o vs. -a most likely denotes a distinction of gender. While the machines compare potential similarities inside the words independent of semantics, humans will always start from those pairs where they think that they could expect to find interesting alternations. As long as the meanings are supplied, a human linguist — even when not familiar with a given language — can easily propose a more or less convincing segmentation of a list of only 500 words.

A second point that is disregarded in current automatic approaches is the fact that morphological structures vary drastically among languages. In Chinese and many South-East Asian languages, for example, it is almost a rule that every syllable represents one morpheme (with minimal exceptions being attested and discussed in the literature). Since syllables are again easy to find in these languages, since words can often only end in a specific number of sounds, an algorithm to detect words in those languages would not need any n-gram statistics, but just a theory on syllable structures. Instead of global strategies, we may rather have to use for local strategies of morpheme segmentation, in which we identify different types of languages for which a given algorithm seems suitable.

This brings us to a third point. A peculiarity of linguistic sequences in spoken languages is that they are built by specific phonotactic rules that govern their overall structure. Whether or not a language tolerates more than three consonants in the beginning of a word depends on its phonotactics, its set of rules by which the inventory of sounds is combined to form morphemes and words. Phonotactics itself can also give hints on morpheme boundaries, since they may prohibit combinations of sounds within morphemes which can occur when morphemes are joined to form words. German Ur-instinkt "basic instinct", for example, is pronounced with a glottal stop after the Ur-, which can only occur in the beginning of German words and morphemes, thus marking the word clearly as a compound (otherwise the word could be parsed as Urin-stinkt "urine smells".

A fourth point that is also generally disregarded in current approaches to automatic morpheme segmentation is that of cross-linguistic evidence. In many cases, the speakers of a given language may themselves no longer be aware of the original morphological segmentation of some of their words, while the comparison with closely related languages can still reveal it. If we have a potentially multi-morphemic word in one language, for example, and only one of the two potential morphemes reflected as a normal word in the other language, this is clear evidence that the potentially multi-morphemic word does, indeed, consist of multiple morphemes.

Suggestions

Linguists regularly use multiple types of evidence when trying to understand the morphological composition of the words in a given language. If we want to advance the field of automatic morpheme segmentation, it seems to me indispensable that we give up the idea of detecting the morphology of a language just by looking at the distribution of letters across word forms. Instead, we should make use of semantic, phonotactic, and comparative information. We should further give up the idea of designing universal morpheme segmentation algorithms, but rather study which approach works best on which morphological type. How these aspects can be combined in a unified framework, however, is still not entirely clear to me; and this is also the reason why I list automatic morpheme segmentation as the first of my ten open problems in computational diversity linguistics.

Even more important than the strategies for the solutions of the problem, however, is that we start to work on extensive datasets for testing and training of new algorithms that seek to identify morpheme boundaries on sparse data. As of now, no such datasets exist. Approaches like Morfessor were designed to identify morpheme boundaries in written languages, they barely work with phonetic transcriptions. But if we had the datasets for testing and training available, be it only some 20 or 40 languages from different language families, manually annotated by experts, segmented both with respect to the phonetics and to the morphemes, this would allow us to investigate both existing and new approaches much more profoundly, and I expect it could give a real boost to our discipline and greatly help us to develop advanced solutions for the problem.

References

Baayen, R. H. and Piepenbrock, R. and Gulikers, L. (eds.) (1995) The CELEX Lexical Database. Version 2. Philadelphia.

Benden, Christoph (2005) Automated detection of morphemes using distributional measurements. In: Claus Weihs and Wolfgang Gaul (eds.): Classification -- the Ubiquitous Challenge. Berlin and Heidelberg:Springer. pp 490-497.

Bordag, Stefan (2008) Unsupervised and knowledge-free morpheme segmentation and analysis. In: Carol Peters, Valentin Jijkoun, Thomas Mandl, Henning Müller, Douglas W. Oard, Anselmo Peñas, Vivien Petras and Diana Santos (eds.): Advances in Multilingual and Multimodal Information Retrieval. Berlin and Heidelberg:Springer, pp 881-891.

Creutz, M. and Lagus, K. (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Technical Report. Helsinki University of Technology.

Goldsmith, John A. and Lee, Jackson L. and Xanthos, Aris (2017) Computational learning of morphology. Annual Review of Linguistics 3.1: 85-106.

Hammarström, Harald (2006) A Naive Theory of Affixation and an Algorithm for Extraction. In: Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology at HLT-NAACL 2006 pp. 79-88.

Harris, Zellig S. (1955) From phoneme to morpheme. Language 31.2: 190-222.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf:Düsseldorf University Press.

Virpioja, Sami, Smit, Peter, Grönroos, Stig-Arne and Kurimo, Mikko (2013) Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Helsinki:Aalto University.

Monday, February 18, 2019

Can we depict the evolution of highly conserved genes, such as the ribosomal RNA genes?

Median networks have been designed to put within-species haplotypes into an explicit evolutionary framework. They are exclusively parsimony-based, but differ from traditional trees by treating operational taxonomic units (OTUs) as both potential tips and ancestors. Ancestors are placed at internal nodes ('medians'). The latter makes them interesting for hypotheses about sequence evolution; but, like all parsimony-based methods, they suffer from high levels of homoplasy, which is a common feature of genetic data sets.

Can we use median networks to better understand evolution far above the species level?

In order to test this, I generated a median network using data on the nuclear-encoded 5.8S rDNA of Fagales. This is a flowering plant (angiosperm) order, which includes well-known trees such as oaks, beeches, chestnuts, walnuts, alder, birch and hazel, but also the enigmatic 'false beech' (Nothofagus s.l., the traditional four subgenera have been elevated to genera by Heenan & Smissen 2013), a Gondwanan element that (for some time) has intrigued biogeographers.

Why I have always loved nrDNA

A a young (phylo-)geneticist, my boss, a geneticist who sequenced genes such as the rRNA genes before PCR made it easy, pointed me to the works of Mark Hershkovitz, Louise Lewis, and Edith Zimmer about evolution of the nuclear-encoded ribosomal RNA genes (nrDNA) in angiosperms. Long pre-dating the era of big data and self-evident, trivial phylogenies (ie. data sets allowing for the inference of a fully resolved, unambiguously supported tree), Hershkovitz and co-workers sought to extract as much information as possible from the best-known gene region available back then (mid-late 90s): the internal transcribed spacers (ITS1, ITS2) of the 35S rDNA, the cistron encoding the genes for the 18S, 5.8S and 25S (or 28S, but not "26S") nuclear ribosomal RNA.

Hershkovitz MA, Lewis LA. 1996. Deep-level diagnostic value of the rDNA-ITS region. Molecular Biology and Evolution 13:1276–1295.
Hershkovitz MA, Zimmer EA. 1996. Conservation patterns in angiosperm rDNA ITS2 sequences. Nucleic Acids Research 24:2857–2867.
Hershkovitz MA, Zimmer EA, Hahn WJ. 1999. Ribosomal DNA sequences and angiosperm systematics. In: Hollingsworth PM, Bateman RM, and Gornall RJ, eds. Molecular Systematics and Plant Evolution. London: Taylor & Francis, pp. 268–326.

The ITS1 and ITS2 are highly divergent, non-coding but transcribed intergenic spacers within the structurally and sequentially much more conserved nrDNA, which distinguishes them from nearly all other non-coding regions. More often than not, their sequences are impossible to align across high-ranking taxa such as families or orders. The brilliance of Hershkovitz et al.'s work was to just go a level-up by identifying shared general sequence patterns, and to put them in an evolutionary context.

Birds-eye view of the ITS region (consensed for sequence groups) in Fagales including sequences of the two outgroups used in Li et al. 2004 (zoom-in and try to figure out where they are). The position of the ITS(1) cleavage site is indicated, a highly conserved, AT-dominated sequence motif within the ITS1. The "Nothofagus deletion" (Manos 1997), gray area seen in some of the topmost variants in the 5.8S rDNA, is a sequencing/ editing artifact (newer sequences all have a complete 5.8S rDNA). Most of these data are more than 15-years old (see references provided at the end of the post) and may include more data artifacts, especially in the length-polymorphic portions. Nonetheless, part of the data were included in the dating studies of Sauquet et al. (2012) and Xing et al. (2014) to compensate for the lack of resolution of the also included plastid regions towards the tips of the Fagales tree (intrafamily and -generic relationships).

Accordingly, in my (open access) Ph.D. thesis you'll find not a few figures depicting the potential evolution of sequence patterns in the ITS1 and ITS2 of maples and the beech trees.

I could probably write a book taking up where Hershkovitz et al. stopped, but this would be: a) very subjective, and b) too complex and marginal for the 21st century. Very few people would read it. We have grown accustomed to simple graphs as metaphors of evolution and, thanks to big data, we have become reluctant to discuss the results ex machina. Also, I would have needed a score of students to pursue all the avenues that I glimpsed into; e.g. the following pic:

Evolution of the 5'-end of the ITS1 in basal eudicots (looking at divergences that happened, at least, 100 myrs ago).

The other way around

If the more conserved sequence patterns within the ITS1 and ITS2 can be informative about evolution at a much higher level (which they are), the next question is: what can we learn from the sequence patterns in the highly-conserved portions of the rDNA linked with the ITS1 and ITS2? Historic-genetically, the ITS1 is fundamentally different from the ITS2. The former, ITS1, is an intergenic spacer, which has no secondary structure (although you can find reconstructions in literature) as it is split into two parts right after translation (the ITS1 cleavage site is quite conserved, and a main topic in the papers by Hershkovitz and Zimmer). The latter, ITS2, has been evolutionarily derived from the first variable portion of the large ribosomal subunit (LSU), the 25S (28S) rDNA. In primitive organisms, there is hence no 5.8S rDNA and ITS2.

This geno-evolutionary history is also the reason for the structural linkage between the 5.8S rRNA and the 5' end of the 25S (28S) rRNA. Here's a zoom-in on the part that we are interested in.

For better orientation, I have named some of the extremely conserved secondary structure elements of the (mature) 5.8S rRNA. Note that the "Gingerbread Man" structure is very conserved in angiosperm sequences although it only contains three very short stems. The "Pimple" and the "Needle" are so-called hairpins — a strictly complementary stem part is capped by a short, non-complementary tip ('semi-loop'): a 3- and 4-nt long motif, respectively, in Arabidopsis and all Fagales (in some species of Lithocarpus, the tropical 'stone nut' and relative of oaks, the "Needle" has two extra nucleotides).

5.8S rDNA in Fagales

I chose the Fagales because I have worked on them a lot, they are a pretty small group, and except for one "asterisk branch" their inter-family relationships are solved.

Basic signal in Li et al. (2004)'s matrix. Inter-family relationships are, data-wise, fairly trivial, hence, the tree-like Neighbor-net. Only the placement of the Myricaceae with respect to Juglandaceae (now incl. Rhoipteleaceae) and Betulaceae + allies is not unambiguously resolved (see this post)

Oaks have received a lot of attention from population geneticists, like other widespread species or species complexes. Those studies, using Median networks and related methods such as Statistical Parsimony, revealed very complex genetic diversity patterns. On the other hand, the Fagales lineage has been fairly neglected by plant phylogeneticists, although it comprises many of the dominant, ecologically and economically most important trees of the Northern Hemisphere (and the enigmatic Gondwanan Nothofagaceae). The early studies found evidence for deep nuclear-plastid incongruences, but only in recent years has the first (non-comprehensive) complete plastome phylogenies and dated all-Fagales trees surfaced (which do contain one or other common error and misinterpretation of results).

For one family, the southern hemispheric, tropical-subtropical Casuarinaceae, we have no (reliable) ITS data at all; also missing is one of the genera of the Juglandaceae: Engelhardia (s.str.; most data in gene banks labelled as Engelhardia is from Alfaropsis; cf. Manchester 1987 and Manos et al. 2007, but see Zhang et al. 2013).

In total, we find 17 variable sites at and above the genus level in the 5.8S rDNA of Fagales. There are three in the core parts, structurally linked to the 5' 25S rRNA, two in the 'Gingerbread Man', three in the 5' and 3' trails, and the rest are in the 'Needle'.

Unique mutations and mutational trends (arrows) in the 5.8S rDNA in Fagales. Circles highlight the basepairs differing from the reference (Arabidopsis 5.8S rRNA). Blue, mutations found within more than one major lineage, pink, lineage-conserved (diagnostic) mutations; red, mutations restricted to a single genus; green, genetic (syn)apomorphies of the 5.8S rDNA of Fagales. Be = Betulaceae; Ju = Juglandaceae; My = Myricaceae; No = Nothofagaceae; Fagaceae include Fagus (Fa, the beech) and the remainder ("Quercaceae": Qu), which are genetically substantially distinct from Fagus.

Many mutations are genus-coherent; increased intrageneric variation is found in the 5'-tail and the part encoding the 4(6)-nt long 'semi-loop' sequence of the "Needle" (pos. 120–142 in the rRNA of Arabidopsis thaliana):

A (near-)full Median network for the tip of the 'Needle'. In a few Lithocarpus (a "Quercaceae" genus) the sequence is 6-nt-long, which would result in an elongated hairpin (paired basepairs are underlined). The ATTC is a genetic symplesiomorphy.

Exceptions are Fagus and Quercus, which can show substantial intragenomic ITS divergence, Lithocarpus (the most divergent genus, ITS-wise), and Nothofagus s.l. (between the former subgenera, now genera). In these cases, the intra-(sub)generic variation includes the putatively ancestral nucleotide and/or nucleotide shared with other genera of the family; eg. at pos. 123, all Fagales have a C, Fagus can have either C or T (= Y), and Quercus can show any of the four nucleotides (= N).

A Median-network for the 5.8S rDNA

Ambiguities can be detrimental for resolution in standard parsimony implementations. The NETWORK program, for instance, warns that a code of "N" may render the result less reliable, and this applies also to the other ambiguity codes. If we include the intra-generic polymorphisms as ambiguity codes, NETWORK runs for quite a long time: too many solutions are equally parsimonious (for this experiment I used genus-consensus data, being interested in the deep splits)

But when we resolve the intra-generic polymorphisms prior to analysis by treating them as satellite types, ie. assuming the family-shared nucleotide represents the ancestral state within the according lineage, we quickly get the following result:

Edges colored to trace the same mutational step. Bubbles indicate the position of the (basic) 5.8S rDNA genotypes for the genera in each family-level lineage.

This is still not a too trivial graph, but it:

provides a framework on which we can develop our evoluionary scenario;
visualizes how mutational patterns may be linked;
tells us directly how derived (genetically) and unique (isolated) the genera are.

Since the 5.8S rDNA is part of a multi-copy (potentially multi-loci, Ribeiro et al. 2011) gene region, uniqueness gives us an idea about how reduced a lineage is. Bottlenecks will eliminate intra-lineage diversity and unique mutational patterns are more likely to accumulate in a species-poor lineage with small population sizes.

But since it is a vital gene region underlying strong sequential and structural constraints, evolution is not neutral: the graph has little tree-likeness. However, the graph looks like graphs that one expects for fast ancient radiations.

There are more interesting details. For instance, we have no mutation separating consistently the earliest diverging lineages (given the currently accepted root), the Nothofagaceae and the Fagaceae (s.l.) and the remainder of the order (called "higher hamamelids" in classic systematic literature). We also see that the 5.8S rDNA shows the Fagaceae should be monotypic: Fagus is more different from its siblings, the 'Quercaceae', than it is from the first-diverging Nothofagaceae or the common ancestor of the "higher hamamelids". Fagaceae s.str. and 'Quercaceae' are without a doubt sister lineages but this also applies to Betulaceae and Ticodendraceae (differing only by three point mutations), with the Betulaceae being just one point mutation away from its more distant sibling (phylogenetically speaking), the Juglandaceae. Furthermore, for Ticodendron-Betulaceae we can postulate a sequentially unique common ancestor, but we can't do the same for Fagus-'Quercaceae'.

Either the 5.8S rDNA evolved much faster in Fagus than in most other lineages, or Fagus split away from its sisters prior to the radiation of the "higher hamamelids" and shortly after their respective ancestors isolated. This second scenario coincides nicely to recent fossil findings tracing the Fagus lineage back to the late Cretaceous (at least 80 Ma; Grímsson et al. 2016, supplement includes a digression of all-Fagales dating attempts).

Reconstruction of ancestral genepools

Using the split patterns in the network to extract an evolutionary tree could be hazardous, since we are looking at strongly interconnected mutational patterns filtered by selective pressure (maintaining a functional structure) in a gene region that evolves very slowly: some sites can or did accumulate mutations (the 'Needle' and the trails), others can't and did not (the remainder of the 5.8S rDNA) in the Fagales lineage. At least mutations were not fixed over a long evolutionary time: the data includes at least as many variable sites where within a single genus, species or genome, the shared, family-typical nucleotide (or even shared with Arabidopsis, a quite distant relative of Fagales) is occasionally replaced.

But since we know the phylogeny of the Fagales, we can, based on the Median(-joining) network(s), infer the evolution of the 5.8S rDNA (i.e. the rDNA gene pool) over time:

Results of the Median-joining analysis mapped on the currently accepted Fagales tree. Clade-characteristic mutations are highlighted by according colors; black, homoplastic mutations that occurred independently in two lineages, gray, in more than two.

Regarding the 'asterisk branch', the 5.8S rDNA provides few extra clues, unless we want to re-include a third hypothesis: that the Myricaceae are sister to Juglandaceae + Betulaceae and allies. This would be the most fitting explanation for the 5.8S rDNA diversity. It also would explain why they can be either sister to Betulaceae and allies or Juglandaceae. Ancestors, or slower evolving sisters diverging shortly before a radiation, will do such a thing.

In this context, one should point out that unequivocal fossils representing various modern genera of all families are known from the early Paleogene, many pop up in early Eocene (~ 50 Ma) intramontane basins of northwestern North America. The oldest modern genus and a possible living fossil is the first diverging Juglandaceae: Rhoiptelea. Its pollen can be found from the Maastrichian onwards in North America and elsewhere, and a fossil showing the unique Rhoiptelea-flower and fitting pollen can be found in the late Turonian-Santonian (~90 Ma) of Bohemia (Heřmanová et al. 2011; the authors, however, decided to name it Budvaricarpus and tone down the striking resemblance to modern-day Rhoiptelea).

Of course, since we use network-based approaches, we can conceptualize the 5.8S rDNA sequence patterns and inferred evolution as a subsequent breaking up and sorting of once-shared gene pools:

A 'coral' tree metaphor for the evolution of the 5.8S rDNA in Fagales (using an alternative, one-node-shifted root).

I chose an alternative root because it is the one that makes most sense regarding the fossil-morphological, palaeoclimatological/-vegetation and high-conserved genetic patterns (thinking of the 18S rDNA). The labels are, of course, a gross simplification — it is likely that the all-ancestor was a tropical-subtropical plant as well (the genetically most unique and potentially earliest isolated genera of the 'Quercaceae' are exclusively tropical-subtropical) and Myricaceae, Betulaceae and Juglandoideae can today be found deep into the temperate zone, some even thriving in boreal and polar climates. But posts can afford to trigger discussion.

The vertical axis reflects not only the derivedness of the 5.8S rDNA, but also the potential sequence of divergences back in time. The horizontal axis represents the taxonomic-geographic breadth over time (very roughly, tapering means higher diversity/greater range in the past than today) and towards the tips the genetic within-lineage diversity seen in the ITS1 and ITS2 (in Myricaceae, it would be close to a point, if it would not be for one species: Myrica gale, the bog myrtle or sweetgale, beloved in Scotland and Scandinavia – see this Dane's video for how to use it).

Just a curious experiment?

Now, to most readers this post may just be a strange example with little general relevance for phylogenetics. But consider the following.

When we infer deeper phylogenetic relationships, we usually rely on sequence differentiation in coding-gene regions. Like the rRNA genes, the tRNA genes need to fulfill secondary (and tertiary) structural constraints to maintain their vital functions. All other genes code for proteins, which also need to fulfill structural constraints (secondary, tertiary and quaternary structures). Their essential functions rely on keeping a specific amino-acid sequence, which is translated from DNA sequences.
We do this inference under the assumption that molecular evolution is neutral, which, as can be seen in the case of the 5.8S rDNA, is apparently not the case. Mutations that would negatively affect the function of the DNA-transcripts are strongly selected against.

Many of our trees makes sense nonetheless, but we should keep a wary eye on all of those branches that draw their support from only one or two gene regions (a common issue of oligo-gene trees like the one by Li et al. 2004), or very few mutations. Especially, when we are producing an ultrametric tree. How sensible can a divergence age estimate be when the data behind it are four mutations in the monotypic lineage and zero in its more diverse sister clade?

Cited literature and further reading (with comments).

ITS studies (some mixed with further data and results that were ignored by all-Fagales dating studies that included the data)

Acosta MC, Premoli AC. 2010. Evidence of chloroplast capture in South American Nothofagus (subgenus Nothofagus, Nothofagaceae). Molecular Phylogenetics and Evolution 54:235–242. See also Premoli AC, Mathiasen P, Acosta MC, Ramos VA. 2012. Phylogeographically concordant chloroplast DNA divergence in sympatric Nothofagus s.s. How deep can it be? New Phytologist 193:261–275. — Just two brilliant papers that only leave one question open: is this different in the Australasian genera of the Nothofagaceae?
Cannon CH, Manos PS. 2003. Phylogeography of the Southeast Asian stone oaks (Lithocarpus). Journal of Biogeography 30:211–226. — A very well-done paper that still doesn't need to fear to comparison with more recent biogeographic papers on Fagales genera with access to more elaborate inference methods, while using much poorer data samples.
Denk T, Grimm GW. 2010. The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59:351–366. — Since this is mine, I should not give myself an assessment. Just some info: it was the most sloppy draft, we ever submitted, and passed rather smoothly the review process. But it used 600+ new ITS and 900+ new 5S-IGS sequences, and although it provided a comprehensive ITS tree (new and all data stored in gene banks), the conclusions relied mostly on networks based on inter-clonal and inter-individual distances and ML bootstrap pseudoreplicate samples. I'm pretty sure, it's still hard to find a similar paper.
Denk T, Grimm G, Stögerer K, Langer M, Hemleben V. 2002. The evolutionary history of Fagus in western Eurasia: Evidence from genes, morphology and the fossil record. Plant Systematics and Evolution 232:213–236. — My first phylogenetic paper (using only about 100 ITS sequences) and one of my most-cited papers; published only because the editor ignored the opinions of two reviewers.
Denk T, Grimm GW, Hemleben V. 2005. Patterns of molecular and morphological differentiation in Fagus: implications for phylogeny. American Journal of Botany 92:1006–1016. — the follow-up paper, including all beech species.
Forest F, Bruneau A. 2000. Phylogenetic analysis, organization, and molecular evolution of the non-transcribed spacer of 5S ribosomal RNA genes in Corylus (Betulaceae). International Journal of Plant Sciences 161:793–806. — Likely the reason for the 2005 study by Forest et al., a great paper (especially when compared to other phylogenetic papers published in the same journal back then and much later). The reason why the 5S-IGS has rarely been studied, is because it is difficult to handle (usually one needs to clone because of intraindividual length-polymorphism). But it provides an unsurpassed resolution at the intrageneric level that only finds a match in the last years by the accumulation of NGS SNP data.
Forest F, Savolainen V, Chase MW, Lupia R, Bruneau A, Crane PR. 2005. Teasing apart molecular- versus fossil-based error estimates when dating phylogenetic trees: a case study in the birch family (Betulaceae). Systematic Botany 30:118–133. — A pivotal, still valid study using ITS and 5S-IGS data, even though the divergence age estimates are probably much too old (an aspect demonstrating the quality of the study, back then, molecular age estimates were usually much too young). Forest and Bruneau published several other papers of equal quality on other plant groups, and I suspect there is an interesting publication story given the author list and the dissemination platform.
Grimm GW, Denk T, Hemleben V. 2007. Coding of intraspecific nucleotide polymorphisms: a tool to resolve reticulate evolutionary relationships in the ITS of beech trees (Fagus L., Fagaceae). Systematics and Biodiversity 5:291–309. — A crazy experiment, but one that, years later, would bring me my first paper in Systematic Biology [PDF] (10-times higher impact factor) because it was the only piece of science providing a way-out for a young researcher in South Africa.
Manos PS. 1997. Systematics of Nothofagus (Nothofagaceae) based on rDNA spacer sequences (ITS): taxonomic congruence with morphology and plastid sequences. American Journal of Botany 84:1137–1155. — A typical study for the time, may be not ground-breaking but opening an interesting path and still the basis for molecular systematics of Nothofagaceae (getting such data in the late 90s was not easy). Interestingly, no-one in Australia or New Zealand ever took the thread up (but see Knapp et al. 2005), the only only properly studied genus (then a subgenus) of Nothofagaceae is Nothofagus s.str. (Acosta & Premoli 2010; Premoli et al. 2012).
Manos PS, Doyle JJ, Nixon KC. 1999. Phylogeny, biogeography, and processes of molecular differentiation in Quercus subgenus Quercus (Fagaceae). Molecular Phylogenetics and Evolution 12:333–349. [PDF] — The counterpart to the above for oaks, it took nearly two decades to assemble more data on American oaks than used for this study.
Manos PS, Stone DE. 2001. Evolution, phylogeny, and systematics of the Juglandaceae. Annals of the Missouri Botanical Garden 88:231–269. — An exemplary paper for two reasons (and despite the fact that it just shows cladograms): 1) it combined morphological and chemotaxonomic data with ITS and plastid data (rbcL-atpB and trnL-trnF intergenic spacer); 2) pretty much got the still accepted tree. Also proof-of-point that, even 20 years ago, studies in low-impact journals were not rarely better than those in high-fly ones. (Note the number of pages; decent research needs space!)
Manos PS, Zhou ZK, Cannon CH. 2001. Systematics of Fagaceae: Phylogenetic tests of reproductive trait evolution. International Journal of Plant Sciences 162:1361–1379. — For years to come the basis for Fagaceae systematics.
Muir G, Fleming CC, Schlötterer C. 2001. Three divergent rDNA clusters predate the species divergence in Quercus petraea (Matt.) Liebl. and Quercus robur L. Molecular Biology and Evolution 18:112–119. — Only about two species, but setting the scene: ITS evolution in Fagales (and probably any other wind-pollinated tree) can be very complex at the very basic level.
Ribeiro T, Loureiro J, Santos C, Morais-Cecílio L. 2011. Evolution of rDNA FISH patterns in the Fagaceae. Tree Genetics and Genomes 7:1113–1122. — A must-read for everyone using ITS data in Fagales.

Phylogenetic studies at and above family level

Li R-Q, Chen Z-D, Lu A-M, Soltis DE, Soltis PS, Manos PS. 2004. Phylogenetic relationships in Fagales based on DNA sequences from three genomes. International Journal of Plant Sciences 165:311-324. — This very traditional paper is (still) the basis for Fagales systematics, see: All solved a decade ago: the asterisk branch in the Fagales phylogeny.

Betulaceae: see Forest et al. (2005) and Grimm & Renner (2013, following section).

Yang Z, Wang G, Ma Q, Ma Q, Liang L, Zha T. 2019. The complete chloroplast genomes of three Betulaceae species: implications for molecular phylogeny and historical biogeography. PeerJ 7:e6320. — Provides a first complete-plastome-based tree and a plastome gene map. Otherwise ("implications") an example for failure of the peer review process (There's no need to do what you can't and Peer review transparency reveals scientific provincialism)

Casuarinaceae: see 'Phylogeny' section on Stevens' Angiosperm Phylogeny Website (never bothered myself with them, since they lack ITS data).
Fagaceae: see Manos et al. (2001), tree in Denk & Grimm (2010)

Oh S-H, Manos PS. 2008. Molecular phylogenetics and cupule evolution in Fagaceae as inferred from nuclear CRABS CLAW sequences. Taxon 57:434–451. — The molecular basis for Fagaceae systematics.
Manos PS, Cannon CH, Oh S-H. 2008. Phylogenetic relationships and taxonomic status of the paleoendemic Fagaceae of Western North America: recognition of a new genus, Notholithocarpus. Madroño 55:181–190. — The only paper providing a tangible plastid-informed phylogeny.

Juglandaceae:

Manos PS, Soltis PS, Soltis DE, Manchester SR, Oh S-H, Bell CD, Dilcher DL, Stone DS. 2007. Phylogeny of extant and fossil Juglandaceae inferred from the integration of molecular and morphological data sets. Systematic Biology 56:412–430. — I would have used a different set of analyses but the paper (and used data) provides the basis for Juglandaceae phylogenetics and systematics (see Manos & Stone 2001)

Nothofagaceae: Manos (1997), Knapp et al. (2005, following section).

Fagales dating studies (naturally including phylogenies)

Grimm GW, Renner SS. 2013. Harvesting GenBank for a Betulaceae supermatrix, and a new chronogram for the family. Botanical Journal of the Linnéan Society 172:465–477. [PDF] — a little experiment we made and submitted to a respectable but low-impact journal because the results were not really ground-shaking. Exemplifies how I think one should harvest gene banks for dating studies (check out the supplement files), hence, providing a striking contrast to the much more ambitious papers by Xiang et al. (2014) and Xing et al. (2014). In that aspect, possibly a must-read for reviewers and editors of large-scale, harvest papers.
Knapp M, Stöckler K, Havell D, Delsuc F, Sebastiani F, Lockhart PJ. 2005. Relaxed molecular clock provides evidence for long-distance dispersal of Nothofagus (Southern Beech). PLoS Biology 3:e14. — A very interesting paper, because it rejects two of the scenarios later tested by Sauquet et al. (2012) and found to produce strange estimates; also, it provides some new sequences of higher quality, none of which was included for the 2012 paper. The author list is quite interesting, too: the last author (GoogleScholar) was the only botanist who challenged tree-thinking from the very start and embraced splits graphs as alternative to trees. The forth author wrote a classic paper everyone should have read working with big data: Delsuc F, Brinkmann H, Philippe H. 2005. Phylogenomics and the reconstruction of the tree of live. Nature Reviews Genetics 6:361–375.
Sauquet H, Ho SY, Gandolfo MA, Jordan GJ, Wilf P, Cantrill DJ, Bayly MJ, Bromham L, Brown GK, Carpenter RJ, Lee DM, Murphy DJ, Sniderman JM, Udovicic F. 2012. Testing the impact of calibration on molecular divergence times using a fossil-rich group: the case of Nothofagus (Fagales). Systematic Biology 61:289–313 — in principle, an interesting idea, unfortunately the instability of dating estimates observed may be mostly due to data artifacts. The authors use unrepresentative, old data (which is puzzling, since the understudied Nothofagaceae grow in Australia, New Zealand and the French New Caledonia, and the authors are from France, Australia and New Zealand) including not a few editing/ sequencing artifacts, insufficient sampling and internal signal conflict by combination of low-divergent plastid genes and introns with high-divergent ITS data. The main test compares apples (Nothofagaceae) with pears (the rest of Fagales as sister clade); for details see this draft [PDF], which I put together for applications (the data documentation of Sauquet et al. is examplary, hence, it was very easy to look into the data basis).
Xiang X-G, Wang W, Li R-Q, Lin L, Liu Y, Zhou Z-K, Li Z-Y, Chen Z-D. 2014. Large-scale phylogenetic analyses reveal fagalean diversification promoted by the interplay of diaspores and environments in the Paleogene. Perspectives in Plant Ecology, Evolution and Systematics 16:101–110 — an ambitious experiment, with even more data-related problems than the study of Sauquet et al. While Sauquet et al. used placeholder sequences for each included genus (and dropped some because their data inflicted too much topological ambiguity), Xiang et al. blindly harvested all data of commonly sequenced plastid "barcodes" (rbcL, matK, trnL/LF region, rbcL-atpB spacer) to infer a species-level tree. Outdated, invalid taxa were not corrected for; the used gene sample can show little to no variation below the genus level (which makes dating, and barcoding, impossible). Furthermore, plastid diversification is partly or fully decoupled from speciation processes in the four genera that have been studied using more than a single individual per species (Nothofagus s.str., Fagus, Quercus, Ostryopsis).
Xing Y, Onstein RE, Carter RJ, Stadler T, Linder HP. 2014. Fossils and large molecular phylogeny show that the evolution of species richness, generic diversity, and turnover rates are disconnected. Evolution 68:2821–2832 — very similar to the Xiang et al. approach but even more flawed (poor control over used data, poor selection of markers, several problems with the dating approach, which is the bases to estimate the crucial turnover rates). Xiang et al. and Xing et al. show what happens when large-scale meta-analyses are conducted by researchers with no idea about the studied organisms.
Zhang J-B, Li R-Q, Xiang X-G, Manchester SR, Lin L, Wang W, Wen J, Chen Z-D. 2013. Integrated fossil and molecular data reveal the biogeographic diversification of the eastern Asian-eastern North American disjunct hickory genus (Carya Nutt.). PLoS ONE 8:e70449. — Focuses on one genus but includes data from all Juglandaceae and gives a typical example for plant biogeographic studies using dated trees (the forth author is the expert on the fossil record of Juglandaceae, so there are little data issues). It's open access, quite short, give it a read and then try to figure out what is the point of the paper (I looked at the provided data matrix, too, and found quite interesting genetic patterns that completely escaped the authors; it is never wrong to look over your alignment when this is still possible).

Other cited literature

Grímsson F, Grimm GW, Zetter R, Denk T. 2016. Cretaceous and Paleogene Fagaceae from North America and Greenland: evidence for a Late Cretaceous split between Fagus and the remaining Fagaceae. Acta Palaeobotanica 56:247–305.
Heenan PB, Smissen RD. 2013. Revised circumscription of Nothofagus and recognition of the segregate genera Fuscospora, Lophozonia, and Trisyngyne (Nothofagaceae). Phytotaxa 146:1–31.
Heřmanová Z, Kvaček J, Friis EM. 2011. Budvaricarpus serialis Knobloch & Mai, an unusual new member of the Normapolles complex from the Late Cretaceous of the Czech Republic. International Journal of Plant Sciences 172:285–293.
Manchester SR. 1987. The fossil history of the Juglandaceae. St. Louis: Missouri Botanical Garden. [book-like paper]

Monday, February 11, 2019

A network analysis of basic leisure-time activities

Social scientists like to compile information about what human beings do with their time, day and night. Some of that time is called "work time", where we often have little control, and the rest is "leisure time", during which we have at least some control over the time we spend on each activity. This blog post looks at how much time people in different countries allocate to some of their different leisure-time activities.

The data are taken from the American Association of Wine Economists' Facebook page: Leisure Time Spent in OECD Countries. The five leisure-time activities included in the dataset are:

Eating & drinking
TV & radio
Sports
Shopping
Sleeping

The hours for these five activities turn out to account for about half of the 24-hour day (46-56%, depending on the country). The data cover 24 of the 36 OECD countries*, plus 3 others (China, India and South Africa). The interest here is to explore the similarities between the people of different countries, in terms of how they allocate their leisure time (on average).

Since these are multivariate data, one of the simplest ways to get an overview of the data patterns is to use a phylogenetic network, as a tool for exploratory data analysis. For this network analysis, I first normalized the data within each of the five activities, and then calculated the similarity of the countries using the Manhattan distance. A Neighbor-net analysis was then used to display the between-country similarities.

The resulting network is shown in the first figure. Countries that are closely connected in the network are similar to each other based on the relative times allocated to the leisure-time activities, and those countries that are further apart are progressively more different from each other.

Clearly, there is considerable diversity between the countries. Moreover, there is very little in the way of consistent patterns in the network — it is basically a single "starburst" pattern. So, we may first conclude that the people of the different countries basically all go their own way, when it comes to allocating their leisure time.

Some of the network associations may result from historical or cultural similarities, such as the closeness of Japan and South Korea in the network. However, this clearly does not apply in other cases — for example, Spain and Portugal are not near each other, and neither are Australia and New Zealand, nor are Denmark, Norway and Sweden. Cultural generalizations seem therefore not to be supported by the data.

India and South Africa both stand out from the rest of the network, indicating that their people behave differently to all of the other countries (on average). Notably, both countries have very short times allocated to Sports and to Shopping. India also has rather short TV/radio time and a long Sleeping time, while South Africa has the longest Sleeping time of all of the countries (45 min longer than the country average!).

The USA has relatively short Eating/drinking time, a long Sleeping time, and the longest TV/radio time of all. That is, Americans spend less time on eating & drinking than most other people, and use the time gained for watching TV and sleeping, instead.

Of the other countries, France has the longest time spent on Eating/drinking, followed by Denmark and Italy, and then Japan and South Korea. Canada and the United Kingdom, on the other hand, actually have the shortest Eating/drinking times of all of the countries. Spain has a relatively short Eating/drinking time and the longest time of all allocated to Sports (nearly double the country average!). This may be a more healthy way to behave than the American one.

A related topic that we could look at is gender differences in time allocation, and how this may differ between countries. The data for this are taken from another American Association of Wine Economists' Facebook page: Time per Day Spent Eating and Drinking, by Country and Gender.

So, the country data are for the averages for Eating/drinking only, with separate observations for males and females. These two averages are plotted against each other in the second figure, where each point represents a single country. I have labeled the three top countries and the five bottom countries.

Obviously, there is a close correlation between the males and females within any one country, so that most of the time variation is between countries (93%). If couples and families usually eat together, then this result is to be expected. It is the children who are likely to have more independent eating habits!

However, there are 14 countries where the average male time somewhat exceeds that for females, and only 7 where the female average time exceeds that for the males, with the remaining 6 being approximately equal (as represented by the pink line). Interestingly, the 2 biggest deviations from equality are where females spend more time on Eating/drinking than do the males (Japan and the Netherlands). You may make of this what you will.

* The 12 missing OECD countries are:
Chile, Czechia, Greece, Hungary, Iceland, Israel, Latvia, Lithuania, Luxembourg, Slovakia, Switzerland and Turkey.

Monday, February 4, 2019

Should we bother about character independence?

The comments of David Marjanović to one my last posts (Please stop using cladograms!), kept me musing about an old question of mine: Why should we be concerned about whether characters in a matrix are independent or not?

When I started to get into phylogenetics (I taught myself by reading and just doing it and never had a course in phylogenetics at university), I learned that the most important thing for a phylogenetic matrix is:

All characters are independent of each other.

In other words: the mutation (change) in one character doesn't affect the mutation (change) in any other character.

I could never wrap my head around this. After all, the characters are all part of the same organism and must therefore function together, so how can they possibly be biologically independent? Even the fact that everything is part of the same universe means that everything is functionally dependent to one extent or another — when a butterfly sneezes the polar bears tremble, as they poetically say.

However, what is meant is that characters must be independent enough for practical mathematical purposes. This is a fundamental assumption of most mathematical analyses, in order to make them tractable. Trying to account for the dependencies is far too difficult, mathematically.

However, it is still worthwhile thinking about whether these "practical purposes" are likely to be realistic for phylogenetics. Consider this:

Traditional phylogenetics mostly uses morphological traits, some of which must have been evolutionary beneficial and evolved as consequence of the same reason (adaptive process).
Working at the tips of the tree of life, our data were from the nuclear-encoded 35S rDNA, the cistron encoding for 18S rRNA (small subunit), 5.8 S rRNA, and 25 rRNA (large subunit, erroneously called 26S in some of the phylogenetic literature), which is known for compensatory mutations (eg. strands of the 5.8S rRNA have to fit to the 5' end of the 25S rRNA; here's a link for those interested in RNA structure).

To investigate point 1, let's look at a dolphin (image source) and a bat (image source).

Without sequencing their entire genomes and establishing the function of each gene (and kicking out one or another gene during development), we cannot assess how independent (genetically) the traits are that make a dolphin a near-perfect swimmer, and a bat the only actively flying mammal. But obviously, a lot of their traits are adapted to this single function of movement. The practical consequence is that instead of a plethora of distinguishing characters, we only can score two fully independent ones: "can swim" versus "can fly".
(And then eliminate these two, because another rule in phylogenetics is that we should only include characters that are not under positive selection. The commonly implemented models all assume that evolution is neutral. This is why Charles Darwin has two parts to The Origin, one discussing historical dependence of characters and one discussing natural selection.)

As for point 2, everyone who worked with ITS, the internal transcribed spacers of the 35S rDNA, can easily see that some mutational patterns always come in pairs or some other series. Although rarely done, we can correct for linked mutations during inference by using the assumed secondary structures as a functional corrective. This is rarely done, because even without this correction you still get trees (or networks) that make sense.

Linked mutations and evolutionary trends within the LP3 of the 5' ITS2 in species of Acer section Acer (see my Ph.D. thesis, open access; figure from Grimm et al., Plant Syst. Evol., 2007). This (non-coding) length-polymorphic region (found in all angiosperms in various modifications) comprises an upstream CT- and partly linked (complementary) downstream GA-motif.

A very simple example

Let's take a group of very simple, made-up organisms differing in two trait complexes (note that it may be a collection of genes that trigger the difference): form and colour.

In total, "evolution" came up with 15 different combinations ("species"), five of which are extinct, two of which are primitive in the sense that they still occur today, but have also been found as fossils.

We all know that morphologies have a high level of homoplasy. Homoplastic traits mean that groups will not accurately reflect the true tree. Having as many forms (9) as colors (9), we have no clue as to which trait is more conservative, and hence could better reflect the true tree.

The 15 species form nine potentially monophyletic genera.

The alternative nine potentially monophyletic genera.

The promise of phylogenetics is that we can infer the true tree based on the scored characters. We could follow the strict independence rule, and score them as two multi-state characters, leading us to the following "tree" — this has been parsimony-optimized and unweighted, as in most studies using morphological data, with the sample of MPTs summarized using a strict consensus cladogram.

The strict consensus tree of 355602 equally parsimonious trees with 17 steps, a CI of 0.94 and RI of 0.88: a pitchfork (an extreme case, but pitchfork-like subtrees are very common in palaeontological phylogenetic literature).

Alternatively, we could score the features as a series of binary characters such as:

Is the center depressed?
Is it horizontally or vertically elongated?
Is it round or pointed?
Do we have few (<= 6) or many tips (>= 8; "?" for all round species)?
Is it reddish? Or greenish? Or bluish? (Example: purple doughnut would be 1 - 0 - 0, the turquoise five-star 0 - 1 - 1)
Has it a dark or light shade (relatively speaking: green taken as darker than turquoise)?

These characters are not particularly independent. Certain evolutionary steps make it impossible to go back or evolve something in parallel / convergently. For example, the Roundish group never evolved pointed tips, and the Pointish organisms can vary their outline, but not smooth it. The characters are also not overly compatible (e.g. shading splits each basic coloring into two subsets), so we wouldn't expect a very resolved tree or one that matches the true tree exactly:

Adams consensus tree of 80 MPTs with 19 steps, CI = 0.57, RI = 0.79, naming follows the principles of cladistic classification (only subtrees in a rooted tree may be named; not to be confused with phylogenetic classification fide Hennig)

However, it doesn't look like a very bad evolutionary hypothesis. In fact, the inferred clades only miss one monophyletic group (I can tell, because I invented this group to illustrate that 'cladistics' is a subset of 'phylogenetics'): Fivestar reflects the morph of the common ancestor of all stars, resolved as part of a monophyletic grade "basal" to the (reciprocally monophyletic) polygons:

Evolution as it happened. Note, each dichotomy is accompanied by one or two exclusive subsequent mutations (synapomorphies at the time). Unknown ancestors (not found in the fossil records) are dimmed. Green: valid names following Hennig's phylogenetic classification; orange: only valid for the most recent time frame (Purpleoval is indistinguishable from the ancestor of all non-olive Roundish, Fivestar from the ancestor of all stars, and the ancestor of all polygons was a blue pentagon).

Of course, I would always show all of the topological alternatives in the optimized tree sample. Here is the strict consensus network of all of the MPTs:

Strict consensus network of all 80 MPTs, the network analogue to the commonly seen strict consensus cladograms.

In contrast to the consensus trees, we see the equally optimal alternatives, and can even make a call as to which trait to give a higher weight (evolution-wise). For instance, although only 12 MPTs have a Pentagon clade, 40 have an Octagon clade, which would fit with the hypothesis of reciprocal monophyly. The shading-based alternative seen in other MPTs (light vs. dark polygons) can be argued to be less likely, noting how scattered this feature is across the entire graph (this is what TNT's iterative weighting does, except that it starts from one of the alternative trees)

And here's the distance network, probably (like with real-world data) the least-biased depiction of the differentiation pattern:

All labelled taxa are monophyletic (as defined by the true tree). Note how some neighborhoods reflect monophyly while others would result in paraphyletic groups.

Take-home message

Now, you could rightfully point out that this is totally hypothetical and, having generated the group, I made sure that the analysis works out — actually, I didn't, and I was quite surprised at how well the binary matrix, which just scores everything that differs between the species, resolves aspects of the true tree. However, just compare the above graphs with trees published in (paleo)phylogenetic studies, and the real-world data we dealt with here on the Genealogical World of Phylogenetic Networks.

You might also point out that this is just like using stepmatrices — forcing a topology by suitably coding complex characters. Likewise, this thought must be discouraged (but see Joe Felsenstein 2004 book, Inferring Phylogenies). I would respond that scoring complex traits filtered by evolution as a single multi-state character severely underestimates the information content. An example from my own research: in the King Ferns (Osmundaceae), the subsequent modification of the sclerenchyma ring along the leaf traces is fully compatible with the molecular tree, so why should I be forced to reduce the surely interdependent (and traceable in the fossil record) aspects of this evolutionary filtered trait complex to a single, multi-state (and unweighted) character?

Coding of a single complex trait (Bomfleur et al., PeerJ, 2017, fig. 7), the structure of the sclerenchym ring in Osmundaceae leaf traces, as five binary characters that reflect the ontogenetic sequence seen in Osmundaceae rhizomes (arrows), a case where ontogeny mirrors phylogeny (Bomfleur et al., BMC Evol. Biol., 2015; cf. Additional file 1, fig. S1-1).

If we have character complexes that we can score, then we should not bother ourselves with drawing a (often very subjective) line between biologically dependent and independent characters. We should just score as much as we can see, and then explore the signal in the resulting matrix (see our many blog posts on the latter topic).

Exploratory data analysis benefits from few-state characters. This is because characters with many states (nine in the above example, which is something also found in the actual literature) that do not inform any taxon bipartitions, lead only to quite useless pitchfork-trees.

Scoring what we see as detailed as possible may, of course, get some things wrong. We may face one or another paraphyletic (or even polyphyletic) clade and monophyletic grade — inferring trees/networks and establishing branch-support with more than a single optimality criterion is advisable as is character mapping. At least it gets us a data-based hypothesis to discuss and to investigate further; or several hypotheses, when using consensus networks or distance-based splits graphs instead of consensus trees.