Monday, March 18, 2019

Which US cities are best for walking, biking and public transport?

In the modern world, there is a lot of discussion about the environmental damage caused by cars and trucks, not least due to their involvement in global climate change. The pro-active parts of this discussion revolve around banning cars, so that parts of cities and towns can return to pedestrian areas (eg. Life in the Spanish city that banned cars; The automotive liberation of Paris), and encouraging alternative modes of transport, particularly bicycles (eg. Copenhagenize your city: the case for urban cycling; Britain wants cycle-friendly cities).

In particular, some cities throughout the world are taking active steps to improve the "walkability" of their centers, including Addis Ababa, Auckland, Denver, Hanoi, London, Manchester and San Francisco (What would a truly walkable city look like?), and the "cyclability" of their inner suburbs, including Calgary, Copenhagen, Eindhoven, Lidzbark, Purmerend, San Sebastian, Utrecht and Vancouver (Top 10 pieces of cycling infrastructure: which country does it right?). On the other hand, there are some cities who have not yet tried to do much about cycling, including Beijing, Cairo, Delhi, Hong Kong, Moscow, Mumbai, Nairobi, Orlando, São Paulo and Sydney (Top 10 worst cities for cycling ).

The USA is not usually considered to be at the forefront of this movement, having long ago wedded itself to the cult of the private motor car. However, this does not mean that US cities are all the same in terms of non-car transportation. For example, the Walk Score site, which is part of the Redfin real estate organization, provides a ranking of all US cities and neighborhoods with a population of 200,000 or more, in terms of how friendly they are for: walking, biking and transit.

The ranks are based on a score out of 100 for each location, using various methodologies:
— Walk Score analyzes hundreds of walking routes to nearby amenities; points are awarded based on the distance to amenities in each category.
— Bike Score is calculated by measuring bike infrastructure (lanes, trails, etc), hills, destinations and road connectivity, and the number of bike commuters.
— Transit Score assign a "usefulness" value to nearby transit routes based on their frequency, type of route (rail, bus, etc), and distance to the nearest stop on the route.
Our interest here is in combining these three pieces of information into a single picture, showing which cities are generally good, at the moment.

Not unexpectedly, the Walk Score and Transit Score are highly correlated (86% shared rankings), while the Bike Score is not as highly correlated with either of these (49% and 42%, respectively). This means that the same cities tend to be good for the first two criteria. The three best cities for the Walk Score are New York, Jersey City and San Francisco, while the top two for the Transit Score are New York and San Francisco. On the other hand, for the Bike Score the top two are Minneapolis and Portland — it would be difficult to imagine either New York or San Francisco as being good for biking!

If we define a "good" score as being >70, then only San Francisco has a score for all three criteria >70, although Boston comes close. On the other hand, Pittsburgh and Washington D.C. have the most consistent scores across the board, because they have uniformly middle-rank scores.

Since these are multivariate data, one of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network, as a tool for exploratory data analysis. For this network analysis, we calculated the similarity of the cities using the Manhattan distance, and a Neighbor-net analysis was then used to display the between-city similarities.

The resulting network of the 98 cities with complete data is shown in the figure. Cities that are closely connected in the network are similar to each other based on how good they are for walking, biking and transit, and those cities that are further apart are progressively more different from each other. The color-coding for the cities is from Megaregions of the United States.

The network generally shows decreasing walking / transit scores from top to bottom, and decreasing biking scores from right to left. We have labeled only the top group of 29 cities, which are distinctly "better" than the remaining 69, plus four unusual cities (at the middle-left).

Note that, as expected, New York, San Francisco and Boston stand out at the top of the network. Note, also, that Minneapolis and Portland are separated in the network from the other cities, because of their high Bike Scores — all of the other cities in the top group have much lower biking scores. Newark, in particular, has a low biking score. New Orleans is at the bottom-left of this group because it has a low Transit Score but not Walk Score.

For the four unusual cities, separated at the left of the bottom group: Dallas has a low Transit Score, and Atlanta, Cincinnati and San Diego all have a low Bike Score.

The city at the very bottom-left of the network, which has the lowest score on all three criteria, is Arlington TX. Along the same lines, there is an online graph of The 10 most dangerous states for cyclists, showing Florida way out in front.

Finally, you should be warned about potential problems with rankings like these, based on only a few selected criteria. For example, the real estate site StreetEasy recently tried to compile a list of the 10 Healthiest Neighborhoods in New York city, and ended up listing the Brooklyn industrial area of Red Hook as number 1, which engendered a couple of negative comments, such as:
I guess the fact that the majority of Red Hook’s parkland has been closed for many years due to lead contamination, or the fact that we have one of the highest asthma rates in the city, was overlooked for this study.
Caveat emptor!

Monday, March 11, 2019

Tattoo Monday XVII

Here are seven more tattoos in our compilation of evolutionary tree tattoos from around the internet. For more examples of the circular design for a phylogenetic tree, in a variety of body locations, see Tattoo Monday V, Tattoo Monday VII, Tattoo Monday X and Tattoo Monday XI.

At the bottom of this post is an unusual linearized version of this same type of tree.

Monday, March 4, 2019

Has homoiology been neglected in phylogenetics?

In a recently published pre-print on PaleorXiv, Roland Sookias makes a point for distinguishing between parallelism, ie. shared inherited traits that can be found in some but not all of the offspring of a common ancestor, and convergences in a strict sense, involving similar traits that are not homologous. The former is also known as homoiology, a term Sookias attributes to Ludwig Plate.

As a geneticist working mostly at the tips of the Tree of Plant Life, I'm quite familiar with the (pre-Hennigian) concept: we much more often than not lack Hennig's 'synapomorphies', ie. shared, derived traits exclusive to an evolutionary lineage. But we have many highly diagnostic characters suites including 'shared apomorphies' (I think that the angiosperm phylogeneticist Jim Doyle coined the term) that collect the same species or higher taxa, eg. groups of taxa that also form highly supported clades in molecular trees, but are not exclusive. In every plant group you can additionally observe that certain traits are exclusive to some members of one lineage, because the lineage has the genetic-physiological prerequisites to express these traits, while their sister lineages or distant relatives lack this potential. Epigenetics deals with tendencies to express a trait in response to the environment without even changing the genetic code.

If you look close enough, you can find such patterns even at the molecular level.

Molecular evolution of the 5' half of the ITS1 in beeches. Each sequence motif is assigned a state (Ax, Bx etc; x = 0 represents the ancestral state, x > 0 are derived states) and evolution involves usually the gain ("+") or loss ("-") of sequence motifs including some potential genetic homoiologies (see here for context and references).

However, it has apparently been ignored by my fellow paleontologists: Sookias' wants to discuss "the neglected concept of homoiology ... in the context of palaeontological phylogenetic methods". Paleontological phylogenetic methods are, of course, tree inferences, and the idea is that recognition of homoiologies can be a means of establishing node support or to "help to choose between equally parsimonious or likely trees". He provides an R function "to calculate two measures for a given tree and matrix: (a) the potential support for clades based on potential homoiologies; and (b) the fit of the tree to all states given the concept of homoiology".

Sookias provides a nice and conscise introduction to the problem with some examples, and makes the connection to linguistics (see also Mattis' and my post on the Chinese dialects continuum: How languages lose body parts); so, give the short paper a read. Like all paleontological literature it is strongly influenced by cladistic views, such as that life is monophyletic, and it revolves around the central theme how to get better supported trees.

My inner geneticist has a principal problem with such a goal, because there has (to my knowledge) not been a single morphology-based tree that was fully congruent to a molecular tree with sufficient taxon and gene sampling, which applies also to the real-world data example that Sookias chose (as we will see).

My inner paleontologists also knows that there are highly diagnostic morphs in the fossil record, but diagnostic character suites and morphs reflect as many paraphyla as monophyla. He also knows that the fossil record, provided you find the right fossil from the right time, may alter your perspective on ancestral and derived character states.

An inferred tree (see this post). Given the inferred tree (quasi-dated tree), we would assume that star shapes are primitive (a symplesiomorphy) within the Pointish lineage, and possibly 10-tipped stars; and conclude that the Tenstars are paraphyletic. Greenish is clearly ancestral (a Pointish symplesiomorphy), and bluish derived (a Polygonia synapomorphy).
If we have the full picture, we can confirm star shapes are symplesiomorphic within the Pointish (the first common ancestor being a five-pointed colorless star). However, all greenish stars form a monophylum not a paraphylum.
Having ten tips is a synapomorphy of the monophyletic Tenstars.

So, why should we aim to get more resolved, better supported, morphology-based trees? Any such tree will inevitably include wrong branches!

I argue that, instead, we should just explore the signal in our data matrices using networks. Any potential tree is included in a network. But networks are more comprehensive because they provide not only a single tree but alternative, competing trees. By visualizing the alternatives, we can discern between mere convergence (random similarity), homoiology (parallelism, convergence related to descent), symplesiomorphy (shared, lineage-consistent primitive traits) and synapomorphy (lineage-unique and consistent shared derived traits), which can be very tricky with just a tree. Thus, we can try to evaluate which evolutionary scenario best explains all our data.


The basic problem when using morphological and such-like data sets to infer phylogenies is that most of the scored characters are, to some degree, incompatible with the true tree, ie. the actual evolutionary pathways.

Let's take a hypothetical evolution (no reticulations), in which the x-axis represents the morphological diversification and the y-axis time.

As in real-world data, sister taxa (eg. Species A and B) may have different levels of morphological derivation compared to their common ancestor(s). This leads us to this unrooted true tree in which the branch lengths are proportional to the real (above) amount of change.

Unrooted representation of the above evolution.
All commonly used tree inferences infer unrooted trees.

The only characters providing a taxon bipartition that is fully compatible with the true tree are Hennig's 'synapomorphies':

Clade A–D shares a unique, derived trait.
The character split is fully compatible with the true tree.

Next come Hennig's 'symplesiomorphies' (Sookias' R-script discards them):

Blue is the ancestral state within the ingroup, lost/modified in Species A.
The character split is compatible with the true tree except for A.
In phylogenetic inference, symplesiomorphies will usually stabilize the topology
as there will be enough other characters supporting A as sister of B and Clade A–D(–F).

Homoiologies / parallelisms can be partly compatible:

Blue is a homoiology found in 50% of the species composing Clade A–F.
The character split supports the sister relationship of A and B (compatible aspect)
but joins them with F (incompatible aspect).
A, B and F belong to the same monophylum/clade (semi-compatible aspect).
As long as homoiologies are confined to otherwise
coherent (or flat) subtrees, they will contribute to the overall decision capacity of the data.

Note that without a molecular backbone tree, it may be impossible to distinguish homoiologies from symplesiomorphies – whether a trait will be resolved as either the one or the other in a tree depends solely on its frequency and distribution across the subtree, and the situation in outgroups.

Purple is the plesiomorphy of the ingroup, blue the homoiology
found in members of Clade A–F, evolved twice
Considering the phylogenetic root-tip distances in the true tree, it makes sense that blue is the plesiomorphy of the ingroup retained in the shorter branching members, and purple a homoiology found in the most derived sublineages (again, evolved twice).
Both scenarios require three steps, but probabilistic character mapping methods would prefer the second scenario as they assume the longer the internal branches, the higher the likelihood for a change. To dismiss symplesiomorphies, Sookias' script infers the ancestral state of the MRCA of a clade and only considers states as homoiologies that differ from the inferred ancestral state (the cut-off value can be modified to "less stringently exclude potential symplesiomorphies as homoiologies").
Doyle's 'shared apomorphies' are locally compatible:

Blue is a shared apomorphy of the GH lineage, convergently evolved in the
outgroup (see original tree above: the GH lineage is a strongly derived
ingroup lineage evolving into the direction of the outgroup
in contrast to the remainder of the ingroup).
The example above also illustrates how shared apomorphies may trigger branching artifacts such as ingroup-outgroup long-branch attraction. Imagine that GH is not the first diverging branch of the ingroup but instead a strongly derived sublineage nested within Clade A–F, and that we lack the short-branching sister-group but have a large outgroup sample. Any ingroup-outgroup shared apomorphies will then draw GH towards the outgroup-defined ingroup's root and detrimental for inferring the true tree.

Convergence in a strict sense, ie. superficial or random similarity, is incompatible with the true tree:

Blue is a randomly distributed derived state found in all longer-branched taxa.

A tree-incompatible signal is, naturally, best handled using a network and not by forcing it into a single tree. Unless, of course, we have a sensible molecular tree and can go for total evidence approaches assuming the molecular tree reflects the true tree.

PS: Also, in molecular data the true tree incompatible characters may outnumber the compatible ones, but there we have many more characters and (usually but not always) a lot that are not filtered by negative or positive selection. Our stochastic molecular models are for sure never accurate enough to model molecular evolution for our sequences, but apparently precise enough for most applications. Even before next generation sequencing and big data, molecular phylogenies outshined morphological phylogenies, something that paleontologists cannot afford to ignore any more — not because the data are much better (to infer evolution) but because the patterns and processes are much less complex.

Sookias' data example, crocodiles and relatives

The supplement of Sookias' paper includes a morphological character matrix for crocodilians and the resulting molecular tree for the group. Here's Sookias' fig. 3 ,using these data to make his point for how to select the better-fitting tree using homoiology recognition:

Now, the unsolved problem is: if we don't have a molecular tree, how can we possibly know 0 is a homoiology and not a symplesiomorphy, 1 not a reversal (scenario B) or likely convergence (scenario C), hence, B should be preferred over C (the legend has a little typo, cf. Sookias 2019, p. 3, l. 34)?

The matrix provided as the example is not the best one to make this point. Sookias' script, when stringently eliminating potential symplesiomorphies, identifies, using the molecular tree as input, one potential homoiology for the Crocodylinae, five for their larger clade (including Gavialis and Tomistoma), and one for the alligators' larger clade in a matrix with 117 characters. Less than 10% can hardly be a game-changer.

What the morpho-data shows

Furthermore, the morphological matrix will give us a single most-parsimonious tree (MPT, using PAUP*'s Branch-and-Bound algorithm), not two or more equally parsimonious alternatives that we need to weigh against each other.

The single most-parsimonious tree that can be inferred from the morpho-matrix (236 steps, CI = 0.64, RI = 0.84). Red branches are conflicting with the topology of the molecular (truer?) tree (green brackets).

Some of the red branches are supported by pseudo-synapmorphies, which, on the background of the molecular tree, are potential homoiologies for the comprising clade, however, interpreted as symplesiomorphies by Sookias' script (provided the molecular branch-lengths are sufficient, they might be recognized when using a probabilistic framework to infer the ancestral states).

Not a good example for Sookias goal, but the matrix shows the limitations of trees when it comes to morphological differentiation. Here's the distance-based, 2-dimensional network for the morphological data:

A Neigbor-net based on Sookias' morphological matrix.
The arrow indicates the position of the assumed root.

The signal from the morphological matrix is quite tree-like, and the structure of the left part of the network is synonymous to that of the single MPT (and the molecular tree). On the right-hand side, we find more complexity than we would expect from the single MPT. The data signal is not trivial regarding the position of the root as inferred by Bernissartia; and nor is the placement of Gavialis and Tomistoma (pink edge bundles), two genera producing a very prominent box-like structure. Called by cladists a "phenetic" approach, the distance-based network is nonetheless straightforward regarding the identification of monophyletic groups (green) and potential monophyletic groups (yellow) (the latter always include the particular alternative seen in the single MPT as well, in case of the pink box, also the molecular alternative). The light green monophylum is a necessary consequence of the prior knowledge about the position of the root, and the likely monophyly of Alligator and its relatives (the tree-like subgraph with long internal branches = lots of uniquely shared traits, including potential synapmorphies).

Potential synapomorphies that can be inferred from the morpho-matrix alone by mapping the states onto the network. Red, homoiologies reconstructed as synapomorphies ('pseudo-synapomorphies') and (except for one) excluded as potential symplesiomorphies by Sookias' test run of his script (strict and relaxed cut-off).

The network provides more information than can be extracted from the MPT: one Crocodylus is significantly closer to the Osteolaemus (the neighborhood defined by the light blue edge bundle, see Sookias' fig. 3A). Crocodylus, however, is likely monophyletic, being generally very similar; and the third genus, Mecitops, is closely linked to (all of) them (neighbourhood defined by the dark blue edge). An inclusive common origin (including the third genus, Mecistops) is – just based on morphology and without using a "phylogenetic" tree inference – beyond question, even though we lack syn- or shared apomorphies (short corresponding edge bundle): Mecistops is obviously closely related to Crocodylus, and Osteolaemus is related to part of the latter, so it's not a bad hypothesis that all three are descendants of the same common ancestor, and that Tomistoma (and Gavialis) branched off the lineage before the Crocodylinae radiated. The only alternative explanation would be that the Crocodylinae show the primitive morphs of the entire lineage, and that the position of Tomistoma and Gavialis is affected by long-branch (-edge) attraction (however, if that is the case then we should have found a Tomistoma-Gavialis clade in the MPT — parsimony will always get it wrong in the Felsenstein zone)

The main flaw

But, any morphology-based alternative using this data matrix is not fully compatible with the molecular tree, which places Mecitops and Osteolaemus as sister to Crocodylus. Here's the consensus network based on 10,000 boostrap pseudoreplicate BioNJ trees inferred from the morpho-matrix, highlighting the support for splits compatible with the molecular tree (green) and their competing, partly incongruent (red edge bundles) alternatives (I do the information transfer manually, but those with R-scripting skills can use the functions in the phangorn library; Schliep et al., MEE, 2017; see also David's post):

NJ-Bootstrap (BS) consensus network based on 10,000 pseudoreplicates.
Edges/splits corresponding to clades in the molecular tree
(see Sookias' fig. 3 above) in green, those conflicting the molecular tree in red.
Edge values show BS support (edge-lengths are proportional to NJ-BS support),
while asterisks indicate the branches seen in the MPT.
Obviously, there is some signal in the morpho-matrix compatible with the molecular clades (this can be synaporphies, symplesiomorphies, homoiologies or shared apomorphies) clashing with the signal of pseudo-synapomorphies etc. supporting the topological alternatives seen in the morpho-based MPT.

Assuming the molecular tree is correct, the above reconstruction means that Osteolaemus is morphologically more derived, and hence placed as sister, while Mecitops and Crocodylus retain more primitive character states, and hence lacks discriminatiory derived traits — a sort of local ingroup-outgroup long-branch attraction (or 'short-branch culling').

What differentiates the Crocodylinae? Black, aut- or synapomorphies; blue, potential homoiologies (or symplesiomorphies); red, shared apomorphies (convergence). The Mecitops-Crocodylus pseudo-monophylum is mostly supported by traits shared between Osteolaemus and distant siblings (taxa of the larger alligator clade) and/or the outgroup.

We can also hypothesize that the initial radiation was fast, because the Mecitops-Osteolaemus ancestor did not accumulate a single, unique, discriminating character trait.

Excess of shared derived, pseudo-synapomorphic traits is the reason Tomistoma is not resolved as sister of Gavialis in the MPT — the molecular Gavialis-Tomistoma clade is represented by a morphological grade.

A 'splits rose' showing the basic splits. Black, aut- or synapomorphies; blue, potential homoiologies (or symplesiomorphies of the larger clade including Crocodylinae); pink, pseudo-synapomorphies (deep homoiologies or symplesiomorphies of the larger Crocodylinae clade); orange, shared ancestral (plesiomorph) or derived traits (convergent). 

And the homoiologies identified using the molecular tree as input cannot put things right. They are just partly compatible with unproblematic splits, ie. the larger clade including Alligator (character #7), the larger clade including Crocodylinae (#1, #18, #73, #74, #117) or exclusive to the Crocodylinae (#66)

Character mapping of the molecular-inferred homoiologies. The lush green splits represent the molecular splits.

However, if we are ignorant of the molecular tree, we would have to assume that Mecitops is the sister to Crocodylus, and that some of their shared traits not found in Osteolaemus are shared apomorphies (if occurring outside the clade and in the sister clade) or even synapomorphies (if exclusive for Mecitops + Crocodylus), while only those shared by Osteolaemus and C. porosus (#66) can be homoiologies. We also would have no reason to challenge the Gavialis-Tomistoma grade, until we infer networks.

Map of all potential synapomorphies (bold), symplesiomorphies (italics) and homoiologies (plain font) using the morphology-based Neighbor-net as basis. Red, pseudo-synapomorphies: split seen in the MPT and (with or without alternative in the Neighbor-net) but rejected by the molecular tree.

This is the main flaw of Sookias' idea. To identify homoiologies, we need the same prerequisite as for any of Hennig's concepts: we need to know the true tree. If we use the inferred tree based on the same data that we want to weight (here: use homoiologies for decision making or means of node support), then we propagate first-level errors; apply circular reasoning. Such as the red-marked pseudo-synapomorphies in the network above; vice versa, all actual (molecular-wise) synapomorphies supporting the molecular Gavialis-Tomistoma clade (dark purple split) would be reconstructed as homoiologies or symplesiomorphies based on the morpho-based single MPT (or morpho-based NJ tree, or probabilistic tree).

And if we have an independent molecular tree, it will make the decision on the fly: putative synapormorphies are the traits that are fully compatible, symplesiomorphies, homoiologies and shared apomorphies are decreasingly compatible, and random convergences are incompatible with the molecular tree.

It is not homoiology but tree-incompatible signal that is neglected in phylogenetics

Sookias points out: "In inference of phylogeny by parsimony, an occurrence of a character state in a part of a tree separated from it by another state is considered simply a homoplasy, and a tree where the occurrences are nearer or further from one another is not more or less parsimonious ... a tree where the 15 occurrences are nearer or further from one another is not more or less parsimonious". In principle, this is true, but has little consequence in application.

We, usually without realizing it, make frequent use of the discriminating power of potential homoiologies. See the example above, but also when, eg., placing fossils in a molecular framework or do post-inference character weighting. In these cases, homoiologies (and symplesiomorphies) will stabilize the inference and increase support. For better and worse:
  • Better, because homoiologies will ensure that the fossil is placed in the right molecular-based subtree, and can compensate for the lack of synapomorphies. Imagine an extinct fossil sibling lineage showing only homoiologies shared by Osteolaemus and C. porosus. Using tree-based optimization (eg. RAxML's 'evolutionary placement algorithm'), it would be placed close to the Crocodylinae ancestor, likely next to Osteolaemus. Using a Neighbor-net, it would be placed between Osteolaemus and C. porosus. Either way, the homoiologies would ensure it is nested within the Crocodylinae.
  • Post-inference character weighting, as implemented in eg. TNT, will downweight inferred convergences (ie. higher homoplasy, more stochastically distributed across the tree) more than putative homoiologies (ie. less homoplastic since confined to a single subtree). This can be better or worse. How do we avoid what happened for the crocodiles that homoiologies are not recognized as such but support (somewhat) misleading clades (act as synapomorphies)? Clades are commonly interpreted as a sufficient criterion to determine monophyly; however, they are not even a necessary one: taxa can be part of a monophyletic group despite not forming an inclusive subtree (ie. clade in a rooted tree) such as the genus Caiman or Gavialia-Tomistoma.
Hence, we should disencourage any form of data-self-dependent or post-analysis weighting and instead just explore the signal in our data — using networks.

One thing is also obvious from the crocodile example: if we have enough signal in the morphological data, then we may get one or another thing wrong and, in some cases, may not be able to decide between one or another alternative. However, overall, the morphological differentiation pretty well captures what the genes provide us as the best approximation of the true tree. Even when the matrix includes very few potential synapomorphies and clear homoiologies but a lot of shared apomorphies, most of which were convergently evolved in parts of both major clades.

At least, this will be so when we analyze the data using networks and not just trees (compare the single MPT to the networks).

Using the alternative evolutionary scenarios provided by the networks, we can then look back into our data (see the maps above), to see what may be a homoiology, a symplesiomorphy (very useful for deciding between evolutionary scenarios, as well) or a synapomorphy. The phangorn library (used for Sookias' script) has now network functionality and allows transferring information between trees and networks. An R-affine person may be able to extract lists of potential (partly competing) synapomorphies, symplesiomorphies, and homoiologies directly from the network showing all possible or the most likely trees.

And then use this information to eg. place fossils in a phylogenetic context, or reconstruct evolutionary trends in extinct groups of organisms — reconstruction of evolutionary trends in extant organisms should always rely on morphological data analyzed in a molecular-phylogenetic framework.


A NEXUS-version of Sookias' test matrix (slightly annotated for Mesquite, simple version for PAUP*), tree- and distance matrix files have been added to my figshare collection of morphological matrices.  

Monday, February 25, 2019

Automatic morpheme segmentation (Open problems in computational diversity linguistics 1)

The first task on my list of 10 open problems in computational diversity linguistics deals with morphemes, that is, the minimal meaning-bearing parts in a language. A morpheme can be a word, but it does not have to be a word, since words may consist of more than one morpheme, and ­— depending on the language in question — may do so almost by default.

Examples of morphemes in English include clear-cut cases of compounding, where two words are joined to form a new word. Often, this is not even readily reflected in spelling, and, as a result, speakers may at times think that a word like "primary school" is not a single word, although it is easy to determine from its semantics that the word is indeed pointing to one uniform concept. Other examples include grammatical markers, such as the ending -s for most English plurals, or to mark the third person singular of verbs. When confronted with a word form like walks, linguists will analyze this word as consisting of two morphemes, illustrating it by adding a dash as a boundary marker: walk-s.

The problem

The task of automatic morpheme segmentation is thus a pretty straightforward one: given a list of words, potentially along with additional information, such as their meaning, or their frequency in the given language, try to identify all morpheme boundaries, and mark this by adding dash symbols where a boundary has been identified.

One may ask why automatic identification of morphemes should be a problem —  and some people commenting on my presentation of the 10 open problems last month did ask this. The problem is not unrecognized in the field of Natural Language Processing, and solutions have been discussed from the 1950s onwards (Harris 1955, Benden 2005, Bordag 2008, Hammarström 2006, see also the overview by Goldsmith 2017).

Roughly speaking, all approaches build on statistics about n-grams, i.e., recurring symbol sequences of arbitrary length. Assuming that n-grams representing meaning-building units should be distributed more frequently across the lexicon of a language, they assemble these statistics from the data, trying to infer the ones which "matter". With Morfessor (Creutz and Lagus 2005, there is also a popular family of algorithms available in form of a very stable and easy-to-use Python library (Virpioja et al. 2013). Applying and testing methods for automatic morpheme segmentation is thus very straightforward nowadays.

The issue with all of these approaches and ideas is that they require a very large amount of data for training, while our actual datasets are small and sparse, by nature. As a result, all currently available algorithms fail graciously when it comes to determining the morphemes in datasets of less of 1,000 words.

Interestingly, even when having been trained on large datasets, the algorithms still commit surprising errors, as can be easily seen when testing the online demo of the Morfessor software for German ( When testing words like auftürmen "pile up", for example, the algorithm yields the segmentation auf-türme-n, which is probably understandable from the fact that the word Türme "towers" is quite frequent in the German lexicon, thus confusing the algorithm; but for a German speaker, who knows that verbs end in -en in their infinitive, it is clear that the auftürmen can only be segmented as auf-türm-en.

If I understand the information on the website correctly, the Morfessor algorithm offered online was trained with more than 1 million different word forms in German. Given that in our linguistic approaches we can usually dispose of 1,000 words, if not less, per language, it is clear that the algorithms won't provide help in finding the morphemes in our data.

To illustrate this, I ran a small test on the Morfessor software, using two datasets for training, one big dataset with about 50000 words from Baayen et al. (1995), and one smaller dataset of about 600 words which I used as a cognate detection benchmark when writing my dissertation (List 2014). I then used these two datasets to train the Morfessor software and then applied the trained models to segment a list of 10 German words (see the GitHub.Gist here.

The results for the two models (small data and big data) as well as the segmentations proposed by the online application (online) are given in the table below (with my own judgments on morphemes given in the column word).

Number Word Small data Big data Online
1 hand hand hand hand
2 hand-schuh hand-sch-uh hand-schuh hand-schuh
3 hantel h-a-n-t-el hant-el han-tel
4 hunger h-u-n-g-er hunger hunger
5 lauf-en l-a-u-f-en laufen lauf-en
6 geh-en gehen gehen gehen
7 lieg-en l-i-e-g-en liegen liegen
8 schlaf-en sch-lafen schlafen schlaf-en
9 kind-er-arzt kind-er-a-r-z-t kind-er-arzt kinder-arzt
10 grund-schule g-rund-sch-u-l-e grund-schule grundschule

What can be seen clearly from the table, where all forms deviating from my analysis are marked in red font, is that none of the models makes a convincing job in segmenting my ten test words.  More importantly, however, we can clearly see that the algorithm's problems increase drastically when dealing with small training data. Since the segmentations proposed in the Small data column are clearly the worst, splitting words in a seemingly random fashion into letters.

What is interesting in this context is that trained linguists would rarely fail at this task, even when all they were given is the small data list for training. That they do not fail is shown by the numerous studies where linguistic fieldworkers have investigated so far under-investigated languages, and quickly figured out how the morphology works.

Why is it so difficult to find morpheme boundaries?

What makes the detection of morpheme boundaries so difficult, also for humans, is that they are inherently ambiguous. A final -s can mark the plural in German, especially on borrowings, as in Job-s, but it can likewise mark a short variant of es "it", where the vowel is deleted, as in ist's "it's", and in many other cases, it can just mark nothing, but instead be part of a larger morpheme, like Haus "house". Whether or not a certain substring of sounds in a language can function as a morpheme depends on the meaning of the word, not on the substring itself. We can — once more — see one of the great differences between sequences in biology and sequences in linguistics here: linguistic sequences derive their "function" (ie. their meaning) from the context in which they are used, not from their structure alone. 

If speakers are no longer able to clearly understand the morphological structure of a given word, they may even start to change it, in order to make it more "transparent" in its denotation. Examples for this are the numerous cases of folk etymology, where speakers re-interpret the morphemes in a word, with English ham-burger as a prominent example, since the word originally seems to derive from the city Hamburg, which has nothing to do with ham. 

How do humans find morphemes?
The reasons why human linguists can relatively easy find morphemes in sparse data, while machines cannot, is still not entirely clear to me (ie. humans are good at pattern recognition and machines are not). However, I do have some basic ideas about why humans largely outperform machines when it comes to morpheme segmentation; and I think that future approaches that try to take these ideas into account might drastically improve the performance of automatic morpheme segmentation methods.

As a first point, given the importance of meaning in order to determine morphemic structure, it seems almost absurd to me to try to identify morphemes in a given language corpus based on a pure analysis of the sequences, without taking their meaning into account.  If we are confronted with two words like Spanish hermano "brother" and hermana "sister", it is clear — if we know what they mean — that the -o vs. -a most likely denotes a distinction of gender. While the machines compare potential similarities inside the words independent of semantics, humans will always start from those pairs where they think that they could expect to find interesting alternations. As long as the meanings are supplied, a human linguist — even when not familiar with a given language — can easily propose a more or less convincing segmentation of a list of only 500 words.

A second point that is disregarded in current automatic approaches is the fact that morphological structures vary drastically among languages. In Chinese and many South-East Asian languages, for example, it is almost a rule that every syllable represents one morpheme (with minimal exceptions being attested and discussed in the literature). Since syllables are again easy to find in these languages, since words can often only end in a specific number of sounds, an algorithm to detect words in those languages would not need any n-gram statistics, but just a theory on syllable structures. Instead of global strategies, we may rather have to use for local strategies of morpheme segmentation, in which we identify different types of languages for which a given algorithm seems suitable.

This brings us to a third point. A peculiarity of linguistic sequences in spoken languages is that they are built by specific phonotactic rules that govern their overall structure. Whether or not a language tolerates more than three consonants in the beginning of a word depends on its phonotactics, its set of rules by which the inventory of sounds is combined to form morphemes and words. Phonotactics itself can also give hints on morpheme boundaries, since they may prohibit combinations of sounds within morphemes which can occur when morphemes are joined to form words. German Ur-instinkt "basic instinct", for example, is pronounced with a glottal stop after the Ur-, which can only occur in the beginning of German words and morphemes, thus marking the word clearly as a compound (otherwise the word could be parsed as Urin-stinkt "urine smells".

A fourth point that is also generally disregarded in current approaches to automatic morpheme segmentation is that of cross-linguistic evidence. In many cases, the speakers of a given language may themselves no longer be aware of the original morphological segmentation of some of their words, while the comparison with closely related languages can still reveal it. If we have a potentially multi-morphemic word in one language, for example, and only one of the two potential morphemes reflected as a normal word in the other language, this is clear evidence that the potentially multi-morphemic word does, indeed, consist of multiple morphemes.


Linguists regularly use multiple types of evidence when trying to understand the morphological composition of the words in a given language. If we want to advance the field of automatic morpheme segmentation, it seems to me indispensable that we give up the idea of detecting the morphology of a language just by looking at the distribution of letters across word forms. Instead, we should make use of semantic, phonotactic, and comparative information. We should further give up the idea of designing universal morpheme segmentation algorithms, but rather study which approach works best on which morphological type. How these aspects can be combined in a unified framework, however, is still not entirely clear to me; and this is also the reason why I list automatic morpheme segmentation as the first of my ten open problems in computational diversity linguistics.

Even more important than the strategies for the solutions of the problem, however, is that we start to work on extensive datasets for testing and training of new algorithms that seek to identify morpheme boundaries on sparse data. As of now, no such datasets exist. Approaches like Morfessor were designed to identify morpheme boundaries in written languages, they barely work with phonetic transcriptions.  But if we had the datasets for testing and training available, be it only some 20 or 40 languages from different language families, manually annotated by experts, segmented both with respect to the phonetics and to the morphemes, this would allow us to investigate both existing and new approaches much more profoundly, and I expect it could give a real boost to our discipline and greatly help us to develop advanced solutions for the problem.


Baayen, R. H. and Piepenbrock, R. and Gulikers, L. (eds.) (1995) The CELEX Lexical Database. Version 2. Philadelphia.

Benden, Christoph (2005) Automated detection of morphemes using distributional measurements. In: Claus Weihs and Wolfgang Gaul (eds.): Classification -- the Ubiquitous Challenge. Berlin and Heidelberg:Springer. pp 490-497.

Bordag, Stefan (2008) Unsupervised and knowledge-free morpheme segmentation and analysis. In: Carol Peters, Valentin Jijkoun, Thomas Mandl, Henning Müller, Douglas W. Oard, Anselmo Peñas, Vivien Petras and Diana Santos (eds.): Advances in Multilingual and Multimodal Information Retrieval. Berlin and Heidelberg:Springer, pp 881-891.

Creutz, M. and Lagus, K. (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Technical Report. Helsinki University of Technology.

Goldsmith, John A. and Lee, Jackson L. and Xanthos, Aris (2017) Computational learning of morphology. Annual Review of Linguistics 3.1: 85-106.

Hammarström, Harald (2006) A Naive Theory of Affixation and an Algorithm for Extraction. In: Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology at HLT-NAACL 2006 pp. 79-88.

Harris, Zellig S. (1955) From phoneme to morpheme. Language 31.2: 190-222.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf:Düsseldorf University Press.

Virpioja, Sami, Smit, Peter, Grönroos, Stig-Arne and Kurimo, Mikko (2013) Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Helsinki:Aalto University.

Monday, February 18, 2019

Can we depict the evolution of highly conserved genes, such as the ribosomal RNA genes?

Median networks have been designed to put within-species haplotypes into an explicit evolutionary framework. They are exclusively parsimony-based, but differ from traditional trees by treating operational taxonomic units (OTUs) as both potential tips and ancestors. Ancestors are placed at internal nodes ('medians'). The latter makes them interesting for hypotheses about sequence evolution; but, like all parsimony-based methods, they suffer from high levels of homoplasy, which is a common feature of genetic data sets.

Can we use median networks to better understand evolution far above the species level?

In order to test this, I generated a median network using data on the nuclear-encoded 5.8S rDNA of Fagales. This is a flowering plant (angiosperm) order, which includes well-known trees such as oaks, beeches, chestnuts, walnuts, alder, birch and hazel, but also the enigmatic 'false beech' (Nothofagus s.l., the traditional four subgenera have been elevated to genera by Heenan & Smissen 2013), a Gondwanan element that (for some time) has intrigued biogeographers.

Why I have always loved nrDNA

A a young (phylo-)geneticist, my boss, a geneticist who sequenced genes such as the rRNA genes before PCR made it easy, pointed me to the works of Mark Hershkovitz, Louise Lewis, and Edith Zimmer about evolution of the nuclear-encoded ribosomal RNA genes (nrDNA) in angiosperms. Long pre-dating the era of big data and self-evident, trivial phylogenies (ie. data sets allowing for the inference of a fully resolved, unambiguously supported tree), Hershkovitz and co-workers sought to extract as much information as possible from the best-known gene region available back then (mid-late 90s): the internal transcribed spacers (ITS1, ITS2) of the 35S rDNA, the cistron encoding the genes for the 18S, 5.8S and 25S (or 28S, but not "26S") nuclear ribosomal RNA.
  • Hershkovitz MA, Lewis LA. 1996. Deep-level diagnostic value of the rDNA-ITS region. Molecular Biology and Evolution 13:1276–1295.
  • Hershkovitz MA, Zimmer EA. 1996. Conservation patterns in angiosperm rDNA ITS2 sequences. Nucleic Acids Research 24:2857–2867.
  • Hershkovitz MA, Zimmer EA, Hahn WJ. 1999. Ribosomal DNA sequences and angiosperm systematics. In: Hollingsworth PM, Bateman RM, and Gornall RJ, eds. Molecular Systematics and Plant Evolution. London: Taylor & Francis, pp. 268–326.
The ITS1 and ITS2 are highly divergent, non-coding but transcribed intergenic spacers within the structurally and sequentially much more conserved nrDNA, which distinguishes them from nearly all other non-coding regions. More often than not, their sequences are impossible to align across high-ranking taxa such as families or orders. The brilliance of Hershkovitz et al.'s work was to just go a level-up by identifying shared general sequence patterns, and to put them in an evolutionary context.

Birds-eye view of the ITS region (consensed for sequence groups) in Fagales including sequences of the two outgroups used in Li et al. 2004 (zoom-in and try to figure out where they are). The position of the ITS(1) cleavage site is indicated, a highly conserved, AT-dominated sequence motif within the ITS1. The "Nothofagus deletion" (Manos 1997), gray area seen in some of the topmost variants in the 5.8S rDNA, is a sequencing/ editing artifact (newer sequences all have a complete 5.8S rDNA). Most of these data are more than 15-years old (see references provided at the end of the post) and may include more data artifacts, especially in the length-polymorphic portions. Nonetheless, part of the data were included in the dating studies of Sauquet et al. (2012) and Xing et al. (2014) to compensate for the lack of resolution of the also included plastid regions towards the tips of the Fagales tree (intrafamily and -generic relationships).

Accordingly, in my (open access) Ph.D. thesis you'll find not a few figures depicting the potential evolution of sequence patterns in the ITS1 and ITS2 of maples and the beech trees.

I could probably write a book taking up where Hershkovitz et al. stopped, but this would be: a) very subjective, and b) too complex and marginal for the 21st century. Very few people would read it. We have grown accustomed to simple graphs as metaphors of evolution and, thanks to big data, we have become reluctant to discuss the results ex machina. Also, I would have needed a score of students to pursue all the avenues that I glimpsed into; e.g. the following pic:

Evolution of the 5'-end of the ITS1 in basal eudicots (looking at divergences that happened, at least, 100 myrs ago).

The other way around

If the more conserved sequence patterns within the ITS1 and ITS2 can be informative about evolution at a much higher level (which they are), the next question is: what can we learn from the sequence patterns in the highly-conserved portions of the rDNA linked with the ITS1 and ITS2? Historic-genetically, the ITS1 is fundamentally different from the ITS2. The former, ITS1, is an intergenic spacer, which has no secondary structure (although you can find reconstructions in literature) as it is split into two parts right after translation (the ITS1 cleavage site is quite conserved, and a main topic in the papers by Hershkovitz and Zimmer). The latter, ITS2, has been evolutionarily derived from the first variable portion of the large ribosomal subunit (LSU), the 25S (28S) rDNA. In primitive organisms, there is hence no 5.8S rDNA and ITS2.

This geno-evolutionary history is also the reason for the structural linkage between the 5.8S rRNA and the 5' end of the 25S (28S) rRNA. Here's a zoom-in on the part that we are interested in.

For better orientation, I have named some of the extremely conserved secondary structure elements of the (mature) 5.8S rRNA. Note that the "Gingerbread Man" structure is very conserved in angiosperm sequences although it only contains three very short stems. The "Pimple" and the "Needle" are so-called hairpins — a strictly complementary stem part is capped by a short, non-complementary tip ('semi-loop'): a 3- and 4-nt long motif, respectively, in Arabidopsis and all Fagales (in some species of Lithocarpus, the tropical 'stone nut' and relative of oaks, the "Needle" has two extra nucleotides).

5.8S rDNA in Fagales

I chose the Fagales because I have worked on them a lot, they are a pretty small group, and except for one "asterisk branch" their inter-family relationships are solved.

Basic signal in Li et al. (2004)'s matrix. Inter-family relationships are, data-wise, fairly trivial, hence, the tree-like Neighbor-net. Only the placement of the Myricaceae with respect to Juglandaceae (now incl. Rhoipteleaceae) and Betulaceae + allies is not unambiguously resolved (see this post)

Oaks have received a lot of attention from population geneticists, like other widespread species or species complexes. Those studies, using Median networks and related methods such as Statistical Parsimony, revealed very complex genetic diversity patterns. On the other hand, the Fagales lineage has been fairly neglected by plant phylogeneticists, although it comprises many of the dominant, ecologically and economically most important trees of the Northern Hemisphere (and the enigmatic Gondwanan Nothofagaceae). The early studies found evidence for deep nuclear-plastid incongruences, but only in recent years has the first (non-comprehensive) complete plastome phylogenies and dated all-Fagales trees surfaced (which do contain one or other common error and misinterpretation of results).

For one family, the southern hemispheric, tropical-subtropical Casuarinaceae, we have no (reliable) ITS data at all; also missing is one of the genera of the Juglandaceae: Engelhardia (s.str.; most data in gene banks labelled as Engelhardia is from Alfaropsis; cf. Manchester 1987 and Manos et al. 2007, but see Zhang et al. 2013).

In total, we find 17 variable sites at and above the genus level in the 5.8S rDNA of Fagales. There are three in the core parts, structurally linked to the 5' 25S rRNA, two in the 'Gingerbread Man', three in the 5' and 3' trails, and the rest are in the 'Needle'.

Unique mutations and mutational trends (arrows) in the 5.8S rDNA in Fagales. Circles highlight the basepairs differing from the reference (Arabidopsis 5.8S rRNA). Blue, mutations found within more than one major lineage, pink, lineage-conserved (diagnostic) mutations; red, mutations restricted to a single genus; green, genetic (syn)apomorphies of the 5.8S rDNA of Fagales. Be = Betulaceae; Ju = Juglandaceae; My = Myricaceae; No = Nothofagaceae; Fagaceae include Fagus (Fa, the beech) and the remainder ("Quercaceae": Qu), which are genetically substantially distinct from Fagus.

Many mutations are genus-coherent; increased intrageneric variation is found in the 5'-tail and the part encoding the 4(6)-nt long 'semi-loop' sequence of the "Needle" (pos. 120–142 in the rRNA of Arabidopsis thaliana):

A (near-)full Median network for the tip of the 'Needle'. In a few Lithocarpus (a "Quercaceae" genus) the sequence is 6-nt-long, which would result in an elongated hairpin (paired basepairs are underlined). The ATTC is a genetic symplesiomorphy.

Exceptions are Fagus and Quercus, which can show substantial intragenomic ITS divergence, Lithocarpus (the most divergent genus, ITS-wise), and Nothofagus s.l. (between the former subgenera, now genera). In these cases, the intra-(sub)generic variation includes the putatively ancestral nucleotide and/or nucleotide shared with other genera of the family; eg. at pos. 123, all Fagales have a C, Fagus can have either C or T (= Y), and Quercus can show any of the four nucleotides (= N).

A Median-network for the 5.8S rDNA

Ambiguities can be detrimental for resolution in standard parsimony implementations. The NETWORK program, for instance, warns that a code of "N" may render the result less reliable, and this applies also to the other ambiguity codes. If we include the intra-generic polymorphisms as ambiguity codes, NETWORK runs for quite a long time: too many solutions are equally parsimonious (for this experiment I used genus-consensus data, being interested in the deep splits)

But when we resolve the intra-generic polymorphisms prior to analysis by treating them as satellite types, ie. assuming the family-shared nucleotide represents the ancestral state within the according lineage, we quickly get the following result:

Edges colored to trace the same mutational step. Bubbles indicate the position of the (basic) 5.8S rDNA genotypes for the genera in each family-level lineage.

This is still not a too trivial graph, but it:
  • provides a framework on which we can develop our evoluionary scenario;
  • visualizes how mutational patterns may be linked;
  • tells us directly how derived (genetically) and unique (isolated) the genera are.
Since the 5.8S rDNA is part of a multi-copy (potentially multi-loci, Ribeiro et al. 2011) gene region, uniqueness gives us an idea about how reduced a lineage is. Bottlenecks will eliminate intra-lineage diversity and unique mutational patterns are more likely to accumulate in a species-poor lineage with small population sizes.

But since it is a vital gene region underlying strong sequential and structural constraints, evolution is not neutral: the graph has little tree-likeness. However, the graph looks like graphs that one expects for fast ancient radiations.

There are more interesting details. For instance, we have no mutation separating consistently the earliest diverging lineages (given the currently accepted root), the Nothofagaceae and the Fagaceae (s.l.) and the remainder of the order (called "higher hamamelids" in classic systematic literature). We also see that the 5.8S rDNA shows the Fagaceae should be monotypic: Fagus is more different from its siblings, the 'Quercaceae', than it is from the first-diverging Nothofagaceae or the common ancestor of the "higher hamamelids". Fagaceae s.str. and 'Quercaceae' are without a doubt sister lineages but this also applies to Betulaceae and Ticodendraceae (differing only by three point mutations), with the Betulaceae being just one point mutation away from its more distant sibling (phylogenetically speaking), the Juglandaceae. Furthermore, for Ticodendron-Betulaceae we can postulate a sequentially unique common ancestor, but we can't do the same for Fagus-'Quercaceae'.

Either the 5.8S rDNA evolved much faster in Fagus than in most other lineages, or Fagus split away from its sisters prior to the radiation of the "higher hamamelids" and shortly after their respective ancestors isolated. This second scenario coincides nicely to recent fossil findings tracing the Fagus lineage back to the late Cretaceous (at least 80 Ma; Grímsson et al. 2016, supplement includes a digression of all-Fagales dating attempts).

Reconstruction of ancestral genepools

Using the split patterns in the network to extract an evolutionary tree could be hazardous, since we are looking at strongly interconnected mutational patterns filtered by selective pressure (maintaining a functional structure) in a gene region that evolves very slowly: some sites can or did accumulate mutations (the 'Needle' and the trails), others can't and did not (the remainder of the 5.8S rDNA) in the Fagales lineage. At least mutations were not fixed over a long evolutionary time: the data includes at least as many variable sites where within a single genus, species or genome, the shared, family-typical nucleotide (or even shared with Arabidopsis, a quite distant relative of Fagales) is occasionally replaced.

But since we know the phylogeny of the Fagales, we can, based on the Median(-joining) network(s), infer the evolution of the 5.8S rDNA (i.e. the rDNA gene pool) over time:

Results of the Median-joining analysis mapped on the currently accepted Fagales tree. Clade-characteristic mutations are highlighted by according colors; black, homoplastic mutations that occurred independently in two lineages, gray, in more than two.

Regarding the 'asterisk branch', the 5.8S rDNA provides few extra clues, unless we want to re-include a third hypothesis: that the Myricaceae are sister to Juglandaceae + Betulaceae and allies. This would be the most fitting explanation for the 5.8S rDNA diversity. It also would explain why they can be either sister to Betulaceae and allies or Juglandaceae. Ancestors, or slower evolving sisters diverging shortly before a radiation, will do such a thing.

In this context, one should point out that unequivocal fossils representing various modern genera of all families are known from the early Paleogene, many pop up in early Eocene (~ 50 Ma) intramontane basins of northwestern North America. The oldest modern genus and a possible living fossil is the first diverging Juglandaceae: Rhoiptelea. Its pollen can be found from the Maastrichian onwards in North America and elsewhere, and a fossil showing the unique Rhoiptelea-flower and fitting pollen can be found in the late Turonian-Santonian (~90 Ma) of Bohemia (Heřmanová et al. 2011; the authors, however, decided to name it Budvaricarpus and tone down the striking resemblance to modern-day Rhoiptelea).

Of course, since we use network-based approaches, we can conceptualize the 5.8S rDNA sequence patterns and inferred evolution as a subsequent breaking up and sorting of once-shared gene pools:

A 'coral' tree metaphor for the evolution of the 5.8S rDNA in Fagales (using an alternative, one-node-shifted root).

I chose an alternative root because it is the one that makes most sense regarding the fossil-morphological, palaeoclimatological/-vegetation and high-conserved genetic patterns (thinking of the 18S rDNA). The labels are, of course, a gross simplification — it is likely that the all-ancestor was a tropical-subtropical plant as well (the genetically most unique and potentially earliest isolated genera of the 'Quercaceae' are exclusively tropical-subtropical) and Myricaceae, Betulaceae and Juglandoideae can today be found deep into the temperate zone, some even thriving in boreal and polar climates. But posts can afford to trigger discussion.

The vertical axis reflects not only the derivedness of the 5.8S rDNA, but also the potential sequence of divergences back in time. The horizontal axis represents the taxonomic-geographic breadth over time (very roughly, tapering means higher diversity/greater range in the past than today) and towards the tips the genetic within-lineage diversity seen in the ITS1 and ITS2 (in Myricaceae, it would be close to a point, if it would not be for one species: Myrica gale, the bog myrtle or sweetgale, beloved in Scotland and Scandinavia – see this Dane's video for how to use it).

Just a curious experiment?

Now, to most readers this post may just be a strange example with little general relevance for phylogenetics. But consider the following.
  1. When we infer deeper phylogenetic relationships, we usually rely on sequence differentiation in coding-gene regions. Like the rRNA genes, the tRNA genes need to fulfill secondary (and tertiary) structural constraints to maintain their vital functions. All other genes code for proteins, which also need to fulfill structural constraints (secondary, tertiary and quaternary structures). Their essential functions rely on keeping a specific amino-acid sequence, which is translated from DNA sequences.
  2. We do this inference under the assumption that molecular evolution is neutral, which, as can be seen in the case of the 5.8S rDNA, is apparently not the case. Mutations that would negatively affect the function of the DNA-transcripts are strongly selected against.
Many of our trees makes sense nonetheless, but we should keep a wary eye on all of those branches that draw their support from only one or two gene regions (a common issue of oligo-gene trees like the one by Li et al. 2004), or very few mutations. Especially, when we are producing an ultrametric the tree. How sensible can a divergence age estimate be when the data behind it are four mutations in the monotypic lineage and zero in its more diverse sister clade?

Cited literature and further reading (with comments).

ITS studies (some mixed with further data and results that were ignored by all-Fagales dating studies that included the data)
  • Acosta MC, Premoli AC. 2010. Evidence of chloroplast capture in South American Nothofagus (subgenus Nothofagus, Nothofagaceae). Molecular Phylogenetics and Evolution 54:235–242. See also Premoli AC, Mathiasen P, Acosta MC, Ramos VA. 2012. Phylogeographically concordant chloroplast DNA divergence in sympatric Nothofagus s.s. How deep can it be? New Phytologist 193:261–275. — Just two brilliant papers that only leave one question open: is this different in the Australasian genera of the Nothofagaceae?
  • Cannon CH, Manos PS. 2003. Phylogeography of the Southeast Asian stone oaks (Lithocarpus). Journal of Biogeography 30:211–226. — A very well-done paper that still doesn't need to fear to comparison with more recent biogeographic papers on Fagales genera with access to more elaborate inference methods, while using much poorer data samples.
  • Denk T, Grimm GW. 2010. The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59:351–366. — Since this is mine, I should not give myself an assessment. Just some info: it was the most sloppy draft, we ever submitted, and passed rather smoothly the review process. But it used 600+ new ITS and 900+ new 5S-IGS sequences, and although it provided a comprehensive ITS tree (new and all data stored in gene banks), the conclusions relied mostly on networks based on inter-clonal and inter-individual distances and ML bootstrap pseudoreplicate samples. I'm pretty sure, it's still hard to find a similar paper.
  • Denk T, Grimm G, Stögerer K, Langer M, Hemleben V. 2002. The evolutionary history of Fagus in western Eurasia: Evidence from genes, morphology and the fossil record. Plant Systematics and Evolution 232:213–236. — My first phylogenetic paper (using only about 100 ITS sequences) and one of my most-cited papers; published only because the editor ignored the opinions of two reviewers.
  • Denk T, Grimm GW, Hemleben V. 2005. Patterns of molecular and morphological differentiation in Fagus: implications for phylogeny. American Journal of Botany 92:1006–1016. — the follow-up paper, including all beech species.
  • Forest F, Bruneau A. 2000. Phylogenetic analysis, organization, and molecular evolution of the non-transcribed spacer of 5S ribosomal RNA genes in Corylus (Betulaceae). International Journal of Plant Sciences 161:793–806. — Likely the reason for the 2005 study by Forest et al., a great paper (especially when compared to other phylogenetic papers published in the same journal back then and much later). The reason why the 5S-IGS has rarely been studied, is because it is difficult to handle (usually one needs to clone because of intraindividual length-polymorphism). But it provides an unsurpassed resolution at the intrageneric level that only finds a match in the last years by the accumulation of NGS SNP data.
  • Forest F, Savolainen V, Chase MW, Lupia R, Bruneau A, Crane PR. 2005. Teasing apart molecular- versus fossil-based error estimates when dating phylogenetic trees: a case study in the birch family (Betulaceae). Systematic Botany 30:118–133. — A pivotal, still valid study using ITS and 5S-IGS data, even though the divergence age estimates are probably much too old (an aspect demonstrating the quality of the study, back then, molecular age estimates were usually much too young). Forest and Bruneau published several other papers of equal quality on other plant groups, and I suspect there is an interesting publication story given the author list and the dissemination platform.
  • Grimm GW, Denk T, Hemleben V. 2007. Coding of intraspecific nucleotide polymorphisms: a tool to resolve reticulate evolutionary relationships in the ITS of beech trees (Fagus L., Fagaceae). Systematics and Biodiversity 5:291–309. — A crazy experiment, but one that, years later, would bring me my first paper in Systematic Biology [PDF] (10-times higher impact factor) because it was the only piece of science providing a way-out for a young researcher in South Africa.
  • Manos PS. 1997. Systematics of Nothofagus (Nothofagaceae) based on rDNA spacer sequences (ITS): taxonomic congruence with morphology and plastid sequences. American Journal of Botany 84:1137–1155. — A typical study for the time, may be not ground-breaking but opening an interesting path and still the basis for molecular systematics of Nothofagaceae (getting such data in the late 90s was not easy). Interestingly, no-one in Australia or New Zealand ever took the thread up (but see Knapp et al. 2005), the only only properly studied genus (then a subgenus) of Nothofagaceae is Nothofagus s.str. (Acosta & Premoli 2010; Premoli et al. 2012).
  • Manos PS, Doyle JJ, Nixon KC. 1999. Phylogeny, biogeography, and processes of molecular differentiation in Quercus subgenus Quercus (Fagaceae). Molecular Phylogenetics and Evolution 12:333–349. [PDF] — The counterpart to the above for oaks, it took nearly two decades to assemble more data on American oaks than used for this study.
  • Manos PS, Stone DE. 2001. Evolution, phylogeny, and systematics of the Juglandaceae. Annals of the Missouri Botanical Garden 88:231–269. — An exemplary paper for two reasons (and despite the fact that it just shows cladograms): 1) it combined morphological and chemotaxonomic data with ITS and plastid data (rbcL-atpB and trnL-trnF intergenic spacer); 2) pretty much got the still accepted tree. Also proof-of-point that, even 20 years ago, studies in low-impact journals were not rarely better than those in high-fly ones. (Note the number of pages; decent research needs space!)
  • Manos PS, Zhou ZK, Cannon CH. 2001. Systematics of Fagaceae: Phylogenetic tests of reproductive trait evolution. International Journal of Plant Sciences 162:1361–1379. — For years to come the basis for Fagaceae systematics.
  • Muir G, Fleming CC, Schlötterer C. 2001. Three divergent rDNA clusters predate the species divergence in Quercus petraea (Matt.) Liebl. and Quercus robur L. Molecular Biology and Evolution 18:112–119. — Only about two species, but setting the scene: ITS evolution in Fagales (and probably any other wind-pollinated tree) can be very complex at the very basic level.
  • Ribeiro T, Loureiro J, Santos C, Morais-Cecílio L. 2011. Evolution of rDNA FISH patterns in the Fagaceae. Tree Genetics and Genomes 7:1113–1122. — A must-read for everyone using ITS data in Fagales.
Phylogenetic studies at and above family level
Betulaceae: see Forest et al. (2005) and Grimm & Renner (2013, following section).
Casuarinaceae: see 'Phylogeny' section on Stevens' Angiosperm Phylogeny Website (never bothered myself with them, since they lack ITS data).
Fagaceae: see Manos et al. (2001), tree in Denk & Grimm (2010)
  • Oh S-H, Manos PS. 2008. Molecular phylogenetics and cupule evolution in Fagaceae as inferred from nuclear CRABS CLAW sequences. Taxon 57:434–451. — The molecular basis for Fagaceae systematics.
  • Manos PS, Cannon CH, Oh S-H. 2008. Phylogenetic relationships and taxonomic status of the paleoendemic Fagaceae of Western North America: recognition of a new genus, Notholithocarpus. Madroño 55:181–190.The only paper providing a tangible plastid-informed phylogeny.
  • Manos PS, Soltis PS, Soltis DE, Manchester SR, Oh S-H, Bell CD, Dilcher DL, Stone DS. 2007. Phylogeny of extant and fossil Juglandaceae inferred from the integration of molecular and morphological data sets. Systematic Biology 56:412–430. — I would have used a different set of analyses but the paper (and used data) provides the basis for Juglandaceae phylogenetics and systematics (see Manos & Stone 2001)
Nothofagaceae: Manos (1997), Knapp et al. (2005, following section).
Fagales dating studies (naturally including phylogenies)
  • Grimm GW, Renner SS. 2013. Harvesting GenBank for a Betulaceae supermatrix, and a new chronogram for the family. Botanical Journal of the Linnéan Society 172:465–477. [PDF] — a little experiment we made and submitted to a respectable but low-impact journal because the results were not really ground-shaking. Exemplifies how I think one should harvest gene banks for dating studies (check out the supplement files), hence, providing a striking contrast to the much more ambitious papers by Xiang et al. (2014) and Xing et al. (2014). In that aspect, possibly a must-read for reviewers and editors of large-scale, harvest papers.
  • Knapp M, Stöckler K, Havell D, Delsuc F, Sebastiani F, Lockhart PJ. 2005. Relaxed molecular clock provides evidence for long-distance dispersal of Nothofagus (Southern Beech). PLoS Biology 3:e14. — A very interesting paper, because it rejects two of the scenarios later tested by Sauquet et al. (2012) and found to produce strange estimates; also, it provides some new sequences of higher quality, none of which was included for the 2012 paper. The author list is quite interesting, too: the last author (GoogleScholar) was the only botanist who challenged tree-thinking from the very start and embraced splits graphs as alternative to trees. The forth author wrote a classic paper everyone should have read working with big data: Delsuc F, Brinkmann H, Philippe H. 2005. Phylogenomics and the reconstruction of the tree of live. Nature Reviews Genetics 6:361–375.
  • Sauquet H, Ho SY, Gandolfo MA, Jordan GJ, Wilf P, Cantrill DJ, Bayly MJ, Bromham L, Brown GK, Carpenter RJ, Lee DM, Murphy DJ, Sniderman JM, Udovicic F. 2012. Testing the impact of calibration on molecular divergence times using a fossil-rich group: the case of Nothofagus (Fagales). Systematic Biology 61:289–313 — in principle, an interesting idea, unfortunately the instability of dating estimates observed may be mostly due to data artifacts. The authors use unrepresentative, old data (which is puzzling, since the understudied Nothofagaceae grow in Australia, New Zealand and the French New Caledonia, and the authors are from France, Australia and New Zealand) including not a few editing/ sequencing artifacts, insufficient sampling and internal signal conflict by combination of low-divergent plastid genes and introns with high-divergent ITS data. The main test compares apples (Nothofagaceae) with pears (the rest of Fagales as sister clade); for details see this draft [PDF], which I put together for applications (the data documentation of Sauquet et al. is examplary, hence, it was very easy to look into the data basis).
  • Xiang X-G, Wang W, Li R-Q, Lin L, Liu Y, Zhou Z-K, Li Z-Y, Chen Z-D. 2014. Large-scale phylogenetic analyses reveal fagalean diversification promoted by the interplay of diaspores and environments in the Paleogene. Perspectives in Plant Ecology, Evolution and Systematics 16:101–110 — an ambitious experiment, with even more data-related problems than the study of Sauquet et al. While Sauquet et al. used placeholder sequences for each included genus (and dropped some because their data inflicted too much topological ambiguity), Xiang et al. blindly harvested all data of commonly sequenced plastid "barcodes" (rbcL, matK, trnL/LF region, rbcL-atpB spacer) to infer a species-level tree. Outdated, invalid taxa were not corrected for; the used gene sample can show little to no variation below the genus level (which makes dating, and barcoding, impossible). Furthermore, plastid diversification is partly or fully decoupled from speciation processes in the four genera that have been studied using more than a single individual per species (Nothofagus s.str., Fagus, Quercus, Ostryopsis).
  • Xing Y, Onstein RE, Carter RJ, Stadler T, Linder HP. 2014. Fossils and large molecular phylogeny show that the evolution of species richness, generic diversity, and turnover rates are disconnected. Evolution 68:2821–2832 — very similar to the Xiang et al. approach but even more flawed (poor control over used data, poor selection of markers, several problems with the dating approach, which is the bases to estimate the crucial turnover rates). Xiang et al. and Xing et al. show what happens when large-scale meta-analyses are conducted by researchers with no idea about the studied organisms.
  • Zhang J-B, Li R-Q, Xiang X-G, Manchester SR, Lin L, Wang W, Wen J, Chen Z-D. 2013. Integrated fossil and molecular data reveal the biogeographic diversification of the eastern Asian-eastern North American disjunct hickory genus (Carya Nutt.). PLoS ONE 8:e70449. — Focuses on one genus but includes data from all Juglandaceae and gives a typical example for plant biogeographic studies using dated trees (the forth author is the expert on the fossil record of Juglandaceae, so there are little data issues). It's open access, quite short, give it a read and then try to figure out what is the point of the paper (I looked at the provided data matrix, too, and found quite interesting genetic patterns that completely escaped the authors; it is never wrong to look over your alignment when this is still possible).
Other cited literature
  • Grímsson F, Grimm GW, Zetter R, Denk T. 2016. Cretaceous and Paleogene Fagaceae from North America and Greenland: evidence for a Late Cretaceous split between Fagus and the remaining Fagaceae. Acta Palaeobotanica 56:247–305.
  • Heenan PB, Smissen RD. 2013. Revised circumscription of Nothofagus and recognition of the segregate genera Fuscospora, Lophozonia, and Trisyngyne (Nothofagaceae). Phytotaxa 146:1–31.
  • Heřmanová Z, Kvaček J, Friis EM. 2011. Budvaricarpus serialis Knobloch & Mai, an unusual new member of the Normapolles complex from the Late Cretaceous of the Czech Republic. International Journal of Plant Sciences 172:285–293.
  • Manchester SR. 1987. The fossil history of the Juglandaceae. St. Louis: Missouri Botanical Garden. [book-like paper]