Showing posts with label EDA. Show all posts
Showing posts with label EDA. Show all posts

Monday, March 18, 2019

Which US cities are best for walking, biking and public transport?


In the modern world, there is a lot of discussion about the environmental damage caused by cars and trucks, not least due to their involvement in global climate change. The pro-active parts of this discussion revolve around banning cars, so that parts of cities and towns can return to pedestrian areas (eg. Life in the Spanish city that banned cars; The automotive liberation of Paris), and encouraging alternative modes of transport, particularly bicycles (eg. Copenhagenize your city: the case for urban cycling; Britain wants cycle-friendly cities).

In particular, some cities throughout the world are taking active steps to improve the "walkability" of their centers, including Addis Ababa, Auckland, Denver, Hanoi, London, Manchester and San Francisco (What would a truly walkable city look like?), and the "cyclability" of their inner suburbs, including Calgary, Copenhagen, Eindhoven, Lidzbark, Purmerend, San Sebastian, Utrecht and Vancouver (Top 10 pieces of cycling infrastructure: which country does it right?). On the other hand, there are some cities who have not yet tried to do much about cycling, including Beijing, Cairo, Delhi, Hong Kong, Moscow, Mumbai, Nairobi, Orlando, São Paulo and Sydney (Top 10 worst cities for cycling ).


The USA is not usually considered to be at the forefront of this movement, having long ago wedded itself to the cult of the private motor car. However, this does not mean that US cities are all the same in terms of non-car transportation. For example, the Walk Score site, which is part of the Redfin real estate organization, provides a ranking of all US cities and neighborhoods with a population of 200,000 or more, in terms of how friendly they are for: walking, biking and transit.

The ranks are based on a score out of 100 for each location, using various methodologies:
— Walk Score analyzes hundreds of walking routes to nearby amenities; points are awarded based on the distance to amenities in each category.
— Bike Score is calculated by measuring bike infrastructure (lanes, trails, etc), hills, destinations and road connectivity, and the number of bike commuters.
— Transit Score assign a "usefulness" value to nearby transit routes based on their frequency, type of route (rail, bus, etc), and distance to the nearest stop on the route.
Our interest here is in combining these three pieces of information into a single picture, showing which cities are generally good, at the moment.

Not unexpectedly, the Walk Score and Transit Score are highly correlated (86% shared rankings), while the Bike Score is not as highly correlated with either of these (49% and 42%, respectively). This means that the same cities tend to be good for the first two criteria. The three best cities for the Walk Score are New York, Jersey City and San Francisco, while the top two for the Transit Score are New York and San Francisco. On the other hand, for the Bike Score the top two are Minneapolis and Portland — it would be difficult to imagine either New York or San Francisco as being good for biking!

If we define a "good" score as being >70, then only San Francisco has a score for all three criteria >70, although Boston comes close. On the other hand, Pittsburgh and Washington D.C. have the most consistent scores across the board, because they have uniformly middle-rank scores.

Since these are multivariate data, one of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network, as a tool for exploratory data analysis. For this network analysis, we calculated the similarity of the cities using the Manhattan distance, and a Neighbor-net analysis was then used to display the between-city similarities.

The resulting network of the 98 cities with complete data is shown in the figure. Cities that are closely connected in the network are similar to each other based on how good they are for walking, biking and transit, and those cities that are further apart are progressively more different from each other. The color-coding for the cities is from Megaregions of the United States.


The network generally shows decreasing walking / transit scores from top to bottom, and decreasing biking scores from right to left. We have labeled only the top group of 29 cities, which are distinctly "better" than the remaining 69, plus four unusual cities (at the middle-left).

Note that, as expected, New York, San Francisco and Boston stand out at the top of the network. Note, also, that Minneapolis and Portland are separated in the network from the other cities, because of their high Bike Scores — all of the other cities in the top group have much lower biking scores. Newark, in particular, has a low biking score. New Orleans is at the bottom-left of this group because it has a low Transit Score but not Walk Score.

For the four unusual cities, separated at the left of the bottom group: Dallas has a low Transit Score, and Atlanta, Cincinnati and San Diego all have a low Bike Score.

The city at the very bottom-left of the network, which has the lowest score on all three criteria, is Arlington TX. Along the same lines, there is an online graph of The 10 most dangerous states for cyclists, showing Florida way out in front.

Finally, you should be warned about potential problems with rankings like these, based on only a few selected criteria. For example, the real estate site StreetEasy recently tried to compile a list of the 10 Healthiest Neighborhoods in New York city, and ended up listing the Brooklyn industrial area of Red Hook as number 1, which engendered a couple of negative comments, such as:
I guess the fact that the majority of Red Hook’s parkland has been closed for many years due to lead contamination, or the fact that we have one of the highest asthma rates in the city, was overlooked for this study.
Caveat emptor!

Monday, March 4, 2019

Has homoiology been neglected in phylogenetics?


In a recently published pre-print on PaleorXiv, Roland Sookias makes a point for distinguishing between parallelism, ie. shared inherited traits that can be found in some but not all of the offspring of a common ancestor, and convergences in a strict sense, involving similar traits that are not homologous. The former is also known as homoiology, a term Sookias attributes to Ludwig Plate.

As a geneticist working mostly at the tips of the Tree of Plant Life, I'm quite familiar with the (pre-Hennigian) concept: we much more often than not lack Hennig's 'synapomorphies', ie. shared, derived traits exclusive to an evolutionary lineage. But we have many highly diagnostic characters suites including 'shared apomorphies' (I think that the angiosperm phylogeneticist Jim Doyle coined the term) that collect the same species or higher taxa, eg. groups of taxa that also form highly supported clades in molecular trees, but are not exclusive. In every plant group you can additionally observe that certain traits are exclusive to some members of one lineage, because the lineage has the genetic-physiological prerequisites to express these traits, while their sister lineages or distant relatives lack this potential. Epigenetics deals with tendencies to express a trait in response to the environment without even changing the genetic code.

If you look close enough, you can find such patterns even at the molecular level.

Molecular evolution of the 5' half of the ITS1 in beeches. Each sequence motif is assigned a state (Ax, Bx etc; x = 0 represents the ancestral state, x > 0 are derived states) and evolution involves usually the gain ("+") or loss ("-") of sequence motifs including some potential genetic homoiologies (see here for context and references).

However, it has apparently been ignored by my fellow paleontologists: Sookias' wants to discuss "the neglected concept of homoiology ... in the context of palaeontological phylogenetic methods". Paleontological phylogenetic methods are, of course, tree inferences, and the idea is that recognition of homoiologies can be a means of establishing node support or to "help to choose between equally parsimonious or likely trees". He provides an R function "to calculate two measures for a given tree and matrix: (a) the potential support for clades based on potential homoiologies; and (b) the fit of the tree to all states given the concept of homoiology".

Sookias provides a nice and conscise introduction to the problem with some examples, and makes the connection to linguistics (see also Mattis' and my post on the Chinese dialects continuum: How languages lose body parts); so, give the short paper a read. Like all paleontological literature it is strongly influenced by cladistic views, such as that life is monophyletic, and it revolves around the central theme how to get better supported trees.

My inner geneticist has a principal problem with such a goal, because there has (to my knowledge) not been a single morphology-based tree that was fully congruent to a molecular tree with sufficient taxon and gene sampling, which applies also to the real-world data example that Sookias chose (as we will see).

My inner paleontologists also knows that there are highly diagnostic morphs in the fossil record, but diagnostic character suites and morphs reflect as many paraphyla as monophyla. He also knows that the fossil record, provided you find the right fossil from the right time, may alter your perspective on ancestral and derived character states.

An inferred tree (see this post). Given the inferred tree (quasi-dated tree), we would assume that star shapes are primitive (a symplesiomorphy) within the Pointish lineage, and possibly 10-tipped stars; and conclude that the Tenstars are paraphyletic. Greenish is clearly ancestral (a Pointish symplesiomorphy), and bluish derived (a Polygonia synapomorphy).
If we have the full picture, we can confirm star shapes are symplesiomorphic within the Pointish (the first common ancestor being a five-pointed colorless star). However, all greenish stars form a monophylum not a paraphylum.
Having ten tips is a synapomorphy of the monophyletic Tenstars.

So, why should we aim to get more resolved, better supported, morphology-based trees? Any such tree will inevitably include wrong branches!

I argue that, instead, we should just explore the signal in our data matrices using networks. Any potential tree is included in a network. But networks are more comprehensive because they provide not only a single tree but alternative, competing trees. By visualizing the alternatives, we can discern between mere convergence (random similarity), homoiology (parallelism, convergence related to descent), symplesiomorphy (shared, lineage-consistent primitive traits) and synapomorphy (lineage-unique and consistent shared derived traits), which can be very tricky with just a tree. Thus, we can try to evaluate which evolutionary scenario best explains all our data.

Compatibility

The basic problem when using morphological and such-like data sets to infer phylogenies is that most of the scored characters are, to some degree, incompatible with the true tree, ie. the actual evolutionary pathways.

Let's take a hypothetical evolution (no reticulations), in which the x-axis represents the morphological diversification and the y-axis time.


As in real-world data, sister taxa (eg. Species A and B) may have different levels of morphological derivation compared to their common ancestor(s). This leads us to this unrooted true tree in which the branch lengths are proportional to the real (above) amount of change.

Unrooted representation of the above evolution.
All commonly used tree inferences infer unrooted trees.

The only characters providing a taxon bipartition that is fully compatible with the true tree are Hennig's 'synapomorphies':

Clade A–D shares a unique, derived trait.
The character split is fully compatible with the true tree.

Next come Hennig's 'symplesiomorphies' (Sookias' R-script discards them):

Blue is the ancestral state within the ingroup, lost/modified in Species A.
The character split is compatible with the true tree except for A.
In phylogenetic inference, symplesiomorphies will usually stabilize the topology
as there will be enough other characters supporting A as sister of B and Clade A–D(–F).

Homoiologies / parallelisms can be partly compatible:

Blue is a homoiology found in 50% of the species composing Clade A–F.
The character split supports the sister relationship of A and B (compatible aspect)
but joins them with F (incompatible aspect).
A, B and F belong to the same monophylum/clade (semi-compatible aspect).
As long as homoiologies are confined to otherwise
coherent (or flat) subtrees, they will contribute to the overall decision capacity of the data.

Note that without a molecular backbone tree, it may be impossible to distinguish homoiologies from symplesiomorphies – whether a trait will be resolved as either the one or the other in a tree depends solely on its frequency and distribution across the subtree, and the situation in outgroups.

Purple is the plesiomorphy of the ingroup, blue the homoiology
found in members of Clade A–F, evolved twice
Considering the phylogenetic root-tip distances in the true tree, it makes sense that blue is the plesiomorphy of the ingroup retained in the shorter branching members, and purple a homoiology found in the most derived sublineages (again, evolved twice).
Both scenarios require three steps, but probabilistic character mapping methods would prefer the second scenario as they assume the longer the internal branches, the higher the likelihood for a change. To dismiss symplesiomorphies, Sookias' script infers the ancestral state of the MRCA of a clade and only considers states as homoiologies that differ from the inferred ancestral state (the cut-off value can be modified to "less stringently exclude potential symplesiomorphies as homoiologies").
 
Doyle's 'shared apomorphies' are locally compatible:

Blue is a shared apomorphy of the GH lineage, convergently evolved in the
outgroup (see original tree above: the GH lineage is a strongly derived
ingroup lineage evolving into the direction of the outgroup
in contrast to the remainder of the ingroup).
The example above also illustrates how shared apomorphies may trigger branching artifacts such as ingroup-outgroup long-branch attraction. Imagine that GH is not the first diverging branch of the ingroup but instead a strongly derived sublineage nested within Clade A–F, and that we lack the short-branching sister-group but have a large outgroup sample. Any ingroup-outgroup shared apomorphies will then draw GH towards the outgroup-defined ingroup's root and detrimental for inferring the true tree.

Convergence in a strict sense, ie. superficial or random similarity, is incompatible with the true tree:

Blue is a randomly distributed derived state found in all longer-branched taxa.

A tree-incompatible signal is, naturally, best handled using a network and not by forcing it into a single tree. Unless, of course, we have a sensible molecular tree and can go for total evidence approaches assuming the molecular tree reflects the true tree.

PS: Also, in molecular data the true tree incompatible characters may outnumber the compatible ones, but there we have many more characters and (usually but not always) a lot that are not filtered by negative or positive selection. Our stochastic molecular models are for sure never accurate enough to model molecular evolution for our sequences, but apparently precise enough for most applications. Even before next generation sequencing and big data, molecular phylogenies outshined morphological phylogenies, something that paleontologists cannot afford to ignore any more — not because the data are much better (to infer evolution) but because the patterns and processes are much less complex.

Sookias' data example, crocodiles and relatives

The supplement of Sookias' paper includes a morphological character matrix for crocodilians and the resulting molecular tree for the group. Here's Sookias' fig. 3 ,using these data to make his point for how to select the better-fitting tree using homoiology recognition:


Now, the unsolved problem is: if we don't have a molecular tree, how can we possibly know 0 is a homoiology and not a symplesiomorphy, 1 not a reversal (scenario B) or likely convergence (scenario C), hence, B should be preferred over C (the legend has a little typo, cf. Sookias 2019, p. 3, l. 34)?

The matrix provided as the example is not the best one to make this point. Sookias' script, when stringently eliminating potential symplesiomorphies, identifies, using the molecular tree as input, one potential homoiology for the Crocodylinae, five for their larger clade (including Gavialis and Tomistoma), and one for the alligators' larger clade in a matrix with 117 characters. Less than 10% can hardly be a game-changer.

What the morpho-data shows

Furthermore, the morphological matrix will give us a single most-parsimonious tree (MPT, using PAUP*'s Branch-and-Bound algorithm), not two or more equally parsimonious alternatives that we need to weigh against each other.

The single most-parsimonious tree that can be inferred from the morpho-matrix (236 steps, CI = 0.64, RI = 0.84). Red branches are conflicting with the topology of the molecular (truer?) tree (green brackets).

Some of the red branches are supported by pseudo-synapmorphies, which, on the background of the molecular tree, are potential homoiologies for the comprising clade, however, interpreted as symplesiomorphies by Sookias' script (provided the molecular branch-lengths are sufficient, they might be recognized when using a probabilistic framework to infer the ancestral states).

Not a good example for Sookias goal, but the matrix shows the limitations of trees when it comes to morphological differentiation. Here's the distance-based, 2-dimensional network for the morphological data:

A Neigbor-net based on Sookias' morphological matrix.
The arrow indicates the position of the assumed root.

The signal from the morphological matrix is quite tree-like, and the structure of the left part of the network is synonymous to that of the single MPT (and the molecular tree). On the right-hand side, we find more complexity than we would expect from the single MPT. The data signal is not trivial regarding the position of the root as inferred by Bernissartia; and nor is the placement of Gavialis and Tomistoma (pink edge bundles), two genera producing a very prominent box-like structure. Called by cladists a "phenetic" approach, the distance-based network is nonetheless straightforward regarding the identification of monophyletic groups (green) and potential monophyletic groups (yellow) (the latter always include the particular alternative seen in the single MPT as well, in case of the pink box, also the molecular alternative). The light green monophylum is a necessary consequence of the prior knowledge about the position of the root, and the likely monophyly of Alligator and its relatives (the tree-like subgraph with long internal branches = lots of uniquely shared traits, including potential synapmorphies).

Potential synapomorphies that can be inferred from the morpho-matrix alone by mapping the states onto the network. Red, homoiologies reconstructed as synapomorphies ('pseudo-synapomorphies') and (except for one) excluded as potential symplesiomorphies by Sookias' test run of his script (strict and relaxed cut-off).

The network provides more information than can be extracted from the MPT: one Crocodylus is significantly closer to the Osteolaemus (the neighborhood defined by the light blue edge bundle, see Sookias' fig. 3A). Crocodylus, however, is likely monophyletic, being generally very similar; and the third genus, Mecitops, is closely linked to (all of) them (neighbourhood defined by the dark blue edge). An inclusive common origin (including the third genus, Mecistops) is – just based on morphology and without using a "phylogenetic" tree inference – beyond question, even though we lack syn- or shared apomorphies (short corresponding edge bundle): Mecistops is obviously closely related to Crocodylus, and Osteolaemus is related to part of the latter, so it's not a bad hypothesis that all three are descendants of the same common ancestor, and that Tomistoma (and Gavialis) branched off the lineage before the Crocodylinae radiated. The only alternative explanation would be that the Crocodylinae show the primitive morphs of the entire lineage, and that the position of Tomistoma and Gavialis is affected by long-branch (-edge) attraction (however, if that is the case then we should have found a Tomistoma-Gavialis clade in the MPT — parsimony will always get it wrong in the Felsenstein zone)

The main flaw

But, any morphology-based alternative using this data matrix is not fully compatible with the molecular tree, which places Mecitops and Osteolaemus as sister to Crocodylus. Here's the consensus network based on 10,000 boostrap pseudoreplicate BioNJ trees inferred from the morpho-matrix, highlighting the support for splits compatible with the molecular tree (green) and their competing, partly incongruent (red edge bundles) alternatives (I do the information transfer manually, but those with R-scripting skills can use the functions in the phangorn library; Schliep et al., MEE, 2017; see also David's post):

NJ-Bootstrap (BS) consensus network based on 10,000 pseudoreplicates.
Edges/splits corresponding to clades in the molecular tree
(see Sookias' fig. 3 above) in green, those conflicting the molecular tree in red.
Edge values show BS support (edge-lengths are proportional to NJ-BS support),
while asterisks indicate the branches seen in the MPT.
Obviously, there is some signal in the morpho-matrix compatible with the molecular clades (this can be synaporphies, symplesiomorphies, homoiologies or shared apomorphies) clashing with the signal of pseudo-synapomorphies etc. supporting the topological alternatives seen in the morpho-based MPT.

Assuming the molecular tree is correct, the above reconstruction means that Osteolaemus is morphologically more derived, and hence placed as sister, while Mecitops and Crocodylus retain more primitive character states, and hence lacks discriminatiory derived traits — a sort of local ingroup-outgroup long-branch attraction (or 'short-branch culling').

What differentiates the Crocodylinae? Black, aut- or synapomorphies; blue, potential homoiologies (or symplesiomorphies); red, shared apomorphies (convergence). The Mecitops-Crocodylus pseudo-monophylum is mostly supported by traits shared between Osteolaemus and distant siblings (taxa of the larger alligator clade) and/or the outgroup.

We can also hypothesize that the initial radiation was fast, because the Mecitops-Osteolaemus ancestor did not accumulate a single, unique, discriminating character trait.

Excess of shared derived, pseudo-synapomorphic traits is the reason Tomistoma is not resolved as sister of Gavialis in the MPT — the molecular Gavialis-Tomistoma clade is represented by a morphological grade.

A 'splits rose' showing the basic splits. Black, aut- or synapomorphies; blue, potential homoiologies (or symplesiomorphies of the larger clade including Crocodylinae); pink, pseudo-synapomorphies (deep homoiologies or symplesiomorphies of the larger Crocodylinae clade); orange, shared ancestral (plesiomorph) or derived traits (convergent). 

And the homoiologies identified using the molecular tree as input cannot put things right. They are just partly compatible with unproblematic splits, ie. the larger clade including Alligator (character #7), the larger clade including Crocodylinae (#1, #18, #73, #74, #117) or exclusive to the Crocodylinae (#66)

Character mapping of the molecular-inferred homoiologies. The lush green splits represent the molecular splits.

However, if we are ignorant of the molecular tree, we would have to assume that Mecitops is the sister to Crocodylus, and that some of their shared traits not found in Osteolaemus are shared apomorphies (if occurring outside the clade and in the sister clade) or even synapomorphies (if exclusive for Mecitops + Crocodylus), while only those shared by Osteolaemus and C. porosus (#66) can be homoiologies. We also would have no reason to challenge the Gavialis-Tomistoma grade, until we infer networks.

Map of all potential synapomorphies (bold), symplesiomorphies (italics) and homoiologies (plain font) using the morphology-based Neighbor-net as basis. Red, pseudo-synapomorphies: split seen in the MPT and (with or without alternative in the Neighbor-net) but rejected by the molecular tree.

This is the main flaw of Sookias' idea. To identify homoiologies, we need the same prerequisite as for any of Hennig's concepts: we need to know the true tree. If we use the inferred tree based on the same data that we want to weight (here: use homoiologies for decision making or means of node support), then we propagate first-level errors; apply circular reasoning. Such as the red-marked pseudo-synapomorphies in the network above; vice versa, all actual (molecular-wise) synapomorphies supporting the molecular Gavialis-Tomistoma clade (dark purple split) would be reconstructed as homoiologies or symplesiomorphies based on the morpho-based single MPT (or morpho-based NJ tree, or probabilistic tree).

And if we have an independent molecular tree, it will make the decision on the fly: putative synapormorphies are the traits that are fully compatible, symplesiomorphies, homoiologies and shared apomorphies are decreasingly compatible, and random convergences are incompatible with the molecular tree.

It is not homoiology but tree-incompatible signal that is neglected in phylogenetics

Sookias points out: "In inference of phylogeny by parsimony, an occurrence of a character state in a part of a tree separated from it by another state is considered simply a homoplasy, and a tree where the occurrences are nearer or further from one another is not more or less parsimonious ... a tree where the 15 occurrences are nearer or further from one another is not more or less parsimonious". In principle, this is true, but has little consequence in application.

We, usually without realizing it, make frequent use of the discriminating power of potential homoiologies. See the example above, but also when, eg., placing fossils in a molecular framework or do post-inference character weighting. In these cases, homoiologies (and symplesiomorphies) will stabilize the inference and increase support. For better and worse:
  • Better, because homoiologies will ensure that the fossil is placed in the right molecular-based subtree, and can compensate for the lack of synapomorphies. Imagine an extinct fossil sibling lineage showing only homoiologies shared by Osteolaemus and C. porosus. Using tree-based optimization (eg. RAxML's 'evolutionary placement algorithm'), it would be placed close to the Crocodylinae ancestor, likely next to Osteolaemus. Using a Neighbor-net, it would be placed between Osteolaemus and C. porosus. Either way, the homoiologies would ensure it is nested within the Crocodylinae.
  • Post-inference character weighting, as implemented in eg. TNT, will downweight inferred convergences (ie. higher homoplasy, more stochastically distributed across the tree) more than putative homoiologies (ie. less homoplastic since confined to a single subtree). This can be better or worse. How do we avoid what happened for the crocodiles that homoiologies are not recognized as such but support (somewhat) misleading clades (act as synapomorphies)? Clades are commonly interpreted as a sufficient criterion to determine monophyly; however, they are not even a necessary one: taxa can be part of a monophyletic group despite not forming an inclusive subtree (ie. clade in a rooted tree) such as the genus Caiman or Gavialia-Tomistoma.
Hence, we should disencourage any form of data-self-dependent or post-analysis weighting and instead just explore the signal in our data — using networks.

One thing is also obvious from the crocodile example: if we have enough signal in the morphological data, then we may get one or another thing wrong and, in some cases, may not be able to decide between one or another alternative. However, overall, the morphological differentiation pretty well captures what the genes provide us as the best approximation of the true tree. Even when the matrix includes very few potential synapomorphies and clear homoiologies but a lot of shared apomorphies, most of which were convergently evolved in parts of both major clades.

At least, this will be so when we analyze the data using networks and not just trees (compare the single MPT to the networks).

Using the alternative evolutionary scenarios provided by the networks, we can then look back into our data (see the maps above), to see what may be a homoiology, a symplesiomorphy (very useful for deciding between evolutionary scenarios, as well) or a synapomorphy. The phangorn library (used for Sookias' script) has now network functionality and allows transferring information between trees and networks. An R-affine person may be able to extract lists of potential (partly competing) synapomorphies, symplesiomorphies, and homoiologies directly from the network showing all possible or the most likely trees.

And then use this information to eg. place fossils in a phylogenetic context, or reconstruct evolutionary trends in extinct groups of organisms — reconstruction of evolutionary trends in extant organisms should always rely on morphological data analyzed in a molecular-phylogenetic framework.

Data

A NEXUS-version of Sookias' test matrix (slightly annotated for Mesquite, simple version for PAUP*), tree- and distance matrix files have been added to my figshare collection of morphological matrices.  


Monday, February 11, 2019

A network analysis of basic leisure-time activities


Social scientists like to compile information about what human beings do with their time, day and night. Some of that time is called "work time", where we often have little control, and the rest is "leisure time", during which we have at least some control over the time we spend on each activity. This blog post looks at how much time people in different countries allocate to some of their different leisure-time activities.


The data are taken from the American Association of Wine Economists' Facebook page: Leisure Time Spent in OECD Countries. The five leisure-time activities included in the dataset are:
  • Eating & drinking
  • TV & radio
  • Sports
  • Shopping
  • Sleeping
The hours for these five activities turn out to account for about half of the 24-hour day (46-56%, depending on the country). The data cover 24 of the 36 OECD countries*, plus 3 others (China, India and South Africa). The interest here is to explore the similarities between the people of different countries, in terms of how they allocate their leisure time (on average).

Since these are multivariate data, one of the simplest ways to get an overview of the data patterns is to use a phylogenetic network, as a tool for exploratory data analysis. For this network analysis, I first normalized the data within each of the five activities, and then calculated the similarity of the countries using the Manhattan distance. A Neighbor-net analysis was then used to display the between-country similarities.

The resulting network is shown in the first figure. Countries that are closely connected in the network are similar to each other based on the relative times allocated to the leisure-time activities, and those countries that are further apart are progressively more different from each other.


Clearly, there is considerable diversity between the countries. Moreover, there is very little in the way of consistent patterns in the network — it is basically a single "starburst" pattern. So, we may first conclude that the people of the different countries basically all go their own way, when it comes to allocating their leisure time.

Some of the network associations may result from historical or cultural similarities, such as the closeness of Japan and South Korea in the network. However, this clearly does not apply in other cases — for example, Spain and Portugal are not near each other, and neither are Australia and New Zealand, nor are Denmark, Norway and Sweden. Cultural generalizations seem therefore not to be supported by the data.

India and South Africa both stand out from the rest of the network, indicating that their people behave differently to all of the other countries (on average). Notably, both countries have very short times allocated to Sports and to Shopping. India also has rather short TV/radio time and a long Sleeping time, while South Africa has the longest Sleeping time of all of the countries (45 min longer than the country average!).

The USA has relatively short Eating/drinking time, a long Sleeping time, and the longest TV/radio time of all. That is, Americans spend less time on eating & drinking than most other people, and use the time gained for watching TV and sleeping, instead.

Of the other countries, France has the longest time spent on Eating/drinking, followed by Denmark and Italy, and then Japan and South Korea. Canada and the United Kingdom, on the other hand, actually have the shortest Eating/drinking times of all of the countries. Spain has a relatively short Eating/drinking time and the longest time of all allocated to Sports (nearly double the country average!). This may be a more healthy way to behave than the American one.


A related topic that we could look at is gender differences in time allocation, and how this may differ between countries. The data for this are taken from another American Association of Wine Economists' Facebook page: Time per Day Spent Eating and Drinking, by Country and Gender.

So, the country data are for the averages for Eating/drinking only, with separate observations for males and females. These two averages are plotted against each other in the second figure, where each point represents a single country. I have labeled the three top countries and the five bottom countries.


Obviously, there is a close correlation between the males and females within any one country, so that most of the time variation is between countries (93%). If couples and families usually eat together, then this result is to be expected. It is the children who are likely to have more independent eating habits!

However, there are 14 countries where the average male time somewhat exceeds that for females, and only 7 where the female average time exceeds that for the males, with the remaining 6 being approximately equal (as represented by the pink line). Interestingly, the 2 biggest deviations from equality are where females spend more time on Eating/drinking than do the males (Japan and the Netherlands). You may make of this what you will.



* The 12 missing OECD countries are:
Chile, Czechia, Greece, Hungary, Iceland, Israel, Latvia, Lithuania, Luxembourg, Slovakia, Switzerland and Turkey.

Monday, February 4, 2019

Should we bother about character independence?


The comments of David Marjanović to one my last posts (Please stop using cladograms!), kept me musing about an old question of mine: Why should we be concerned about whether characters in a matrix are independent or not?

When I started to get into phylogenetics (I taught myself by reading and just doing it and never had a course in phylogenetics at university), I learned that the most important thing for a phylogenetic matrix is:
All characters are independent of each other.
In other words: the mutation (change) in one character doesn't affect the mutation (change) in any other character.

I could never wrap my head around this. After all, the characters are all part of the same organism and must therefore function together, so how can they possibly be biologically independent? Even the fact that everything is part of the same universe means that everything is functionally dependent to one extent or another — when a butterfly sneezes the polar bears tremble, as they poetically say.

However, what is meant is that characters must be independent enough for practical mathematical purposes. This is a fundamental assumption of most mathematical analyses, in order to make them tractable. Trying to account for the dependencies is far too difficult, mathematically.

However, it is still worthwhile thinking about whether these "practical purposes" are likely to be realistic for phylogenetics. Consider this:
  1. Traditional phylogenetics mostly uses morphological traits, some of which must have been evolutionary beneficial and evolved as consequence of the same reason (adaptive process).
  2. Working at the tips of the tree of life, our data were from the nuclear-encoded 35S rDNA, the cistron encoding for 18S rRNA (small subunit), 5.8 S rRNA, and 25 rRNA (large subunit, erroneously called 26S in some of the phylogenetic literature), which is known for compensatory mutations (eg. strands of the 5.8S rRNA have to fit to the 5' end of the 25S rRNA; here's a link for those interested in RNA structure).
To investigate point 1, let's look at a dolphin (image source) and a bat (image source).


Without sequencing their entire genomes and establishing the function of each gene (and kicking out one or another gene during development), we cannot assess how independent (genetically) the traits are that make a dolphin a near-perfect swimmer, and a bat the only actively flying mammal. But obviously, a lot of their traits are adapted to this single function of movement. The practical consequence is that instead of a plethora of distinguishing characters, we only can score two fully independent ones: "can swim" versus "can fly".
(And then eliminate these two, because another rule in phylogenetics is that we should only include characters that are not under positive selection. The commonly implemented models all assume that evolution is neutral. This is why Charles Darwin has two parts to The Origin, one discussing historical dependence of characters and one discussing natural selection.)

As for point 2, everyone who worked with ITS, the internal transcribed spacers of the 35S rDNA, can easily see that some mutational patterns always come in pairs or some other series. Although rarely done, we can correct for linked mutations during inference by using the assumed secondary structures as a functional corrective. This is rarely done, because even without this correction you still get trees (or networks) that make sense.

Linked mutations and evolutionary trends within the LP3 of the 5' ITS2 in species of Acer section Acer (see my Ph.D. thesis, open access; figure from Grimm et al., Plant Syst. Evol., 2007). This (non-coding) length-polymorphic region (found in all angiosperms in various modifications) comprises an upstream CT- and partly linked (complementary) downstream GA-motif.


A very simple example

Let's take a group of very simple, made-up organisms differing in two trait complexes (note that it may be a collection of genes that trigger the difference): form and colour.




In total, "evolution" came up with 15 different combinations ("species"), five of which are extinct, two of which are primitive in the sense that they still occur today, but have also been found as fossils.

We all know that morphologies have a high level of homoplasy. Homoplastic traits mean that groups will not accurately reflect the true tree. Having as many forms (9) as colors (9), we have no clue as to which trait is more conservative, and hence could better reflect the true tree.

The 15 species form nine potentially monophyletic genera.

The alternative nine potentially monophyletic genera.

The promise of phylogenetics is that we can infer the true tree based on the scored characters. We could follow the strict independence rule, and score them as two multi-state characters, leading us to the following "tree" — this has been parsimony-optimized and unweighted, as in most studies using morphological data, with the sample of MPTs summarized using a strict consensus cladogram.

The strict consensus tree of 355602 equally parsimonious trees with 17 steps, a CI of 0.94 and RI of 0.88: a pitchfork (an extreme case, but pitchfork-like subtrees are very common in palaeontological phylogenetic literature).

Alternatively, we could score the features as a series of binary characters such as:
  • Is the center depressed?
  • Is it horizontally or vertically elongated?
  • Is it round or pointed?
  • Do we have few (<= 6) or many tips (>= 8; "?" for all round species)?
  • Is it reddish? Or greenish? Or bluish? (Example: purple doughnut would be 1 - 0 - 0, the turquoise five-star 0 - 1 - 1)
  • Has it a dark or light shade (relatively speaking: green taken as darker than turquoise)?
These characters are not particularly independent. Certain evolutionary steps make it impossible to go back or evolve something in parallel / convergently. For example, the Roundish group never evolved pointed tips, and the Pointish organisms can vary their outline, but not smooth it. The characters are also not overly compatible (e.g. shading splits each basic coloring into two subsets), so we wouldn't expect a very resolved tree or one that matches the true tree exactly:

Adams consensus tree of 80 MPTs with 19 steps, CI = 0.57, RI = 0.79, naming follows the principles of cladistic classification (only subtrees in a rooted tree may be named; not to be confused with phylogenetic classification fide Hennig)

However, it doesn't look like a very bad evolutionary hypothesis. In fact, the inferred clades only miss one monophyletic group (I can tell, because I invented this group to illustrate that 'cladistics' is a subset of 'phylogenetics'): Fivestar reflects the morph of the common ancestor of all stars, resolved as part of a monophyletic grade "basal" to the (reciprocally monophyletic) polygons:

Evolution as it happened. Note, each dichotomy is accompanied by one or two exclusive subsequent mutations (synapomorphies at the time). Unknown ancestors (not found in the fossil records) are dimmed. Green: valid names following Hennig's phylogenetic classification; orange: only valid for the most recent time frame (Purpleoval is indistinguishable from the ancestor of all non-olive Roundish, Fivestar from the ancestor of all stars, and the ancestor of all polygons was a blue pentagon).

Of course, I would always show all of the topological alternatives in the optimized tree sample. Here is the strict consensus network of all of the MPTs:

Strict consensus network of all 80 MPTs, the network analogue to the commonly seen strict consensus cladograms.

In contrast to the consensus trees, we see the equally optimal alternatives, and can even make a call as to which trait to give a higher weight (evolution-wise). For instance, although only 12 MPTs have a Pentagon clade, 40 have an Octagon clade, which would fit with the hypothesis of reciprocal monophyly. The shading-based alternative seen in other MPTs (light vs. dark polygons) can be argued to be less likely, noting how scattered this feature is across the entire graph (this is what TNT's iterative weighting does, except that it starts from one of the alternative trees)



And here's the distance network, probably (like with real-world data) the least-biased depiction of the differentiation pattern:

All labelled taxa are monophyletic (as defined by the true tree). Note how some neighborhoods reflect monophyly while others would result in paraphyletic groups.

Take-home message

Now, you could rightfully point out that this is totally hypothetical and, having generated the group, I made sure that the analysis works out — actually, I didn't, and I was quite surprised at how well the binary matrix, which just scores everything that differs between the species, resolves aspects of the true tree. However, just compare the above graphs with trees published in (paleo)phylogenetic studies, and the real-world data we dealt with here on the Genealogical World of Phylogenetic Networks.

You might also point out that this is just like using stepmatrices — forcing a topology by suitably coding complex characters. Likewise, this thought must be discouraged (but see Joe Felsenstein 2004 book, Inferring Phylogenies). I would respond that scoring complex traits filtered by evolution as a single multi-state character severely underestimates the information content. An example from my own research: in the King Ferns (Osmundaceae), the subsequent modification of the sclerenchyma ring along the leaf traces is fully compatible with the molecular tree, so why should I be forced to reduce the surely interdependent (and traceable in the fossil record) aspects of this evolutionary filtered trait complex to a single, multi-state (and unweighted) character?

Coding of a single complex trait (Bomfleur et al., PeerJ, 2017, fig. 7), the structure of the sclerenchym ring in Osmundaceae leaf traces, as five binary characters that reflect the ontogenetic sequence seen in Osmundaceae rhizomes (arrows), a case where ontogeny mirrors phylogeny (Bomfleur et al., BMC Evol. Biol., 2015; cf. Additional file 1, fig. S1-1).


If we have character complexes that we can score, then we should not bother ourselves with drawing a (often very subjective) line between biologically dependent and independent characters. We should just score as much as we can see, and then explore the signal in the resulting matrix (see our many blog posts on the latter topic).

Exploratory data analysis benefits from few-state characters. This is because characters with many states (nine in the above example, which is something also found in the actual literature) that do not inform any taxon bipartitions, lead only to quite useless pitchfork-trees.

Scoring what we see as detailed as possible may, of course, get some things wrong. We may face one or another paraphyletic (or even polyphyletic) clade and monophyletic grade — inferring trees/networks and establishing branch-support with more than a single optimality criterion is advisable as is character mapping. At least it gets us a data-based hypothesis to discuss and to investigate further; or several hypotheses, when using consensus networks or distance-based splits graphs instead of consensus trees.


Monday, January 14, 2019

Phylogenetic ambiguity: data gaps, indifference and internal conflict

A tweet by my favourite journal (not only because they insist that authors make their data available) pointed me to their most viewed paper of 2018, with a nice title (for a network-fan):
Genus-level phylogeny of cephalopods using molecular markers: current status and problematic areas, by Sanchez et al. (PeerJ, 6:e4331).
"Problematic areas" are exactly my cup of tea. However, the graphical representation of these falls a bit short. The authors show three maximum-likelihood phylograms, one for the Cephalopoda with support annotated at some branches (their Fig. 1), and one each for two of the constituent lineages, the Decabrachia (their Fig. 2) and the Octobrachia (Fig. 3, reproduced below, because we will take a look at the data behind it).

Original: "Figure 3: Maximum-likelihood tree of the Octobrachia under the
GTR + Gamma model with the morphological character set mapped onto the tree.
Taxa highlighted in red represents discrepancy to previously published studies."

Unfortunately, we don't know the actual support for each of the branches — there is a legend in the lower right, but no signatures etc. associated with it. You will find some information throughout the text, of course. For example:
The use of concatenated sequences of all markers (Fig. 2) resulted in a resolved topology for monophyly of the Octobrachia (BS = 58%), and strong support for monophyly of the Decabrachia (BS = 98%), with both clades strongly supported by the Bayesian approach with PP = 0.78 and 0.75 respectively
The latter is quite strange, as PP are expected (methodologically) to be ≥ BS.
Although monophyly was demonstrated for several families contained within both superorders, the relationships of the families contained within Octobrachia were better supported than those in Decabrachia (Fig. 2). Of the 37 nodes in the Octobrachia portion of the general tree containing all taxa, the majority were resolved above the 50% level (31 nodes with BS > 50%); but only 28 out of 80 nodes in the Decabrachia were resolved at BS >50%, most of which were located at family level.
BS = 51 could be lack of signal (all other alternatives BS ~ 0) or conflict (one alternative has a BS = 49).

What we can infer directly from the alignment

Let's have a look at the first three gene regions in the matrix provided, using Mesquite's bird-view option.


We can see from the alignment that the first gene (left; mitochondrial 12S rDNA) splits the taxon set (the taxon order seems to be arbitrary) into two (three if we include those with no data) main groups with substantially divergent 12S rDNAs. However, in the second, much more homogeneous gene, no such differentiation is obvious, with the exception of two accessions that remain very different from the rest. This is quite puzzling, because the second gene is the (close-by) mitochondrial 16S rDNA.

Without going into details, the 12S rDNA unambiguously supports (and enforces) an Octopodia core clade defined by a 12S rDNA entirely different from that of other taxa, and comprising five of this order's families, in which Amphioctopus and Octopus make up a subclade with strongly derivating 16S rDNA.

With respect to the tree, we also have to assume that the 12S rDNA of the Octopodia core clade is derived, strongly evolved, whereas it remained largely unmodified (ie. is primitive) in the other, earlier diverged (according to the tree) lineages. However, some of these lineages have equally long terminal branches: there has been more evolution going on in other genes.



The third gene, the nuclear-encoded gene for the 18S rRNA (18S nrDNA), shows another pattern (and quite typical). Large stretches with very little variation, hence, devoid of differentiating signal that would allow the tree algorithm to make a decision (and letting Bayes get lost in the treespace resulting in PP < 1.0).


For half of the taxa, no information is available, but this hardly matters because even genera with strongly different mt 12S rDNA have nearly the same 18S nrDNA. There is a little hickup in the second part in one accession (a gap in Cirrothauma with a small, off-alignment strand in between), but this could just be a sequencing artefact. Limited to a single taxon, it has no topological effect (we at least need four to make a call), it will only increase the length of the terminal branch.

The remainder of the matrix mirrors the situation in the first three partitions, eg. in the well-sampled (only six taxa missing) mt coI gene, Callistoctopus is visibly distinct from all other genera, while most general variation is concentrated at the 3rd codon position. All other mt-genes, accounting for 58% of the matrix' characters, are covered for four of the taxa (the sister taxon used to root, Vampyrotheutis, and three of the core Octopodia, hence, can only support a single split within this group and be used to test for its alternatives.



What networks could have shown

The matrix provided for the shown tree (made available via figshare) has 40 taxa and 16104 characters, quick to run these days. Here's the tree with branch support annotated along branches.

ML phylogram inferred from Sanchez et al.'s matrix, taxa ordered as in the original fig. 3. Members of the same taxon (order, superfamily, family, as annotated in Sanchez et al.'s fig. 3) colored accordingly. Values at branches indicate ML-BS  support using a single partition for the entire data ("unpart.") or using the gene-wise partition scheme provided in the figshare submission ("part.")

Even though I run an unpartitioned analysis, my tree is very similar to the original tree, with a near identical topology except for Ameloctopus being moved one node up and placed as sister to Hapalochlaena (ML, unpartioned-BS = 52 vs. 46[!] for the alternative seen in Sanchez et al.'s fig. 3). I never understood the fuzz about model and partition testing, when we usually work with data where any model will inevitably be suboptimal (see alignments). As a geneticist, I also believe data partitions should be informed by function, not computer programmes (eg. one for 1st and 2nd codon position, another for the 3rd codon position, and one for the rDNAs).

We have unambiguously supported branches (BS ~ 100), and others, the "problematic areas" (BS << 100). Ambiguity in support values for branches of a tree can have two reasons:
  1. Lack of signal, the data is indifferent regarding the placements of certain taxa and/or subtrees (PP < 1.0 are indicative for lack of signal).
  2. Conflicting signal, parts of the data (data partitions) prefer one topological alternative, others a (partly) conflicting one (keep in mind that even in the presence of substantial signal conflict, PP ~ 1).
Short branches with low (BS) support point to the former, long branches with low (BS) support are a direct indication of the latter. Two apparent sources of conflict would be that the data include gene regions from the biparentally inherited nucleome and the (usually maternally inherited, not sure how this is in squids) mitochondriome and combine protein-coding genes (amino-acids coded by codons) with rRNA genes (directly encoding a certain secondary, tertiary structure).

In our tree here, we notice a general correlation between the branch lengths and the support; the shorter the branch, the lower the support. With a few exceptions, eg. the Octopodida core clade, triggered by the unique, strongly diverged sequences of the 12S rDNA, has a long root branch with compartively low support (ML-BS = 63; collapses when using the authors' partitioning scheme that treats each gene region as individual partition).

Full BS Consensus network based on 450 ML pseudoreplicates (result of the unpartitioned analysis). Edge lengths are proportional to the BS support (frequency of the splits in the BS tree sample), trivial splits not collapsed. Arrow points to the root (cf. Sanchez et al.'s fig. 1).

The BS Consensus network shows us that some of the "problematic areas", ie. branches with ambiguous support, are not really problematic (alternatives have no to very little support), but others are. Including the 12S rDNA-based Octopodida core clade, and connected to this, the division of the Megaleledonidae, as annotated in Sanchez et al.'s fig. 3, into two clades (not discussed in the paper). A clade including all Megaleledonidae has a BSunpart./part. = 34/55 and competes with the 12S rDNA split (BS = 63/37) and the placement of Cistopus as sister to the Octopodida core clade (BS = 52/34). It doesn't conflict with the alternative topology placing Cistopus as sister to all of them (BS = 38/50). The reason for this is, of course, that by using a different partion for the highly divergent mt-12S rDNA, we allow RAxML to estimate high probabilities for all mutations, effectively down-weighting each mutation in this gene compared to those in other, more conservatively structured gene regions, which seem to prefer alternative splits.

Vice versa, the poorly supported sister relationship (BS = 45/21) of Bathypolypus with the Enteroctopodidae (light green) + part of the Argonautoidea (pink) stands unopposed, alternative splits have BSunpart. < 10. In the partitioned analysis, however, there is an equally poor supported alternative sticking out a bit: Bathypolypus as sister to the (all-including) Megaleledonidae clade (BSpart. = 23).

While we see little effect on the tree topology, partitioning affects some of the support values. An nice example is the structure of the Megaleledonidae s.str. subtree. The root is unambiguously supported, as is the sister relationship of Graneldone and Bentheledone. The remaining branches have ambiguous support.


Here, the partitioning scheme is a game changer. Unpartioned, the favored alternative is a Adelieledone-Pareledone-Megaeledone (APM) grade "basal" to Graneldone and Bentheledone (BS = 68/49); using the authors' partitioning scheme, the data favors an APM clade sister to the latter two (quite a difference, since we often equal clades with monophyly and grades with paraphyly).

It doesn't matter whether a clade has a BS support of 30, 50 or 70. We need to know, if the remaining 70%, 50%, or 30% of bootstrap replicates show random or the same alternative(s). When a tree has ambiguously support branches, BS Consensus networks should be obligatory.

Instead of reading sentences like this:
Benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) together with the Megaleledonidae (possessing a single row of suckers) formed a well-supported monophyletic group (BS = 72%, PP = 0.61).
we should read this:
A clade including all benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) and the Megaleledonidae (possessing a single row of suckers) received ambiguous support (BS = 72%, PP = 0.61), but potential alternatives received no support at all. The combination of a relative high BS but low PP points towards a faint, but consistent signal in the available data.
And include the Consensus networks at least in the supplement.

When we aim to map morphological traits (which a nice touch of Sanchez et al.'s paper), why not consider the topological alternatives we see there?

Running single-gene trees is never wrong, too. But, in the case of these data, that would be the topic of another post, using a different type of network: a super-network.

Final note. This post is not intended to criticize Sanchez et al.'s paper (my squid-expertise ends with having seen them in aquaria). My impression is they put a lot of effort into getting the matrix together. Having been forced to harvest molecular data myself in the past, I know how important and tedious this work is. Instead, this post stresses and shows, using an easy-to-access example that raised a lot of interest (attracted many views), that we often have to work with suboptimal data not providing trivial results in the form of fully resolved trees. This is a situation in which easy to generate networks offer a lot. No peer reviewer should, in such a case, be content with seeing just a tree (although they, to my experience, always are).

Monday, December 10, 2018

Please stop using cladograms!


I really like the journal PeerJ, not only because it is open access and publishes the peer review process, but also because it's one of the few that adhere to strict policies when it comes to data documentation. In my last (on my own) 2-piece post (part 1, part 2), I showed what networks could have offered for historical and more recent studies in Cladistics, the journal of the Willi Hennig Society. In this one, I'll illustrate why paleontology in general needs to stop using cladograms.

An example

In a recent article, Atterholt et al. (PeerJ 6: e5910, 2018) describe and discuss "the most complete enantiornithine from North America and a phylogenetic analysis of the Avisauridae". I'm not a paleozoologist and "stuff of legend", but their first 17 figures seem to make a good point about the beauty of the fossil and its relevance; and it is interesting to read about it. This makes me envy paleozoologists a bit — the reason I exchanged chemistry for paleontology was my childhood love for the thunder lizards; I specialized in zoology not botany for graduate biology courses, and I fell in love with social insects, especially bees; but then more general circumstances pushed me into plant phylogenetics.

The result of Atterholt et al.'s phylogenetic analysis is presented in their figure 18, as shown here.

Figure 18 of Atterholt et al. (2018): "A cladogram depicting the hypothetical phylogenetic position of Mirarce eatoni." [the beautiful fossil is highlighted in bold font]
This looks very familiar — graphs like this can be seen in many paleontological studies, not only those in Cladistics. However, this is a phylogeneticist's "nightmare" (but a cladist's "dream").

First, phylogenetic trees, especially those that were weighted post-analysis several times to get a more or less resolved tree, should be depicted as phylograms — trees with branch lengths. Phylogenetic hypotheses are not only about clades, and what is sister to what, but about the amount of (inferred) evolutionary change between the hypothetical ancestors, the internal nodes, and their descendants, the labelled tips. For example, we may want to know how long is the root of the clade (Avisauridae, Avisaurus s.l.) comprising the focus taxon compared to the lengths of the terminal branches within the clade. Prominent roots and short terminals are a good sign for monophyly (inclusive common origin), or at least a fossil well placed, whereas short roots and long terminals are not.

The above tree as phylogram (using PAUP*'s AccTran optimization). The beauty of cladistic classification is that the new specimen could have just been described as another species of Avisaurus (but read the author's discussion).

In this example, we seem to be on the safe side, although one may question the general taxonomic concept for extinct birds. Are the differences enough to erect a new genus for every specimen? This is hard to decide based on this matrix.

Second, a tree without branch support is just a naked line graph, telling us nothing about the quality (strengths and weaknesses) of the backing data. Neontologists are not allowed to publish naked trees. In molecular phylogenetics, we are not uncommonly asked by reviewers to drop all branches (internodes) below an arbitrary threshold: a bootstrap (BS) support value < 70 and posterior probability (PP) < 0.95. In palaentology, it has become widely accepted to not show support values at all. The reason is simple: the branch support is always low, because of data gaps and homoplasy. This is a problem the authors are well aware of:
The modified matrix consists of 43 taxa (26 enantiornithines, 10 ornithuromorphs) scored across 252 morphological characters [the provided matrix lists 253], which we analyzed using TNT (Goloboff, Farris & Nixon, 2008a). Early avian evolution is extremely homoplastic (O’Connor, Chiappe & Bell, 2011; Xu, 2018) thus we utilized implied weighting (without implied weights Pygostylia was resolved as a polytomy due to the placement of Mystiornis) (Goloboff et al., 2008b); we explored k values from one to 25 (see Supplemental Information) and found that the tree stabilized at k values higher than 12. In the presented analysis we conducted a heuristic search using tree-bisection reconnection retaining the single shortest tree from every 1,000 replications with a k-value of 13. This produced six most parsimonious trees with a score of 25.1. These trees differed only in the relative placement of five enantiornithines closely related to the Avisauridae, forming a polytomy with this clade in the strict consensus tree (Consistency Index = 0.453; Retention Index = 0.650; Fig. 18).
I've seen much worse CI and RI values in the paleophylogenetic literature (some of them are plotted in this post). For a phylogenetic inference, homoplasy equals internally incompatible signals — many characters show different, partly or fully conflicting, taxon bipartitions; or, in other words, they prefer different trees. The signal in the matrix is thus not tree-like — it doesn't fit a single tree. That's why we have to choose one using TNT's iterated reweighting procedures. (Note: an alternative "phenetic" Neighbor-joining tree has a computation time < 1s, and produces the same tree for the Ornithumorpha and the root-proximal, 'basal' part of the tree, except that Jeholornis is moved two nodes up; but it shuffles a lot in the Longirostravis–Avisauridae clade.)

Another point is that the more homoplasy we have, then the higher must have been the rate of change (here: visible anatomical mutation). The higher the rate of change, the higher the statistical inconsistency of parsimony.

In short, paleontologists (Atterholt et al. just follow the standard in paleophylogenetic publications) use data with tree-unlike signal to infer trees (see also David's last post on illogicality in phylogenetics) under a possibly invalid optimality criterion, which are then used to downweight characters (eliminate noise due to homoplasy) to infer less noisy, "better" trees.

The basic signal

We can't change the data, but we can explore and show its signal. And the basic signal from the unfiltered matrix is best visualized using a Neighbor-net splits graph.

Neighbor-net based on mean pairwise taxon distances. Thick edges correspond to branches in the published tree.

Some differentiation patterns that explain the clades in the tree can be traced, but it becomes difficult in the group that is of most interest: the (inferred) clade(s) comprising the newly described fossil. In the Neighbor-net this is placed close to another member of the Avisauridae, but not all. The matrix is not optimal for the task at hand.

The data properties

The matrix is a multistate matrix with up to six states in the definition line (although only five are used, as state "5" is not present). The taxa have variable gappyness (i.e. the proportion of completely undetermined cells), between 2% (extant birds: Anas and Gallus) and 94% (Intiornis, an Avisauridae) — the median is 56%, and the average close to it (54%). The "hypothetically" placed fossil Mirarce eatoni (in the matrix it is under its old designation: "Kaiparowits") lacks a bit more of the scored characters (61%). That may strike one as a lot, but note that the matrix has 253 characters! However, we may well ask: if I want to place a fossil for which I can score 99 characters, why bother to include another ~150 that tell me nothing about its affinity? (Note: paleobotanists struggle hard even to get such numbers, we usually have at best 50 characters.)

Its closest putative relatives, the Avisaurus s.l., lack 90% of the characters; leaving us with max. 25 characters supporting the relevant clade (assuming that the 10% are all found in Mirarce as well). Coverage is not much better in the next-closest relatives (phylogenetically speaking).

Data coverage in the phylogenetic neighborhood of Mirarce eatoni

The missing data percentage may have mislead the Neighbor-net a bit, because we will have fed it with unrepresentative or highly ambiguous pairwise distances. In the the network, the focus fossil comes close to Neuquenornis, the only other Avisauridae with some data coverage. Looking at the heat map below, we see that missing data is indeed a problem in this matrix — we have zero distances between several pairs that show different distances to the better-covered taxa.

The distance matrix drawn as a heat map: green = similar, red = dissimilar (values range between 0 and 0.8). Red arrows: taxa with too many (and ambiguous) zero pairwise distances.

The closest relative of Mirarce is, indeed, Avisaurus/Gettya gloriae, but the latter has zero distances to various other poorly covered taxa from the phylogenetic neighborhood, in contrast to the much better-covered Mirarce. Neighbor-nets are very good at getting the obvious out of a morphological matrix, but they don't perform miracles. However, why should we include poorly known taxa at all during phylogenetic inference? Wouldn't it be better to infer a backbone tree (or network showing the alternative hypotheses) based on a less gappy matrix, and then find the optimal position of the poorly known taxa within that tree (network)?

Estimating the actual character support

Some characters cover just 10–20% of the taxa, whereas others are scored for most of them — more than half of the characters are missing for more than half of the taxa. Using TNT's iterative weight-to-fit option means that we infer a tree, ideally one fitting the well-covered data (taxon- and character-wise), and then downweight all conflicting characters elsewhere to fit this tree. We then end up with a tree where we have no idea about actual character support. Since the matrix is a Swiss cheese, we only can re-affirm the first-inferred tree.

Let's check the raw character support, using non-parametric bootstrapping and maximum likelihood as the optimality criterion (corrected for ascertainment bias, as implemented in RAxML).

ML-BS Consensus Network (using Lewis' 2-parameter Mk+G model). Edge lengths are proportional to the BS support values of taxon bipartitions (= phylogenetic splits, internodes, branches in phylogenetic trees). Only splits are shown that occurred in at least 10% of 900 BS pseudoreplicates (number of necessary BS replicates determined by the Extended Majority Rule Bootstrap criterion), trivial splits collapsed. Thick edges correspond with branches in Atterholt et al.'s iterative parsimony tree; coloring as before.

The ML bootstrap Consensus Network bears not a few similarities to the distance-based Neighbor-net. The characters do not support the Avisauridae subtree, as depicted in the published TNT tree, but there are faint signals associating some of them to each other, despite the missing data. Keep in mind: a BS support of 20 for one alternative and < 10 for all others means (ideally) one fifth of the characters support the split, and the rest have no (coherent) information. Some sister pairs have quite high support (for this kind of data set), and Gettya gloriae is resolved as sister of Mirarce (unambiguously, with a BS support = 67). But, the matrix hardly has the capacity to resolve deeper relationships within the group of interest, the Enantiornithes — the polytomy with the next relatives seen in the tree and the corresponding clade dissolve. This confirms what we saw in the Neighbor-Net (despite missing data distortion).

The matrix and the tree show something that could have been deduced directly from the distance matrix: the poorly known Gettya (Avisaurus) gloriae is (literally) the closest relative of the enigmatic new genus / species Mirarce (morphological distance of 0.08 compared to 0.1–0.64 for all other taxa). But is this overall similarity enough to conclude Avisaurus, Gettya and Mirarce are a monophyletic group within the Avisauridae?

What the authors (and all paleontologists doing phylogenetics) should have done

(I would have skipped all trees, naturally, but peer reviewers and most readers probably need to see them.)

  • Trimmed the matrix to include only those characters preserved in the fossil of interest, in order to minimize missing data artefacts during inference.
  • Shown the Neighbor-net to visualize the primary signal situation, including and excluding poorly covered taxa. From the Neighbor-net it is already obvious that the fossil is an Enantiornithes, so any subsequent optimization / inference could have focussed on this group alone.
  • Then inferred a backbone tree excluding poorly covered taxa, and shown the resulting phylogram. In case one needs to test the Enantiornithes root (the Neighbor-net gives us two alternatives for the Enantiornithes root: Pengornis + Eopengornis or Protopteryx + Iberomesornis), there is no point in including the poorly covered Enantiornithes or the worst-covered taxa outside this clade.
  • Then optimized the position of the poorly covered taxa in the backbone tree. I recommend using RAxML's evolutionary placement algorithm (EPA) for this, but you can also do this in a parsimony framework if you wish. (EPA can also be used to test outgroup roots: here, one would search the branch at which all non-Enantiornithes fit best.)
  • Shown the resulting phylogram including all taxa — that is, read in the topology to the analysis, and then re-optimize branch lengths.
  • Shown a Support Consensus Network to illustrate the support for the branches in the preferred tree and their competing alternatives. (There may be one or more, as there are many options to estimate branch support.) How sure can we be about relationships within the Avisauridae and their relationships to other Enantiornithes?



Postscriptum. For those who are curious about how the ML tree would look like, here it is:


I have no idea about birds, but from a methodological point of view this is an equally (if not more, because unforced) valid hypothesis for the data set. And demonstrating its limitations: note the relatively long branches with very low support making up the backbone of the Enantiornithes clade. This is typical for matrices lacking coherent discriminatory signal and/or struggling with internal conflict.