Showing posts with label EDA. Show all posts
Showing posts with label EDA. Show all posts

Monday, December 30, 2019

National differences in amount of paid versus unpaid work


Countries differ in many cultural ways. An important one of those ways concerns how time is managed. There are 24 hours in every day, and 7 days in every week, and the time that people spend on each of the different activities can be averaged across each year. When combined across the whole population, these averages usually differ between countries, and this is what we mean when we recognize national behaviors. There are, however, many similarities among countries that share strong cultural ties.

The Organisation for Economic Co-operation and Development (OECD) has collected data on this matter among its member countries, as it also has for many other cultural and economic characteristics. In each of the 30 member countries, the OECD conducts regular "time-use surveys, based on nationally representative samples of between 4,000 and 20,000 people." The aggregated results are available online, including data for three other countries, for comparison (China, India, South Africa).


Four main categories of time use are reported by the surveys:
  • Paid Work or Study, which includes paid work time, time in school or classes, travel to and from work / study, research / homework, and job search.
  • Unpaid Work, which includes child care, care for other household members, care for non household members, routine housework, shopping, volunteering, and travel related to household activities.
  • Personal Care, which includes sleeping, eating & drinking, medical services, and travel related to personal care.
  • Leisure Time, which includes sports, participating / attending events, visiting or entertaining friends, and TV or radio at home.
Of particular interest is that the data are aggregated separately for males and females. I will look at the gender data in a future blog post, while here I will look only at the pooled data for each country.

National differences

In order to look at the current differences between the 33 countries (30 OECD, 3 non-OECD), I have performed this blog's usual exploratory data analysis. The available data are multivariate, since there are five measured variables for each country — total paid work, total unpaid work, total personal care time, leisure time (each measured in average number of minutes per day), plus Other (to make a total of 1,440 minutes per day). One of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network, as a form of exploratory data analysis. For this network analysis, I calculated the similarity of the countries using the manhattan distance; and a Neighbor-net analysis was then used to display the between-country similarities.

The resulting network is shown in the first graph. Countries that are closely connected in the network are similar to each other based on their average time management, and those countries that are further apart are progressively more different from each other.

National differences in amount of paid versus unpaid work

The expected cultural similarities of the countries are, in most cases, reflected in the network. For example, the country that we might expect to be the most different to the other 32 is Mexico (it is the only country from Latin America), and it is also the most isolated one in the network. It is characterized by having the greatest amount of average work time per day, particularly unpaid work, and the least leisure time.

Furthermore, the three Asian countries are clustered together: Japan, Korea, and China. They have similar high amounts of paid work, but much less unpaid work than the Mexicans.

On the other hand, it is not clear why India is shown as very similar to some of the European countries, given its very different culture. However, differences do appear in the gender patterns discussed below.

In other cases, there are occasional countries that are not where we might anticipate them to be in the network, given other known historical and cultural similarities, particularly language. For example, Sweden is not near the other Nordic countries (Denmark, Norway, Finland), as the Swedes report many more minutes of paid work per day, and correspondingly less time on each of the other activities. Portugal is not near Spain, Italy, and France, as they also report more minutes of paid work, and specifically less leisure time. On the other hand, Australia is not near Canada, the USA, New Zealand, and the UK, because the Australians report fewer minutes of paid work but correspondingly more unpaid work time. The people of Latvia and Lithuania also report many more minutes of paid work per day than do those of Poland and Estonia.

Other differences

Lest you get the impression that historical and cultural ties dominate the time-management data, we can look at one part of the data in detail.

As noted above, the Personal Care data includes separate information for sleeping versus eating & drinking. In the next graph I have plotted these two variables against each other (in average minutes per day), for all 33 countries.

Time spent sleeping versus eating & sleeping for 33 countries

As you can see, thee is no correlation whatsoever between these two variables. That is, extra eating and drinking time does not come out of the time allocated for sleeping, or vice versa.

Moreover, you will note that the denizens of the three Asian countries do not behave anything like each other, particularly as the Chinese sleep longer than everyone except the South Africans. Nor do the Swedes behave much like the Danes, in terms of eating and drinking.

Finally, the Mexicans report that they do not spend much time eating or sleeping, which follows from the work data discussed above. Instead, it is the Mediterranean peoples who like to spend their time eating and drinking. On the other hand, the Americans (and Canadians) certainly behave like they live on fast food, spending less time on eating and drinking than anyone else. They do, however, like their 8.5 hours sleep per day, which most other populations think they can do without that extra half hour.

Monday, November 11, 2019

A new playground for networks and exploratory data analysis


[This is a post by Guido with some help from David]

There tend to be two types of studies of inheritance and evolution. First, there is evolution of organisms, either of the phenotype (morphology, anatomy, cell ultrastructure, etc) or genotype (chromosome, nucleotides). The latter involves direct inheritance, but it is often treated as including all molecules, although it is the nucleotides (and chromosomes) that get inherited, not amino acids, for example.

Second, there are studies of the evolution of behaviour, which has focused mainly on humans, of course, but can include all species. For humans, this includes socio-cultural phenomena, particularly language (written as well as spoken), but also including cultural advancements such as social organization, tool use, agriculture, etc., which are inherited indirectly, by learning.

However, we rarely see studies that are multi-disciplinary in the sense of combining both physical and behavioural evolution. It is therefore very interesting to note the just-published preprint by:
Fernando Racimo, Martin Sikora, Hannes Schroeder, Carles Lalueza-Fox. 2019. Beyond broad strokes: sociocultural insights from the study of ancient genomes. arXiv.
These authors provide a review about the extent to which the analysis of ancient human genomes has provided new insights into socio-cultural evolution. This provides a platform for interesting future cross-disciplinary research.

The authors comment:
In this review, we summarize recent studies showcasing these types of insights, focusing on the methods used to infer sociocultural aspects of human behaviour. This work often involves working across disciplines that have, until recently, evolved in separation. We argue that multidisciplinary dialogue is crucial for a more integrated and richer reconstruction of human history, as it can yield extraordinary insights about past societies, reproductive behaviours and even lifestyle habits that would not have been possible to obtain otherwise.
Since multi-disciplinary dialogue is a focal point here at the Genealogical World of Phylogenetic Networks. Since our blog embraces non-biological data, we have done a little brainstorming, to put forward some ideas based on Racimo et al.'s comments. The four figures contain some extra discussion, with some visual representations of the ideas.

Why it's important to correlate genetic, linguistic and socio-cultural data. The doodle shows a simple free expansion model of a founder population with three genotypes (yellow, green, blue), a shared language (L) and two major cultural innovations (white stars). Because of drift and stochastic intra-population processes (size represent the size of the actively reproducing populace) the first expansion (light gray arrows) lead to 'tribes' that show already some variation. The smaller ones close to the founder population spoke still the same language, the ones further away used variants (dialects) of L (L', still close to L, L'', more distinct). Because of bootlenecks, geographic distance and differing levels of inbreeding (the smaller a population, the farther away from the source, the more likely are changes in genotype frequency), each population has a different genotype composition. The second expansion (mid-gray arrows) mixing two sources leads to a grandchild that evolved a new language M and lost the blue genotype. Because the cultural innovations are beneficial, we find them in the entire group. In extreme cases of genetic sorting and linguistic evolution, such shared cultural innovations may be the only evidence clearly linking all these populations.

Social-cultural character matrices

Correlating different sets of data and (cross-)exploring the signal in these data can be facilitated by creating suitable character matrices. In phylogenetics, we primarily use characters that underlie (ideally) neutral evolution, such as nucleotide sequences and their transcripts, amino-acid sequences. When using matrices scoring morphological traits, we relax the requirement of neutral evolution, but we are still scoring traits that are the product of biological evolution. However, we don't need to stop there, phylo-linguistics is an active field, even though languages involve different evolutionary constraints and processes than we meet in biology. Data-wise there are nonetheless many analogies, and phylogenetic methods seem to work fine.

So, why not also score socio-cultural traits in a character matrix? For instance, we can characterize cultures and populations by basic features including: the presence of agriculture, which crops were cultivated, which animals were domesticated, which technological advances were available, whether it was a stone-age, bronze-age, iron-age culture, etc. Linguistically, we could also develop matrices of local populations, with regional accents or dialects, etc.

Creating such a matrix should, of course, be informed by available objective information. As in the case of morphological matrices or non-biological matrices in general, we should not be concerned about character independence. We don't need to infer a phylogenetic tree from these matrices, as their purpose is just to sum up all available characteristics of a socio-cultural group.

Second phase: stabilization of differentiation pattern. While the close-by tribes are still in contact with the mother population, the most distant lost contact. As consequence the gene pools of the L/L'-speaking communities will become more similar, and new innovations acquired by the founder population (black star) are readily propagated within its cultural sphere. Re-migration from the larger M-speaking tribe to the struggling L''-speakers (small population with high inbreeding levels) lead to the extinction of the blue genotype in the latter and increased 'borrowing' of M-words and concepts.

Distance calculations

Pairwise distance matrices are most versatile for comparing data across different data sets.

First, any character matrix can be quickly transformed into a distance matrix, and the right distance transformation can handle any sort of data: qualitative, categorical data as well as quantitative, continuous data.

Second, the signal in any distance matrix can be quickly visualized using Neighbor-nets. This blog has a long list of posts showing Neighbor-nets based on all sorts of sociological data that don't follow any strict pattern of evolution, and are heavily biased by socio-cultural constraints (eg. bikability, breast sizes, German politics, gun legislation, happiness, professional poker, spare-time activities). We have even included celestial bodies.

Third, distance matrices can be tested for correlation as-is, without any prior inference, using simple statistics, such as the Pearson correlation coefficient. To give just one example from our own research: in Göker and Grimm (BMC Evol. Biol. 2008), the latter was used for testing the performance of character and distance transformations for cloned ITS data covering substantial intra-genomic diversity, by correlating the resulting individual-based distances with species-level morphological data matrices. (The internal transcribed spacers are multi-copy, nuclear-encoded, non-coding gene regions; in the simplest case each individual has two sets of copies, arrays, one inherited from the father, the other from the mothers, which may differ between but also within the individual.)

In the context of Racimo et al.'s paper, one could construct a genetic, a socio-cultural, a linguistic and a geographical matrix, determine the pairwise distances between what in phylogenetics are called OTUs (the operational taxonomic units), and test how well these data (or parts of it) correlate. The OTUs would be local human groups sharing the same culture (and, if known) language.

Alternatively, one can just map the scored socio-cultural traits onto trees based on genetic data or linguistics.

A new culture with its own language (Λ), genotype (red) and innovations (ruby-red pentagon) migrates close to the settling area of the L-people. Because of raids, genotypes and innovations from the the L-people get incorporated into the the Λ-culture.

How to get the same set of OTUs

The Göker & Grimm paper mentioned above tested several options for character and distance transformations, because we faced a similar problem to what researchers will face when trying to correlate socio-cultural data with genetic profiles of our ancestors: a different set of leaves (the OTUs). We were interested in phylogenetic relationships between individuals using data representing the genetic heterogeneity within these individuals.

Genetic studies of human (ancient or modern) DNA use data based from individuals, but socio-cultural and linguistic data can only be compiled at a (much) higher level: societies, or other groups of many individuals. In addition, these groups may also span a larger time frame. Since humans love to migrate, we are even more of a genetic mess than were the ITS data that we studied.

One potential alternative is to use the host-associate analysis framework of Göker & Grimm. Instead of using the individual genetic profiles (the associate data), one sums them across a socio-cultural unit (serving as host). The simplest method is to create a consensus of the data (in Göker & Grimm, we tested strict and modal consensuses). This produces sequences with a lot of ambiguity codes — genetic diversity within the population will be presented by intra-unit sequence polymorphism (IUSP). Standard distance and parsimony implementation do not deal with ambiguities, but the Maximum likelihood, as implemented in RAxML, does to some degree. A gapstop is the recoding of ambiguities as discrete states for phylogenetic analysis (tree and network inference) as done by Potts et al. (Syst. Biol. 2014 [PDF]) for 2ISPs ('twisps'), intra-individual site polymorphism. It can't hurt to try out whether this works for IUSPs, too.

Since humans (tribes, local groups) often differ in the frequency of certain genotypes, it would be straightforward to use these frequencies directly when putting up a host matrix. Instead of, for example, nucleotides or their ambiguity codes, the matrix would have the frequency of the different haplotypes. We can't infer trees from such a matrix (we need categorical data), but we can still calculate the distance matrix and infer a Neighbor-net.

The 'phylogenetic Bray-Curtis' (distance) transformation introduced in Göker & Grimm (2008) also keeps the information about within-host diversity when determining inter-host distances (see Reticulation at its best ...)


Transformations for genetic data from smaller to larger, more-inclusive units are implemented in the software package POFAD by Joli et al. (Methods in Ecology & Evolution, 2015. Their paper also provides a comparison of different methods, including the ones tested in Göker & Grimm (2008, also implemented in the tiny executables g2cef and pbc, compiled for any platform).

The process of assimilation. The Λ-people subdued the L-culture with the consequence that all innovations are shared in their influence sphere. Having a much smaller total population size, the language of the invaders is largely lost but the new common language L* still includes some Λ-elements (in a phylogenetic tree analysis, L* would be part of the L/M clade, using networks, L* would share edges with Λ in contrast to L and M). The L''/M-speaking remote population is re-integrated. The invaders' genotype (red) becomes part of the L-people's gene pool. Re-migration (forced or not) introduces L-genotypes into the original Λ-population. Only by comparing all available data, ideally covering more than one time period, we can deduce that the M-speakers represent an early isolated subpopulation of the L-people that was not affected by the Λ-invasion. With only the genetic data at hand, one may identify the M-speakers as one source and the Λ-tribe as another source for the L*-people, and infer that all L/M and Λ-tribes share a common origin (since the yellow genotype is found in both the M- and the original Λ-population).

Conclusion

It therefore seems to us that there is enormous potential for multi-disciplinary work, that truly combine organismal and socio-cultural evolution. We have provided a few practical suggestions here about how this might be done. We encourage you all to have try some of these ideas, to see where it leads us all.

Monday, August 12, 2019

Public transit trips in the USA


Public transport, or mass transit, has long been a politically charged issue, throughout the world. However, the modern world now recognizes that it is an effective way to deal with mass movements of people in a manner that respects the use of non-renewable resources.

After all, the only way to continue with autonomous transportation is to get rid of fossil fuels. However. electric cars will not be of much use until we work out where we are going to get all of the needed extra electricity, in a manner that is environmentally friendly. There is not much point in simply moving the burning of fossil fuels from the vehicle (ie. gasoline) to a power station that also burns fossil fuels (eg. coal). There is also a limit to how many rivers there are left to dam for hydroelectric power; and nuclear reactors have gone out of fashion (fortunately). There is also, of course, the matter of how we are going to recycle the used (lithium-ion) batteries from the cars, which is apparently a tougher proposition than recycling the electric motors themselves.


So, until we sort this out, mass transit is a viable option for most conurbations. In this context, a conurbation (or a metropolitan area) is a contiguous area within which large numbers of people move regularly, especially traveling to and from their workplace each weekday. A conurbation often involves multiple cities and towns, as defined by political administrations or contiguous urban development — many people live in one urban area but work in another.

So, naturally, governments collect data on these matters. One such data collection is the U.S. Department of Transportation's National Transit Database. The data consist of "sums of annual ridership (in terms of unlinked passenger trips), as reported by transit agencies to the Federal Transit Administration." Data for three separate modes of transit are included: bus, rail, and paratransit. The data currently cover the years 2002–2018, inclusive.

To look at the data for the 42 U.S. conurbations included, for the year 2018, I have performed this blog's usual exploratory data analysis. I first calculated the transit rate per person, by dividing the annual number of trips for each of the three modes by the conurbation population size. Since these are multivariate data, one of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network. For this network analysis, I calculated the similarity of the conurbations using the manhattan distance. A Neighbor-net analysis was then used to display the between-area similarities.

The resulting network is shown in the graph. Conurbations that are closely connected in the network are similar to each other based on the trip rates, and those areas that are further apart are progressively more different from each other. In this case, there is a simple gradient from the busiest mass transit systems at the top of the network to the least busy at the bottom.


The network shows us that the New York – Newark transit-commuting area (which covers part of three states) is far and away the busiest in the USA. The subway system dominates this mass transit, of course, as it is justifiably world famous, although not always for the best of reasons as far as commuters are concerned

The San Francisco – Oakland area is in clear second place. Here, bus transit slightly exceeds rail transit. Then follows Washington DC and Boston, both of which also cover parts of three states. In Boston trains out-do buses 2:1, while in Washington it is closer to 1.5:1.

Nest comes a group of four conurbations: Chicago, Philadelphia, Portland and Seattle. Two of these cover part of Washington, but in quite different ways — in Seattle the buses dominate the system 5:1 but in Portland it is only 1.5:1. Chicago and Philadelphia share buses and trains pretty equally.

At the bottom of the network there are two large groups of conurbations, one of which does slightly better than the other at mass transit use. The least-used system is that of San Juan, in Puerto Rico, perhaps not unexpectedly. Of the contiguous U.S. states, Indianapolis (IN) has the least used system, followed by Memphis (TN–MS–AR).

Moving on, we could also look at changes in the total number of transit trips (irrespective of mode) during the period for which data are available: 2002–2018. A network is of little help here. So, it so simplest just to plot the data, as shown in the next graph.


For most of the metropolitan areas there is little in the way of consistent change through time. However, there are some areas that show high correlations between the number of trips and time. These are the areas that have shown the most consistent increase in the number of transit trips from 2002–2018:
  • Chicago (IL–IN)
  • Tampa – St Petersburg (FL)
  • Baltimore (MD)
  • Denver – Aurora (CO)
  • San Francisco – Oakland (CA)
  • Memphis (TN–MS–AR)
  • San Diego (CA)
  • Cleveland (OH)
  • Providence (RI–MA)
  • Orlando (FL)
  • Indianapolis (IN)
  • New York – Newark (NY–NJ–CT)
  • Portland (OR–WA)
  • Minneapolis – St Paul (MN–WI)
Sadly, there are also areas that have shown a consistent decrease in the number of transit trips through time (2002–2018):
  • Kansas City (MO–KS)
  • Columbus (OH)
  • Riverside – San Bernardino (CA)
Presumably these are the areas where the local politicians should be looking into how to address this long-term issue.

Declining transit numbers is a topic discussed around the web; for example: Transit ridership down in most American cities. This article has a graph neatly showing the change in transit numbers from 2017 to 2018. It shows marked decreases, particularly for bus trips, while the few increases almost all involved rail travel. Is this a short-term effect, or the start of a general long-term decline?

Monday, June 17, 2019

Ockham's Razor applied, but not used: can we do DNA-scaffolding with seven characters?


One of the most interesting research areas in organismal science is the cross-road between palaeontology and neontology, which puts together a picture marrying the fossil record with molecular-based phylogenies. Unfortunately, when it comes to plant (palaeo-)phylogenetics, some people adhere to outdated analysis frameworks (sometimes with little data).

How to place a fossil?

The fossil record is crucial for neontology as it can provide age constraints (minimum ages when doing node dating) and inform us about the past distribution of a lineage. This, especially in the case of plants that can't run away from unfortunate habitat changes, can be much different than today.

The main question in this context is whether a fossil represents the stem, ie. a precursor or extinct ancient sister lineage, or the crown group, ie. a modern-day taxon (primarily modern-day genus). For instance, the oldest crown fossil gives the best-possible minimum age for the stem (root) age of a modern lineage, whereas a stem fossil can give (at best) only a rough estimate for the crown age of the next-larger taxon/clade when doing the common node dating of molecular trees (note that fossilized birth-death dating can make use of both).

There are two commonly accepted criteria to identify a crown-group fossil:
  1. Apomorphy-based argues that if a fossil shows a uniquely derived character (ie. a aut- or synapomorphy sensu Hennig) or character suite diagnostic for a modern-day genus, it represents a crown-group fossil.
  2. Phylogeny-based aims to place the fossil in a phylogenetic framework, the position of the fossil in the genus- or species-level tree (most commonly done) or network (rarely done but producing much less biased or flawed results) then informs what it is.
(We will focus on members of modern-day genera, since it becomes more trickier for higher-level taxa, see eg. my posts thinking about What is an angiosperm? [part1][part2][why I pondered about it].)

There a three basic options to place a fossil using a phylogenetic tree.
  1. Putting up a morphological matrix, then inferring the tree. A classic but due to the nature of most morphological data sets leading to a partly wrong tree as we demonstrated in some posts here on the Genealogical World of Phylogenetic Networks (hence, such analysis should always be done in a network-based exploratory data analysis framework).
  2. Putting up a mixed molecular-morphological matrix, then inferring a "total evidence" tree. This includes sophisticated approaches that use the molecular data to implement weights on the morphological traits and/or consider the age of the fossils (so-called total evidence dating approaches). Works not that bad with animal-data, provided the matrix includes a lot of morphological traits reflecting aspects of the (molecular-based) phylogeny. Doesn't work too well for plants because we usually have much fewer scorable traits, most of which are evolved convergently or in parallel. Non-trivial plant fossils love to act as rogues during phylogenetic inference.
  3. Optimise the position of a fossil in a molecular-based tree, eg. using so-called "DNA scaffold approach" (usually using parsimony as optimality criterion) or the evolutionary placement algorithm implemented in RAxML (using maximum likelihood). A special form of this approach is to first map the traits on a (dated) molecular tree, and then find the position where a fossil would fit best.

Why (standard) phylogenetic tree-based approaches are tricky

Below a simple example, including three fossils of different age (and often, place) with different character suites.


Even though none of the derived traits (blue and red "1") is a synapomorphy (fide Hennig), we can assign the youngest fossil X to the lineage of genus 1A just based just based on its unique derived ('apomorphic') character suite. Its likely a crown-group fossil of clade 1, and may inform a minimum age for the most-recent common ancestor (MRCA) of the two modern-day genera of Clade 1.
Apomorphy-wise, fossils Y and Z cannot be unambiguously placed. The red trait appears to be independently obtained in both clades, and the blue trait may have been
To discern between the options, we'd be well-advised to do character mapping in a probabilistic framework which require a tree with independently defined branch-lengths.

Just by using parsimony-based DNA-scaffolding, fossil X would be confirmed as crown-group fossil and member of genus 1A (being identical and different from all others) and fossil Z would end up as a stem-group fossil. Fossil Y, however, would be placed as sister to genus 2C (again, identical to each other and different from all others). Using Y in node dating, would then lead to a much too old divergence age for the crown-group age of Clade 2. In reality, what researchers do with such a seemingly too old fossil is not to use it by the book, as MRCA of Genus 2B and 2C, but to inform the MRCA of eg. genera 2A, 2B, and 2C assuming that the fossil's age and trait set indicate the 2C morphology is primitive within the clade or Y is an extinct sister lineage and the shared derived trait a convergence (parallelism).

Four characters, three homoplastic and one invariant, are surely not enough for DNA-scaffolding, but adding more and more characters has a catch. Easy to do for the modern-day taxa, for which we also have molecular data, the preservation of fossils limits adding many more traits; any trait not preserved in the fossil is effectively useless when placing it (including not-preserved traits in total evidence approach may, nonetheless, help the analysis). Which brings us to the real-world example just published in Science:

Wilf P, Nixon KC, Gandolfo MA, Cúneo RA (2019). Eocene Fagaceae from Patagonia and Gondwanan legacy in Asian rainforests. Science 364, 972. Full-text article at Science website.

Why one should not place a fossil using DNA-scaffolding with seven characters

Wilf et al. show (another) spectacularly preserved fossil from the Eocene of Patagonia. Personally, I think that just publishing and shortly describing such a beautiful fossil should be enough to get into the leading biological journals.

But Wilf et al. wanted (needed?) more and came up with the following "phylogenetic analysis" to argue that their fossil is a crown-group Castanoideae, a representative of the modern-day firmly Southeast Asian tropical-subtropical genus Castanopsis, and evidence for a "southern route to Asia hypothesis" (via Antarctica and Australia, both well-studied but devoid so far of any Fagaceae presence; despite the fact that the modern-day climate allows cultivating them as eg. source for commercially used wood).


Wilf et al's Fig. 3 and Table 1 suggest to me that the paper was not critically reviewed by anyone familiar with the molecular genetics of Fagaceae or phylogenetic methods in general — perhaps this is not needed, since the first author is well-merited and the second author a world-leading expert of botanical palaeo-cladistics. However, parsimony-based DNA-scaffolding can be tricky, even with a larger set of characters (see eg. the post on Juglandaceae using a well-done matrix), and using seven is therefore quite bold. Notably, of the seven characters, one is parsimony-uninformative and four are variable within at least one of the included OTUs.

Side note: The tree used as a backbone is outdated and not comprehensive. Plastid and nuclear-molecular data indicate that the castanoids Lithocarpus (mostly tropical SE Asia) and Chrysolepis (temperate N. America) may be sisters. However, the morphologically quite similar Notholithocarpus is not related to either of these, but is instead a close relative of the ubiquitous oaks, genus Quercus (not included in Wilf et al.'s backbone tree), especially subgenus Quercus. Furthermore, the (today Eurasian) castanoid sisterpair Castanea (temperate)-Castanopsis (tropical-subtropical) have stronger affinities to the (today and in the past) Eurasian oaks of subgenus Cerris. The Fagaceae also include three distinct monotypic relict genera, the "trigonobalanoids" Formanodendron and Trigonobalanus, SE Asia, and Colombobalanus from Columbia, South America. Using a more up-to-date instead of a 2-decade-old molecular hypothesis would have been a fair request during review, as would compiling a new molecular matrix to infer a tree used as backbone (currently gene banks include > 238,000 nucleotide DNA accessions including complete plastomes). This would have also enabled the authors to map their traits using a probabilistic framework, which can protect to some degree against homoplastic bias but requires a backbone tree with defined branch-lengths.

There are many more problems with the paper and its conclusions, but this critique would be content- not network-related. Let's just look at the data and see why Wilf et al. would have better off not showing any phylogenetic analysis at all (and the impact-driven editors and positive-meaning reviewers should have advised against it). Or a network.

Clades with little character support

The scaffolding placed the Eocene fossil in a clade with both representatives of Castanopsis, from which it differs by 0–2 and 1–4 traits, respectively. Phylogeny-based, the fossil is a stem- or crown-Castanopsis.

However, the fossil has a character suite that differs in just a single trait (#6: valve deshiscence) from the (genetically very distant) sister taxon of all other Fagaceae, Fagus (the beech), used here as the outgroup to root the Castanoideae subtree. As far as apomorphies are concerned, the data are inconclusive as to whether the fossil represents a stem-Castanoideae (or extinct Fagaceae lineage) or a Castanopsis — this critical, potentially diagnostic derived trait, partial valve dehiscence, is only shared by the fossil and some but not all modern-day Castanopsis. This particular trait is not mentioned elsewhere in the text, although it is the reason the fossil is placed next to Castanopsis and not the outgroup Fagus in the "phylogenetic analysis".

In the following figure, I have mapped (with parsimony) the putative character mutations on the tree used by Wilf et al.

Black font: shared by Fagus (outgroup) and "Castanoideae". Green font: potential uniquely derived traits. Blue font: traits reconstructed as having evolved in parallel/convergently. Red branches, clades in the used backbone tree that are at odds with currently available molecular data (the N. American relict Notholithocarpus should be sister to the Eurasian Castanea-Castanopsis).

This hardly presents a strong case of crown-group assignation. Except for partial dehiscence, even the modern-day Castanopsis have little discriminating derived traits — they are living fossils with a primitive ('plesiomorphic') character suite. Intriguingly, they are also genetically less derived than other Castanoideae and the oaks (see eg. the ITS tree in Denk & Grimm 2010).

The actual differentiation pattern

The best way to depict what the character set provides as information for placing the fossil is, of course, the Neighbor-net, as shown next.

Neighbor-net based on Wilf et al.'s seven scored morphological traits used to place the fossil. Green: the current molecular-based phylogenetic synopsis — based mostly on Oh & Manos 2008; Manos et al. 2008; Denk & Grimm 2010. I had the opportunity to get familiar with all of the then-available genetic data when harvesting all Fagaceae data from gene banks in 2012 for a talk in Bordeaux. One complication in getting an all-Fagaceae-tree is that plastids, geographically constrained, and nuclear regions tell partly different stories.


Castanopsis, including the fossil, is morphologically a paraphyletic (see also our other posts dealing with paraphyla represented as clades in trees). Note also the long edge-bundle separating the temperate Chrysolepis and chestnuts (Castanea), from their respective cold-intolerant sister genera (Lithocarpus viz Castanopsis) — derived traits have been accumulated in parallel within the "Castanoideae". The scored aspects of Fagaceae morphology are very flexible and ~50 million years is a long time, possibly leading to partial valve indehiscence (or losing it) without being part of the same generic lineage. The puzzling differentiation, and the profoundly primitive appearance of the fossil (shared with modern-day Castanopsis), may in fact be the reason the authors didn't: (i) optimize / discuss very similar, co-eval fossils from the Northern Hemisphere interpreted (and cited) as extinct genera (eg. Crepet & Nixon 1989), (ii) left out the two Fagaceae genera today occurring in South America, (iii) opted for classic parsimony and a partly outdated molecular hypothesis, and (iv) just showed a naked cladogram without branch support values as the result of their "phylogenetic analysis" (Please stop using cladograms!)

Based on the scored characters, the position of the fossil in the graph, and on the background of a more up-to-date molecular-based phylogenetic synopsis (the green tree in the figure above), the most parsimonious interpretation (and probably, the most likely) is that the fossil may indeed be a stem-Castanoideae, a representative of the lineage from which the Laurasian oaks evolved at least 55 million yrs ago (oldest Quercus fossil was found in SE Asia), or even represent a morphologically primitive, extinct (South) American lineage of the Fagaceae. Regarding the "southern route", Ockham's Razor would favor that they are just a South American extension of the widespread Eocene Laurasian Fagaceae / Castanoideae, since very similar fossils and castaneoid pollen is found in equally old and older sites in North America, Greenland (papers cited by Wilf et al.) and Eurasia but not Australia, New Zealand or Antarctica.

A final note: when you have so few characters to compare, you should use OTUs that are not completely ambiguous in every potentially discriminating character, as scored for the "C. fissa group" — the "Castanopsis group" has a single unambiguously defined, potentially derived trait. Using artificial bulk taxa is generally a bad idea when mapping trait evolution onto a molecular backbone tree. Instead, you should compile a representative placeholder taxa set, with as many taxa as you need (or are feasible) to represent all character combinations seen in the modern species/genera.

Other cited references, with comments
Crepet WL, Nixon KC (1989) Earliest megafossil evidence of Fagaceae: phylogenetic and biogeographic implications. American Journal of Botany 76: 842–855. – introducing a Castanopsis-like infructescence interpreted to represent an extinct genus but very similar to the new Patagonian fossil in its preserved features; and co-occuring with castaneoid pollen (not reported so far for Patagonia) and foliage.
 
Denk T, Grimm GW (2010) The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59: 351–366. — includes an all-"Quercaceae" ITS-tree (fig. 3) and -network (fig. 4) using data of ~ 1000 ITS accessions; the nuclear-encoded ITS is so far the only comprehensively sampled gene region that gets the genera and main intra-generic lineages apart (recently confirmed and refined by NGS phylogenomic data), something wide-sampled plastid barcodes struggle with. Analysed with up-to-date methods and avoiding long-branch interference by excluding the only partially alignable Fagus, Castanopsis dissolves into a grade in the all-accessions tree and Quercus is deeply nested within the Castanoideae (as already seen in the 2001 tree used by Wilf et al. as backbone). The species-level PBC neighbor-net prefers a ciruclar arrangement in which Notholithocarpus remains a putative sister of substantially divergent and diversified Quercus, followed by Castanea-Castanopsis, and Lithocarpus, while Chrysolepis is recognized as unique.

Oh S-H, Manos PS (2008) Molecular phylogenetics and cupule evolution in Fagaceae as inferred from nuclear CRABS CLAW sequences. Taxon 57: 434–451. – Probably still the best Fagaceae tree, and surely not a bad basis for probabilistic mapping of morphological traits in the family.

Manos PS, Cannon CH, Oh S-H (2008) Phylogenetic relationships and taxonomic status of the paleoendemic Fagaceae of Western North America: recognition of a new genus, Notholithocarpus. Madroño 55: 181–190. – the tree failed to resolve the monophyly of the largest genus, the oaks, but depicts well the data reality when combining ITS with plastid data and, hence, provides a good trade-off guide tree.

Monday, April 22, 2019

The 2nd Amendment does more than keep King George away


A year ago, in the aftermath of the Florida shooting, I used a neighbor-net as a way to visualize U.S. gun legislation (see the first graph here). In this post, we will use this network to explore some other aspects of American society.

A network illustrating the diversity in U.S. gun legislation. Blue stars – states with a gun registry.

The network picture emphasizes those states where guns are regulated to some extent (in green), but this means that the states at the bottom-left have little or no regulation of gun ownership. Note, first, that the U.S. gun lobby argues that the absence of any gun control is covered by the 2nd Amendment to the U.S. Constitution,which covers the right of citizens to form a "well regulated militia", an amendment installed to protect the freedom of the new republic from the former British sovereign (ie. to "keep King George away").

This claim ignores the fact that "well regulated" implies regulation of some sort, while the network emphasizes its absence in many cases. Besides, the risk of being re-conquered by Her Majesty's Royal Army is quite low these days, with or without Brexit. More to the point, the world itself has changed quite a bit since the 1700s, while the Constitution has had only a few Amendments added and subtracted.

If we start our use of the neighbor-net to look at the data, then we can see that there is at least one obvious consequence of unregulated gun ownership. For example, the next plot shows the number of gun-related deaths (in 2016) super-imposed on the gun-regulation network.

The total number of firearm-related deaths in 2016 (includes accidents and suicides.
Data from worldlifeexpectancy.com; this and more plots can be found here:
Visualising U.S. gun legislation, and mapping politics, economics, and population)

There seems to be a good correlation between unregulated gun ownership and the probability of getting shot or shooting yourself — the number of shootings is greatest in the lower-left of the network, where gun ownership is essentially unregulated (see the Gun Violence Archive for current numbers).

Arming every citizen may have helped to fend off King George's Redcoats, but in the long run, a substantial amount of Americans (c. 275,000 per year; when compared with Canada's rate) would still be alive if the Colonies would have become HRM's dominion like Australia or Canada; both Canadians and Australians own a lot of firearms per capita (see the Small Arms Survey for up-to-date estimates), but while Canada long had Europe-style legislation (and low casualty frequencies); Australians implemented them more recently leading to a massive drop in firearm-related deaths (see above).

As a side note, arming every male citizen to secure freedom from a feudal lord was probably a Swiss invention (see the Swiss Federal Charter of 1291, the Bundesbrief). Switzerland has a compulsory general draft of young males; and after this service they take their Sturmgewehr back home for the yearly training exercise, and to be prepared to fend off invaders (until 2007, including the ammunition). They have ~4-times lower rate of firearm-related deaths (2.8 in 2015 according to GunPolicy.org; nearly all of them males) — the only EU country approaching lowest U.S. values is Finland, and it's near exclusively accidents and suicides.

Other factors

It is important to keep in mind that the United States is a true federation of states, with each state having a substantial amount of autonomy, which is not found in any other country with a federal organization. Hence, many other aspects differ between states, not just the substantial differences in gun legislation.

For example, economics differ greatly between the states, and this also shows a reasonable correlation with gun regulation, as seen in this next version of the network. Note that Gross Domestic Product (GDP) is a monetary measure of the market value of all the goods and services produced annually — rich places have high GDP and poor places have lower GDP.

Real gross domestic product per capita mapped on the gun-legislation-based network.
Red, below global U.S. value; green above global U.S. value.
Data source: U.S. Bureau of Economic Analysis.

So, the economically poorer the state, the less likely there is to be gun regulation.

Modern developments include allowing women into the armed forces, and granting them the right to vote. For example, the 19th Amendment to the US Constitution granted women the right to vote, which was passed by Congress June 4, 1919, and ratified on August 18, 1920. This first map shows the situation for the European Union, some parts of which lagged behind the U.S.

Implementation of general right to vote within the countries of the EU (source: Süddeutsche Zeitung).
In the case of Germany and France, the reason was a lost war leading to the (re)establishment of new republics.

Women make about 50% of the populace and (usually) more than 50% of the electorate (having a generally higher life expectancy), but they are still typically under-represented in parliaments (here are a few examples). The United States is, sadly, a good example of this imbalance. This next map shows that the women in 13 states currently have no same-sex representation in the U.S. Congress.

Female representation in the current U.S. Congress.
The green part of each pie chart indicates the proportion of women representatives.

This leads to the obvious question for this blog post: how does the absence of female representatives (and senators) relate to the absence of gun regulation? So, let's map the above collection of pie charts onto the gun legislation network.

Female representation in the U.S. Congress after 2018 mid-term elections
(includes Senate and House of Representatives).
The c. 700,000 inhabitants of DC, District of Columbia, have no representation in
Congress at all, but send a non-voting delegate to the House.

There is a general trend — those states with little or no gun regulation (bottom left) have less female representation than those with (some) gun regulation. Perhaps someone took the 2nd Amendment a bit too literally (the right that every man to carry a gun), and this keeps not only King George away, from the country but also women away from Congress?

Exceptions from the generalization (starting with 75% going down to 33%) are sparsely populated states with only a few members of Congress: New Hampshire (NH, 75%; 2 representatives in addition to the two U.S. senators representing each state), Maine (ME, 2 reps.), West Virginia (WV; 3 reps), Alaska (AK; 1 rep.), New Mexico (NM; 3 reps), and Nevada (NV; 4 reps). All of these states have one thing in common: a substantial proportion of the state is wilderness.

At the other end, some states with relative high levels of gun regulation, like Maryland (MD; 8 reps), Rhode Island (RI; 2 reps), New Jersey (NJ; 12 reps) and Colorado (CO; 7 reps), lack women in Congress (0–15%, ie. one representative or none). This may relate to these state being very densely populated (MD, RI, NJ), and, irrespective of outside threats, no-one wants their close neighbors running around with guns. Colorado is particular in this sense, because with Denver it includes a major population center (the nucleus of the emerging Front Range megaregion), and it enforced much stricter gun regulation than found elsewhere in the state.

A map showing Colorado's congressional districts, for the 113th Congress.
Data from the defunct digital version of the U.S. National Atlas.

Do more women in parliament save American lives?

According to a recent Gallup poll, Americans have the highest regard for nurses, a profession mostly occupied by women and lowest regard for Members of Congress, a profession mostly occupied by men. Hence, it would make sense to explore the data the other way around. We will explore this in a later post.

Monday, April 15, 2019

Tournament success is not poker success


Let us suppose for a moment that we wish to list the world's best professional poker players. This might be of some interest, because poker is partly a game of luck (the cards are dealt at random) and partly a game of skill (players choose how to play their cards). Indeed, put simply, the idea is to convince your opponents that you have a weak hand when they have a strong one (so that they will bet against you) and a strong hand when they have a weak one (so that they will fold).


One well-known way to assess poker success is to look at tournament winnings. Indeed, Nathan Williams recently did this for The Top 50 Best Poker Players of All Time by simply listing the 50 greatest money earners from The Hendon Mob database. This database accumulates data on the lifetime money winnings for all of those participants who have ever cashed in a live poker tournament.

However, this approach does not work. In fact, there are at least five reasons why this is not appropriate:
  1. Inflation continues unabated. After all, $1 now is not worth as much as $1 was 30 years ago. In fact, something that cost $1 in 1990 would cost a bit more than $2 now (ie. the money has been devalued to 50%). So, the value of current winnings cannot be compared to those of the past.
  2. There are more tournaments now than there have ever been. So, there are more opportunities to play them now, and to thereby potentially accumulate more money for the same tournament success rate.
  3. The tournament fields are now generally bigger. This means that the average prize money for each tournament is now much greater than before (since the money is provided by the participants themselves). In particular, the top prizes now provide more money than whole tournaments did 20 years ago.
  4. Some of the best players play online rather than live. Obviously, this is a bit more difficult these days, due to the banning of online poker in the USA, but it is still a significant source of poker income for many people.
  5. Some of the best players do not play many tournaments —instead, they play cash games. Indeed, if you want to make a living playing poker, you may be better off playing for cash rather than for prize money, as tournament success is much more of a lottery.
The first three reasons all mean that we would have to adjust the tournament winnings, if we wish to have a meaningful assessment of lifetime earnings. As one example of the need to do this, we can look at point no. 3 in a simple way. The first graph shows the current top-100 money earners from The Hendon Mob. For each player, it shows how much of their total earnings came from their biggest single tournament cash.

Note that for the majority of players, a large part of their lifetime winnings came from a single tournament — the median percentage is 18.4% (range 3.8–97.7%). Indeed, for some of the players it is >50%, and for a few it is almost all of their money. Bigger fields mean more money per tournament, and thus bigger cashes when you do well. Note, incidentally, that this graph does contain the top 17 biggest cashes in history (to date).

An alternative approach

So, in order to evaluate players, we actually need a list of criteria that is independent of money won. That is, we need a list of the poker skills of each player. There are several different skills involved in playing poker, and presumably some people are good at some of them, and other people are good at some of the others. A comparison of relative skills is what we need.

This approach was actually tried by Barry Greenstein back in c. 2005. What he did was try to rate a group of 33 of the poker players that he had played against in cash games. He rated these players by style of play, based on ten playing criteria (each scored on a 1–10 scale):
  • Aggressiveness
  • Looseness
  • Short-handed play
  • Limit poker
  • No-limit poker
  • Tournaments
  • Side games
  • Steam control
  • Against weak players
  • Against strong players
Given the time at which this analysis was done (2005), the modern crop of young players are obviously not included, and a few of those people included are no longer playing. However, it is worthwhile looking at the data to see just what can be done with this approach.

Greenstein himself notes: "I don’t think you can add up the ratings in the skill categories to get an accurate comparison of players." He is right; but first let's do it anyway. So, the next graph shows the total score (out of 100) for each player. (Click on the figure to see it at full size.)


This problem here is that we are comparing apples with oranges. That is, the rank ordering of the sum does not make much sense, because it does not group players with similar playing strengths. The rank order would make sense when comparing each feature one at a time, but not for the total. For example, ranking by total winnings does make sense, because we have only one criterion: money (although it is not a useful criterion). This is the basic weakness of having a single rank order.

As one example of how the "overall score" misses important points, note that Eric Seidel and John Juanda have the same total. However, Seidel exceeds Juanda on Stem control, while Juanda exceeds Seidel on Looseness — these are actually two rather different players.

A better way to look at the data is to use a network, as we often do in this blog. The final graph is a NeighborNet (based on the manhattan distance) of Greenstein's data. Each point represents one of the 33 people. Those people that are near each other in the network have a similar set of scores, while people further apart are progressively more different from each other as poker players.


As you can see, there is no simple trend from "best" to "worst", but instead a complex set of relationships, just as we would expect. However, the network does show an overall trend of decreasing total score from top to bottom (compare this to the previous graph).

Note, first, that Eric Seidel and John Juanda are on opposite sides of the network (Juanda left, Seidel right). This illustrates how much better the network is as a display of the data, compared to simply summing the scores (as in the previous graph). The network accurately shows the differences in the relative playing styles.

There are some players who are actually gathered together in the network, indicating that they have similar scores across all 10 criteria. For example, Barry Greenstein , Eric Seidel and Howard Lederer rarely differ by more than 1 point on any of the criteria — according to Greenstein, these people have very similar playing styles.

Alternatively, Pil Helmuth and T.J. Cloutier have scores that differ from the other players — both have low scores on Side games and Steam control. Gus Hansen is near these two because all three have high scores for Against weak players. Similarly, the legendary Stu Ungar and Patrik Antonius both have high Aggressiveness and Looseness.

There is one a final point worth mentioning. As Michel Bettane once said (The absurdity and flattery of scores):
It doesn't take a genius to appreciate the absurdity of giving a number score to a work of art or, worse still, an artist. Salvador Dalí had huge fun scoring great artists (including himself) on the basis of design, color, and composition — but that says far more for his sense of provocation and irony than it does for the principle itself.
Is poker an art, a science or a sport? If it is either of the first two, then scoring players may actually be a Bad Idea.

Monday, March 18, 2019

Which US cities are best for walking, biking and public transport?


In the modern world, there is a lot of discussion about the environmental damage caused by cars and trucks, not least due to their involvement in global climate change. The pro-active parts of this discussion revolve around banning cars, so that parts of cities and towns can return to pedestrian areas (eg. Life in the Spanish city that banned cars; The automotive liberation of Paris), and encouraging alternative modes of transport, particularly bicycles (eg. Copenhagenize your city: the case for urban cycling; Britain wants cycle-friendly cities).

In particular, some cities throughout the world are taking active steps to improve the "walkability" of their centers, including Addis Ababa, Auckland, Denver, Hanoi, London, Manchester and San Francisco (What would a truly walkable city look like?), and the "cyclability" of their inner suburbs, including Calgary, Copenhagen, Eindhoven, Lidzbark, Purmerend, San Sebastian, Utrecht and Vancouver (Top 10 pieces of cycling infrastructure: which country does it right?). On the other hand, there are some cities who have not yet tried to do much about cycling, including Beijing, Cairo, Delhi, Hong Kong, Moscow, Mumbai, Nairobi, Orlando, São Paulo and Sydney (Top 10 worst cities for cycling ).


The USA is not usually considered to be at the forefront of this movement, having long ago wedded itself to the cult of the private motor car. However, this does not mean that US cities are all the same in terms of non-car transportation. For example, the Walk Score site, which is part of the Redfin real estate organization, provides a ranking of all US cities and neighborhoods with a population of 200,000 or more, in terms of how friendly they are for: walking, biking and transit.

The ranks are based on a score out of 100 for each location, using various methodologies:
— Walk Score analyzes hundreds of walking routes to nearby amenities; points are awarded based on the distance to amenities in each category.
— Bike Score is calculated by measuring bike infrastructure (lanes, trails, etc), hills, destinations and road connectivity, and the number of bike commuters.
— Transit Score assign a "usefulness" value to nearby transit routes based on their frequency, type of route (rail, bus, etc), and distance to the nearest stop on the route.
Our interest here is in combining these three pieces of information into a single picture, showing which cities are generally good, at the moment.

Not unexpectedly, the Walk Score and Transit Score are highly correlated (86% shared rankings), while the Bike Score is not as highly correlated with either of these (49% and 42%, respectively). This means that the same cities tend to be good for the first two criteria. The three best cities for the Walk Score are New York, Jersey City and San Francisco, while the top two for the Transit Score are New York and San Francisco. On the other hand, for the Bike Score the top two are Minneapolis and Portland — it would be difficult to imagine either New York or San Francisco as being good for biking!

If we define a "good" score as being >70, then only San Francisco has a score for all three criteria >70, although Boston comes close. On the other hand, Pittsburgh and Washington D.C. have the most consistent scores across the board, because they have uniformly middle-rank scores.

Since these are multivariate data, one of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network, as a tool for exploratory data analysis. For this network analysis, we calculated the similarity of the cities using the Manhattan distance, and a Neighbor-net analysis was then used to display the between-city similarities.

The resulting network of the 98 cities with complete data is shown in the figure. Cities that are closely connected in the network are similar to each other based on how good they are for walking, biking and transit, and those cities that are further apart are progressively more different from each other. The color-coding for the cities is from Megaregions of the United States.


The network generally shows decreasing walking / transit scores from top to bottom, and decreasing biking scores from right to left. We have labeled only the top group of 29 cities, which are distinctly "better" than the remaining 69, plus four unusual cities (at the middle-left).

Note that, as expected, New York, San Francisco and Boston stand out at the top of the network. Note, also, that Minneapolis and Portland are separated in the network from the other cities, because of their high Bike Scores — all of the other cities in the top group have much lower biking scores. Newark, in particular, has a low biking score. New Orleans is at the bottom-left of this group because it has a low Transit Score but not Walk Score.

For the four unusual cities, separated at the left of the bottom group: Dallas has a low Transit Score, and Atlanta, Cincinnati and San Diego all have a low Bike Score.

The city at the very bottom-left of the network, which has the lowest score on all three criteria, is Arlington TX. Along the same lines, there is an online graph of The 10 most dangerous states for cyclists, showing Florida way out in front.

Finally, you should be warned about potential problems with rankings like these, based on only a few selected criteria. For example, the real estate site StreetEasy recently tried to compile a list of the 10 Healthiest Neighborhoods in New York city, and ended up listing the Brooklyn industrial area of Red Hook as number 1, which engendered a couple of negative comments, such as:
I guess the fact that the majority of Red Hook’s parkland has been closed for many years due to lead contamination, or the fact that we have one of the highest asthma rates in the city, was overlooked for this study.
Caveat emptor!

Monday, March 4, 2019

Has homoiology been neglected in phylogenetics?


In a recently published pre-print on PaleorXiv, Roland Sookias makes a point for distinguishing between parallelism, ie. shared inherited traits that can be found in some but not all of the offspring of a common ancestor, and convergences in a strict sense, involving similar traits that are not homologous. The former is also known as homoiology, a term Sookias attributes to Ludwig Plate.

As a geneticist working mostly at the tips of the Tree of Plant Life, I'm quite familiar with the (pre-Hennigian) concept: we much more often than not lack Hennig's 'synapomorphies', ie. shared, derived traits exclusive to an evolutionary lineage. But we have many highly diagnostic characters suites including 'shared apomorphies' (I think that the angiosperm phylogeneticist Jim Doyle coined the term) that collect the same species or higher taxa, eg. groups of taxa that also form highly supported clades in molecular trees, but are not exclusive. In every plant group you can additionally observe that certain traits are exclusive to some members of one lineage, because the lineage has the genetic-physiological prerequisites to express these traits, while their sister lineages or distant relatives lack this potential. Epigenetics deals with tendencies to express a trait in response to the environment without even changing the genetic code.

If you look close enough, you can find such patterns even at the molecular level.

Molecular evolution of the 5' half of the ITS1 in beeches. Each sequence motif is assigned a state (Ax, Bx etc; x = 0 represents the ancestral state, x > 0 are derived states) and evolution involves usually the gain ("+") or loss ("-") of sequence motifs including some potential genetic homoiologies (see here for context and references).

However, it has apparently been ignored by my fellow paleontologists: Sookias' wants to discuss "the neglected concept of homoiology ... in the context of palaeontological phylogenetic methods". Paleontological phylogenetic methods are, of course, tree inferences, and the idea is that recognition of homoiologies can be a means of establishing node support or to "help to choose between equally parsimonious or likely trees". He provides an R function "to calculate two measures for a given tree and matrix: (a) the potential support for clades based on potential homoiologies; and (b) the fit of the tree to all states given the concept of homoiology".

Sookias provides a nice and conscise introduction to the problem with some examples, and makes the connection to linguistics (see also Mattis' and my post on the Chinese dialects continuum: How languages lose body parts); so, give the short paper a read. Like all paleontological literature it is strongly influenced by cladistic views, such as that life is monophyletic, and it revolves around the central theme how to get better supported trees.

My inner geneticist has a principal problem with such a goal, because there has (to my knowledge) not been a single morphology-based tree that was fully congruent to a molecular tree with sufficient taxon and gene sampling, which applies also to the real-world data example that Sookias chose (as we will see).

My inner paleontologists also knows that there are highly diagnostic morphs in the fossil record, but diagnostic character suites and morphs reflect as many paraphyla as monophyla. He also knows that the fossil record, provided you find the right fossil from the right time, may alter your perspective on ancestral and derived character states.

An inferred tree (see this post). Given the inferred tree (quasi-dated tree), we would assume that star shapes are primitive (a symplesiomorphy) within the Pointish lineage, and possibly 10-tipped stars; and conclude that the Tenstars are paraphyletic. Greenish is clearly ancestral (a Pointish symplesiomorphy), and bluish derived (a Polygonia synapomorphy).
If we have the full picture, we can confirm star shapes are symplesiomorphic within the Pointish (the first common ancestor being a five-pointed colorless star). However, all greenish stars form a monophylum not a paraphylum.
Having ten tips is a synapomorphy of the monophyletic Tenstars.

So, why should we aim to get more resolved, better supported, morphology-based trees? Any such tree will inevitably include wrong branches!

I argue that, instead, we should just explore the signal in our data matrices using networks. Any potential tree is included in a network. But networks are more comprehensive because they provide not only a single tree but alternative, competing trees. By visualizing the alternatives, we can discern between mere convergence (random similarity), homoiology (parallelism, convergence related to descent), symplesiomorphy (shared, lineage-consistent primitive traits) and synapomorphy (lineage-unique and consistent shared derived traits), which can be very tricky with just a tree. Thus, we can try to evaluate which evolutionary scenario best explains all our data.

Compatibility

The basic problem when using morphological and such-like data sets to infer phylogenies is that most of the scored characters are, to some degree, incompatible with the true tree, ie. the actual evolutionary pathways.

Let's take a hypothetical evolution (no reticulations), in which the x-axis represents the morphological diversification and the y-axis time.


As in real-world data, sister taxa (eg. Species A and B) may have different levels of morphological derivation compared to their common ancestor(s). This leads us to this unrooted true tree in which the branch lengths are proportional to the real (above) amount of change.

Unrooted representation of the above evolution.
All commonly used tree inferences infer unrooted trees.

The only characters providing a taxon bipartition that is fully compatible with the true tree are Hennig's 'synapomorphies':

Clade A–D shares a unique, derived trait.
The character split is fully compatible with the true tree.

Next come Hennig's 'symplesiomorphies' (Sookias' R-script discards them):

Blue is the ancestral state within the ingroup, lost/modified in Species A.
The character split is compatible with the true tree except for A.
In phylogenetic inference, symplesiomorphies will usually stabilize the topology
as there will be enough other characters supporting A as sister of B and Clade A–D(–F).

Homoiologies / parallelisms can be partly compatible:

Blue is a homoiology found in 50% of the species composing Clade A–F.
The character split supports the sister relationship of A and B (compatible aspect)
but joins them with F (incompatible aspect).
A, B and F belong to the same monophylum/clade (semi-compatible aspect).
As long as homoiologies are confined to otherwise
coherent (or flat) subtrees, they will contribute to the overall decision capacity of the data.

Note that without a molecular backbone tree, it may be impossible to distinguish homoiologies from symplesiomorphies – whether a trait will be resolved as either the one or the other in a tree depends solely on its frequency and distribution across the subtree, and the situation in outgroups.

Purple is the plesiomorphy of the ingroup, blue the homoiology
found in members of Clade A–F, evolved twice
Considering the phylogenetic root-tip distances in the true tree, it makes sense that blue is the plesiomorphy of the ingroup retained in the shorter branching members, and purple a homoiology found in the most derived sublineages (again, evolved twice).
Both scenarios require three steps, but probabilistic character mapping methods would prefer the second scenario as they assume the longer the internal branches, the higher the likelihood for a change. To dismiss symplesiomorphies, Sookias' script infers the ancestral state of the MRCA of a clade and only considers states as homoiologies that differ from the inferred ancestral state (the cut-off value can be modified to "less stringently exclude potential symplesiomorphies as homoiologies").
 
Doyle's 'shared apomorphies' are locally compatible:

Blue is a shared apomorphy of the GH lineage, convergently evolved in the
outgroup (see original tree above: the GH lineage is a strongly derived
ingroup lineage evolving into the direction of the outgroup
in contrast to the remainder of the ingroup).
The example above also illustrates how shared apomorphies may trigger branching artifacts such as ingroup-outgroup long-branch attraction. Imagine that GH is not the first diverging branch of the ingroup but instead a strongly derived sublineage nested within Clade A–F, and that we lack the short-branching sister-group but have a large outgroup sample. Any ingroup-outgroup shared apomorphies will then draw GH towards the outgroup-defined ingroup's root and detrimental for inferring the true tree.

Convergence in a strict sense, ie. superficial or random similarity, is incompatible with the true tree:

Blue is a randomly distributed derived state found in all longer-branched taxa.

A tree-incompatible signal is, naturally, best handled using a network and not by forcing it into a single tree. Unless, of course, we have a sensible molecular tree and can go for total evidence approaches assuming the molecular tree reflects the true tree.

PS: Also, in molecular data the true tree incompatible characters may outnumber the compatible ones, but there we have many more characters and (usually but not always) a lot that are not filtered by negative or positive selection. Our stochastic molecular models are for sure never accurate enough to model molecular evolution for our sequences, but apparently precise enough for most applications. Even before next generation sequencing and big data, molecular phylogenies outshined morphological phylogenies, something that paleontologists cannot afford to ignore any more — not because the data are much better (to infer evolution) but because the patterns and processes are much less complex.

Sookias' data example, crocodiles and relatives

The supplement of Sookias' paper includes a morphological character matrix for crocodilians and the resulting molecular tree for the group. Here's Sookias' fig. 3 ,using these data to make his point for how to select the better-fitting tree using homoiology recognition:


Now, the unsolved problem is: if we don't have a molecular tree, how can we possibly know 0 is a homoiology and not a symplesiomorphy, 1 not a reversal (scenario B) or likely convergence (scenario C), hence, B should be preferred over C (the legend has a little typo, cf. Sookias 2019, p. 3, l. 34)?

The matrix provided as the example is not the best one to make this point. Sookias' script, when stringently eliminating potential symplesiomorphies, identifies, using the molecular tree as input, one potential homoiology for the Crocodylinae, five for their larger clade (including Gavialis and Tomistoma), and one for the alligators' larger clade in a matrix with 117 characters. Less than 10% can hardly be a game-changer.

What the morpho-data shows

Furthermore, the morphological matrix will give us a single most-parsimonious tree (MPT, using PAUP*'s Branch-and-Bound algorithm), not two or more equally parsimonious alternatives that we need to weigh against each other.

The single most-parsimonious tree that can be inferred from the morpho-matrix (236 steps, CI = 0.64, RI = 0.84). Red branches are conflicting with the topology of the molecular (truer?) tree (green brackets).

Some of the red branches are supported by pseudo-synapmorphies, which, on the background of the molecular tree, are potential homoiologies for the comprising clade, however, interpreted as symplesiomorphies by Sookias' script (provided the molecular branch-lengths are sufficient, they might be recognized when using a probabilistic framework to infer the ancestral states).

Not a good example for Sookias goal, but the matrix shows the limitations of trees when it comes to morphological differentiation. Here's the distance-based, 2-dimensional network for the morphological data:

A Neigbor-net based on Sookias' morphological matrix.
The arrow indicates the position of the assumed root.

The signal from the morphological matrix is quite tree-like, and the structure of the left part of the network is synonymous to that of the single MPT (and the molecular tree). On the right-hand side, we find more complexity than we would expect from the single MPT. The data signal is not trivial regarding the position of the root as inferred by Bernissartia; and nor is the placement of Gavialis and Tomistoma (pink edge bundles), two genera producing a very prominent box-like structure. Called by cladists a "phenetic" approach, the distance-based network is nonetheless straightforward regarding the identification of monophyletic groups (green) and potential monophyletic groups (yellow) (the latter always include the particular alternative seen in the single MPT as well, in case of the pink box, also the molecular alternative). The light green monophylum is a necessary consequence of the prior knowledge about the position of the root, and the likely monophyly of Alligator and its relatives (the tree-like subgraph with long internal branches = lots of uniquely shared traits, including potential synapmorphies).

Potential synapomorphies that can be inferred from the morpho-matrix alone by mapping the states onto the network. Red, homoiologies reconstructed as synapomorphies ('pseudo-synapomorphies') and (except for one) excluded as potential symplesiomorphies by Sookias' test run of his script (strict and relaxed cut-off).

The network provides more information than can be extracted from the MPT: one Crocodylus is significantly closer to the Osteolaemus (the neighborhood defined by the light blue edge bundle, see Sookias' fig. 3A). Crocodylus, however, is likely monophyletic, being generally very similar; and the third genus, Mecitops, is closely linked to (all of) them (neighbourhood defined by the dark blue edge). An inclusive common origin (including the third genus, Mecistops) is – just based on morphology and without using a "phylogenetic" tree inference – beyond question, even though we lack syn- or shared apomorphies (short corresponding edge bundle): Mecistops is obviously closely related to Crocodylus, and Osteolaemus is related to part of the latter, so it's not a bad hypothesis that all three are descendants of the same common ancestor, and that Tomistoma (and Gavialis) branched off the lineage before the Crocodylinae radiated. The only alternative explanation would be that the Crocodylinae show the primitive morphs of the entire lineage, and that the position of Tomistoma and Gavialis is affected by long-branch (-edge) attraction (however, if that is the case then we should have found a Tomistoma-Gavialis clade in the MPT — parsimony will always get it wrong in the Felsenstein zone)

The main flaw

But, any morphology-based alternative using this data matrix is not fully compatible with the molecular tree, which places Mecitops and Osteolaemus as sister to Crocodylus. Here's the consensus network based on 10,000 boostrap pseudoreplicate BioNJ trees inferred from the morpho-matrix, highlighting the support for splits compatible with the molecular tree (green) and their competing, partly incongruent (red edge bundles) alternatives (I do the information transfer manually, but those with R-scripting skills can use the functions in the phangorn library; Schliep et al., MEE, 2017; see also David's post):

NJ-Bootstrap (BS) consensus network based on 10,000 pseudoreplicates.
Edges/splits corresponding to clades in the molecular tree
(see Sookias' fig. 3 above) in green, those conflicting the molecular tree in red.
Edge values show BS support (edge-lengths are proportional to NJ-BS support),
while asterisks indicate the branches seen in the MPT.
Obviously, there is some signal in the morpho-matrix compatible with the molecular clades (this can be synaporphies, symplesiomorphies, homoiologies or shared apomorphies) clashing with the signal of pseudo-synapomorphies etc. supporting the topological alternatives seen in the morpho-based MPT.

Assuming the molecular tree is correct, the above reconstruction means that Osteolaemus is morphologically more derived, and hence placed as sister, while Mecitops and Crocodylus retain more primitive character states, and hence lacks discriminatiory derived traits — a sort of local ingroup-outgroup long-branch attraction (or 'short-branch culling').

What differentiates the Crocodylinae? Black, aut- or synapomorphies; blue, potential homoiologies (or symplesiomorphies); red, shared apomorphies (convergence). The Mecitops-Crocodylus pseudo-monophylum is mostly supported by traits shared between Osteolaemus and distant siblings (taxa of the larger alligator clade) and/or the outgroup.

We can also hypothesize that the initial radiation was fast, because the Mecitops-Osteolaemus ancestor did not accumulate a single, unique, discriminating character trait.

Excess of shared derived, pseudo-synapomorphic traits is the reason Tomistoma is not resolved as sister of Gavialis in the MPT — the molecular Gavialis-Tomistoma clade is represented by a morphological grade.

A 'splits rose' showing the basic splits. Black, aut- or synapomorphies; blue, potential homoiologies (or symplesiomorphies of the larger clade including Crocodylinae); pink, pseudo-synapomorphies (deep homoiologies or symplesiomorphies of the larger Crocodylinae clade); orange, shared ancestral (plesiomorph) or derived traits (convergent). 

And the homoiologies identified using the molecular tree as input cannot put things right. They are just partly compatible with unproblematic splits, ie. the larger clade including Alligator (character #7), the larger clade including Crocodylinae (#1, #18, #73, #74, #117) or exclusive to the Crocodylinae (#66)

Character mapping of the molecular-inferred homoiologies. The lush green splits represent the molecular splits.

However, if we are ignorant of the molecular tree, we would have to assume that Mecitops is the sister to Crocodylus, and that some of their shared traits not found in Osteolaemus are shared apomorphies (if occurring outside the clade and in the sister clade) or even synapomorphies (if exclusive for Mecitops + Crocodylus), while only those shared by Osteolaemus and C. porosus (#66) can be homoiologies. We also would have no reason to challenge the Gavialis-Tomistoma grade, until we infer networks.

Map of all potential synapomorphies (bold), symplesiomorphies (italics) and homoiologies (plain font) using the morphology-based Neighbor-net as basis. Red, pseudo-synapomorphies: split seen in the MPT and (with or without alternative in the Neighbor-net) but rejected by the molecular tree.

This is the main flaw of Sookias' idea. To identify homoiologies, we need the same prerequisite as for any of Hennig's concepts: we need to know the true tree. If we use the inferred tree based on the same data that we want to weight (here: use homoiologies for decision making or means of node support), then we propagate first-level errors; apply circular reasoning. Such as the red-marked pseudo-synapomorphies in the network above; vice versa, all actual (molecular-wise) synapomorphies supporting the molecular Gavialis-Tomistoma clade (dark purple split) would be reconstructed as homoiologies or symplesiomorphies based on the morpho-based single MPT (or morpho-based NJ tree, or probabilistic tree).

And if we have an independent molecular tree, it will make the decision on the fly: putative synapormorphies are the traits that are fully compatible, symplesiomorphies, homoiologies and shared apomorphies are decreasingly compatible, and random convergences are incompatible with the molecular tree.

It is not homoiology but tree-incompatible signal that is neglected in phylogenetics

Sookias points out: "In inference of phylogeny by parsimony, an occurrence of a character state in a part of a tree separated from it by another state is considered simply a homoplasy, and a tree where the occurrences are nearer or further from one another is not more or less parsimonious ... a tree where the 15 occurrences are nearer or further from one another is not more or less parsimonious". In principle, this is true, but has little consequence in application.

We, usually without realizing it, make frequent use of the discriminating power of potential homoiologies. See the example above, but also when, eg., placing fossils in a molecular framework or do post-inference character weighting. In these cases, homoiologies (and symplesiomorphies) will stabilize the inference and increase support. For better and worse:
  • Better, because homoiologies will ensure that the fossil is placed in the right molecular-based subtree, and can compensate for the lack of synapomorphies. Imagine an extinct fossil sibling lineage showing only homoiologies shared by Osteolaemus and C. porosus. Using tree-based optimization (eg. RAxML's 'evolutionary placement algorithm'), it would be placed close to the Crocodylinae ancestor, likely next to Osteolaemus. Using a Neighbor-net, it would be placed between Osteolaemus and C. porosus. Either way, the homoiologies would ensure it is nested within the Crocodylinae.
  • Post-inference character weighting, as implemented in eg. TNT, will downweight inferred convergences (ie. higher homoplasy, more stochastically distributed across the tree) more than putative homoiologies (ie. less homoplastic since confined to a single subtree). This can be better or worse. How do we avoid what happened for the crocodiles that homoiologies are not recognized as such but support (somewhat) misleading clades (act as synapomorphies)? Clades are commonly interpreted as a sufficient criterion to determine monophyly; however, they are not even a necessary one: taxa can be part of a monophyletic group despite not forming an inclusive subtree (ie. clade in a rooted tree) such as the genus Caiman or Gavialia-Tomistoma.
Hence, we should disencourage any form of data-self-dependent or post-analysis weighting and instead just explore the signal in our data — using networks.

One thing is also obvious from the crocodile example: if we have enough signal in the morphological data, then we may get one or another thing wrong and, in some cases, may not be able to decide between one or another alternative. However, overall, the morphological differentiation pretty well captures what the genes provide us as the best approximation of the true tree. Even when the matrix includes very few potential synapomorphies and clear homoiologies but a lot of shared apomorphies, most of which were convergently evolved in parts of both major clades.

At least, this will be so when we analyze the data using networks and not just trees (compare the single MPT to the networks).

Using the alternative evolutionary scenarios provided by the networks, we can then look back into our data (see the maps above), to see what may be a homoiology, a symplesiomorphy (very useful for deciding between evolutionary scenarios, as well) or a synapomorphy. The phangorn library (used for Sookias' script) has now network functionality and allows transferring information between trees and networks. An R-affine person may be able to extract lists of potential (partly competing) synapomorphies, symplesiomorphies, and homoiologies directly from the network showing all possible or the most likely trees.

And then use this information to eg. place fossils in a phylogenetic context, or reconstruct evolutionary trends in extinct groups of organisms — reconstruction of evolutionary trends in extant organisms should always rely on morphological data analyzed in a molecular-phylogenetic framework.

Data

A NEXUS-version of Sookias' test matrix (slightly annotated for Mesquite, simple version for PAUP*), tree- and distance matrix files have been added to my figshare collection of morphological matrices.