Monday, August 19, 2019

Phylogenetics of chain letters?


The general public and the general media often have no idea what biologists mean by the work "evolution". The word has two possible meanings, and they usually pick the wrong one. Niles Eldredge tried to clarify the situation by referring to them:
  • transformational evolution — the change in a group of objects resulting from a change in each object (often attributed to Lamarck)
  • variational evolution - the change in a group of objects resulting from a change in the proportion of different types of objects (usually attributed to Darwin).
Charles Darwin changed biology by pointing out that changes in species occur via the latter mechanism, not the former, which had been the predominant previous idea. Sadly, 160 years later, the idea of transformational evolution still seems to prevail in the minds of the general public and the people writing for them.


So, it was with some trepidation that I looked at an article in Scientific American called Chain letters and evolutionary histories (by Charles H. Bennett, Ming Li and Bin Ma. June 2003, pp. 76-81). It was subtitled: "A study of chain letters shows how to infer the family tree of anything that evolves over time, from biological genomes to languages to plagiarized schoolwork."

The "taxa" in their study consist of 33 different chain letters, collected during the period 1980–1995 (8 other letters were excluded), covering the diversity of chain letters as they existed before internet spam became widespread. These letters can be viewed on the Chain Letters Home Page.

The main issue with this study is that there are no clearly defined characters, from which the phylogeny could be constructed. The authors therefore resort to creating a pairwise distance matrix, among the taxa, in a manner (compression) that I have criticized before (Non-model distances in phylogenetics). I have also discussed previous examples where this approach has been used, notably: Phylogenetics of computer viruses? Multimedia phylogeny?

The essential problem, as I see it, is that without a model of character change there is no reliable way to separate phylogenetic information from any other type of information. That is, phylogenetic similarity is a special type of similarity. It is based on the idea of shared derived character states, as these are the only things that are informative about a phylogeny.

Compression, on the other hand, is a general sort of similarity, based on the idea of information complexity. This presumably will contain some useful phylogenetic information, but it will also contain a lot of irrelevance — for example, shared ancestral character states, which are uninformative at best and positively misleading at worst.

So, the authors can easily produce an unrooted tree from their similarity matrix, which they then proceed to root at one of the letters that they collected early on in their study. This tree is shown here.


However, whether this diagram represents a phylogeny is unknown.

Nevertheless, that does not stop us using an unrooted phylogenetic network as a form of exploratory data analysis, as we have done so often in this blog. This is not intended to produce a rooted evolutionary history, but instead merely to summarize the multivariate information in a comprehensible (and informative) manner. This might indicate whether we are likely to be able to reconstruct the phylogeny In this case, I have used a NeighborNet to display the similarity matrix, as shown next.

Phylogenetic network of cahin letters

It is easy to see that the relationships among the letters are not particularly tree-like. Moreover, the long terminal edges emphasize that much of the complexity information is not shared among the letters, while the shard information is distinctly net-like. So, a simple "phylogenetic tree" (as shown above) is not likely to be representative of the actual evolutionary history.

However, there are actually a few reasonably well-defined groups among the taxa — one at the top. one at the right, and several at the bottom of the network. There are also letters of uncertain affinity, such as L2, L23, L13 and L31. These may reflect phylogenetic history, even though that history is hard to untangle.

Finally, it is worth noting that the history of chain letters, dating back to the 1800s, is discussed in detail by Daniel W. VanArsdale at his Chain Letter Evolution web pages.

Monday, August 12, 2019

Public transit trips in the USA


Public transport, or mass transit, has long been a politically charged issue, throughout the world. However, the modern world now recognizes that it is an effective way to deal with mass movements of people in a manner that respects the use of non-renewable resources.

After all, the only way to continue with autonomous transportation is to get rid of fossil fuels. However. electric cars will not be of much use until we work out where we are going to get all of the needed extra electricity, in a manner that is environmentally friendly. There is not much point in simply moving the burning of fossil fuels from the vehicle (ie. gasoline) to a power station that also burns fossil fuels (eg. coal). There is also a limit to how many rivers there are left to dam for hydroelectric power; and nuclear reactors have gone out of fashion (fortunately). There is also, of course, the matter of how we are going to recycle the used (lithium-ion) batteries from the cars, which is apparently a tougher proposition than recycling the electric motors themselves.


So, until we sort this out, mass transit is a viable option for most conurbations. In this context, a conurbation (or a metropolitan area) is a contiguous area within which large numbers of people move regularly, especially traveling to and from their workplace each weekday. A conurbation often involves multiple cities and towns, as defined by political administrations or contiguous urban development — many people live in one urban area but work in another.

So, naturally, governments collect data on these matters. One such data collection is the U.S. Department of Transportation's National Transit Database. The data consist of "sums of annual ridership (in terms of unlinked passenger trips), as reported by transit agencies to the Federal Transit Administration." Data for three separate modes of transit are included: bus, rail, and paratransit. The data currently cover the years 2002–2018, inclusive.

To look at the data for the 42 U.S. conurbations included, for the year 2018, I have performed this blog's usual exploratory data analysis. I first calculated the transit rate per person, by dividing the annual number of trips for each of the three modes by the conurbation population size. Since these are multivariate data, one of the simplest ways to get a pictorial overview of the data patterns is to use a phylogenetic network. For this network analysis, I calculated the similarity of the conurbations using the manhattan distance. A Neighbor-net analysis was then used to display the between-area similarities.

The resulting network is shown in the graph. Conurbations that are closely connected in the network are similar to each other based on the trip rates, and those areas that are further apart are progressively more different from each other. In this case, there is a simple gradient from the busiest mass transit systems at the top of the network to the least busy at the bottom.


The network shows us that the New York – Newark transit-commuting area (which covers part of three states) is far and away the busiest in the USA. The subway system dominates this mass transit, of course, as it is justifiably world famous, although not always for the best of reasons as far as commuters are concerned

The San Francisco – Oakland area is in clear second place. Here, bus transit slightly exceeds rail transit. Then follows Washington DC and Boston, both of which also cover parts of three states. In Boston trains out-do buses 2:1, while in Washington it is closer to 1.5:1.

Nest comes a group of four conurbations: Chicago, Philadelphia, Portland and Seattle. Two of these cover part of Washington, but in quite different ways — in Seattle the buses dominate the system 5:1 but in Portland it is only 1.5:1. Chicago and Philadelphia share buses and trains pretty equally.

At the bottom of the network there are two large groups of conurbations, one of which does slightly better than the other at mass transit use. The least-used system is that of San Juan, in Puerto Rico, perhaps not unexpectedly. Of the contiguous U.S. states, Indianapolis (IN) has the least used system, followed by Memphis (TN–MS–AR).

Moving on, we could also look at changes in the total number of transit trips (irrespective of mode) during the period for which data are available: 2002–2018. A network is of little help here. So, it so simplest just to plot the data, as shown in the next graph.


For most of the metropolitan areas there is little in the way of consistent change through time. However, there are some areas that show high correlations between the number of trips and time. These are the areas that have shown the most consistent increase in the number of transit trips from 2002–2018:
  • Chicago (IL–IN)
  • Tampa – St Petersburg (FL)
  • Baltimore (MD)
  • Denver – Aurora (CO)
  • San Francisco – Oakland (CA)
  • Memphis (TN–MS–AR)
  • San Diego (CA)
  • Cleveland (OH)
  • Providence (RI–MA)
  • Orlando (FL)
  • Indianapolis (IN)
  • New York – Newark (NY–NJ–CT)
  • Portland (OR–WA)
  • Minneapolis – St Paul (MN–WI)
Sadly, there are also areas that have shown a consistent decrease in the number of transit trips through time (2002–2018):
  • Kansas City (MO–KS)
  • Columbus (OH)
  • Riverside – San Bernardino (CA)
Presumably these are the areas where the local politicians should be looking into how to address this long-term issue.

Declining transit numbers is a topic discussed around the web; for example: Transit ridership down in most American cities. This article has a graph neatly showing the change in transit numbers from 2017 to 2018. It shows marked decreases, particularly for bus trips, while the few increases almost all involved rail travel. Is this a short-term effect, or the start of a general long-term decline?

Monday, August 5, 2019

Tattoo Monday XIX


Here are two more (large) Charles Darwin tree tattoos, based on his best-known sketch from his Notebooks (the "I think" tree). For other examples, see Tattoo Monday III, Tattoo Monday V, Tattoo Monday VI, Tattoo Monday IX, Tattoo Monday XII, and Tattoo Monday XVIII.