## Tuesday, October 24, 2017

### Let's distinguish between Hennig and Cladistics

There are theoretically an infinite number of ways to mathematically analyze any set of data, and yet it is unlikely that all (or even most) of these will have any relevance to a study of biology. In this sense, the philosophy of phylogenetic analysis needs to show that there is a strong basis for treating any particular mathematical analysis as having biological relevance. This is a point that I have discussed before: Is there a philosophy of phylogenetic networks?

Willi Hennig clearly has some role to play here. However, his ideas are often treated as being solely related to one particular form of phylogenetic analysis — cladistics. In this post I will point out that his work has a much greater relevance than that — he provides a crucial logical step that applies to all phylogenetic inference.

The steps of phylogenetic inference are shown in the first figure, which is taken from my earlier post. The first step is a mathematical inference from character data to tree/network; the second step is a logical inference that the mathematical summary resulting from the first step has some biological relevance; and the third step is a practical inference that the biological summary applies to whole organisms as well as to their characters.

Summary

Hennig's concept of "shared innovations" (which he called synapomorphies) is the only thing that allows us to use the mathematical phylogenetics in the pursuit of genealogical history. Without this concept, the mathematics could just produce something like the arithmetic mean, a mathematical concept with no connection to real objects (unlike the median or mode, which will always be real). The idea of shared innovations is what leads us to believe that the mathematical summary (whether tree or network) might actually also be a close approximation to the real thing. This is a separate concept from cladistics, which is simply a mathematical algorithm based on a particular optimality criterion (parsimony), just like maximum likelihood or bayesian approaches. So, shared innovations underlie the use of both parsimony, likelihood and distance methods — Willi Hennig (and, before him, Karl Brugmann in linguistics) is relevant no matter what algorithm we use.

Mathematical analyses

If they are to represent genealogical history, then all trees and networks in phylogenetics will be directed acyclic graphs (DAGs), mathematically. There are many ways to produce a DAG, some of which have had varying degrees of popularity in phylogenetics, and some of which have not been used at all.

To produce an acyclic line graph (in which nodes are connected by edges), we can start with character data or distance data. We can then use various optimality criteria to choose among the many graphs that could apply to the data, such as parsimony (usually ssociated with cladistics) and likelihood (either as maximum likelihood or integrated likelihood). We can also ensure that the graph is directed (ie. the edges have arrows), by choosing a root location, either directly as part of the analysis or a posteriori by specifying an outgroup.

All of these approaches are mathematically valid, as are a number of others. They all provide a mathematical summary of the data. This is step one of the phylogenetic inference, as illustrated above.

But what of step two? Biologists need a summary of the data that has biological relevance, as well, not just mathematical relevance. This has long been a thorn in the side of biologists — just because they can perform a particular mathematical calculation does not automatically mean that the calculation is relevant to their biological goal.

Consider the simplest mathematics of all — calculating the central location of a set of data. There are many ways to do this, mathematically — indeed, there are technically an infinite number of ways. These include the mode, the median, the arithmetic mean, the geometric mean, and the harmonic mean. All of these are mathematically valid, but do any of them produce a central location that describes biology?

The mode does, because it is the most common observation in the dataset. The median usually does, because it is the "middle" observation in the dataset. But what of the various means? There is no necessary reason for them to describe biology, although they are perfectly valid mathematics.

For instance, the modal number of children in modern families is 2, meaning that more families have this number than any other number of children. The median number is also 2, meaning that half of the families have 2 or fewer children and half of the families have 2 or more. So, these mathematical summaries are also descriptions of real families. But the means are not. For example, the arithmetic mean number of children is 2.2, which does not describe any real family. If you ever find a family with 2.2 children, then you should probably call the police, to investigate!

Mathematically valid data summaries have a lot of relevance, but they do not necessarily describe biological concepts. I can use the mean number of children per local family to estimate the number of schools that I might need in that area, but I cannot use it to describe the families themselves. This is a classic case of "horses for courses".

Hennig

So, in phylogenetics we need some piece of logic that says that we can expect our DAG (a mathematical concept) to be a representation of a genealogy (a biological concept). Our genealogical estimate may still be wrong (and indeed it probably will be, in some way!), but that is a separate issue. The DAG needs to a reasonable representation, not a correct one. Correctness needs to be a result of our data, not our mathematics.

This is where Willi Hennig comes in. Hennig's ideas, and the ideas derived from them, are illustrated in the second figure.

Hennig explicitly noted that characters have a genealogical polarity, with ancestral states being modified into derived states through evolutionary time. Furthermore, he noted that it is only the derived states that are of relevance to studying evolutionary history — the sharing of derived character states reveals evolutionary history, but shared ancestral states tells us nothing.

We have done two things with these Hennigian ideas. Some people have been interested in classification, for which the concept of monophyly is relevant, and others have been interested in reconstructing the genealogies, rather than simply interpreting them.

Phylogenetics

Reconstructing a tree-like phylogenetic history is conceptually straightforward, although it took a long time for someone (Hennig 1966) to explain the most appropriate approach. Interestingly, the study of historical linguistics has developed the same methodology (Platnick and Cameron 1977; Atkinson and Gray 2005), thus independently arriving at exactly the same solution to what is, in effect, exactly the same problem. From this point of view, the logical inference itself is uncontroversial; and its generic nature means that it can be used for any objects with characteristics that can be identified and measured, and that follow a history of descent with modification. I will, however, discuss this in terms of biology — you can make the leap to other objects yourself.

The objective is to infer the ancestors of the contemporary organisms, and the ancestors of those ancestors, etc., all the way back to the most recent common ancestor of the group of organisms being studied. Ancestors can be inferred because the organisms share unique characteristics (shared innovations, or shared derived character states. That is, they have features that they hold in common and that are not possessed by any other organisms. The simplest explanation for this observation is that the features are shared because they were inherited from an ancestor. The ancestor acquired a set of heritable (i.e. genetically controlled) characteristics, and passed those characteristics on to its offspring. We observe the offspring, note their shared characteristics, and thus infer the existence of the unobserved ancestor(s). If we collect a number of such observations, what we often find is that they form a set of nested groupings of the organisms.

Hennig, in particular, was interested in the interpretation of phylogenetic trees, rather than their reconstruction. He did this interpretation in terms of monophyletic groups (also called clades), each of which consists of an ancestor and all of its descendants. These are natural groups in terms of their evolutionary history, whereas other types of groups (eg. paraphyletic, polyphyletic) are not. So, a phylogenetic tree consists of a set of nested clades, which are the groups that are represented and given names in formal taxonomic schemes.

For phylogenetic trees, there is thus a rationale for treating a tree diagram as a representation of evolutionary history. For example, in a study of a set of gene sequences, first we produce a mathematical summary of the data based on a quantitative model. We then infer that this summary represents the gene history, based on the Hennigian logic that the patterns are formed from a nested series of shared innovations (this is a logical inference about the biology being represented by the mathematical summary). We then infer that this gene history represents the organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (ie. a pragmatic inference).

Mis-interpretations of Hennig

What I have said above has lead to various mis-interpretations of Hennig's role in phylogenetics.

First, he did not propose any specific method for producing a phylogenetic tree (or network). He was concerned about the logic of the diagram. not how to get it in the first place. He distinguished shared derived character states, or shard innovations, (he called them synapomorphies) from shared ancestral states (symplesiomorphies), and noted that only the former are relevant for phylogenies. So, distance methods will also work in phylogenetics provided the distances are based on homologous apomorphic features. If they are not so based, then they are simply mathematical constructions, which may or may not represent anything to do with phylogeny. Distances estimated from plesiomorphic features can be used to construct a tree, obviously, but there is no reason to expect that tree to represent a phylogeny.

Second, parsimony analysis was developed independently of Hennig, by people such as Farris, Nelson and Platnick. This came to be called cladistics, intended by Ernst Mayr to be a derogatory term for the new form of analysis. The fact that the Willi Hennig Society is associated exclusively with cladistics has nothing to do with Hennig himself, or with the logic of his approach to phylogenetics. You need to clearly distinguish between Hennig and Cladistics!

Third, Hennig was more interested in classification than he was in phylogeny reconstruction. This seems to cause confusion for gene jockeys and linguists, in particular, who often associate phylogenetics solely with classification (see, for example, Felsenstein 2004, chapter 10). Sure, Hennig was primarily interested in the interpretation of phylogenies, rather than their construction. However, that was simply a personal point of view. The logic of his work transcends his own personal interests. Without him, no genealogical reconstruction makes logical sense, in genetics or linguistics. Mathematical methods for summarizing data were developed independently in genetics and linguistics, just as they were in other areas of biology and also in stemmatology. However, without the concept of shared innovations, these methods remain mathematical summaries, not estimates of genealogies.

Finally, Hennig's work was not original, being naturally a synthesis of much previous work. In biology, the work of Walter Zimmerman is frequently noted (eg. Donoghue & Kadereit 1992), and in linguistics the work of Karl Brugmann is obviously important (see Mattis' post Arguments from authority, and the Cladistic Ghost, in historical linguistics). Sometimes, wheels have to be re-invented many times before the general populace comes to realize just how important they are.

References

Atkinson QD, Gray RD (2005) Curious parallels and curious connections — phylogenetic thinking in biology and historical linguistics. Systematic Biology 54: 513-526.

Donoghue MJ, Kadereit W (1992) Walter Zimmermann and the growth of phylogenetic theory. Systematic Biology 41: 74-85.

Felsenstain J (2004) Inferring Phylogenies. Sinauer Associates, Sunderland MA.

Hennig W (1966) Phylogenetic Systematics. University of Illinois Press, Urbana IL. [Translated by DD Davis and R Zangerl from W. Hennig 1950. Grundzüge einer Theorie der Phylogenetischen Systematik. Deutscher Zentralverlag, Berlin.]

Platnick NI, Cameron HD (1977) Cladistic methods in textual, linguistic, and phylogenetic analysis. Systematic Zoology 26: 380-385.

#### 1 comment:

1. "Second, parsimony analysis was developed independently of Hennig, by people such as Farris, Nelson and Platnick." Farris, maybe -- but cladistics sensu Nelson has no connection to Wagner Parsimony (Nelson, Gareth G. 1979. Cladistic analysis and synthesis: principles and definitions, with a historical note on Adanson’s Familles des Plantes (1763–1764). Systematic Zoology, 28: 1–21.). This paper may tell you a great about cladistics and how it should have been understood.