The Genealogical World of Phylogenetic Networks: Are realistic mathematical models necessary?

In a comment on last week's post (Capturing phylogenetic algorithms for linguistics), Mattis noted that linguists are often concerned about how "realistic" are the models used for mathematical analyses. This is something that biologists sometimes also allude to, as well, not only in phylogenetics.

Here, I wish to argue that model realism is often unnecessary. Instead, what is necessary is only that the model provides a suitable summary of the data, which can be used for successful scientific prediction. Realism can be important for explanation in science, but even here it is not necessarily essential.

The fifth section of this post is based on some data analyses that I carried out a few years ago but never published.

Isaac Newton

Isaac Newton is one of the top handful of most-famous scientists. Among other achievements, he developed a quantitative model for describing the relative motions of the planets. As part of this model he needed to include the mass of each planet. He did this by assuming that each mass is concentrated at an infinitesimal point at the centre of mass. Clearly, the planets do not have zero volume, and thus this aspect of the model is completely unrealistic. However, the model functions quite well for both description of planetary motion and prediction of future motion. (It gets Mercury's motion slightly wrong, which is one of the improvements that Einstein's model of Special Relativity provides).

Newton's success came from neither wanting nor needing realism. Modeling the true distribution of mass throughout each planetary volume would be very difficult, since it is not uniformly distributed, and we still don't have the data anyway; and it is thus fortunate that it is unnecessary.

Other admonitions

The importance of Newton's reliance on the simplest model was also recognized by his best-known successor, Albert Einstein:

Everything should be as simple as it can be, but not simpler.

This idea is usually traced back to William of Ockham:

1. Plurality must never be posited without necessity.
2. It is futile to do with more things that which can be done with fewer.

However, like all things in science, it actually goes back to Aristotle:

We may assume the superiority, all things being equal, of the demonstration that derives from fewer postulates or hypotheses.

Sophisticated models model details

Realism in models makes the models more sophisticated, rather than keeping them simple. However, more complex models often end up modelling the details of individual datasets rather than improving the general fit of the model to a range of datasets.

In an earlier post (Is rate variation among lineages actually due to reticulation?) I also commented on this:

There is a fundamental limitation to trying to make any one model more sophisticated: the more complex model will probably fit the data better but it might be fitting details rather than the main picture.

The example I used was modelling the shape of starfish, all of which have a five-pointed star shape but which vary considerably in the details of that shape. If I am modelling starfish in general, then I don't need to concern myself about the details of their differences.

Another example is identifying pine trees. I usually can do this from quite a distance away, because pine needles are very different from most tree leaves, which makes a pine forest look quite distinctive. I don't need to identify to species each and every tree in the forest in order to recognize it as a pine forest.

Simpler phylogenetic models

This is relevant to phylogenetics whenever I am interested in estimating a species tree or network. Do I need to have a sophisticated model that models each and every gene tree, or can I use a much simpler model? In the latter case I would model the general pattern of the species relationships, rather than modelling the details of each gene tree. The former would be more realistic, however.

In that previous post (Is rate variation among lineages actually due to reticulation?) I noted:

If I wish to estimate a species tree from a set of gene trees, do I need a complex model that deals with all of the evolutionary nuances of the individual gene trees, or a simpler model that ignores the details and instead estimates what the trees have in common? ... adding things like rate variation among lineages (and also rate variation along genes) will usually produce "better fitting" models. However, this is fit to the data, and the fit between data and model is not the important issue, because this increases precision but does not necessarily increase accuracy.

So, it is usually assumed ipso facto that the best-fitting model (ie. the best one for description) will also be the best model for both prediction and explanation. However, this does not necessarily follow; and the scientific objectives of description, prediction and explanation may be best fulfilled by models with different degrees of realism.

In this sense, our mathematical models may be over-fitting the details of the gene phylogenies, and in the process sacrificing our ability to detect the general picture with regard to the species phylogenies.

Empirical examples

In phylogenetics, about 15 years ago it was pointed out that simpler and obviously unrealistic models can yield more accurate answers than do more complex models. Examples were provided by Yang (1997), Posada & Crandall (2001) and Steinbachs et al. (2001). That is, the best-fitting model does not necessarily lead to the correct phylogenetic tree (Gaut & Lewis 1995; Ren et al. 2005).

This situation is related to the fact that gene trees do not necessarily match species phylogenies. These days, this is frequently attributed to things like incomplete lineage sorting, horizontal gene transfer, etc. However, it is also related to models over-fitting the data. We may (or may not) accurately estimate each individual gene tree, but that does not mean that the details of these trees will give us the species tree. Basically, estimation in a phylogenetic context is not a straightforward statistical exercise, because each tree has its own parameter space and a different probability function (Yang et al. 1995).

One way to investigate this is to analyze data where the species tree is known. We could estimate the phylogeny using each of a range of mathematical models, and thus see the extent to which simpler models do better than more complex ones, by comparing the estimates to the topology of the true tree.

I used six DNA-sequence datasets, as described in this blog's Datasets page. Each one has a known tree-like phylogenetic history:

Datasets where the history is known experimentally:
Sanson — 1 full gene, 16 sequences
Hillis — 3 partial genes, 9 sequences
Cunningham — 2 genes + 2 partial genes, 12 sequences
Cunningham2 — 2 partial genes, 12 sequences
Datasets where the history is known from retrospective observation:
Leitner — 2 partial genes, 13 sequences
Lemey — 2 partial genes, ~16 sequences

For each dataset I carried out a branch-and-bound maximum-likelihood tree search, using the PAUP* program, for each of the 56 commonly used nucleotide-substitution models. I used the ModelTest program to evaluate which model "best fits" each dataset. The models along with their number of free parameters (ie. those that can be estimated) is:

For the Sanson, Hillis and Lemey datasets it made no difference which model I used, as in each case all models produced the same tree. For the Sanson dataset this was always the correct tree. For the Hillis dataset it was not the correct tree for any gene. For the Lemey dataset it was the correct tree for one gene but not the other.

The results for the other three datasets are shown below. In each case the lines represent different genes (plus their concatenation), the horizontal axis is the number of free parameters in the models, and the vertical axis is the Robinson-Foulds distance from the true tree (for models with the same number of parameters the data are averages). The crosses mark the "best-fitting" model for each line.

Cunningham:

Cunninham2

Leitner

For all three datasets, for both individual genes and for the concatenated data, there is almost always at least one model with fewer free parameters that produces an estimated tree that is closer to the true phylogenetic tree. Furthermore, the concatenated data do not produce estimates that are closer to the true tree than are those of the individual genes.

Conclusion

The relationship between precision and accuracy is a thorny one in practice, but it is directly relevant to the whether we need / use complex models, and thus more realistic ones.

References

Gaut BS, Lewis PO (1995) Success of maximum likelihood phylogeny inference in the four-taxon case. Molecular Biology & Evolution 12: 152-162.

Posada D, Crandall KA (2001) Simple (wrong) models for complex trees: a case from Retroviridae. Molecular Biology & Evolution 18: 271-275.

Ren F, Tanaka H, Yang Z (2005) An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Systematic Biology 54: 808-818.

Steinbachs JE, Schizas NV, Ballard JWO (2001) Efficiencies of genes and accuracy of tree-building methods in recovering a known Drosophila genealogy. Pacific Symposium on Biocomputing 6: 606-617.

Yang Z (1997) How often do wrong models produce better phylogenies? Molecular Biology & Evolution 14: 105-108.

Yang Z, Goldman N, Friday AE (1995) Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Systematic Biology 44: 384-399.

3 comments:

Johann-Mattis ListNovember 18, 2015 at 10:41 PM
Maybe I'm too much of a linguist (still), but when you say that "what is necessary is only that the model provides a suitable summary of the data, which can be used for successful scientific prediction", this is something I would be careful to subscribe to. Take google.translate, for example, and assume they're getting even better than now, offering almost perfect translations for the major languages (English, Spanish, Russian, etc.), but since their models are stochastic, all decisions that the black box engine makes are not transparent. They may be wrong or not appropriate internally, but they work and translate the things just right. Now, what has been achieved is making an intelligent machine that can translate. But the major question, as to what is behind the curtains, what is the basic structure of languages, etc. these machines do not make any contribution. But shouldn't science try to go beyond the black box and tell us what is going on in there? So even if google created the perfect translation engine that works on large amounts of training data: as long as they couldn't tell us why the translations are good, I would not call this a scientific contribution, since, as a linguist, I want to know what is really going on with language, as a biologist might want to know what is really going on with evolution.

Having simplifying models or unrealistic models is OK and serves a purpose since we can never really model reality. But in computational applications there is the tendency to just be content with the models as they are, without either questioning what they actually imply about language change (which is often not realistic), nor how this could be enhanced, nor why the simple models work well (if they do, but often they don't).

So, as long as we reflect about the nature of our models and as long as we try to be transparent, I am fine with all simplifications, but if it results in an "as long as we have big data, we don't need to think about our models"-attitude, as one finds often these days (in linguistics and other fields, especially translation), I accept the procedure but I wouldn't call it scientific.
Johann-Mattis ListNovember 19, 2015 at 4:50 PM
I like your separation into the three parts. Linguists often do not distinguish this clearly, and maybe this would spare us a lot of confusions...

Wednesday, November 18, 2015

Are realistic mathematical models necessary?

3 comments: