Computer simulations are an important part of phylogenetics, not least because people use them to evaluate analytical methods, for example for alignment strategies or network and tree-building algorithms.

For this reason, biologists often seem to expect that there is some close connection between simulation "experiments" and the performance of data-analysis methods in phylogenetics, and yet the experimental results often have little to say about the methods' performance with empirical data.

There are two reasons for the disconnection between simulations and reality, the first of which is tolerably well known. This is that simulations are based on a mathematical model, and the world isn't (in spite of the well-known comment from James Jeans that "God is a mathematician"). Models are simplifications of the world with certain specified characteristics and assumptions. Perhaps the most egregious assumption is that variation associated with the model involves independent and identically distributed (IID) random variables. For example, simulation studies of molecular sequences make the IID assumption, by generating substitutions and indels at random in the simulated sequences (called stochastic modeling). This IID assumption is rarely true, and therefore simulated sequences deviate strongly from real sequences, where variation occurs distinctly non-randomly and non-independently, both in space and time.

The second problem with simulations seems to be less well understood. This is that they are not intended to tell you anything about which data-analysis method is best. Instead, whatever analysis method matches the simulation model most closely will almost always do best, irrespective of any characteristics of the model.

To take a statistical example, consider assessing the

*t*-test versus the Mann-Whitney test — this is the simplest form of statistical analysis, comparing two groups of data. If we simulate the data using a normal probability distribution, then we know a priori that the

*t*-test will do best, because its assumptions perfectly match the model. What the simulation will tell us is how well the

*t*-test does under perfect conditions; and indeed we find that its success is 100%. Furthermore, the Mann-Whitney test scores about 95%, which is pretty good. But we know a priori that it will do worse than the

*t*-test; what we want to know is how much worse. All of this tells us nothing about which test we should use. It only tells us which method most closely matches the simulation model, and how close it gets to perfection. If we change the simulation model to one where we do not know a priori which analysis method is closest (eg. a lognormal distribution), then the simulation will tell us which it is.

This is what mathematicians intended simulations for — to compare methods relative to the models for which they were designed, and to deviations from those models. So, simulations evaluate models as much as methods. They will mainly tell you which model assumptions are important for your chosen analysis method. To continue the example, non-normality matters for the

*t*-test when the null hypothesis being tested is true, but not when it is false. Instead, inequality of variances matters for the

*t*-test when the null hypothesis is false. This is easily demonstrated using simulations, as it also is for the Mann-Whitney test. But does it tell you whether to use

*t*-tests or Mann-Whitney tests?

This is not a criticism of simulations as such, because mathematicians are interested in the behaviour of their methods, such as their consistency, efficiency, power, and robustness. Simulations help with all of these things. Instead it is a criticism of the way simulations are used (or interpreted) by biologists. Biologists want to know about "accuracy" and about which method to use. Simulations were never intended for this.

To take a first phylogenetic example. People simulate sequence data under likelihood models, and then note that maximum likelihood tree-building does better than parsimony. Maximum likelihood matches the model better than parsimony, so we know a priori that it will do better. What we learn is how well maximum likelihood does under perfect conditions (it is some way short of 100%) and how well parsimony does relative to maximum likelihood.

As a second example, we might simulate sequence-alignment data with the gaps in multiples of three nucleotides. We then discover that an alignment method that puts gaps in multiples of three does better than ones that allow any size of gap. So what? We know a priori which method matches the model. What we don't know is how well it does (it is not 100%), and how close to it the other methods will get. But this is all we learn. We learn nothing about which method we should use.

So, it seems to me that biologists often over-interpret computer simulations. They are tempted to over-interpret the results and not see them for what they are, which is simply an exploration of one set of models versus other models within the specified simulation framework. The results have little to say about the data-analysis methods' performance with empirical data in phylogenetics.

## No comments:

## Post a Comment