Wednesday, May 29, 2013

Should phylogenetic modelling proceed from simple to complex or vice versa?


In statistical model testing, models can be tested by starting with the simplest model and progressively adding model complexity until the desired level of model fit is achieved. Alternatively, one can start with the most complex model and progressively delete unnecessary components while maintaining the desired level of model fit. The first approach is constructive, in the sense that the model is constructed piece by piece (stepwise addition), while the second approach is reductive, in the sense that the full model is pared down to its simplest form (stepwise deletion).

This distinction in approaches to modelling is relevant to the difference between using trees and networks as phylogenetic models.

At the moment, the most common approach to phylogenetic analysis is the constructive one. One starts with the simplest model, a bifurcating tree, and assesses the degree to which it fits the data. If the fit is poor, as it often is with multi-gene data, especially if the gene data are concatenated, then complexity is added. For example, one might include incomplete lineage sorting (ILS) in the model, which allows the different genes to fit different trees, while still maintaing the need for a single dichotomous species tree. Alternatively, one might consider gene duplication-loss as a possible addition to the model, which is another major source of incompatibility between multi-gene data and a single species tree. Only if these additional complexities also fail to attain the desired degree of fit does one consider adding components of reticulate evolution to the model, such as hybridization or horizontal gene transfer (HGT).

The reductive (or simplification) approach, however, proceeds the other way. A general network model is used as the starting point. The various components of this model would include a dichotomous tree as a special case, along with ILS, duplication-loss, hybridization, and HGT as individual components. These special cases are evaluated simultaneously, and each one is dropped if it is contributing nothing worthwhile to the model fit. The final model consists of the simplest combination of components that still maintains the specified fit of data and model; this may indeed be a simple tree.

The main advantage of the latter approach is that all of the components of the model are evaluated simultaneously, so that their potential interactions can be quantitatively assessed. Components are dropped from the model only if they contribute nothing to the model, either independently or in synergy with the other components. That is, they are dropped only if they can be shown to be redundant.

This does not happen with the constructive approach to modelling. Here, the components are evaluated in some specified order, and components that are later in the order will not be evaluated unless the earlier components prove to be inadequate. These later components are thus potentially excluded from statistical consideration. This means that their possible contribution to biological explanation may never be quantitatively assessed.

So, in practice, evolutionary reticulation is considered to be a "last resort" in current phylogenetic analyses. It is considered as a possible biological explanation only if all else has already failed.

This philosophy seems to be as much a historical artifact as anything else. The first phylogenetic diagrams (by Buffon and Duchesne) were networks not trees, but they were replaced a century later by the tree model suggested by Darwin; and the tree has retained its primacy since that time. This leads naturally to the constructive approach to modelling, which is so prevalent in the current literature.

However, there is no necessary statistical superiority of the constructive approach to modelling. Indeed, statisticians seem to consider forward and backward selection of model components to be essentially equivalent, although they may lead to different models for any given dataset. The most commonly specified advantage of the constructive approach to modelling is that it is likely to avoid possible problems arising from having too many components in the model.

Nevertheless, the reductive approach has the distinct advantage of simultaneously evaluating all possible special cases of a network, and thus does not exclude any possible biological explanation that might apply to the observed data. This may provide more biological insight than does the construcive approach to phylogenetic modelling.

No comments:

Post a Comment