Saturday, 28 November 2015

Thinking about the bootstrap - is it doing what we think it is?

The bootstrap comes in two forms non-parametric and parametric. Non-parametric is the easiest to understand and it is closely related to cross-validation and jack-knifing. Bootstrap methods were pioneered by Brad Efron and collaborators and the book Introduction to the Bootstrap by Efron and Tibshirani is a very readable account.

Non-parametric bootstrapping should be used  in cases where the data cannot be summarised by a well-defined parameter that is normally distributed. This applies to phylogenetic analysis where the actual effects of the sequences and the variable sites behave in a very non-linear manner. Felsenstein applied the bootstrap to phylogenetic analysis in 1984.

The process in general for the bootstrap is that if you have a set of data then you resample that data without replacement to generate a new set of data with the same properties as the original dataset. For the jackknife, the resampling is the exclusion of a subset of data from the complete set (in crystallography this is analogous to using the Free-R factor and not the R-factor for refinement).  If the original dataset had 200 elements the resampled dataset has 200 elements. This is the first resampling. In the bootstrap, you resample many times. As the number of bootstrap resamples goes to infinity then you will get all the possible permutations of the original data but there is usually convergence of the calculated statistics of the resampling with smaller numbers of  bootstraps. You should test to see about convergence (nobody ever does).

For phylogenetics, the aligned sets of sequences produce a matrix where the rows are each of the sequences and the columns are the aligned positions. The columns of this matrix are then re-sampled. In most of Efron's examples, it is the rows that are resampled and this could be done in phylogenetics but with some added complexity in renaming the sequences that are duplicated. Resampling the columns might have some issues regarding the basic assumptions of the bootstrap.

These fundamental assumptions are that the sites are independent and that they are identically distributed. Now each of these assumptions might hold with a slight exception in the case of sequence alignments but the two together most likely do not.


  1. Some sites will not be independent and we know that there are correlated mutations, but these are perhaps a small enough number that they do not affect the results of the bootstrap. 
  2. Each of the positions should have the same probability of being A, C, T or G but because of the different rates of change at different codon positions they definitely will not have the same number of changes at all sites and over long periods saturation becomes an issue. There is some correlation but Efron and Tibshirani show how to deal with correlation by resampling each of the correlated variables as a single set.
  3. The differences in mutation at different codon position mean that most likely the assumptions about identically distributed and independent are not true. We should bootstrap using codons and not single sites. (This is actually a possibility using consense where you can set any length of sub-sequence to resample).
  4. There is a direct contradiction of models if you bootstrap a model where the codon position is represented by a gamma distribution and the bootstrap sampling is carried out by resampling single sites because you lose the correlation between sites implicit in the gamma distribution model. This must have a negative effect on the bootstrap tree.

More recently methods based on likelihoods and local bootstrap methods have been developed to deal with large trees where the calculation of the non-parametric bootstrap becomes prohibitive. All of these methods depend on assuming that the alignment contains enough information to define a simple parameter that can then be bootstrapped. These are all examples of a parametric bootstrap. They have been reported to give results that are highly correlated to the non-parametric bootstrap with levels of correlation over 95%, but this might conceal very large local fluctuations.

No comments: