The Accidental Statistician: Thinking about the bootstrap

Wednesday, 13 December 2017

Thinking about the bootstrap

Bootstrap samples experimental units but in phylogenetics you sample the VARIABLES - sites.
How should we treat sites?

Remove totall variant?
Remove sites where a row is missing?

You cannot say that parametric and non-parametric are the same thing. They are correlated but not directly comparable.

Carry out FastTree with H5N8, then H5 then N8
Use the parametric and non-parametric bootstraps
Use the CONSEL measures as well.

Having more bootstraps than 100 makes NO difference to the bootstrap values. They converge quickly empirically.

This is far below the theoretical numbers needed by Efron says that this is usual.
Suggests that sites are linked and so there is less independent variability than it appears.
Need to experiment with conserved sites.
Need to experiment with the substitution models to look at sensitivity and also gamma.

There is a lack of independence between sites in the evolutionary models but this is IGNORED in the bootstrap calculations. You should bootstrap codons and not individual bases.

Need to create synthetic data where the true tree is known. This can be used to test:

Effects of sampling by censoring the data.
Evaluate modeltest.
Check trees from bad evolutionary models against the best models (probably the same!!!)

No comments:

Subscribe to: Post Comments (Atom)