Monday 21 September 2015

Is my phylogenetic analysis right? Or why some editors are too stupid for words

I had an editor decide he was going to try and school me on statistics. I had used an approximate method to do a phylogenetic analysis which they objected to. Where are the bootstraps they say why use FastTree and not something else? (unspecified I might add). I needed to do a more robust and proper phylogenetic analysis.

Now I have two trees each with about 1500 sequences. These are for two genes in the same organism. I created an approximate tree for both and the two trees AGREED. They show the same seven sub-trees that are the key result of the whole paper. That means that from TWO INDEPENDENT SAMPLES I get the SAME POINT ESTIMATE AGREEING SEVEN TIMES. So for naive me I think that is PROOF BEYOND REASONABLE DOUBT that the two trees are correct. You simply cannot get the SAME wrong answers in two independent datasets 7 times, and if I consider the ordering of all of the other sequences they agree in many hundreds of positions.

Now I could do bootstrap, but this is just a permutation test to check if the algorithm is working properly it tells you ABSOLUTELY NOTHING about whether you have the correct biologically sound tree (check out Page and Holmes Molecular Evolution a Phylogenetic Approach for a discussion of the problems of verifying trees and that if trees from other genes agree they might be right).

Then again how many bootstraps should I do with 1500 sequences? It is a permutation test so lots is a good answer. Looking at a guide from Stata for a dataset of 448 they carried out 4000 bootstraps (they tried 40,000 as well and it gave about the same result). So for 1500 sequences I need at least 15,000 bootstraps and likely even more. So exactly what programme can I run on a fairly normal commercial PC that will run 15,000 bootstraps on 1,500 sequences in a reasonable amount of time? Even when I do that what programme will then be able to assemble and count these 15,000 trees to prepare a bootstrap tree? Even if I did would it tell me anything more than the two independent trees have already told me about my tree being correct?

Yes, the editor is an idiot. Yes, they have very little idea about what they are talking about. Yes they are too stupid for words.

They are so caught up with the technicality of bootstrap and maximum likelihood, confidence intervals and prior probability that they forgot what the ultimate arbiter of a good tree is: Does the biology make sense?