The Accidental Statistician: My angry replies to Editor 1's rejection

Dear Editor 1,

I am rather struggling to understand the referees comments especially those of Referee 2 who seems to have NO IDEA what the paper was about. I do not care at all about the epidemiology of the recent Korean flu outbreak and that is not in the title or abstract.

Referee 1 is also wrong I used mega to construct a tree of just the H5N8 data which I know is not original the new data are the MAFFT trees containing 15000 HA sequences and 15000 NA sequences that NEITHER referee even bothered to look at. That is the result. I do not really care less about the H5N8 tree as it is wrong because it does not take into account reassortment. The point is to show Mega gives a nice tree that looks all good but is actually missing key elements.

Put simply the paper SHOWS ABSOLUTELY that when you collect flu sequences from a supposed single serotype like H5N8 that you would suppose arises from evolution once it is actually a mix of different events that have created this serotype multiple times by combining H5 hemagglutinins with N8 neuraminidases from other different serotypes. This tells you that to construct trees you need to include intermediates that includes non-H5N8 sequences, otherwise you cannot reconstruct the tree properly as you are missing reassortments and ancestors.

This is a VERY BIG DEAL as if you don’t do it your trees will be wrong (as ALL current H5N8 trees are). This is definitely novel and definitely never discussed before. I can certainly increase the level of detail for the method but any credible bioinformatician would be able to reproduce the trees as they are produced by the default methods of Katoh in MAFFT which does not allow most of the options Referee 2 suggest. To be honest all the nucleotide substitution models score pretty much the same in AIC and BIC but Mega and MAFFT do not use all the same models in the same way and so you need to use methods that are in common. MAFFT method has been cited more times than any other.

I got the tool and got the results. I do not optimise the trees because that is not the point. I do not care about testing parameter space or algorithms. The point is that they form many clades spread widely over the tree and that these multiple clades are consistent across two different genes – This cannot happen by chance so the trees are CORRECT regardless of parameter space and settings.

If the reviewer opened the tree files he might see that they contain 15000 sequences which I am sure are not sets of data produced by ANYONE EVER before. I would like to see them bootstrap as 1000 bootstraps would be statistically wrong to do you need to carry out at least N where N is the number of sequences. If they have access to the world’s largest super-computer then they may do the bootstrap. MAFFT does not allow it anyway. But as I said they have not bothered to even look at that data at all.

Thanks Andy

There are a few problems with this response and temper got the better of me. There are not 15,000 sequences there are 4008 HA and 1840 NA sequences. So it is a big problem that is outside the scope of most programs to create a phylogenetic analysis with bootstraps (phyML won't do it for example), but it is possible, if rather pointless. To carry out a non-parametric bootstrap need nCr(8015, 4009) bootstraps - this is very large number. Large enough not to be calculable in the entire history of the universe. However bootstraps converge to this ideal value quickly but to know it has converged you would need to do multiple bootstrap trees and check that the values are converging. As far as I know nobody does this and certainly not for trees with 4008 taxons.

The Accidental Statistician

Sunday, 4 October 2015

My angry replies to Editor 1's rejection

No comments:

About Me