The Accidental Statistician: My more considered response to the referees

Referee 1

While the sequences are in the public domain nobody has carried out the phylogenetic analysis of the H5 hemagglutinin and N8 neuraminidase sequences. Nobody would have been able to carry out the analysis previously as the data has only just become available, When-ever someone publishes a phylogenetic analysis it only ever contains a subset of the available data. This is a complete data set for all the sequence from NCBI. If the GISAID data was included then this would be all of the available public data. GISAID data was not included because it cannot be searched just based on hemagglutinin and neuraminidase numbers. I am confused as to why using data from a public dataset cannot be original. If this were true no papers based on the human genome project would be valid after the initial publication.

I used Mega as I wanted to construct trees for the H5N8 sequences and the H5 and N8 trees that are comparable in how they were produced. I have used other methods to calculate the H5N8 trees and I have referenced my own work and others that have used BEAST and coalescent methods. The tree from Mega is the same as from those studies, which is good because this is the control data for the paper. The control shows you get a tree but that it does not show any of the structure of the 7 lineages you find from the H5 and N8 trees. There are no bootstraps on the trees because they are not calculated when you create a tree with FastTree as it is an approximate and not exact method. You use it for large numbers of sequences. A bootstrap tree would not be valid without at least N bootstraps where N is the number of sequences (and I think it actually requires N-squared to be properly correct and sample all the possible bootstraps)

There is a H5 classification but this would add extra complication to the study that would distract from the message about recombination. The main classification has been mentioned as this is the Guangdong H5 which is now becoming the globally predominant form (as also propsed in the reference by Verhagen et al.)

Referee 2

It has nothing to do with an epidemiological analysis of the recent Korean outbreak as I have already published in this journal on that subject! I have changed the introduction to make my aim clearer and remove any possible idea that this might be epidemiological in intent.

The two points I am supposedly addressing are both wrong and so I have stated explicitly the three possibilities the work actually covers. As I show in the paper the referee’s question 1 cannot be answered with the trees produced in past studies or my tree of the H5N8 sequences. As for question 2 regarding the Korean outbreak that is well established by the cited work of Kang and Jeong and which the referee kindly gives me the PMID ids for (they are already in the paper).

Like the other referee the objection seems to be that you cannot be doing a novel sequence analysis unless you have a new sequence that is not already in the NCBI. I find this a ridiculous assertion. If we cannot ever use publicly available data to carry out research why do we make it public? I have produced a novel dataset for both hemagglutinin and neuraminidase genes for all H5 containing serotypes and N8 containing serotypes. Nobody has done this before because it makes no sense except in the context of this paper. Even if they had nobody has done it to include the sequences from the most recent outbreak because they have only just been included in the database and I know beyond any doubt that nobody has recently done the H5 and N8 analysis for all serotypes.

Regarding the suggestions for original research population studies will be difficult if you fail to account for the re-assortment events that are a focus of the paper. I am certainly interested in clocks and variability between hosts and even locations. There are also interesting implications from the coalescent trees from my previous paper about population sizes but that is for another paper as that requires deep theory and a reading of Kimura and Sewall-Wright.

They are right in saying that the tools are largely to be used out of the box and that parameter space was not explored, because that is not my aim. I am interested in biology and not methods, this is a biology and not a methods paper. To select the best substitution model both AIC and BIC were best for the GTR+I model, the difference are actually nearly negligible between models. Alignment is on the nucleotides and so was tree building. Doing anything else would make no sense as the distances and variants between sequences are very small.

There was no check for intra-segment recombination but that is a very rare even although it is hypothesized in the 1918 pandemic strain (although I doubt that result with limited sampling). I have included the exact commands I used to calculate the trees. A condensed phylogenetic tree is when you collapse the nodes based on bootstrap values. Identical sequences cannot be distinguished and so have low bootstrap values and so it makes no sense to represent branches between them. That is why the figure legend says there is a cut-off of 60%. The results detail each of the re-arrangement events stating the likely re-assortment partners that produced each of the different events mostly in the USA. The details of the Korean outbreak are irrelevant to the objective of the paper unless they showed the presence of another re-assortment event, which they do not.

It is not possible to quantitatively add to the results when they are ancestral serotypes that produce the re-assortment. I could assign probabilities but the data is too sparse.

The figures are only for review and I have higher resolution vector files of the original trees that will be submitted with the final version.

My harsher response to referee 2 is because he is neither an expert nor able to read. A condensed tree is the same as an ML with a cut-off of a certain bootstrap percentage, in this case 60%. This referee has clearly never used Mega and it seems highly unlikely that he has used consense either. His focus is only on AIC and BIC in modeltest which is a tiny step of minor relevance to anyone except the author of that method.

The Accidental Statistician

Sunday, 4 October 2015

My more considered response to the referees

No comments:

About Me