Sunday, 4 October 2015

Statistics Don't Tell the Whole Story

cock had his say on

Match stats

Provided by Opta
5 (1)
Scrums won (lost)
8 (0)
7 (1)
Line-outs won (lost)
11 (2)
Pens conceded
77 (7)
Rucks won (lost)
81 (3)
Kicks from hand
101 (15)
Tackles made (missed)
116 (18)
Metres made
Line breaks

Last piece of the story - my appeal letter

Dear Editor, 
I would like to submit the article “The multiple origins of the H5N8 avian influenza sub-type”. This paper describes the multiple reassortment events that have produced the H5N8 avian influenza sub-type. The current phylogenetic tree for the H5N8 sequences is calculated and this does not exhibit any unusual features. There seem to be clades in the Far East and North America.
What this does phylogenetic tree does not show is that the appartently homogeneous North American clade is actually made up of multiple distinct re-assortment events. These are usually just single isolated sequences, and this suggests that while the sub-type is present other H5 and N8 containing sub-types dominate and this is a rare reassortment that rapidly becomes extinct. 
By carrying out a phylogenetic analysis of all the H5 segments from all H5 containing subtypes it is clear that the H5 is evolving in different subtypes between reassortment events that create H5N8 and the same is true for the N8 segments. This is the novel aspect of the research and shows that there are many re-assortment events that need to be accounted for but which are undetectable in the H5N8 sequences only trees. By not taking these re-assortments the the subsequent sequence evolution in other H5 or N8 containing sub-types into account we cannot correctly calculate the H5N8 phylogeny. Both the H5N8 phylogenies have been calculated using the same method (maximum likelihood) and the same evolutionary model (GTR with gamma correction). However the phylogenies for H5 and N8 were computed using FastTree rather than in Mega or PhyML because there are over 4000 H5 taxa and so they cannot be computed with bootstraps. However despite the lack of bootstrap values both the H5 and N8 phylogenies exhibit the same patterns of isolated re-assortment events, showing that the trees are a good representation of the true phylogenetic relationship.
The usual failure to propagate of H5N8 re-assortments is very important in the context of the current outbreak because this is the first time that the H5N8 subtype has persisted over such a long period of time. This has allowed the virus to spread globally through bird migration spreading the pathogenic Chinese H5 variant.

Thank you for your consideration.

Editor 1's response to the revisions.

From Editor 1:  

Andrew incorrectly interprets the rejection based on the idea that the sequences are in the public domain and therefore not interesting to reevaluate. This is not the case. I have many papers based on such assembled data sets. The point the reviewers have made though, is that the manuscript is not clear on what advances are made based on these analyses.  
Reviewer:Given that almost all of the data for these analyses came from published studies, it was not clear if there were differences between these results and those studies –and if so, was this due to using different datasets or different analyses from those studies? 
 Response: I am confused as to why using data from a public dataset cannot be original. If this were true no papers based on the human genome project would be valid after the initial publication. Andrew is missing the point. It is not that he used publicly available data, it is that he has not articulated what new insights he has gained from combining and re-analyzing these data.
 The reviewers also had issues with some of the methods. Andrew excuses these because they are ‘comparable in how they were produced’ in previous studies (a poor criterion for method choice!). He excuses the lack of bootstrap values because he used the FastTree approach, which is a poor approach to phylogeny estimation and an even poorer excuse to not get bootstrap values. To provide a tree without bootstrap values or posterior probabilities is simply poor science. It is providing a point estimate with not confidence interval or variance estimate. There are fast bootstrap approaches, even in a maximum likelihood framework and Bayesian approaches that should have been explored.  
Clearly the reviewers were confused because of the writing of the paper. Andrew tries to clarify a few things in the response letter and edits to the manuscript, but this only confirms the fact that the reviewers were confused by the previous version. This revised draft is still rushed and only addresses a few of the concerns (not many of the methodological concerns). So I stand behind the rejection decision. That said, I would be happy to have Andrew resubmit a new version that addresses the concerns in a more robust way. But if he is unwilling to put the time in to do a proper phylogenetic analysis and articulate in the manuscript what and how he is doing such analyses, then I don’t want to waste reviewers’ time with it again. 

Only one reviewer was confused his poor PhD student. The first author states that he understood it clearly and even his PhD student says that it is written well! I chose the methods not to compare to other studies as Editor 1 state but for internal consistency. That is most definitely good science and a fair test. It is actually fundamental to experimental design. As a statistician I am well aware of point estimates and confidence intervals, but I am also aware of the limits of bootstraps and bootstrap sampling.

So I worded a very strong reply to the Editor in Chief and asked for an appeal. Suspecting that it would be passed to Editor 2 who I have had previous dealings before where I told the editor in chief I never wanted Editor 2 to touch any paper I submitted to the journal again, because he is a pedant. Foolish me I forgot to remind the Editor in Chief about this.

I greatly like the Editor in Chief. He has to deal I suspect with a large number of irate authors but I think he could have less if he had editors who took their job seriously and not personally.

My more considered response to the referees

Referee 1

While the sequences are in the public domain nobody has carried out the phylogenetic analysis of the H5 hemagglutinin and N8 neuraminidase sequences. Nobody would have been able to carry out the analysis previously as the data has only just become available, When-ever someone publishes a phylogenetic analysis it only ever contains a subset of the available data. This is a complete data set for all the sequence from NCBI. If the GISAID data was included then this would be all of the available public data. GISAID data was not included because it cannot be searched just based on hemagglutinin and neuraminidase numbers. I am confused as to why using data from a public dataset cannot be original. If this were true no papers based on the human genome project would be valid after the initial publication.  
I used Mega as I wanted to construct trees for the H5N8 sequences and the H5 and N8 trees that are comparable in how they were produced. I have used other methods to calculate the H5N8 trees and I have referenced my own work and others that have used BEAST and coalescent methods. The tree from Mega is the same as from those studies, which is good because this is the control data for the paper. The control shows you get a tree but that it does not show any of the structure of the 7 lineages you find from the H5 and N8 trees. There are no bootstraps on the trees because they are not calculated when you create a tree with FastTree as it is an approximate and not exact method. You use it for large numbers of sequences. A bootstrap tree would not be valid without at least N bootstraps where N is the number of sequences (and I think it actually requires N-squared to be properly correct and sample all the possible bootstraps) 
There is a H5 classification but this would add extra complication to the study that would distract from the message about recombination. The main classification has been mentioned as this is the Guangdong H5 which is now becoming the globally predominant form (as also propsed in the reference by Verhagen et al.)

Referee 2 
It has nothing to do with an epidemiological analysis of the recent Korean outbreak as I have already published in this journal on that subject! I have changed the introduction to make my aim clearer and remove any possible idea that this might be epidemiological in intent.
The two points I am supposedly addressing are both wrong and so I have stated explicitly the three possibilities the work actually covers. As I show in the paper the referee’s question 1 cannot be answered with the trees produced in past studies or my tree of the H5N8 sequences. As for question 2 regarding the Korean outbreak that is well established by the cited work of Kang and Jeong and which the referee kindly gives me the PMID ids for (they are already in the paper).  
Like the other referee the objection seems to be that you cannot be doing a novel sequence analysis unless you have a new sequence that is not already in the NCBI. I find this a ridiculous assertion. If we cannot ever use publicly available data to carry out research why do we make it public? I have produced a novel dataset for both hemagglutinin and neuraminidase genes for all H5 containing serotypes and N8 containing serotypes. Nobody has done this before because it makes no sense except in the context of this paper. Even if they had nobody has done it to include the sequences from the most recent outbreak because they have only just been included in the database and I know beyond any doubt that nobody has recently done the H5 and N8 analysis for all serotypes.  
Regarding the suggestions for original research population studies will be difficult if you fail to account for the re-assortment events that are a focus of the paper. I am certainly interested in clocks and variability between hosts and even locations. There are also interesting implications from the coalescent trees from my previous paper about population sizes but that is for another paper as that requires deep theory and a reading of Kimura and Sewall-Wright.  
They are right in saying that the tools are largely to be used out of the box and that parameter space was not explored, because that is not my aim. I am interested in biology and not methods, this is a biology and not a methods paper. To select the best substitution model both AIC and BIC were best for the GTR+I model, the difference are actually nearly negligible between models. Alignment is on the nucleotides and so was tree building. Doing anything else would make no sense as the distances and variants between sequences are very small. 
There was no check for intra-segment recombination but that is a very rare even although it is hypothesized in the 1918 pandemic strain (although I doubt that result with limited sampling). I have included the exact commands I used to calculate the trees. A condensed phylogenetic tree is when you collapse the nodes based on bootstrap values. Identical sequences cannot be distinguished and so have low bootstrap values and so it makes no sense to represent branches between them. That is why the figure legend says there is a cut-off of 60%. The results detail each of the re-arrangement events stating the likely re-assortment partners that produced each of the different events mostly in the USA. The details of the Korean outbreak are irrelevant to the objective of the paper unless they showed the presence of another re-assortment event, which they do not. 
It is not possible to quantitatively add to the results when they are ancestral serotypes that produce the re-assortment. I could assign probabilities but the data is too sparse.  
The figures are only for review and I have higher resolution vector files of the original trees that will be submitted with the final version. 
My harsher response to referee 2 is because he is neither an expert nor able to read. A condensed tree is the same as an ML with a cut-off of a certain bootstrap percentage, in this case 60%. This referee has clearly never used Mega and it seems highly unlikely that he has used consense either. His focus is only on AIC and BIC in modeltest which is a tiny step of minor relevance to anyone except the author of that method.

My angry replies to Editor 1's rejection

Dear Editor 1, 
 I am rather struggling to understand the referees comments especially those of Referee 2 who seems to have NO IDEA what the paper was about. I do not care at all about the epidemiology of the recent Korean flu outbreak and that is not in the title or abstract. 
Referee 1 is also wrong I used mega to construct a tree of just the H5N8 data which I know is not original the new data are the MAFFT trees containing 15000 HA sequences and 15000 NA sequences that NEITHER referee even bothered to look at. That is the result. I do not really care less about the H5N8 tree as it is wrong because it does not take into account reassortment. The point is to show Mega gives a nice tree that looks all good but is actually missing key elements. 
Put simply the paper SHOWS ABSOLUTELY that when you collect flu sequences from a supposed single serotype like H5N8 that you would suppose arises from evolution once it is actually a mix of different events that have created this serotype multiple times by combining H5 hemagglutinins with N8 neuraminidases from other different serotypes. This tells you that to construct trees you need to include intermediates that includes non-H5N8 sequences, otherwise you cannot reconstruct the tree properly as you are missing reassortments and ancestors. 
This is a VERY BIG DEAL as if you don’t do it your trees will be wrong (as ALL current H5N8 trees are). This is definitely novel and definitely never discussed before. I can certainly increase the level of detail for the method but any credible bioinformatician would be able to reproduce the trees as they are produced by the default methods of Katoh in MAFFT which does not allow most of the options Referee 2 suggest. To be honest all the nucleotide substitution models score pretty much the same in AIC and BIC but Mega and MAFFT do not use all the same models in the same way and so you need to use methods that are in common. MAFFT method has been cited more times than any other. 
I got the tool and got the results. I do not optimise the trees because that is not the point. I do not care about testing parameter space or algorithms. The point is that they form many clades spread widely over the tree and that these multiple clades are consistent across two different genes – This cannot happen by chance so the trees are CORRECT regardless of parameter space and settings.  
If the reviewer opened the tree files he might see that they contain 15000 sequences which I am sure are not sets of data produced by ANYONE EVER before. I would like to see them bootstrap as 1000 bootstraps would be statistically wrong to do you need to carry out at least N where N is the number of sequences. If they have access to the world’s largest super-computer then they may do the bootstrap. MAFFT does not allow it anyway. But as I said they have not bothered to even look at that data at all.  
Thanks Andy

There are a few problems with this response and temper got the better of me. There are not 15,000 sequences there are 4008 HA and 1840 NA sequences. So it is a big problem that is outside the scope of most programs to create a phylogenetic analysis with bootstraps (phyML won't do it for example), but it is possible, if rather pointless. To carry out a non-parametric bootstrap need nCr(8015, 4009) bootstraps - this is  very large number. Large enough not to be calculable in the entire history of the universe. However bootstraps converge to this ideal value quickly but to know it has converged you would need to do multiple bootstrap trees and check that the values are converging. As far as I know nobody does this and certainly not for trees with 4008 taxons.

Telling the story properly. The original rejection letter.

To put my irate posts in context you need to have the full story. Here is the original rejection letter and referees comments.

Thank you for your submission to Journal B. I am writing to inform you that your manuscript, "The multiple origins of the H5N8 avian influenza sub-type" (#2015:07:5858:0:0:REVIEW), has been rejected for publication.The comments supplied by the reviewers on this revision are pasted below. My comments are as follows:Editor's commentsYour paper has now been reviewed by two experts in the field. They both found the general topic of avian influenza of interest, but both also agree that your study seems to provide little advance with respect to previous research. The study adds no new data and apparently no new interpretation of results. Both reviewers found your method descriptions lacking in detail as well as the discussion of your results. As reviewer 2 points out, it would be impossible to reproduce your results given the lack of detail in the manuscript on the methods employed. Likewise, reviewer 1 identified discussions of methods (e.g., bootstrapping) for which there were no available results. Given these significant concerns with the manuscript and the lack of clear articulation of the novel contribution your analysis makes beyond other studies of avian influenza, I am forced to recommend rejection of your paper.Editor 1

Reviewer CommentsReviewer 1 Basic reportingNo CommentsExperimental designNo CommentsValidity of the findingsNo CommentsComments for the authorMajor strengths and weaknesses:This study is a successful attempt in using the datasets from website to analysis the multiple orgins of the H5N8 avian influenza subtype. The study has further phylogenetic analyses of all of the H5 hemagglutinin and N8 neuraminidase sequences to show that each of the H5N8 outbreaks has resulted from a different reassortment event and that there have been at least 7 distinct origins of the viral sub-type since it was first characterised in a turkey in Ireland in 1983.
However, I also feel the phylogenetic results and discussion contained mostly results from the analyses, but very little discussion. Given that almost all of the data for these analyses came from published studies, it was not clear if there were differences between these results and those studies –and if so, was this due to using different datasets or different analyses from those studies? From your MS, you used MEGA v6.06 to construct the ML tree, could you use other phylogenetic software to build ML tree or BI tree, to verify each other? I suggested that you could mark the clades in figure 1 in order to make the relationship more distinct. In addition, from figure 3 to figure 10, I couldn’t find the bootstrap values in your trees, so please add the values into tree figures., and you can also add some description in this section. The classification of H5, maybe you can refer to recent research (e.g. Donis et al, 2015).
Minor issues:- Line 36: change to “shown”- L100: should be “turkeys”Reviewer 2 Basic reportingThe article is written in English using clear and unambiguous text. I think the article introduction and background fit into the broader field of knowledge. I think the figures need some work regarding clarity. For instance, phylogenetic trees show clades that are too busy and could be collapsed for more clarity without losing information.Experimental designI am not sure whether this manuscript conforms to "original primary research". While the manuscript aims to provide an epidemiological analysis of highly pathogenic avian influenza (HPAI) from a recent outbreak in Korea, it does not add much in comparison to other published articles.The two main objectives of this manuscript are 1) to test whether there's phylogenetic evidence for reassortment, and 2) whether there are multiple origins to the Korean outbreak. A cursory review of literature shows that both aspects have already been studied. Some examples are PMID: 24856098; PMID: 25625281; PMID: 25192767. Also, while I am not opposed to manuscripts that use data from GenBank only (that's why it's there to begin with), I think that in this case this has a negative impact since the data are not novel and the analysis provide little new information.In my opinion, the author should re organize the article to focus on something more original. Some examples come to mind: using molecular clocks to uncover temporal patterns in this outbreak, looking at past population dynamics to infer whether this serotype is becoming more or less diverse over time, or perform a phytogeographic analysis to provide insights into its origin and spread.Validity of the findingsMethods
I think this manuscript uses a very rich dataset but the analysis seem precipitous, exploratory and "out-of-the-box". My overall impression is that there is little exploration of the parameter space.
In Methods, line 52, what was the criterion by which the nucleotide substitution model was selected? Was it Akaike Information Criterion, Bayesian Information Criterion or else?
I think the methods section needs more detail. Statements such as “The sequences were aligned using MAFFT” are insufficient to allow full reproducibility. What parameter values were used and why? What version is the program? Did you fit a codon model to the coding sequences? Did you check for intra-gene recombination? Also, how many sequences were used? It says “all” but how many? Was the alignment done on the coding sequences or in the nucleotides? What parameter values were used? Finally, the phylogenetic reconstruction step is not clear as to how it was performed. Are the trees rooted? Were outgroups used? What is a “condensed phylogenetic tree”? Is it some type of consensus? If so, is it a majority rule consensus?
Results and Discussion
The description of the results is very qualitative, e.g., “the current outbreak shows some structure”. It would be more objective to quantify the degree of structure by using distance metrics or population genetic estimators of structure.Comments for the author
no comments
An important point is that Referee 2 is the PhD student of Editor 1

Saturday, 3 October 2015

To be fair I should report my incorrect and somewhat angry response to the rejection of the appeal

Really I give in.

I will do fasttree using seqboot for bootsrap but this is from the manual

00 fast-global bootstraps took just 20 hours, and the resulting support values were strongly correlated with the traditional bootstrap (r=0.975).”

So it is NOT comparable to the bootstrap it just correlates to the bootstrap. So again the “expert” is actually unaware of how it works.

Both trees are created with ML methods using a GTR+G evolutionary model – using different implementations sure. I can do both in fasttree but it would show what exactly?

The figures are illegible – you could try zooming as they are vector images but that asks too much of a someone with a PhD

There is one reference missing to my own work.

As they say “ whatever”.

This is going to give me a two for as I will resubmit it next week with everything done and a new title and excluding both of those editors.



On this I am definitely wrong. You can bootstrap normally using seqboot and consense. This correlation is only true if you specify an initial tree.

However, there are over 1000 variable sites. For the ideal (non-parametric) bootstrap estimate you need nCr (2n-1,n) permutations = 10 x E 600 calculations (Efron and Tibshirani). Sampling even 1000 bootstraps is insignificant. For 100 sites it is 5 x E 59. It is likely that bootstrap estimates are far from convergent, but that would need testing for each tree by increasing the bootstrap number until the percentages remain fixed.