The Accidental Statistician: 2015

Thursday 31 December 2015

Does peer review work?

As an outsider to the viral phylogenetics community who just does some work for people in the community I get a pretty hard time from referees. If you are not part of the in crowd then they always find a reason why you have done something wrong. Some of my favourite examples are cladograms are wrong (then use the data I deposited to draw a phylogram if you want - which shows the same thing in more confusing detail). Some think you have to carry out a modeltest before building the tree (that would be the editor who wrote modeltest, and has managed to get it to 14,000 citations) whereas the next referee says modeltest is a worthless circular argument and pointless to carry out. The funniest was probably the figures are not clear to read, well they are vector graphic pdfs not bitmaps like all the ones you publish yourself so you could zoom them if you were not being obstructive and obfuscating.

The list of reasons goes on and on. The real question for an editor or referee to give a rejection is whether the science is right, not if it could be presented better. Wrong figures or writing are a source for corrections not for rejections. I have edited 120 papers and the default as an editor is to believe the scientist carried out with honesty and integrity and that if I was to do the same analysis I would get the same results. I believe in 100% transparency. That means; no anonymous referees, possibly post publication peer review (F1000 model), publishing all referees comments even for rejected papers.

All of this would be amusing rather than frustrating until I see that someone publishes the same idea a few months later and I lose any credit for it. For these reasons everything I ever do goes straight into to PeerJ repository so that I can always go that is a nice peer reviewed publication you have there but it is a shame I put mine in the repository before you even submitted yours, and isn't it strange that you were probably that anonymous reviewer that made sure my paper was rejected while you got your very similar one published. In my other interest as a lawyer they have a name for this practice. They call it fraud.

The politics of H5N8

I used to think that protein crystallography was a field full of back-stabbing political bastards but they have nothing on the depths of spite and pathetic stupidity present amongst the viral phylogenetics community. As Sayre's law states academic politics is so bitter because the stakes are so low and no stakes are lower than those for H5N8.

H5N8 is a subtype of influenza that nobody cared about or studied until a big outbreak in 2014. This outbreak was important for two reasons.

It showed that all the papers about wild birds not being able to spread highly pathogenic flu virus were wrong (Sorry Gaidet at al.)
It allowed the Guangdong Goose H5 hemagglutinin to spread to North America.

That is it. No more interest. It actually reassorted to H5N2 pretty quickly in the US as H5N8 does not often occur because it is not a preferred packaging of the virus. H5N1 is much more frequent and H5N2 also seems to be a better alternative.

Saturday 19 December 2015

H5N8 in Taiwan - Poor methods and not the best peer review.

I was critical of the sharing of data from the Taiwanese outbreak but there are a few more problems I have with the paper that reports the analysis of the data. So the paper says in the methods section that:

Phylogenetic analysis, as described previously (Lee et al., 2014a and Lee et al., 2014b), was performed using these full genome sequences and closely related sequences from GenBank, GISAID and the publicly available government website (http://ai.gov.tw/index.php?id=720), which gave the sequences of the 16 H5 viruses isolated by the Council of Agriculture (COA), Taiwan during the recent outbreaks.

Now lets look at those two papers by Lee from 2014 with the methods in them. The first one is a letter and so does not even have a methods section. The methods are only mentioned in the figure legend.

Phylogenetic tree of hemagglutin (HA) genes of influenza A(H5N8) viruses, South Korea, 2014Triangles indicate viruses characterized in this studyOther viruses detected in South Korea are indicated in boldfaceSubtypes are indicated in parenthesesA total of 72 HA gene sequences were ≥1,600 ntMultiple sequence alignment was performed by using ClustalW (www.ebi.ac.kr/Tolls/clustalw2)The tree was constructed by using the neighbor-joining method with the Kimura 2-parameter model and MEGA version 5.2 (www.megasoftware.net/) with 1,000 bootstrap replicatesH5, hemagglutinin 5; Gs/Gd, Goose/Guangdong; LPAI, low pathogenic avian influenza; HPAI, highly pathogenic avian influenzaScale bar indicates nucleotide substitutions per site.

This uses NJ-tree construction in Mega - and Mega 6.06 was already available.

The second paper does have a methods section which says:

Molecular clock analysis. For the HA and NA genes, the genetic distance from the common ancestral node of the lineage to each viral isolate was measured from the ML tree and plotted against the sample collection dates. Linear regression was used to indicate the rate of accumulation of mutations over time. A more detailed evolutionary time scale for each virus gene phylogeny, with confidence limits, was obtained using relaxed molecular clocks under uncorrelated lognormal (UCLD) and exponential (UCED) rate distributions, implemented in a Bayesian Markov chain Monte Carlo (BMCMC) statistical framework (27), using BEAST, version 1.8 (28). The SRD06 nucleotide substitution model (29) and Bayesian Skyride demographic model (30) were used. Multiple runs were performed for each data set, giving a total of 6 107 states (with 1 107 states discarded as burn-in) that were summarized to compute statistical estimates of the parameters. Convergence of the BMCMC analysis was assessed in Tracer, version 1.6 (A. Rambaut M. Suchard, and A. J. Drummond, 2013 [http://tree.bio.ed.ac.uk/software/tracer/]

So this analysis was carried out with Beast in a Bayesian framework. So which of these totally different methods was used in the current paper? It has to be the Beast analysis because of the way that the trees appear. But this also raises questions as they talk about the Bayesian Skyride model. I think they mean the Bayesian Skyline model and are confused by this paper. Anyway you should not be using the Skyline unless you are interested in hypotheses about viral demographics and phylodynamics. What doe they mean by multiple runs? This shows a naivity in using Beast. So while they might get the right results, they could have got them faster and easier using a simpler coalescent model.

What is of larger concern is that in both the Taiwanese outbreak paper and the second Lee paper the referees did not notice the errors in linking to two methodologies or the incorrect use of the Bayesian Skyline. So much for the peer review process improving science.

Friday 18 December 2015

Lessons for data collection from crystallography for viral data collection

I remember what a huge impact Denzo had on the processing of experimental data. Denzo allows you to model errors in the collection, such as slippage and x-ray damage. When we take samples of sequences from a population there are experimental errors that we need to consider. The sorts of question that we need to ask are:

Is there a difference between the population of the virus in the host and that in the amplified sample?
If the virus is a mixture of subtypes do the experimental methods favour one subtype over another?
Does the sequencing technique favour AT or GC rich regions?
Can we distinguish between sequencing error and point mutations?

Recent work on improving the collection and monitoring of wild bird avian influenza has shown that birds can be infected with multiple sub-types. In these cases how do we know which segments match with which other segments? How do they mix and produce mature virus?

Wednesday 16 December 2015

A new low in data sharing: The Taiwanese Outbreak of H5N8

I am interested in the spread of H5N8 avian influenza and after seeing the spread of cases in Taiwan on the OIE website I have been eagerly waiting to see sequence data emerge to carry out some detailed evolutionary analysis. I have waited and waited and waited and today I got my RSS feed from Google Scholar to say a paper is available showing the relationship to the Korean outbreak. The authors had been hoarding the data to publish it rather than making it more widely available and being pipped to the publication, as we live by the law of publish or perish.

Now it is published the data must be available for everyone as it no longer matters for priority of publication. I checked the sources Genbank, GISAID (the worlds first password access only resource for sharing publicly funded biological sequence data) and a Taiwanese government database - http://ai.gov.tw/index.php?id=720 It is so easy to find a database that is only in Chinese script with no English translation. That is the perfect way to share data.

Anyway we should all know more languages and they have a good reason to publish in their native language. Google can translate it anyway. So there is a page with the sequences and here is the link http://ai.gov.tw/index.php?id=704 There is just one slight catch. These are image files. In order to extract the sequences you are going to have to retype them all from the images. 1600 characters for each entry! That is not data sharing. That is not a public database. That is not good practice or good science. That is obstruction pure and simple.

Japan made its data available almost immediately as well as issuing local warnings to farmers about the risk from wild birds of the spread of the new H5N8 variant. The result of this was a relatively small number of cases in domestic birds. However the lack of transparency from the Taiwanese laboratories contributed to the deaths of 3.2 million birds including nearly 60% of the domestic goose population. This is an avoidable disaster that has cost millions to farmers and the scale of which could have been reduced by improved sharing of information.

Sunday 13 December 2015

Reproducibility in molecular dynamics

I once asked David Osguthorpe for some advice when I was a PhD student. I was using Discover to carry out molecular dynamics simulations on a small peptide. He told me that it was unreliable in the version that Biosym were selling and that you got different answers every time. He also told me that my loop simulations were wrong and that I must have forgotten to cap the peptide ends to get the peptide to fold into any loop form.

A few ideas and lessons came from this:

I ignored my peptide simulation results and so it was never published. At the time it was the largest and longest peptide simulation ANYONE had done (this was 1994).
The structure of the peptide bound to an enzyme became available and it was in the conformation that I had observed in the calculations! This structure was later retracted as it was based on a poor enzyme crystal structure (I corrected the structure in the PDB).
I now know to run some simulations fixing the random seed to check for reproducibility of the simulations. This shows what is computer/coding variation and what is simulated variation.
I have shown that despite variability the simulations are following some sort of physical reality in that they follow the Arrhenius equation. This is for an ensemble, an average over multiple simulations.
Now I run all simulations a number of times and it worries me about how irreproducible they are when the seeds are not fixed. This seriously undermines the reproducibility of the field and supports my proposed second doctoral supervisors reasons for not taking that position on. His comment was that it is garbage in and garbage out.

Wednesday 9 December 2015

More thinking about the bootstrap - systematic bias.

The reasoning for sampling the columns and not the rows of the data matrix as you would do in most bootstrap cases is that because they produce a tree the rows are not considered independent, but the columns are.

I am not sure I quite follow this logic because a split between two groups of sequences might be based on a pattern of changes rather than a single column and so the tree topology depends on the multiple columns that are definitely not independent. So doesn't the tree structure mean that both the rows AND the columns are not independent?

Anyway more serious than this is bootstraps represent resampling to try and capture the real population and so it is assumed that the sequences that form the alignment and that you are using to generate the tree are a random sample. This is sadly almost NEVER true. We sample the low hanging fruit. We sample organisms that we like, that we can culture, that are model systems we do not sample randomly and without bias. The collection of sequences and genomes introduces a systematic error into everything we do.

The bootstrap cannot deal with this. If there are aspects of the population that are not sampled at all then no amount of bootstraps can model it. This is a Black Swan problem. Your bootstrap values will only tell you that this data, with this model using this method, has strong support for being reproducible. It says nothing about it being a biological truth.

Friday 4 December 2015

Reassortment in influenza

Why is there any sort of argument over the amount of reassortment in Influenza? Put simply for each subtype there are many lineages and each lineage can have many subtypes as shown in this image.

Saturday 28 November 2015

Thinking about the bootstrap - is it doing what we think it is?

The bootstrap comes in two forms non-parametric and parametric. Non-parametric is the easiest to understand and it is closely related to cross-validation and jack-knifing. Bootstrap methods were pioneered by Brad Efron and collaborators and the book Introduction to the Bootstrap by Efron and Tibshirani is a very readable account.

Non-parametric bootstrapping should be used in cases where the data cannot be summarised by a well-defined parameter that is normally distributed. This applies to phylogenetic analysis where the actual effects of the sequences and the variable sites behave in a very non-linear manner. Felsenstein applied the bootstrap to phylogenetic analysis in 1984.

The process in general for the bootstrap is that if you have a set of data then you resample that data without replacement to generate a new set of data with the same properties as the original dataset. For the jackknife, the resampling is the exclusion of a subset of data from the complete set (in crystallography this is analogous to using the Free-R factor and not the R-factor for refinement). If the original dataset had 200 elements the resampled dataset has 200 elements. This is the first resampling. In the bootstrap, you resample many times. As the number of bootstrap resamples goes to infinity then you will get all the possible permutations of the original data but there is usually convergence of the calculated statistics of the resampling with smaller numbers of bootstraps. You should test to see about convergence (nobody ever does).

For phylogenetics, the aligned sets of sequences produce a matrix where the rows are each of the sequences and the columns are the aligned positions. The columns of this matrix are then re-sampled. In most of Efron's examples, it is the rows that are resampled and this could be done in phylogenetics but with some added complexity in renaming the sequences that are duplicated. Resampling the columns might have some issues regarding the basic assumptions of the bootstrap.

These fundamental assumptions are that the sites are independent and that they are identically distributed. Now each of these assumptions might hold with a slight exception in the case of sequence alignments but the two together most likely do not.

Some sites will not be independent and we know that there are correlated mutations, but these are perhaps a small enough number that they do not affect the results of the bootstrap.
Each of the positions should have the same probability of being A, C, T or G but because of the different rates of change at different codon positions they definitely will not have the same number of changes at all sites and over long periods saturation becomes an issue. There is some correlation but Efron and Tibshirani show how to deal with correlation by resampling each of the correlated variables as a single set.
The differences in mutation at different codon position mean that most likely the assumptions about identically distributed and independent are not true. We should bootstrap using codons and not single sites. (This is actually a possibility using consense where you can set any length of sub-sequence to resample).
There is a direct contradiction of models if you bootstrap a model where the codon position is represented by a gamma distribution and the bootstrap sampling is carried out by resampling single sites because you lose the correlation between sites implicit in the gamma distribution model. This must have a negative effect on the bootstrap tree.

More recently methods based on likelihoods and local bootstrap methods have been developed to deal with large trees where the calculation of the non-parametric bootstrap becomes prohibitive. All of these methods depend on assuming that the alignment contains enough information to define a simple parameter that can then be bootstrapped. These are all examples of a parametric bootstrap. They have been reported to give results that are highly correlated to the non-parametric bootstrap with levels of correlation over 95%, but this might conceal very large local fluctuations.

Sunday 15 November 2015

Good grief: Dark Matter and Dinosaurs it is all physics-babble

I have just figured out where all that invisible dark matter we need to make the universe work is. It is stuck between theoretical physicists ears when they write pseudo-science books based on no evidence what so ever. Everyone complains about quackery of "alternative medicine" but theoretical physics takes junk science to a new level. The lips are moving but the brain isn't working. Dark Matter and Dinosaurs, the Cosmic Anthropological Principle, Many Worlds ...

https://uk.finance.yahoo.com/news/leading-harvard-physicist-radical-theory-160000092.html

Update 1/2/2016

Seems that John Gribbin got the same message from the book. We definitely have gone from Dark Matter to doesn't matter.

https://johngribbinscience.wordpress.com/2016/02/01/doomed/

Sunday 18 October 2015

Taking on phylogenetics: I'm afraid what you are suggesting is rather stupid

After my rants and tirades against Editors and referees I decided that I would follow Ewan Birney and make my blog a bit more systematic. So this will be my lab notebook so I have proof of what I was thinking when.

So after reading a few textbooks my general opinion of some of the leading lights of phylogenetics is that they need to read a few more books. There is a whole lot of bad science out there and a complete failure to appreciate how evolution works.

Editor 1 is my favourite editor as he has cost me a PLoS One paper already and nearly another one in PeerJ. Anyway he suggested that FastTree is an inadequate method and I needed bootstraps. If he knew anything about FastTree he would know that for >100 sequences they suggest not using bootstraps but an estimator of the likelihoods instead which has a > 90% correlation to the bootstrap values. There is an alternative program which is PhyML and that says exactly the same. For trees of > 100 sequences compute the branch support using a Bayesian method.

You could create a bootstrap version by either running seqboot to create a bootstrap dataset for FastTree or by setting the bootstrap option in PhyML, but why would you? You would have to be stupid to do this when you get practically the same result from the internal estimator in minutes and the bootstrap calculations will take days. For example my H5 tree with > 4000 sequences cannot be run on PhyML. Removing the identical sequences reduces this to less that 3000 sequences BUT this will take 176 DAYS on the PhyML server. Running FastTree takes about half an hour.

The alignment file "H5_HA_4-10-15-2811_DNA_muscle.phylip" was uploaded, it contains 2811 taxons of 1926 characters.

The estimated computation time is : 176 day(s) 3 hour(s) 25 minute(s)
See FAQ for details.
Thank you for your request.

It is therefore very clear that Editor 1 is stupid, unaware of current techniques and biased towards outdated methods inappropriate for the type of analysis I wish to carry out.

Sunday 4 October 2015

Statistics Don't Tell the Whole Story

cock had his say on

Match stats
England		Australia
Provided by Opta
49%	Possession	51%
47%	Territory	53%
5 (1)	Scrums won (lost)	8 (0)
7 (1)	Line-outs won (lost)	11 (2)
9	Pens conceded	5
77 (7)	Rucks won (lost)	81 (3)
25	Kicks from hand	27
101 (15)	Tackles made (missed)	116 (18)
391	Metres made	254
10	Offloads	9
7	Line breaks	5

Last piece of the story - my appeal letter

Dear Editor,

I would like to submit the article “The multiple origins of the H5N8 avian influenza sub-type”. This paper describes the multiple reassortment events that have produced the H5N8 avian influenza sub-type. The current phylogenetic tree for the H5N8 sequences is calculated and this does not exhibit any unusual features. There seem to be clades in the Far East and North America.

What this does phylogenetic tree does not show is that the appartently homogeneous North American clade is actually made up of multiple distinct re-assortment events. These are usually just single isolated sequences, and this suggests that while the sub-type is present other H5 and N8 containing sub-types dominate and this is a rare reassortment that rapidly becomes extinct.

By carrying out a phylogenetic analysis of all the H5 segments from all H5 containing subtypes it is clear that the H5 is evolving in different subtypes between reassortment events that create H5N8 and the same is true for the N8 segments. This is the novel aspect of the research and shows that there are many re-assortment events that need to be accounted for but which are undetectable in the H5N8 sequences only trees. By not taking these re-assortments the the subsequent sequence evolution in other H5 or N8 containing sub-types into account we cannot correctly calculate the H5N8 phylogeny. Both the H5N8 phylogenies have been calculated using the same method (maximum likelihood) and the same evolutionary model (GTR with gamma correction). However the phylogenies for H5 and N8 were computed using FastTree rather than in Mega or PhyML because there are over 4000 H5 taxa and so they cannot be computed with bootstraps. However despite the lack of bootstrap values both the H5 and N8 phylogenies exhibit the same patterns of isolated re-assortment events, showing that the trees are a good representation of the true phylogenetic relationship.

The usual failure to propagate of H5N8 re-assortments is very important in the context of the current outbreak because this is the first time that the H5N8 subtype has persisted over such a long period of time. This has allowed the virus to spread globally through bird migration spreading the pathogenic Chinese H5 variant.

Thank you for your consideration.

Editor 1's response to the revisions.

From Editor 1:

Andrew incorrectly interprets the rejection based on the idea that the sequences are in the public domain and therefore not interesting to reevaluate. This is not the case. I have many papers based on such assembled data sets. The point the reviewers have made though, is that the manuscript is not clear on what advances are made based on these analyses.

Reviewer:Given that almost all of the data for these analyses came from published studies, it was not clear if there were differences between these results and those studies –and if so, was this due to using different datasets or different analyses from those studies?

Response: I am confused as to why using data from a public dataset cannot be original. If this were true no papers based on the human genome project would be valid after the initial publication. Andrew is missing the point. It is not that he used publicly available data, it is that he has not articulated what new insights he has gained from combining and re-analyzing these data.

The reviewers also had issues with some of the methods. Andrew excuses these because they are ‘comparable in how they were produced’ in previous studies (a poor criterion for method choice!). He excuses the lack of bootstrap values because he used the FastTree approach, which is a poor approach to phylogeny estimation and an even poorer excuse to not get bootstrap values. To provide a tree without bootstrap values or posterior probabilities is simply poor science. It is providing a point estimate with not confidence interval or variance estimate. There are fast bootstrap approaches, even in a maximum likelihood framework and Bayesian approaches that should have been explored.

Clearly the reviewers were confused because of the writing of the paper. Andrew tries to clarify a few things in the response letter and edits to the manuscript, but this only confirms the fact that the reviewers were confused by the previous version. This revised draft is still rushed and only addresses a few of the concerns (not many of the methodological concerns). So I stand behind the rejection decision. That said, I would be happy to have Andrew resubmit a new version that addresses the concerns in a more robust way. But if he is unwilling to put the time in to do a proper phylogenetic analysis and articulate in the manuscript what and how he is doing such analyses, then I don’t want to waste reviewers’ time with it again.

Only one reviewer was confused his poor PhD student. The first author states that he understood it clearly and even his PhD student says that it is written well! I chose the methods not to compare to other studies as Editor 1 state but for internal consistency. That is most definitely good science and a fair test. It is actually fundamental to experimental design. As a statistician I am well aware of point estimates and confidence intervals, but I am also aware of the limits of bootstraps and bootstrap sampling.

So I worded a very strong reply to the Editor in Chief and asked for an appeal. Suspecting that it would be passed to Editor 2 who I have had previous dealings before where I told the editor in chief I never wanted Editor 2 to touch any paper I submitted to the journal again, because he is a pedant. Foolish me I forgot to remind the Editor in Chief about this.

I greatly like the Editor in Chief. He has to deal I suspect with a large number of irate authors but I think he could have less if he had editors who took their job seriously and not personally.

My more considered response to the referees

Referee 1

While the sequences are in the public domain nobody has carried out the phylogenetic analysis of the H5 hemagglutinin and N8 neuraminidase sequences. Nobody would have been able to carry out the analysis previously as the data has only just become available, When-ever someone publishes a phylogenetic analysis it only ever contains a subset of the available data. This is a complete data set for all the sequence from NCBI. If the GISAID data was included then this would be all of the available public data. GISAID data was not included because it cannot be searched just based on hemagglutinin and neuraminidase numbers. I am confused as to why using data from a public dataset cannot be original. If this were true no papers based on the human genome project would be valid after the initial publication.

I used Mega as I wanted to construct trees for the H5N8 sequences and the H5 and N8 trees that are comparable in how they were produced. I have used other methods to calculate the H5N8 trees and I have referenced my own work and others that have used BEAST and coalescent methods. The tree from Mega is the same as from those studies, which is good because this is the control data for the paper. The control shows you get a tree but that it does not show any of the structure of the 7 lineages you find from the H5 and N8 trees. There are no bootstraps on the trees because they are not calculated when you create a tree with FastTree as it is an approximate and not exact method. You use it for large numbers of sequences. A bootstrap tree would not be valid without at least N bootstraps where N is the number of sequences (and I think it actually requires N-squared to be properly correct and sample all the possible bootstraps)

There is a H5 classification but this would add extra complication to the study that would distract from the message about recombination. The main classification has been mentioned as this is the Guangdong H5 which is now becoming the globally predominant form (as also propsed in the reference by Verhagen et al.)

Referee 2

It has nothing to do with an epidemiological analysis of the recent Korean outbreak as I have already published in this journal on that subject! I have changed the introduction to make my aim clearer and remove any possible idea that this might be epidemiological in intent.

The two points I am supposedly addressing are both wrong and so I have stated explicitly the three possibilities the work actually covers. As I show in the paper the referee’s question 1 cannot be answered with the trees produced in past studies or my tree of the H5N8 sequences. As for question 2 regarding the Korean outbreak that is well established by the cited work of Kang and Jeong and which the referee kindly gives me the PMID ids for (they are already in the paper).

Like the other referee the objection seems to be that you cannot be doing a novel sequence analysis unless you have a new sequence that is not already in the NCBI. I find this a ridiculous assertion. If we cannot ever use publicly available data to carry out research why do we make it public? I have produced a novel dataset for both hemagglutinin and neuraminidase genes for all H5 containing serotypes and N8 containing serotypes. Nobody has done this before because it makes no sense except in the context of this paper. Even if they had nobody has done it to include the sequences from the most recent outbreak because they have only just been included in the database and I know beyond any doubt that nobody has recently done the H5 and N8 analysis for all serotypes.

Regarding the suggestions for original research population studies will be difficult if you fail to account for the re-assortment events that are a focus of the paper. I am certainly interested in clocks and variability between hosts and even locations. There are also interesting implications from the coalescent trees from my previous paper about population sizes but that is for another paper as that requires deep theory and a reading of Kimura and Sewall-Wright.

They are right in saying that the tools are largely to be used out of the box and that parameter space was not explored, because that is not my aim. I am interested in biology and not methods, this is a biology and not a methods paper. To select the best substitution model both AIC and BIC were best for the GTR+I model, the difference are actually nearly negligible between models. Alignment is on the nucleotides and so was tree building. Doing anything else would make no sense as the distances and variants between sequences are very small.

There was no check for intra-segment recombination but that is a very rare even although it is hypothesized in the 1918 pandemic strain (although I doubt that result with limited sampling). I have included the exact commands I used to calculate the trees. A condensed phylogenetic tree is when you collapse the nodes based on bootstrap values. Identical sequences cannot be distinguished and so have low bootstrap values and so it makes no sense to represent branches between them. That is why the figure legend says there is a cut-off of 60%. The results detail each of the re-arrangement events stating the likely re-assortment partners that produced each of the different events mostly in the USA. The details of the Korean outbreak are irrelevant to the objective of the paper unless they showed the presence of another re-assortment event, which they do not.

It is not possible to quantitatively add to the results when they are ancestral serotypes that produce the re-assortment. I could assign probabilities but the data is too sparse.

The figures are only for review and I have higher resolution vector files of the original trees that will be submitted with the final version.

My harsher response to referee 2 is because he is neither an expert nor able to read. A condensed tree is the same as an ML with a cut-off of a certain bootstrap percentage, in this case 60%. This referee has clearly never used Mega and it seems highly unlikely that he has used consense either. His focus is only on AIC and BIC in modeltest which is a tiny step of minor relevance to anyone except the author of that method.

My angry replies to Editor 1's rejection

Dear Editor 1,

I am rather struggling to understand the referees comments especially those of Referee 2 who seems to have NO IDEA what the paper was about. I do not care at all about the epidemiology of the recent Korean flu outbreak and that is not in the title or abstract.

Referee 1 is also wrong I used mega to construct a tree of just the H5N8 data which I know is not original the new data are the MAFFT trees containing 15000 HA sequences and 15000 NA sequences that NEITHER referee even bothered to look at. That is the result. I do not really care less about the H5N8 tree as it is wrong because it does not take into account reassortment. The point is to show Mega gives a nice tree that looks all good but is actually missing key elements.

Put simply the paper SHOWS ABSOLUTELY that when you collect flu sequences from a supposed single serotype like H5N8 that you would suppose arises from evolution once it is actually a mix of different events that have created this serotype multiple times by combining H5 hemagglutinins with N8 neuraminidases from other different serotypes. This tells you that to construct trees you need to include intermediates that includes non-H5N8 sequences, otherwise you cannot reconstruct the tree properly as you are missing reassortments and ancestors.

This is a VERY BIG DEAL as if you don’t do it your trees will be wrong (as ALL current H5N8 trees are). This is definitely novel and definitely never discussed before. I can certainly increase the level of detail for the method but any credible bioinformatician would be able to reproduce the trees as they are produced by the default methods of Katoh in MAFFT which does not allow most of the options Referee 2 suggest. To be honest all the nucleotide substitution models score pretty much the same in AIC and BIC but Mega and MAFFT do not use all the same models in the same way and so you need to use methods that are in common. MAFFT method has been cited more times than any other.

I got the tool and got the results. I do not optimise the trees because that is not the point. I do not care about testing parameter space or algorithms. The point is that they form many clades spread widely over the tree and that these multiple clades are consistent across two different genes – This cannot happen by chance so the trees are CORRECT regardless of parameter space and settings.

If the reviewer opened the tree files he might see that they contain 15000 sequences which I am sure are not sets of data produced by ANYONE EVER before. I would like to see them bootstrap as 1000 bootstraps would be statistically wrong to do you need to carry out at least N where N is the number of sequences. If they have access to the world’s largest super-computer then they may do the bootstrap. MAFFT does not allow it anyway. But as I said they have not bothered to even look at that data at all.

Thanks Andy

There are a few problems with this response and temper got the better of me. There are not 15,000 sequences there are 4008 HA and 1840 NA sequences. So it is a big problem that is outside the scope of most programs to create a phylogenetic analysis with bootstraps (phyML won't do it for example), but it is possible, if rather pointless. To carry out a non-parametric bootstrap need nCr(8015, 4009) bootstraps - this is very large number. Large enough not to be calculable in the entire history of the universe. However bootstraps converge to this ideal value quickly but to know it has converged you would need to do multiple bootstrap trees and check that the values are converging. As far as I know nobody does this and certainly not for trees with 4008 taxons.

Telling the story properly. The original rejection letter.

To put my irate posts in context you need to have the full story. Here is the original rejection letter and referees comments.

Thank you for your submission to Journal B. I am writing to inform you that your manuscript, "The multiple origins of the H5N8 avian influenza sub-type" (#2015:07:5858:0:0:REVIEW), has been rejected for publication.The comments supplied by the reviewers on this revision are pasted below. My comments are as follows:Editor's commentsYour paper has now been reviewed by two experts in the field. They both found the general topic of avian influenza of interest, but both also agree that your study seems to provide little advance with respect to previous research. The study adds no new data and apparently no new interpretation of results. Both reviewers found your method descriptions lacking in detail as well as the discussion of your results. As reviewer 2 points out, it would be impossible to reproduce your results given the lack of detail in the manuscript on the methods employed. Likewise, reviewer 1 identified discussions of methods (e.g., bootstrapping) for which there were no available results. Given these significant concerns with the manuscript and the lack of clear articulation of the novel contribution your analysis makes beyond other studies of avian influenza, I am forced to recommend rejection of your paper.Editor 1

Reviewer CommentsReviewer 1 Basic reportingNo CommentsExperimental designNo CommentsValidity of the findingsNo CommentsComments for the authorMajor strengths and weaknesses:This study is a successful attempt in using the datasets from website to analysis the multiple orgins of the H5N8 avian influenza subtype. The study has further phylogenetic analyses of all of the H5 hemagglutinin and N8 neuraminidase sequences to show that each of the H5N8 outbreaks has resulted from a different reassortment event and that there have been at least 7 distinct origins of the viral sub-type since it was first characterised in a turkey in Ireland in 1983.
However, I also feel the phylogenetic results and discussion contained mostly results from the analyses, but very little discussion. Given that almost all of the data for these analyses came from published studies, it was not clear if there were differences between these results and those studies –and if so, was this due to using different datasets or different analyses from those studies? From your MS, you used MEGA v6.06 to construct the ML tree, could you use other phylogenetic software to build ML tree or BI tree, to verify each other? I suggested that you could mark the clades in figure 1 in order to make the relationship more distinct. In addition, from figure 3 to figure 10, I couldn’t find the bootstrap values in your trees, so please add the values into tree figures., and you can also add some description in this section. The classification of H5, maybe you can refer to recent research (e.g. Donis et al, 2015).
Minor issues:- Line 36: change to “shown”- L100: should be “turkeys”Reviewer 2 Basic reportingThe article is written in English using clear and unambiguous text. I think the article introduction and background fit into the broader field of knowledge. I think the figures need some work regarding clarity. For instance, phylogenetic trees show clades that are too busy and could be collapsed for more clarity without losing information.Experimental designI am not sure whether this manuscript conforms to "original primary research". While the manuscript aims to provide an epidemiological analysis of highly pathogenic avian influenza (HPAI) from a recent outbreak in Korea, it does not add much in comparison to other published articles.The two main objectives of this manuscript are 1) to test whether there's phylogenetic evidence for reassortment, and 2) whether there are multiple origins to the Korean outbreak. A cursory review of literature shows that both aspects have already been studied. Some examples are PMID: 24856098; PMID: 25625281; PMID: 25192767. Also, while I am not opposed to manuscripts that use data from GenBank only (that's why it's there to begin with), I think that in this case this has a negative impact since the data are not novel and the analysis provide little new information.In my opinion, the author should re organize the article to focus on something more original. Some examples come to mind: using molecular clocks to uncover temporal patterns in this outbreak, looking at past population dynamics to infer whether this serotype is becoming more or less diverse over time, or perform a phytogeographic analysis to provide insights into its origin and spread.Validity of the findingsMethods
I think this manuscript uses a very rich dataset but the analysis seem precipitous, exploratory and "out-of-the-box". My overall impression is that there is little exploration of the parameter space.
In Methods, line 52, what was the criterion by which the nucleotide substitution model was selected? Was it Akaike Information Criterion, Bayesian Information Criterion or else?
I think the methods section needs more detail. Statements such as “The sequences were aligned using MAFFT” are insufficient to allow full reproducibility. What parameter values were used and why? What version is the program? Did you fit a codon model to the coding sequences? Did you check for intra-gene recombination? Also, how many sequences were used? It says “all” but how many? Was the alignment done on the coding sequences or in the nucleotides? What parameter values were used? Finally, the phylogenetic reconstruction step is not clear as to how it was performed. Are the trees rooted? Were outgroups used? What is a “condensed phylogenetic tree”? Is it some type of consensus? If so, is it a majority rule consensus?
Results and Discussion
The description of the results is very qualitative, e.g., “the current outbreak shows some structure”. It would be more objective to quantify the degree of structure by using distance metrics or population genetic estimators of structure.Comments for the author
no comments

An important point is that Referee 2 is the PhD student of Editor 1

Saturday 3 October 2015

To be fair I should report my incorrect and somewhat angry response to the rejection of the appeal

Really I give in.

I will do fasttree using seqboot for bootsrap but this is from the manual

“00 fast-global bootstraps took just 20 hours, and the resulting support values were strongly correlated with the traditional bootstrap (r=0.975).”

So it is NOT comparable to the bootstrap it just correlates to the bootstrap. So again the “expert” is actually unaware of how it works.

Both trees are created with ML methods using a GTR+G evolutionary model – using different implementations sure. I can do both in fasttree but it would show what exactly?

The figures are illegible – you could try zooming as they are vector images but that asks too much of a someone with a PhD

There is one reference missing to my own work.

As they say “ whatever”.

This is going to give me a two for as I will resubmit it next week with everything done and a new title and excluding both of those editors.

Thanks

Andy

On this I am definitely wrong. You can bootstrap normally using seqboot and consense. This correlation is only true if you specify an initial tree.

However, there are over 1000 variable sites. For the ideal (non-parametric) bootstrap estimate you need nCr (2n-1,n) permutations = 10 x E 600 calculations (Efron and Tibshirani). Sampling even 1000 bootstraps is insignificant. For 100 sites it is 5 x E 59. It is likely that bootstrap estimates are far from convergent, but that would need testing for each tree by increasing the bootstrap number until the percentages remain fixed.

Editor 2's comments

I love the process of Peer Review where you get judged by a group of unaccountable and usually anonymous "experts" who have all the power and to whom you have no avenue of response. So this is my response.

After an Appeal from the author I was asked to adjudicate on the decision from Editor 1. I was able to review the prior reviews, as well as the author’s Appeal document and a revised version of their manuscript.

I have read the paper carefully and I agree with Editor 1 that the paper doesn’t meet the quality criteria we expect from a paper at Journal B. The paper has clear methodological issues that need to be addressed. In addition, the paper is sloppily prepared and the figures are poor.

Specific comments:

1. The argument that FastTree can’t be used for bootstrap is absurd. If FastTree is good enough for point estimates then it is good enough for bootstrap. The bootstrap simply assesses the amount of variation expected under the given inference method. It can be done with any tree reconstruction method, however flawed that method may be. If you don’t trust FastTree then you should use a different inference method. (Note that the FastTree paper explicitly states that FastTree can be used for bootstrap.)

2. The methods are strange. You analyze two separate data sets (H5N8 sequences vs. H5 sequences/N8 sequences) with entirely different methods. I’m not sure these are even comparable. Importantly, this is never properly explained or motivated.

3. The figures are largely impenetrable. It is standard procedure in phylogenetic analyses to collapse and/or color branches, as well as label groups of sequences rather than individual sequences, so that the key features of the tree are clearly visible. Even though this paper's conclusion hinge entirely on the shape of the phylogenetic trees that were obtained, you haven't put much effort at all into into producing proper, high-quality tree figures. Moreover, you labeled each tip with the full strain name, which makes them largely illegible (in particular when the font is small, e.g. Fig. 2).

4. The paper is sloppily written. To give just one example, in the revision, you added the sentence "These trees are in good agreement with the much more detailed and rigorous coalescent analysis carried out previously.” Importantly, there is no reference. I have no idea which previous analysis you refer to. Similar issues permeate the paper. This paper simply isn't written in such a way that readers can understand what exactly was done and why.

Most importantly, it is the author’s responsibility to prepare a compelling, well-prepared manuscript. The work you have done may very well be (mostly) technically correct, but it is not presented in a way that it makes a useful contribution to the field.

So let me see;

1) You do not bootstrap in FastTree - you bootstrap in seqboot and consense from Phylip as FastTree has no bootstrap function built in. So if we are playing pedantry I am right and Editor 2 is wrong. Why do I want to know the variation under an approximate method? It was created for large sequence trees for speed not accuracy. The key to the paper is identifying if the H5N8 hemagglutinin and neuraminidase genes are part of a single cluster or multiple clusters. The bootstrap tells you NOTHING about this because it just tells you the variability of the method nothing about the biology and the results. These same multiple clusters are identified in both the neuraminidase and hemagglutinin trees. These are independent samples as opposed to covariate sites tested in a bootstrap and so if they agree it is very unlikely the trees are wrong. Bootstraps add nothing other than a sop to the phylogenetics community who cannot live without them.

2) I explained the methods - they both use ML GTR + gamma + I just in different programs FastTree for the big alignment and Mega for the small alignments. Editor 2 had actually told me to use FastTree in a previous paper where these H5 and N8 trees were needed as supplementary data. In that case it was to confirm my Bayesian tree did not have recombinations, it was in doing that, that I found the result reported here. So I am quite stunned that he has forgotten that this is the method he actually suggested.

3) The figures are vector graphics - is it beyond the wit of an editor to zoom a figure. Even if they are in need of editing THIS IS NEVER A REASON FOR REJECTING A PAPER for revisions yes but rejection you have to be joking. Each tip is labelled in that way because the BIOLOGY is what matters i.e. the location and date of the sequences were I to shorten to database identity this would make interpreting the trees impossible in terms of biology and it would also be impossible to see that the trees show a coherent pattern in terms of date and location. The problem is we get so carried away with algorithms we no longer look at the data. If a cluster is all from New Jersey in 1989 then that is a reasonable cluster in good agreement with geography and chronology.

4) I object to the word sloppily. It is not a term that should ever enter an editor's response. You may say the paper is missing a reference at point A and point B but sloppily is pure hyperbole and not worthy of a good editor. It is not a word I would ever use. Still more it is not justification for more than major revisions.

I have been an editor for 4 years and edited 120 papers and I would never write such a response or make such a decision without proper justification especially given the final part about it being mostly technically correct. Not being presented in a way that is a useful contribution is NOT a CRITERIA for rejection, at most it is for Major revision especially at initial submission. We do not judge on significance we judge on it being SCIENCE.

Friday 2 October 2015

Bayes and Fisher the two explorers and the Tiger.

This is an update of the joke about the two explorers and the Tiger.

Bayes and Fisher are walking through a forest. Suddenly, they see a tiger in the distance, running towards them. They turn and start running away. Bayes stops, takes some running shoes from his bag, and starts putting them on. “What are you doing?” says Fisher. “Do you think you will run fast than the tiger with those?” “I don’t have to run faster than the tiger,” Bayes says. “I just have to run faster than you.”

This was inspired by a referee commenting that I had not proved my hypothesis about H5N8 influenza being spread by wild birds and not through domestic poultry. My opinion was I didn't have to prove my hypothesis (you never can prove a hypothesis). I just had to show that my hypothesis was much more likely than the referee's alternative which I did, although it was rejected as usual. Finally the referee was shown to be wrong and there are a whole string of papers showing I was right.