Saturday, 28 November 2015

Thinking about the bootstrap - is it doing what we think it is?

The bootstrap comes in two forms non-parametric and parametric. Non-parametric is the easiest to understand and it is closely related to cross-validation and jack-knifing. Bootstrap methods were pioneered by Brad Efron and collaborators and the book Introduction to the Bootstrap by Efron and Tibshirani is a very readable account.

Non-parametric bootstrapping should be used  in cases where the data cannot be summarised by a well-defined parameter that is normally distributed. This applies to phylogenetic analysis where the actual effects of the sequences and the variable sites behave in a very non-linear manner. Felsenstein applied the bootstrap to phylogenetic analysis in 1984.

The process in general for the bootstrap is that if you have a set of data then you resample that data without replacement to generate a new set of data with the same properties as the original dataset. For the jackknife, the resampling is the exclusion of a subset of data from the complete set (in crystallography this is analogous to using the Free-R factor and not the R-factor for refinement).  If the original dataset had 200 elements the resampled dataset has 200 elements. This is the first resampling. In the bootstrap, you resample many times. As the number of bootstrap resamples goes to infinity then you will get all the possible permutations of the original data but there is usually convergence of the calculated statistics of the resampling with smaller numbers of  bootstraps. You should test to see about convergence (nobody ever does).

For phylogenetics, the aligned sets of sequences produce a matrix where the rows are each of the sequences and the columns are the aligned positions. The columns of this matrix are then re-sampled. In most of Efron's examples, it is the rows that are resampled and this could be done in phylogenetics but with some added complexity in renaming the sequences that are duplicated. Resampling the columns might have some issues regarding the basic assumptions of the bootstrap.

These fundamental assumptions are that the sites are independent and that they are identically distributed. Now each of these assumptions might hold with a slight exception in the case of sequence alignments but the two together most likely do not.

  1. Some sites will not be independent and we know that there are correlated mutations, but these are perhaps a small enough number that they do not affect the results of the bootstrap. 
  2. Each of the positions should have the same probability of being A, C, T or G but because of the different rates of change at different codon positions they definitely will not have the same number of changes at all sites and over long periods saturation becomes an issue. There is some correlation but Efron and Tibshirani show how to deal with correlation by resampling each of the correlated variables as a single set.
  3. The differences in mutation at different codon position mean that most likely the assumptions about identically distributed and independent are not true. We should bootstrap using codons and not single sites. (This is actually a possibility using consense where you can set any length of sub-sequence to resample).
  4. There is a direct contradiction of models if you bootstrap a model where the codon position is represented by a gamma distribution and the bootstrap sampling is carried out by resampling single sites because you lose the correlation between sites implicit in the gamma distribution model. This must have a negative effect on the bootstrap tree.

More recently methods based on likelihoods and local bootstrap methods have been developed to deal with large trees where the calculation of the non-parametric bootstrap becomes prohibitive. All of these methods depend on assuming that the alignment contains enough information to define a simple parameter that can then be bootstrapped. These are all examples of a parametric bootstrap. They have been reported to give results that are highly correlated to the non-parametric bootstrap with levels of correlation over 95%, but this might conceal very large local fluctuations.

Sunday, 15 November 2015

Good grief: Dark Matter and Dinosaurs it is all physics-babble

I have just figured out where all that invisible dark matter we need to make the universe work is. It is stuck between theoretical physicists ears when they write pseudo-science books based on no evidence what so ever. Everyone complains about quackery of "alternative medicine" but theoretical physics takes junk science to a new level. The lips are moving but the brain isn't working.  Dark Matter and Dinosaurs, the Cosmic Anthropological Principle, Many Worlds ...

Sunday, 18 October 2015

Taking on phylogenetics: I'm afraid what you are suggesting is rather stupid

After my rants and tirades against Editors and referees I decided that I would follow Ewan Birney and make my blog a bit more systematic. So this will be my lab notebook so I have proof of what I was thinking when.

So after reading a few textbooks my general opinion of some of the leading lights of phylogenetics is that they need to read a few more books. There is a whole lot of bad science out there and a complete failure to appreciate how evolution works.

Editor 1 is my favourite editor as he has cost me a PLoS One paper already and nearly another one in PeerJ. Anyway he suggested that FastTree is an inadequate method and I needed bootstraps. If he knew anything about FastTree he would know that for >100 sequences they suggest not using bootstraps but an estimator of the likelihoods instead which has a > 90% correlation to the bootstrap values. There is an alternative program which is PhyML and that says exactly the same. For trees of > 100 sequences compute the branch support using a Bayesian method.

You could create a bootstrap version by either running seqboot to create a bootstrap dataset for FastTree or by setting the bootstrap option in PhyML, but why would you? You would have to be stupid to do this when you get practically the same result from the internal estimator in minutes and the bootstrap calculations will take days. For example my H5 tree with > 4000 sequences cannot be run on PhyML. Removing the identical sequences reduces this to less that 3000 sequences BUT this will take 176 DAYS on the PhyML server. Running FastTree takes about half an hour.

The alignment file "H5_HA_4-10-15-2811_DNA_muscle.phylip" was uploaded, it contains 2811 taxons of 1926 characters.

The estimated computation time is : 176 day(s) 3 hour(s) 25 minute(s)
See FAQ for details.
Thank you for your request.

It is therefore very clear that Editor 1 is stupid, unaware of current techniques and biased towards outdated methods inappropriate for the type of analysis I wish to carry out.

Sunday, 4 October 2015

Statistics Don't Tell the Whole Story

cock had his say on

Match stats

Provided by Opta
5 (1)
Scrums won (lost)
8 (0)
7 (1)
Line-outs won (lost)
11 (2)
Pens conceded
77 (7)
Rucks won (lost)
81 (3)
Kicks from hand
101 (15)
Tackles made (missed)
116 (18)
Metres made
Line breaks

Last piece of the story - my appeal letter

Dear Editor, 
I would like to submit the article “The multiple origins of the H5N8 avian influenza sub-type”. This paper describes the multiple reassortment events that have produced the H5N8 avian influenza sub-type. The current phylogenetic tree for the H5N8 sequences is calculated and this does not exhibit any unusual features. There seem to be clades in the Far East and North America.
What this does phylogenetic tree does not show is that the appartently homogeneous North American clade is actually made up of multiple distinct re-assortment events. These are usually just single isolated sequences, and this suggests that while the sub-type is present other H5 and N8 containing sub-types dominate and this is a rare reassortment that rapidly becomes extinct. 
By carrying out a phylogenetic analysis of all the H5 segments from all H5 containing subtypes it is clear that the H5 is evolving in different subtypes between reassortment events that create H5N8 and the same is true for the N8 segments. This is the novel aspect of the research and shows that there are many re-assortment events that need to be accounted for but which are undetectable in the H5N8 sequences only trees. By not taking these re-assortments the the subsequent sequence evolution in other H5 or N8 containing sub-types into account we cannot correctly calculate the H5N8 phylogeny. Both the H5N8 phylogenies have been calculated using the same method (maximum likelihood) and the same evolutionary model (GTR with gamma correction). However the phylogenies for H5 and N8 were computed using FastTree rather than in Mega or PhyML because there are over 4000 H5 taxa and so they cannot be computed with bootstraps. However despite the lack of bootstrap values both the H5 and N8 phylogenies exhibit the same patterns of isolated re-assortment events, showing that the trees are a good representation of the true phylogenetic relationship.
The usual failure to propagate of H5N8 re-assortments is very important in the context of the current outbreak because this is the first time that the H5N8 subtype has persisted over such a long period of time. This has allowed the virus to spread globally through bird migration spreading the pathogenic Chinese H5 variant.

Thank you for your consideration.

Editor 1's response to the revisions.

From Editor 1:  

Andrew incorrectly interprets the rejection based on the idea that the sequences are in the public domain and therefore not interesting to reevaluate. This is not the case. I have many papers based on such assembled data sets. The point the reviewers have made though, is that the manuscript is not clear on what advances are made based on these analyses.  
Reviewer:Given that almost all of the data for these analyses came from published studies, it was not clear if there were differences between these results and those studies –and if so, was this due to using different datasets or different analyses from those studies? 
 Response: I am confused as to why using data from a public dataset cannot be original. If this were true no papers based on the human genome project would be valid after the initial publication. Andrew is missing the point. It is not that he used publicly available data, it is that he has not articulated what new insights he has gained from combining and re-analyzing these data.
 The reviewers also had issues with some of the methods. Andrew excuses these because they are ‘comparable in how they were produced’ in previous studies (a poor criterion for method choice!). He excuses the lack of bootstrap values because he used the FastTree approach, which is a poor approach to phylogeny estimation and an even poorer excuse to not get bootstrap values. To provide a tree without bootstrap values or posterior probabilities is simply poor science. It is providing a point estimate with not confidence interval or variance estimate. There are fast bootstrap approaches, even in a maximum likelihood framework and Bayesian approaches that should have been explored.  
Clearly the reviewers were confused because of the writing of the paper. Andrew tries to clarify a few things in the response letter and edits to the manuscript, but this only confirms the fact that the reviewers were confused by the previous version. This revised draft is still rushed and only addresses a few of the concerns (not many of the methodological concerns). So I stand behind the rejection decision. That said, I would be happy to have Andrew resubmit a new version that addresses the concerns in a more robust way. But if he is unwilling to put the time in to do a proper phylogenetic analysis and articulate in the manuscript what and how he is doing such analyses, then I don’t want to waste reviewers’ time with it again. 

Only one reviewer was confused his poor PhD student. The first author states that he understood it clearly and even his PhD student says that it is written well! I chose the methods not to compare to other studies as Editor 1 state but for internal consistency. That is most definitely good science and a fair test. It is actually fundamental to experimental design. As a statistician I am well aware of point estimates and confidence intervals, but I am also aware of the limits of bootstraps and bootstrap sampling.

So I worded a very strong reply to the Editor in Chief and asked for an appeal. Suspecting that it would be passed to Editor 2 who I have had previous dealings before where I told the editor in chief I never wanted Editor 2 to touch any paper I submitted to the journal again, because he is a pedant. Foolish me I forgot to remind the Editor in Chief about this.

I greatly like the Editor in Chief. He has to deal I suspect with a large number of irate authors but I think he could have less if he had editors who took their job seriously and not personally.

My more considered response to the referees

Referee 1

While the sequences are in the public domain nobody has carried out the phylogenetic analysis of the H5 hemagglutinin and N8 neuraminidase sequences. Nobody would have been able to carry out the analysis previously as the data has only just become available, When-ever someone publishes a phylogenetic analysis it only ever contains a subset of the available data. This is a complete data set for all the sequence from NCBI. If the GISAID data was included then this would be all of the available public data. GISAID data was not included because it cannot be searched just based on hemagglutinin and neuraminidase numbers. I am confused as to why using data from a public dataset cannot be original. If this were true no papers based on the human genome project would be valid after the initial publication.  
I used Mega as I wanted to construct trees for the H5N8 sequences and the H5 and N8 trees that are comparable in how they were produced. I have used other methods to calculate the H5N8 trees and I have referenced my own work and others that have used BEAST and coalescent methods. The tree from Mega is the same as from those studies, which is good because this is the control data for the paper. The control shows you get a tree but that it does not show any of the structure of the 7 lineages you find from the H5 and N8 trees. There are no bootstraps on the trees because they are not calculated when you create a tree with FastTree as it is an approximate and not exact method. You use it for large numbers of sequences. A bootstrap tree would not be valid without at least N bootstraps where N is the number of sequences (and I think it actually requires N-squared to be properly correct and sample all the possible bootstraps) 
There is a H5 classification but this would add extra complication to the study that would distract from the message about recombination. The main classification has been mentioned as this is the Guangdong H5 which is now becoming the globally predominant form (as also propsed in the reference by Verhagen et al.)

Referee 2 
It has nothing to do with an epidemiological analysis of the recent Korean outbreak as I have already published in this journal on that subject! I have changed the introduction to make my aim clearer and remove any possible idea that this might be epidemiological in intent.
The two points I am supposedly addressing are both wrong and so I have stated explicitly the three possibilities the work actually covers. As I show in the paper the referee’s question 1 cannot be answered with the trees produced in past studies or my tree of the H5N8 sequences. As for question 2 regarding the Korean outbreak that is well established by the cited work of Kang and Jeong and which the referee kindly gives me the PMID ids for (they are already in the paper).  
Like the other referee the objection seems to be that you cannot be doing a novel sequence analysis unless you have a new sequence that is not already in the NCBI. I find this a ridiculous assertion. If we cannot ever use publicly available data to carry out research why do we make it public? I have produced a novel dataset for both hemagglutinin and neuraminidase genes for all H5 containing serotypes and N8 containing serotypes. Nobody has done this before because it makes no sense except in the context of this paper. Even if they had nobody has done it to include the sequences from the most recent outbreak because they have only just been included in the database and I know beyond any doubt that nobody has recently done the H5 and N8 analysis for all serotypes.  
Regarding the suggestions for original research population studies will be difficult if you fail to account for the re-assortment events that are a focus of the paper. I am certainly interested in clocks and variability between hosts and even locations. There are also interesting implications from the coalescent trees from my previous paper about population sizes but that is for another paper as that requires deep theory and a reading of Kimura and Sewall-Wright.  
They are right in saying that the tools are largely to be used out of the box and that parameter space was not explored, because that is not my aim. I am interested in biology and not methods, this is a biology and not a methods paper. To select the best substitution model both AIC and BIC were best for the GTR+I model, the difference are actually nearly negligible between models. Alignment is on the nucleotides and so was tree building. Doing anything else would make no sense as the distances and variants between sequences are very small. 
There was no check for intra-segment recombination but that is a very rare even although it is hypothesized in the 1918 pandemic strain (although I doubt that result with limited sampling). I have included the exact commands I used to calculate the trees. A condensed phylogenetic tree is when you collapse the nodes based on bootstrap values. Identical sequences cannot be distinguished and so have low bootstrap values and so it makes no sense to represent branches between them. That is why the figure legend says there is a cut-off of 60%. The results detail each of the re-arrangement events stating the likely re-assortment partners that produced each of the different events mostly in the USA. The details of the Korean outbreak are irrelevant to the objective of the paper unless they showed the presence of another re-assortment event, which they do not. 
It is not possible to quantitatively add to the results when they are ancestral serotypes that produce the re-assortment. I could assign probabilities but the data is too sparse.  
The figures are only for review and I have higher resolution vector files of the original trees that will be submitted with the final version. 
My harsher response to referee 2 is because he is neither an expert nor able to read. A condensed tree is the same as an ML with a cut-off of a certain bootstrap percentage, in this case 60%. This referee has clearly never used Mega and it seems highly unlikely that he has used consense either. His focus is only on AIC and BIC in modeltest which is a tiny step of minor relevance to anyone except the author of that method.