The Accidental Statistician: October 2015

Sunday, 18 October 2015

Taking on phylogenetics: I'm afraid what you are suggesting is rather stupid

After my rants and tirades against Editors and referees I decided that I would follow Ewan Birney and make my blog a bit more systematic. So this will be my lab notebook so I have proof of what I was thinking when.

So after reading a few textbooks my general opinion of some of the leading lights of phylogenetics is that they need to read a few more books. There is a whole lot of bad science out there and a complete failure to appreciate how evolution works.

Editor 1 is my favourite editor as he has cost me a PLoS One paper already and nearly another one in PeerJ. Anyway he suggested that FastTree is an inadequate method and I needed bootstraps. If he knew anything about FastTree he would know that for >100 sequences they suggest not using bootstraps but an estimator of the likelihoods instead which has a > 90% correlation to the bootstrap values. There is an alternative program which is PhyML and that says exactly the same. For trees of > 100 sequences compute the branch support using a Bayesian method.

You could create a bootstrap version by either running seqboot to create a bootstrap dataset for FastTree or by setting the bootstrap option in PhyML, but why would you? You would have to be stupid to do this when you get practically the same result from the internal estimator in minutes and the bootstrap calculations will take days. For example my H5 tree with > 4000 sequences cannot be run on PhyML. Removing the identical sequences reduces this to less that 3000 sequences BUT this will take 176 DAYS on the PhyML server. Running FastTree takes about half an hour.

The alignment file "H5_HA_4-10-15-2811_DNA_muscle.phylip" was uploaded, it contains 2811 taxons of 1926 characters.

The estimated computation time is : 176 day(s) 3 hour(s) 25 minute(s)
See FAQ for details.
Thank you for your request.

It is therefore very clear that Editor 1 is stupid, unaware of current techniques and biased towards outdated methods inappropriate for the type of analysis I wish to carry out.

Sunday, 4 October 2015

Statistics Don't Tell the Whole Story

cock had his say on

Match stats
England		Australia
Provided by Opta
49%	Possession	51%
47%	Territory	53%
5 (1)	Scrums won (lost)	8 (0)
7 (1)	Line-outs won (lost)	11 (2)
9	Pens conceded	5
77 (7)	Rucks won (lost)	81 (3)
25	Kicks from hand	27
101 (15)	Tackles made (missed)	116 (18)
391	Metres made	254
10	Offloads	9
7	Line breaks	5

Last piece of the story - my appeal letter

Dear Editor,

I would like to submit the article “The multiple origins of the H5N8 avian influenza sub-type”. This paper describes the multiple reassortment events that have produced the H5N8 avian influenza sub-type. The current phylogenetic tree for the H5N8 sequences is calculated and this does not exhibit any unusual features. There seem to be clades in the Far East and North America.

What this does phylogenetic tree does not show is that the appartently homogeneous North American clade is actually made up of multiple distinct re-assortment events. These are usually just single isolated sequences, and this suggests that while the sub-type is present other H5 and N8 containing sub-types dominate and this is a rare reassortment that rapidly becomes extinct.

By carrying out a phylogenetic analysis of all the H5 segments from all H5 containing subtypes it is clear that the H5 is evolving in different subtypes between reassortment events that create H5N8 and the same is true for the N8 segments. This is the novel aspect of the research and shows that there are many re-assortment events that need to be accounted for but which are undetectable in the H5N8 sequences only trees. By not taking these re-assortments the the subsequent sequence evolution in other H5 or N8 containing sub-types into account we cannot correctly calculate the H5N8 phylogeny. Both the H5N8 phylogenies have been calculated using the same method (maximum likelihood) and the same evolutionary model (GTR with gamma correction). However the phylogenies for H5 and N8 were computed using FastTree rather than in Mega or PhyML because there are over 4000 H5 taxa and so they cannot be computed with bootstraps. However despite the lack of bootstrap values both the H5 and N8 phylogenies exhibit the same patterns of isolated re-assortment events, showing that the trees are a good representation of the true phylogenetic relationship.

The usual failure to propagate of H5N8 re-assortments is very important in the context of the current outbreak because this is the first time that the H5N8 subtype has persisted over such a long period of time. This has allowed the virus to spread globally through bird migration spreading the pathogenic Chinese H5 variant.

Thank you for your consideration.

Editor 1's response to the revisions.

From Editor 1:

Andrew incorrectly interprets the rejection based on the idea that the sequences are in the public domain and therefore not interesting to reevaluate. This is not the case. I have many papers based on such assembled data sets. The point the reviewers have made though, is that the manuscript is not clear on what advances are made based on these analyses.

Reviewer:Given that almost all of the data for these analyses came from published studies, it was not clear if there were differences between these results and those studies –and if so, was this due to using different datasets or different analyses from those studies?

Response: I am confused as to why using data from a public dataset cannot be original. If this were true no papers based on the human genome project would be valid after the initial publication. Andrew is missing the point. It is not that he used publicly available data, it is that he has not articulated what new insights he has gained from combining and re-analyzing these data.

The reviewers also had issues with some of the methods. Andrew excuses these because they are ‘comparable in how they were produced’ in previous studies (a poor criterion for method choice!). He excuses the lack of bootstrap values because he used the FastTree approach, which is a poor approach to phylogeny estimation and an even poorer excuse to not get bootstrap values. To provide a tree without bootstrap values or posterior probabilities is simply poor science. It is providing a point estimate with not confidence interval or variance estimate. There are fast bootstrap approaches, even in a maximum likelihood framework and Bayesian approaches that should have been explored.

Clearly the reviewers were confused because of the writing of the paper. Andrew tries to clarify a few things in the response letter and edits to the manuscript, but this only confirms the fact that the reviewers were confused by the previous version. This revised draft is still rushed and only addresses a few of the concerns (not many of the methodological concerns). So I stand behind the rejection decision. That said, I would be happy to have Andrew resubmit a new version that addresses the concerns in a more robust way. But if he is unwilling to put the time in to do a proper phylogenetic analysis and articulate in the manuscript what and how he is doing such analyses, then I don’t want to waste reviewers’ time with it again.

Only one reviewer was confused his poor PhD student. The first author states that he understood it clearly and even his PhD student says that it is written well! I chose the methods not to compare to other studies as Editor 1 state but for internal consistency. That is most definitely good science and a fair test. It is actually fundamental to experimental design. As a statistician I am well aware of point estimates and confidence intervals, but I am also aware of the limits of bootstraps and bootstrap sampling.

So I worded a very strong reply to the Editor in Chief and asked for an appeal. Suspecting that it would be passed to Editor 2 who I have had previous dealings before where I told the editor in chief I never wanted Editor 2 to touch any paper I submitted to the journal again, because he is a pedant. Foolish me I forgot to remind the Editor in Chief about this.

I greatly like the Editor in Chief. He has to deal I suspect with a large number of irate authors but I think he could have less if he had editors who took their job seriously and not personally.

My more considered response to the referees

Referee 1

While the sequences are in the public domain nobody has carried out the phylogenetic analysis of the H5 hemagglutinin and N8 neuraminidase sequences. Nobody would have been able to carry out the analysis previously as the data has only just become available, When-ever someone publishes a phylogenetic analysis it only ever contains a subset of the available data. This is a complete data set for all the sequence from NCBI. If the GISAID data was included then this would be all of the available public data. GISAID data was not included because it cannot be searched just based on hemagglutinin and neuraminidase numbers. I am confused as to why using data from a public dataset cannot be original. If this were true no papers based on the human genome project would be valid after the initial publication.

I used Mega as I wanted to construct trees for the H5N8 sequences and the H5 and N8 trees that are comparable in how they were produced. I have used other methods to calculate the H5N8 trees and I have referenced my own work and others that have used BEAST and coalescent methods. The tree from Mega is the same as from those studies, which is good because this is the control data for the paper. The control shows you get a tree but that it does not show any of the structure of the 7 lineages you find from the H5 and N8 trees. There are no bootstraps on the trees because they are not calculated when you create a tree with FastTree as it is an approximate and not exact method. You use it for large numbers of sequences. A bootstrap tree would not be valid without at least N bootstraps where N is the number of sequences (and I think it actually requires N-squared to be properly correct and sample all the possible bootstraps)

There is a H5 classification but this would add extra complication to the study that would distract from the message about recombination. The main classification has been mentioned as this is the Guangdong H5 which is now becoming the globally predominant form (as also propsed in the reference by Verhagen et al.)

Referee 2

It has nothing to do with an epidemiological analysis of the recent Korean outbreak as I have already published in this journal on that subject! I have changed the introduction to make my aim clearer and remove any possible idea that this might be epidemiological in intent.

The two points I am supposedly addressing are both wrong and so I have stated explicitly the three possibilities the work actually covers. As I show in the paper the referee’s question 1 cannot be answered with the trees produced in past studies or my tree of the H5N8 sequences. As for question 2 regarding the Korean outbreak that is well established by the cited work of Kang and Jeong and which the referee kindly gives me the PMID ids for (they are already in the paper).

Like the other referee the objection seems to be that you cannot be doing a novel sequence analysis unless you have a new sequence that is not already in the NCBI. I find this a ridiculous assertion. If we cannot ever use publicly available data to carry out research why do we make it public? I have produced a novel dataset for both hemagglutinin and neuraminidase genes for all H5 containing serotypes and N8 containing serotypes. Nobody has done this before because it makes no sense except in the context of this paper. Even if they had nobody has done it to include the sequences from the most recent outbreak because they have only just been included in the database and I know beyond any doubt that nobody has recently done the H5 and N8 analysis for all serotypes.

Regarding the suggestions for original research population studies will be difficult if you fail to account for the re-assortment events that are a focus of the paper. I am certainly interested in clocks and variability between hosts and even locations. There are also interesting implications from the coalescent trees from my previous paper about population sizes but that is for another paper as that requires deep theory and a reading of Kimura and Sewall-Wright.

They are right in saying that the tools are largely to be used out of the box and that parameter space was not explored, because that is not my aim. I am interested in biology and not methods, this is a biology and not a methods paper. To select the best substitution model both AIC and BIC were best for the GTR+I model, the difference are actually nearly negligible between models. Alignment is on the nucleotides and so was tree building. Doing anything else would make no sense as the distances and variants between sequences are very small.

There was no check for intra-segment recombination but that is a very rare even although it is hypothesized in the 1918 pandemic strain (although I doubt that result with limited sampling). I have included the exact commands I used to calculate the trees. A condensed phylogenetic tree is when you collapse the nodes based on bootstrap values. Identical sequences cannot be distinguished and so have low bootstrap values and so it makes no sense to represent branches between them. That is why the figure legend says there is a cut-off of 60%. The results detail each of the re-arrangement events stating the likely re-assortment partners that produced each of the different events mostly in the USA. The details of the Korean outbreak are irrelevant to the objective of the paper unless they showed the presence of another re-assortment event, which they do not.

It is not possible to quantitatively add to the results when they are ancestral serotypes that produce the re-assortment. I could assign probabilities but the data is too sparse.

The figures are only for review and I have higher resolution vector files of the original trees that will be submitted with the final version.

My harsher response to referee 2 is because he is neither an expert nor able to read. A condensed tree is the same as an ML with a cut-off of a certain bootstrap percentage, in this case 60%. This referee has clearly never used Mega and it seems highly unlikely that he has used consense either. His focus is only on AIC and BIC in modeltest which is a tiny step of minor relevance to anyone except the author of that method.

My angry replies to Editor 1's rejection

Dear Editor 1,

I am rather struggling to understand the referees comments especially those of Referee 2 who seems to have NO IDEA what the paper was about. I do not care at all about the epidemiology of the recent Korean flu outbreak and that is not in the title or abstract.

Referee 1 is also wrong I used mega to construct a tree of just the H5N8 data which I know is not original the new data are the MAFFT trees containing 15000 HA sequences and 15000 NA sequences that NEITHER referee even bothered to look at. That is the result. I do not really care less about the H5N8 tree as it is wrong because it does not take into account reassortment. The point is to show Mega gives a nice tree that looks all good but is actually missing key elements.

Put simply the paper SHOWS ABSOLUTELY that when you collect flu sequences from a supposed single serotype like H5N8 that you would suppose arises from evolution once it is actually a mix of different events that have created this serotype multiple times by combining H5 hemagglutinins with N8 neuraminidases from other different serotypes. This tells you that to construct trees you need to include intermediates that includes non-H5N8 sequences, otherwise you cannot reconstruct the tree properly as you are missing reassortments and ancestors.

This is a VERY BIG DEAL as if you don’t do it your trees will be wrong (as ALL current H5N8 trees are). This is definitely novel and definitely never discussed before. I can certainly increase the level of detail for the method but any credible bioinformatician would be able to reproduce the trees as they are produced by the default methods of Katoh in MAFFT which does not allow most of the options Referee 2 suggest. To be honest all the nucleotide substitution models score pretty much the same in AIC and BIC but Mega and MAFFT do not use all the same models in the same way and so you need to use methods that are in common. MAFFT method has been cited more times than any other.

I got the tool and got the results. I do not optimise the trees because that is not the point. I do not care about testing parameter space or algorithms. The point is that they form many clades spread widely over the tree and that these multiple clades are consistent across two different genes – This cannot happen by chance so the trees are CORRECT regardless of parameter space and settings.

If the reviewer opened the tree files he might see that they contain 15000 sequences which I am sure are not sets of data produced by ANYONE EVER before. I would like to see them bootstrap as 1000 bootstraps would be statistically wrong to do you need to carry out at least N where N is the number of sequences. If they have access to the world’s largest super-computer then they may do the bootstrap. MAFFT does not allow it anyway. But as I said they have not bothered to even look at that data at all.

Thanks Andy

There are a few problems with this response and temper got the better of me. There are not 15,000 sequences there are 4008 HA and 1840 NA sequences. So it is a big problem that is outside the scope of most programs to create a phylogenetic analysis with bootstraps (phyML won't do it for example), but it is possible, if rather pointless. To carry out a non-parametric bootstrap need nCr(8015, 4009) bootstraps - this is very large number. Large enough not to be calculable in the entire history of the universe. However bootstraps converge to this ideal value quickly but to know it has converged you would need to do multiple bootstrap trees and check that the values are converging. As far as I know nobody does this and certainly not for trees with 4008 taxons.

Telling the story properly. The original rejection letter.

To put my irate posts in context you need to have the full story. Here is the original rejection letter and referees comments.

Thank you for your submission to Journal B. I am writing to inform you that your manuscript, "The multiple origins of the H5N8 avian influenza sub-type" (#2015:07:5858:0:0:REVIEW), has been rejected for publication.The comments supplied by the reviewers on this revision are pasted below. My comments are as follows:Editor's commentsYour paper has now been reviewed by two experts in the field. They both found the general topic of avian influenza of interest, but both also agree that your study seems to provide little advance with respect to previous research. The study adds no new data and apparently no new interpretation of results. Both reviewers found your method descriptions lacking in detail as well as the discussion of your results. As reviewer 2 points out, it would be impossible to reproduce your results given the lack of detail in the manuscript on the methods employed. Likewise, reviewer 1 identified discussions of methods (e.g., bootstrapping) for which there were no available results. Given these significant concerns with the manuscript and the lack of clear articulation of the novel contribution your analysis makes beyond other studies of avian influenza, I am forced to recommend rejection of your paper.Editor 1

Reviewer CommentsReviewer 1 Basic reportingNo CommentsExperimental designNo CommentsValidity of the findingsNo CommentsComments for the authorMajor strengths and weaknesses:This study is a successful attempt in using the datasets from website to analysis the multiple orgins of the H5N8 avian influenza subtype. The study has further phylogenetic analyses of all of the H5 hemagglutinin and N8 neuraminidase sequences to show that each of the H5N8 outbreaks has resulted from a different reassortment event and that there have been at least 7 distinct origins of the viral sub-type since it was first characterised in a turkey in Ireland in 1983.
However, I also feel the phylogenetic results and discussion contained mostly results from the analyses, but very little discussion. Given that almost all of the data for these analyses came from published studies, it was not clear if there were differences between these results and those studies –and if so, was this due to using different datasets or different analyses from those studies? From your MS, you used MEGA v6.06 to construct the ML tree, could you use other phylogenetic software to build ML tree or BI tree, to verify each other? I suggested that you could mark the clades in figure 1 in order to make the relationship more distinct. In addition, from figure 3 to figure 10, I couldn’t find the bootstrap values in your trees, so please add the values into tree figures., and you can also add some description in this section. The classification of H5, maybe you can refer to recent research (e.g. Donis et al, 2015).
Minor issues:- Line 36: change to “shown”- L100: should be “turkeys”Reviewer 2 Basic reportingThe article is written in English using clear and unambiguous text. I think the article introduction and background fit into the broader field of knowledge. I think the figures need some work regarding clarity. For instance, phylogenetic trees show clades that are too busy and could be collapsed for more clarity without losing information.Experimental designI am not sure whether this manuscript conforms to "original primary research". While the manuscript aims to provide an epidemiological analysis of highly pathogenic avian influenza (HPAI) from a recent outbreak in Korea, it does not add much in comparison to other published articles.The two main objectives of this manuscript are 1) to test whether there's phylogenetic evidence for reassortment, and 2) whether there are multiple origins to the Korean outbreak. A cursory review of literature shows that both aspects have already been studied. Some examples are PMID: 24856098; PMID: 25625281; PMID: 25192767. Also, while I am not opposed to manuscripts that use data from GenBank only (that's why it's there to begin with), I think that in this case this has a negative impact since the data are not novel and the analysis provide little new information.In my opinion, the author should re organize the article to focus on something more original. Some examples come to mind: using molecular clocks to uncover temporal patterns in this outbreak, looking at past population dynamics to infer whether this serotype is becoming more or less diverse over time, or perform a phytogeographic analysis to provide insights into its origin and spread.Validity of the findingsMethods
I think this manuscript uses a very rich dataset but the analysis seem precipitous, exploratory and "out-of-the-box". My overall impression is that there is little exploration of the parameter space.
In Methods, line 52, what was the criterion by which the nucleotide substitution model was selected? Was it Akaike Information Criterion, Bayesian Information Criterion or else?
I think the methods section needs more detail. Statements such as “The sequences were aligned using MAFFT” are insufficient to allow full reproducibility. What parameter values were used and why? What version is the program? Did you fit a codon model to the coding sequences? Did you check for intra-gene recombination? Also, how many sequences were used? It says “all” but how many? Was the alignment done on the coding sequences or in the nucleotides? What parameter values were used? Finally, the phylogenetic reconstruction step is not clear as to how it was performed. Are the trees rooted? Were outgroups used? What is a “condensed phylogenetic tree”? Is it some type of consensus? If so, is it a majority rule consensus?
Results and Discussion
The description of the results is very qualitative, e.g., “the current outbreak shows some structure”. It would be more objective to quantify the degree of structure by using distance metrics or population genetic estimators of structure.Comments for the author
no comments

An important point is that Referee 2 is the PhD student of Editor 1

Saturday, 3 October 2015

To be fair I should report my incorrect and somewhat angry response to the rejection of the appeal

Really I give in.

I will do fasttree using seqboot for bootsrap but this is from the manual

“00 fast-global bootstraps took just 20 hours, and the resulting support values were strongly correlated with the traditional bootstrap (r=0.975).”

So it is NOT comparable to the bootstrap it just correlates to the bootstrap. So again the “expert” is actually unaware of how it works.

Both trees are created with ML methods using a GTR+G evolutionary model – using different implementations sure. I can do both in fasttree but it would show what exactly?

The figures are illegible – you could try zooming as they are vector images but that asks too much of a someone with a PhD

There is one reference missing to my own work.

As they say “ whatever”.

This is going to give me a two for as I will resubmit it next week with everything done and a new title and excluding both of those editors.

Thanks

Andy

On this I am definitely wrong. You can bootstrap normally using seqboot and consense. This correlation is only true if you specify an initial tree.

However, there are over 1000 variable sites. For the ideal (non-parametric) bootstrap estimate you need nCr (2n-1,n) permutations = 10 x E 600 calculations (Efron and Tibshirani). Sampling even 1000 bootstraps is insignificant. For 100 sites it is 5 x E 59. It is likely that bootstrap estimates are far from convergent, but that would need testing for each tree by increasing the bootstrap number until the percentages remain fixed.

Editor 2's comments

I love the process of Peer Review where you get judged by a group of unaccountable and usually anonymous "experts" who have all the power and to whom you have no avenue of response. So this is my response.

After an Appeal from the author I was asked to adjudicate on the decision from Editor 1. I was able to review the prior reviews, as well as the author’s Appeal document and a revised version of their manuscript.

I have read the paper carefully and I agree with Editor 1 that the paper doesn’t meet the quality criteria we expect from a paper at Journal B. The paper has clear methodological issues that need to be addressed. In addition, the paper is sloppily prepared and the figures are poor.

Specific comments:

1. The argument that FastTree can’t be used for bootstrap is absurd. If FastTree is good enough for point estimates then it is good enough for bootstrap. The bootstrap simply assesses the amount of variation expected under the given inference method. It can be done with any tree reconstruction method, however flawed that method may be. If you don’t trust FastTree then you should use a different inference method. (Note that the FastTree paper explicitly states that FastTree can be used for bootstrap.)

2. The methods are strange. You analyze two separate data sets (H5N8 sequences vs. H5 sequences/N8 sequences) with entirely different methods. I’m not sure these are even comparable. Importantly, this is never properly explained or motivated.

3. The figures are largely impenetrable. It is standard procedure in phylogenetic analyses to collapse and/or color branches, as well as label groups of sequences rather than individual sequences, so that the key features of the tree are clearly visible. Even though this paper's conclusion hinge entirely on the shape of the phylogenetic trees that were obtained, you haven't put much effort at all into into producing proper, high-quality tree figures. Moreover, you labeled each tip with the full strain name, which makes them largely illegible (in particular when the font is small, e.g. Fig. 2).

4. The paper is sloppily written. To give just one example, in the revision, you added the sentence "These trees are in good agreement with the much more detailed and rigorous coalescent analysis carried out previously.” Importantly, there is no reference. I have no idea which previous analysis you refer to. Similar issues permeate the paper. This paper simply isn't written in such a way that readers can understand what exactly was done and why.

Most importantly, it is the author’s responsibility to prepare a compelling, well-prepared manuscript. The work you have done may very well be (mostly) technically correct, but it is not presented in a way that it makes a useful contribution to the field.

So let me see;

1) You do not bootstrap in FastTree - you bootstrap in seqboot and consense from Phylip as FastTree has no bootstrap function built in. So if we are playing pedantry I am right and Editor 2 is wrong. Why do I want to know the variation under an approximate method? It was created for large sequence trees for speed not accuracy. The key to the paper is identifying if the H5N8 hemagglutinin and neuraminidase genes are part of a single cluster or multiple clusters. The bootstrap tells you NOTHING about this because it just tells you the variability of the method nothing about the biology and the results. These same multiple clusters are identified in both the neuraminidase and hemagglutinin trees. These are independent samples as opposed to covariate sites tested in a bootstrap and so if they agree it is very unlikely the trees are wrong. Bootstraps add nothing other than a sop to the phylogenetics community who cannot live without them.

2) I explained the methods - they both use ML GTR + gamma + I just in different programs FastTree for the big alignment and Mega for the small alignments. Editor 2 had actually told me to use FastTree in a previous paper where these H5 and N8 trees were needed as supplementary data. In that case it was to confirm my Bayesian tree did not have recombinations, it was in doing that, that I found the result reported here. So I am quite stunned that he has forgotten that this is the method he actually suggested.

3) The figures are vector graphics - is it beyond the wit of an editor to zoom a figure. Even if they are in need of editing THIS IS NEVER A REASON FOR REJECTING A PAPER for revisions yes but rejection you have to be joking. Each tip is labelled in that way because the BIOLOGY is what matters i.e. the location and date of the sequences were I to shorten to database identity this would make interpreting the trees impossible in terms of biology and it would also be impossible to see that the trees show a coherent pattern in terms of date and location. The problem is we get so carried away with algorithms we no longer look at the data. If a cluster is all from New Jersey in 1989 then that is a reasonable cluster in good agreement with geography and chronology.

4) I object to the word sloppily. It is not a term that should ever enter an editor's response. You may say the paper is missing a reference at point A and point B but sloppily is pure hyperbole and not worthy of a good editor. It is not a word I would ever use. Still more it is not justification for more than major revisions.

I have been an editor for 4 years and edited 120 papers and I would never write such a response or make such a decision without proper justification especially given the final part about it being mostly technically correct. Not being presented in a way that is a useful contribution is NOT a CRITERIA for rejection, at most it is for Major revision especially at initial submission. We do not judge on significance we judge on it being SCIENCE.

Friday, 2 October 2015

Bayes and Fisher the two explorers and the Tiger.

This is an update of the joke about the two explorers and the Tiger.

Bayes and Fisher are walking through a forest. Suddenly, they see a tiger in the distance, running towards them. They turn and start running away. Bayes stops, takes some running shoes from his bag, and starts putting them on. “What are you doing?” says Fisher. “Do you think you will run fast than the tiger with those?” “I don’t have to run faster than the tiger,” Bayes says. “I just have to run faster than you.”

This was inspired by a referee commenting that I had not proved my hypothesis about H5N8 influenza being spread by wild birds and not through domestic poultry. My opinion was I didn't have to prove my hypothesis (you never can prove a hypothesis). I just had to show that my hypothesis was much more likely than the referee's alternative which I did, although it was rejected as usual. Finally the referee was shown to be wrong and there are a whole string of papers showing I was right.

Stop trying to put biology in a quantifiable box!

This brings me to my favourite and sadly spot on rant about bioinformatics. I am a bioinformatician and I would count on being one for nearly 20 years. I am not a big name but I do small things well and carefully and I think a lot around them. I am not that interested in the big projects and the latest band waggons (microbiome comes to mind).

There is too much politics, too many vested interests and there are too many inflexible people. Biologists seem to have this paranoia about detailed measurement. They seem to feel inadequate compared to physics which is much more quantifiable. The thing about biology is that it ISN'T quantifiable so stop trying to do it. There always has to be some hand waving. There is not a single equation in Darwin and that is the only theoretical framework Biology has. All the great population biologists got the maths wrong in some way because nature is non-linear. Even statistics cannot always help. An average dog has less than four legs demonstrates the point. We need to know about history despite Popper's misgivings.

So be happy with the hand waving and publish and be damned. A lot will be wrong but some will be right and evolution will apply the filter. Trying to reason and rationalise and plan will give you the wrong results and waste your time. Get your imperfect ideas out there. Let there be survival of the fittest. You cannot beat evolution.

Apparently the Editor is not the only stupid one, it is the general opinion of everyone in phylogenetics so I have to be wrong ...

I appeal the very stupid editors comments on the bootstrap and guess what another methods man agrees with him and so it gets rejected again. So here is a little response.

IT IS ABOUT THE BIOLOGY.

I really don't care about the methods as they are all heuristic approximations with more incorrect assumptions than you can wave a stick at. The trees are right putting bootstraps on them does not make them any more right. I am telling a story about biology not about maths.

Now I can do a story about maths.

Bootstraps were invented by Efron who wrote a nice book about them with Tibshirani that maybe some of the editors and "experts" in phylogenetics might like to read. We all know that bootstrap is a resampling method where you resample with replacement a set of data. We do this in order to construct confidence intervals for complicated functions by simulations rather than an analytical solution which often is too complex or does not exist.

Now there are two key points:

1) If your sample is biased your bootstrap will still be biased and your confidence interval will still be wrong.

2) Resampling creates an identical and independently distributed distribution (you lose all correlation between variables when you resample).

Extending these two points for phylogenetics.

In phylogenetics they carry out convenience samples i.e. this is the set of sequences that someone happens to have collected but the have no idea if they are a representative or good sample of any kind. If I try to get convenience sampling based research published in almost any field (except phylogenetics) I would have my work rejected by most statisticians as wrong. So we suspect that samples are biased and if this is true then the bootstrap is not going to tell us much about this bias. In fact Efron and Tibshirani discuss this very problem on p138 of their book where they say bias estimation is an interesting but very tricky problem.

If you are using a technique where those correlations define your output - like say tree building in phylogenetics. Bootstrapping is a fairly stupid thing to do. Why do I want to create bootstraps which lose the correlated properties of my data? To do it properly you can read chapter 9 of Efron and Tibshirani which says the bootstraps are on the covariate vectors not on the data itself. So as far as I can tell all of the bootstrap implementations in phylogenetics don't do this.

This is a rejection of a paper where I am talking about the interesting bit of biology, not methods. The interesting bit is that not all H5N8 influenza virus is from the same ancestor. This means that viral subtypes spontaneously are recreated multiple times by reassortment of the viral neuraminidases and hemagglutinins. This means when you create a tree of a subtype it might not be homologous as has been found recently in a paper published in Science about Dengue which argues that the whole serotype argument for Dengue does not hold water. You see that is interesting biology but you miss it while you get tied up in your bootstraps.

As time goes on I am even more convinced that I am right and the biology I discovered is definitely important and significant and that the method pedants like the grammar pedants are defending an empty palace. They try to defend beautiful methods that actually have no relationship to reality.

Sunday, 18 October 2015

Sunday, 4 October 2015

Match stats

From Editor 1:

Saturday, 3 October 2015

Friday, 2 October 2015