The Accidental Statistician: Constructing Trees based on a single influenza subtype is not a good idea as it introduces sampling bias (amended and toned down)

Version 2 of the paper about H5N8 is rejected. Regardless of it being still right and that what it says is not desperately controversial but it is important. https://peerj.com/preprints/1489/

It is saying that doing trees by finding all the H5N8 sequences or all the sequences of any other subtype is not a good idea as this is a biased sample that misses out reassortment events that give alternative subtypes. An H5N8 sequence can be next to an H5N1 sequence in the true tree and then the H5N8 can appear again in another place in the H5 tree.

I gave a very clear tree to show this is absolutely true and even posted it on this blog and repost it again here.

There is no doubt. Doing anything other than complete sampling of ALL of the H5 trees will not give you the correct sampling for the hemagglutinin tree. I have done this in BOTH versions of the paper.

I put them in the supplementary materials because they are large - the Hemagglutinin tree contains over 4000 sequences and this is not easy to deal with. I just cut out the clades with H5N8 to make it easier to understand and to focus on them. For some unknown reason the referees fail to grasp this and one even commented that my method and sampling was wrong becauseI showed a tree calculated just from the H5N8 sequences.

This comment from a referee just drives me crazy. I am lost for words as to how deliberately obstructive this person is.

In this paper the author is attempting to explain the evolutionary and reassortment history of H5N8 influenza A virus. However, the dataset design ignores what is already known about the emergence and reassortment history of these multiple virus lineages. In particular, the H5-HA of the recent North American high path H5N8 virus is derived from the Goose Guangdong HPAI H5N1 lineage circulating since 1996. This reassortment history has been well studied and published. The author wants to determine if H5N8 has been circulating cryptically in avian hosts or if emerges repeatedly through reassortment. But this has been shown - the highly pathogenic H5N8 virus emerged through reassortment (see Lee et al, 2014 EID for example). In fact, this has been show for every avian virus subtype in the MANY MANY publications investigating the reassortment history of avian influenza A virus in both wild and domestic populations.

The paper is poorly referenced and has not included important citations relevant to the study presented. I believe this has lead to incorrect understanding of influenza A ecology and evolution by the author and subsequently a poorly designed dataset to shed light on the questions he is attempting to address. The figures are completely inappropriate and not in line with the standards of phylogenetic studies or influenza research. It is unfortunate that the author has decided to show cladograms instead of phylograms. Branch lengths in a cladogram are meaningless. However, long branchs are indicative of poor sampling and missing data. This would be obvious from phylograms, but they are conveniently obscured in cladograms. The most informative analysis was of all available H5-HA and N8-NA phylogenies available from the supporting material link. By highlighting only H5N8 viruses in these trees it is evident that the other datasets presented in the main text of the study are poorly sampled.

Experimental design

As stated above, this is a poorly designed investigation. While I admire the effort to understand influenza ecology and evolution, the work presented here ignores much of what is already known about this lineage and influenza A virus in general. The assumptions of the analyses conducted are not appropriate. The analysis conducted by this author assumes a direct lineage connecting all H5N8 viruses that have been sampled (Figure 1-4). This is not true and that is evident from the supporting material presented by the author. The HA-H5 lineage has associated with multiple different virus genotypes and only a handful of lineages have emerged as highly pathogenic. The dataset design does not address the questions posed and ecological or evolutionary inference is questionable.

Validity of the findings

The inferences made from Figure 1-4 are dubious. The author acknowledges this in the manuscript when he states “These trees show that the apparently simple H5N8 phylogenetic trees for the two envelope segments (figures 1-4) are actually more complex and that multiple reassortment events have occurred resulting in the creation of novel H5N8 subtype lineages. These events cannot be seen in the structure of the H5N8 only trees but they need to be taken into account if the phylogenetic trees are going to be calculated correctly, especially if coalescent methods are going to be used.” This is an appropriate warning. I wish the author had heard it! This is evident and known to the influenza field. Regardless, at this point in the paper the author suggest that the reader ignore all previous results. Figure 5-13 are sections from the supporting material. The author attempts to determine source of HA and NA virus subtypes. The author has determined his reading is better than a probabilistic approach to assess reassortment history. However, this reading is in absence of informative branch lengths or assessment of sampling. Any inference presented here is either dubious, in contrast to other studies (not cited) or meaningless.

Comments for the author

I can't endorse publication of this manuscript. It does not serve the influenza field, nor does it add to the current body of knowledge. The quality of the research is not up to standards in the field. I believe that this manuscript should be rejected.

This person is trying to use my own findings to say why my findings are wrong. I heard my warning, that is why I wrote it. That is in fact why I wrote the paper because all of that extensive literature that I did not cite and that annoyed the referee with his MANY MANY snide comment, is nonsense carried out by someone who needs to read about statistics and sampling. The referee agrees completely with what I am trying to do, with the results that I find and have in the supplementary material but argues that I am saying the exact opposite of the entire argument of the paper in order to reject it. This is a classic example of creating a straw-man.

The entire point of figures 1-4 is to show that they are wrong and thus that the prevailing dogma that always does analysis of influenza strains like this is wrong. The experimental design is exactly correct. First you do what is done by everyone this is the control. Then you do something new - the tree of ALL of the H5 and N8 sequences to show what should be done. That is why there are figures 5-13 that show how sampling has to be done.

I could think that this referee is sufficiently confused not to be able to understand, but I think that they do understand and this is just malevolence, they want to block publication.

How can I be sampling incorrectly when I include every known sequence, all of them, none excluded?

If that is not a valid sample then there are no valid samples in H5 influenza research ever. To know who it is for sure I will wait for a couple of months and see who tries to publish the view that sampling is wrong if we focus on a single influenza subtype. I expect to see it in something like Emerging Infectious Disease or PLoS Pathogens and a fairly big name to be submitting author.

Finally the last lines are NOT permitted in a referees comments. You are not allowed in your instructions to reviewers to put that sort of response in the comments to the author. That is for the editor to decide. It is not constructive or useful.

This is someone with an axe to grind who is annoyed that their work has not been cited. Boo hoo to you. It is appalling behaviour for a so called professional scientist. What the paper says is still true, it will still be published and whoever you are as you chose to remain anonymous (for good reason) you will eventually be exposed for the dishonest person that you are.

The Accidental Statistician

Sunday, 31 January 2016

Constructing Trees based on a single influenza subtype is not a good idea as it introduces sampling bias (amended and toned down)

No comments:

About Me