The Accidental Statistician: Second referees comments - Do lineage and subtype have meaning any more?

The second referee is a bit pedantic which is perhaps important. There are good arguments for very strict use of terms but this is a minor correction. I am a bit hand waving and inexact and so he has some points, but still why anonymous? What are you afraid of?

Basic reporting

The manuscript suffers from sub-standard writing. There's a typo in the text ("creating phylogenetic trees oh H5N8" on line 112), as well as grammar mistakes (line 182, line 229). Some very unusual language is also employed throughout the manuscript, such as references to H5N8 trees on line 63(unclear whether trees are of the HA segment, NA segment or the inadvisable concatenation of the two), hemagglutinin and neuraminidase subunits on line 69 and 70 (they are segments, subunits are what proteins have), sequence degeneracy on line 93 (the opposite of saturation is low diversity, not degeneracy), information content of HA and NA trees on line 121 (two trees are always sufficient to infer reassortment), consistency of trees as strong evidence of phylogenetic analysis validity on line 131 (tree consistency indicates that the segments have a similar history and says nothing about the validity of the analysis), envelope segments on line 133 (to my knowledge only Retroviruses possess a surface protein called envelope) and reassortment in Flaviviruses on line 231 (Flaviviruses cannot reassort because their genomes are on a single RNA strand). The conclusion is rather short, the second paragraph of which is basically the same thing repeated over and over again.

Experimental design

The reporting of trees is extremely unhelpful. All trees are shown as cladograms and thus only indicate the topology of the tree. Dalby writes that this was done for clarity on line 83 but it achieves the opposite effect. Branch lengths allow everyone to see how much evolution has occurred on each branch and thus how robust some of the inferences are, especially in light of reporting on how much evolutionary change has occurred in trees on line 107 without supporting evidence. It is never made clear whether the trees have been rooted or not and without branch lengths it is impossible to tell whether they are. Although not a major of flaw of the study, nor a problem unique to this manuscript, the use of a parameter rich GTR+I+G nucleotide substitution model is questionable. Model testing, as it is done today, is based on a circular argument (the tree with a given model has the highest likelihood, therefore the model is used to reconstruct the tree) and ignores identifiability problems when it comes to the combination of Gamma-distributed rate heterogeneity AND invariant sites. Gamma-distributed rate heterogeneity takes care of slowly (or non-) evolving sites, so the addition of invariant site estimation combines two models that are explaining the same variation.

Validity of the findings

In the manuscript Dalby describes the rise of an avian influenza A virus subtype H5N8, which has recently caused a sustained outbreak in Korea. The author finds that the combination of H5 and N8 segments in avian influenza A viruses has arisen multiple times independently rather than circulated cryptically in birds as a single genomic lineage. I have no problems with the overall findings - I think the divergence between the H5s and the N8s that have ended up reassorting together is sufficient to infer numerous origins of the subtype. What I disagree with are the details surrounding each independent origin of the subtype. Some very bold claims are made in the absence of any clear evidence that would be available to the reader, for example that the origin of the Californian quail H5N8 subtype is unambiguous when it is actually quite the opposite, given the phylogenetic position of the sequence or that the Thailand 2012 H5N8 neuraminidase clusters with H3N8 neuraminidases when it does nothing of the sort.

Comments for the author

I think this manuscript could easily be improved by:

1. Showing maximum likelihood trees with clear rooting and actual branch lengths.

2. The direction and context of each reassortment should be explicitly tested using an appropriate model - e.g. BEAST with discrete traits of location, host and subtype (as appropriate) - to support the various proposed hypotheses for the origins of subtype H5N8.

3. Clean up the language - use the correct terms agreed upon in the literature.

4. Show full trees of all HA and NA sequences indicating where H5N8 viruses are.

I would strongly advise the author to implement these suggestions before attempting to submit this manuscript elsewhere.

So from the comments to the author:

1) Is trivial and ok. Actually with branch lengths reading the trees is a whole lot harder and the key arguments of the paper as it is about reassortment and this depends on clades and not branch lengths but this is a minor point. This is cosmetic and not grounds for more than revision.
2) This is not going to happen there are 4007 sequences this would take large amounts of computer time and give you nothing new or significant in identifying which clades H5N8 can be found in. Putting in subtypes and locations would actually be over-fitting of the data to the model and a very bad statistical error because you leave no variables to test your model against. This would be an example of Bode's Law. Put in all the empirical data to the model and you get no free variables left.
3) Agreed but again that is minor changes.
4) They are in the supplementary materials and always were - but referees don't look. Figures 5-13 are parts of this complete tree. Version 3 will just have the full H5 and N8 trees and go to F1000. There will be no anonymous referees and it will be published first.

Regarding the point on the California quail sequence. It is ambiguous if you think that the H5N8 trees are telling you anything, but the point of the paper is that they aren't. So it is completely unambiguous that this does not contain the H5 from Goose Guangdong and it is in NO WAY connected to the H5N8 sequences from Korea regardless of what the location and chronology suggest (that is why doing what is suggested in comment 2 is a very bad idea).

The point about Quang Ninh is partially true it is part of an amorphous clade that includes H10N8 isolated at the same but also mixed types. The ancestral sequence to this clade is most definitely an H3N8 from Vietnam and H3N8 or H6N8 are the sources for almost all of the N8 sequences.

Flaviviruses do not reassort as they are not segmented but they definitely undergo recombination which is equivalent. It is an analogy and not homology but sometimes metaphors are not clear. Again this is easily removed. The point of the analogy is the wider consideration that lineage has no meaning if there are multiple subtypes with the same lineage and subtypes with multiple lineages. What does the word lineage mean? How are we going to define it other than in some arbitrary way based on distances in a phylogenetic tree?

The Accidental Statistician

Sunday, 31 January 2016

Second referees comments - Do lineage and subtype have meaning any more?

No comments:

About Me