The Accidental Statistician: November 2017

I think it unlikely that I will be submitting to Virus Gene again in a hurry. We had written a few papers that we knew would be unpopular and sent them to a meeh level journal where we expected to have an easier ride through peer review. The first hint this wasn't going to be the case was the editor assigned who happened to have collaborated with a group that was in direct competition in H9 phylogenetics lead by Cattoli who I had insulted previously.

Anyway, they are now in PeerJ and public so that nobody else runs off and starts using USEARCH in flu phylogenetics and claiming priority.

https://peerj.com/preprints/3166/
https://peerj.com/preprints/3396/

I just wanted to put the referee's comments for the first paper that was rejected here, because they are laughable and in the context of the referee's comments on the second paper they are probably wrong or at least not consistent. I have put my responses here as with a straight reject I get not response to the editor, who is not going to be on my Christmas card list.

Reviewer #1: General comments

The paper presents the method of classification of H9 lineages using clustering and compares the results with classification based on other methods.

The paper would gain if some practical aspects were added. The title suggests the method is fast, so an approximate time of analysis would be useful, especially that cluster analysis after each run is required and repeated clustering if necessary is suggested.

Specific comments:

Introduction

Explain HMM and SVM abbreviations.

Hidden Markov Models and Support Vector Machines

Materials and methods

Please add the information on the chain length in the BEAST analysis.

2 million

Results

Lines 37-43: "USEARCH identified 19 clusters …" - does it refer to H9 HA? It should be indicated in the text to avoid confusion with subtype identification described above.

19 H9 clusters

Lines 43-44: "The subtype originated in Wisconsin in 1966 and this clade continues to be in circulation" Do the Authors mean that H9N2 subtype was first detected in Wisconsin in 1966?

H9 is first detected in 1966 as part of H9N2

Lines 37-39 (2nd page of results): The sentence "The phylogenetic trees…" is confusing, as only fig. 4 shows tree for clade 12 and it was not divided into subclades.

Easily changed

Lines 50-51 (2nd page of results): Were there 3 or 4 subclades of 14 clade identified?

Easily checked

Discussion

First sentence "The clustering of the influenza viral hemagglutinins using USEARCH proved that clustering can correctly identify the viral subtypes from the sequence data" - the subtype identification was partially correct, as it did not detect H15, and H7 was split into two clusters, so this statement should be revised. It would be interesting to mention with which subtype the H15 sequences were clustered.

I can show that H15 separates out at slightly lower identity. H7 is two groups adjacent so it is correctly identified. It gets 14 out of 15 clusters this is 93% accuracy the method works. 93% is more accuracy than typical for clustering algorithms.

Lines 27-30 (2nd page of discussion): "…small sub-clades of four or less sequence were merged for phylogenetic analysis…" Please explain it in Results.

You cannot make a tree of less than 3 sequences.

Supplementary Figure 3: There are branches labeled with subclade number and some with individual sequence. Please explain it. It is also associated with the comment above.

That would be because labelling a cluster containing one sequence with a cluster name would be stupid. As these clusters were grouped for tree generation it would be misleading to use the cluster number but I can edit them to have both.

Table 3 - missing data in the 5th line

No that does not exist – it is unsupported data in the LABEL paper that is not public and cannot be verified. This was data given to Justin Bahl but not available to anyone else.

Reviewer #2: The automated detection and assignment of IAV genetic data to known lineages and the identification of sequences that don't "fit" existing descriptions is a challenge that requires creative solutions. The authors present a manuscript that proposes a solution to this question and tests it on an extensive H9 IAV dataset.

Though I find the general question intriguing there are a number of issues. The two major items are: a) as a study on the evolutionary dynamics of H9 IAV, this is missing appropriate context, and the results are not adequately presented or discussed; and b) as a tool to identify known and unknown HA, it generates results that appear to be no different to BLASTn, it isn't packaged so that others may use it in a pipeline/web interface/package, and the generated "clusters" aren't linked to any known biological properties. I elaborate on a few of these issues below.

1) This is not a novel algorithm: USEARCH has been in use for over 7 years and it has been previously used in IAV diagnostics. Consequently, I would expect the authors to present a novel implementation of the algorithm (e.g., a downloadable package/pipeline, or an interactive interface on the web) or a detailed analysis and discussion of the evolutionary history of the virus in question. Unfortunately, the authors do not achieve either.

This reviewer is lying you may search for IAV and USEARCH in Google and you will find NOTHING except the two papers I mentioned both of which are more recent. It was first used by Webster in 2015 and for a different approach. It is mostly used for analyzing metagenomics projects. It cannot be packaged because as the paper shows you have to make decisions about the clustering. It is not just automatic you have to analyse the appropriate identity and clustering.

2) The introduction is not adequately structured - after reading, I was left confused as to why dividing the H9 subtype into different genetic clades was necessary, i.e. there is no justification provided for the study. The discussion of clades and lineages is particularly convoluted and given the presented information, it is not clear what the authors are trying to achieve (i.e., they move from identifying subtypes, to identifying clades, to lineages, to reassortment, and all possible combinations). Additionally, there are entire sections of the introduction that consist entirely of unsupported statements (lines 39-48 on alignments and tree inference: lines 52-60 on lineage evolution). This section needs to be revised to provide appropriate context and justification for the study.

The reviewer is obviously completely oblivious as to why you want to carry out lineage analysis in influenza. As such they are not competent to review the paper. As the WHO actually has a working party to create these nomenclatures for H5 this argument is ridiculous.

3) There are many figures associated with BEAST analyses. The goals of these analyses are not introduced, and the trees are not presented or described in any meaningful detail. Further, and more concerning, the presented trees appear to be wrong, i.e. the tip dates are not in accordance with the temporal scale.

That would be because the editor had the number of figures reduced. The BEAST analysis is not particularly important other than to show the consistency of the clustering. If the reviewer bothered to read then they would see that one of the trees does not use tip dates and is a maximum likelihood tree and so dates WILL NOT be consistent with the temporal scale if there is variation in mutation rate along one of the branches. This is actually an interesting point as BEAST FAILS completely to generate a reasonable tree with tip dates for that cluster of data. It produces a network with cycles over a wide range of initial parameters.

4) One of the major results (lines 6-16 in the results) is that the USEARCH algorithm can identify the correct subtype of a sequence, most of the time. How does this compare to BLASTn? And, failing to classify a subtype (line 16) is problematic. The authors should consider what the goal of the analysis is, and then present it along with results from similar algorithms, e.g., with the same data, is BLASTn able to identify subtypes?

I am intrigued by how the reviewer thinks that BLASTn works? To do the same task I would need to identify prototypes of each cluster and then use BLASTn to find the rest of the cluster. I would then need to apply some sort of cut-off in order to identify when BLASTn was finding members of other clusters and not the current cluster. In short this is nonsense. They perform different functions as USEARCH identifies the clusters not just related sequences. USEARCH produces the results in about 1 minute. Just to even set-up the BLAST searches would take 10 times longer than this and to analyse their results and do the correct portioning will take hundreds of times longer. The title of the USEARCH paper is actually “Searching and clustering orders of magnitude FASTER THAN BLAST”

5) I do not understand the significance of USEARCH identifying 19 clusters (line 37); and these data are not linked in anyway to a larger more comprehensive description of the evolutionary dynamics of H9 IAV. The authors should refine their hypothesis, and discuss the results - specifically, if a cluster is identified, what does it mean? What is the significance of the previously unidentified clusters? How closely does this align with phylogenetic methods (and the discussed LABEL)?

Um really this is now getting to be a bad joke. The paper compares to LABEL a method based on totally subjective cluster names created by influenza researchers. The entire discussion is carrying out exactly what this referee is suggesting in this paragraph. Do they need glasses? Are they suffering from a reading problem? Do they have a brain injury? USEARCH produces some of the clusters from LABEL, faster more efficiently and correctly. It is completely objective and based on mathematical criteria. There is no bias dependent on convenience sampling because it uses all the data not just the data a particular lab collects at a particular time. This is a MAJOR step forward in trying to sort out the mess that is influenza nomenclature and shows that most existing attempts are biased, partial and use rules that are not appropriate such as the need for clades to be homogeneous in subtype e.g. only H9N2 and not other H9 containing subtypes. The hypothesis is that existing nomenclatures are bad arbitrary, subjective and not based on mathematical rigour. We have proved this in this paper and in two more analyzing H7 and the internal influenza genes. All show exactly the same point, sound maths, rigorous systematic approaches and excellent biological agreement.

Minor comment:

1) Using my laptop, I aligned all non-redundant H9 HAs (n=5888) in ~2 minutes, and inferred a maximum likelihood phylogeny in ~6 minutes. The argument that phylogenetic methods are slow, particularly given modern tree inference algorithms and implementations on HPCs (e.g. Cipres: http://www.phylo.org) is not accurate. Additionally, alignment issues - particularly within subtypes is a trivial issue.

Yippy for you referee 2. Now put them into clusters. Just edit that tree with 5888 sequences and see how long it takes. Meanwhile USEARCH will have done it after 1 minute and it will be mathematically correct and not depend on how you cut the trees. Alignments of large numbers of sequences are unreliable. Regardless of this referee stating that this is unsupported this is actually supported by a very large literature and best summed up in the work of Robert Edgar who wrote Muscle and who says DO NOT DO LARGE ALIGNMENTS WITHOUT USING MY USEARCH PROGRAM FIRST. But then it is unlikely that referee 2 actually RTFM for the alignment program. I am sure they ran it without bootstrap and it could not have used tip dates as only BEAST does this.

2) There are a number of methods, e.g, neighbor joining and UPGMA, that use agglomerative clustering methods.

Yes there are well done referee 2 for being a genius and knowing that actually all of phylogenetics is related to clustering. This is the one and only correct statement that they make. All nomenclature and lineage methods depend on agglomerative methods but this is a divisive clustering method which is much less susceptible to convenience sampling. USEARCH is the fastest and best clustering method you can use and it is divisive and not agglomerative.

My comment is that I have NEVER encountered a more partial incompetent and ignorant referee than referee two. I think that they protest too much because they have too much invested in current methods such as LABEL which this paper show to be at best poor and at worst completely wrong.

I really was not particularly interested in viral phylogenetics for most of my research career but I started my research in the area when I joined the Institute of Animal Health. I have a background in synthetic organic chemistry and protein crystallography and so I am well aware of how vicious and petty academic politics can be (ask me someday about the MRC skiing anecdote, or the Ribosome Nobel Prize story, or maybe the Rubisco saga or you can ask me about Nicolaou and Taxol ).

But I can honestly say that I have never met a more political, corrupt and inept field than influenza phylogenetics. It is staggering how bad it is, and that might be what makes it interesting. I come to it from a statistics perspective as I spent five years on my Damascene transformation in the statistics department at Oxford. I am very interested in bad science, people doing science badly in order to get grants and power but who really have no idea what they are doing. I was inspired by the work of John Ziman and the growth of the field of reproducibility (scientists like to suggest that this only applies to social sciences although psychology is acknowledging it has a problem too and biology in general definitely has a reproducibility problem). In viral phylogenetics there is a lot of bad science.

First I want to set out the main problems I have are:

Most biologists in the field have no idea what they are talking about from a theoretical perspective. They don't get the maths, they do not understand the assumptions and they ignore any results that do not fit with their expectations instead of asking themselves why it happened.
Sampling is terrible, It is all a convenience sample and this cannot avoid being a biased sample. Why are we focused on China when we collect almost nothing in Africa and the last swine flu pandemic came from Mexico?
People horde data and do not collaborate. There are 3 main databases which all have data in different formats with different annotations that you might or might not want. They are designed to make it difficult for researchers to use data from all three.
Peer review is intensely political and there are cliques which cite each other's papers excessively and which block publication of other groups. There are some government laboratories that have > 50 cites on a scientific note paper that says ... well not very much. Everyone is playing the citation game to keep their government laboratory funding.
There is a lack of communication between the analysts and the laboratory scientists. For example how many analysts know that the virus was often passaged through hen's eggs for amplification before sequencing the result was that the sequences mutate to have chicken specific variations and so the sequences in the original sample are not the same as those they submit to the database. This is exactly the same as the cowpox vaccine for smallpox - it is NOT cowpox Jenner got lucky and also the problem of cell culture where the cell cultures are no longer the genotypes originally collected.

Why do I think these are problems?

I told a referee that phylogenetics is just a clustering method based on a metric. They assured me that this is not always true. For example parsimony and maximum likelihood are not distance based methods says the referee. Except fundamentally they are. To calculate a parsimony you need an alignment. To get a multiple alignment you need to build an alignment you use a progressive alignment based on a guide tree which will often use UPGMA a distance-based method. Distance is fundamental to progressive alignment. Parsimony itself depends on the smallest number of changes - a distance. The scoring models you use as evolutionary models for measuring changes and calculating likelihoods are metrics for finding distances. Probability is a metric in measure theory it is a distance. You could use information theory but the difference in information is again a metric and a distance. Whatever way you try and cut it phylogenetics depends on distance and a metric and groups based on those metrics. It is clustering it does use metrics and because it uses metrics it is not completely objective it is subjective and metric dependent. People who do phylogenetics would do well to read Kaufman and Rousseeuw to understand why clustering needs care and why metrics are very interesting (I have a story about one of the authors as well which makes me reluctant to suggest reading his book but it is foundational).
The BOOTSTRAP - I don't know where to start. Nobody who is a biologist has bothered to do some simple experiments to check what is does and what it means with real data. For example to know how many bootstraps you need to use run it on your data with 50, 100, 150 and 200 and see if it has converged - you get the same numbers each time. From my experience 100 is more than enough. All of these referees and experimentalists using 500 or 1000 or 10,000 are wasting their time. If you read Efron's book he says 100 is empirically often enough although, in theory, the number should be something like the square of the number of sequences. If they had read Efron's book they might grasp the issues. That means going beyond his simplified paper saying what the bootstrap is and definitely going beyond taking Felsenstein's word for it. There is some fantastic work on this by Susan Holmes who worked with Efron. This is really great stuff but under-read and poorly understood. So much so that she has moved on to other things.
The need to make everything quantitative. Biology is NOT always quantitative. If I see a clade in a tree and it is monophyletic to a geographical location I believe that tree. I do not need to put a number on it. I could work out by permutation test how likely it is to get a clade that is monophyletic to a location but given the sampling is convenience sampling that is not probably going to be meaningful.
Creating trees for distinct species should make sense. We know and I mentioned before that influenza undergoes rapid change to obtain host specific mutations when it is introduced to a new host, such as passaging it in chicken eggs. We know this experimentally for ferrets as well. Why would a host specific tree be a bad idea? A referee and an editor thought so in my H9N2 work until a paper taking the same approach was published in Nature and then what I had done was Ok and they allowed my appeal after stalling publication for 18 months and two journals with the same editor in both.

The Accidental Statistician

Saturday, 18 November 2017

The Virus Gene Papers

The state of Influenza Phylogenetics

About Me