The Accidental Statistician: The state of Influenza Phylogenetics

I really was not particularly interested in viral phylogenetics for most of my research career but I started my research in the area when I joined the Institute of Animal Health. I have a background in synthetic organic chemistry and protein crystallography and so I am well aware of how vicious and petty academic politics can be (ask me someday about the MRC skiing anecdote, or the Ribosome Nobel Prize story, or maybe the Rubisco saga or you can ask me about Nicolaou and Taxol ).

But I can honestly say that I have never met a more political, corrupt and inept field than influenza phylogenetics. It is staggering how bad it is, and that might be what makes it interesting. I come to it from a statistics perspective as I spent five years on my Damascene transformation in the statistics department at Oxford. I am very interested in bad science, people doing science badly in order to get grants and power but who really have no idea what they are doing. I was inspired by the work of John Ziman and the growth of the field of reproducibility (scientists like to suggest that this only applies to social sciences although psychology is acknowledging it has a problem too and biology in general definitely has a reproducibility problem). In viral phylogenetics there is a lot of bad science.

First I want to set out the main problems I have are:

Most biologists in the field have no idea what they are talking about from a theoretical perspective. They don't get the maths, they do not understand the assumptions and they ignore any results that do not fit with their expectations instead of asking themselves why it happened.
Sampling is terrible, It is all a convenience sample and this cannot avoid being a biased sample. Why are we focused on China when we collect almost nothing in Africa and the last swine flu pandemic came from Mexico?
People horde data and do not collaborate. There are 3 main databases which all have data in different formats with different annotations that you might or might not want. They are designed to make it difficult for researchers to use data from all three.
Peer review is intensely political and there are cliques which cite each other's papers excessively and which block publication of other groups. There are some government laboratories that have > 50 cites on a scientific note paper that says ... well not very much. Everyone is playing the citation game to keep their government laboratory funding.
There is a lack of communication between the analysts and the laboratory scientists. For example how many analysts know that the virus was often passaged through hen's eggs for amplification before sequencing the result was that the sequences mutate to have chicken specific variations and so the sequences in the original sample are not the same as those they submit to the database. This is exactly the same as the cowpox vaccine for smallpox - it is NOT cowpox Jenner got lucky and also the problem of cell culture where the cell cultures are no longer the genotypes originally collected.

Why do I think these are problems?

I told a referee that phylogenetics is just a clustering method based on a metric. They assured me that this is not always true. For example parsimony and maximum likelihood are not distance based methods says the referee. Except fundamentally they are. To calculate a parsimony you need an alignment. To get a multiple alignment you need to build an alignment you use a progressive alignment based on a guide tree which will often use UPGMA a distance-based method. Distance is fundamental to progressive alignment. Parsimony itself depends on the smallest number of changes - a distance. The scoring models you use as evolutionary models for measuring changes and calculating likelihoods are metrics for finding distances. Probability is a metric in measure theory it is a distance. You could use information theory but the difference in information is again a metric and a distance. Whatever way you try and cut it phylogenetics depends on distance and a metric and groups based on those metrics. It is clustering it does use metrics and because it uses metrics it is not completely objective it is subjective and metric dependent. People who do phylogenetics would do well to read Kaufman and Rousseeuw to understand why clustering needs care and why metrics are very interesting (I have a story about one of the authors as well which makes me reluctant to suggest reading his book but it is foundational).
The BOOTSTRAP - I don't know where to start. Nobody who is a biologist has bothered to do some simple experiments to check what is does and what it means with real data. For example to know how many bootstraps you need to use run it on your data with 50, 100, 150 and 200 and see if it has converged - you get the same numbers each time. From my experience 100 is more than enough. All of these referees and experimentalists using 500 or 1000 or 10,000 are wasting their time. If you read Efron's book he says 100 is empirically often enough although, in theory, the number should be something like the square of the number of sequences. If they had read Efron's book they might grasp the issues. That means going beyond his simplified paper saying what the bootstrap is and definitely going beyond taking Felsenstein's word for it. There is some fantastic work on this by Susan Holmes who worked with Efron. This is really great stuff but under-read and poorly understood. So much so that she has moved on to other things.
The need to make everything quantitative. Biology is NOT always quantitative. If I see a clade in a tree and it is monophyletic to a geographical location I believe that tree. I do not need to put a number on it. I could work out by permutation test how likely it is to get a clade that is monophyletic to a location but given the sampling is convenience sampling that is not probably going to be meaningful.
Creating trees for distinct species should make sense. We know and I mentioned before that influenza undergoes rapid change to obtain host specific mutations when it is introduced to a new host, such as passaging it in chicken eggs. We know this experimentally for ferrets as well. Why would a host specific tree be a bad idea? A referee and an editor thought so in my H9N2 work until a paper taking the same approach was published in Nature and then what I had done was Ok and they allowed my appeal after stalling publication for 18 months and two journals with the same editor in both.

The Accidental Statistician

Saturday, 18 November 2017

The state of Influenza Phylogenetics

No comments:

About Me