The Accidental Statistician

Friday, 2 October 2015

Bayes and Fisher the two explorers and the Tiger.

This is an update of the joke about the two explorers and the Tiger.

Bayes and Fisher are walking through a forest. Suddenly, they see a tiger in the distance, running towards them. They turn and start running away. Bayes stops, takes some running shoes from his bag, and starts putting them on. “What are you doing?” says Fisher. “Do you think you will run fast than the tiger with those?” “I don’t have to run faster than the tiger,” Bayes says. “I just have to run faster than you.”

This was inspired by a referee commenting that I had not proved my hypothesis about H5N8 influenza being spread by wild birds and not through domestic poultry. My opinion was I didn't have to prove my hypothesis (you never can prove a hypothesis). I just had to show that my hypothesis was much more likely than the referee's alternative which I did, although it was rejected as usual. Finally the referee was shown to be wrong and there are a whole string of papers showing I was right.

Stop trying to put biology in a quantifiable box!

This brings me to my favourite and sadly spot on rant about bioinformatics. I am a bioinformatician and I would count on being one for nearly 20 years. I am not a big name but I do small things well and carefully and I think a lot around them. I am not that interested in the big projects and the latest band waggons (microbiome comes to mind).

There is too much politics, too many vested interests and there are too many inflexible people. Biologists seem to have this paranoia about detailed measurement. They seem to feel inadequate compared to physics which is much more quantifiable. The thing about biology is that it ISN'T quantifiable so stop trying to do it. There always has to be some hand waving. There is not a single equation in Darwin and that is the only theoretical framework Biology has. All the great population biologists got the maths wrong in some way because nature is non-linear. Even statistics cannot always help. An average dog has less than four legs demonstrates the point. We need to know about history despite Popper's misgivings.

So be happy with the hand waving and publish and be damned. A lot will be wrong but some will be right and evolution will apply the filter. Trying to reason and rationalise and plan will give you the wrong results and waste your time. Get your imperfect ideas out there. Let there be survival of the fittest. You cannot beat evolution.

Apparently the Editor is not the only stupid one, it is the general opinion of everyone in phylogenetics so I have to be wrong ...

I appeal the very stupid editors comments on the bootstrap and guess what another methods man agrees with him and so it gets rejected again. So here is a little response.

IT IS ABOUT THE BIOLOGY.

I really don't care about the methods as they are all heuristic approximations with more incorrect assumptions than you can wave a stick at. The trees are right putting bootstraps on them does not make them any more right. I am telling a story about biology not about maths.

Now I can do a story about maths.

Bootstraps were invented by Efron who wrote a nice book about them with Tibshirani that maybe some of the editors and "experts" in phylogenetics might like to read. We all know that bootstrap is a resampling method where you resample with replacement a set of data. We do this in order to construct confidence intervals for complicated functions by simulations rather than an analytical solution which often is too complex or does not exist.

Now there are two key points:

1) If your sample is biased your bootstrap will still be biased and your confidence interval will still be wrong.

2) Resampling creates an identical and independently distributed distribution (you lose all correlation between variables when you resample).

Extending these two points for phylogenetics.

In phylogenetics they carry out convenience samples i.e. this is the set of sequences that someone happens to have collected but the have no idea if they are a representative or good sample of any kind. If I try to get convenience sampling based research published in almost any field (except phylogenetics) I would have my work rejected by most statisticians as wrong. So we suspect that samples are biased and if this is true then the bootstrap is not going to tell us much about this bias. In fact Efron and Tibshirani discuss this very problem on p138 of their book where they say bias estimation is an interesting but very tricky problem.

If you are using a technique where those correlations define your output - like say tree building in phylogenetics. Bootstrapping is a fairly stupid thing to do. Why do I want to create bootstraps which lose the correlated properties of my data? To do it properly you can read chapter 9 of Efron and Tibshirani which says the bootstraps are on the covariate vectors not on the data itself. So as far as I can tell all of the bootstrap implementations in phylogenetics don't do this.

This is a rejection of a paper where I am talking about the interesting bit of biology, not methods. The interesting bit is that not all H5N8 influenza virus is from the same ancestor. This means that viral subtypes spontaneously are recreated multiple times by reassortment of the viral neuraminidases and hemagglutinins. This means when you create a tree of a subtype it might not be homologous as has been found recently in a paper published in Science about Dengue which argues that the whole serotype argument for Dengue does not hold water. You see that is interesting biology but you miss it while you get tied up in your bootstraps.

As time goes on I am even more convinced that I am right and the biology I discovered is definitely important and significant and that the method pedants like the grammar pedants are defending an empty palace. They try to defend beautiful methods that actually have no relationship to reality.

Monday, 21 September 2015

Is my phylogenetic analysis right? Or why some editors are too stupid for words

I had an editor decide he was going to try and school me on statistics. I had used an approximate method to do a phylogenetic analysis which they objected to. Where are the bootstraps they say why use FastTree and not something else? (unspecified I might add). I needed to do a more robust and proper phylogenetic analysis.

Now I have two trees each with about 1500 sequences. These are for two genes in the same organism. I created an approximate tree for both and the two trees AGREED. They show the same seven sub-trees that are the key result of the whole paper. That means that from TWO INDEPENDENT SAMPLES I get the SAME POINT ESTIMATE AGREEING SEVEN TIMES. So for naive me I think that is PROOF BEYOND REASONABLE DOUBT that the two trees are correct. You simply cannot get the SAME wrong answers in two independent datasets 7 times, and if I consider the ordering of all of the other sequences they agree in many hundreds of positions.

Now I could do bootstrap, but this is just a permutation test to check if the algorithm is working properly it tells you ABSOLUTELY NOTHING about whether you have the correct biologically sound tree (check out Page and Holmes Molecular Evolution a Phylogenetic Approach for a discussion of the problems of verifying trees and that if trees from other genes agree they might be right).

Then again how many bootstraps should I do with 1500 sequences? It is a permutation test so lots is a good answer. Looking at a guide from Stata for a dataset of 448 they carried out 4000 bootstraps (they tried 40,000 as well and it gave about the same result). So for 1500 sequences I need at least 15,000 bootstraps and likely even more. So exactly what programme can I run on a fairly normal commercial PC that will run 15,000 bootstraps on 1,500 sequences in a reasonable amount of time? Even when I do that what programme will then be able to assemble and count these 15,000 trees to prepare a bootstrap tree? Even if I did would it tell me anything more than the two independent trees have already told me about my tree being correct?

Yes, the editor is an idiot. Yes, they have very little idea about what they are talking about. Yes they are too stupid for words.

They are so caught up with the technicality of bootstrap and maximum likelihood, confidence intervals and prior probability that they forgot what the ultimate arbiter of a good tree is: Does the biology make sense?

Monday, 27 April 2015

Quick to Block

I generally find those who are swift to block aren't worth talking to. In that I include James Delingpole, Guido Fawkes and Damian Thompson. They are the fingers in their ears debaters who like their own voice and their own opinions much more than anybody else's.

If you read the book Emotional Vampires you will recognise them clearly. You will also know that they will never recognise this about themselves and that arguing with them is a waste of breath. They are always right, always perfect, they never make a mistake. I have worked around people like this where you walk on egg-shells not to say the wrong thing or do the wrong thing and it was the most miserable experience of my life.

It is always disappointing interacting with those you admire.

I have to remind myself that I should not overly admire people as they often prove to be as flawed as everyone else. So I believe strongly in rights for sex workers and I was disappointed to learn that Tina Fey does a lot of anti-sex worker "comedy". Today I was being generous and giving her the benefit of the doubt and so I was thinking aloud, well she may just be delivering the material, she might not be the writer and if you have a show you can just be there to deliver the lines and not think too much about the content.

So anyway Dr Brooke Magnanti had made the allegation and there was a link to the Saturday Night Live routine. I had sent her a tweet saying about maybe she is just a performer. To which she responded strongly. I have seen interactions on Twitter before and I would say she often responds pretty strongly. Often this is with justification and my tweet certainly annoyed her. So I carefully wrote another saying it is not an excuse but implying that nobody is perfect and if she was a writer as well she had no excuses at all.

So Brooke Magnanti's response was:

@ardalby Dude. I get you want to argue this but fuck off, she makes jokes about dead bodies of people like me. Begone.

Along of course with a block. So while I am very disappointed with Tina Fey. However I am also disappointed with Brooke Magnanti and this is a bigger disappointment to me personally because she was the person I admired that I was alluding to in the title.

Now with more time to do my research and get an informed opinion rather than just living of tweets I know that Fey was the chief writer for SNL when they did the French Hooker sketches and that is just one of a long line of offences (http://titsandsass.com/category/tina-fey-hates-sex-workers/).

Stoya's article is great as usual. So Fey is bad and didn't deserve the benefit of the doubt it is a sort of, whatever as I am not that big a fan. I just wanted to be sure she was not being done an injustice, as I have jumped on too many band-waggons on Twitter and I was balance and thoughtful today. I make plenty of mistakes and so sending Magnanti my other tweet was one. Fey is definitely a slut shamer and sex worker hater.

Friday, 10 April 2015

False Discovery Rate vs p-values

The problem with p-values are they are just based on an aside made by Fisher who said that any data that was more than 2 standard deviations away from the mean would be unusual. So the cult of the p-value was born.

If you look at Bayesian analysis of a rare disease (rare is less than 1 case in 10,000) where you have a test that is correct 99% of the time for true positives and also has a false positive rate of 1% (p-value) then you will still have a large number of cases where you identify the disease where it isn't actually present.

So for examples in a population of 1,000,000 you may have 100 cases of which your very good test finds 99 and only misses one. But it also finds 9999 cases that are not actually real. You false cases vastly outnumber your real cases. So if you are diagnosed your probability of having the disease is 99/10098, or less than 0.1% and that is with a p-value of 0.01!

The same thing happens if I do multiple tests. If I set a p-value of 0.05 which is quite typical there is a 1/20 chance of seeing a result when it is not there. So if I do 20 tests on average 1 of them will show significance. This is easily corrected by Bonferroni's method amongst others.

In genetic analysis you get complex problems with tens of thousands of variables that are all tested simultaneously. You also have massively under-powered studies because your sample numbers are small and so, sample size << number of variables. So you will always be doing many more tests than are justified by the resulting data and you will almost always be over-fitting the model.

This is why the false discovery rate is so important in genetic analysis but really when you look deeply at this it will still mean that even with careful use of FDR most genetic results from big data experiments will turn out to be wrong.

Here are the professionals talking about false discovery rates.
Selective Inference and False Discovery Rate I
Selective Inference and False Discovery Rate II
Estimating Local False Discovery Rate in Differential Expression
Interpreting p and q values in Genetic Analysis