The Accidental Statistician: April 2015

Monday, 27 April 2015

Quick to Block

I generally find those who are swift to block aren't worth talking to. In that I include James Delingpole, Guido Fawkes and Damian Thompson. They are the fingers in their ears debaters who like their own voice and their own opinions much more than anybody else's.

If you read the book Emotional Vampires you will recognise them clearly. You will also know that they will never recognise this about themselves and that arguing with them is a waste of breath. They are always right, always perfect, they never make a mistake. I have worked around people like this where you walk on egg-shells not to say the wrong thing or do the wrong thing and it was the most miserable experience of my life.

It is always disappointing interacting with those you admire.

I have to remind myself that I should not overly admire people as they often prove to be as flawed as everyone else. So I believe strongly in rights for sex workers and I was disappointed to learn that Tina Fey does a lot of anti-sex worker "comedy". Today I was being generous and giving her the benefit of the doubt and so I was thinking aloud, well she may just be delivering the material, she might not be the writer and if you have a show you can just be there to deliver the lines and not think too much about the content.

So anyway Dr Brooke Magnanti had made the allegation and there was a link to the Saturday Night Live routine. I had sent her a tweet saying about maybe she is just a performer. To which she responded strongly. I have seen interactions on Twitter before and I would say she often responds pretty strongly. Often this is with justification and my tweet certainly annoyed her. So I carefully wrote another saying it is not an excuse but implying that nobody is perfect and if she was a writer as well she had no excuses at all.

So Brooke Magnanti's response was:

@ardalby Dude. I get you want to argue this but fuck off, she makes jokes about dead bodies of people like me. Begone.

Along of course with a block. So while I am very disappointed with Tina Fey. However I am also disappointed with Brooke Magnanti and this is a bigger disappointment to me personally because she was the person I admired that I was alluding to in the title.

Now with more time to do my research and get an informed opinion rather than just living of tweets I know that Fey was the chief writer for SNL when they did the French Hooker sketches and that is just one of a long line of offences (http://titsandsass.com/category/tina-fey-hates-sex-workers/).

Stoya's article is great as usual. So Fey is bad and didn't deserve the benefit of the doubt it is a sort of, whatever as I am not that big a fan. I just wanted to be sure she was not being done an injustice, as I have jumped on too many band-waggons on Twitter and I was balance and thoughtful today. I make plenty of mistakes and so sending Magnanti my other tweet was one. Fey is definitely a slut shamer and sex worker hater.

Friday, 10 April 2015

False Discovery Rate vs p-values

The problem with p-values are they are just based on an aside made by Fisher who said that any data that was more than 2 standard deviations away from the mean would be unusual. So the cult of the p-value was born.

If you look at Bayesian analysis of a rare disease (rare is less than 1 case in 10,000) where you have a test that is correct 99% of the time for true positives and also has a false positive rate of 1% (p-value) then you will still have a large number of cases where you identify the disease where it isn't actually present.

So for examples in a population of 1,000,000 you may have 100 cases of which your very good test finds 99 and only misses one. But it also finds 9999 cases that are not actually real. You false cases vastly outnumber your real cases. So if you are diagnosed your probability of having the disease is 99/10098, or less than 0.1% and that is with a p-value of 0.01!

The same thing happens if I do multiple tests. If I set a p-value of 0.05 which is quite typical there is a 1/20 chance of seeing a result when it is not there. So if I do 20 tests on average 1 of them will show significance. This is easily corrected by Bonferroni's method amongst others.

In genetic analysis you get complex problems with tens of thousands of variables that are all tested simultaneously. You also have massively under-powered studies because your sample numbers are small and so, sample size << number of variables. So you will always be doing many more tests than are justified by the resulting data and you will almost always be over-fitting the model.

This is why the false discovery rate is so important in genetic analysis but really when you look deeply at this it will still mean that even with careful use of FDR most genetic results from big data experiments will turn out to be wrong.

Here are the professionals talking about false discovery rates.
Selective Inference and False Discovery Rate I
Selective Inference and False Discovery Rate II
Estimating Local False Discovery Rate in Differential Expression
Interpreting p and q values in Genetic Analysis