Friday, 10 April 2015

False Discovery Rate vs p-values

The problem with p-values are they are just based on an aside made by Fisher who said that any data that was more than 2 standard deviations away from the mean would be unusual. So the cult of the p-value was born.

If you look at Bayesian analysis of a rare disease (rare is less than 1 case in 10,000) where you have a test that is correct 99% of the time for true positives and also has a false positive rate of 1% (p-value) then you will still have a large number of cases where you identify the disease where it isn't actually present.

So for examples in a population of 1,000,000 you may have 100 cases of which your very good test finds 99 and only misses one. But it also finds 9999 cases that are not actually real. You false cases vastly outnumber your real cases. So if you are diagnosed your probability of having the disease is 99/10098, or less than 0.1% and that is with a p-value of 0.01!

The same thing happens if I do multiple tests. If I set a p-value of 0.05 which is quite typical there is a 1/20 chance of seeing a result when it is not there. So if I do 20 tests on average 1 of them will show significance. This is easily corrected by Bonferroni's method amongst others.

In genetic analysis you get complex problems with tens of thousands of variables that are all tested simultaneously. You also have massively under-powered studies because your sample numbers are small and so, sample size << number of variables. So you will always be doing many more tests than are justified by the resulting data and you will almost always be over-fitting the model.

This is why the false discovery rate is so important in genetic analysis but really when you look deeply at this it will still mean that even with careful use of FDR most genetic results from big data experiments will turn out to be wrong.

Here are the professionals talking about false discovery rates.
Selective Inference and False Discovery Rate I
Selective Inference and False Discovery Rate II
Estimating Local False Discovery Rate in Differential Expression
Interpreting p and q values in Genetic Analysis

No comments: