Friday, 10 April 2015

False Discovery Rate vs p-values

The problem with p-values are they are just based on an aside made by Fisher who said that any data that was more than 2 standard deviations away from the mean would be unusual. So the cult of the p-value was born.

If you look at Bayesian analysis of a rare disease (rare is less than 1 case in 10,000) where you have a test that is correct 99% of the time for true positives and also has a false positive rate of 1% (p-value) then you will still have a large number of cases where you identify the disease where it isn't actually present.

So for examples in a population of 1,000,000 you may have 100 cases of which your very good test finds 99 and only misses one. But it also finds 9999 cases that are not actually real. You false cases vastly outnumber your real cases. So if you are diagnosed your probability of having the disease is 99/10098, or less than 0.1% and that is with a p-value of 0.01!

The same thing happens if I do multiple tests. If I set a p-value of 0.05 which is quite typical there is a 1/20 chance of seeing a result when it is not there. So if I do 20 tests on average 1 of them will show significance. This is easily corrected by Bonferroni's method amongst others.

In genetic analysis you get complex problems with tens of thousands of variables that are all tested simultaneously. You also have massively under-powered studies because your sample numbers are small and so, sample size << number of variables. So you will always be doing many more tests than are justified by the resulting data and you will almost always be over-fitting the model.

This is why the false discovery rate is so important in genetic analysis but really when you look deeply at this it will still mean that even with careful use of FDR most genetic results from big data experiments will turn out to be wrong.

Here are the professionals talking about false discovery rates.
Selective Inference and False Discovery Rate I
Selective Inference and False Discovery Rate II
Estimating Local False Discovery Rate in Differential Expression
Interpreting p and q values in Genetic Analysis

Friday, 20 March 2015

Science free Science

When I started protein modelling we studied one protein at a time and tried to work out in detail how it functioned and what it did and each lab was a world leader in that protein. Then along came high-throughput and structural genomics.

In some ways this was good as we had missed all the connections but it also means that a lot of specialist expertise was lost and detailed studies that did not have the high-throughput angle were not funded. Losing that expertise was bad, we lost expertise on transcription factors and individual protein family databases disappeared to be replaced by the EBI and NCBI monoliths.

Then once these early big data adopters, bioinformaticians as we called them then found that they were still missing the connections that high-throughput was supposed to give them the new buzz became systems biology. Really this was just dusting off the work of Bertalanffy and others who had tried before but were data poor and so couldn't find any solutions. I was at a meeting about Systems Biology at Exeter and there was one glum group the biological phenotype researchers. They had been doing systems biology all the time and now they were going to be bull-dozed over by the bioinformatics revolution as the big data people with no local expertise demolished their field and took all the funding. The perfect analogy is the local stall (Mon and Pop store for US readers) when Wall-mart arrives. So another field got concreted over by the bioinformatics juggernaut. The real sense of systems biology was lost and it failed because what is needed was a paradigm shift in the way of thinking and big data is rooted in reductionism and computability

Now they want to get big data from health services because despite all of this analysis and all the work of the last 2 decades we still have no idea what we are doing and we haven't cured cancer. So now the bioinformaticians are moving into health-care data. Now they are going to concrete over the public health specialists and epidemiologists as all the big money is going to be directed into these monolithic projects that will yet again fail to find a cure for cancer or understand how living things work.

So I feel sad for the loss of all those disciplines and all those experts and I have to include myself among the bioinformaticians. At least I can say I was never a very pushy or successful one. I admire a lot of the people driving the steam-roller. They are good and honest scientists most of them (there are always exceptions more interested in themselves than science). But I wish they would just stop the Big Data juggernaut for a quiet coffee break and have a think about what we can do not to wipe these fields out, but to learn from them. The real article that said we had all gone mad was Chris Anderson's The End of Theory in Wired, Now I also like Peter Norvig and his work and ideas but in his quotes it goes too far. Big data is good and important and I am glad we live in the Google world where I no longer have to memorise endless tedious facts. I am glad we have machines that can deal with massive amounts of data, using whatever your favourite result finding algorithm is, be it SVMs, neural nets, solitons or whatever else gets you excited.

But a former student of mine who was doing a post-doc at a very prestigious US University after completing a PhD at Cambridge asked one of the leading bioinformaticians (he has many hundred publications) about the biological meaning of the patterns found and about how the data had been collected and what the limitations were and he shrugged his shoulders and said he had no idea. That is what we are missing and that is why Big Data fails, because it has no context and no big picture. If I show you a picture of Madonna from the 1990s Erotica Tour and ask you to say what you see I will say Madonna in her silly pointy bra outfit. Another generation will not recognise her, some might say Lady Gaga, some might say a singer, some might say a woman in underwear. Knowledge out of context does not work and Big Data is leading us to Science free Science. It will give us answers like 42 is the answer to Life the Universe and Everything but we won't know why.

So lets stop concreting over the epidemiologists, the systemists, the experts on a single protein, the public health specialists and lets start listening to them rather than the white noise of high throughput data.

Monday, 2 March 2015

Amazon Vine and helpful votes.

I am a reviewer for Amazon Vine and I also review outside the Vine program. Since joining Vine I have seen my helpful review percentage fall and I have dropped out of the Top 1000 reviewers. So I thought I would check if there is a statistical difference between helpful review percentage for vine items and non-Vine items. So I did a simple chi-squared test.

The Chi-square statistic, P value and statement of significance appear beneath the table. Blue means you're dealing with dependent variables; red, independent.
 positivenegativeMarginal Row Totals
vine176   (195.58)   [1.96]52   (32.42)   [11.82]228
non-vine777   (757.42)   [0.51]106   (125.58)   [3.05]883
Marginal Column Totals9531581111    (Grand Total)

The Chi-square statistic is 17.3343. The P value is 3.1E-05. This result is significant at p < 0.05.

That is a very significant difference. There is quite a lot of bias because I have many more non-Vine reviews that have been there for much longer than the Vine reviews and my most helpful review is a non-Vine review but this does still support what a lot of Vine members say that some people systematically go around clicking on the unhelpful review button for anything Vine reviews.

Friday, 20 February 2015

Why I will never trust Science again.

On the 7th of December I submitted a paper about the spread of H5N8 bird flu via bird migration to Science. A pdf version of the file can be found here. Then I waited and went off for my Christmas vacation. That is why I did not see the final reply from Science until the beginning of January.


Biomedical Science
University of Westminster
Westminster None W1W 6UW

Dear Dr. Dalby

Manuscript number: aaa3940

Thank you for submitting your manuscript "The European and Japanese outbreaks of H5N* derive from a single source population that has been dispersed along the long distance bird migratory flyways. " to Science. Because your manuscript was not given a high priority rating during the initial screening process, we have decided not to proceed to in-depth review. The overall view is that the scope and focus of your paper make it more appropriate for a more specialized journal. We are therefore notifying you so that you can seek publication elsewhere.

We now receive many more interesting papers than we can publish. We therefore send for in-depth review only those papers most likely to be ultimately published in Science. Papers are selected on the basis of discipline, novelty, and general significance, in addition to the usual criteria for publication in specialized journals. Therefore, our decision is not necessarily a reflection of the quality of your research but rather of our stringent space limitations.


Caroline Ash, Ph.D.
Senior Editor

That was fine but the timing was a bit unfortunate and so delayed the paper being sent out to another more specific journal. I was happy with the paper but it was borderline in significance and Science has a lot more important manuscripts to publish.

So I sent it to Emerging Infectious Disease on the 9th of January in a modified form with some typos removed and a switch of emphasis on the epidemiology as that is what they need. The new manuscript for EID I sent is here. Again the paper was rejected on the 30th of January because it does not really fit with EID which wants manuscripts that are about diseases that affect human health and in this case it looks like it will only be an avian disease.

I was reading the Science weekly e-mail and saw that they were going to publish a paper on the spread of H5N8 by migratory birds in their Insights column. It is available here. This was even reported by the BBC. So it seems it was a more important story than I had thought.

Reading this I was rather angry that this was published and that my paper had been rejected as it comes to the same conclusions and so I wrote a short and quite angry e-mail to the editor of Science complaining about ethics and precedence. This was the reply which is in the name of Caroline Ash.

Dear Dr Dalby
Thank you for your message. I understand your concern that we should publish an item on the same topic as yours shortly after having  rejected your report. However, I should clarify. The Verhagen piece is published in the Insights section of Science and is therefore intended as commentary without data. Your paper was submitted as a formal research report with data that would normally be subject to peer review.
We receive many excellent papers, but we are limited in the number and subject areas we can pursue in each section of the journal and find ourselves rejecting the majority. Although we decided against in-depth review of your paper we enjoyed reading it and unless you have submitted elsewhere would encourage you to try our new journal Science Advances:
I hope this information is of some help and I am sorry your experience at Science was disappointing.
Kind regards
Caroline Ash
Caroline Ash
Senior Editor, Science;
ASI Science International, 82-88 Hills Road, Cambridge, UK, CB2 1LQ
+44 1223 326500;

So therefore anything we read in the Insights section of Science should be taken with a pinch of salt because it does not contain data and it is only a commentary. Anyway I went through the Science paper with a fine toothed comb and found a few errors that raise concerns. First there is s a different lineage in circulation in North America and more seriously the reference cited to support the figure and in fact the main conclusion about migration was wrong. So I submitted a letter to Science pointing out these faults. 

So I expected them to take this seriously as the error in the reference fundamentally undermines the paper and there is no alternative reference that collects the data that supports the figure and conclusion other than my own paper which is in PeerJ preprints. So I was fairly astounded by their reply.
Dear Dr. Dalby,

Manuscript number: aaa8769

Thank you for sending a Letter to Science. We have read your contribution but will not be able to publish it.  We invite you to leave an online comment instead.  To leave a comment, go and find the published paper to which your comment refers.  Then click Leave a Comment to submit.  Online comments should be no more than 300 words.  Excerpts from comments are occasionally published in the print Letters section of Science.

Note that we will post a correction to the reference you mention.

Please do not reply to this email, as it will not be read by Science. Unfortunately, the volume of submissions precludes specific discussions about individual submitted letters.


Jennifer Sills

I had asked for Caroline Ash to act as Editor on the letter submission as she had been the one who responded to my earlier e-mail and had been the signatory on the original rejection. I would say that my experience with Science goes beyond disappointing.  I would say that at this moment I am extremely angry with their behaviour. I would encourage anyone reading this to only support open access publication and transparency in peer review.

Sunday, 18 January 2015

Books I have read

When Bacon wrote his dictum about books there were so few that you could not be over-powered by their number. Now more than ever we are over-whelmed by writing and his dictum has become much more significant.

Now we have to distinguish the mundane from the profound, we have to distinguish the books that have an impact on the reader from the general background noise.  Not only are some books more significant, but they also take on a life of their own, moulding the experiences and beliefs of the reader.  Each book has its own time, it has order and it has age.

I keep a list of all of the books that I have read so that I can try and unpick the influences that they have on me. Now I have realised that just knowing what I have read is not enough. I need to know when I read it and in what order. For the last three years I have kept an ordered list, and pushed my reading to 50 books a year.

For example I read Brave New World in my 30s and for me it was a profound book because it struck a chord with my age, my experiences and the world in which I lived, but I doubt that the experiences of anyone else would put it into the same context. Reflecting on it, I think it is a book that is more likely to resonate with older readers with a wider range of experiences and I think that I would have appreciated it less if I had read it as a teenager.  The same is true of Borges, Labyrinths. Now for me it is an amazing book, but I do not think I would have grasped its many different layers and themes if I had read it in my teens or 20s. Reading it later I find that it has so many hidden ideas that make it a greater work of thought than many works of philosophy.

Saturday, 10 January 2015

The Downton Delusion

Downton Abbey seems to be a national obsession but I can't quite understand it myself. Why do we enjoy watching a generation of injustice, inequality and wasted opportunities and celebrate them like they were the "Good Old Days". There was nothing good about them but they seem to be the golden age to the British.

My grandma died this week. She was 95 which is a good age and she had loved through a lot of change. She got her first passport to come to my wedding in Spain when she was 81. She was an impressive and strong woman but I have to think what she would have been . Her parents were servants to the aristocrats. Her dad was the chauffeur and I think that her mother was the upstairs maid. She was born to Downton parents, those that lived at their masters wishes. My grandma went to school at Wyggeston Girls Grammar School where she was even a prize winner. I do not know when she left school and with what qualifications but this was a time when women were still not encouraged to continue their education. She had her teenage years in the depression and her early twenties were the war years. What saddens me most is what she could have been if she was born in 1979 and not 1919.

That is what we are celebrating with Downton, the inequality that wasted potential like my Grandma's. The wealthy had their great houses and the aristocracy had their protected lives, because some ancestor had done some favour to some monarch. But why because my ancestor's were successful should I expect to be successful as well? They had their privilege and success built in. This was first weakened by Lloyd-George's "People's Budget" but mostly the inequality and injustice was a consequence of the post-war second world war settlements. The returning soldiers and the women who had fought and worked at home wanted a different world and so the Downton Age passed. There were still pockets left but the 60's and 70's took care of them. I still remember the sense of deference amongst the farmers to the old aristocrats, who had once been their landlords.

Now we have this return to idolising the gilded Downton Age when we are in the middle of another depression caused largely by the same families who caused the last one. The aristocrats who we thought had been vanquished just became the much more diffuse "Establishment". Why do we idealise a time that was so unfair, so unjust and so unequal. We love Agatha Christie where Miss Marple and Hercule Poirot come from the same Downton world with trips on the Orient Express, or cruises down the Nile. We fantasize about a life that did not exist for most people and forget that it was the nineties and noughties when we really "Never had it so good". Now we are entering another age of inequality and instead of fighting against it we are embracing it. It seems that we want to be back in Downton, knowing our place and doffing our caps again.

Continuing the theme on Evolutionary Biology

Yesterday I was tumbling ideas around in my head but there is something important I missed and that is Monod - Chance and Necessity. Although philosophers have attacked it, it does contain a kernel of truth and the beginnings of an important theme. That is that evolution is random but selection makes necessary choices. So we have some developments which will happen over and over again such as eyes, long necks and photosynthetic systems and then we have accidents that are not required by the environment.

This theme is developed by Ian Stewart and Jack Cohen in all of their books and also by Murray Gell-mann in the Quark and the Jaguar where he talks about the amount of information needed to describe a system. Things that have to happen and that are homogeneous across a set require less description than the unique properties that are heterogeneous.

Where we lack theory is in describing the unique heterogeneous events - we struggle withe heterogeneous entropy or any systems because there are no statistical descriptions of the unique. You cannot average them. This is where we need to build our theories, on the edge of maths.


Monod - Chance and necessity
Ho - The Rainbow and the Worm
Gell-mann - The Quark and the Jaguar.
Stewart and Cohen - Figments of Reality
Stewart and Cohen - The collpase of chaos.
Stewart, Pratchett and Cohen - The Science of Discworld I-IV.