Friday, 20 March 2015

Science free Science

When I started protein modelling we studied one protein at a time and tried to work out in detail how it functioned and what it did and each lab was a world leader in that protein. Then along came high-throughput and structural genomics.

In some ways this was good as we had missed all the connections but it also means that a lot of specialist expertise was lost and detailed studies that did not have the high-throughput angle were not funded. Losing that expertise was bad, we lost expertise on transcription factors and individual protein family databases disappeared to be replaced by the EBI and NCBI monoliths.

Then once these early big data adopters, bioinformaticians as we called them then found that they were still missing the connections that high-throughput was supposed to give them the new buzz became systems biology. Really this was just dusting off the work of Bertalanffy and others who had tried before but were data poor and so couldn't find any solutions. I was at a meeting about Systems Biology at Exeter and there was one glum group the biological phenotype researchers. They had been doing systems biology all the time and now they were going to be bull-dozed over by the bioinformatics revolution as the big data people with no local expertise demolished their field and took all the funding. The perfect analogy is the local stall (Mon and Pop store for US readers) when Wall-mart arrives. So another field got concreted over by the bioinformatics juggernaut. The real sense of systems biology was lost and it failed because what is needed was a paradigm shift in the way of thinking and big data is rooted in reductionism and computability

Now they want to get big data from health services because despite all of this analysis and all the work of the last 2 decades we still have no idea what we are doing and we haven't cured cancer. So now the bioinformaticians are moving into health-care data. Now they are going to concrete over the public health specialists and epidemiologists as all the big money is going to be directed into these monolithic projects that will yet again fail to find a cure for cancer or understand how living things work.

So I feel sad for the loss of all those disciplines and all those experts and I have to include myself among the bioinformaticians. At least I can say I was never a very pushy or successful one. I admire a lot of the people driving the steam-roller. They are good and honest scientists most of them (there are always exceptions more interested in themselves than science). But I wish they would just stop the Big Data juggernaut for a quiet coffee break and have a think about what we can do not to wipe these fields out, but to learn from them. The real article that said we had all gone mad was Chris Anderson's The End of Theory in Wired, Now I also like Peter Norvig and his work and ideas but in his quotes it goes too far. Big data is good and important and I am glad we live in the Google world where I no longer have to memorise endless tedious facts. I am glad we have machines that can deal with massive amounts of data, using whatever your favourite result finding algorithm is, be it SVMs, neural nets, solitons or whatever else gets you excited.

But a former student of mine who was doing a post-doc at a very prestigious US University after completing a PhD at Cambridge asked one of the leading bioinformaticians (he has many hundred publications) about the biological meaning of the patterns found and about how the data had been collected and what the limitations were and he shrugged his shoulders and said he had no idea. That is what we are missing and that is why Big Data fails, because it has no context and no big picture. If I show you a picture of Madonna from the 1990s Erotica Tour and ask you to say what you see I will say Madonna in her silly pointy bra outfit. Another generation will not recognise her, some might say Lady Gaga, some might say a singer, some might say a woman in underwear. Knowledge out of context does not work and Big Data is leading us to Science free Science. It will give us answers like 42 is the answer to Life the Universe and Everything but we won't know why.

So lets stop concreting over the epidemiologists, the systemists, the experts on a single protein, the public health specialists and lets start listening to them rather than the white noise of high throughput data.

Monday, 2 March 2015

Amazon Vine and helpful votes.

I am a reviewer for Amazon Vine and I also review outside the Vine program. Since joining Vine I have seen my helpful review percentage fall and I have dropped out of the Top 1000 reviewers. So I thought I would check if there is a statistical difference between helpful review percentage for vine items and non-Vine items. So I did a simple chi-squared test.

The Chi-square statistic, P value and statement of significance appear beneath the table. Blue means you're dealing with dependent variables; red, independent.
 
 positivenegativeMarginal Row Totals
vine176   (195.58)   [1.96]52   (32.42)   [11.82]228
non-vine777   (757.42)   [0.51]106   (125.58)   [3.05]883
Marginal Column Totals9531581111    (Grand Total)

The Chi-square statistic is 17.3343. The P value is 3.1E-05. This result is significant at p < 0.05.

That is a very significant difference. There is quite a lot of bias because I have many more non-Vine reviews that have been there for much longer than the Vine reviews and my most helpful review is a non-Vine review but this does still support what a lot of Vine members say that some people systematically go around clicking on the unhelpful review button for anything Vine reviews.