Saturday, 6 February 2016

Secondary Structure Prediction - Do neural networks add anything.

I began in bioinformatics when secondary structure prediction was considered an interesting problem. Now it is considered as "solved".

All of the early methods used sliding windows (from Scheraga) and statistical propensities of the amino acids (Chou and Fasman, GOR etc.). The big step forward was realising that predictions for alignments which accounts for positional variation was better than trying on only a single sequence. With the massive growth of the databases this has only got better.

Supposedly Neural Networks also helped improve predictions until we could do no better. The problem is are the NNs actually detecting any patterns that the straight linear statistics were not? They will probably help to find the amphipathic helices that the window methods will struggle with as this is a periodic pattern but they will not be able to deal with beta sheets as these are non local and exist outside of the windows. I had a student who constructed a neural network for prediction without any hidden layers and got the same prediction accuracy. This suggests that the neural networks are contributing nothing to the predictions.

We know that the codon distribution is optimised to make sure that mutations have a minimum effect of the resulting proteins (Baldi and  Brunak and Andreas Wagner).

We need to repeat the student's experiment and to check in the NNs actually make any difference. We also need to change the amino acid coding scheme to give realistic distances between amino acids and not Hamming distances.

Networks in proteins

A long time ago there was some interest in networks in proteins and a few articles were published about the non-bonded networks. These were interesting but they just looked at spheres of contacts around atoms to find a power law. The problem is that this power law arises from the cube relationship between number of atoms/residues and size - a protein approximates to a sphere. What would have been more interesting is to analyse the hydrogen bonding networks.


Friday, 5 February 2016

The open data bun fight, why experimentalists and analysts need to collaborate more.

I can understand why the people who collect the data want to protect their hard work and their grants against research parasites. The problem is that people like me who have spent their careers working with methods and thinking about the statistics are probably in a better position to carry out the analysis.

I have seen too many badly designed experiments that have wasted research funds because the experimentalists did not talk to the statisticians or the analysts. They end up with a badly designed experiment with so many flaws that it is practically useless.

Collaboration would be good for everyone. Yes it does mean one or two more names on the paper but the analysts don't want to be the submitting author so long as their contribution is recognised. Where being an analyst/bioinformatician/statistician will get annoyed is if someone treats them like a technician and not an equal party and collaborator. Analysts contribute equal skills to experimentalists, they are just different. Neither are more or less important and I have done both (I was an x-ray crystallographer solving protein structures as well as a molecular modeller using protein structures).

Thursday, 4 February 2016

The network of H5N8 publications.

Before I became obsessed with influenza phylogenetics I was interested in networks. I observed something interesting in the citations of flu papers.

My paper with Munir about H5N8 having a single origin has 5 cites. Verhagen et al. Science paper - How a virus travels the world has 17 cites but amazingly the paper with the riveting title Novel Eurasian highly pathogenic avian influenza A H5 viruses in wild birds Washington USA 2014 has 24 cites.

Looking at Google Scholar for all of the H5N8 citations you have:

Novel reassortant influenza A (H5N8) viruses, South Korea, 2014 EID 69.





















So my new research questions are:
How can there be this amount of publication on 120 viral sequences?
How can there be these amounts of citations in less than 18 months?
How can peer review not reject so many papers which cover virtually identical ground?
Why does my paper languish behind all of the others?

This to me looks like classic clique behaviour and also the rich get richer. It will be interesting to construct the network of citations and return citations.


Failing to share influenza data is not something new

I had complained about the Taiwanese H5N8 data being a pain to get hold of and I also tangentially mentioned how slow the US was in making its H5N8 sequences available as they wanted to publish first. I have just blogged about GISAID which I find does not help the problem and in some ways is there to put another hurdle into getting the data, but this news story from 2006 is shocking.

The WHO was/is keeping flu data secret with access to the data being restricted to 15 labs. I can see why people in the viral community sometimes throw their hands in the air and go independent like Nathan Wolfe. If you have to fight WHO to get the data then you need to find another source of data. It should be made clear as Wolfe does in his books that eventually we will face a viral catastrophe.

What virus and when is not clear but hoarding data and blocking access does not threaten good science it threatens lives. The Taiwan H5N8 response was affected by this lack of open sharing and it had a large economic cost. It is vital that the WHO is completely open with its data because a future viral pandemic is a possible existential threat.

The GISAID story

Just reading the wikipedia article on GISAID is enough to start the alarm bells ringing that this is not all it is supposed to be. Here is the Max-Planck statement on funding.

Why I dislike it is that for a database that is supposedly open and for sharing it is surprisingly difficult to actually gain access to it. It has password access and you have to apply for an account. The researchers who publish data there don't do so to enable the world to access it - send it to the EBI or NCBI to do that. They deposit there to actually restrict access and attention. It is in a different structure to the NCBI database with different segments, it is harder to search and produces harder to download search results. It is a nightmare with little reason to exist.

Regarding its initial supposed reasoning to prevent unscrupulous big pharma from exploiting influenza vaccine markets it is interesting that it was inspired not by scientists but by a business investor. Like the SNP database this is a way of spiking intellectual property of competitors by making it public domain and so no longer an interesting discovery.

This is an interesting news story from last year about an accusation of an academic collaborating with big pharma to smuggle influenza virus and about corruption.

Here is that same researcher talking about GISAID when it first had legal issues in 2006 (at that time I didn't care less about influenza or virus work).

That same researcher was also caught out in a review article where her co-author had self plagiarised, a paper written about the same time as the 2006 GISAID relaunch.

Wednesday, 3 February 2016

Why do I keep editing/toning down my blog?

Mostly because I do actually believe science works by cooperation and openness. There is always the nuclear option - the kill switch. But throwing your toys around most of the time is not useful. I do hate corruption and there is a lot of it. My dad was on a research funding committee and he was always troubled by the decisions as to who got funded and who got nothing. But what he couldn't stand were the scientists and their incessant lobbying and politics.

Mostly I moderated it because of this good advice from Stephan Lewandowsky and Dorothy Bishop . It is important to be part of the solution not part of the problem.