Saturday, 6 February 2016

Secondary Structure Prediction - Do neural networks add anything.

I began in bioinformatics when secondary structure prediction was considered an interesting problem. Now it is considered as "solved".

All of the early methods used sliding windows (from Scheraga) and statistical propensities of the amino acids (Chou and Fasman, GOR etc.). The big step forward was realising that predictions for alignments which accounts for positional variation was better than trying on only a single sequence. With the massive growth of the databases this has only got better.

Supposedly Neural Networks also helped improve predictions until we could do no better. The problem is are the NNs actually detecting any patterns that the straight linear statistics were not? They will probably help to find the amphipathic helices that the window methods will struggle with as this is a periodic pattern but they will not be able to deal with beta sheets as these are non local and exist outside of the windows. I had a student who constructed a neural network for prediction without any hidden layers and got the same prediction accuracy. This suggests that the neural networks are contributing nothing to the predictions.

We know that the codon distribution is optimised to make sure that mutations have a minimum effect of the resulting proteins (Baldi and  Brunak and Andreas Wagner).

We need to repeat the student's experiment and to check in the NNs actually make any difference. We also need to change the amino acid coding scheme to give realistic distances between amino acids and not Hamming distances.

Networks in proteins

A long time ago there was some interest in networks in proteins and a few articles were published about the non-bonded networks. These were interesting but they just looked at spheres of contacts around atoms to find a power law. The problem is that this power law arises from the cube relationship between number of atoms/residues and size - a protein approximates to a sphere. What would have been more interesting is to analyse the hydrogen bonding networks.

Friday, 5 February 2016

The open data bun fight, why experimentalists and analysts need to collaborate more.

I can understand why the people who collect the data want to protect their hard work and their grants against research parasites. The problem is that people like me who have spent their careers working with methods and thinking about the statistics are probably in a better position to carry out the analysis.

I have seen too many badly designed experiments that have wasted research funds because the experimentalists did not talk to the statisticians or the analysts. They end up with a badly designed experiment with so many flaws that it is practically useless.

Collaboration would be good for everyone. Yes it does mean one or two more names on the paper but the analysts don't want to be the submitting author so long as their contribution is recognised. Where being an analyst/bioinformatician/statistician will get annoyed is if someone treats them like a technician and not an equal party and collaborator. Analysts contribute equal skills to experimentalists, they are just different. Neither are more or less important and I have done both (I was an x-ray crystallographer solving protein structures as well as a molecular modeller using protein structures).

Thursday, 4 February 2016

The network of H5N8 publications.

Before I became obsessed with influenza phylogenetics I was interested in networks. I observed something interesting in the citations of flu papers.

My paper with Munir about H5N8 having a single origin has 5 cites. Verhagen et al. Science paper - How a virus travels the world has 17 cites but amazingly the paper with the riveting title Novel Eurasian highly pathogenic avian influenza A H5 viruses in wild birds Washington USA 2014 has 24 cites.

Looking at Google Scholar for all of the H5N8 citations you have:

Novel reassortant influenza A (H5N8) viruses, South Korea, 2014 EID 69.

So my new research questions are:
How can there be this amount of publication on 120 viral sequences?
How can there be these amounts of citations in less than 18 months?
How can peer review not reject so many papers which cover virtually identical ground?
Why does my paper languish behind all of the others?

This to me looks like classic clique behaviour and also the rich get richer. It will be interesting to construct the network of citations and return citations.

Failing to share influenza data is not something new

I had complained about the Taiwanese H5N8 data being a pain to get hold of and I also tangentially mentioned how slow the US was in making its H5N8 sequences available as they wanted to publish first. I have just blogged about GISAID which I find does not help the problem and in some ways is there to put another hurdle into getting the data, but this news story from 2006 is shocking.

The WHO was/is keeping flu data secret with access to the data being restricted to 15 labs. I can see why people in the viral community sometimes throw their hands in the air and go independent like Nathan Wolfe. If you have to fight WHO to get the data then you need to find another source of data. It should be made clear as Wolfe does in his books that eventually we will face a viral catastrophe.

What virus and when is not clear but hoarding data and blocking access does not threaten good science it threatens lives. The Taiwan H5N8 response was affected by this lack of open sharing and it had a large economic cost. It is vital that the WHO is completely open with its data because a future viral pandemic is a possible existential threat.

The GISAID story

Just reading the wikipedia article on GISAID is enough to start the alarm bells ringing that this is not all it is supposed to be. Here is the Max-Planck statement on funding.

Why I dislike it is that for a database that is supposedly open and for sharing it is surprisingly difficult to actually gain access to it. It has password access and you have to apply for an account. The researchers who publish data there don't do so to enable the world to access it - send it to the EBI or NCBI to do that. They deposit there to actually restrict access and attention. It is in a different structure to the NCBI database with different segments, it is harder to search and produces harder to download search results. It is a nightmare with little reason to exist.

Regarding its initial supposed reasoning to prevent unscrupulous big pharma from exploiting influenza vaccine markets it is interesting that it was inspired not by scientists but by a business investor. Like the SNP database this is a way of spiking intellectual property of competitors by making it public domain and so no longer an interesting discovery.

This is an interesting news story from last year about an accusation of an academic collaborating with big pharma to smuggle influenza virus and about corruption.

Here is that same researcher talking about GISAID when it first had legal issues in 2006 (at that time I didn't care less about influenza or virus work).

That same researcher was also caught out in a review article where her co-author had self plagiarised, a paper written about the same time as the 2006 GISAID relaunch.

Wednesday, 3 February 2016

Why do I keep editing/toning down my blog?

Mostly because I do actually believe science works by cooperation and openness. There is always the nuclear option - the kill switch. But throwing your toys around most of the time is not useful. I do hate corruption and there is a lot of it. My dad was on a research funding committee and he was always troubled by the decisions as to who got funded and who got nothing. But what he couldn't stand were the scientists and their incessant lobbying and politics.

Mostly I moderated it because of this good advice from Stephan Lewandowsky and Dorothy Bishop . It is important to be part of the solution not part of the problem.

More of the H5N8 story - should data be open?

In December 2014 I wanted to include the North American analysis of the origins of H5N8 when I was writing the paper about the European and Asian outbreaks of H5N8 having a single source. I had heard of the US H5N8 sequences but they were not yet public and so I contacted the US influenza researchers sending a draft of my single origin paper that I had submitted to Science to them. I have already blogged about what happened to the Science paper.

From: Andrew Dalby []
Sent: Wednesday, December 17, 2014 4:54 PM
To: APHIS-NVSL Concerns
Subject: H5N8 sequences

Dear Sir/Madam,

I recently read the report about H5N8 in a gyrfalcon in Whatcom County. I see that you have identified it as being related to the Eurasian outbreaks of H5N8. I recently submitted a paper to Science showing that the Japanese and European outbreaks come from a common source and that this has diverged from the Korean sequences during the migration to summer breeding grounds before being spread along the long range migratory pathways. 

I have attached the submitted paper. 

Would it be possible for you to let me know as soon as the nucleotide sequences are available as I would like to add them to my analysis and also add the American migratory pathways to my maps. I have already updated the map to include the Washington case.



Dr Andrew Dalby
University of Westminster

I received the following response.

From: , Mia Kim - APHIS <>Date: Friday, 19 December 2014 14:31To: Andrew Dalby <>Subject: FW: H5N8 sequences (UK)

Dear Dr. Dalby, 
Thank you for sharing your draft - indeed the H5N8 appears to be a hearty virus. Current information suggests the introduction into North America may represent a separate event from introductions in Europe. The sequences should be available in Genbank by next week and will keep you advised.All the best, Mia Kim Torchetti, DVM MS PhDAvian Viruses Section HeadDiagnostic Virology LaboratoryNational Veterinary Services LaboratoriesAmes, Iowa,  50010515-337-7590 (phone)515-337-7348 (fax) 

My final e-mail was to Mia

Friday, 19 December 2014 at 14:52 
Dear Mia, 

Thanks for the e-mail. Looking at the past US cases of H5N8 they are very different in the internal gene sequences to the Eurasian ones. It seems that there was a convergence to a H5N8 serotype via different segment rearrangements. It will be interesting to add the new sequence to the analysis.
Best wishes

There were no more emails and I waited a considerable time for the release of the North American sequences, more than a few weeks and when they were released there was a large deletion in the original H5 gene which meant I never performed the analysis as it looked like a sequencing error.

Then they produced the paper about the H5N8 phylogenetic analysis in May 2015. That is less than one month after I got the single origin paper published in PeerJ. An anonymous referee delayed the PeerJ paper by asking me to carry out a complete analysis of all of the H5 and N8 genes to show that there was not a reassortment that produced the different outbreaks. The editor Claus Wilke correctly insisted that this was done despite me arguing that this was not common practice, I now think that it should be common practice but it still isn't. Claus did give invaluable advice by pointing out that this is easy to do using FastTree which has helped my subsequent research. This actually directed me to thinking about reassortment and the fact that analysis of single subtypes might be flawed.

What concerns me is whether the national laboratories should be publishing before they have time to analyse so that "research parasites" like myself can use the data. I was e-mailing in the hope of collaboration as the analysis was already done and was trivial for me to do again. I already blogged about the Taiwanese outbreak of H5N8 and the difficulties in getting the sequence data in a timely manner. While national laboratories need the publications to justify their funding I realise that they need to protect their data but on the other hand are they then providing a data service?

I am not sure that the national laboratories do not do more harm than good by hoarding and not openly sharing data. I think that they should be co-authors and collaborators with those who carry out the analysis as there biological work is often the most significant part of the process in terms of time and funds. But I do not think that they should try to do all of the steps of analysis when this is not their function or area of expertise. We are wasting skills and time and not making constructive use of the data.

Tuesday, 2 February 2016

A paper that carries out the H5N8 North American analysis using the same methodology as my rejected paper.

There have been just so many H5N8 papers that I did not pay much attention to this one about H5N8 in North America, but it should have been included as a  reference to the reassortment and sporadic outbreaks paper that keeps getting rejected.

This shows that the H5N8 in North America is in multiple clades by doing the exact analysis that referee 1 of version 2 said was flawed. This shows that the comment on a flawed method cannot be true because a paper has been published using the exact method in dispute. I then went on to show why this is incomplete and misses reassortment events but the referee chose to ignore this. Not mentioning this paper was a mistake. However the referees arguments about wrong methodology are an even bigger mistake. Except I do not think they were a mistake I think that they were dishonesty.

Why subtype might be meaningless for influenza phylogenetics

The different clades in the H5 tree are so intermingled that it is often impossible to say that a clade belongs to a single subtype. There is significant mixing and intermingling of subtypes within the complete H5 tree. Constructing trees based on a single H5 containing subtype such as H5N1 or H5N8 will introduce sampling bias.

An example is presented below.

This clade contains H5N2, H5N1, H5N8, H5N5, H5N9 and mixed H5 containing subtypes. H5N8 and H5N9 subtypes occur randomly within the tree. The likelihood ratios are high for the nodes/splits which indicates that this is a reliable reconstruction of the evolutionary tree and that this seemingly random mixing of subtype is not an artefact of tree construction.

This unambiguously shows that to correctly estimate the evolutionary history of H5 you need to sample across all subtypes and that even using geographical or chronological criteria will not produce an unbiased sample. In this case they are all sequences from the Americas but they are found from Guatemala to Alaska and British Columbia and from 2005-2014 which is a very wide range of times and locations.

The conclusion from this result is that we can no longer accept that subtype trees of influenza represent an unbiased sample of lineages or evolution and that all papers that have been published that take this approach for sequence selection in phylogenetic analysis have to be questioned. If analysis is over short time spans such a single influenza season then these trees are likely to be unbiased because the sampling will be from a specific sub-clade, although these analyses will have limited value for making inferences about viral evolution.

However many previously generated trees and a large part of the existing influenza literature is likely to be flawed because of these sampling issues and these papers need to be revisited urgently with a more complete analysis of the data.