The Accidental Statistician: 2016

Saturday 24 September 2016

Dawkins' and Pinker's gene

The two biggest problems for Dawkins' and Pinker's interpretation of what the gene means as a heritable "element" that affects phenotype. Is that first it is most often not an element but a system of non-local and interacting elements and secondly and most importantly it will not follow Mendelian genetics. It will not be segregating and discrete. There will be a myriad of variations depending on how the system responds to the environment it finds itself.

Mendel was lucky with the characteristics he chose to examine and when you do have genes that segregate then you do have a gene as described by the molecular biologist and geneticist, of they type that Pinker riles against. They are just different alleles corresponding to an expressed region of DNA or possibly their cis regulatory regions. They are not the nefarious and indeterminate objects defined by Pinker and Dawkins. If we are to take their views seriously then we have to go back to before the modern synthesis and try again.

Now that we know that most of the genome can be defined as loosely functional, even if just in terms of spacing between coding regions, then perhaps we do need to look at what the term gene means.

Thursday 22 September 2016

I finally understand what Dawkins means by gene

I was reading a short article by Steven Pinker in the book "This Idea Must Die". There Pinker was saying about how molecular biologists have a different view, a very restrictive view of what a gene is. They only consider the protein encoding region of the DNA as the gene.

That was a Eureka moment as I finally understand what Dawkins was trying to say. He shares exactly the same view as Pinker. To him a gene is a heritable element that produces a phenotype. This is a much older view of the gene than the view I was brought up with. It predates knowing anything about DNA at all.

To the atomistic and DNA based molecular biologists and geneticists this means the sections of DNA that produce the protein that is responsible for the phenotype. That piece of DNA when expressed causes the phenotype. This is why the molecular biologists got such a shock when they found that there were only 20-30 thousand protein expressing segments, genes in their words, in the human genome. This looks the same as Dawkins' gene but it is completely different. Dawkins because he knows very little about genetics and molecular biology is living in the world view before the modern synthesis which linked DNA to genes. Pinker shares the same anti-reductionist perspective. Even though both would consider themselves materialist and reductionist scientists.

Their view of the gene would include all of the regulatory elements, both local and non-local in the genome. It would also include all of the mechanisms for regulation and post-translational modification, for localisation and for every other modifier that affects the process of taking that section of DNA or those multiple sections of DNA to produce a phenotype. In Dawkins' view there are no multi-gene effects to produce a phenotype because the genetic atom is actually that complete system that relates DNA and phenotype.

That is what makes me so strongly critical of Dawkins' work because he has no appreciation of the system at the molecular level. I work with proteins and how they fold and I even dislike DNA. I see the disconnect between the DNA code that can be mutated and the proteins that they produce. There is a huge non-linearity in their connection. The effects of mutations are almost impossible to predict. But if you take Dawkins' and Pinker's way of specifying a gene just as a heritable element then their writing makes a lot more sense.

It makes more sense but they still ignore the fundamental problems with this view. That is that these "atomic entities" these genes are not atomic. They are overlapping, intertwined, non-local and non-linear systems that cannot be approximated by some atomic genetic theory. In each cell-type the networks of connections between regulatory elements and expressed regions is different and that is not even considering spatial effects.

In their world each cell type would have its own set of genes, because each have their own phenotype and own system of expression. Even each of my tissues would become a collective organism and and animal would become a collective of collectives. It is this decision to ignore the relationships between the parts and to impose an artificial genetic atomism on these heritable elements that makes it unrealistic as a view. Playing with my sons' Lego makes it clear. I have all those bricks which are the genes of the model. But unless I put those bits together in the right way I never get my car or my space ship. If you don't think at the systems level you can never understand biology. Atomism and reductionism are doomed to failure.

Saturday 3 September 2016

Big Government should Amazon and Starbucks pay more tax?

Yesterday I posted about beggar my neighbour and why the Ireland/Apple tax case matters for democracy and stability. Today Amazon and Starbucks are he focus of attention. These are two more in a very long list that will also include Google, Vodafone. Microsoft and many others who use their global clout to minimise regulation and taxation.

What was amusing is the posts on social media by neo-cons about the companies being justified because the governments waste money and so they should keep avoiding the tax.

What do governments spend their money on? A lot of it is social security and a lot of that is pensions (much more than unemployment in the UK). So shall we cut pensions because Amazon and Starbucks don't pay up? Should those Daily Mail reading baby boomers who support the neo-con illusion get shafted by their own stupidity? Should we allow them to poke themselves in the eye? Sounds good to me but maybe not.

What else does the government pay for? Healthcare is a big spend as well. We could allow Amazon and Starbucks to use their tax avoided cash to invest in sponsored hospitals and to reproduce the philanthropy of Carnegie or Rockerfeller. Look at Oracle and the billions of Larry Ellison as an example he used all that cash to build - the most expensive racing yacht in history. So maybe expecting billionaires to give away their money is not such a good idea (I know Bill Gates has done amazing things and George Lucas and Warren Buffett as well but they do not run countries).

The government also spends money on defence. From an evidence based view this is often a waste of money and the social media post is correct. Britain is building two stupid carriers to fight the types of war that no longer exist against enemies that are no longer there. We are about the renew nuclear weapons that nobody will ever use and that are also a waste of time. Oddly enough I suspect that the person who made the media post would say that this is NOT a waste of government money as the neo-cons are easily deceived by Eisenhower's military industrial complex that sells what nobody needs at an extortionate price.

Then there is education. We could all do with a lot less of that so that we can all be as stupid as the Daily Mail readers and the neo-con social media enthusiasts. That keeps people from questioning. Yes you need to train an elite to run your business and keep globalisation going but an ignorant population is good for business.

What about the infrastructure paid for by taxes? The roads etc. Well there is lots of mis-mangement of funds there, but is is caused by the neo-con push to privatise all services and to have the market find the best price. Just ask Halliburton how this works for them in the US and ask any local government how it has worked out in the UK. Higher price poorer service and don't mention PPI.

So yes Amazon, Starbucks etc. should be paying taxes and while sometimes government does waste money it is a lot better than the alternative.

Friday 2 September 2016

Beggar my neighbour: Apple's Tax Problem in Ireland

There is well known economic rule called "beggar your neighbour". It is important in behavioural economics when you consider the model of the repeated Prisoner's dilemma. In that case beggar my neighbour represents the defection strategy. There is also a connection to companies seeking countries with the minimum regulation/taxation. This is when the companies are defecting.

Companies have a duty to shareholders which in the short term and when you do not expect there to be a repeat of the circumstances means that defection is the preferred strategy and politicians often think the same way. This is sib-optimal capitalism. It is sub-optimal because in reality we have longer term interactions and repeat business which are undermined by defection. Axelrod has shown that the best strategy as proposed by Rappaport is Tit-for-tat. You respond to defection by defecting and then you go back to a position of trust. Trust is the essential feature that makes the system optimal. You have to maximise the trust to reduce the costs of regulation and defection.

At a national level a defecting country is one that offers a lower level of regulation and taxation compared to all the other countries as business will move to that country and not pay taxes where they actually are active. This is how the Swiss canton of Zugg has become the European HQ of many multi-nationals. Zugg has a tax rate of 5% which is very attractive to global business. Given the size of the canton this minimal amount of tax from a large number of corporations raises more than enough for the infra-structure and services that the canton has to pay for. In fact they should be making a considerable profit. Levels of taxation elsewhere have to be larger because nations are expensive and levels of tax are set to avoid a deficit. This is why Zugg is beggaring its neighbours and why Ireland with its Apple tax deal was also beggaring its neighbours by removing tax revenues from other countries where Apple was doing business. Apple was unlucky enough to be the first company that was brought to court but it will not be the last and Google and Amazon are two more big names that stand out.

Ireland with a much smaller economy can survive without all the tax that is owed but this deficit is pushed onto all the other EU nations. That is why this sort of tax deal is illegal and why the UK deal with Vodafone also needs to be investigated. Governments do this because they want to keep the jobs in their countries but if I did this as a small business or as an individual even if I did not do my duty well enough only in the expectation of a future benefit I would be in jail for up to 12 years and face an unlimited fine under UK law. In fact the recent report about the possible dropping of the investigation of BHS in return for Sir Philip Green paying a large sum to the pension fund is also bordering on illegal under the Bribery Act 2010. I begin to see why an Italian mafia judge called the UK the most corrupt country in Europe.

Ireland has been caught cheating but both Ireland and Apple are going to seek to contest the judgement. The only way that you can prevent beggar my neighbour is if we go beyond short term interests or if we promote supra-national agreements. We call these trade deals and although TTIP is a dirty word at the minute there are many others that we rely on everyday. The WTO is the largest agreement to make sure that nations do not deliberately cause economic hardship for each other. But the most successful is the European Union and that is why Ireland and Apple in the end have to lose if we are to have any faith in nations and democracies and if we want to live in a world which has not been taken over by corporations.

So why does not being able to beggar your neighbour matter? History shows that wars are usually about resources and as a response to internal economic challenges. Harming another countries economy has political and social consequences and not just economic ones. If we want a more peaceful and equal world we have to get beyond Brexit and Beggar my Neighbour and start understanding the long term benefits of working together.

Monday 20 June 2016

Brexiteers

I have just received the EU Myth Buster from the Brexiteers. Firstly it still has the already proven wrong 350 million a week claim, as well as the nonsense about expansion to new members including Turkey. This is all scare mongering and the gall of the Brexiteers to claim it is Remain that are using economic fear to win the argument is just nonsense.

On the back the great and the good making their arguments. I wanted to go through them one at a time.

1) Sir Richard Dearlove - former chief of MI6.

"Brexit would bring about two potentially important security gains: the ability to dump the European Convention on Human Rights ... and more importantly, greater control over immigration from the European Union."

Sorry but this makes me very worried when a former head of the security service thinks that not having human rights is a good thing. This is the person who was in charge during 9/11 and then the dodgy dossier fiasco. Regarding his second statement, is the EU a source of terror threats to the UK? The 7/7 attacks were home grown and we do control immigration from outside the EU. He is also on record as stating that the media have exaggerated the threat from Islamic terrorists. So maybe he thinks the French farmers have been radicalised and are a threat?

2) Tim Martin - chairman of Weatherspoons

"The EU places tariffs on goods from outside the EU, which is bad for British shoppers and the developing world. And the EU forces us to charge VAT on goods, pushing up bills for working families."

By his argument leaving the EU and becoming one of those countries outside this is going to be good because now we are subject to those tariffs he is talking about. The levels of taxation are set by governments and they are to make sure that the finances of the country are in good order. The UK sets its own levels of VAT. These had to be increased because of the banking crisis not the EU. When we face tariffs because we are outside of the EU tax levels are likely to rise even more.

3) Nigel Lawson - former Chancellor of the Exchequer.

"As Chancellor, I became increasingly aware that, in economic terms, membership of the EU did us more harm than good. Outside the EU, we would prosper, we would be free and we would stand tall."

Sorry Nigel but this is pure rhetoric without a single shred of evidence or valid argument. He was last Chancellor in 1989 before the Berlin wall had fallen. The world has changed a bit since then and certainly the economy has changed beyond all recognition. Free and tall are not economic arguments, they are useful for boxers and cage fighters but they do not secure wealth for nations.

4) John Longworth Director of the British Chambers of Commerce.

"The EU interferes with UK firms and stacks the rules in favour of a select number of big businesses. It we Vote Leave jobs will be safer. We can have faster growth and greater prosperity in the future."

The EU spends a very large amount of its business spending on what are called SMEs and not the very large firms. Big business and big interests are very effective at lobbying the EU. My dad was a lobbyist for the Cattle Breeders Association and the UK farming industry which is hardly a big business. He would have never dreamed of voting out. The EU actually is more effective than the UK in dealing with the excesses of global business such as Microsoft and Google. These are companies that individual nations do not take on. In fact the UK government is noticeable in its reticence to get involved in taxing many of these businesses (Vodaphone is a high profile example) which get special tax deals that are certainly not given to small UK firms. Exactly the opposite to his claims is closer to reality.

5) Gisela Stuart Labour MP

"The rights we have won for British workers came from our Parliament, not the EU. The EU is run in the interests of the big corporations who spend billions lobbying to make it work for them."

I think that billions is an exaggeration. Even big business cannot spend that much on a Parliament that has so little legislative authority. I seem to remember the Conservatives opposing EU legislation precisely because it did protect worker's rights and asking for opt-out clauses. I seem to remember the social charter causing uproar from Mrs Thatcher, which does not agree with the MP's claims. It was only finally accepted in 1998 by a Labour government. I was also discussing this issue and where Deutsche Bank would move to if there was Brexit and the wife of one of their employees said not back to Frankfurt because they are only allowed to work a 37 hour week in Germany because of the EU legislation (that has not bee adopted by the UK) whereas her husband has a 67 hour week here.

It is amazing that in 5 quotes there is not a single good reason to Brexit. There is nothing there that is not fabricated, untrue or not pure fallacy. If that is the best they have got in their arguments then the Remain camp have very little to worry about.

Wednesday 13 April 2016

Thoughts on Heritability

Using Waddington's idea of canalisation. This is the related to the width of the canalisation - this is the maximum variation between a pure in bred low phenotype and a pure in bred high phenotype. This is the variance in that characteristic which is attributable to the genes.

This is going to be very hard to measure unless you do it for a genuine population. If you have a non representative sample then that will have its own variability and the variation in that characteristic will depend on the state of all the different end points you are starting from in a normal mating population carrying all of its history and variation. Think of dogs the dog population includes all the breeds but the variation within breeds is very low, the same with horses but with humans you have Usain Bolt and me to compare for our 100 metres performance.

The width of the canals is important as it shows the plasticity in response to environment. How far does it need to be pushed to get someone outside the normal bounds. The genes make the landscape and they do not determine the outcome except in pure bred lines where the canals are very narrow.

High heritability could actually signify large canals and low determinism because we are very far from a pure bred line and everyone is close to the middle of a large canal. But it could also indicate low variation and a multi-modal population - mixing of species, comparing apples and oranges. Conversely low heritability measures would mean that there is low variation with very narrow canals and it is very strongly inherited.

The concept of heritability implies that you can have a pure bred line for a specific phenotypic characteristic and so that characteristic must be inheritable. So my reason for questioning the paper rejecting Haidt and heritability is that why would his classifications be capable of being bred for? Can I breed someone who believes in justice over everything? Can I breed someone who finds disgust in the unclean their biggest driving force? Can any of these higher human constructs be embedded into genes or are they more accurately modelled at the meme/cultural evolutionary level? This is what I meant by there being strong determinism - that you could breed for it and that the genes would determine the effect.

My opinion is that they are at the cultural evolutionary level and only the very simplest of behaviours is canalised at the genetic level. This would be probably things like higher level reasoning, personal identity, desire to reproduce, desire for satisfaction etc. The advantage of the cultural/meme level is that it is NOT Darwinian. It develops during lifetimes and is directly passed on to the next generation. It is Lamarckian and not wasteful random exploration. It builds on what we already have and is rapidly modified. It is much faster than genetic evolution.

Sunday 20 March 2016

Spectator Review of Not in Your Genes

This is the spectator review by a post-doc in psychology.

Now there are some good references in the beginning but sadly the complaints and understanding also decline as it goes on. In particular it descends into an ad hominem attack against the author and not his work. If you are convinced of your argument why go after James and his media appearances?

The first few references seem fine about birth order and the 10,000 hours effects. Then is goes down hill. So what if 80% of genes are expressed in the brain - this tells you nothing more than the 20% that are not cannot have any effect on the brain. Why does this mean anything? Lots are house keeping genes that just keep cells alive.

The ample evidence is a meta analysis of twin studies. The problem with twin studies if the author had bothered to check is that epidemiologists now suspect that they might be flawed as they have the same exposures and circumstances. I have two children who are not twins but they share lots of views and personality traits because of their upbringing not because of their genes. It is very hard unless you separate twins to actually get good evidence for genetic as opposed to environmental effects and even then epigenetic effects might be larger than genetic effects (these follow Lamarck and not Darwin - they are "Just So Story" modifications that are directly heritable)

I will need to look at GCTA and think about how to weight that evidence.

Then there is a look at GWAS which again the review author just talks about a review chapter. I have seen the people who lead some of the major GWAS studies of disease speak and there conclusions are that the effects of genes are SO SMALL THAT THEY CANNOT BE STATISTICALLY DETECTED. The effect sizes are tiny in diabetes, heart disease and lots of less complex phenomena than learning and personality. This is because it is gene interactions that produce the effects and not single genes. This talk was from the same lecturer who cast doubt on the twin studies. The review author might like to read Prof. David Clayton's work.

Then for basic statistics he cites wikipedia. While I like wikipedia this is all credibility lost. Power calculations are a tautology. You need to know effect size and population standard deviation in order to calculate the population size that you need to use to detect the effect. This is knowing the answer before you ask the question. The last references from the Journal of Irreproducible Results (Nature) is from the News and View section, an unrefereed section where the great and the good get to spout garbage on a weekly basis.

So apart from the possible evidence from GCTA there is actually as little to support the reviewers assertions of genetic links as there is for James support that there are no genetic links. The final result is a 0-0 draw.

Life imitates art: Captain America - Civil War

I have been reading twitter and the battle lines drawn between the genetics determine personality, ability and psychology and the opposing genetics tell you nothing camps. This is not the first time this battle has been fought. Last time we didn't even know very much about genes and we called it Eugenics. That was the Middle Class Victorian English man's attempt to protect his position in the world from the reality that he was nothing special. The second world war and the holocaust finished Eugenics as an acceptable position.

This new battle is much more dangerous especially in the current political climate and with current technology. There is every chance that we are going to up the scales to an existential level of threat and that will prove a Pyrrhic victory for whoever is left.

I am not a fan of Oliver James. I find his popular psychology grating and annoying and many of his arguments facile and unsupported by evidence. His most recent book has been attacked by almost all reviewers and a lot of people I follow on social media. It is titled "Not in Your Genes". The basic point is that nurture and not nature is responsible for all of your personality traits and abilities. You are not where you are because of genetics it is just because of up-bringing.

I have on my shelf a book written by Richard Lewontin, Steven Rose and Leon Kamin with a very similar title "Not in Our Genes" that makes very similar arguments. One of the criticisms of Oliver James is his lack of knowledge of genetics. It is impossible to doubt the genetic credentials of Lewontin. Lewontin et al's book was an attack on the genetic determinism and sociobiology of Dawkins and E.O. Wilson. Now I can seriously dispute the genetics ability of both Dawkins who has no idea about population genetics and molecular biology. As for Wilson his problem is that he studied the social insects and derives his rules for human social interaction from insect social interaction. While society is therefore clearly something that evolution has to discover (it is a universal in the language of Stewart and Cohen), he is mistaking analogy for homology. We do not have a common social history with the insects. These are two distinct evolutionary solutions to the same problem, and because of this knowledge of one CANNOT be directly extrapolated to the other.

James' view that genetic determinism is not a powerful driver is therefore not unsupported by people who do know their genetics as the critics might claim. Steven J Gould was another campaigner against the last vestiges of eugenics. Dan Graur is also particularly hostile in social media to psychologists claiming genetic effects or molecular biologists claiming function for all of the genome.

To understand it better we need to go back to the Victorian man who started this all. We need to think of Galton. As well as his work on genetics Galton is famous in statistics for discovering regression to the mean. The problem is that he and many other people failed to understand what it really means. It means that traits are not pure. It means that if someone is exceptional in something, then their progeny and likely to be closer to the mean. This is why so many properties in biology are normally distributed. The normal distribution is the error distribution for a complex system with many confounding variables. That sums up most of the properties of life, intelligence, success, height etc. If our genetic determinants were pure and with strong selection we would see divergence between members of the population until we would have speciation. We might have evolved the Morlocks and Eloi of H.G. Wells imagination. But we didn't. The strongest evidence that the genetic determinists are wrong is the diversity of humanity that still remains a single cohesive species.

If the determinists were right that genetics plays a major determining role then we could look at large genetic differences in humans and find differences. The largest difference is that between the genders where there is a Chromosome of differences and we could say that women and men (which the media seem to treat as distinct species) would have different personalities and levels of intelligence. This was an argument for a long period of time as Middle Class Victorian man tried to maintain himself at the top of the social tree. Now we know that if there are differences it is that women are more intelligent and have a better balanced personality than men but how much of this is social and experience and how much is genetics?

Next we could argue that populations that have been separated for long periods of time will have evolved different psychological and intellectual properties under genetic determinism. This leads us to the question of race as these are distinct genetic populations. Do we really want to go there and start asking those questions? Have we learnt nothing from two World Wars, endless genocides and the examples of peaceful migrations and settlement? There are no fundamental differences. Around the edges there are some but these are accidents of history and not universals of evolution.

This brings me back to the title. Science faces this civil war between the determinists and those who believe that nothing is set in stone. That is between the frightened Technologists like Tony Stark and those who have lived in a world where fascism was in the open and not cloaked in making X or Y nation great. So I am definitely on the side of the Captain and against rigid solutionism. This is why it feels to me like the Civil War. I am going to be on the opposite side of the argument to a lot of scientists who I respect and who I would have been allied to in the past.

Humanity works because of diversity and inclusiveness, for all of the bad that happens, what we do well out-weighs this and we must not let the cynics and the spreaders of fear win. We do not want to live under Ultron or Skynet.

Nature AND nurture play a part and the interaction between nature and nurture is the most difficult part to unravel. My intuition tells me that nurture is still the more significant of the two and that genetics at the level of personality and psychology plays a fairly small part. Nature and genetic evolution are very slow movers and slow to change but humanity has changed to be unrecognisable just in my life-time. This difference in time frames for me is the ultimate evidence that most of the arguments for genetic determinism will turn out to be wrong.

Saturday 6 February 2016

Secondary Structure Prediction - Do neural networks add anything.

I began in bioinformatics when secondary structure prediction was considered an interesting problem. Now it is considered as "solved".

All of the early methods used sliding windows (from Scheraga) and statistical propensities of the amino acids (Chou and Fasman, GOR etc.). The big step forward was realising that predictions for alignments which accounts for positional variation was better than trying on only a single sequence. With the massive growth of the databases this has only got better.

Supposedly Neural Networks also helped improve predictions until we could do no better. The problem is are the NNs actually detecting any patterns that the straight linear statistics were not? They will probably help to find the amphipathic helices that the window methods will struggle with as this is a periodic pattern but they will not be able to deal with beta sheets as these are non local and exist outside of the windows. I had a student who constructed a neural network for prediction without any hidden layers and got the same prediction accuracy. This suggests that the neural networks are contributing nothing to the predictions.

We know that the codon distribution is optimised to make sure that mutations have a minimum effect of the resulting proteins (Baldi and Brunak and Andreas Wagner).

We need to repeat the student's experiment and to check in the NNs actually make any difference. We also need to change the amino acid coding scheme to give realistic distances between amino acids and not Hamming distances.

Networks in proteins

A long time ago there was some interest in networks in proteins and a few articles were published about the non-bonded networks. These were interesting but they just looked at spheres of contacts around atoms to find a power law. The problem is that this power law arises from the cube relationship between number of atoms/residues and size - a protein approximates to a sphere. What would have been more interesting is to analyse the hydrogen bonding networks.

Friday 5 February 2016

The open data bun fight, why experimentalists and analysts need to collaborate more.

I can understand why the people who collect the data want to protect their hard work and their grants against research parasites. The problem is that people like me who have spent their careers working with methods and thinking about the statistics are probably in a better position to carry out the analysis.

I have seen too many badly designed experiments that have wasted research funds because the experimentalists did not talk to the statisticians or the analysts. They end up with a badly designed experiment with so many flaws that it is practically useless.

Collaboration would be good for everyone. Yes it does mean one or two more names on the paper but the analysts don't want to be the submitting author so long as their contribution is recognised. Where being an analyst/bioinformatician/statistician will get annoyed is if someone treats them like a technician and not an equal party and collaborator. Analysts contribute equal skills to experimentalists, they are just different. Neither are more or less important and I have done both (I was an x-ray crystallographer solving protein structures as well as a molecular modeller using protein structures).

Thursday 4 February 2016

The network of H5N8 publications.

Before I became obsessed with influenza phylogenetics I was interested in networks. I observed something interesting in the citations of flu papers.

My paper with Munir about H5N8 having a single origin has 5 cites. Verhagen et al. Science paper - How a virus travels the world has 17 cites but amazingly the paper with the riveting title Novel Eurasian highly pathogenic avian influenza A H5 viruses in wild birds Washington USA 2014 has 24 cites.

Looking at Google Scholar for all of the H5N8 citations you have:

Novel reassortant influenza A (H5N8) viruses, South Korea, 2014 EID 69.

Characterization of three H5N5 and one H5N8 highly pathogenic avian influenza viruses in China Veterinary Microbiology 60.

Novel reassortant influenza A (H5N8) viruses in domestic ducks, eastern China EID 44.

Highly pathogenic avian influenza virus (H5N8) in domestic poultry and its relationship with migratory birds in South Korea during 2014 Veterinary Microbiology 51

Highly pathogenic avian influenza A (H5N8) virus from waterfowl, South Korea, 2014 EID 25

Outbreaks of avian influenza A (H5N2),(H5N8), and (H5N1) among birds—United States, December 2014–January 2015 MMWR (CDC) 24

Pathobiological features of a novel, highly pathogenic avian influenza A (H5N8) virus Emerging Microbes and Infections 27

A novel highly pathogenic H5N8 avian influenza virus isolated from a wild duck in China Influenza and Other Respiratory Diseases 22

Novel reassortant influenza A (H5N8) viruses among inoculated domestic and wild ducks, South Korea, 2014 EID 23

Comparing introduction to Europe of highly pathogenic avian influenza viruses A (H5N8) in 2014 and A (H5N1) in 2005 Euro-surveillance 17

Novel Eurasian highly pathogenic avian influenza A H5 viruses in wild birds, Washington, USA, 2014 EID 25

Reassortant highly pathogenic influenza A H5N2 virus containing gene segments related to Eurasian H5N8 in British Columbia, Canada, 2014 Scientific Reports 17

Full-genome sequence of influenza A (H5N8) virus in poultry linked to sequences of strains from Asia, the Netherlands, 2014 EID 14

Intercontinental spread of Asian-origin H5N8 to North America through Beringia by migratory birds Journal of Virology 14

Genetic characterization of highly pathogenic avian influenza (H5N8) virus from domestic ducks, England, November 2014 EID 11

Influenza A (H5N8) virus similar to strain in Korea causing highly pathogenic avian influenza in Germany EID 8

Pathologic changes in wild birds infected with highly pathogenic avian influenza A (H5N8) viruses, South Korea, 2014 EID 7

Genetic diversity of highly pathogenic H5N8 avian influenza viruses at a single overwintering site of migratory birds in Japan, 2014/15 Euro-surveillance 7

Characterization of an H5N8 influenza A virus isolated from chickens during an outbreak of severe avian influenza in Japan in April 2014 Archives of Virology (closed access) 6

The European and Japanese outbreaks of H5N8 derive from a single source population providing evidence for the dispersal along the long distance bird … PeerJ 5

So my new research questions are:

How can there be this amount of publication on 120 viral sequences?

How can there be these amounts of citations in less than 18 months?

How can peer review not reject so many papers which cover virtually identical ground?

Why does my paper languish behind all of the others?

This to me looks like classic clique behaviour and also the rich get richer. It will be interesting to construct the network of citations and return citations.

Failing to share influenza data is not something new

I had complained about the Taiwanese H5N8 data being a pain to get hold of and I also tangentially mentioned how slow the US was in making its H5N8 sequences available as they wanted to publish first. I have just blogged about GISAID which I find does not help the problem and in some ways is there to put another hurdle into getting the data, but this news story from 2006 is shocking.

The WHO was/is keeping flu data secret with access to the data being restricted to 15 labs. I can see why people in the viral community sometimes throw their hands in the air and go independent like Nathan Wolfe. If you have to fight WHO to get the data then you need to find another source of data. It should be made clear as Wolfe does in his books that eventually we will face a viral catastrophe.

What virus and when is not clear but hoarding data and blocking access does not threaten good science it threatens lives. The Taiwan H5N8 response was affected by this lack of open sharing and it had a large economic cost. It is vital that the WHO is completely open with its data because a future viral pandemic is a possible existential threat.

The GISAID story

Just reading the wikipedia article on GISAID is enough to start the alarm bells ringing that this is not all it is supposed to be. Here is the Max-Planck statement on funding.

Why I dislike it is that for a database that is supposedly open and for sharing it is surprisingly difficult to actually gain access to it. It has password access and you have to apply for an account. The researchers who publish data there don't do so to enable the world to access it - send it to the EBI or NCBI to do that. They deposit there to actually restrict access and attention. It is in a different structure to the NCBI database with different segments, it is harder to search and produces harder to download search results. It is a nightmare with little reason to exist.

Regarding its initial supposed reasoning to prevent unscrupulous big pharma from exploiting influenza vaccine markets it is interesting that it was inspired not by scientists but by a business investor. Like the SNP database this is a way of spiking intellectual property of competitors by making it public domain and so no longer an interesting discovery.

This is an interesting news story from last year about an accusation of an academic collaborating with big pharma to smuggle influenza virus and about corruption.

Here is that same researcher talking about GISAID when it first had legal issues in 2006 (at that time I didn't care less about influenza or virus work).

That same researcher was also caught out in a review article where her co-author had self plagiarised, a paper written about the same time as the 2006 GISAID relaunch.

Wednesday 3 February 2016

Why do I keep editing/toning down my blog?

Mostly because I do actually believe science works by cooperation and openness. There is always the nuclear option - the kill switch. But throwing your toys around most of the time is not useful. I do hate corruption and there is a lot of it. My dad was on a research funding committee and he was always troubled by the decisions as to who got funded and who got nothing. But what he couldn't stand were the scientists and their incessant lobbying and politics.

Mostly I moderated it because of this good advice from Stephan Lewandowsky and Dorothy Bishop . It is important to be part of the solution not part of the problem.

More of the H5N8 story - should data be open?

In December 2014 I wanted to include the North American analysis of the origins of H5N8 when I was writing the paper about the European and Asian outbreaks of H5N8 having a single source. I had heard of the US H5N8 sequences but they were not yet public and so I contacted the US influenza researchers sending a draft of my single origin paper that I had submitted to Science to them. I have already blogged about what happened to the Science paper.

From: Andrew Dalby [mailto:A.Dalby@westminster.ac.uk]
Sent: Wednesday, December 17, 2014 4:54 PM
To: APHIS-NVSL Concerns
Subject: H5N8 sequences

Dear Sir/Madam,

I recently read the report about H5N8 in a gyrfalcon in Whatcom County. I see that you have identified it as being related to the Eurasian outbreaks of H5N8. I recently submitted a paper to Science showing that the Japanese and European outbreaks come from a common source and that this has diverged from the Korean sequences during the migration to summer breeding grounds before being spread along the long range migratory pathways.

I have attached the submitted paper.

Would it be possible for you to let me know as soon as the nucleotide sequences are available as I would like to add them to my analysis and also add the American migratory pathways to my maps. I have already updated the map to include the Washington case.

Thanks

Andy

Dr Andrew Dalby

University of Westminster

I received the following response.

From: , Mia Kim - APHIS <mia.kim.torchetti@aphis.usda.gov>Date: Friday, 19 December 2014 14:31To: Andrew Dalby <A.Dalby@westminster.ac.uk>Subject: FW: H5N8 sequences (UK)

Dear Dr. Dalby,

Thank you for sharing your draft - indeed the H5N8 appears to be a hearty virus. Current information suggests the introduction into North America may represent a separate event from introductions in Europe. The sequences should be available in Genbank by next week and will keep you advised.All the best, Mia Kim Torchetti, DVM MS PhDAvian Viruses Section HeadDiagnostic Virology LaboratoryNational Veterinary Services LaboratoriesAmes, Iowa, 50010515-337-7590 (phone)515-337-7348 (fax)

Mia.Kim.Torchetti@aphis.usda.gov

My final e-mail was to Mia

Friday, 19 December 2014 at 14:52

Dear Mia,

Thanks for the e-mail. Looking at the past US cases of H5N8 they are very different in the internal gene sequences to the Eurasian ones. It seems that there was a convergence to a H5N8 serotype via different segment rearrangements. It will be interesting to add the new sequence to the analysis.
Best wishes

Andy

There were no more emails and I waited a considerable time for the release of the North American sequences, more than a few weeks and when they were released there was a large deletion in the original H5 gene which meant I never performed the analysis as it looked like a sequencing error.

Then they produced the paper about the H5N8 phylogenetic analysis in May 2015. That is less than one month after I got the single origin paper published in PeerJ. An anonymous referee delayed the PeerJ paper by asking me to carry out a complete analysis of all of the H5 and N8 genes to show that there was not a reassortment that produced the different outbreaks. The editor Claus Wilke correctly insisted that this was done despite me arguing that this was not common practice, I now think that it should be common practice but it still isn't. Claus did give invaluable advice by pointing out that this is easy to do using FastTree which has helped my subsequent research. This actually directed me to thinking about reassortment and the fact that analysis of single subtypes might be flawed.

What concerns me is whether the national laboratories should be publishing before they have time to analyse so that "research parasites" like myself can use the data. I was e-mailing in the hope of collaboration as the analysis was already done and was trivial for me to do again. I already blogged about the Taiwanese outbreak of H5N8 and the difficulties in getting the sequence data in a timely manner. While national laboratories need the publications to justify their funding I realise that they need to protect their data but on the other hand are they then providing a data service?

I am not sure that the national laboratories do not do more harm than good by hoarding and not openly sharing data. I think that they should be co-authors and collaborators with those who carry out the analysis as there biological work is often the most significant part of the process in terms of time and funds. But I do not think that they should try to do all of the steps of analysis when this is not their function or area of expertise. We are wasting skills and time and not making constructive use of the data.

Tuesday 2 February 2016

A paper that carries out the H5N8 North American analysis using the same methodology as my rejected paper.

There have been just so many H5N8 papers that I did not pay much attention to this one about H5N8 in North America, but it should have been included as a reference to the reassortment and sporadic outbreaks paper that keeps getting rejected.

This shows that the H5N8 in North America is in multiple clades by doing the exact analysis that referee 1 of version 2 said was flawed. This shows that the comment on a flawed method cannot be true because a paper has been published using the exact method in dispute. I then went on to show why this is incomplete and misses reassortment events but the referee chose to ignore this. Not mentioning this paper was a mistake. However the referees arguments about wrong methodology are an even bigger mistake. Except I do not think they were a mistake I think that they were dishonesty.

Why subtype might be meaningless for influenza phylogenetics

The different clades in the H5 tree are so intermingled that it is often impossible to say that a clade belongs to a single subtype. There is significant mixing and intermingling of subtypes within the complete H5 tree. Constructing trees based on a single H5 containing subtype such as H5N1 or H5N8 will introduce sampling bias.

An example is presented below.

This clade contains H5N2, H5N1, H5N8, H5N5, H5N9 and mixed H5 containing subtypes. H5N8 and H5N9 subtypes occur randomly within the tree. The likelihood ratios are high for the nodes/splits which indicates that this is a reliable reconstruction of the evolutionary tree and that this seemingly random mixing of subtype is not an artefact of tree construction.

This unambiguously shows that to correctly estimate the evolutionary history of H5 you need to sample across all subtypes and that even using geographical or chronological criteria will not produce an unbiased sample. In this case they are all sequences from the Americas but they are found from Guatemala to Alaska and British Columbia and from 2005-2014 which is a very wide range of times and locations.

The conclusion from this result is that we can no longer accept that subtype trees of influenza represent an unbiased sample of lineages or evolution and that all papers that have been published that take this approach for sequence selection in phylogenetic analysis have to be questioned. If analysis is over short time spans such a single influenza season then these trees are likely to be unbiased because the sampling will be from a specific sub-clade, although these analyses will have limited value for making inferences about viral evolution.

However many previously generated trees and a large part of the existing influenza literature is likely to be flawed because of these sampling issues and these papers need to be revisited urgently with a more complete analysis of the data.

Sunday 31 January 2016

Second referees comments - Do lineage and subtype have meaning any more?

The second referee is a bit pedantic which is perhaps important. There are good arguments for very strict use of terms but this is a minor correction. I am a bit hand waving and inexact and so he has some points, but still why anonymous? What are you afraid of?

Basic reporting

The manuscript suffers from sub-standard writing. There's a typo in the text ("creating phylogenetic trees oh H5N8" on line 112), as well as grammar mistakes (line 182, line 229). Some very unusual language is also employed throughout the manuscript, such as references to H5N8 trees on line 63(unclear whether trees are of the HA segment, NA segment or the inadvisable concatenation of the two), hemagglutinin and neuraminidase subunits on line 69 and 70 (they are segments, subunits are what proteins have), sequence degeneracy on line 93 (the opposite of saturation is low diversity, not degeneracy), information content of HA and NA trees on line 121 (two trees are always sufficient to infer reassortment), consistency of trees as strong evidence of phylogenetic analysis validity on line 131 (tree consistency indicates that the segments have a similar history and says nothing about the validity of the analysis), envelope segments on line 133 (to my knowledge only Retroviruses possess a surface protein called envelope) and reassortment in Flaviviruses on line 231 (Flaviviruses cannot reassort because their genomes are on a single RNA strand). The conclusion is rather short, the second paragraph of which is basically the same thing repeated over and over again.

Experimental design

The reporting of trees is extremely unhelpful. All trees are shown as cladograms and thus only indicate the topology of the tree. Dalby writes that this was done for clarity on line 83 but it achieves the opposite effect. Branch lengths allow everyone to see how much evolution has occurred on each branch and thus how robust some of the inferences are, especially in light of reporting on how much evolutionary change has occurred in trees on line 107 without supporting evidence. It is never made clear whether the trees have been rooted or not and without branch lengths it is impossible to tell whether they are. Although not a major of flaw of the study, nor a problem unique to this manuscript, the use of a parameter rich GTR+I+G nucleotide substitution model is questionable. Model testing, as it is done today, is based on a circular argument (the tree with a given model has the highest likelihood, therefore the model is used to reconstruct the tree) and ignores identifiability problems when it comes to the combination of Gamma-distributed rate heterogeneity AND invariant sites. Gamma-distributed rate heterogeneity takes care of slowly (or non-) evolving sites, so the addition of invariant site estimation combines two models that are explaining the same variation.

Validity of the findings

In the manuscript Dalby describes the rise of an avian influenza A virus subtype H5N8, which has recently caused a sustained outbreak in Korea. The author finds that the combination of H5 and N8 segments in avian influenza A viruses has arisen multiple times independently rather than circulated cryptically in birds as a single genomic lineage. I have no problems with the overall findings - I think the divergence between the H5s and the N8s that have ended up reassorting together is sufficient to infer numerous origins of the subtype. What I disagree with are the details surrounding each independent origin of the subtype. Some very bold claims are made in the absence of any clear evidence that would be available to the reader, for example that the origin of the Californian quail H5N8 subtype is unambiguous when it is actually quite the opposite, given the phylogenetic position of the sequence or that the Thailand 2012 H5N8 neuraminidase clusters with H3N8 neuraminidases when it does nothing of the sort.

Comments for the author

I think this manuscript could easily be improved by:

1. Showing maximum likelihood trees with clear rooting and actual branch lengths.

2. The direction and context of each reassortment should be explicitly tested using an appropriate model - e.g. BEAST with discrete traits of location, host and subtype (as appropriate) - to support the various proposed hypotheses for the origins of subtype H5N8.

3. Clean up the language - use the correct terms agreed upon in the literature.

4. Show full trees of all HA and NA sequences indicating where H5N8 viruses are.

I would strongly advise the author to implement these suggestions before attempting to submit this manuscript elsewhere.

So from the comments to the author:

1) Is trivial and ok. Actually with branch lengths reading the trees is a whole lot harder and the key arguments of the paper as it is about reassortment and this depends on clades and not branch lengths but this is a minor point. This is cosmetic and not grounds for more than revision.
2) This is not going to happen there are 4007 sequences this would take large amounts of computer time and give you nothing new or significant in identifying which clades H5N8 can be found in. Putting in subtypes and locations would actually be over-fitting of the data to the model and a very bad statistical error because you leave no variables to test your model against. This would be an example of Bode's Law. Put in all the empirical data to the model and you get no free variables left.
3) Agreed but again that is minor changes.
4) They are in the supplementary materials and always were - but referees don't look. Figures 5-13 are parts of this complete tree. Version 3 will just have the full H5 and N8 trees and go to F1000. There will be no anonymous referees and it will be published first.

Regarding the point on the California quail sequence. It is ambiguous if you think that the H5N8 trees are telling you anything, but the point of the paper is that they aren't. So it is completely unambiguous that this does not contain the H5 from Goose Guangdong and it is in NO WAY connected to the H5N8 sequences from Korea regardless of what the location and chronology suggest (that is why doing what is suggested in comment 2 is a very bad idea).

The point about Quang Ninh is partially true it is part of an amorphous clade that includes H10N8 isolated at the same but also mixed types. The ancestral sequence to this clade is most definitely an H3N8 from Vietnam and H3N8 or H6N8 are the sources for almost all of the N8 sequences.

Flaviviruses do not reassort as they are not segmented but they definitely undergo recombination which is equivalent. It is an analogy and not homology but sometimes metaphors are not clear. Again this is easily removed. The point of the analogy is the wider consideration that lineage has no meaning if there are multiple subtypes with the same lineage and subtypes with multiple lineages. What does the word lineage mean? How are we going to define it other than in some arbitrary way based on distances in a phylogenetic tree?

Constructing Trees based on a single influenza subtype is not a good idea as it introduces sampling bias (amended and toned down)

Version 2 of the paper about H5N8 is rejected. Regardless of it being still right and that what it says is not desperately controversial but it is important. https://peerj.com/preprints/1489/

It is saying that doing trees by finding all the H5N8 sequences or all the sequences of any other subtype is not a good idea as this is a biased sample that misses out reassortment events that give alternative subtypes. An H5N8 sequence can be next to an H5N1 sequence in the true tree and then the H5N8 can appear again in another place in the H5 tree.

I gave a very clear tree to show this is absolutely true and even posted it on this blog and repost it again here.

There is no doubt. Doing anything other than complete sampling of ALL of the H5 trees will not give you the correct sampling for the hemagglutinin tree. I have done this in BOTH versions of the paper.

I put them in the supplementary materials because they are large - the Hemagglutinin tree contains over 4000 sequences and this is not easy to deal with. I just cut out the clades with H5N8 to make it easier to understand and to focus on them. For some unknown reason the referees fail to grasp this and one even commented that my method and sampling was wrong becauseI showed a tree calculated just from the H5N8 sequences.

This comment from a referee just drives me crazy. I am lost for words as to how deliberately obstructive this person is.

In this paper the author is attempting to explain the evolutionary and reassortment history of H5N8 influenza A virus. However, the dataset design ignores what is already known about the emergence and reassortment history of these multiple virus lineages. In particular, the H5-HA of the recent North American high path H5N8 virus is derived from the Goose Guangdong HPAI H5N1 lineage circulating since 1996. This reassortment history has been well studied and published. The author wants to determine if H5N8 has been circulating cryptically in avian hosts or if emerges repeatedly through reassortment. But this has been shown - the highly pathogenic H5N8 virus emerged through reassortment (see Lee et al, 2014 EID for example). In fact, this has been show for every avian virus subtype in the MANY MANY publications investigating the reassortment history of avian influenza A virus in both wild and domestic populations.

The paper is poorly referenced and has not included important citations relevant to the study presented. I believe this has lead to incorrect understanding of influenza A ecology and evolution by the author and subsequently a poorly designed dataset to shed light on the questions he is attempting to address. The figures are completely inappropriate and not in line with the standards of phylogenetic studies or influenza research. It is unfortunate that the author has decided to show cladograms instead of phylograms. Branch lengths in a cladogram are meaningless. However, long branchs are indicative of poor sampling and missing data. This would be obvious from phylograms, but they are conveniently obscured in cladograms. The most informative analysis was of all available H5-HA and N8-NA phylogenies available from the supporting material link. By highlighting only H5N8 viruses in these trees it is evident that the other datasets presented in the main text of the study are poorly sampled.

Experimental design

As stated above, this is a poorly designed investigation. While I admire the effort to understand influenza ecology and evolution, the work presented here ignores much of what is already known about this lineage and influenza A virus in general. The assumptions of the analyses conducted are not appropriate. The analysis conducted by this author assumes a direct lineage connecting all H5N8 viruses that have been sampled (Figure 1-4). This is not true and that is evident from the supporting material presented by the author. The HA-H5 lineage has associated with multiple different virus genotypes and only a handful of lineages have emerged as highly pathogenic. The dataset design does not address the questions posed and ecological or evolutionary inference is questionable.

Validity of the findings

The inferences made from Figure 1-4 are dubious. The author acknowledges this in the manuscript when he states “These trees show that the apparently simple H5N8 phylogenetic trees for the two envelope segments (figures 1-4) are actually more complex and that multiple reassortment events have occurred resulting in the creation of novel H5N8 subtype lineages. These events cannot be seen in the structure of the H5N8 only trees but they need to be taken into account if the phylogenetic trees are going to be calculated correctly, especially if coalescent methods are going to be used.” This is an appropriate warning. I wish the author had heard it! This is evident and known to the influenza field. Regardless, at this point in the paper the author suggest that the reader ignore all previous results. Figure 5-13 are sections from the supporting material. The author attempts to determine source of HA and NA virus subtypes. The author has determined his reading is better than a probabilistic approach to assess reassortment history. However, this reading is in absence of informative branch lengths or assessment of sampling. Any inference presented here is either dubious, in contrast to other studies (not cited) or meaningless.

Comments for the author

I can't endorse publication of this manuscript. It does not serve the influenza field, nor does it add to the current body of knowledge. The quality of the research is not up to standards in the field. I believe that this manuscript should be rejected.

This person is trying to use my own findings to say why my findings are wrong. I heard my warning, that is why I wrote it. That is in fact why I wrote the paper because all of that extensive literature that I did not cite and that annoyed the referee with his MANY MANY snide comment, is nonsense carried out by someone who needs to read about statistics and sampling. The referee agrees completely with what I am trying to do, with the results that I find and have in the supplementary material but argues that I am saying the exact opposite of the entire argument of the paper in order to reject it. This is a classic example of creating a straw-man.

The entire point of figures 1-4 is to show that they are wrong and thus that the prevailing dogma that always does analysis of influenza strains like this is wrong. The experimental design is exactly correct. First you do what is done by everyone this is the control. Then you do something new - the tree of ALL of the H5 and N8 sequences to show what should be done. That is why there are figures 5-13 that show how sampling has to be done.

I could think that this referee is sufficiently confused not to be able to understand, but I think that they do understand and this is just malevolence, they want to block publication.

How can I be sampling incorrectly when I include every known sequence, all of them, none excluded?

If that is not a valid sample then there are no valid samples in H5 influenza research ever. To know who it is for sure I will wait for a couple of months and see who tries to publish the view that sampling is wrong if we focus on a single influenza subtype. I expect to see it in something like Emerging Infectious Disease or PLoS Pathogens and a fairly big name to be submitting author.

Finally the last lines are NOT permitted in a referees comments. You are not allowed in your instructions to reviewers to put that sort of response in the comments to the author. That is for the editor to decide. It is not constructive or useful.

This is someone with an axe to grind who is annoyed that their work has not been cited. Boo hoo to you. It is appalling behaviour for a so called professional scientist. What the paper says is still true, it will still be published and whoever you are as you chose to remain anonymous (for good reason) you will eventually be exposed for the dishonest person that you are.

Tuesday 5 January 2016

From the past: The EID paper referees comments from 30/1/2015. For the future: monitoring birds in Russia.

Reviewer: 1

Comments to the Author

In their manuscript, ‘The European and Japanese outbreaks of H5N8 derive from a single source population that has been dispersed along the long distance migratory bird migratory flyways’, Dalby and Iqbal use Bayesian coalsescent methods to infer ancestral of previously detected H5N8 subtype influenza A viruses and estimate time since divergence for isolates derived from recent Eurasian outbreaks.  Furthermore, the authors provide generalized and anecdotal informal on bird migration patterns in Eurasia to gain inference on possible origins and dispersal patterns.

First, I would like to applaud the authors for investigating a topic of great interest and importance to both human and animal population health.  Few studies combine genetic data with information on bird migration which can be a useful methodology for understanding the global dispersal of avian pathogens. I personally believe that such studies are relatively rare on account of the difficulty of combining disparate types of data and in obtaining relevant information for wild birds at locations along migratory pathways.  Regrettably, this is where I think this particular investigation falls short.

Aside from a few potential (minor?) issues (see specific comments below) I do not have any problems with genetic analyses included in this manuscript and the conclusions drawn therefrom per se; however, I do think that the authors ‘epidemiological data’ on migratory birds falls short of supporting conclusions.  For example, in the abstract alone, I would argue that the following claims are not sufficiently supported by empirical data: ‘traced to a single source population, which has been spread by migratory birds’, we can show when and where the outbreak originated’, ‘This population was located in the Siberian summer breeding grounds of long-range migratory birds’.  Because recent outbreaks of H5N8 influenza A viruses also have occurred in poultry throughout Eurasia, I feel that one could use the same genetic analyses conducted by the authors and provide similar generalized/anecdotal information regarding Eurasian poultry trade patterns to reach the conclusion that this virus has been dispersed through bird trade.  That is, the authors fail to provide any convincing evidence to demonstrate that wild birds have been solely responsible for viral dispersal as implied.

Unfortunately, I feel as though I cannot be more supportive of the authors’ current submission to Emerging Infectious Diseases at this time.  Given what I believe to the critical flaw in the manuscript as written, I might suggest that the authors shorten their submission by focusing on ancestry of H5N8 viruses and time of divergence.  By formulating a short communication (i.e. Dispatch) on this more focused topic, I think that a few speculative sentence could be included in the discussion in support of the authors’ thesis, that wild birds are dispersing H5N8 viruses throughout Eurasia.  In hopes that the authors will pursue a revision in the future, I’ve appended numerous specific comments below which I hope will ultimately prove useful towards this end.

Specific comments:

Introduction

Lines 28-29: By ‘cases’, do you mean ‘detections’?  There have certainly been undetected cases, no?

Line 43: By ‘the virus’, which specific strain are you referring to?  Different strains of HP H5N8 probably have differences in pathogenicity and host adaptation.

Lines 46-47: (‘Ducks and…’) I’m not sure that this is supported as written.  If you are referring to the laboratory study conducted by Kang et al., specify (e.g. ‘in a laboratory challenge study…’) and provide citation.

Line 51: Have antibodies specific to this strain been demonstrated?  If so, I presume this was through experimental challenge?  Clarify.

General comment: Considerable text is included in the introduction presenting information from very distantly related viruses of the same serotype (H5N8).  Would the introduction be better focused on the evolution of the reassortant HP H5N8 viruses currently causing poultry outbreaks?

Materials and methods

Line 62: I don’t believe that ‘flu resource’ is correct nomenclature.  Change throughout.

Lines 74-75: By ‘different reassortment events’ do you mean ‘different ancestral lineages’?

Lines 79-80: Please justify why this model and molecular clock assumption.

Lines 82-83: How long were runs?  Burn-in period?

Lines 88-92: Methodology here is insufficient for evaluating contributions of wild birds in viral dispersal.

Results and discussion

Lines 96-97: Is support (i.e. posterior probability values) presented on trees?  If not, I cannot assess support for the topology presented.

Line 101: It is a bit confusing as to results for which dataset you are referring here.  Some clarification for the reader in this section would be helpful.

Line 102: Add ‘gene segment’ after ‘hemagglutinin’.

Line 115: ‘in Korea in Korea’;

Lines 199-121: This scenario is plausible but is not demonstrated by results.

Lines 123-124: Couldn’t one argue that wide dispersal is reflective of the extent of poultry trade in Eurasia?

Lines 126-128: Please provide genus/species for bird species.

Line 128: By ‘carrying the virus’, do you mean ‘infected with H5N8 viruses’?

Lines 131-132: Please provide genus/species for bird species.

Lines 133-134: Prevalence for these species has been reported as being much more variable than implied here.  Also, I’m not sure birds ‘are funneled’; rather, birds ‘congregate’.

Lines 138: Add ‘of Eurasia’ after ‘flyways’ to reflect recent detection in North America.

Lines 140-143: This anecdotal information provides weak support for your thesis.

Line 145: Is this referring to H5N1?

Lines 150-153: Speculative statements here need appropriate caveats.

Lines 157-159: I tend to disagree with this statement.  Estimated evolutionary divergence does provide information on host, location, or ‘events’.

Lines 167-169: (‘However the small…’) This statement is not supported.

Line 175: Define ‘vigilance’.  By ‘avian flu’ do you mean ‘HP H5N8’?

Line 185: Define ‘vigilance’.  Are you referring to surveillance?

Lines 185-187: How will this help if infections are asymptomatic in wild birds?  If the thesis put forward re. potential for transmission to humans in the prior paragraph holds true, might your recommendation here actually increase the probability for bird to human transmission?

Reviewer: 2

Comments to the Author

The manuscript by Dalby and Iqbal reports a phylogenetic study performed on sequences of influenza A virus (IAV), H5N8 subtype. The analysis is timely and has the potential to provide significant information to the understanding of the H5N8 IAV outbreak. The manuscript, unfortunately, suffers from a lack of clarity in both the methodology and the presentation of the results. Below is a list of both minor and major issues that the authors may want to address:

- The article has been submitted as a “full research manuscript” but is very short and could have rather been formatted as a “dispatch”.

- The use of “seroptype” is not appropriate; the use of “subtype” for the HA and NA gene segments is widely used and I recommend the author to do so in their manuscript to avoid confusion for the readers.

- line 28: “was identified in a wild bird”. More information on the bird species is needed as well as the context in which it has been found (active/passive surveillance, outbreaks in poultry in the same area, with other H5 virus subtypes, etc.).

- line 47: “mallards are often asymptomatic but can still be carriers of the virus”: What virus: H5 in general ? HP/LP ? Please be more specific.

- line 57: “this study identify the source of the outbreaks”. Define “source” and the specific objectives of the study. It is somewhat vague as the phylogenetic analyses provide information on the relatedness and evolutionary history of isolated viruses, but not on the exact identification of the donor of the virus circulating in Europe in November/December. Indeed, the lack of sequence available between the spring/summer and the winter of 2014 precludes conclusion on the exact origin of the virus.

- line 62: “all of the available H5N8 sequences were downloaded”. Provide details on the number of downloaded sequences per segment, whether if they were nucleotide or protein sequences, etc. Also, why not including H5 sequences of non-H5N8 virus subtypes ? This would have given more power to the analyses and strengthened the conclusions regarding the global evolutionary history and origin of the H5N8 virus. Since IAV gene segments have an – almost – independent evolution, it would have been more appropriate to not restrict the data to segment that belong to the H5N8 subtype only, in particular for the internal gene segments. I understand that it may not has been the initial objective of the study but I believe that including only H5N8 sequences gives biased results and a limited understanding of the global evolutionary history of the virus genes (reassortment events, etc.).

- line 69 “no editing at the 3' end”. Why that ? Why not trimming the sequences to the stop codon ?

- lines 71-75: I'm confused by this part. Based on the results it looks like it had affected the estimates of the TMRCA. Sequence selection procedure overall needs more clarity.

- lines 77-80: Why use a strict molecular clock and a nucleotide-based substitution model while it has been shown that more appropriate models exits for IAV, in particular for Bayesian analyses performed with BEAST ? (Shapiro et al. Mol Biol Evol. 2006;23:7–9; Bahl et al. Virology 2009;390:289-297; etc.). Where tips dates coded by year only, or years/month/day ? Was that the case for all sequences ? How did you dealt with missing information ?

- lines 82-84: Where several runs combined ? What were the chain lengths and sampling frequency ?

- line 96: “produces a consistent gene tree”: How that is consistent ? Also, there is a major information missing on the phylogenetic trees: the posterior probabilities. It is overall difficult to evaluate the methodology but since no posterior are indicated on the trees, it also makes the interpretation of the results somewhat complicated.

- line 103 and throughout the manuscript: “outbreaks diverged between 1.58 and 5.53 months ago”. I suggest to use exact dates (e.g. 2014.XX) rather than some time in the paste, given that the time reference used is not clearly stated (December 2014 ?).

- line 119: “this result indicate”. This statement requires additional support. The fact that two events occurred at the same time does not supports that the two events were related... I agree that it is likely that these particular two events were linked, but it would be welcomed to have a more comprehensive discussion rather than a statement that is not fully supported by the data and analyses.

- line 124: “that migratory birds”. This is a very general statement and it is very unlikely that migratory birds, in general, were the source. Please be more specific (species involved or not, etc.). Overall, the manuscript critically lacks strong arguments and discussion of the ornithological aspects.

- line 133: “Mallard... high prevalence”. Prevalence of infected ducks strongly depends on the time of the year and geographic location. This again is a general statement that does not provide strong support to the conclusion. I suggest the author to have a look at recent publications on the ecology and epidemiology of IAV in wild ducks to strengthen the discussion (e.g. Latorre-Margalef et al. 2014. Proc B.281:20140098)

- lines 134-138: How would you explain this absence the virus spread in other migratory flyways ? Southwards ?

- lines 141-143: Not all bird species migrate at the same time, even for closely related ones (e.g. ducks). Again, although this information is interesting and valuable, concluding that the two events were related needs more support or at least to be better discussed.

- lines 149-153. Is there any molecular evidence of this change in virulence and selection ?

- lines 155-159: “the evolutionary events responsible for the European and Japanese cases must have occurred in migratory birds”. This statement is very vague. What evolutionary events ? In which migratory birds species and populations ? Where ? How ? etc.

- Line 169: Based on which segment ? If changes in population sizes were investigated then the results should be presented. Also, why not using a Bayesian Skyline priors to investigate such changes instead of exponential population growth ?

- line 178: “the longer outbreaks in wild birds and poultry persist...”. Not sure we can use “outbreak” for wild birds if they are asymptomatic. Is it expected that the H5N8 virus will persist in wild birds ? Are there other evidence of poultry-origin virus spillover to wild birds with subsequent long-time maintenance ?

- line 186: “bird watchers”. Bird watchers I doubt, especially if birds are asymptomatic. There is certainly a need for better surveillance and for the implication of trained ornithologists and veterinarian but this is somewhat different to bird watchers.

- Figures 1-8: The trees are poorly formatted (very difficult to read). The red clade is missing on Fig. 1. As they all seem to provide the same information I would suggest to place them together in an online supplementary file, and select one tree (e.g. HA) for detailed presentation in the manuscript. Slight editing (colors, simplified taxa names, etc.) could also help to read the figures.

- Figure legend: double check figure numbers: looks like there is a Fig. 9 missing.

- Figure 10. The map is not informative as presented. It shows very general migratory flyways and could have rather focused on the migratory routes of the species that are implicated in the spread of the H5N8 IAV subtype. Birdlife international and other websites provide maps that could be much more informative that the one presented in Figure 10: http://www.birdlife.org/datazone/species/factsheet/22680317

Reviewer: 3

Comments to the Author

In this manuscript, authors describe the recent H5N8 outbreaks of Europe and Japan might be a single source population, which has been spread by migratory birds by combining genetic methods and epidemiological data. It seems to be probable theory to explain current H5N8 outbreaks in European countries and Japan at the same time by similar viruses. Although the topic is of interest, the manuscript will need revising.

Comments

1. Line 36-39: Reference is inadequate. The contents of this sentence were derived from reference 17 rather than reference 4.

2. Line 44-45: The H5N8 virus of Ireland was genetically quite different from recent Asian H5N8 viruses, so “the original H5N8” is improper expression. These were just same serotype of viruses.

3. Line 45-47: The authors have to identify a quotation.

4. Line 115: Delete the duplicated word (in Korea)

5. Line 149-152: The authors have to identify a quotation.

6. Figure Legends and Figures: All figure legends are same except for each gene name. However, there were different scale in figures between NA gene and the other genes. Please correct the scale in figures or rewrite figure legends.

The bits highlighted in red are a fairly standard reviewers tool. First you give the praise. This is really good work but ...

Then the but kills off the paper. The but here is that the data could fit with the movement of poultry between farms. Except they cannot. For that to happen British, Dutch, And German farmers would have to all have obtained eggs from Korea. This would then have to infected a limited number of farms and spread from the farms to the local wild bird populations in small enough numbers not to the detected by a screen but in large enough numbers to give sporadic cases. This is absolute nonsense. A simple likelihood model says that the wild bird transmission is much more likely than these multiple domestic bird submissions. This referee has another agenda. Blocking the paper until their own paper is ready for the same journal or use this idea to write a paper for Science. EID went on to publish three more papers about the H5N8 outbreak and its spread. This extra data included cases from Russia which I had predicted must be there, but which were unavailable at the time.

The summer breeding grounds in Russia are an essential for monitoring bird influenza epidemics. We need to have a well funded international effort to get the best possible sampling of the virus so that we can predict future outbreaks.