User login
The issue of scientific reproducibility has come to the fore in the past several years, driven by noteworthy failures to replicate critical findings in several much-publicized reports coupled to a series of scandals calling into question the role of journals and granting agencies in maintaining quality and oversight.
In a special Nature online collection, the journal assembled articles and perspectives from 2011 to the present dealing with this issue of research reproducibility in science and medicine. These articles were supplemented with current editorial comment.
Seeing these broad spectrum concerns pulled together in one place makes it difficult not to be pessimistic about the current state of research investigations across the board. The saving grace, however, is that these same reports show that a lot of people realize that there is a problem – people who are trying to make changes and who are in a position to be effective.
According to the reports presented in the collection, the problems in research accountability and reproducibility have grown to an alarming extent. In one estimate, irreproducibility ends up costing biomedical research some $28 billion wasted dollars per year (Nature. 2015 Jun 9. doi: 10.1038/nature.2015.17711).
A litany of concerns
In 2012, scientists at AMGEN (Thousand Oaks, Calif.) reported that, even cooperating closely with the original investigators, they were able to reproduce only 6 of 53 studies considered to be benchmarks of cancer research (Nature. 2016 Feb 4. doi: 10.1038/nature.2016.19269).
Scientists at Bayer HealthCare reported in Nature Reviews Drug Discovery that they could successfully reproduce results in only a quarter of 67 so-called seminal studies (2011 Sep. doi: 10.1038/nrd3439-c1).
According to a 2013 report in The Economist, Dr. John Ioannidis, an expert in the field of scientific reproducibility, argued that in his field, “epidemiology, you might expect one in ten hypotheses to be true. In exploratory disciplines like genomics, which rely on combing through vast troves of data about genes and proteins for interesting relationships, you might expect just one in a thousand to prove correct.”
This increasing litany of irreproducibility has raised alarm in the scientific community and has led to a search for answers, as so many preclinical studies form the precursor data for eventual human trials.
Despite the concerns raised, human clinical trials seem to be less at risk for irreproducibility, according to an editorial by Dr. Francis S. Collins, director, and Dr. Lawrence A. Tabak, principal deputy director of the U.S. National Institutes of Health, “because they are already governed by various regulations that stipulate rigorous design and independent oversight – including randomization, blinding, power estimates, pre-registration of outcome measures in standardized, public databases such as ClinicalTrials.gov and oversight by institutional review boards and data safety monitoring boards. Furthermore, the clinical trials community has taken important steps toward adopting standard reporting elements,” (Nature. 2014 Jan. doi: 10.1038/505612a).
The paucity of P
Today, the P-value, .05 or less, is all too often considered the sine qua non of scientific proof. “Most statisticians consider this appalling, as the P value was never intended to be used as a strong indicator of certainty as it too often is today. Most scientists would look at [a] P value of .01 and say that there was just a 1% chance of [the] result being a false alarm. But they would be wrong.” The 2014 report goes on to state how, according to one widely used calculation by authentic statisticians, a P value of .01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a P value of .05 raises that chance of a false alarm to at least 29% (Nature. 2014 Feb. doi: 10.1038/506150a).
Beyond this assessment problem, P values may allow for considerable researcher bias, conscious and unconscious, even to the extent of encouraging “P-hacking”: one of the few statistical terms to ever make it into the Urban Dictionary. “P-hacking is trying multiple things until you get the desired result” – even unconsciously, according to one researcher quoted.
In addition, “unless statistical power is very high (and much higher than in most experiments), the P value should be interpreted tentatively at best” (Nat Methods. 2015 Feb 26. doi: 10.1038/nmeth.3288).
So bad is the problem that “misuse of the P value – a common test for judging the strength of scientific evidence – is contributing to the number of research findings that cannot be reproduced,” the American Statistical Association warns in a statement released in March, adding that the P value cannot be used to determine whether a hypothesis is true or even whether results are important (Nature. 2016 Mar 7. doi: 10.1038/nature.2016.19503).
And none of this even remotely addresses those instances where researchers report findings that “trend towards significance” when they can’t even meet the magical P threshold.
A muddling of mice (and more)
Fundamental to biological research is the vast array of preliminary animal studies that must be performed before clinical testing can begin.
Animal-based research has been under intense scrutiny due to a variety of perceived flaws and omissions that have been found to be all too common. For example, in a report in PLoS Biology, Dr. Ulrich Dirnagl of the Charité Medical University in Berlin reviewed 100 reports published between 2000 and 2013, which included 522 experiments using rodents to test cancer and stroke treatments. Around two-thirds of the experiments did not report whether any animals had been dropped from the final analysis, and of the 30% that did report rodents dropped from analysis, only 14 explained why (2016 Jan 4. doi: 10.1371/journal.pbio.1002331). Similarly, Dr. John Ioannidis and his colleagues assessed a random sample of 268 biomedical papers listed in PubMed published between 2000 and 2014 and found that only one contained sufficient details to replicate the work (Nature. 2016 Jan 5. doi: 10.1038/nature.2015.19101).
A multitude of genetic and environmental factors have also been found influential in animal research. For example, the gut microbiome (which has been found to influence many aspects of mouse health and metabolism) varies widely in the same species of mice fed on different diets or obtained from different vendors. And there can be differences in physiology and behavior based on circadian rhythms, and even variations in cage design (Nature. 2016 Feb 16. doi: 10.1038/530254a).
But things are looking brighter. By the beginning of 2016, more than 600 journals had signed up for the voluntary ARRIVE (Animals in Research: Reporting of In Vivo Experiments) guidelines designed to improve the reporting of animal experiments. The guidelines include a checklist of elements to be included in any reporting of animal research, including animal strain, sex, and adverse events (Nature. 2016 Feb 1. doi: 10.1038/nature.2016.19274).
Problems have also been reported in the use of cell lines and antibodies in biomedical research. For example, a report in Nature indicated that too many biomedical researchers are lax in checking for impostor cell lines when they perform their research (Nature. 2015 Oct 12. doi: 10.1038/nature.2015.18544). And recent studies have shown that improper or misused antibodies are a significant source of false findings and irreproducibility in the modern literature (Nature. 2015 May 19. doi: 10.1038/521274a).
Reviewer, view thyself
The editorial in The Economist also discussed some of the failures of the peer-reviewed scientific literature, usually considered the final gateway of quality control, to provide appropriate review and correction of research errors. The editorial cites a damning test of lower-tier research publications by Dr. John Bohannon, a biologist at Harvard, who submitted a pseudonymous paper on the effects of a chemical derived from lichen cells to 304 journals describing themselves as using peer review. The paper was concocted wholesale with manifold and obvious errors in study design, analysis, and interpretation of results, according to Dr. Bohannon. This fictitious paper from a fictitious researcher based at a fictitious university was accepted for publication by an alarming 147 of the journals.
The problem is not new. In 1998, Dr. Fiona Godlee, editor of the British Medical Journal, sent an article with eight deliberate mistakes in study design, analysis, and interpretation to more than 200 of the journal’s regular reviewers. None of the reviewers found all the mistakes, and on average, they spotted fewer than two. And another study by the BMJ showed that experience was an issue, not in improving quality of reviewers, but quite the opposite. Over a 14-year period assessed, 1,500 referees, as rated by editors at leading journals, showed a slow but steady drop in their scores.
Such studies prompted a profound reassessment by the journals, in part pushed by some major granting agencies, including the National Institutes of Health.
Not taking grants for granted
The National Institutes for Health are advancing efforts to expand scientific rigor and reproducibility in their grants projects.
“As part of an increasing drive to boost the reliability of research, the NIH will require applicants to explain the scientific premise behind their proposals and defend the quality of their experimental designs. They must also account for biological variables (for example, by including both male and female mice in planned studies) and describe how they will authenticate experimental materials such as cell lines and antibodies.”
Whether current efforts by scientists, societies, granting organizations, and journals can lead to authentic reform and a vast and relatively quick improvement in reproducibility of scientific results is still an open question. In discussing a 2015 report on the subject by the biomedical research community in the United Kingdom, neurophysiologist Dr. Dorothy Bishop had this to say: “I feel quite upbeat about it. ... Now that we’re aware of it, we have all sorts of ideas about how to deal with it. These are doable things. I feel that the mood is one of making science a much better thing. It might lead to slightly slower science. That could be better” (Nature. 2015 Oct 29. doi: 10.1038/nature.2015.18684).
In the recent Nature editorial, “Repetitive flaws,” comments are offered regarding the new NIH guidelines that require grant proposals to account for biological variables and describe how experimental materials may be authenticated (2016 Jan 21. doi: 10.1038/529256a). It is proposed that these requirements will attempt to improve the quality and reproducibility of research. Many concerns regarding scientific reproducibility have been raised in the past few years. As the editorial states, the NIH guidelines “can help to make researchers aspire to the values that produced them” and they can “inspire researchers to uphold their identity and integrity.”
To those investigators who strive to report only their best results following exhaustive and sincere confirmation, these guidelines will not seem threatening. Providing experimental details of one’s work is helpful in many ways (you can personally reproduce them with new and different lab personnel or after a lapse of time, you will have excellent experimental records, you will have excellent documentation when it comes time to write another grant, and so on), and I have personally been frustrated when my laboratory cannot duplicate published work of others. However, questions raised include who will pay for reproducing the work of others and how will the sacrifice of additional animals or subjects be justified? Many laboratories are already financially strapped due to current funding challenges and time is also extremely valuable. In addition, junior researchers are on tenure and promotion timelines that provide stress and need for publications to establish independence and credibility, and established investigators must document continued productivity to be judged adequate to obtain continued funding.
The quality of peer review of research publications has also been challenged recently, adding to the concern over the veracity of published research. Many journals now have mandatory statistical review prior to acceptance. This also delays time to publication. In addition, the generous reviewers who perform peer review often do so at the cost of their valuable, uncompensated time.
Despite these hurdles and questions, those who perform valuable and needed research to improve the lives and care of our patients must continue to strive to produce the highest level of evidence.
Dr. Jennifer S. Lawton is a professor of surgery at the division of cardiothoracic surgery, Washington University, St. Louis. She is also an associate medical editor for Thoracic Surgery News.
In the recent Nature editorial, “Repetitive flaws,” comments are offered regarding the new NIH guidelines that require grant proposals to account for biological variables and describe how experimental materials may be authenticated (2016 Jan 21. doi: 10.1038/529256a). It is proposed that these requirements will attempt to improve the quality and reproducibility of research. Many concerns regarding scientific reproducibility have been raised in the past few years. As the editorial states, the NIH guidelines “can help to make researchers aspire to the values that produced them” and they can “inspire researchers to uphold their identity and integrity.”
To those investigators who strive to report only their best results following exhaustive and sincere confirmation, these guidelines will not seem threatening. Providing experimental details of one’s work is helpful in many ways (you can personally reproduce them with new and different lab personnel or after a lapse of time, you will have excellent experimental records, you will have excellent documentation when it comes time to write another grant, and so on), and I have personally been frustrated when my laboratory cannot duplicate published work of others. However, questions raised include who will pay for reproducing the work of others and how will the sacrifice of additional animals or subjects be justified? Many laboratories are already financially strapped due to current funding challenges and time is also extremely valuable. In addition, junior researchers are on tenure and promotion timelines that provide stress and need for publications to establish independence and credibility, and established investigators must document continued productivity to be judged adequate to obtain continued funding.
The quality of peer review of research publications has also been challenged recently, adding to the concern over the veracity of published research. Many journals now have mandatory statistical review prior to acceptance. This also delays time to publication. In addition, the generous reviewers who perform peer review often do so at the cost of their valuable, uncompensated time.
Despite these hurdles and questions, those who perform valuable and needed research to improve the lives and care of our patients must continue to strive to produce the highest level of evidence.
Dr. Jennifer S. Lawton is a professor of surgery at the division of cardiothoracic surgery, Washington University, St. Louis. She is also an associate medical editor for Thoracic Surgery News.
In the recent Nature editorial, “Repetitive flaws,” comments are offered regarding the new NIH guidelines that require grant proposals to account for biological variables and describe how experimental materials may be authenticated (2016 Jan 21. doi: 10.1038/529256a). It is proposed that these requirements will attempt to improve the quality and reproducibility of research. Many concerns regarding scientific reproducibility have been raised in the past few years. As the editorial states, the NIH guidelines “can help to make researchers aspire to the values that produced them” and they can “inspire researchers to uphold their identity and integrity.”
To those investigators who strive to report only their best results following exhaustive and sincere confirmation, these guidelines will not seem threatening. Providing experimental details of one’s work is helpful in many ways (you can personally reproduce them with new and different lab personnel or after a lapse of time, you will have excellent experimental records, you will have excellent documentation when it comes time to write another grant, and so on), and I have personally been frustrated when my laboratory cannot duplicate published work of others. However, questions raised include who will pay for reproducing the work of others and how will the sacrifice of additional animals or subjects be justified? Many laboratories are already financially strapped due to current funding challenges and time is also extremely valuable. In addition, junior researchers are on tenure and promotion timelines that provide stress and need for publications to establish independence and credibility, and established investigators must document continued productivity to be judged adequate to obtain continued funding.
The quality of peer review of research publications has also been challenged recently, adding to the concern over the veracity of published research. Many journals now have mandatory statistical review prior to acceptance. This also delays time to publication. In addition, the generous reviewers who perform peer review often do so at the cost of their valuable, uncompensated time.
Despite these hurdles and questions, those who perform valuable and needed research to improve the lives and care of our patients must continue to strive to produce the highest level of evidence.
Dr. Jennifer S. Lawton is a professor of surgery at the division of cardiothoracic surgery, Washington University, St. Louis. She is also an associate medical editor for Thoracic Surgery News.
The issue of scientific reproducibility has come to the fore in the past several years, driven by noteworthy failures to replicate critical findings in several much-publicized reports coupled to a series of scandals calling into question the role of journals and granting agencies in maintaining quality and oversight.
In a special Nature online collection, the journal assembled articles and perspectives from 2011 to the present dealing with this issue of research reproducibility in science and medicine. These articles were supplemented with current editorial comment.
Seeing these broad spectrum concerns pulled together in one place makes it difficult not to be pessimistic about the current state of research investigations across the board. The saving grace, however, is that these same reports show that a lot of people realize that there is a problem – people who are trying to make changes and who are in a position to be effective.
According to the reports presented in the collection, the problems in research accountability and reproducibility have grown to an alarming extent. In one estimate, irreproducibility ends up costing biomedical research some $28 billion wasted dollars per year (Nature. 2015 Jun 9. doi: 10.1038/nature.2015.17711).
A litany of concerns
In 2012, scientists at AMGEN (Thousand Oaks, Calif.) reported that, even cooperating closely with the original investigators, they were able to reproduce only 6 of 53 studies considered to be benchmarks of cancer research (Nature. 2016 Feb 4. doi: 10.1038/nature.2016.19269).
Scientists at Bayer HealthCare reported in Nature Reviews Drug Discovery that they could successfully reproduce results in only a quarter of 67 so-called seminal studies (2011 Sep. doi: 10.1038/nrd3439-c1).
According to a 2013 report in The Economist, Dr. John Ioannidis, an expert in the field of scientific reproducibility, argued that in his field, “epidemiology, you might expect one in ten hypotheses to be true. In exploratory disciplines like genomics, which rely on combing through vast troves of data about genes and proteins for interesting relationships, you might expect just one in a thousand to prove correct.”
This increasing litany of irreproducibility has raised alarm in the scientific community and has led to a search for answers, as so many preclinical studies form the precursor data for eventual human trials.
Despite the concerns raised, human clinical trials seem to be less at risk for irreproducibility, according to an editorial by Dr. Francis S. Collins, director, and Dr. Lawrence A. Tabak, principal deputy director of the U.S. National Institutes of Health, “because they are already governed by various regulations that stipulate rigorous design and independent oversight – including randomization, blinding, power estimates, pre-registration of outcome measures in standardized, public databases such as ClinicalTrials.gov and oversight by institutional review boards and data safety monitoring boards. Furthermore, the clinical trials community has taken important steps toward adopting standard reporting elements,” (Nature. 2014 Jan. doi: 10.1038/505612a).
The paucity of P
Today, the P-value, .05 or less, is all too often considered the sine qua non of scientific proof. “Most statisticians consider this appalling, as the P value was never intended to be used as a strong indicator of certainty as it too often is today. Most scientists would look at [a] P value of .01 and say that there was just a 1% chance of [the] result being a false alarm. But they would be wrong.” The 2014 report goes on to state how, according to one widely used calculation by authentic statisticians, a P value of .01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a P value of .05 raises that chance of a false alarm to at least 29% (Nature. 2014 Feb. doi: 10.1038/506150a).
Beyond this assessment problem, P values may allow for considerable researcher bias, conscious and unconscious, even to the extent of encouraging “P-hacking”: one of the few statistical terms to ever make it into the Urban Dictionary. “P-hacking is trying multiple things until you get the desired result” – even unconsciously, according to one researcher quoted.
In addition, “unless statistical power is very high (and much higher than in most experiments), the P value should be interpreted tentatively at best” (Nat Methods. 2015 Feb 26. doi: 10.1038/nmeth.3288).
So bad is the problem that “misuse of the P value – a common test for judging the strength of scientific evidence – is contributing to the number of research findings that cannot be reproduced,” the American Statistical Association warns in a statement released in March, adding that the P value cannot be used to determine whether a hypothesis is true or even whether results are important (Nature. 2016 Mar 7. doi: 10.1038/nature.2016.19503).
And none of this even remotely addresses those instances where researchers report findings that “trend towards significance” when they can’t even meet the magical P threshold.
A muddling of mice (and more)
Fundamental to biological research is the vast array of preliminary animal studies that must be performed before clinical testing can begin.
Animal-based research has been under intense scrutiny due to a variety of perceived flaws and omissions that have been found to be all too common. For example, in a report in PLoS Biology, Dr. Ulrich Dirnagl of the Charité Medical University in Berlin reviewed 100 reports published between 2000 and 2013, which included 522 experiments using rodents to test cancer and stroke treatments. Around two-thirds of the experiments did not report whether any animals had been dropped from the final analysis, and of the 30% that did report rodents dropped from analysis, only 14 explained why (2016 Jan 4. doi: 10.1371/journal.pbio.1002331). Similarly, Dr. John Ioannidis and his colleagues assessed a random sample of 268 biomedical papers listed in PubMed published between 2000 and 2014 and found that only one contained sufficient details to replicate the work (Nature. 2016 Jan 5. doi: 10.1038/nature.2015.19101).
A multitude of genetic and environmental factors have also been found influential in animal research. For example, the gut microbiome (which has been found to influence many aspects of mouse health and metabolism) varies widely in the same species of mice fed on different diets or obtained from different vendors. And there can be differences in physiology and behavior based on circadian rhythms, and even variations in cage design (Nature. 2016 Feb 16. doi: 10.1038/530254a).
But things are looking brighter. By the beginning of 2016, more than 600 journals had signed up for the voluntary ARRIVE (Animals in Research: Reporting of In Vivo Experiments) guidelines designed to improve the reporting of animal experiments. The guidelines include a checklist of elements to be included in any reporting of animal research, including animal strain, sex, and adverse events (Nature. 2016 Feb 1. doi: 10.1038/nature.2016.19274).
Problems have also been reported in the use of cell lines and antibodies in biomedical research. For example, a report in Nature indicated that too many biomedical researchers are lax in checking for impostor cell lines when they perform their research (Nature. 2015 Oct 12. doi: 10.1038/nature.2015.18544). And recent studies have shown that improper or misused antibodies are a significant source of false findings and irreproducibility in the modern literature (Nature. 2015 May 19. doi: 10.1038/521274a).
Reviewer, view thyself
The editorial in The Economist also discussed some of the failures of the peer-reviewed scientific literature, usually considered the final gateway of quality control, to provide appropriate review and correction of research errors. The editorial cites a damning test of lower-tier research publications by Dr. John Bohannon, a biologist at Harvard, who submitted a pseudonymous paper on the effects of a chemical derived from lichen cells to 304 journals describing themselves as using peer review. The paper was concocted wholesale with manifold and obvious errors in study design, analysis, and interpretation of results, according to Dr. Bohannon. This fictitious paper from a fictitious researcher based at a fictitious university was accepted for publication by an alarming 147 of the journals.
The problem is not new. In 1998, Dr. Fiona Godlee, editor of the British Medical Journal, sent an article with eight deliberate mistakes in study design, analysis, and interpretation to more than 200 of the journal’s regular reviewers. None of the reviewers found all the mistakes, and on average, they spotted fewer than two. And another study by the BMJ showed that experience was an issue, not in improving quality of reviewers, but quite the opposite. Over a 14-year period assessed, 1,500 referees, as rated by editors at leading journals, showed a slow but steady drop in their scores.
Such studies prompted a profound reassessment by the journals, in part pushed by some major granting agencies, including the National Institutes of Health.
Not taking grants for granted
The National Institutes for Health are advancing efforts to expand scientific rigor and reproducibility in their grants projects.
“As part of an increasing drive to boost the reliability of research, the NIH will require applicants to explain the scientific premise behind their proposals and defend the quality of their experimental designs. They must also account for biological variables (for example, by including both male and female mice in planned studies) and describe how they will authenticate experimental materials such as cell lines and antibodies.”
Whether current efforts by scientists, societies, granting organizations, and journals can lead to authentic reform and a vast and relatively quick improvement in reproducibility of scientific results is still an open question. In discussing a 2015 report on the subject by the biomedical research community in the United Kingdom, neurophysiologist Dr. Dorothy Bishop had this to say: “I feel quite upbeat about it. ... Now that we’re aware of it, we have all sorts of ideas about how to deal with it. These are doable things. I feel that the mood is one of making science a much better thing. It might lead to slightly slower science. That could be better” (Nature. 2015 Oct 29. doi: 10.1038/nature.2015.18684).
The issue of scientific reproducibility has come to the fore in the past several years, driven by noteworthy failures to replicate critical findings in several much-publicized reports coupled to a series of scandals calling into question the role of journals and granting agencies in maintaining quality and oversight.
In a special Nature online collection, the journal assembled articles and perspectives from 2011 to the present dealing with this issue of research reproducibility in science and medicine. These articles were supplemented with current editorial comment.
Seeing these broad spectrum concerns pulled together in one place makes it difficult not to be pessimistic about the current state of research investigations across the board. The saving grace, however, is that these same reports show that a lot of people realize that there is a problem – people who are trying to make changes and who are in a position to be effective.
According to the reports presented in the collection, the problems in research accountability and reproducibility have grown to an alarming extent. In one estimate, irreproducibility ends up costing biomedical research some $28 billion wasted dollars per year (Nature. 2015 Jun 9. doi: 10.1038/nature.2015.17711).
A litany of concerns
In 2012, scientists at AMGEN (Thousand Oaks, Calif.) reported that, even cooperating closely with the original investigators, they were able to reproduce only 6 of 53 studies considered to be benchmarks of cancer research (Nature. 2016 Feb 4. doi: 10.1038/nature.2016.19269).
Scientists at Bayer HealthCare reported in Nature Reviews Drug Discovery that they could successfully reproduce results in only a quarter of 67 so-called seminal studies (2011 Sep. doi: 10.1038/nrd3439-c1).
According to a 2013 report in The Economist, Dr. John Ioannidis, an expert in the field of scientific reproducibility, argued that in his field, “epidemiology, you might expect one in ten hypotheses to be true. In exploratory disciplines like genomics, which rely on combing through vast troves of data about genes and proteins for interesting relationships, you might expect just one in a thousand to prove correct.”
This increasing litany of irreproducibility has raised alarm in the scientific community and has led to a search for answers, as so many preclinical studies form the precursor data for eventual human trials.
Despite the concerns raised, human clinical trials seem to be less at risk for irreproducibility, according to an editorial by Dr. Francis S. Collins, director, and Dr. Lawrence A. Tabak, principal deputy director of the U.S. National Institutes of Health, “because they are already governed by various regulations that stipulate rigorous design and independent oversight – including randomization, blinding, power estimates, pre-registration of outcome measures in standardized, public databases such as ClinicalTrials.gov and oversight by institutional review boards and data safety monitoring boards. Furthermore, the clinical trials community has taken important steps toward adopting standard reporting elements,” (Nature. 2014 Jan. doi: 10.1038/505612a).
The paucity of P
Today, the P-value, .05 or less, is all too often considered the sine qua non of scientific proof. “Most statisticians consider this appalling, as the P value was never intended to be used as a strong indicator of certainty as it too often is today. Most scientists would look at [a] P value of .01 and say that there was just a 1% chance of [the] result being a false alarm. But they would be wrong.” The 2014 report goes on to state how, according to one widely used calculation by authentic statisticians, a P value of .01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a P value of .05 raises that chance of a false alarm to at least 29% (Nature. 2014 Feb. doi: 10.1038/506150a).
Beyond this assessment problem, P values may allow for considerable researcher bias, conscious and unconscious, even to the extent of encouraging “P-hacking”: one of the few statistical terms to ever make it into the Urban Dictionary. “P-hacking is trying multiple things until you get the desired result” – even unconsciously, according to one researcher quoted.
In addition, “unless statistical power is very high (and much higher than in most experiments), the P value should be interpreted tentatively at best” (Nat Methods. 2015 Feb 26. doi: 10.1038/nmeth.3288).
So bad is the problem that “misuse of the P value – a common test for judging the strength of scientific evidence – is contributing to the number of research findings that cannot be reproduced,” the American Statistical Association warns in a statement released in March, adding that the P value cannot be used to determine whether a hypothesis is true or even whether results are important (Nature. 2016 Mar 7. doi: 10.1038/nature.2016.19503).
And none of this even remotely addresses those instances where researchers report findings that “trend towards significance” when they can’t even meet the magical P threshold.
A muddling of mice (and more)
Fundamental to biological research is the vast array of preliminary animal studies that must be performed before clinical testing can begin.
Animal-based research has been under intense scrutiny due to a variety of perceived flaws and omissions that have been found to be all too common. For example, in a report in PLoS Biology, Dr. Ulrich Dirnagl of the Charité Medical University in Berlin reviewed 100 reports published between 2000 and 2013, which included 522 experiments using rodents to test cancer and stroke treatments. Around two-thirds of the experiments did not report whether any animals had been dropped from the final analysis, and of the 30% that did report rodents dropped from analysis, only 14 explained why (2016 Jan 4. doi: 10.1371/journal.pbio.1002331). Similarly, Dr. John Ioannidis and his colleagues assessed a random sample of 268 biomedical papers listed in PubMed published between 2000 and 2014 and found that only one contained sufficient details to replicate the work (Nature. 2016 Jan 5. doi: 10.1038/nature.2015.19101).
A multitude of genetic and environmental factors have also been found influential in animal research. For example, the gut microbiome (which has been found to influence many aspects of mouse health and metabolism) varies widely in the same species of mice fed on different diets or obtained from different vendors. And there can be differences in physiology and behavior based on circadian rhythms, and even variations in cage design (Nature. 2016 Feb 16. doi: 10.1038/530254a).
But things are looking brighter. By the beginning of 2016, more than 600 journals had signed up for the voluntary ARRIVE (Animals in Research: Reporting of In Vivo Experiments) guidelines designed to improve the reporting of animal experiments. The guidelines include a checklist of elements to be included in any reporting of animal research, including animal strain, sex, and adverse events (Nature. 2016 Feb 1. doi: 10.1038/nature.2016.19274).
Problems have also been reported in the use of cell lines and antibodies in biomedical research. For example, a report in Nature indicated that too many biomedical researchers are lax in checking for impostor cell lines when they perform their research (Nature. 2015 Oct 12. doi: 10.1038/nature.2015.18544). And recent studies have shown that improper or misused antibodies are a significant source of false findings and irreproducibility in the modern literature (Nature. 2015 May 19. doi: 10.1038/521274a).
Reviewer, view thyself
The editorial in The Economist also discussed some of the failures of the peer-reviewed scientific literature, usually considered the final gateway of quality control, to provide appropriate review and correction of research errors. The editorial cites a damning test of lower-tier research publications by Dr. John Bohannon, a biologist at Harvard, who submitted a pseudonymous paper on the effects of a chemical derived from lichen cells to 304 journals describing themselves as using peer review. The paper was concocted wholesale with manifold and obvious errors in study design, analysis, and interpretation of results, according to Dr. Bohannon. This fictitious paper from a fictitious researcher based at a fictitious university was accepted for publication by an alarming 147 of the journals.
The problem is not new. In 1998, Dr. Fiona Godlee, editor of the British Medical Journal, sent an article with eight deliberate mistakes in study design, analysis, and interpretation to more than 200 of the journal’s regular reviewers. None of the reviewers found all the mistakes, and on average, they spotted fewer than two. And another study by the BMJ showed that experience was an issue, not in improving quality of reviewers, but quite the opposite. Over a 14-year period assessed, 1,500 referees, as rated by editors at leading journals, showed a slow but steady drop in their scores.
Such studies prompted a profound reassessment by the journals, in part pushed by some major granting agencies, including the National Institutes of Health.
Not taking grants for granted
The National Institutes for Health are advancing efforts to expand scientific rigor and reproducibility in their grants projects.
“As part of an increasing drive to boost the reliability of research, the NIH will require applicants to explain the scientific premise behind their proposals and defend the quality of their experimental designs. They must also account for biological variables (for example, by including both male and female mice in planned studies) and describe how they will authenticate experimental materials such as cell lines and antibodies.”
Whether current efforts by scientists, societies, granting organizations, and journals can lead to authentic reform and a vast and relatively quick improvement in reproducibility of scientific results is still an open question. In discussing a 2015 report on the subject by the biomedical research community in the United Kingdom, neurophysiologist Dr. Dorothy Bishop had this to say: “I feel quite upbeat about it. ... Now that we’re aware of it, we have all sorts of ideas about how to deal with it. These are doable things. I feel that the mood is one of making science a much better thing. It might lead to slightly slower science. That could be better” (Nature. 2015 Oct 29. doi: 10.1038/nature.2015.18684).