Science has been in a “replication crisis” for a decade. Have we learned anything?
March 26, 2022 | News | No Comments
Much ink has been spilled over the “replication crisis” in the last decade and a half, including here at Vox. Researchers have discovered, over and over, that lots of findings in fields like psychology, sociology, medicine, and economics don’t hold up when other researchers try to replicate them.
This conversation was fueled in part by John Ioannidis’s 2005 article “Why Most Published Research Findings Are False” and by the controversy around a 2011 paper that used then-standard statistical methods to find that people have precognition. But since then, many researchers have explored the replication crisis from different angles. Why are research findings so often unreliable? Is the problem just that we test for “statistical significance” — the likelihood that similarly strong results could have occurred by chance — in a nuance-free way? Is it that null results (that is, when a study finds no detectable effects) are ignored while positive ones make it into journals?
A recent write-up by Alvaro de Menard, a participant in the Defense Advanced Research Project’s Agency’s (DARPA) replication markets project (more on this below), makes the case for a more depressing view: The processes that lead to unreliable research findings are routine, well understood, predictable, and in principle pretty easy to avoid. And yet, he argues, we’re still not improving the quality and rigor of social science research.
While other researchers I spoke with pushed back on parts of Menard’s pessimistic take, they do agree on something: a decade of talking about the replication crisis hasn’t translated into a scientific process that’s much less vulnerable to it. Bad science is still frequently published, including in top journals — and that needs to change.
Most papers fail to replicate for totally predictable reasons
Let’s take a step back and explain what people mean when they refer to the “replication crisis” in scientific research.
When research papers are published, they describe their methodology, so other researchers can copy it (or vary it) and build on the original research. When another research team tries to conduct a study based on the original to see if they find the same result, that’s an attempted replication. (Often the focus is not just on doing the exact same thing, but approaching the same question with a larger sample and preregistered design.) If they find the same result, that’s a successful replication, and evidence that the original researchers were on to something. But when the attempted replication finds different or no results, that often suggests that the original research finding was spurious.
In an attempt to test just how rigorous scientific research is, some researchers have undertaken the task of replicating research that’s been published in a whole range of fields. And as more and more of those attempted replications have come back, the results have been striking — it is not uncommon to find that many, many published studies cannot be replicated.
One 2015 attempt to reproduce 100 psychology studies was able to replicate only 39 of them. A big international effort in 2018 to reproduce prominent studies found that 14 of the 28 replicated, and an attempt to replicate studies from top journals Nature and Science found that 13 of the 21 results looked at could be reproduced.
The replication crisis has led a few researchers to ask: Is there a way to guess if a paper will replicate? A growing body of research has found that guessing which papers will hold up and which won’t is often just a matter of looking at the same simple, straightforward factors.
Menard argues that the problem is not so complicated. “Predicting replication is easy,” he said. “There’s no need for a deep dive into the statistical methodology or a rigorous examination of the data, no need to scrutinize esoteric theories for subtle errors — these papers have obvious, surface-level problems.”
A 2018 study published in Nature had scientists place bets on which of a pool of social science studies would replicate. They found that the predictions by scientists in this betting market were highly accurate at estimating which papers would replicate.
“These results suggest something systematic about papers that fail to replicate,” study co-author Anna Dreber argued after the study was released.
Additional research has established that you don’t even need to poll experts in a field to guess which of its studies will hold up to scrutiny. A study published in August had participants read psychology papers and predict whether they would replicate. “Laypeople without a professional background in the social sciences are able to predict the replicability of social-science studies with above-chance accuracy,” the study concluded, “on the basis of nothing more than simple verbal study descriptions.”
The laypeople were not as accurate in their predictions as the scientists in the Nature study, but the fact they were still able to predict many failed replications suggests that many of them have flaws that even a layperson can notice.
Bad science can still be published in prestigious journals and be widely cited
Publication of a peer-reviewed paper is not the final step of the scientific process. After a paper is published, other research might cite it — spreading any misconceptions or errors in the original paper. But research has established that scientists have good instincts for whether a paper will replicate or not. So, do scientists avoid citing papers that are unlikely to replicate?
This striking chart from a 2020 study by Yang Yang, Wu Youyou, and Brian Uzzi at Northwestern University illustrates their finding that actually, there is no correlation at all between whether a study will replicate and how often it is cited. “Failed papers circulate through the literature as quickly as replicating papers,” they argue.
Looking at a sample of studies from 2009 to 2017 that have since been subject to attempted replications, the researchers find that studies have about the same number of citations regardless of whether they replicated.
If scientists are pretty good at predicting whether a paper replicates, how can it be the case that they are as likely to cite a bad paper as a good one? Menard theorizes that many scientists don’t thoroughly check — or even read — papers once published, expecting that if they’re peer-reviewed, they’re fine. Bad papers are published by a peer-review process that is not adequate to catch them — and once they’re published, they are not penalized for being bad papers.
The debate over whether we’re making any progress
Here at Vox, we’ve written about how the replication crisis can guide us to do better science. And yet blatantly shoddy work is still being published in peer-reviewed journals despite errors that a layperson can see.
In many cases, journals effectively aren’t held accountable for bad papers — many, like The Lancet, have retained their prestige even after a long string of embarrassing public incidents where they published research that turned out fraudulent or nonsensical. (The Lancet said recently that, after a study on Covid-19 and hydroxychloroquine this spring was retracted after questions were raised about the data source, the journal would change its data-sharing practices.)
Even outright frauds often take a very long time to be repudiated, with some universities and journals dragging their feet and declining to investigate widespread misconduct.
That’s discouraging and infuriating. It suggests that the replication crisis isn’t one specific methodological reevaluation, but a symptom of a scientific system that needs rethinking on many levels. We can’t just teach scientists how to write better papers. We also need to change the fact that those better papers aren’t cited more often than bad papers; that bad papers are almost never retracted even when their errors are visible to lay readers; and that there are no consequences for bad research.
In some ways, the culture of academia actively selects for bad research. Pressure to publish lots of papers favors those who can put them together quickly — and one way to be quick is to be willing to cut corners. “Over time, the most successful people will be those who can best exploit the system,” Paul Smaldino, a cognitive science professor at the University of California Merced, told my colleague Brian Resnick.
So we have a system whose incentives keep pushing bad research even as we understand more about what makes for good research.
Researchers working on the replication crisis are more divided, though, on the question of whether the last decade of work on the replication crisis has left us better equipped to fight these problems — or left us in the same place where we started.
“The future is bright,” concludes Altmejd and Dreber’s 2019 paper about how to predict replications. “There will be rapid accumulation of more replication data, more outlets for publishing replications, new statistical techniques, and—most importantly—enthusiasm for improving replicability among funding agencies, scientists, and journals. An exciting replicability ‘upgrade’ in science, while perhaps overdue, is taking place.”
Menard, by contrast, argues that this optimism has not been borne out — none of our improved understanding of the replication crisis leads to more papers being published that actually replicate. The project that he’s a part of — an effort to design a better model to predict which papers replicate run by DARPA in the Defense Department — has not seen papers grow any more likely to replicate over time.
“I frequently encounter the notion that after the replication crisis hit there was some sort of great improvement in the social sciences, that people wouldn’t even dream of publishing studies based on 23 undergraduates any more … In reality there has been no discernible improvement,” he writes.
Researchers who are more optimistic point to other metrics of progress. It’s true that papers that fail replication are still extremely common, and that the peer-review process hasn’t improved in a way that catches these errors. But other elements of the error-correction process are getting better.
“Journals now retract about 1,500 articles annually — a nearly 40-fold increase over 2000, and a dramatic change even if you account for the roughly doubling or tripling of papers published per year,” Ivan Oransky at Retraction Watch argues. “Journals have improved,” reporting more details on retracted papers and improving their process for retractions.
Other changes in common scientific practices seem to be helping too. For example, preregistrations — announcing how you’ll conduct your analysis before you do the study — lead to more null results being published.
“I don’t think the influence [of public conversations about the replication crisis on scientific practice] has been zero,” statistician Andrew Gelman at Columbia University told me. “This crisis has influenced my own research practices, and I assume it’s influenced many others as well. And it’s my general impression that journals such as Psychological Science and PNAS don’t publish as much junk as they used to.”
There’s some reassurance in that. But until those improvements translate to a higher percentage of papers replicating and a difference in citations for good papers versus bad papers, it’s a small victory. And it’s a small victory that has been hard-won. After tons of resources spent demonstrating the scope of the problem, fighting for more retractions, teaching better statistical methods, and trying to drag fraud into the open, papers still don’t replicate as much as researchers would hope, and bad papers are still widely cited — suggesting a big part of the problem still hasn’t been touched.
We need a more sophisticated understanding of the replication crisis, not as a moment of realization after which we were able to move forward with higher standards, but as an ongoing rot in the scientific process that a decade of work hasn’t quite fixed.
Our scientific institutions are valuable, as are the tools they’ve built to help us understand the world. There’s no cause for hopelessness here, even if some frustration is thoroughly justified. Science needs saving, sure — but science is very much worth saving.