The term reproducible research, in its current sense, was coined about 1990 by the geophysicist Jon Claerbout. Thus Jon Claerbout & Martin Karrenbach, "Electronic Documents Give Reproducible Research a New Meaning", Society of Exploration Geophysics 1992 [emphasis added, here and throughout]:
A revolution in education and technology transfer follows from the marriage of word processing and software command scripts. In this marriage an author attaches to every figure caption a pushbutton or a name tag usable to recalculate the figure from all its data, parameters, and programs. This provides a concrete definition of reproducibility in computationally oriented research. Experience at the Stanford Exploration Project shows that preparing such electronic documents is little effort beyond our customary report writing; mainly, we need to file everything in a systematic way. […]
The principal goal of scientific publications is to teach new concepts, show the resulting implications of those concepts in an illustration, and provide enough detail to make the work reproducible. In real life, reproducibility is haphazard and variable. Because of this, we rarely see a seismology PhD thesis being redone at a later date by another person. In an electronic document, readers, students, and customers can readily verify results and adapt them to new circumstances without laboriously recreating the author's environment.
I organized a session on "Reproducible Research" at the Berlin 6 Open Access Conference in 2008; and Victoria Stodden organized a session entitled "The Digitization of Science: Reproducibility and Interdisciplinary Knowledge Transfer" at AAAS 2011 (LLOG coverage here).
Because research in Claerbout's lab mainly involved analysis of published seismological recordings collected and published by the USGS, the idea of re-doing an experiment by collecting new data didn't ordinarily arise — the closest thing would be what that paper calls "adapting results to new circumstances". And much the same situation obtains in other areas where the goal is to model or explore large shared datasets, as is the case in most modern research in computational linguistics.
But in many other fields, it's natural to wonder whether an experiment would work if someone else tried to follow a similar recipe from start to finish. So at some point between 1990 and 2006, people in this tradition began using terms in the word family replication replicable replicability to refer to the (traditional) process of completely re-running an experiment, with all the effects of new researchers, new equipment, new subjects or other raw materials, etc. Thus Roger Peng et al., "Reproducible Epidemiologic Research", American Journal of Epidemiology 2006:
The replication of important findings by multiple independent investigators is fundamental to the accumulation of scientific evidence. Researchers in the biologic and physical sciences expect results to be replicated by independent data, analytical methods, laboratories, and instruments. Epidemiologic studies are commonly used to quantify small health effects of important, but subtle, risk factors, and replication is of critical importance where results can inform substantial policy decisions. However, because of the time, expense, and opportunism of many current epidemiologic studies, it is often impossible to fully replicate their findings. An attainable minimum standard is “reproducibility,” which calls for data sets and software to be made available for verifying published findings and conducting alternative analyses. The authors outline a standard for reproducibility and evaluate the reproducibility of current epidemiologic research. They also propose methods for reproducible research and implement them by use of a case study in air pollution and health.
For another example of the same terminological tradition, see "Replication, psychology, and Big Science", Simply Statistics 2012:
A study is reproducible if there is a specific set of computational functions/analyses (usually specified in terms of code) that exactly reproduce all of the numbers in a published paper from raw data. It is now recognized that a critical component of the scientific process is that data analyses can be reproduced. This point has been driven home particularly for personalized medicine applications, where irreproducible results can lead to delays in evaluating new procedures that affect patients’ health.
But just because a study is reproducible does not mean that it is replicable. Replicability is stronger than reproducibility. A study is only replicable if you perform the exact same experiment (at least) twice, collect data in the same way both times, perform the same data analysis, and arrive at the same conclusions. The difference with reproducibility is that to achieve replicability, you have to perform the experiment and collect the data again. This of course introduces all sorts of new potential sources of error in your experiment (new scientists, new materials, new lab, new thinking, different settings on the machines, etc.)
And there's a substantial and growing literature on (computational and social) methods for achieving reproducibility in Claerbout's sense, and replicability in Peng's sense. An important recent survey of the movement is Victoria Stodden et al., Eds., Reproducible Research, Taylor & Francis 2014:
Science moves forward when discoveries are replicated and reproduced. In general, the more frequently a given relationship is observed by independent scientists, the more trust we have that such a relationship truly exists in nature. Replication, the practice of independently implementing scientific experiments to validate specific findings, is the cornerstone of discovering scientific truth. Related to replication is reproducibility, which is the calculation of quantitative scientific results by independent scientist using the original datasets and methods. Reproducibility can be thought of as a different standard of validity because it forgoes independent data collection and uses the methods and data collected by the original investigator. Reproducibility has become an important issue for more recent research due to advances in technology and the rapid spread of computational methods across the research landscape.
Clear enough, right?
But more recently, some researchers have started using the same terms with the reference more or less switched. I think that this confusion originates with Chris Drummond, "Replicability is not Reproducibility: Nor is it Good Science", ICML 2009:
At various machine learning conferences, at various times, there have been discussions arising from the inability to replicate the experimental results published in a paper. There seems to be a wide spread view that we need to do something to address this problem, as it is essential to the advancement of our field. The most compelling argument would seem to be that reproducibility of experimental results is the hallmark of science. Therefore, given that most of us regard machine learning as a scientific discipline, being able to replicate experiments is paramount. I want to challenge this view by separating the notion of reproducibility, a generally desirable property, from replicability, its poor cousin. I claim there are important differences between the two. Reproducibility requires changes; replicability avoids them. Although reproducibility is desirable, I contend that the impoverished version, replicability, is one not worth having.
And Drummond's confusion has been picked up a few others — e.g. Arturo Casadevall & Ferric Fang, "Reproducible Science", Infection and Immunity 2010:
Although many biological scientists intuitively believe that the reproducibility of an experiment means that it can be replicated, Drummond makes a distinction between these two terms. Drummond argues that reproducibility requires changes, whereas replicability avoids them. In other words, reproducibility refers to a phenomenon that can be predicted to recur even when experimental conditions may vary to some degree. On the other hand, replicability describes the ability to obtain an identical result when an experiment is performed under precisely identical conditions.
Or Thilo Mende, "Replication of Defect Prediction Studies: Problems, Pitfalls and Recommendations", PROMISE 2010:
In the early days, […] most prediction models were based on proprietary data, thus preventing independent replication. With the rise of the PROMISE repository1 , this situation has changed. This repository collects publicly available data sets, the majority of them for the task of defect prediction. Currently, there are more than 100 such data sets inside the PROMISE repository, and many more are made available elsewhere.
This trend is very beneficial, as it enables researchers to independently verify or refute previous results. Drummond argues that replication — the repetition of an experiment without any changes — is not worthwhile. He favors reproducing experiments with changes, since only this adds new insights. While we agree that the pure replication of experiments on the same data sets should not lead to new results, we argue that replicability is nevertheless important: When applying previously published procedures to new data sets, or new procedures to well-known data sets, researchers should be able to validate their implementations using the originally published results.
This confusion seems to have led some researchers to reject the whole distinction — e.g. Brian Nosek, "An Open, Large-Scale, Collaborative Effort to Estimate the Reproducibility of Psychological Science", Perspectives on Psychological Science 2012:
Some distinguish between “reproducibility” and “replicability” by treating the former as a narrower case of the latter (e.g., computational sciences) or vice versa (e.g., biological sciences). We ignore the distinction.
(As the citations in this post suggest, and as a little poking around in Google Scholar will confirm, Nosek's notion that this is a difference between Computer Science and Biology is false. As far as I can tell, it's a difference between people influence by Drummond's provocative but deeply confused article, and everybody else in a dozen different fields — though maybe there has been some independent invention of related confusions as well.)
It seems to me that
- Under whatever names, the distinction between replicability and reproducibility is worth preserving (and indeed extending — see below);
- Since the technical term "reproducible research" has been in use since 1990, and the technical distinction between reproducible and replicable at least since 2006, we should reject Drummond's 2009 attempt to re-coin technical terms reproducible and replicable in senses that assign the terms to concepts nearly opposite to those used in the definitions by Claerbout, Peng and others.
Why preserve the distinction in an extended or elaborated form? Because there are many variations on the theme, all of them sometimes worthwhile. We might re-apply the original computational analysis to the original data, perhaps to check unreported aspects of the method or the results; we might re-implement the model or algorithm and apply it to the original data, to test a new program or to check for coding or algorithmic errors; we might apply the original computational analysis to new data, meant to test exactly the same hypotheses; we might apply a different model or algorithm to the original data, as an independent test of the original hypothesis, or in support of a different one; we might apply the original computational analysis to new data focused on an analogous but systematically different set of questions; and so on. Perhaps the most common and most valuable variation is benchmarking alternative models or algorithms with respect to the same quantitative evaluative metric on the same training and testing material. All of these varied responses to a publication are consistent with Jon Claerbout's original vision, in my opinion, though they go far beyond simply "[attaching] to every figure caption a pushbutton or a name tag usable to recalculate the figure from all its data, parameters, and programs".
For many reasons, I think that Drummond is profoundly wrong on the substance. But even if you believe his assertion that "Although X is desirable, I contend that the impoverished version, Y, is one not worth having", you should reject his attempt to swap the reference of the terms X and Y, substituting reproducibility for what many others have been calling replicability, and vice versa.
Some previous LLOG posts on related topics: "Reproducible research" (11/14/2008); "Reproducible Science at AAAS 2011" (2/18/2011); "Literate programming and reproducible research" (2/22/2014); and "Reliability" (2/28/2015).