Replicability vs. reproducibility — or is it the other way around?
« previous post | next post »
The term reproducible research, in its current sense, was coined about 1990 by the geophysicist Jon Claerbout. Thus Jon Claerbout & Martin Karrenbach, "Electronic Documents Give Reproducible Research a New Meaning", Society of Exploration Geophysics 1992 [emphasis added, here and throughout]:
A revolution in education and technology transfer follows from the marriage of word processing and software command scripts. In this marriage an author attaches to every figure caption a pushbutton or a name tag usable to recalculate the figure from all its data, parameters, and programs. This provides a concrete definition of reproducibility in computationally oriented research. Experience at the Stanford Exploration Project shows that preparing such electronic documents is little effort beyond our customary report writing; mainly, we need to file everything in a systematic way. […]
The principal goal of scientific publications is to teach new concepts, show the resulting implications of those concepts in an illustration, and provide enough detail to make the work reproducible. In real life, reproducibility is haphazard and variable. Because of this, we rarely see a seismology PhD thesis being redone at a later date by another person. In an electronic document, readers, students, and customers can readily verify results and adapt them to new circumstances without laboriously recreating the author's environment.
I organized a session on "Reproducible Research" at the Berlin 6 Open Access Conference in 2008; and Victoria Stodden organized a session entitled "The Digitization of Science: Reproducibility and Interdisciplinary Knowledge Transfer" at AAAS 2011 (LLOG coverage here).
Because research in Claerbout's lab mainly involved analysis of published seismological recordings collected and published by the USGS, the idea of re-doing an experiment by collecting new data didn't ordinarily arise — the closest thing would be what that paper calls "adapting results to new circumstances". And much the same situation obtains in other areas where the goal is to model or explore large shared datasets, as is the case in most modern research in computational linguistics.
But in many other fields, it's natural to wonder whether an experiment would work if someone else tried to follow a similar recipe from start to finish. So at some point between 1990 and 2006, people in this tradition began using terms in the word family replication replicable replicability to refer to the (traditional) process of completely re-running an experiment, with all the effects of new researchers, new equipment, new subjects or other raw materials, etc. Thus Roger Peng et al., "Reproducible Epidemiologic Research", American Journal of Epidemiology 2006:
The replication of important findings by multiple independent investigators is fundamental to the accumulation of scientific evidence. Researchers in the biologic and physical sciences expect results to be replicated by independent data, analytical methods, laboratories, and instruments. Epidemiologic studies are commonly used to quantify small health effects of important, but subtle, risk factors, and replication is of critical importance where results can inform substantial policy decisions. However, because of the time, expense, and opportunism of many current epidemiologic studies, it is often impossible to fully replicate their findings. An attainable minimum standard is “reproducibility,” which calls for data sets and software to be made available for verifying published findings and conducting alternative analyses. The authors outline a standard for reproducibility and evaluate the reproducibility of current epidemiologic research. They also propose methods for reproducible research and implement them by use of a case study in air pollution and health.
For another example of the same terminological tradition, see "Replication, psychology, and Big Science", Simply Statistics 2012:
A study is reproducible if there is a specific set of computational functions/analyses (usually specified in terms of code) that exactly reproduce all of the numbers in a published paper from raw data. It is now recognized that a critical component of the scientific process is that data analyses can be reproduced. This point has been driven home particularly for personalized medicine applications, where irreproducible results can lead to delays in evaluating new procedures that affect patients’ health.
But just because a study is reproducible does not mean that it is replicable. Replicability is stronger than reproducibility. A study is only replicable if you perform the exact same experiment (at least) twice, collect data in the same way both times, perform the same data analysis, and arrive at the same conclusions. The difference with reproducibility is that to achieve replicability, you have to perform the experiment and collect the data again. This of course introduces all sorts of new potential sources of error in your experiment (new scientists, new materials, new lab, new thinking, different settings on the machines, etc.)
And there's a substantial and growing literature on (computational and social) methods for achieving reproducibility in Claerbout's sense, and replicability in Peng's sense. An important recent survey of the movement is Victoria Stodden et al., Eds., Reproducible Research, Taylor & Francis 2014:
Science moves forward when discoveries are replicated and reproduced. In general, the more frequently a given relationship is observed by independent scientists, the more trust we have that such a relationship truly exists in nature. Replication, the practice of independently implementing scientific experiments to validate specific findings, is the cornerstone of discovering scientific truth. Related to replication is reproducibility, which is the calculation of quantitative scientific results by independent scientist using the original datasets and methods. Reproducibility can be thought of as a different standard of validity because it forgoes independent data collection and uses the methods and data collected by the original investigator. Reproducibility has become an important issue for more recent research due to advances in technology and the rapid spread of computational methods across the research landscape.
Clear enough, right?
But more recently, some researchers have started using the same terms with the reference more or less switched. I think that this confusion originates with Chris Drummond, "Replicability is not Reproducibility: Nor is it Good Science", ICML 2009:
At various machine learning conferences, at various times, there have been discussions arising from the inability to replicate the experimental results published in a paper. There seems to be a wide spread view that we need to do something to address this problem, as it is essential to the advancement of our field. The most compelling argument would seem to be that reproducibility of experimental results is the hallmark of science. Therefore, given that most of us regard machine learning as a scientific discipline, being able to replicate experiments is paramount. I want to challenge this view by separating the notion of reproducibility, a generally desirable property, from replicability, its poor cousin. I claim there are important differences between the two. Reproducibility requires changes; replicability avoids them. Although reproducibility is desirable, I contend that the impoverished version, replicability, is one not worth having.
And Drummond's confusion has been picked up a few others — e.g. Arturo Casadevall & Ferric Fang, "Reproducible Science", Infection and Immunity 2010:
Although many biological scientists intuitively believe that the reproducibility of an experiment means that it can be replicated, Drummond makes a distinction between these two terms. Drummond argues that reproducibility requires changes, whereas replicability avoids them. In other words, reproducibility refers to a phenomenon that can be predicted to recur even when experimental conditions may vary to some degree. On the other hand, replicability describes the ability to obtain an identical result when an experiment is performed under precisely identical conditions.
Or Thilo Mende, "Replication of Defect Prediction Studies: Problems, Pitfalls and Recommendations", PROMISE 2010:
In the early days, […] most prediction models were based on proprietary data, thus preventing independent replication. With the rise of the PROMISE repository1 , this situation has changed. This repository collects publicly available data sets, the majority of them for the task of defect prediction. Currently, there are more than 100 such data sets inside the PROMISE repository, and many more are made available elsewhere.
This trend is very beneficial, as it enables researchers to independently verify or refute previous results. Drummond argues that replication — the repetition of an experiment without any changes — is not worthwhile. He favors reproducing experiments with changes, since only this adds new insights. While we agree that the pure replication of experiments on the same data sets should not lead to new results, we argue that replicability is nevertheless important: When applying previously published procedures to new data sets, or new procedures to well-known data sets, researchers should be able to validate their implementations using the originally published results.
This confusion seems to have led some researchers to reject the whole distinction — e.g. Brian Nosek, "An Open, Large-Scale, Collaborative Effort to Estimate the Reproducibility of Psychological Science", Perspectives on Psychological Science 2012:
Some distinguish between “reproducibility” and “replicability” by treating the former as a narrower case of the latter (e.g., computational sciences) or vice versa (e.g., biological sciences). We ignore the distinction.
(As the citations in this post suggest, and as a little poking around in Google Scholar will confirm, Nosek's notion that this is a difference between Computer Science and Biology is false. As far as I can tell, it's a difference between people influence by Drummond's provocative but deeply confused article, and everybody else in a dozen different fields — though maybe there has been some independent invention of related confusions as well.)
It seems to me that
- Under whatever names, the distinction between replicability and reproducibility is worth preserving (and indeed extending — see below);
- Since the technical term "reproducible research" has been in use since 1990, and the technical distinction between reproducible and replicable at least since 2006, we should reject Drummond's 2009 attempt to re-coin technical terms reproducible and replicable in senses that assign the terms to concepts nearly opposite to those used in the definitions by Claerbout, Peng and others.
Why preserve the distinction in an extended or elaborated form? Because there are many variations on the theme, all of them sometimes worthwhile. We might re-apply the original computational analysis to the original data, perhaps to check unreported aspects of the method or the results; we might re-implement the model or algorithm and apply it to the original data, to test a new program or to check for coding or algorithmic errors; we might apply the original computational analysis to new data, meant to test exactly the same hypotheses; we might apply a different model or algorithm to the original data, as an independent test of the original hypothesis, or in support of a different one; we might apply the original computational analysis to new data focused on an analogous but systematically different set of questions; and so on. Perhaps the most common and most valuable variation is benchmarking alternative models or algorithms with respect to the same quantitative evaluative metric on the same training and testing material. All of these varied responses to a publication are consistent with Jon Claerbout's original vision, in my opinion, though they go far beyond simply "[attaching] to every figure caption a pushbutton or a name tag usable to recalculate the figure from all its data, parameters, and programs".
For many reasons, I think that Drummond is profoundly wrong on the substance. But even if you believe his assertion that "Although X is desirable, I contend that the impoverished version, Y, is one not worth having", you should reject his attempt to swap the reference of the terms X and Y, substituting reproducibility for what many others have been calling replicability, and vice versa.
Some previous LLOG posts on related topics: "Reproducible research" (11/14/2008); "Reproducible Science at AAAS 2011" (2/18/2011); "Literate programming and reproducible research" (2/22/2014); and "Reliability" (2/28/2015).
Brett said,
October 31, 2015 @ 8:51 am
"Reproducible" was standard research lingo long, long before 1990. The Journal of Irreproducible Results was founded back in 1955, and the name was a play on already standard terminology.
[(myl) Applications to science of morphemes related to Latin replicare and re+producere go back much further than that. The OED gives e.g.
1883 Fortn. Rev. Aug. 275 The results of scientific discoveries..are, as a rule, reproducible at will.
1917 Bull. Agric. Exper. Station Nebraska No. 160. 39 All tests were replicated ten times each year, except the unselected seed, which was replicated 30 times.
And I strongly suspect that similar quotations go back to the 17th century if not before.
But the distinction between re-doing an experiment, as opposed to re-running a given analysis of a given dataset, doesn't really become important until data became digital and people began using computers for analysis. (The concept was always available, and was occasionally put into practice, but things really changed when scientific practice was broadly digitized in the 1980s…)]
Schnoerkelman said,
October 31, 2015 @ 9:40 am
Reproduce (regenerate) the result, replicate (copy) the experiment is how I've thought about them.
+1 on Journal of Irreproducible Results :-)
D.O. said,
October 31, 2015 @ 9:57 am
I think there are at least 3, not 2, distinct things under reproduce/replicate approach. The thing that requires publishing data and code (Claerbout's reproducibility) is one, the attempts to redo experiment as precisely as it was done to achieve the same result is another one, and an attempt to change experimental setting varying what is claimed to be irrelevant parameters, but retain important ones is yet another thing.
[(myl) There's a rich space of relevant concepts. As I wrote above:
We might re-apply the original computational analysis to the original data, perhaps to check unreported aspects of the method or the results; we might re-implement the model or algorithm and apply it to the original data, to test a new program or to check for coding or algorithmic errors; we might apply the original computational analysis to new data, meant to test exactly the same hypotheses; we might apply a different model or algorithm to the original data, as an independent test of the original hypothesis, or in support of a different one; we might apply the original computational analysis to new data focused on an analogous but systematically different set of questions; and so on.
]
Jason Eisner said,
October 31, 2015 @ 11:34 pm
Thanks for calling attention to this and making a ruling. Maybe your post will now be highly ranked when people search the web to figure out which is which.
It's unfortunate that Drummond's usage is more intuitive, at least to me, since I think of a replica being more exact than a reproduction. We have the stock phrase "replaced with an exact replica" (5 times more frequent on Google than "replaced with an exact reproduction"), whereas reproduction has given me children rather than clones. But this was not a good enough reason for Drummond to sow confusion when there was an established usage.
Drummond also has an ally in the Reproducibility Project, recently in the news. As noted by the Simply Statistics article that you cite, it should have been called the Replicability project under the terminology that you recommend, although they tried very hard to make the replications as faithful as possible by working with the original authors.
[(myl) You're right about the intuitions — presumably that's why/how Drummond got it backwards. And presumably the "Reproducible Research" term got started because (a) reproducible is somewhat commoner than replicable (); and (b) in the context of Claerbout's seismology-modeling work in 1990, only one term was really needed. When Peng and others wanted to add a term for "re-doing the experiment(s)", that left replicable.
This obviously isn't the first case where technical terminology is confusing to outsiders — consider the whole "passive" business — or has different meanings in different fields — e.g. "recursive function theory" vs. "recursive filters". Though I don't know of any other cases where two groups of experts in essentially the same field use a pair of terms to make the same distinction in opposite ways.
But "making a ruling"? Ha. ]
J. W. Brewer said,
November 2, 2015 @ 8:38 pm
"Reproduction" obviously has a wider range of meanings than "replica," but I'm not sure I am persuaded that the key ordinary-language difference between "replica" and the relevant sense of "reproduction" is degree of exactness. I would suggest that the verb "replicate" may have different and perhaps narrower ordinary-language semantics (and less "ordinary," because it's a more high-falutin' word to start with) than the noun "replica" and that's where you would want to focus if you were intent on showing that the 1990 usage has the intuitions backwards.* Although any time technical jargon wants to label two sharply distinguished things by using pre-existing words that in ordinary language are near-synonyms, trouble is bound to arise regardless of which word is assigned which technical meaning.
*I have trouble imagining circumstances where it would be idiomatic to respond to the question "Is that a replica" by saying "No, it's a reproduction" (or even "Well, it's a reproduction but I'm not sure I'd say it's actually a replica") but I can imagine responding to the question "Were you able to replicate their methodology" by saying something like "Not exactly, but we reproduced it as best as we could."
Brett said,
November 3, 2015 @ 8:54 am
@J. W. Brewer: In a discussion of works of art, I would take "replica" to imply something that was the same size as the original, whereas a "reproduction" would not need to be. I don't know really how widespread this distinction is.
J. W. Brewer said,
November 3, 2015 @ 11:10 am
Brett: for things other than "fine art," e.g. automobiles and ships, it is pretty easy to find usages via googling that describe the same object of interest to hobbyists/collectors as both a "scale model" (so not the same size as the real thing) and an "exact replica." But the more I think about it the clearer it seems to me that "make a replica" is not a very good gloss for the verb "replicate" and likewise "the process or habit of making replicas" is not a very good gloss for the noun "replication," which I think strengthens my earlier notion that the ordinary-language semantics of "replica" are a bit of a red herring here.
BZ said,
November 3, 2015 @ 4:08 pm
In programming, reproducing and replicating a reported problem are just two words referring to the same concept. Each can be used to mean either re-doing something by the same user to recreate the same problem or independently arriving at the same problem by the person assigned to investigate the same report, following the same steps the original user encountered.
If the two words have been used interchangeably since at least the 19th century, artificially imposing a difference on them can and will lead to confusion about which one is which. Isn't this exactly what we are trying to get rid of by rejecting supposed rules such as the that/which distinction? Granted, technical jargon is different from everyday speech, but if a recent distinction is introduced in technical jargon between formerly interchangeable terms within the same jargon, you will run into similar issues. You may have to fall back to defining the terms you want to use in the paper, the way legal documents do.
And no, I don't have a better solution for you regarding this needed distinction.
Abraham Flaxman said,
November 4, 2015 @ 10:02 am
Thanks so much for putting this all together. These terms are a recurring source of confusion for me. I want to call your attention to another important paper in establishing the term "replication": King, Gary. 1995. “Replication, Replication,” PS: Political Science and Politics, 28: 443–499. Copy online here.
This is the quantitative political scientist take, and it has been influential in how my colleagues and students are doing global health research as well.
Jeremy Leipzig said,
November 4, 2015 @ 2:54 pm
Good post. I am not a linguist but you might be amused by comic I had commissioned for this very topic.
see http://jermdemo.blogspot.com/2012/12/the-reproducible-research-guilt-trip.html
Ben Marwick said,
November 6, 2015 @ 7:41 am
This is a great post, and helps a lot to make sense of the various ways people are writing about reproducible research. There's yet another variant of 'reproducibility', contrasted with 'repeatability', here: http://www.nature.com/ngeo/journal/v7/n11/full/ngeo2283.html
The author writes:
"For example, repeatability and reproducibility are often conflated in the context of scientific computing. Repeatability means the ability to re-run the same code at a later time or on a different machine. Reproducibility means the ability to recreate the results, whether by re-running the same code, or by writing a new program."
Here's the distinction explained further:
"Repeatability without reproducibility — getting different results when re-running the code — can be a result of fragile code, combined with small changes in the hardware platform, the compiler or one of the ancillary tools. It is especially common in numerical simulations of chaotic phenomena, such as weather, where any change in how the code is compiled and optimized may lead to tiny rounding differences that rapidly multiply as the simulation proceeds. In meteorology and climate science, modellers handle this problem by using ensembles of runs to produce probabilistic results. Exact repeatability is extremely hard to maintain across platforms (see Box 1).
Reproducibility without repeatability — the confirmation of results using different code — is the computational equivalent of a replicated experiment, the bread-and-butter of doing science. Independently reproducing computational results is a creative process that can lead to the discovery of new approaches, and generates a stronger body of scientific evidence."
The author sees little value in 'reproducibility with repeatability' and claims that 'it is not obvious that making the code associated with a specific scientific publication available will lead to significant advances in reproducibility or to significant new insights'.
But I disagree, because using code from someone else's publication is the ideal starting point for generating new approaches. It's much less work to fork another researcher's code and modify it, compared to writing the code from scratch. If research publications are routinely accompanied by code the others can reuse and extend, then the barriers to engaging and extending that research are greatly lowered.
Another minor variant can be found by Stodden et al. (http://stodden.net/icerm_report.pdf) in an earlier work than the 2014 book cited in the post. They split reproducible research five ways:
1. Reviewable Research. The descriptions of the research methods can be independently assessed and the results judged credible. (This includes both traditional peer review and community review, and does not necessarily imply reproducibility.)
2. Replicable Research. Tools are made available that would allow one to duplicate the results of the research, for example by running the authors’ code to produce the plots shown in the publication. (Here tools might be limited in scope, e.g., only essential data or executables, and might only be made available to referees or only upon request.)
3. Confirmable Research. The main conclusions of the research can be attained independently without the use of software provided by the author. (But using the complete description of algorithms and methodology provided in the publication and any supplementary materials.)
4. Auditable Research. Sufficient records (including data and software) have been archived so that the research can be defended later if necessary or differences between independent confirmations resolved. The archive might be private, as with traditional laboratory notebooks.
5. Open or Reproducible Research. Auditable research made openly available. This comprised well-documented and fully open code and data that are publicly available that would allow one to a) fully audit the computational procedure, b) replicate and also independently reproduce the results of the research, and c) extend the results or apply the method to new problems.
These are interesting distinctions, but in that list, 'confirmable' seems to have the meaning of 'replicable', as in ' independently implementing scientific experiments to validate specific findings'. That confusion may be why this list didn't have a lot of influence. Their 2014 explanation is a big improvement, and hopefully your post above will help to establish it as the definitive one!