Towards the end of January, there were three editorials in the New England Journal of Medicine with somewhat overlapping authors and somewhat conflicting messages. The first editorial was D.B. Taichman et al., "Sharing clinical trial data — a proposal from the International Committee of Medical Journal Editors", published online 1/20/2016:
The International Committee of Medical Journal Editors (ICMJE) believes that there is an ethical obligation to responsibly share data generated by interventional clinical trials because participants have put themselves at risk. In a growing consensus, many funders around the world — foundations, government agencies, and industry — now mandate data sharing.
The second editorial was by Dan L. Longo and Jeffrey Drazen — Drazen is the editor-in-chief of the journal, and Longo is a deputy editor. Though phrased somewhat diplomatically, it expressed a strikingly negative evaluation of the whole idea ("Data Sharing", NEJM 1/21/2016):
The aerial view of the concept of data sharing is beautiful. What could be better than having high-quality information carefully reexamined for the possibility that new nuggets of useful data are lying there, previously unseen? The potential for leveraging existing results for even more benefit pays appropriate increased tribute to the patients who put themselves at risk to generate the data. The moral imperative to honor their collective sacrifice is the trump card that takes this trick.
However, many of us who have actually conducted clinical research, managed clinical studies and data collection and analysis, and curated data sets have concerns about the details. The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters. Special problems arise if data are to be combined from independent studies and considered comparable. How heterogeneous were the study populations? Were the eligibility criteria the same? Can it be assumed that the differences in study populations, data collection and analysis, and treatments, both protocol-specified and unspecified, can be ignored?
A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”
This response created enough of a fuss that Jeffrey Drazen felt the need to walk it back a bit ("Data Sharing and the Journal", NEJM 1/25/2016):
We want to clarify, given recent concern about our policy, that the Journal is committed to data sharing in the setting of clinical trials. […] Journal policy will therefore follow that outlined in the ICMJE editorial and the IOM report: when appropriate systems are in place, we will require a commitment from authors to make available the data that underlie the reported results of their work within 6 months after we publish them.
In the process of formulating our policy, we spoke to clinical trialists around the world. Many were concerned that data sharing would require them to commit scarce resources with little direct benefit. Some of them spoke pejoratively in describing data scientists who analyze the data of others. To make data sharing successful, it is important to acknowledge and air those concerns. In our view, however, researchers who analyze data collected by others can substantially improve human health.
Those two semi-negative editorials evoked a response by Monica McNutt, the editor-in-chief of Science ("#IAmAResearchParasite", Science 3/4/2016):
In the midst of steady progress in policies for data sharing, a recent editorial expressed a contrarian view.* The authors described the concern of some scientists about the rise of an underclass of “research parasites” who exploit data sets that are collected and curated by others. Even worse, these parasites might use such data to try to disprove the conclusions posited in the data's original source studies. The editorial raised the points of how anyone not involved in the original study could use the data without misrepresenting it, and the danger of perhaps arriving at erroneous conclusions. The editorial advised instead that data sharing be implemented by involving the authors of the original study as coauthors in follow-up research. The research community immediately took to Twitter under the hashtag #IAmAResearchParasite to voice opposition to the editorial.
One of the many tweets:
— Carl T. Bergstrom (@CT_Bergstrom) March 3, 2016
In my opinion, the moral debt owed to subjects is not nearly the most important reason for sharing data. Higher on the list are reducing error, fraud and (self-)deception; increasing the size of available datasets; reducing costs and lowering barriers to entry; and speeding up the virtuous cycle of science and technology.
Some relevant LLOG posts:
You might also be interested in "Reproducible Computational Experiments", 2/26/2015, my presentation to a workshop “Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results" (report here), which was organized by the Committee on Applied and Theoretical Statistics (CATS) of the Board on Mathematical Sciences and Their Applications of the National Academy of Sciences.
I'd like to point out that in this respect, scholars in theology and in the humanities are a few millennia ahead of the scientists and engineers. Since the days of Pāṇini 2500 years ago, or the Masoretes a thousand years later, it's been assumed in many traditions that everyone should have access to the same textual datasets. Occasional deviations from these norms, such as the delay in the publication of the Dead Sea Scrolls, have resulted in vigorous controversy.
And let's also note that the NEJM editors' concerns are somewhat similar to the reasons that many 16th-century authorities were opposed to the widespread publication and individual interpretation of bible translations.
So I'm happy to see the data-publication ethos spreading from speech and language technology to other areas of science and engineering.
[See also the report of the Committee on Strategies for Responsible Sharing of Clinical Trial Data, "Sharing clinical trial data: maximizing benefits minimizing risk", NAS 1/14/2015]