Language Log

#IAmAResearchParasite

March 4, 2016 @ 9:07 am · Filed by Mark Liberman under Language and politics

Towards the end of January, there were three editorials in the New England Journal of Medicine with somewhat overlapping authors and somewhat conflicting messages. The first editorial was D.B. Taichman et al., "Sharing clinical trial data — a proposal from the International Committee of Medical Journal Editors", published online 1/20/2016:

The International Committee of Medical Journal Editors (ICMJE) believes that there is an ethical obligation to responsibly share data generated by interventional clinical trials because participants have put themselves at risk. In a growing consensus, many funders around the world — foundations, government agencies, and industry — now mandate data sharing.

The second editorial was by Dan L. Longo and Jeffrey Drazen — Drazen is the editor-in-chief of the journal, and Longo is a deputy editor. Though phrased somewhat diplomatically, it expressed a strikingly negative evaluation of the whole idea ("Data Sharing", NEJM 1/21/2016):

The aerial view of the concept of data sharing is beautiful. What could be better than having high-quality information carefully reexamined for the possibility that new nuggets of useful data are lying there, previously unseen? The potential for leveraging existing results for even more benefit pays appropriate increased tribute to the patients who put themselves at risk to generate the data. The moral imperative to honor their collective sacrifice is the trump card that takes this trick.

However, many of us who have actually conducted clinical research, managed clinical studies and data collection and analysis, and curated data sets have concerns about the details. The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters. Special problems arise if data are to be combined from independent studies and considered comparable. How heterogeneous were the study populations? Were the eligibility criteria the same? Can it be assumed that the differences in study populations, data collection and analysis, and treatments, both protocol-specified and unspecified, can be ignored?

A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

This response created enough of a fuss that Jeffrey Drazen felt the need to walk it back a bit ("Data Sharing and the Journal", NEJM 1/25/2016):

We want to clarify, given recent concern about our policy, that the Journal is committed to data sharing in the setting of clinical trials. […] Journal policy will therefore follow that outlined in the ICMJE editorial and the IOM report: when appropriate systems are in place, we will require a commitment from authors to make available the data that underlie the reported results of their work within 6 months after we publish them.

In the process of formulating our policy, we spoke to clinical trialists around the world. Many were concerned that data sharing would require them to commit scarce resources with little direct benefit. Some of them spoke pejoratively in describing data scientists who analyze the data of others. To make data sharing successful, it is important to acknowledge and air those concerns. In our view, however, researchers who analyze data collected by others can substantially improve human health.

Those two semi-negative editorials evoked a response by Monica McNutt, the editor-in-chief of Science ("#IAmAResearchParasite", Science 3/4/2016):

In the midst of steady progress in policies for data sharing, a recent editorial expressed a contrarian view.* The authors described the concern of some scientists about the rise of an underclass of “research parasites” who exploit data sets that are collected and curated by others. Even worse, these parasites might use such data to try to disprove the conclusions posited in the data's original source studies. The editorial raised the points of how anyone not involved in the original study could use the data without misrepresenting it, and the danger of perhaps arriving at erroneous conclusions. The editorial advised instead that data sharing be implemented by involving the authors of the original study as coauthors in follow-up research. The research community immediately took to Twitter under the hashtag #IAmAResearchParasite to voice opposition to the editorial.

One of the many tweets:

https://twitter.com/CT_Bergstrom/status/705469753612042240

In my opinion, the moral debt owed to subjects is not nearly the most important reason for sharing data. Higher on the list are reducing error, fraud and (self-)deception; increasing the size of available datasets; reducing costs and lowering barriers to entry; and speeding up the virtuous cycle of science and technology.

Some relevant LLOG posts:

"Reproducible research", 11/14/2008
"Reproducible science at AAAS 2011", 2/18/2011
"Big Inaccessible Data", 6/4/2012

You might also be interested in "Reproducible Computational Experiments", 2/26/2015, my presentation to a workshop “Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results" (report here), which was organized by the Committee on Applied and Theoretical Statistics (CATS) of the Board on Mathematical Sciences and Their Applications of the National Academy of Sciences.

I'd like to point out that in this respect, scholars in theology and in the humanities are a few millennia ahead of the scientists and engineers. Since the days of Pāṇini 2500 years ago, or the Masoretes a thousand years later, it's been assumed in many traditions that everyone should have access to the same textual datasets. Occasional deviations from these norms, such as the delay in the publication of the Dead Sea Scrolls, have resulted in vigorous controversy.

And let's also note that the NEJM editors' concerns are somewhat similar to the reasons that many 16th-century authorities were opposed to the widespread publication and individual interpretation of bible translations.

So I'm happy to see the data-publication ethos spreading from speech and language technology to other areas of science and engineering.

[See also the report of the Committee on Strategies for Responsible Sharing of Clinical Trial Data, "Sharing clinical trial data: maximizing benefits minimizing risk", NAS 1/14/2015]

March 4, 2016 @ 9:07 am · Filed by Mark Liberman under Language and politics

Permalink

13 Comments

David L said,

March 4, 2016 @ 9:41 am

The sentiment that "this is my data and only I know what it really means" is not exactly a ringing endorsement of the quality of the data.

In any case, meta-analysis — statistical evaluation of combined datasets, done with the aim of getting stronger results — has a long history. See https://en.wikipedia.org/wiki/Meta-analysis

The wikipedia article dates the concept (in modern form) to the 17th century and the term itself to 1976. Dealing with differences in the way differents sets of data were collected is precisely what makes meta-analysis difficult and sometimes controversial. But if all the conditions of a study are made plain, along with the data itself, the problem is in general tractable.
J. W. Brewer said,

March 4, 2016 @ 11:28 am

I would think what you should legitimately be worried about (although I don't know how much of a near-term risk this is actually likely to be in medical research) is if a field ends up with too many scholars specializing in meta-analysis and not spending a decent percentage of their time doing actual data collection that could then be raw material for other people's meta-analyses. One problem some have claimed is systematic in linguistics for the last 50+ years (i.e. since the rise of Chomskyanism, although this is admittedly a high-level view that no doubt some would say is a caricature) is the loss of the sense that most scholars in the field ought to be expected to have spent meaningful time engaged in fieldwork in New Guinea or wherever doing data collection on underdescribed languages. Indeed, even the UPenn notion that you didn't have to go to New Guinea but maybe you at least ought to take the subway across town and rigorously study some sort of underanalyzed distinctive local variety of English had considerable trouble battling against the MIT notion that you could just stay in your own office introspecting about your own variety of English. And while I am personally fond of the sort of typological/universals work (not unlike meta-analysis?) that requires working with data from many more languages than any one scholar could possibly do primary fieldwork or data collection on, my sense is that that sort of work is done better by a scholar who has spent some time doing that sort of primary work, both because it avoids the "parasite" free-riding problem and also because it may provide greater insight into how to handle with appropriate care the data gathered by others.
Jon said,

March 4, 2016 @ 1:04 pm

Epidemiologists distinguish between meta-analysis and pooling. Meta-analysis is taking results from several published papers and using them to derive some sort of weighted mean result. Pooling is putting all of the raw data from several studies together, and re-analysing. Pooling is reckoned to be more statistically powerful. I have been on the sidelines of a large pooling study, and it was clearly a lot of work, just to make the data from different studies compatible.
Providing raw data alongside papers is clearly a good thing, and will allow more pooling of studies to be done. But don't underestimate the amount of work involved in providing all the relevant information in a usable form. It's not just providing tables of data, there is all of the background information about how the data was collected, and the decisions taken along the way, that may be different between studies. Comparing providing raw data to providing a bible text is simplistic.
Rubrick said,

March 4, 2016 @ 3:23 pm

The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.

If those choices aren't scrupulously documented and made public along with the data itself, then IMO there's already something amiss. This sounds like an own-goal to me.
Jerry Friedman said,

March 4, 2016 @ 3:34 pm

If people can be bothered to evaluate research in ways other than counting publications or even citations, won't they be able to judge it according to labor, originality, value to the discipline, and whatever else? And might it not happen that some analyses of others' data, though they didn't take much experimental labor, are original and valuable? I suspect it might.
Jason said,

March 4, 2016 @ 4:25 pm

"…or even use the data to try to disprove what the original investigators had posited…"

Imagine that! They might, y'know, do science.
maidhc said,

March 4, 2016 @ 5:31 pm

What Jason said. Isn't the scientific method based on providing enough information for other people to replicate your experiments?
Pflaumbaum said,

March 4, 2016 @ 6:04 pm

That was the line the leapt out at me too. I assumed that I'd somehow misunderstood what he meant. But what would a generous interrelation be? The meaning seems transparent.
Chris Waigl said,

March 4, 2016 @ 9:25 pm

Also, what Rubrick said.

Publishing data and "boring" stuff like pre-processing steps in a reproducible format is hard, sure, and in many corners of science just not (yet) part of standard procedures. My seniors, who I'm right now trying to convince of the desirability of spending time, effort and thought on this, aren't being evaluated based on whether they do this or not. This is not in a field where therapy decisions are made on the basis my work. Still, it seems like a no-brainer to me to build the platforms we need for reproducibility and collaboration.
leoboiko said,

March 4, 2016 @ 11:09 pm

Yeah, where do I publish my data? I'm thinking of just uploading it, together with the code, to my self-hosted website, and also to a Github mirror (with the boring pre-processing steps etc. in a good old README or HOWTO text file). But my personal website won't be around for long, and I don't like the idea of trusting science archival to a private, proprietary company like Github. My university has a system to archive dissertations (as PDF files); but there's no system to archive the data and scripts that were collected/developed for the dissertation. It's a bit as if I could publish a recording of the music I composed, but not the score. Except it's even worse, because of the whole reproducibility-crisis-in-science thing.

[(myl) There are an increasing number of discipline-specific repository and data archiving/distribution sites, many of which are documented in the Registry of Research Data Repositories. A few specific (and diverse) examples include e.g. ICPSR, UniProt,, and the Linguistic Data Consortium (which I direct). Many university libraries have set up data repositories: for a small sample, here are link to information pages at the University of Minnesota, Georgia Tech, Rutgers. And many journals now offer some amount of data archiving as "supplementary materials" for published research papers.

It's easier to find a place to deposit material than to find a place that will effectively distribute it, and it's harder still to get guarantees of long-term availability. Thus GitHub is easy to use and convenient for deposit and access, at least for archive sizes that can be easily be downloaded via the internet, but it's run by a private for-profit company that might decide next year that maintaining its free archiving and download service is too much trouble, or might go out of business, or be sold to a larger company that might make those decisions. What happened to SourceForge is a relevant lesson.

And there's no general set of mechanisms in place for funding long-term preservation and distribution. At the moment, university libraries and other university-based activities strike me as the most reliable way to ensure some sort of long-term access for most sorts of material.]
DCA said,

March 5, 2016 @ 12:48 am

McNutt comes from a field (geophysics) where datasets are unique and can take a lot of field work to obtain. There has been the same concern about putting in a huge amount of effort collecting the data and then being scooped by someone — though this has not turned out to happen much, if at all. Over the last 20 years there has been a growing acceptance of open data, and it is now expected, though still not always done. Mostly the data goes to a few community-wide centers; also the major journals all have electronic supplements, and these can be a place to put the data. But it is additional work.
James Wimberley said,

March 6, 2016 @ 6:39 am

I realize that access to data and compulsory publishing are not the same, but they are related. The argument goes like this:
1. To avoid publishing bias towards positive results, and to comply with the implied bargain with the subjects, all clinical trials should be published.
2. To maximise the scientific value of clinical trials, and to comply with the implied bargain with the subjects, the published reports of clinical trials should provide for access to the data for review and meta-analysis.

The common thread is that morally the data belong to the subjects not the researchers.
Yuval said,

March 8, 2016 @ 7:15 am

The polarity in "The editorial raised the points of how anyone not involved in the original study could use the data without misrepresenting it" threw me off for a few seconds.

RSS feed for comments on this post

#IAmAResearchParasite

13 Comments

David L said,

J. W. Brewer said,

Jon said,

Rubrick said,

Jerry Friedman said,

Jason said,

maidhc said,

Pflaumbaum said,

Chris Waigl said,

leoboiko said,

DCA said,

James Wimberley said,

Yuval said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta