Big Inaccessible Data

« previous post | next post »

John Markoff, "Troves of Personal Data, Forbidden to Researchers", NYT 5/21/2012:

When scientists publish their research, they also make the underlying data available so the results can be verified by other scientists.

(I wish this were generally true…)

At least that is how the system is supposed to work. But lately social scientists have come up against an exception that is, true to its name, huge.

It is “big data,” the vast sets of information gathered by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers’ privacy. But to many scientists, the practice is an invitation to bad science, secrecy and even potential fraud.

For those who don't care much about science, and oppose data publication on the basis of some combination of beliefs in corporate secrecy, personal privacy, and researchers' "sweat equity", here's a stronger argument: lack of broad access to representative data is also a recipe for bad engineering.  Or rather, it's a recipe for slow to non-existent development of workable solutions to the the technical problems of turning recorded data into useful information.

At the recent DataEDGE workwhop in Berkeley, as well as at the recent LREC 2012 conference in Istanbul, I was unpleasantly surprised by the widespread lack of awareness of this (in my opinion evident) fact.

25 years of careful attention to making "big data" accessible to researchers is the reason that we now have practical solutions in various areas of speech and language technology, from "text analytics" to speech recognition to machine translation.  This effort, fostered especially by DARPA's Human Language Technology programs, lowered barriers to entry and created a research community in this area that was orders of magnitude larger than it otherwise would have been. And the ability to compare the performance of different algorithms on the same training and testing data has been a necessary condition for the gradual algorithmic progress of the past 25 years. [Full disclosure: the Linguistic Data Consortium, which I helped to found, also played a role in this process.]

Continuing with John Markoff's article:

The issue came to a boil last month at a scientific conference in Lyon, France, when three scientists from Google and the University of Cambridge declined to release data they had compiled for a paper on the popularity of YouTube videos in different countries.

The chairman of the conference panel — Bernardo A. Huberman, a physicist who directs the social computing group at HP Labs here — responded angrily. In the future, he said, the conference should not accept papers from authors who did not make their data public. He was greeted by applause from the audience.

In February, Dr. Huberman had published a letter in the journal Nature warning that privately held data was threatening the very basis of scientific research. “If another set of data does not validate results obtained with private data,” he asked, “how do we know if it is because they are not universal or the authors made a mistake?”

This has been a standard problem for decades in the social, psychological, and biomedical sciences, where it's been common for large government-funded data collections to be closely held by the researchers who were funded to collect them.

As Markoff goes on to explain:

At leading social science journals, there are few clear guidelines on data sharing. “The American Journal of Sociology does not at present have a formal position on proprietary data,” its editor, Andrew Abbott, a sociologist at the University of Chicago, wrote in an e-mail. “Nor does it at present have formal policies enforcing the sharing of data.”

The problem is not limited to the social sciences. A recent review found that 44 of 50 leading scientific journals instructed their authors on sharing data but that fewer than 30 percent of the papers they published fully adhered to the instructions. A 2008 review of sharing requirements for genetics data found that 40 of 70 journals surveyed had policies, and that 17 of those were “weak.”

The data-sharing policy of the journal Science says, “All data necessary to understand, assess and extend the conclusions of the manuscript must be available to any reader of Science.” But in the case of a 2010 article based on data from cellphone patterns, a legal agreement with the data provider prevented the researchers from even disclosing the country of origin.

There are certainly many problems here, ranging from privacy concerns to intellectual property rights to the sheer practical difficulty of arranging and distributing very large data sets. But in my experience, all of these difficulties can be overcome when people are motivated to do so. And researchers often rely on these difficulties to bolster a conclusion that they prefer on much more personal grounds, namely the competitive advantage and protection against refutation that they derive from exclusive access to a large and hard-to-replicate body of data.

A common consequence of the view that "it can't be done", wherever this view comes from, is that data is collected in a way that in fact does guarantee that it can't be shared with other researchers — because the consent forms (or other agreements with people involved) fail to include any provision for such sharing, or even explicitly promise that it will not occur. For example, at LREC 2012 I was told that the Norwegian portion of the Nordic Dialect Corpus, based on a large geographic sample of spoken interviews, was collected based on agreements with the interviewees that will prevent the recordings being shared with anyone outside the university research group involved "until perhaps after all the participants are dead". It was asserted that this state of affairs is mandated by "our privacy laws", though other Norwegian researchers later expressed surprise at this point of view.


  1. Philip Resnik said,

    June 4, 2012 @ 1:55 pm

    This difficulty is all the more evident in healthcare, where HIPAA prevents disclosure of clinical records for responsibly controlled research purposes just as strictly as it would prevent disclosure of those same records to the National Enquirer. (HIPAA is the U.S. law that protects personal health information, also known as PHI; see The privacy issues are substantial, and HIPAA protections are clearly extremely important, but the law is also just as clearly an obstacle to progress as currently formulated, presumably because it just wasn't constructed with these sorts of secondary uses in mind.

    At a recent workshop on natural language processing and clinical decision support at the National Institutes of Health (, this came out rather forcefully in the closing discussion. I'd say the consensus there was that, unlike the sorts of data Mark alludes to above, the difficulties *cannot* be overcome just by data collectors and data users being sufficiently motivated to do so. This is one for the policy makers — yet as far as I can tell, they don't yet understand the importance of the issues.

    [(myl) I agree that HIPAA poses special problems, but I will also note that the Alzheimer's Disease Neuroimaging Initiative (ADNI) has found a way to make a considerable amount of clinical data available for research use. Browse the ADNI website, or take a look at Neal Buchholtz's slides from the Berlin 9 Open Access Conference session on "Transforming Research through Open Online Access to Discovery Inputs and Outputs", from which I quote: "Goal is rapid public access of all raw and processed data".

    Neal made clear in his talk that the reason for this goal was precisely a belief that such access would lower barriers to entry, increase the size of the research community, permit rapid replication and extension, and generally make a qualitative difference in the rate of progress on certain key problems.

    I also agree that too many policy-makers — and too many scientists and engineers as well — are blind to the importance of these issues.]

  2. Jerry Friedman said,

    June 4, 2012 @ 2:07 pm

    For government-funded research, maybe there should be a policy for the data like the one at the NIH that the results of the research have to be made public. The petition recommended here by Eric Baković asked that that be extended to all research funded by the U.S. government. Should there be a petition like that for data?

  3. Michael Newman said,

    June 4, 2012 @ 2:18 pm

    One issue is getting better in my experience. The IRB at CUNY used to prefer to have human subjects data destroyed after a period of time and they still worry about where researchers keep their audiorecordings. Now, they don't seem to make the destroy request and at least make it easier to put a share checkbox in the consent form. I have a data set which I can't share with anyone that I collected between 1998 and 2002 because I didn't construct the consent form to make it available. Now, I can share all the data that I'm collecting in Barcelona.

    I got the impression from my last IRB training though that some people apparently want to treat internet data such as Facebook posts and even Youtube videos similarly to privately collected data. Is that correct?

  4. david said,

    June 4, 2012 @ 3:21 pm


    As an independent developer with an interest in MT and a track record producing open source lexicons, it is far beyond my means to purchase an annual commercial subscription for $24,000 in order to gain access to LDC research corpora.

    If you genuinely believe what you've written here, you should push your own organization to be much more open with the developer community: either making it possible for individuals to purchase non-commercial subscriptions at reasonable prices, or adopting a scaling licensing fee based on the annual revenues of the company doing the licensing.

    Realistically, both Google and Facebook and Yahoo have released an incredible amount of open source software that is invaluable to anyone solving practical problems in search and large-scale data processing. So this is not a black-and-white issue….

    [(myl) As a not-for-profit entity, you would pay a membership fee of $2,400, not $24,000. And we offer "data scholarships" to those who want to work with LDC data and genuinely can't afford it — you should feel free to apply for one. But our founding agreement with the U.S. Government requires us to raise enough money through subscriptions and data sales to pay for IPR negotiations, data licensing fees, data publication and distribution costs, and so on. In particular, nearly all of the parallel-text and comparable-text data of interest to MT developers belongs (in the IPR sense) to someone else — publishers, broadcasters, and others — from whom we've had to license it for limited re-distribution. The IPR negotiations, in the case of each of dozens of sources, take weeks of time for an LDC employee; and in most cases, we need to pay subscription fees as well: last year, our newswire subscription bill was about $200K. In addition, preparing such material for publication takes many person-months of work on the part of LDC employees. If someone would endow us so as to pay those costs indefinitely into the future, we would gladly reduce all user fees to zero.

    The most important divide, in my opinion, is between data that's published (i.e. available in a standard way to everyone) and data that's proprietary (i.e. not available at all, or available only to friends and relations of the owner, or whatever).

    There's another important divide between data that's "open access", in the sense that there's no cost to acquire it and that anyone can re-distribute it. Free re-distribution is not an option for material that belongs to publishers, broadcasters, and so on. (And not-for-profit entities like NPR and the United Nations are, in my experience, tighter with their stuff than commercial entities are.) Free re-distribution can also be problematic for data that has privacy or other issues, where it may be necessary to keep track of who gets it, and to make recipients sign a user agreement constraining what they can do with it.

    Still, it would be nice if research material could be given out to researchers for little or no cost. If you (or anyone else) can provide an alternative way to pay the salaries and benefits of the people who do the work to make it possible to create and maintain a catalog of speech and language datasets to distribute, we'd be happy to reduce or eliminate our subscription fees.]

  5. jaypatrick said,

    June 4, 2012 @ 6:01 pm

    When scientists publish their research, they also make the underlying data available so the results can be verified by other scientists.

    Are there any scientific fields or sub-fields where this is generally true?

    [(myl) I believe that there are some areas of geophysics (dependent on seismic data and on climate data) where it's close to true that everyone has access to the same data sets. Similarly, in some areas of genomics, proteomics, etc., the basic sequence data is (I think) mostly shared. In some areas of astronomy this may also be close to true. Thanks to 25 years of agitation by Brian MacWhinney, it's become normal for child language acquisition researchers to deposit their transcripts and recordings in the CHILDES repository. And so on.

    But your implication is correct — such situations are the exception rather than the rule.]

  6. Maria Wolters said,

    June 5, 2012 @ 5:16 am

    I agree with Philip that health is a particularly tricky area. This is why good de-identification algorithms are so important for making sure that any data that leaves the health service is properly anonymised. I am currently working on getting a de-identification programme set up for free text data in Scottish primary care. I am lucky to work with somebody who knows the ins and outs of the privacy regulations.

    One step in the right direction is the i2b2 challenge,
    For these challenges, large standardised and anonymised corpora are assembled which can then push research further.

    Regarding your point about the Norwegian research, often one only encounters those absurd restrictions when getting permission for data collection. It's perfectly possible that your Norwegian sources found the prospect of having to wait until the death of the interviewees strange, but researchers are very much at the mercy of individual Ethics committees here, who may or may not get the point of open data.

    As a community, we need to avoid being caught out by these considerations when planning research. While we're waiting for society and policy to change, maybe we should start talking amongst ourselves about the best ways of coping with the restrictions that are in place.

    [(myl) While regulations are often problematic, my experience is that the biggest problem is the misinterpretation of regulations. In the U.S., this can certain happen within "Institutional Review Boards", which are often dominated by clinical researchers who develop rules of thumb (like the one about destroying recordings and transcripts after the "experiment" is over) that are nowhere mandated or even suggested by the regulations themselves. And around the world, most of the misinterpretation of regulations is done by researchers rather than by regulatory bodies.

    In addition to anonymization, another layer of protection can be imposed by user agreements. For research data sets to be useful, it's not necessary for them to be "put up on the internet" — there can be "open" contractual arrangements (in the sense of contracts that are available to all on terms publicly specified in advance) whereby researchers agree to various restrictions on their use and reproduction of the research material in question. These restrictions have long been normal in arrangements to protect intellectual property — similar restrictions can and should also be imposed in order to protect privacy where such protection is appropriate or legally required.]

  7. Howard Oakley said,

    June 6, 2012 @ 3:24 am

    I am not sure that I entirely agree with the premise. It is my understanding that the underlying tenet in science (in its broadest sense) is that any research that is published includes the information sufficient for others to replicate the study. Such replication is a fundamental part of the scientific process, as it allows others to validate or question your work.

    The problem arises in what is necessary for that replication. In small-sample experimental studies, sufficient detail of the experimental protocol usually suffices, as that enables another lab to go off and repeat the study.

    In healthcare, it is not the original data that are necessary, but again sufficient detail to repeat the study. However that can pose a problem with large and long multi-centre studies, where others are extremely unlikely to obtain the resources to try to replicate. Patient privacy is often brandished as an excuse to prevent access to original data, but in fact this is almost invariably spurious, as almost all studies perform (or should perform) analyses on anonymised data, so there is seldom any sound reason to deny access to derived datasets for analytical use.

    With large or specialised corpora, depriving others of access to those corpora is a clear barrier to replication, and should not be allowed by those controlling publication. I think that the same should apply to any work on large or expensive datasets.

    Re-analysis of data is a much more recent elaboration (after all, it is only in the recent past that large sample statistical methods have become so widely used). The past principle here has been that the publication (or supplementary material available on request, perhaps) should contain sufficient primary data to allow others to check the analytical methods used, and perhaps try alternatives. That is a generally more complex matter, and these days good studies in good journals will make some sort of data disclosure to allow that. With older experimental work, for instance, good labs were always happy to make the original lab records available on request. Current excuses for not doing similar seem somewhat feeble, given how much easier that has become.


RSS feed for comments on this post