John Markoff, "Troves of Personal Data, Forbidden to Researchers", NYT 5/21/2012:
When scientists publish their research, they also make the underlying data available so the results can be verified by other scientists.
(I wish this were generally true…)
At least that is how the system is supposed to work. But lately social scientists have come up against an exception that is, true to its name, huge.
It is “big data,” the vast sets of information gathered by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers’ privacy. But to many scientists, the practice is an invitation to bad science, secrecy and even potential fraud.
For those who don't care much about science, and oppose data publication on the basis of some combination of beliefs in corporate secrecy, personal privacy, and researchers' "sweat equity", here's a stronger argument: lack of broad access to representative data is also a recipe for bad engineering. Or rather, it's a recipe for slow to non-existent development of workable solutions to the the technical problems of turning recorded data into useful information.
At the recent DataEDGE workwhop in Berkeley, as well as at the recent LREC 2012 conference in Istanbul, I was unpleasantly surprised by the widespread lack of awareness of this (in my opinion evident) fact.
25 years of careful attention to making "big data" accessible to researchers is the reason that we now have practical solutions in various areas of speech and language technology, from "text analytics" to speech recognition to machine translation. This effort, fostered especially by DARPA's Human Language Technology programs, lowered barriers to entry and created a research community in this area that was orders of magnitude larger than it otherwise would have been. And the ability to compare the performance of different algorithms on the same training and testing data has been a necessary condition for the gradual algorithmic progress of the past 25 years. [Full disclosure: the Linguistic Data Consortium, which I helped to found, also played a role in this process.]
Continuing with John Markoff's article:
The issue came to a boil last month at a scientific conference in Lyon, France, when three scientists from Google and the University of Cambridge declined to release data they had compiled for a paper on the popularity of YouTube videos in different countries.
The chairman of the conference panel — Bernardo A. Huberman, a physicist who directs the social computing group at HP Labs here — responded angrily. In the future, he said, the conference should not accept papers from authors who did not make their data public. He was greeted by applause from the audience.
In February, Dr. Huberman had published a letter in the journal Nature warning that privately held data was threatening the very basis of scientific research. “If another set of data does not validate results obtained with private data,” he asked, “how do we know if it is because they are not universal or the authors made a mistake?”
This has been a standard problem for decades in the social, psychological, and biomedical sciences, where it's been common for large government-funded data collections to be closely held by the researchers who were funded to collect them.
As Markoff goes on to explain:
At leading social science journals, there are few clear guidelines on data sharing. “The American Journal of Sociology does not at present have a formal position on proprietary data,” its editor, Andrew Abbott, a sociologist at the University of Chicago, wrote in an e-mail. “Nor does it at present have formal policies enforcing the sharing of data.”
The problem is not limited to the social sciences. A recent review found that 44 of 50 leading scientific journals instructed their authors on sharing data but that fewer than 30 percent of the papers they published fully adhered to the instructions. A 2008 review of sharing requirements for genetics data found that 40 of 70 journals surveyed had policies, and that 17 of those were “weak.”
The data-sharing policy of the journal Science says, “All data necessary to understand, assess and extend the conclusions of the manuscript must be available to any reader of Science.” But in the case of a 2010 article based on data from cellphone patterns, a legal agreement with the data provider prevented the researchers from even disclosing the country of origin.
There are certainly many problems here, ranging from privacy concerns to intellectual property rights to the sheer practical difficulty of arranging and distributing very large data sets. But in my experience, all of these difficulties can be overcome when people are motivated to do so. And researchers often rely on these difficulties to bolster a conclusion that they prefer on much more personal grounds, namely the competitive advantage and protection against refutation that they derive from exclusive access to a large and hard-to-replicate body of data.
A common consequence of the view that "it can't be done", wherever this view comes from, is that data is collected in a way that in fact does guarantee that it can't be shared with other researchers — because the consent forms (or other agreements with people involved) fail to include any provision for such sharing, or even explicitly promise that it will not occur. For example, at LREC 2012 I was told that the Norwegian portion of the Nordic Dialect Corpus, based on a large geographic sample of spoken interviews, was collected based on agreements with the interviewees that will prevent the recordings being shared with anyone outside the university research group involved "until perhaps after all the participants are dead". It was asserted that this state of affairs is mandated by "our privacy laws", though other Norwegian researchers later expressed surprise at this point of view.