John Markoff, "Troves of Personal Data, Forbidden to Researchers", NYT 5/21/2012:
When scientists publish their research, they also make the underlying data available so the results can be verified by other scientists.
(I wish this were generally true…)
At least that is how the system is supposed to work. But lately social scientists have come up against an exception that is, true to its name, huge.
It is “big data,” the vast sets of information gathered by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers’ privacy. But to many scientists, the practice is an invitation to bad science, secrecy and even potential fraud.
For those who don't care much about science, and oppose data publication on the basis of some combination of beliefs in corporate secrecy, personal privacy, and researchers' "sweat equity", here's a stronger argument: lack of broad access to representative data is also a recipe for bad engineering. Or rather, it's a recipe for slow to non-existent development of workable solutions to the the technical problems of turning recorded data into useful information.
At the recent DataEDGE workwhop in Berkeley, as well as at the recent LREC 2012 conference in Istanbul, I was unpleasantly surprised by the widespread lack of awareness of this (in my opinion evident) fact.
Read the rest of this entry »