Today and tomorrow I'm participating in the Berlin 9 Open Access Conference, at the Howard Hughes Medical Institute in Bethesda MD. This afternoon I'll be giving a talk in a session on "Transforming Research through Open Online Access to Discovery Inputs and Outputs". Here's my abstract:
All published scientific and technical research should in principle be reproducible. But in many areas of science and technology, it’s now possible to conduct genuinely “reproducible research”, in the sense of “reproducible computational experiments”. This requires three things: (1) the data sets that serve as input; (2) the programs needed to run the experiment; and (3) a comprehensible account of what the experiment does, why it matters, and what the results are.
A traditional publication provides only the third leg of this three-legged stool. However, in many areas—from geophysics and molecular biology to computer vision and computational linguistics—the data sets that researchers use are available in principle to anyone. And there is no technical difficulty in providing the second leg of the stool, namely the experiment’s code.
Reproducible research is more efficient research, because it lowers barriers to checking and extension (including by the original authors), encourages broader collaborations, and leads to deeper understanding. And reproducible research is better research, because it’s less prone to error, fraud, and nonsense.
But reproducible research is not necessarily Open Access research. We do need all three legs of the reproducible-research stool to be published, just as the single traditional leg, the explanation, always has been. But the social-policy question of who pays for this is an independent matter.
Thus the Open Access movement needs to think seriously about access to data and code as well as to traditional papers. And data and code cost more to referee, prepare, maintain, and distribute than research papers do. As a result, the potential benefits of Open Access are greater; but if the Open Access movement fails to find solutions, the reproducible research trend will undermine the value of mere access to research papers.
One of the key issues I plan to talk about, but didn't put in the abstract, is the question of how to publish and share data in the biomedical, social, behavioral sciences in a way that's consistent with policies on privacy and confidentiality — and specifically with government-imposed (and ethically appropriate) regulations on "human subjects".
Dieter Stein, who organized this session, sent around an extract from "Jim Gray on e-Science: A Transformed Scientific Method", from Tony Hey, Stuart Tansley, and Kristin Tolle, Eds., The Fourth Paradigm: Data-Intensive Scientific Discovery. Here's the part of the text that's relevant to our session:
[T]he Internet can do more than just make available the full text of research papers. In principle, it can unify all the scientific data with all the literature to create a world in which the data and the literature interoperate with each other. You can be reading a paper by someone and then go off and look at their original data. You can even redo their analysis. Or you can be looking at some data and then go off and find out all the literature about this data. Such a capability will increase the “information velocity” of the sciences and will improve the scientific productivity of researchers. And I believe that this would be a very good development! [...]
I’ve talked about publishing literature, but if the answer is 42, what are the units? You put some data in a file up on the Internet, but this brings us back to the problem of files. The important record to show your work in context is called the data provenance. How did you get the number 42? Here is a thought experiment. You’ve done some science, and you want to publish it. How do you publish it so that others can read it and reproduce your results in a hundred years’ time? Mendel did this, and Darwin did this, but barely. We are now further behind than Mendel and Darwin in terms of techniques to do this. It’s a mess, and we’ve got to work on this problem. [...]
[W]e have traditionally had authors, publishers, curators, and consumers. In the new world, individual scientists now work in collaborations, and journals are turning into Web sites for data and other details of the experiments. Curators now look after large digital archives, and about the only thing the same is the individual scientist. It is really a pretty fundamental change in the way we do science.
One problem is that all projects end at a certain point and it is not clear what then happens to the data. There is data at all scales. There are anthropologists out collecting information and putting it into their notebooks. And then there are the particle physicists at the LHC. Most of the bytes are at the high end, but most of the datasets are at the low end. We are now beginning to see mashups where people take datasets from various places and glue them together to make a third dataset.
So in the same sense that we need archives for journal publications, we need archives for the data. So this is my last recommendation to the CSTB: foster digital data libraries. Frankly, the NSF Digital Library effort was all about metadata for libraries and not about actual digital libraries. We should build actual digital libraries both for data and for the literature.
I wanted to point out that almost everything about science is changing because of the impact of information technology. Experimental, theoretical, and computational science are all being affected by the data deluge, and a fourth, “data-intensive” science paradigm is emerging. The goal is to have a world in which all of the science literature is online, all of the science data is online, and they interoperate with each other. Lots of new tools are needed to make this happen.