Size Matters: Big Data, New Vistas in the Humanities and Social Sciences
Mark Liberman, Geoffrey Nunberg, Matthew Salganik
Vast archives of digital text, speech, and video, along with new analysis technology and inexpensive computation, are the modern equivalent of the 17th-century invention of the telescope and microscope. We can now observe social and linguistic patterns in space, time, and cultural context, on a scale many orders of magnitude greater than in the recent past, and in much greater detail than before. This transforms not just the study of speech, language, and communication but fields ranging from sociology and empirical economics to education, history, and medicine — with major implications for both scholarship and technology development.
We've got until tomorrow afternoon to figure out what we're going to talk about. Here are a few of my own current thoughts. If you're pressed for time, the slogan-sized version is "Big Data is not necessarily Big Science" and "Preserve Endangered Data".
1) The shifting spectrum of size. Or maybe this should be called "Towards the Data-Analysis Singularity". As a result of Moore's Law, along with whoever's law it is that expands accessible digital content, the whole spectrum of analytic scale is shifting rapidly. Yesterday's Borgesian Fantasy turns into today's Heroic Project; yesterday's Heroic Project turns into today's Breakfast Experiment™. Thus the first bible concordance took thousands of monk-years to compile; today, any bright high school student with a laptop can do better in a few hours. In the 1960s, a million-word corpus was a big deal; today, … well, you get the idea. Projecting this trend into the future tells us that today's Heroic Projects, like creating the Google Ngram Viewer, will be tomorrow's undergraduate problem sets.
2) There's room for many a-more. Or maybe this should be "You ain't seen nothing yet". Most academic disciplines and sub-disciplines haven't really gotten on board this train yet. In my own field, phoneticians still mostly measure formant frequencies and voice-onset times by hand, even if they use computer programs rather than specialized electro-mechanical devices to do it. People who do large social surveys still mostly transcribe open-ended responses by hand and code them (also by hand) as if they were multiple-choice answers, ignoring the rest of the information in the recordings and transcripts. "Digital humanities" is still mostly a controversial gleam in a minority of humanists' eyes.
3) It's good to be able to fail. Or maybe, "evolution needs variation and selection". The thing about Heroic Projects is that you can't do very many of them, and it's a big deal if they fail. As it gets easier to ask and answer a certain kind of empirical question, you can afford to ask more questions. As a result, more researchers with a wider range of goals and beliefs can explore a bigger space of more detailed hypotheses about a broader range of problems. This is a Good Thing, in my opinion, even if most of the explorations wind up in blind alleys. Thus the most important thing about Big Data in the humanities and social sciences, in my opinion, is that today's Big Data rapidly turns into tomorrow's No Big Deal.
4) Save Endangered Data! More and more of our lives are carried out digitally and preserved in the Shadow Universe of digital archives. But most human activity is still ephemeral; and much of the small fraction that is recorded is still in danger of vanishing into the entropic mists. Future generations will have reason to wish that we paid more attention to aspects of this problem.
I'll pick two culturally-important examples at random: audiotape archives and court records.
Audiotape archives: Museums, libraries, country historical societies, radio station archives, and individual researchers' closets are full of millions of hours of audio tapes. These voices from the past will be of significant interest and value in the future — if they survive. Many are falling apart; others end up in landfills when storage space or money runs out. There are major efforts underway to digitize the world's books. We need a similar effort to digitize and preserve the world's tapes — and unlike the books, the tapes are unlikely to survive much longer unless something is done soon.
Court data: (Thanks to Jerry Goldman of oyez.org for background information.) If properly collected and archived, the activities of the American judicial system represent a massive collection of formalized social interactions, with great potential for social scientists interested in the activities of American courts and for computer scientists seeking a large highly-structured language corpus. Moreover, the hierarchical structure of the American judiciary represents an opportunity for technologists interested in modeling consequential interactions among institutions in a large system.
However, the American judiciary has been reluctant, at least in practice, to provide access to its data.
Most courts provide access via their websites to recent opinions, but few courts provide access to archival opinions. No consistency to the number of opinions available: Some courts provide all opinions from 2005 forward, others provide only the last term's worth. There is no consistency in the means of delivering the data: Some courts use RSS feeds, others have only a list of links on a page on their website, others require the user to use a search form to access any data.
Only half of the federal circuit courts make recordings of oral arguments available; a smaller fraction of other courts make them available electronically. No court, other than the Supreme Court, appears to make an official transcript of oral arguments available electronically. Almost all written data is contained in PDFs. What audio data is available is of inconsistent quality and in various formats.
Details aside, a sociologist or political scientist today would find it very difficult at best to assemble a complete collection of briefs, oral arguments, and opinions in cases at various levels dealing with some given topic — it would be a lot of work, and there would be many gaps. And in 20 or 30 years, the situation (with respect to cases now and in the past) might well be worse, because much of whatever may be available now might well have vanished.
The scale of the problem is fairly large. State courts of last resort decide about 90,000 cases a year. Intermediate federal courts of appeals decide about 60,000 cases a year. It's not clear how (and even whether) digital versions of oral-argument recordings, collections of briefs, etc., are being preserved by various courts. It seems possible that much of this material, although now almost invariably prepared in digital form, is not being digitally archived in any effective way.
[Jiahong Yuan and I helped Jerry Goldman in an NSF-funded project to rescue about 9,000 hours of U.S. Supreme Court oral arguments from analog tapes on the shelves of the National Archives, to transcribe them, and to make them available online at oyez.org. When complete, the whole collection will be available to researchers in corpus form.]