Mark Liberman -- "Language, Linguistics, and the Data Explosion"

The Data Reformation

On Friday 9 May, from 4.15 - 5.45 pm, I'll be joining Philip Durkin (Principal Etymologist of the Oxford English Dictionary) and Sali Tagliamonte (a sociolinguist from the University of Toronto) in a panel discussion at the British Academy, "Language, Linguistics, and the Data Explosion".

We'll try to communicate the sheer exuberant wonder with which linguists contemplate today's vast and growing archives of digital text and speech. When we point our analysis algorithms at these oceans of data, we see amazing things.

Of couse, it takes a lot of careful exploration to turn our wild surmise into sound scholarship, science, and engineering. And crucial to both surmise and science is the fact that all that digital data has not just been collected, it has also been published.

Or, at least, some of it has. And distributed, democratic investigation of that shared linguistic data has played a central and critical role in the research behind the linguistic technology, science, and scholarship that we have today. Over the past 25 years, we've learned that:

When research datasets are available, there's more research, because available datasets lower barriers to entry.
When research datasets are shared, the research is better, because results can be replicated, and algorithms and theories can be compared. In addition, shared datasets are typically much bigger and more expensive than any individual researcher's time and money would permit.
When datasets are associated with well-defined research questions, the whole field gets better, because the people who work on the "common tasks" form a community of practice within which ideas and tools circulate rapidly.
And finally, well-designed datasets usually turn out to be useful for many kinds of research besides those for which they were originally designed.

The U.S. Defense Department discovered all of this, somewhat by accident, during the late 1980s. As a result, U.S. "Human Language Technology" programs since then have been managed via the "Common Task" model, in which the research starts with a well-defined performance metric (typically administered by the National Institute of Standards and Technologies) and a common dataset for training on the tasks of interest, with testing data held back for periodic evaluations. For a narrative account of the early history of this process, see my 2010 obituary for Fred Jelinek; and you can get a snapshot of some current activities from the web sites of NIST's Text Retrieval Conference and Multimodal Information Group.

I suggest, half seriously, that European civilization made an analogous set of discoveries in the 16th century, again more or less by accident. The invention of the printing press, and its use to disseminate translations of the bible into the languages of everyday life, transformed European society. Literacy, education, and scholarship spread to a much larger portion of the population, and improved in quality as well as quantity along the way.

And it's not just theology, classics, and linguistic technology that are improved by access to the raw materials of research. Fields from genomics to geophysics, from musicology to econometrics, are seeing the benefits of significant bodies of generally-accessible digital data.

Still, the great majority of relevant material remains locked up, due to legitimate concerns about privacy and intellectual property, as well as less laudable interests in exclusive access to publicly-funded data. There is a growing trend to find ways to overcome these barriers, protecting privacy and property, and rewarding sharing rather than hoarding. Some outward signs of this intellectual trend can be seen in the Royal Society's "Science as an Open Enterprise" report, and the U.S. Office of Science and Technology Policy memo on "Increasing Access to the Results of Federally Funded Scientific Research".

We might call this process the "Data Reformation", since it emphasizes the spread of unmediated access to the primary material needed to discover truth. More familiar names for the trend are the "Open Data" and "Reproducible Research" movements. Under whatever name, this trend is making increasing amounts of digital data -- including speech and language data -- accessible to increasingly many researchers worldwide.