US Circuit Judge Denny Chin has ruled in favor of Google in its long-running copyright litigation with the Authors Guild over the scanning and digitization of books. Chin ruled that the Google Books project constitutes fair use because it is "highly transformative" and "provides significant public benefits." In explaining those public benefits, Chin cited the use of Google Books data for Ngram queries, and pointed to a research example that we've discussed several times on Language Log.
The benefits of the Library Project are many. First, Google Books provides a new and efficient way for readers and researchers to find books. […] Second, in addition to being an important reference tool, Google Books greatly promotes a type of research referred to as "data mining" or "text mining." (Br. of Digital Humanities and Law Scholars as Amici Curiae at 1 (Doc. No. 1052)). Google Books permits humanities scholars to analyze massive amounts of data — the literary record created by a collection of tens of millions of books. Researchers can examine word frequencies, syntactic patterns, and thematic markers to consider how literary style has changed over time. (Id. at 8-9; Clancy Decl. ¶ 15). Using Google Books, for example, researchers can track the frequency of references to the United States as a single entity ("the United States is") versus references to the United States in the plural ("the United States are") and how that usage has changed over time. (Id. at 7). The ability to determine how often different words or phrases appear in books at different times "can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology." Jean-Baptiste Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books, 331 Science 176, 176 (2011) (Clancy Decl. Ex. H).
The cited amicus brief, written by Matthew Jockers, Matthew Sag, and Jason Schultz (PDF here), provided Chin with the "United States is/are" example:
Google’s “Ngram” tool provides another example of a nonexpressive use enabled by mass digitization—this time easily visualized. Figure 1, below, is an Ngram-generated chart that compares the frequency with which authors of texts in the Google Book Search database refer to the United States as a single entity (“is”) as opposed to a collection of individual states (“are”). As the chart illustrates, it was only in the latter half of the Nineteenth Century that the conception of the United States as a single, indivisible entity was reflected in the way a majority of writers referred to the nation. This is a trend with obvious political and historical significance, of interest to a wide range of scholars and even to the public at large. But this type of comparison is meaningful only to the extent that it uses as raw data a digitized archive of significant size and scope.
It's heartening to see the "United States is/are" example serve such a central role in the decision. I first discussed how "the United States are" gave way to "the United States is" in a Language Log post back in 2005, "Life in these, uh, this United States," and followed it up in a Word Routes column in 2009, "The United States Is… Or Are?." Later that year, Mark Liberman posted on the topic here, here, here, and here.
But that was all before Google rolled out its Ngram Viewer in December 2010, which allowed for the ready visualization of the trend, as given in the amicus brief. And when a new version of the Ngram Viewer was released in October 2012, I turned to the example yet again in an article for The Atlantic, as a way to show off some of the new features. Here is the query I included:
(If you missed it, just last month the Ngram Viewer was again freshened up with even more new features, including wildcard searching. See my Atlantic piece and the announcement on the Google Research blog.)
Finally, I was happy to receive an advance copy of the book Uncharted: Big Data as a Lens on Human Culture by Erez Aiden and Jean-Baptiste Michel, the brilliant young researchers who worked with Google to develop the Ngram Viewer and introduced it to the world in their paper for Science, "Quantitative Analysis of Culture Using Millions of Digitized Books." In their book (coming out next month), Aiden and Michel walk the reader through a series of enlightening examples of how the Ngram data can be used to analyze trends in language and culture (what they dubbed "culturomics" in the Science paper). The very first example in the book? You guessed it: the shift from plural to singular "United States." Now, that's an example with legs.