When I was a student at the end of the 1970's, I never dared imagine, even in my wildest dreams, that the scientific community would one day have the means of analyzing computerized corpuses of texts of several hundreds of billions of words.
I've contributed my voice to the chorus — Robert Lee Holtz in the Wall Street Journal ("New Google Database Puts Centuries of Cultural Trends in Reach of Linguists", WSJ 12/17/2010) quotes me this way:
"We can see patterns in space, time and cultural context, on a scale a million times greater than in the past," said Mark Liberman, a computational linguist at the University of Pennsylvania, who wasn't involved in the project. "Everywhere you focus these new instruments, you see interesting patterns."
And I meant every word of that. But there's a worm in the bouquet of roses.
Here's a larger sample of my email Q&A with Mr. Holtz:
Q: What's your assessment of this computational approach to the historical lexicon? They suggest that this massive data base offers the foundation for new forms of historical and linguistic scholarship. In your view, are they right in heralding a new era for cultural studies or is this hyperbole? Is this likely to have any impact on scholarly studies?
A: I'd put their work in context this way: 2010 is like 1610. The vast and growing archives of digital text and speech, along with new analysis techniques and inexpensive computation, are a modern equivalent of the 17th-century invention of the telescope and microscope. We can see patterns in space, time, and cultural context, on a scale a million times greater than in the past. Everywhere you focus these new instruments, you see interesting patterns. Look at the sky, and see the moons of Jupiter; look at a leaf, and see the structure of cells.
This paper is an example of the sort of thing that is becoming possible, and will soon be easy. Its main contribution is to create a historical corpus of texts that is many times larger than those used in the past. Its main limitation is to look only at the frequency of word-strings over time.
Q: The researchers clearly see this data base as a tool for more than the study of the evolution of language. They talk about tracking a wide range of social trends — fame, censorship, diet, gender, science and religion — and I wonder how you assess that claim. Can the study of changing language on this scale be a lens to study all that?
A: In principle, yes. Some interesting questions can be reduced to a matter of changes in the frequency of words and word-strings. Other questions still have answers implicit in a historical text collection, but may require other sorts of analysis.
For a small but amusing example of the kind of problem that requires more than "culturomic trajectories", take a look at Giles Thomas's post about the systematic OCR substitution of f for long-s:
Why is it that of four swearwords, the one starting with ‘F’ is incredibly popular from 1750 to 1820, then drops out of fashion for 140 years — only appearing again in the 1960s?
Your first thought might be to do with the replacement of robust 18th-century English — the language of Jack Aubrey — with pusillanimous lily-livered Victorian bowdlerism. But the answer is actually much simpler. Check out this set of uses of that f-word from between 1750 and 1755. In every case where it was used, the word was clearly meant to be “suck”. The problem is the old-fashioned “long S“. It’s a myth that our ancestors used “f” where we would use “s”. Instead, they used two different glyphs for the letter “s”. At the end of a word, they used a glyph that looked just like the one we use now, but at the start or in the middle of a word they used a letter that looked pretty much like an “f”, except without the horizontal stroke in the middle.
But to an OCR program like the one Google presumably used to scan their corpus, this “long S” is just an F. Which, um, sucks. Easy to make an afs of yourself…
And lift vs. list:
And my personal favorite, funk vs. sunk, showing the absolute peak of funkitude associated with the Siege of Yorktown:
(Coincidence? I think not.)
This s/f confusion, in itself, is a small glitch. Independent of any possible improvements in the back-end OCR programs at Google Books, it would be easy to use the techniques developed by David Yarowsky for word-sense disambiguation in order to clean up the s/f confusions in these texts.
It would be easy — if you had access to the underlying texts, not just to the time-functions of n-gram frequencies. In this case, I have little doubt that the folks at Google Books will take care of the problem. The issue is simple enough, and common enough, and embarrassing enough, to catch their attention and to be assigned an adequate allocation of effort.
But there's an indefinitely long list of cases of genuine sense disambiguation, where the same letter-string in the text corpus — even when correctly OCR'ed — corresponds to several sharply distinct senses, or to a cline of shades of meaning. In each case, someone with access to the underlying corpus can use well-studied techniques to put each instance in its proper semantic place, in a way that is consistent with the opinions of human annotators to roughly the same extent that they are consistent with one another. As things stand, however, no one except the "Google Books Team" has the needed access. And we shouldn't expect the Google Books team to want to do this for every sense distinction of possible interest to any scholar.
The same thing is true for the large class of cases where a given letter-string in the text corpus has an interesting range of different functions. For example, consider the issues about the role of the word-string "the United States" discussed here, here, here, and here. Does it get singular or plural agreement? Is it the subject of a sentence? Is it treated as an agent? Sometimes you can find reasonable n-gram proxies for such questions — but often you can't.
There are also a large number of cases where you'd like to group word-strings into categories: dates, organizations, minerals, place names, novelists, etc., and then treat these categories (rather than words or word strings) as units of analysis. Again, there are well-known techniques for inducing such categories in text collections — but to use these techniques, you need to be able to have the text collection in hand so as to be able to run your algorithms over it.
Many — maybe most — questions about historical texts are like these last few examples: relatively easy to answer if you have a corpus in hand, and not addressed very well (if at all) by a collection of "culturomic trajectories", defined as the year-by-year time-functions of common word sequences. In particular, nearly all questions about the history of the English language fall beyond the grasp of time-functions of n-gram frequencies. This is not to deny the interest and value of such time functions. It's just that they're not nearly enough.
What are the prospects here? The portion of the Google Books corpus before 1922 — which is more than enough for many historical studies — is not encumbered by copyright. However, it belongs to Google, and it is arguably unfair to ask a company to pay for a large database creation project and then to give it away. Nevertheless, this leaves the rest of us in an uncertain situation. Here's the end of my Q&A with the previously-mentioned reporter:
Q: Lastly, as a practical matter, does the use of the Google book library pose any particular challenges to researchers due to copyright issues, the nature of the digitization process, and such? Broadly, what do you think of Google's ambitions to digitize the world's libraries?
A: Google is making an important contribution to the creation of the archives that will make new kinds of work possible. For that, the company deserves everyone's thanks.
But there is a potential problem. As it stands, outside scholarly access to this historical archive will be limited to tracking the frequency of words and word-strings (what they call in the trade "n-grams"). This is useful for addressing some questions, but most questions will require other kinds of processing, which are not possible without having the full underlying archive in your (digital) hands. For the material before 1922, there is no copyright issue. The only barrier is Google's competitive advantage.
This puts the rest of us in a difficult position. Given Google's large, well-run and successful effort to digitize these historical collections, for which the economic returns are fairly small on a per-book basis, it's unlikely that anyone else will duplicate their efforts in the visible future. So we're in the situation that would have existed if the Human Genome Project had been entirely private, rather than shared.
In this analogy, the access to "culturomic trajectories" to be made available at culturomics.org might correspond to information about the relative frequencies of nucleotide polymorphisms across individuals, without access to the underlying genomes.
This is not a wonderful analogy, for various reasons, though maybe it helps to make the point.
The Science paper says that "Culturomics is the application of high-throughput data collection and analysis to the study of human culture". But as long as the historical text corpus itself remains behind a veil at Google Books, then "culturomics" will be restricted to a very small corner of that definition, unless and until the scholarly community can reproduce an open version of the underlying collection of historical texts.
[Full disclosure — Mr. Holtz calls me "a computational linguist at the University of Pennsylvania, who wasn't involved in the project", and this is basically true. However, I discussed the project with some of the authors in a meeting in Cambridge a couple of years ago, as the project was getting underway, and I've corresponded and talked with them from time to time since.
A quantitative caveat — Mr. Holtz quotes me correctly as using the phrase "on a scale a million times greater than in the past". Although that's true as a statement about the scale of linguistic data in general, in this particular case the proximate point of comparison would be Mark Davies' Historical Corpus of American English, which is only about a thousand times smaller. Of course, if you go farther back into the past or forward into the future, the multiplier gets bigger.]
[Update #2: Mark Davies makes the case that his Corpus of Historical American English gives essentially the same results for many searches, is more reliable in other cases, and allows a much wider variety of search types, including useful genre classifications, collocates, part-of-speech searches, etc. I'm a big fan of Mark's work, and these are all good points. What I'd like to see is for Mark to have access to the full-text data behind the Google lists, so that he could give us the best of both worlds.]