In Science today, there's yesterday, there was an article called "Quantitative analysis of culture using millions of digitized books" [subscription required] by at least twelve authors (eleven individuals, plus "the Google Books team"), which reports on some exercises in quantitative research performed on what is by far the largest corpus ever assembled for humanities and social science research. Culled from the Google Books collection, it contains more than 5 million books published between 1800 and 2000 — at a rough estimate, 4 percent of all the books ever published — of which two-thirds are in English and the others distributed among French, German, Spanish, Chinese, Russian, and Hebrew. (The English corpus alone contains some 360 billion words, dwarfing better structured data collections like the corpora of historical and contemporary American English at BYU, which top out at a paltry 400 million words each.)
I have an article on the project appearing in tomorrow's in today's Chronicle of Higher Education, which I'll link to here, and in later posts Ben or Mark will probably be addressing some of the particular studies, like the estimates of English vocabulary size, as well as the wider implications of the enterprise. For now, some highlights:
1. The team: The authors include some Google Books researchers (Jon Orwant, Peter Norvig, Matthew Gray and Dan Clancy), a group of people associated with Harvard bioscience programs (Jean-Baptiste Michel, Erez Lieberman Aiden, Aviva Aiden, Adrien Veres, and Martin Nowak), as well as Steve Pinker of Harvard and Joe Pickett of the American Heritage Dictionary, Dale Hoiberg of the Encyclopedia Britannica, and Yuan Kui Shen of the MIT AI lab. So it's dominated by scientists and engineers, and is framed in scientific (or -istic) terms: the enterprise is described, unwisely, I think, with the name "culturomics" (that's a long o, as in genome). That's apt to put some humanists off, but doesn't affect the implications of the paper one way or the other. I have more to say about this in the Chronicle article.
2. The research exercises take various forms. In one, the researchers computed the rates at which irregular English verbs became regular over the past two centuries. In another, very ingenious, they used quantitative methods to detect the suppression of the names of artists and intellectuals in books published in Nazi Germany, the Stalinist Soviet Union, and contemporary China. A third deals with investigate the evolution of fame, as measured by the relative frequency of mentions of people’s names. They began with the 740,000 people with entries in Wikipedia and sorted them by birth date, picking the 50 most frequently mentioned names from each birth year (so that the 1882 cohort contained Felix Frankfurter and Virginia Woolf, and so on). Next they plotted the median frequency of mention for each cohort over time and looked for historical tendencies. It turns out that people become famous more quickly and reach a greater maximum fame today than they did 100 years ago, but that their fame dies out more rapidly — though it's left unclear what to make of those generalizations or what limits there are to equating fame with frequency of mention.
The paper also presents a number of n-gram trajectories — that it, graphs that show the relative frequency of words or n-grams (up to five) over the period 1800-2000. ("Relative frequency" here means the ratio of tokens of the expression in a given year to the total number of tokens in that year.) By way of example, they plot the changing fame of Galileo, Dickens, Freud, and Einstein; the frequency of "steak," "hamburger," "pizza" and "pasta"; and the changing frequency of "influenza" (it peaks, in the least surprising result of the study, in years of epidemics).
The big news is that Google has set up a site called the Google Books Ngram Viewer where the public can enter words or n-grams (to 5) for any period and corpus and see the resulting graph. They've also announced that the entire dataset of n-grams will be made available for download. Some reports have interpreted this as meaning that Google is making the entire corpus available. It isn't, alas, nor even the pre-1923 portion of the corpus that's in public domain. One can hope…
At present, that's all you can with this. You can't do many of the things that you can do with other corpora: you can’t ask for a list of the words that follow traditional for each decade from 1900 to 2000 in order of descending frequency, or restrict a search for bronzino to paragraphs that contain fish and don’t contain painting, etc. And while Lieberman Aiden and Michel made an impressive effort to purge the subcorpus of the metadata errors that have plagued Google Books, you can't sort books by genre or topic. The researchers do plan to make available a more robust search interface for the corpus, though it's unlikely that users will be able to replicate a lot of the computationally heavy-duty exercises that the researchers report in the paper. But my sense is that even this limited functionality will be interesting and useful to a lot of humanists and historians, even if linguists won't be really happy until they have the whole data set to play with. Again, I have more on this in the Chronicle essay.
That's all for now… watch this space.
12/17: I was thinking here of the ordinary, technologically limited historian or English professor who logs into the Google Labs site to use the database. With a downloaded corpus, of course, it would be a different story. Jean-Baptiste and Erez wrote me to point out that
The only part of our paper that could not be done on a small cluster is the computation of the n-gram tables, which is the data that we provide. Thus, any user with the motivation and the computational skills could replicate our work….To be exact, absolutely all the analysis we do in this paper can be done on one laptop – not even a cluster. (the 1-3 grams in English fit easily onto a hard drive, and very little computing power is needed for the computation)
I think the interesting difference here is how one imagines these data being used — by technologically sophisticated people working in humanities labs or in subgroups within humanities departments or divisions, say, or by the ordinary humanist who is curious about some cultural or linguistic trend, but isn't about to take the time to write a routine to address it. Of course the hope here might be that the second sort of user — particularly the students — will move from the second category to the first; that's why I described the present system as a kind of "gateway drug" in my Chronicle article.