Language Log

Humanities research with the Google Books corpus

December 16, 2010 @ 9:03 pm · Filed by Geoff Nunberg under Computational linguistics

In Science today, there's yesterday, there was an article called "Quantitative analysis of culture using millions of digitized books" [subscription required] by at least twelve authors (eleven individuals, plus "the Google Books team"), which reports on some exercises in quantitative research performed on what is by far the largest corpus ever assembled for humanities and social science research. Culled from the Google Books collection, it contains more than 5 million books published between 1800 and 2000 — at a rough estimate, 4 percent of all the books ever published — of which two-thirds are in English and the others distributed among French, German, Spanish, Chinese, Russian, and Hebrew. (The English corpus alone contains some 360 billion words, dwarfing better structured data collections like the corpora of historical and contemporary American English at BYU, which top out at a paltry 400 million words each.)

I have an article on the project appearing in tomorrow's in today's Chronicle of Higher Education, which I'll link to here, and in later posts Ben or Mark will probably be addressing some of the particular studies, like the estimates of English vocabulary size, as well as the wider implications of the enterprise. For now, some highlights:

1. The team: The authors include some Google Books researchers (Jon Orwant, Peter Norvig, Matthew Gray and Dan Clancy), a group of people associated with Harvard bioscience programs (Jean-Baptiste Michel, Erez Lieberman Aiden, Aviva Aiden, Adrien Veres, and Martin Nowak), as well as Steve Pinker of Harvard and Joe Pickett of the American Heritage Dictionary, Dale Hoiberg of the Encyclopedia Britannica, and Yuan Kui Shen of the MIT AI lab. So it's dominated by scientists and engineers, and is framed in scientific (or -istic) terms: the enterprise is described, unwisely, I think, with the name "culturomics" (that's a long o, as in genome). That's apt to put some humanists off, but doesn't affect the implications of the paper one way or the other. I have more to say about this in the Chronicle article.

2. The research exercises take various forms. In one, the researchers computed the rates at which irregular English verbs became regular over the past two centuries. In another, very ingenious, they used quantitative methods to detect the suppression of the names of artists and intellectuals in books published in Nazi Germany, the Stalinist Soviet Union, and contemporary China. A third deals with investigate the evolution of fame, as measured by the relative frequency of mentions of people’s names. They began with the 740,000 people with entries in Wikipedia and sorted them by birth date, picking the 50 most frequently mentioned names from each birth year (so that the 1882 cohort contained Felix Frankfurter and Virginia Woolf, and so on). Next they plotted the median frequency of mention for each cohort over time and looked for historical tendencies. It turns out that people become famous more quickly and reach a greater maximum fame today than they did 100 years ago, but that their fame dies out more rapidly — though it's left unclear what to make of those generalizations or what limits there are to equating fame with frequency of mention.

The paper also presents a number of n-gram trajectories — that it, graphs that show the relative frequency of words or n-grams (up to five) over the period 1800-2000. ("Relative frequency" here means the ratio of tokens of the expression in a given year to the total number of tokens in that year.) By way of example, they plot the changing fame of Galileo, Dickens, Freud, and Einstein; the frequency of "steak," "hamburger," "pizza" and "pasta"; and the changing frequency of "influenza" (it peaks, in the least surprising result of the study, in years of epidemics).

The big news is that Google has set up a site called the Google Books Ngram Viewer where the public can enter words or n-grams (to 5) for any period and corpus and see the resulting graph. They've also announced that the entire dataset of n-grams will be made available for download. Some reports have interpreted this as meaning that Google is making the entire corpus available. It isn't, alas, nor even the pre-1923 portion of the corpus that's in public domain. One can hope…

At present, that's all you can with this. You can't do many of the things that you can do with other corpora: you can’t ask for a list of the words that follow traditional for each decade from 1900 to 2000 in order of descending frequency, or restrict a search for bronzino to paragraphs that contain fish and don’t contain painting, etc. And while Lieberman Aiden and Michel made an impressive effort to purge the subcorpus of the metadata errors that have plagued Google Books, you can't sort books by genre or topic. The researchers do plan to make available a more robust search interface for the corpus, though it's unlikely that users will be able to replicate a lot of the computationally heavy-duty exercises that the researchers report in the paper. But my sense is that even this limited functionality will be interesting and useful to a lot of humanists and historians, even if linguists won't be really happy until they have the whole data set to play with. Again, I have more on this in the Chronicle essay.

That's all for now… watch this space.

12/17: I was thinking here of the ordinary, technologically limited historian or English professor who logs into the Google Labs site to use the database. With a downloaded corpus, of course, it would be a different story. Jean-Baptiste and Erez wrote me to point out that

The only part of our paper that could not be done on a small cluster is the computation of the n-gram tables, which is the data that we provide. Thus, any user with the motivation and the computational skills could replicate our work….To be exact, absolutely all the analysis we do in this paper can be done on one laptop – not even a cluster. (the 1-3 grams in English fit easily onto a hard drive, and very little computing power is needed for the computation)

I think the interesting difference here is how one imagines these data being used — by technologically sophisticated people working in humanities labs or in subgroups within humanities departments or divisions, say, or by the ordinary humanist who is curious about some cultural or linguistic trend, but isn't about to take the time to write a routine to address it. Of course the hope here might be that the second sort of user — particularly the students — will move from the second category to the first; that's why I described the present system as a kind of "gateway drug" in my Chronicle article.

December 16, 2010 @ 9:03 pm · Filed by Geoff Nunberg under Computational linguistics

Permalink

58 Comments

Kylopod said,

December 16, 2010 @ 9:47 pm

Why don't they use Google News archive as well? It has an extensive collection of newspaper articles going back to the early 19th century. I've used it myself to find out about language usage in the past.

Neither Google Books nor Google News are perfect, however. My searches have frequently yielded books or articles outside the date range I specified.
Erik Zyman Carrasco said,

December 16, 2010 @ 9:52 pm

"the enterprise is described, unwisely, I think, with the name 'culturomics' […]"

Unwisely why?
John said,

December 16, 2010 @ 10:01 pm

The n-grams site is interesting, but hardly a consistently representative sample over time. For example, I ran a search for various food terms: pork chops, fried chicken, meat loaf, steak. Notice in the result how they all peak at the same place, in the early 1940s and are all on the rise again now.

Suspicious.

GN: One thing to bear in mind is that this isn't — and probably can't be — a truly "balanced" corpus, in the sense that the topic and genre distribution varies over time along with the composition of the library collections from which it was drawn. The 1980 corpus, for example, will have a lot more thrillers, self-help books, and how-to manuals than the corpus for 1900. Genre tagging would make it possible to restrict searches, so that you could look for the frequency of "dear reader" in novels published in Britain in the nineteenth century. But those metadata aren't very reliable for the Google Books collection as a whole, and automatic genre tagging is still problematic, even for someone with the whole corpus and the computational resources to do it. (Hinrich Schuetze, Brett Kessler, and I implemented a genre-tagging system some years ago for a 4500-article corpus on AIDS; it worked pretty well at that small scale.)
billb said,

December 16, 2010 @ 10:46 pm

Like John, I wonder what's going on with first, last, so, since, and because, but not but, around 1790.
Bill Benzon said,

December 16, 2010 @ 11:01 pm

John: I looked at your graph and the first thing I thought was that the early 1940s is WWII. And, without trying to reason it out, I could sorta' see how war and food rationing might have some effect on the use of food terms. Just what's going on from the late 1990s on . . . don't know.

So, as a crude test I substituted "jeep" for "pork chop" on the notion that the occurance of "jeep" ought to be correlated with war. And here's what I got. Interesting. Notice that the "jeep" curve goes up in 90s along with the "steak" curve, but the heads down after 2000.

Crude, but . . .
John Cowan said,

December 17, 2010 @ 1:00 am

Kylopod: Books were a good place to start, and they have lots of bang for the buck. As the research progresses, newspapers, manuscripts, maps, artwork, and Hathi text will be incorporated.

Lazar: The books were cherry-picked for correct metadata, presumably on the assumption that metadata errors are random, not systematic. They represent about a third of Google's total holdings of books.

Some basic corpus statistics:

Five million books from the beginning of printing to about 2000 in English, French, Spanish, German, Chinese, Russian and Hebrew, amounting to 500 billion running words.

Book publishing has surged from 60 million words a year in 1800 to 8 billion words a year in 2000.

The published corpus data includes all the 1-grams through 5-grams in the books, excluding those which appear less than 40 times. This amounts to about 2 billion n-grams. (Google is staging publication, so not everything is downloadable yet.) The data shows for each n-gram/year combination how many tokens, pages, and books it appears in.

Some cherry-picked results:

The English lexicon doubled in size from half a million words in 1950 to a million words in 2000 (the raw tokens remain about the same, due to many more typos in 1950). Very little of this growth appears in standard comprehensive dictionaries.

The verbs burn, chide, smell, spell, spill, thrive have become mostly regular since 1800, more so in AmE than BrE. Light and wake were irregular in Middle English, became mostly regular by 1800, and are mostly irregular today. Snuck is the only verb in process of conversion from regular to irregular today.

Most famous people alive today are more famous than their predecessors, but they don't remain famous for as long. The same is true of years: 1883 remained prominent for many years afterwards, whereas 1950 lost prominence fast and is now only about twice as common as 1883.

The marks of censorship by the Nazi, Soviet, and Communist Chinese government are visible: famous people who were unacceptable to the regimes tended to disappear from later-written books in German, Russian and Chinese, but not English. However, the Hollywood Ten disappeared from American books during the blacklist period.
carat said,

December 17, 2010 @ 3:38 am

@billb: the data does not seem to be reliable before about 1800. they only used the period 1800-2000 for the paper
a George said,

December 17, 2010 @ 3:43 am

@Geoff Nunberg: to be precise, you did not link to your article in the Chronicle of Higher Education, but rather to bibliographic information relating to it. I do not see how I can gain access without getting an account with them.
maidhc said,

December 17, 2010 @ 4:47 am

A minute data point: on my work home from work today I was listening to some old blues on the radio, and one of the songs mentioned "the days when a pork chop cost a nickel, but nobody had a nickel".
Lance said,

December 17, 2010 @ 5:31 am

I've been having way too much fun graphing different constructions against each other to see how they change in popularity. For instance, "Everyone has their" vs. "Everyone has his". Also also single phrases of linguistic interest: NP VP; colorless green ideas sleep furiously; and so forth. I doubt the data's perfect (data is, data are), but that doesn't stop it from being interesting.
Paolo said,

December 17, 2010 @ 5:39 am

or restrict a search for bronzino to paragraphs that contain fish and don’t contain painting, etc.

In Italian, the fish is called brAnzino and the painter BrOnzino ,so there should be no need to differentiate with context.

GN: In le Marche there is: "Il pesce a taglio di cui potete servirvi per questo piatto di ultimo gusto, puo essere il tonno, l'ombrina, il dentice, o il ragno, chiamato impropriamente bronzino lungo le coste dell'Adriatico." ‪La scienza in cucina e l'arte di mangiare bene‬ By Pellegrino Artusi
Leonardo Boiko said,

December 17, 2010 @ 5:54 am

So, Trends for books?

I agree, there’s lots of weird-looking peaks and anomalies on 1500–1800. The system doesn’t seem to play well with things like þ or ſ either.

I like there’s Chinese and Spanish corpora; here’s hoping for more languages.
GeorgeW said,

December 17, 2010 @ 7:12 am

'War' got a barely discernible bump in the 1860s with big jumps in the 1920s and 1940s. Apparently American interest in the American Civil War was diluted by English writers elsewhere. Likewise for Vietnam. The Korean war didn't even register (maybe too soon after WWII).

If this is 'culturomics' I am not sure what it tells us about culture. (FWIW, I have trouble pronouncing the word with a full [o] sound).
mgh said,

December 17, 2010 @ 7:50 am

The public n-gram database is probably the least they could do to appear to attempt to comply with Science's policies intended to ensure the results can be held to a basic standard of reproducibility. I can see the case that the corpus is raw data which is almost never shared in any field; still, given the metadata errors in Google Books it is very troubling that they did not find a way to make the original corpus accessible at least to other researchers in their field after having them sign a lengthy restrictive transfer agreement. It is very hard to confirm the authors' claims without this basic resource, and goes against Science's own policies:

"Data and materials availability All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. After publication, all reasonable requests for materials must be fulfilled.
[…]
Large data sets with no appropriate approved repository must be housed as supporting online material at Science, or only when this is not possible, on an archived institutional Web site, provided a copy of the data is held in escrow at Science to ensure availability to readers.."
Rodger C said,

December 17, 2010 @ 8:11 am

@Erik Zyman Carrasco: Because it's pretentious, a Greco-Latin hybrid, and its pronunciation isn't transparent.
Margaret L said,

December 17, 2010 @ 8:58 am

Also, by no conceivable stretch are they data-mining the entire culture, the way that genomics data-mines the entire genome.
GeorgeW said,

December 17, 2010 @ 9:13 am

If anyone is interested, this is a link to a NYTimes article about the project.

http://www.nytimes.com/2010/12/17/books/17words.html?hp
kangol said,

December 17, 2010 @ 9:41 am

The members of this team are extremely poorly chosen. What the heck does Steven Pinker have to do with any of this, other than that he thinks he's a big star? The Harvard bioscience people, for my money, know nothing about linguistics, as clearly indicated by their Science paper on the past tense. Language Log ought to run a takedown of that piece of tripe; it commits category errors I wouldn't permit in a Ling 101 class.
Aaron B said,

December 17, 2010 @ 10:02 am

Kind of fascinating checking out taboo words and racial slurs, and how they stack up to more PC words.
Tadeusz said,

December 17, 2010 @ 10:03 am

@ John Cowan. What do you mean by "words"? Types, presumably? I think so because the ngram page returns different results with "war" and "wars". If it is so, it does not make any sense to compare types to what dictionaries include, which are lexemes. And it does not make any sense to say that the size of the lexicon doubled. It is not lexical items that are counted.
Martyn Cornell said,

December 17, 2010 @ 10:45 am

Whoa, this is fantastic! In my own tiny, tiny field, the history of beer styles, I have already been able to graph what I had previously only been able to assert through a sense of perusing old books and newspapers, that, for example, bitter beer only took off from the 1840s.
John Cowan said,

December 17, 2010 @ 11:06 am

mgh: As an ex-Googler, I am very confident that none of the paper's actual authors, or anybody else outside Google, ever got a look at any of the books. What Google can do with post-1922 books is tightly constrained by contract, and of course what the researchers couldn't get, they can't pass on. Fortunately, other scientists don't have this problem, because the law does not pretend that anyone owns Mother Nature — yet.

Tadeusz: Types, yes, excluding non-words such as numbers, typos, and terminal punctuation (non-terminal punctuation is not in the corpus). Lexicon was an ill-chosen word, I agree. However, most of the growth is surely not due to innovative uses of inflections, but to regular and irregular uses of derivational morphology, like the aridification and netiquette mentioned in the paper, plus novel borrowings and the occasional coinage.
Dan K said,

December 17, 2010 @ 11:12 am

Re: "it's unlikely that users will be able to replicate a lot of the computationally heavy-duty exercises that the researchers report in the paper," I suppose this is really more because the corpus is not publicly available? It doesn't seem like most of this should be out of reach for the average weekend hacker.
Ben Bolker said,

December 17, 2010 @ 11:52 am

Does anyone know if there's a way to download the numeric results of an N-gram search (rather than just looking at the pretty pictures)? Or, failing that, to make the y-axis logarithmic?
Ken Brown said,

December 17, 2010 @ 11:56 am

Martyn Cornell said: "… for example, bitter beer only took off from the 1840s."

The trouble is the beer is usually just called "bitter" which you can't detect by this method.

(& sometimes "bitter ale" of course – about as often as "bitter beer" by the look of things)
KCinDC said,

December 17, 2010 @ 12:08 pm

Leonardo Boiko is right about ſ. Someone on Twitter was claiming use of "fuck" peaked in the 1600s, but of course those were really "suck".
oliverio said,

December 17, 2010 @ 12:24 pm

I played a bit with it and not only the search is sensitive not only to the case but also to the spelling(diacritics). I did a search in French for a myth that appeared in Language Log : the genius of the French language.
It apppears that when there is a decline in uses of the phrase "génie de la langue française", there is an increase of uses of the phrase "décadence de la langue française".

I'm reading De la langue française : essai sur une clarté obscure by Henri Meschonnic . It appears that Voltaire is the initiator of the myth in the entry FRANC ou FRANQ; FRANCE, FRANÇOIS, FRANÇAIS of his Dictionnaire Philosophique.

décadence de la langue française vs génie de la langue française[ngrams.googlelabs.com]
David Clausen said,

December 17, 2010 @ 1:10 pm

They have made the datasets available here under a Creative Commons Attribute license:

http://ngrams.googlelabs.com/datasets

This looks like a great new resource for linguists. It would be nice to build some tools to allow regular expression search over the corpora. From a glance at the format, it shouldn't be too difficult.
Bill Benzon said,

December 17, 2010 @ 1:15 pm

I've been doing some more playing around. I've got a long-standing interest in Coleridge's "Kubla Khan," which introduced "Xanadu" into the modern English lexicon. So I searched the corpus for "Xanadu" from 1800-2008. No big deal. But . . .

Here's a post where I integrate that search with earlier work, which is based on a web search on "Xanadu" and on some historical data from the OED and the NYTimes archive (which goes back to 1851). The Books Ngram search picked up interesting stuff not in that other material, which was hardly representative — nor, I suppose, is the Books Ngram search. But it's something we didn't have.
Giles said,

December 17, 2010 @ 1:35 pm

Leonardo Boiko is right about ſ. Someone on Twitter was claiming use of "fuck" peaked in the 1600s, but of course those were really "suck".

I did a bit of research around this (posted at http://www.gilesthomas.com/?p=432); looks like there are lots of long-S problems, enough to make pretty much all of the data prior to 1820 dubious for many kinds of research.

Some nice examples (largeish list linked from the blog post above) are:

case vs cafe: http://ngrams.googlelabs.com/graph?content=cafe%2Ccase&year_start=1750&year_end=2000&corpus=0&smoothing=3

fame vs same: http://ngrams.googlelabs.com/graph?content=fame%2Csame&year_start=1750&year_end=2000&corpus=0&smoothing=3
Nick said,

December 17, 2010 @ 1:57 pm

"computer" and "internet" both have interesting spikes in the early 1900s. (Don't graph them both at the same time as one will become flattened.) Interesting. I wonder what causes it.
Mr Punch said,

December 17, 2010 @ 2:12 pm

"Computer" had a meaning distinct from the current one – someone who does computations. No clue about "internet."
GeorgeW said,

December 17, 2010 @ 2:15 pm

'Iphone' began occurring in the 1830s but has an up and down history. 'Ipod' has an even earlier occurrence, 1800. Hmm.

Memo to Steve: You may not have rights to the names.
Clayton Burns said,

December 17, 2010 @ 2:17 pm

Geoffrey Nunberg:

Thanks for this. English at 360 billion words would allow sensitive searches for rare patterns. For example, ask people to compose a sentence with "can blurred" (you can't put anything between the words, except that you can make "can" negative, "can't"; you can't start the sentence with "can"; "can" is a modal. Create a sentence and tell me what the sentence type is.

Curiously, there is always a struggle.

"Why can't blurred images be used in court?"
"When can blurred images be used in court?"

Could you tell me how many sentences following this pattern there are in the 360 billion words of English? What is the trajectory?

A good advance would be tagged corpora. If you are perched in your little Quebec of a pulpit, then Quebec is a metaphoric global. How many would you expect to find in 100 billion words?

The issue of irregular verbs is compelling.

Especially since they are the base of vowel gradation in poetry ("After Apple-Picking" by Robert Frost for "e" gradation).

But since this extremely powerful set of sound symbolic patterns in poetry (and sometimes in fiction, as with "a" gradation in "The Road" by Cormac McCarthy–"…and returned again as trackless and as unremarked as the path of any nameless sisterworld in the ancient dark beyond)" has never been decisively focused, the issue is not the data, but the ability to frame the interpretation.

Or the ability to shift perceptual and cognitive frames so as to see the data behind the data.

A limitation of another sea of data is that we should have had long ago the Internet merged analytical indexes for all non-fiction books. This would be a simple legal matter: If you were going to publish a non-fiction book, you would have to submit a high quality comprehensive analytical index to the Library of Congress before the book appeared in the bookstores and libraries. The indexes would be merged and linked back to the books. Now, if I want to find "zitterbewegung" in "The Road to Reality" index, I will see that it is not there. Indexing is informal. We could have made "refined data" gains with the Internet indexes, but it is as with tagged corpora: we just passed up the opportunity.
Kylopod said,

December 17, 2010 @ 2:28 pm

No clue about "internet."

I did a Google Books search of the word "internet" with the date range from 1895 to 1905. Many of the hits turned out to be modern articles from old journals, even when I specifically restricted the search to books. Other hits were completely different words (such as "income") somehow being mistaken for "internet" in the search, and here is an example of the word "infernet":

"The other ships to be finished next year are the first-class cruiser Chateau-renault, the third-class cruisers D'Estrees and Infernet, eight destroyers, a gunboat, the submarine boat Morse, seventeen first-class torpedo-boats, and six small torpedo-boats."

I think this is what it's a reference to:

http://en.wikipedia.org/wiki/Fort_de_l%27Infernet

I'd be surprised to find actual instances of "internet" in the early 1900s.
Ray Dillinger said,

December 17, 2010 @ 2:31 pm

At that time "computer" was a job title for a human, not the name of a device.

It seems at least a bit plausible that there was demand and / or employment for good computers (that is, people who did tedious or advanced mathematics for hire) around 1900 with an early wave of the industrial revolution.

"Internet" however is less plausible. If "internet" is also implicated it looks more like some kind of cataloging error possibly due to attributing the wrong century to dates expressed as '01, '02, etc.
Helmut said,

December 17, 2010 @ 2:36 pm

I don't know how linguists get anything done with full access to stuff like this. This is addictive.

@Nick

It looks like "internet" spikes around 1905 are in the English One Million corpus. Compare English to EOM, and then notice the contrast with American English, British English, and English Fiction from 1880 to 1920.
Helmut said,

December 17, 2010 @ 3:00 pm

Without quotes, internet looks a lot like internetwork or internetworking. Still only shows up in the English One MIllion Corpus.
John said,

December 17, 2010 @ 4:24 pm

@GN: Part of my point was that it doesn't seem to be a very balanced corpus at all, unless one can come up with a good reason why several food terms rise and fall together in frequency over a period of decades. That strikes me as more an artifact of the data set than a sign of any change in the language.
John said,

December 17, 2010 @ 4:29 pm

The "About…" link at the bottom of Google's page actually mentions false positives for "internet" as well as medial "s".
Erin Jonaitis said,

December 17, 2010 @ 4:32 pm

I'm flummoxed about something. I've searched for a few inverted questions, of the form "isn't he," "doesn't he," etc., and most of these come back with zero hits. The only exceptions I've seen so far are "isn't it" and "can't he." Is this realistic? These phrases don't seem so uncommon to me that they shouldn't appear in any book from the last two hundred years. Am I missing something?
carat said,

December 17, 2010 @ 5:13 pm

@Erin: there seems to be something screwy going on with apostrophes. not sure how it is treating them.
The most interesting thing I've noticed so far is that "war" starts trending up in 1911 despite the ostensible suddenness of WW1 in 1914. Also, it starts trending up as late as 1935 for WW2, but "krieg" starts trending up earlier in German, in 1932.
http://ngrams.googlelabs.com/graph?content=war&year_start=1900&year_end=1950&corpus=0&smoothing=3

also the battle between -elling/-eling, -elled/-eled shows up very well and consistently
http://ngrams.googlelabs.com/graph?content=travelling,traveling,travelled,traveled&year_start=1800&year_end=2000&corpus=0&smoothing=3

connection/connexion have an interesting rivalry. they were at rough parity until connexion pulled ahead ~1830 before finally being overtaken for good.
http://ngrams.googlelabs.com/graph?content=connection,connexion&year_start=1800&year_end=2000&corpus=0&smoothing=3
Andrew West said,

December 17, 2010 @ 6:10 pm

The "medial s" issue mentioned on the About… page (whereby "long s" (ſ) is OCR'd as the letter "f") is very useful for determining the rules for the use of long s in different languages, and when long s went out of fashion (circa 1760 in Spain, circa 1780 in France and circa 1800 in England according to the Google corpus). I have appended some Google n-gram plots showing the change from long s to short s at the end of my essay on the rules for long s.
Bill Benzon said,

December 17, 2010 @ 8:46 pm

@ John: "That strikes me as more an artifact of the data set than a sign of any change in the language."

The data set may well be flawed, but I wouldn't be inclined to interpret the rise and fall of food terms as having to do with language change. I'd interpret it as change in what people are writing about.
Bruce said,

December 17, 2010 @ 10:29 pm

I did a graph of the following French pharses:
c'était
c'étaient
septante (Belgian and Swiss for 70)
nonante (Belgian and Swiss for 90)

I found that both of the regional terms exceeded the frequency of the basic phases above them consistently from 1800 to 2000.
Doug M. said,

December 17, 2010 @ 11:08 pm

My first reaction to a few trial Ngram searches was the same as Ben's (see below) Why not a log axis? Many of my searches didn't show one or more of the Ngrams because they were too infrequent compared with the most frequent one.
#
Ben Bolker said,

December 17, 2010 @ 11:52 am

Does anyone know if there's a way to download the numeric results of an N-gram search (rather than just looking at the pretty pictures)? Or, failing that, to make the y-axis logarithmic?
#
shaftesbury said,

December 18, 2010 @ 1:03 am

I just posted a short piece a confusion arising from effects of the medial/long 's' in Google Labs Ngrams data: I had at first thought the Ngrams graph had revealed a sharp rise in using words relating to pleasure!
See "Google Labs Ngrams Show Mysterious Spike in Pleasure"
http://multitude.tv/content/view/471/60/
maidhc said,

December 18, 2010 @ 5:17 am

Something I've tried to find through Google's newspaper archive is when did the Winter Solstice become the "official first day of winter"? (In the US, at least) There appears to have been a war between calendar publishers and almanac publishers, which ended with the almanac publishers convincing most of the journalists in the US to use the solstice-based definition. The concept seems to have spread into Canada also.

Back in the 1920s the calendar publishers had the upper hand and Dec. 1 was the beginning of winter in the US.

It's what happened in between that's unclear. I really didn't have that much success with searching the newspaper archive.

This is a great tool. I've been looking at the stuff that's on every website today and it's very interesting. I have the nagging idea that I could use this somehow to research my problem, but it seems to me I'm trying to look for something that's too complex for the engine. Suggestions?
John said,

December 18, 2010 @ 8:28 am

@Bill Benzon: I don't disagree with you, but I also wonder whether it's just a change in what Google digitized. E.g., does their dataset have a bunch of cookbooks from the early 40s?

The problem is that the patterns could be the result of numerous things and without the source metadata, it's impossible to say.

@maidhc: One of my pet peeves! I'd love to hear about what you discover.
Mark Davies said,

December 18, 2010 @ 10:33 am

You might compare Google Books / Culturomics to the new NEH-funded Corpus of Historical American English (400 million words, 1810s-2000s; http://corpus.byu.edu/coha)

Along with accurate frequency of words and phrases by decade and year (like Google Books), COHA also allows for many types of searches that Google Books / Culturomics can't:

* changes in meaning (via collocates; "nearby words")
* changes in word forms (via wildcard searches)
* grammatical changes (because corpus is "tagged" for part of speech)
* show all words that are more common in one set of decades than in another
* integrate synonyms and customized word lists into queries
* etc etc etc

For a comparison of COHA and Google Books / Culturomics, see http://corpus.byu.edu/coha/compare-culturomics.asp.
Tadeusz said,

December 18, 2010 @ 11:28 am

I would not like to spoil the fun people seem to have, but one million TYPES generated from several billion words is not so many we are led to think it is. The list of types in the British National Corpus, 100 million words, has roughly 900,000 items. If we account for typos, etc., let us say there are 700,000 types. Now we are talking about BILLIONS of tokens in the Google corpus. All right, the ratio token/type is exponential, anyhow one would expect a list of types of at least several millions. Either there is something wrong in the calculations, or "word" refers to an undefined entity, not to types.
Bill Benzon said,

December 18, 2010 @ 11:34 am

@John: Right, without metadata we don't know what's going on.

However . . . I did some more playing around.

First I pushed the start date back to 1900. When you do that you see that there's a peak for "steak" in around 1917-18, but not for the others. They just show a small and smooth rise through that period. That's WWI. (& things don't change much when you reduce the smoothing, even to zero.)

Second, we've got, not one, but five English language collections: English, American English, British English, English Fiction, and English One Million. British English gives the most 'interesting' result, with no particular action around either WWI or WWI. "Steak" drops from 1970 to 1980 and all rise starting in 1990, "steak" more dramatically than the others; and it also drops off starting in 2000. The nature of that peak, however, is affected by lowing the amount of smoothing. Just how many books are we dealing with at this point? Reading off the percentage at the left, we're down to ten-thousandths of a percent.

The English One Million collection gives much the same result at the English collection. Here's how that collection is characterized:

The "Google Million". All are in English with dates ranging from 1500 to 2008. No more than about 6000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980). Books with low OCR quality were removed, and serials were removed.

So what's really going on? Who knows? But it does look like there's something going on in American books concerning steak at times of war.
Clayton Burns said,

December 18, 2010 @ 1:48 pm

[kangol said, December 17, 2010 @ 9:41 am
The Harvard bioscience people, for my money, know nothing about linguistics, as clearly indicated by their Science paper on the past tense. Language Log ought to run a takedown of that piece of tripe; it commits category errors I wouldn't permit in a Ling 101 class.]

kangol: Why not write your own review of the "Science paper on the past tense?" It would be valuable if the information cycle at Language Log were deeper and longer in certain aspects, although it is certainly a great blog.

The past is exceptionally interesting because if you master its grammar you will have no trouble with the present and the future. Also, in Daniel Schacter's work as featured in "On the Origin of Stories" (a beautiful Harvard UP book by Auckland's Brian Boyd), and in Matt Ridley's "Why the Mind Sees the Future in the Past Tense," in The Wall Street Journal, we read about the "constructive episodic simulation hypothesis," how the future recruits the same brain areas as the past.

Intuitively, that makes perfect sense. It is why the generic method of mixing in the past, present, and future in teaching the tense systems of English is inept. The more formal power you apply to the past, the more the rest will fall into place. Unfortunately, we have no advanced grammar workbook constructed on this principle.

The best idea is to separate out 60 verb elements of the past, teaching memory, recognition, and production. What would be the payoffs for imagining the future? A good psychology/linguistics lab project.
Ben Bolker said,

December 18, 2010 @ 8:28 pm

It's completely ridiculous, but I spent too much of today hacking some R code to extract numeric values from a PNG saved from the Google ngram charting application. I've posted it at http://www.math.mcmaster.ca/~bolker/R/misc/ (see charthack.R, chart.png, etc.) in case anyone wants to play with it.

[GN: Ridiculous or no, thanks very much for this. You can take the rest of the afternoon off. ]
Mark F said,

December 19, 2010 @ 10:18 pm

What's up with the word "man"? Its use declined steadily throughout the 20th century, and has been rising in the 21st. Is it part of a trend represented by books like The Dangerous Book for Boys? Does it have to do with post 9/11 militarism? Any ideas?

Here's a link to a graph for "man":

http://ngrams.googlelabs.com/graph?content=man&year_start=1850&year_end=2008&corpus=0&smoothing=3
Mark F said,

December 19, 2010 @ 10:27 pm

OK, the same things happens at the end of boy, girl, dog and horse, so I doubt my earlier hypotheses.
Ken Brown said,

December 20, 2010 @ 10:48 am

carat said: "connection/connexion have an interesting rivalry. they were at rough parity until connexion pulled ahead ~1830 before finally being overtaken for good"

In the early 19th century Methodist groups like the Countess of Huntingdon's Connexion were big news. It might be that they popularised that spelling.

Some people, including myself, use "connection" as the ordinary spelling, but "Connexion" as a historical or technical term in church politics. In Independent" churches each congregation runs its own affairs, "connexional" churches are joined together in some way. Yes, we know they are the same word really. Just like "program" and "programme" – in UK both spellings are current in different contexts.
David Conrad said,

December 30, 2010 @ 8:52 pm

So due to the long s Google has trouble telling the difference between fuck and suck? I'm pretty sure there's a joke in there, somewhere….

RSS feed for comments on this post

Humanities research with the Google Books corpus

58 Comments

Kylopod said,

Erik Zyman Carrasco said,

John said,

billb said,

Bill Benzon said,

John Cowan said,

carat said,

a George said,

maidhc said,

Lance said,

Paolo said,

Leonardo Boiko said,

GeorgeW said,

mgh said,

Rodger C said,

Margaret L said,

GeorgeW said,

kangol said,

Aaron B said,

Tadeusz said,

Martyn Cornell said,

John Cowan said,

Dan K said,

Ben Bolker said,

Ken Brown said,

KCinDC said,

oliverio said,

David Clausen said,

Bill Benzon said,

Giles said,

Nick said,

Mr Punch said,

GeorgeW said,

Clayton Burns said,

Kylopod said,

Ray Dillinger said,

Helmut said,

Helmut said,

John said,

John said,

Erin Jonaitis said,

carat said,

Andrew West said,

Bill Benzon said,

Bruce said,

Doug M. said,

shaftesbury said,

maidhc said,

John said,

Mark Davies said,

Tadeusz said,

Bill Benzon said,

Clayton Burns said,

Ben Bolker said,

Mark F said,

Mark F said,

Ken Brown said,

David Conrad said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta