A new chapter for Google Ngrams

« previous post | next post »

When Google's Ngram Viewer was launched in December 2010 it encouraged everyone to be an amateur computational linguist, an amateur historical lexicographer, or a little of both. Today, the public interface that allows users to plumb the Google Books megacorpus has been relaunched, and the new version makes it even more enticing to researchers, both scholarly and nonscholarly. You can read all about it in my online piece for The Atlantic, as well as Jon Orwant's official introduction on the Google Research blog.

The big news for linguists and fellow travelers is the introduction of part-of-speech tagging. While Mark Davies at BYU had previously created his own POS-tagged version of Google Ngrams as part of his corpus collection, he only had access to the publicly available datasets of n-grams (up to 5-grams, with a threshold of 40 occurrences for inclusion) and thus wasn't able to parse the corpus in a systematic fashion. The Google team, on the other hand, was able to go back to the underlying data from the Google Books scanning project and do full-scale tagging and parsing, including identifying sentence boundaries. The specifics are laid out in the paper presented by Slav Petrov, Yuri Lin, et al. at the annual ACL meeting in July, "Syntactic Annotations for the Google Books Ngram Corpus."

As I note in the Atlantic piece, the smaller corpora that Mark Davies has compiled, such as COCA and COHA, still offer more flexibility in the search interface, such as the ability to search for lemmas or high-frequency collocations from particular time periods. Furthermore, the universal tagset of twelve parts of speech used by Google may disappoint corpus linguists who are more used to dealing with the intricacies of the CLAWS tagset (as used in the BYU corpora). But that coarser tagset (besides being more straightforward to a lay audience) also allows for cross-linguistic comparisons, encompassing the languages currently available via the Ngram Viewer: English, Spanish, French, German, Russian, Italian, Chinese, and Hebrew. I'll be interested to see how researchers take advantage of the POS tags and dependency relations for investigations in these different languages.

The other major advanced search feature to be introduced in the new version is what they're calling "Ngram Compositions," which allows the user to add, subtract, multiply, and divide n-gram counts. That's quite handy, and I give an example of its use in the Atlantic piece: you can construct such queries as (The United States is + The United States has)/The United States, (The United States are + The United States have)/The United States (graph here) to better answer the question of when The United States began to be construed as a grammatically singular entity. The ability to compare different subcorpora (e.g., British vs. American) is another welcome addition.

I'm also pleased to see that metadata improvements have been made, as faulty metadata (particularly faulty dating of Google Books volumes) has been a long-standing concern. And the growing size of the Ngrams corpus continues to boggle the mind: for English alone, there are now nearly half a trillion words (468,491,999,592 tokens, to be precise). The previous corpus data remains available for searching (the older corpora have the "2009" identifier), so any research based on the original version will still be replicable. Let's see what the culturomicists come up with this time.

(Thanks to Jon Orwant, lead engineer on the project, for letting me play with the new Ngram Viewer before its public release.)



13 Comments

  1. Rod Johnson said,

    October 18, 2012 @ 11:18 am

    Pretty stunning–but also really dependent on the quality of their parsing and tagging.

  2. Mark N. said,

    October 18, 2012 @ 12:04 pm

    On the note at the end about COHA having some advantages: I think it does, but its relative underusage comes down to it just being less accessible. The COHA interface is not very friendly, indeed baffling at first look, and as a webapp it doesn't really behave in a way I would expect. Even simple things like sharing results are surprisingly hard (if you've never used it before, see how long it takes you to figure out how). While, with Google Ngrams you copy the URL from your browser's address bar to link to a specific results graph.

    And, the raw data is not available for download, unlike the Google Ngrams data. Those are all basically interface/accessibility issues, but I think make a huge difference in uptake/usage, especially among nonspecialists.

  3. leoboiko said,

    October 18, 2012 @ 2:37 pm

    This query might be a bit too apple, I think…

    [(bgz) Could be that attributive uses in the training data, e.g. apple tree and apple juice, encourage the ADJ tagging.]

  4. naddy said,

    October 18, 2012 @ 4:54 pm

    I'm kind of shocked. Can we now do reliable algorithmic part-of-speech parsing for English? Aren't many ridiculous results in machine translation down to failures in this area?

    German orthography capitalizes all nouns, which seems like a quaint but straightforward rule unless you actually try to apply it and reliably identify nouns. Generations of German school children can attest to the difficulty of this task, and that's a language with arguably more inflectional and morphological clues than English. I shudder at the thought of having to do this for English—distinguishing attributive nouns and adjectives seems especially troublesome.

    Or is the underlying thought that the results will be still useful, even if there is a considerable error rate?

  5. Brett Reynolds said,

    October 18, 2012 @ 8:46 pm

    I've already found problems with the POS tagging. In the Penn treebank, many should be a determiner, but in the new ngrams data, it's only a determiner about 0.03% of the time.
    http://books.google.com/ngrams/graph?content=many_DET+%2F+many&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=

    Most of the time it's an adjective. What's the point of having a category for determiner if you're just going to call them adjectives anyhow?
    http://books.google.com/ngrams/graph?content=many_ADJ+%2F+many&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=

    At least this is a determiner.

  6. Nathan Myers said,

    October 18, 2012 @ 9:56 pm

    I'm trying to figure out what to make of this one.

    http://books.google.com/ngrams/graph?content=is+what+it+is%2Cend+of+the+day%2Cbetter+or+worse%2Csaid+and+done&year_start=1800&year_end=2008&corpus=0&smoothing=3

    The unraveling starts at 1895, but what kept them tracking for most of a century?

  7. Jonathon said,

    October 18, 2012 @ 11:40 pm

    Mark Davies' version of the Google Ngrams corpus isn't really tagged for part of speech. He told us in class that it cheats by looking up the part of speech in COHA or COCA and guessing based on probabilities. I hope he updates it with the new tagged Ngrams data.

  8. Rod Johnson said,

    October 19, 2012 @ 8:14 am

    @Nathan: I'm never sure how to interpret the numbers, but my guess would be that the frequencies of all of them are so low that they're just down in the noise, with the exception of a couple burps here and there and the rise of "end of the day" in recent decades (since 1965, in fact, the date of this).

  9. Brett Reynolds said,

    October 19, 2012 @ 9:59 am

    Some more tagging oddities here:
    http://english-jack.blogspot.ca/2012/10/google-ngrams-20-and-pos-tagging.html

  10. Andy Averill said,

    October 19, 2012 @ 8:35 pm

    Keep in mind that there can be a lot of garbage in the search results from Google Books, mostly due to typos and OCR misreadings. This is especially a problem for older publications.

    For example, searching for the word "ibm", which ought to be pretty rare before 1924, the year International Business Machines was founded. But in fact there are over 900 results before 1900. I looked at a bunch of them and couldn't find any where the original text actually had those three letters together. Some of the false positives include:

    ibid
    thus
    1814
    lass
    usan (the last 4 letters of Pausan; the Pa didn't got lost because it ended up in the gutter of the book when the book was scanned)
    … not to mention the occasional spot due to foxing.

  11. Keith M Ellis said,

    October 20, 2012 @ 3:05 am

    This is one of those occasions where my recently-strong-but-still-increasing belief that professional, highly qualified, statisticians should be essential parts of such endeavors and any research related.

    Because, surely, there are good statistical tools that can be used to quantify the noise level discussed in these recent comments and how to account for them? I mean, it's not as if there's some qualitative difference between Google's data and any other data … there's always error, always noise. If there's a lot, we notice it and complain. But we should never interpret such data with the expectation that there's no error, because there always is.

  12. Jean-Baptiste Michel said,

    October 25, 2012 @ 9:53 am

    For those who want the data behind the Ngram plots, but don't want to download the full corpus to get it, there is a way. In this new version, the data behind the plots is hard-coded in the page returned by the Ngram Viewer. You can parse it out.

    Here is a python code and an exe file to do just that: http://www.culturomics.org/Resources/get-ngrams

  13. Warsaw Will said,

    November 17, 2012 @ 8:18 am

    I'm sorry to ask here, but I haven't been able to find the answer anywhere else. On the old Ngram you could save the resulting graph as an image for showing on a blog, for example, by the normal method of right-clicking. But unfortunately this ability seems to have been been lost. Does anybody know if there's any easy way to save the graph with the new Ngram other than doing a PrintScreen and then editing it?

RSS feed for comments on this post