Quotes with and without quotes

Chris is puzzled by these Google counts, for famous quotations with and without quotation marks flanking the search string:

Gone With The Wind
about 797,000 for "Frankly, my dear, I don't give a damn!"
about 163,000 for Frankly, my dear, I don't give a damn!

Taxi Driver
about 17,500,000 for "You talkin' to me?"
about 7,450,000 for You talkin' to me?

As he explains: " I discovered something weird. In some cases, the more restrictive, double-quoted query returned more hits that the unquoted query. A lot more. "

Here's a plausible theory about what's going on. Google stores (and indeed has published) counts of common high-order n-grams. Famous quotations are likely to include common n-grams, for large-ish values of n, and the quotation marks cause the search algorithm to check the n-grams lists and make some use of the counts. This method will perhaps yield somewhat truthful results, depending on details.

Without the counts, the basic approach (however modulated) is to look up the individual words, intersect the most highly-ranked hits for each of them, and extrapolate in some semi-clever way to what total count for the whole set would be expected. This method is certain to underestimate the counts for famous multi-word phrases, since such sequences are MUCH commoner than you would predict simply on the basis of their constituent unigram (or even bigram or trigram) counts.



  1. Atario said,

    December 24, 2009 @ 6:34 pm

    It's instructional to actually attempt to go to the end of the list of those results. For example, the "Frankly, my dear, I don't give a damn" search only goes out to 905 actual results (even including "omitted results"), and unquoted, 951, which makes more sense.

    It seems the count Google gives you at the top is not of much use.

  2. Robert Hutchinson said,

    December 25, 2009 @ 2:09 am

    For sufficiently popular phrases, going to the "end" of the results may not be as illustrative, as I'm pretty sure Google won't give you any more than 1,000 links.

  3. Rappaccini said,

    December 25, 2009 @ 2:35 am

    I'm very suspicious of the fact that almost any search returns between 800 and 1000 (which is indeed the limit) actual results. There must be some actual number of results corresponding roughly to Google's reported ghits, but why on Earth don't they return the full 1000 for every search that reports 1000+ ghits?

    On an aside, I approve of the pun in the title — mainly because it'll have prescriptivists in an uproar. I always defend my casual usage of "quote" (quotation) by citing the American Heritage Usage Panel.

  4. DusK said,

    December 25, 2009 @ 4:24 pm

    Due to the way some of Google's database systems work, they can't generate more than 1,000 results for any search. And that's before winnowing out duplicate pages, which accounts for why you often hit the end at 800 or 900.

  6. Bradley Wright said,

    December 26, 2009 @ 8:24 am

    The unquoted queries are also affected by what's known as "stop words"—common words (like "my", "I", "a", "you", "to", "me" etc.) are dropped from the query, so they won't actually show up in the query graph. They are included in sequence when quoted so more likely to be accurate.

