Language Log

Climategate, Tiger, and Google hit counts: dropping the other shoe

December 7, 2009 @ 2:00 am · Filed by Geoff Nunberg under Language and politics, Language and the media

They're getting to be routine, Mark's virtuoso skewerings of those who Google widely but not well — in the post below, taking on James Delingpole's effort to demonstrate that the Climategate story is undercovered by the MSM by showing that the number of Google hits for the phrase is disproportionate to the news stories about it. If I have one reservation — which doesn't affect his conclusions — it's that Mark lets Google off too lightly when he says that its hit-count algorithm "might over-estimate the total number of pages for a term that has increased very rapidly in the recent past," and goes on to allow that "if we take the counts at face value, then apparently there are a lot of people generating a lot of pages about climategate." Might overestimate? Too kind. When Google reports hit count estimates over a few hundred, the results should never be taken at face value, or any value at all — they're not only too inaccurate for serious research, but demonstrably flaky.

One easy way to show this is by noting that adding additional search terms to a search string often increases the number of reported hits, an obvious impossibility. Here are the hit-count estimates (rounded to the nearest million) that Google gives for some strings involving items that Mark took as examples.

Search string	estimated hits (millions)
climategate	30
climategate news	35
climategate news global	164
"Tiger Woods"	13
"Tiger Woods" wife	19
"Tiger Woods" wife photos	23
"Tiger Woods" golf	119
"Tiger Woods" car	128
Afghanistan	28
Afghanistan Obama	88
Afghanistan Obama photos	104
Afghanistan Taliban	15
Afghanistan Taliban war	48
Afghanistan pictures war	46
Afghanistan war pictures	85
Afghanistan war pictures war	103
Afghanistan war pictures war pictures	106

Can anybody take these seriously? At these magnitudes, Google's hit-count estimates don't have face values.

But I wouldn't conclude from this that Google hit counts are useless. We can glean something about relative magnitudes of search terms by looking at the the number of pages Google actually returns for a query (as indicated on the final page of hits that turns up when you click the last number at the bottom of the first results page) — provided the figure is under 700 or so. In these cases we can assume that Google has tried to return all the pages in its index that contain the search string. (A figure between 700 and 1000 might be an accurate count, but might also be Google's effort to return around 1000 pages for a term that appears on thousands or millions of web pages.) And we can estimate the relative frequencies even of very common terms if we add extraneous search terms to keep the total pages returned to the <700 level. For example, here are some results for various search strings containing "Tiger Woods" and "Barry Bonds" along with a few topically irrelevant items that together restrict the search to a corpus-slice of less than 700 pages:

Add'l search terms	"Barry Bonds" ~	"Tiger Woods"~	ratio
merciful redundant	304	545	1.8
cumin hammer	383	585	1.5
microwave okapi	128	183	1.4
precipice broom	172	274	1.6
pumpkin warbling	95	396	4.2
		average ratio	2.1

Since the ratio between the two terms falls within the same rough range for each corpus slice, it's a fair assumption that the ratio between pages containing the terms is going to fall in that range for the corpus as a whole.

Let's see how this falls out now with climategate, Tiger Woods, and for comparison, Pirate Radio, another term that has been mentioned a lot over recent weeks:

Add'l search terms	climategate ~	"Tiger Woods" ~	Ratio	"Pirate Radio"	ratio
hammer adagio	17	372	21.9	197	11.6
semantic respite	16	290	18.1	66	4.1
annuity caracas	14	162	11.6	40	2.9
merciful redundant	51	545	10.7	97	1.9
cleveland pistachio	24	548	22.8	95	4.0
		average ratio	17.0		4.9

A couple of things are worth noting here. First, like the searches on Tiger Woods and Barry Bonds, the ratios here are relatively consistent from one corpus-slice to the next, which suggests that taken together, they're representative of the corpus as a whole. (I could have done more searches and controlled more carefully to make sure the additional search terms were irrelevant to the names I'm interested in and hence unlikely to favor one over the others, but this is close enough to make the point.)

Second, the number of pages for Tiger Woods in the various corpus slices averages 17 times greater than the number of pages for Climategate, which is around the middle of the range of variation in the ratios of Google queries for the two items over the period Mark looked at — or at least these ratios aren't wildy inconsistent. So I think this at least weakly confirms Mark's results (though I'm not sure if this is right, and I'll wait to get his take on all this).

Finally, let me try the same thing Mark did to determine the ratio of MSM coverage to interest — only now taking the relative number of web pages containing an item as a proxy for general interest in the topic, rather than the number of Google queries:

story	news stories	rel. web freq.	stories/web freq.	Ratio of undercoverage
Climategate	6,399	1	6399	1
Tiger Woods	53,216	20.0	2940	.46
Pirate Radio	696	7.4	94	.15

If you took Delingpole's thesis at face-value, then, and substituted the correct ratio of web mentions of Climategate and Tiger Woods, you'd conclude that Climategate has actually been massively overcovered relative to public interest in the story — the same conclusion you'd reach if you did the calculations Mark's way. Actually, I can think of about ten reasons why this doesn't prove much one way or the other, some of which were mentioned by Mark or in the comments to his post and the rest of which I'll leave as an exercise for the reader. But it's hard to dissuade people from believing what's right in front of their noses, particularly if they don't scroll down.

December 7, 2009 @ 2:00 am · Filed by Geoff Nunberg under Language and politics, Language and the media

Permalink

17 Comments

Philip TAYLOR said,

December 7, 2009 @ 4:18 am

As I reported elsewhere under a different heading a few weeks ago, Google's behaviour is quite different if one encloses the search term in quotation marks, even if the search term is but a single word. Taking the first three examples from this thread, here are the corresponding counts :

climategate 30.8 x 10^6
climategate news 35.1 x 10^6
climategate news global 18.5 x 10^6

"climategate" 2.5 x 10^6
"climategate" "news" 2.6 x 10^6
"climategate" "news" "global" 1.9 x 10^6

I would place far more credence on the second set than on the first, given Google's fuzzy matching for unquoted search strings.
Adam said,

December 7, 2009 @ 4:26 am

Also, google seems to OR search terms rather than ANDing. You can force ANDing by prepending "+" to the terms.

+climategate => 2.50 x 10^6
+climategate +news => 2.02 x 10^6
Rubrick said,

December 7, 2009 @ 4:31 am

You typed "MSN" for "MSM" in the opening paragraph, which could be rather misleading. I also don't know of "MSM" has reached household acronym status; prior to these posts I think I'd almost always heard and seen "mainstream meda" in full.

As for the "impossible" Google hit ratios: Does Google actually anywhere state that results pages must contain the search terms? Conceptually, they're trying to answer the question "If someone's searching on 'Tiger Woods golf", what pages are most likely to be relevant to their needs?" Certainly if you do a search on a misspelled term and reject Google's "Did you mean…" suggestion, your results will still include pages with the term spelled (by Google's guess) correctly.

That said, folks I know who work at Google fully acknowledge that the estimated page counts are approximately nonsense. Making them more accurate would cost computing cycles to the benefit of few besides linguistic researchers.

[(myl) I suspect that both James Delingpole and the Linguistic Society of Great Britain will be unnerved to learn that he is a "linguistic researcher".]

[GN: Peter Norvig, Google's director of search quality, made this point a few years ago in a Wall Street Journal column by Carl Bialik called "Estimates for Web Search Results Are Often Wildly Off the Mark":

The bottom line, said Mr. Norvig, is that getting an accurate estimate isn't that important for most of Google's users, so the company hasn't invested much time and computing power. "It's only reporters and computational linguists who care if it's really precise," he said.

I didn't offer this as a criticism of Google. We would all like to see the search engines provide more accurate hit count estimations, but it obviously would be computationally expensive. So long as the counts are so unreliable, though, people shouldn't be taking them at face value to make points about what the public is interested in.
Adam said,

December 7, 2009 @ 4:38 am

I have to retract my earlier comment about "+" forcing google to AND search terms. That's how I've been using it for years, but their help says:

– – – – –
Search exactly as is (+)
Google employs synonyms automatically, so that it finds pages that mention, for example, childcare for the query [ child care ] (with a space), or California history for the query [ ca history ]. But sometimes Google helps out a little too much and gives you a synonym when you don't really want it. By attaching a + immediately before a word (remember, don't add a space after the +), you are telling Google to match that word precisely as you typed it. Putting double quotes around a single word will do the same thing.
– – – – –
It ain't the things you know that hurt you; it's the things you know that ain't so.
kip said,

December 7, 2009 @ 10:06 am

Why would you expect [climategate news] to have fewer hits than [climategate]? The page doesn't necessarily have to mention both words to be a search hit. (I use brackets to avoid confusion with quotes, since quotes have a special meaning to Google.)

[GN: Generally, the search terms have to be present ether in the page or the links pointing to it. One way or the other, the set of pages containing or linked to with "Tiger Woods" has to be larger than the set of those pages that also contain (or are pointed to by links containing) "car." Note that query substitution (e.g., replacing "car" by "vehicle") won't affect this point — in that case the string "'Tiger Woods' car" will return pages containing (or linked to by pages containing) both "Tiger Woods" and either "car" or "vehicle," again, a smaller set than you would get for just "Tiger Woods." There are a few exceptions, as Google explains. But even if Google always ignored the second search term, the set returned couldn't logically be bigger than that returned for the first term alone.]
Philip Spaelti said,

December 7, 2009 @ 11:18 am

kip: Why would you expect [climategate news] to have fewer hits than [climategate]? The page doesn't necessarily have to mention both words to be a search hit.

Is that what you as user would expect? Would you be happy if a search for [climategate news] returned pages that didn't contain the word "climategate"? It may be that Google does this, but I, for one, get annoyed when that happens.

Generally when doing searches with terms with fewer hits, adding terms certainly leads to fewer and fewer hits. So searching for [labial] gives me about 600,000 hits, and searching for [labial sonorant] gives about 30,000. So this kind of case seems to be behaving a lot more like a typical AND search.
kip said,

December 7, 2009 @ 11:54 am

@Philip Spaelti: I guess Google is smarter than me. Maybe I learned these search practices back in the wild west Altavista days. My expectation for a search for [foo bar] would be:

* Pages containing the exact phrase "foo bar" would have the highest ranking.
* Pages containing "foo" and "bar" (but not necessarily adjacent to one another or in that order) would be ranked next highest.
* Pages containing only "foo" or only "bar" would be returned, but with the lowest rank.

If I only wanted pages containing both foo and bar, I would have searched for [+foo +bar]. And if I wanted pages containing both words in that order, I would search for ["foo bar"].

I used "apple juice" as a test and it looks like you guys are right and I'm wrong. I got the following hit counts (in millions):

[apple]: 77.0
[juice]: 14.5
[apple juice]: 14.1
[+apple +juice]: 13.3
["apple juice"]: 0.559
kip said,

December 7, 2009 @ 12:00 pm

Maybe "orange juice" is a better example, since "Apple" is a huge corporation with a significant internet presences, but not "orange". Results are still the same, adding "juice" to "orange" eliminates a lot of results, as you guys expected.

[orange]: 49.0
[juice]: 14.5
[orange juice]: 14.3
[+orange +juice]: 13.2
["orange juice"]: 1.37

[GN: but adding terms can increase hit counts:
["orange juice" +fresh] 3.43
["orange juice" +store] 2.14
["orange juice" +recipe] 2.05
tablogloid said,

December 7, 2009 @ 2:21 pm

Just googled "Tiger Woods grirlfriend" under Images. There were 1,650,000 hits.
Army1987 said,

December 7, 2009 @ 4:49 pm

Once upon a time, at the bottom of result pages Google used to claim "Search performed on [some large number] pages". But when you searched for "the", it would say it had found [some even larger number] of results.
CWV said,

December 7, 2009 @ 10:40 pm

Ah, we've finally gotten to nut meat of the story. Barry Bonds is engaged in rampant pumpkin warbling, and the PC-addled Mainstream Media doesn't have the guts to tell you about it! I wonder if pumpkinwarblinggate.com is still available …
Garrett Wollman said,

December 8, 2009 @ 12:08 am

Has anyone in the corpus-linguistics community actually approached Google about making a service available that would actually perform a useful count?

[(myl) Yes, this has been discussed at various levels to various degrees at various times and places known to me; and no doubt there have been even more other times and places that I don't know about. It's not easy to define the target set of users and applications; or rather, both sets are diverse. Anyone with a credible web snapshot and a small cluster running hadoop can do whatever version of this they want for themselves, but of course some applications (like the current one) require an up-to-date snapshot, which is harder.]
Ginger Yellow said,

December 8, 2009 @ 10:51 am

I'm quite impressed that none of those search strings are even close to being Googlewhacks.
Michael Straight said,

December 8, 2009 @ 1:02 pm

Regarding your technique of doing a search with extraneous search terms like "pumpkin" and "warbler":

Is it possible that a phrase like "Tiger Woods" would be much more common on a general-interest website, while a phrase like "climategate" might be more common on pages that focus on politics or science? Meaning that you'd be much more likely to find "Tiger Woods" than "climategate" with pumpkin or any other given non-golf, non-climate terms? Which would cause an overestimate of the Tiger/climate ratio?

Just off the top of my head, I'd think you'd be much more likely to find the terms "cleveland" and "pistachio" on a page with articles about Tiger Woods than on a page with articles about climategate.
Joshua Zucker said,

December 8, 2009 @ 4:47 pm

I think dictionaries might be relevant here. There are a lot of websites with long lists of words, and since "tiger" "woods" "bonds" "pirate" and "radio" are all found in those dictionaries, their frequency might be greatly exaggerated by their occurrences on pages like those.

Skimming through the results returned for the query with all five of those words doesn't seem to support my hypothesis. Adding the non sequitur words [tiger woods bonds pirate radio annuity caracas] does seem to give a significant proportion of dictionary hits.
Aaron Davies said,

December 13, 2009 @ 4:43 pm

google's original algorithm of looking at the text of incoming links to determine what a page is about is still to some extent present in their current methodology. if you hit the "cache" link to see google's copy of a page, the words in your query will be highlighted; some may be noted in the header as having "only appeared in links pointing to this page".
Yury Gomon said,

October 4, 2010 @ 7:18 am

I checked the search strings mentioned in the article referred:
climategate 928,000 hits
climategate news 771,000 hits
climategate news global 169,000 hits
"Tiger Woods" 33,900,000 hits
"Tiger Woods" wife 3,980,000 hits
"Tiger Woods" wife photos 3,420,000 hits
"Tiger Woods" golf 22,600,000 hits
"Tiger Woods" car 4,290,000 hits
Afghanistan 121,000,000 hits
Afghanistan Obama 62,900,000 hits
Afghanistan Obama photos 49,300,000 hits

The numbers look logical. Google may have improved the calculation algorithm or something. I don't think there's any reason to disbelieve the numbers given by Google any longer.

RSS feed for comments on this post

Climategate, Tiger, and Google hit counts: dropping the other shoe

17 Comments

Philip TAYLOR said,

Adam said,

Rubrick said,

Adam said,

kip said,

Philip Spaelti said,

kip said,

kip said,

tablogloid said,

Army1987 said,

CWV said,

Garrett Wollman said,

Ginger Yellow said,

Michael Straight said,

Joshua Zucker said,

Aaron Davies said,

Yury Gomon said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta