They're getting to be routine, Mark's virtuoso skewerings of those who Google widely but not well — in the post below, taking on James Delingpole's effort to demonstrate that the Climategate story is undercovered by the MSM by showing that the number of Google hits for the phrase is disproportionate to the news stories about it. If I have one reservation — which doesn't affect his conclusions — it's that Mark lets Google off too lightly when he says that its hit-count algorithm "might over-estimate the total number of pages for a term that has increased very rapidly in the recent past," and goes on to allow that "if we take the counts at face value, then apparently there are a lot of people generating a lot of pages about climategate." Might overestimate? Too kind. When Google reports hit count estimates over a few hundred, the results should never be taken at face value, or any value at all — they're not only too inaccurate for serious research, but demonstrably flaky.
One easy way to show this is by noting that adding additional search terms to a search string often increases the number of reported hits, an obvious impossibility. Here are the hit-count estimates (rounded to the nearest million) that Google gives for some strings involving items that Mark took as examples.
|Search string||estimated hits (millions)|
|climategate news global||
|"Tiger Woods" wife||
|"Tiger Woods" wife photos||
|"Tiger Woods" golf||
|"Tiger Woods" car||
|Afghanistan Obama photos||
|Afghanistan Taliban war||
|Afghanistan pictures war||
|Afghanistan war pictures||
|Afghanistan war pictures war||
|Afghanistan war pictures war pictures||
Can anybody take these seriously? At these magnitudes, Google's hit-count estimates don't have face values.
But I wouldn't conclude from this that Google hit counts are useless. We can glean something about relative magnitudes of search terms by looking at the the number of pages Google actually returns for a query (as indicated on the final page of hits that turns up when you click the last number at the bottom of the first results page) — provided the figure is under 700 or so. In these cases we can assume that Google has tried to return all the pages in its index that contain the search string. (A figure between 700 and 1000 might be an accurate count, but might also be Google's effort to return around 1000 pages for a term that appears on thousands or millions of web pages.) And we can estimate the relative frequencies even of very common terms if we add extraneous search terms to keep the total pages returned to the <700 level. For example, here are some results for various search strings containing "Tiger Woods" and "Barry Bonds" along with a few topically irrelevant items that together restrict the search to a corpus-slice of less than 700 pages:
|Add'l search terms||"Barry Bonds" ~||"Tiger Woods"~||ratio|
Since the ratio between the two terms falls within the same rough range for each corpus slice, it's a fair assumption that the ratio between pages containing the terms is going to fall in that range for the corpus as a whole.
Let's see how this falls out now with climategate, Tiger Woods, and for comparison, Pirate Radio, another term that has been mentioned a lot over recent weeks:
|Add'l search terms||climategate ~||"Tiger Woods" ~||Ratio||"Pirate Radio"||ratio|
A couple of things are worth noting here. First, like the searches on Tiger Woods and Barry Bonds, the ratios here are relatively consistent from one corpus-slice to the next, which suggests that taken together, they're representative of the corpus as a whole. (I could have done more searches and controlled more carefully to make sure the additional search terms were irrelevant to the names I'm interested in and hence unlikely to favor one over the others, but this is close enough to make the point.)
Second, the number of pages for Tiger Woods in the various corpus slices averages 17 times greater than the number of pages for Climategate, which is around the middle of the range of variation in the ratios of Google queries for the two items over the period Mark looked at — or at least these ratios aren't wildy inconsistent. So I think this at least weakly confirms Mark's results (though I'm not sure if this is right, and I'll wait to get his take on all this).
Finally, let me try the same thing Mark did to determine the ratio of MSM coverage to interest — only now taking the relative number of web pages containing an item as a proxy for general interest in the topic, rather than the number of Google queries:
|story||news stories||rel. web freq.||stories/web freq.||Ratio of undercoverage|
If you took Delingpole's thesis at face-value, then, and substituted the correct ratio of web mentions of Climategate and Tiger Woods, you'd conclude that Climategate has actually been massively overcovered relative to public interest in the story — the same conclusion you'd reach if you did the calculations Mark's way. Actually, I can think of about ten reasons why this doesn't prove much one way or the other, some of which were mentioned by Mark or in the comments to his post and the rest of which I'll leave as an exercise for the reader. But it's hard to dissuade people from believing what's right in front of their noses, particularly if they don't scroll down.