Language Log

496M hits for "language log"? Alas, no.

June 2, 2009 @ 9:58 am · Filed by Mark Liberman under Computational linguistics

You've probably heard about Microsoft's new search site bing. I don't know much about it yet, but I did observe a couple of things that may be of interest to those of us who try to use web-search counts as data.

First, and least important, bing search counts are now generally round numbers, rather than the implausibly exact totals that MSN search used to yield. Thus if you check the numbers in (say) this old post, you'll see that a search for "full", back in the fall of 2005, returned 367,645,836 hits from MSN search, compared to 1,890,000,000 from Google and 2,000,000,000 from Yahoo. Now you'll get 896,000,000 from bing, compared to 2,140,000,000 from Google and 6,270,000,000 from Yahoo.

But second, and more interesting, something odd has happened to the counts for quoted strings.

Comparing the counts for "full" to the counts for "half full of", now and in the fall of 2005, we get

	full	"half full of"	ratio
MSN 10/15/2005	367,645,836	63,841	5,758.8
bing 6/2/2009	896,000,000	286,000,000	3.1
Google 10/15/2005	1,890,000,000	248,000	7,621.0
Google 6/2/2009	2,140,000,000	459,000	4,662.3
Yahoo 10/15/2005	2,000,000,000	397,000	5,037.8
Yahoo 6/2/2009	6,270,000,000	858,000	7,307.7

Compared to today's Google and Yahoo counts, and the MSN count from 2005, today's bing count for the string "half full of" is more than three orders of magnitude too large. This results in weird outcomes like the discovery that there are 496,000,000 hits for "language log", compared to Google's estimate of 305,000 and Yahoo's estimate of 2,760,000.

Except perhaps in cases where the total is very low, all such counts are estimates, based on a variety of indexed numbers (e.g. for single words and common n-grams) and extrapolations from relatively small samples (e.g. of word sequences in highly-ranked pages), all plugged into some sort of empirically-tuned formula. See this post for a discussion of some aspects of the situation as of a few years ago.

So it seems that somebody goofed in setting up bing's formula for estimating hit counts for strings.

(Of course, this might have happened some time ago, before the bing make-over — I haven't been checking MSN counts recently.)

June 2, 2009 @ 9:58 am · Filed by Mark Liberman under Computational linguistics

Permalink

6 Comments

Craig said,

June 2, 2009 @ 12:47 pm

Conveniently, one can change the number of the first result in the bing URL search parameters to jump ahead to a later page. Out of curiosity, I changed the above search for "language log" to start with result #1001 and was shown the actual maximum page (88) displaying results 871-877.

[(myl) The traditional story about this sort of result is that the algorithm indexes the top-ranked N pages for each word or common-enough n-gram — where N is a few tens of thousands — and then for multi-item sequences, searches sequentially in the intersection of the indexed pages, using some sort of extrapolation to predict what the total would be if a complete scan were done.

My first guess is that the formula used to do bing's extrapolation has a constant in it that's assuming counts in thousands of pages, but is applied to counts in pages. That still seems to leave a factor of 2 or so unexplained …]

Separate searches for "language" and "log" were ridiculously set to max out at the 1000th result, even though they supposedly had 279,000,000 results and 393,000,000 results respectively. Obviously, if "language log" actually generated 495,000,000 results, searches on the individual words should also total at least that number.

It would be interesting to see the algorithm they use to generate the faux but very consistent totals. One thing is clear: it has nothing to do with the actual number of available pages.

[(myl) "Nothing" might be too strong, but it's not much. ]
Oskar said,

June 2, 2009 @ 1:10 pm

Here's a very odd result: a search for language log without quotes gets 88,600,000 results, but search for "language log" with quotes, you get (as you noted) 320,000,000 results.

I can't for the life of me think of how that would happen. In the common language of searching, typing just two words generally mean "find pages with those two words in it" (this isn't strictly true, google also indexes words in the incoming links to a page, hence the phenomenon of Google-bombing, but that's the general idea) and if you search for two words with quotes around them it should mean "find pages with those two words in them AND make sure that they are next to each other and in the right order".

The second should be a subset of the first. These two searches make absolutely no sense.

[(myl) See this post for a discussion of the fact that Google searches for {X or X} often used to return fewer hits than searches for {X} — or this one for a discussion of the analogous effect with {X OR Y}.

As of now, Google returns 578M hits for {Obama}, 31.9M for {Sarkozy}, 495M for {Obama OR Sarkozy}, and 396M for {Sarkozy OR Obama}.

So whatever is happening, it's not constrained to be logically consistent. ]
Yuval said,

June 2, 2009 @ 1:54 pm

I'm hoping the 500 millionth hit would be "defriend".
Or maybe "n00b".
Evan said,

June 3, 2009 @ 12:46 am

If you search for "linux", the first 3 hits on bing suggest (besides "linux" itself) are all related to microsoft products. Contrast this with google suggest, for which there are no hits related to microsoft products. (Google/bing suggest is the dropdown of frequent searches you get when you type some words in the search box.)

[(myl) It's not implausible that this reflects the actual statistics of search on MSN as compared to Google — note that the "suggest" continuations are based on analysis of query logs. ]

Having established that there's some clear fudging going on over at Bing headquarters, and given that there's no real incentive to return accurate search result numbers, you'd expect they'd err on the high side in order to seem more impressive than their rivals.

[(myl) But bing's claimed counts for single words are generally *lower* than the numbers at Google or (especially) Yahoo. What's out of whack is the numbers for strings of words.

So I don't see any evidence of fudging, just a mistake in a formula somewhere. ]
Evan said,

June 10, 2009 @ 3:03 am

I find it hard to believe that the user base for bing and google would be so far out of sync. Most of google's users are running IE on windows, so you'd think that if microsoft users made such searches they'd show up on google's suggest as well. (Note that the rest of the suggest lists are very similar.) The best guess I have besides foul play is that Bing's initial suggest data comes from internal Bing beta testing, but it seems more logical to use old live search logs. But you're right, it's not impossible that Bing suggest is overrun by avid windows users researching the advantages of their OS over competing offerings.

as for the second point, I concede it.
Russell Cross said,

June 18, 2009 @ 3:24 pm

So now I have to have to check the correlation between "ghits" and "bhits" when I'm looking for usage data? What a bhummer!

](myl) Please not to forget the "yhits". ]

RSS feed for comments on this post

496M hits for "language log"? Alas, no.

6 Comments

Craig said,

Oskar said,

Yuval said,

Evan said,

Evan said,

Russell Cross said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta