Language Log

Send a private message to

March 15, 2009 @ 7:03 am · Filed by Mark Liberman under Syntax

That's apparently the commonest 5-word sequence in English, barely beating out "property of their respective owners". At least, those are the commonest five-word sequences on the web.

Last week, in commenting on Geoff Pullum's "Familiar six-word phrase or saying" post, I observed that

For five-word phrases, a version of the question "What five-word phrase occurs most often on Google?" can definitively be answered by reference to the Web 1T 5-gram corpus, created by researchers at Google, which contains English n-gram counts from about one trillion words of web text.

Several readers asked me what the answer actually is. The answer turned out to be not entirely trivial to get, and it may not be as interesting as you'd expect. Or maybe it's more interesting, I don't know. Anyhow, I live to serve, if not always at internet speeds, and the details are below.

The Web 1T 5-Gram Corpus contains all the length-five sequences of tokens occuring more than 40 times, from a sample of the web containing 1,024,908,267,229 tokens. I'll pass over in silence various minor technical difficulties associated with the fact that the corpus is 54 GB of text, stored in compressed form on five DVDs, and that it contains 1,176,470,663 5-grams, divided in alphabetical order into 118 files. It's a tribute to modern technology that I was nevertheless able to get to an answer, using only a laptop computer and a few minutes spared now and then from other activities. (Of course, it's a much more significant tribute to modern technology that Thorsten Brants and Alex Franz were able to compile the corpus in the first place.)

The tokenization used in creating this corpus follows the tradition established by the Wall Street Journal portion of the Penn Treebank, which represents sentence boundaries as special tokens, and splits punctuation out as separate tokens as well.

As a result, the most-frequent 5-grams tend to be things like

– – – – –

which occurs 88,974,253 times.

If we ignore sentence-boundary markers and punctuation (other than the apostrophes in things like "what's hot" and "you're looking for"), then the commonest 5-gram in the corpus is

Send a private message to

which occurs 26,672,131 times. (Free tip: as far as amazon.com knows, there is no published book whose title includes the string "send a private message"…)

The rest of the top-20 commonest sequences, with their counts, are:

Use of this Web site 19678811
of this Web site constitutes 19703371
this Web site constitutes acceptance 19723554
Web site constitutes acceptance of 19724386
eBay User Agreement and Privacy 19807811
the eBay User Agreement and 19808132
acceptance of the eBay User 19850253
of the eBay User Agreement 19850627
constitutes acceptance of the eBay 19850700
Designated trademarks and brands are 20820815
User Agreement and Privacy Policy 20917050
trademarks and brands are the 20975334
and brands are the property 21113548
brands are the property of 21139112
site constitutes acceptance of the 21556427
this result in new window 24059811
Open this result in new 24059963
the property of their respective 24891265
are the property of their 24938581
property of their respective owners 25640531

See what I mean? This is not really telling us anything about the state of the English language, and what it's telling us about web culture is pretty limited. (Though because of the overlapping 5-grams, we can also conclude that "are the property of their respective owners" is the commonest 7-word sequence on the web. If we care.)

Even looking down the list to strings that occur only about a million times, nearly everything is either boilerplate from commercial websites, or fragments of random lists:

have not been able to 1000387
Philippines Pitcairn Poland Portugal Puerto 1000489
Islands Faroe Islands Fiji Finland 1001073
Shopping help About this site 1001307
part of The New York 1001557
NM NY NC ND OH 1003534
is the responsibility of the 1003710
your web hosting at eNom 1005283
Get your web hosting at 1005294
MD ME MI MN MO 1005327
See more of the brand 1006237
d d d d d 1006434
great domain name at eNom 1006715
in TripAdvisor 's popularity index 1007567
a great domain name at 1007579
poster 's website AIM Address 1007641
Get a great domain name 1007806
Visit poster 's website AIM 1007875
browser does not support script 1010343
Ivoire Croatia Cuba Cyprus Czech 1011221
NY NC ND OH OK 1011482
State University of New York 1011879
ND OH OK OR PA 1012023

Obviously the 5-gram resource as a whole is not like this — the lower-frequency end of its n-gram lists is an invaluable aid to language modeling. But it's inevitable, if you think about it, for things like eBay boilerplate to dominate the high-frequency end of the higher-order n-grams in a web snapshot.

I'm happy that "Send a private message to" nosed out "property of their respective owners". (Another free tip — there's also no book entitled "Property of their respective owners". Nor any album or band …) But neither one of these phrases strikes me as a plausible candidate for the real "commonest five word sequence in English". And because it's harder to filter out commercial and legalistic boilerplate than to filter out punctuation, it's not going to be easy to get a useful answer from a web snapshot of this kind.

It would be more meaningful to take counts from a balanced corpus like the BNC — though at a mere 100 million words, that's 10,000 times smaller. This creates a different problem. Since a count of 1,000,000 in the 1T corpus translates to a count of 100 in the BNC, it's likely that the commonest real 5- and 6-word sequences will have low enough counts that the ordering is statistically unreliable.

[I'm not sure why the 1T corpus slightly different counts for overlapping 5-grams that must be pretty much coextensive on the web, e.g.

the property of their respective 24891265
are the property of their 24938581
property of their respective owners 25640531

The differences are fairly small — in this case, around 3% — but it seems surpising to see even that much divergence associated with such near-zero-entropy string continuations. Perhaps this does reflect real differences in the frequency counts in the sampled corpus, but perhaps there is a technical reason having to do with approximations in the counting algorithm. I'll see if I can find out. ]

[Meanwhile, for those interested in the earlier discussion of that all-too-common six-word phrase "before turning the gun on himself", Fev at Headsup: The Blog has something to add, in the form of a reference to Gaye Tuchman, "Objectivity as strategic ritual", American Journal of Sociology 77:660-679, 1972. Fev supports rootlesscosmo's suggestion that

recourse to boilerplate may actually help persuade editors and referees that the work meets professional standards; if a paper says "the animals were sacrificed" it's science, if it says "I killed the mice" it's macabre, though the event is the same.

by quoting Tuchman:

Attacked for a controversial presentation of 'facts,' newspapermen invoke their objectivity almost the way a Mediterranean peasant might wear a clove of garlic around his neck to ward off evil spirits.

For more on journalistic objectivity rituals, see e.g. "Ritual questions, ritual answers", 6/25/2005; "Down with journalists!", 6/27/2005; "Ritual interviews", 9/18/2005. Perhaps it's time for the anthropologists to devote more time to documenting the culture of journalists, who seem to be at least as endangered as Mediterranean peasants are. ]

March 15, 2009 @ 7:03 am · Filed by Mark Liberman under Syntax

Permalink

14 Comments

mollymooly said,

March 15, 2009 @ 8:04 am

A simple "property of their respective owners" -"the property of their respective" search throws up about 1.4m matches: the first 100 are mostly "are property of", some "are the sole property of", etc.
joseph palmer said,

March 15, 2009 @ 8:23 am

It is a nice lesson in some of the limitations of corpus linguistics, especially when exclusively web-based.
Spectre-7 said,

March 15, 2009 @ 10:38 am

I don't think I'd be in any rush to use those strings as book titles or band names. When someone Googles my title, I would want it to return links related to my work. Rather that than two millions links about nothing in particular.
ambrosen said,

March 15, 2009 @ 11:14 am

I came across "I don't know how to describe it, really" on the radio this morning and found it had 8800, which can't be too bad for a 7-gram. You get 142,000 without the "really" on the end, which is a 6-gram which I find a bit more exciting as a revelation about how people express themselves than the ones above.
Sili said,

March 15, 2009 @ 11:23 am

Perhaps these will be the next books by Joey Comeau.

Although I guess it would now be derivative to write an autobiography in the form of Ebay item descriptions. Still, there'd be a melancholy of sorts to seeing a life auctioned away piece by piece, I'm sure.
Brett said,

March 15, 2009 @ 12:18 pm

When you are dealing with a phrase like "send a private message to," which (I assume) occurs primarily on bulletin boards, I don't think one can evaluate how many times the phrase appears in meaningful fashion. The pages that contain such phrases are dynamically generated; the number of distinct URLs that can contain a given piece of text (whether it is post copy or, as in this case, boilerplate) is astronomical. Google has a particular way of indexing these pages, which frankly has little to recommend it; Google's ability to located posts containing certain text is worthless on some bulletin board setups.
Luke Winikates said,

March 15, 2009 @ 1:42 pm

@Brett:
I was going to suggest the same thing. I'd bet that the exact phrase is used on facebook profiles, and probably on a few other similar sites with millions of personalized versions of the same page being produced programmatically from a single master template. Similar to the Ebay terms of service and other clearly legal strings of text, which must appear at every new account creation or every transaction confirmation page. Sort of brushes up against the semantics of what constitutes a web page and a unique instance of a sequence of words in the eyes of google's or anybody's web spiders.
John Cowan said,

March 15, 2009 @ 1:57 pm

So it seems a reasonably safe conjecture that "Trademarks and brands are the property of their respective owners" is the most common sentence on the Web, and therefore probably the most common sentence of written English as a whole. Who woulda thunk it [only 116 kghits]?
Dan S said,

March 15, 2009 @ 2:02 pm

Related (meaning: "I hope you agree this is sufficiently related, because it's so much fun") are these lists from Randall Munroe of XKCD.com

"…a list of phrases that (at the time of this posting) turn up no hits on Google:

* “ate a violin”
* “driver-side bidet”
* “unlike normal furries,” […]

"Here are some phrases that I had hoped were original when I typed them in but was disappointed:

* “full-body glissando”
* “passenger-side bidet”
* “underwater Linux” […]"

Source: http://blag.xkcd.com/2008/12/03/some-lists/
Nathan Myers said,

March 15, 2009 @ 2:37 pm

I thought "I can't tell if you're joking" would show more than 3,270 results. "Or not" seems to be unnecessarily commonly added (475). Variations include "I can't tell whether you're joking" (465), "Please Be Joking", and "God I Hope You're Joking". Generally I do too.
Anonymous said,

March 18, 2009 @ 6:10 am

"Browser does not support script" is mostly a false positive. There are sites that are designed not to play well with Google's (or any other search engines') spider. "Browser does not support script" is what the spider sees, but it's not really there.
Kragen Javier Sitaker said,

March 22, 2009 @ 4:04 am

In response to "if a paper says 'the animals were sacrificed" it's science, if it says "I killed the mice" it's macabre, though the event is the same,' I think the use of the term "sacrifice" is extremely inappropriate; although it's commonplace in papers describing live-animal experiments, to people without a background in those sciences, it sounds like the paper describes Satanism rather than science. The root of "sacrifice" is Latin "sacra", whose meaning is entirely religious rather than fatal ("thanatological"?), and the "sacred" aspect of "sacrifice" is much stronger in common usage (largely in the extended metaphorical sense of giving up something that matters to you for noble reasons) than the later, more specific meaning of killing things.
Merri said,

March 24, 2009 @ 11:20 am

> Dan S

There is an interesting paradox here.

Since you wrote there was no sentence like 'ate a violin' on the Web, there is.

BTW, I'm mildly surprised that nobody who wrote 'perpetrate a violin massacre' mistyped by inserting an undue blank.
Estel said,

April 7, 2009 @ 11:56 pm

I just stumbled across a list of the 18 most common 6-word sequences in the CANCODE corpus (5 million words of spoken English). It can be found on pages 16-17 of this article[PDF] on second-language acquisition. The top item is apparently "do you know what I mean".

RSS feed for comments on this post

Send a private message to

14 Comments

mollymooly said,

joseph palmer said,

Spectre-7 said,

ambrosen said,

Sili said,

Brett said,

Luke Winikates said,

John Cowan said,

Dan S said,

Nathan Myers said,

Anonymous said,

Kragen Javier Sitaker said,

Merri said,

Estel said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta