The "million word" hoax rolls along
« previous post | next post »
Gullible reporters keep falling for a self-aggrandizing scam perpetrated by Paul J.J. Payack, who runs an outfit called Global Language Monitor. As regular Language Log readers know, Mr. Payack has been trumpeting the arrival of "the millionth word" in English for some time now. In fact, he's predicted that the English language would pass the million-word mark in 2006… and 2007… and 2008… and now 2009. As reported in the Christian Science Monitor and The Economist, the date that Payack has now set for the million-word milestone is April 29, 2009.
In a previous installment of the Payack saga, I wrote that the Million Word March was "a progression that he turns on and off based on his publicity needs." So I can't say I was terribly surprised to learn that April 29, 2009 just happens to be the publication date of the paperback edition of Payack's book, A Million Words and Counting: How Global English Is Rewriting The World. What a stupendous coincidence that Global Language Monitor's word-counting algorithm has timed itself to accord with Payack's publishing schedule!
A quick review for newcomers to the story. Payack's million-word claim first popped up on our radar in early 2006. In February of that year, Payack told The Times of London that "the one millionth word is likely to be formed this summer." Then in August 2006 he said it would happen that coming November. In early 2007 I observed that the Million Word March seemed to have gotten stalled, and speculated that it might have had something to do with the serious debunkage the claims had received from Jesse Sheidlower on Slate and our own Geoff Nunberg on NPR's "Fresh Air." As it turns out, the more likely reason for Payack's slowdown had to do with rolling out his book to cash in on the lexico-quackery.
Here's what I've been able to piece together about Payack's latest maneuvers. A Million Words and Counting was originally slated by his publisher, Citadel Press (an imprint of Kensington Books), to appear as a hardback in April 2008. The publisher page for this edition announced, "In 2007, the English language passed the million-word mark." The following month Payack put out a press release for the hardcover asserting that "English will adopt its millionth word in 2008." This same claim appears on the back flap of the hardcover edition locatable on Google Book Search. (I haven't actually seen a physical copy of this edition, copyright 2008, but I assume it exists somewhere.) Then the publisher announced that a trade paperback would be released on April 29, 2009. And of course the goalposts were moved yet again, with the announcement stating that the millionth word would be achieved not in '06, or '07, or '08, but '09. (Really! No kidding this time!)
All the while, Payack has continued to dress up his claims with pseudo-scientific talk of an "algorithm" that precisely calculates the size of the lexicon and predicts its future growth. In a press release dated June 30, 2008, Payack stated that "English will adopt its millionth word within a ten-day period centered upon April 29, 2009." The "ten-day period" presumably is intended to give the impression of a statistical margin of error produced by Payack's magic algorithm. A lovely idea, except that this algorithm has evidently failed in its prediction of passing the million-word milestone for three years running.
In my February 2007 post, "Whatever happened to the millionth word?", I checked up on past claims reported on the Global Language Monitor website for incremental growth in the size of the lexicon, using the Internet Archive Wayback Machine. Let's continue to monitor the Monitor:
11/16/03: | 816,167 |
11/28/04: | 823,481 |
3/30/05: | 856,435 |
5/19/05: | 866,349 |
11/3/05: | 895,479 |
1/16/06: | 985,955 |
1/26/06: | 986,120 |
3/21/06: | 988,968 |
10/16/06: | 989,614 |
12/19/06: | 991,207 |
12/31/06: | 991,833 |
4/2/07: | 991,833 |
5/16/07: | 993,412 |
8/14/07: | 994,638 |
10/23/07: | 995,116 |
12/23/07: | 995,116 |
2/13/08: | 995,117 |
11/7/08: | 997,752 |
1/1/09: | 998,773 |
4/29/09: | 1,000,000 |
There are apparently no archived pages for the site after February 2008, and I wasn't bothering to check the site myself very much over the past year. But it's still quite easy to see the asymptotic approach to a million words, ever since the big jump at the end of 2005 in advance of Payack's first round of self-promotional puffery. Now at least we know what's holding things up: the media blitz planned for his next publication date. It's my fervent hope that in 2009 Payack's manipulation of his bogus figures will be immediately transparent to any journalist with the ability to Google. But given the track record of how easily the media has been duped for the past three years, I'm not holding my breath.
[Update: I should note that the Economist piece linked above, by John Grimond is rightfully skeptical of the Payack poppycock, so not all journalists are so credulous. Daniel Franklin, executive editor of The Economist, continued poking fun at Payack's prediction on NPR's Morning Edition.]
Jair said,
January 3, 2009 @ 6:45 am
To be fair, it might be that he set the publishing date for his book based on his estimates of the millionth word arrival rather than the other way around.
Honestly, I think the guy's kind of amusing. There are all sorts of quacks out there – this sort of thing doesn't surprise me at all. I am amazed at all the apparent media coverage he's getting, however. I know very little about linguistics but it seems entirely obvious that putting an exact number on the number of words in the English language is about as hard as measuring the length of the coastline of Britain. You'd think that even a very ignorant journalist would recognize such a thing as ridiculous just based on common sense.
Martin Magnusson said,
January 3, 2009 @ 7:37 am
I agree with you that the media is easily duped.
Just look at how many recent top-selling memoirs are false: Angel at the Fence, A Million Little Pieces, Love and Consequences. I'm sure there are more that I am not aware of.
I think the New York Times article about Angel at the Fence says it all: "In media circles, there is a joke about facts that are too good to check."
His publisher must think that Mr. Payack's facts are too good to check.
Paul JJ Payack said,
January 3, 2009 @ 8:20 am
The fact that the methodology behind the Million Word March is a matter of public (and published) record — and has been for a number of years — has been completely ignored by Benjamin Zimmer, one of a handful of critics who invariably fail to acknowledge that there is a published methodology.
His objections have been repeatedly published without having ever made contact with me or my organization. He accompanies his criticism with charts such as you find in this posting (which if you trace the trajectory of March 2005 is, to Zimmer’s inconvenience, almost exactly aligns to our current estimate).
The current estimate is based upon an estimated rate of word creation that we checked and re-evaluated about 2 and-a half years ago and ascertained that the rate of growth estimated in March 2005 was correct.
To quote Dennis Barron, in his Web of Language blog:
"Paul Payack, professional word-counter and the founder of Global Language Monitor and yourdictionary.com, claims that someone coins an English word every 98 minutes, which seems pretty fast until we consider that during the word-coining frenzy of the 1590s, when the pace of life was slower, about 10,000 new words popped up every year. If Shakespeare and his contemporaries never slept, that comes to a neologism every 52 minutes.
With more than 326 million native speakers of English today, and only 2 million in 1600, today's neologism-per-person rate is only a fraction of what it was 400 years ago. Given our perception that the pace of life has increased dramatically since the Renaissance, this suggests that while there are in fact more words in English now than there used to be, we have a lot less time to coin them (neologism, a word, coined in France in the 1730s and borrowed by English in the 1770s, meaning ‘a new word’; Renaissance, a mid-19th century word meaning the European revival of arts and letters of the 14th – 16th centuries).
Payack’s words-per-minute assertion can’t be tested, because he uses a secret formula to count his words, but if he’s right, then in the time it took me to write this post, somewhere in the English-speaking world a new word was born, or two, if you count revisions. We know they’re out there. We just don’t know what they are."
Zimmer’s comment (and 'aha!' experience) upon my publishing schedule ignores the significant fact that A Million Words and Counting was actually published last year, and was written the year before that. The book, itself, assumes it is being read after the date of the Millionth Word has already passed. However, it is true that the paperback edition will follow a year after the hardcover, as is standard. Reprints will probably follow, so it is bound to be linked to some date that coincides with one of any number of our books, publications, studies, and the like.
Those who do talk to us and actually examine our methodology end up writing fair, balanced articles like that found in Smithsonian Magazine (http://www.smithsonianmag.com/arts-culture/million-word-march.html).
I would also note that our methodology has been tested by various government agencies and financial institutions, as well as by media the world over.
Our work has been incorporated into dozens of academic journals, studies and books. We are currently writing an article for one of the statistics journals.
Zimmer and other critics maintain that all these scholars, institutions and media have been 'duped’ or ‘snookered' or worse, in an ascending order of vehemence that makes one wonder about the actual motivation of his argument.
We welcome a direct conversation with Zimmer or anyone else. All our contact information is readily available on the site.
Dan T. said,
January 3, 2009 @ 8:37 am
I guess the language took a vacation from adding new words for the first few months of '07.
Another vacation, but it got called back into the office to add one emergency word.
Faldone said,
January 3, 2009 @ 9:11 am
This just in:
Lake Superior State University banishes millionth word. An anonymous source in the LSSU hierarchy stated, "We just know it's going to be over/misused so we are banishing it proactively."
Dr Benway said,
January 3, 2009 @ 10:05 am
You keep using that word, "algorithm." I do not think it means what you think it means.
Geoff Pullum said,
January 3, 2009 @ 10:49 am
Payack says in his comment: "the methodology behind the Million Word March is a matter of public (and published) record". But his website (see http://www.languagemonitor.com/pqi) says that the Global Language Monitor uses a "proprietary algorithm, the Predictive Quantities Indicator" as "the basis of our analytical engine". Now, he can't have it both ways: "a matter of public (and published) record" is one thing, and "proprietary" is quite another. Proprietary code is kept secret for commercial reasons. Fair enough. But you don't get to claim public disclosure in that case. All that I could find on Payack's website about the algorithm is that it "tracks the frequency of words and phrases in the global print and electronic media"; "a keyword base index is created" which includes "selected keywords, phrases, 'excluders' and ‘penumbra’ words", and then "'timestamps' and a 'media universe' are determined". He adds that "The PQI is a weighted Index, factoring in: Long-term trends, Short-term changes, Momentum, and Velocity. As such it can create ’signals’ that can be used in a variety of applications. Outputs include: the raw PQI, a Directional Signal, or a Relative Ranking with 100 as the base." Alongside the basic PQI there is a "Political-sensitivity Quotient Index" and something called "TrendTopper software" for "analyzing words and phrases in commercial contexts". What all this vague jargon means, I have no idea. (For example: Predictive Quantities of what?) But whatever it all is — and I smell snake oil — what is published on Payack's website really cannot count as full public disclosure of a methodology. About that, he seems to be simply lying.
Geoff Nunberg said,
January 3, 2009 @ 2:48 pm
I guess it's easy to see Payack as amusing, as Jair suggests — kind of a linguistic Madoff, but one who visits insults only on the intelligence of the people who buy into his story, not their material well-being (apart from the ones who pay him to speak, who I suppose get what's coming to them).
But there's also an underlying creepy strain to the guy, which bubbles to the surface when he says that Ben criticizes him with a "vehemence that makes one wonder about the actual motivation of his argument." This is a breathtaking bit of chutzpah. If Payack's poppycock makes Ben indignant — as it does me, Geoff Pullum, and just about every other linguist or lexicographer who has looked at it — it's because Ben is a serious scholar of language who takes it personally when a self-promoting huckster gulls the press with lies and fabrications about language.
Actually, the interesting question here involves Payack's motivation. Not that there's anything out of the ordinary about desperately seeking attention or puffing up your vita with phony credentials. But would you trade your credit among serious people for a flurry of press attention?
As for Payack's assurance that he welcomes direct conversation with Zimmer or anyone else, I actually did have an email exchange with him after after he wrote me following a Language Log post I did (called "Hackery, Quackery, Schlock") that described him as an opportunistic charlatan. The following should give you a sense of how informative the colloquy was:
I'll say.
rootlesscosmo said,
January 3, 2009 @ 5:38 pm
His objections have been repeatedly published without having ever made contact with me or my organization.
The objections have never made contact?
He accompanies his criticism with charts such as you find in this posting (which if you trace the trajectory of March 2005 is, to Zimmer’s inconvenience, almost exactly aligns to our current estimate).
"which is…aligns"?
Jangari said,
January 3, 2009 @ 6:51 pm
I'm having trouble even conceptualising how words are counted in this pointless scheme. Is cat counted as one word and cats counted as another? Is bank counted once, twice or three times? What about run (the verb), runs (the verb), ran, run (the noun), runs (the noun) and running? If they're just considered a single word, or possibly two, as I would contend, then how about less clear cases like be, am, are and were?
Counting words in a language is likened earlier on, to the futility of giving an exact measure of the length of the coastline of the UK. This is too kind. At least you can be correct within a margin. The number of words is subject to way too many innumerable and undefinable variables. Just to take a few: How do you define English? Is Hinglish allowed in? What about individual speakers' idiosyncrasies, and how many people need to use a word for it to become 'a word'? If I say Hooblah enough on this thread, will it show up in Payack's algorithms?
One last thing, Payack, you say you've had government agencies (Which ones? And why should they care?), financial institutions and "the media the world over" scrutinise your algorithm. How about giving a look to a linguist or a lexicographer?
Hocus pocus.
Ryan said,
January 3, 2009 @ 6:52 pm
"Our work has been incorporated into dozens of academic journals, studies and books. We are currently writing an article for one of the statistics journals."
So then name in which issues of which journals your work was published in.
Ryan said,
January 3, 2009 @ 6:59 pm
Oh god. Wow, it was a long day. Sorry on the redundancy.
Tim Silverman said,
January 3, 2009 @ 7:36 pm
@Jangari: not sure if you know this, but "the" length of a coastline was famously used by Benoit Mandelbrot as an example of the peculiar nature of a fractal curve, because by using shorter and shorter measuring sticks, tracing the outline of ever smaller bays, inlets, caverns and pockmarks down to the spaces between individual grains of sand, and further down into the roughness of the sand grains, one can get the measured length to increase without limit. So it's a somewhat analogous example of a quantity which sounds superficially objective but is actually highly dependent on an artificial choice of definition for a naturally ambiguous concept.
Spectre-7 said,
January 3, 2009 @ 8:35 pm
And how would it deal with homonyms and heteronyms, for that matter? Could it correctly count the separate words if a dove dove off a cliff, or when I object to being the object of your desire? Can it separate what a minute deer does and what does do for a minute? There are myriad words that appear identical in the record, but they remain different words regardless of how we record them.
I consider myself quite moderate in these matters, but I have serious trouble believing that his software is so totally perfect, and I strongly doubt that anyone could perfect such a thing. At the very least, I refuse to consider any such claims prima facie until the public is given access to his methods, and will loudly object to such refuse being spread in our media.
Matthew Flaschen said,
January 3, 2009 @ 9:12 pm
Well, on a positive note, once he published the book, that means he has to stop claiming the millionth word is just around the corner (because it's already here!)
Tim said,
January 3, 2009 @ 10:54 pm
Perhaps Mr. Payack would consider providing a list of the 998 773 words his "algorithm" counts as English, for the perusal of any interested parties? Surely this would not be proprietary?
Richard said,
January 4, 2009 @ 12:03 am
Y'all should deal in Ben Goldacre of Bad Science fame (http://www.badscience.net), if you haven't already. He deals with this kind of quackery all the time–MMR vaccinations cause autism, cell phone towers cause cancer and suicide, a formula that "predicts" celebrity. Payack's responses to questions are typical for cranks and quacks: attack the questioner, answer a different question, demand the questioner answer their own question, or change the subject. Anyway, I think Ben would be amused, even though he's a medical doctor, not a linguist.
Marc Naimark said,
January 4, 2009 @ 8:57 am
If I say Hooblah enough on this thread, will it show up in Payack's algorithms?
I like that word! It's "defined" by Urban Dictionary as "Gossip or extreme amount of talk about something", but I think we can do better. How about "a great deal of controversy about a subject that doesn't warrant such a reaction"?
Grant Barrett said,
January 4, 2009 @ 12:19 pm
What's most interesting to me about Payack's claims is that in the uncorrected proof of his book, A Million Words and Counting (2008, Citadel Press, Kensington Publishing Corp, New York), he says in more than one place that English already has more than a million words.
For, example, on page 2, in a gray box with some other factoids (a word I use with prejudice), it says on the first line, "Nontechnical English has over 1,000,000 words."
In a note at the bottom of page 4, it says,
On page 4, he writes, "With the present word count standing around 990,000 words, we estimate that the Global Codex of Expository English should have reached the 1,000,000th word mark by the time you read these words."
And yet he persists with his media stunts.
As for the PQI, there is a gray box in the book titled, "How the PQI Works," but it has no more information than what is on the web site. No formulas, no explanations of software, no bibliography, no reference to computational linguistics or computational lexicography. Perhaps these things were included before the volume went to press.
The rest of the book is what we might call, inspired by Geoff Pullum's common lament, "a big bag o' facts" because it's a disjointed mish-mash of GLM-generated "top" lists and unsourced statistics and word histories. A précis might be, "Hey, look. It's English. Huh."
John Cowan said,
January 5, 2009 @ 8:27 pm
Hooblah
Does this represent a neutralization between /p/ and /b/ in this context? "Hoopla" is an existing word with the same meaning.
Jangari said,
January 5, 2009 @ 8:57 pm
Nope. Hooblah is a word that a friend of mine claims to have come up with as, at least originally, a complete nonce. The /b/ is certainly audible, and doesn't neutralise with /p/. Moreover, the 'oo' is phonemically /ʊ/, as in 'book'. So it should probably be pronounced [ˈhʊblʌ], as far as I'm concerned. It can mean whatever you like, to be perfectly honest.
That's now 5 times that Hooblah has been used in this thread (including this comment). I think we have our 998 774th word.
hjælmer said,
January 6, 2009 @ 7:03 am
"I would also note that our methodology has been tested by various government agencies and financial institutions, as well as by media the world over."
Ah, yes! The SEC, FEMA, Northern Rock, Kaupthing, Bear Stearns, Lehmann Brothers, IndyMac, the Chicago Tribune…
Ginger Yellow said,
January 6, 2009 @ 12:55 pm
"I would also note that our methodology has been tested by various government agencies and financial institutions, as well as by media the world over. "
This is brilliant. I can just imagine the heated conversations at the very top of the Ministry of Linguistics:
CIVIL SERVANT: Sir, I have bad news.
JUNIOR MINISTER FOR WORDS: What is it, Smith?
CS: Well, we're running out of words.
JMFW: What are you talking about?
CS: We didn't want to believe it either, sir, but we've triple-checked it. We put a crack team of lexicographers on it, and however they ran the numbers the algorithm came back with the same answer. English will soon have over a million words!
JMFW: But our big word counter only has six digits!
CS: Exactly.
JMFW: Dear God. I'd better tell the prime minister right away.
Adrian said,
January 6, 2009 @ 2:33 pm
Short interview with the estimable Mr Payack on BBC Radio this afternoon:
http://www.bbc.co.uk/radio4/factual/wordofmouth.shtml
About two-thirds of the way through the show.
Faldone said,
January 7, 2009 @ 1:41 pm
I think we should have April 29, 2009 proclaimed National Coin A Word Day. With Language Log's highly paid team of lobbyists we should be able to get it through Congress and on the President's desk for a signature in no time at all. Surely innumerable people will pitch in with the coining blitz that will occur on that day, if only so they can claim to have coined the millionth word. As an added benefit, the employment opportunities at Lake Superior State University to facilitate the banishing of all the new words that will be coined on that day will make great strides in the fight against the global recession we are experiencing.
Steve said,
April 22, 2009 @ 7:14 am
It's moved again: according to a vacuous report on BBC Breakfast television this morning, it's now going to be sometime in June 2009.
Stan said,
May 12, 2009 @ 5:00 pm
The number of words in the English language is no more countable than the number of colours in the visible spectrum or the number of bacterial species in a bucket of soil. Systematized categories have their uses, but treating them as definitive is not in accordance with reality (imagine scare quotes around that word if you like).
I can't say I'm surprised by the credulity – or cynicism – of the many news organizations that reported this story uncritically, but anyone who is uncertain about its validity and takes the trouble to look around online will find no shortage of solid debunking.
Ben Zimmer's piece from February 2006 contains a quote from Wordspy by James A. H. Murray, the first editor of the Oxford English Dictionary. It's from the Introduction to the first volume of the OED; here is a longer excerpt:
"The Vocabulary of a widely-diffused and highly-cultivated living language is not a fixed quantity circumscribed by definite limits. That vast aggregate of words and phrases which constitutes the Vocabulary of English-speaking men presents, to the mind that endeavours to grasp it as a definite whole, the aspect of one of those nebulous masses familiar to the astronomer, in which a clear and unmistakable nucleus shades off on all sides, through zones of decreasing brightness, to a dim marginal film that seems to end nowhere, but to lose itself imperceptibly in the surrounding darkness. In its constitution it may be compared to one of those natural groups of the zoologist or botanist, wherein typical species forming the characteristic nucleus of the order, are linked on every side to other species, in which the typical character is less and less distinctly apparent, till it fades away in an outer fringe of aberrant forms, which merge imperceptibly in various surrounding orders, and whose own position is ambiguous and uncertain. For the convenience of classification, the naturalist may draw the line, which bounds a class or order, outside or inside of a particular form; but Nature has drawn it nowhere. So the English Vocabulary contains a nucleus or central mass of many thousand words whose 'Anglicity' is unquestioned; some of them only literary, some of them only colloquial, the great majority at once literary and colloquial,- they are the Common Words of the language. But they are linked on every side with other words which are less and less entitled to this appellation, and which pertain ever more and more distinctly to the domain of local dialect, of the slang and cant of 'sets' and classes, of the peculiar technicalities of trades and processes, of the scientific terminology common to all civilized nations, of the actual languages of other lands and peoples. And there is absolutely no defining line in any direction: the circle of the English language has a well-defined centre but no discernible circumference."
And here is Murray's simple diagram of that non-existent circumference. Its star shape is most appropriate, and delightfully unbounded.
James Desrosiers said,
June 10, 2009 @ 3:50 pm
If there is any recognition Mr. Payack deserves, it is the next nomination for the Darwin Awards.
Paul JJ Payack said,
December 23, 2010 @ 10:31 am
As the distinguished linguists above were writing these words, it turns out that Google was in the process of beginning its endeavor to scan the world's corpus of printed books.
In December 2010, they announced a Google/Harvard study of the data contained in the corpus. One result was an estimate of the current number of words in the English language.
Google's estimate of the current number of words in the English language: 1,022,000.
Global Language Monitor's estimate of the current number of words in the English language: 1,008,350.
The difference between the two analyses is thirteen thousandth of one percent.
Google's estimate of the number of new words added to English per year: ~8300.
Global Language Monitor's estimated of the number of new words added to English per year: ~5300.
Google's number is based on counting the number of words in the 15,000,000 books thus far entered into the ‘Google Corpus’.
Evidently, the folks at Google and Harvard think that is is possible to count words. Their analysis undoubtedly, is is based on a set of strict criteria, as well as the attendant mathematical models or algorithms.
(Question: in what academic realm is mathematical modeling considered 'pseudo-science'?)
GLM was founded as a Silicon Valley start-up, in this type of environment, one rarely ponders why something cannot be done, but rather how to make something happen that has never happened before.
As always GLM invites any of the above to sign a non-disclosure to discuss the bases of our mathematical models (as have numerous government agencies, scientists, technology companies, and scholars).