Earlier today, Victor quotes Jerry Packard quoting C.C. Cheng to the effect that "the human lexicon has a de facto storage limit of 8,000 lexical items" ("Lexical limits", 12/5/2015). Victor is appropriately skeptical, and asks for "references to any studies that have been done on the limits to (or norms for) the human lexicon". In fact there's been a lot of quantitative research on this topic, going back at least 75 years, which supports Victor's skepticism, and demonstrates clearly that Cheng's estimate is low by such a large factor that I wonder whether his idea has somehow gotten mangled at some point along the chain of quotation.
There are many ways to approach the problem of how many words there are in a language, and the question of how many words a given person knows. I take it that Cheng's claim is related to the second question, since it's clear from a glance at a random dictionary that 8,000 is not a remotely plausible limit for the number of lexical items in a language viewed as a social construct.
One approach to the question of how many words a given person "knows" is to choose a sample of items at random from a suitable word list, to test that person for their knowledge of each of the items in that sample, and then to extrapolate. Thus if you start with a list of 100,000 items, and test someone for knowledge of a random sample of 100, and the subject gives evidence of knowing 43 out of 100, then you can estimate that they know (in whatever sense you've tested them) about 43,000 of the items on the list. (And you'd get a tighter estimate if you tested them on a larger sample.)
Of course, there are several questions you need to answer along the way to such a result.
To start with, what sorts of things do you put on the list? Once you have dog, do you treat dogs as a separate word? Presumably not. But how about doggy, dogging, dogged? If you have food as well as dog, is dog food (or dogfood) a separate word? Or hotdog, the food or the skiing style? What about idiomatic phrases like dog's breakfast?
Do you include proper nouns — place names, personal names, company names, product names, band names, names of books and movies and albums and composers and musicians and songs? What about proper nouns with internal white space? Is "New York" a word? Is "New York Times"? How about "Lady Gaga" and "Pnom Penh"? And what about acronyms and initialisms? There are 26^3 = 17,576 possible three-letter initialisms, and a sampling test suggests that a fairly large proportion of them are in use. So does someone get credit for knowing ABC and IBM and TSA and TNT and LED and WTF and … How do you deal with words with multiple senses? The answers to questions like these will expand or contract your wordlist by at least an order of magnitude, and can have a similar effect on your quantitative estimate of lexical knowledge.
And then, what counts as "knowing" a word? Is it enough to be able to distinguish words from non-words? Presumably not. Do you need to be able to use the word in a sentence? How should such candidate uses be graded? Or should you have to distinguish correct uses from incorrect ones? How subtle or difficult should the tested distinctions be? Again, the testing method can have a large effect on your estimate.
Attempts to define and implement this approach go back at least to R.H. Seashore and L.D. Eckerson, "The measurement of individual differences in general English vocabularies", Journal of Educational Psychology 1940. A somewhat more recent example is W.E. Nagy & R.C. Anderson, "How many words are there in printed school English?" Reading Research Quarterly 19, 304-330, 1984. They came up with plausible answers to all of the questions above, and conclude that average American high school graduates have a vocabulary of about 40,000 "word families". When I've given similar tests to Penn undergraduates, estimates are generally in the 60,000-70,000 range.
Another approach is to try to extrapolate type-token functions in an individual's speech or writing, to estimate how many types would appear if you continued tracking tokens forever. This is hard to do for several reasons: it's not clear what function to extrapolate; and after a relatively short time, most of the "new words" in digital text are actually typographical errors. An excellent work about this is Harald Baayen's book, Word Frequency Distributions. But even without any fancy extrapolations, and even with relatively conservative ideas about what constitutes a "lexical item", we can determine by analyzing the texts of prolific writers in English that they used (and presumably knew) way more than 8,000 items.
Packard's reference appears to be to Cheng, Chin-Chuan, "Quantification for understanding language cognition." Quantitative and Computational Studies on the Chinese Language (1998): 15-30. I don't have time to try to chase this reference down, but the idea that 8,000 is a sort of natural cognitive limit to the number of items in "the human lexicon" is so obviously and preposterously false that I wonder again what Prof. Cheng actually said, and what evidence if any he cited for it.
For more, here are a few earlier LLOG posts on the topic:
"986,120 words for snow job", 2/6/2006
"Word counts", 11/28/2006
"Vocabulary size and penis length", 12/8/2006
"Britain's scientists risk becoming hypocritical laughing-stocks, research suggests", 12/16/2006
"An apology to our readers", 12/28/2006
"Cultural specificity and universal values", 12/22/2006
"Vicky Pollard's Revenge", 1/2/2007
"Ask Language Log: Comparing the vocabularies of different languages", 3/31/2008
"Laden on word counting", 6/1/2010
"Lexical bling: Vocabulary display and social status", 11/20/2014
Update — here's a type-token plot from 16 books by Charles Dickens (15 novels plus American Notes):
I downcased everything, split hyphenated words, and counted only all-alphabetic tokens. Of course the 37,504 types constitute a smaller number of "word families" — thus we have
The meaning and usage of dogged and dogging are far from 100% predictable from the meaning and usage of dog, so there's a question about where to draw the boundaries of dog's "word family". And of course most "word families", however extended, are not so fully represented. Thus we have
but nothing for implausible, plausibilities, etc. And we have
but not catty, catlike, catting, cattish, etc. A typical factor in such lists is about 2.3 wordform types per "word family" on average, which would reduce 37,504 to 16,306. Even increasing this to a factor of 3 gives us 12,501.
The fact that I'm not looking at "words" with internal hyphens or white space, etc., reduces the count. And most important, the curve has not completely flattened out — Dickens has not displayed his entire active vocabulary, much less the total set of words that he would recognize and understand in the speech or writing of others.