Language Log

Lexical limits?

December 5, 2015 @ 9:05 am · Filed by Mark Liberman under Psychology of language, Words words words

Earlier today, Victor quotes Jerry Packard quoting C.C. Cheng to the effect that "the human lexicon has a de facto storage limit of 8,000 lexical items" ("Lexical limits", 12/5/2015). Victor is appropriately skeptical, and asks for "references to any studies that have been done on the limits to (or norms for) the human lexicon". In fact there's been a lot of quantitative research on this topic, going back at least 75 years, which supports Victor's skepticism, and demonstrates clearly that Cheng's estimate is low by such a large factor that I wonder whether his idea has somehow gotten mangled at some point along the chain of quotation.

There are many ways to approach the problem of how many words there are in a language, and the question of how many words a given person knows. I take it that Cheng's claim is related to the second question, since it's clear from a glance at a random dictionary that 8,000 is not a remotely plausible limit for the number of lexical items in a language viewed as a social construct.

One approach to the question of how many words a given person "knows" is to choose a sample of items at random from a suitable word list, to test that person for their knowledge of each of the items in that sample, and then to extrapolate. Thus if you start with a list of 100,000 items, and test someone for knowledge of a random sample of 100, and the subject gives evidence of knowing 43 out of 100, then you can estimate that they know (in whatever sense you've tested them) about 43,000 of the items on the list. (And you'd get a tighter estimate if you tested them on a larger sample.)

Of course, there are several questions you need to answer along the way to such a result.

To start with, what sorts of things do you put on the list? Once you have dog, do you treat dogs as a separate word? Presumably not. But how about doggy, dogging, dogged? If you have food as well as dog, is dog food (or dogfood) a separate word? Or hotdog, the food or the skiing style? What about idiomatic phrases like dog's breakfast?

Do you include proper nouns — place names, personal names, company names, product names, band names, names of books and movies and albums and composers and musicians and songs? What about proper nouns with internal white space? Is "New York" a word? Is "New York Times"? How about "Lady Gaga" and "Pnom Penh"? And what about acronyms and initialisms? There are 26^3 = 17,576 possible three-letter initialisms, and a sampling test suggests that a fairly large proportion of them are in use. So does someone get credit for knowing ABC and IBM and TSA and TNT and LED and WTF and … How do you deal with words with multiple senses? The answers to questions like these will expand or contract your wordlist by at least an order of magnitude, and can have a similar effect on your quantitative estimate of lexical knowledge.

And then, what counts as "knowing" a word? Is it enough to be able to distinguish words from non-words? Presumably not. Do you need to be able to use the word in a sentence? How should such candidate uses be graded? Or should you have to distinguish correct uses from incorrect ones? How subtle or difficult should the tested distinctions be? Again, the testing method can have a large effect on your estimate.

Attempts to define and implement this approach go back at least to R.H. Seashore and L.D. Eckerson, "The measurement of individual differences in general English vocabularies", Journal of Educational Psychology 1940. A somewhat more recent example is W.E. Nagy & R.C. Anderson, "How many words are there in printed school English?" Reading Research Quarterly 19, 304-330, 1984. They came up with plausible answers to all of the questions above, and conclude that average American high school graduates have a vocabulary of about 40,000 "word families". When I've given similar tests to Penn undergraduates, estimates are generally in the 60,000-70,000 range.

Another approach is to try to extrapolate type-token functions in an individual's speech or writing, to estimate how many types would appear if you continued tracking tokens forever. This is hard to do for several reasons: it's not clear what function to extrapolate; and after a relatively short time, most of the "new words" in digital text are actually typographical errors. An excellent work about this is Harald Baayen's book, Word Frequency Distributions. But even without any fancy extrapolations, and even with relatively conservative ideas about what constitutes a "lexical item", we can determine by analyzing the texts of prolific writers in English that they used (and presumably knew) way more than 8,000 items.

Packard's reference appears to be to Cheng, Chin-Chuan, "Quantification for understanding language cognition." Quantitative and Computational Studies on the Chinese Language (1998): 15-30. I don't have time to try to chase this reference down, but the idea that 8,000 is a sort of natural cognitive limit to the number of items in "the human lexicon" is so obviously and preposterously false that I wonder again what Prof. Cheng actually said, and what evidence if any he cited for it.

For more, here are a few earlier LLOG posts on the topic:

"986,120 words for snow job", 2/6/2006
"Word counts", 11/28/2006
"Vocabulary size and penis length", 12/8/2006
"Britain's scientists risk becoming hypocritical laughing-stocks, research suggests", 12/16/2006
"An apology to our readers", 12/28/2006
"Cultural specificity and universal values", 12/22/2006
"Vicky Pollard's Revenge", 1/2/2007
"Ask Language Log: Comparing the vocabularies of different languages", 3/31/2008
"Laden on word counting", 6/1/2010
"Lexical bling: Vocabulary display and social status", 11/20/2014

Update — here's a type-token plot from 16 books by Charles Dickens (15 novels plus American Notes):

I downcased everything, split hyphenated words, and counted only all-alphabetic tokens. Of course the 37,504 types constitute a smaller number of "word families" — thus we have

504 dog
200 dogs
32 dogged
27 doggedly
5 dogging
3 doglike
2 doggedness

The meaning and usage of dogged and dogging are far from 100% predictable from the meaning and usage of dog, so there's a question about where to draw the boundaries of dog's "word family". And of course most "word families", however extended, are not so fully represented. Thus we have

4 plausible
1 plausibility

but nothing for implausible, plausibilities, etc. And we have

114 cat
37 cats

but not catty, catlike, catting, cattish, etc. A typical factor in such lists is about 2.3 wordform types per "word family" on average, which would reduce 37,504 to 16,306. Even increasing this to a factor of 3 gives us 12,501.

The fact that I'm not looking at "words" with internal hyphens or white space, etc., reduces the count. And most important, the curve has not completely flattened out — Dickens has not displayed his entire active vocabulary, much less the total set of words that he would recognize and understand in the speech or writing of others.

December 5, 2015 @ 9:05 am · Filed by Mark Liberman under Psychology of language, Words words words

Permalink

10 Comments

Bill S. said,

December 5, 2015 @ 10:05 am

I've run across claims like the one below, from http://www.victoria.ac.nz/lals/about/staff/publications/paul-nation/2006-How-large-a-vocab.pdf

"If 98% coverage of a text is needed for unassisted comprehension, then a 8,000 to 9,000 word-family vocabulary is needed for comprehension of written text and a vocabulary of 6,000 to 7,000 for spoken text."

With "word families," the claim sounds more plausible, esp. if the rules about family definition are fairly inclusive. And it's a claim about sufficiency, not a claim about actuality.

[(myl) This is indeed a plausible claim, though "written text" is a pretty diverse target, and there are certainly kinds of texts (compare biochemistry journal articles vs. theological treatises vs. hiphop lyrics) where a single list of that size seems unlikely to be adequate.

But Cheng's alleged assertion is not about sufficiency for comprehension, it's about a "de facto storage limit" on "the human lexicon". And again, his estimate is so transparently false that I wonder whether the claim has gotten garbled in transmission.]
Bob Ladd said,

December 5, 2015 @ 2:26 pm

Like Mark, I can't find Cheng's original paper, but what appears to be the factoid attributed by Victor Mair to Cheng via a quote from Jerry Packard is quoted in a 2011 paper by W. S.-Y. Wang as follows: "the number of distinct sinograms used in the various dynastic histories seems to have remained largely constant, hovering around 8000". This is obviously a very different claim from a suggestion that an individual brain can only hold 8000 words.

However, another paper by Wang et al. 2004 suggests that the Big Issue here is the mismatch between the essentially infinite range of meanings we may want to convey and the clearly finite size of the vocabulary. (Wang and colleagues are interested in this from the point of view of language evolution and the problem of how speech communities converge on shared vocabularies.) In other words, the 8000 figure seems to be about Chinese writing, not language in general, but embedding this in the context of the more general infinite-use-of-finite-means question reminds us that the means really are finite. The question of limits on vocabulary size may be hard to answer, but it's a real question that may tell us something about the nature of language, both in the abstract and in the individual brain.
The Other Mark said,

December 5, 2015 @ 3:42 pm

Also these counts are only what people do know or use.

Since learning words is usually only a side-effect of other knowledge the counts are dramatically too low for what we could learn if we really put our hearts into it.

I suspect the word counts for truly bi-lingual and tri-lingual people must be higher yet.

I've yet to meet a person who says that they can no longer learn any new words, or that old words fall out as new ones replace them.
Rubrick said,

December 5, 2015 @ 3:44 pm

I'm glad you brought up proper nouns. I've gotten the impression that proper nouns have been given somewhat short shrift in vocabulary counts (and perhaps in linguistics in general?). Any opinion on whether my impression is accurate? And what are the best estimates of the percentage of a typical person's vocabulary that does consist of proper nouns?

[(myl) Your impression is accurate — in general, proper nouns are entirely excluded from most studies of vocabulary knowledge (though not generally from text-based word counts). I've never see any estimates of "the percentage of a typical person's vocabulary that does consist of proper nouns", so I can't tell you the best one — but it's clear that people in the modern world know many thousands of proper nouns, and a crude guess would be that the number is of the same order as the number of common nouns that they know. It wouldn't be a shock (to me at least) to find that it's larger.]

On a less relevant note, I'm curious whether you opted to maintain the distinction between acronyms and initialisms (which I think has gone the way of plural "data"?) because it still matters to you personally, or because you didn't feel like dealing with the comments if you just called them all "acronyms". :-)

[(myl) The latter, definitely]
Eric said,

December 5, 2015 @ 4:03 pm

@Rubrick: From what I've seen, most counting schemes explicitly choose to exclude proper nouns. This is definitely the case in second language vocabulary estimates. Here are a few of the justifications I've found:

1) Individual knowledge of proper nouns is more idiosyncratic than that of other words. (I guess this is an argument that it's simply too difficult to count them).

2) Older counting schemes drew words from dictionaries and the assumption was that proper nouns were included in rather unprincipled ways, e.g., why did the dictionary include well-known person/place X and not well-known person/place Y? (I find this to be less compelling when modern counting practices based on word frequency are used.)

3) For second language learners, it was assumed that many proper nouns simply don't need to be learned since they are easy cognates. (This is obviously not true in all cases, e.g., English to Chinese.)
Coby Lubliner said,

December 5, 2015 @ 4:13 pm

I wonder: is 8,000 ("eight thousand") a "lexical item"? (In German it's written as one "word", achttausend.) And if it is, how about 8,001? Then aren't all lexical representations of natural numbers — an infinite set — "lexical items", even if most of them are compounds formed from a finite set of primitive words?
Michael Watts said,

December 5, 2015 @ 4:57 pm

As to the particular example, I don't think it would be controversial to say that dog and dogging (free of context) are two separate words. But I'm surprised that the issue driving that, which I'd expect disproportionately complicates efforts based on tokenizing text, isn't mentioned: dog [noun; canine animal] and dog [verb; haunt, pursue relentlessly] should be considered different words by basically any standard. And you could make a similar point about dogged [adjective; relentless] and dogged, a form of verb dog up above. Those two aren't even pronounced the same.

For second language learners, it was assumed that many proper nouns simply don't need to be learned since they are easy cognates. (This is obviously not true in all cases, e.g., English to Chinese.)

One of CC-CEDICT's great strengths is a robust listing of Chinese proper names, such as 迈克尔乔丹 mai-ke-er qiao-dan Michael Jordan or 薛定谔 xue-ding-e Schrödinger.
Michael Watts said,

December 5, 2015 @ 5:06 pm

Coby Lubliner: as I understand things, the spelling of a word isn't of much linguistic interest; "eight thousand" is not an independent lexical item regardless of how it's spelled because it is compositional, that is, its meaning can be easily determined just by knowing the meaning of the components (and the rule that assembles them). In contrast, "hot dog" or "dog's breakfast" are noncompositional — knowing the meanings of "hot", "dog", and "breakfast" won't help you to understand them.
GH said,

December 5, 2015 @ 9:20 pm

There are a number of self-tests one can take online to estimate vocabulary size. This one seemed pretty convincing to me; the two-part testing method makes a lot of sense. Taking the test vividly demonstrates the problem of defining just what it means to "know" a word.
Athanassios Protopapas said,

December 6, 2015 @ 4:39 am

It may not be so simple to estimate one's vocabulary, the most likely outcome being an underestimate, especially for those with the largest vocabularies:

"Unfortunately, psychometric vocabulary measures are virtually guaranteed to fail to detect vocabulary growth in adults because they attempt to extrapolate vocabulary sizes from sets of test words that are biased toward frequent types (Heim, 1970; Raven, 1965; Wechsler, 1997). However, the distribution of word-types in language ensures both that adult vocabularies overwhelmingly (and increasingly) comprise low-frequency types, and that an individual’s knowledge of one randomly sampled low-frequency type is not predictive of his or her knowledge of any other randomly sampled low-frequency type. This makes the reliable estimation of vocabulary sizes from small samples mathematically impossible (Baayen, 2001)." (Ramscar et al., 2014, p. 9)

Ramscar, M., Hendrix, P., Shaoul, C., Milin, P., & Baayen, H. (2014). The myth of cognitive decline: Non‐linear dynamics of lifelong learning. Topics in Cognitive Science, 6(1), 5-42.

RSS feed for comments on this post

Lexical limits?

10 Comments

Bill S. said,

Bob Ladd said,

The Other Mark said,

Rubrick said,

Eric said,

Coby Lubliner said,

Michael Watts said,

Michael Watts said,

GH said,

Athanassios Protopapas said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta