## Why estimating vocabulary size by counting words is (nearly) impossible

A few days ago, I expressed skepticism about a claim that "the human lexicon has a de facto storage limit of 8,000 lexical items", which was apparently derived from counting word types in various sorts of texts ("Lexical limits?", 12/5/2015). There are many difficult questions here about what we mean by "word", and what it means to be "in" the lexicon of an individual or a language — though I don't see how you could answer those questions so as to come up with a number as low as 8,000. But today I'd like to focus on some of the reasons that even after settling the "what is a word" questions, it's nearly hopeless to try to establish an upper bound by counting "word" types in text.

Zipf's Law says that if you order words by frequency, the frequency of the Nth word is proportional to $$1/N^s$$, where s is an exponent that runs roughly between 0.5 and 2 for various parts of the distribution. (See e.g. this plot for the various estimated exponents in different frequency-ranges of Wikipedia text.)

The crucial point is that most words are rare, and the further down the frequency-ordered list you go, the rarer the words are. And an obvious consequence of this is that estimating vocabulary by counting word-types in someone's speech or writing is at best very hard to do.

This is partly because most words are very rare, and not-yet-encountered words are likely to be increasingly rare, and so it takes a very large sample to see them all or even most of them. As I observed in an earlier post, if we concatenate 16 of Charles Dickens' books and ask how much of his vocabulary he displayed as a given amount of his text has passed by, then after nearly 4 million words he's still showing us wordforms (i.e. distinct space-or-punctuation-separated letter-strings) that he hadn't used before:

But it's worse than that. You might hope to fit a function to data of this sort, and thereby estimate the vocabulary size as a parameter of the fit. But because of the nature of the long-tailed distribution of words, this is essentially impossible to do in a reliable way. Here's the same plot with two superimposed artificial samples, from hypothetical vocabularies of 100,000 and 200,000 words:

And for a given vocabulary size, different choices of Zipfian parameters will give quite different results:

Obviously you can establish a lower bound, starting with the number of "words" that you've actually seen (which makes the estimate of a limit at 8,000 especially puzzling).

And there is a simple and elegant model of how to use such samples to estimate the underlying population size, developed originally by Alan Turing during the WWII Enigma cryptanalysis effort — see I.J. Good, "The Population Frequencies of Species and the Estimation of Population Parameters", Biometrika 40(3-4) 237-264, December 1953, or these lecture notes. But this simple and elegant method, though it works reasonably well to estimate how much "belief" to withhold for previously-unseen species, is not nearly as effective at estimating how numerous the species for which we reserve that belief will turn out to be.

For much more on this topic, see Harald Baayen's book, Word Frequency Distributions.

1. ### mendel said,

December 8, 2015 @ 4:52 pm

How would a thesaurus figure in the estimation of a writer's active vocabulary? And wouldn't one's vocabulary shift over a lifetime as well?

2. ### Chris C. said,

December 8, 2015 @ 7:06 pm

@mendel — I don't know that a habitual user of a thesaurus would leave behind a sufficiently large body of published work for analysis. Writing by such people tends to be not very good.

On the rare occasions I've resorted to one, it's because I've had trouble bringing le mot juste to mind, not because I needed to pull a long, fancy word out of my ass. (Which is why I prefer a thesaurus organized by the traditional conceptual method rather than dictionary-style.)

3. ### gaoxiaen said,

December 8, 2015 @ 8:21 pm

Why don't they use Moby Dick?

[(myl) Why do you ask? Here's what a type/token plot for Moby Dick looks like:

4. ### John Busch said,

December 8, 2015 @ 10:35 pm

I've seen reports that an educated Chonese person knows about 8,000 characters, which could be where the original claim about an 8,000 word lexical limit came from.

5. ### flow said,

December 9, 2015 @ 6:22 am

It's funny you should come up with an argument based on Zipf's Law; since I've read the original post I've been wondering what effect that law might have on the predictive power of a dictionary sample.

If I understand the random-sample method correctly, what you do is you take a dictionary, select headwords by chance and somehow judge whether you 'know' each one. Let's assume a dictionary with 10,000 entries; then, if you are familiar with, say, 50 out of 100 samples, your vocabulary in that language should be around 5,000 (or 50% of 10,000).

The problem comes when you repeat the test with a dictionary of 20,000 or 100,000 entries. In a natural language, almost each of the entries in a big dictionary will be rare or even exceedingly rare, meaning chances are very low that a given word will appear in a selection of natural texts. This means that out of 100 randomly selected words you will likely know far less than 50 items; qualitatively, this is OK since in case our first estimate of 5,000 words in your mental lexicon was correct, then a lower percentage with the bigger dictionary may still result in 5,000 known words—that is, if the rate is lowered to 25% in the case of a corpus of 20,000 words or to 5% in the case of a corpus of 100,000 words.

My question boils down to whether this works out quantitatively as well. I have a hunch that this might not be the case, that the testing methods leads to too many too rare words to be presented when using a big corpus, which in turn may more than proportionally lower your chances of getting to see an item you're familiar with.

[(myl) See "Dictionary-sampling estimates of vocabulary knowledge: No Zipf problems".]

6. ### Jonathan said,

December 9, 2015 @ 1:03 pm

Has anyone tried using the HyperLogLog algorithm on this? It was developed for a rather different purpose — using very small amounts of memory to estimate the number of unique elements in enormous sets — I'd love to see how it performs. It's gotten quite a bit of buzz in the web analytics world, where knowing how many unique visitor your website gets is important.

[(myl) Wikipedia says that "The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly-distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set." (emphasis added.) So my guess is "No, it's a solution to a completely different problem".]

7. ### Gosse Bouma said,

December 10, 2015 @ 6:19 am

This reminds of a huge crowd sourcing done for Dutch a few years ago by Marc Brysbaert et al.

Unfortunately, the only publication i can find is in Dutch http://crr.ugent.be/papers/Woordenkennis_van_Nederlanders_en_Vlamingen_anno_2013.pdf

Subjects were confronted with a word list, and had to mark the words they knew. To correct for 'bluffing', the list would also contain nonsense words. Based on the number of nonsense words you flag, a correction can be made for the number of bona fide words you recognized.

The authors used a source list of 53000 word lemmas, and used a computer program to construct plausible sounding nonsense words.

Based on results for 600000 participants (!) they estimate ta vocabulary sizes to peak around 38000 with interesting effects for age (you keep expanding) and between Netherlandic and Flemish participants.

8. ### John Chew said,

December 13, 2015 @ 12:31 am

A journalist recently asked me how many more words a competitive Scrabble player of various levels might know than a typical native speaker of the English language. I think she thought it would be an easy question with an easy answer.