Language Log

Dictionary-sampling estimates of vocabulary knowledge: No Zipf problems

December 9, 2015 @ 12:59 pm · Filed by Mark Liberman under Computational linguistics

Yesterday I explained why the long-tailed ("Zipf's Law") distribution of word frequencies makes it almost impossible to estimate vocabulary size by counting word types in samples of writing or speaking ("Why estimating vocabulary size by counting words is (nearly) impossible"). In a comment on that post, "flow" suggested that similar problems might afflict attempts to estimate vocabulary size by checking someone's knowledge of random samples from a dictionary.

But in fact this worry is groundless. There are many problems with the method — especially defining the list to sample from, and defining what counts as "knowing" an item in the sample — but the nature of word-frequency distributions is not one of them.

We could prove this as a theorem, but it will probably be clearer to most people if we run a simulation. So this R script does the following:

(1) Assume a dictionary with NWords=100000 entries;
(2) Assume a Zipfian distribution such that the Nth commonest word has probability proportional to \(1/N^k\);
(3) Simulate a "known" subset by selected those words that appear in a randomly-generated text of NSample=6000000 items;
(4) Pick NTest=500 dictionary items at random (with uniform probability across the NWords entries);
(5) Determine what fraction of those NTest items are "known" according to (3);
(6) Iterate (4)-(5) ntests=100 times;
(7) Compare the mean proportion "known" in the sample tests to the true underlying proportion of "known" words.

Here's the result of a few runs of the cited script:

> source("KnowCheck.R")
 32313-word vocabulary: 0.32313 of 100000-word dictionary
 Mean 163.7 known out of 500 tested in 100 trials
 Mean Proportion known is 0.3274, stdev 0.0219688
 > source("KnowCheck.R")
 32120-word vocabulary: 0.3212 of 100000-word dictionary
 Mean 160.7 known out of 500 tested in 100 trials
 Mean Proportion known is 0.3214, stdev 0.0187757
 > source("KnowCheck.R")
 32216-word vocabulary: 0.32216 of 100000-word dictionary
 Mean 161.7 known out of 500 tested in 100 trials
 Mean Proportion known is 0.3234, stdev 0.020238
 > source("KnowCheck.R")
 32402-word vocabulary: 0.32402 of 100000-word dictionary
 Mean 161.48 known out of 500 tested in 100 trials
 Mean Proportion known is 0.32296, stdev 0.021834

You can try changing the script parameters to convince yourself that the method continues to work.

December 9, 2015 @ 12:59 pm · Filed by Mark Liberman under Computational linguistics

Permalink

9 Comments

Jerry Friedman said,

December 9, 2015 @ 1:27 pm

The actual problem related to dictionary size is that too small a dictionary might not contain all the words the person knows. The extreme case is that 10,000-word dictionary flow imagined, with which one would clearly have trouble telling the difference between a 40,000-word and a 60,000-word vocabulary. I'm less sure of this one—is it correct to say that the bigger the dictionary, the more words the person will have to try to get a reasonable-sized sample of known words?
flow said,

December 9, 2015 @ 1:39 pm

Great, thanks for answering this question in such length!

Would you care to hint at the reason that this works? I can think of two candidates: (1) The sampling method works with any distribution because each random subset, if it is big enough, is likely to be representative of the whole; (2) The sampling method works because of some special property of the Zipf distribution.

I'd put my money on #1.

[(myl) If we want to know "what proportion of Xs have property P?", and we have a method to choose Xs at random with uniform probability across the set of all possible Xs, and a reliable method to test selections for property P, then testing a sample of Xs for P will give us an unbiased estimate for the proportion of Xs that are P in the overall population, regardless of how likely different Xs might be to come up in practice. (Because we didn't ask "what proportion of the Xs that come up in practice have property P?" Though we could also use a sampling approach to estimate this, as long as our sampling method reflects the distribution that we care about! Though note that the first version of the question offers a sort of answer to the question "How many words does so-and-so know?", while the second version doesn't do so, at least not directly.)]
D.O. said,

December 9, 2015 @ 11:09 pm

The method works because in this example "known" words are about 1/3 of the all words. Here's a very rough mathematical estimate. If you have a probability of Bernoullian success p and take sample size n, the expected number of successes will be np and std sqrt(np(1-p)). If p is small 1-p~1 and relative error in estimate will be about np/sqrt(np) = sqrt(np). In other words, you have to monitor the number of successes. In Prof. Liberman's example they are around 160, which means the relative error of the estimate should be around 1/sqrt(160)~0.08 and the numerical estimates give 0.02/0.32~0.06, which is OK given that p is not small in this case (sqrt(2/3)~0.8). If, on the contrary, there are 30k "known" words, dictionary size is 300k and you select only 100 test words, things will start to become dicey.

[(myl) Yes, there's the usual problem with error bars in estimating small percentages (of success or failure). If the true population proportion is one in 500, then an estimate based on 500 Bernoulli trials will have an uncertainty that's larger than the estimate itself — we're just about as likely to get no hits as one hit, etc.:

> dbinom(c(0,1,2,3,4),size=500,prob=(1/500))
[1] 0.36751125 0.36824775 0.18412388 0.06125163 0.01525153

But this problem depends only on the true proportion of "known" words in the set, not on the expected frequency distribution over the (known or unknown) words in usage; a uniform distribution of word frequencies there would pose the same issues as a heavy-tailed distribution.

And although the uncertainty of the estimate may be large in proportional terms, the range of values for a resulting estimate of vocabulary size is not necessarily problematic in absolute terms, depending on our goals.]
D.O. said,

December 10, 2015 @ 1:56 am

Prof. Liberman, I completely agree with what you've written, but want to add that fat-tailed distribution tends to contribute to existence of unabridged dictionaries filled with words that hardly anyone knows.
Leonid Boytsov said,

December 10, 2015 @ 2:45 am

Nice demonstration! My two cents here:
1) The fact that the dictionary is sampled via the Zipfian distribution seems to be irrelevant here (and confusing).

[(myl) The idea is to choose a subset of "known words" which is oriented towards words that are likely to have been heard or read more frequently, rather than a subset chosen with equal probabilities across the dictionary. That was the whole point — if I hadn't done that, it would not have been a test of flow's proposal.]
2) While the estimate is unbiased, we should worry about variance as well. In particular, if the known vocabulary is a very small subset or the number of trials is too small, the estimate may become inaccurate.
[(myl) True enough, but not relevant to the original question.]
Robot Therapist said,

December 10, 2015 @ 4:57 am

@D.O. "…unabridged dictionaries filled with words that hardly anyone knows."
and which are flagged as having been originated by Shakespeare or Spenser.
If it had been anyone else, we'd have called it an error, but because it was one of them, it's a new word.
Jerry Friedman said,

December 10, 2015 @ 10:51 am

D.O.: Thanks for answering my question. I thought there was going to be a square root of the number of successes there, but I wasn't sure.

The first time I saw the dictionary-sampling method suggested, the recommendation was to use the biggest dictionary available, but it seems that might be a bad idea, since it makes the test longer, and possibly more frustrating and unpleasant to the person being tested.

A question I'm idly wondering about is: if a dictionary is smaller than one's vocabulary, say a dictionary for children, how many words in it is one likely not to know, as a function of the size of the dictionary and that of one's vocabulary? This might have something to do with long tails.
Athanassios Protopapas said,

December 11, 2015 @ 4:18 am

So, you are saying that the following claim (from Ramscar et al., 2014) is incorrect:
"…the distribution of word-types in language ensures both that adult vocabularies overwhelmingly (and increasingly) comprise low-frequency types, and that an individual’s knowledge of one randomly sampled low-frequency type is not predictive of his or her knowledge of any other randomly sampled low-frequency type. This makes the reliable estimation of vocabulary sizes from small samples mathematically impossible (Baayen, 2001)."

Perhaps one's vocabulary is not really a random sample of the dictionary? Your step #3 seems to rest on somewhat shaky ground. Perhaps the "known" subset would be unlikely to include all words that have been encountered only once; and perhaps the words one encounters more than once are not randomly sampled but related to one's specific experience, which cannot be estimated by randomly sampling from the dictionary.

[(myl) No, I entirely agree with the quote from Ramscar et al.

My point was just that power-law (or other heavy-tailed) distributions of word frequencies don't prevent sampling methods from being used to estimate someone's percent "knowledge" of a given large word list. As I observed, there are two sticky problems independent of this: what should be on the list? and how to define and test "knowledge"?

The point that Ramscar (and Baayen) made has to do mainly with the problem of delimiting the word list to be tested. Another expression of the same problems can be found in this quote from the Oxford Dictionaries site:

How many words are there in the English language? There is no single sensible answer to this question. It's impossible to count the number of words in a language, because it's so hard to decide what actually counts as a word. Is dog one word, or two (a noun meaning 'a kind of animal', and a verb meaning 'to follow persistently')? If we count it as two, then do we count inflections separately too (e.g. dogs = plural noun, dogs = present tense of the verb). Is dog-tired a word, or just two other words joined together? Is hot dog really two words, since it might also be written as hot-dog or even hotdog?

It's also difficult to decide what counts as 'English'. What about medical and scientific terms? Latin words used in law, French words used in cooking, German words used in academic writing, Japanese words used in martial arts? Do you count Scots dialect? Teenage slang? Abbreviations?

Or chemical names or drug names. Or, as I noted, idiomatic compounds and phrases, and names of companies and villages and bands and movies and actors and songs and writers and random people… The potential size of the full list is very large and not cleanly bounded — and it's likely that any individual's percent knowledge of the full list will be small, creating the confidence-interval problems discussed above.

But this doesn't stop us from taking a given list — say the ~ 180,000 entries and subentries in the OED, or the 165,000 entries in the MW Collegiate, perhaps supplemented with some appropriate lists of proper nouns — and asking (relative to some way of operationalizing "knowledge") what proportion of that list some given person "knows". And this gives us some sort of lower bound, which is surely going to be much greater than 8,000 for most people.]
Athanassios Protopapas said,

December 11, 2015 @ 4:39 am

Thank you for your reply. I was under the impression that their point was not based on the difficulty of defining what counts as a word but on the uncorrelated probability of knowing particular words of very low frequency. Your simulation seems to concern "estimation of vocabulary sizes from small samples," which, according to this quote, is supposed to be "mathematically impossible." It seems to me that the conflict arises because of your assumptions in step #3. Did I completely miss your (or their) point?

[(myl) I think you may have misunderstood the relationship.

My step (3)

(3) Simulate a "known" subset by selected those words that appear in a randomly-generated text of NSample=6000000 items;

is just a method of simulating the process whereby a hypothetical person might learn some fraction of a large power-law-distributed word list. This was one stage in a program aimed at checking the accuracy of sampling methods for estimating percent knowledge of a word list, in a case where we can know the true answer. And the point was to show that there's no difficulty, intrinsic to that problem, created by the fact of power-law distributions in usage.

Ramscar and Baayen et al. are concerned with a somewhat different question, namely how to deal with the essentially unbounded character of the things that might count as "words" in a large linguistic community that's geographically, socially, and culturally diverse. If we design a plausible "word" list, any random individual is likely to know several thousand "words" that aren't on it; and even if we somehow determine what those are, and add them to the list, a second random individual is likely to know several thousand more. Since new words are invented and borrowed all the time, it's not clear that this process would ever converge.

But none of this prevents us from estimating a reasonably accurate answer to the question "What proportion of the items in (the specific finite) list L does person P know?" ]

RSS feed for comments on this post

Dictionary-sampling estimates of vocabulary knowledge: No Zipf problems

9 Comments

Jerry Friedman said,

flow said,

D.O. said,

D.O. said,

Leonid Boytsov said,

Robot Therapist said,

Jerry Friedman said,

Athanassios Protopapas said,

Athanassios Protopapas said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta