Several people have asked me about Alexander M. Petersen et al., "Languages cool as they expand: Allometric scaling and the decreasing need for new words", Nature Scientific Reports 12/10/2012. The abstract (emphasis added):
We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.
The paper is thought-provoking, and the conclusions definitely merit further exploration. But I feel that the paper as published is guilty of false advertising. As the emphasized material in the abstract indicates, the paper claims to be about the frequency distributions of words in the vocabulary of English and other natural languages. In fact, I'm afraid, it's actually about the frequency distributions of strings in Google's 2009 OCR of printed books — and this, alas, is not the same thing at all.
It's possible that the paper's conclusions also hold for the distributions of words in English and other languages, but it's far from clear that this is true. At a minimum, the paper's quantitative results clearly will not hold for anything that a linguist, lexicographer, or psychologist would want to call "words". Whether the qualitative results hold or not remains to be seen.
In the 20090715 version of the Google dataset that the paper relies on for its analysis of English, there are about 360 billion string tokens in total, representing 7.4 million distinct string types. Of these, 292 billion string tokens are all-alphabetic (81.2%), representing 6.1 million string types (83%). 66 billion string tokens are non-alphabetic (18.3%), representing 560 thousand string types (7.6%). And 1.8 billion string tokens are mixed alphabetic and non-alphabetic (0.5%), representing 692,713 types (9.4%). More exactly:
Why have I referred to "string tokens" and "string types" rather than to "words"? Well, here's a random draw of 10 from that set of 7,380,256 string types (preceded in each case by the count of occurrences of that string type in the collection):
As samples of English "words", these are not very persuasive. About half of them are typographical or OCR errors, and of those that are not, many are regularly-derived forms of other words ("cart's", "discouragements") or numeric strings ("1037C"). The fact that Google's collection of such strings exhibits certain statistical properties is interesting, but it's not clear that it's telling us anything much about the English language, rather than about typographical practices and the state of Google's OCR as of 2009.
Though several of the key results in their paper deal with the full dataset — numbers, OCR errors, and all — the authors do recognize this problem, while (in my opinion) seriously underplaying it:
The word frequency distribution for the rarely used words constituting the “unlimited lexicon” obeys a distinct scaling law, suggesting that rare words belong to a distinct class. This “unlimited lexicon” is populated by highly technical words, new words, numbers, spelling variants of kernel words, and optical character recognition (OCR) errors.
In fact, the order in which they give these categories is rather the reverse of the truth: "highly technical words [and] new words" are radically less common in this dataset than "numbers, spelling variants of kernel words, and OCR errors". Or to put it another way, the majority of the "words" in this list are not words of English at all — not "highly technical words", not "new words", not any kind of words.
The authors propose to remedy the problem this way:
We introduce a pruning method to study the role of infrequent words on the allometric scaling properties of language. By studying progressively smaller sets of the kernel lexicon we can better understand the marginal utility of the core words.
They set their word-count threshold to successively larger powers of two, ending with a threshold of 2^15 = 32768 as defining the "kernel" or "core" lexicon of English. This reduces the set to 353 billion tokens representing 143 thousand types, of which 286 billion tokens (81%) representing 132 thousand types (92%) are all-alphabetic, 65 billion tokens (18.6%) representing 5 thousand types (4%) are non-alphabetic, and 1.3 billion tokens (0.4%) representing 5 thousand types are mixed (4%).
Do we now have a set representing the vocabulary of English? It's certainly better, as this random draw of 20 suggests:
But there are still some issues. We still have about 8% (by types) and 19% (by tokens) of numbers, punctuation, and so on. And about half of the higher-frequency string-types in the "kernel" lexicon are proper names — these are certainly part of the language, but it seems likely that their dynamics is quite different from the rest of the vocabulary.
And now that most of egregious typos and OCR errors are out of the way, we need to consider the issue of variant capitalization and regular inflection. Here are the dataset's 22 diverse capitalizations of the inflected forms of the word copy, arranged in increasing order of overall frequency (with 9 frequent enough to make it into the "core" lexicon):
Here are the 23 variants of the word break, with 13 in the "core" lexicon:
Or the 14 forms of succeed, of which 7 make it into the "core" lexicon:
Increasing the threshold successively prunes these lists, obviously. Monocasing everything (which the authors did not do) would reduce the number of all-alphabetic types to 4,472,529 in the "unlimited lexicon" (a reduction of 39.4% from the full type count of 7,380,256), and to 98,087 in the "core lexicon" of items that occur at least 32,768 times (a reduction of 31.3% from the full type count of 142,700). Limiting the list to all-lower-case alphabetic words (which will tend to decrease the number of proper nouns, and which the authors also did not do) reduces the "unlimited lexicon" to 2,642,277 items (down 64.2%), and the "core lexicon" to 67,816 (down 52.5%). But it's not clear which properties of these successively reduced distributions would be due to the nature of the English language, and which would be due to the interaction of typographical practice, OCR performance, and sampling effects.
And even if we were to limit our attention to alphabetic words, to fold or screen out capital letters, and to restrict item frequencies, we'd be left to unravel the (sampled) distribution across time of the various inflected forms of words, compared to the (sampled) distribution across time of different words. Or the development (and its reflection in print) of derived and compound words, or of proper nouns, which probably also differ from that of novel coinages or borrowings.
So to sum up: The authors' conclusions about the "unlimited lexicon" should be seen as conclusions about the sets of strings that result from typographical practices, Google's OCR performance as of 2009, and Google's tokenization algorithms. Even their conclusions about the "kernel" or "core" lexicon are heavily influenced, and perhaps dominated, by the distribution of proper names and by variant capitalization, as well as by the residue of the issues affecting the "unlimited" lexicon. Questions about the influence of inflectional and derivational variants also remain to be addressed.
Given these problems, the quantitative results cannot be trusted to tell us anything about the nature and growth of natural-language vocabulary. And the qualitative results need to be checked against a much more careful preparation of the underlying data.
It's too bad that the authors (who are all physicists or economists) didn't consult any computational linguists or others with experience in text analysis, and that Nature's reviewers apparently didn't include anyone in this category either.
Note: The underlying data is available here. For convenience, if you'd like to try some alternative models on various subsets or transformation of the "unlimited lexicon", here is a (90-MB) histogram of string-types with their counts from the files
According to the cited paper, the authors accessed the data on 14 January 2011, which means that this is the version they worked from for English. A newer and larger English version (Version 20120701) is now available — at some point I'll post about the properties of the new dataset…