Word String frequency distributions

« previous post | next post »

Several people have asked me about Alexander M. Petersen et al., "Languages cool as they expand: Allometric scaling and the decreasing need for new words", Nature Scientific Reports 12/10/2012. The abstract (emphasis added):

We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.

The paper is thought-provoking, and the conclusions definitely merit further exploration. But I feel that the paper as published is guilty of false advertising. As the emphasized material in the abstract indicates, the paper claims to be about the frequency distributions of words in the vocabulary of English and other natural languages. In fact, I'm afraid, it's actually about the frequency distributions of strings in Google's 2009 OCR of printed books — and this, alas, is not the same thing at all.

It's possible that the paper's conclusions also hold for the distributions of words in English and other languages, but it's far from clear that this is true. At a minimum, the paper's quantitative results clearly will not hold for anything that a linguist, lexicographer, or psychologist would want to call "words". Whether the qualitative results hold or not remains to be seen.

In the 20090715 version of the Google dataset that the paper relies on for its analysis of English, there are about 360 billion string tokens in total, representing 7.4 million distinct string types. Of these, 292 billion string tokens are all-alphabetic (81.2%), representing 6.1 million string types (83%). 66 billion string tokens are non-alphabetic (18.3%), representing 560 thousand string types (7.6%). And 1.8 billion string tokens are mixed alphabetic and non-alphabetic (0.5%), representing 692,713 types (9.4%). More exactly:

Tokens Types
All 359,675,008,445 7,380,256
Alphabetic 292,060,845,596 6,127,099
Non-alphabetic 65,799,990,649 560,444
Mixed 1,814,172,200 692,713

Why have I referred to "string tokens" and "string types" rather than to "words"? Well, here's a random draw of 10 from that set of 7,380,256  string types (preceded in each case by the count of occurrences of that string type in the collection):

116856 discouragements
2485 NH4CI
42 attorniea
425 PEPPERED
51 prettyboys
191 iillll
506 Mecir
68 inkdrop
3926 LATTIMORE
18631 cart's

Here's another:

174 Pfii
51 Lnodicea
126 almofb
82 0egree
672 Garibaldina
47 excllence
5693 Eoosevelt
118 Dypwick
83 opinion19
65 VVouldst

And another:

55 txtx
218 suniving
91 fn_
48 Kultursysteme
54 notexpected
137 handsoap
46 tornarmene
9551 Rohault
48 Blrnstlngl
150 1037C

As samples of English "words", these are not very persuasive.  About half of them are typographical or OCR errors, and of those that are not, many are regularly-derived forms of other words ("cart's", "discouragements") or numeric strings ("1037C"). The fact that Google's collection of such strings exhibits certain statistical properties is interesting, but it's not clear that it's telling us anything much about the English language, rather than about typographical practices and the state of Google's OCR as of 2009.

Though several of the key results in their paper deal with the full dataset — numbers, OCR errors, and all — the authors do recognize this problem, while (in my opinion) seriously underplaying it:

The word frequency distribution for the rarely used words constituting the “unlimited lexicon” obeys a distinct scaling law, suggesting that rare words belong to a distinct class. This “unlimited lexicon” is populated by highly technical words, new words, numbers, spelling variants of kernel words, and optical character recognition (OCR) errors.

In fact, the order in which they give these categories is rather the reverse of the truth: "highly technical words [and] new words" are radically less common in this dataset than "numbers, spelling variants of kernel words, and OCR errors". Or to put it another way, the majority of the "words" in this list are not words of English at all — not "highly technical words", not "new words", not any kind of words.

The authors propose to remedy the problem this way:

We introduce a pruning method to study the role of infrequent words on the allometric scaling properties of language. By studying progressively smaller sets of the kernel lexicon we can better understand the marginal utility of the core words.

They set their word-count threshold to successively larger powers of two, ending with a threshold of 2^15 = 32768 as defining the "kernel" or "core" lexicon of English. This reduces the set to 353 billion tokens representing 143 thousand types, of which 286 billion tokens (81%) representing 132 thousand types (92%) are all-alphabetic, 65 billion tokens (18.6%) representing 5 thousand types (4%) are non-alphabetic, and 1.3 billion tokens (0.4%) representing 5 thousand types are mixed (4%).

Tokens Types
All ("core") 353,131,491,190 142,700
Alphabetic ("core") 286,352,289,855 131,598
Non-alphabetic ("core") 65,465,833,242 5,360
Mixed ("core") 1,313,368,093 5,742

Do we now have a set representing the vocabulary of English? It's certainly better, as this random draw of 20 suggests:

103971 tardive
34879 Lichnowsky
40031 punctiliously
40543 recumbency
85620 Lupton
50627 GLP
156373 fertilize
54410 Niu
139924 ogre
38535 Burnett's
108385 chymotrypsin
70076 rigueur
680293 staunch
56995 56.6
467320 Habsburg
57726 populists
41164 occu
47133 Scapula
42483 Buhl
242641 Olmsted

But there are still some issues. We still have about 8% (by types) and 19% (by tokens) of numbers, punctuation, and so on. And about half of the higher-frequency string-types in the "kernel" lexicon are proper names — these are certainly part of the language, but it seems likely that their dynamics is quite different from the rest of the vocabulary.

And now that most of egregious typos and OCR errors are out of the way, we need to consider the issue of variant capitalization and regular inflection. Here are the dataset's 22 diverse capitalizations of the inflected forms of the word copy, arranged in increasing order of overall frequency (with 9 frequent enough to make it into the "core" lexicon):

46 copyIng
56 COPiES
83 cOpy
99 CoPY
107 copY
144 COPYing
222 coPy
280 COPy
367 COpy
435 CoPy
484 CopY
4601 COPIED
14651 COPYING
50412 Copied
54338 COPIES
194846 COPY
374302 Copying
545386 Copies
1143116 Copy
2633786 copied
7316913 copies
13920809 copy

Here are the 23 variants of the word break, with 13 in the "core" lexicon:

51 breaKs
73 breaKing
82 BreaK
88 BReak
187 broKe
276 breakIng
356 broKen
512 breaK
14875 BROKE
21849 BREAKS
54462 BREAKING
74913 BROKEN
103631 BREAK
132482 Breaks
149392 Broke
560479 Breaking
629544 Break
740875 Broken
4564785 breaks
8538111 breaking
12330476 broke
19325762 break
19617163 broken

Or the 14 forms of succeed, of which 7 make it into the "core" lexicon:

46 SUCCeed
51 SUCceeded
4897 SUCCEEDING
6800 SUCCEEDS
7026 SUCCEEDED
12619 SUCCEED
22154 Succeeds
43856 Succeeded
67645 Succeed
88912 Succeeding
1556533 succeeds
3466786 succeeding
6919764 succeed
13166229 succeeded

Increasing the threshold successively prunes these lists, obviously. Monocasing everything (which the authors did not do) would reduce the number of all-alphabetic types to 4,472,529 in the "unlimited lexicon" (a reduction of 39.4% from the full type count of 7,380,256), and to 98,087 in the "core lexicon" of items that occur at least 32,768 times (a reduction of 31.3% from the full type count of 142,700). Limiting the list to all-lower-case alphabetic words (which will tend to decrease the number of proper nouns, and which the authors also did not do) reduces the "unlimited lexicon" to 2,642,277 items (down 64.2%), and the "core lexicon" to 67,816 (down 52.5%). But it's not clear which properties of these successively reduced distributions would be due to the nature of the English language, and which would be due to the interaction of typographical practice, OCR performance, and sampling effects.

And even if we were to limit our attention to alphabetic words, to fold or screen out capital letters, and to restrict item frequencies, we'd be left to unravel the (sampled) distribution across time of the various inflected forms of words, compared to the (sampled) distribution across time of different words. Or the development (and its reflection in print) of derived and compound words, or of proper nouns, which probably also differ from that of novel coinages or borrowings.

So to sum up: The authors' conclusions about the "unlimited lexicon" should be seen as conclusions about the sets of strings that result from typographical practices, Google's OCR performance as of 2009, and Google's tokenization algorithms. Even their conclusions about the "kernel" or "core" lexicon are heavily influenced, and perhaps dominated, by the distribution of proper names and by variant capitalization, as well as by the residue of the issues affecting the "unlimited" lexicon. Questions about the influence of inflectional and derivational variants also remain to be addressed.

Given these problems, the quantitative results cannot be trusted to tell us anything about the nature and growth of natural-language vocabulary. And the qualitative results need to be checked against a much more careful preparation of the underlying data.

It's too bad that the authors (who are all physicists or economists) didn't consult any computational linguists or others with experience in text analysis, and that Nature's reviewers apparently didn't include anyone in this category either.


Note: The underlying data is available here. For convenience, if you'd like to try some alternative models on various subsets or transformation of the "unlimited lexicon", here is a (90-MB) histogram of string-types with their counts from the files

googlebooks-eng-all-1gram-20090715-0.csv
googlebooks-eng-all-1gram-20090715-1.csv
googlebooks-eng-all-1gram-20090715-2.csv
googlebooks-eng-all-1gram-20090715-3.csv
googlebooks-eng-all-1gram-20090715-4.csv
googlebooks-eng-all-1gram-20090715-5.csv
googlebooks-eng-all-1gram-20090715-6.csv
googlebooks-eng-all-1gram-20090715-7.csv
googlebooks-eng-all-1gram-20090715-8.csv
googlebooks-eng-all-1gram-20090715-9.csv

According to the cited paper, the authors accessed the data on 14 January 2011, which means that this is the version they worked from for English. A newer and larger English version (Version 20120701) is now available — at some point I'll post about the properties of the new dataset…



13 Comments

  1. Daniel Ezra Johnson said,

    February 3, 2013 @ 10:04 am

    It's too bad that the authors (who are all [non-linguists]) didn't consult any [type of] linguists or others with experience in [language] analysis, and that [science journal]'s reviewers apparently didn't include anyone in this category either.

    Too bad!

  2. michael ramscar said,

    February 3, 2013 @ 10:45 am

    "The kink in the slope preceding the entry into the unlimited lexicon is a likely consequence of the limits of human mental ability that force the individual to optimize the usage of frequently used words and forget specialized words that are seldom used."

    if i understand this correctly, the reason no-one uses "omnibus" and "charabanc" any more is because they have been thrown overboard to keep our otherwise overladen brains on an even keel.

    am i alone in having problems emptying my mental wastebasket?

    [(myl) The authors are not claiming that less-frequent words are forgotten, just that they fall into a different usage regime. That might be true, and it might also have a psychological explanation of the sort that they propose. But I'd like to explore the possibility that the effect has something to do with a very different sort of thing, namely the way that highly-skewed distributions of typos, OCR errors, and weird capitalizations are sampled across different underlying frequency ranges of actual words. For example, the "kink" might reflect a transition from a regime in which common typographical variants are relatively fully represented, to one in which they increasingly drop below the sampling radar, so to speak.]

  3. michael ramscar said,

    February 3, 2013 @ 12:22 pm

    @ myl understood. was just a weak attempt at humour: the few bits of psychology in the paper are as hokey as the linguistics, imho.

    another point in support of your analysis comes from the use of language models in ocr: won't the sparsity of the data used to train these models inevitably mean that very low frequency words will be far more likely to be mangled during ocr than high frequency words? if so, this will further bias the distribution of manglings in support of the authors' thesis.

    another thing that seems to make make matters worse in this regard is that language models are usually trained on largely contemporary samples. this could further bias the data here, because it might affect the frequency with which words that have declined in frequency will be misidentified as their now more frequent competitors. for example, legend has it that "boredom" was coined by dickens in the mid 19th century, yet the google books corpus attests to numerous instances that predate "hard times" — until you eyeball the texts in which they occur, anyway, when it becomes clear that the google books ocr algorithm has a problem with "whoredom."

  4. Jon said,

    February 3, 2013 @ 1:32 pm

    I've often been puzzled by how bad Google Books and Internet Archive's OCR efforts are. There is so much background information that could be taken into account, such as that most pages of a book have a common layout, with consecutive page numbers, running heads, same typeface, margins, words from a dictionary, etc. Instead it seems as though each character (or inkspot) is interpreted separately, potentially being any character in any typeface, any size, upper or lower case.
    It's not as though Google was short of technical expertise or cash, and once a book has been scanned (the time-consuming part), the images could be run through improved software at any time.

    [(myl) I believe that the 2012 release of the Google ngram datasets has improved OCR as well as other enhancements. And the Internet Archive's OCR, though still often pretty dreadful, is much less dreadful than it was a few years ago.

    If I had to guess why relatively dumb OCR techniques have been used, compared to what might in principle be done at any given point in time, it would be that across the wide range of documents involved, it's hard to do something smarter whose assumptions don't fail spectacularly in some subset of cases. But then again, it might just be a question of doing the simplest thing first and hoping for the best…]

  5. D.O. said,

    February 3, 2013 @ 1:45 pm

    Even if all OCR and related problems are solved, it is not clear what exactly we learn from a study with generalization like this. It is not a criticism! If there is an interesting fact we can learn a lot by figuring out where it comes from.
    The rare words might be of 2 types. One from the general vocabulary, which almost any speaker may occasionally use and another is from specialized vocabulary, which is frequently used by a relatively small set of speakers. Here is one example. Commotion and tibia are not actually that rare, but their frequencies are around 2e-6 and the study shows transition at about 1e-5 so their use seems legit.

    How do I know that their distributions are very different? Well, apart from somewhat vague personal experience I can see that time series for commotion is much more smooth than that for tibia, strongly indicating that the latter comes from the much narrow set of sources. Actually, in 1948 tibia's frequency jumped ~2.5 times up from 1947 number and then dropped back almost by the same factor. What happened in 1948?

    [(myl) There are many more than two types of wordforms involved here, typography and OCR mistakes aside — regular inflections of a given lexeme with widely varying frequencies; regular morphological derivations, standard and neologistic; compounds written solid; proper nouns; terms of art in more or less arcane disciplines; foreign words used in quotations or quasi-borrowed; and so on. Some of these categories probably have interestingly different synchronic and diachronic usage profiles; but there might also be interesting emergent properties of word- (or string-) distributions when they're all added up willy nilly.]

  6. Joe said,

    February 3, 2013 @ 1:50 pm

    I found this to be interesting:

    Received 16 October 2012 Accepted 24 October 2012 Published 10 December 2012

    Am I correct in understanding it took 8 days for to be accepted? Is that common for Nature?

    [(myl) Nature's "Scientific Reports" boasts of its rapid turnaround:

    Hosted on nature.com — the home of over 80 journals published by Nature Publishing Group and the destination for millions of scientists globally every month — Scientific Reports is open to all, publishing technically sound, original research papers of interest to specialists within their field, without barriers to access.

    Scientific Reports is committed to providing an efficient service for both authors and readers, and exists to facilitate the rapid peer review and publication of research. With the support of an external Editorial Board and a streamlined peer-review system, all papers are rapidly and fairly peer reviewed to ensure they are technically sound. An internal publishing team works with the board, and accepted authors, to ensure manuscripts are processed for publication as quickly as possible.

    However, this one seems to have been swifter than usual — according to the current statistics, the mean time to publication for papers published in November 2012 was 110 days. In this case, from 10/24 to 12/10 was 47 days.

    FWIW, I think this kind of velocity (as well as the open access) is a terrific idea. Frankly, the quality of the reviewing for language-related papers in Science and Nature has been notoriously poor anyhow. And in this case, I think the paper should have been published, but the authors should have been encouraged to tone down the level of generalization a bit, and to note that the application of their models to word-usage in natural language really remains to be evaluated.]

  7. peter said,

    February 3, 2013 @ 2:58 pm

    I am a computer scientist who has on several occasions been asked to review papers submitted to Nature, and have refused each such request. The required turn-around time for the review has always been extremely short – typically 48 or 72 hours (compared with the 6-12 weeks more common for CS journals, or the 4-6 weeks common for major refereed conferences). 72 hours is not sufficient time to assess or even reflect on new ideas, nor to check technical details, proofs or references properly. In addition, IMHE, anyone with any disciplinary competence will already have other tasks allocated to the next 72 hours. I thus fear Nature's typical reviewers are people with time on their hands, rather than disciplinary experts.

  8. J.W. Brewer said,

    February 3, 2013 @ 2:58 pm

    No instances at all in the dataset of either "copied" or "Copied," only "COPIED"? Despite the fact that, e.g., "copying" is much more common than "Copying," which in turn is much more common than "COPYING"?

    [(myl) Sharp eyes. Some kind of editing error on my part left out

    50412 Copied
    2633786 copied

    I think the list is complete now…]

  9. D.O. said,

    February 3, 2013 @ 3:19 pm

    By the way, the specialized vocabulary hypothesis can be evaluated by restricting the corpus to fiction. I don't think Google made necessary data available though…

  10. Daniel Ezra Johnson said,

    February 3, 2013 @ 3:33 pm

    The "subject terms" applied to this piece were:

    Applied physics [sic]
    Information theory and computation
    Statistical physics, thermodynamics and nonlinear dynamics [sic]
    Sustainability [sic]

    Do we suspect these were the authors' choices or the editors'?
    There are linguists with time on their hands who could have made reasonable commentary?
    I'd bet no linguist was even asked.

    Maybe as Mark says, this article was worth publishing, but we've all seen several high profile examples of historical/geographical linguistics published by e.g. biologists that every linguist has denounced.

    Sometimes they get to do so in a special issue of a different journal, but even then it's not clear that everybody wins.

    The general public believes nonsense about language. Journals like Nature seem to want to ensure that scientists in other fields do too.

    A linguist couldn't publish some research on genetics (maybe about words and sentences within DNA?) get it reviewed by linguists, and published in Science. Could we?

  11. benjamin börschinger said,

    February 3, 2013 @ 6:07 pm

    have I just missed it, or has nobody mentioned Joshua Goodman's comment on "Language Trees and Zipping" so far?
    http://arxiv.org/abs/cond-mat/0202383

  12. On Words, Strings, and Co-Occurrence Studies | IR Thoughts said,

    August 2, 2013 @ 8:38 am

    […] String Frequency Distributions, Mark Liberman blogs about the flaws involved when co-occurrence studies are reported without […]

  13. David Sarokin said,

    July 8, 2014 @ 9:32 am

    Hello all. I've been searching this forum for posts about the frequency of phrases rather than words, but to no avail. Ditto on the Web as a whole — there's a lot about letter frequency and word frequency, but nada on the most common phrases, or even the most common N-grams (not quite the same thing in my mind).

    I've written a brief article of my own identifying the most common phrases in the English language. I'd love to get some feedback on it, as to whether it seems a legitimate list or not (I'm an Ngram novice, just trying to make my way through unfamiliar sets of data).

    Anyway…the article link is below (I hope posting it is OK with this site's policies, but if not, please deactivate it). Any thoughts would be most appreciated:

    http://quezi.com/19775
    What is the Most Common Phrase in the English Language?

RSS feed for comments on this post