Language Log

Lexical limits

December 5, 2015 @ 12:18 am · Filed by Victor Mair under Lexicon and lexicography, Psycholinguistics, Spelling

C. C. Cheng, emeritus professor of computational linguistics at the University of Illinois, estimates that the human lexicon has a de facto storage limit of 8,000 lexical items (referred to in n. 12 on p. 301 of Jerry Packard's The Morphology of Chinese: A Linguistic and Cognitive Approach [Cambridge University Press, 2000]).

I think that I am familiar with a lot more words than that. For example, if I look through a medium-sized dictionary of about 60,000 words, I feel that I seem to recognize most of them. On the other hand, there must be a great difference between active and passive vocabulary.

Janet Elder, Entryways into College Reading and Learning, chapter 3, states:

It is difficult to measure vocabulary size accurately. Total vocabulary size varies greatly from person to person, but people typically use about 5,000 words in their speech and about twice that many in their writing. A college-educated speaker of English could have a vocabulary as large as 80,000 words. Shakespeare, whose body of work is considered the greatest in English literature, used more than 33,000 words in his plays. This is an astonishing number, especially considering that he was writing 400 years ago.

The spelling bee champions, whom we've often featured on Language Log, have to be able to spell more than 8,000 words, I should think, because the humdingers they win on are usually extremely rare and of very low frequency in the lexicon. Maybe they only know how to spell many thousands of words but aren't much concerned with what they mean nor how to use them.

"Spelling bee champs " (6/1/14), with references to earlier posts on the subject.

The reigning champion of Francophone Scrabble cannot speak French, but he is able to recall and correctly spell a vast number of French words (without even knowing what most of them mean).

"Il ne parle pas français" (7/23/15)

C. C. Cheng, whom I cited at the beginning of this post, also holds that the human cognitive capacity for Chinese characters is around 3,000 to 3,500 characters. If true, and combined with his claims about lexical limits mentioned above (around 8,000 items), that would average out to about 2.5 words for each character. That seems reasonable to me and, coincidentally, is fairly close to the average length of a word in Mandarin, which is just slightly under two syllables.

I would be grateful for references to any studies that have been done on the limits to (or norms for) the human lexicon, whether due to cognitive capacity or brain size.

December 5, 2015 @ 12:18 am · Filed by Victor Mair under Lexicon and lexicography, Psycholinguistics, Spelling

Permalink

28 Comments

Stephan Stiller said,

December 5, 2015 @ 1:19 am

The official HSK vocabulary list uses less than 3000 characters (even when counting characters after converting its words to traditional characters), and the sets that are supposed to be taught explicitly in school are 3500 or less in Mainland-China and in Hong Kong.

I informally tested one native speaker of HK-Cantonese with primary and secondary schooling in HK "in Chinese" (ie: textbooks are in Mandarin or possibly in the HK-dialect of (written) Mandarin, and Cantonese is the language used for communication in class) on their character knowledge and had another person (with the same educational background) self-report (independently), both by exploring different ranges going down a frequency list (with scaled answers) and then interpolating, with the result being that each of them had knowledge of around 4200, or 4300 max. Both did well (they report) on HK's secondary school tests in Chinese (they say they got an "A" in the HKALE) in Chinese (not sure whether they meant "Chinese Language and Culture AS" or "Chinese Literature A"). The two people I tested might not have known the same characters, but the fact that both numbers were in the same 100-range is remarkable, given that my methodology differed between them. Note that the characters they "knew" included easy-to-learn pairs such as 忐/忑. And I know that "knowledge" is an elusive notion: do you include passive knowledge (if so, what type of passive knowledge); do you include knowledge of a character used in the name of a famous athlete if that person can't detect if it's misspelled; how much do you count variant characters; etc? One probably shouldn't count characters where the only 2-3 commonly used words/expressions they occur in in MSM (Modern Standard Mandarin) can be recognized correctly but where the character by itself can't be assigned any information at all by the speaker if presented in isolation to that person.

I learned that native speakers routinely overestimate their knowledge if you ask them to give a quick estimate based on just glancing at a list. They might also get very uncomfortable ;-) once they find that they can't answer your questions.

Dictionaries may have 8000 characters or more, but I really think that estimates of knowledge of a native speaker above 4500 are highly doubtful. In the linguistic literature I occasionally encounter lower numbers (I don't remember sources right now), and Wenlin is quoting the old Wieger here:

The authoritative Kangxi dictionary of 1716 A.D. contains forty thousand, which Dr. L. Wieger said “may be plainly divided as follows: 4000 characters in common use; 2000 proper names and doubles of limited use; 34,000 monstrosities of no practical use.”

Clavis Sinica covers about 4000 characters. Urs Bucher's "Vocabulary of Modern Chinese" (Tobun; 1986; 3908155029) supposedly covers all the vocabulary one normally needs. I don't have the book with me (so I can't give a precise quotation), but I believe the author states that he ended up with slightly below 4000 characters, and then he added surnames to his book to increase his coverage to a round number of 4000.

People seem to believe that the Taiwanese have better character knowledge than the Mainland-Chinese. The character mergers when doing traditional-to-simplified conversion wouldn't sufficiently account for that; so if this is true, I'd really want to know why. Anecdotally, it takes both sides of the Strait a bit over 6 years to learn the characters taught explicitly in school, except it takes somewhat longer in Taiwan. Also, structural input methods were in use in Taiwan before they were in Mainland-China, I believe.

The standard reason for asking about the number of characters is to get a handle on the learning load. While I think it's a good and reasonable question to ask, the thing is that the incremental learning load for each new character is, once you exceed a certain threshold, decreasing: The effort required to master the N most frequent characters simply isn't proportional to N.
Stephan Stiller said,

December 5, 2015 @ 1:42 am

My own spreadsheet of English lexical items for which I took notes after my German secondary school education has 12000-13000 rows. For some lexical items all I have in there is a brief note on pronunciation (if stress or vowel quality were unclear from the spelling), but it means that a strict estimate of 8000 is too low for English. But I haven't read C. C. Cheng's research; it's very much possible that he has avoided counting items that are compositional (and note that compositionality is not a binary property but comes in degrees). I also remember going through one of those vocabulary books by Klett (with coverage of 3000-3500 items) during my middle school years. So, I'm pretty sure that a meaningful estimate of my English vocabulary is in the range of 15-20K, yet my friends tell me that it's very large.

English is a language with a large vocabulary, so if someone says that most languages unburdened by an old literature or loanwords from many different languages have vocabularies of around 8000, that seems believable.
Max Wheeler said,

December 5, 2015 @ 4:03 am

Why might anyone suppose there is a lexicon limit, as opposed to an average? Multilingual people can easily have lexicons of 10s of thousands in each of their languages.
Jon said,

December 5, 2015 @ 4:04 am

David Crystal, in one of his many books, suggested a means of estimating vocabulary: take a large dictionary, choose pages at random, count the numbers of words you know on each page and the total number on each page, calculate the fraction you know and scale up based on the number of pages in the dictionary.
What troubled me about this was deciding whether I 'knew' a word or not – there are so many shades of knowing, from certainty of meaning, through knowing 'that's a word that lawyers/gardeners/etc use', to a vague feeling that I've seen it before. I do cryptic crosswords, and sometimes I recognise that a certain combination of letters will fit, and I think it is a word, but I have to go to the dictionary (sometimes the OED) to check.
leoboiko@namakajiri.net said,

December 5, 2015 @ 6:26 am

That's all very interesting. I wish I could find some estimates on morpheme counts.
Daniel Lieberman said,

December 5, 2015 @ 6:34 am

Does this imply a hard limit on the number of different languages a person could learn, or at least, learn really well?
Jason Cullen said,

December 5, 2015 @ 6:57 am

Even the distinction between 'active' and 'passive' vocabulary seems simplistic. I forgot which linguist said it where, but he pulled a random popular novel off his shelf and found on the very first page a word he couldn't define with any confidence. And of course that one page would also include words that aren't common in his 'active' vocabulary, either. But, as a reader, he easily skipped past the word he couldn't define because his understanding, as vague as it was, was enough to get the sentence as a whole and to read the first page with ease. I suspect there are at least three layers to our lexicon.
Mark Liberman said,

December 5, 2015 @ 8:15 am

There's a large literature on this subject, going back nearly a century (and there are also a quite a few LLOG posts about it). See "Lexical limits?", posted next on this blog, for discussion and references.
Michael Watts said,

December 5, 2015 @ 8:35 am

C. C. Cheng, whom I cited at the beginning of this post, also holds that the human cognitive capacity for Chinese characters is around 3,000 to 3,500 characters. If true, and combined with his claims about lexical limits mentioned above (around 8,000 items), that would average out to about 2.5 words for each character. That seems reasonable to me and, coincidentally, is fairly close to the average length of a word in Mandarin, which is just slightly under two syllables.

I make that a little over 0.5 words per character. But it's not clear to me that "characters a person knows in their own writing system" divided by "words that person knows in their own language" is really comparable to "number of characters necessary to write a word"? English has fewer characters than chinese by a factor of hundreds, but we definitely don't need our written words to be hundreds of characters long.
Victor Mair said,

December 5, 2015 @ 9:13 am

@Michael Watts

1. Did you mean 0.5 characters per word? That would be about right.
[Update at 6 p.m. 12/5/15: 0.5 words per character would be about right, but it is not clear how you derived that ("I make that") from Cheng's figures (see comments below)].

2. C. C. Cheng's estimates indicate that — on average — one commonly used character (in combination with other characters at one's command) yields about 2.5 words in a typical user's lexicon.

3. You ignored the "coincidentally" in my sentence, which is there for a purpose.

4. You added italics without calling attention to the emphasis.

5. Your last sentence points to a painfully obvious difference between alphabets and morphosyllabaries.
Keith said,

December 5, 2015 @ 9:17 am

Like Stephan and Max have remarked, Chengs assertion poses a particular problem when considering bilinguals or polyglots.

If there were a ceiling of around 8000 "lexical items", and a lexical item is a "word" as Prof Mair's post suggests, then this would imply that acquisition of a second language would "eat away" at the first language's store of words…

Or is it that, for example, an English speaker learning the French word "chien" simply attaches a second sound to the existing "lexical item" corresponding to the English sound "dog"?
Max Wheeler said,

December 5, 2015 @ 9:46 am

My original question was more basic. Why would anyone take seriously a claim that human lexicons have a "de facto storage limit"? I mentioned multilinguals rather as a reductio ad absurdum of such a claim.
Stephan Stiller said,

December 5, 2015 @ 9:54 am

@Keith
You're raising a very interesting question in your last paragraph. I'm pretty sure that the answer is "no" for truly multilingual people. But for your example: the notion of "dog" is I guess the same in most languages.

And, about the rest: I totally agree that 8K as a hard limit seems strange.

About my estimate about my own vocabulary size: I must be ignoring simple lexical items and very international and scientific vocabulary. Still – 40K word families (from Mark Liberman's next LL post)? As I have a personal interest in the learning load for achieving native proficiency in foreign languages, I guess the only way to figure things out is to go through a dictionary and count morphemes, lexical items, and word families, while trying to rank myself honestly on how well I know them on on how important these items are …
Bob Ladd said,

December 5, 2015 @ 12:12 pm

What Max Wheeler said.
Eric said,

December 5, 2015 @ 1:19 pm

Paul Nation (also referenced with a link in a comment on ML's related post) has been a leader in examining vocabulary size estimates for second language learners, and the literature for this is rather large in SLA vocabulary research more generally. A common target for second language English vocabulary is 8000 (note this is 8000 word families, not merely 'words'—which are always much harder to pin down than we'd like). So, unless 'word' is defined in some very different fashion, 8000 seems far too small for native speakers. I think this will hold for Chinese as well, though simply deciding what unit to count is fraught with difficulty. (For instance, it's often the case that English word counts exclude compound words. Depending on how we define compounds, this practice would potentially exclude most Chinese vocabulary.)

I've been involved in some efforts to quantify Chinese second language vocabulary, but so far my feeling is that none of the Chinese corpora I have access to have been able to capture second language knowledge well enough to give anything at all reliable in the results (if you're interested, here's a poster from a recent conference: https://terpconnect.umd.edu/~epelzl/ECOLT2015_poster.pdf). I haven't really made an effort to quantify native speaker knowledge, so I don't have any numbers to offer there, but the Chao et al. (1967) suggests an impressive (but highly questionable) 46,000. Here's the full reference for that: Chao, C., Chao, T., & Chang, F. F. K. (1967) How many words do Chinese know? Journal of the Chinese Language Teachers Association, 2(2), 44-59.

A more recent published attempt to quantify second language Chinese vocabulary size is: Shen, H. H. (2009). Size and Strength: Written Vocabulary Acquisition among Advanced Learners. Shijie Hanyu Jiaoxue (Chinese Teaching in the World), 23(1), 74–85.
Eric said,

December 5, 2015 @ 1:39 pm

I realized I didn't actually include any numbers for second language learners. I haven't actually calculated any numbers because the relationship between test results and frequency are rather weak (r=.33 and r=.36), which means estimates would be wildly off the mark.

Shen (2009) says her results suggest participants (after three years of college study) had passive knowledge of about 2,229 words (of the most frequent 8,500), but I wouldn't put much stock in the number, and I'm guessing Shen would also acknowledge it is a very uncertain estimate.
JS said,

December 5, 2015 @ 2:14 pm

I can't see this page of Packard on Google Books so don't know where to look in Chin-Chuan Cheng.

Now I only see Cheng's "Frequently Used Chinese Characters and Language Cognition," Studies in the Linguistic Sciences Volume 30 Number 1 (Spring 2000), pp. 107-118, in which the author begins by comparing Chinese character lists and concludes that "[t]he range of 4,000 to 8,000 morphemes then is proposed as the optimal number of linguistic symbols for human manipulation." (p. 107)

Later, Cheng quotes Miller, George A., & Patricia M. Gildea, 1991, "How children learn words. The Emergence of Language Development and Evolution", ed. by William S-Y. Wang, pp. 150-58 (New York: W. H. Freeman): "in the United States high school graduates at age 17 normally have 80,000 words in their vocabulary." I haven't looked there.

But what ended up in Packard is perhaps a conflation of this 8,000 and 80,000, in which case MYL's suspicion in the other thread would be correct.
Victor Mair said,

December 5, 2015 @ 2:22 pm

I received a preliminary reply from Jerry Packard. He will probably post something about this later today.
Victor Mair said,

December 5, 2015 @ 2:57 pm

From a colleague who is familiar with the work of C C Cheng and Jerry Packard:

1. There is a strong tendency to over-estimate the number of characters one must know to read/write everyday Chinese and Japanese. Defining what it means to know a character is part of the problem, but it's pretty obvious that most folks get by with a mere fraction of all the characters in allegedly comprehensive dictionaries like Morohashi.

2. I think that C.C. was guessing at a minimum number of lexical primitives that one could get by with, assuming that a lot of other thing included in "the lexicon" could be derived by morphological rules. As Liberman says, it's pretty obvious that educated people know many more than 8,000 words provided you take "know" and "words" in the loose sense most people use.
Michael Watts said,

December 5, 2015 @ 3:09 pm

Victor Mair: no, 0.5 characters per word would mean that the average word was half a character long. If the average word is two characters long, you've got 0.5 words per character, or two characters per word.
Victor Mair said,

December 5, 2015 @ 4:39 pm

From Bob Ramsey:

I was always skeptical myself. And now seeing this posting I was thinking along the lines you are about what it means to count “words”. Really, we Americans all think “words” are something specific and countable, when in fact they’re really mostly defined by how we write, and how we use spaces on the page. I remember having a flash about this when I was reading “Lolita” and Humbert Humbert says something like, “The words ‘for ever’ referred only to my own passion.” And I then thought, wow, “so ‘forever’ is really a plural concept! Huh. That’s news.” I guess I had even been thinking that ‘forever’ was a single, indivisible morpheme, when in fact it was just all about British spellings vs. American ones…

When I was studying in Germany, some of my Shakespeare-loving German classmates had told me that English was much richer in its number of words compared to their impoverished language, which had only a fraction of the number. But I remember thinking: Of course! Their “words” run together without spaces a lot of what we English-speakers consider separate words.

Culture at this very low level of word spacing can affect what we imagine to be something profound!
Christian Weisgerber said,

December 5, 2015 @ 6:13 pm

@Keith:

Or is it that, for example, an English speaker learning the French word "chien" simply attaches a second sound to the existing "lexical item" corresponding to the English sound "dog"?

No, it can't work like that. Many (most?) vocabulary items don't have exact matches between languages. Instead you have semantic fields that have varying degrees of overlap. Here's an example going back and forth between English and German:
"Screw" translates as Schraube, but Schraube can also translate as "bolt", "bolt" in turn as Bolzen, Bolzen as "pin", "pin" as Stift, Stift as "pen", … depending on what thingy exactly you are talking about.
Victor Mair said,

December 5, 2015 @ 6:20 pm

@Michael Watts

The waters (where certain TReacherous creatures lurk) have gotten muddied, so let's go back to the beginning and see if we can clear things up a bit.

1. I referred to C. C. Cheng's claim about the human cognitive capacity for Chinese characters being around 3,000 to 3,500.

2. I averaged that out to 3,250.

3. I divided the figure of 8,000 lexical items (de facto storage limit) cited at the beginning of the post by the 3,250 characters, giving 2.5 for the number of words formable per character. I elaborated on that in a previous reply to you thus: "C. C. Cheng's estimates indicate that — on average — one commonly used character (in combination with other characters at one's command) yields about 2.5 words in a typical user's lexicon." Of course, the individual characters enter into lexical items at vastly different rates.

4. You commented: "I make that a little over 0.5 words per character." I do not know what you were computing that gave you that result. That is to say, I do not know precisely what you are referring to in Cheng's figures by "that". Trying to make sense of what you wrote, I asked whether you meant 0.5 characters per word, but I really had no idea where you got that result from in Cheng's figures. 0.5 words per character would indeed be "about right", but that's not at all what Cheng's figures are telling us. He is talking about the word-forming capacity of the most common characters within the vocabulary of a typical individual, not length (number of syllables) of words

5. It is well known, and I have stated this on numerous occasions on Language Log (including in this very post) and elsewhere, that the average length of a Chinese word is just under two characters / syllables (1.98, by one large survey). Put differently, on average one character equals roughly .5 word, but that result is not derivable from Cheng's figures. He was talking about the size of a typical person's vocabulary and the number of characters that an individual can keep in mind. He was not making claims about the lengths of Chinese words.
Michael Watts said,

December 5, 2015 @ 11:15 pm

I was computing the words-per-character ratio that you compared to the ratio of Cheng limits. I italicized the bit I was talking about; the rest of the quote is context for it. I agree that the ratio of the hypothetical memory limit "8000 words" to the hypothetical memory limit "3000 to 3500 characters" works out to something in the neighborhood of 2.5 words per character, and that that means that a person looking at a character that they know will on average know about 2.5 words featuring that character (assuming everyone is right at the theoretical limit of knowledge, which on second thought sounds a little dicey. But I understand that calculation).

However, I don't see how that can be described as "fairly close to the average length of a word in Mandarin, which is just slightly under two syllables". The most sense I can make of that statement is that the words-per-character figure for memory limits is being compared to the words-per-character figure for written Chinese. That way the numbers being compared at least have "the same" units. But those figures are very far apart; 2.5 words per character is essentially the opposite of 2 characters per word.

I guess theoretically the sentence might mean "the ratio of the limit of vocabulary knowledge to the limit of chinese character knowledge, 2.5 words per character, is numerically similar, though of different dimension, to the average length of a chinese word, 2 characters", but it's totally nonsensical to compare those. It'd be kind of like saying that 2.5 words-per-character in memory limits is strikingly close to the number of Musketeers, 3.
Victor Mair said,

December 6, 2015 @ 12:15 am

@Michael Watts

Murkier and murkier.
Victor Mair said,

December 6, 2015 @ 12:15 am

I had indicated that Jerry Packard would have something to offer with regard to our discussion. It turns out that Jerry dug up CC's original paper which, together with its abstract, is in Chinese. Fortunately, CC had personally written an English version of the abstract which, though he had never published it, Jerry had in his files. Jerry kindly keyboarded the English abstract, and I have copied that below. Jerry also sent along the pdf of the whole paper in Chinese, which I can forward to anyone who is really interested in it.

As soon as I looked at the paper, I realized that I had already read it several times in the past and always thought that it was rather ingenious. However, because CC has written such a huge number of papers, I wasn't sure which one Jerry was talking about in the footnote from his book that I cited in the OP above.

http://www.ntnu.edu.tw/tcsl/Teachers/zheng/zheng.htm

Please note that CC's approach is highly quantitative and computational. He surveys a large number of classics, histories, novels, dictionaries, and so forth from the earliest stages of writing in China to the twentieth century. His approach is not impressionistic but is grounded in hard data. Naturally, since CC's focus is on Chinese, the texts that form the basis of his computations are primarily in that language, but I list here his findings for some well-known American and English works of literature that he examines to show that the range of stored lexical items is comparable in English.

The last column gives the name of the work, and that is preceded starting from the beginning of each line by 1) the total number of words in the work, 2) the number of different words, and 3) the number of conceptual units ("cooled, cooler, coolest, cooling" would all fall under the conceptual unit of "cool", i.e., a lexeme) in the work. Because WordPress seems not to permit more than one space between words or numbers, I've separated the different items in a line by dashes to make them clearer:

32,361 — 4,727 — 3,431 Call of the Wild

74,038 — 7,427 — 5,316 Tom Sawyer

87,044 — 8,630 — 6,046 Beauty and the Beast

161,751 — 9,281 — 6,433 Dracula

137,060 — 8,877 — 6,218 The American

39,631 — 4,236 — 3,190 Aspern Papers

80,493 — 8,976 — 6,377 Paradise Lost

161,974 — 7,097 — 4,647 Emma

120,735 — 6,288 — 4,199 Sense and Sensibility

123,270 — 6,288 — 4,146 Pride and Prejudice

84,128 — 5,741 — 3,934 Persuasion

729,792 — 13,765 — 8,641 Austen's Books combined: Emma, Sense and Sensibility, Pride and Prejudice, Persuasion, Mansfield Park, Northanger Abbey

Rereading Jerry's footnote against CC's original paper, it is clear that he did not misquote or misrepresent CC.

Here is the complete bibliographical citation for CC's paper:

Zhèng Jǐnquán 鄭錦全, "Yǔyán wénxué yǔ zīxùn 語言文學與資訊" ("Language, Literature, and Information"), in his collected papers entitled Cóng jìliàng lǐjiě yǔyán rènzhī 從計量理解語言認知 (Quantification for Understanding Language Cognition) (Urbana-Champaign: University of Illinois, Department of Linguistics), pp. 103-137, rpt. from Zōu Jiāyàn, Lí Bāngyáng, Chén Wěiguāng, and Wáng Shìyuán 鄒嘉彥、黎邦洋、陳偉光、王士元, ed., Hànyǔ jìliàng yǔ jìsuàn yánjiū 漢語計量與計算研究 (The Quantification of Sinitic and Computational Research) (Hong Kong: City University, 1998), pp. 15-30.

And here is the English abstract:

=====

Quantification for Understanding Language Cognition

Chin-Chuan Cheng

Abstract

In our view, the purpose of quantitative studies of language is for understanding of human cognition. This paper raises two cognition questions. We hope to understand human language by examining some quantitative aspects of language use. The first question is about the number of words a person can actively use. The second question inquires about the number of morphemes that can be held for processing in communication without overwhelming the short–term memory. With regard to the number of active vocabulary including morphemes and non-productive words that require memory, we can find answers from individual works of the past. The occurrences of Chinese characters in some Four Treasures of Chinese including the entire set of 25 Dynasty Histories showed that in spite of the existence of 30,000 to 56,000 characters, each book no matter how long it was, used only about 5,000 distinct characters. The maximum number was about 8,000. Each Chinese character represents a morpheme. However, in recent surveys of word frequency, the highest 8,000 words covered over 90% of modern texts. We also lemmatized the words in many English books by great authors and found that the number of words each author used did not exceed 8,000. We have therefore proposed a theory that defines the upper limit of human active control of linguistic symbols and words as 8,000. With regard to the use of linguistic symbols in communication, we examined the smallest topics each forming a meaningful unit as an information chunk in written discourse of news reportage. We found that the number of morphemes in such an information chunk was about 50. This number is the number of morphemes one can handle in short-term memory processing.
=====

I just want to back up what CC has demonstrated by observing that, of the many premodern and modern Chinese works that I have read and translated — and these include many of the most famous literary and philosophical works in the language — the number of different characters used normally falls within a range of between about 1,000 and 5,000, with earlier texts tending to fall around the middle or bottom part of the range. The same is true even of the magnificent corpus of Tang poems — often considered the most glorious manifestation of the entire Chinese literary tradition — the vast bulk of which falls around the middle of the range indicated just above.

Before electronic data bases became common for finding characters and terms in Chinese texts, modern Sinologists relied on indices and concordances that were compiled during the 20th century. I once did a survey of scores of these reference works of the major works of Chinese literature and was amazed to find that they all contained between about 1,000 (or even less) and 5,000 different characters. That was before the first time I read CC's paper. So it was with a sense of déjà vu and welcome gratification that I realized CC had independently made the same discovery, but had done so in a much more systematic and rigorous way.
Victor Mair said,

December 6, 2015 @ 2:38 pm

From a German colleague (specialist in Chinese historical phonology) — his comment is for both "Lexical Limits" posts:

I think the second log discussion gets closer to reality. Actually, I know nothing about these issues, but I doubt there is an upper limit, considering all the multi-lingual people. If the limit is 8,000 vocabulary items, and a person masters 3 languages in addition to his native tongue, does that mean that then his native vocabulary drops down to 2,000? That would make people like Jim Matisoff theoretically impossible.

Personal, totally uneducated impression: take the difference between German and English and look, for example, at words that are in German formed with the basic verb ‘geben’ and a number of commonly used prefixes:

German || English

geben || = give
abgeben || deliver, drop off
angeben || brag / indicate
aufgeben || = give up / mail
ausgeben || spend / hand out
begeben || occur
beigeben || add to
eingeben || infuse / inspire
ergeben || result / surrender
umgeben || surround
übergeben || hand over, deliver / vomit
vergeben || = forgive
vorgeben || pretend
zugeben || admit
etc.

Here we have in German one verb + a large but limited number of well-known prefixes/prepositions which are widely used in other word formations. In English you find at least 11 unique words (morphemes) instead. That is why Germans go through school and life, mastering their language perfectly, without ever having a need for a dictionary (I acquired my first Duden only after living here for a while and becoming unsure about some spellings). What struck me here is that English is impossible to master even for native speakers without a dictionary, with its 50,000 (?) words it appears incredibly bloated in comparison to German and most other languages where you have a much more limited number of morphemes which can be combined for expressing nuances and subtleties for which English requires unique words that often need to be ascertained in a dictionary. — One consequence of the German system is that speakers/writers often make up words ad hoc through morpheme combinations which (though correct idiomatic German) cannot be found even in the fattest Duden, to the consternation of foreign students.
About counting words: German ‘angeben’ is one word for all I know, but needs to be represented by two rather different words in English: is angeben one or two words?

The point is that in languages like German you need to remember a much lower number of morphemes than English to express every conceivable subtlety.
Christian Weisgerber said,

December 7, 2015 @ 11:12 am

I think the "German colleague"'s examples actually undermine his argument. To me as a German speaker, the meanings of most of these verbs cannot be transparently derived from the prefix and the base verb geben. Instead they have to be learned as individual lexical items just like their English counterparts. In fact, the same word can have both an idiomatic and a compositional meaning, e.g., the last one on the list, zugeben, can mean "add to" (transparent from zu- and geben) or it can mean "admit", which is derivatively opaque. The German verb prefix system is very reminiscent of English phrasal verbs.

It may very well be that a larger part of the commonly used German vocabulary is built from fewer morphemes than in English, but knowing the constituent morphemes all too often does not reveal the idiomatic meaning of the composite term.

Germans themselves frequently notice and remark on how underspecified the compositional system is. A commonly cited example are types of schnitzel:
* Schweineschnitzel — schnitzel made from pork
* Kinderschnitzel — schnitzel for children (i.e., a smaller portion)
* Zigeunerschnitzel — schnitzel in Gypsy fashion (i.e. with paprika)
We know these as established terms, but there is nothing in the composition that would determine the exact meaning. A Kinderschnitzel might just as well be made by children or, if we didn't know to exclude cannibalism for cultural reasons, from children. Its specific meaning has to be learned as a lexical item.

It is true that Germans' traditional idea of a dictionary is Duden Vol. 1, a pure spelling dictionary, and that most only encounter a dictionary with comprehensive definitions during advanced foreign language courses, if they ever do at all. But that is a cultural issue and I'd be very careful about drawing conclusions about the language itself from this. It's not that Germans don't run into unfamiliar words, it's just that they traditionally don't know to look them up. (And it's an interesting question whether this still holds true for young people who have grown up with the Internet.)

RSS feed for comments on this post

Lexical limits

28 Comments

Stephan Stiller said,

Stephan Stiller said,

Max Wheeler said,

Jon said,

leoboiko@namakajiri.net said,

Daniel Lieberman said,

Jason Cullen said,

Mark Liberman said,

Michael Watts said,

Victor Mair said,

Keith said,

Max Wheeler said,

Stephan Stiller said,

Bob Ladd said,

Eric said,

Eric said,

JS said,

Victor Mair said,

Victor Mair said,

Michael Watts said,

Victor Mair said,

Christian Weisgerber said,

Victor Mair said,

Michael Watts said,

Victor Mair said,

Victor Mair said,

Victor Mair said,

Christian Weisgerber said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta