Language Log

Laden on word counting

June 1, 2010 @ 9:30 am · Filed by Mark Liberman under Language and culture

Greg Laden has recently posted an entertaining screed, "Minifalsehood: We can't tell what a word is!?!?", 5/31/2010. I don't have time this morning for a a serious discussion, but I can point to some relevant stuff here, here, here, here, here, here, …

Laden is cheerfully dismissive of the arguments raised in such discussions. But his scorn doesn't change the facts of the case, which are roughly:

1. Without a careful definition of what you mean by "word" and by "language X", questions like "how many words are there in language X" are pretty much meaningless, because different definitions will yield very different numbers.

2. The same thing applies, with the added issue of what you mean by "know", to the question of "how many words of language X does a specific person know?" Another layer of variation is added by generalizing the question to "how many words of language X does an average four-year-old or 18-year-old know?" There's an obvious answer, subject to the usual sampling-error problems, but the result is a bit like asking about average income — the mean value may not be very useful in telling you what you really want to know about the distribution.

3. Most sensible definitions for (1) and (2) above create serious practical difficulties for counting. That is, they define an answer, but the prescribed process for finding it is hard to carry out, and especially hard to automate in a way that produces an accurate result.

4. Extrapolating accurately from samples raises its own special problems here — for a discussion of some of these difficulties, see this set of lecture notes, or read Harald Baayen's book, Word Frequency Distributions.

5. Despite all these difficulties, researchers over the years have gone through the steps of defining carefully what they mean by "word", "language", "know", etc., and then carried out these steps. Some classical references are M. Graves, "Vocabulary Learning and Instruction", Review of Research in Education, 13 49-89, 1986; W.E. Nagy & R.C. Anderson, "How many words are there in printed school English?" Reading Research Quarterly 19, 304-330, 1984. [(added later) They've done this because the (suitable qualified) answers matter to various scientific and technological questions: How much word-learning, and of what kind, do children need to do in the course of learning a language? How many entries does a lexicon need to have in order to get X% of coverage on task Y? etc.]

6. Comparisons across languages are made more difficult by the fact that the most natural and sensible answers to questions like those in (1) tend to be different in different languages. Furthermore, a decision that may have only a small effect on the results in language X, may turn out to change things by an order of magnitude or more in language Y. Again, this doesn't make it impossible to answer the questions, it just increases yet again the range of sensible values that answers might have.

Laden is radically impatient with all this talk about how it all depends and it's hard to tell, but his impatience doesn't change the facts. Nor does it change the fact that there are plenty of attempts to answer such questions — one of the standard assignments in my LING 001 course asks students to use a dictionary-sampling method to estimate the size of their passive vocabulary in English. Of course, I also explain some of the reasons that the estimated numbers are not very meaningful.

Laden seems to be aware of these issues — for example, he found the Nagy and Anderson reference — but his goal in the cited post seems to be to make fun of people rather than to clarify the questions and answers. (He suggests, towards the start of his post, that he wants to evaluate claims about the rate of word learning by children — but I couldn't see any connection between this issue and the rest of his hyper-kinetic complaining about the difficulty of getting a simple answer to the word-counting question.)

June 1, 2010 @ 9:30 am · Filed by Mark Liberman under Language and culture

Permalink

36 Comments

jimbino said,

June 1, 2010 @ 10:10 am

I think equally interesting the question of how time-efficient and space-efficient a language is. I've been faced with the problem of putting menus in English, German, Spanish, French and Portuguese on small medical-device displays. Written English seems the most space-efficient, followed by German, French, Portuguese and Spanish.

It also seems to me that spoken English is far more time-efficient, consisting as it does of one syllable words where Spanish, for example, would use a 3-syllable word.

I wonder how Chinese and Hebrew would score.

[(myl) Your question is at least equally interesting, but it's also equally hard to answer in a straightforward way. There's some discussion in "Comparing communication efficiency across languages", 4/4/2008.]
McLemore said,

June 1, 2010 @ 10:13 am

Well gosh Mark, too bad you were so short on time that you were only able to write 10 unserious paragraphs. (Men and their chatty ways, I swear!)
Sili said,

June 1, 2010 @ 10:15 am

Ah, but writes lucidly of the Congo.
Jerry Friedman said,

June 1, 2010 @ 10:57 am

I enjoyed Laden's humor and neologipotence, but I got a faint feeling that he can't proofread without a spellchecker.

I like to complain about the schools here in northern New Mexico, but I've heard a semi-rural legend that the average tenth-grader here has a vocabulary of 500 words (or some other ridiculously small number). Dictionary sampling would be useful to show that's way off.

@jimbino: I've wondered how closely the higher speed (in syllables per second) of spoken Spanish compensates for the higher number of syllables needed for a lot of concepts. Of course there's regional and individual variation in speed in both languages.
Faldone said,

June 1, 2010 @ 11:09 am

Slightly related, I recently saw a napkin dispenser at a soft-serve ice cream stand that had directions in three languages: English, Spanish, and French. The English and Spanish said "pull down". The French had to specify what it was you were pulling down: "Pull the napkins down."
Stephen Jones said,

June 1, 2010 @ 11:57 am

Why does he think it obvious that 'run' the noun is different from 'run' the verb?
marie-lucie said,

June 1, 2010 @ 12:06 pm

Faldone, what were the actual Spanish and French words?
Dan T. said,

June 1, 2010 @ 12:41 pm

Your assignment page has a bogus character near the bottom (showing up, at least in my Firefox browser, as a question mark in a diamond). The likely culprit is the fact that you're using inconsistent character set designations; the server sends the page as UTF-8 in its HTTP headers, but your internal meta tags identify it as iso-8859-1. If the actual content of the page is iso-8859-1, then it will be misrendered by standards-compliant browsers because the HTTP headers take precedence over meta tags.
Nick Lamb said,

June 1, 2010 @ 12:43 pm

I presumed from the fact that almost every other word in the "rant" is made up that he's very conscious of what the problem is with counting words, and is actually using this opportunity to show the reader why this is all very tricky, but has chosen the form of a rant which pretends to assert the contrary.

Sort-of like suggesting that the Irish should eat their children. Yes?

Excuse me if that was so obvious that everyone already knows it and I missed some hint that Mark dropped.
Jerry Friedman said,

June 1, 2010 @ 12:46 pm

@Stephen Jones: As he says in a comment, he thinks run is about 50 words. I pointed that this contradicts his claim that people agree by common sense on how many words there are.

I was wrong about proofreading—he doesn't spellcheck. Well, I don't write amusing blog posts with dozens of neologisms.
Jerry Friedman said,

June 1, 2010 @ 12:46 pm

Or "pointed out". That was probably a good spot for Muphry's Law.
Faldone said,

June 1, 2010 @ 1:21 pm

@marie-lucie:

English: Pull Straight Down

Spanish: Tire hacia abajo

French: Tirer sur la serviette

Capitalization as in original
Greg Laden said,

June 1, 2010 @ 1:49 pm

Thanks for your comments on my post. My response is here.

[(myl) Thanks for the response, but it left me even more puzzled than the original post did. It might help if you could explain what specific question(s) you want an answer to — or think that you have an answer to — and why you care. Or are you really just trying to make fun of (what you take to be) an "internet meme" about this topic?]
language hat said,

June 1, 2010 @ 3:11 pm

I presumed from the fact that almost every other word in the "rant" is made up that he's very conscious of what the problem is with counting words, and is actually using this opportunity to show the reader why this is all very tricky, but has chosen the form of a rant which pretends to assert the contrary.

I thought the same thing, but having read his response I am forced to conclude that he is incapable of rational thought on the matter. He doesn't even seem to have a coherent position, let alone being able to communicate it effectively. For what it's worth, I wouldn't call him a troll, because I think he thinks he has a position and isn't just poking the hornet's nest to cause trouble, but the effect is the same, since his response to anything he's told is to roll his eyes and produce more insults. He's a ranter, not a thinker, and I don't see that it does any good to engage him.
kevin said,

June 1, 2010 @ 4:05 pm

language hat is spot on. Laden's m.o. seems to usually be to first make an outlandish claim, then forever enter some kind of Schrodinger's cat zone where he adamantly insists (or denies) that he was being completely serious (or facetious) because he does (or doesn't) really care about the topic at hand, and that anyone who took him seriously is (or isn't) missing the point (or not). Oh, and that to get the "context" requires reading his entire blog archives. Or not.
D.O. said,

June 1, 2010 @ 4:37 pm

This comment probably belongs more on the Greg Laden's post than here, but I post it anyways. If you read what Laden writes in other posts in his Falsehoods II series, you would be surprised. His position is nuanced, explaining how simple and categorical answers for science questions are not enough and taking the point that to get a truly correct answer (or, as it really is, the best contemporary approximation to it) you really need to dig into the subject, not just remember a few soundbites one way or the other. Why this is all thrown out of the window now, I have no idea. Imagine, someone would insist that biologists answer the question exactly how many missing links are there between chimp and man. Laden would have bitten the poor head off. But similar thing is somehow acceptable for the word counting.
Greg Laden said,

June 1, 2010 @ 5:35 pm

Mark, others: Yes, that post was really just a dig on that one issue. As D.O. points out, this is part of a series of posts (and radio spots on Skeptically Speaking) exploring this concept of "falsehoods" (a word I've cooped and rewritten the definition of!). This particular post is a "minifalsehood" because it only goes part way. I'm complaining about the brushing off of a basic question (how many words are there) because it's hard.

When I've taught advanced archaeology seminars, I've had this experience: The grad students, after reading some taphonomy papers, turn on the field. They begin to insist that we can never infer anything reliably from the archaeological record because there are always questions that leave us uncertain. So I've tried this: Here's an archaeological situation. Provide the most likely (simple) interpretation (like, do the bones represent an intentional burial or not, etc.). If everyone in the same room comes up with the same answer, the beer is on me.

No matter how much hemming and hawing has gone on to that point, I always end up buying the damn beer.

Pragmatics tends to rule, or at least do, until we get where we need to be. These are subtle concepts, and they are concepts that often set us up with short term comforts that lead us later into some abyss of misunderstanding. Or, more likely, the teacher and the student come to an understanding that is a bit of a white lie, and then later as the student goes off on her or his own, the white lie grows into an ugly misunderstanding.

Or, in other cases, this happens: There is a complex multifaced "truth," but only parts of truth are ever mentioned or noted, for political reasons. A large poor unemployed African American woman is seen on TV for some reason and it is noted that she has six children. The culture of the observers shifts … it is now understood and verified that "poor people have more babies than middle class people" Two weeks later, it is revealed that the one time richest man in the country (I shall not mention his name) has two different families instead of the usual one. Suddenly, he has 8 children instead of three. The culture of the observers fails to shift with the understanding that maybe the rich have more babies than the middle class.

The evidence indeed suggests that both are true. But only one seems to matter. That is a classic falsehood.

So now you've got a brief summary of the falshoods thing. It's a large discussion, it has been going on with some of the same people for a few years on line, more years before, and is now expanding and being rewritten.

The post in question is a small example that I'll probably use later on: The arrogant dismissive wave as a step in creating falsehoods. The arrogance of incredulity. Something along those lines.

I think I've dealt with all the questions and comments on this issue (except the needlessly and rather embarrassingly mean spirited ones). But if I've missed something let me know.

I've got to run off now.
Vicki Baker said,

June 1, 2010 @ 6:01 pm

I don't see any evidence that "how many words in a language" or "number of words in a language" is any kind of internet meme. Not how I would define a meme anyway. Laden links to the OED FAQ and a Slate article on the 1millionth word hype.

I don't see what is wrong with the OED's answer "there is no single sensible answer to that question." Of course we can give a number for some definition of word and some definition of language. No one is disputing that, least of all the AskOxford FAQ he links to, which actually goes on to give an answer to the question, subject to the necessary qualifiers. An answer with real numbers and everything!
Vicki Baker said,

June 1, 2010 @ 6:14 pm

"The post in question is a small example that I'll probably use later on: The arrogant dismissive wave as a step in creating falsehoods. The arrogance of incredulity. "

So how is an answer that hopefully corrects some of the questioner's implicit assumptions, leading to a better understanding of how language works, "an arrogant dismissive wave"?

http://www.askoxford.com/asktheexperts/faq/aboutenglish/numberwords
language hat said,

June 1, 2010 @ 6:28 pm

Greg, if you're serious and not just having fun on the ol' internet, please explain how the issue of the number of words in a language is in any way similar to the interpretation of an archaeological record, why you think all the (obvious and irremediable) problems that have been so frequently pointed out to you are somehow irrelevant, and whether you think you would also have to buy the beers if you asked a bunch of people how many words there were in, say, English. I put it to you that the answers are, respectively, it's not, they're not, and you would not — no group of people, whether experts or students, would come up with the same number, and in fact it's unlikely any two would, unless it was a random round number like "a million." That being the case, how do you justify your repeated mockery of anyone who disagrees with whatever your position is (I'm still having a hard time pinning it down)?

I assure you my question is not mean-spirited, just pointed, but I'm sorry if my earlier remark embarrassed you.
Alex said,

June 1, 2010 @ 6:49 pm

"Oh, and that to get the "context" requires reading his entire blog archives."

Well, he did provide a big HINT near the start of the original post, where he provided two links in relation to his "Falsehoods" topic. It's hardly his fault if the people reading the post haven't acquainted themselves with what he is doing when he gives those links.
Vicki Baker said,

June 1, 2010 @ 6:58 pm

Alex, I think I get the "Falsehoods" topic but I don't see how this relates. I don't see how the AskOxford answer Laden calls an "Internet meme" is an "arrogant dismissal" anymore than an explanation of the difficulties of estimating an answer to "how many species are there in the biosphere?" Especially if biological evolution occurred at a similar rate to linguistic evolution.
language hat said,

June 1, 2010 @ 8:22 pm

It's hardly his fault if the people reading the post haven't acquainted themselves with what he is doing when he gives those links.

What Vicki said. There is no "arrogant dismissal" involved, any more than there is when a biologist says someone denying evolution is wrong: just a pointing out of facts. If the person advancing the mistaken view refuses to take the correction with good grace, they may well call it arrogant, but that is on them, not the person making the correction. Facts are facts.

Ironically, it is his own "I will be deconstructing some of your cherished beliefs" (from one of those "Falsehoods" links) that is arrogant, and it is he who is refusing to accept the facts.
notrequired said,

June 2, 2010 @ 1:02 am

Amazing. An actual EDUCATED internet troll in his natural habitat! Quick, get the tranquilizer gun!
Martin Ellison said,

June 2, 2010 @ 4:46 am

On comparing the length of translations, it seems to depend on which statement was in the original language — the translations tend to be longer than the original as they try to capture all the sense. Or the translators just leave out half the meaning, like 安全出口 [safety exit] versus EXIT.

[(myl) This is a plausible idea that doesn't seem to result in consistent effects. I discussed it in "One world, how many bytes?", 8;/5/2005, where I compared three English/Chinese parallel corpora. Two were translated Chinese to English, and showed overall file-size ratios of 2.27 and 1.95 uncompressed, or 1.41 and 1.19 compressed. One was translated English to Chinese (or at least nearly all of it was), and it showed a ratio of 1.96 uncompressed, and 1.24 compressed.]
Ken Brown said,

June 2, 2010 @ 7:47 am

The interesting thing about the claim that "run" is fifty words (allowing that "fifty words" here is a rhetorical device for "X words where X is probably a lot larger than three") is the idea of testing for distinct meanings by replacing words with others.

"…can either word be perfectly replaced with another word that is not nuanced in overlap with the other word? So, running in a race and running water from the faucet becomes running in a race and flowing water from a faucet."

which he claims proves that the running that water does and the running that people do are distinct meanings so those two examples of "run" are candidates to be counted as different words. This seems very wrong to me. Words have ranges or fields of meaning, which can stretch or bend or contract as we use them. Those uses of "run" might be different (though not very) but they most certainly aren't distinct. They are connected to each other by overlapping usages. There is a plastic continuum of meaning. (And yes, I only wrote that because it sounds so silly)

And anyway "flow" does not replace "run" perfectly in that example, and no pair of words could perfectly replace each other because even if their meanings are identical (not the case here) their connotations and sound qualities will be different. Greg Laden's view of language seems to deny the possibility of poetry.

And I think I specifically object to the odd notion that "run" as a noun and "run" as a verb are distinct words. English verbs nouns and nouns verbs freely, its part of our grammar. And it makes them into adverbs and adjectives and other more ambiguous things. Are there many competent speakers of contemporary English who have any trouble with the meaning of "Playstation gaming network"? Or who seriously think that the existence of those forms creates new, distinct, words that are homophones of the old familiar words "play" and "game" and "net" and "work", rather than just using those words in different ways? Though arguably such compounds do instantiate a new word "station" or at least a distinct meaning for the old one. Once upon a time a "station" was where things stopped for a while; so a "workstation" was a place someone stopped or stood to do work. It has come to also mean the machines they use to do work at that place, from there it became yet another synonym for "computer", so the new coinage "playstation" can now be trivially understood as a computer for playing games, even though it is carried about and doesn't stand still at all.

[(myl) Indeed. A large number of smart people have thought and written about these questions: philosophers, linguists, psychologists, anthropologists, and engineers, from Ludwig Wittgenstein and Roman Jakobson to George Miller and James Pustejovsky, among dozens of others.But I suppose that if he read it, Laden would be just as dismissive of their work as he seems to be of everything else.]
outeast said,

June 2, 2010 @ 8:25 am

I'm sure there was a band in the 70s called A Plastic Continuum of Meaning…
notrequired said,

June 2, 2010 @ 8:48 am

Ken Brown, I haven't yet seen anyone in internet discussions about video games use "playstation" generically to mean a computer system for playing games. If it's used like that somewhere, it's a pretty rare case, though an interesting one. To most "gamers" it still unambiguously refers to the Sony PlayStation. But who knows what the future holds, a shift could happen, of course.
language hat said,

June 2, 2010 @ 11:45 am

Here's a quote from his blog that goes far toward explaining the chip on his shoulder:

Linguistics is a subfield of anthropology, so in theory, the anthropologists should be able to tell the linguists what to do and they have to do it.
Vicki Baker said,

June 2, 2010 @ 12:34 pm

Shorter version of the Minifalsehood/word count debate.

GL: There's a stupid meme that says you can't count the words in a language. That's dumb. Look, I just added 3 words to the number of words in English while talking at length about something I will later say is irrelevant to my point.

ML and other commenters: What is this meme you speak of? Once you define "word" and "language" it's possible to come up with some useful numbers, but the programming's kinda tricky.

GL: OMG why are being so dangerously incredulous? Of course it's possible to define "word" and "language" and come up with some useful numbers. Bad linguist, no beer for you!
Mark P said,

June 2, 2010 @ 2:28 pm

@language hat – Somehow I don't think GL is going to get very far with that one. Just like I don't get very far insisting that chemistry is a subfield of physics. Pb balloon.
Troy S. said,

June 2, 2010 @ 3:56 pm

So does this mean the factoid about Shakespeare having such a vast vocabulary pool relative to his contemporaries (and the subsequent inference that he couldn't have been a real person but must have been a pseudonym for several men) is not really meaningful?

[(myl) Modulo problems about inconsistent spelling, it would be pretty easy to compare vocabulary growth curves for Shakespeare and his contemporaries. I haven't done it, but I'd be surprised if it suggested that his rate of vocabulary display (which is all that can really be measured in this case) was especially large. Do you have a citation for this factoid?]
Troy S. said,

June 4, 2010 @ 12:38 pm

A figure often cited for Shakespeare's vocabulary is "over 29000 words" which a Google search will demonstrate. Where this figure comes is unclear. He's also credited with coining some 1700 words himself. That may be more meaningful, but I'm not sure how accurate it is. A more reasonable explanation I came is across is the the OED had a tendency to quote Shakespeare as the earliest citation, since his work was simply more extant than his contemporaries.

[(myl) There's a well known paper by Bradley Efron and Ron Thisted on the problem of extrapolating word-type counts from samples, "Estimating the Number of Unsen Species: How Many Words Did Shakespeare Know?", Biometrika 63(3), 1976. They observe that "Shakespeare's known works comprise 884,647 total words", and that "31,534 different word types appear", where "type" is "any distinguishable arrangement of letters", so that "'girl' is a different types from 'girls' and 'throneroom' is a different type from both 'throne' and 'room'."

They use mathematical reasoning to estimate that if another 884,647 word tokens of Shakespeare text were to be discovered, we should expect them to contain about 11,460 new word types; and that if Shakespeare were to have produced an infinite quantity of text, we should expect to find at least 35,000 new word types, beyond the 31,534 already observed.

It would be relatively easy to carry out similar calculations on (say) the works of Ben Jonson, for whom a similar volume of material is extant. (The main difficulty would be normalizing his spelling.) I would be very surprised to learn that there was a meaningful difference in estimated vocabulary size (or, leaving out the estimation, the observed rate of vocabulary display. If I have a spare hour some morning, I'll try it.]
Ken Brown said,

June 4, 2010 @ 1:11 pm

@notrequired: "I hven't yet seen anyone in internet discussions about video games use "playstation" generically to mean a computer system for playing games."

I didn;t mean to imply that they did – its just a brand name – my point is simply that when you first heard the name you knew what it meant – you didn't for example, think of a railway station. Or anything stationary at all.
ken said,

June 4, 2010 @ 2:41 pm

Would anyone care to introduce this man to a language called Navajo?
stephen said,

June 28, 2010 @ 9:16 am

Remember in Star Trek II: The Wrath of Khan, when Kirk gets really mad and yells, "Khan!"?

"How much do people love Khan? Well, so much that Google gets quite a few hits for the search of "KHAN", "KHAAN", "KHAAAN" and so on. Some clever bloke decided to graph the results of these searches based on how many 'A's were used"

http://www.vintageculture.net/captain-kirk-cant-handle-khaaaan/

http://www.flickr.com/photos/squidnews/3200285750/

RSS feed for comments on this post

Laden on word counting

36 Comments

jimbino said,

McLemore said,

Sili said,

Jerry Friedman said,

Faldone said,

Stephen Jones said,

marie-lucie said,

Dan T. said,

Nick Lamb said,

Jerry Friedman said,

Jerry Friedman said,

Faldone said,

Greg Laden said,

language hat said,

kevin said,

D.O. said,

Greg Laden said,

Vicki Baker said,

Vicki Baker said,

language hat said,

Alex said,

Vicki Baker said,

language hat said,

notrequired said,

Martin Ellison said,

Ken Brown said,

outeast said,

notrequired said,

language hat said,

Vicki Baker said,

Mark P said,

Troy S. said,

Troy S. said,

Ken Brown said,

ken said,

stephen said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta