Greg Laden has recently posted an entertaining screed, "Minifalsehood: We can't tell what a word is!?!?", 5/31/2010. I don't have time this morning for a a serious discussion, but I can point to some relevant stuff here, here, here, here, here, here, …
Laden is cheerfully dismissive of the arguments raised in such discussions. But his scorn doesn't change the facts of the case, which are roughly:
1. Without a careful definition of what you mean by "word" and by "language X", questions like "how many words are there in language X" are pretty much meaningless, because different definitions will yield very different numbers.
2. The same thing applies, with the added issue of what you mean by "know", to the question of "how many words of language X does a specific person know?" Another layer of variation is added by generalizing the question to "how many words of language X does an average four-year-old or 18-year-old know?" There's an obvious answer, subject to the usual sampling-error problems, but the result is a bit like asking about average income — the mean value may not be very useful in telling you what you really want to know about the distribution.
3. Most sensible definitions for (1) and (2) above create serious practical difficulties for counting. That is, they define an answer, but the prescribed process for finding it is hard to carry out, and especially hard to automate in a way that produces an accurate result.
4. Extrapolating accurately from samples raises its own special problems here — for a discussion of some of these difficulties, see this set of lecture notes, or read Harald Baayen's book, Word Frequency Distributions.
5. Despite all these difficulties, researchers over the years have gone through the steps of defining carefully what they mean by "word", "language", "know", etc., and then carried out these steps. Some classical references are M. Graves, "Vocabulary Learning and Instruction", Review of Research in Education, 13 49-89, 1986; W.E. Nagy & R.C. Anderson, "How many words are there in printed school English?" Reading Research Quarterly 19, 304-330, 1984. [(added later) They've done this because the (suitable qualified) answers matter to various scientific and technological questions: How much word-learning, and of what kind, do children need to do in the course of learning a language? How many entries does a lexicon need to have in order to get X% of coverage on task Y? etc.]
6. Comparisons across languages are made more difficult by the fact that the most natural and sensible answers to questions like those in (1) tend to be different in different languages. Furthermore, a decision that may have only a small effect on the results in language X, may turn out to change things by an order of magnitude or more in language Y. Again, this doesn't make it impossible to answer the questions, it just increases yet again the range of sensible values that answers might have.
Laden is radically impatient with all this talk about how it all depends and it's hard to tell, but his impatience doesn't change the facts. Nor does it change the fact that there are plenty of attempts to answer such questions — one of the standard assignments in my LING 001 course asks students to use a dictionary-sampling method to estimate the size of their passive vocabulary in English. Of course, I also explain some of the reasons that the estimated numbers are not very meaningful.
Laden seems to be aware of these issues — for example, he found the Nagy and Anderson reference — but his goal in the cited post seems to be to make fun of people rather than to clarify the questions and answers. (He suggests, towards the start of his post, that he wants to evaluate claims about the rate of word learning by children — but I couldn't see any connection between this issue and the rest of his hyper-kinetic complaining about the difficulty of getting a simple answer to the word-counting question.)