Don't Try This at Home!
« previous post | next post »
In a "Fresh Air" piece (audio, text) that aired today, I reprised a couple of the cases of quantitative quackery that Language Loggers have taken on, where someone counts up the words in a text to draw some utterly unjustified conclusions about its content or author. I mention the efforts to distill the essence of the Democrats' health care bills from the frequency of selected words, which I took up in a post a couple of months ago (it drew a number of useful comments thatI borrowed liberally from in the "Fresh Air" piece).
These enumerations have become more fevered on all sides as the bills make their interminable way through Congress: Only seven instances of women! More than 3300 occurrences of shall, each a mandate that chips away at our freedom! On that last point, I note that, page-for-page, shall is more frequent in the Constitution than in the House healthcare bill, and conclude: "Critics of the bill are still free to insist that it opens a new fast lane on the road to serfdom. But that isn't something you can prove just by counting helping verbs."
Then there are the ubiquitous tallies of first-person pronouns aimed at demonstrating the egotism or arrogance of public figures. The targets have included John McCain, Hillary Clinton, Sarah Palin, and particularly Barack Obama, which Mark has painstakingly dispatched in a series of posts (links here) that show that Obama and the other politicians charged with egotism don't actually use first-person pronouns more than other politicians have. (The posts have gotten a gratifying amount of attention from bloggers and journalists: you could have the hopeful sense that LanguageLog — well, or Mark, to be more precise — is becoming a kind of unofficial linguistic equivalent of the CBO).
Mark lays these misperceptions of pronoun frequency to confirmation bias, which is certainly the obvious conclusion, though I speculate that another factor may play a part:
You can't help thinking there's a measure of projection here, as well. Will and Fish are neck and neck for the most immodest style in all of American prose, and it's not surprising that they'd read Obama's impenetrable self-possession as the sign of a bristling ego. When you're a narcissist, every doorknob becomes a mirror.
Not that there's likely to be any slackening of the craze for counting words. As I noted in the piece, the Internet turns everybody into a linguist, the same way it turns us all into medical diagnosticians and tracers of lost persons. On the other hand, the Internet is pretty good at mobilizing the critical spirit, too. Do you suppose the sheer volume of inappropriate or misguided arguments from word counts might motivate more people to wonder what it takes to do this well?
Mark P said,
November 18, 2009 @ 9:15 am
This tendency to count words and draw conclusions strikes me as a lot like the ancient-astronaut crowd, who look for hidden meaning in things like the length of a side of the Great Pyramid of Giza. If you really want to find something and you look hard enough for it, you can find something that will pass.
[(myl) On the other hand, the relative frequency of various words in a document really is a decent proxy for (some aspects of) its content. That's half the reason that Google searches work. On book-length documents, if you simply sort words by the ratio of their local frequency to their frequency across a large collection of books, you get top-10 lists like these:
(Those examples came from a collaboration between Harper-Collins and Bell Labs, more than 20 years ago.)
The problem with the word-spotters that Geoff skewers in his Fresh Air piece ("When you're a narcissist, every doorknob becomes a mirror" — ouch) is that they don't evaluate their evidence in a rational or responsible way. If you want to claim that person X uses word Y unusually often, simple rationality requires you to compare X's Y-usage to some other people's Y-usage in comparable situations. (An even more elementary point is that you should actually count things, not just make qualitative assertions about frequency.) And if you want to connect Y-usage to some personality characteristic, you should try to provide an argument beyond mere assertion.
As Geoff said in closing his essay, "counting words isn't very revealing if you aren't listening to them, too".]
Spell Me Jeff said,
November 18, 2009 @ 10:35 am
I am also worried about using pronoun counts as an index of unconscious narcissism or populism or what have you. Counts of this sort might make sense for spontaneous speech. But when the speech in question is a speech, a verbal artifact crafted and honed by a variety voices applying diverse rhetorical theories, what can we really learn?
Even a comparative study would be inconclusive. Let's aay we found more FPS pronouns in Nixon's speeches than Kennedy's. Does that say something about the presidents themselves, or is it more a reflection of the rhetorical philosophies of Pat Buchanan and Ted Sorensen?
Mark P said,
November 18, 2009 @ 11:27 am
MYL, the examples you cite make intuitive sense – food-related words in a cookbook. And your last point is what I was driving at: some of these word counters fail to take the first step, which is to demonstrate that the frequency of "I" or "me" in a document or speech really does indicate something about a person's ego irrespective of the context.
Craig Russell said,
November 18, 2009 @ 11:39 am
@myl
Amazon is starting to include information like this in their product descriptions of books. Of course, they can only do this for books that have "Search Inside!" page images available, but for books that have this, Amazon includes a list of "SIPs–Statistically Improbable Phrases," which is exactly what you describe (phrases that are common in this book but uncommon in the corpus) except that it's done by phrase rather than word.
They are also starting to include "Text Stats" (Indexes of readability, number of complex words, average number of syllables per word, words per sentence, etc, and comparisons of all these figures to the rest of the Amazon corpus) and a "Concordance" (list of the 100 most commonly occurring words in the book, when you exclude "common words such as 'of' and 'it.'").
(I notice that last sentence ended with five pieces of punctuation in a row. Did I need both periods?)
Anyway, here are some Amazon SIPs for different books I could think of off the top of my head:
The Communist Manifesto (Marx/Engels)–bourgeois property, modern industry
The Cambridge Grammar of the English Language (Huddleston/Pullum)–modal remoteness, catenative construction, backshifted preterite, subclausal coordination, predicative complement function, focusing modifiers, exclamative phrase, subclausal negation, prenuclear position, relativised element, double underlining marks, exhaustive conditional, ist person inclusives, connective adjunct, recoverable anaphorically, fused relative, supplementary relatives, formal alternant, parenthesised element, compound determinatives, emphatic polarity, open interrogatives, conditional adjunct, backshifted report, evaluative adjuncts
The Da Vinci Code (Brown)–divine proportion, saint graal, hieros gamos, orb that ought, seeded womb, lettered dials, inlaid rose, lame saint, rosewood box, depository bank, sacred feminine, sweater pocket
Angels and Demons (Brown)–secret archives, camerlengo nodded, antimatter canister, missing cardinals, sixth brand, lingua pura, antimatter technology, lofty quest, earthly tomb, sacred test, media lights, mystic elements
I'm heartbroken at what books they don't have this for–I wanted to present the phrases for Sarah Palin's "Going Rogue" or either of Barack Obama's books, but alas, it was not to be.
Maria said,
November 18, 2009 @ 11:47 am
May I just point out how wonderful it is that "camerlengo nodded" is a SIP in a Dan Brown novel… That's just about the most delightful comment on Dan Brown's prose ever.
language hat said,
November 18, 2009 @ 11:48 am
I greatly enjoyed hearing Geoff's piece on "Fresh Air" and was pleased to hear him name-check Mark, not just because Mark deserves the props (do people still say "props"?), but because I learned that Liberman is pronounced LIBB-erman and not (as I had been thinking) LEE-berman. Now I'll never misspell it again!
[(myl) Yes, I'm one of the lax Libermans, as opposed to the tense ones like Joe and Phil. (That's a feeble joke based on the phonological features "tense" and "lax", which I just realized most readers will probably not understand, …)]
J. Goard said,
November 18, 2009 @ 12:27 pm
Wait, you mean it's not LIE-berman? Sounds like an obvious Limbaughism, if Joe wasn't often on his side.
Ken Brown said,
November 18, 2009 @ 12:30 pm
SIPs obviously make sense for indexing or cataloging books.
Last week (as an excercise to try to teach myself some Python) I wrote a little program to count words and phrases, and tested it on the text of the 1881/85 Revised Version of the AV Bible.
I was pleased to find that the most frequent group of consecutive five words really is "and it came to pass", just like they told us at Sunday School.
Whether that tells you anything about the personality of the translators, or the authors, or even the Lord, is another question entirely.
[(myl) As discussed here, it's not clear what algorithm is used to find Amazon's "statistically improbable phrases", but whatever it is, it's not just n-gram frequency.
More helpfully, if you're learning Python and using it on texts, you should look into NLTK.]
Forrest said,
November 18, 2009 @ 1:32 pm
It sounds like Amazon's SIPs must be some flavor of TF/IDF?
mike said,
November 18, 2009 @ 1:34 pm
I've noticed that Prof Nunberg's entry here uses "I" 7 times, but uses "we" only once. Talk about a sure sign that the author isn't a team player! Everyone knows there is no "i" in "team".
Haha.
GN: Yeah, but it don't got a whole lot of w's, neither.
Simon Spero said,
November 18, 2009 @ 1:48 pm
Philip and Joe Lieberman are spelled thusly.
Liberman is the name of a little known libertarian Superhero.
Whilst doing fieldwork deep in the jungles south of Hanoi, mild mannered paleolinguistics student Kim Sandy uncovered the hidden burial mound of James D. McCawley. Whilst exploring the site, he discovered a mysterious piece of blotting paper concealed within a strange volume labeled "The Underground Linguist". Startled by a passing jungle rabbit, Kim accidentally dropped the paper onto his tongue. Shortly thereafter a startling transformation happened, and Liberman was born.
Now, whenever fallacious arguments raise their ugly heads; whenever pomposity threatens the foundations of language itself; whenever homicidal toothbrushes begin a rain of terror; whenever virtual conceptual necessities being raping and devouring small children, Liberman is there. With the power of logic, chinese food, and extreme sillyness, Liberman will save the day.
Robert T McQuaid said,
November 18, 2009 @ 2:53 pm
I follow efforts by social services agencies to wrest control of children from their parents. I have found several documents produced by such agencies that discuss child care at length, sometimes over a hundred pages. Since these are too long to study every word, I do a quick check. Often the number of occurrences of the word "mother" is zero. The number of occurrences of the word "father" is likewise zero. I deem this to be adequate reason to dismiss the reports as not dealing seriously with the problems of child care.
Sili said,
November 18, 2009 @ 3:44 pm
My dear sir, this is LanguageLog.
That is to say, even I got it.
I'm not happy to hear you called a libertarian superhero, but perhaps I hang out with the wrong crew.
John Cowan said,
November 18, 2009 @ 4:54 pm
Clearly the Libertarians in question are the supporters of Liberman.
John Cowan said,
November 18, 2009 @ 4:54 pm
Just as the Feenomanists are the worshippers of Feenoman (not to be confused with the Feenomanologists, who merely study them both. (h/t Dan Dennett)
Nathan Myers said,
November 18, 2009 @ 9:55 pm
I don't think we need Amazon's help to know that Going Rogue's catch phrase will be "you betcha".
Nathan Myers said,
November 18, 2009 @ 10:03 pm
I'd like to know for future attribution if that brilliant line, "when you're a narcissist, every doorknob becomes a mirror", is original. Geoff? May we quote you saying, also, "to a narcissist, every doorknob is a mirror"?
[Thanks for this, Nathan. Yes, it's mine, or at least if I stole it I've conveniently repressed the fact. Adapt away.]
Garrett Wollman said,
November 18, 2009 @ 11:11 pm
GN: For citation purposes, what's the correct title of your piece? The in-text title is clearly "The I's Don't Have It", but the HTML title is "Counting Words". Which is correct?
"The I's Don't Have It," I guess. On radio you never have to tip your hand.
Philip TAYLOR said,
November 19, 2009 @ 7:35 am
Craig Russell writes : '[…] a "Concordance" (list of the 100 most commonly occurring words in the book, when you exclude "common words such as 'of' and 'it.'").'
and then asks : '(I notice that last sentence ended with five pieces of punctuation in a row. Did I need both periods?)"'
I think not. I cannot see what the embedding of the earlier period gains, given that what you quote at that point is merely a fragment of a sentence (albeit the final fragment) rather than a full sentence which would then justify punctuating it as such.
Incidentally, I assume that the very odd punctuation of " 'it.' " is an exact transcription of the original, and is presumably to be blamed on Chicago's insane rule for requiring final punctuation to be placed within the quotation marks rather than at the logical place (i.e., just after the final closing question mark).
language hat said,
November 19, 2009 @ 10:35 am
Chicago's insane rule for requiring final punctuation to be placed within the quotation marks rather than at the logical place
This is not an "insane rule," nor is it peculiar to Chicago. It is the standard U.S. rule for punctuation. I'm sorry you don't care for it, but it's not good practice to pretend that whatever you don't like limited to a few insane people.
Rob Chametzky said,
November 19, 2009 @ 11:23 am
#
mike said,
November 18, 2009 @ 1:34 pm
I've noticed that Prof Nunberg's entry here uses "I" 7 times, but uses "we" only once. Talk about a sure sign that the author isn't a team player! Everyone knows there is no "i" in "team".
Haha.
GN: Yeah, but it don't got a whole lot of w's, neither.
When fed this line, a youngish Michael Jordan replied (more or less): "Maybe not, but it does have an 'm' and an 'e' ".
W. Kiernan said,
November 19, 2009 @ 2:03 pm
The worst, when it comes to egotism, are those math teachers. I scientifically analyzed a series of lectures on elementary algebra, and wow! They become especially self-obsessed, so I have determined, when they get to the section on complex numbers.
Philip TAYLOR said,
November 20, 2009 @ 7:46 am
I'm sorry if you (Language Hat) are offended by my classification of the "final punctuation must be placed inside final quotation marks, if any" rule as "insane", but if you can offer any logical reason for such a rule, and/or explain how it can be reasonably regarded as a sane rule, then I am more than willing to stand corrected. I would add, in my defence, that this is not solely a British perspective — I know well-educated and highly literate Americans who are equally condemnatory of that rule.