If and some
« previous post | next post »
Last night, I got back from England in time to be faced with a dilemma: the third presidential debate between Barack Obama and John McCain, starting at 9:00 p.m., conflicted with the fifth game of the National League championship series between the Philadelphia Phillies and the Los Angeles Dodgers, starting around 8:30.
Based on past performances, I expected the NLCS to be more exciting than the debate. And there's this nifty method for summarizing debates: for each participant P, rank the words that P uses more than 10 times according to the ratio of P's count to the opponent's count. And CNN publishes an instant debate transcript…
Still, I felt that I should pay at least some attention to what was going on at Hofstra University. So my solution involved a couple of radios, a TV with picture-in-picture, and several sites that were live-blogging one or the other event. In the end, the Phillies won the game 5-1, and will be going to the World Series. What about the debate?
Well, here's John McCain's word list:
Ratio | Word | McCain | Obama |
– | obama | 55 | 0 |
14 | business | 14 | 1 |
12 | wants | 12 | 1 |
11 | wealth | 11 | 1 |
7 | government | 14 | 2 |
5 | united | 10 | 2 |
5 | tell | 10 | 2 |
3.9 | america | 35 | 9 |
3.7 | whether | 11 | 3 |
3.6 | she | 18 | 5 |
3.3 | states | 10 | 3 |
3.3 | him | 10 | 3 |
3.2 | take | 13 | 4 |
3 | i'm | 18 | 6 |
3 | country | 15 | 5 |
2.8 | again | 11 | 4 |
2.6 | joe | 23 | 9 |
2.5 | taxes | 15 | 6 |
2.5 | spending | 15 | 6 |
2.5 | know | 20 | 8 |
And here's Barack Obama's:
Ratio | Word | Obama | McCain |
– | here | 14 | 0 |
– | economic | 12 | 0 |
18 | mccain | 36 | 2 |
12 | policies | 12 | 1 |
10.5 | important | 21 | 2 |
10 | support | 10 | 1 |
10 | crisis | 10 | 1 |
10 | college | 10 | 1 |
7.5 | policy | 15 | 2 |
5.4 | make | 38 | 7 |
5 | provide | 15 | 3 |
5 | doesn't | 10 | 2 |
5 | afford | 10 | 2 |
4.3 | example | 13 | 3 |
4 | think | 56 | 14 |
3.8 | last | 19 | 5 |
3.7 | work | 11 | 3 |
3.5 | insurance | 14 | 4 |
3.1 | if | 44 | 14 |
3.1 | some | 34 | 11 |
In the abstract, "Obama business wants wealth" vs. "here economic McCain policies" seems like a plausible account of a debate between these two men. Alas, this misses (what most people took to be) the big stories of the evening.
Luckily, it's all on YouTube, including pre-digested thematic excerpts or collections, from the size of Joe the Plumber's health-care fine to Senator McCain's "I'm not Bush" line, the "health of the mother" exchange, the whole exasperation factor, and so on.
Still, I think my trivial word lists are not entirely without interest. In particular, I'm curious about something less down-to-earth than plumber: why did Barack Obama use the little words if and some more than three times as often as John McCain did?
Here's some data: the sentences using if from the CNN transcript for Obama and for McCain, and the some sentences for Obama and for McCain.
Here's some more data — a comparison of counts across all three debates for if (at least according to the CNN transcripts):
(if) | Obama | McCain |
Debate 1 | 20 | 19 |
Debate 2 | 37 | 22 |
Debate 3 | 44 | 14 |
And for some:
(some) | Obama | McCain |
Debate 1 | 25 | 6 |
Debate 2 | 22 | 20 |
Debate 3 | 34 | 11 |
Given a set of observations like this, we could come to several different sorts of conclusions.
Maybe it's a meaningless, random statistical fluctuation. After all, there are lots of words, and people vary randomly in how often they choose different words on different occasions, and the way I've gone about this analysis is likely to turn up some differences that arise purely by chance.
Then again, maybe the difference (between individuals or across occasions) is real, but reflects a stylistic difference in the way messages are framed (e.g. "If we want to do X, we need Y" versus "In order to do X, we need Y"), rather than a difference in the underlying distribution of messages. If the difference is a stylistic one, it might be a stable feature of the different individuals involved, or it might reflect a more temporary priming effect, whether lexical or semantic or rhetorical.
Or perhaps the observation reflects a genuine difference in the kinds of ideas that the two candidates are presenting, or at least the spin they want to put on these ideas.
What do you think? Please try, as we professors say, to be be specific and to give reasons for your answer.
ben said,
October 16, 2008 @ 8:55 am
It looks like you swapped (or forgot to swap) "McCain" and "Obama" in the table for Sen. Obama's word list.
[(myl) Oops, sorry. Fixed now.]
ben said,
October 16, 2008 @ 9:00 am
It's tough to determine implication from "if" given that a -> b is equivalent to a v !b.
Some code that counts possible logical forms from natural language sounds like a fun weekend project…
anonymous said,
October 16, 2008 @ 9:44 am
Not to be too unscientific, but "if" and "some" are conditional, hedging type words. If one were to jump to any conclusion based on this data bite, it might be that Obama likes to qualify his statements. It's actually something that McCain brought up during the debate. About half way through he says, "Well, you know, I admire so much Sen. Obama's eloquence. And you really have to pay attention to words. He said, we will *look at* offshore drilling. Did you get that? Look at." I'm a big fan of Obama, but it seems McCain may be on to something.
Mark P said,
October 16, 2008 @ 10:26 am
Some people familiar with Obama from prior to the presidential race have said that he is a reflective person, one who likes to think through issues before making decisions. The use of "if" seems consistent with that. The use of "some" would also seem to be consistent with that. But I'm afraid that is a subjective judgement on my part.
Betsy said,
October 16, 2008 @ 10:50 am
I think this is a stylistic issue. I think that McCain may be under the impression that qualifying statements make one look "weak", and chooses to avoid using words like "if" and "some" in favour of unqualified statements that seem more "assertive".
Whereas, Obama prefers to be accurate in what he says, and so uses qualifiers and conditionals where appropriate for the concepts he is expressing, regardless of the perception of "weakness", which I think is false anyway.
Another comparison in your word list that I think is interesting is the contrast between the uses of "know" and "think". McCain favours "know" while Obama favours "think". This seems to go along with this idea that I've heard before from some writers, that one is supposedly perceived to be stronger, even when the other is more accurate. A quick Google search turns up others who have pointed the "stronger" quality of "know" vs. "think" out.
I freely admit to an intellectual bias here, and "think" doesn't bother me. And I see "know" where it's not appropriate as a sign of an idealogue. Of course, to be fair, it could just be a habit, along with "if" and "some" avoidance, that he picked up from listening to other idealogues that he hangs out with. :)
Chris said,
October 16, 2008 @ 1:40 pm
Or to put it in Rumsfeldian terms, Obama has a lot of known unknowns, while McCain has (or is willing to admit to) very few known unknowns (but for precisely that reason, is likely to have a truckload of unknown unknowns).
I think it's not just a linguistic game but a deep difference in their ways of thinking – being aware of the limitations of one's own knowledge vs. believing that certainty is a virtue even when you're wrong. (I should perhaps disclose that I'm an Obama supporter, and that I believe that everyone makes some mistakes, so lack of awareness of your own mistakes probably does not mean you just aren't making any.)
Robert F said,
October 16, 2008 @ 3:08 pm
>> It's tough to determine implication from "if" given that a -> b is equivalent to a v !b.
This is sort of off topic, but A v !B should be !A v B. You may examine the truth tables if you feel the need for verification.
However, if doesn't usually seem to represent logical implication, at least in the sense of classical logic.
Mike Geis said,
October 16, 2008 @ 3:16 pm
After the first Mondale v. Reagan debate, I counted up the sentence lengths of the two men and found that Mondale came in at around 22-26 words per sentence. Reagan was all over the map, from just 2 or 3 words into the 50's. This I took to reflect the difference between an orderly mind and manner of expression and a disorderly mind and manner of speaking. My research on Reagan's press conferences in preparing my book on the language of politics had made clear that Reagan was usually pretty inarticulate when speaking extemporaneously. This didn't show up in his debates with Carter (recollection here, not research) because he had no administration to defend and could just recite his General Electric speeches. The Great Communicator he was not unless reading or reciting.
I wonder if there were any sentence level differences between McCain and Obama?
Forrest said,
October 16, 2008 @ 5:55 pm
A recent TED Talk on emergence has a quick "word bursts" analysis from three State of the Union addresses: 1860, 1935, and 1985. The results are along the lines of Mike Geis's comment.
In 1860, the most common words were: slaves, emancipation, slavery, rebellion, and Kansas.
In 1935, the most common words were: relief, depression, recovery, banks.
In 1985, the most common words were: that's, we're, there's, we've, it's.
To echo the talk, it's no surprise that slavery was a hot topic in 1860. The real story here isn't how priorities have changed; it's about how language is being used in a much … less specific manner.
Mark Liberman said,
October 16, 2008 @ 6:18 pm
@Forrest: Do you have a citation for this claim?
It seems quite unlikely to be true, on general grounds.
More specifically, the SOU address in 1860 was given by James Buchanan, and the commonest five words in fact were "the", "of", "to", "and", and "in".
The SOU address in 1935 was given by FDR, and the commonest five words were in fact "the", "of", "to", "and", and "in".
The SOU in 1985 was given by Ronald Reagan, and the commonest five words were "the", "and", "to", "of", and "in".
So there was a difference — a re-ordering among some of the common function words — but nothing at all like the claimed change in specificness.
And a quick search of the TED site (e.g. for "state of the union") didn't turn up anything relevant.
Leon said,
October 16, 2008 @ 6:26 pm
It's also interesting that McCain used "whether" more, since this also has a conditional meaning of sorts. It would be interesting to compare McCain's use of "whether" with Obama's use of "if".
To phrase it a little more neutrally (I'm not American), you could say that McCain is more of a "conviction" politician, and his appeals focus on character rather than policy exposition.
I think the "if" sentences might be evidence for this — Obama's sentences seem more like a discussion with the audiences, whereas McCain seems to be trying to "tell it like it is".
But "some" was frequently used in its number sense, rather than its restrictive/qualifying sense. For example, take this Obama sentence:
The implication here is not "there are some challenges which have to be dealt with, and some which don't". The "some" could reasonably be elided without much change in meaning. So not every instance of "some" indicates more qualified statements — I think it's more indicative of Obama's tone.
fred lapides said,
October 16, 2008 @ 6:37 pm
I am not a liguist but I see nothing in this but speculation–perhaps, might be, etc
More telling: facial expressions during the debate on a split screen.
Forrest said,
October 16, 2008 @ 10:26 pm
@Mark: Here's a citation, a TED Talk called "The City and the Web."
Philip Spaelti said,
October 17, 2008 @ 3:37 am
I did some further digging on @Forrest's claim.
First I checked Forrest's link. (The bit about the word bursts is around 12:50.) Clearly that talk seems to treat data as an opportunity for jokes. At any rate the info in the talk is not really any more informative than what Forrest already posted. (What is the source?)
Then I tried analyzing the speeches myself (following Mark's links). I don't know anything about "word burst" technology. I just ran them through some text analysis macros. My results are here.
To make a long story short, I think the grain of truth is that one difference between the Reagan speech and the older ones is that it contains abbreviated forms ("we're", etc.) where the older speeches don't. This has presumably as much to do with changes in writing style and archiving as anything else.
If the analysis is done with a quick macro (like mine) which eliminates the function words, but leaves the abbreviated forms in, the latter will show up in the word count. And these forms have high frequencies compared to content words, so they are fairly high in the list. At this point, this is now a story about poorly implemented analytic techniques, not about changes in speech styles.
Even so this story is based on some serious cherry picking and rearranging of the facts. For example "that's" appears in Reagan's speech exactly once. "Emancipation" does not appear in Buchanan's speech at all!
L Sewell said,
October 17, 2008 @ 3:49 am
The Committee on Presidential Debates has mendaciously organized debates to coincide with large sporting events in the past in order to reduce viewership numbers.
It's worth reading more about how this Committee manipulates the debating process:
http://www.amazon.com/No-Debate-Parties-Secretly-Presidential/dp/1583226303
Mark Liberman said,
October 17, 2008 @ 5:21 am
@Forrest: Thanks for the link.
Philip Spaelti: … the grain of truth is that one difference between the Reagan speech and the older ones is that it contains abbreviated forms ("we're", etc.) where the older speeches don't. This has presumably as much to do with changes in writing style and archiving as anything else. … Even so this story is based on some serious cherry picking and rearranging of the facts.
Yes, I think you're exactly right. But one addendum: the "word burst" idea is based on a comparison of local word frequencies to frequencies estimated for earlier time periods, using sophisticated HMM-like algorithms described originally in Jon Kleinberg, "Bursty and Hierarchical Structure in Streams", ACM SIGKDD 2002. I was confused by Forrest's phrase "the commonest words were …" — though this is a natural misinterpretation of Steven Johnson's TED talk — Kleinberg's algorithm is ranking words according to an estimate of how much their underlying frequency has changed, not what their underlying frequency is.
In this case, however, these sophisticated algorithms have apparently been used to discover a fact about the acceptability of orthographic contractions, which has nothing to do with any putative change in degree of generality of SOU addresses. The basis for Johnson's discussion is Kleinberg's list of "the 150 term bursts of highest weight in Presidential State of the Union Addresses, 1790-2002" (with some additional restrictions and parametric settings). If Johnson attributes "emancipation" to 1860 (I haven't checked) it's his fault, not Kleinberg's, since the original list gives 1862-1864 as the time period of this word's burst (and even that presumably involves some temporal smoothing).
What is probably the same material is presented in an essay that Johnson wrote for Discover Magazine in 2003, "Are computers better qualified than humans to grade student essay exams?". The interpretation that he gives there is not in terms of specificness, but rather folksiness:
JKG said,
October 17, 2008 @ 6:24 pm
you all should check out artist Luke Dubois' Hindsight Is Always 20/20 exhibit. he analyzes the most used words from every president's state of the union address and presents them in descending order as in an eye exam chart.
http://hindsightisalways2020.net/
Mark Liberman said,
October 17, 2008 @ 6:49 pm
JKG: he analyzes the most used words from every president's state of the union address…
As in the case that Forrest cited, whatever Dubois did, it *wasn't* simply listing the words in order of frequency, since that would result in *every* list being headed by "the".
His explanation says that "words that appear in the majority of speeches … are cancelled out". It's not clear whether this is meant literally (i.e. words that appear in more than 50% of the SOUs get a weight of 0, others get a weight of 1), or whether he used software that imposed some variant of the standard TF-IDF weighting used in information retrieval.
Forrest said,
October 17, 2008 @ 9:18 pm
@Mark: Thanks for clarifying what "word bursts" actually means. I assumed this was his way of describing a term extraction. It sounds a little more closer to a "TF/IDF" analysis – words that are used more often in an address than in general speech, not simply words that are common in a state of the union.
Sorry for pulling everybody off track…
Kevin Iga said,
October 18, 2008 @ 3:13 pm
The words "if" and "some" can also be used when making very assertive and unqualified statements:
"Some people may like rap music, but quite frankly, some people are idiots."
"If any country wants to mess with us, we'll send them back in body bags."
"My opponent doesn't know if he's for it or against it."
"I love America not just some of the time, but all of the time."
"I know some people who are very upset with my opponent's proposal."
"I don't know if you're from the same planet as the rest of us are from, but that idea makes no sense."
"I want to know if my opponent will apologize for slandering my name."
"My opponent seems to think Americans don't care about the truth. Well, some Americans do: I do."
I think we have to look more carefully at the uses of "if" and "some" to determine if they have the kind of qualifying (reflective?) force suggested by Mark P, anonymous, and others above.
Anonymous Cowherd said,
October 20, 2008 @ 8:26 pm
@Forest and later: "In 1860, the most common words were: slaves, emancipation, slavery, rebellion, and Kansas. In 1935, the most common words were: relief, depression, recovery, banks." — This isn't a matter of misunderstanding, or cherry-picking data; somebody is just flat out lying. Buchanan in 1860 never used the word "emancipation" (nor any word containing the letter sequence "emanc"); Roosevelt in 1935 never used the word "banks" (nor any word containing the letter sequence "bank").