## Obama's favored (and disfavored) SOTU words

Lane asked "It would be great if someone had time to find some truly Obama signature phrases, doing the math properly. I'd be curious to know what words he actually does use unusually often."

I have two classes to prepare for today, and a student study break to get ready for (bread and cheese, fruits and nuts, chips and dips, cakes and candies etc., but mostly cleaning up the living room…). So I don't have time to work on the "truly signature phrases" problem — that's a hard problem to solve on the basis of a sample as small as a few years of SOTU messages, anyhow.  But there's one thing that I do have time for: calculating the words (or rather, the lexical tokens) that are characteristic of Obama's SOTU messages in contrast to the other post-war SOTUs, against the background of all SOTUs since 1790.

To do this, I used the "weighted log-odds-ratio, informative Dirichlet prior" algorithm described on p. 387-8 of Monroe, Colaresi & Quinn "Fightin' Words: : Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict", Political Analysis 2009. (Tip of the hat to Dan Jurafsky, who told me about this algorithm a couple of years ago.)

The basic idea here is that we have two "lexical histograms" (i.e. word-count lists), taken from two sources X and Y whose patterns of usage we want to contrast.  If we just compare naively estimated rates of usage, we're going to end up with a bunch of unreliable comparisons between small counts, say comparing a word that X uses once and Y doesn't use at all, or vice versa. We want to take account of the likely sampling error in our counts, discounting differences that are probably just an accident, and enhancing differences that are genuinely unexpected given the null hypothesis that both X and Y are making random selections from the same vocabulary.

There are many different ways of approaching this problem.  Monroe et al. survey several different methods, and the one that I've used is (in my opinion as well as Dan's) a nice balance between effectiveness and ease of application.

In my implementation of the algorithm, you prime the pump with three lexical histograms: source X, source Y, and some relevant background source Z. Then if you give the program a word, it determines a score (that "weighted log-odds ratio"), where positive values mean that the word is favored by source X, 0 means that the word is neutral between X and Y, and negative values mean that the word is favored by source Y.

In this case, I took X as Barack Obama's five SOTU addresses so far (41,508 "words", as per my tokenization), Y as the SOTU addresses of all other presidents since WWII (Truman through George W. Bush), and Z as all SOTU addresses from 1790 to 2014 inclusive.  I then fed in all the words in SOTU addresses since Truman, Obama included, and sorted the results according to the weighted log-odds ratio. Here's the positive end of the list (i.e. tokens favored by Obama). Each line presents

WORD XCount (XPerMillion) YCount (YPerMillion) ZCount (ZPerMillion) SCORE

's 471 (11347.2) 1573 (4146.14) 2742 (1543.75) 17.010
jobs 151 (3637.85) 289 (761.751) 462 (260.106) 14.320
why 85 (2047.8) 75 (197.686) 248 (139.624) 13.631
businesses 70 (1686.42) 75 (197.686) 155 (87.2649) 12.106
that 912 (21971.7) 4629 (12201.2) 21793 (12269.5) 11.969
get 98 (2360.99) 171 (450.725) 362 (203.806) 11.797
i'm 61 (1469.6) 79 (208.23) 143 (80.509) 10.703
don't 56 (1349.14) 82 (216.137) 146 (82.198) 9.792
can't 44 (1060.04) 54 (142.334) 104 (58.552) 9.226
we'll 42 (1011.85) 61 (160.785) 105 (59.115) 8.528
like 83 (1999.61) 199 (524.528) 664 (373.832) 8.500
innovation 24 (578.202) 13 (34.2656) 45 (25.335) 7.982
republicans 27 (650.477) 24 (63.2596) 54 (30.402) 7.872
college 41 (987.761) 70 (184.507) 139 (78.257) 7.755
kids 28 (674.569) 29 (76.4387) 59 (33.217) 7.739
because 114 (2746.46) 399 (1051.69) 1015 (571.445) 7.594
what 128 (3083.74) 462 (1217.75) 1476 (830.988) 7.518
we've 64 (1541.87) 175 (461.268) 256 (144.128) 7.505
democrats 25 (602.294) 24 (63.2596) 51 (28.713) 7.450
we're 62 (1493.69) 169 (445.453) 237 (133.431) 7.431

Here's the other end of the list, i.e. the words that Obama has tended to use significantly less than other postwar presidents (according to this algorithm):

the 1840 (44328.8) 23017 (60668.6) 150561 (84765.8) -8.942
of 1022 (24621.8) 13939 (36740.7) 97314 (54787.7) -8.140
must 53 (1276.86) 1583 (4172.5) 3202 (1802.72) -7.307
in 651 (15683.7) 8433 (22227.8) 38390 (21613.6) -6.210
peace 8 (192.734) 670 (1766) 1888 (1062.94) -5.732
program 16 (385.468) 618 (1628.93) 740 (416.62) -5.293
federal 22 (530.018) 737 (1942.6) 1421 (800.023) -5.245
freedom 8 (192.734) 472 (1244.11) 718 (404.234) -4.961
which 18 (433.651) 1072 (2825.6) 12468 (7019.48) -4.725
economic 21 (505.927) 614 (1618.39) 891 (501.633) -4.694
billion 9 (216.826) 425 (1120.22) 466 (262.358) -4.641
nations 16 (385.468) 601 (1584.13) 1787 (1006.08) -4.594
world 82 (1975.52) 1369 (3608.43) 2409 (1356.27) -4.536
free 17 (409.56) 554 (1460.24) 1242 (699.246) -4.404
soviet 1 (24.0917) 273 (719.578) 276 (155.388) -4.277
national 17 (409.56) 566 (1491.87) 2124 (1195.81) -4.080
programs 16 (385.468) 440 (1159.76) 473 (266.299) -3.976
development 5 (120.459) 287 (756.479) 622 (350.186) -3.695
hope 7 (168.642) 324 (854.005) 831 (467.853) -3.663
be 175 (4216.05) 2499 (6586.91) 18787 (10577.1) -3.550

The whole list is here. [Update — I noticed a procedural error, now corrected, so the lists given above are slightly changed. That's what I get for doing experiments on my way out the door…]

I don't have time to discuss the results further right now, but it's clear that we're looking at a mixture of  effects that are stylistic ('s vs. of, the ongoing decline of which and the, contractions vs. uncontracted forms, …), effects that are rhetorical (why vs. must, …),  and effects that are topical (jobs vs. peace,  …).

Update — Same method, just comparing Obama (2009-2014 SOTUs) against Bush 43 (2001-2008 SOTUs. Most Obama-ish words:

's 471 (11347.2) 177 (4352.64) 2742 (1543.75) 10.110
that 912 (21971.7) 461 (11336.5) 21793 (12269.5) 9.032
jobs 151 (3637.85) 38 (934.465) 462 (260.106) 7.237
why 85 (2047.8) 8 (196.729) 248 (139.624) 6.459
but 241 (5806.11) 83 (2041.07) 5619 (3163.5) 6.345
what 128 (3083.74) 35 (860.691) 1476 (830.988) 5.841
get 98 (2360.99) 25 (614.779) 362 (203.806) 5.755
i'm 61 (1469.6) 8 (196.729) 143 (80.509) 5.333
don't 56 (1349.14) 5 (122.956) 146 (82.198) 5.256
businesses 70 (1686.42) 15 (368.868) 155 (87.2649) 5.230
it 382 (9203.05) 196 (4819.87) 15456 (8701.72) 5.031
can't 44 (1060.04) 3 (73.7735) 104 (58.552) 4.688
deficit 49 (1180.5) 7 (172.138) 254 (143.002) 4.627
how 62 (1493.69) 15 (368.868) 479 (269.677) 4.450
let 102 (2457.36) 42 (1032.83) 730 (410.99) 4.309
college 41 (987.761) 6 (147.547) 139 (78.257) 4.281
now 166 (3999.23) 79 (1942.7) 3125 (1759.37) 4.256
do 149 (3589.67) 72 (1770.56) 1938 (1091.09) 4.206
companies 33 (795.027) 3 (73.7735) 198 (111.474) 3.979
financial 33 (795.027) 2 (49.1823) 344 (193.672) 3.962

Most W-ish words:

must 53 (1276.86) 188 (4623.14) 3202 (1802.72) -6.830
and 1406 (33873) 1884 (46329.8) 60436 (34025.4) -6.804
iraq 16 (385.468) 95 (2336.16) 129 (72.627) -6.627
freedom 8 (192.734) 84 (2065.66) 718 (404.234) -6.335
terrorists 10 (240.917) 73 (1795.16) 95 (53.485) -5.959
terror 1 (24.0917) 55 (1352.51) 93 (52.359) -5.049
weapons 10 (240.917) 57 (1401.7) 239 (134.557) -4.978
iraqi 3 (72.2752) 49 (1204.97) 54 (30.402) -4.932
security 45 (1084.13) 110 (2705.03) 1020 (574.26) -4.599
yet 12 (289.101) 60 (1475.47) 940 (529.22) -4.390
social 9 (216.826) 47 (1155.79) 487 (274.181) -4.154
is 360 (8673.03) 517 (12713.6) 16860 (9492.17) -4.113
hope 7 (168.642) 46 (1131.19) 831 (467.853) -3.995
terrorist 3 (72.2752) 31 (762.326) 37 (20.831) -3.954
in 651 (15683.7) 856 (21050) 38390 (21613.6) -3.923
enemy 0 (0) 27 (663.962) 239 (134.557) -3.888
america 164 (3951.05) 245 (6024.84) 1658 (933.453) -3.857
liberty 1 (24.0917) 27 (663.962) 318 (179.034) -3.703
relief 4 (96.367) 30 (737.735) 371 (208.873) -3.495
peace 8 (192.734) 50 (1229.56) 1888 (1062.94) -3.484

The whole list is here.

1. ### GeorgeW said,

January 29, 2014 @ 8:38 am

I didn't see "incredibly" on the list. It is my sense that he generally uses this more than some norm. However, if true, this may be more in extemporaneous speech that formal addresses carefully crafted, at least in part, by others.

[(myl) As far as I can tell, incredibly has never been used by any president in any SOTU message. I can't find it in any of Obama's weekly radio addresses either, at least not in the sample (127 weeks) that I have lying around. Can you point to some examples?]

2. ### Coby Lubliner said,

January 29, 2014 @ 11:25 am

Wouldn't there be an effect due to diachronically different ways of getting the text of the address? Wouldn't, for example, a direct transcript of the speech as delivered have different word frequencies, especially in relation to contractions and other colloquialisms, than an advance text?

[(myl) It's certainly often true that speeches as delivered are a bit more colloquial than the advance texts are. However, when I went back to the audio or video of SOTU addresses by FDR, Truman, Eisenhower, Nixon, Kennedy, etc., I expected to find some spoken contractions (for example) that were not in the written version, but none turned up. ]

3. ### James said,

January 29, 2014 @ 11:55 am

I think it's really striking that I'm, we'll , and we've have high scores, but the bare first-person pronouns don't. This suggests (to me, anyway) that it's stylistic — those contractions are a bit folksier than can't or let's — and not semantic. The very high score for 's might support that, assuming it's more contraction than possessive.

[(myl) It's clear that there's a strong trend in formal writing towards greater use of contractions. But there's also a trend towards use of 's in place of of — here's some data from COHA:

FWIW, in the 2014 SOTU address, I count 65 's contractions and 31 's possessives.]

4. ### cs said,

January 29, 2014 @ 12:49 pm

Now I'm trying to form a speech out of Obama's overused words, like:

Why don't republicans get college kids? Because jobs!! That's what democrats like.

5. ### Quicksand said,

January 29, 2014 @ 12:54 pm

James said (above):
I think it's really striking that I'm, we'll , and we've have high scores, but the bare first-person pronouns don't.

This clearly brings the whole analysis into question, because everybody* knows that Obama is self-absorbed, can't stop talking about himself, and uses the first-person pronouns more than any English speaking person in history.

* well, a certain significant subset of "everybody" that tends not to hang around here

6. ### D.O. said,

January 29, 2014 @ 2:06 pm

The average of Obama's scores is +1.46 (I took the whole list, of course) and the distribution is fairly bell-shaped. Positive average score probably means that president Obama continues (on average) the trends initiated by the postwar presidents relative to previous ones.

And some counts are clearly the mixture of different trends. Here's kids vs. children
kids 28 (674.569) 29 (76.4387) 50 (31.6843) 7.761
children 37 (891.394) 400 (1054.33) 575 (364.37) -0.842
child 21 (505.927) 146 (384.829) 199 (126.104) 1.023

Obama speaks way more about the youngsters than either his close or far-in-the-past predecessors, but he is also on a clear drift toward informality. And the latter trend is even overpowering the first one for the use of "children".

There are clearly some important words (like Afghanistan) which didn't make it into the list, presumably because earlier presidents never used them. There should be a method how to account for them though. I cannot make a suggestion because of very limited familiarity with the method used though.

[(myl) Afghanistan is one of the cases rescued by my correction of an unfortunate procedural error — it should be (and now is):

afghanistan 19 (457.743) 68 (179.236) 87 (48.981) 3.282

And in the corrected list, the average is now 0.848, which is still in the direction you point, just not as far out.

Sorry!]

7. ### GeorgeW said,

January 29, 2014 @ 2:28 pm

myl: Sorry, I think it is "incredible" (not "incredibly") that I hear him use a lot. An example from the recent SOTU: ". . . and we don’t resent those who, by virtue of their efforts, achieve incredible success."

But, I could well be suffering a frequency illusion.

[(myl) Here's the tally for postwar SOTU speeches:

incredible 4 (96.367) 3 (7.90745) 13 (7.319) 3.013

That is, Obama used it 4 times, for a rate of 96 per million; other presidents have used this word at rates of about 7.9 per million (Truman through W) or 7.3 per million (all presidents other than Obama. So he does use it a lot more frequently than others, but just not very frequently in absolute terms.]

8. ### Tyler Schnoebelen said,

January 29, 2014 @ 3:04 pm

Here are Justice Kennedy's signature phrases from oral arguments:

http://idibon.com/justice-kennedy-speaking-patterns/

9. ### Rubrick said,

January 29, 2014 @ 5:45 pm

Kudos to @cs for a fine effort!

10. ### max hrvatin said,

January 29, 2014 @ 9:28 pm

A phrase search may also be interesting. Although not noted in the Obama SOTU speech, the catch phrase of the current president that stands out in my observations is: 'Let ME make one thing perfectly CLEAR'. I seem to remember Regan using this phrase, but Obama's case it is over used in comparison. Word search in the SOTU speech did not turn up high hit rates on 'clear' and 'perfectly'.

[(myl) The only instances of "perfectly clear" in the SOTU messages were in 1917 and 1918, from Woodrow Wilson:

(1917) If I have overlooked anything that ought to be done for the more effective conduct of the war, your own counsels will supply the omission. What I am perfectly clear about is that in the present session of the Congress our whole attention and energy should be concentrated on the vigorous, rapid, and successful prosecution of the great task of winning the war.

(1918) Let me say at once that I have no answer ready. The only thing that is perfectly clear to me is that it is not fair either to the public or to the owners of the railroads to leave the question unanswered and that it will presently become my duty to relinquish control of the roads, even before the expiration of the statutory period, unless there should appear some clear prospect in the meantime of a legislative solution. Their release would at least produce one element of a solution, namely certainty and a quick stimulation of private initiative.

]

11. ### Bloix said,

January 29, 2014 @ 10:16 pm

The SOTU isn't an ordinary speech. It is taken seriously as the administration's agenda for the coming year and as a ranking of the importance of each mentioned item. Advocates for the various cabinet departments and agencies jockey aggressively to get their proposals in and fight over the number of words devoted to them, and the final draft is the product of negotiation and compromise. There is no speech in which the President is less likely to deviate from the printed version, lest an alteration be viewed as a presidential decision to change the priorities embedded in the text.

PS – It was Nixon who liked to say, "let me make one thing perfectly clear." Obama likes to say, "let me be clear." But this is the sort of thing that would probably not show up in the tightly scripted speech that is the SOTU.

[(myl) Obama has used "Let me be clear" twice in SOTU addresses, once in 2009 and once in 2014. Before that, this phrase occurred only in 1988 (Reagan), 1989 (Bush 41, twice) and 1993 (Clinton).]

PPS- did earlier presidents use "incredible" to mean marvelous, amazing, stupendous? Or did they use it to mean obviously not true, ridiculous, not to be believed?

[(myl) The first use was by Millard Fillmore in 1853, and seems definitely to be of the "stupendous" variety, though still by figurative extension of the "unbelievable" sense:

The successive decennial returns of the census since the adoption of the Constitution have revealed a law of steady, progressive development, which may be stated in general terms as a duplication every quarter century. Carried forward from the point already reached for only a short period of time, as applicable to the existence of a nation, this law of progress, if unchecked, will bring us to almost incredible results. A large allowance for a diminished proportional effect of emigration would not very materially reduce the estimate, while the increased average duration of human life known to have already resulted from the scientific and hygienic improvements of the past fifty years will tend to keep up through the next fifty, or perhaps hundred, the same ratio of growth which has been thus revealed in our past progress; and to the influence of these causes may be added the influx of laboring masses from eastern Asia to the Pacific side of our possessions, together with the probable accession of the populations already existing in other parts of our hemisphere, which within the period in question will feel with yearly increasing force the natural attraction of so vast, powerful, and prosperous a confederation of self-governing republics and will seek the privilege of being admitted within its safe and happy bosom, transferring with themselves, by a peaceful and healthy process of incorporation, spacious regions of virgin and exuberant soil, which are destined to swarm with the fast growing and fast-spreading millions of our race.

I like the "virgin and exuberant soil" part — they don't make SOTUs like that anymore…]

12. ### Bloix said,

January 29, 2014 @ 10:26 pm

Answering my own question, I find that FDR used "incredible" in what I think of as the more modern, slangy meaning in his SOTU in 1943:

"And we must not forget that our achievements in production have been relatively no greater than those of the Russians and the British and the Chinese who have developed their own war industries under the incredible difficulties of battle conditions."

13. ### GeorgeW said,

January 30, 2014 @ 9:31 am

Obama's overuse of "incredible" may be the result of palling around with Silicon Valley titans. It is very Applesque. Tim Cook cannot speak without dropping a few "incredibles."

14. ### These Are the Words Obama Loves, and the Ones He Avoids | ACROSS THE FADER – US said,

January 30, 2014 @ 12:51 pm

[…] on how much Obama has used them in his big annual addresses. The academic minutiae at the top of his post on Language Log might be intimidating to anyone without a PhD, but you can find lists showing how Obama diverges […]

15. ### Lane said,

January 30, 2014 @ 1:26 pm

Thanks for tackling my question, Mark. My write-up, hat respectfully doffed, is here:

http://www.economist.com/blogs/democracyinamerica/2014/01/politics-and-linguistics

16. ### D.O. said,

January 30, 2014 @ 7:49 pm

Prof. Liberman. I have looked at the full list of entries (for the first experiment) and found to my surprise that ZCount divided by sum of ZCount does not equal ZPerMillion/1e6. It seems like 64093 counts from ZCount gone missing. Or do I miss something obvious?

[(myl) The only words in the list are those that occur in the sets denoted X and Y. There are some words in Z (all of the SOTU messages) that were not in X (Obama SOTU addresses) or Y (SOTU addresses from Truman to GWB), and apparently you've counted them.]

17. ### D.O. said,

January 31, 2014 @ 11:07 pm

I've just added whatever is in column ZCount and got 1712106, which makes per millions for (for example) "'s" 2742/1712106*1e6 = 1601.5. Anyways, I don't think it distracts from the main point.

[(myl) Again, the only rows in the table are those that represent word types found in sets X (Obama's SOTUs) and Y (SOTUs from Truman to W), of which there are 14338. Across all the SOTU messages I found 29142 word types, with 1776200 tokens in total (as per my tokenization). Since e.g. 's occurred 2742 times in all the SOTU messages, the observed count per million across all the SOTU messages is thus

1000000*2742/1776200 = 1543.745

which is what I reported as 1543.75.

I've made some other choices that affect the totals, though I don't believe that they affect the overall outcome in a material way: I did not split contractions; I did split hyphenated words; I did fold case; I retained only tokens that contained at least one alphabetic character; I tried to eliminate all HTML markup but a few things slipped through; etc. Note also that I've retained just one SOTU message per year — in some years, there were two, one from the departing president and one from the new one.

But believe me, my code can count.]