SOTU ngrams

« previous post | next post »

Below is a guest post by Yuval Pinter.

Reading Mark Liberman's analysis of Obama's SOTU addresses versus other presidents', my thirst remained unquenched. Word-counts are fun, sure, but the real fun comes in when looking at longer phrases – two (bigrams) or three (trigrams) words long.

After waiting for it to be breakfast time in Philadelphia, I engaged in an experiment (Legal has advised me against explicit use of MYL's trademark phrase) to analyze the 228 addresses (found here) and see what Obama's favorite (and least-favorite) phrases are.

Since I worked with raw data, I handled it a bit differently than previous analyses just for the sake of getting results fast. To begin with, I did not weed out the non-orally-delivered addresses or any other "special" cases. Next, I used an unsophisticated tokenization algorithm where all apostrophes break words into tokens (so "Congress's" is split in two, as in Liberman's analysis, but same goes for "i'm" and "he's"). Lastly, I used a comparison algorithm which only takes into account Obama's speeches and all addresses (1790-2014) as "background": the KL measure, which purports to tell us how "informative" the phrase is in the Obama corpus relative to the background corpus.

Let's get to it: here are Obama's most unexpectedly frequent bigrams:

bigram KL-measure X 1000
that 's 3.284
it 's 2.463
let 's 2.022
don 't 1.545
i 'm 1.540
we will 1.408
's why 1.375
we 're 1.278
we 've 1.253
can 't 1.147
right now 1.092
clean energy 0.960
i will 0.946
if you 0.931
need to 0.925
we 'll 0.907
we can 0.902
is why 0.883
jobs and 0.848
's what 0.844
health care 0.842
tonight i 0.825
our economy 0.813
's not 0.736
middle class 0.696

We see many stylistic markers here, such as the contracted forms "'s", "'re" and "'ll", which will probably re-appear in any modern president's lingo (with not much to support either the egocentric-Obama or collective-Obama hypotheses), but these expected bigrams greatly emphasize the magnitude of the more content-swayed ones: "our economy", "middle class", "health care" and the number one issue on Obama's plate (at least according to Kullback and Leibler): "clean energy".

Obama's most unexpectedly infrequent bigrams: (for these, I still only took phrases which appeared somewhere in Obama's addresses)

bigram KL-measure X 1000
of the -2.388
to the -0.941
in the -0.896
for the -0.529
and the -0.494
by the -0.446
it is -0.397
PAR the -0.392
united states -0.389
the united -0.388

And the rest is just as boring. We've seen "the" is on the decline, and it drags down all its associated bigrams with it.

Moving on. Favorite trigrams: ("PAR" marks the beginning of a paragraph)

trigram KL-measure X 1000
that 's why 1.191
that 's what 0.750
that is why 0.640
democrats and republicans 0.549
we need to 0.526
it 's not 0.495
this congress to 0.432
PAR that 's 0.426
the american people 0.413
i will not 0.406
so let 's 0.405
tonight i 'm 0.399
we can 't 0.391
states of america 0.369
it 's time 0.353
across the country 0.336
's why i 0.325
's why we 0.324
over the last 0.319
over the next 0.313
we have to 0.312
i took office 0.312
i know that 0.310
's time to 0.304
PAR of course 0.304

So the top three are explanation starters, but check out "democrats and republicans" creeping in to a bipartisan content-lead. And you may take what you will from number 25, beginning paragraphs with "of course".

Least favorite trigrams:

trigram KL-measure X 1000
the united states -0.375
of the united -0.134
of the country -0.054
part of the -0.048
as well as -0.046
the people of -0.044
of the people -0.044
PAR it is -0.043
united states and -0.040
of the government -0.032
the secretary of -0.030
it will be -0.029
the federal government -0.029
and it is -0.026
and in the -0.026
at the same -0.026
of our citizens -0.026
the number of -0.025
of the last -0.024
the fact that -0.023
of the union -0.023
in order to -0.022
it is not -0.022
and to the -0.022
it is a -0.022

A bit more interesting than the lost bigram table. "the american people" made it to the top, but "the people of" are on the bottom, suggesting nothing but a stylistic anomaly (or shift) in denoting what is probably the group which is most referred to in these addresses. How "the united states" and "states of america" got to opposite ends is beyond me, though. Much to look into, perhaps during some breakfast after next year's SOTU.


  1. Lee said,

    February 3, 2014 @ 3:36 pm

    Interesting stuff — thanks for doing this.

    Can you explain what the KL means for individual events? I am just learning information theory, and I understand the relative entropy (or KL divergence) to be a function of two distributions. I think of it as the tax we pay for the encoding a sequence using the ideal encoding scheme for some sequence with a different distribution.

    In this context, encoding the sequence of words in Obama's SOTU using a scheme based on the actual distribution of words in that speech would would result in the average word needing H bits (H = entropy). But encoding that same sequence using a scheme based on the distribution of words in priori presidents' speeches would result in the average word needing H+KL bits to encode. So KL should always be positive, since no scheme can be better than the ideal one based on his own words.

    I am guessing that KL for individual words is the difference in bits that are required to encode those individual words. So some common Obama words have positive KL because they are being encoded with more bits that is necessary and his rarer words are being encoded with fewer bits (which is too bad because they occur rarely). Is that right?

  2. David B said,

    February 3, 2014 @ 3:39 pm

    Complete guesswork, but the united states and states of america might could end up on opposite ends due to a lot of cases of these united states of america in the data.

  3. yoav said,

    February 3, 2014 @ 5:06 pm

    Interesting. Are all determiners on the decline, or just the definite ones? in other words, how are bigrams like "of a" are doing?

    [(myl) I don't think that bigrams or trigrams are helpful in answering this particular question, which is basically about unigram frequencies. Here's the answer:

    The frequency of "a/an" has actually been rising a bit since 1950 or so, and was more or less stable up to that point.]

  4. Anton Sherwood said,

    February 3, 2014 @ 10:07 pm

    Why make the tables so narrow? Wider would be much easier to read, at least for me.

  5. D.O. said,

    February 3, 2014 @ 10:51 pm

    Earlier presidents didn't feel the need to append "of America" to "the United States", that's why (oops, Obama-talk is catching up with me), IMHO.

  6. D.O. said,

    February 4, 2014 @ 12:52 am

    We have comparison here of Obama SOTUs with previous SOTUs, but part of the change is, of course, because of Obama (or his speech writers) and part of the change is because of general linguistic change. For example, Obama's favorite "why" is 2480 words per million Obama against 140 wpm historic (based on Prof. Liberman data) can be compared to 201 wpm for 2008 with about 126 wpm historic (based on Google n-grams, the later figure is just eyeballed). In other words, Mr. Obama uses "why" very frequently compared to both other presidents in similar settings and to contemporary usage. Maybe the more interesting measure is something like
    f(Obama-SOTU)*f(AmEng last 200 years)/(f(all SOTU)*f(AmEng now))
    properly logarithmed and normalized to get rid of obscure words. I have a feeling though that Prof. Liberman mostly already dealt with the issue by comparing Obama SOTUs with only last 70 years.

  7. Yuval said,

    February 4, 2014 @ 1:13 am

    @Lee: your conclusion is correct. The formula I took for a single-event (e) KL-measure is P(e|obama) * Log(P(e|obama)/P(e|all)). So what basically happens is that as an ngram is more frequent in the Obama corpus relative to the general corpus, the log term grows (it's zero when the probabilities are the same, i.e when Obama uses the phrase in exactly the expected rate). The left multiplicand is responsible for showing us what's interesting: amplifying the abnormalities of more frequent ngrams.
    Information-theoretically, this represents the total "damage" this ngram has made on representing the whole Obama corpus in an encoding befitting the general one: Obama-favorite phrases cost more because we didn't expect them to appear as much, and Obama-avoided phrases made unnecessary savings which should have been redistributed to other places in the encoding.

  8. mollymooly said,

    February 4, 2014 @ 4:59 am

    How "the united states" and "states of america" got to opposite ends is beyond me, though.

    Per D.O., contra David B; simple illustration with fake numbers:

    * Obama: 2 x "the United States of America" + 3 x "the United States [anything other than 'of America']"

    * others: 1 x "the United States of America" + 5 x "the United States [anything other than 'of America']"

    Obama wins 2-1 on "States of America", loses 6-5 on "the United States"

RSS feed for comments on this post