SOTU ngrams
« previous post | next post »
Below is a guest post by Yuval Pinter.
Reading Mark Liberman's analysis of Obama's SOTU addresses versus other presidents', my thirst remained unquenched. Word-counts are fun, sure, but the real fun comes in when looking at longer phrases – two (bigrams) or three (trigrams) words long.
After waiting for it to be breakfast time in Philadelphia, I engaged in an experiment (Legal has advised me against explicit use of MYL's trademark phrase) to analyze the 228 addresses (found here) and see what Obama's favorite (and least-favorite) phrases are.
Since I worked with raw data, I handled it a bit differently than previous analyses just for the sake of getting results fast. To begin with, I did not weed out the non-orally-delivered addresses or any other "special" cases. Next, I used an unsophisticated tokenization algorithm where all apostrophes break words into tokens (so "Congress's" is split in two, as in Liberman's analysis, but same goes for "i'm" and "he's"). Lastly, I used a comparison algorithm which only takes into account Obama's speeches and all addresses (1790-2014) as "background": the KL measure, which purports to tell us how "informative" the phrase is in the Obama corpus relative to the background corpus.
Let's get to it: here are Obama's most unexpectedly frequent bigrams:
bigram | KL-measure X 1000 |
that 's | 3.284 |
it 's | 2.463 |
let 's | 2.022 |
don 't | 1.545 |
i 'm | 1.540 |
we will | 1.408 |
's why | 1.375 |
we 're | 1.278 |
we 've | 1.253 |
can 't | 1.147 |
right now | 1.092 |
clean energy | 0.960 |
i will | 0.946 |
if you | 0.931 |
need to | 0.925 |
we 'll | 0.907 |
we can | 0.902 |
is why | 0.883 |
jobs and | 0.848 |
's what | 0.844 |
health care | 0.842 |
tonight i | 0.825 |
our economy | 0.813 |
's not | 0.736 |
middle class | 0.696 |
We see many stylistic markers here, such as the contracted forms "'s", "'re" and "'ll", which will probably re-appear in any modern president's lingo (with not much to support either the egocentric-Obama or collective-Obama hypotheses), but these expected bigrams greatly emphasize the magnitude of the more content-swayed ones: "our economy", "middle class", "health care" and the number one issue on Obama's plate (at least according to Kullback and Leibler): "clean energy".
Obama's most unexpectedly infrequent bigrams: (for these, I still only took phrases which appeared somewhere in Obama's addresses)
bigram | KL-measure X 1000 |
of the | -2.388 |
to the | -0.941 |
in the | -0.896 |
for the | -0.529 |
and the | -0.494 |
by the | -0.446 |
it is | -0.397 |
PAR the | -0.392 |
united states | -0.389 |
the united | -0.388 |
And the rest is just as boring. We've seen "the" is on the decline, and it drags down all its associated bigrams with it.
Moving on. Favorite trigrams: ("PAR" marks the beginning of a paragraph)
trigram | KL-measure X 1000 |
that 's why | 1.191 |
that 's what | 0.750 |
that is why | 0.640 |
democrats and republicans | 0.549 |
we need to | 0.526 |
it 's not | 0.495 |
this congress to | 0.432 |
PAR that 's | 0.426 |
the american people | 0.413 |
i will not | 0.406 |
so let 's | 0.405 |
tonight i 'm | 0.399 |
we can 't | 0.391 |
states of america | 0.369 |
it 's time | 0.353 |
across the country | 0.336 |
's why i | 0.325 |
's why we | 0.324 |
over the last | 0.319 |
over the next | 0.313 |
we have to | 0.312 |
i took office | 0.312 |
i know that | 0.310 |
's time to | 0.304 |
PAR of course | 0.304 |
So the top three are explanation starters, but check out "democrats and republicans" creeping in to a bipartisan content-lead. And you may take what you will from number 25, beginning paragraphs with "of course".
Least favorite trigrams:
trigram | KL-measure X 1000 |
the united states | -0.375 |
of the united | -0.134 |
of the country | -0.054 |
part of the | -0.048 |
as well as | -0.046 |
the people of | -0.044 |
of the people | -0.044 |
PAR it is | -0.043 |
united states and | -0.040 |
of the government | -0.032 |
the secretary of | -0.030 |
it will be | -0.029 |
the federal government | -0.029 |
and it is | -0.026 |
and in the | -0.026 |
at the same | -0.026 |
of our citizens | -0.026 |
the number of | -0.025 |
of the last | -0.024 |
the fact that | -0.023 |
of the union | -0.023 |
in order to | -0.022 |
it is not | -0.022 |
and to the | -0.022 |
it is a | -0.022 |
A bit more interesting than the lost bigram table. "the american people" made it to the top, but "the people of" are on the bottom, suggesting nothing but a stylistic anomaly (or shift) in denoting what is probably the group which is most referred to in these addresses. How "the united states" and "states of america" got to opposite ends is beyond me, though. Much to look into, perhaps during some breakfast after next year's SOTU.
Lee said,
February 3, 2014 @ 3:36 pm
Interesting stuff — thanks for doing this.
Can you explain what the KL means for individual events? I am just learning information theory, and I understand the relative entropy (or KL divergence) to be a function of two distributions. I think of it as the tax we pay for the encoding a sequence using the ideal encoding scheme for some sequence with a different distribution.
In this context, encoding the sequence of words in Obama's SOTU using a scheme based on the actual distribution of words in that speech would would result in the average word needing H bits (H = entropy). But encoding that same sequence using a scheme based on the distribution of words in priori presidents' speeches would result in the average word needing H+KL bits to encode. So KL should always be positive, since no scheme can be better than the ideal one based on his own words.
I am guessing that KL for individual words is the difference in bits that are required to encode those individual words. So some common Obama words have positive KL because they are being encoded with more bits that is necessary and his rarer words are being encoded with fewer bits (which is too bad because they occur rarely). Is that right?
David B said,
February 3, 2014 @ 3:39 pm
Complete guesswork, but the united states and states of america might could end up on opposite ends due to a lot of cases of these united states of america in the data.
yoav said,
February 3, 2014 @ 5:06 pm
Interesting. Are all determiners on the decline, or just the definite ones? in other words, how are bigrams like "of a" are doing?
[(myl) I don't think that bigrams or trigrams are helpful in answering this particular question, which is basically about unigram frequencies. Here's the answer:
The frequency of "a/an" has actually been rising a bit since 1950 or so, and was more or less stable up to that point.]
Anton Sherwood said,
February 3, 2014 @ 10:07 pm
Why make the tables so narrow? Wider would be much easier to read, at least for me.
D.O. said,
February 3, 2014 @ 10:51 pm
Earlier presidents didn't feel the need to append "of America" to "the United States", that's why (oops, Obama-talk is catching up with me), IMHO.
D.O. said,
February 4, 2014 @ 12:52 am
We have comparison here of Obama SOTUs with previous SOTUs, but part of the change is, of course, because of Obama (or his speech writers) and part of the change is because of general linguistic change. For example, Obama's favorite "why" is 2480 words per million Obama against 140 wpm historic (based on Prof. Liberman data) can be compared to 201 wpm for 2008 with about 126 wpm historic (based on Google n-grams, the later figure is just eyeballed). In other words, Mr. Obama uses "why" very frequently compared to both other presidents in similar settings and to contemporary usage. Maybe the more interesting measure is something like
f(Obama-SOTU)*f(AmEng last 200 years)/(f(all SOTU)*f(AmEng now))
properly logarithmed and normalized to get rid of obscure words. I have a feeling though that Prof. Liberman mostly already dealt with the issue by comparing Obama SOTUs with only last 70 years.
Yuval said,
February 4, 2014 @ 1:13 am
@Lee: your conclusion is correct. The formula I took for a single-event (e) KL-measure is P(e|obama) * Log(P(e|obama)/P(e|all)). So what basically happens is that as an ngram is more frequent in the Obama corpus relative to the general corpus, the log term grows (it's zero when the probabilities are the same, i.e when Obama uses the phrase in exactly the expected rate). The left multiplicand is responsible for showing us what's interesting: amplifying the abnormalities of more frequent ngrams.
Information-theoretically, this represents the total "damage" this ngram has made on representing the whole Obama corpus in an encoding befitting the general one: Obama-favorite phrases cost more because we didn't expect them to appear as much, and Obama-avoided phrases made unnecessary savings which should have been redistributed to other places in the encoding.
mollymooly said,
February 4, 2014 @ 4:59 am
Per D.O., contra David B; simple illustration with fake numbers:
* Obama: 2 x "the United States of America" + 3 x "the United States [anything other than 'of America']"
* others: 1 x "the United States of America" + 5 x "the United States [anything other than 'of America']"
Obama wins 2-1 on "States of America", loses 6-5 on "the United States"