In preparation for Tuesday's State of the Union address, I thought I'd take a look at the language of these addresses over the years. Texts are available at UCSB's American Presidency Project — I downloaded their texts and removed irrelevant mark-up .(Or rather, I wrote scripts to do all of this automatically — I believe that the results are generally correct but there are probably a few uncaught errors.)
There are lots of ways to approach this question. In today's post, I'll set the stage and look at a couple of simple word-frequency features, with more (and maybe more interesting) explorations to come later on.
When we see a change in the SOTU messages over the years — and there are plenty of changes to see — we need to consider several different sorts of causes. Maybe the English language itself has changed; maybe style or fashion has changed, at least for a certain sort of political language; maybe the themes or topics of the addresses have shifted, due to changes in the world or at least in the American political landscape; or maybe the individual styles of particular presidents (or their speechwriters) are what's at stake.
No doubt all of these things apply to different degrees in different cases, but we also need to keep in mind something explained at length in Gerhard Peters' "Research Notes" on State of the Union Addresses and Messages, which is that between 1801 and 1913, the SOTU was "a written (and often lengthy) report sent to Congress to coincide with a new Session of Congress"). Here's a graph showing the time-periods involved and the consequences for message length:
We've looked at linguistic SOTU trends a couple of times in the past. For example, in "Real Trends in Word and Sentence Length", 10/31/2011, I used the SOTU texts as one source for evidence about how sentence lengths have been getting shorter over the past couple of centuries:
(In the plots above, the red lines track the address-by-address measurements as my scripts calculated them, while the blue lines are smoothed approximations produced bylocally-weighted scatterplot smoothing in R.)
There may be some indication of the switch from written reports to oral addresses in the early 20th century, but overall, the secular trend remains clear, and is not remarkably different from the pattern seen in the Inaugurals, all of which (I believe) were speeches delivered orally. And the trend towards shorter sentences is surely a culture-wide stylistic trend, which is mirrored in the SOTU and Inaugural texts. At least in part, the shortening of sentences is the reflex of a more paratactic style, with less clausal embedding, as discussed in "Inaugural Embedding", 9/9/2005, and "Presidential Parataxis", 1/24/2009.
These results are not surprising. It's a comonplace observation that English prose style has been moving, over the past couple of centuries, towards shorter and simpler sentences — and sometimes, commonplace observations are actually true. But here's a trend whose explanation is less obvious:
The written SOTU reports apparently have a higher the frequency, but the written/oral distinction can't explain the whole thing. The average frequency of the in the most recent 10 SOTU addresses (2004-2013) was 47,458 per million words; in the first 10 addresses (1790-1799, all delivered as speeches to Congress) it was 93,201 per million words, almost double the frequency. And the decline during the 20th-century era of oral addresses seems to have been a gradual one.
Why is this? Maybe the style of speeches has been getting gradually less formal, and therefore gradually less like written style. Or maybe even formal styles have been changing. We can add to the plot the comparable data from COHA (by decade) and from the Google Books Ngram collection (by year):
COHA and the Google Books data pretty much agree, which is reassuring; and they both suggest a slight decline in the frequency of the; but the change that they show is very modest compared to the change in SOTU frequencies. So I feel that the explanation for the SOTU change remains to be found.
Here's an even more striking stylistic change:
In this case the proportional change is much greater. The frequency of which in the most recent 10 SOTU addresses (2004-2013) was 742 per million words; in the first 10 addresses it was 12,272 per million words, more than 16 times greater. And again, the changes seem to have been relatively gradual ones, with a decline from 1810 to 1850 or so, a rise for a few years around 1900, and then a long fall through the modern era.
Is this a response to grammar mavens' which-hunting? Or is it an underlying stylistic trend, with which-hunting merely a symptom? Or both?
If we add data from COHA and Google Books, we again find a trend in a similar direction but much weaker in size:
So again, there's something left to explain here.
What about examples of thematic differences? Here are two cases of semi-complementary concepts, with word frequencies as a (no doubt imperfect) proxy. For these cases, since the overall frequencies are lower, I've switched to averages by decade. First, nation vs. states:
Some amount of the this change may be due to swapping "America" for "United States" — anyhow, more investigation is needed.
For a second example, freedom vs. duty:
We might ask again to what extent these changes reflect broader trends in cultural emphasis (or at least word frequency as a perhaps-faulty proxy for it). And indeed, the Google Books Ngram frequency for duty/duties does fall during this period, and the frequency of freedom/freedoms does rise:
But again, if we plot the changes on the same scale as the SOTU frequencies, there is a large difference in the size of the effects:
A plausible interpretation of this last plot is that duty/duties returned to background rates in SOTU messages during the second half of the 20th century, while freedom/freedoms was at background rates up until that point, with the deviations from background rates representing the influence of the political rhetoric of the time.