During the course of the 20th century, the frequency of the English definite article the decreased gradually and radically. I first noticed this effect about a year ago, in a post about the history of State of the Union addresses ("SOTU evolution", 1/26/2014), where I observed, in reference to the graph on the right, that
The average frequency of the in the most recent 10 SOTU addresses (2004-2013) was 47,458 per million words; in the first 10 addresses (1790-1799, all delivered as speeches to Congress) it was 93,201 per million words, almost double the frequency. And the decline during the 20th-century era of oral addresses seems to have been a gradual one.
I speculated that
Maybe the style of speeches has been getting gradually less formal, and therefore gradually less like written style. Or maybe even formal styles have been changing.
And I noted that a corresponding effect can be seen in two other sources, the BYU Corpus of Historical American English (COHA) and the Google Books N-Gram viewer (GNG), though it is considerably smaller in magnitude:
COHA and the Google Books data pretty much agree, which is reassuring; and they both suggest a slight decline in the frequency of the; but the change that they show is very modest compared to the change in SOTU frequencies. So I feel that the explanation for the SOTU change remains to be found.
At that point, I turned my attention to other aspects of SOTU evolution. But a student paper recently reminded me of this issue.
In my undergraduate Introduction to Linguistics course, one of the assignments is a final project that asks the students to do some original analysis. Grading these reports (165 of them in the most recent batch) is always interesting and informative — this year, I learned about things like the rhetoric of professional wrestlers' "promos", or the differences between Dominican and Puerto Rican Spanish, or verbal auxiliaries and epenthetic vowels in Hindi film songs.
One of this year's projects, by He Chen, was "Analyzing the Differences in Journalism Style though the Percent Difference of Definite Articles and Indefinite Articles". Mr. Chen, a freshman in the Engineering School who gave me permission to use his name, created his own small historical corpus by selecting and downloading articles at random from the New York Times and Guardian indices, five articles per decade from 1860 to 2010. He then wrote programs to count articles, and to calculate confidence intervals for the difference in frequency between definite and indefinite articles. And despite the small size of his corpus, he found a statistically significant change in that difference, in the direction of decreasing definiteness. Overall, this was an impressive piece of work for a first-semester student in an introductory course.
And it inspired me to look at the frequency of a/an as well as the, in various larger historical collections where the counts are easy to do, starting with the SOTU addresses. Focusing on changes since 1900, and plotting the counts in individual addresses as well as a lowess fit:
Again, the frequency of the has decreased by about half; the frequency of a/an has increased by about a third (though of course the overall frequency of a/an is much lower).
What if we look at the same frequencies in COHA, where there's enough data for us to use the raw values rather than the results of fitting a trend line?
Again, in proportional terms, on the same plot:
Here the effects are much smaller — the decreases in frequency by about 22% in relative terms, from 6.6% to 5.4%, while a/an increases in frequency by about 14%, from 2.4% to 2.7%. Still, these words are common for these changes to be stylistically as well as statistically significant.
What about in the Google Books ngram index? Here are the analogous plots:
And to add one last source of evidence to this morning's Breakfast Experiment™, here are the results for (U.S. presidents') inaugural addresses, from 1897 to 2013:
In this dataset, the decreases in frequency by about 35% in relative terms, from about 8.0% to about 5.2%., while a/an increases by about 39%, from about 1.7% to about 2.3%.
So in all of the four data sources considered so far, the consistently declines in frequency over the course of the 20th century, monotonically and by a relatively large proportion. The behavior of a/an is less consistent, and in any case the changes are not large enough to suggest a simple trading relation between definite and indefinite reference.
What's the explanation for these changes? That's the really interesting question — but I've run out of time this morning, and this post is already far too long. So more on that later.