For the past century or so, the commonest word in English has gradually been getting less common. Depending on data source and counting method, the frequency of the definite article THE has fallen substantially — in some cases at a rate as high as 50% per 100 years.
At every stage, writing that's less formal has fewer THEs, and speech generally has fewer still, so to some extent the decline of THE is part of a more general long-term trend towards greater informality. But THE is apparently getting rarer even in speech, so the change is more than just the (normal) shift of writing style towards the norms of speech.
There appear to be weaker trends in the same direction, at overall lower rates, in German, Italian, Spanish, and French.
I'll lay out some of the evidence for this phenomenon, mostly collected from earlier LLOG posts. And then I'll ask a few questions about what's really going on, and why and how it's happening. [Warning: long and rather wonky.]
Data from the Google Books ngram corpus shows a decline in the frequency of THE, mostly in the last third of the 20th century.
Comparing the first decade of the century with the last decade, we get:
(And the systematically lower frequency of THE in the "Fiction" dataset represents the influence of a generally less formal genre.)
The Corpus of Historical American English shows a similar effect, spread more evenly over the 20th century:
The Corpus of Contemporary American English shows a decline of nearly 8% over the 25 years from 1990 to 2015, which would be about 28% compounded for a century:
And COCA's rates by section (for the period 1990-2015) exhibit the genre/formality effect — the frequency of THE in the "Spoken" section is about 27% lower than the rate in the "Academic" section:
The COCA "Spoken" segments are relatively formal interview transcripts — in the Fisher corpus of conversational telephone speech, THE's overall frequency is only 2.47%, less than half the rate of the "Spoken" segment of COCA.
And if we break things out by age and sex, we see the pattern typical of a language change in progress. Younger people use THE at lower rates than older people, and in each age group, women use THE at lower rates than men:
The same numbers in tabular form:
|AGE <28||Age 28-40||Age >40|
Data from a (more recent) collection of Facebook posts from 75,000 volunteers shows a similar (but even more advanced) pattern, with teen women's posts dipping below 2%:
It's conceivable that these are stable life-cycle and gender effects, but I doubt it — in every case where I've seen a pattern like this, independent evidence has shown that the pattern reflects a change in progress.
American presidents' State of the Union addresses show a decline of about 50% over the past 115 years in the frequency of THE:
If we compare the SOTU data with COHA and Google over a longer span of time, we can see that the trends are in the same direction, although the SOTU addresses show an effect of greater magnitude:
The biomedical abstracts in the MEDLINE dataset show a steady decline of 26% over 40 years, from 6.48% in 1975 to 4.82% in 2014 — which would project to a decline of over 50% in 100 years, matching the SOTU rate:
The Google Books datasets in other languages seem to show flatter profiles for definite determiners over the course of 20th century. Let's start with data for English, created by the same method that I used for German, Italian, Spanish, and French below, namely to ask the Google Books ngram interface for the sum of determiner forms with and without initial capitalization, with smoothing=3.
(For the results reported above, I downloaded the various English-language 1gram datasets, 2012 edition, pulled the counts out myself, including variants like THE, and plotted the results without any smoothing — which is why the numbers are slightly different.)
Cherry-picking the maximum value (in 1916) and the minimum value (in 2000) doesn't change the numbers by a lot:
|SOURCE||MAX: 1916||MIN: 2000||Difference|
German also shows a decline in the summed frequency of the various forms of the definite determiner (which unfortunately are homographs with pronominal forms):
Comparing the first and last decade gives us a decline of -7.2%, notably lower than the English dataset's -17.1%:
And comparing the mid-century maximum to the end-0f-century minimum increases the difference, though still not to the level of the English dataset:
|SOURCE||MAX: 1959||MIN: 2000||Difference|
Among the Romance languages available through the Google Books ngram viewer, Italian shows the greatest change in definite article frequency over the course of the 20th century. (Though note that like the other non-English languages considered here, the definite articles overlap with pronoun uses…)
Comparing the first and last decade gives us a decline of -8.1%:
And comparing the 1923 maximum to the 1985 minimum increases the difference a bit, though still not to the level of the English dataset:
|SOURCE||MAX: 1959||MIN: 2000||Difference|
Spanish shows even less change (though I should note again that there may be some confusion between el the definite article and él the pronoun, and the counts definitely conflate la, las, los the definite articles and la, las, los the object pronouns):
Cherry-picking the century's maximum and minimum values increases the difference only a little:
|SOURCE||MAX: 1900||MIN: 1956||Difference|
And French shows the least overall change (though again the counts conflate articles and object pronouns):
As with Spanish, the decline is mostly a feature of the first half of the century:
|SOURCE||MAX: 1918||MIN: 1953||Difference|
Putting all five languages on the same plot, and showing the changes as proportions relative to the century-wide mean, highlights the differences:
English and German seem to show parallel declines in definite-determiner rates, at least in the second half of the 20th century. Other evidence for English yields higher rates of change, and provides additional evidence for change in the first half of the century.
Italian also shows a reasonably convincing pattern of decline.
The evidence for Spanish and French is more equivocal.There does seem to be a modest trend, though mostly in the first half of the century rather than the second half.
For all of the languages other than English, the patterns are surely obscured to some extent by the fact that the determiners involved are homographs with pronouns, though the pronouns are generally much less frequent.
So is there a general decay of European definiteness? Or a specifically Germanic trend? Does German show the same formality, age, and gender effects as English? What about Dutch, Swedish, Norwegian, etc.? What about other languages, related and unrelated, with roughly comparable determiner systems?
Why might English, German, Italian, Spanish, and French have been moving in the same direction, even if at different rates and perhaps in different time periods? Is there some kind of general dynamical law here, a sort of Jespersen's Cycle for determiners?
And in the case of English, we're in a position to ask where all those THEs are going. Among the possibilities that occur to me:
- Substitution of other determiners, such as this, that, these, those? Problem: many of these words are also declining in frequency, and any increases are too small to account for much of the change in THE.
- More use of 's possessives rather than of possessives: "The X's Y" rather than "The Y of the X". Problem: this is happening, but the construction is way too rare to account for much of what's going on with THE.
- Substitution of pronouns for definite descriptions?
- Substitution of other constructions for abstract nouns (e.g. "that's why" instead of "that's the reason")?
- Substitution of indefinites for definites?
- Increased general verbiage (not involving THE) for a given amount of informational content?
None of these seem empirically very promising to me — but as a start, it should be possible to characterize the relevant differences between formal writing at 6.5% THE and conversational transcripts at 2.5% THE.And there's probably research on this topic that I don't know about.
Update — Bob Ladd points out that in Italian, we can look just as "the main masculine forms il and i, which are never pronominal". And the result looks proportionately just like the full set, which increases my confidence that there's a real effect:
Some relevant past LLOG posts:
"SOTU evolution", 1/26/2014
"Decreasing definiteness", 1/8/2015
"Why definiteness is decreasing, part 1", 1/9/2015
"Why definiteness is decreasing, part 2", 1/10/2015
"Why definiteness is decreasing, part 3", 1/18/2015