The case of the disappearing determiners

For the past century or so, the commonest word in English has gradually been getting less common. Depending on data source and counting method, the frequency of the definite article THE has fallen substantially — in some cases at a rate as high as 50% per 100 years.

At every stage, writing that's less formal has fewer THEs, and speech generally has fewer still, so to some extent the decline of THE is part of a more general long-term trend towards greater informality. But THE is apparently getting rarer even in speech, so the change is more than just the (normal) shift of writing style towards the norms of speech.

There appear to be weaker trends in the same direction, at overall lower rates, in German, Italian, Spanish, and French.

I'll lay out some of the evidence for this phenomenon, mostly collected from earlier LLOG posts. And then I'll ask a few questions about what's really going on, and why and how it's happening. [Warning: long and rather wonky.]

Data from the Google Books ngram corpus shows a decline in the frequency of THE, mostly in the last third of the 20th century.

Comparing the first decade of the century with the last decade, we get:

SOURCE 1900-1910 1990-2000 Difference
English 6.39% 5.28% -17.3%
American English 5.98% 4.99%  -16.5%
Fiction 4.97% 4.45% -10.5%
British English 5.86% 5.32% -9.3%

(And the systematically lower frequency of THE in the "Fiction" dataset represents the influence of a generally less formal genre.)

The Corpus of Historical American English shows a similar effect, spread more evenly over the 20th century:

SOURCE 1900-1909 1990-1999 Difference
COHA 6.53% 5.37% -17.8%

The Corpus of Contemporary American English shows a decline of nearly 8% over the 25 years from 1990 to 2015, which would be about 28% compounded for a century:

SOURCE 1990-1994 2010-2015 Difference Projection
COCA 5.62% 5.18% -7.9% -28%

And COCA's rates by section (for the period 1990-2015) exhibit the genre/formality effect — the frequency of THE in the "Spoken" section is about 27% lower than the rate in the "Academic" section:

Spoken Fiction Newspaper Magazine Academic
4.62% 5.29% 5.35% 5.36% 6.30%

The COCA "Spoken" segments are relatively formal interview transcripts — in the Fisher corpus of conversational telephone speech, THE's overall frequency is only 2.47%, less than half the rate of the "Spoken" segment of COCA.

And if we break things out by age and sex, we see the pattern typical of a language change in progress. Younger people use THE at lower rates than older people, and in each age group, women use THE at lower rates than men:

The same numbers in tabular form:

AGE <28 Age 28-40 Age >40
MALE 2.53%  2.72%  2.97%
FEMALE  2.31%  2.49%  2.62%

Data from a (more recent) collection of Facebook posts from 75,000 volunteers shows a similar (but even more advanced) pattern, with teen women's posts dipping below 2%:

It's conceivable that these are stable life-cycle and gender effects, but I doubt it — in every case where I've seen a pattern like this, independent evidence has shown that the pattern reflects a change in progress.

American presidents' State of the Union addresses show a decline of about 50% over the past 115 years in the frequency of THE:

SOURCE 1900-1910 2005-2015 Difference
SOTU addresses 9.21% 4.67% -49.3%

If we compare the SOTU data with COHA and Google over a longer span of time, we can see that the trends are in the same direction, although the SOTU addresses show an effect of greater magnitude:

The biomedical abstracts in the MEDLINE dataset show a steady decline of 26% over 40 years, from 6.48% in 1975 to 4.82% in 2014 — which would project to a decline of over 50% in 100 years, matching the SOTU rate:

The Google Books datasets in other languages seem to show flatter profiles for definite determiners over the course of 20th century. Let's start with data for English, created by the same method that I used for German, Italian, Spanish, and French below, namely to ask the Google Books ngram interface for the sum of determiner forms with and without initial capitalization, with smoothing=3.

(For the results reported above, I downloaded the various English-language 1gram datasets, 2012 edition, pulled the counts out myself, including variants like THE, and plotted the results without any smoothing — which is why the numbers are slightly different.)

SOURCE 1900-1910 1990-2000 Difference
ENGLISH 6.36% 5.27% -17.1%

Cherry-picking the maximum value (in 1916) and the minimum value (in 2000) doesn't change the numbers by a lot:

SOURCE MAX: 1916 MIN: 2000 Difference
ENGLISH 6.42% 5.23% -18.5%

German also shows a decline in the summed frequency of the various forms of the definite determiner (which unfortunately are homographs with pronominal forms):

Comparing the first and last decade gives us a decline of -7.2%, notably lower than the English dataset's -17.1%:

SOURCE 1900-1910 1990-2000 Difference
GERMAN 9.61% 8.91% -7.2%

And comparing the mid-century maximum to the end-0f-century minimum increases the difference, though still not to the level of the English dataset:

SOURCE MAX: 1959 MIN: 2000 Difference
GERMAN 10.05% 8.78% -12.7%

Among the Romance languages available through the Google Books ngram viewer, Italian shows the greatest change in definite article frequency over the course of the 20th century. (Though note that like the other non-English languages considered here, the definite articles overlap with pronoun uses…)

Comparing the first and last decade gives us a decline of -8.1%:

SOURCE 1900-1910 1990-2000 Difference
Italian 5.00% 4.59% -8.1%

And comparing the 1923 maximum to the 1985 minimum increases the difference a bit, though still not to the level of the English dataset:

SOURCE MAX: 1959 MIN: 2000 Difference
Italian 5.05% 4.55% -10.0%

Spanish shows even less change (though I should note again that there may be some confusion between el the definite article and él the pronoun, and the counts definitely conflate la, las, los the definite articles and la, las, los the object pronouns):

SOURCE 1900-1910 1990-2000 Difference
SPANISH 8.57% 8.34% -2.7%

Cherry-picking the century's maximum and minimum values increases the difference only a little:

SOURCE MAX: 1900 MIN: 1956 Difference
SPANISH 8.60% 8.23% -4.3%

And French shows the least overall change (though again the counts conflate articles and object pronouns):

SOURCE 1900-1910 1990-2000 Difference
FRENCH 6.25% 6.15% -1.6%

As with Spanish, the decline is mostly a feature of the first half of the century:

SOURCE MAX: 1918 MIN: 1953 Difference
FRENCH 6.39% 6.05% -5.4%

Putting all five languages on the same plot, and showing the changes as proportions relative to the century-wide mean, highlights the differences:

English and German seem to show parallel declines in definite-determiner rates, at least in the second half of the 20th century. Other evidence for English yields higher rates of change, and provides additional evidence for change in the first half of the century.

Italian also shows a reasonably convincing pattern of decline.

The evidence for Spanish and French is more equivocal.There does seem to be a modest trend, though mostly in the first half of the century rather than the second half.

For all of the languages other than English, the patterns are surely obscured to some extent by the fact that the determiners involved are homographs with pronouns, though the pronouns are generally much less frequent.

So is there a general decay of European definiteness? Or a specifically Germanic trend? Does German show the same formality, age, and gender effects as English? What about Dutch, Swedish, Norwegian, etc.? What about other languages, related and unrelated, with roughly comparable determiner systems?

Why might English, German, Italian, Spanish, and French have been moving in the same direction, even if at different rates and perhaps in different time periods? Is there some kind of general dynamical law here, a sort of Jespersen's Cycle for determiners?

And in the case of English, we're in a position to ask where all those THEs are going. Among the possibilities that occur to me:

  • Substitution of other determiners, such as this, that, these, those? Problem: many of these words are also declining in frequency, and any increases are too small to account for much of the change in THE.
  • More use of 's possessives rather than of possessives: "The X's Y" rather than "The Y of the X". Problem: this is happening, but the construction is way too rare to account for much of what's going on with THE.
  • Substitution of pronouns for definite descriptions?
  • Substitution of other constructions for abstract nouns (e.g. "that's why" instead of "that's the reason")?
  • Substitution of indefinites for definites?
  • Increased general verbiage (not involving THE) for a given amount of informational content?

None of these seem empirically very promising to me — but as a start, it should be possible to characterize the relevant differences between formal writing at 6.5% THE and conversational transcripts at 2.5% THE.And there's probably research on this topic that I don't know about.

Update — Bob Ladd points out that in Italian, we can look just as "the main masculine forms il and i, which are never pronominal". And the result looks proportionately just like the full set, which increases my confidence that there's a real effect:

  1. Stephen Nightingale said,

    January 3, 2016 @ 8:44 am

    I blame Strunk and White's war on passive and objective constructions. Published in 1916 by Strunk, expanded with more prescriptiveness in 1959 by E.B. White. Though how this got into German I can"t say …

  2. ngage92 said,

    January 3, 2016 @ 9:49 am

    Bravely fighting against this trend is my parents' generation who keep sticking "the" in front of words where it doesn't belong; e.g. "the gays," "the facebook" etc

  3. phw said,

    January 3, 2016 @ 10:17 am

    Hmm, interesting. What does that mean for Mr Zipf?

  4. Doug said,

    January 3, 2016 @ 10:28 am

    " Is there some kind of general dynamical law here, a sort of Jespersen's Cycle for determiners?"

    Oddly enough, I've read elsewhere that a common cycle for definite articles is the opposite — that over time they become more & more common, & evolve from a marker of definiteness into an essentially meaningless marker on all nouns — and then sometimes a new article evolves to mark definiteness.

    [(myl) If it were (starting to be) true that definite articles have become "an essentially meaningless marker", then their frequency might decline in favor of the new ways to "mark definiteness", whatever they are, right?]

  5. Coby Lubliner said,

    January 3, 2016 @ 10:48 am

    Is there any possibility of retrieving whole phrases or even clauses that differ only in having "the" in earlier sources but lacking it in later ones?

    [(myl) In principle, yes — but unless the effects are due to a smallish number of fixed phrases, I doubt that this would tell us much. The statistics of part-of-speech ngrams would be a better line of inquiry, I think — maybe starting by comparing written text with conversational transcripts.]

  6. Phillip said,

    January 3, 2016 @ 10:56 am

    One thing to remember about the definite article in German is that it forms contractions with various prepositions, both formal/standard (zu dem -> zum) and non standard / informal (für das – > fürs). So you need to somehow include those forms in the count of definite articles. Of course, increasing frequency of the contracted form compared to the non contracted form is in and of itself interesting.

    There's another "problem" though – the definite article can also work as a demonstrative, so the contracted form isn't always semantically equivalent to the full form. "Am Abend" is "in the evening", while "An dem Abend" is "on that evening".

    [(myl) Good points. Pending statistics from some grammatically-analyzed material, all of the non-English stuff has to be taken with several large grains of salt. But I included it because maybe there's enough chance of a signal there to make it worthwhile to investigate further…]

  7. Bruce Rusk said,

    January 3, 2016 @ 11:09 am

    Could some of the change in English be due to the definite article increasingly being used once at the head of a series of nouns rather than before each noun: "the butcher, baker, and candlestick maker" rather than "the butcher, the baker, and the candlestick maker"?

    [(myl) It's true that the frequency of the pattern and the _NOUN_ seems to have decreased by about 0.03% absolute. But this is at best 1/40 of the change of about 1.2% absolute in the frequency of the.

    It's possible — maybe even likely — that the effect is the sum of a lot of little things like this. But if so, what's the current that makes the little things mostly drift in the same direction?]

  8. Jerry Friedman said,

    January 3, 2016 @ 11:10 am

    I blame Time founder Henry Luce.

    Seriously, the anarthrous nominal premodifier may at least partly result from a belief that good style relies on "strong" nouns and verbs and avoids "weak" and "colorless" little function words. Could such a belief actually have a noticeable effect? The possibility doesn't seem very consistent with the greater use of "the" in more formal language, which presumably is more carefully edited for style.

    Other possibilities would be increased use of mass nouns relative to count nouns and increased use of bare plurals.

  9. Pflaumbaum said,

    January 3, 2016 @ 11:18 am

    Interesting to hear that it's declining somewhat in Romance languages too.

    In Romanian, as I understand it, the masculine (or arguably also neuter) article /-l/ was originally added to the end of words usually ending in /-u/. Eventually final /-u/ on indefinite nouns was lost, apart from when preceded by a vowel or [plosive + liquid] combination. Then the article was re-analysed as the whole /-ul/ on definite nouns.

    But today the /-l/ itself is dropped much of the time, restoring an /-u/ ending, like that which originally just marked the masculine (/'neuter'), but now marking definiteness. So with words like leu (lion) and socru (father-in-law), there is now syncretism between definite and indefinite forms.

    The article remains robust elsewhere (though sometimes prone to mishearing, at least for foreigners, in the feminine, as it usually depends on the distinction between unstressed /a/ and /ə/). But it will be interesting to see how the language deals with the masculine ambiguity – if it tolerates it, modifies the -Vu/-CRu indefinites somehow, or extends the syncretism by restoring the meaning of /-u/ as general masculine marker (perhaps for definiteness installing the alternative masculine article '-le' which currently attaches to nouns ending in /-e/).

  10. Mr Punch said,

    January 3, 2016 @ 11:57 am

    Is all fault of Moose and Squirrel

  11. Bob Ladd said,

    January 3, 2016 @ 12:04 pm

    For Italian, you could just look at the main masculine forms il and i, which are never pronominal, to see how closely they match the overall trends that you get when you include the ambiguous forms la, le, lo, l', gli. There's no such option in French, German or Spanish.

    [(myl) Good idea — should have thought of it! The results are (proportionately) similar:

    A decrease of 7.35% relative, from 2.09% in 1900-1910 to 1.94% in 1990-2000; or 9.37%, from a maximum of 2.09% in 1923 to a minimum of 1.92% in 1988.]

  12. Jerry Friedman said,

    January 3, 2016 @ 1:26 pm

    You could do something almost as good in Spanish, since the pronoun él is much less common than the pronouns la, los, and las, as you noted. I don't know how to embed this graph.

  13. Oriana said,

    January 3, 2016 @ 1:27 pm

    Stumbled on FiveThirtyEight's reddit ngram viewer and it looks like the same trend plays out on a smaller time-scale within reddit. The data covers 2007-10-15 to 2015-08-31. The mean frequency for the partial 2007 data is 4.72%, and drops to 4.06% for the year of 2015 which is a relative change of 13.9%.

    reddit n-gram for "the" :

    [(myl) Yes, I noticed that. Here's a version of the results presented in the same format as the earlier plots:

    The rate seems so extreme that I wondered whether something funny might be going on. Calculated from the raw data, the average proportion for comments from 2007 is 0.04720536 and for comments from 2015 is 0.04060947, for a decline of (0.04720536-0.04060947)/0.04720536 = 0.1397276 [= about -14%] in a mere 8 years. Projecting for a century would take us from 4-5% down to 0.04720536*(1-0.1397276)^(1/8))^100 = 0.007193 [a rate of about 0.7%] for THE, which is hard to believe…

    Of course, long-term projections of short-term rates of change run into some version of Stein's Law. And it looks like there's a more rapid fall in the beginning, followed by a more gradual and fading decline, maybe all caused by a stylistic shift from more old-fashioned prose to a purer form of internet language.

    So anyhow you're right, we should add the Reddit results to the data pile.]

  14. chris said,

    January 3, 2016 @ 2:04 pm

    The reader may suspect that individual pseudo-definite constructions like the one at the beginning of this sentence are giving way to plurals. At least, they sound a bit dated to me. Whether or not that actually reflects a shift in the unconscious assumption of a Platonic ideal reader vs. acknowledgment of multiple readers each distinct from the other, either that type of usage has declined or it hasn't, and I think it has.

    Also, "the President" vs. "President Obama" (or "the Queen of England" vs. "Queen Elizabeth" etc.); there is still space for the former when you're stating a general principle that can apply to any person holding that title, but for the deeds of a specific one, it is presently much more common to use the name and not just the title; I don't think it was always that way, although I could be wrong.

    P.S. ngage92 may be on to something too. Does the generational difference reflect a tendency for new concepts or entities to not use "the" where older ones would have? You can call someone on the telephone but you can't send them the email. You can look them up in the phone book but not on the Google. If older speakers are more likely to violate these idioms it may reflect that they would have been formed differently in their youth.

  15. Guy said,

    January 3, 2016 @ 2:30 pm

    On another thread I wrote "the collection of ivory involved the killing of the elephant", which I suppose could also have had a definite article before "ivory", except that would feel a bit odd to me since I was discussing "ivory" as a substance in general, not conceptualizing it as a specific piece of ivory thought of as an exemplar of generic instance of all pieces of ivory. It occurs to me that I'd be unlikely to say something like that in ordinary speech. "The collection" would probably have been "collecting" or simply collection, and "the elephant" would probably involve "elephants". So I suppose one hypothesis of where "the" is disappearing is that it is not in what we might think of as "prototypical" uses of "the" (specific references to concrete individuals previously identified in discourse) but that it is perhaps occurring primarily among non-referential or abstract noun phrases, and uses of noun phrases to identify entire classes. One preliminary test might be to use a reliable parser to count all the places where a determiner is obligatory, and see if there's a change in frequency there.

  16. DTI said,

    January 3, 2016 @ 3:35 pm

    When I as in college in the early 1980s a popular poster on campus was "Stop the bombing in El Salvador." Which always struck me as alienated and (non-Strunk-and-White) passive. "Stop THE bombing IN" just seems reedier and less powerful than "Stop bombing!" At least for citizens of the country doing the alleged bombing.

    But then also when I was in college, or perhaps soon after, there was a trend for upscale/hipster cafes and restaurants to add "the" to menu items, as in "$7 the cup" or "$55 the serving."

    During the Bush years, of course, ironic usage like "the internets" spiked.

    Hmm. The previous paragraphs not withstanding I find myself using the in most of my sentences. Mostly not in the unnecessary ways.

  17. James Wimberley said,

    January 3, 2016 @ 3:56 pm

    The systemafic use of truncated scales on the vertical axis is legitimate here. But it can also be used to deceive, as with the infamous Fox News charts. Can the publishers of charting software be persuaded to offer the zigzag symbol for truncation?

  18. Sili said,

    January 3, 2016 @ 4:11 pm

    Why are the wars so obvious in the French and Italian (or sorta German) data?

  19. Bob Ladd said,

    January 3, 2016 @ 4:37 pm

    @Sili: Also in British English Google Books, but much less so in AmEng., esp. WW2.

    [(myl) My guess is that this is the effect of a difference in what sorts of things got published during those periods, rather than how people wrote during those periods.

    Statistical analysis of the (downloadable) dataset could probably confirm or refute this idea.]

  20. Matt Keefe said,

    January 3, 2016 @ 5:52 pm

    So the obvious assumption is that fewer NPs are turning up with the definite article in them (rather than NPs themselves being less common, or there being an increase in the word-length of phrases overall thus reducing the frequency of the). Is there any reason to doubt that?

    Assuming not, and following, more or less, the suggestion that it may well be a host of small changes, are we just using more nouns that typical occur without a definite article (indeed, any determiner)? I can think of a few changes that might prompt this.

    Greater use of mass nouns. Do we now talk about 'oil' or 'energy' more often relative to how often we talk about 'the oil industry'? (And do such conversations now more often include elements like 'renewable energy' which seem to me unlikely to turn up with the definite article very often.) Ditto, do we now talk about 'government', 'industry' and 'business' more often relative to our use of constructions like 'the government', 'the X industry' and so on.

    Loss of the article in common phrases. A quick look over a list of the most common nouns yields thing, point, problem and fact, all of which supply constructions of the kind 'The X is…', where the determiner is now frequently omitted ('Thing is…', 'Point is…' – all perfectly natural to me). A change like this wouldn't necessarily need to be motivated by anything in particular – shedding an unstressed monosyllable at the start of a sentence seems pretty natural.

    Greater preponderance of brand names: YouTube, Google, Facebook. And I wonder if this one might motivate coinages for new products and technology (even where not brand specific) of a type less likely to be frequently accompanied by the. Notably, internet does usually occur with it, but what about the kinds of terms we use in discussing communication (off the top of my head, cell phone, mobile phone, social media, email, broadband) and entertainment (streaming, on-demand downloads). Have these replaced common phrases like 'the phone', 'the television' and 'the radio' (and before that, 'the telegraph') which, despite being new coinages, were of a kind likely to occur in most contexts with a definite article? (Some do still occur with it, of course – 'Is the WifI down?'.)

    Some, like TV, seem to me to be used with determiners in fewer contexts than previously ('the television', 'the telly' or even 'the TV' would have been typical in speech when I was younger; my sense is very much that just 'TV' is typical now); likewise, a phrase like 'the radio' has presumably been replaced in many ways by phrases unlikely to employ the definite article: 'digital radio', 'DAB', named proprietary broadcast systems. Phrases like 'John's on the phone' are presumably less common since most people now have their own mobile phones and the use of messaging systems is more common.

    I wonder if there could be tipping points to any of these effects – the way we talk about certain kinds of concepts and the likelihood of a new invention being known by a ubiquitous brand name becoming just frequent enough to create a trend followed for other existing terms and other new coinages?

  21. Sam said,

    January 3, 2016 @ 6:52 pm

    Could it have something to do with decreasing sentence lengths over time? The extra definiteness provided by "the" might be especially necessary in long ornate sentences of the sort that were so popular in past centuries, and became so much less so in the course of the 20th.

    (Omitting links in hope of avoiding moderation purgatory, but on the historical drop in sentence length see e.g. the 2011 Log post on sentence lengths in inauguration speeches, or the chapter on "Style and Presentation in the 20th Century" in Gross et al., Communicating Science, Oxford, 2002.)

    I suppose the test of this would be to see whether longer sentences really do generally have a greater "the" density — if it's real, the effect ought to be observable synchronically as well as diachronically.

  22. christoll said,

    January 3, 2016 @ 7:04 pm

    Phillip's point above about German contractions (zum, fürs, am etc.) also applies to Spanish, Italian and of course French as well, although I doubt that the frequency of such contractions has been increasing in those languages as it has been in German.

    I wonder how much of the effect might be due to personal pronouns replacing the definite article, with "he's on the phone" being replaced by "he's on his phone", for example. Certainly in French it seems like there has been a trend toward greater "personalization" of the language in this way. I doubt that this would explain much of the effect in German, but maybe in German the increasing frequency of contractions and the replacement of the genitive by "vom" would more than make up the difference.

  23. Jamie Pennebaker said,

    January 3, 2016 @ 8:15 pm

    I love this post and comments. The word "the" typically is used in front of a concrete noun. Is the word "the" dropping or is the use of concrete nouns dropping? To get a very rough sense of this, I went to and entered:

    and, in a separate analysis:
    (the school)+(the book)+(the water)+(the day)+(the dog)

    Rate for nouns alone in 1900-1910 = .16
    1990-2000 = .11
    a drop of 31.2%

    Rate for the same nouns preceded by "the"
    in 1900-1910 = .028
    in 1990-2000 = .018
    a drop of 35.7%

    The drop for "the" alone went from 5.77 to 4.68, or 18.9%.

    I also checked to see if there has been a subtle move from singular concrete nouns to plural ones which might result in people just referring to a broader category that is inherently less specific. Alas, the same results: a drop in the same group of plural nouns by 35%.

    What about (a school)+(a book)+(a water)+(a day)+(a dog)? Same general pattern. Although the base rate of usage is lower (dropping from .0094 to .0068), the drop is 27.6% — just slightly lower than "the" drops.

    The problem may not be "the" usage but concrete noun usage. The patterns all look similar to Mark's first graph — although not as dramatic: a slight rise after 1900 and then a drop beginning around 1945, with a fairly steady decline between 1950 and 1990.

    Are we becoming more conceptual (words such as "ideas", "beliefs", etc are on the rise) in our books? What do these patterns mean from a psychological perspective? Are people paying attention to their worlds differently?

  24. Y said,

    January 3, 2016 @ 9:32 pm

    To get a better understanding of ths, I picked three spoken SOTU speeches illustrating this, and picked a longish paragraph near the middle of each one, with a percentage of 'the' tokens similar to that in the speech as a whole:

    Wilson 1916 (8.2%, 9.5% for the whole)

    Immediate passage of the bill to regulate the expenditure of money in elections may seem to be less necessary than the immediate enactment of the other measures to which I refer, because at least two years will elapse before another election in which Federal offices are to be filled; but it would greatly relieve the public mind if this important matter were dealt with while the circumstances and the dangers to the public morals of the present method of obtaining and spending campaign funds stand clear under recent observation, and the methods of expenditure can be frankly studied in the light of present experience; and a delay would have the further very serious disadvantage of postponing action until another election was at hand and some special object connected with it might be thought to be in the mind of those who urged it. Action can be taken now with facts for guidance and without suspicion of partisan purpose.

    Eisenhower 1954 (7.6%, 6.8% for the whole)

    While we are moving toward lower levels of taxation we must thoroughly revise our whole tax system. The groundwork for this revision has already been laid by the Committee on Ways and Means of the House of Representatives, in close consultation with the Department of the Treasury. We should now remove the more glaring tax inequities, particularly on small taxpayers; reduce restraints on the growth of small business; and make other changes that will encourage initiative, enterprise and production. Twenty-five recommendations toward these ends will be contained in my budget message.

    Bush 2004 (4.3%, 5.4% for the whole)

    At the same time, we must ensure that older students and adults can gain the skills they need to find work now. Many of the fastest growing occupations require strong math and science preparation and training beyond the high school level. So tonight, I propose a series of measures called Jobs for the 21st Century. This program will provide extra help to middle and high school students who fall behind in reading and math, expand advanced placement programs in low-income schools, invite math and science professionals from the private sector to teach part-time in our high schools. I propose larger Pell grants for students who prepare for college with demanding courses in high school. I propose increasing our support for America's fine community colleges, so they can—I do so, so they can train workers for industries that are creating the most new jobs. By all these actions, we'll help more and more Americans to join in the growing prosperity of our country. Job training is important, and so is job creation. We must continue to pursue an aggressive, progrowth economic agenda.

    I don't see any obvious clues. Wilson's speech would need complete paraphrasing to reduce the percentage of its the's. Eisenhower's is biased by reference to named institutions. This is very common in some of the earlier written SOTUs, which read like technical reports rather than popular speeches.

  25. Felix said,

    January 3, 2016 @ 10:10 pm

    Sili said, "Why are the wars so obvious in the French and Italian (or sorta German) data?" and I noticed this too. French and Italian rose during WW I and WW II, French also rose during its Vietnam and Algerian wars, Spanish rose during WW I and its civil war, but German rose from WW I all the way into WW II, then declined. Or maybe I am not lining up the data well with the axis.

  26. Rubrick said,

    January 3, 2016 @ 10:10 pm

    Obviously the fault lies with the rise of cultural relativism. Kids these days are just way more wishy-washy than their forebears; their not definite about anything, articles included. Where do I publish?

    But seriously. Of the various hypotheses listed in the comments, the branding one — and particularly the rise of national brands— appeals to me quite a bit. We used to go to the coffee shop, the grocery store, and the pharmacy; now we go to Starbucks, Safeway, and Walgreens.

    This seems insufficient to account for the whole trend; but on the other hand, as Mark confirmed in response to a comment on a previous post, proper nouns have been sorely neglected in linguistic analysis, and are usually omitted from word counts.

    A rise over time in the proportion of words which are brand names — or even proper nouns in general — would be a pretty good smoking gun. Proper nouns usually don't take a definite article.

  27. D.O. said,

    January 4, 2016 @ 1:40 am

    Having simple counting tools like Google n-grams does not allow for a lot of subtlety, but still here's my suggestion. Rough search for the * shows that the most common word following the is same. same is almost always preceded by the. Now, during 20th century the link between the and same remained unchanged, but the frequency of (the) same dropped substantially. On the other hand, the words first and other, which are also among the most frequent the followers did not drop in frequency (or maybe just a little), but their linkage to the has weakened. So here's my suggestion. Take first 100 words or so which are most often found after the and look in what sort of cases there will be more of a drop in absolute use (same pattern) vs. weakening linkage (other pattern). The first 100 words will probably cover about half overall occurrences of the, but is still a relatively manageable set of words to ponder individually. It also probably a large enough set to spot some regularities if they are not too deep.
    And yes, though the first 10 most frequent the followers (the info Google provides without jujitsu) behave sort of normally, there is no telling what will happen in the first hundred.
    Unfortunately, I cannot provide this service to the linguistic community because there is no way I can work even with 1-grams other than typing a few words in the ngram search field.

  28. v01ces said,

    January 4, 2016 @ 5:45 am

    Might it be that a lot of non-native English speakers started writing books in English. People whose first language does not have articles tend not to use them. This theory explains Reddit, at least.

  29. Yuval said,

    January 4, 2016 @ 7:34 am

    @Y: I think the samples you posted are actually quite illuminating, hinting at a construction-related reason that jut slightly eluded myl (closest is his possessive hypothesis): non-definite compounds replace more elaborate constructions (or just definite compounds).

    Consider Wilson's "regulate the expenditure of money". I suspect Bush would just say "regulate money expenditure". Same goes for, at least, "the methods of expenditure". Some other compounds would very easily lose the definite today: "the immediate enactment", "the public mind", "the public morals". Without these five THEs the passage drops to 5.2%.

    On the other side, some Bush NPs might well have been definite in Wilson's speech: "older students and adults", "middle and high school students who fall behind", "students who prepare for college", "increasing our support" (perhaps as "the increase in our support"*), "workers for industries", "job creation" ("the creation of jobs"). These would bring the paragraph to 14 THEs, or 7.5%, and one could probably find enough others to get it to Wilson's 9%.

    *@myl: is there an increase in gerunds over the last century?

  30. Jerry Friedman said,

    January 4, 2016 @ 11:38 am

    v01ces: Your suggestion might be involved in the trends at reddit and in academic writing, but not that in (the) State of the Union addresses.

    Anyway, I too have solved it, as seen in these ngram graphs:

    if I would have / if I had

    is located at / is at

    too _ADJ_ of / too _ADJ_ a (using "too _ADJ_ of" as a proxy for "too _ADJ_ of a", which you can't search for)

    Okay, I suppose those don't explain the SOTU trends either, but still…

  31. MikeA said,

    January 4, 2016 @ 12:49 pm

    Just sitting in my armchair, I have to wonder what regional differences versus the homogenization of dialect via popular entertainment have to do with it. I would love seeing a general trend in the-dropping reduce the number of times I hear a SoCal transplant refer to "the 101" or worse "the El Camino" here in NoCal. Or did the writing style of Dan Brown create a glut of surplus 'the's which made them more available for use in general speech?

  32. BZ said,

    January 4, 2016 @ 3:31 pm

    When the Twilight Zone remade the episode "The Eye of the Beholder" in 2003, it dropped the first "The" from the title. I don't know if that means anything, though.

  33. J. W. Brewer said,

    January 4, 2016 @ 4:22 pm

    Do we think the difference shown between BrEng and AmEng is meaningful? (My own anecdotal experience is that the way the google books corpus sorts texts into the "American" and "British" subcorpora is riddled with errors, but maybe on a large enough scale the data is clean enough for different rates of usage to be meaningful?)

    [(myl) Unfortunately, almost everything about the Google Book corpus is "riddled with errors".

    To the extent that the errors are genuinely random, it doesn't matter much in cases like this. But there might well be various sorts of bias introduced.

    I can think of a few minimal pairs where BrEng often or typically uses a "the" where AmEng wouldn't ("on main street" v. "in the high street"; "watching TV" v. "watching the telly"), but I wouldn't be surprised if there are examples going the other way as well that are not springing to my mind, and certainly don't have a theory for the differences in the two pairs I noted other than, hey fixed idioms are what they are. In some old LL thread about arthrousness in rock band names that I'm not going to take the time to find a link for, it may have been noted (quite possibly by me!) that British usage tends to be modestly but notably more arthrous than American (example, the '70's band known as merely "Sweet" in the US was standardly The Sweet in the UK), but again, the difficult is finding a why for the difference and seeing if it generalizes and if so to what range of other situations.

  34. J. W. Brewer said,

    January 4, 2016 @ 4:29 pm

    PS: I realize the bigger story is that the rate of "the" usage has been notably dropping in both AmEng and BrEng (as well as other languages), but am still thinking that an understanding of why the slope of the decrease differs between Am and Br (if we think the data is clean enough that what the graph shows as to that comparison is meaningful) might help illuminate the causes of the broader trend.

  35. J. W. Brewer said,

    January 4, 2016 @ 4:39 pm

    One diachronic AmEng example that comes to mind: Today's Young People often say "go to prom" rather than "go to the prom." The anarthrous version of the phrase is still shown as the minority variant, albeit a significant one, in the google books n gram corpus for 2008 (most recent year available), but that of course will include lots of nostalgic uses by old people using the older formulation who have not switched over to the innovation in their own idiolects — the more important point is that a few decades ago when I was in high school "go to prom" was so rare as to be plausibly thought an error rather than a respectable variant of a stock phrase. If one could identify some other stock phrases like this that have experienced a loss-of-arthrousness over time, maybe one could figure out a pattern in when/why that change happens.

  36. Jack Grieve said,

    January 4, 2016 @ 5:28 pm

    Determiners frequency often increases with the use of more complex noun phrases, which is related in turn to increased informational density and formality. So I'd guess a rise in informality could be responsible. But I don't think you'd find the same thing in newspaper writing, for example, which I believe has become denser. The is also definitely on the increase in American Twitter at least in 2014, presumably reflecting denser tweets with more complex NPs (most prepositions are also on the rise, whereas verbs and pronouns are falling). The also is regionally patterned, being most common along with determiners in general in the Northeastern United States.

  37. Jack Grieve said,

    January 4, 2016 @ 6:13 pm

    Also in response to a couple of the other comments:

    BrE is generally thought to drop "the" more often than American English, e.g. going to hospital (BrE) vs. going to the hospital (AmE).

    Pre-modifying nouns with nouns is definitely on the rise in newspapers at least. For example see this paper, although we didn't consider determiners:

    Douglas Biber, Jack Grieve and Gina Iberri-Shea. 2010. Noun phrase modification. In Günter Rohdenburg and Julia Schlüter (editors) One Language, Two Grammars? Differences between British and American English. Cambridge University Press.

    Also for regional patterns in determiners, as well as other parts-of-speech, both individually an in the aggregate see:

    Jack Grieve. 2014. A multidimensional analysis of regional variation in American English. In Tony Berber Sardinha and Marcia Veirano Pinto (editors) Multi-Dimensional Analysis, 25 years on: A Tribute to Douglas Biber. Amsterdam: John Benjamins.

  38. Levantine said,

    January 4, 2016 @ 8:17 pm

    Jack Grieve, isn't the omission of "the" in "going to hospital" actually a conservative feature rather than an instance of dropping as such? People used to say "at table", and there are other obsolete examples of this sort of construction that don't come to mind right now. As far as I know, it's still usual in all varieties of English to omit "the" when talking of going to school, church, or court.

  39. Ray said,

    January 4, 2016 @ 11:36 pm

    ngage92 wrote: Bravely fighting against this trend is my parents' generation who keep sticking "the" in front of words where it doesn't belong; e.g. "the gays," "the facebook" etc

    not to take away from your point, but originally, facebook was called!

  40. Y said,

    January 5, 2016 @ 12:51 am

    Yuval, I think you have something there, but I think that some of your modifications of Bush's speech are not quite grammatical, or they have an unacceptable change of meaning. A good modern speech writer would not write "regulate money expenditure" either. It sounds too dry. I think you're right that the-less infinitives might be preferred to gerunds in modern writing.

    Most of Wilson's paragraph is a single long sentence. I agree that overall verbosity is somehow connected to these widespread determiners; I am not a good enough copy editor to cast any of Wilson's the-rich clauses into the-less language which maintains the force of the original.

  41. Jack Grieve said,

    January 5, 2016 @ 4:45 am

    I'm not sure which is the older construction, i.e. "to the hospital" or "to hospital", or if there is a trend in usage one way or another.

  42. Andrew (not the same one) said,

    January 5, 2016 @ 10:24 am

    Certainly in British English 'to the university' has given way to 'to university'. (It has always been 'to school' and 'to college', but 'university' used to be anomalous.)

  43. GeorgeW said,

    January 5, 2016 @ 10:53 am

    @Ray: It seems to me that there is a semantic difference between X and 'the X' (with X = various social groups). As an example, a white person could say "the blacks" or "blacks" where "the blacks" distinguishes the speaker from the group. A black person would be much less likely to say "the blacks." I think.

  44. Marc Sacks said,

    January 5, 2016 @ 10:59 am

    I hadn't been aware that definite-article use in English has been declining. Instead, I've been noticing (and bothered by) a trend in prefacing titles or professional descriptions with an article (e.g., "the economist George X" instead of "economist George X"). Is this in somebody's style sheet (or perhaps many style sheets), a new usage trend, or just something I've noticed over the last few years? Can you refer me to any postings on it?

  45. DWalker said,

    January 5, 2016 @ 11:17 am

    This is interesting, because linguistics. :-)

  46. Levantine said,

    January 5, 2016 @ 5:36 pm

    Marc Sacks, it's the other way round: it's the omission of the "the" in such constructions, and not its inclusion, that's a relatively recent trend. Jerry Friedman already referred to this phenomenon in his comment above, and the following links give more information:

    I can just about tolerate the use of false titles in journalese but personally hate it in all other contexts.

  47. Elessorn said,

    January 6, 2016 @ 6:09 am

    Aren't we being a bit premature? The trend securely indicates only a decrease in word frequency, and provides no direct insight on whether definiteness as a category of concept handling is changing or not. Even if we identified a trend towards one or several categories of construction that statistically produced fewer articles, what reason would we have for singling out definiteness as the cause of such a trend? Or indeed for pegging "the" as the motive reason rather than as a mere affected variable? This is not to say that such a thing as decreasing definiteness is impossible, only that we're zooming over several steps here in going from a frequency effect to a cognitive change.

    Think of historical change we can be sure about. Imagine we had an Ngram reader back in the Middle Ages: we would surely detect a marked decrease in discreet English accusative and dative case marking from 800-1200. But would we infer that English was losing nominative-accusativity? Or the distinction between direct and indirect objects?

    Or think of the future. It's not impossible to imagine that the already common use of the form we call present progressive for future events ("I'm graduating this year") could eventually completely usurp "will" in that function entirely. Yet would we say that a concept of futurity had been lost?

    I think we have to be careful to distinguish form and function. I do have an intuition that, say, "The research on definite article usage trends in English is fascinating" could stand to drop the initial "the" without much change in meaning. But I have a much stronger intuition that in "Have you read (the) research on definite article usage trends in English?" the presence or absence of the article unavoidably signals a different meaning for all groups of native English speakers, and an even stronger intuition that almost no native English speaker would say "Have you read research on definite article usage trends in English that I sent you yesterday?" (except perhaps in borderline cases of contrast emphasis). If these intuitions were proved wrong, if we observed contrasting patterns of article usage in a large number of examples of the same construction between regions, sexes, age groups, etc., then I would say that the conclusion is more than justified, but if in fact we observe that the same construction triggers article usage in all categories of native speaker, even if the frequency of the given construction varies widely between categories, then it seems to me that as a cognitive organizer, definiteness does not seem to be in decline.

  48. Jerry Friedman said,

    January 6, 2016 @ 10:41 am

    Eiessorn: I do have an intuition that, say, "The research on definite article usage trends in English is fascinating" could stand to drop the initial "the" without much change in meaning.

    Interesting example. Fifty or a hundred years ago, that might have been "The research on the trends in the usage of the definite article in English is fascinating," though actually there are more graceful possibilities that don't have the sequence of attributives.

  49. DWalker said,

    January 7, 2016 @ 11:21 am

    This is interesting. Not to JUST make jokes, but:

    There's a missed opportunity for noun pileup here:

    "The research on the trends in the usage of the definite article in English is fascinating." should obviously be "English definite article trend research fascination spotted".

  50. The case of the disappearing determiners | The Proof Angel said,

    January 9, 2016 @ 4:52 am

    […] This is fascinating stuff. Over the last century or so, the commonest word in English has gradually been getting less common. […]

  51. Steve Rapaport said,

    January 9, 2016 @ 1:23 pm

    I'm with Chris above. Actual definiteness, involving the implicit assertions of existence, uniqueness, and previous reference in the discourse, is probably not declining at all. It isna useful shorthand.

    But the use of the definite article in the pseudo-definite, generalized sense is sounding more pretentious all the time (witness the pretentiousness of the current sentence.). Was the usage of the generalized 'the' previously common enough to account for the dramatic decline?

  52. Neil Gooderham said,

    January 11, 2016 @ 5:34 am

    I think you may have overlooked a factor here, in terms of the changing nature over time of the works that make up the source data for Google nGrams. Isn't it the case that as recently as 30 years ago (pre-Internet, anyway), most publications in the various English corpuses would have been written by, can we say, better-educated authors? The language used in literary works is very different from the mainstream, and its form has probably changed relatively slowly in comparison.

    In more recent times, publication has become much easier for anyone with something to say, so there is inevitably more variety in the forms of language used.

    [(myl) Actually this factor has been extensively discussed, here and elsewhere. But the trend under discussion here took place mostly in the pre-internet period of Google Books ngrams; and can be even more clearly seen in the balanced COHA collection, in State of the Union messages, in Medline abstracts, and so on.]

  53. Charlotte R. Mitchell said,

    January 11, 2016 @ 9:50 am

    When completing forms online, the number of characters is often quite limited and causes one to omit everything but the essential words and to abbreviate even those words. I do not use Twitter but know there is a limit there also. Perhaps these things cause us to think in fewer words elsewhere, too.

  54. Ed Vanderpump said,

    January 11, 2016 @ 11:04 am

    "In (the) light of" is the sort of thing I regret.

