## Correlated lexicometrical decay

This is a brief progress report on "The case of the disappearing determiners", which I've continue to poke at in my spare time.

As the red line in the plot below shows, the proportion of nouns immediately preceded by THE decreased over the course of the 20th century, from an average of 18.9% for books published in 1900-1910 to 13.5% for books published in 1990-2000.  The blue line shows that the proportion of adjective+noun sequences immediately preceded by THE was higher, overall, but followed a remarkably similar falling trajectory, from 29.1% in 1900-1910 to 21.2% in 1990-2000:

As the dotted green line shows, the proportion of nouns immediately preceded by an adjective remained stable in this source, at about 18.9% throughout the century. I thought that perhaps a falling percentage of adjective-modified nouns, coupled with the lower percentage of THE preceding non-modified nouns, might make a contribution to the overall decline in THE — but not so.

I got the data by prompting the Google Books ngram viewer with the queries

The html source of the resulting pages contains a javascript vector of the numbers, which I extracted for further processing.

The similarity between the red and blue lines is so strong — a correlation of r=0.996 for the pair of  101-element vectors — that I wonder whether there might be a bug of some kind in the underlying data.

This remarkably similar pattern continues to be visible in the plot below. The blue line shows the proportion of instances of OF immediately followed by THE, which drops from about 27.8% in 1900-1910 to 24.0% in 1990-2000, while the proportion of all prepositions immediately followed by THE, shown in the red line, drops from 25.7% to 23.0%. The correlation of those two sequences is r=0.994, and their correlations with the falling lines in the previous plot are between 0.962 and 0.974.

And OF as a proportion of all prepositions, shown in the dotted green line, fell from 31.2% to 28.4%, but again in a strikingly similar pattern, correlating between 0.913 and 0.967 with all of the other falling lines:

I'm sure that these patterns mean something. But it seems a little weird that OF as a proportion of all prepositions should correlate r=0.953 with the proportion of instances of OF immediately followed by THE, and  it seems weirder that OF as a proportion of all prepositions should correlate r=0.913 with the proportion of adjective-noun sequences immediately preceded by THE.

So let's hope that what these patterns mean is that the secular decay of THE has somehow seeped into some but not all of the other counts, or that some other hidden cause is governing all of the correlated decays. The alternative hypothesis is that there's a problem with the way the underlying data was collected and processed, which would be annoying.

1. ### Doctor Science said,

January 9, 2016 @ 10:02 am

Alas, my spider-sense says it's a problem with collection/processing — that many "the"s and "of"s are being ignored, the way they might be for e.g. alphabetizing titles. It's probably time to check with Google.

You might cross-check by comparing Bible translations, using non-Google software. Not very many data points, of course, but they'd be *good* data points.

[(myl) The latest results from CSI Syntax (watch it next year on CBS…) show that COHA has pretty much the same trends, though the observed proportions are rather different, and the correlation is not as striking:

                         GB     COHA
(of the)/of 1900-1910   18.9%    28.1
(of the)/of 1990-2000   13.5%    22.7
(the N)/N 1900-1910     27.8%    25.4
(the N)/N 1990-2000     24.0%    22.4


But maybe the fact that the correlation is "less striking" is simply due to the decade-wise sampling — in this source the correlation between theNN/NN and ofthe/of is r=0.970. So I join you in worrying about some contamination of samples, but I don't think we can blame it on Google… ]

2. ### Yet Another John said,

January 9, 2016 @ 11:39 am

Could this phenomenon be related to the (even steeper) decline in popular band names beginning with "The" in the late 20th century?

You can really see this looking at the Billboard Year-End Hot 100 for various years. In 2015 (https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2015), there was only one group with "The" before its name (The Weeknd). Only one such group in 2014 as well. But look at the Hot 100 for, say, 1962 or earlier, and virtually any musical group (excluding solo artists and duos) has a "The" in front of their name ("The Platters," "The Four Lads," etc.)

It seems the transition period was in mid-1960's and it the trend was already very striking by the time Paul McCartney had left The Beatles to start Wings. It is not only because modern rock band names are often not NPs ("Imagine Dragons"), since many 2015 acts have names which could very naturally have had the definite article, but for some reason they don't: "Calvin Harris and Disciples," "Maroon 5," "X Ambassadors."

3. ### Y said,

January 9, 2016 @ 1:51 pm

I wonder if determinerlessness (WOTY 2016?) follows the same pattern in spoken English as reflected, imperfectly, in plays.

4. ### D-AW said,

January 9, 2016 @ 2:38 pm

Good to make the curves case-insensitive, but interesting that "The" doesn't show the same kind of decline a "the". Another piece of puzzle?

[(myl) It's relevant and interesting that "The" behaves differently from "the", but the original comparison doesn't change materially when limited to lower-case "the":

And the two ratio sequences shown in red and blue continue to correlate r=0.996…]

5. ### Guy said,

January 9, 2016 @ 2:43 pm

Isn't this similar to the data we would expect to see if the decline of "the" is largely attributable to the use of more genitives in place of "of"-headed preposition phrases? Or am I making some basic reasoning error here?

[(myl) No, because "the _NOUN_" is so much more frequent than "of the _NOUN_". So the increasing frequency of -s-genitives compared with of-genitives is relevant, but far from the whole story, and not enough of the story to explain the curiously high correlations…]

6. ### Guy said,

January 9, 2016 @ 2:48 pm

@D-AW

I think that's also consistent with the genitive hypothesis. "The X of the Y" has one "The" and one "the", but "The Y's X" still has a "The", but no "the".

7. ### stephen hart said,

January 9, 2016 @ 3:09 pm

Technically, The Weeknd is an individual (Abęl Makkonen Tesfaye), not a group. Unlike The The, which was a group sometimes and an individual (Matt Johnson) at others.

8. ### Jerry Friedman said,

January 9, 2016 @ 4:19 pm

Either I'm doing something wrong, or there's something wrong with the part-of-speech tagging at GB, as in this graph, which was supposed to show the most common nouns after "The writer".

However, a search for "the *_ADJ _NOUN_" worked fine.

It's also strange that the URL I copied out of the address bar isn't the one that I could read there, which had * instead of a list of adjectives.

9. ### Schroduck said,

January 9, 2016 @ 4:36 pm

There's definitely something odd about the data (look at the decline in adpositions vs determiners – virtually identical, right down to yearly spikes), and I wonder if maybe it comes from how Google digitizes things like page numbers and chapter titles. "Chapter" with a capital C has increased far more than "chapter" (and "Part" has proliferated even while "part" has declined), and numerals now make up 3.5% of all words (up from 2% a century ago). Incorrect inclusion of chapter titles might explain part of the decline – they often use a form of headlinese with fewer articles, and if for whatever reason Google is including more of these, they're basically bulking up the non-determiner word count of the books.

10. ### Jerry Friedman said,

January 9, 2016 @ 4:44 pm

Y: "Determinerlessness" is already doing better than my attempts to popularize "anarthrosity". "Anarthrousness" is actually in some dictionaries.

11. ### D-AW said,

January 9, 2016 @ 5:10 pm

@Guy – yes, that might be part of it. But it doesn't explain the consistent rise in "The" over the same period.

It's also possible upper-case "The" is just a red herring in the Google Books corpus, where book titles and chapter headings, for instance, get counted on every page where they appear.