This is a brief progress report on "The case of the disappearing determiners", which I've continue to poke at in my spare time.
As the red line in the plot below shows, the proportion of nouns immediately preceded by THE decreased over the course of the 20th century, from an average of 18.9% for books published in 1900-1910 to 13.5% for books published in 1990-2000. The blue line shows that the proportion of adjective+noun sequences immediately preceded by THE was higher, overall, but followed a remarkably similar falling trajectory, from 29.1% in 1900-1910 to 21.2% in 1990-2000:
As the dotted green line shows, the proportion of nouns immediately preceded by an adjective remained stable in this source, at about 18.9% throughout the century. I thought that perhaps a falling percentage of adjective-modified nouns, coupled with the lower percentage of THE preceding non-modified nouns, might make a contribution to the overall decline in THE — but not so.
I got the data by prompting the Google Books ngram viewer with the queries
The similarity between the red and blue lines is so strong — a correlation of r=0.996 for the pair of 101-element vectors — that I wonder whether there might be a bug of some kind in the underlying data.
This remarkably similar pattern continues to be visible in the plot below. The blue line shows the proportion of instances of OF immediately followed by THE, which drops from about 27.8% in 1900-1910 to 24.0% in 1990-2000, while the proportion of all prepositions immediately followed by THE, shown in the red line, drops from 25.7% to 23.0%. The correlation of those two sequences is r=0.994, and their correlations with the falling lines in the previous plot are between 0.962 and 0.974.
And OF as a proportion of all prepositions, shown in the dotted green line, fell from 31.2% to 28.4%, but again in a strikingly similar pattern, correlating between 0.913 and 0.967 with all of the other falling lines:
I'm sure that these patterns mean something. But it seems a little weird that OF as a proportion of all prepositions should correlate r=0.953 with the proportion of instances of OF immediately followed by THE, and it seems weirder that OF as a proportion of all prepositions should correlate r=0.913 with the proportion of adjective-noun sequences immediately preceded by THE.
So let's hope that what these patterns mean is that the secular decay of THE has somehow seeped into some but not all of the other counts, or that some other hidden cause is governing all of the correlated decays. The alternative hypothesis is that there's a problem with the way the underlying data was collected and processed, which would be annoying.