More on trends in the Google ngrams corpus
In "Lexico-cultural decay?", 10/9/2018, I called into question Jonathan Merritt's evidence for the view that "most of the central terms in the Christian vocabulary are rapidly declining". Merritt cites Kesebir & Kesebir 2012, who argue on the basis of Google ngram-viewer data that
Study 1 showed a decline in the use of general moral terms such as virtue, decency and conscience, throughout the twentieth century. In Study 2, we examined the appearance frequency of 50 virtue words (e.g. honesty, patience, compassion) and found a significant decline for 74% of them.
I explained several reasons why unigram frequencies for many ordinary words in the Google ngram dataset tend to show a decline over the 20th century, citing Pechinick et al. 2015 and giving some illustrative examples. It occurred to me this morning that there's a different way to illustrate one of the issues, namely the changing mix of types of books in Google's collection. At some point after 2000, that collection shifts fairly abruptly — the earlier material is based on scans of books from cooperating research libraries, while the later material is based on digital texts provided by publishers. This shift produces such a pronounced change in the frequency of nearly all words that the default ngram viewer stops in the year 2000.
But you can ask the viewer to give you data up to 2008 (as far as it's willing to go), and the results almost always show a pronounced change. So I tried it for the items underlying Merritt's argument.
Read the rest of this entry »