Lexico-cultural decay?

« previous post | next post »

Jonathan Merritt, "The Death of Sacred Speech", The Week 9/10/2018:

America boasts more Christians than any other country on planet Earth. But you wouldn't know it from listening to us.

According to Google Ngram Viewer data, a searchable database of millions of printed works stretching back 500 years, most of the central terms in the Christian vocabulary are rapidly declining. One 2012 study in the Journal of Positive Psychology, for example, analyzed 50 moral terms associated with Christianity and found that a whopping 74 percent were used less frequently over the course of the last century […]

"Whopping "? If the frequency of each word were following a random walk, we'd expect 50% of them to decline and 50% of them to increase. And to be confident that 74% is "whopping", or even meaningful, we'd need to do something that neither Merritt nor the cited paper do, namely verify that there's no overall bias in the data source for reasons other than changing "cultural salience", either towards decreasing frequency of certain types of words, or decreasing frequency of individual words in general, But in fact there's good reason to believe that both sorts of bias exist — see below.

The cited paper is Kesebir & Kesebir, "The cultural salience of moral character and virtue declined in twentieth century America", The Journal of Positive Psychology 2012. Their work is based on results from the Google Books ngram viewer. And the first problem is that the distribution of types of material included in that collection changes over time.

One aspect of this problem is amply documented (among other places) in Pechenick, Danforth, and Dodds, "Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution", PLoS One 2015.  Kesebir & Kesebir don't discuss this issue, but Merritt cites a popular-press version of the Pechenick et al. paper, only to dismiss it:

Ngram data is complicated and susceptible to misinterpretation, of course. The overabundance of scientific literature in the database can skew findings, for example, and it is often difficult to account for all colloquial, synonymous terms that have arisen during the same period. But the data we have cannot be dismissed out of hand, and at the very least indicates that traditional sacred speech is dying in the English-speaking world.

"Indicates that traditional sacred speech is dying"? Geoff Nunberg pointed out in email that words for physical aspects of churches like pew, spire, apse, rectory — not exactly "sacred speech" — also generally declined in apparent frequency, by amounts comparable to the declines cited for "moral character and virtue words":

In fact, the Google Ngram viewer also shows declines for the first six words for general architectural features that occurred to me — stairway, foundation, roof, eaves, arch, cornice:

[The multiplications are to get words of different frequency ranges into the same graphical range.]

So is "traditional architectural speech" dying in America as well?

But there's a second problem. It's not just that the range of publications is shifting over the course of time, it's also that the overall total of "words" in the collection is increasing, so that there's a tendency for the relative frequency of individual words to decrease. I put "words" in scare quotes in that sentence because in fact the great majority of "word" types in the Google Books unigram corpus are actually not words at all, but rather OCR errors, typographical errors, number strings, and so on.   See "The birth and death of typos", 3/17/2012, and "Word string frequency distributions", 2/3/2013.  I don't know that the proportion of pseudo-word tokens increases over increasing publication time (as the proportion of pseudo-word types definitely does), but it might well, and if it does, it would be another source of a general bias towards decreasing frequency of ordinary, real words.

As one simple test of this general bias, I

  • had a program choose 10 words at random from the 34,691 distinct all-alphabetic words found in 16 Dickens novels;
  • ran those ten words through Google Books ngrams;
  • pulled the percentage values for the 20th century (look at the page source from a search if you want to see one way to get those);
  • took the average of the first half of the century and of the second half of the century; and compared them (here presented as frequency per million words):
Word     FirstHalf SecondHalf
sport     0.1244    0.1036
edges     0.1819    0.1601
wasteful  0.0329    0.0245
narrow    0.5750    0.4363
sewn      0.0121    0.0148
healths   0.0023    0.0007
toddling  0.0016    0.0008
row       0.2146    0.2337
idler     0.0067    0.0039
squibs    0.0021    0.0010

As you can see, a "whopping" 80% of these words declined in frequency over the course of the century!

This is not the end of the story, obviously. Did 80% of random Dickens words decline in Google ngram frequency because of an increasing admixture of scientific and technical writing in the underlying collection? Because of increasing pollution by letter-strings that are not words at all? Because of overall changes in linguistic fashion? All of the above?

The one thing we can be sure of is that the premise of Merritt's argument is faulty. His neo-Whorfian next step is also problematic:

As the language of faith has declined in usage, it should not surprise us that our collective thinking and behaviors are less dependent on spirituality than they once were. In the same way, we speak far less of grace, mercy, patience, and compassion. While we may decry that our world is not as gracious, merciful, patient, or compassionate as it might be, we must also take responsibility for the way our use of language has contributed to it.

But more on this later.

Update — Discussing some joint research with a colleague, I used this work as one of many examples of why it's often dangerous to take "yes" for an answer when you've asked the world a question. As Richard Feynman said, "The first principle is that you must not fool yourself – and you are the easiest person to fool."

Update #2 — (hoisted from a response to a confused commenter):

For the word "pew"

Year           1900   2000
Count         16819  27767
Frequency      2.97   1.03

So the token count of "pew" increased by a factor of 27767/16819 = 1.65, while the token frequency (per million words) decreased by a factor of 1.97/1.03 = 1.91.

That's because in 1900 the collection has 7.52 billion "word"  tokens (i.e.  letter strings separated by punctuation or space), whereas for 2000 it has 26.882 billion "word" tokens.



20 Comments

  1. AG said,

    October 9, 2018 @ 6:09 am

    I find the idea that "compassion" et al. are technical terms connected with one religion to be almost too stupid to be offensive. But, no, it's offensive.

    [(myl) Indeed. But keep in mind that Kesebir & Kesebir argue for a decline in "moral character and virtue", not christianity — the christian focus is Merritt's contribution.]

  2. richardelguru said,

    October 9, 2018 @ 6:19 am

    AG
    …or indeed with any religion.

  3. J.W. Brewer said,

    October 9, 2018 @ 8:00 am

    It turns out to be pretty easy to guess some technical/"jargon" words or phrases meaningful only within fairly explicitly Christian discourse that increased in frequency in the second half of the 20th century. I went four for four on "dispensationalist," "hamartia," "prosperity gospel," and "slain in the spirit." Some hardy perennials like "consubstantial" and "filioque" were up and down over the course of the century but not notably lower in second half than first.

  4. Ursa Major said,

    October 9, 2018 @ 8:52 am

    "Grace" and "mercy" may be down, but "kindness" and "empathy" and "community" are all up. Personally, I'd rather talk about these moral concepts as innate human attributes rather than as condescending gifts from a paternalistic god of one particular religion.

  5. R Steinmetz said,

    October 9, 2018 @ 10:00 am

    While I can't add a pseudo-scientific reason, I have the impression that public religiousness had declined. One thing I have noticed in viewing mid twentieth century films is the frequent references to religion in films without a religious theme and the number of films with a religious theme.

    This can be explained by the change in movie audiences, especially after television became the dominant entertainment medium. But I think there is now a general reluctance for people to speak of religion in the public sphere.

    From personal experience a long time friend of mine passed away last year. We attended college together and he was best man at my wedding.Although we lived thousands of miles apart we kept in regular touch seeing each other a couple of times a year.. When I attended his funeral I was stunned by the extent to which he and his family had been involved in formal church activities.

  6. Yerushalmi said,

    October 9, 2018 @ 10:49 am

    If there were a decline in Christian vocabulary in the United States, it would also manifest itself in fewer references to churches and their constituent parts. So the first graph, about physical aspects of church architecture, isn't in and of itself good evidence against the theory.

    [(myl) But the original paper is about "virtue words", and the recent popularization is about "sacred speech". Some theological approaches would see a focus on church architecture as uncorrelated with religious virtue, or even antithetical to it.]

  7. J.W. Brewer said,

    October 9, 2018 @ 10:57 am

    Church-architecture words are not universal even within a "universe" limited to Christianity. How likely it is that e.g. "apse" and "pew" will be words used to describe bits of a given randomly-chosen building where some particular subset of Christians worship will vary over space and time. I expect that those words are not used to describe the interior of e.g. many of the evangelical "megachurches" that have become more common in recent decades.

    There is data out there (with, I'm sure, debates about its quality) about long-term trends in religiosity, frequency of church attendance etc. It is also relevant to consider that the religiosity:secularism ratio at any given moment in time is not even throughout the U.S. but can vary dramatically by geography, social class, political affiliation, etc. The subsets of the population that generate a disproportionate amount of the texts that end up in the google books corpus may have experienced different trends than the broader population.

  8. cameron said,

    October 9, 2018 @ 11:13 am

    Well I, for one, go out of my way to use words like "utraquistic" and "triune" as often as I can.

    [(myl) My personal favorite is homoiousian, whose usage over time shows clear periodic oscillation, perhaps triggered by sunspots or the business cycle or periodical cicada emergence or something:

    (As well, of course, as a secular(ist) trend downward…)]

  9. rootlesscosmo said,

    October 9, 2018 @ 3:27 pm

    Nor should we forget the advice of Dave Barry, to the effect that we need some serious discussion of the verb "to whop."

    The battle between "homoiousian" and "hooousian"–"of like substance" vs. "of one substance," I think–features in William Gaddis' neglected novel "The Recognitions."

  10. David Morris said,

    October 9, 2018 @ 3:41 pm

    My understanding of Google Ngrams is that it traces the usage of words as a *percentage* of total verbiage. Therefore, as more words enter into usage, any given word's percentage of the total is (all else being equal) going to decline over time. A slight or even moderate rise in *absolute* usage is going to be cancelled by the sheer number new words.

    [(myl) Yes, this is one of the points made in original post. A related point is that (at least with respect to the unigram counts) the category of "word" includes ~99 OCR errors, typographical errors, number strings, etc. for every case that any dictionary would recognize as a word.]

  11. Chris C. said,

    October 9, 2018 @ 5:01 pm

    Good gravy! MYL has uncovered cyclical Arianism! There must be many papers about this!

  12. Jerry Friedman said,

    October 9, 2018 @ 5:18 pm

    Chris C.: Is that more of a deontic or an epistemic "must"?

  13. Mick O said,

    October 9, 2018 @ 5:40 pm

    I threw Mr. Merritt's previous 9 columns together and searched the roughly 8300+ words he used in those columns for "sacred speech." This does not include the linked piece for obvious reasons>

    Grace: 12
    Mercy: 1
    Wisdom 2
    Faith: 17
    Sacrifice: 0
    Honesty: 0
    Righteousness: 0
    Evil: 0

    This includes discussion of a stage production of Amazing Grace. If you remove "Amazing Grace" references, his use of the word "grace" drops to 0. He does have faith going for him though.

  14. eub said,

    October 9, 2018 @ 11:08 pm

    The Ngrams numbers are frequency of a word within the word sequence, so why would the number of distinct word tokens interfere? If all text is of the form "apse la la la la la la apse la la", and you then change some las to tras, that won't change apse's numbers.

    [(myl) No, the unigram results are the number of instances of a word for a given year, divided by the total number of "word" tokens in the collection for that year. And both the total number of distinct "words" and the total number of "word" tokens are increasing year by year in the Google Books collection.

    Thus for the word "pew"


    Year           1900   2000
    Count         16819  27767
    Frequency      2.97   1.03

    So the token count of "pew" increased by a factor of 27767/16819 = 1.65, while the token frequency decreased by a factor of 1.97/1.03 = 1.91.

    That's because in 1900 the collection has 7.52 billion "word" tokens (i.e. letter strings separated by punctuation or space), whereas for 2000 it has 26.882 billion "word" tokens.]

  15. D.O. said,

    October 11, 2018 @ 2:06 pm

    Here' my 2 cents.

    We need some measuring stick to figure out what's going on and as a first stab, I tried "God". The use of the word in Google ngram corpus dropped from about 1600ppm (because Prof. Liberman objects to calling them "words", I will use neutral "parts per million") in around year 1800 to about 250ppm in 1940 and then clawed back a little to about 350ppm in 2000. Ratio of God/(Jesus+Christ) remained more or less stable over 2 centuries at ~2 with a bit of a bump between 1930s and 1960s ("Jesus" was gaining on "Christ" slowly and steadily).

    Most of the words in Johnathan Merritt's piece are not particularly religion bound, but faith and grace seems to be as good as any. Faith was steady between 1880 and about 1965 at about 0.27*God, then started to decline (relative to God) and is now at about 0.2 where it was back in 1840s. grace and mercy (and also wrath) are in secular decline.

    So it might be something to the notion that as far as religiosity goes, people take God as more of an abstract concept rather than an active force in the moral universe. Or maybe we just use some other words, like love and hate for essentially the same concepts.

    [(myl) I think you might not be paying attention to the discussion. There are several reasons (evolving mix of sources, and growing set of pseudo-words) why Google ngram "frequencies" probably don't give a reliable estimate of changing frequencies of word usage over time, especially when the changes are relatively small. If we look instead at the COHA corpus, which tries to keep the source distribution as stable as it can, and also tries to avoid OCR crap, "GOD" get this pattern over the 10 decades of the twentieth century (frequency per million words):

    270.96 287.56 306.92 284.79 270.40 313.30 343.94 329.04 266.45 303.34

    The mean of the first five decades is 284.1; the mean of the last five decades is 304.4.]

  16. D.O. said,

    October 11, 2018 @ 2:53 pm

    I did pay attention and that's why I think we need a measuring stick. I used God not because it is particularly stable or revealing, but to measure how much staff in Google ngrams is about religion. All other measurements are relative to God and as such do not depend on OCR errors (I must add here that I limited my searches to american English).

    Merritt's thesis seems to be that we are lacking in moral foundation relative to previous decades because we do not think about morality in specifically religious terms (at least not as much). Proposed causality is plainly ridiculous. My interest was mostly in trying to figure out whether some traditionally religious concepts are on the decline even in religious speech. This seems to be plausible. But of course, the simplest explanation for the decline of a word use is the substitution of another word.

  17. Chris C. said,

    October 11, 2018 @ 5:23 pm

    @Jerry Friedman — The former. I have yet to see anything about this remarkable trend in my academia.edu feed.

  18. Jason M said,

    October 11, 2018 @ 10:51 pm

    In biomedical research, my line of work, we would set up some control values as normalizers for the noise. For example, to see if a given gene’s expression really changes in a noisy system, we would normalize to “housekeeping” genes whose expression is assumed to be constant under different conditions or timepoints. If expression of your gene of interest follows the trend of the housekeeping genes, then any change in expression in your measurements is likely simply due to noise.

    Ideally, you would also have positive control genes you know from previous work actually do change in the direction you hypothesize your gene of interest changes. If your positive control doesn’t work, you know the measuring system is too noisy to give real data.

    In this case, the hypothesis is words associated with Christianity are in relative decline. We need a neutral housekeeping set of control words unlikely to have changed in real usage as nirmalizers. I chose a small, likely cohort of common nouns and ngrammed over the 20th Century (“house,life,time,day,world”). Except “world”, all showed substantial declines. Use of the word “day” declined by about 40%. Could one argue “Time” itself is in whopping decline? Clearly, in addition to society’s descent into amorality, we are apparently also far less aware of chronological dynamics! ;-)

    I am not sure about positive controls but came up with a quick set of technology-associated ones: “carriage,telegraph,cavalry,musket,hearth”. If you run these, they decrease more in the 60-80% range.

    Some Christiany, moral-y words like “mercy, grace, and faith” are all definitely down. In the 40-60% range. But are those just quaint words?

    Cursory conclusion: with more controls, one might eventually argue that some faith words like “grace, faith, mercy” may have decreased in use during the 20th Century, normalized to neutral terms. But one would need a lot of such faith terms, as those particular ones may just be quaint themselves, not the concepts indicated therewith. But, of course, apply a little wishful Whorffianism yadayada and correlation=causality arguing and you too can cement the perrennial “society is in freefall decline from when I was a kid” lament…..

    [(myl) Nicely done.]

  19. ktschwarz said,

    October 12, 2018 @ 3:20 pm

    The Pechenick et al. paper is in PLOS ONE, not PNAS (the link is correct).

    [(myl) Fixed now.]

  20. Steve B said,

    October 15, 2018 @ 8:57 am

    If you look at non-christian spiritual/moral words ("karma", "dharma", "nirvana"), you see a large rise over the exact same period. (https://books.google.com/ngrams/graph?content=karma%2Cdharma%2Cnirvana%2Cpew&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Ckarma%3B%2Cc0%3B.t1%3B%2Cdharma%3B%2Cc0%3B.t1%3B%2Cnirvana%3B%2Cc0%3B.t1%3B%2Cpew%3B%2Cc0)

RSS feed for comments on this post