Sharon Jayson, "What's on Americans' mind? Increasingly, 'me'", USA Today 7/10/2012:
An analysis of words and phrases in more than 750,000 American books published in the past 50 years finds an emphasis on "I" before "we" — showing growing attention to the individual over the group.
This is actually true as stated. If we take the counts from the "American English" unigram dataset in the Google Books ngram collection, and extract the year-by-year counts for the letter strings in question, the frequency of "I" has increased relative to the frequency of "we" over the period since 1960 — to the point where the ratio of frequencies is almost as high as it was in 1900:
Perhaps we ought to worry a bit about how often "i" is the roman numeral or the initial; but looking at the relative frequency of "me" vs. "us", or "myself" vs. "ourselves", shows a generally similar pattern over time (though the recent rise seems somewhat delayed, and is substantially smaller):
The changes from 1900 to 1960 are at least as striking as the changes from 1960 to the present. And if we compare "I" and its pronominal associates to "you" rather than to "we", we see a strikingly different pattern:
But let's push forward for now.
The USAToday article is based on work by Jean M. Twenge, W. Keith Campbell, and Brittany Gentile, "Increases in Individualistic Words and Phrases in American Books, 1960–2008", PLoS One 7/10/2012, which concludes that:
Individualistic words and phrases increased in use between 1960 and 2008, even when controlling for changes in communal words and phrases. Language in American books has become increasingly focused on the self and uniqueness in the decades since 1960.
If we look at the particular words cited in that quote, again broadening the view to the period 1890-2008, the claim is sort of true:
But again, the description seems to be missing some interesting things. Thus "self" started to increase rapidly in frequency around 1940, not 1960; and it peaked in 1996. And "uniqueness" increased steadily from 1890 onwards, with a rapid rise from 1945 to 1965, and a relatively flat trajectory since then.
What's really going on here? The numbers involved are very large, and the changes are relatively smooth and extend over relatively large ranges of both absolute and relative frequency, so that it's quite clear that these time series are not just noise. But what kinds of signal are really involved? Here are a few possibilities:
- The mix of kinds of books published changes over time (e.g. more romance novels, fewer collections of sermons); different kinds of books use words differently; therefore the relative frequency of words changes.
- The mix of kinds of books selected for the Google Books ngram collections changes over time; so the relative frequency of words changes, for similar reasons as in (1).
- The distribution of concepts or conceptual frames changes over time, even in the same sorts of books.
- The choice of words to express a given concept (in published books) changes over time, even in the same sorts of books.
I have no doubt that all of these things are contributing to the time series that we see. As an illustration of point (4) — perhaps with a bit of a contribution from points (1) and (2) — consider the history of "everyone" vs. "everybody". As far as I can tell, these two words are universally inter-substitutable — there's no context (metalanguage aside) where the choice makes other than a stylistic difference. But the frequency of "everyone" (in the Google Books American English dataset) has been increasing steadily since the start of the 20th century, with a pause from 1945 to 1975; the frequency of "everybody" has been relatively stable. As a result, the ratio of "everyone" to "everybody" in this sample has increased more than 20 times, with "everyone" overtaking "everybody" at some point in 1934:
Unfortunately, we don't have the information about the Google Books datasets that would allow us to directly unravel these factors. We don't know what the actual books involved are; we don't know the broader contexts of the words; we don't have anything except the string counts by years. At some point in the (I hope not-too-distant) future, we'll have an open historical collection that will make it plausible to explore these questions.
And the Twenge et al. study uses these counts in a largely mysterious way, making it even harder to evaluate their claims. In an earlier post, I complained that
[The Twenge et al.] study doesn't, as far as I can tell, provide access to its data! This is completely inexcusable, in my opinion — everything is based on two 20-by-49 tables of numbers, which could trivially have been put in the (digital-only) "paper", or (better) made available on line as separate files. ("Textual Narcissism", 7/13/2012)
And in a later post, I observed that Twenge et al. don't even provide the basic details required to make it possible to replicate their work by getting the numbers again from the Google Books ngram corpus, because
… the Google Books data does not collapse over case, so that e.g. "solo", "Solo", and "SOLO" are all distinct items. I counted only the all-lower-case versions of the words; they don't say what they did. ("What does this graph mean?", 7/15/2012)
At the time, I thought that this last point didn't matter a great deal, since for most words, the lower-case and case-independent counts are close being proportional, year by year, e.g.
But there are a few words their lists that behave quite differently. If you look at the lists and think about it a bit, you'll see what (at least some of) words are.
Their 20 "communal" words: communal, community, commune, unity, communitarian, united, teamwork, team, collective, village, tribe, collectivization, group, collectivism, everyone, family, share, socialism, tribal, union
Their 20 "individualistic" words: independent, individual, individually, unique, uniqueness, self, independence, oneself, soloist, identity, personalized, solo, solitary, personalize, loner, standout, single, personal, sole, singularity
A large fraction of the instances of the word "united" are in proper nouns ("United States", "United Kingdom", "United Nations", "United Airlines", etc.), and are therefore capitalized. As a result, the time series of lower-case and case-independent frequency are very different:
Something similar is true for "union", and to a lesser extent for some other words as well. As a result, the case-independent aggregate counts show a much bigger difference between "communal" and "individualistic" words than the lower-case-only counts do:
And as further result, the ratios of the aggregate counts behave somewhat differently in the post-WWII era:
(Note that their paper deals only with the changes over the period 1960-2008 — thus missing what might be the most interesting aspect of these numbers, namely the large difference between the behavior of these aggregate frequency ratios before WWII and after WWII.)
In my opinion, the relative frequency of proper nouns like "United States" and "United Kingdom" is not likely to tell us very much about whether "language in American books has become increasingly focused on the self and uniqueness", and so I suspect that the lower-case-only data is more relevant to the issues that Twenge et al. raise. But given this (in retrospect obvious) problem with their word lists, it's all the more unfortunate that they provide neither the table of data that they used, nor a recipe for calculating it from the original (publicly available) source. Instead, they give us only regression coefficients and significance levels.
Anyhow, I continue to think that the numbers in the Google Books datasets are fascinating. I just wish I had a clue as to what they mean.
[Note: I'll provide a link to the data and code a bit later today — some clean-up is needed, and I've run out of time for this morning's Breakfast Experiment™.]
Update — the data is here.