Tyler Cowen, "I wonder if this is actually true", Marginal Revolution 7/12/2012.
Researchers who have scanned books published over the past 50 years report an increasing use of words and phrases that reflect an ethos of self-absorption and self-satisfaction.
"Language in American books has become increasingly focused on the self and uniqueness in the decades since 1960,” a research team led by San Diego State University psychologist Jean Twenge writes in the online journal PLoS One. “We believe these data provide further evidence that American culture has become increasingly focused on individualistic concerns.”
Their results are consistent with those of a 2011 study which found that lyrics of best-selling pop songs have grown increasingly narcissistic since 1980. Twenge’s study encompasses a longer period of time—1960 through 2008—and a much larger set of data.
That 2011 study was not very convincing — for details, see "Lyrical Narcissism?", 4/9/2011; "'Vampirical' hypotheses", 4/28/2011; "Pop-culture narcissism again", 4/30/2011; "Let me count the ways", 6/9/2011.
On the face of it, however, the new study (Jean M. Twenge, W. Keith Campbell, and Brittany Gentile, "Increases in Individualistic Words and Phrases in American Books, 1960–2008", PLoS One 7/10/2012) looks more plausible. But I thought that for this morning's Breakfast Experiment™ I'd take a closer look. And what I found diverges pretty seriously from the conclusions of the cited paper.
First, a serious complaint to the editors of PLoS One — the Twenge et al. study doesn't, as far as I can tell, provide access to its data! This is completely inexcusable, in my opinion — everything is based on two 20-by-49 tables of numbers, which could trivially have been put in the (digital-only) "paper", or (better) made available on line as separate files.
So for my replication, I had to go back to the sources. As long as I had to do that, I thought I'd do it a bit differently. First, I looked at changes from 1900 to 2010; second, I looked at the data decade-by-decade rather than year-by-year; and third I'll present the initial results graphically rather than relying on regression parameters. The results left me skeptical about what would happen in a more careful overall check. Details follow, but for an up-front taste, here's a plot of the ratio of counts of "communal" words to "individualistic" words in the American English Google Books corpus, decade-by-decade since 1900:
If Twenge et al. are right, then the last five data points — the numbers from the 1960s to the present, plotted in red — should show an relative increase in the use of individualistic words. That is, the "communal"/"individualistic" ratio should be going down.
OK, now to the details.
Twenge et al. did two experiments, one on isolated word counts and one on counts of phrases. They selected their lists of words and phrases by asking "turkers":
We used a two-step process to create a sampling of individualistic words. One sample generated words characteristic of individualism, and another rated which were most representative of the concepts. We used the same method to generate communal words.
For both phases, we recruited participants through the online service MTurk, in which participants are paid small amounts to complete various tasks. MTurk samples are typically more diverse in age and ethnicity than college samples or even most other Internet samples, and the data generated meet psychometric standards. […]
In the generation phase, MTurk participants generated words characteristic of individualism and communalism. Participants were given the following instructions: “We are looking for examples of single words often used in American culture, now and in the past, that express either: A) individualism (defined as focusing on the self and the needs of the self) or B) communalism (defined as focusing on groups, the society, and/or social rules).” Participants were then asked to list five individualistic and five communal words. Eliminating duplicates and foreign words left a list of 105 individualistic words and 137 communal words. We took a conservative approach to similar words, eliminating only plurals (for example, keeping “group” but not “groups”) but retaining noun and adjective forms, as they may have slightly different meanings (for example, “tribe” and “tribal”).
A separate sample of 55 MTurk participants rated the individualistic words on a 1 to 7 scale (with 1 = “not at all Individualistic” and 7 = “very individualistic”). Fifty-one other participants rated the communal words on a 1 to 7 scale (with 1 = “not at all communal” and 7 = “very communal”). Demographic information was not collected on participants in the second phase.
The 20 top-rated individualistic words were: independent, individual, individually, unique, uniqueness, self, independence, oneself, soloist, identity, personalized, solo, solitary, personalize, loner, standout, single, personal, sole, and singularity. The 20 top-rated communal words were: communal, community, commune, unity, communitarian, united, teamwork, team, collective, village, tribe, collectivization, group, collectivism, everyone, family, share, socialism, tribal, and union.
I worry a bit that this method means that the study is not so much focused on underlying changes in American individualistic vs. communal thinking, as on changes in American usage of words chosen by the current generation of Americans as descriptive of individualistic vs. communal thinking. But whatever — for my replication, I used the same lists of 20 "individualistic" and 20 "communal" words.
I took decade-by-decade counts, for decades starting from 1900-1909 and ending with 2000-2009, from the BYU interface to the published Google Books "American English" ngram collections. Here are links to the resulting "individualistic" and "communal" tables. (Note that the Google Books datasets preserve case, so the count for e.g. "independent" does not include counts for "Independent" or "INDEPENDENT" or etc. On a quick scan, I didn't see anything in the Twenge et al. paper about whether they did case-independent counts or not…)
These are word counts (or really, string counts) — and the overall number of words in the Google Books dataset increases over time, so we need to normalize them. I couldn't find decade-by-decade word counts for the Google Books "American English" dataset, and didn't feel like taking the time to calculate them from the year-by-year counts, so as a quick proxy I used the decade-by-decade counts for the string "The" from the same collection.
For the plot below, I've summed the counts for the 20 "individualistic" and 20 "communal" words, decade by decade. The "individualistic" word counts are plotted as a red line with "I" data points; the "communal" words are the blue line with "C" data points. The x-axis gives the mid-points of decades, e.g. 1905 for 1900-1909, etc.
Here's the R script that made the plot. Note that this is equivalent to the first of the two analysis strategies described by Twenge et al:
We analyzed the data using two complementary approaches. First, we simply summed usage means together, with the idea that the natural frequency of the words is relevant for assessing cultural change. In these analyses, a word used more frequently has a proportionally larger influence. In a second set of analyses, we Z-scored each word before summing so each word carried an equal weight regardless of absolute frequency.
According to their summary, the two methods had similar results:
Individualistic words increased in use in American books between 1960 and 2008. The correlation between year and the sum of the 20 individualistic words was r(49) = .87, p<.001. The 20 individualistic words combined made up.096% of words in books published in 1960, and.115% of words in books published in 2008. With an SD of.0063, this is an increase of d = 3.02. We also analyzed the data by Z-scoring each word before summing them. The correlation between publication year and the sum of the Z-scores was r(49) = .86, p<.001 for the 20 individualistic words, very similar to the simple sum. […]
When both individualistic and communal words are included in a regression equation predicting year, only individualistic words are significant (Beta = .83, p<.001; for communal words, Beta = .05, ns). When the Z-scored sums of both the individualistic and communal words were included in a regression equation predicting year, individualistic words increased while communal words decreased (Beta = .84, p<.001; for communal words, Beta = −.15, p<.05). Thus when the common variance of being generated by a modern sample is partialled out, only individualistic words have increased since 1960.
I haven't tried the Z-score method, or any regression experiments, because my breakfast hour is over. But so far, this dataset doesn't seem to provide any sort of convincing case for the idea that "Language in American books has become increasingly focused on the self and uniqueness" (in contrast to focus on the group and shared characteristics or experiences), either over the last century or "in the decades since 1960″. If anything, I suspect that you could use these numbers to build a case slightly in the opposite direction.
I don't know why this take on the subject looks so different from the results that Twenge et al. got. The next step would be to get a copy of their tables and look into things a bit more closely.
[I should note that there's also the question of why both their "individualistic" and "communal" words seem to be increasing in overall relative frequency. One possibility is that there's something wrong with my normalization-by-"The" approach — maybe the relative frequency of "The" has been gradually decreasing? Another possibility is that their word-list creation method, which relies on a sort of word-association test administered to 2012 turkers, tends to create a list of words with a high trendiness factor, which have been increasing in relative frequency over the past hundred years or so.
And the last possibility — which is the most interesting but the least likely — is that Americans have actually become increasingly obsessed, not with themselves, but with the whole self/group opposition.]
Update — I fetched the unigram counts for the Google Books American English ngram collection, and re-did the plot using those to normalize by decade. The results are not very different, except for maybe some issues with incomplete data in the last decade:
Looking at the decade-by-decade ratio of counts of "communal" words to counts of "individualistic" words, it's hard to see any serious recent trend. (Though the numbers are big enough that the increasing ratio over the past three decades (in favor of "communal" words) could probably be argued to be "statistically significant", in some sense of that so-often-meaningless phrase.) The ratio has been well above 1.0 since 1925 or so, for whatever that's worth: