I've generally been skeptical of claims about counts of first-person singular pronouns as an index of self-involvement, mainly on empirical grounds. In particular, the pundits who beat this drum mostly make assertions without any counts, much less comparisons of counts. For some of the Language Log coverage, with links to articles by George F. Will, Stanley Fish, and Peggy Noonan (among others), see "Fact-checking George F. Will" (6/7/2009); "Obama's Imperial 'I': spreading the meme" (6/8/2009); "Inaugural pronouns" (6/8/2009); "Another pack member heard from" (6/9/2009); "I again" (7/13/2009); "'I' is a camera" (7/18/2009).
And there are problems with the theory as well, as Jamie Pennebaker explains here.
But look at this impressive graph, from C. Nathan DeWall, Richard S. Pond, Jr., W. Keith Campbell, and Jean M. Twenge, "Tuning in to psychological change: Linguistic markers of psychological traits and emotions over time in popular U.S. song lyrics", Psychology of Aesthetics, Creativity, and the Arts, 3/21/2011:
Here we've got numbers galore — from the lyrics of Billboard's 10 top songs from each of 28 years, 88,621 total words — and comparison of numbers across time. There still might be some questions about the explanation, but at least we have a strong effect to explain, right?
DeWall et al. certainly want us to draw the obvious conclusion, as they explain in their abstract:
Linguistic analyses of the most popular songs from 1980–2007 demonstrated changes in word use that mirror psychological change. Over time, use of words related to self-focus and antisocial behavior increased, whereas words related to other-focus, social interactions, and positive emotion decreased. These findings offer novel evidence regarding the need to investigate how changes in the tangible artifacts of the sociocultural environment can provide a window into understanding cultural changes in psychological processes.
The trouble is, they also give the associated table of yearly numbers:
And eyeballing the numbers in that table, it's hard to see how they connect to the graph in their Figure 1. A plot of the yearly numbers agrees:
And calculating and plotting the mean values for the time-periods in their Figure 1 confirms it:
Frankly, I'm not completely sure. It seems that their Figure 1 in fact plots numbers that are not derived directly from their Table 1, but rather from a regression analysis somewhat vaguely described as follows:
To provide a conservative test of our hypothesis, multiple regression analyses were conducted for each dependent variable, predicting word use from song year. Dummy variables for genre type (i.e., country, hip hop/r&b, pop, and rock) and changes in methodology (i.e., changes in ranking formula to account for digital downloads and streamed media) were entered as covariates.
The results for first-personal pronouns:
In other words, they apparently fitted a line (probably to the logit transform of the yearly first-person-singular-pronoun proportions), and plotted the fitted time-period proportions rather than the actual proportions. If that's true, then it seems to me that their Figure 1 is superfluous at best and misleading at worst.
Still, it looks like there's some change over time in the actual proportions, even if it's noisy and non-monotonic. What caused this? Is it showing us something about the overall expression of self-involvement in American culture over time?
We should start by getting specific about where they got their data:
To explore changes in word use in popular songs over time, we obtained song lyrics for the 10 most popular U.S. songs (according to the Billboard Hot 100 year-end chart) for each year between 1980 through 2007. We chose this time period because prior work has shown significant cultural changes over time in motivation, personality, and emotion (e.g., Twenge, 1997; Twenge & Campbell, 2001, 2008; Twenge & Foster, 2010). The top 10 songs were chosen because of the preponderance of Top 10 lists that identify popular cultural products (e.g., foods, presents), including songs.
The first thing that occurred to me was that (noisy and non-monotonic) trends might (partly or mainly) reflect some changes in the relative popularity of musical genres and even individual artists. DeWall et al. consider and reject this alternative, as follows:
The results for the genre variable (which they give as correlations rather than as logistic-regression coefficients):
This is not at all convincing, in my opinion. All four categories are highly diverse, in ways that change significantly over the time the time period of their study. The category of "Hip Hop/R&B", as the slash suggests, is especially diverse, and also has changed especially strongly over the period from 1980 to 2007. At one end, this category includes Michael Jackson, Diana Ross, Lionel Richie, Kool & the Gang, Prince, and Tina Turner (representing the Hip Hop/R&B genre in the top 10 of the Billboard Hot 100 for 1980-1984), and 50 Cent, R. Kelly, Sean Paul, Jay-Z, Chingy, Ludacris, Usher, and P. Diddy (representing the same genre in the top 10 of the Billboard Hot 100 for 2000-2004). A variable that doesn't distinguish between these two lists is hardly controlling effectively for genre differences.
The idea of data-mining song lyrics for indications of cultural trends is a plausible and interesting one, in my opinion. But this particular study uses misleading graphics to exaggerate its findings, and does a remarkably tone-deaf job of controlling for genre changes.
Update — Cosma Shalizi writes:
Inspired by your post on DeWall et al., I typed in the confidence intervals as well as the means, and added (1) a horizontal line at the grand mean, (2) a smoothing spline (blue), and (3) a smoothing spline with points inversely weighted by standard deviation (purple). Code and graph attached.
They probably cite Colin Martindale approvingly, don't they?