It's a time series, from 1890 to 2008, of a certain socio-cultural index. The points in red are the year-by-year values; the blue line is a smoothed ("spline") version of the sequence.
If you had to summarize this plot to someone over the phone or in text-only form, how would you do it? I might say something like
It starts at about 0.95 through the 1890s and 1900s. Then around 1915 it starts rising, and keeps on going up to a peak of about 1.25, around 1950. Then it comes down a bit, and wobbles around until the present, in the range of 1.15 to 1.2.
OK, suppose we zoom in on the period from 1960 to 2008, plotted in red below. How would you describe that part?
I'd say something like
Well, it goes down a bit through the mid-1960s, then it rises a bit through the early 1970s, then it falls a bit again into the early 1980s, and then it gradually rises over the past 25 or 30 years. It ends up just about where it started.
So what is this mystery time series?
Ayn Rand, author of Atlas Shrugged, might have called it the "collectivism index", or the "herd index". .Jean Twenge, author of Generation Me: Why Today's Young Americans Are More Confident, Assertive, Entitled–and More Miserable Than Ever Before, might call it the "cooperation index", or the "we over me index".
And Prof. Twenge and her co-authors, focusing on the section of this dataset from 1960 to 2008, conclude that
Individualistic words and phrases increased in use between 1960 and 2008, even when controlling for changes in communal words and phrases. Language in American books has become increasingly focused on the self and uniqueness in the decades since 1960.
The use of both individualistic words (Study 1) and phrases (Study 2) increased over time in a very large corpus of books in American English. This increase remained significant even when a sample of communal words and phrases also generated by a modern sample was controlled for statistically.
We interpret these changes in published language as reflecting broader cultural changes. That is, we believe these data provide further evidence that American culture has become increasingly focused on individualistic concerns since 1960. Using cross-temporal data to assess cultural change over time within one country is similar to using cross-cultural data to assess differences between cultures during the same historical period. Thus, America today is culturally distinct from America in 1960– at least in the realm of individualism.
I doubt that any of you, looking at the time series plots above, came to the conclusion that the "we over me index" has fallen markedly since 1960. It basically ended up where it started — and if anything, the trend over the past couple of decades seems to be anti-individualistic, as measured by this index.
And what is this index, really?
What it is, in simple fact, is the time series, from 1890 to 2008, of the ratio of aggregate counts of 20 "communal" words to the aggregate counts of 20 "individualistic" words, where the counts are taken from the American English data in the Google Books ngram collection, and the "communal" and "individualistic" word lists come from Jean M. Twenge et al., "Increases in Individualistic Words and Phrases in American Books, 1960–2008", PLoS One 7/10/2012.
Twenge et al.'s "communal" word list is
communal, community, commune, unity, communitarian, united, teamwork, team, collective, village, tribe, collectivization, group, collectivism, everyone, family, share, socialism, tribal, union
Their "individualistic" word list is
independent, individual, individually, unique, uniqueness, self, independence, oneself, soloist, identity, personalized, solo, solitary, personalize, loner, standout, single, personal, sole, singularity
Now, Twenge et al. came to their conclusion on the basis of a different sort of analysis of the underlying time series data in question:
Individualistic words increased in use in American books between 1960 and 2008. The correlation between year and the sum of the 20 individualistic words was r(49) = .87, p<.001. The 20 individualistic words combined made up.096% of words in books published in 1960, and.115% of words in books published in 2008. With an SD of.0063, this is an increase of d = 3.02. […]
When both individualistic and communal words are included in a regression equation predicting year, only individualistic words are significant (Beta = .83, p<.001; for communal words, Beta = .05, ns).
They did a separate analysis based on looking at the unigram data in terms of z-scores, and they tabulated and analyzed a set of "communal" and "individualistic" phrases as well. It's clear that they were looking for more evidence in favor of Prof. Twenge's "me generation" theme. I'll leave it as an exercise for the reader to sort out the actual relationship between their data and this conclusion; but in my opinion, the prima facie case is very weak for the conclusion that (as the USA Today story put it) "An analysis of words and phrases in more than 750,000 American books published in the past 50 years finds an emphasis on 'I' before 'we' — showing growing attention to the individual over the group".
It's more interesting, in my opinion, to ask what's going on in the part of the time series that Twenge et al. ignored, namely the section from 1890 to 1960. It's possible, of course, that the obvious trend is an artefact, caused by a change in the Google Books mix of genres over time, or some other relatively uninteresting factor. But as I noted yesterday, replication in an independent (and better-controlled) historical dataset shows a very similar overall pattern (though with all of the ratios shifted upwards, presumably due to an overall difference in genre mix):
If you really believed in this cultural index, you might tell a story about it like this one:
As a result of the Russian revolution and related historical trends, American interest in "communal" topics increased sharply through the period 1917-1945. During the period 1945-1965, several factors — the start of the cold war, the McCarthy period in the 1950s, and the individualistic aspects of the counterculture — all combined to bring this trend to a halt and even to reverse it slightly. Since then, various socio-cultural forces have pushed the index up and down, without in the end changing it much.
Do I believe in this index enough to endorse such interpretations? No, I don't. Do I think that a more careful examination of such data might lead to something interesting along these lines? Yes, I do.
Here are a few more graphs that may help to clarify the nature of this particular data set. The time series of raw (summed) frequency counts (I've included the incomplete 2009 data, which was omitted in my plots above and also in the analysis by Twenge et al.):
The blue line with 'C' plotting points represents the "communal" words, while the red line with "I' plotting points represents the "individualistic" words. Here are the summed counts normalized by the total yearly word counts, and scaled to equal frequency per million words:
My times series data can be found in three files: TIdata (the yearly counts of "individualistic" words from 1890 to 2009), TCdata (the yearly counts of "communal" words from 1890 to 2009), AmEngUnigrams (the yearly total words, pages, and books from the Google Books American English collection). Twenge et al. have not made their numbers available, as far as I can tell. Note that the Google Books data does not collapse over case, so that e.g. "solo", "Solo", and "SOLO" are all distinct items. I counted only the all-lower-case versions of the words; they don't say what they did. If you want the rest of the data, it's available on the Google Books ngram web site.
Code is here.
Update — Following up on Erez's comment, I agree that an extended exploration of this example (and related examples) would be worth doing. Some obvious ideas:
- Add time-series data for a large set of related words, perhaps created by extending the Twenge sets through thesaurus links or LSA distance or distributional similarity.
- Use the larger dataset to explore various approaches to the aggregation problem, e.g. via functional principal components analysis or other dimensionality-reduction techniques.
- Compare the time series of the (extended?) "communal"/"individualistic" word sets with those of randomly-selected words with generally similar frequency profiles.
No doubt readers who know something about time-series analysis will have other and better ideas. One helpful thing would be to make available code for a better general interface to datasets of this kind. The Google ngrams web site is cute, but offers graphs rather than numbers. The data is easily available, but it's large enough that simple approaches to computing over it are rather slow. My first attempt to create a simple perl interface to a database back-end ran aground on scale issues — so I'll try again when I have a chance, and will be happy to share the results. But if someone has already done it, or gets it done before I get back to it, I'll be happy.