Jean M. Twenge, W. Keith Campbell and Brittany Gentile, "Male and Female Pronoun Use in U.S. Books Reflects Women’s Status, 1900–2008", Sex Roles published online 8/7/2012. The abstract:
The status of women in the United States varied considerably during the 20th century, with increases 1900–1945, decreases 1946–1967, and considerable increases after 1968. We examined whether changes in written language, especially the ratio of male to female pronouns, reflected these trends in status in the full text of nearly 1.2 million U.S. books 1900–2008 from the Google Books database. Male pronouns included he, him, his, himself and female pronouns included she, her, hers, and herself. Between 1900 and 1945, 3.5 male pronouns appeared for every female pronoun, increasing to 4.5 male pronouns during the postwar era of the 1950s and early 1960s. After 1968, the ratio dropped precipitously, reaching 2 male pronouns per female pronoun by the 2000s. From 1968 to 2008, the use of male pronouns decreased as female pronouns increased. The gender pronoun ratio was significantly correlated with indicators of U.S. women’s status such as educational attainment, labor force participation, and age at first marriage as well as women’s assertiveness, a personality trait linked to status. Books used relatively more female pronouns when women’s status was high and fewer when it was low. The results suggest that cultural products such as books mirror U.S. women’s status and changing trends in gender equality over the generations.
Here's their plot of the ratio between the frequencies of male (he, him, his, himself) and female (she, her, hers, and herself) third-person singular pronouns:
As is all too usual these days in scientific publications, alas, Twenge et al. don't give us the complete recipe for their work — for example, they don't tell us how they treated the case of letters. Modulo this uncertainty, here's my replication of the same ratio from the same source:
As I observed in an earlier post, apparent historical trends of this kind in word frequencies or in ratios of word frequencies can have several qualitatively different sorts of explanations:
- The mix of kinds of books published changes over time (e.g. more romance novels, fewer collections of sermons); different kinds of books use words differently; therefore the relative frequency of words changes.
- The mix of kinds of books selected for the Google Books ngram collections changes over time; so the relative frequency of words changes, for similar reasons as in (1).
- The distribution of concepts or conceptual frames changes over time, even in the same sorts of books.
- The choice of words to express a given concept (in published books) changes over time, even in the same sorts of books.
David Brown ("Gender Pronouns in the News", Grammar Lab 8/12/2012) tries to address part of this uncertainty by calculating a similar ratio (of gendered pronouns) using data from an independent source, namely the Corpus of Historical American English. David looked only at he and she, rather than the full set of masculine and feminine third-person singular pronouns, but this turns out not to make much difference. Here's a plot showing he/she for both the Google Books American English ngrams and COHA:
As David notes, these time-series are "similar, though slightly different". But at least they show a qualitatively similar pattern of rises and falls — the difference is probably due to a different mix of types of material. The qualitative similarity of the time-series of ratios should help to reassure us that the overall rise and fall is not due to some artefactual change in the sampling of book genres in the Google Books American English ngram collection.
Another sort of replication would look at time-series of different words that relate to the same concepts — here's man+men versus woman+women, which shows an even bigger post-1960 fall, but a less clear pattern before that. This should help reassure us that the post-1960 fall probably does have something to do with changes in the frequency of references to the underlying concept:
As with the pronoun data, the change is more driven by a fall in male (or nominally sex-neutral?) references than by a rise in female ones:
And there appear to be blips in male references associated with the two world wars and with the Vietnam war, suggesting that one factor in the trend might be changes in the overall amount of writing about war.
Anyhow, I'm glad to see that Twenge et al. start their paper by showing a convincing graph. Another recent work by the same authors didn't do that — and for good reason, I argued, because the corresponding graphs tend to undermine the point that they wanted to make (see "Textual narcissism", 7/13/2012; "Textual narcissism, replication 2", 7/14/2012; "What does this graph mean?", 7/13/2012; "It's all about who?", 7/31/2012).
And as I pointed out in those posts, one problem with this type of research is that almost everywhere you look in these time series of word frequencies and ratios of word frequencies, you find striking patterns.
Thus since the late 1960s, American books have apparently seen a large increase in the frequency of second-person vs. first-person-singular pronouns. Does this reflect the rise of the "you generation", interested more in others than in themselves? Not according to the Conventional Wisdom, which might prefer to note that over the same time period, the ratio of first-person-singular to first-person-plural pronouns has increased (though not by as large a factor), supporting the received opinion that Kids Today are getting more and more self-obsessed:
And if we pick some more-or-less random conceptual comparison — say day vs. week or year vs. decade – we're also likely to see significant (and perhaps even strikingly large) historical trends:
It's often easy to think of possible explanations — in this case, perhaps our culture is increasingly interested in longer periods of time?
But there's a serious problem looming down this road.
There are lots of English words; and even larger number of sets of words arguably related to one another in some conceptual way; and a larger-still number of plausible comparisions of pairs of such sets of words. This generates a really, really big space of possible quantitative comparisons of time series; experiments in this space are really, really easy to do; and the numbers are large enough that almost all differences are statistically "significant".
This doesn't mean that the results of such experiments are uninteresting or misleading. But it's just about the worst imaginable sort of situation from the point of view of publication bias. So be careful out there.