Andy Schwartz and others at the World Well-Being Project have worked with "Facebook posts from over 75,000 volunteers who also took the standard Interpersonal Personality Item Pool (IPIP) personality test to measure the 'Big Five' personality traits", looking for linguistic features that correlate with those aspects of personality measured by that test.
Lyle Ungar talked about this work a few days ago (Andy was unfortunately out of town), for an audience of mostly first-year undergraduates. The venue was a weekly event, Dinners With Interesting People, held in the Quad, an undergraduate residence here at Penn.
This year, the DWIP talks (though still open to the public) are integrated into a Freshman Seminar called "The Landscape of Research and Innovation at Penn". The idea is to give the participants a general idea of what kinds of research go on around here, and how they might get involved. As part of the course, I've asked DWIP guests to provide a dataset that we can use as part of a course assignment in quantitative analysis. Since the students have widely varied backgrounds in mathematics, statistics, and programming, and since the quantitative analysis part of the course is only one of several aspects, the assignments start with an R script that does something interesting, with the assigned task being to modify the script to do something a bit different.
In this case, Andy was kind enough to give me a table indicating number of posts and token counts for each "word", in their Facebook dataset, for males and females of each age. Inspired by Jamie Pennebaker's The Secret Life of Pronouns, I decided to focus the quantitative analysis assignment around the issue of pronoun usage. The body of this post lays out some of the things that I've noticed in setting the assignment up.
I first looked at overall word counts by sex and age:
The bump at age=42 is due to default age assignment of 40 at the start of the study, so all age=42 data is thrown out in what follows.
For the remaining ages, here are the frequencies by sex and age for (the sum of) "I", "me", "my", "mine", where frequencies for female writers are the red Fs, while male writers' frequencies are the blue Ms:
The overall frequency of first-person singular pronoun usage decreases with age, and at every age, female writers use FPS pronouns more than male writers. The blip at age 14 is due to small sample size — probably the age=14 data should be removed as well, though I haven't done that in today's plots. And the increasingly noisy data at more advanced ages is clearly also due to small sample sizes — it would no doubt be helpful to do some smoothing, which again I haven't done for today's exercise.
Here are the frequencies for "you", "your", "yours":
Here it seems that female and male usage declines, in sync and at comparable values, to the age of 30, and then rises, with females increasingly outstripping males.
Here are the frequencies for "we", "us", "our", "ours":
Here female and male Facebookers are just about the same, reaching a wee we-peak in the late teens, and then rising steadily for the rest of the life cycle.
Here are the frequencies for "she", "her", "hers":
Female Facebookers refer to other females at rates that increase roughly to the age of 40. At all ages, male Facebookers refer to females at much lower rates.
Here are the frequencies for "he", "him", "his":
Ignoring the blip for the age 14 data, references to males by both male and female posters increase in frequency to about the age of 40, with males slightly ahead of females through age 30 or so.
Here are the frequencies for "it", "its":
Not much going on with "it", except for a slight male advantage in the late teens and early twenties, and the usual noise among the elderly due to small sample sizes.
Here are the frequencies for "they", "them", "their", "theirs":
We see steady growth in third-person plural frequency with increasing age, especially steeply during the period from about age 25 to age 45.
Reprising some of the same numbers in different combinations, we see that male writers show increasing frequency of references to other males with increasing age, at least to age 40 or so, but more-or-less steady low rates of pronominal reference to individual females at all ages:
The comparable plot for female writers shows a very different pattern — the rate of pronominal reference to both males and females also increases through age 40 or so, but at all ages, the rates for pronominal references to males and females are fairly close. Before age 30, female references are slightly more frequent, and after 40, male references are a bit commoner. But overall, the referential egalitarianism is in striking contrast to the pattern for male writers:
One note: It would be a boon to pronominology if the folks at Facebook would release similar data based on a larger sample of their ~1.3 billion users…