Andy Schwartz and others at the World Well-Being Project have worked with "Facebook posts from over 75,000 volunteers who also took the standard Interpersonal Personality Item Pool (IPIP) personality test to measure the 'Big Five' personality traits", looking for linguistic features that correlate with those aspects of personality measured by that test.
Lyle Ungar talked about this work a few days ago (Andy was unfortunately out of town), for an audience of mostly first-year undergraduates. The venue was a weekly event, Dinners With Interesting People, held in the Quad, an undergraduate residence here at Penn.
This year, the DWIP talks (though still open to the public) are integrated into a Freshman Seminar called "The Landscape of Research and Innovation at Penn". The idea is to give the participants a general idea of what kinds of research go on around here, and how they might get involved. As part of the course, I've asked DWIP guests to provide a dataset that we can use as part of a course assignment in quantitative analysis. Since the students have widely varied backgrounds in mathematics, statistics, and programming, and since the quantitative analysis part of the course is only one of several aspects, the assignments start with an R script that does something interesting, with the assigned task being to modify the script to do something a bit different.
In this case, Andy was kind enough to give me a table indicating number of posts and token counts for each "word", in their Facebook dataset, for males and females of each age. Inspired by Jamie Pennebaker's The Secret Life of Pronouns, I decided to focus the quantitative analysis assignment around the issue of pronoun usage. The body of this post lays out some of the things that I've noticed in setting the assignment up.