My latest column for the Boston Globe is about some fascinating new research presented by Tyler Schnoebelen at the recent NWAV 41 conference at Indiana University Bloomington. Schnoebelen's paper, co-authored with Jacob Eisenstein and David Bamman, is entitled "Gender, styles, and social networks in Twitter" (abstract, full paper, presentation).
I first got to know the paper's co-authors last year when I was putting together a piece for the New York Times Sunday Review, "Twitterology: A New Science?" At the time, Eisenstein was at Carnegie-Mellon University, where he had collaborated with Brendan O'Connor, Noah A. Smith, and Eric P. Xing on using Twitter to analyze dialectal variation in the American English lexis (see "A Latent Variable Model for Geographic Lexical Variation"). Eisenstein has since moved on to Georgia Tech's School of Interactive Computing, but he continues to work with CMU's computational linguists — including Bamman, now a PhD student there. Bamman's own interest in using Twitter as a megacorpus for the study of language variation goes back to his work on the Lexicalist project (which he wrote up as a guest post here).
As I described in a Language Log post last year ("On the front lines of Twitter linguistics"), Eisenstein and Bamman teamed up for further Twitterological studies with Schnoebelen, a sociolinguist who studied with Penelope Eckert at Stanford University. (His recently completed dissertation on language and emotion is available here.) At last year's NWAV, Schnoebelen gave an early indication of the usefulness of Twitter for sociolinguistic studies in his paper, "Affective patterns using words and emoticons on Twitter."
The fruits of the collaboration can be seen in the new paper on Twitter and gender, which combines sophisticated data-mining techniques (on a corpus of 9,212,118 tweets by 14,464 authors) with thoughtful sociolinguistic analysis. Researchers of a computational bent have previously tried to divine the gender of authors in large corpora of online texts based on differences in lexical use — see, for instance, "Effects of Age and Gender on Blogging," a 2006 paper by Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James Pennebaker (discussed by Mark Liberman here). But Schnoebelen, Eisenstein and Bamman go beyond that kind of predictive study to delve into the complexities of gender identity in a medium like Twitter. They find that there are many ways to "perform" gender, and that this performance very often relates to the social network one makes:
In general, the gender composition of the social networks of the members of each cluster tracks the gender composition of the cluster itself. For example, women in the sports-related clusters have far more male friends than average—though they still have fewer male friends than the male members of the cluster. Rather than revealing essential categories, these styles reflect an interplay of authors, audiences, and topics.
I hope that this research will encourage other partnerships between computational linguists and sociolinguists interested in taking advantage of "big data" megacorpora for studies of language variation. Focusing strictly on the computational side or on the social side just won't cut it in the new era of digital scholarship.