The he's and she's of Twitter

« previous post | next post »

My latest column for the Boston Globe is about some fascinating new research presented by Tyler Schnoebelen at the recent NWAV 41 conference at Indiana University Bloomington. Schnoebelen's paper, co-authored with Jacob Eisenstein and David Bamman, is entitled "Gender, styles, and social networks in Twitter" (abstract, full paper, presentation).

I first got to know the paper's co-authors last year when I was putting together a piece for the New York Times Sunday Review, "Twitterology: A New Science?" At the time, Eisenstein was at Carnegie-Mellon University, where he had collaborated with Brendan O'Connor, Noah A. Smith, and Eric P. Xing on using Twitter to analyze dialectal variation in the American English lexis (see "A Latent Variable Model for Geographic Lexical Variation"). Eisenstein has since moved on to Georgia Tech's School of Interactive Computing, but he continues to work with CMU's computational linguists — including Bamman, now a PhD student there. Bamman's own interest in using Twitter as a megacorpus for the study of language variation goes back to his work on the Lexicalist project (which he wrote up as a guest post here).

As I described in a Language Log post last year ("On the front lines of Twitter linguistics"), Eisenstein and Bamman teamed up for further Twitterological studies with Schnoebelen, a sociolinguist who studied with Penelope Eckert at Stanford University. (His recently completed dissertation on language and emotion is available here.) At last year's NWAV, Schnoebelen gave an early indication of the usefulness of Twitter for sociolinguistic studies in his paper, "Affective patterns using words and emoticons on Twitter."

The fruits of the collaboration can be seen in the new paper on Twitter and gender, which combines sophisticated data-mining techniques (on a corpus of 9,212,118 tweets by 14,464 authors) with thoughtful sociolinguistic analysis. Researchers of a computational bent have previously tried to divine the gender of authors in large corpora of online texts based on differences in lexical use — see, for instance, "Effects of Age and Gender on Blogging," a 2006 paper by Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James Pennebaker (discussed by Mark Liberman here). But Schnoebelen, Eisenstein and Bamman go beyond that kind of predictive study to delve into the complexities of gender identity in a medium like Twitter. They find that there are many ways to "perform" gender, and that this performance very often relates to the social network one makes:

In general, the gender composition of the social networks of the members of each cluster tracks the gender composition of the cluster itself. For example, women in the sports-related clusters have far more male friends than average—though they still have fewer male friends than the male members of the cluster. Rather than revealing essential categories, these styles reflect an interplay of authors, audiences, and topics.

I hope that this research will encourage other partnerships between computational linguists and sociolinguists interested in taking advantage of "big data" megacorpora for studies of language variation. Focusing strictly on the computational side or on the social side just won't cut it in the new era of digital scholarship.


  1. K. said,

    November 6, 2012 @ 6:11 pm

    "The he's and she's of twitter"

    Pluralizing with apostrophe's on Language Log of all places?

    [(bgz) The do's and don'ts of pluralization are not so clear-cut when it comes to "words used as words." See "Preposterous Apostrophes II: Pluralization" on Gabe Doyle's Motivated Grammar blog. I prefer apostrophizing he's and she's (and no's and do's) because they just look wrong without apostrophes.]

  2. Adrian Morgan said,

    November 6, 2012 @ 7:48 pm

    One aspect of Twitter linguistics that I don't think I've seen discussed anywhere is how the 140 character format influences expectations re Grice's maxims.

    Here's what I'm thinking. Often, when condensing a reply to 140 characters, one of the first things to go is the clause establishing why you think the reply is relevant (e.g. "Speaking of linguistics…"). But not only is there no room to establish relevance; there is also a common understanding that nobody else has room to establish relevance either. This leads to a culture in which everyone accepts that apparent irrelevance is inevitable, and hence one in which the maxim of relevance is less of a barrier to communication than in other media.

    Any thoughts on this?

  3. Joe Green said,

    November 6, 2012 @ 9:49 pm

    The apparently dodgy apostrophes could simply have been avoided by using the (perhaps journalistically clichéd) "his and hers" instead.

  4. Katie Skeen said,

    November 7, 2012 @ 3:07 pm

    Hi Ben, I was surprised that you refer (in the BG column) to xoxo as "electronic shorthand." My grandmother (b. 1905), like many people, used xoxo all her life in the myriad letters and cards that she (hand)wrote. I think "shorthand" would be sufficient here.

    [(bgz) Point taken, but just because something is "electronic shorthand" doesn't mean it's only electronic. I'm well aware of the pre-electronic history of "xo…" — in fact, I've traced it back to a 1905 court case. Note, too, that OMG dates back to 1917.]

  5. David Morris said,

    November 8, 2012 @ 6:47 pm

    Do people *say* "/oh em gee/" or "Oh my God"?. The free commuter paper on Sydney's trains has a section where people submit things they've overheard in public, and the submissions are full of people recorded as saying "OMG, …". I was thinking about this while walking to the station yesterday afternoon when, right behind me, someone said "Oh my God, …".

  6. Dan M. said,

    December 8, 2012 @ 8:42 pm

    As to "OMG", I've certainly heard people say "Zoh my god what the fuck barbecue!", which is a direct result of text "ZOMGWTFBBQ", and I'm fairly certain I've heard people self-censor "God" by saying "oh em gee".

RSS feed for comments on this post