I have a piece in today's New York Times Sunday Review section, "Twitterology: A New Science?" In the limited space I had, I tried to give a taste of what research is currently out there using Twitter to build various types of linguistic corpora. Obviously, there's a lot more that could be said about these projects and other fascinating ones currently underway. Herewith a few notes.
- Fellow Language Logger David Beaver and research assistants Joey Frazee and Christopher Brown at the University of Texas were extremely generous with their time and energy when I asked for some insta-analysis of tweets from Libya after the news broke of Qaddafi's death. Since then, they've been connecting up their new analysis with the work they did on tweets from Libya earlier in the year. But I'll let David talk more about this research, and how it fits into the larger project, "Modeling Discourse and Social Dynamics in Authoritarian Regimes." [Update: David follows up here.]
- Twitter-based sentiment analysis first got some attention a couple of years ago when James Pennebaker, Roger Booth, Teal Pennebaker, and Chris Wilson created the website AnalyzeWords, which provides on-the-fly analysis of a person's Twitter feed by using the text analysis program Linguistic Inquiry and Word Count (LIWC). The work of Pennebaker and his colleagues with LIWC has been discussed on Language Log in the past (here, here, here, here, and here). LIWC is also explored in great detail in Pennebaker's book The Secret Life of Pronouns, which I reviewed for The New York Times Book Review. (The book opens with some discussion of Twitter and AnalyzeWords, but it goes on to consider a wide array of corpora analyzed with LIWC.)
- My look at dialectal variation on Twitter was based on work done by the Carnegie Mellon researchers Jacob Eisenstein, Brendan O'Connor, Noah A. Smith, and Eric P. Xing. You can check out their EMNLP 2010 paper, "A Latent Variable Model for Geographic Lexical Variation," and the slides from their LSA 2011 presentation, "Statistical Exploration of Geographical Lexical Variation in Social Media." Eisenstein is headed to a teaching job at Georgia Tech's School of Interactive Computing, but Twitter-based studies at CMU are sure to continue. A new addition to Carnegie Mellon's stable of Twitterologists is David Bamman, who Language Log readers may know from the Lexicalist project he undertook to create map visualizations of American English variation on Twitter. (See here for a guest post by Bamman about Lexicalist.)
- Eisenstein and Bamman are currently conducting research with Tyler Schnoebelen of Stanford University that looks at how gender plays a role in language variation on Twitter. But they're going well beyond simply analyzing which language forms are associated with women and which are associated with men. Using information on people's Twitter followers, they can also take into consideration the gender makeup of people's networks. Thus, a man with a predominantly female network may show different linguistic patterns compared to a man with a male or mixed network. Earlier today, at NWAV 40, Schnoebelen presented some of his research on one aspect of Twitter discourse, emoticons. The abstract of his paper includes this great line: "Emoticons with noses are historically older." It's true! Not only that, but emoticons with noses, like :-), show distinctly different patterns of distribution than the noseless kind, like :) . Noseless emoticons tend to be used by younger Twitter users and are associated with more informal discourse. Women use them more than men, too, but women use more of all types of emoticons. I'll be looking forward to the definitive study of emoticon nosedness. [Update: Slides from Schnoebelen's NWAV talk are here.]