It's a strange fact about social scientists that hardly any of them, in recent years, have paid any analytic attention to language, which is the main medium of human social interaction. At schools of "communication", you'll generally find that neither the curriculum nor the faculty's research publications feature much if any analysis of speech and language. In other disciplines — sociology, social psychology, economics, history — you'll find even less of it. (The main systematic exception, Linguistic Anthropology, deserves a separate discussion — but the conclusion of such a discussion, I believe, would note a steep decline in empirical linguistic analysis. And of course I'm leaving out sociolinguistics, which is healthy enough but largely alienated from the rest of the social sciences.)
There are notable exceptions of several kinds, such as Erving Goffman, Manny Schegloff, or Jamie Pennebaker. But such work emphasizes the paradox, since it shows that we can't blame the effect on a lack of intellectual opportunity.
It's not only in the social sciences where linguistic anemia is evident, of course. Over the past generation, the amount of language-related teaching and research in "language departments" (including departments of English) has declined to an unprecedented level. It's common to find highly-ranked English departments where neither undergraduates nor graduate students are trained in any sort of linguistic analysis at all, except perhaps by accident (see this earlier post for a more specific discussion).
But climate change is coming, in my opinion. And in this case, the driving force is not carbon emissions, but digital technology.
To state the obvious: Traditional mass media are now nearly all digital; new media are documenting (and creating) social interactions at extraordinary scale and depth; more and more historical records are available in digital form. The digital shadow-universe is a more and more complete proxy for the real one. And in the areas that matter to the social sciences, much of the content of this digital universe exists in the form of digital text and speech.
A future social scientist who wants to use this proxy universe to learn about the real one had therefore better know how to analyze the form and meaning of large digital archives of text and speech. And future social scientists who choose not to do this will work under a significant competitive disadvantage. (Numerical data, video recordings, and various kinds of relationship graphs are of course important too, but without analysis of speech and text, their value is lower.)
The required tools include a good deal of computer science and statistics, but you also need to know what to program and what to model. As a result, the basic concepts and skills of speech and text analysis are an important part of the future social science tool kit.
There's an increasing amount of research along these lines, mostly by computer scientists and computational linguists, along with a few rogue social scientists like Jamie Pennebaker. We've blogged about quite a few examples over the years. But I suspect that most social scientists don't see most of this stuff, because it appears in conference proceedings and journals that they don't read.
All the same, change is sure to come. I predict that over the next 20 years or so, this work will go mainstream. (I know that 20 years in internet time is a millennium or two, but Academia is culturally conservative to a degree that would turn Pashtun village elders green with envy.)
One symptom (and cause) of corpus-based social science going mainstream is that individual pieces of research will increasingly break out into the old media (or go viral in new media). This happened a few days ago to Peter Sheridan Dodds and Christopher M. Danforth, whose paper "Measuring the Happiness of Large-Scale Written Expression: Songs, Blogs, and Presidents" (Journal of Happiness Studies, published online 7/17/2009) was covered in the New York Times (Benedict Carey, "Does a Nation's Mood Lurk in Its Songs and Blogs?", 8/3/2009).
Here's the paper's abstract:
The importance of quantifying the nature and intensity of emotional states at the level of populations is evident: we would like to know how, when, and why individuals feel as they do if we wish, for example, to better construct public policy, build more successful organizations, and, from a scientific perspective, more fully understand economic and social phenomena. Here, by incorporating direct human assessment of words, we quantify happiness levels on a continuous scale for a diverse set of large-scale texts: song titles and lyrics, weblogs, and State of the Union addresses. Our method is transparent, improvable, capable of rapidly processing Web-scale texts, and moves beyond approaches based on coarse categorization. Among a number of observations, we find that the happiness of song lyrics trends downward from the 1960s to the mid 1990s while remaining stable within genres, and that the happiness of blogs has steadily increased from 2005 to 2009, exhibiting a striking rise and fall with blogger age and distance from the Earth’s equator.
Here's the figure showing the secular trend in song-lyric happiness:
Here's the figure showing the recent trend in emotional valence estimated from aspects of blog posts:
And finally, the effects of age, latitude, and day of the week (phase of the moon is not pictured):
Like most work of this type, the linguistic analysis involved is pretty simple — but it's still more than you'll now find in the collected works of the faculty of the communications schools that I've looked at.
And you could raise various questions about their methods and their conclusions, as always in science (though the work seems basically sound to me). But the nice thing about this kind of research is that all of their data is published — their paper gives the URLs that they got it from. (In fact, they doubtless undertook this study in large part because the basic data is easily available.) And they could easily publish their code as well (though the algorithms seem simple and easy to replicate).
So if you have an idea about how to qualify, modify or extend their findings, go to it!
[I'll note in passing that linguistics was left out of the publicity in this case: thus the NYT article quotes Prof. Pennebaker to the effect that “The new approach that these researchers are taking is part of movement that is really exciting, a cross-pollination of computer science, engineering and psychology. [...] And it’s going to change the social sciences; that to me is very clear.” From Jamie's mouth to God's ear; but let's recognize that this type of work will not reach its full potential unless the researchers involved also understand something about how speech and language work.]