Intensives over time
« previous post | next post »
In their new book Sense and Sensitivity, Brady Clark and LL's own David Beaver identify and discuss a class of intensives. The items they name are (most) importantly, significantly, especially, really, truly, fucking, damn, well, and totally. Here's one of their examples:
MTV like totally gave us TWO episodes back to back. It was like so random. The more the merrier, but it's like waay too much for one recap.
I'm intrigued by the classification and independently interested in some of words and phrases involved, so I went looking in a large weblog corpus I recently collected, to see if I could gain some new insights into where and why people use these things. This post describes a first experiment along these lines.
The weblog corpus consists of the whole of Eschaton, including all the comments, and the whole of Talking Points Memo, TPM Café, and TPM Election Central, including all the comments. (I heuristically filtered the spam, erring on the side of removing real comments.) In all, this amounts to 54,074 posts (many with comments) and 218,607,178 words. The posts and comments are tagged with the date and the author.
Except for the metadata I just mentioned, the texts are not annotated. However, we know a lot about what these weblogs are like, so we can make educated guesses about what was being discussed on a given date and also how it was being discussed. For example, it's clear what was mainly under discussion during October and November 2008: the election. And, unlike in 2004, the mood was upbeat.
I calculated monthly frequencies for the entire vocabulary of words with at least 150 tokens, and then started exploring these distributions, looking especially at words with highly correlated distributions over time. (Two technical notes. (1) The frequency of word W in month M is the number of tokens of W in M divided by the total number of tokens used in M; thanks to Mark Graham for pointing out the unclarity. (2) For the correlations, I used R's cor()
function, method="pearson"
.)
Many of the word pairs with strong correlations are collocations of one kind of another. For example, barack and obama have a 97% correlation, but this is largely because of how reference works, and have and been have a 96% correlation due to the nature of English auxiliary system.
For my purposes, the interesting correlations are the ones that don't trace to grammatical dependencies. Beaver and Clark's intensives are great for this: they have few interdependencies, so speakers are likely either to choose one over the other (to suit their communicative needs) or else to pile them up for additional intensity (really and truly amazing, totally fucking amazing).
It turns out that the intensives are generally pretty well correlated over time. For example, really and totally have a 94% correlation. The first is significantly more frequent than the second (206,434 tokens for really, 19,101 for totally), but their frequencies ebb and flow together. I think we can rule out the idea that this correlation is due to a grammatical dependency. Just 2% of the occurrences of totally are in the same sentence as really. (In contrast, 74% of the occurrences of barack are in the same sentence as obama.)
Here's a visualization (click to enlarge and elongate):
Unfortunately, this figure puts the two lines too far apart to bring out the relationship clearly. This isn't my area of expertise, so I am not sure of the best alternative, but the following scheme seems to work well: I simply laid one distribution atop the other and removed the numbers on the y-axis:
Here, I think one can see the correlation well. (Please post a comment if you think this is misleading, or if you know of a better method.)
I've annotated the time series a bit, pointing out historical events. The upshot seems to be that these intensives are used more frequently around important events. Check out the general upward trend as the 2008 election grew nearer, with peaks in June, August, and October 2008.
Here is a sampling of other correlations, those involving intensives as well as some pairs meant to reassure us that around 90% is genuinely high:
Word pair | Correlation |
---|---|
truly ~ really | 92% |
truly ~ totally | 87% |
absolutely ~ really | 93% |
absolutely ~ totally | 84% |
somewhat ~ totally | 48% |
hardly ~ totally | 46% |
Update (2009-01-13): In the comments, Chris suggested adding a plot to bring out the linear correlation between really and totally. Here goes:
The R2 value for the linear fit (red) about 0.55.
Chris said,
January 11, 2009 @ 3:38 pm
It might be interesting to graph them against each other: "log(really)" on the x-axis and "log(totally)" on the y-axis.
Mark P said,
January 11, 2009 @ 9:58 pm
Or you could just peak normalize. That would reduce potential misimpressions due to scale differences.
Brian K said,
January 12, 2009 @ 10:25 am
I'm curious to know how you would describe the function of 'like' in the example you gave. (Mike like totally gave us…). Sorry, if that's a stupid question. I'm one of the amateurs who cause so much grief on this blog.
Arnold Zwicky said,
January 12, 2009 @ 10:53 am
To Brian K: the like in this example is an instance of "discourse particle" or "pragmatic particle" like, which serves a variety of functions. (There's a huge literature on it.)
[(myl) A list of links to LL posts on (various uses of) "like", as of May 2005, can be found here. Brian, that should be enough to get you started. The cited post also references Muffy Siegel's classic article on the subject of like the discourse particle, and you find some of the subsequent literature by asking Google Scholar for the works that have cited her paper. ]
Brian K said,
January 12, 2009 @ 12:14 pm
Thanks very much.
Mark Graham said,
January 12, 2009 @ 4:20 pm
If I understand the data properly, it may be a little misleading. Have you accounted for the ebb and flow of words generally? For example, on the night of the election both "really" and "totally" would have spiked together, but so would any pair of words because so many more words were being typed that night.
Chris Potts said,
January 13, 2009 @ 12:09 pm
Chris!
Yes, definitely a worthwhile addition. I've added that as an update to the post. It brings out the nature and degree of correlation (though it hides the connection with historical events).
Chris Potts said,
January 13, 2009 @ 12:13 pm
Mark Graham!
Thanks for bringing this up. I should have been clearer in the post about how I calculated frequencies. Here goes: the frequency of word W in month M is the number of occurrences of W in M divided by the total number of words used in M. This relativization to specific dates goes a long way towards addressing your concern, I think. (I think I'll slip this into the post itself now.)
Chris Potts said,
January 13, 2009 @ 12:51 pm
Brian K!
No need to apologize! The dialogue between amateurs and professionals is part of what this weblog is all about. I think the "grief" you're perceiving traces to the unforgiving back-and-forth between writers and commentators in the comments. I assure you, we're just as (perhaps more) unforgiving with each other …
Alexandra said,
February 1, 2009 @ 6:12 pm
I realize this comment is three weeks late, so I'm probably talking to an empty room, but…
I'm curious about the troughs in these distributions, specifically 06-07 and 07-07. If intensives are used more frequently around important events, what causes them to be used less frequently? Hot July weather?