The Supreme Court hears oral arguments today in FCC v. Fox Television Stations, the case of the fleeting expletive. Bono got things going when exclaimed "really, really fucking brilliant" at the 2003 Golden Globe Awards.[*] The FCC first judged such usage non-offensive, then back-tracked in the face of pressure from the Parents Television Council. In this note, the FCC declares that
given the core meaning of the "F-Word," any use of that word or a variation, in any context, inherently has a sexual connotation
Language Loggers have commented on this and related topics before, and Arnold recently went meta on the Times coverage of the case. I recently spoke with Jess Bravin at the Wall Street Journal about the FCC's statement and the coming Supreme Court hearings. (His article with Amy Schatz appeared today, along with a cool wordle-like graphic on the results below.) During out conversation, Jess asked how a linguist might test the FCC's claim about the connotations of the F-word. Does it in fact have sexual connotations even when used as an intensive, as in Bono's "really, really fucking brilliant"?
Formal linguistic theories of meaning have, unfortunately, had relatively little to say about connotations. However, I think there is a chance here to apply methods from information extraction. Soon after talking with Jess, I undertook a small pilot experiment to try out an idea that bridges these two fields nicely. The pilot begins from this hypothesis:
Connotations hypothesis: A word's connotations are reflected in the words that it tends to co-occur with.
Political operatives know this hypothesis well. It is why they repeat the same phrases over and over again, seeking to instill particular words with new connotations. This is a core insight behind Latent Semantic Analysis. George Lakoff discusses similar hypotheses under the rubric of framing. The most famous recent example is the systematic, large-scale effort to link Iraq and Al Qaeda in people's mind by mentioning them in the same breath again and again.
Using the connotations hypothesis, we can evaluate the FCC's claim. To do this, I gathered about 9.5 million words of blog posts from Eschaton (left-wing politics), Say Anything Blog (right-wing politics), and DListed (celebrity gossip). Posts at DListed tend to be about sex and sexuality in one way or another, so it's a good one to include. The full collection is designed to ensure that we don't see too many influences from particular domains or usage patterns. The main motivation for using these blogs, though, is that their authors are not shy with the F-word.
I then built a word-by-word matrix consisting of all the content word with counts above 150. Here is a snippet from it (the full matrix is 808 x 808):
The cells are filled with the number of times that the two words co-occurred in a blog post. The posts are mostly short (average length: 119 words), so it's safe to say that these associations are close.
We can now compare word distributions by comparing vectors of counts. The closer the vectors for w and w' are to each other, the closer their distributions match. For this, I use a cosine measure. For a given target word w, we can compare w with all the others words in the matrix, then rank the results for closeness. Here, for example, are the nearest content words to a few target words (bold), ranked in decreasing order of closeness:
- speech: free, speak, political, speaking, calling, sort, terms, freedom, called, group
- movie: love, watching, family, girl, guys, mind, couple, head, hell, back
- video: people, thing, time, playing, part, telling, play, watch, statement, called
- obama: barack, obama's, campaign, mccain, candidate, election, presidential, john, race, senator
- mccain: john, obama, barack, campaign, obama's, presidential, candidate, election, senator, race
- america: great, terrorists, hope, protect, step, work, nation, lives, tonight, americans
- tax: taxes, income, spending, property, cut, rate, pay, measure, economic, lower
- sex: life, woman, women, happy, gay, young, thing, love, called, parents
These results look pretty good. Thus, let's move to the punchline:
- fucking: time, imagine, here's, understand, things, wanted, wrong, hell, stop, stay
This looks chaotic to me. It certainly doesn't look sexual. The reason for that seems clear: contra the FCC's claims, the F-word is primarily a marker of emotional content. It is compatible with a wide range of emotions. In addition, it is extremely flexible syntactically.
Consider this a pilot study. I do not pretend that this is the best approach to (or the ideal data set for) characterizing connotations, but I think it might be a step in a useful direction. Perhaps other approaches would support the FCC's claim that the F-word is invariantly sexual, though I would now approach such claims very skeptically.
Much of the strength of this approach rests on how good the overall model of connotations is. We should get intuitive results back for the majority of words in the database. Since I do not presently know of a way to automatically evaluate the model, I'd like to enlist the help of the Language Log readers:
* Since Language Log is based in Philly, it is worth noting that second baseman Chase Utley got a big cheer with "World Champions" and an even bigger cheer with "World Fucking Champions".