There are two basic reasons for the increased interest in "text analytics" and "sentiment analysis": In the first place, there's more and more data available to analyze; and second, the basic techniques are pretty easy.
This is not to deny the benefits of sophisticated statistical and text-processing methods. But algorithmic sophistication adds value to simple-minded baselines that are often pretty good to start with. In particular, simple "bag of words" techniques can be surprisingly effective. I'll illustrate this point with a simple Breakfast Experiment™.
Let's start by looking at the relationship between the words used in wine-tasting notes and the numerical scores representing the perceived quality of the wine. We're interested in tasting notes like these examples from the Beverage Tasting Institute:
Pure golden color. Golden raisin, honeydew, balloon, and flint aromas. A brisk entry leads to an off-dry medium-to-full body of honeyed peach, golden raisin, and rubber eraser flavors. Finishes with a petrol-like mineral, tangy apricot marmalade, and spice fade with pithy fruit tannins.
Hazy, pale golden color. Funky, old canned vegetable and lemon detergent aromas follow through to a bittersweet medium-bodied palate with wet hay, fruit stones, orange drink, and honey candy flavors. Finishes with a tannic citrus peel fade.
The associated scores run from 79 to 99, with the following distribution in my sample collection:
The simple-minded "bag of words" approach to analyzing such material starts by counting how often each word is associated with a given score. There are 21 possible scores (or at least 21 scores that are actually used in my sample), and 7,303 distinct "lexical tokens" (including wordforms, marks of punctuation, number strings, etc.), so this simple counting exercise gives us a matrix with 7,303 rows and 21 columns.
One trivial thing to do with this "words by scores" matrix is to average the scores in each row. These averages are probably not very reliable for the 3,037 words that only occur once ("arresting", "finesseful", "cod", "meatballs", etc.); and maybe also not reliable for other words that are relatively rare in this sample. So let's limit the exercise to words that occur more than 10 times. In this subset, the 20 rows with the highest average score correspond to the wordforms:
The 20 worst-scoring wordforms are:
The question-mark scores negatively in these reviews because it's almost always associated with but, as in
Bright orangey-brick hue. Aromatically reserved, with a subdued, earthy undertone. A rich entry leads to a rounded, moderately full-bodied palate with a wave of aggressive tannins. Somewhat mean. Try with mid-term cellaring, but?
The exclamation point, unsurprisingly, has a fairly high average score of 90.34.
There are some interesting oddities — for example, plurals generally seem to score a little bit better on average than the corresponding singulars do:
(Some go the other way, e.g. flavor 88.57, flavors 86.96; carrot 87.81, carrots 86.83; olive 88.68, olives 87.62. And I haven't done a systematic survey. But in most of the cases that I've looked at this morning, plurals seem to taste better than singulars.)
From the same data, we could calculate some other, equally simple tables of numbers, say the matrix of words by reviews, where the i,j cell holds the number of times that word i occurs in the text of wine-review j.
Given that there are 20,509 reviews in my sample, this is a matrix with 7,303 rows and 20,509 columns. (We wouldn't want to have to do all this counting ourselves, but luckily a few lines of computer code will do the counting and create this matrix in a fraction of a second.)
Like the words-by-scores matrix, this words-by-reviews matrix also supports some really simple applications.
For example, we can approximate the similarity of two wine-reviews by taking the "inner product" of the corresponding columns in the matrix — giving us the counts in one column multiplied by the counts in the other column, all added up. With a bit of weighting thrown in to take account of things how unevenly distributed words are across documents, this "cosine distance" has been a basic technique in document retrieval and text mining for decades.
And again, this matrix of counts is the starting point for some slightly more sophisticated techniques. For example, we could use the association of words with wine-reviews to infer the overall contribution of each word to the vector of wine scores, either overall or for different kinds of grapes or in different price ranges, using various regression techniques.
The technique known as "latent semantic analysis" starts with this same type of "term by document matrix" — a table whose rows are words and whose columns are documents, with the number in row i and column j representing the number of times that word i occurs in document j.
The basic metaphor is that a given word's distribution across documents is a sort of extensional representation of its meaning. But if there are a million documents, then this representation is a million-dimensional vector, which is not a very handy object. Even in our little collection of wine reviews, we've got more than 20,000 counts per word to keep track of. And you'd be right to suspect that the real amount of relevant per-word information in these distributions is a lot less than such large collections of numbers imply…
A mathematically simple technique called singular value decomposition (SVD) gives us a way to find a low-dimensional approximation for any matrix, including specifically such term-by-document matrices. We can use this technique to approximate the information in a word's distribution across documents with a projection onto an arbitrarily short vector, 2 or 20 or 200 numbers rather than 20,000 (or 20 billion) . Obviously, the quality of the approximation improves with the number of numbers that we retain — but typically, the returns (in terms of, say, the effect on estimated distances) start to diminish rapidly.
Furthermore, at least the first few of these numbers (ranked by importance) are often humanly interpretable. And this approach can be a worthwhile way to find interpretable dimensions even if we start with a much smaller space.
On beyond averaging, regression, and eigenanalysis, there are many more elaborate and more sophisticated sorts of analysis we could (and should) do with datasets like this — but again, a simple-minded baseline starts us off in a pretty good place.
Of course, the way the words are put together in text really does matter. A text is more than a "bag of words". All the same, simple word-choice carries enough information to start us at a useful level for many tasks.
And there are almost-equally-simple methods that take account of the sequence and structures of words in texts. But that's a topic for another breakfast.
(For some prior art on this particular topic, see "…with just a hint of Naive Bayes in the nose", 2/23/2011.)