A few days ago ("Evaluative words for wines", 4/7/2012), I illustrated how a trivial method can help us uncover the contribution of individual words to the expression of opinion in text. For this morning's Breakfast Experiment™, I'll illustrate an equally trivial approach to learning how words fit together structurally, using the same small collection of 20,888 wine reviews.
These reviews come from the site of the Beverage Tasting Institute — two (literally) random examples are
Deep yellow golden color with an emerald cast. Toasty, yeasty lemon custard aromas follow through to a fruity tart medium bodied palate with candied lemon and lime peels, banana custard, and brioche flavors. Finishes with a tart and sweet citrus marmalade on toast fade. Very tasty and classically styled.
Brilliant garnet-ruby red hue. Earth, dried herb, tar and cherry aromas. A medium-bodied palate leads to a short, earthy finish with a moderate burst of fruit and tart acidity.
Our method is a simple implementation of an old idea, the "Distributional Hypothesis", which suggests that "a word is characterized by the company it keeps". For this simple test, we'll define a "word" as a textual token, with case distinctions eliminated, hyphenated compound words retained as units, and other marks of punctuation split off as separate tokens. Thus the two-sentence sequence
Brilliant garnet color. Cranberry-apple, plum and thyme aromas.
is turned into a space-separated sequence of tokens as follows:
brilliant garnet color . cranberry-apple , plum and thyme aromas .
For each such token, we now define "the company it keeps" as the words immediately to the right and immediately to the left.
Every aspect of this implementation is questionable. By ignoring case, we've made a useful generalization, but also thrown away some valuable information. By using word-forms rather than lemmas — so that e.g. "aroma" and "aromas" are different words, with nothing more in common than "aroma" and "cherry" — we've done the opposite, retaining some useful information but ignoring a potentially valuable generalization. Similarly, by retaining punctuation, we've retained some valuable information, but failed to capture across-punctuation relationships. By retaining hyphenated words, we've failed to recognize the contribution of their parts. And finally, a one-word window on each side of a given word, however we define "word", is a pretty impoverished definition of "the company it keeps".
There are interesting options for simultaneously having and eating the various alternative ways of slicing these algorithmic cakes, but we need to finish by the end of the breakfast hour. So for now, let's press ahead with our simple-minded experiment.
By the definitions given above, there are 869,761 lexical tokens in our little collection, divided among 7,303 lexical types.
1217 of these lexical tokens are the word "plum", which occurs with 49 different right-hand neighbors, of which these are the 15 commonest:
464 plum ,
318 plum and
76 plum aromas
62 plum skin
54 plum pie
44 plum fruit
24 plum tart
21 plum flavors
18 plum custard
16 plum pudding
15 plum jam
11 plum fade
9 plum chutney
7 plum finish
7 plum with
The word-form "plum" occurs with 75 different left-hand neighbors, of which these are the 15 commonest:
448 , plum
176 and plum
119 . plum
69 of plum
36 black plum
34 ripe plum
33 yellow plum
26 baked plum
23 tangy plum
20 with plum
19 spicy plum
17 golden plum
15 tart plum
15 red plum
14 damson plum
It's going to be challenging to estimate the distributional properties of a word that only occurs once, and not a lot better for words that occur just a handful of times. There are clever ways past this estimation problem, but for now, let's just restrict our attention to the 1866 words that occur at least 10 times in this collection.
For each of these 1866 words, we now create an 1866-element vector representing how often this word occurred with each other word (within the same set of 1866) on the right; and we create a similar vector of pairing-counts with other words on the left. We divide each count by the frequency of the other word in the pair — thus given that "aromas" occurred a total of 18951 times, and "plum aromas" occurred 76 times, the pairing-count for "aromas" on the right of "plum" becomes 76/18951, or 0.0040103.
(Again, this form of normalization is questionable — there are other alternatives — but we press ahead heedlessly…)
This yields a sparse matrix with 1866 rows (words) and 2*1866 columns (right and left frequency-weighted neighbor counts). Now we want to use these frequency-weighted neighbor-counts to characterize a given word "by the company it keeps". One crude way to do that is to ask what other words a given word is most like. A crude way to measure the similarity of these crudely-represented word contexts is just to sum the products of corresponding elements, i.e. to take the inner product of the context-vectors.
For "plum", the 15 most similar context-vectors — the 15 rows in our matrix of words-by-contexts with the largest inner products with the row corresponding to "plum" — correspond to the words
"Cheesecloth"? Yes, in this arena, cheesecloth is a smell:
Lanolin, cheesecloth and grapefruit aromas.
Blackberry, cheesecloth and tar aromas.
Cherry, cheesecloth, sage and oak aromas.
Currant, tar, cheesecloth and oak aromas.
Lemon peel, cheesecloth and dried pear aromas.
Does every word's context come out most similar to a list of odor-words? No, the 15 words whose contexts are most similar by this metric to "garnet" are:
Given that this simple-minded approach yields some plausible results, we can hope that somewhat more sophisticated methods of the same type might enable us to learn a sort of stochastic semantic grammar.
There are some non-trivial problems remained to be solved. Thus the ten words whose contexts are most similar to "color" (by our simple metric applied to this text collection) are
The only approximate synonyms for "color" in this text collection seem to be "hue", "cast", and "tint", which occupy the top three places in the list above. Without looking deeply into the details, I think the rest of the list is mostly just noise — thus "rim" is often modified by a color-word ("Deep ruby black color with a purple rim"; "Tawny reddish amber color with an orange rim"), but the rest of the words are mainly similar to "color" in their propensity to occur at the end of a sentence.
There are some obvious ideas for strengthening the signal and weakening the noise in calculations of this kind, but they'll have to wait for another breakfast hour.