TerMine is a system for recognizing multiword terms. The algorithm was originally presented in Katerina Frantzi, Sophia Ananiadou, and Hideki Mima, Automatic recognition of multi-word terms, International Journal of Digital Libraries 3(2): 117-132, 2000. You can try it out on a site at the National Centre for Text Mining (NaCTeM) at the University of Manchester in the UK, where they have a web demonstration that will analyze short (<2 MB) texts or URLs for you.

As you'll find if you try, the results are not always perfect, but I think that the algorithm is remarkably good at guessing multi-word terms from small amounts of text. For example, if I try it out on a page (~2000 words) of lecture notes about "Statistical estimation for Large Numbers of Rare Events", it comes up with a large number of sensible things like good-turing estimate, maximum likelihood, population frequency, belief tax, and negative binomial distribution — along with a few clunkers like cnew = cnew./token and some other fragments of Matlab code. (Maybe it was unfair to give it a sample that included such things…)

Jock McNaught recently reminded me of this service by trying it out on President Obama's inaugural address.

Jock wrote:

As I see there has been some discussion on the Language Log about Pres. Obama's inauguration speech, and elsewhere David Crystal's excellent analysis, I thought you might be interested to see what NaCTeM's TerMine tool made of his speech, see attached.

I've seen some word clouds, some single word analyses, but haven't come across any analysis of the compound words he used.

The output is ordered by descending C-value score, then within each score by ascending alphabetic order.

As the speech is a small sample, only the C-value scores >1 are really relevant to indicating the 'important' compounds. As you may recall, C-value isn't straight frequency of occurrence, it also takes into account nesting of smaller forms in larger forms, length of forms, etc.

I chose to focus here is on Adj N combinations, and lowercasing and stemming were used in an attempt to improve results. Possibly of some discourse-level interest is that there were very few 3-grams of importance found and nothing at all by way of >3-grams, although I haven't checked to see if, statistically, the 3-grams found would represent a typical proportion for this size/type of text. The tool does not handle conjoined forms (i.e. of the "old men and women" variety).

Here's the list that Jock attached:

2.000000 common dangers
2.000000 health care
2.000000 new age
2.000000 new era
1.584962 few worldly possessions
1.584962 gross domestic product
1.584962 long rugged path
1.584962 many big plans
1.584962 stale political arguments
[Update: an amusing textual explication, offered by The Drunken Priest:

Frequent terms intended to conjure up the pioneer in all of us: few worldly possessions, long rugged path, distant mountains, far-off deserts, hard earth, difficult task, icy river, icy currents, hungry minds, starved bodies, rugged path, brave Americans.

Terms intended to unite us all into one common incontestable ant-hill: common dangers, common defense, common good, common humanity, common purpose, collective failure, many big plans, fellow citizens, greater cooperation, mutual interest, mutual respect, patchwork heritage…

Terms indicating the opposition is nothing more than a group of kiss kiss, hug hug, lip gloss girls: childish things, bad habits, stale political arguments, petty grievances, worn-out dogmas…

Terms demonstrating we’re not old: new age, new era, new foundation, new jobs, new life, new threats, new way, next generation, young nation….

]

1. ### mark said,

February 11, 2009 @ 9:08 am

Only ASCII input allowed in that tool! I'm disappointed.

[Hwæt, you were hoping to analyze Beowulf? The tool is tuned for English — there's a parser in it, for example — so the opportunities to stray licitly outside of ASCII are limited.]