TerMine is a system for recognizing multiword terms. The algorithm was originally presented in Katerina Frantzi, Sophia Ananiadou, and Hideki Mima, Automatic recognition of multi-word terms, International Journal of Digital Libraries 3(2): 117-132, 2000. You can try it out on a site at the National Centre for Text Mining (NaCTeM) at the University of Manchester in the UK, where they have a web demonstration that will analyze short (<2 MB) texts or URLs for you.

As you'll find if you try, the results are not always perfect, but I think that the algorithm is remarkably good at guessing multi-word terms from small amounts of text. For example, if I try it out on a page (~2000 words) of lecture notes about "Statistical estimation for Large Numbers of Rare Events", it comes up with a large number of sensible things like good-turing estimate, maximum likelihood, population frequency, belief tax, and negative binomial distribution — along with a few clunkers like cnew = cnew./token and some other fragments of Matlab code. (Maybe it was unfair to give it a sample that included such things…)

Jock McNaught recently reminded me of this service by trying it out on President Obama's inaugural address.

Jock wrote:

As I see there has been some discussion on the Language Log about Pres. Obama's inauguration speech, and elsewhere David Crystal's excellent analysis, I thought you might be interested to see what NaCTeM's TerMine tool made of his speech, see attached.

I've seen some word clouds, some single word analyses, but haven't come across any analysis of the compound words he used.

The output is ordered by descending C-value score, then within each score by ascending alphabetic order.

As the speech is a small sample, only the C-value scores >1 are really relevant to indicating the 'important' compounds. As you may recall, C-value isn't straight frequency of occurrence, it also takes into account nesting of smaller forms in larger forms, length of forms, etc.

I chose to focus here is on Adj N combinations, and lowercasing and stemming were used in an attempt to improve results. Possibly of some discourse-level interest is that there were very few 3-grams of importance found and nothing at all by way of >3-grams, although I haven't checked to see if, statistically, the 3-grams found would represent a typical proportion for this size/type of text. The tool does not handle conjoined forms (i.e. of the "old men and women" variety).

Here's the list that Jock attached:

Result of analysing President Obama's inauguration speech using NaCTeM's TerMine
tool.

The output is ordered by descending C-value, then within each score by ascending
alphabetic order.

2.000000 common dangers
2.000000 health care
2.000000 new age
2.000000 new era
1.584962 few worldly possessions
1.584962 gross domestic product
1.584962 long rugged path
1.584962 many big plans
1.584962 stale political arguments
1.000000 american people
1.000000 better history
1.000000 better life
1.000000 bitter swill
1.000000 brave americans
1.000000 childish things
1.000000 civil war
1.000000 clean waters
1.000000 collective failure
1.000000 common defense
1.000000 common good
1.000000 common humanity
1.000000 common purpose
1.000000 dark chapter
1.000000 darkest hours
1.000000 decent wage
1.000000 digital lines
1.000000 distant mountains
1.000000 earlier generations
1.000000 electric grids
1.000000 expedience sake
1.000000 fair play
1.000000 false promises
1.000000 far-off deserts
1.000000 far-reaching network
1.000000 fellow citizens
1.000000 former foes
1.000000 founding documents
1.000000 founding fathers
1.000000 free men
1.000000 full measure
1.000000 future generations
1.000000 future world
1.000000 god-given promise
1.000000 grandest capitals
1.000000 greater cooperation
1.000000 greater effort
1.000000 hard choices
1.000000 hard earth
1.000000 hard work
1.000000 hard-earned peace
1.000000 high office
1.000000 humble gratitude
1.000000 hungry minds
1.000000 icy currents
1.000000 icy river
1.000000 individual ambitions
1.000000 khe sahn
1.000000 last month
1.000000 last week
1.000000 last year
1.000000 local restaurant
1.000000 magnificent mall
1.000000 muslim world
1.000000 mutual interest
1.000000 mutual respect
1.000000 nagging fear
1.000000 narrow interests
1.000000 new foundation
1.000000 new jobs
1.000000 new life
1.000000 new threats
1.000000 new way
1.000000 next generation
1.000000 noble idea
1.000000 nuclear threat
1.000000 old friends
1.000000 old hatreds
1.000000 other peoples
1.000000 patchwork heritage
1.000000 petty grievances
1.000000 poor nations
1.000000 powerful nation
1.000000 president bush
1.000000 presidential oath
1.000000 prudent use
1.000000 quiet force
1.000000 relative plenty
1.000000 rightful place
1.000000 sacred oath
1.000000 short span
1.000000 small band
1.000000 small village
1.000000 starved bodies
1.000000 sturdy alliances
1.000000 surest route
1.000000 tempering qualities
1.000000 timeless words
1.000000 uncertain destiny
1.000000 united states
1.000000 unpleasant decisions
1.000000 vital trust
1.000000 warming planet
1.000000 watchful eye
1.000000 willing heart
1.000000 worn-out dogmas
1.000000 wrong side
1.000000 young nation
-0.000000 big plans
-0.000000 domestic product
-0.000000 political arguments
-0.000000 rugged path
-0.000000 worldly possessions



[Update: an amusing textual explication, offered by The Drunken Priest:

Frequent terms intended to conjure up the pioneer in all of us: few worldly possessions, long rugged path, distant mountains, far-off deserts, hard earth, difficult task, icy river, icy currents, hungry minds, starved bodies, rugged path, brave Americans.

Terms intended to unite us all into one common incontestable ant-hill: common dangers, common defense, common good, common humanity, common purpose, collective failure, many big plans, fellow citizens, greater cooperation, mutual interest, mutual respect, patchwork heritage…

Terms indicating the opposition is nothing more than a group of kiss kiss, hug hug, lip gloss girls: childish things, bad habits, stale political arguments, petty grievances, worn-out dogmas…

Terms demonstrating we’re not old: new age, new era, new foundation, new jobs, new life, new threats, new way, next generation, young nation….

]

1. ### mark said,

February 11, 2009 @ 9:08 am

Only ASCII input allowed in that tool! I'm disappointed.

[Hwæt, you were hoping to analyze Beowulf? The tool is tuned for English — there's a parser in it, for example — so the opportunities to stray licitly outside of ASCII are limited.]