Morphology technology
Different applications -- different needs
stemmers
collapse all forms of a word by pairing with “stem”
for (CL)IR
for (aspects of) language modeling
analysis/synthesis
pair wordform with lemma+affixes/features (or more)
perhaps with disambiguation in textual context
for spelling correction
for pronunciation calculation
for language modeling
for information extraction
for translation
All cases can be modeled as  relations on strings

Hand-built engines
Well-understood FST technology
broad coverage, accurate, efficient
1-2 person-years per language
native speaker/linguist/programmer combination
application adaptation is also understood
More informal approaches
often less demanding applications such as stemmers
sui generis algorithms developed for single language
typically more human effort

Related technology
Probability distribution over alternative analyses
Can be listed for common forms
Approximated by a weighted transducer for rarer forms
Taggers
 contextual disambiguation
of the output of a morph analyzer
Several well-understood approaches
HMM
Brill
FST

So what’s the problem?
Transducers and taggers cost a lot
1-3 person-years of highly-skilled work per language
or more, from a standing start
e.g. without electronic dictionaries and tagged texts
Each language takes a long time
difficult to parallelize the work
Only about 15 languages have been “done”
coverage is uneven for these
natural rate of addition is now slow
only large, wealthy languages are likely candidates

Obvious solution:
 use machine learning
Several approaches have been tried
Others look interesting
Some results are promising, but…
no overall solutions so far
different languages, training sets, tests if any
no comparisons
of different methods on one problem
of one method on different problems
of quality in different applications

A test bench
for morphology learning
For a dozen or so diverse languages
Text corpus (~1-10MW)
Tagged subcorpus (.1-1MW)
for training and/or testing
Broad-coverage analyzer/synthesizer
generating data for (semi-)supervised learning
“oracle” for active learning
Tagger
generating approximately correct tagged data

Slide 7

Approaches I
Unsupervised learning
lots of text as input (100M words or so)
wordform similarity and context similarities
stem classes or full decomposition
Promising performance on small subtasks
Yarowsky (2000)
Overall results (so far) are inadequate
Brent (1999), Goldsmith (2000)
100M-word corpus is unrealistic for most languages where induction is needed

Approaches II
Supervised learning
No attempts (to our knowledge) to apply general inductive methods to the full problem
Some very small experiments with neural nets
Why not?
problem is hard (Gold’s theorem)
fully supervised case seems unrealistic
though small tagged corpus is cheap
production can be parallelized

Approaches III
Semi-supervised learning
Oflazer and Nirenburg (2000)
Elicit/build/test
Elicit paradigms
Build simple FST rules using Brill learner
Test and “see what the problems are”
Initial results seem promising
Quantitative evaluation?
Role of human judgment?

Our approach
Set up a test bench to compare alternatives
for overall performance
for application suitability
Try new things
modifications of existing algorithms
new or untested algorithms
Get basic data on the problem
across languages and applications

An interesting idea
MAT learning of FSAs
Angluin 1987
Exact learning in polynomial time and state space
given “queries and counterexamples”
Becomes PAC learning given membership queries only
Generalized to “multiplicity automata” (Beimel et al. 1998)
Generalized to  two-tape automata (Yokomori 2000)
In tests: transducer answers membership queries!
In real applications: humans would give the answers
Queries can often be run in parallel
might run nearly N times faster given N humans…

Prospective
machine morphologists
Mark Liberman
Scott Weinstein
Robin Clark
David Embick
Steven Bird
Ian Ross
Paul Kingsbury
Na-Rae Han
[Michael Kearns]