Different applications -- different needs | ||||
stemmers | ||||
collapse all forms of a word by pairing with “stem” | ||||
for (CL)IR | ||||
for (aspects of) language modeling | ||||
analysis/synthesis | ||||
pair wordform with lemma+affixes/features (or more) | ||||
perhaps with disambiguation in textual context | ||||
for spelling correction | ||||
for pronunciation calculation | ||||
for language modeling | ||||
for information extraction | ||||
for translation | ||||
All cases can be modeled as relations on strings | ||||
Well-understood FST technology | |||
broad coverage, accurate, efficient | |||
1-2 person-years per language | |||
native speaker/linguist/programmer combination | |||
application adaptation is also understood | |||
More informal approaches | |||
often less demanding applications such as stemmers | |||
sui generis algorithms developed for single language | |||
typically more human effort |
Probability distribution over alternative analyses | |||
Can be listed for common forms | |||
Approximated by a weighted transducer for rarer forms | |||
Taggers | |||
contextual disambiguation of the output of a morph analyzer |
|||
Several well-understood approaches | |||
HMM | |||
Brill | |||
FST |
Transducers and taggers cost a lot | |||
1-3 person-years of highly-skilled work per language | |||
or more, from a standing start | |||
e.g. without electronic dictionaries and tagged texts | |||
Each language takes a long time | |||
difficult to parallelize the work | |||
Only about 15 languages have been “done” | |||
coverage is uneven for these | |||
natural rate of addition is now slow | |||
only large, wealthy languages are likely candidates |
Obvious
solution:
use machine learning
Several approaches have been tried | |||
Others look interesting | |||
Some results are promising, but… | |||
no overall solutions so far | |||
different languages, training sets, tests if any | |||
no comparisons | |||
of different methods on one problem | |||
of one method on different problems | |||
of quality in different applications | |||
A test bench
for morphology learning
For a dozen or so diverse languages | |||
Text corpus (~1-10MW) | |||
Tagged subcorpus (.1-1MW) | |||
for training and/or testing | |||
Broad-coverage analyzer/synthesizer | |||
generating data for (semi-)supervised learning | |||
“oracle” for active learning | |||
Tagger | |||
generating approximately correct tagged data |
Unsupervised learning | |||
lots of text as input (100M words or so) | |||
wordform similarity and context similarities | |||
stem classes or full decomposition | |||
Promising performance on small subtasks | |||
Yarowsky (2000) | |||
Overall results (so far) are inadequate | |||
Brent (1999), Goldsmith (2000) | |||
100M-word corpus is unrealistic for most languages where induction is needed | |||
Supervised learning | ||||
No attempts (to our knowledge) to apply general inductive methods to the full problem | ||||
Some very small experiments with neural nets | ||||
Why not? | ||||
problem is hard (Gold’s theorem) | ||||
fully supervised case seems unrealistic | ||||
though small tagged corpus is cheap | ||||
production can be parallelized | ||||
Semi-supervised learning | ||||
Oflazer and Nirenburg (2000) | ||||
Elicit/build/test | ||||
Elicit paradigms | ||||
Build simple FST rules using Brill learner | ||||
Test and “see what the problems are” | ||||
Initial results seem promising | ||||
Quantitative evaluation? | ||||
Role of human judgment? | ||||
Set up a test bench to compare alternatives | ||
for overall performance | ||
for application suitability | ||
Try new things | ||
modifications of existing algorithms | ||
new or untested algorithms | ||
Get basic data on the problem | ||
across languages and applications |
MAT learning of FSAs | ||
Angluin 1987 | ||
Exact learning in polynomial time and
state space given “queries and counterexamples” |
||
Becomes PAC learning given membership queries only | ||
Generalized to “multiplicity automata” (Beimel et al. 1998) | ||
Generalized to two-tape automata (Yokomori 2000) | ||
In tests: transducer answers membership queries! | ||
In real applications: humans would give the answers | ||
Queries can often be run in parallel | ||
might run nearly N times faster given N humans… | ||
Prospective
machine morphologists
Mark Liberman | |
Scott Weinstein | |
Robin Clark | |
David Embick | |
Steven Bird | |
Ian Ross | |
Paul Kingsbury | |
Na-Rae Han | |
[Michael Kearns] | |