|
|
|
|
|
|
Different applications -- different needs |
|
stemmers |
|
collapse all forms of a word by pairing with
“stem” |
|
for (CL)IR |
|
for (aspects of) language modeling |
|
analysis/synthesis |
|
pair wordform with lemma+affixes/features (or
more) |
|
perhaps with disambiguation in textual context |
|
for spelling correction |
|
for pronunciation calculation |
|
for language modeling |
|
for information extraction |
|
for translation |
|
All cases can be modeled as relations on strings |
|
|
|
|
|
|
|
|
|
|
|
|
Well-understood FST technology |
|
broad coverage, accurate, efficient |
|
1-2 person-years per language |
|
native speaker/linguist/programmer combination |
|
application adaptation is also understood |
|
More informal approaches |
|
often less demanding applications such as
stemmers |
|
sui generis algorithms developed for single
language |
|
typically more human effort |
|
|
|
|
|
|
Probability distribution over alternative
analyses |
|
Can be listed for common forms |
|
Approximated by a weighted transducer for rarer
forms |
|
Taggers |
|
contextual disambiguation
of the output of a morph analyzer |
|
Several well-understood approaches |
|
HMM |
|
Brill |
|
FST |
|
|
|
|
|
|
Transducers and taggers cost a lot |
|
1-3 person-years of highly-skilled work per
language |
|
or more, from a standing start |
|
e.g. without electronic dictionaries and tagged
texts |
|
Each language takes a long time |
|
difficult to parallelize the work |
|
Only about 15 languages have been “done” |
|
coverage is uneven for these |
|
natural rate of addition is now slow |
|
only large, wealthy languages are likely
candidates |
|
|
|
|
|
|
Several approaches have been tried |
|
Others look interesting |
|
Some results are promising, but… |
|
no overall solutions so far |
|
different languages, training sets, tests if any |
|
no comparisons |
|
of different methods on one problem |
|
of one method on different problems |
|
of quality in different applications |
|
|
|
|
|
|
|
|
For a dozen or so diverse languages |
|
Text corpus (~1-10MW) |
|
Tagged subcorpus (.1-1MW) |
|
for training and/or testing |
|
Broad-coverage analyzer/synthesizer |
|
generating data for (semi-)supervised learning |
|
“oracle” for active learning |
|
Tagger |
|
generating approximately correct tagged data |
|
|
|
|
|
|
|
Unsupervised learning |
|
lots of text as input (100M words or so) |
|
wordform similarity and context similarities |
|
stem classes or full decomposition |
|
Promising performance on small subtasks |
|
Yarowsky (2000) |
|
Overall results (so far) are inadequate |
|
Brent (1999), Goldsmith (2000) |
|
100M-word corpus is unrealistic for most
languages where induction is needed |
|
|
|
|
|
|
|
|
|
Supervised learning |
|
No attempts (to our knowledge) to apply general
inductive methods to the full problem |
|
Some very small experiments with neural nets |
|
Why not? |
|
problem is hard (Gold’s theorem) |
|
fully supervised case seems unrealistic |
|
though small tagged corpus is cheap |
|
production can be parallelized |
|
|
|
|
|
|
|
|
|
Semi-supervised learning |
|
Oflazer and Nirenburg (2000) |
|
Elicit/build/test |
|
Elicit paradigms |
|
Build simple FST rules using Brill learner |
|
Test and “see what the problems are” |
|
Initial results seem promising |
|
Quantitative evaluation? |
|
Role of human judgment? |
|
|
|
|
|
|
|
Set up a test bench to compare alternatives |
|
for overall performance |
|
for application suitability |
|
Try new things |
|
modifications of existing algorithms |
|
new or untested algorithms |
|
Get basic data on the problem |
|
across languages and applications |
|
|
|
|
|
MAT learning of FSAs |
|
Angluin 1987 |
|
Exact learning in polynomial time and state
space
given “queries and counterexamples” |
|
Becomes PAC learning given membership queries
only |
|
Generalized to “multiplicity automata” (Beimel
et al. 1998) |
|
Generalized to
two-tape automata (Yokomori 2000) |
|
In tests: transducer answers membership queries! |
|
In real applications: humans would give the
answers |
|
Queries can often be run in parallel |
|
might run nearly N times faster given N humans… |
|
|
|
|
|
|
Mark Liberman |
|
Scott Weinstein |
|
Robin Clark |
|
David Embick |
|
Steven Bird |
|
Ian Ross |
|
Paul Kingsbury |
|
Na-Rae Han |
|
[Michael Kearns] |
|
|
|
|
|