Morphology technology

Different applications -- different needs

stemmers

collapse all forms of a word by pairing with “stem”

for (CL)IR

for (aspects of) language modeling

analysis/synthesis

pair wordform with lemma+affixes/features (or more)

perhaps with disambiguation in textual context

for spelling correction

for pronunciation calculation

for language modeling

for information extraction

for translation

All cases can be modeled as relations on strings

Hand-built engines

Well-understood FST technology

broad coverage, accurate, efficient

1-2 person-years per language

native speaker/linguist/programmer combination

application adaptation is also understood

More informal approaches

often less demanding applications such as stemmers

sui generis algorithms developed for single language

typically more human effort

Related technology

Probability distribution over alternative analyses

Can be listed for common forms

Approximated by a weighted transducer for rarer forms

Taggers

contextual disambiguation
of the output of a morph analyzer

Several well-understood approaches

HMM

Brill

FST

So what’s the problem?

Transducers and taggers cost a lot

1-3 person-years of highly-skilled work per language

or more, from a standing start

e.g. without electronic dictionaries and tagged texts

Each language takes a long time

difficult to parallelize the work

Only about 15 languages have been “done”

coverage is uneven for these

natural rate of addition is now slow

only large, wealthy languages are likely candidates

Obvious solution:
use machine learning

Several approaches have been tried

Others look interesting

Some results are promising, but…

no overall solutions so far

different languages, training sets, tests if any

no comparisons

of different methods on one problem

of one method on different problems

of quality in different applications

A test bench
for morphology learning

For a dozen or so diverse languages

Text corpus (~1-10MW)

Tagged subcorpus (.1-1MW)

for training and/or testing

Broad-coverage analyzer/synthesizer

generating data for (semi-)supervised learning

“oracle” for active learning

Tagger

generating approximately correct tagged data

Slide 7

Approaches I

Unsupervised learning

lots of text as input (100M words or so)

wordform similarity and context similarities

stem classes or full decomposition

Promising performance on small subtasks

Yarowsky (2000)

Overall results (so far) are inadequate

Brent (1999), Goldsmith (2000)

100M-word corpus is unrealistic for most languages where induction is needed

Approaches II

Supervised learning

No attempts (to our knowledge) to apply general inductive methods to the full problem

Some very small experiments with neural nets

Why not?

problem is hard (Gold’s theorem)

fully supervised case seems unrealistic

though small tagged corpus is cheap

production can be parallelized

Approaches III

Semi-supervised learning

Oflazer and Nirenburg (2000)

Elicit/build/test

Elicit paradigms

Build simple FST rules using Brill learner

Test and “see what the problems are”

Initial results seem promising

Quantitative evaluation?

Role of human judgment?

Our approach

Set up a test bench to compare alternatives

for overall performance

for application suitability

Try new things

modifications of existing algorithms

new or untested algorithms

Get basic data on the problem

across languages and applications

An interesting idea

MAT learning of FSAs

Angluin 1987

Exact learning in polynomial time and state space
given “queries and counterexamples”

Becomes PAC learning given membership queries only

Generalized to “multiplicity automata” (Beimel et al. 1998)

Generalized to two-tape automata (Yokomori 2000)

In tests: transducer answers membership queries!

In real applications: humans would give the answers

Queries can often be run in parallel

might run nearly N times faster given N humans…

Prospective
machine morphologists

Mark Liberman

Scott Weinstein

Robin Clark

David Embick

Steven Bird

Ian Ross

Paul Kingsbury

Na-Rae Han

[Michael Kearns]


Different applications -- different needs
	stemmers
		collapse all forms of a word by pairing with “stem”
			for (CL)IR
			for (aspects of) language modeling
	analysis/synthesis
		pair wordform with lemma+affixes/features (or more)
		perhaps with disambiguation in textual context
			for spelling correction
			for pronunciation calculation
			for language modeling
			for information extraction
			for translation
All cases can be modeled as relations on strings


Well-understood FST technology
	broad coverage, accurate, efficient
	1-2 person-years per language
		native speaker/linguist/programmer combination
	application adaptation is also understood
More informal approaches
	often less demanding applications such as stemmers
	sui generis algorithms developed for single language
	typically more human effort


Probability distribution over alternative analyses
Can be listed for common forms
Approximated by a weighted transducer for rarer forms
Taggers
	contextual disambiguation of the output of a morph analyzer
	Several well-understood approaches
		HMM
		Brill
		FST


Transducers and taggers cost a lot
	1-3 person-years of highly-skilled work per language
	or more, from a standing start
		e.g. without electronic dictionaries and tagged texts
Each language takes a long time
	difficult to parallelize the work
Only about 15 languages have been “done”
	coverage is uneven for these
	natural rate of addition is now slow
	only large, wealthy languages are likely candidates


Several approaches have been tried
Others look interesting
Some results are promising, but…
	no overall solutions so far
	different languages, training sets, tests if any
	no comparisons
		of different methods on one problem
		of one method on different problems
		of quality in different applications


For a dozen or so diverse languages
	Text corpus (~1-10MW)
	Tagged subcorpus (.1-1MW)
		for training and/or testing
	Broad-coverage analyzer/synthesizer
		generating data for (semi-)supervised learning
		“oracle” for active learning
	Tagger
		generating approximately correct tagged data


Unsupervised learning
	lots of text as input (100M words or so)
	wordform similarity and context similarities
	stem classes or full decomposition
	Promising performance on small subtasks
		Yarowsky (2000)
	Overall results (so far) are inadequate
		Brent (1999), Goldsmith (2000)
	100M-word corpus is unrealistic for most languages where induction is needed


Supervised learning
	No attempts (to our knowledge) to apply general inductive methods to the full problem
		Some very small experiments with neural nets
	Why not?
		problem is hard (Gold’s theorem)
		fully supervised case seems unrealistic
			though small tagged corpus is cheap
			production can be parallelized


Semi-supervised learning
	Oflazer and Nirenburg (2000)
	Elicit/build/test
		Elicit paradigms
		Build simple FST rules using Brill learner
		Test and “see what the problems are”
		Initial results seem promising
			Quantitative evaluation?
			Role of human judgment?


	Set up a test bench to compare alternatives
		for overall performance
		for application suitability
	Try new things
		modifications of existing algorithms
		new or untested algorithms
	Get basic data on the problem
		across languages and applications


	MAT learning of FSAs
		Angluin 1987
		Exact learning in polynomial time and state space given “queries and counterexamples”
		Becomes PAC learning given membership queries only
		Generalized to “multiplicity automata” (Beimel et al. 1998)
		Generalized to two-tape automata (Yokomori 2000)
	In tests: transducer answers membership queries!
	In real applications: humans would give the answers
	Queries can often be run in parallel
		might run nearly N times faster given N humans…


	Mark Liberman
	Scott Weinstein
	Robin Clark
	David Embick
	Steven Bird
	Ian Ross
	Paul Kingsbury
	Na-Rae Han
	[Michael Kearns]