IRCS/CCN Summer Workshop
June 2003
Speech Recognition

Why is perception hard?

Task: available signals → model of the world around

signals are mostly accidental, inadequate

sometimes disguised or falsified

always mixed-up and ambiguous

Reasoning about the source of signals:

Integration of context: what do you expect?

“Sensor fusion”: integration of vision, sound, smell etc.

Source (and noise) separation:
there’s more than one thing out there

Variable perspective, source variation etc.

depends on the type of signal

depends on the type of object

Much harder than chess or calculus!

Bayesian probability estimation

Thomas Bayes (1702-1761)

Minister of the Presbyterian Chapel
at Tunbridge Wells

Amateur mathematician

Essay towards solving a problem
in the doctrine of chances,
published (posthumously) in 1764

Crucial idea:

background (prior) knowledge
about the plausibility of different theories
can be combined with knowledge about
the relation of theories to evidence

in a mathematically well-defined way

even if all knowledge is uncertain

to reason about the most likely explanation of the available evidence

Bayes’ theorem

“the most important equation in the history of mathematics” (?)

a simple consequence of basic definitions, or

a still-controversial recipe for the probability of alternative causes for a given event, or

the implicit foundation of human reasoning

a general framework for solving the problems of perception

Slide 4

Slide 5

Slide 6

Slide 7

Fundamental theorem
of speech recognition

P(W|S) ∝ P(S|W)P(W)

where W is “Word(s)” (i.e. message text)

             S is “Sound(s)” (i.e. speech signal)

“Noisy channel model” of communications engineering
due to Shannon 1949

New algorithms, especially relevant to speech recognition

   due to L.E. Baum et al. ~ 1965-1970

Applied to speech recognition by Jim Baker (CMU PhD 1975),

     Fred Jelinek (IBM speech group >>1975)

Motivations for a Bayesian approach

A consistent framework for integrating
   previous experience and current evidence

A quantitative model for “abduction”
       = reasoning about the best explanation

A general method
for turning a generative model into an analytic one
        = “analysis by synthesis”
        helpful where |categories| << |signals|

Basic architecture
of standard speech recognition technology

1. Bayes’ Rule: P(W|S) ∝ P(S|W)P(W)

2. Approximate P(S|W)P(W)
as a Hidden Markov Model

a probabilistic function   [ to get P(S|W)]

         of a markov chain      [ to get P(W) ]

3. Use Baum/Welch (=EM) algorithm
       to “learn” HMM parameters

4. Use Viterbi decoding

       to find the most probable W given S

       in terms of the estimated HMM

Slide 11

Slide 12

Slide 13

Other typical details:
Complex elaborations of the basic ideas

HMM states ← triphones ← words

each triphone → 3-5 states + connection pattern

phone sequence from pronuncing dictionary

clustering for estimation

Acoustic features

RASTA-PLP etc.

Vocal tract length normalization, speaker clustering

Output pdf for each state as mixture of gaussians

Language model as N-gram model over words

recency/topic effects

Empirical weighting of language vs. acoustic models

etc. etc.

Some limitations
of the standard architecture

Problems with Markovian assumptions

Modeling trajectory effects

Variable coordination of articulatory dimensions

....


Task: available signals → model of the world around
	signals are mostly accidental, inadequate
	sometimes disguised or falsified
	always mixed-up and ambiguous
Reasoning about the source of signals:
	Integration of context: what do you expect?
	“Sensor fusion”: integration of vision, sound, smell etc.
	Source (and noise) separation: there’s more than one thing out there
	Variable perspective, source variation etc.
		depends on the type of signal
		depends on the type of object
Much harder than chess or calculus!


Thomas Bayes (1702-1761)
	Minister of the Presbyterian Chapel at Tunbridge Wells
	Amateur mathematician
	Essay towards solving a problem in the doctrine of chances, published (posthumously) in 1764
Crucial idea:
	background (prior) knowledge about the plausibility of different theories can be combined with knowledge about the relation of theories to evidence
		in a mathematically well-defined way
		even if all knowledge is uncertain
		to reason about the most likely explanation of the available evidence
Bayes’ theorem
	“the most important equation in the history of mathematics” (?)
	a simple consequence of basic definitions, or
	a still-controversial recipe for the probability of alternative causes for a given event, or
	the implicit foundation of human reasoning
	a general framework for solving the problems of perception


	P(W\|S) ∝ P(S\|W)P(W)
	where W is “Word(s)” (i.e. message text)
	S is “Sound(s)” (i.e. speech signal)

	“Noisy channel model” of communications engineering due to Shannon 1949
	New algorithms, especially relevant to speech recognition
	due to L.E. Baum et al. ~ 1965-1970
	Applied to speech recognition by Jim Baker (CMU PhD 1975),
	Fred Jelinek (IBM speech group >>1975)


	A consistent framework for integrating previous experience and current evidence
	A quantitative model for “abduction” = reasoning about the best explanation
	A general method for turning a generative model into an analytic one = “analysis by synthesis” helpful where \|categories\| << \|signals\|



	1. Bayes’ Rule: P(W\|S) ∝ P(S\|W)P(W)
	2. Approximate P(S\|W)P(W) as a Hidden Markov Model
	a probabilistic function [ to get P(S\|W)]
	of a markov chain [ to get P(W) ]
	3. Use Baum/Welch (=EM) algorithm to “learn” HMM parameters
	4. Use Viterbi decoding
	to find the most probable W given S
	in terms of the estimated HMM


	HMM states ← triphones ← words
		each triphone → 3-5 states + connection pattern
		phone sequence from pronuncing dictionary
		clustering for estimation
	Acoustic features
		RASTA-PLP etc.
		Vocal tract length normalization, speaker clustering
	Output pdf for each state as mixture of gaussians
	Language model as N-gram model over words
		recency/topic effects
	Empirical weighting of language vs. acoustic models
	etc. etc.


	Problems with Markovian assumptions
	Modeling trajectory effects
	Variable coordination of articulatory dimensions
	....