IRCS/CCN Summer Workshop
June 2003
Speech Recognition

Why is perception hard?
Task: available signals → model of the world around
signals are mostly accidental, inadequate
sometimes disguised or falsified
always mixed-up and ambiguous
Reasoning about the source of signals:
Integration of context: what do you expect?
“Sensor fusion”: integration of vision, sound, smell etc.
Source (and noise) separation:
there’s more than one thing out there
Variable perspective, source variation etc.
depends on the type of signal
depends on the type of object
Much harder than chess or calculus!

Bayesian probability estimation
Thomas Bayes (1702-1761)
Minister of the Presbyterian Chapel
at Tunbridge Wells
Amateur mathematician
Essay towards solving a problem
in the doctrine of chances,
published (posthumously) in 1764
Crucial idea:
background (prior) knowledge
about the plausibility of different theories
can be combined with knowledge about
the relation of theories to evidence
in a mathematically well-defined way
even if all knowledge is  uncertain
to reason about the most likely explanation of  the available evidence
Bayes’ theorem
“the most important equation in the history of mathematics” (?)
a simple consequence of basic definitions, or
a still-controversial recipe for the probability of alternative causes for a given event, or
the implicit foundation of human reasoning
a general framework for solving the problems of perception

Slide 4

Slide 5

Slide 6

Slide 7

Fundamental theorem
of speech recognition
P(W|S) ∝ P(S|W)P(W)
  where W is “Word(s)”  (i.e. message text)
             S is “Sound(s)” (i.e. speech signal)
“Noisy channel model” of communications engineering
due to Shannon 1949
New algorithms, especially relevant to speech recognition
   due to L.E. Baum et al. ~ 1965-1970
Applied to speech recognition by Jim Baker (CMU PhD 1975),
     Fred Jelinek (IBM speech group >>1975)

Motivations for a Bayesian approach
A consistent framework for integrating
   previous experience and current evidence
A quantitative model for “abduction”
       = reasoning about the best explanation
A general method
for turning a generative model into an analytic one
        = “analysis by synthesis”
        helpful where |categories| << |signals|

Basic architecture
of standard speech recognition technology
1. Bayes’ Rule:  P(W|S) ∝ P(S|W)P(W)
2. Approximate P(S|W)P(W)
as a Hidden Markov Model
  a probabilistic function   [ to get P(S|W)]
         of a markov chain      [  to get P(W) ]
3. Use Baum/Welch (=EM) algorithm
       to “learn” HMM parameters
4. Use Viterbi decoding
       to find the most probable W given S
       in terms of the estimated HMM

Slide 11

Slide 12

Slide 13

Other typical details:
Complex elaborations of the basic ideas
HMM states ← triphones ← words
each triphone → 3-5 states + connection pattern
phone sequence from pronuncing dictionary
clustering for estimation
Acoustic features
RASTA-PLP etc.
Vocal tract length normalization, speaker clustering
Output pdf for each state as mixture of gaussians
Language model as N-gram model over words
recency/topic effects
Empirical weighting of language vs. acoustic models
etc. etc.

Some limitations
of the standard architecture
Problems with Markovian assumptions
Modeling trajectory effects
Variable coordination of articulatory dimensions
....