IRCS/CCN Summer Workshop
June 2003
Speech Recognition

Why is perception hard?
 Task: available signals → model of the world around signals are mostly accidental, inadequate sometimes disguised or falsified always mixed-up and ambiguous Reasoning about the source of signals: Integration of context: what do you expect? “Sensor fusion”: integration of vision, sound, smell etc. Source (and noise) separation: there’s more than one thing out there Variable perspective, source variation etc. depends on the type of signal depends on the type of object Much harder than chess or calculus!

Bayesian probability estimation
 Thomas Bayes (1702-1761) Minister of the Presbyterian Chapel at Tunbridge Wells Amateur mathematician Essay towards solving a problem in the doctrine of chances, published (posthumously) in 1764 Crucial idea: background (prior) knowledge about the plausibility of different theories can be combined with knowledge about the relation of theories to evidence in a mathematically well-defined way even if all knowledge is  uncertain to reason about the most likely explanation of  the available evidence Bayes’ theorem “the most important equation in the history of mathematics” (?) a simple consequence of basic definitions, or a still-controversial recipe for the probability of alternative causes for a given event, or the implicit foundation of human reasoning a general framework for solving the problems of perception

Slide 4

Slide 5

Slide 6

Slide 7

Fundamental theorem
of speech recognition
 P(W|S) ∝ P(S|W)P(W) where W is “Word(s)”  (i.e. message text) S is “Sound(s)” (i.e. speech signal) “Noisy channel model” of communications engineering due to Shannon 1949 New algorithms, especially relevant to speech recognition due to L.E. Baum et al. ~ 1965-1970 Applied to speech recognition by Jim Baker (CMU PhD 1975), Fred Jelinek (IBM speech group >>1975)

Motivations for a Bayesian approach
 A consistent framework for integrating    previous experience and current evidence A quantitative model for “abduction”        = reasoning about the best explanation A general method for turning a generative model into an analytic one         = “analysis by synthesis”         helpful where |categories| << |signals|

Basic architecture
of standard speech recognition technology
 1. Bayes’ Rule:  P(W|S) ∝ P(S|W)P(W) 2. Approximate P(S|W)P(W) as a Hidden Markov Model a probabilistic function   [ to get P(S|W)] of a markov chain      [  to get P(W) ] 3. Use Baum/Welch (=EM) algorithm        to “learn” HMM parameters 4. Use Viterbi decoding to find the most probable W given S in terms of the estimated HMM

Slide 11

Slide 12

Slide 13

Other typical details:
Complex elaborations of the basic ideas
 HMM states ← triphones ← words each triphone → 3-5 states + connection pattern phone sequence from pronuncing dictionary clustering for estimation Acoustic features RASTA-PLP etc. Vocal tract length normalization, speaker clustering Output pdf for each state as mixture of gaussians Language model as N-gram model over words recency/topic effects Empirical weighting of language vs. acoustic models etc. etc.

Some limitations
of the standard architecture
 Problems with Markovian assumptions Modeling trajectory effects Variable coordination of articulatory dimensions ....