IRCS/CCN Summer
Workshop
June 2003
Speech Recognition
Task: available signals → model of the world around  
signals are mostly accidental, inadequate  
sometimes disguised or falsified  
always mixedup and ambiguous  
Reasoning about the source of signals:  
Integration of context: what do you expect?  
“Sensor fusion”: integration of vision, sound, smell etc.  
Source (and noise) separation: there’s more than one thing out there 

Variable perspective, source variation etc.  
depends on the type of signal  
depends on the type of object  
Much harder than chess or calculus! 
Bayesian probability estimation
Thomas Bayes (17021761)  
Minister of the Presbyterian Chapel
at Tunbridge Wells 

Amateur mathematician  
Essay towards solving a problem in the doctrine of chances, published (posthumously) in 1764 

Crucial idea:  
background (prior) knowledge about the plausibility of different theories can be combined with knowledge about the relation of theories to evidence 

in a mathematically welldefined way  
even if all knowledge is uncertain  
to reason about the most likely explanation of the available evidence  
Bayes’ theorem  
“the most important equation in the history of mathematics” (?)  
a simple consequence of basic definitions, or  
a stillcontroversial recipe for the probability of alternative causes for a given event, or  
the implicit foundation of human reasoning  
a general framework for solving the problems of perception 
Fundamental
theorem
of speech recognition
P(WS) ∝ P(SW)P(W)  
where W is “Word(s)” (i.e. message text)  
S is “Sound(s)” (i.e. speech signal)  
“Noisy channel model” of communications
engineering due to Shannon 1949 

New algorithms, especially relevant to speech recognition  
due to L.E. Baum et al. ~ 19651970  
Applied to speech recognition by Jim Baker (CMU PhD 1975),  
Fred Jelinek (IBM speech group >>1975)  
Motivations for a Bayesian approach
A consistent framework for integrating
previous experience and current evidence 

A quantitative model for
“abduction” = reasoning about the best explanation 

A general method for turning a generative model into an analytic one = “analysis by synthesis” helpful where categories << signals 
Basic architecture
of standard speech recognition technology
1. Bayes’ Rule: P(WS) ∝ P(SW)P(W)  
2. Approximate P(SW)P(W) as a Hidden Markov Model 

a probabilistic function [ to get P(SW)]  
of a markov chain [ to get P(W) ]  
3. Use Baum/Welch (=EM) algorithm
to “learn” HMM parameters 

4. Use Viterbi decoding  
to find the most probable W given S  
in terms of the estimated HMM  
Other typical
details:
Complex elaborations of the basic ideas
HMM states ← triphones ← words  
each triphone → 35 states + connection pattern  
phone sequence from pronuncing dictionary  
clustering for estimation  
Acoustic features  
RASTAPLP etc.  
Vocal tract length normalization, speaker clustering  
Output pdf for each state as mixture of gaussians  
Language model as Ngram model over words  
recency/topic effects  
Empirical weighting of language vs. acoustic models  
etc. etc. 
Some limitations
of the standard architecture
Problems with Markovian assumptions  
Modeling trajectory effects  
Variable coordination of articulatory dimensions  
....  