IRCS/CCN Summer
Workshop
June 2003
Speech Recognition
Task: available signals → model of the world around | |||
signals are mostly accidental, inadequate | |||
sometimes disguised or falsified | |||
always mixed-up and ambiguous | |||
Reasoning about the source of signals: | |||
Integration of context: what do you expect? | |||
“Sensor fusion”: integration of vision, sound, smell etc. | |||
Source (and noise) separation: there’s more than one thing out there |
|||
Variable perspective, source variation etc. | |||
depends on the type of signal | |||
depends on the type of object | |||
Much harder than chess or calculus! |
Bayesian probability estimation
Thomas Bayes (1702-1761) | |||
Minister of the Presbyterian Chapel
at Tunbridge Wells |
|||
Amateur mathematician | |||
Essay towards solving a problem in the doctrine of chances, published (posthumously) in 1764 |
|||
Crucial idea: | |||
background (prior) knowledge about the plausibility of different theories can be combined with knowledge about the relation of theories to evidence |
|||
in a mathematically well-defined way | |||
even if all knowledge is uncertain | |||
to reason about the most likely explanation of the available evidence | |||
Bayes’ theorem | |||
“the most important equation in the history of mathematics” (?) | |||
a simple consequence of basic definitions, or | |||
a still-controversial recipe for the probability of alternative causes for a given event, or | |||
the implicit foundation of human reasoning | |||
a general framework for solving the problems of perception |
Fundamental
theorem
of speech recognition
P(W|S) ∝ P(S|W)P(W) | |
where W is “Word(s)” (i.e. message text) | |
S is “Sound(s)” (i.e. speech signal) | |
“Noisy channel model” of communications
engineering due to Shannon 1949 |
|
New algorithms, especially relevant to speech recognition | |
due to L.E. Baum et al. ~ 1965-1970 | |
Applied to speech recognition by Jim Baker (CMU PhD 1975), | |
Fred Jelinek (IBM speech group >>1975) | |
Motivations for a Bayesian approach
A consistent framework for integrating
previous experience and current evidence |
|
A quantitative model for
“abduction” = reasoning about the best explanation |
|
A general method for turning a generative model into an analytic one = “analysis by synthesis” helpful where |categories| << |signals| |
Basic architecture
of standard speech recognition technology
1. Bayes’ Rule: P(W|S) ∝ P(S|W)P(W) | |
2. Approximate P(S|W)P(W) as a Hidden Markov Model |
|
a probabilistic function [ to get P(S|W)] | |
of a markov chain [ to get P(W) ] | |
3. Use Baum/Welch (=EM) algorithm
to “learn” HMM parameters |
|
4. Use Viterbi decoding | |
to find the most probable W given S | |
in terms of the estimated HMM | |
Other typical
details:
Complex elaborations of the basic ideas
HMM states ← triphones ← words | ||
each triphone → 3-5 states + connection pattern | ||
phone sequence from pronuncing dictionary | ||
clustering for estimation | ||
Acoustic features | ||
RASTA-PLP etc. | ||
Vocal tract length normalization, speaker clustering | ||
Output pdf for each state as mixture of gaussians | ||
Language model as N-gram model over words | ||
recency/topic effects | ||
Empirical weighting of language vs. acoustic models | ||
etc. etc. |
Some limitations
of the standard architecture
Problems with Markovian assumptions | |
Modeling trajectory effects | |
Variable coordination of articulatory dimensions | |
.... | |