Regular readers of LL know that I've always been a partisan of automatic speech recognition technology, defending it against unfair attacks on its performance, as in the case of "ASR Elevator" (11/14/2010). But Chin-Hui Lee recently showed me the results of an interesting little experiment that he did with his student I-Fan Chen, which suggests a fair (or at least plausible) critique of the currently-dominant ASR paradigm. His interpretation, as I understand it, is that ASR technology has taken a wrong turn, or more precisely, has failed to explore adequately some important paths that it bypassed on the way to its current success.
In order to understand the experiment, you have to know a little something about how automatic speech recognition works. If you already know this stuff, you can skip the next few paragraphs. And if you want a deeper understanding, you can go off and read (say) Larry Rabiner's HMM tutorial, or some of the material available on the Wikipedia page.
Basically, we've got speech, and we want text. (This version of the problem is sometimes called "speech to text" (STT), to distinguish it from systems that derive meanings or some other representation besides standard text.) The algorithm for turning speech into text is a probabilistic one: we have a speech signal S, and for each hypothesis H about the corresponding text, we want to evaluate the conditional probability of H given S; and all (?) we need to do is to find the H for which P(H|S) is highest.
We solve this problem by applying Bayes' Theorem, which in this case tells us that
Since P(S), the probability of the speech signal, is the same for all hypotheses H about the corresponding text, we can ignore the denominator, so that the quantity we want to maximize becomes
This expression has two parts: P(S|H), the probability of the speech signal given the hypothesized text; and P(H), the a priori probability of the hypothesized text. In the parlance of the field, the term P(S|H) is called the "acoustic model", and the term P(H) is called the "language model". The standard implementation of the P(S|H) term is a so-called "Hidden Markov Model" (HMM), and the standard implementation of the P(H) term is an "n-gram language model". (We're ignoring many details here, such as how to find the word sequence that actually maximizes this expression — again, see some of the cited references if you want to know more.)
It's well known that large-vocabulary continuous speech recognition is heavily dependent on the "language model" — which is entirely independent of the spoken input, representing simply an estimate of how likely the speaker is to say whatever. This is because simple n-gram language models massively reduce our uncertainty about what word was said next.
We can see this Lee and Chen's experiment, which looked at the effect of varying the language-model component of a recognizer, while keeping the same acoustic models and the same training and testing materials. (For those skilled in the art, they used the classic WSJ0 SI84 training data, and the Nov92 Hub2-C1 5K test set, described at greater length in David S. Pallett et al., "1993 Benchmark Tests for the ARPA Spoken Language Program", and Francis Kubala et al., "The Hub and Spoke Paradigm for CSR Evaluation", both from the Proceedings of the Spoken Language Technology Workshop: March 6-8, 1994.)
|Cross-entropy||Perplexity||Word Error Rate|
|3-gram Language Model||5.87||58||5.1%|
|2-gram Language Model||6.78||110||7.4%|
|1-gram Language Model||9.53||742||32.8%|
|No Language Model||12.28||4987||69.2%|
Thus using a 3-gram language model, where the probability of a given word is conditioned on the two preceding words, yielded a 5.1% word error rate; a 2-gram language model, where a word's probability is conditioned on the previous word, yielded a 7.4% WER; a 1-gram language model, where just the various unconditioned probabilities of words were used, yielded a 32.8% error rate; and with no language model at all, so that every item in the 5,000-word vocabulary is equally likely in all positions, gave a whopping 69.2% WER.
The 3-gram language model allows such a low error rate because it leaves us with relatively little uncertainty about the identity of the next word. In the particular dataset used for this experiment, the resulting 3-gram perplexity was about 58, meaning that (after seeing two words) there was as much left to be learned about the next word as if there were a vocabulary of 58 words all equally likely to occur — despite the fact that the actual vocabulary was about 5,000 words. (The dataset involved a selection of sentences from stories published in the Wall Street Journal, taking only those sentences made up of the commonest 5,000 words.)
The bigram perplexity was about 110, and the unigram perplexity about 742. (If you want to know more about such numbers and how they are calculated, look at the documentation for the SRI language modeling toolkit, which was actually used to generate them.)
If we take the log to the base 2 of these perplexities, we get the corresponding entropy, measured in bits.
And there's an interestingly linear relationship between the entropies of the language models used and the logit of the resulting WER (i.e. log(WER/(1-WER))):
A different acoustic-model component would have somewhat different performance — the best reported results with the same trigram and bigram models on this dataset are somewhat better — but the overall relationship between entropy and error rate will remain the same, and performance on high-entropy speech recognition tasks will be poor, even with careful speech and good acoustic conditions.
This all seems reasonable enough — so why does Chin think that there's a problem? Well, there's good reason to think that human performance on high-entropy speech recognition tasks can sometimes remain pretty good.
Thus George R. Doddington and Barbara M. Hydrick, “High performance speaker‐independent word recognition”, J. Acoust. Soc. Am. 64(S1) 1978:
Speaker‐independent recognition of words spoken in isolation was performed using a very large vocabulary of over 26 000 words taken from the “Brown” data set. (Computational Analysis of Present‐Day American English by Kucera and Francis). After discarding 4% of the data judged to be spoken incorrectly, experimental recognition error rate was 2.3% (1.8% substitution and 0.5% rejection), with negligible difference in performance between male and female speakers. Experimental error rate for vocabulary subsets, ordered by frequency of usage, was 1.0% for the first 50 words, 0.8% for the first 120 words, and 1.2% error for the first 1500 words. An analysis of recognition errors and a discussion of ultimate performance limitations will be presented.
If we project the regression line (in the entropy versus logit(WER) plot from the Lee & Chen experiment) to a vocabulary of 26k words (entropy of 14.67 bits), we would predict a word error rate of 90.5% — which is a lot more than 2.3%.
Now, this projection is not at all reliable: isolated word recognition is easier than connected word recognition, especially when the words being connected include short monosyllabic function words that might be hypothesized to occur almost anywhere. But still, Chin's guess is that current ASR performance on the Doddington/Hydrick task would be quite poor — strikingly worse than human performance, and perhaps spectacularly so.
And he thinks that this striking human/machine divergence points to a basic flaw in the current standard approach to ASR. For his diagnosis of the problem, see his keynote address at Interspeech 2012.
I hope that before long, we'll be able to recreate something like to the Doddington/Hydrick dataset: a high-entropy recognition task on which human and machine performance can be directly compared. If this comparison works out the way Chin thinks it will, the plausibility of his diagnosis and his prescription for action will be increased.
Update — Although everyone seems to agree that research on less diffuse acoustic models (and their use in ASR) would be a Good Thing, issues with the LM weight (= "fudge factor") and the insertion penalty make this experiment a less-than-ideal way to compare machine and human performance on a high-entropy task. Following up on some of the comments below, I-Fan Chen has done a more extensive series of experiments, whose results he reports here.
So at some point, I think we will go ahead and record a couple of test sets for either (1) a random isolated word task, or (2) a randomly-generated complex nominal task. More on this later…