You can help improve ASR

« previous post | next post »

If you're a native speaker of English, and you have about an hour to spare, and the title of this post (or a small promised gift) convinces you to devote your spare hour to helping researchers improve automatic speech recognition, just pick one of these four links at random and follow the instructions: 1, 2, 3, 4.

[Update — the problem with the tests has been fixed — but more than 1,000 people have participated, and the server is saturated, so unless you've already started the experiment, please hold off for now!]

If you'd like a fuller explanation, read on.

It's often been observed that when computers perform "intelligent" tasks, their error patterns are different from those of humans, even if their performance is as good or better. An interesting example came up on last night's Jeopardy! show, where Watson walloped its human adversaries, but seemed to miss an elementary geographic fact by answering "Toronto" to a question in the category "U.S. Cities" ("Watson Dominates ‘Jeopardy’ but Stumbles Over Geography", NYT Arts Beat 11/15/2011).

The most straightforward way to improve the the performance of an AI system is to give it more training data. But systematic error analysis can also be helpful, trying to identify those areas where algorithmic or training changes would make the most difference. And one obvious target is precisely this class of obvious-to-humans mistakes.

When a system isn't very good, it's easy to get data for error analysis, since nearly all differences between human and machine judgment will be system errors. But as system performance improves, it starts to get hard to distinguish mistakes from disagreements, especially for tasks where human judgments are variable. One area where this problem has arisen, believe it or not, is automatic speech recognition (ASR).

There are two facts about speech recognition that are not widely understood.

First, jokes to the contrary, ASR systems have gotten to be pretty good at it.  For speaker-adapted English dictation applications, word error rates in the range of 5% are now common; and for constrained tasks, like digit-string recognition, word error rates are a small fraction of a percent.

Second, humans are often surprisingly bad at it. Specifically, in general transcription tasks, the typical "word error rate" for a comparison of two human transcripts is likely to be the range of 1-10%, depending on audio quality, difficulty and familiarity of material, definition of "error", and amount of transcriber training. (If the humans are journalists, alas, 40-60% word error rate is fairly common…)

This is not to say that ASR is a solved problem. ASR performance degrades rapidly with noisy or distorted input. Even on high-quality input, ASR performance on tasks that can't rely on contextual redundancies (e.g. random word sequences, or nonsense/unknown words, or topic areas far outside the training set) remains significantly worse than human performance. The performance of systems for English and other major languages depends on statistical analysis of billions of words of text and thousands of hours of transcribed speech, resources that are not available for most of the world's languages.

And even when ASR word error rate is similar to human inter-transcriber differences in percentage terms, the patterns of machine and human disagreements are often quite different. But in order to use a comparison of those patterns in error analysis, researchers need a large enough sample of human transcripts — that's where you (may) come in!

The request for people to participate in this task comes from Ioana Vasilescu, a friend at LIMSI ("Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur"), a CNRS laboratory in Orsay, near Paris. The speech group at LIMSI is one of the best in the world — they've been placing near the top in DARPA-sponsored speech technology evaluations for two decades.  But in the southern suburbs of Paris, recruiting native speakers of English is difficult, and so I offered to ask for volunteers on their behalf.

Ioana writes:

I'm looking for native English listeners with normal hearing, in France or abroad.  The test lasts about 45 minutes but it can be completed in several sessions: participants can connect as many times as needed to complete the test.  The payment consists in a small computer device (USB key, about 20 euros). The participant will receive it upon request (by sending an email with the address) at the end of the test.

I'll post a discussion of the paper that results from this research, when it's done.

[A further comment on Watson's geographical blooper…  It can often be hard to assign blame among the many components of a complex system.  The general nature of the flaw seems clear in the case of the "Toronto" flub — Watson doubtless knows a great deal about where Toronto is, but somehow failed to restrict its answer according to the wording of the question category.  However, given that Watson's developers have analyzed its responses — especially the wrong ones — to hundreds of thousands of past Jeopardy! clues, this could not have been something as simple as systematically failing to take the question category into account.  It seems more likely to have been a matter of inappropriate weighting of different constraints on the answer;  maybe we'll learn something more about what happened in this case from Watson's developers.]


  1. Nathaniel said,

    February 16, 2011 @ 10:37 am

    Perhaps Watson was confused by the Toronto in Ohio.

  2. Ben Zimmer said,

    February 16, 2011 @ 11:13 am

    You can read more about Watson's Toronto flub here, on the blog of Stephen Baker, who has written about the IBM challenge for his book Final Jeopardy.

  3. John Cowan said,

    February 16, 2011 @ 11:21 am

    Apparently Watson doesn't trust the categories anyway, as they are often wrong either technically (Catcher in the Rye under "U.S. novelists") or intentionally (the category "Country Clubs" turns out to mean blunt instruments used in various countries).

  4. Dan Lufkin said,

    February 16, 2011 @ 11:43 am

    I'll bet Watson went through a match of airports with war heroes and came up with Billy Bishop (Canadian WW I ace) along with Butch O'Hare (US WW II ace). Bishop gets many more Ghits than O'Hare. That may have outweighed Midway as the battle. I can't identify any other Toronto airport as being associated with a battle.

    Of course, the US part of the topic should have trumped Ghits.

  5. Dan Lufkin said,

    February 16, 2011 @ 11:51 am

    More on the topic of ASR. I wonder if we could have a show of hands from those LL denizens who use version 11 of Dragon.I use it all the time on non-constrained work and experience an error rate of below 1% at a speed of about 1000 words an hour. As a matter of fact, I'm dictating this on Dragon right now. Dragon is really at its best on complex technical vocabulary, particularly organic chemistry. It's also excellent with pharmaceuticals and anatomy. Dragon's vocabulary is truly jaw-dropping. As far as I'm concerned, ASR has already been solved for practical purposes.

  6. Peter said,

    February 16, 2011 @ 12:51 pm

    I went and did 1/4 of that test and even going fast and choosing not to repeat half the samples I see no way that it is humanly possible to be done in less then 2-3 hours, much less "45 minutes."

    [(myl) I think there may be a bug here — let me check with the author.

    Update — the problem seems to have been fixed.]

  7. Chris Travers said,

    February 16, 2011 @ 1:27 pm

    The discussion of journalist error rates brings to mind the "30,000 pigs" vs "30 sows and pigs" from a previous post.

    However, I'd like to bring up what I see as a major systems error area for ASR programs to date, which is differentiating input vs ambient noise. We humans tend to do a lot of context tracking and if I pull the phone away from my ear and say something to my wife, the person on the other side recognizes this usually and ignores. ASR systems invariably treat it as input. This could lead to bad things.

    Consider the following conversation of a hypothetical ASR wrapper for a UNIX shell.

    Operator: "C D slash"
    Computer: "executing cd /"
    Someone walks in…..
    Operator: "Hold on."
    Computer: "command 'hold' not found."
    Operator (to other person): "I'm in trouble."
    Computer: "Command Im not found"
    Other person: "What did you do?"
    Computer: "I didn't get that. Please repeat."
    Operator: "Rammed the sheriff, period."
    Computer: "Executing rm -rf ."

    (ooops…… Bye bye files……)

    It seems to me that this sort of context tracking is very hard to do in ASR because we humans take both verbal and non-verbal (in the sense of volume dropoff, tone of voice, etc) cues which are missing from the machine's analysis. I don't know how to solve this but I cannot imagine that it could happen after the text to speech recognition is complete.

  8. Peter Taylor said,

    February 16, 2011 @ 1:44 pm

    @Chris Travers, see also UserFriendly:

  9. Sili said,

    February 16, 2011 @ 2:23 pm

    particularly organic chemistry


    How about inorganic?

  10. You Can Help Improve Automatic Speech Recognition | Voxy Blog said,

    February 16, 2011 @ 3:58 pm

    […] of them is Mark Liberman's latest Language Log post encouraging native English speakers to help researchers improve automatic speech recognition (ASR). […]

  11. Kathleen said,

    February 16, 2011 @ 4:22 pm

    I have to agree with the poster who said the test was taking way longer than advertised. I'm a much faster typist than most, didn't listen to most samples more than once or twice, and was still in part I when I gave up after an hour and fifteen minutes.

    [(myl) There was a problem with the test that seems to have been fixed. In principle, it involves 4 segments of 55 items each. Since each item takes only a second or so to play and a few second to transcribe, the 220 total items should take 55 minutes at 15 seconds per item.]

  12. Julie said,

    February 16, 2011 @ 4:40 pm

    The test was slow going but I did my best to make it through the first part. I have a certificate for audio typing dated nineteen eighty something and even then, and being motivated to advance the cause of science even slightly, it was extremely slow and laborious to finish even the first section. It might be a good idea not to let me know quite how slowly I was progressing. Is there a reason it needs to be so many questions?

  13. Rubrick said,

    February 16, 2011 @ 5:09 pm

    a CNRS laboratory in Orsay, near Paris.

    This lab is a world leader in ASR, hence the ubiquitous "To verify your account info, press Orsay one."

    [(myl) Exactly.]

  14. Jerry Friedman said,

    February 17, 2011 @ 12:39 am

    @Rubrick: Thank you.

  15. [links] Link salad talks about politics for a change | said,

    February 17, 2011 @ 8:33 am

    […] You can help improve ASR — This is actually something I deal with in the Day Jobbe. […]

  16. Dan Lufkin said,

    February 17, 2011 @ 9:47 am

    @ Sili — I think Dragon's better performance on organic chemistry is because it works better on longer words in general, and organic chem holds several world records for everyday use of long words. It works well on inorganic chem, but the vocabulary is inherently sparser and the words shorter.

    I think that Dragon's decision tree keeps on going until it reaches an utterance boundary and then looks at Markov chains to resolve ambiguities. If you have a good long word, there isn't much left to disambiguate. Dragon works best if you give it sentence-length utterances; if you pause between words (as you had to do in earlier versions of ASR), it goes all kerflooey.

    You can feed Dragon's AI engine with corpuses of stuff you work on often and it very quickly learns to spot frequent clusters, after which it almost never makes a misrecognition (of that cluster). It's fascinating to watch Dragon at work, but most of the time I'm just along for the ride, kinda like the humans were on the Jeopardy expo.

  17. Read Weaver said,

    February 17, 2011 @ 3:20 pm

    Interesting to pay such close attention to speech, something I don't usually have reason to do. A contraction I hadn't noticed before: the hesitation form of "you know" to "ya'o" (vowels schwa & long-o)—though while hearing it I was aware it wasn't novel.

  18. Read Weaver said,

    February 17, 2011 @ 3:55 pm

    Hmm. I started the test before even noting the gift, so I'm not all that annoyed. But a bit disappointed that I didn't find a way to request the gift after going through the test.

  19. Ioana Vasilescu said,

    February 18, 2011 @ 6:49 am

    Dear all,

    Thank you very much for your interest in the perceptual test. We have now many transcriptions thanks to your participation and I would like to thank all the readers and writers of the languagelog. I can keep everybody posted with the results as soon as we process the transcriptions.

    There are a couple of things I would like to mention:

    1.We have between 10 to 15 complete transcriptions per test so far and about 20 on-going transcriptions for each test.
    If you already started the test and would like to finish it you can connect to the on-going transcription with the email. I would like to encourage people who already started the test to finish it especially when there are only 10 to 15 stimuli left: the hard work is already done!
    Otherwise for at least two of the 4 tests my colleague Dahbia decided to block new connexions to give the opportunity to the current participants to finish the test.

    2.We were quite optimistic about the duration of the test: we estimated about one hour the time needed to complete the test in one session (I extrapolated the time needed for a sibling test in French).
    I would like to apologize, some people spent up to 2 hours (in several sessions though!!!).
    Maybe the reason is that in French we had mainly standard broadcast news French and here there are various accents and sources…There are also 20 more stimuli in English.
    I will compare the rates in the two languages (and with ASR systems) anyway and I can keep you posted about this.

    3.Finally I would be happy to send to everybody interested a 8 Go USB key as reward. So please fill a mailing address in the final comments of the test or email me (or Dahbia) a mailing address. I would only like to mention that it will take a couple of weeks, as we had more volunteers than hoped and I had to order supplementary keys…

    Thank you again and feel free to ask me questions about this work.

  20. Ray Dillinger said,

    February 18, 2011 @ 2:38 pm

    I have found it useful to adopt an affected hyper-consistent accent for use with ASR systems. It drops the error rate (with a trained system) to around 0.5% for unconstrained text. The best I could achieve with "normal" pronunciation was in the 2% – 3% range.

    I make a point of pronouncing almost every consonant and vowel in the spelling, even those normally elided in speech. I except letters that signify different pronunciations of other letters, like the 'gh' in right that signals a long rather than short i, the 'e' in judge that signals a soft rather than hard g, and the trailing silent-e in many words that signals a close rather than open vowel in the final syllable. Where the usual pronunciation of a vowel is "schwa," I insert the "schwa" sound before the written vowel but do not then skip the written vowel. In vowel diphthongs, unless one of them has a function-rather-than-pronunciation role, both vowels are pronounced, so this sometimes has me uttering triple vowels into the mike.

    I've trained myself to use this accent with normal, or even slightly quick, speech speeds and normal rhythms. My fianceé says it sounds like a Finn who learned English in a stodgy university from ESL professors playing a bad joke on him. It is embarrassing when it takes me a minute to drop out of it when talking to human beings after doing ASR dictation.

  21. Dan Lufkin said,

    February 18, 2011 @ 5:30 pm

    @ Ray D — Yes, indeed, consistency is everything. Dragon version 11 doesn't require any initial training at all. It works very well right out of the box.

    My favorite Dragon demo sentence: Amidst the mists and coldest frosts, with barest wrists and stoutest boasts, she thrusts her fists against the posts and still insists she sees the ghosts.

    And one I got years ago from a Bell Labs researcher: Joe took father's shoe-box out. She was waiting at my lawn.

    This latter one is supposed to be difficult for ASR, but Dragon gets it easily.

  22. Janice Byer said,

    February 18, 2011 @ 11:54 pm

    M. Vasilescu, you're very welcome. I enjoyed participating in what proved to be an interesting challenge. I look forward to reading the results of your research.

  23. Ran Ari-Gur said,

    February 19, 2011 @ 11:30 am

    I wonder if this is the best design for this. Given a good interface, I think I could transcribe a longer passage — ten or twenty seconds, say — much more accurately, because there's more context and more chance to get used to the accent. Dr. Vasilescu's comment above says that "in French" they "had mainly standard broadcast news French"; maybe in that context this design makes more sense?

  24. Ran Ari-Gur said,

    February 19, 2011 @ 2:12 pm

    By the way, to add to my earlier comment: I realize that a short snippet can be more useful than a long one, in that it reduces the potential for errors to be introduced during the alignment process, when the system tries to figure out which written word corresponds to which segment of the audio stream. And I realize that many short snippets allows a greater variety of audio qualities and voices and accents and so on — in short, more diverse training data. But this approach certainly introduces errors, and I would have thought that the errors that it introduces would be more serious than the errors it alleviates. Maybe it would be better to give the transcriptionists two audio snippets: a longer one, for context, and a shorter subset, that we're asked to transcribe? Or even a series of shorter subsets?

  25. Ran Ari-Gur said,

    February 19, 2011 @ 2:18 pm

    (Sorry if I've totally missed the point. The post does say that they're looking for patterns of human errors. It now occurs to me that the purpose of using short, context-less snippets, with word-fragments and so on, may actually be to artificially increase those errors?)

    [(myl) I don't know the answers to your questions, though I could make some guesses. When the results of the results are published (and I'll link to an open version of them), we'll have a better idea.]

  26. blahedo said,

    February 19, 2011 @ 9:00 pm

    I think it took me close to two hours elapsed; regarding Mark's earlier comment, many (most?) of the clips were much more than a second long, and many I had to replay more than once to get a sense of what I was hearing.

    At this point I'd almost want to see what the "right answers" were, or at least get longer clips; particularly for some of the shortest clips I was unable to get much. In a couple cases I could type nothing more detailed than "* *" and move on.

    Either way, though, I can't wait to see the results that come out of this!

  27. Ioana Vasilescu said,

    February 21, 2011 @ 5:00 am

    I would like to thank everybody: tests are almost finished (there are still some on-going transcriptions but the number of transcriptions we had already allow conducting statistics and drawing some conclusions).
    I saw some comments about the length of the excerpts: as soon as the work is accepted for publication (hopefully at Interspeech) I can describe the design of the test and the reason it was conceived this way.
    Thank you again for your help.

  28. Ioana Vasilescu said,

    June 17, 2011 @ 5:29 am

    Dear all,

    I'm back with some results of the perceptual test!

    Some of you completed one of the four (too long!) perceptual tests asking to transcribe spoken English.

    Here is the main idea behind the test:

    The test aims at investigating the role of the context length for the
    recovery of small homophonic items yielding automatic transcription errors in French and English. The hypothesis being that ambiguity due to homophonic words reduces with context size. An important parameter is that the excerpts are coming from the (often) wrong automatic transcriptions. The great majority of the sequences you transcribed were missed by the ASR system.
    We focused then on target words frequently missed by ASR system (such as “and”, “in”, “the” for English, and “est”, “et”, “les”…for French), which have been presented in 3-grams, 5-grams, 7-grams (usually maximum span of a language model) and 9-grams.
    The performance was in particular measured in terms of human WER and
    compared with the automatic solution for the central targets.

    The main observations:
    – for 100% ASR missed target words, human WER averages about 22% errors. Humans are better then the ASR system, however they have problems to solve the local ambiguity as well: here the problem is coming from the poor and ambiguous acoustic information. The context if sufficient should then help.
    – (not surprisingly) human WER decrease with increasing context: the gap is particularly significant between 3 and 5-grams (about 10%). Besides, 9-grams do not lead to zero human WER.
    – patterns are similar in French and English

    Other details and some figures will be presented at the Interspeech conference, maybe I'll met some of you there (Monday, 29th of August).

    Thank you again for your help, I hope the usb devices and the Limsi souvenir arrived safely to your places.


    (LIMSI-CNRS, France)

RSS feed for comments on this post