Human parity in conversational speech recognition

« previous post | next post »

Today at ISCSLP2016, Xuedong Huang announced a striking result from Microsoft Research. A paper documenting it is up on — W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig, "Achieving Human Parity in Conversational Speech Recognition":

Conversational speech recognition has served as a flagship speech recognition task since the release of the DARPA Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcriptionists is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion where friends and family members have open-ended conversations. In both cases, our automated system establishes a new state-of-the-art, and edges past the human benchmark. This marks the first time that human parity has been reported for conversational speech. The key to our system's performance is the systematic use of convolutional and LSTM neural networks, combined with a novel spatial smoothing method and lattice-free MMI acoustic training.

30 years ago, things were quite different — Philippe, Jeanrenaud, Ellen Eide, U. Chaudhari, J. McDonough, Kenney Ng, M. Siu, and Herbert Gish, "Reducing word error rate on conversational speech from the Switchboard corpus", IEEE ICASSP 1995:

Speech recognition of conversational speech is a difficult task. The performance levels on the Switchboard corpus had been in the vicinity of 70% word error rate. In this paper, we describe the results of applying a variety of modifications to our speech recognition system and we show their impact on improving the performance on conversational speech. These modifications include the use of more complex models, trigram language models, and cross-word triphone models. We also show the effect of using additional acoustic training on the recognition performance. Finally, we present an approach to dealing with the abundance of short words, and examine how the variable speaking rate found in conversational speech impacts on the performance. Currently, the level of performance is at the vicinity of 50% error, a significant improvement over recent levels.

And even a few months ago, reported error rates on the same task were substantially higher — Tom Sercu,  Christian Puhrsch, Brian Kingsbury, and Yann LeCun, "Very deep multilingual convolutional neural networks for LVCSR", IEEE ICASSP 2016:

Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance. In this paper we propose a number of architectural advances in CNNs for LVCSR. First, we introduce a very deep convolutional network architecture with up to 14 weight layers. There are multiple convolutional layers before each pooling layer, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture. Then, we introduce multilingual CNNs with multiple untied layers. Finally, we introduce multi-scale input features aimed at exploiting more context at negligible computational cost. We evaluate the improvements first on a Babel task for low resource speech recognition, obtaining an absolute 5.77% WER improvement over the baseline PLP DNN by training our CNN on the combined data of six different languages. We then evaluate the very deep CNNs on the Hub5'00 benchmark (using the 262 hours of SWB-1 training data) achieving a word error rate of 11.8% after cross-entropy training, a 1.4% WER improvement (10.6% relative) over the best published CNN result so far.

This is not the end of the speech-recognition story. There are harder tasks where there's plenty of room for improvement. And perhaps it's time to go beyond simple evaluation in terms of overall word error rate, since some errors are more consequential than others. But still, this is an important and impressive milestone.

[See also the MS blog entry "Historic Achievement: Microsoft researchers reach human parity in conversational speech recognition".]




  1. Geoffrey K. Pullum said,

    October 18, 2016 @ 11:24 am

    I must confess that I never thought I would see this day. In the 1980s, I judged fully automated recognition of connected speech (listening to connected conversational speech and writing down accurately what was said) to be too difficult for machines, far more difficult than syntactic and semantic processing (taking an error-free written sentence as input, recognizing which sentence it was, analysing it into its structural parts, and using them to figure out its literal meaning). I thought the former would never be accomplished without reliance on the latter. I thought computer understanding of typed input as a component of a usable product would come within a decade, while fully automated recognition of connected speech would probably take forty or fifty years. I was wrong. The speech engineers have accomplished it without even relying on any syntactic analysis: pure engineering, aided by statistical modeling based on gigantic amounts of raw data. I admit it, I am no futurologist. I not only didn't think I would see this come about, I would have confidently bet against it. Were I a betting man (and it's just as well I'm not; I would have lost a bundle on the Brexit vote too, and more on the Republican nomination contest).

  2. Y said,

    October 18, 2016 @ 12:29 pm

    I must confess I never thought I would see the day that Pullum praises something that came out of Microsoft.

  3. Rubrick said,

    October 18, 2016 @ 5:53 pm

    Gold star for Y's comment.

  4. D.O. said,

    October 19, 2016 @ 12:31 am

    A fly in the ointment. Trying to maximize performance on a fixed corpus/task runs the risk of optimizing specifically for this corpus. Of course, I am sure they did usual training set/test set approach, but if this approach is repeated multiple times for multiple designs, one of them might be especially good just by accident. Just saying! For all I know, that's a robust achievement.

  5. Xuedong Huang said,

    October 19, 2016 @ 3:52 am

    Thank you for sharing your insightful comments. This is a great milestone not just for Microsoft but also for so many great speech colleagues we worked together. We benchmarked our systems on the Switchboard task scientifically but the underlying technology can be used more broadly for other tasks. It is an exciting moment for all of us as so many of us have worked on this task for over 20 years and I personally didn't believe we could have reached Human Parity this year!

  6. Jochen L. Leidner said,

    October 19, 2016 @ 4:10 am

    A memorable day, perhaps much more so than Kasparov beaten in chess.

    Don't oversell it, please, we can do without a "speech recognition winter" ;-)

    I raise my glass to all fellow engineers (but without firing any linguists).

  7. Athanassios Protopapas said,

    October 19, 2016 @ 6:58 am

    How times change! In 1996, as I was completing my PhD in cognitive science (on human speech perception and word recognition), I got a job interview at BBN. After my talk I (naively but truthfully) said I wanted to research neural networks for speech recognition, at which point I was almost laughed out of the room, as the chief engineers were not interested in anything but HMM and thought neural nets would never work. It is funny how the unthinkable turns into mainstream within a couple of decades.

  8. Counterbander said,

    October 20, 2016 @ 4:37 pm

    Two simple-minded questions about this feat:

    1. "Human parity" using a word error recognition metric does not seem very convincing. Human transcriber errors are typically not nonsense and usually preserve the gist of the text. ASR errors, on the other hand, are often bizarre in appearance and may complicate or destroy intelligibility. Don't we have a better metric of transcription quality?
    2. If on spontaneous speech professional transcribers err at 5.9%, who has created the perfect transcription (and how?) against which this is measured? A super-human?

    [(myl) These are both excellent questions — I'll address them in a separate post within the next couple of days.]

  9. Gregory Kusnick said,

    October 21, 2016 @ 2:19 am

    I'm guessing the answer to Counterbander's second question is to commission several independent transcriptions and combine them into a single text with a much lower error rate by discarding the minority opinion at any point of disagreement.

    20 years ago I had an office down the hall from Xuedong and his group at MSR. Glad to see their work coming to fruition.

  10. 11 gripping questions raised by ‘Westworld’ : Break The Limit said,

    October 23, 2016 @ 1:30 pm

    […] Luckily, and for a variety of reasons, AI researchers today believe out-of-control AI is a myth and that we can control intelligent software. Then again, few computer and linguistic scientists thought machines could ever learn to listen and speak as well as people — and now they can on a limited level. […]

  11. Ted Chang said,

    October 25, 2016 @ 10:02 am

    This is truly a breakthrough that will pave towards transcription quality that is unparalleled.

    "Long behold, the day will come forth!"

  12. 11 unsettling questions raised by ‘Westworld’ | aspost said,

    October 31, 2016 @ 12:40 am

    […] Luckily, and for a variety of reasons, AI researchers today believe out-of-control AI is a myth and that we can control intelligent software. Then again, few computer and linguistic scientists thought machines could ever learn to listen and speak as well as people — and now they can on a limited level. […]

RSS feed for comments on this post