Rick Rashid, "Microsoft Research shows a promising new breakthrough in speech translation technology", 118/2012:
A demonstration I gave in Tianjin, China at Microsoft Research Asia’s 21st Century Computing event has started to generate a bit of attention, and so I wanted to share a little background on the history of speech-to-speech technology and the advances we’re seeing today.
In the realm of natural user interfaces, the single most important one – yet also one of the most difficult for computers – is that of human speech.
As Dr. Rashid's post explains in detail, this demo is less of a breakthrough than an evolutionary step, representing a new version of a long-established combination of three gradually-improving technologies: Automatic Speech Recognition (ASR), Machine Translation (MT), and speech synthesis (no appropriate standard acronym, though TTS for "text to speech" is close).
At some point in the past 100 years, automatic speech-to-speech translation became a standard plot-facilitating assumption in science fiction. (Does anyone know what the first example of this trope was?) And in 1986, when the money from the privatization of NTT was used to found the Advanced Telecommunication Research (ATR) Institute in Japan, the centerpiece of ATR's prospectus was the Interpreting Telephony Laboratory. As explained in Tsuyoshi Morimoto, "Automatic Interpreting Telephone Research at ATR", Proceedings of a Workshop on Machine Translation, 1990:
An automatic telephone interpretation system will transform a spoken dialogue from the speaker’s language to the listener’s automatically and simultaneously. It will undoubtedly be used to overcome language barriers and facilitate communication among the people of the world.
ATR Interpreting Telephony Research project was started in 1986. The objective is to promote basic research for developing an automatic telephone interpreting system. The project period is seven-years.
As of 1986, all of the constituent technologies had been in development for 25 or 30 years. But none of them were really ready for general use in an unrestricted conversational setting, and so the premise of the ATR Interpreting Telephony Laboratory was basically a public-relations device for framing on-going speech technology research, not a plausible R&D project. And so it's not surprising that the ATR Interpreting Telephony Laboratory completed its seven-year term without producing practical technology — though quite a bit of valuable and interesting speech technology research was accomplished, including important contributions to the type of speech synthesis algorithm used in the Microsoft demo.
But as a public-relations framework, "interpreting telephony" was a very effective choice. Here and there around the world, research groups were inspired to produce demos illustrating similar ideas — my own group at Bell Labs created a demo of a real-time English/Spanish/English conversational system for Seville Expo '92. (We started the project in 1989-90 before I left Bell Labs for Penn, and others finished it.) None of these projects created (or intended to create) practical systems — the idea was more to show people what human language technology was in principle capable of doing.
In the 26 years since 1986, there have been two crucial changes: Moore's Law has made computers bigger and faster but smaller and cheaper; and speech recognition, machine translation, and speech synthesis have all gotten gradually better. In both the domain of devices and the domain of algorithms, the developments have been evolutionary rather than revolutionary — the reaction of a well-informed researcher from the late 1980s, transplanted to 2012, would be satisfaction and admiration at the clever ways that familiar devices and algorithms have been improved, not baffled amazement at completely unexpected inventions.
All of the constituent technologies — ASR, MT, speech synthesis — have improved to the point where we all encounter them in everyday life, and some people use them all the time. I'm not sure whether Interpreting Telephony's time has finally come, but it's clearly close.
In passing, I'll caution strongly against taking demos at face value — demos are generally scripted and rehearsed exercises in which both the user and the system have been jointly optimized to present a good show. This is not a criticism of Rick Rashid or Microsoft Research, it's just a universal fact about public demonstrations of new technology, which has an especially strong effect on demonstrations of speech and language technology.
In any case, the folks at Microsoft Research are at or near the leading edge in pushing forward all of the constituent technologies for speech-to-speech translation, and Rashid's speech-to-speech demo is an excellent way to publicize that fact.
Update — in the comments, Victor Mair wonders what it means that the Microsoft algorithms are "patterned after human brain behavior", as Rashid puts it. This is a reference to an innovation promoted by Microsoft researchers, using so-called "deep neural nets", and specifically "a hybrid between a pre-trained, deep neural network (DNN) and a context-dependent (CD) hidden Markov model". See e.g. Dahl et al., “Context-Dependent Pre-trained Deep Neural Networks for LVSR”, IEEE Trans. ASLP 2012, which documents a significant improvement in performance:
Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively.
They increased their sentence-correct rate from 63.8% to 69.6%, which is a big improvement by the standards of ASR research (where algorithmic innovations often nudge performance up by a few tenths of a percent, and two percent is reason to break out the champagne) , though it may perhaps be less impressive to outsiders who naively expect qualitatively different results from worthwhile inventions…
For those who are interested in what "deep neural net" (sometimes called "deep learning") algorithms are, here's a tutorial from ACL 2012 on their application in NLP:
And in greater depth, Yoshua Bengio, "Learning Deep Architectures for AI", Foundations and Trends in Machine Learning 2009. The abstract:
Theoretical results suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g., in vision, language, and other AI-level tasks), one may need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a diffcult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the stateof-the-art in certain areas. This monograph discusses the motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer models such as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks.
The key algorithmic innovations are described in Hinton et al., "A Fast Learning Algorithm for Deep Belief Nets", Neural Computation 2006:
We show how to use “complementary priors” to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.
In this context, Rashid's phrase "patterned after human brain behavior" is maybe not the most accurate way to put it, since I don't believe that anything is known about "human brain behavior" in relation to the specific kinds of learning algorithms behind the "deep learning" boom.