Speech-to-speech translation

« previous post | next post »

Rick Rashid, "Microsoft Research shows a promising new breakthrough in speech translation technology", 118/2012:

A demonstration I gave in Tianjin, China at Microsoft Research Asia’s 21st Century Computing event has started to generate a bit of attention, and so I wanted to share a little background on the history of speech-to-speech technology and the advances we’re seeing today.

In the realm of natural user interfaces, the single most important one – yet also one of the most difficult for computers – is that of human speech.

As Dr. Rashid's post explains in detail, this demo is less of a breakthrough than an evolutionary step, representing a new version of a long-established combination of three gradually-improving technologies: Automatic Speech Recognition (ASR), Machine Translation (MT), and speech synthesis (no appropriate standard acronym, though TTS for "text to speech" is close).

At some point in the past 100 years, automatic speech-to-speech translation became a standard plot-facilitating assumption in science fiction. (Does anyone know what the first example of this trope was?) And in 1986, when the money from the privatization of NTT was used to found the Advanced Telecommunication Research (ATR) Institute in Japan, the centerpiece of ATR's prospectus was the Interpreting Telephony Laboratory. As explained in Tsuyoshi Morimoto, "Automatic Interpreting Telephone Research at ATR", Proceedings of a Workshop on Machine Translation, 1990:

An automatic telephone interpretation system will transform a spoken dialogue from the speaker’s language  to the listener’s  automatically  and simultaneously. It will undoubtedly be used to overcome language barriers and facilitate communication among the people of the world.

ATR Interpreting Telephony Research project was started in 1986. The objective is to promote basic research for developing an automatic telephone interpreting system. The project period is seven-years.

As of 1986, all of the constituent technologies had been in development for 25 or 30 years. But none of them were really ready for general use in an unrestricted conversational setting, and so the premise of the ATR Interpreting Telephony Laboratory was basically a public-relations device for framing on-going speech technology research, not a plausible R&D project. And so it's not surprising that the ATR Interpreting Telephony Laboratory completed its seven-year term without producing practical technology — though quite a bit of valuable and interesting speech technology research was accomplished, including important contributions to the type of speech synthesis algorithm used in the Microsoft demo.

But as a public-relations framework, "interpreting telephony" was a very effective choice. Here and there around the world, research groups were inspired to produce demos illustrating similar ideas — my own group at Bell Labs created a demo of a real-time English/Spanish/English conversational system for Seville Expo '92. (We started the project in 1989-90 before I left Bell Labs for Penn, and others finished it.)  None of these projects created (or intended to create) practical systems — the idea was more to show people what human language technology was in principle capable of doing.

In the 26 years since 1986, there have been two crucial changes: Moore's Law has made computers bigger and faster but smaller and cheaper; and speech recognition, machine translation, and speech synthesis have all gotten gradually better.  In both the domain of devices and the domain of algorithms, the developments have been evolutionary rather than revolutionary — the reaction of a well-informed researcher from the late 1980s, transplanted to 2012, would be satisfaction and admiration at the clever ways that familiar devices and algorithms have been improved, not baffled amazement at completely unexpected inventions.

All of the constituent technologies — ASR, MT, speech synthesis — have improved to the point where we all encounter them in everyday life, and some people use them all the time. I'm not sure whether Interpreting Telephony's time has finally come, but it's clearly close.

In passing, I'll caution strongly against taking demos at face value — demos are generally scripted and rehearsed exercises in which both the user and the system have been jointly optimized to present a good show. This is not a criticism of Rick Rashid or Microsoft Research, it's just a universal fact about public demonstrations of new technology, which has an especially strong effect on demonstrations of speech and language technology.

In any case, the folks at Microsoft Research are at or near the leading edge in pushing forward all of the constituent technologies for speech-to-speech translation, and Rashid's speech-to-speech demo is an excellent way to publicize that fact.

Update — in the comments, Victor Mair wonders what it means that the Microsoft algorithms are "patterned after human brain behavior", as Rashid puts it. This is a reference to an innovation promoted by Microsoft researchers, using so-called "deep neural nets", and specifically "a hybrid between a pre-trained, deep neural network (DNN) and a context-dependent (CD) hidden Markov model". See e.g. Dahl et al.,  “Context-Dependent Pre-trained Deep Neural Networks for LVSR”, IEEE Trans. ASLP 2012, which documents a significant improvement in performance:

Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively.

They increased their sentence-correct rate from 63.8% to 69.6%, which is a big improvement by the standards of ASR research (where algorithmic innovations often nudge performance up by a few tenths of a percent, and two percent is reason to break out the champagne) , though it may perhaps be less impressive to outsiders who naively expect qualitatively different results from worthwhile inventions…

For those who are interested in what "deep neural net" (sometimes called "deep learning") algorithms are, here's a tutorial from ACL 2012 on their application in NLP:

And in greater depth, Yoshua Bengio, "Learning Deep Architectures for AI", Foundations and Trends in Machine Learning 2009. The abstract:

Theoretical results suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g., in vision, language, and other AI-level tasks), one may need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a diffcult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the stateof-the-art in certain areas. This monograph discusses the motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer models such as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks.

The key algorithmic innovations are described in Hinton et al., "A Fast Learning Algorithm for Deep Belief Nets", Neural Computation 2006:

We show how to use “complementary priors” to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

In this context, Rashid's phrase "patterned after human brain behavior" is maybe not the most accurate way to put it, since I don't believe that anything is known about "human brain behavior" in relation to the specific kinds of learning algorithms behind the "deep learning" boom.


  1. Ben Zimmer said,

    November 10, 2012 @ 9:55 am

    The 1945 short story “First Contact” by Murray Leinster (pen name for Will F. Jenkins) is often credited with introducing the "universal translator" to science fiction. More from Robert Silverberg here:

    The first problem, of course, is figuring out how to speak with the aliens. Leinster solves this in his usual efficient way: “ ‘We’ve hooked up some machinery,’ said Tommy, ‘that amounts to a mechanical translator.’ ” After some plausible-sounding engineering talk about frequency modulation and short-wave beams, Tommy goes on to tell his captain, “We agreed on arbitrary symbols for objects, sir, and worked out relationships and verbs and so on with diagrams and pictures. We’ve a couple of thousand words that have mutual meanings. We set up an analyzer to sort out their short-wave groups, which we feed into a decoding machine. And then the coding end of the machine picks out recordings to make the wave groups we want to send back. When you’re ready to talk to the skipper of the other ship, sir, I think we’re ready.”

  2. Victor Mair said,

    November 10, 2012 @ 10:31 am

    I was actually preparing a post on this subject, but Mark beat me to the punch, and I'm so glad that he did, because he has a much better understanding of the history of the development of this technology than I. Consequently, in this comment, I shall merely add a few odds and ends that Mark has not touched upon.

    First of all, what we're dealing with in this particular instance, as demonstrated in Tianjin, is English to Mandarin speech translation technology. The stunning demonstration of spoken English being converted to spoken Mandarin takes place near the end of the video (at around 7:30), when Dr. Rashid speaks several supposedly spontaneous English sentences. Rashid, who presumably knows no Mandarin, is able, by speaking English into the software, to produce passable Mandarin. It's really quite uncanny to hear his voice speaking Mandarin with natural cadences and proper tones. Still, I suspect that this was not entirely unrehearsed. I have observed countless demonstrations of IT products and skills that were extremely impressive, but I was always able to deflate the demonstators by asking them to say or write something that they had not prepared ahead of time.

    Moreover, as Jamie Condliffe at Gizmodo points out:

    And it's still not perfect. It gets around one word in eight wrong. But let it off: It's still mind-blowing to think that a computer can analyze a person's voice, translate it, and then use knowledge of how the person sounds to recreate what they've said as an audio track in a different language.

    Incidentally, I suspect that, just as it is more difficult to create software that goes from written Mandarin to acceptable written English than it is to create software that renders written English into acceptable Mandarin, it will be harder to produce software that will go from spoken Mandarin to spoken English.

    Finally, many of the press accounts of this impressive demonstration state that the software works in a "brain-like" fashion.



    I'd like to know more about how this technology functions like the human brain.

  3. Sol said,

    November 10, 2012 @ 10:36 am

    But well before that, of course, the Apostles are said to have been granted the ability to speak in different tongues in order to aid their proselytizing.

  4. Victor Mair said,

    November 10, 2012 @ 10:51 am


    Do you want to hear what speaking in tongues sounds like?

    Go to 7:34 here: http://www.youtube.com/watch?v=Fkn1BilNhmc

    It's funny that Lonnie Mackley, the man who is talking in this video, kept referring to speaking in tongues as sounding like Chinese, but the sounds that he produces don't sound at all like Chinese to me. I wonder what they sound like to others.

  5. bks said,

    November 10, 2012 @ 11:10 am

    When do we get to see the software listen to its own Mandarin output and retranslate to English so that we can compare to the original?


  6. Dan Lufkin said,

    November 10, 2012 @ 11:42 am

    I think that a bow in the direction of Dragon NaturallySpeaking dictation software is appropriate here. I've been using the program for about 10 years now and my experience is that, with some attention to optimizing the mechanics and with version 12, you can expect a throughput of about 20 words per minute at an accuracy of about 98% for medical English. I specify that because longer words are generally less ambiguous and give better accuracy. Dragon will sometimes flub "member" for "number," but it never misses "anastomosis."

    I can easily see moving text between Dragon and Google Translate and back to Dragon's very good text-to-speech function. It would be a little manky to do it by hand, but I'm sure a clever macro would do the job.

    For language pairs where there's a lot of bilingual corpus available, GT is surprisingly good. English Afrikaans is sometimes amazingly accurate.

  7. Sol said,

    November 10, 2012 @ 12:03 pm

    Well, there are several mentions of speaking in tongues in the New Testament, including the Pentecostal (and otherwise) idea of speaking in a language that "doesn't exist." The mention that best approximates the "universal translator" concept, though, would be Acts 2:6-11: "Now when this was noised abroad, the multitude came together, and were confounded, because that every man heard them speak in his own language. And they were all amazed and marvelled, saying one to another, Behold, are not all these which speak Galilaeans? And how hear we every man in our own tongue, wherein we were born? Parthians, and Medes, and Elamites, and the dwellers in Mesopotamia, and in Judaea, and Cappadocia, in Pontus, and Asia, Phrygia, and Pamphylia, in Egypt, and in the parts of Libya about Cyrene, and strangers of Rome, Jews and proselytes, Cretes and Arabians, we do hear them speak in our tongues the wonderful works of God."

    I doubt Acts is the first example of some sort of fantastical translations, but it's an interesting antecedent to the pure sci-fi uses of the trope.

  8. Ben Zimmer said,

    November 10, 2012 @ 12:43 pm

    Christine Mitchell points out via Twitter that well before Leinster's "mechanical translator" was the "Language Rectifier" (an auto-translator through videophone) in Hugo Gernsback's Ralph 124C 41+ (1911).

    [(myl) A link to a relevant passage in the 1925 edition is here.]

  9. LDavidH said,

    November 10, 2012 @ 1:10 pm

    @Sol: In Acts, it's not actually a matter of translation at all, but of people speaking languages they've never learned – and possibly (although Acts isn't clear on that point) not understanding themselves.

  10. John Roth said,

    November 10, 2012 @ 2:27 pm

    Ben Zimmer beat me to it with an earlier example. I was going to mention E. E. "Doc" Smith's novel, First Lensman. However, the lens is, presumably, not a mechanical device and, like the passages quoted in Acts, cannot be used by just anyone.

  11. Steve Treuer said,

    November 10, 2012 @ 2:51 pm

    "Moore's Law has made computers bigger and faster but smaller and cheaper."

    I've never read that paradoxical formulation before. I assume that bigger means bigger memoray and smaller means just smaller in size.

    [(myl) I tried a couple of other phrasings, like "logically bigger and faster but physically smaller and cheaper", but in the interests of time I gave up and went with "bigger and faster but smaller and cheaper", and figured that people would figure it out…]

  12. leoboiko said,

    November 10, 2012 @ 3:14 pm

    > “We agreed on arbitrary symbols for objects, sir, and worked out relationships and verbs and so on with diagrams and pictures.

    While this story was being published, blissymbols guy was in Shanghai, kind of trying to develop this.

  13. Philip said,

    November 10, 2012 @ 3:18 pm

    The "universal translator" in science fiction: The Communipaths series, by Suzette Haden Elgin, one of my teachers decades ago in grad school, is all about this issue/assumption.

    If aliens were truly alien and their language truly alien, too, then any kind of translation would be difficult, if not impossible. What "phonemes" would intelligent gas clouds in the atmosphere of Jupiter use to communicate with each other?

  14. Joe Green said,

    November 10, 2012 @ 7:54 pm

    @Dan Lufkin:

    I can easily see moving text between Dragon and Google Translate and back to Dragon's very good text-to-speech function. It would be a little manky to do it by hand

    I apologise for the cross-thread digression, but manky here meaning… difficult? Unpleasant? Time-consuming? This doesn't quite seem to fit with the discussion which we were/are having in the recent Rubbish thread (http://languagelog.ldc.upenn.edu/nll/?p=4295). Another interesting example of your heritage's vocabulary, like culch?

    @Steve Treuer and myl: How about "more capacious but more compact"?

  15. Dan Lufkin said,

    November 10, 2012 @ 8:47 pm

    @Joe Green — Yes, "manky" is part of my repertory of native wood-notes wild. I always thought that it came to Maine via Québec (you can do it on foot) and manque, lacking in some desired but unspecified quality, the adjectival declension of the noun "culch," so to speak.

    Now to a more serious note — At dinner this evening I described to my son (aged 49) the demonstration of the language rectifier we've been discussing. He looked back at me with pity, whipped his Samsung Android phone out of his pocket and proceeded to demonstrate the Google Translate Android application. He spoke to the phone, "What time does the train leave for Amsterdam?," touched the screen, waited about five seconds, touched the screen again and we heard perfectly acceptable Dutch.

    True, it was in a generic voice (the Swedish voice is female), but it was voice-to-voice and worked in all the language pairs that Google Translate offers. Of course, it shares the shortcomings of GT in the language pairs with scanty bilingual corpora (corpuses?), but it was plenty good enough for lots of everyday purposes.

  16. leoboiko said,

    November 11, 2012 @ 6:11 am

    Philip: You probably know about it, but in case you don't: Miéville's Embassytown is an interesting exploration of cultural trouble with an alien race whose "language" is different from ours not just in in structure and medium, but in mechanism and function.

  17. Nick Lamb said,

    November 11, 2012 @ 8:08 am

    leoboiko, unlike our hosts I'm no linguist, but my impression was that Embassytown gets its science all wrong.

    I was reminded of Never Let Me Go. I was whirled along with the characters, expecting that the strange inconsistencies would all be explained away, and then the writer revealed their big idea and I thought "Huh? But that makes no sense whatsoever". I was more willing to forgive Ishiguro, who never made any pretence that the science in Never Let Me Go (poor people are cloned and the clones are used as live organ donors) makes sense, than Miéville who seems to think he was making some profound observation in Embassytown.

    The summary I took away from "Never Let Me Go" was "We don't talk about death, but we should". The summary in Embassytown? Maybe "In the future space linguists will suck at their jobs and doom us all".

  18. leoboiko said,

    November 11, 2012 @ 9:38 am

    I didn't get that at all. (Mild spoilers may follow). First, the xenolinguists actually managed to decipher the alien language, why it was so alien, and even to design a method of communication; so they didn't suck at their jobs. I also don't see what's the problem with the science (the special property of their language has a lot to do with real-world animal communication, I think; and certain linguistic utterances can certainly have physiological effects in us, so why not in hypothetical alien beings?) Thirdly, the Bad Thing that happened wasn't the fault of linguists, but rather the unforeseen consequences of a deliberate act of sabotage by politicians who didn't know what they were dealing with. And lastly, everything works out in the end, so no one was doomed. (Er, except everyone who died.) The ending is quite optimistic, in fact.

    My summary of Embassytown would be "language may be the fall from Eden, but it's worth it."

  19. Jerimiah said,

    November 11, 2012 @ 10:30 am

    Following up on Victor Mair's mention of rehearsed sentences, in another youtube video of this presentation that I watched, one can see the Chinese text that starts to get translated before the text-to-speech occurs more clearly (around 6:20 in this video). There were far more errors and much less natural language than what happens after the text-to-speech starts, leading me to concur with the idea that the latter sentences are rehearsed. It seems slightly better than IBM's attempt at the same thing about five or six years ago. I wish I would've known about this so I could have gone to Tianjin to see it.

    As a translator, I have no fear of losing my job to software within the next 20 years.

  20. Steven Grady said,

    November 11, 2012 @ 7:29 pm

    Of course the automatic translation trope is basic to Star Trek, and occasionally features in the plots. Most famously, the episode "Darmok" on ST:TNG involves a meeting with a society whose language consists entirely of cultural metaphors. Lacking any knowledge of their culture, the automatic translator fails, and the crew are forced to rely on good ol' linguistic know-how to learn how to speak with them.

  21. Greg Morrow said,

    November 11, 2012 @ 10:32 pm

    The problem with "Darmok" is that a cultural allusion to a loyal last stand by close friends, as used in the language, translates straightforwardly when taken as a unit as "friendship" or "alliance". The language is pretty ordinary, it just preserves a lot of etymological morphology and syntax in its lexical entries, which is triggering the universal translator to treat it like an agglutinative or polysynthetic language. Tune the universal translator to be a little less diligent in looking for embedded content words, and you should be fine.

    I will drop in a mention of "Omnilingual" by H. Beam Piper (available on Gutenberg) — math and science are the universal key to translating unknown languages.

  22. Bill said,

    November 12, 2012 @ 9:49 am

    I can't comment on the science fiction literature, but in the technical literature there is certainly plenty of prior work on unsupervised cross-lingual adaptation for speech synthesis. HMM-based synthesis has made this much easier, and hooking together ASR->SMT->TTS is straightforward enough (you can do it using inbuilt android services, if you're missing any of the component technologies), and with cross-lingual speaker adapted TTS, you have the microsoft demo. There are plenty of cites here http://www.emime.org/learn/publications . The impressive part of the demo is to get it all running in real-time. But if you're willing to tolerate latency between the various components then putting together a demo like this is ~ fourth year project level work (ask my students :).

  23. Polyspaston said,

    November 12, 2012 @ 10:03 am

    The bigger issue with "Darmok" is the awful pan-pipe incidental music and the appalling mangling of Gilgamesh, IMO.

    On the subject at hand: as I understand it, Mandarin is, like English, an analytic language with SVO word order. A more comprehensive test of computer translation systems would be their handling of synthetic languages, which, if 'Google Translate' is anything to go on, is still lacking.

  24. Mark Mandel said,

    November 12, 2012 @ 6:36 pm

    @Dan Lufkin
    On behalf of myself and all the other ex-Dragons, thanks for the props to Dragon NaturallySpeaking.

    He didn't mention that the work at Carnegie Mellon was done by Jim & Janet Baker, the founders of Dragon Systems.

    Mark A. Mandel

    Senior Linguist, Dragon Systems, Inc. (R.I.P.), 1990-1999 (2001)

  25. Mark Mandel said,

    November 12, 2012 @ 6:54 pm

    @Dan again: Your numbers with NatSpeak don't surprise me at all. Even with ordinary text, proper training of the speaker and the program will typically get up to 95% or higher.

    @Victor, November 10, 2012, 10:51 am: Putting the most charitable interpretation on it, Lonnie Mackey is probably one of those people for whom "It sounds like Chinese" is equivalent to "It's Greek to me".

    @John Roth, November 10, 2012, 2:27 pm: The Lens is explicitly described as being not a mechanical device, but "quasi-living" (or some very similar term). I can't give you a page number after thirty years or more, but that is … Ah, good. Google is our friend, and Wikipedia [First Lensman] has the authority, if not the citation: "a unique badge of authority which is actually a form of 'pseudo-life' … and that can be worn only by the person that it is exclusively attuned to."

  26. Nancy Lebovitz said,

    November 12, 2012 @ 7:23 pm

    My problem with Embassytown was that if the aliens were so linguistically rigid, I don't see how they can come up with new concepts for people to enact for them.

    However, I forgive the book because it's so good about how much effort it can take to assimilate an unfamiliar view of the world.

  27. Daniel said,

    November 20, 2012 @ 8:04 am

    I have really been impressed by the accuracy of the Microsoft speech-to-speech translator. Of course, it is not perfect yet. Some sentences couldn't be understood by the translator even in English and some words in Mandarin were chosen wrongly.

    However I am asking myself if it will become a reality that one day speech-to-speech translations in settings such as a UN-meetings and other international meetings will be done by "machines" instead of human interpreters?

  28. MikeA said,

    November 20, 2012 @ 6:31 pm

    So, commenting late, I appear to be the only one to have noticed the abysmal-as-usual subtitling on the bulk of the speech. Did they not use the latest (subject of demo) ASR for that, or is the Chinese as mangled as the English text?

  29. Metzger oder Tierarzt? « gnaddrig ad libitum said,

    February 7, 2013 @ 4:17 pm

    […] oft nützliche Werkzeuge dabei, können aber bisher keinen menschlichen Übersetzer ersetzen. Trotz interessanter Entwicklungen auf den Gebieten automatische Spracherkennung, maschinelle Übersetzung und Sprachsynthese sieht es […]

RSS feed for comments on this post