Voice recognition for English and Mandarin typing revisited

« previous post | next post »

In "Voice recognition for English and Mandarin typing " (8/24/16), we took a brief look at a Stanford-University of Washington-Baidu study that showed, according to an NPR article, that voice recognition finally beat humans at typing.  The title of the original study is "Speech Is 3x Faster than Typing for English and Mandarin Text Entry on Mobile Devices", and the authors are Sherry Ruan, Jacob O. Wobbrock, Kenny Liou, Andrew Ng, and James Landay.

Abstract (may be found here):

With laptops and desktops, the dominant method of text entry is the full-size keyboard; now with the ubiquity of mobile devices like smartphones, two new widely used methods have emerged: miniature touch screen keyboards and speech-based dictation. It is currently unknown how these two modern methods compare. We therefore evaluated the text entry performance of both methods in English and in Mandarin Chinese on a mobile smartphone. In the speech input case, our speech recognition system gave an initial transcription, and then recognition errors could be corrected using either speech again or the smartphone keyboard. We found that with speech recognition, the English input rate was 3.0x faster, and the Mandarin Chinese input rate 2.8x faster, than a state-of-the-art miniature smartphone keyboard. Further, with speech, the English error rate was 20.4% lower, and Mandarin error rate 63.4% lower, than the keyboard. Our experiment was carried out using Deep Speech 2, a deep learning-based speech recognition system, and the built-in Qwerty or Pinyin (Mandarin) Apple iOS keyboards. These results show that a significant shift from typing to speech might be imminent and impactful. Further research to develop effective speech interfaces is warranted.

Here's the pdf for the original Stanford paper.

The following is a guest post by Silas Brown, which goes into much greater detail about the context of the achievement described in the Stanford-UW-Baidu study.


According to the NPR article:

In English, they found the software's error rate was 20.4 percent lower than humans typing on a keyboard; and in Mandarin Chinese, it was 63.4 percent lower.

… except that it wasn't a keyboard.  The link to http://arxiv.org/abs/1608.07323 says "the built-in Qwerty or Pinyin (Mandarin) Apple iOS keyboards".  That means on-screen keyboard, not physical keyboard.  It's no surprise to me that the iOS keyboard is awful – I'd be far more interested in a study that used an actual physical keyboard (second-hand phones that have them are still available, although the manufacturers don't like to make them anymore so they're trying to trick us into thinking we don't want them anymore – but it's still possible to pair a phone with a Bluetooth keyboard if you can find a good one).

And the news report is very misleading to put this after such high-ranking things as the Kasparov-Deep Blue chess match and the similar Jeopardy and Go matches.  The human opponents in those setups were world champions.  So if you're going to follow on from that, I would expect the human side to have an experienced secretary and a proper office keyboard, not some random volunteer typing into an iPhone.  At least we get the phrase "mobile devices" a couple of paragraphs down (which is from the paper's title, although even this is pushing it because "devices" can include tablets but the study used only phones) – this aspect should have been more prominent.

Furthermore, the paper says nothing about how participants were asked to trade-off between speed and error rate – presumably they could achieve a much lower error rate at the expense of slower speed, but they didn't want to because they were only being paid a fixed amount and probably wanted the test to be over and done with sooner rather than later.  It doesn't really make sense to compare both speed and error rate in the same test because obviously there is a trade-off (on BOTH systems, as we are told the speech system offered post-recognition correction and participants might have chosen to skip this if they were rushing).  It would have made more sense to fix one and measure the other, as in "make sure the text is 100% right and we'll time how long it takes you to make those edits" or "use all of 20 seconds to make as many edits as you can and we'll measure how good it is when the clock stops".

Oh, and they excluded punctuation marks because they found a 1999 conference paper saying it's not a fair comparison.  But in real life it's sometimes possible to use a punctuation mark to save several words of meaning, so perhaps, instead of having participants transcribe a fixed set of sentences exactly, it might have been better to tell them they can change the wording if they want as long as the result is unambiguously the same meaning according to some reasonable neutral observer (or as long as they agree their rewrites will be published for all to see).  That would make a better fit for applications like sending emails from phones, since how many keyboard users are told they can't have punctuation in real life?  And I expect speech recognition systems have got better at punctuation since 1999, so wouldn't want to say "let's just forget punctuation because somebody in 1999 did".

So yes I'm afraid that study does look a little over-hyped … but it would be nice if voice recognition could improve to the point where I don't want my real keyboard. Although, for messaging, if you're going to speak, then you might as well send the person a voice message, unless there's some reason why the message must be in text or you need to carefully edit what you say and can't make yourself a script first.

Please note that I was reacting to the NPR headline "Voice Recognition Software Finally Beats Humans At Typing" and its opening line "Computers have already beaten us at chess, Jeopardy and Go" in their report here.

I'm not saying the study isn't good, just that I think NPR over-hyped it by placing the performance of average iPhone users on a par with world-championship contests (I assume that hype was NPR's doing, not the researchers).  Unfortunately, it's not obvious how to get the reporter Aarti Shahani's email address to check if they'd like to reply.



  1. Justin said,

    August 30, 2016 @ 2:54 pm

    It's no surprise to me that the iOS keyboard is awful – I'd be far more interested in a study that used an actual physical keyboard (second-hand phones that have them are still available, although the manufacturers don't like to make them anymore so they're trying to trick us into thinking we don't want them anymore – but it's still possible to pair a phone with a Bluetooth keyboard if you can find a good one).

    I don't think there's any trickery going on. A physical keyboard means either a bigger phone or a smaller screen. Neither seems like a good tradeoff when the keyboard is only used a small percentage of time.

    Unfortunately, it's not obvious how to get the reporter Aarti Shahani's email address to check if they'd like to reply.

  2. Mark Liberman said,

    August 30, 2016 @ 3:36 pm

    The original piece made it quite clear that the "typing" was really texting. It's true that it was a transparently dishonest and sensationalist move on the part of the "contest" promoters and the NPR reporter to use the word "typing", but people who were naive and unobservant enough to be taken in have only themselves to blame.

  3. Jarek Weckwerth said,

    August 30, 2016 @ 3:52 pm

    My experience with iOS keyboards is limited but on Android and Windows the differences between individual keyboards (and within each one, between individual languages) are huge. My favourite swipe keyboard is just amazing in English but only so-so in a more heavily inflected language…

    I haven't done any testing ;) but I'm pretty sure that, for English, it's not three times slower than speech recognition on the same device. (One specific reason being — as is suggested above — that error correction is much more effective. And of course methods for correcting typos also differ substantially between keyboards.)

  4. Stephen said,

    August 31, 2016 @ 6:11 am

    I'm with Jarek Weckwerth on the better error correcting (and what I would call sensible suggestions) in text rather than speech-to-text.

    I use the native text (SMS) app on my Google Android phone. Once a letter (or sometimes two) has been entered it suggests three possible words and if you hit the space key it selects the middle (most likely) one and moves on to the next word.

    So entering '"i" the most likely suggestion is "I", for "ive" it is "'I've" and so on. The speech-to-text function that is part of the same application is not as 'smart' about this. It does get these right but not as reliably.

    Punctuation seems to be regarded as an optional extra, "comma" comes out as "," but "stop", "period", "semi colon", "colon" "quote" & "double quote" all come out as words. Some of the time "full stop" comes out as "." and some of the time as "full stop". In the former case the next sentence will start with a capital letter but in the latter, unsurprisingly, not.

    Also I have not found a way of forcing a capital letter in the middle of a sentence. E.g., the text part thinks that "sms" should be "SMS" but the text-to-speech part leaves it alone.

    Playing around with it just now I saw that it put both "can't" and "cant" as "can't". So I tried again with some more emphasis on "cant" and
    "I don't mean can't as in inability I mean cant as in hypocrisy"
    came out as
    "I don't mean can't as in inability I mean Kent as in hip Hop Chrissy".

    NB Possibly there is some learning involved here. I live near Kent and may well have used that in a number of texts, hence

    I spent so long correcting the output of the text-to-speech function that I simply gave up on it.

  5. Catanea said,

    August 31, 2016 @ 3:49 pm

    & no one seems to address the question that "typing" on whatever sort of finger-input is a relatively silent, or at least quiet activity. If I wanted to speak aloud everything I might send as a message to persons not present, I might as well send a What'sApp voice recording (which I find somewhat annoying to receive) or just telephone. All this texting/SMS-ing/What'sApp-ing is to AVOID making more noise. At least in my house.
    Car, village, anywhere I am.

  6. Eidolon said,

    August 31, 2016 @ 7:08 pm

    Voice recognition has many applications. It's not just about replacing texting and/or typing. Consider an automatic universal speech translator – one of the prerequisites would be near perfect voice recognition, which we are approaching. Note taking in circumstance that requires it also becomes a lot easier when you have solid voice recognition as humans read a lot faster than they listen, so recording the speaker is not nearly as effective as transcribing what he/she says. Then there's applications such as automatically generating close caption text for video, voice command interfaces, and so on, all of which require voice recognition.

    To this end I think the results presented by this paper are actually impressive. But of course, it's an evaluation paper of an existing system, not a new system, so the technology is already in play and it's not new. The paper is more about drawing attention to the state of the art than it is about a new advance.

  7. tangent said,

    September 2, 2016 @ 1:35 am

    It's also important that for on-screen keyboard entry, correction of bad guesses is well integrated into the process of typing — enter a word, pick alternative suggestion of necessary, repeat. The speech entry I've tried had me flit the cursor around afterwards.

    The key metric, I'd think, would be time to end up with text corrected to a certain level.

  8. R Bremner said,

    September 5, 2016 @ 3:53 pm

    Wouldn't a trained secretary type at the rate of normal speech, though? So in that case a speed-test wouldn't work (though I suppose if you're looking at number of errors a comparison could be made).
    With respect to texting applications, though, I don't think many people would speak in order to text when they might just as well send a voice message. As Catanea mentioned, there are other advantages to typing in these circumstances.

RSS feed for comments on this post