Voice recognition for English and Mandarin typing

« previous post | next post »

In all tech considered (8/24/16), Arrti Shahani has an article titled "Voice Recognition Software Finally Beats Humans At Typing, Study Finds".

Turns out voice recognition software has improved to the point where it is significantly faster and more accurate at producing text on a mobile device than we are at typing on its keyboard. That's according to a new study by Stanford University, the University of Washington and Baidu, the Chinese Internet giant. The study ran tests in English and Mandarin Chinese.

Baidu chief scientist Andrew Ng says this should not feel like defeat. "Humanity was never designed to communicate by using our fingers to poke at a tiny little keyboard on a mobile phone. Speech has always been a much more natural way for humans to communicate with each other," he says.

The study found that speaking short phrases into an iPhone was three times faster than typing them on an iPhone.

The Stanford University-University of Washington-Baidu team didn't test query skills. They zoomed in on voice recognition software's ability to type the spoken words. In English, they found the software's error rate was 20.4 percent lower than humans typing on a keyboard; and in Mandarin Chinese, it was 63.4 percent lower.

Judging from these figures, it would seem that typists of English must typically make a lot fewer errors than typists of Mandarin.  Considering the complexity of the Chinese script and the complicated nature of human-machine interfaces for entering Chinese text, this is not surprising.  Since the software used in this experiment was a Baidu program called Deep Speech 2, there should not have been a bias in favor of English.

What does this presage?  Greater reliance on speech recognition for typing and increased levels of character amnesia for handwriting.


  1. Nick said,

    August 24, 2016 @ 2:19 pm

    I'm afraid I'll be shunned as "Off-topic Nick" henceforth, but did anyone else notice in the PRC Constitution the phrase 光辉灿烂 ? Is it possible that the craftsman intended structural repetition of the 光 in 辉 and the 火 in both 灿 and 烂 ? As a novice in Chinese I wonder is this a poetical ornament…I don't recall seeing any discussion of this in Perry Link's recent book. I also wonder what Perry Link would say to William Hannas–could we arrange a debate, moderate by Herr Doktor Mair?
    Thanks for any responses!

  2. WSM said,

    August 24, 2016 @ 3:00 pm

    Not sure how speech-to-text is going to cause more problems for character amnesia, which is an issue regardless of electronic input method, rather than render pinyin irrelevant as a method for inputting characters electronically. Particularly since the recognition software an apparently cope with regional influences whereas pinyin input assumes standard inflection. The recognition software produces characters, not pinyin.

  3. WSM said,

    August 24, 2016 @ 3:29 pm

    Any effects this technology will actually have on linguistic practices are also likely to be somewhat limited, given the concerns noted by the co-author that "there are many moments — in a meeting, in bed with your partner sleeping — when typing still makes more sense than talking to one's devices", which is the understatement of the century.

  4. Michael Watts said,

    August 24, 2016 @ 4:33 pm

    Particularly since the recognition software an apparently cope with regional influences whereas pinyin input assumes standard inflection.

    The Google pinyin IME definitely has settings to cope with regional accents. In a fairly old installation, I find the following togglable adjustments:

    z = zh / c = ch / s = sh / an = ang / en = eng / in = ing // l = n / f = h / r = l / k = g / ian = iang / uan = uang .

    The options before the // are on by default; options afterward are off by default.

    This actually brings up a number of points of interest:

    – Why are ian/iang and uan/uang different from the other -n/-ng options?

    – Who makes these errors? As far as I know, the lack of distinction between s/sh and the related sounds is characteristic of "south china", including cantonese-speaking regions and also shanghai, which is maybe a northern outpost for south china? (And presumably also including hokkien, which I know basically nothing about.) -n/-ng was described to me as a distinction present in mandarin but absent in the local speech of Jiangsu, which should be a variety of wu. Collapsing n/l is characteristic of cantonese. f/h and r/l are, to me, characteristic of japanese, although I'm not sure how much consideration they merit in the design of a chinese input system. And that's the limit of my knowledge.

    – Except that I had a tutor from Guangzhou who, in addition to not distinguishing s/sh and n/l, also didn't distinguish r/y, which isn't even listed here.

  5. Bathrobe said,

    August 24, 2016 @ 4:39 pm

    I'm afraid I'll be shunned as "Off-topic Nick" henceforth

    Then I suggest you refrain from posting off topic. Try other forums, Stack Exchange, for example, or contact Professor Mair directly if you think it is a worthwhile question.

  6. WSM said,

    August 24, 2016 @ 4:51 pm

    @Michael Watts – that observation was echoed by Chinese participants in the study, who apparently were unaware, as was I, that there was even such a thing as "Google IME"; I'd be curious to know whether more mainstream IMEs such as those provided by MS Pinyin or SOgou include such functionalities, and if so whether such functionalities are ever used. The limited observations recorded in the study suggest that they're not used often.

    In any case I'm guessing those settings only address the most basic, and arguably most trivial to cope with, ways in which topolect-influenced inflections complicate the use of pinyin to access characters. I agree that an extended study of just what (if any) complications the influence of various topolects cause for input of MSM-based Pinyin would be extremely interesting.

  7. Eidolon said,

    August 24, 2016 @ 4:52 pm

    This may work in favor of Chinese characters, since difficulty of input has always been a problem with Chinese characters, and it looks as though this could be a solution which won't come at the cost of efficiency, since people rarely can type faster than they speak, and certainly not on a mobile device. Of course, this would only work where and when you can use speech for input, and not in quiet or noisy environments.

    It won't solve the problem of having to memorize thousands of characters for reading comprehension, which is a separate issue that require different solutions than those being talked about here.

  8. Rubrick said,

    August 24, 2016 @ 4:57 pm

    Anecdotally: I find that while I certainly have a lower error rate when using voice-to-text than when typing, recovering from the inevitable errors that do occur is far more difficult in the former case. I'd be curious about the results of a study that took that into account.

    The day that voice recognition can smoothly handle things like "No, I meant Reed, the college" is the day I'll stop finding it terribly frustrating, despite how high its base accuracy has gotten.

  9. Andrew said,

    August 24, 2016 @ 5:03 pm

    f/h and r/l are, to me, characteristic of japanese, although I'm not sure how much consideration they merit in the design of a chinese input system

    Confusing f/h is a characteristic of Taiwanese Mandarin influenced by Hokkien or Hakka. My father, who is fluent in both Taiwanese Hokkien and Mandarin, exhibits the f/h phenomenon on occasion, producing huā for 發 on occasion in less careful speech.

    Wikipedia says r- becomes l- in Taiwanese Mandarin, but I can't speak to this firsthand.

  10. Michael Watts said,

    August 24, 2016 @ 5:53 pm

    WSM: Microsoft Pinyin provides the same functionality under the same name ("fuzzy pinyin"). The options I see are zh,z / ch,c / sh,s / ang,an / eng,en / ing,in / wang,huang // n,l / l,r / f,h , where, as before, the options preceding // are on by default and the options afterward are off by default.

    My tutor who was canonese had enough difficulty with pinyin input that she just used wubi; that might contribute to low awareness of fuzzy input settings.

  11. Michael Watts said,

    August 24, 2016 @ 5:54 pm

    she was cantonese, that is.

  12. WSM said,

    August 24, 2016 @ 6:42 pm

    @Michael Watts I have the strange feeling we're communicating using error-prone cellphone keyboards. Ironic, no?;)

  13. liuyao said,

    August 24, 2016 @ 11:33 pm

    It's not very clear what kind of typing errors the error rate is accounting for. In typing English, is auto-correct enabled? If there's a mistype, and the typist sees it and goes back to correct it, does that count as error or not? In fact a lot of the slowing down is due to incorrect auto-correct. I suppose it's even more time-consuming to correct errors in speech-to-text, so error rate is a good measure.

    It's conceivable that the Mandarin speech recognition can accommodate accents, be it tone variations or fuzzy pinyin. Likewise English has a great variety of accents by non native speakers.

  14. Michael Watts said,

    August 25, 2016 @ 2:57 am

    WSM: You might be, but I'm not.

  15. Bloix said,

    August 25, 2016 @ 8:09 pm

    John Henry said to the Cap'n
    Well a man ain't nothin' but a man
    But before I let that deep speech beat me down
    I'll die with a keyboard in my han' lord lord
    I'll die with a keyboard in my han'

  16. Alice said,

    August 27, 2016 @ 11:23 pm

    If this is true, why is automatic captioning so dreadful?

  17. Dave Cragin said,

    August 28, 2016 @ 8:02 pm

    I’ve read the study and it’s hard to fully comprehend. That voice recognition is superior to typing for Mandarin is in part driven by the very high “corrected error rate” for the subjects in the study (16 native Mandarin speakers). Whereas the “corrected error rate” for the 16 native English subjects was only 3.49%, it was 19.14% for Chinese subjects.

    (According to the study, a corrected error is one when the person backspaces to fix it.)

    It would be great to hear from Chinese if they think they the data are representative, i.e., is it typical to mistype 1 out of every 5 characters when typing? (characters appears to mean both English letters and Chinese characters, so presumably this includes pinyin before selecting the characters and the Chinese characters themselves).

    I definitely exceed the 3.49% error rate for native English speakers, even when using a full keyboard. For the text above, it would mean I hit backspace <5 times, whereas I hit backspace 5 times in some single sentences above.

    The study also found that even with speech recognition, there was still significant differences, i.e., a total error rate with English of 2.93% versus 7.51% with Chinese (i.e., the 7.51% reflects the 63% reduction in errors compared with typing).

  18. Mark Liberman said,

    August 28, 2016 @ 8:39 pm

    It seems very misleading to call the modality tested "typing", since it's NOT entering text by using a standard computer keyboard, but rather what is more commonly called "texting", namely entering text by using virtual keys on a small handheld device.

    Good typists can maintain 50-100 words per minute more or less indefinitely. The best texters (in English) are more like 15-25 words per minute by similar metrics, as far as I know, or something like four times slower — or even worse in realistic comparisons, as here.

    Meanwhile normal reading rates are in the range of 200 wpm — but the catch would be what the error rates and correction methods are.

RSS feed for comments on this post