Voice recognition for inputting

« previous post | next post »

When I'm with my sister Heidi, whether it be in Seattle or northeast Ohio or anywhere else in the world, she's often talking to Siri.  She asks Siri to look up information about trees, about food, about traditional medicines, about Yoga, about genealogy, and anything else she wants to investigate.  Above all, when we're driving around, she asks Siri for directions about how to get where we're going.

To me, who doesn't even own a cell phone, this is all quite miraculous.  A few days ago, at the conclusion of my "Language, Script, and Society in China" class, however, a new (for me) dimension of voice recognition was demonstrated by one of the students.

During the class period, we had been discussing the relative merits of phonetic inputting (e.g., Hanyu Pinyin) vs. shape-based inputting (e.g., writing on a pad or glass with one's fingertip) — cf. "Easy versus exact" (10/14/17).  As the class was disbanding, Ben Roth whipped out his cell phone, and demonstrated how efficient it was at entering Chinese text through speech.  Of course, everything he said was simple and routine, things like "wǒ ài nǐ 我爱你" ("I love you"), "tā shì wǒ de péngyǒu 他是我的朋友" ("he is my friend"), and so forth, but it did immediately and accurately turn Ben's spoken words into Chinese characters.

Krista Ryu, who was standing nearby, joked about writing a whole paper that way.  But would that be practical?  One thing that immediately comes to mind is how one makes corrections or revisions in what one has "typed" via voice input?  How do you backspace?  How do you delete?  How do you revise?  How do you move things around in the text (cut and paste)?

Voice inputting of Chinese characters has been around a long time.  I remember visiting the home of a Peking University professor about 25 years ago.  He showed me his new toy, which was some software that enabled him to input text via spoken language.  He had just gotten the software and he said that he was still "training" it to understand his unique accent (he also spoke with a noticeable stutter).  I watched him play around with it for about half an hour, but I don't think that he succeeded in entering even one complete, correct sentence.

I'm told that more and more people in China nowadays opt to enter text via voice, especially for very short messages on WeChat and similar social media applications.  That certainly is the EASY way out for dealing with the challenging writing system, but it may not be the most EXACT.


"Chinese character inputting" (10/17/15; includes references to earlier posts related to inputting)

"Stroke order inputting" (10/30/11)

"Voice recognition for English and Mandarin typing" (8/24/16)

"Voice recognition vs. Shandong accent" (3/1/15)


  1. AntC said,

    October 22, 2017 @ 9:14 pm

    Given the number of homophones you keep telling us about, and the 'character amnesia': I expect voice recognition software would rather often pick the wrong character; and rather often the user wouldn't know any better.

    I have seen voice-input being used by putonghua speakers — cheifly for Google searches rather than whole sentences. But even then, I chiefly noticed their annoyance it kept choosing the wrong characters.

    Does that explain some of the 'lost in translation' howlers you regale us with? Voice input -> wrong character -> nonsense translation.

  2. Victor Mair said,

    October 22, 2017 @ 9:52 pm


    "Does that explain some of the 'lost in translation' howlers you regale us with? Voice input -> wrong character -> nonsense translation."

    That's a good suggestion, and something we need to take into consideration more often when faced with such "lost in translation" blunders.

  3. BrisbaneTom said,

    October 23, 2017 @ 12:36 am

    Victor, my experience with WeChat is that the voice input is not converted into text, rather it's sent as a sound file, and received by listening – like a voicemail. This is a very convenient way to send a message while walking, or if the weather makes touch screens impractical. I feel that the benefits of this system are weighed more towards the sender – listening to the whole message is not as convenient as reading (at least in my opinion).

  4. Sean Richardson said,

    October 23, 2017 @ 2:06 am

    I wonder if the same would explain what seems to be a noticeable uptick over the last two years in the same sort of howlers in English — "phase" for "faze" and worse — in articles appearing immediately on websites only. If the input method is recognition of streams of whole words, spell-check wouldn't flag any for a proofreader's eye, if proofreading is even happening!

  5. leoboiko said,

    October 23, 2017 @ 2:06 am

    I find my children, and their friends, use voice messages on WhatsApp much more than I'm comfortable with, in Portuguese.

  6. Sid Smith said,

    October 23, 2017 @ 3:23 am

    Here in the UK, voice-recognition s/w is often used for TV subtitles. Many howlers ensue. I submitted this one to our newspaper's Diary section (the gentleman mentioned is a brutally effective soccer forward):

    "Sky News will be following the destructive Harry Kane across Florida today."

  7. Victor Mair said,

    October 23, 2017 @ 5:59 am

    @Sid Smith

    Thanks for the good belly laugh to start the morning.

  8. Mark Liberman said,

    October 23, 2017 @ 6:18 am

    PC-based and server-based dictation systems have been working well for a couple of decades, for those people who use them by choice (rare) or by necessity (for example because of disabilities preventing typing).

    And high-quality speech-to-text has been available in cell phones for about a decade — see e.g. "ASR elevator", 11/14/2010.

    The technology has certainly improved over the past few years, partly due to better algorithms and partly due to larger amounts of training data. But the main thing that's really new here is Victor's exposure to the technology.

    And with the spread of devices and applications like Alexa and Google Home, on top of Siri and Google Assistant and Cortana and so on, the Victors of the world are increasingly engaged.

  9. Wei said,

    October 23, 2017 @ 6:42 am

    I just tried inputting the one sentence that crossed my mind with the built-in voice recognition engine on wechat, and the result is this:


    It mistook 比 for 笔 rather mystically and messed up 的/地/得 quite understandably. But anyway it's much more useful than I thought! No wonder my senior relatives are using it.

  10. Phil H said,

    October 23, 2017 @ 8:15 am

    English voice input technology has improved dramatically over the last 5 years – this is what's made Siri and Google assistant possible. I tried English voice input for a fairly long text message in WeChat yesterday, and it was very good, only one major mistake. Funnily enough, though, I'm still using the keyboard to enter this passage now. I still don't fully trust the voice input. And it is difficult to both spot and correct mistakes. So I'm not quite ready to make the switch to voice is my primary input method yet.

  11. Chris Button said,

    October 23, 2017 @ 11:18 am

    Given that intonation in English must be a real challenge for voice input (e.g. the isolated word like 'inde'pendent usually becomes 'independent ad'vice in compounds- using John Wells' notation), I wonder if intonation superimposed over lexical tone in Chinese makes things any simpler (since the tone contours are still distinct regardless of intonation) or harder (since the lexical tones are nonetheless still being distorted by intonation)

  12. Chris Button said,

    October 23, 2017 @ 11:21 am

    For some reason the underlining for the nucleus did not appear with the italics:


    'independent ad'vice

  13. Chris Button said,

    October 23, 2017 @ 11:22 am

    Ok – so can someone please tell me how to underline? I'll write in caps instead:


    'independent ad'VICE

  14. Terry Hunt said,

    October 23, 2017 @ 11:46 am

    @ Sean Richardson

    Specifically on 'faze': I myself (an ageing Brit) spell it thus, but in various (rare) uses of it in novels and stories that I've seen, the oldest not long after 1900, I've seen various others, such that I doubt that it's ever had a widely known "official" spelling. (In my paper Compact OED, 'faze' has two US slang or colloquial usages cited from 1890.) I've always presumed that it's derived from or related to the Scots and Irish 'fash', which has broadly the same meaning.

  15. David Morris said,

    October 23, 2017 @ 3:44 pm

    Occasionally I have turned on the closed captioning function on my television. I understand that there are two ways of producing these – one is via a steno-keyboard and the other is via voice-to-text, with a trained operator respeaking the words of the program into a machine. Either way, it lags behind the actual speech by about five seconds. I had to turn it off – it just got too distracting, partly because of the delay and partly because of the mistakes. During the Australian Open tennis, it routinely mixed 'service' and 'surface' (an understandable mistake), but referred to the woman's champion, *during the presentation ceremony*, as 'Victoria as a drinker'.

  16. Andrew said,

    October 23, 2017 @ 10:20 pm

    I was impressed the other day when Google's voice recognition tracked across an unmarked language shift: I told it "Ok Google, play Los Cubanos Postizos" and got the music I was asking for. :)

  17. Biscia said,

    October 24, 2017 @ 7:22 am

    I know many translators who use dictation software for first drafts, because it's faster than typing (once you've "trained" it, anyway) and helps you avoid carpal tunnel problems. Sooner or later I'll get around to trying it myself, because typing a first-draft translation is so much like taking dictation from yourself anyway, and I often find myself making bizarre phonetic slip-ups ("won" for "one" and so on) that I never would in any other situation. I figure a voice recognition program can't be that much worse.

  18. Dan Lufkin said,

    October 24, 2017 @ 12:05 pm

    @Biscia — I used Dragon Naturally Speaking starting with version 3.0 in translation and never could understand why it wasn't used universally. With care and experience, I could cruise along at close to 1000 words per hour draft and about 800 wph finished product. The software and a better microphone paid off in less than an hour. Yes, there was the situation with homonyms and accidental garbles, but these were largely predictable and I could always spell out what I wanted. I used to test a new version against Harvard sentences (q. G.); with v. 10 and later; Dragon would usually score at less than 1% error.

  19. Andrew said,

    October 26, 2017 @ 2:19 am

    Isn't that about a tenth the speed of a good typist?

  20. Biscia said,

    October 27, 2017 @ 9:18 am

    That's slow for typing, but very fast for a finished translation. And from what I'm told, if all you want is to throw down a sloppy first draft (mine tend to be almost nonsensical), you can go a little faster than average typing speed. Most importantly, while avoiding damage to your hands (or eyes, if you read off a hard copy and only check everything afterwards). Being able to stand up, stretch, stick your hands in your pockets if they're chilly–nothing to sneeze at, if one has been translating full-time for years and years.

  21. Lane said,

    October 27, 2017 @ 11:18 am

    About 1% word-error rate was what I got from Dragon, too. This was reading a prepared text, not composing by speech.

    To Victor's question, there are various voice commands for correcting a misunderstood input. I can't remember them now, but they amount to saying "stop; back up; I meant 'bare', not 'bear'". They work fairly quickly and can be even faster if you use a mouse.

    I'd be interested in knowing whether Chinese is harder or easier. On one hand Chinese has lots of homophones; on the other, so does English. The software uses a 'language model' to disambiguate homophones. It is trained on real-world text. So if it hears something like "the right to bare arms shall not be infringed" it will know that the right word is "bear" because the phrase "the right to bear arms" will appear lots in its training data.

    But is this kind of homophonic slip more likely in Chinese? Or less? What role do e.g. tones play?

RSS feed for comments on this post