Dramatic reading of ASR voicemail transcription

« previous post | next post »

Following up on the recent post about ASR error rates, here's Mary Robinette Kowal doing a dramatic reading of the Google Voice transcript of three phone calls (voicemail messages?) from John Scalzi:

John Scalzi's reaction:

All of the human experience is in there. All in one minute and eight seconds.

It. Is. Magic.

I've never tried getting Google Voice transcripts of voicemail, because I've been trying to ignore voicemail entirely for many years, ever since my university's hospital circulated my office phone number as the fax number for submitting applications for new radiation safety badges.  My voice mail filled up with hundreds of recordings of plaintive fax-machine noises, and thus became even more useless as a communications medium than voicemail normally is. I no longer even have an office phone (why pay for something I don't use?), but the habit of ignoring voicemail has stuck with me.

However, I make extensive use of the the ASR "note to self" feature on my Android cell phone, and it generally works pretty well. For example, my email inbox now contains this dictated "note to self":

Language Log post about dramatization of John Scalzi is voicemail messages

which has got one substitution (is for 's) in 11 words, for a "word error rate" of 1/11 = 9%.

[I presume that] Google's ASR system can do this sort of thing so well because Google knows a great deal about me, including my relationship to Language Log, and (probably) the fact that I recently visited John Scalzi's web site. The system is using an adaptive language model, for which the perplexity of what I said is radically lower than it would be in the case of a model of the English language at large. [Update — No, it ISN'T using a personally-adapted language model, according to a comment by Vincent Vanhoucke, who should know. That makes the performance all the more impressive, since the effective perplexity will obviously be much higher than if it were able to make use of what Google knows about me.]

It doesn't always work so well — another dictated note in my email inbox reads

Language Log post about the relationship between perplexity and we're there a

which amusingly substitutes "we're there a" for "word error", yielding a WER more like 30%.

But usually, 10% WER is about what I see, which I think is pretty good for material dictated into a cell phone in a restaurant, on a street corner, or in a moving train — and even the more spectacular errors, like that last one, generally leave the overall message interpretable (at least to me).


  1. Ben said,

    January 7, 2013 @ 9:59 am

    Here's of my favorite examples of this used for comedic effect.

  2. Vincent Vanhoucke said,

    January 7, 2013 @ 10:28 am

    The 'note to self' feature does not (at this time) use any language model personalization. You'll be happy to hear that both the Language Log and John Scalzi are big enough on the Internets to be part of the universal language model.

    [(myl) Wow.]

  3. Q. Pheevr said,

    January 7, 2013 @ 12:37 pm

    Google Voice has that endearing mix of erudition and naïveté that is so characteristic of autodidacts: it knows who John Scalzi is, but it doesn't seem to know that an English utterance is highly unlikely to end with the word a.*
    I assume that's a symptom of having a lot of corpus-based statistical knowledge and not a lot of syntactic sophistication, but the absence of sentence-final articles seems like the sort of thing that wouldn't be too hard to pick up on statistically (if you're looking for that kind of generalization at all).

    *Yes, yes. I know. But that's different.

  4. Andy Averill said,

    January 7, 2013 @ 5:36 pm

    @Pheevr, I would hazard a guess that when they end up with something with a very low probability, it's because the only alternatives they could come up with are even less likely. In which case it seems perfectly reasonable to go with what you've got in the hopes that the user will be able to make sense of it, rather than, say, just leave it out.

  5. Just another Peter said,

    January 7, 2013 @ 6:52 pm

    I just love the irony of the fact that the word errors in that last one are on the term "word error"

  6. David Morris said,

    January 7, 2013 @ 8:40 pm

    Fortunately I don't use ASR voicemail. In fact I rarely use voicemail at all.

    I get seriously distracted by live captioning of sports broadcasts, usually (for me) tennis and cricket. This is partly because the captioning lags behind the voice by about 5 seconds, and partly because some of the errors are either funny or inexplicable. One of the latter came on the night of the women's final of the Australian Open tennis last year. After two weeks of the tournament, the live captioning referred to the winner as "Victoria as a drinker".

  7. Richard Hershberger said,

    January 8, 2013 @ 12:04 am

    I work in a small law office. My boss is quite enamored of voice recognition software, and uses it both for in internal notes and for dictating letters. I am less impressed by it. I often go back and edit the internal notes. I often edit legal papers he has drafted (and he edits the ones I draft) but I wish he would let me edit those letters, too. He offered to buy me a license for the software. I declined. I am used to thinking via a keyboard. I won't claim that I don't let my share of typos through, but I will claim the results are better than what the software produces.

  8. Keith M Ellis said,

    January 8, 2013 @ 2:12 am

    I've used Google Voice since its beta rollout. And I never answer my phone, leading to a high proportion of voice-mails (from a low frequency of calls). I often will rely upon the ASR transcript (and the forwarded-to-email option) without ever listening to the voicemail itself.

    My experience is that for some callers (that is, particular people, consistently) it is very accurate and for others — even now, after several years — is almost gibberish. It's not call quality, either. Some people it just can't understand. Even then, however, I get the gist of the voicemail from the transcript, usually.

    [(myl) In the jargon of speech technologists, these people are called "sheep" and "goats" (in reference to Matthew 25:31-46). It's a well-known and persistent fact that there are large individual differences in ASR outcomes.]

  9. Keith M Ellis said,

    January 8, 2013 @ 2:27 am

    Incidentally, in Google Voice's web interface it includes at the bottom of each voicemail transcription two options: a "transcript useful?" yes/no pair of checkboxes, and a "donate this voicemail" checkbox. The last allows it to be manually transcribed by a person (requiring someone listen to it, thus asking for permission) and thus used for the software model.

    I seem to notice these every few months or so and sometimes will be moved to mark one or two messages out of a weird sort of guilt, but mostly I've just ignored the options.

  10. Andrew (not the same one) said,

    January 8, 2013 @ 11:18 am

    I was struck by the fact that she says 'three voicemail'. Is this singular-for-plural plural construction normal with this word?

  11. Mary Robinette Kowal said,

    January 9, 2013 @ 12:13 am

    No. That was me slipping and realizing that since it was live I couldn't fix it.

    If you want to see the next generation, look at the YouTube closed captioning for the videos.

  12. Andrew (not the same one) said,

    January 9, 2013 @ 8:40 am

    Oh, how sad. I thought I had discovered an interesting linguistic phenomenon.

  13. mgh said,

    January 10, 2013 @ 10:20 am

    my favorite google voice transcription error, from my own voicemails, is "dialing my eyes and crossing my teeth" (can you guess what was spoken?)

  14. Fiona Hanington said,

    January 10, 2013 @ 3:00 pm

    @mgh "Dotting my 'i's and crossing my 't's"


  15. Andrew (not the same one) said,

    January 10, 2013 @ 3:52 pm

    Presumably the machine was annoyed by the reference to handwriting.

  16. quixote said,

    January 10, 2013 @ 6:55 pm

    When companies try to use machines to talk to me, I have them talk to my machine, aka GoogVoice. You'd think, or maybe you wouldn't, being trained linguists, but I'd think that all the machines would understand each other.

    Au contraire. The word salads I see are a couple of orders of magnitude worse than your examples. About one time in three or four, they're bad enough so I have no idea what the message was trying to say. Given that it's simple stuff like, "Your order has been shipped" that seems doubly pathetic.

    Maybe the Goog needs to include machine recordings in its training sessions. I wonder if they haven't bothered because that would cost a few pennies for somebody to run them.

  17. Yuval said,

    January 20, 2013 @ 2:43 am

    Got yourself a little typo there – "every since".

RSS feed for comments on this post