John Scalzi's reaction:
All of the human experience is in there. All in one minute and eight seconds.
It. Is. Magic.
I've never tried getting Google Voice transcripts of voicemail, because I've been trying to ignore voicemail entirely for many years, ever since my university's hospital circulated my office phone number as the fax number for submitting applications for new radiation safety badges. My voice mail filled up with hundreds of recordings of plaintive fax-machine noises, and thus became even more useless as a communications medium than voicemail normally is. I no longer even have an office phone (why pay for something I don't use?), but the habit of ignoring voicemail has stuck with me.
However, I make extensive use of the the ASR "note to self" feature on my Android cell phone, and it generally works pretty well. For example, my email inbox now contains this dictated "note to self":
Language Log post about dramatization of John Scalzi is voicemail messages
which has got one substitution (is for 's) in 11 words, for a "word error rate" of 1/11 = 9%.
[I presume that] Google's ASR system can do this sort of thing so well because Google knows a great deal about me, including my relationship to Language Log, and (probably) the fact that I recently visited John Scalzi's web site. The system is using an adaptive language model, for which the perplexity of what I said is radically lower than it would be in the case of a model of the English language at large. [Update -- No, it ISN'T using a personally-adapted language model, according to a comment by Vincent Vanhoucke, who should know. That makes the performance all the more impressive, since the effective perplexity will obviously be much higher than if it were able to make use of what Google knows about me.]
It doesn't always work so well — another dictated note in my email inbox reads
Language Log post about the relationship between perplexity and we're there a
which amusingly substitutes "we're there a" for "word error", yielding a WER more like 30%.
But usually, 10% WER is about what I see, which I think is pretty good for material dictated into a cell phone in a restaurant, on a street corner, or in a moving train — and even the more spectacular errors, like that last one, generally leave the overall message interpretable (at least to me).