Canoe schemata nama gary anaconda

« previous post | next post »

Following up on recent posts suggesting that speech-to-text is not yet a solved problem ("Shelties On Alki Story Forest", "The right boot of the warner of the baron", "AI is brittle"), here's a YouTube link to a lecture given in July of 2018 by Michael Picheny, "Speech Recognition: What's Left?" The whole thing is worth following, but I particularly draw your attention to the section starting around 50:06, where he reviews the state of human and machine performance with respect to "noise, speaking style, accent, domain robustness, and language learning capabilities", with the goal to "make the case that we have a long way to go in [automatic] speech recognition".

Google's "auto-generated" transcript is overall impressively good. But I'd like to point out that many of its errors are not due to noise, speaking style, etc. — but rather to a lack of common-sense reasoning about the context. For example:

So I first want to start with the inspirations for this talk.
And the inspirations are my two thesis advisors at MIT:
Nat Durlach, on the left,
uh who died last year,
and Lou Breda, on the right,
from a 1993 picture.

Auto-generated transcript (line breaks as given):

so I first want to
start with the inspirations for this
talk and the inspirations are my two
thesis advisors at MIT
NAT direct on the Left who died last
and Lou Breda on the right from a 1993

That's two substitutions in 41 words (leaving out the filled pause "uh"), for a Word Error Rate of 4.9%, which is excellent. But it's obvious from context that "NAT direct" should be the name of one of the two cited thesis advisors — and "NAT direct" is not a plausible thesis advisor, even at MIT.

A little before that, Michael gives the title of his talk in two languages:

The auto-generated transcript:

the title of my talk today is
canoe schemata nama gary anaconda but
for those of you who don't understand
turkish this is speech recognition
what's left

We shouldn't expect the system to understand that

canoe schemata nama

is actually

Konuşma Tanıma

(Though some day!)  But even those of us who don't know Turkish will recognize, hearing Michael talk, that there's a sequence of sounds that's in a different language — which he implicitly identifies as Turkish — and mapping those sounds onto "canoe schemata nama" etc. is not the right thing to do.

Today's best quality speech-to-text is sometimes very good, though as Michael explains, things like noise, casual conversation and overlapping speech can multiply error rates by large factors.

But even when the best current systems are working well, many of the remaining errors result from failing to apply common-sense reasoning to the content that's correctly recognized. YouTube's auto-generated transcripts are an unbounded source of examples.



  1. Steve Morrison said,

    December 17, 2019 @ 8:11 pm

    At last, a better password than “correct horse battery staple”!
    [(myl) Some time ago, I wrote a little script to select three random words from a list (I use SUBTLEX or SCOWL or CELEX), precisely for creating passwords. A few random examples:


    This should be adequately secure, since e.g. CELEX has 160595 wordforms, and 160595^3 = 4.141866e+15

    Sometimes systems require you to work in some numbers or symbols, of course. (Very simple) code available on request.]

  2. Suburbanbanshee said,

    December 18, 2019 @ 6:49 am

    Anyone who has worked at a call center knows that speech recognition still stinks. Anyone who does not have a standard US accent from the East or West Coast routinely has difficulty having spoken numbers recognized, such as credit card numbers.

    It is ridiculously bad. There are huge population centers in the South, Texas, Hispanic areas, etc., and speech recognition routinely gets these basic sounds wrong, from them. Humans have to take up the slack, but the callers often start out the call frustrated and embarrassed. (And let's not even talk about people with Caribbean, Russian, and other common accents.)

    What has been done is amazing. But hoboy, is it far from perfected.

    [(myl) This is the kind of thing where today's system architectures plus a few million training examples ought to fix the problem, at least for responses to "What is your X number?"

    If that's really still not working, it must be either because no one cares enough to do what it takes to fix it, or because there are other serious problems as well, like variable background noise.]

  3. Richard Hershberger said,

    December 18, 2019 @ 7:56 am

    My boss uses voice speech-to-text software (in fairness, a few years old now) for both file notes and outgoing correspondence. I can always tell when he used it. I encourage him to let me copy edit anything going out the door. I prefer using the keyboard. At least that way any errors came honestly.

  4. unekdoud said,

    December 19, 2019 @ 5:50 am

    There are lots of "uncommon" English scenarios that YouTube's transcription is weak at:

    Crosstalk, talking over narration, background music and sound effects. Names, nicknames, acronyms, puns, loanwords, words pronounced wrong intentionally or otherwise. Homophones, the word "to", nouns suffixed with "-y", maybe even the word "not". Words ending in s/'s/th/t versus those sounds at the beginning of the next word. Singing, whispering, stuttering, non-word utterances, words spoken while laughing, words spoken quickly! Numbers, long words spelled out, codes. And sometimes even text-to-speech (unfair but ironic). And on top of all that there's no natural punctuation, emphasis or profanity.

    I know that makes it sound horrible, but it's possible to fail at all these on a regular basis and still be acceptable to both humans and the WER metric.

RSS feed for comments on this post