Following up on recent posts suggesting that speech-to-text is not yet a solved problem ("Shelties On Alki Story Forest", "The right boot of the warner of the baron", "AI is brittle"), here's a YouTube link to a lecture given in July of 2018 by Michael Picheny, "Speech Recognition: What's Left?" The whole thing is worth following, but I particularly draw your attention to the section starting around 50:06, where he reviews the state of human and machine performance with respect to "noise, speaking style, accent, domain robustness, and language learning capabilities", with the goal to "make the case that we have a long way to go in [automatic] speech recognition".

Google's "auto-generated" transcript is overall impressively good. But I'd like to point out that many of its errors are not due to noise, speaking style, etc. — but rather to a lack of common-sense reasoning about the context. For example:

Your browser does not support the audio element.

So I first want to start with the inspirations for this talk.

And the inspirations are my two thesis advisors at MIT:

Nat Durlach, on the left,

uh who died last year,

and Lou Breda, on the right,

from a 1993 picture.

Auto-generated transcript (line breaks as given):

so I first want to

start with the inspirations for this

talk and the inspirations are my two

thesis advisors at MIT

NAT direct on the Left who died last

year

and Lou Breda on the right from a 1993

picture

That's two substitutions in 41 words (leaving out the filled pause "uh"), for a Word Error Rate of 4.9%, which is excellent. But it's obvious from context that "NAT direct" should be the name of one of the two cited thesis advisors — and "NAT direct" is not a plausible thesis advisor, even at MIT.

A little before that, Michael gives the title of his talk in two languages:

Your browser does not support the audio element.

The auto-generated transcript:

the title of my talk today is

canoe schemata nama gary anaconda but

for those of you who don't understand

turkish this is speech recognition

what's left

We shouldn't expect the system to understand that

canoe schemata nama

is actually

Konuşma Tanıma

(Though some day!) But even those of us who don't know Turkish will recognize, hearing Michael talk, that there's a sequence of sounds that's in a different language — which he implicitly identifies as Turkish — and mapping those sounds onto "canoe schemata nama" etc. is not the right thing to do.

Today's best quality speech-to-text is sometimes very good, though as Michael explains, things like noise, casual conversation and overlapping speech can multiply error rates by large factors.

But even when the best current systems are working well, many of the remaining errors result from failing to apply common-sense reasoning to the content that's correctly recognized. YouTube's auto-generated transcripts are an unbounded source of examples.

