## "Unparalleled accuracy" == "Freud as a scrub woman"

A couple of years ago, in connection with the JSALT2017 summer workshop, I tried several commercial speech-to-text APIs on some clinical recordings, with very poor results. Recently I thought I'd try again, to see how things have progressed. After all, there have been recent claims of "human parity" in various speech-to-text applications, and (for example) Google's Cloud Speech-to-Text tells us that it will "Apply the most advanced deep-learning neural network algorithms to audio for speech recognition with unparalleled accuracy", and that "Cloud Speech-to-Text accuracy improves over time as Google improves the internal speech recognition technology used by Google products."

So I picked one of the better-quality recordings of neuropsychological test sessions that we analyzed during that 2017 workshop, and tried a few segments. Executive summary: general human parity in automatic speech-to-text is still a ways off, at least for inputs like these.

Here's the intro of the Logical Memory task in that particular recording, as transcribed at LDC (minus the time stamps):

Subject: Yeah.
Interviewer: Okay.
Interviewer: Anna Thompson of South Boston,
Interviewer: employed as a scrubwoman in an office building,
Interviewer: reported at the city hall station
Interviewer: that she had been held up on state street
Interviewer: the night before and robbed of fifteen dollars.
Interviewer: She had four little children,
Interviewer: the rent was due, and they had not eaten for two days.
Interviewer: The officers, touched by the woman's story, made up a purse for her.
Interviewer: So I want you to tell me everything you can remember.

Speaker1: Anna Thompson of South Boston
Speaker2: and Freud as a scrub woman in an office building reported at the city hall station that she had been held up on State Street the night before and Roz of
Speaker1: \$15. She had for little
Speaker2: children. The rent was due and they had not eaten for two
Speaker1: days the officers
Speaker2: for the woman Story made
Speaker1: up a purse for her.
Speaker2: Don't you tell me everything you can remember?

The Word Error Rate is not terrible in this sample — though things as "and Freud as a scrub worman" instead of "employed as a scrub woman" are kind of weird — but the diarization (who spoke when) is really bad.

FWIW, here are the control parameters that I used, for all the examples in this post:

{
"audio": {
},
"config": {
"diarizationSpeakerCount": 2,
"enableAutomaticPunctuation": true,
"enableSpeakerDiarization": true,
"encoding": "LINEAR16",
"languageCode": "en-US",
"model": "default"
}
}



Google does much worse on the subject's responses — in this case the LDC transcription was

Subject: This poor lady, she didn't have much money.
Subject: She didn't, uh, money to buy food.
Subject: You know, I just can't, I'm not good at this.
Subject: I'm not doing good on this one.

Google's attempt for the same stretch of audio:

Speaker1: Flemington ice
Speaker1: food

Sometimes the system's language model goes overboard. Thus in the first part of the digit-span task, the LDC transcriber has:

Interviewer: Five, eight, two.
Subject: Five, eight, two.
Interviewer: Six, four, three, nine.
Subject: Six, four, three, nine.
Interviewer: Four, two, seven, three, one.
Subject: Four, two, seven, blank, one.
Subject: I missed one.

Speaker1: 427-314-2721
Speaker1: 643-9439
Speaker1: 427-314-2721

Just for fun, here's another passage, from the start of the Fluency task:

Interviewer: For this next test, I'm going to say a letter of the alphabet,
Interviewer:: and I'd like you to give me as many words that start with that letter as quickly as you can.
Interviewer: So for example, if I say ~B,
Interviewer: you might give me words like bad, bottle, or bed.
Subject: uh huh
Interviewer: However, I don't want you to give me words that are proper names, such as Boston or Bob.
Subject: mhm
Interviewer: And I also don't want you to give me the same on again with a different ending,
Interviewer: such as bake, baking, or baked.
Interviewer: Do you have any questions?
Subject: oh I see, uh huh okay.
Interviewer: Okay?
Interviewer: Well the first letter is ~F.
Interviewer: Give me as many words that start with ~F as quickly as you can.

Speaker1: British Max Plus I'm going to send a letter of the alphabet and I'd like you to give me as many words that start with that letter as quickly as you can. So for example, if I think he was like Dad said the same one again with a different ending baking or they do you have any questions but start with us as quickly as you can so having sex

The subject's response, per the human transcriber:

Subject: Fuel,
Subject: face,
Subject: fuzz,
Subject: um
Subject: {breath}
Subject: {laugh}
Subject: uh floater,
Subject: uh
Subject: flag,
Subject: uh
Subject: film,
Subject: uh fish,
Subject: uh forest,
Subject: uh

Speaker1: You'll face.
Speaker1: pause
Speaker2: border film

I haven't checked whether other current systems can do better, though I'm not especially hopeful.

Overall, my belief is that language models adapted to the characteristics of the particular subtasks in such test batteries, combined with some other specializations, would enable useful speech-to-text performance levels.  But this remains to be demonstrated.

[I should add that ASR technology has been improving for a long time, is pretty good now, and is obviously good enough for a growing number of applications. But it's important to keep in mind that this doesn't mean that it's as good as it could be, or good enough for any arbitrary application that might occur to you — like transcribing and analyzing clinical interviews or political surveys. (At least as recorded by clinicians and political scientists, who tend to be remarkably poor sound engineers, for some reason.)]

1. ### Victoria said,

April 27, 2019 @ 9:12 am

They're not good enough to transcribe directly yet – if you really can't type quickly, you can slow down the audio file though, throw on some headphones and dictate what you hear clearly and slowly into a microphone, and Google text-to-speech can do a pretty good job there at least.

2. ### Ben Zimmer said,

April 27, 2019 @ 9:41 am

I grew up in Flemington, NJ… "Flemington ice" evokes either the local ice skating rink or the local Rita's Italian Ice outlet.

3. ### TIC said,

April 28, 2019 @ 6:27 am

From the title, I thought this post was going to be about a mispronunciation/mistranslation of 'schadenfreude'…

I grew up not too far away, Ben, and "Flemington ice" evoked for me a description of a slicked over surface on the speedway!…

4. ### Cervantes said,

April 29, 2019 @ 8:01 am

Much of my research consists of analyzing transcripts of clinical encounters. The labor of transcription is a major problem that limits sample sizes, so I've been hoping for the day when machines can do it. I think it's even farther away for medical encounters than it is for the situation you present, for several reasons. One is that the clinicians are at least as disfluent as the patients, there are lots of false starts and non-lexical utterances; there is often crosstalk; there's a lot of arcane vocabulary; and there is a lot of extraneous noise. I have been hopeful that we could at least get a transcript that it would be less work for a human to correct than to produce from scratch, but it doesn't look like we're there yet.

5. ### Topher Cooper said,

April 29, 2019 @ 11:12 am

"Sometimes the system's language model goes overboard"

In the early 70's I worked (my semi-official title being "Head Peon") on the Hearsay Speech Recognition project (part of the "ARPA Speech Understanding Project"). There were two Hearsay programs, Hearsay I (originally, just called "Hearsay") and Hearsay II (later, frequently just called Hearsay; it was the program in the ARPA project/competition that came closest to meeting the ARPA success goals of all program/labs in the project; it failed only in speed, but the goal was set based on an assumption of faster hardware by the end of the 5 year project, but we were still using the PDP-10 we started with, the speed goals would have been met if running on the latest hardware at the end of the 5 years).

The Hearsay I project used as a task a speech interface to a computer vs human chess program. Hearsay I had three "recognition modules": Phonetics, Syntax and "Semantics" (scare quotes on the last because it was really mostly pragmatics". The Semantics module was built around a well regarded computer chess program, which, in addition to supporting the task (i.e., being the computer chess program that was being played against) would generate a list of all possible moves that the player could make with the rating that the program placed on each one on his/her turn. The rating scores were used as the contribution of the Semantics module, on the theory that players probably preferred to make great moves to terrible ones, and so were probably saying highly scored moves rather than lowly scored moves.

Sounds sensible enough, but the Hearsay I algorithms were hard-wired to give equal weight to the "opinions" of each of the three modules (one of the many things corrected in Hearsay II, which was really a completely different program which pretty much only shared a core approach — independent cooperating "modules", called, metaphorically "experts" in HSII). Anyway, a result of this lack of weighting was that HSI would tend to insist that the speaker was saying what the chess program thought was even a somewhat better move over what it considered a somewhat poorer one. It was virtually impossible, for example, to get it to make the move "Pawn to Queen four" if it thought that "Pawn to King four" was a better move. Furthermore, currently illegal moves were difficult to input to the system — which was supposed to be allowed by the task, though the response wasn't to make the move but to print out that the move was illegal and to ask for another move.

In fact, on several occasions, I got the system to make a recommendation to me as to how I should move. The technique was to mumble so that the first two modules would generate a large number of possibilities, among which the third module would prefer the better move (at least by its lights — and I was a poor enough player that that was generally a better choice than what I might choose).

6. ### Rick Rubenstein said,

April 29, 2019 @ 8:27 pm

My intuition is that progress in this task is a bit like progress on the Twin Prime conjecture: Current methods and variants thereof can get close, but the final gap will require something entirely different. In this case, though, I think the "something different" is a deep and accurate model of the world and how actors in it behave. And that, I believe, is still a long, long way off.

I think this applies to many AI tasks (regardless of how ridiculous Hofstadter quickly looked when he opined in the late 1970's that a computer would only play human-level chess when it also had human-level humor, etc.).

7. ### Tommi Nieminen said,

April 30, 2019 @ 4:38 am

I've recently used the Google Speech API for dictating tens of thousands of words, and while it complements keyboard input nicely, it's clear that it's not usable for unedited transcription. I have a decent microphone and a quiet environment, and there are still major errors in roughly half of the output sentences. There's a very strong language model effect (although I assume the neural systems don't have explicit language models), with some words almost impossible to input in the correct form even with very careful pronunciation.