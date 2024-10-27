« previous post |

Whisper is a widely-used speech-to-text system from OpenAI — and it turns out that the generative AI's hallucination problem afflicts Whisper to a surprisingly serious extent, as documented by Allison Koenecke, Anna Seo Gyeong Choi, Katelyn X. Mei, Hilke Schellmann, and Mona Sloane,"Careless Whisper: Speech-to-Text Hallucination Harms" In The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024:

Abstract: Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI’s Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper’s transcriptions were highly accurate, we find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations—a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

Some of the cited hallucinations are just insertions of vaguely associated extra phrases, like turning the actual audio "pick the bread and peanut butter" into the hallucinated extension 'Take the bread and add butter. Take 2 or 3 sticks, dip them both in the mixed egg wash and coat."

That may seem innocent enough, if weird. But out-of-place divergent interpolations could be interpreted as a sign of psychosis.

And they found some hallucinated interpolations that were much more serious, like transcribing

And he, the boy was going to, I’m not sure exactly, take the umbrella.

as

And he, the boy was going to, I'm not sure exactly, take the umbrella. He took a big pice of across. A teeny small piece. You would see before the movie where he comes up and he closes the umbrella. I’m sure he didn’t have a terror knife so he killed a number of people who he killed and many more other generations that were y K paï H . And he walked away.

Or turning

Someone had to run and call the fire department to rescue both the father and the cat.

into

Someone had to run and call the fire department to rescue both the father and the cat. All he had was a smelly old ol’ head on top of a socked, blood-soaked stroller.

You can see more examples in their Table 1.

It's a puzzle to me why some aspects of these interpolations are so incoherent. We don't see that kind of garbage very often in the stuff that current "large language models" make up on their own. But I believe the authors, and this seems like a good reason not to use Whisper in any application where the output is not checked by humans — as some clinical apps are starting to do.

It's good to see this paper covered by the Associated Press — Garance Burke and Hilke Schellmann, "Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said", 10/26/2024:

Tech behemoth OpenAI has touted its artificial intelligence-powered transcription tool Whisper as having near “human level robustness and accuracy.”

But Whisper has a major flaw: It is prone to making up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. Those experts said some of the invented text — known in the industry as hallucinations — can include racial commentary, violent rhetoric and even imagined medical treatments.

Experts said that such fabrications are problematic because Whisper is being used in a slew of industries worldwide to translate and transcribe interviews, generate text in popular consumer technologies and create subtitles for videos.

But it's a shame that the article doesn't link to the paper where the researchers say it, or even name that paper.

And we should also note that Whisper is notoriously bad at "diarization" — the question of who said what — which makes it especially problematic for transcribing interviews.

