Psychotic Whisper

« previous post | next post »

Whisper is a widely-used speech-to-text system from OpenAI — and it turns out that generative AI's hallucination problem afflicts Whisper to a surprisingly serious extent, as documented by Allison Koenecke, Anna Seo Gyeong Choi, Katelyn X. Mei, Hilke Schellmann, and Mona Sloane,"Careless Whisper: Speech-to-Text Hallucination Harms", In The 2024 ACM Conference on Fairness, Accountability, and Transparency,  2024:

Abstract: Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI’s Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper’s transcriptions were highly accurate, we find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations—a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

Some of the cited hallucinations are just insertions of vaguely associated extra phrases, like turning the actual audio "pick the bread and peanut butter" into the hallucinated extension 'Take the bread and add butter. Take 2 or 3 sticks, dip them both in the mixed egg wash and coat."

That may seem innocent enough, if weird, although out-of-place divergent interpolations could be interpreted as a sign of psychosis or neurodegeneration.

And they found some hallucinated interpolations that were much more serious, like transcribing

And he, the boy was going to, I’m not sure exactly, take the umbrella.

as

And he, the boy was going to, I'm not sure exactly, take the umbrella. He took a big pice of across. A teeny small piece. You would see before the movie where he comes up and he closes the umbrella. I’m sure he didn’t have a terror knife so he killed a number of people who he killed and many more other generations that were yKpaïH. And he walked away.

Or turning

Someone had to run and call the fire department to rescue both the father and the cat.

into

Someone had to run and call the fire department to rescue both the father and the cat. All he had was a smelly old ol’ head on top of a socked, blood-soaked stroller.

You can see more examples in their Table 1.

It's a puzzle to me why some aspects of these interpolations are so incoherent. We don't see that kind of garbage very often in the stuff that current "large language models" make up on their own. But I believe the authors, and this seems like a good reason not to use Whisper in any application where the output is not checked by humans — as some clinical apps are starting to do. We should note that the paper says:

Notably, we found no evidence of hallucinations in competing speech recognition systems such as Google Speech-to-Text (tested in April 2023) or the latest Google Chirp model (tested in December 2023): we identified exactly 0 comparable hallucination concerns (as defined above) from Google’s products out of the 187 identified audio segments. We similarly identified exactly 0 comparable hallucination concerns among the same 187 audio segments from Amazon, Microsoft, AssemblyAI, and RevAI speech-to-text services (tested in January 2024). This could indicate that advancements in generative language models such as PaLM2 (underlying Google Bard) were not being used in a similar manner in competing speech-to-text systems. As such, we believe hallucinations to currently be an OpenAI-specific concern.

And we should also note that the 1% hallucinations proportion was calculated on sentence-level "segments", so (for example) a narrative or interview of 100 segments would have only about .99^100 = .37 likelihood of being hallucination-free. And the proportion of hallucinations in aphasia speakers was .019, which translates to .981^100 = about 15% hallucination-free 100-sentence discourses. Given the paper's sugestion that the hallucinations are "due to Whisper being seeded by noise rather than speech", this would also raise the risk for all recordings with poorer audio quality.

Such hallucinations are more likely in an end-to-end ASR system, which maps directly from audio samples to transcription letters, without any intervening lexicon or other linguistic constraint. Whisper is definitely E2E — I think that the competitor systems are also E2E, these days, but perhaps they have some hybrid constraints as well.

It's good to see this paper covered by the Associated Press — Garance Burke and Hilke Schellmann, "Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said", 10/26/2024:

Tech behemoth OpenAI has touted its artificial intelligence-powered transcription tool Whisper as having near “human level robustness and accuracy.”

But Whisper has a major flaw: It is prone to making up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. Those experts said some of the invented text — known in the industry as hallucinations — can include racial commentary, violent rhetoric and even imagined medical treatments.

Experts said that such fabrications are problematic because Whisper is being used in a slew of industries worldwide to translate and transcribe interviews, generate text in popular consumer technologies and create subtitles for videos. […]

The full extent of the problem is difficult to discern, but researchers and engineers said they frequently have come across Whisper’s hallucinations in their work. A University of Michigan researcher conducting a study of public meetings, for example, said he found hallucinations in eight out of every 10 audio transcriptions he inspected, before he started trying to improve the model.

A machine learning engineer said he initially discovered hallucinations in about half of the over 100 hours of Whisper transcriptions he analyzed. A third developer said he found hallucinations in nearly every one of the 26,000 transcripts he created with Whisper.

But it's a shame that the article doesn't cite the Koenecki paper, or even name the other researchers/engineers/developers whose statistics they quote.

My understanding is that these hallucinated interpolations are relatively rare, but obviously very serious when they occur. And more local transcription mistakes are of course extremely common. We should also note that Whisper is notoriously bad at "diarization" — the question of who said what — which makes it especially problematic for transcribing interviews.

Putting it all together, the bottom line seems to be that current speech-to-text in general, and Whisper in particular, shouldn't be used for serious applications without human checking. Performance has been improving rapidly, so maybe things will be different in a few years.

Update — there's been some further mass-media uptake, though mostly in small and/or niche publications…



12 Comments »

  1. AntC said,

    October 27, 2024 @ 6:25 am

    It's a puzzle to me why some aspects of these interpolations are so incoherent.

    The first part of the puzzle is that the tool seems to have captured its input accurately. But then incoherently riffed on a few words within it. (Or is that not how to interpret Table 1?)

    So why does it phantasise following some inputs but not all?

  2. Mark Liberman said,

    October 27, 2024 @ 6:46 am

    AntC: "So why does it phantasise following some inputs but not all?"

    The paper says

    Our second hypothesis regards the types of speech being uttered: specifically, longer pauses in spoken speech (thereby, with longer periods of background noise in the audio file) could result in more hallucinations due to Whisper being seeded by noise rather than speech. These sorts of “speech disfluencies” appear disproportionately often for individuals with speech impairments such as aphasia.

    They go into considerable (and inconclusive) detail on this, but basically it does seem that Whisper is inclined to hallucinate to fill periods of background noise and perhaps non-speech vocalizations.

  3. D.O. said,

    October 27, 2024 @ 10:28 am

    For some reason (maybe explained in the relevant literature, but which I didn't see discussed in popular accounts) this consumer-facing translators (and I guess speech-to-text transcribers) come without any numerical estimate into how reliable they outputs are. Would it kill Open AI, Google, Microsoft and all the rest to flag the content which they are not sure they get right. C'mon, they are getting Nobels now, couldn't be too hard for them.

  4. Mark Liberman said,

    October 27, 2024 @ 11:09 am

    @D.O.:

    The problem is that "Neural networks do not know when the don't know", as explained in this 2020 NeurIPS tutorial from researchers at Google Brain:
    https://www.gatsby.ucl.ac.uk/~balaji/balaji-uncertainty-talk-cifar-dlrl.pdf

  5. Seth said,

    October 27, 2024 @ 11:32 am

    These systems aren't my field, but just for discussion, the effects described strike me as purely a bug somewhere in the system. They look very much like what happens when something which should be roughly terminating-zero-entry goes wrong. Then instead of properly ending, a bunch of random characters shows up in the output. If so, that would mean it's not really a true case of "generative AI's hallucination problem". The issue is possibly higher up, more like Tourette syndrome if you want a mental analogy.

  6. MattF said,

    October 27, 2024 @ 12:19 pm

    @Seth
    …and, given that the problem shows up more often with aphasic, i.e., ‘noisy’ input… some sort of garbage collection bug.

  7. Y said,

    October 27, 2024 @ 1:47 pm

    Recently, speech-to-text has been used for court transcripts. Not directly: the court transcriber repeats the text into an (isolated) microphone. I presume the transcriber sees the resulting text in real time and is able to correct any mistakes.

  8. Gregory Kusnick said,

    October 27, 2024 @ 1:49 pm

    Previously undetected bugs in Python's string-handling or garbage-collection facilities strike me as far less likely than the prima facie explanation: that a system trained to turn audio signals into strings of words continues to do so even when the input consists entirely of wordless background noise.

    This is one case where the term of art "hallucination" seems apt (as opposed to LLMs making up bogus facts, which would be better described as confabulation).

    On a side note, I seem to recall that an early (mid-90s) prototype of Windows Speech Recognition was code-named Whisper internally at Microsoft. Presumably this Whisper is not related to that one.

  9. Kenny said,

    October 27, 2024 @ 6:40 pm

    I'm wondering how significant the problem actually is. I don't expect this sort of error in the segments in a single interview to be uncorrelated with each other – my naive guess would actually be that something like 90% are error free or close to it, while the 10% of them that have a particular sort of noise going on have errors on nearly 10% of segments.

    Also, the authors say "38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority." But without more examples, I'm a bit suspicious of the coding of an error as "inaccurate association" as supporting the coding as "explicit harm".

  10. D.O. said,

    October 27, 2024 @ 6:45 pm

    Neural networks do not know when the don't know

    Someone should train them. Why don't they apply high power of neural networks to analyzing neural networks. That can be even the next step of making them "conscious".

  11. Seth said,

    October 27, 2024 @ 11:19 pm

    @ Gregory Kusnick – I would hope this paper isn't merely reporting that overmatching can happen. That would be pretty trivial. The "take the umbrella" example seems to me too long to be accounted for by overmatching, but again I could be wrong. The first question I'd ask in terms of investigation, is if these example are reproducible to any extent, i.e. does running the same file again produce the same output, or at least similar problems? There's a big difference between "The system is producing properly-calculated outputs for this input, even if the output has randomness and is not what you wanted", versus "The system has a calculation error where the output is entirely wrong from what is the intended calculation". To me, that example just feels like a case of the latter rather than the former.

  12. /df said,

    November 1, 2024 @ 12:22 am

    Not to discount the "can't tell what it doesn't know" explanation, but I wonder whether the speech is being suitably bandwidth-limited in either the training or operational phases. If not, sounds that are effectively inaudible to humans may be affecting system behaviour.

RSS feed for comments on this post · TrackBack URI

Leave a Comment