Quirky speech-to-text, weird diarization

« previous post | next post »

From Daniel Deutsch:

We had a long drive yesterday, so we listened to a “robot” reading the entire indictment. It certainly isn’t flawless, but I was surprised by how good it is, especially when it gets “excited” while enacting dialogue.

Indeed, the text-to-speech quality is quite good — though unfortunately they don't tell us which TTS software they used.

Here's the opening, which is indeed entirely clear and even nearly natural-sounding:

But oddly, the podcast authors ("The Bulwark") provide a "transcript" based on an automatic speech-to-text program applied to the automatic text-to-speech reading. This is weird to start with, since they had the actual text that they'd given the "robot" to read. And it's especially weird because of the mistakes that result — which they warn us about at the top of the "transcript" page:

This transcript was generated automatically and may contain errors and omissions. Ironically, the transcription service has particular problems with the word “bulwark,” so you may see it mangled as “Bullard,” “Boulart,” or even “bull word.” Enjoy!

They also don't tell us which speech-to-text system they used, but many of the its errors are harder to excuse than "bull word" for "Bulwark" — for instance, "Introdu" for "Introduction", in the start of the reading:

The following is a reading of the Trump classified documents indictment. Part of the text have been edited for clarity and ease of listening. United States of America versus Donald j Trump and Walt Nada. Introdu, one,

That's presumably one of the mistakes characteristic of end-to-end systems, which aim to translate recorded speech into letter strings rather than word strings. But in that context, it's odd to use a lower-case letter for Trump's middle initial, as the transcript does (13 times) throughout — surely the letter sequences involved should have occurred many times in the language model's training material.

The rendition of "Nauta" as "Nada" or "NADA" (44 out of 45 times) is excusable, though it's interesting to get it right once and miss it 44 times. The inconsistency in capitalization is also interesting: "Nauta" is rendered 13 times as "Nada" and 31 times as NADA. A similarly telling set of speech-to-text errors are the inconsistent treatments of the document's "Trump Employee One", which is rendered as "Trump employee 1" (3 times) and "Trump employee won" (1 time), and "Trump Employee 2", which is rendered as "Trump employee too" (18 times) and "Trump employee two" (9 times). In context, "one" and "too" make no semantic or rhetorical sense — but the inconsistency is a more fundamental problem, indicating that the system has no idea that it's transcribing a coherent discourse.

But the most striking problem is the automatic diarization ("who spoke when"). The system correctly flags the transition between the initial advertisement ("Speaker 1") and the start of the reading ("Speaker 2"). But then it attributes quasi-random fragments of the transcript to a non-existent "Speaker 3". One example:

Another example:

Audio that zeros in on the hallucinated speaker transitions in that passage:

For the past 6 years, some of us have been trying to encourage the field to pay more attention to the diarization problem — and the effort is bearing fruit, as indicated by the improved performance in DIHARD I, DIHARD II, and DIHARD III. But for some reason, the various available speech-to-text systems (commercial or open-source) are still surprisingly bad at this.

Some relevant LLOG posts:

"My summer", 6/22/2017
"DIHARD", 2/13/2018
"Hearing interactions", 2/18/2018
"DIHARD again", 4/18/2018
"Speaker change detection", 4/26/2020
"The dynamics of talk maps", 9/30/2022



2 Comments

  1. Benjamin Orsatti said,

    June 12, 2023 @ 7:40 am

    About this: "I was surprised by how good it is, especially when it gets 'excited' while enacting dialogue."

    I beg to differ (please! please! let me differ!!!)

    The prosody is all off. It gets "excited" at the wrong times and leaves me wondering why it's "shouting" at me about abridgement, viz: "PÀRTS of the TÈXT have been ÈDITED for CLÁRITY and EÁSE of LÍSTENING."

    I hope I don't live long enough to see carbon-based-human prosody mimicking that of the bots, like a city bird "incorporating" car horns into its song.

  2. Frans said,

    June 14, 2023 @ 3:57 pm

    > They also don't tell us which speech-to-text system they used, but many of the its errors are harder to excuse than "bull word" for "Bulwark" — for instance, "Introdu" for "Introduction", in the start of the reading:

    Is there a corrected audio file up or something? It clearly says introduction as far as I can tell.

    [(myl) That's a genuine ASR error — an all-too-typical one for "end to end" systems.]

RSS feed for comments on this post