Language Log

What's (still) wrong with text-to-speech?

March 2, 2026 @ 5:00 am · Filed by Mark Liberman under Artificial intelligence, Computational linguistics

Text-To-Speech technology has improved enormously over the decades — but there's still some headroom, as a friend has recently underlined for me. He observes that when The Economist magazine first publishes a piece online, it appears with a AI-read audio, and then later with a human-read version:

The rhythm/prosody/pitch (I'm not exactly sure which – all three?) is the same in nearly every sentence and even clause. This high-then-falling pattern is fine in one sentence, but repeated 50 times in a row is awful.

Later, those pieces that make it into the print edition get their own, human-read version. So voilà, you have a perfect before-and-after.

I downloaded a handful of "AI Narrated" stories (as the magazine calls then), and then the human-read versions for the ones that made it into print. Before getting to the complaint about repetitive prosody, I noticed a few (minor) old-fashioned errors, such as this parsing (or interpretation?) problem in the phrase "The Supreme Court tariffs ruling reins in Donald Trump", which makes it sound like Supreme Court tariffs are ruling reins inside of Donald Trump:

Or this focus problem, where the human reader helpfully contrasts dollars with euros,

…which the AI narrative failed to do:

As for the stereotyped pitch accents, here's one of the first sentences in the AI version of the example story that my friend sent me (that link will send you to the slightly-revised print version):

As he observed, it sounds fine. The print version has modified the text somewhat, but you should be able to hear that the corresponding phrase deploys a more varied set of pitch accents:

We can zero in on the subject noun phrase to see as well as hear the difference, first in the AI version:

And now the human version:

You can listen to as much as you like of the two versions, and see whether you agree that "this high-then-falling pattern is fine in one sentence, but repeated 50 times in a row is awful":

AI Reader	Human Reader

We can quantify the falling-falling-falling perception by looking at syllable-scale dipole statistics, showing a two-dimensional density plot comparing time differences against pitch differences. (As usual, click on an image to see a larger version.)

AI Reader	Human Reader

Or maybe better, by looking at a density plot of delta F0 against delta amplitude:

AI Reader	Human Reader

Human speakers obviously exhibit a wide range of prosodic patterns — see the similar plots in "Tunes, political and geographical" (2/2/2017), or "My poster for the 'Prosody Visualization Challenge'" (6/14/2018).

And a TTS system could easily choose a wider variety of prosodic patterns, but it would be harder to make the choices align with the style and the discourse structure — which is maybe why the app doesn't try to do it in this case.

There's a lot more to say, and many more articles to look at, but that's enough for this morning.

March 2, 2026 @ 5:00 am · Filed by Mark Liberman under Artificial intelligence, Computational linguistics

Permalink

8 Comments

David Morris said,

February 28, 2026 @ 3:37 pm

I recently blogged about an AI-voiceover of a summary of a Korean tv series, which mispronounced almost every Korean word (mostly names, but also familiar words like kim-chai).
Simon K said,

February 28, 2026 @ 4:00 pm

I heard an advert on a podcast last week – apologies, I can't remember what for – where the voiceover pronounced the name of the product they were selling in two different ways, with the stress on different syllables.
Viseguy said,

February 28, 2026 @ 7:36 pm

@Simon K.:

I heard an advert on a podcast last week – apologies, I can't remember what for – where the voiceover pronounced the name of the product they were selling in two different ways, with the stress on different syllables.

I encounter this all the time — well, often enough — listening to audiobooks. A notable example is the audio version of Mark Chiusano's incisive profile of former-U.S.-representative-and-lately-sprung-felon George Santos, The Fabulist. For the first half and more of the book, the (human?) reader pronounces "Nassau" (as in Nassau County, New York) with the last syllable rhyming with "cow". Thereafter he switches to the correct pronunciation, rhyming with "saw" — without bothering to fix the earlier errors. At this scale, the repeated lapse is laugh-worthy and, I suppose, speaks to the economics of producing audiobooks. More often, in my experience, these errors crop up sporadically, a minor yet unwelcome distraction from the text.
Chris Button said,

March 2, 2026 @ 7:42 am

The "intonational phrase" seems expectedly defined by punctuation in the machine-read version. The human on the other hand breaks up the "intonational phrases" as appropriate, which may or may not involve written punctuation in the text and may vary from human to human.
Mike Maxwell said,

March 2, 2026 @ 8:59 pm

The problem is not limited to text-to-(AI)speech. A few years ago, I listed to a read version of Dante's Inferno (in English!), where each chapter was read by a different reader. Afaik, all the readers were human, but some were so bad they were painful to listen to.
Edith said,

March 3, 2026 @ 3:29 am

@max You may have been browsing around on the Librivox – a free resource of volunteer-read audio books.

It is an astonishingly valuable resource for people who need narrated books. But the quality of the volunteers' efforts are all across the spectrum from excellent through painful to unintelligible.

It's full of examples that show just how hard it is to narrate to a professional standard.

A classic example of the adage: "If those who can won't, those who want to will."f
Bob Ladd said,

March 3, 2026 @ 9:49 am

"a TTS system could easily choose a wider variety of prosodic patterns, but it would be harder to make the choices align with the style and the discourse structure — which is maybe why the app doesn't try to do it"

This sounds about right. From a completely different area of TTS technology, another example: it's noteworthy that the Duolingo materials for Greek are entirely machine-generated and almost entirely human-sounding, but it's also striking that they produce questions as if they were statements. In many cases in Greek you really have to know the discourse context to get the intonation of a question right – "which is maybe why the app doesn't try to do it".
Philip Taylor said,

March 4, 2026 @ 3:25 am

Whereas, as far as I can tell (i.e., from the perspective of a non-native speaker), Duolingo's French course appears to be extremely authentic. Perhaps a native French speaker could comment.

RSS feed for comments on this post

What's (still) wrong with text-to-speech?

8 Comments

David Morris said,

Simon K said,

Viseguy said,

Chris Button said,

Mike Maxwell said,

Edith said,

Bob Ladd said,

Philip Taylor said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta