Timing from TTS
« previous post | next post »
Or maybe I should say, "AI prosody"?
In a series of posts over the past year, I've suggested that evaluation of reading performance ought to go beyond the question of whether individual words are correctly decoded and pronounced.
In "Reading Instruction in the mid 19th century" (8/15/2025), I began by quoting a passage from an 1853 McGuffey Reader, which starts like this:
The great object to be accomplished in reading as a rhetorical exercise is, to convey to the hearer, fully and clearly, the ideas and feelings of the writer. In order to do this, it is necessary that the reader should himself thoroughly understand those sentiments and feelings. This is an essential point. It is true, he may pronounce the words as traced upon the page, and, if they are audibly and distinctly uttered, they will be heard, and in some degree understood, and, in this way, a general and feeble idea of the author's meaning may be obtained.
Ideas received in this manner, however, bear the same resemblance to the reality, that the dead body does to the living spirit . There is no soul in them. The author is stripped of all the grace and beauty of life, of all the expression and feeling which constitute the soul of his subject […]
Independent of the effect on hearers, we can echo McGuffey's concern about whether a reader understands the ideas and feelings of the writer.
Modern reading instruction generally tests this by asking a series of questions about the content of a passage that has been read. But in a discussion among participants in the U-GAIN project, Ran Liu of Amira Learning suggested that a computational analysis of prosodic features could be an effective way to evaluate how well grade-school students understand what they're reading.
I followed up with a series of posts suggesting some simple steps towards such evaluation: "A simple way to model prosody in reading" (9.27/2025), "Analysis of prosodic timing in reading" (4/5/2026), and "Inter-word intervals again" (5/21/2026).
That work argues that the pattern of inter-word-onset timing in reading is a generally reliable signal of text understanding. It's far from the only relevant feature, and it's not entirely foolproof, but it's both effective and easy to calculate.
In the cited posts, I used the prosodic timing of fluent human readers as a basis of evaluation. But recruiting human readers to act as models falls short of total automation of the evaluation process, so from the start, the idea was to rely on models derived from today's (pretty good and improving) AI text-to-speech systems. (And the process would eventually back up a bit, using the text analysis behind such TTS systems rather than analyzing their output — but a sample of TTS outputs is a useful place to start.)
So here's a plot of median inter-word intervals, extracted automatically from 8 TTS versions of the (first five sentences of the) Shark Passage , for which we have a large sample of recordings from grade-school readers:
And here are the (first 40) word indices (with the phrase-final words bolded). The rest of the passage is similar…
Comparison to the timing of student readers' shows clearly where they slowed or paused, due to problems with decoding, word knowledge, or phrase understanding. There are probably cues to the different sorts of difficulty, and of course there are also prior probabilities to help in classification.

AntC said,
May 27, 2026 @ 3:24 pm
today's (pretty good and improving) AI text-to-speech systems.
Colour me unconvinced. Or perhaps AI readouts on Youtube don't represent the state of the art? There's a swag of sites offering summaries of current news. I guess they don't have good data for pronouncing out-of-the-way places or unusual (non-English) names.
They do these days manage a reasonable level of stress-timing and prosody, rather than plodding syllables.
I still hear plenty of wrong choices for pronouncing homographs. And initialisms rarely sound fluent (try UNESCO vs UNHCR).
Mark Liberman said,
May 27, 2026 @ 3:38 pm
@AntC: "I still hear plenty of wrong choices for pronouncing homographs. And initialisms rarely sound fluent (try UNESCO vs UNHCR)."
True, but not relevant to the problem of creating prosodic timing models for early reading instruction — first, because such words are unlikely to come up in early-grade reading material; and second, because an obvious step in developing the application would be adding to the pronunciation dictionary where needed.
Peter Cyrus said,
May 28, 2026 @ 2:13 am
I'm also skeptical. For our last video, we switched from human voice actoris to AI, since we couldn't get the HUMANS to get the prosody right, especially placement of contrasting stress. I was always convinced it was because they didn't really understand what they were saying, but spending more time explaining didn't improve the results.
With both alternatives, there were also mispronunciations, even conflicting pronunciations of the same word in the same paragraph. But the main problem was that the intonation didn't illuminate the progression of points that constitutes the argument.
As a side note, the longer the passage – the less I split passages up – the better the AI version gets. Maybe the state of the art is improving so rapidly that some agents are much better than others.
Condign Harbinger said,
May 28, 2026 @ 8:57 am
There's a whole bunch missing here. Inter- word timing is only part of the game. I did my kids a big disfavour when they were young, by reading to them stories as stories -i.e. drama. For There's timing – including overall speed, ac- and de- celeration, voice pitch, volume, timbre…. However the result was that they took it in, and never got the part of Joseph or Mary in the Christmas play, but always the narrator, because they brought the words alive.
"even conflicting pronunciations of the same word in the same paragraph" – talk to Walt Whitman about that.
"The great object to be accomplished in reading as a rhetorical exercise is, to convey to the hearer, fully and clearly, the ideas and feelings of the writer." – sez who? The object of any rhetoric is to get the hearer to buy into what you are pushing, especially whether you think it is good or bad. If you think the writer is good, or supports your own line, you will read it one way. If the converse, the delivery will attempt to use non- semantic signals (e.g. sarcastic tones, gestures, body language) to trash the message.