Analysis of prosodic timing in reading
« previous post | next post »
This post documents one small step in a larger plan for improved evaluation of prosody in reading. It compares word-level timing in a large number of recordings, from the Speech Accent Archive at GMU, of 3038 people reading the 69-word "Please call Stella" passage. 661 of these people are native speakers of English, with accents from all over the anglophone world, while the remaining 2377 readers have native languages from Afrikaans to Zulu. The reading and speaking level of those non-native readers varies a lot, and many of them have problems in decoding or pronunciation that affect their timing.
Automated analysis of such problems should be useful in foreign-language teaching. And similar analyses might help in early reading instruction for students in anglophone classrooms, whatever their native language.
Let's start with a quick comparison of word-level timing in the 661 native English speakers; the 85 native French speakers; the 99 native Korean speakers; and the 82 native russian speakers.
I calculated word-level time points for those 927 speakers, using a forced-alignment system originally developed many years ago with Jiahong Yuan — a summary of the technology and a few of its application can be found here (open-access version). Here's the output for speaker english1 — note that the segment ID sp means "silent pause".
The key conclusions:
- Time between word onsets gives a good picture of phrase structure, despite the many other effects on timing;
- Individual non-native readers, aside from being overall a bit slower, usually show lengthened inter-word intervals in unexpected places, due to decoding or pronunciation problems.
A crucial point: "time between word onsets" means that any inter-word silent pauses added to the duration of the pre-pausal word. Here's the beginning of the reading by speaker english1, with the inter-word-onset duration for "Stella" indicated in red. (As usual, click on the image for a larger version.)
This merges pre-pausal lengthening with silent pauses (if any), and results in word-level duration measures that geneally reflect the prosodic phrasing. The plot below shows the sequence of 69 median inter-word-onset times for all 661 native English speakers with labels on the 10 largest local peaks:
Please call Stella.
Ask her to bring these things with her from the store:
Six spoons of fresh snow peas,
five thick slabs of blue cheese,
and maybe a snack for her brother Bob.
We also need a small plastic snake
and a big toy frog for the kids.
She can scoop these things into three red bags,
and we will go meet her Wednesday at the train station.
The local maximum on Wednesday mainly reflects the fact that it's a long word, though it's also at the edge of a phrase. And the final word station is shorter in this measure because there's no following onset to establish the pause duration:
The overall median durations for the French, Korean, and Russian native speakers are similar:
As we expect, the non-native speakers are overall somewhat slower, as shown below in the quantile plot of all inter-word intervals:
However, when we look at the patterns for individual speakers, we see something different. Here's a plots showing the inter-word pattern for speaker russian1, compared to the median pattern for native English speakers:
As you can see, there are some unexpected local maxima, corresponding to places where the speaker's flow is interrupted by reading or pronunciation difficulties. Here's the same thing for speaker russian2, where the unexpected pauses are in different places:
And speaker russian3, who is more fluent than the other two:
It's helpful to compare the first three native English speakers (english1, english2, english3), whose differences are concentrated at the phrase-boundary pauses:

Here are speakers french1, french2, and french3:, who also introduce the problem of false starts, repetitions, and filled pauses (about which more later):
And speakers korean1, korean2, korean3:
This is only a small first step, but it suggests fruitful continuations. In particular, the recent flowering of natural-sounding TTS ("text to speech") technology means that we can calculate a plausible reference pattern for an arbitrary input text, without the need to record human speakers.
















Jerry Packard said,
April 7, 2026 @ 12:53 pm
Terrific work Mark. It is absolutely fascinating that interword spacing gives cues to phrase structure. One of my UIUC grad students did her dissertation on measuring Mandarin L2 proficiency by elicited imitation, in which a computer measures the output of testers who have repeated/imitated Mandarin sentences. Acoustic measurement of the output yields reliable placement into L2 curricular levels.
Daniel Barkalow said,
April 8, 2026 @ 11:16 am
It's interesting that time since start of previous word is a better metric than time since end of previous word. It would be interesting to look at texts with different length words in places which call for pauses to see if it's really start of word, or if there's some later point in words with a bunch of syllables that matters. If it's really start of word, that would be a good probe into compound words and contractions.