A simple way to model prosody in reading
« previous post | next post »
In "Reading Instruction in the mid 19th century" (8/15/2025), I noted a suggestion, due to Ran Liu of Amira Learning, that a computational analysis of prosodic features could be an effective way to evaluate how well grade-school students understand what they're reading. Beyond that, Maryellen MacDonald has suggested that phrasal prosody can be seen as the phase-level analog of phonemic blending (i.e. putting the sounds of 'c' 'a' 't' together into "cat") — which might help to explain the benefits of McGuffey-style elocution lessons.
Both ideas raise the question of how to evaluate the prosody of a given student's reading. And there's a simple and obvious way to do this, described and exemplified below.
We might rely on a model that predicts duration, vocal effort, pitch, and pausing from the phonology, syntax, semantics, and pragmatics of a phrase — there's an enormous literature aiming to do this analytically — or we could rely on a modern-style end-to-end deep learning system that simply maps character sequences onto predicted acoustics.
But that's going to be complicated, either way, and there's a simpler way to start.
For decades, we've had technology that does a good job of "forced alignment", i.e. aligning speech signals with various levels of symbol-sequences representing them (see e.g. Talkin and Wightman 1994; Fox 2006; Yuan and Liberman 2008). So from a sample of model readings for a given passage, we can derive a distribution of relevant acoustic measures, and compare the same measures derived from the performance to be evaluated.
I'll illustrate this with a simple example from the Speech Accent Archive at George Mason University, in which a large number speakers read an "elicitation paragraph":
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
Many of the readers are native speakers of various varieties of English, e.g.
And many others are speakers of other languages — a large fraction of whom are not entirely fluent as readers of English, e.g. this native Russian speaker:
There are many issues with that last reading, but let's start with something simple, which also applies to U.S. learners of whatever language background — the location and duration of silent pauses.
The second phrase of the elicitation paragraph is a good example. The durations of the Russian speaker's inter-word pauses in milliseconds, as measured by forced alignment, are given below between curly braces in the transcript below:
Ask her to bring {240} these {520} things with her from the {290} store
In contrast, none of three sample native-English readers above have any within-phrase silent pauses. And their speech rate is also obviously faster:
There's plenty more to say about the pronunciation variation involved — and the Speech Accent Archive (at least at the time that I downloaded it) has 659 native-English readings to compare, along with even more non-native readings.
The four examples above were literally chosen at random. But I've made a systematic comparison of timing and pausing in all the native-English readers, and a large sample of non-native readers, and the pattern holds pretty well, except for a subset of highly fluent non-native readers. In a sample of learner-data from Amira Learning (one of Penn's partners in the U-GAIN project), the effects seem even stronger. (The process of getting consent to share (some of) that data is still underway…)
Accent comparison is an issue for U-GAIN as well. But this morning's goal is just to indicate an obvious and easy road towards evaluation of student reading fluency, on which the first step is simply a comparison of silent pause locations and durations in a given reading, against the distribution of those measures in a set of fluent models. (And another easy step is to do the same thing for word duration and local F0 features…)
My current guess is that 5 or 10 model readings of each passage will be plenty for that task, but time will tell.
Update — I thought it was clear that the (slightly weird) "Please ask Stella" passage was composed long ago as a way to elicit variants in English pronunciation, not as a lesson in reading instruction. The reading passages (which I can't show you yet, because permissions are still being negotiated) are designed to be accessible (and therefore useful) to students at different levels, and perhaps even to serve the needs of an individual student. I've used the Speech Accent Archive examples purely for illustrative purposes.
Some commenters have apparently misunderstood this point…
Bob Ladd said,
September 27, 2025 @ 12:04 pm
Nice idea. Keep us posted.
Don Monroe said,
September 27, 2025 @ 2:16 pm
It seems that some of the Russian speaker's pauses indicate that they have encountered an unfamiliar word and are working out how to pronounce it, rather than reflecting their interpretation of the sentence as a whole.
Mark Liberman said,
September 27, 2025 @ 3:16 pm
@Dan Munroe: "It seems that some of the Russian speaker's pauses indicate that they have encountered an unfamiliar word and are working out how to pronounce it, rather than reflecting their interpretation of the sentence as a whole."
Indeed. And that's certainly the most common reason for learners to pause in prosodically inappropriate places — though if they can't figure out how to say some of the words, they are also probably not on top of the basic meaning of the text, much less more subtle aspects of its interpretation.
Less often, there are also cases where a reader knows how to pronounce a word, but is not sure how it fits into the sentence.
So an early goal of the research will be to classify the likely causes of prosodic errors, so as to recommend effective interventions.
It should also be possible to predict likely error locations, in general or for a particular learner, so that a lesson can be structured to help avoid the problems.
Michael Watts said,
September 27, 2025 @ 8:51 pm
This is something that might be changing. I've subjectively felt that I seem to place pauses into my sentences more than I "should". And I've noted family members doing the same thing, for some value of "the same thing".
I don't think any of the speech I've noted, from myself or my family, would lead other people to think it came from a nonnative speaker. But I do think that it would be perceived as containing pauses in what a different speaker might produce without pauses.
Well, some of them. But for e.g. the large pause between "these" and "things" I don't think that explanation is available, and similarly for several other pauses before common words.
I can personally attest that, when speaking a foreign language, I may be perfectly familiar with a word that I need to use, and yet have difficulty actually producing it anyway. A preliminary pause as I get my mouth ready to do something weird would be entirely unsurprising.
My gut feeling listening to the Russian guy was that he's not going to pass for native, but he probably did understand the text and would be able to speak with me if necessary.
Rick Rubenstein said,
September 28, 2025 @ 1:15 am
This has little or nothing to do with the topic at hand, but my cultural background made the third reading instantly sound like part of a comedy sketch. I encountered something similar long ago when attending a Jewish wedding where the presiding rabbi sounded very much like John Cleese at his most over-the-top earnest. It was very hard not to laugh.
AntC said,
September 28, 2025 @ 2:21 am
Six spoons of fresh snow peas, …
The whole list sounds completely bizarre. I'd be halting in reciting it just wondering what it had got garbled from: 'send three-and-fourpence, we're going to a dance'. I don't think it would elicit natural prosody from me.
Snow peas in spoons? It means spoonfulls? Scoopfulls? Handfulls? My greengrocer measures snow peas by weight.
What does a 'slab' of cheese mean? A 200g pre-pack?
evaluate how well grade-school students understand what they're reading.
As a counterpoint: in Sabine Hossenfelder's You tube videos, I often find her prosody so strange I need to go back and replay. I'm sure she understands her own scripts perfectly well. It's not so much the obscure technical terms she mis-stresses, but everyday words and phrases. I particularly notice /kɔˈleːɡ
ə/ rather than /ˈkɒliːɡ/ – which is presumably interference from German.Michael Vnuk said,
September 28, 2025 @ 3:40 am
'A computational analysis of prosodic features could be an effective way to evaluate how well grade-school students understand what they're reading.'
What is meant by 'understand' here? As AntC rightly points out, some words in the passage, especially 'spoons' and 'slabs' are odd or vague. I still don't 'understand' what they fully mean here, but I expect that I could read the passage with appropriate prosody because I 'understand', from the shape of the sentences, what parts of speech I am dealing with.
It might be interesting to compare the reading of different passages containing (1) common words, (2) uncommon and unusual words, and (3) nonsense words (such as the linguistics examples 'bouba' and 'kiki'). Most, if not all, the function words would be simple and normal.
Does reading aloud really test 'understanding' (however you define it)? I've heard some people read aloud from their own typed speeches and they still got the prosody wrong. Did they not understand their own material or are they just poor at reading aloud?
How often do people even read aloud? Would they improve with practice, regardless of 'understanding'?
Bob Ladd said,
September 28, 2025 @ 4:39 am
I presume that the weirdness of the passage about snow peas and blue cheese is intentional, because the passage was apparently constructed for a "Speech Accent Archive" and as such must have included a bunch of pairs/contrasts/words that will be diagnostic of one accent or another.
Whether the weirdness invalidates the idea of using it to test reading comprehension, as suggested by AntC and Michael Vnuk, is a good question, but one that could easily be tested in a larger study that includes both weird and non-weird test passages.
Jerry Packard said,
September 28, 2025 @ 8:06 am
The prosody measure of reading proficiency seems analogous to the elicited imitation (EI) measure of a speaker’s overall language proficiency. The rationale behind EI as a proficiency measure is that in order to repeat a sentence correctly, one has to understand the meaning of the sentence. Since a sentence is more difficult to imitate when it is not comprehended, when the sentence taxes the speaker’s ability, they are less able to repeat it correctly. The accuracy of the repeated output generally requires human evaluation, making it more expensive than the prosodic measure of reading proficiency.
David Marjanović said,
September 28, 2025 @ 8:14 am
Yes, that's straightforwardly Kollege.
Rodger C said,
September 28, 2025 @ 9:10 am
But for e.g. the large pause between "these" and "things" I don't think that explanation is available
Dealing with the two different pronunciations of th?
John Finkbiner said,
September 29, 2025 @ 10:45 am
My pronunciation, has the TRAP vowel for the first syllable of slavish/slavishly. Slav, on the other hand is PALM, I think. In any case the two feel quite distinct to me.
I have no idea whether I came to this ‘slavish’ pronunciation purely from reading or from having heard it.
JPL said,
September 30, 2025 @ 12:14 am
Above comment:
Wrong post. Go to "'Slav-ishly devoted'". And while you're at it, tell us what is your understanding about the semantic relation between 'slavish' and 'slave'. And if you see them as morphologically related, why do you pronounce them with a different vowel in the root? If you don't see them as related, what sense does the "slav-" part of the word "slavish" contribute to its overall meaning? Etc.