Read vs. spontaneous speech

« previous post | next post »

Across the many disciplines that analyze language, there's surprisingly little focus on the properties of natural, spontaneous speech, as opposed to read (or memorized and performed) speech. But of course that dichotomy is an oversimplification — there are many linguistic registers, many ways to read each of the many styles of text, and even more individual, social, and contextual factors influencing spontaneous speech.

So one place to start is events where the same speaker, addressing the same audience for the same purposes, both reads a passage and answers questions — in such cases, at least the speaker and the context are controlled. In "Fluent 'disfluencies' again", 9/3/2022, I looked at the question-answering part of such an event, a press briefing by the U.S. Department of Defense Press Secretary, Brigadier General Patrick S. Ryder. At least, I looked at one small aspect of some of his answers, namely the distribution of certain kinds of disfluencies interpolations.

The focus of this morning's Breakfast Experiment™ will be one of Ryder's more recent press briefings, comparing the introduction (where he reads prepared text) to the first of his answers to subsequent press questions. I'll look at (aspects of) the properties of speech segments and silence segments, as well the statistics of local inter-syllable durations. For both of those features, fully-automatic analysis techniques allow research at scale, though this morning's data sample is small.

I'll also take a short comparative peek at his filled pauses and rapid word-repetitions in the two passages.

Here's Brig. Gen. Ryder reading his opening remarks, along with an image from the video showing him opening the folder that he reads from:

And here's his answer to the first question:

If you listen even to short samples of those clips, you'll notice characteristic differences in phrase-scale timing. So I ran a speech activity detector (SAD), and found that some ways of quantifying its output show much bigger differences than others.

The 106.86 seconds of Ryder's introduction is 87.3% speech and 12.7% silence, while the 87.12 seconds of his first answer is 84.4% speech and 15.6% silence. So not much difference there.

(FWIW, there's also little difference in overall speaking rate: my transcript of the Intro has 146 words per minute, while Answer 1 has 153 wpm. I'm counting filled pauses and rapid repetitions as "words", which increases the Answer's word count — without those tokens, the Answer weighs in at 135 wpm. The official transcript has some of both,  so rates based of that source would be even closer…)

In contrast, SAD segment rates show a big difference.  The Intro has 54 SAD segments, or 30.32 segments/minute, while the Answer has 78 segments, for 53.72 segments/minute — 77% more, compared to less than 5% difference in wpm.

And quantile plots show that the differences are spread across the full range of speech-segment and silence-segment durations:

We can also compare the distribution of inter-syllable durations, using the fully-automated methods described in "Inter-syllable intervals", 9/13/2023:

In closing, here's a brief glimpse into the things generally called "disfluencies".

My transcript of Ryder's Introduction is here (audio again here). I've used line-final "##" to mark silent pauses, and line-final "==" to mark (what I perceive as) phrasal boundaries, as in this sample of the beginning:

So earlier today, ==
Secretary Austin spoke by phone with ##
Turkish Minister of National Defense Yasar Guler ##
to discuss Turkish activity ##
in proximity to U.S. forces in Syria. ##

[I'm assuming that there's a gradient hierarchy of inter-morpheme boundary strengths, so the "==" markings are somewhat arbitrary…]

My transcript of this first answer is here (audio again here). It employs an additional non-standard typographical convention, namely word-final hyphens to make rapid repetitions or "false starts", which are one of the things that are common in spontaneous speech but basically absent in read speech — again, here's a bit of the beginning as an illustration:

yeah, so fir- first of all, just a- a little context ##
uh up front. ##
um ##
you know Turkey ##
uh is ##
one of our strongest and most valued ##
uh NATO allies and that- that partnership continues ==
and will continue ##
uh so this is certainly a regrettable ==
incident. ##

The 260 words of the introduction have no filled pauses and no rapid repetitions, as usual for read speech.

The 222 tokens of the first answer have 15 uhs and 3 ums, so that 8.1% of the tokens are filled pauses. There are also 8 rapid repetitions or 3.6%, and (15+3+8)/222 = 11.7%. The next-commonest "words" are a and us, with 8 (3.6%) each.

See also: Neville Ryant and Mark Liberman, "Automatic Analysis of Speech Style Dimensions", InterSpeech 2016.

Summary: Simple acoustic and lexical properties can be very different in spontaneous vs. read speech, even for the same speaker addressing the same audience in the same context. This is obvious, really, but the majority of linguistic and psycholinguistic research continues to focus on read speech, often in the form of decontextualized sentences, without considering that some of the results may not generalize to varieties of (more basic and more natural) spontaneous speech.

There are lots of other (fully-automatic) ways to explore the differences — for another simple example, see   "My poster for the 'Prosody Visualization Challenge'", 6/14/2018.



  1. Phillip Helbig said,

    October 16, 2023 @ 9:28 am

    After the big tsunami in Thailand, where proportionally more Swedes were killed than US citizens in 9/11, the king of Sweden gave a consolation speech. Many were sceptical beforehand: what could someone in his position say to comfort ordinary people facing such a loss? But it was one of his best speeches ever.

    One reason is that people noticed that those were his words and that he wasn’t reading something. How? The King is dyslexic, so it’s obvious whether he’s reading or speaking freely.

  2. ktschwarz said,

    October 16, 2023 @ 11:46 am

    My first thought was, didn't Labov design his most famous experiments — "fawth flaw" and Martha's Vineyard — to elicit the targeted words in spontaneous speech, as answers to questions, rather than asking people to read them? Then I typed "Labov" into the search box, and found this technique attributed to Labov:

    if you want to know how someone pronounces certain speech sounds or sequences, in an informal realistic context rather than when asked directly to perform the pronunciation, one excellent technique is to get them to give their ideas about the difference between X and Y, where at least one member of the pair illustrates the sound in question.

  3. Jonathan Wright said,

    October 16, 2023 @ 3:24 pm

    @ktschwarz His most recent book describes his approach very generally:

  4. Jarek Weckwerth said,

    October 17, 2023 @ 12:31 am

    Fascinating, thank you!

RSS feed for comments on this post