One conventional view of "disfluencies" in speech is that they're the result of confusions and errors, such as difficulties in deciding what to say or how to say it, or changing ideas about what to say or how to say it, or slips of the tongue that need to be corrected. Another idea is that such interpolations can serve to "hold the floor" across a phrase boundary, or to warn listeners that a pause is coming.
These views are supported by the fact that fluent reading lacks filled pauses, restarts, repeated words, and non-speech vocalizations. And as a result, (human) transcripts of interviews, conversations, narratives, and speeches generally edit out all such interpolations, yielding a text that's more like writing, and is easier to read than an accurate transcript would be. Automated speech-to-text systems also generally omit (or falsely transcribe) such things.
The result is a good choice if the goal is readability, but not if the goal is to analyze the dynamics of speech production, speech perception, and conversational interaction. And in fact, even a brief examination of such interpolations in spontaneous speech is enough to tell us that the conventional views are incomplete at best.
I've noticed recently that automated transcripts from rev.ai do a good job of transcribing ums and uhs in English, though repeated words are still omitted. And in the other direction, I've noticed that the transcripts on the site of the U.S. Department of Defense include (some of the) repeated words, but not the filled pauses. It's interesting to compare those transcripts to the audio (where available) — I offer a sample below.
Read the rest of this entry »