We start with a psycholinguistic controversy. On one side, there's Herbert Clark and Jean Fox Tree, "Using uh and um in spontaneous speaking", Cognition 2002.
The proposal examined here is that speakers use uh and um to announce that they are initiating what they expect to be a minor (uh), or major (um), delay in speaking. Speakers can use these announcements in turn to implicate, for example, that they are searching for a word, are deciding what to say next, want to keep the floor, or want to cede the floor. Evidence for the proposal comes from several large corpora of spontaneous speech. The evidence shows that speakers monitor their speech plans for upcoming delays worthy of comment. When they discover such a delay, they formulate where and how to suspend speaking, which item to produce (uh or um), whether to attach it as a clitic onto the previous word (as in “and-uh”), and whether to prolong it. The argument is that uh and um are conventional English words, and speakers plan for, formulate, and produce them just as they would any word.
And on the other side, there's Daniel C. O'Connell and Sabine Kowal, "Uh and Um Revisited: Are They Interjections for Signaling Delay?", Journal of Psycholinguistic Research 2005:
Clark and Fox Tree (2002) have presented empirical evidence, based primarily on the London–Lund corpus (LL; Svartvik & Quirk, 1980), that the fillers uh and um are conventional English words that signal a speaker’s intention to initiate a minor and a major delay, respectively. We present here empirical analyses of uh and um and of silent pauses (delays) immediately following them in six media interviews of Hillary Clinton. Our evidence indicates that uh and um cannot serve as signals of upcoming delay, let alone signal it differentially: In most cases, both uh and um were not followed by a silent pause, that is, there was no delay at all; the silent pauses that did occur after um were too short to be counted as major delays; finally, the distributions of durations of silent pauses after uh and um were almost entirely overlapping and could therefore not have served as reliable predictors for a listener. The discrepancies between Clark and Fox Tree’s findings and ours are largely a consequence of the fact that their LL analyses reflect the perceptions of professional coders, whereas our data were analyzed by means of acoustic measurements with the PRAAT software (www.praat.org). […] Clark and Fox Tree’s analyses were embedded within a theory of ideal delivery that we find inappropriate for the explication of these phenomena.
I haven't seen any recent defenses of the Clark & Fox Tree position on this issue, which I think is too bad, since the core of their position (that filled pauses are part of the linguistic signaling system, rather than simply symptoms of its malfunction) seems worth preserving. But the debate is apparently still alive, since there are recent publications like Ian Finlayson and Martin Corley, "Disfluency in dialogue: An intentional signal from the speaker?", Psychonomic Bulletin and Review 2012
[P]articipants were no more disfluent in dialogue than in monologue situations, and the distribution of types of disfluency used remained constant. Our evidence rules out at least a straightforward interpretation of the view that disfluencies are an intentional signal in dialogue.
So I thought I'd report, FWIW, on a Breakfast Experiment™ that looks at the duration distribution of um, uh, and adjacent silences in the Switchboard corpus. This exploration is connected to our recent flurry of posts on UM and UH (see here for some links), but it also underlines the curious disconnect between speech science and speech technology, in ways that I'll underline as they emerge.
Here's what Clark and Fox Tree say about their data:
The primary evidence for our proposal comes from the London–Lund corpus (hereafter LL corpus). It consists of 170,000 words from 50 face-to-face conversations (numbered S.1.1 through S.3.6) from the Svartvik and Quirk (1980) corpus of English conversations. […]
Brief pauses “of one light foot” are marked with periods (.), and unit pauses “of one stress unit” with dashes (-). When we need a measure of pause length, we treat the unit pause as 1 unit long, and the brief pause as 0.5 units long, so “. -” is a 1.5 unit pause, and “- – -” is a 3 unit pause. […] Prolonged syllables are marked with colons (:), as in “u:m”. Uh and um were sometimes pronounced in brief or normal form, which we will write “uh” and “um”, and other times in prolonged form, which we will write “u:h” and “u:m”. The surreptitiously recorded speakers produced 3904 fillers (“uh” 898, “u:h” 1213, “um” 530,
They add a few words about some other data sources.
For auxiliary analyses, we draw on an answering machine corpus (AM corpus), the switchboard corpus (SW corpus), and the Pear stories (Pear corpus). The AM corpus consists of 5000 words in 63 calls to telephone answering machines, section S.9.3 in the full computerized version of the LL corpus. It contains only 319 fillers (“uh” 69, “u:h” 166, “um” 6, “u:m” 78). The SW corpus is a 2.7 million word corpus of telephone conversations (Godfrey, Holliman, & McDaniel, 1992). It marks uh, um, and sentence boundaries, but not prolongations or pauses; it contains 79,623 fillers (uh 67,065 and um 12,558).
This is strange. As Clark and Fox Tree note, the Switchboard corpus is about 2700000/170000 = 16 times bigger than the portion of the London/Lund corpus that they used. And while it doesn't have symbolic notations like "unit pause" and "prolongation" and so on, even the original 1993 publication had word-level time alignments, which offer a detailed phonetic account of pauses and prolongations. An improved version of these alignments was prepared for the WS'97 workshop at Johns Hopkins, and the alignments were checked and corrected at Mississippi State in 2001-2003. The 1997 version of the alignments was certainly available when Clark and Fox Tree were preparing the Cognition paper.
So it puzzles me that Clark and Fox Tree didn't use this information. I'll ask them, but perhaps the reason is that the cultural gap between the world of speech technology (from which Switchboard comes) and the world of speech science (including psycholinguistics) was even larger in 2002 than it is now. They should have been aware that word-level time-marking existed for Switchboard, since Godfrey et al. 1992, which they cite, has a section titled "Time-aligned Transcription", which states that
The SWITCHBOARD conversations are also time aligned at the word level. The time alignment is accomplished using supervised phone-based recognition, as described in a companion paper by Wheatley. This process produces phone by phone time marking, which are then reduced to a word by word format for publication with the transcripts.
And it's even more puzzling that O'Connell & Kowal based their 2005 critique only on duration measurements from "six media interviews of Hillary Clinton". It's an interesting collection, but it's fairly small — about 100 times smaller than Switchboard, by the metric of the number of UMs and UHs — and limited to 7 speakers, with Clinton providing about 90% of the UMs and UHs. By 2005, the Mississppi State hand-checked alignments had been available for several years, so no painful acoustic measurements in Praat were necessary. So why not use Switchboard, in addition to if not in place of the interview dataset?
Based on the creation times of the files involved, my work on this material was spread over two periods, one of 34 minutes and another of 7 minutes. This included more than a few minutes of tending to a miserable chest cold. I say this not to boast, but to underline the fact that once you know where the files are (and they can be freely downloaded from the ISIP site), it doesn't take a lot of work to extract the counts and distributions shown below. And given that the issue raised by the cited papers is still a live one — as the Findlayson & Corley 2012 paper suggests — it's surprising that in the dozen years since Clark and Fox Tree 2002, no one (as far as I can tell) took the necessary 41 minutes.
Here are the counts of UM and UH in Switchboard (from the ISIP transcriptions), with a breakdown of how many are preceded or followed by a silent pause:
|SILENCE UM SILENCE||8251||39%|
|SPEECH UM SILENCE||7358||35%|
|SILENCE UM SPEECH||2938||14%|
|SPEECH UM SPEECH||2521||12%|
|SILENCE UH SILENCE||9231||13%|
|SPEECH UH SILENCE||25150||36%|
|SILENCE UH SPEECH||12681||18%|
|SPEECH UH SPEECH||21696||31%|
UM is followed by a silent pause 74% of the time, whereas UH is followed by a silent pause only 49% of the time.
In the other direction, UH is embedded in speech without silence on either side 31% of the time, whereas UM is in that situation only 12% of the time.
Here's the distribution of durations for the UM and UH themselves. UM is a bit longer — median of 417 msec. vs. 286 msec. for UH, or 131 msec longer, which is about what's expected for a word-final nasal murmur. But there's quite a bit of overlap:
Here's the distribution of durations for silent pauses following UM or UH (15609 silent pauses after UM, 34381 silent pauses after UH):
Again, pauses after UH are in red, and pauses after UM are in blue — but purple, which the region of the histogram where the distributions are the same, clearly predominates to a massive extent. I was surprised to see that the distributions are so nearly identical, and wasted a few minutes looking for a bug in my code.
O'Connell and Kowal argued that in their corpus of Hillary Clinton interviews,
the distributions of durations of silent pauses after uh and um were almost entirely overlapping and could therefore not have served as reliable predictors for a listener.
As we've just seen, in the Switchboard corpus, which is 100 times larger and much more representative, the distribution of durations of silent pauses after uh and um is about as close to identical as real-world distributions ever are. So why in the world would O'Connell and Kowal have failed to make use of this freely-available and easy-to-analyze dataset? It can't quite be true that they didn't know of its existence, because Clark and Fox Tree mention it, though falsely implying that timing information wasn't available for it.
I don't mean to criticize these four psycholinguists in particular. Clark, Fox Tree, O'Connell, and Kowal are all major figures in the field, who have made important contributions. But my point is that the field of psycholinguistics has been culturally estranged from research in speech technology for several decades. So it's not a surprise that this disciplinary blind spot afflicts even these four major researchers.
The distribution of durations of pre-filled-pause silences in Switchboard is also nearly identical for UM and UH. This time, the distribution is clearly bimodal, and there's a small shift from the shorter to the longer mode in UMs compared to UHs. But the durations of the pauses preceding UM and UH still clearly come from the same distribution:
Finally, here's the distribution of summed durations for preceding silence (if any), UM or UH, and following silence (if any). UH is clearly bimodal, with a distribution of short interventions representing UH without silence, and a distribution of longer periods corresponding to UH plus one or two silences. In the case of UM, the distribution is clearly "fattened" by a similar effect, but there's not such a clear multimodal structure:
Overall, these results echo the argument that O'Connell & Kowal made, but add the authority of a 100-times-larger and much more representative dataset. I'm not convinced that their conclusion thereby follows, at least in its strongest form. It seems likely to me that UM and UH have somewhat different (though perhaps overlapping) communicative functions. But I AM convinced that psycholinguists need to learn about the tools and resources available in the technological world.