In 2002 the two of us published a paper in Cognition called “Using uh and um in spontaneous speaking.” We argued that uh and um are conventional English words, but of a special type. Our hypothesis was this (p. 79):
Filler-as-word hypothesis. Uh and um are interjections whose basic meanings are these:
(a) Uh: “Used to announce the initiation, at t(‘uh’), of what is expected to be a minor delay in speaking.”
(b) Um: “Used to announce the initiation, at t(‘um’), of what is expected to be a major delay in speaking.”
We went on to say (p. 79):
Producing uh itself constitutes a brief delay, and um, a longer delay (according to evidence described later). If speakers are accurate in their expectations, the delays should often extend beyond uh and um, and be longer after um than after uh. Uh and um can be used for other functions too. The hypothesis is that most other functions are implicatures that follow from the relevance of announcing minor or major expected delays in the current situation.
In the rest of the paper, we described a great deal of evidence for the various features of this hypothesis—that uh and um are conventional words of English (but not French or Japanese), that they are interjections, that they are used to announce expected delays, that the expected delays contrast for uh and um, and much more. We had a great time writing the paper.
We were delighted when Mark Liberman, in a post on October 5 ("Um, there's timing information in Switchboard?"), returned to our paper and took up some of the controversy it has stirred up. And we are delighted to comment on the issues he raised.
We have quoted the filler-as-word hypothesis in full for a reason: Too many folks have misread, skipped over, or forgotten one or another of its features. Several of these folks were mentioned in Mark’s post. Here are three features that are often lost sight of.
Delay vs. silence
What uh and um project are delays, not silences. The delays always include uh or um, and that is often all they include. In our data, there were no silences after uh 72% of the time, and none after um 39% of the time. But um itself is longer than uh since um has two phonemes and uh has only one (in other terminology, um has two morae and uh has only one.) The difference is measurable. In the three studies we cited, um was longer than uh by 17%, 53%, and 29%. In Mark’s analysis, um was longer than uh by 45%. That is, whenever speakers select um over uh, they are already initiating a longer delay.
And yet, as we said, “If speakers are accurate in their expectations, the delays should often extend beyond uh and um,” and they did. In the corpus we analyzed (the London-Lund, or LL corpus), speakers produced silences more often after um than after uh 61% to 28% of the time. In the corpus Mark analyzed (the Switchboard or SW corpus), the difference was 74% to 49%. In the LL corpus, speakers sometimes produced additional uhs or ums as well, though Mark didn’t look for these.
We thought we had been clear about delays vs. silences, but apparently we weren’t. In a 2010 paper, Emmanuel Schegloff characterized the filler-as-word hypothesis this way:
[T]he psycholinguists Herb Clark and Jean Fox Tree (2002) argued in Cognition that [“uh(m)”] is to be understood as a full-fledged word, one that projects upcoming silence—shorter in the case of “uh,” longer in the case of “uhm,” displaying imminent trouble in speaking (our italics).
Schegloff then went on to criticize us for making claims about “upcoming silence” and about uh and um being used to display “imminent trouble in speaking.” Neither claim, of course, is part of the filler-as-word hypothesis—although speakers often do use uh and um to implicate imminent trouble.
In his post, Mark quoted a 2005 paper by O’Connell and Kowal in which they analyzed the uses of uh and um by then Senator Hillary Rodham Clinton in a small number of interviews. They argued:
Our evidence indicates that uh and um cannot serve as signals of upcoming delay, let alone signal it differentially: In most cases both uh and um were not followed by a silent pause, that is, there was no delay at all (our italics).
For over half a century, uh and um have been called “filled pauses” because they were considered to be delays on a par with “silent pauses.” It is hard to square this tradition with the idea that when Senator Clinton used uh and um, she was producing “no delay at all.”
As people speak, they monitor not only what they have just said or done, but what they expect to say or do next. That includes delays. As we put it (p. 106): “Speakers monitor for delays that are worthy of comment” and then “formulate a signal for commenting on the anticipated delay.” The problem, of course, is that speakers cannot know how long they will actually delay. When they are stuck for a word mid-sentence, they might find it right away or only after a delay. The best they can do is estimate the delay and classify it as either major or minor. This was part of the filler-as-word hypothesis: Uh is used to announce “what is expected to be a minor delay,” and um, “what is expected to be a major delay.” These are probabilities, not certainties.
What is the evidence that these are probabilities? As we said, “If speakers are accurate in their expectations, the delays should often extend beyond uh and um, and be longer after um than after uh.” As noted earlier, speakers are more likely to produce silences after um than after uh (61% to 29% in LL; 74% to 49% in SW).
Speakers should also produce longer delays on average after um than after uh. But do they? Mark cited O’Connell and Kowal, who analyzed the small corpus of interviews with a single speaker (Clinton) and argued:
the distributions of durations of silent pauses after uh and um were almost entirely overlapping and could therefore not have served as reliable predictors for a listener
But O’Connell and Kowal did not actually compare “the distributions of durations of silent pauses after uh and um,” so we did the comparison ourselves, using the data they published. We computed how often Clinton used uh vs. um for each length of silence following the filler. Here we have plotted, for each length of silence, the percentage of fillers (uh and um combined) that were uh, that is, N(uh)/(N(uh) + N(um)).
We have also included the best fitting linear trend. Although the corpus was small so the data were noisy, the pattern is clear. The percentage of uhs was high for short silences and declined to zero for the longest silences. The average silence was 340 msec after uh and 420 msec after um, a difference that was significantly greater than chance by a t-test, t(213)=2.74, p< .01.
But even this analysis isn’t quite right. Here’s why.
Delays between vs. within clauses
Not all silences are alike. Speakers do much of their planning before initiating a clause, and there it is okay to add delays. Many of these delays aren’t “worthy of comment.” But once they have started a clause, they try to remain fluent throughout—to avoid introducing further delays. So a silence that is 1 sec long will be heard as highly disruptive and worthy of comment when it is within a clause, but only mildly disruptive (if that) and not worthy of comment when it is between clauses. This is why we spoke of “major” and “minor” delays: A length of silence that is “major” within a clause may be “minor” between clauses. And this is what we found.
Our evidence came from a reverse analysis of silences. We first identified all of the silences in the LL corpus (there were 19,200 of them). Speakers could in principle produce a silence between any two words in the corpus, but they didn’t. They produced twice as many silences between clauses as they did between any two words within clauses (12,800 to 6,400). Plainly, speakers were far more likely to introduce a silence before starting a clause than after doing so.
We then asked: Which of these 19,200 silences did the speakers anticipate and mark with uh or um? We classified each silence (according to the coding in the LL corpus) as 1, 2, or 3 (or more) beats long. (A beat in the LL corpus depended on the pace of the speaker; in one study, it averaged 420 msec long.) If our hypothesis is correct, the longer the silence, the more likely speakers should have anticipated it and announced it with uh or um. There should be a steep increase for silences within clauses, but not necessarily for silences between clauses.
This, too, is just what we found. First consider just the longest silences (3 beats long). These are silences that speakers should have announced much more often when they were within a clause than when they were between clauses. They did, and five times as often. Next consider the silences within clauses. For these, the longer the silence, the more likely speakers should announce them with um than with uh. Indeed, there was a two-fold increase in uh from the shortest to the longest silences (4% to 10%), but a five-fold increase in um (4% to 21%). Both trends were large, and the trend for um was reliably larger. For silences between clauses, in contrast, the analogous increase was small for um and absent for uh. To repeat, not all silences are alike.
Mark did not do a reverse analysis for silences nor did he distinguish silences within vs. between clauses. And yet that is what is needed for testing the difference between “major” and “minor” delays.
Are uh and um genuine words?
This question has exercised a great many people. For over half a century, uh and um have been treated as “disfluencies” or “filled pauses,” as noise to be filtered out of linguistic analyses. But if that is all they are, we wondered, why are there two of them? If speakers produce uh on some occasions, and um on others, on what basis are they doing so? This was the question that led to our first investigation—and eventually to the evidence that uh and um are, indeed, genuine words.
Not everyone has taken our evidence to heart. Of the two papers Mark cited, O’Connell and Kowal accepted uh and um as words, but Finlayson and Corley did not. In a related paper in 2008, Corley and Stewart put it this way:
We conclude that, whereas listeners are highly sensitive to hesitation disfluencies in speech, there is little evidence to suggest that they are intentionally produced, or should be considered to be words in the conventional sense.
But the authors drew this conclusion without addressing the evidence we presented. Let us explain.
“To be an English word” we noted, “is to conform to the phonology, prosody, syntax, semantics, and pragmatics of English words,” and we went on to show that uh and um satisfied these criteria. Like the articles the vs. a, for example, uh and um contrast in pronunciation and in usage, and they can be pronounced with distinctive intonation contours. Like the and a, they are also conventional. Where English uses uh and um, Mandarin uses uh, mm, nage, and zhege, and Japanese uses eto, ano, sono, and kono. The terms nage and zhege (in Mandarin), and ano, sono, and kono (in Japanese) are also used as demonstrative pronouns (like English this and that), so native speakers readily hear these as words. And (pace Corley and Stewart) uh and um cannot be automatic. Evidence we cited shows that speakers have control over when they use them. These criteria alone are enough to conclude that uh and um are words. (We urge readers to examine the published argument in full.)
But are uh and um something speakers plan? Speakers regularly attach uh and um to the ends of other words to form larger “phonological words.” They pronounce but + uh as “bu-tuh,” to + uh as “tu-wuh,” and the + uh as “thee-yuh.” (We have a whole paper on “thee” and “thee-yuh.”) In each of these examples, speakers readjust the syllable boundaries (from “but-uh” to “bu-tuh”) and pronounce the result as a trochee. The point is that speakers have to plan these trochees as units. You cannot start saying “thuh,” attach “uh” to it, and get “thee-yuh.” Nor can you make phonological words out of the + laughter, or the + a cough, or the + a groan. This is a process you can only do with words. Simply put, you cannot produce the trochee “thee-yuh” without planning “uh” as part of it from the beginning. Uh and um are items you have to plan.
Corley and Stewart based their conclusion solely on evidence from comprehension. They cited studies of reaction times and eye gaze, along with their own lovely work on EEGs, most of which show that listeners take account of uh and um. And yet, they argued, listeners could be treating uh and um not as words, but as mere symptoms of troubles. The problem for Corley and Stewart is that this is all one can ever conclude from comprehension alone. For word-hood, the crucial evidence is found in speaking, and this is evidence that Corley and Stewart ignored.
Blogs and slogs
Corpora vary enormously in quality—as Mark knows better than anyone. When we began our research, the LL corpus was ideal for our purposes. The speech was from spontaneous face-to-face conversations. The transcripts included not only words, but intonation, overlapping speech, silences, prolongations, non-words, laughter, and other features we made use of. And it had been checked and double-checked by professional linguists. For us, it was a goldmine.
The main point of Mark’s post was to ask why people hadn’t used the timing information in the SW corpus. And he chided the two of us in particular, saying, “So it puzzles me that Clark and Fox Tree didn’t use this information.”
We would first like to remind Mark about the difference between blogs and slogs. Blogs (an abbreviation of weblogs) are created on a time-scale of days and weeks. Slogs (an abbreviation of science-logs, for scientific reports that appear in peer-reviewed journals) are created on a time-scale of years. Mark’s post was a blog. Our paper was a slog.
With that in mind, let us compare the time-lines of our paper and the SW corpus. We began our research with the LL corpus in 1991 and worked on it (intermittently) for nine years. We produced three closely related papers—one on thuh vs. thee (Fox Tree & Clark), one on repeated words as in “the . the basket” (Clark & Wasow), and one on uh vs. um (Clark & Fox Tree). The papers were submitted for publication in 1995, 1996, and 2000, respectively, but didn’t appear until 1997, 1998, and 2002. Not only did it take time to plan, carry out, and write up the research itself, but it took two more years for the research to appear in print. This was not atypical of slogs in that era.
As for the SW corpus, it was first announced at an IEEE Conference in March 1992, which we weren’t at. (A first report of our findings on uh and um appeared in a paper submitted in April 1992.) When we did hear about SW, colleagues warned us not to use it for timing until the timing had been checked. As Mark noted, the checked version wasn’t available until 2003. By then, we had moved on.
Blogs are hares, and slogs are tortoises. A hare may be speedy, but it is the tortoise that wins the race. Science depends ultimately on slogs no matter how slow they are. On this we’re sure Mark would agree.
Above is a guest post by Herb Clark and Jean Fox Tree.