World disfluencies

« previous post | next post »

Disfluency has been in the news recently, for two reasons: the deployment of filled pauses in an automated conversation by Google Duplex, and a cross-linguistic study of "slowing down" in speech production before nouns vs. verbs.

Lance Ulanoff, "Did Google Duplex just pass the Turing Test?", Medium 5/8/2018:

I think it was the first “Um.” That was the moment when I realized I was hearing something extraordinary: A computer carrying out a completely natural and very human-sounding conversation with a real person. And it wasn’t just a random talk. […]

Duplex made the call and, when someone at the salon picked up, the voice AI started the conversation with: “Hi, I’m calling to book a woman’s hair cut appointment for a client, um, I’m looking for something on May third?”

Frank Seifart et al., "Nouns slow down speech: evidence from structurally and culturally diverse languages", PNAS 2018:

When we speak, we unconsciously pronounce some words more slowly than others and sometimes pause. Such slowdown effects provide key evidence for human cognitive processes, reflecting increased planning load in speech production. Here, we study naturalistic speech from linguistically and culturally diverse populations from around the world. We show a robust tendency for slower speech before nouns as compared with verbs. Even though verbs may be more complex than nouns, nouns thus appear to require more planning, probably due to the new information they usually represent. This finding points to strong universals in how humans process language and manage referential information when communicating linguistically.

For a more authoritative account of the Google Duplex service, see Yaniv Leviathan [yes, really] and Yossi Matias, "Google Duplex: An AI System for Accomplishing Real-World Tasks Over the Phone", Google AI Blog 5/8/2018. And the University of Zurich press release for Seifert et al. is "Nouns slow down our speech", 5/14/2018:

Speakers hesitate or make brief pauses filled with sounds like "uh" or "uhm" mostly before nouns. Such slow-down effects are far less frequent before verbs, as UZH researchers working together with an international team have now discovered by looking at examples from different languages.

When we speak, we unconsciously pronounce some words more slowly than others, and sometimes we make brief pauses or throw in meaningless sounds like "um." Such slow-down effects provide key evidence on how our brains process language. They point to difficulties when planning the utterance of a specific word.

A small sample of the buzz: "Google’s AI sounds like a human on the phone — should we be worried?"; "Service Workers Forced to Act Like Robots Meet Their Match — Surprise! It’s a robot that pretends to be human, courtesy of Google"; "Hello, Google Duplex? No Artificially Intelligent Calls, Please";  "What If A Robot Wrote This Article?".

I don't have time for much this morning, but I'd like to make a few quick points.

First, the only examples we have so far of Google Duplex conversations are supplied by the authors, who tell us that "This summer, we’ll start testing the Duplex technology within the Google Assistant". The examples we've seen have no doubt been selected to show the system off at its best, and should not be trusted to present typical examples, much less problematic ones. I say this as someone who worked in industry on text-to-speech synthesis: I've been there.

Modern speech synthesis is generally excellent, but conversational interaction remains a serious challenge. Google Duplex will initially be limited to specific sorts of interactions,"to help users make restaurant reservations, schedule hair salon appointments, and get holiday hours over the phone". But even so, I guarantee that once the public has access to the service, we'll hear some less impressive (and therefore perhaps less concerning) examples.

Second, the work on speech rate and silent or filled pauses is less novel than the coverage suggests. The idea that pauses (filled or not) reflect uncertainty about following material is obvious and well documented. A couple of research reports from 20-30 years ago, among hundreds:

Stanley Schachter et al., "Speech Disfluency and the Structure of Knowledge", Journal of Personality and Social Psychology 1991:

It is generally accepted that filled pauses (“uh,” “er,” and “um”) indicate time out while the speaker searches for the next word or phrase. It is hypothesized that the more options, the more likely that a speaker will say “uh.” The academic disciplines differ in the extent to which their subject matter and mode of thought require a speaker to choose among options. The more formal, structured, and factual the discipline, the fewer the options. It follows that lecturers in the humanities should use more filled pauses during lectures than social scientists and that natural scientists should use fewest of all. Observations of lecturers in 10 academic disciplines indicate that this is the case. That this is due to subject matter rather than to self-selection into disciplines is suggested by observations of this same set of lecturers all speaking on a common subject. In this circumstance, the academic disciplines are identical in the number of filled pauses used.

Elizabeth Shriberg and Andreas Stolcke, "Word predictability after hesitations: A corpus-based study", ICSLP 1996:

We ask whether lexical hesitations in spontaneous speech tend to precede words that are difficult to predict. We define predictability in terms of both transition probability and entropy, in the context of an N-gram language model. Results show that transition probability is significantly lower at hesitation transitions, and that this is attributable to both the following word and the word history. In addition, results suggest that fluent transitions in sentences with a hesitation elsewhere are significantly more likely than transitions in fluent sentences to contain out-of-vocabulary words and novel word combinations. Such findings could be used to improve statistical language modeling for spontaneous-speech applications.

So I have a question: Nouns in context are on average less predictable than verbs — is there anything left for a part-of-speech variable to explain once conditional entropy has been factored in? I don't think that anyone has tested this, but it would be fairly easy to check.

As for the deployment of disfluencies in artificial conversational interaction, there's been lots of research implying that they belong there, e.g. Brennan & Williams, "The Feeling of Another′s Knowing: Prosody and Filled Pauses as Cues to Listeners about the Metacognitive States of Speakers", Journal of Memory and Language 1995; Swerts et al., "Filled pauses as markers of discourse structure", 1996.

With respect to um and uh in particular, here's a useful quote from Herb Clark and Jean Fox Tree,  "Using uh and um in spontaneous speaking", Cognition 2002:

Uh and um have long been called filled pauses in contrast to silent pauses (see Goldman-Eisler, 1968; Maclay & Osgood, 1959). The unstated assumption is that they are pauses (not words) that are filled with sound (not silence). Yet it has long been recognized that uh and um are not on a par with silent pauses. In one view, they are symptoms of certain problems in speaking. In a second view, they are non-linguistic signals for dealing with certain problems in speaking. And in a third view, they are linguistic signals – in particular, words of English. If uh and um are words, as we will argue, it is misleading to call them filled pauses. To be neutral and yet retain a bit of their history, we will call them fillers.

And finally, the following article on the Seifart publication ends with an annoying falsehood — Alan Burdick, "Why Nouns Slow Us Down, and Why Linguistics Might Be in a Bubble", The New Yorker 5/15/2018:

In recent years, scientists have grown concerned that much of the literature on human psychology and behavior is derived from studies carried out in Western, educated, industrialized, rich, democratic countries. These results aren’t necessarily indicative of how humans as a whole actually function. Linguistics may face a similar challenge—the science is in a bubble, talking to itself. “This is what makes people like me realize the unique value of small, often endangered languages and documenting them for as long as they can still be observed,” Seifart said. “In a few generations, they will not be spoken anymore.” In the years to come, as society grows more complex, the number of nouns available to us may grow exponentially. The diversity of its speakers, not so much.

"Linguistics … is in a bubble, talking to itself", limited to "studies carried out in Western, educated, industrialized, rich, democratic countries"?  What ignorant nonsense.

From missionaries like Antonio Ruiz de Montoya in the 16th century, through scholars like Sir William Jones in the 18th century and Wilhelm von Humboldt in the early 19th century, among innumerable philologists of the later 19th century,  anthropological linguistics like Franz Boas and Edward Sapir in the early 20th century, and most of my colleagues in the field today, linguistics as a field has always devoted a large fraction of its efforts to documenting and understanding languages, endangered and otherwise, that are not "Western, educated, industrialized, rich".

Frank Seifart certainly knows this. So I'm going to pin this one squarely on Alan Burdick, since I'm all too familiar with the way that journalists behave when they have a (true or false) generalization in their sights. For an example documented by a journalist who was mistreated in this standard way by a fellow practitioner, see "Down with journalists!", 6/27/2005.




  1. Torbjørn said,

    May 16, 2018 @ 7:14 am

    Am I the only one having a very hard time believing that the call to a "real hair salon" actually was an unscripted call to a salon that had no idea they would be talking to a phone app? The woman answering seems to go out of her way not to trip up the AI she's (probably well aware that she is) talking to.

    Nokia's faked Lumia920 advertisement trying to hype their new camera comes to mind here…

    [(myl) It's true that the blog post is silent on whether the human participants in the sample conversations knew what was happening or were otherwise prepared in advance, so that the presentation would be misleading but not literally false if the encounter were semi-scripted. But I'd be surprised if the samples were not genuine examples of the potential service. On the other hand, I'd also be surprised if the samples were not especially successful examples of the potential service.]

  2. Jerry Friedman said,

    May 16, 2018 @ 9:24 am

    I'm a bit surprised to see so strong an effect of the next word, since I feel that during many of my silent or filled pauses I'm processing something farther down the road. But of course I haven't done a study of my speech, much less anyone else's.

    Hijack: What's going on with "than"? Seifart et al. write, "We show a robust tendency for slower speech before nouns as compared with verbs. Even though verbs may be more complex than nouns, nouns thus appear to require more planning, probably due to the additional information they usually represent." Why not "than" both times? I don't think I ever use anything but "than" with comparatives, so I have no insight into this. Elegant variation? Or might the authors have been uncomfortable with both "than verbs" (syntactically unclear) and "than before verbs" ("than" followed by a preposition)? Has anyone studied this question?

    I feel that the use of alternatives to "than" is a recent phenomenon. For a little support, "than" has been decreasing slightly in relation to "more" for some decades, according to this ngram result. I suppose that -er comparatives or comparatives without the second comparandum could have become commoner, though.

  3. D-AW said,

    May 16, 2018 @ 9:31 am

    The AI asks for their appointment repeatedly at "12 PM." Am I wrong that real people don't say "12 PM" very often? Google Ngrams seems to bear me out but the evidence in COCA isn't clear.

  4. tbell said,

    May 16, 2018 @ 10:19 am

    I wonder if the 'more planning' interpretation encompasses the need to negotiate interference…It's hard to calculate how many alternatives there might be in a conversation, but maybe nouns produce more interference, simply because there are more of them in our memory…
    We know that memory interference grows as a function of the number of alternatives to a given cue (see Fan effect)…could this phenom be related to a memory access issue?

  5. Andreas Johansson said,

    May 16, 2018 @ 2:09 pm


    I've heard enough peeving about "12 pm" (and "12 am") to have assumed it's common …

  6. M said,

    May 16, 2018 @ 2:45 pm

    In normal conversation, I think I would always say "noon," but I could easily imagine saying "12 pm" if I were calling to make an appointment.

  7. J.W. Brewer said,

    May 16, 2018 @ 3:27 pm

    I am puzzled by the last sentence of the bit quoted from Burdick's piece, because after reading it several times it still seemed non-obvious what the antecedent of the "its" in "the diversity of its speakers" was supposed to be. It seems like "society" is the only noun in the prior discourse that fits, but even then it makes no sense. What would "the diversity of society's speakers" even mean? If e.g. half of the world's speakers of Garifuna migrate to the Bronx (example recently mentioned in some other comment thread) and their descendants end up as monolingual Anglophones, linguistic diversity perhaps decreases (assume this makes the old-country population sufficiently smaller that resistance to complete language shift to Spanish becomes more difficult) but other common senses of "diversity" in US society increase.

    I do think upon rereading that that's not the bit of what Burdick wrote that myl thought an "annoying falsehood." I'm not sure it's coherent enough to be false.

    [(myl) I think you're right that "society" is the antecedent, and I suppose that Burdick has something like "linguistic (and maybe cultural) diversity" in mind, not genetic or ethnic-origin diversity; but you're probably also right that in this case his hook is incoherent enough to qualify as "not even wrong" — except for the bit about linguistics being in a disciplinary bubble because of not looking at a wide variety of languages.]

  8. AntC said,

    May 17, 2018 @ 3:02 am

    Am I wrong that real people don't say "12 PM" very often?

    I'd say "12 noon" or just "12": I'm hardly likely to want a hair appointment at midnight.

    With "12 PM" (and in the absence of sufficient context) I'd be nervous whether it's midnight or noon.

    More generally, I'm not seeing the benefit from getting a machine to make bookings like that: if it's a cut-and-dried appointment (sorry) it'll take no longer to ring up myself than instruct the machine. If it turns out more complicated, the machine will have to refer back to me anyway.

    Where it could have merit is dealing with call-centres where you spend long times on musak waiting to get through to a human. Then the machine on my behalf could wait for the human and keep them engaged in idle gossip until I'm ready.

  9. BZ said,

    May 17, 2018 @ 10:35 am

    I'm pretty sure my state's 511 (traffic information) phone service, which is fully automated, sometimes includes :"um"s in its speech, I think when repeating the same options again after not understanding input, as in "I'm sorry, I didn't get that. Please say, um, repeat to hear this information again…" where the original prompt didn't have the "um".

    In fact it's very human sounding until it hits a phrase not in the DOT database (usually when it's a street name or landmark in a different state, such as on the other side of a bridge from my state, or nonstandard terminology from a toll road) when it jarringly switches to a Stephen Hawking style computer voice for just a single word or two.

  10. KevinM said,

    May 17, 2018 @ 1:11 pm

    A humorous take on the opposite phenomenon as affectation–i.e., pausing before ordinary words but not difficult ones:
    . 'Use of SAT words is as casual as breathing. Example: "Ahhhhh … but surely you don't mean to …. ahhh …. suggest that …. ahhh … his behavior in any way adumbrates malversation …. because, ahhhh … after all … none of us is prepared to … ahh … to ahhh …. agree with you on that." For further instruction, listen to William F. Buckley, Jr.'
    From The Preppy Handbook

  11. Trogluddite said,

    May 19, 2018 @ 4:28 pm

    While I agree with AntC's comment that the Google Duplex appointment booking example would be, for most users, a solution for a non-existent problem, I can think of a couple of groups of users for whom the technology might be extremely attractive.

    On the positive side, people who struggle to communicate with the world due to the inability to produce speech might greatly appreciate speech generation software which can synthesise more fluent prosody. It might also, allied with the development of suitable interfaces, allow greater scope to synthesise the prosodic cues which indicate pragmatic elements of conversation – imagine, for example, how difficult it would be to indicate sarcasm using the model of speech synthesiser familiar from listening to Steven Hawking. Difficulty using the telephone is also not unusual for people with anxiety, learning and developmental conditions, which can even hinder them from accessing the support services which they may need (as an autistic person, I have experienced this first hand.)

    On the other hand, we might need to brace ourselves for an onslaught of unsolicited telephone calls from spammers and scammers, who we can be sure will find this technology extremely useful for automating their irritating practices and more easily duping the unwary.

  12. Nicki said,

    May 21, 2018 @ 6:53 pm

    I can see this as incredibly useful for non-native speakers, who often struggle with phone conversations despite a fairly high level of fluency.

RSS feed for comments on this post