The Voder — and "emotion"

« previous post | next post »

There was an interesting story yesterday on NPR's All Things Considered, "How We Hear Our Own Voice Shapes How We See Ourselves And How Others See Us". Shankar Vedantam starts with the case of a woman whose voice was altered because her larynx was accidentally damaged during an operation, leading to a change in her personality. And then it segues into an 80-year-old crowd pleaser, the Voder:

All the way back in 1939, Homer Dudley unveiled an organ-like machine he called the "Voder". It worked using special keys and a foot pedal, and it fascinated people at the World's Fair in New York.

Helen, will you have the Voder say 'She saw me'.

She … saw … me

"That sounded awfully flat. How about a little expression? Say the sentence in answer to these questions.

Q: Who saw you?
A: SHE saw me.
Q: Whom did she see?
A: She saw ME.
Q: Well did she see you or hear you?
A: She SAW me.

So far, so good. Except that Vedantam introduces the Voder with an all-too-common conceptual (and terminological) confusion.

Voices convey so much more than information. They communicate feeling, temperament, personality. For more than two centuries, scientists have been working to incorporate this psychological insight into their work to recreate the human voice.

But the Voder's variation in prosodic focus — what Homer Dudley vaguely calls "a little expression" — is surely a matter of conveyed information about communicative context, not "feeling, temperament, personality". And this confusion persists, and even gets worse, with "feeling, temperament, personality" all packed up as "emotional information">

Speech technology has come a long way over the years, but in many ways synthetic voices still sound synthetic. They don't convey all the emotional information that's packed into the human voice.

What makes synthetic voices sound "synthetic" is not the absence of "emotional information" (which is indeed a kind of information, though it's not very clear what it is and how it's different from style, emphasis, rhetoric, or a dozen other vague words). Failures of naturalness can occur at many levels, from things that just don't sound human at all, to sounds that are natural enough in the abstract but don't fit the  narrative or conversational context, or phases that sound like they come from a different speaker, or are inappropriately formal or informal, or seem to have too much or too little vocal effort, or etc. and etc. and etc.

And it's not just journalists who use the word "emotion" to describe this complex phonetic space and its diverse interpretive connections — all too many speech technologists, all too often, describe almost everything in speech beyond the simple sequence of words as "emotion". I don't think this is because the word emotion is shifting in meaning — rather it seems to reflect lack of conceptual engagement with the phenomena in question.

The ATC piece concludes with an interesting discussion of Rupal Patel's work on assistive communication aids for people who have lost their voices or who are about to lose them. This mostly focuses on the question of how to design a speech synthesis system that sounds like a particular user, or will allow the user to modulate features like breathiness.

Meanwhile, here's video of Helen Harper performing at the keyboard in 1939:

Homer Dudley developed the Voder as a demonstration of the decoding side of his Vocoder. The Vocoder was also the foundation of the SIGSALY secure speech system, which led to Alan Turing's interactions with Claude Shannon in 1943, and indirectly led to Aleksandr Solzhenitsyn's work in the  Marfino sharashka, fictionalized as The First Circle.


  1. AntC said,

    August 17, 2019 @ 8:27 am

    all too many speech technologists, all too often, describe almost everything beyond the textual content of speech as "emotion".

    All too many technologists of artificial languages (I'm thinking especially of compiler-writers) describe almost everything beyond the syntax of a language as 'semantics'. That is, beyond the formal syntax in the language spec: there's plenty of infelicities that can be detected syntactically (with context-sensitivity or whole-program analysis), just not by BNF productions.

  2. Rick Rubenstein said,

    August 17, 2019 @ 3:39 pm

    As a bit of vaguely related curiosity, I wonder whether and, if so, after how long, Stephen Hawking's "internal" voice became that of his legendary speech synthesis module.

  3. kimika said,

    August 17, 2019 @ 10:44 pm

    The voice of Dorth Voder.

RSS feed for comments on this post