Celebrity voices

« previous post | next post »

In current rotation on Doonesbury, Bernie (Mike's boss) is pitching an idea to Sid (Boopsie's agent):

Bernie's idea? To create celebrity GPS voices:

Sid is sure he can recruit speakers:

Though what this means, in the case of actors who don't always play themselves, is less clear:

I was pretty sure that this was a re-run, and a quick web search confirmed that the same sequence ran back in November of 2009. But first I had a moment of doubt, wondering whether it might be a dream memory.

Why might I have false memories about celebrity speech synthesis? Well, in the late 1980s, when I still worked at Bell Labs, I spent a week in Denver recording a (minor) celebrity voice. The speaker was a woman working at a country music radio station, who had previously recorded the messages and prompts for AT&T's AUDIX voicemail system, for which the engineering development was then done at the Western Electric facility in Denver. The AUDIX people wanted to see if they could add general text-to-speech capability using the same voice.

This didn't work out in the end, for various reasons (corporate re-orgs; unclarity about how customers would get text into the system in those pre-cell-phone, pre-internet days; the Uncanny Valley effect, etc.). But it was the occasion for a fair number of semi-jokes about celebrity voices — could we recruit Charlton Heston, we wondered?

And I learned some interesting things about voices from Miss Audix, as she was called in this professional role.

When she originally recorded the AUDIX prompts, using her normal voice-over persona, the results had been vehemently rejected by the customer. After several equally unsatisfactory iterations, she finally hit on what she called her "happy secretary voice", which turned out to be exactly what they wanted.  So she tried to adopt that persona for the material that I recorded (a typical list of phonetically-balanced sentences, somewhat like the current ARCTIC list but an order of magnitude larger).

This was not easy — she broke down laughing more than once, trying to give the happy secretary version of something like "The hogs were fed chopped corn and garbage". (Although we probably didn't actually include the Harvard Sentences in our list, fragments taken out of context in order to optimize the coverage of segmental n-grams tend to be similarly incongruous.) In the end, I think she reverted to a more standard voice-over delivery.

And things broke down further when we got to the prose passages. I wanted to get recordings of some extended passages of coherent prose, for use in modeling her prosodic patterns in material other than sentence lists, and so I'd included a collection of newswire stories. But when we got to that point, Miss Audix objected.

In the first place, she explained, newswire stories are not written to be read out loud. As a professional newsreader, she would need to rewrite the stories and mark them up in various ways before reading them. I'd allowed one 40-minute session for her to read the stack of stories that I'd printed out for her — but it would take much longer than that for her to re-write them so as to be suitable for reading on the air.

And in the second place, she insisted, the clash of personas was just too much. Happy secretaries are not newsreaders, nor vice versa. OK, I said, just read them as you would normally.

But that was not a clear enough instruction, because she had worked at several different kinds of radio stations. And she gave a fascinating demonstration of the acting method behind her different ways of reading the same story on a public radio station, on an all-news AM station, or on a top-40 music station.

On an NPR outlet, she explained, her presentation would embody the idea that "This is really complicated stuff, but I'm intelligent, and you're intelligent, so I'm going to lay the ideas out in a way that intelligently reflects their structure, and since you're paying careful and intelligent attention, you'll understand." And her sample exhibited a correspondingly elaborate modulation of amplitude, pitch, and time.

On an all-news AM station, she explained, the idea is "This is really important and you're really busy so just listen for a minute and you'll get all the essential stuff you need to know". And in her sample, she talked fast and loud and urgently, with great but generally uniform emphasis.

And on a music station her message was "You don't want to hear this, and I don't want to read this either, but the FCC makes us do it, so just ignore me for a minute and we'll get back to the tunes." The corresponding was rapid, soothing, unemphatic and easily backgrounded.

I think that we resolved the dilemma with something like "just imagine that you're reading the newspaper out loud to your grandmother whose eyesight is failing", but I'm not sure. Anyhow, this is an aspect of speech synthesis (and speech science) that still needs some work.



12 Comments

  1. hanmeng said,

    March 26, 2011 @ 9:58 am

    The one time I want to hear samples, there aren't any.

  2. Dan Lufkin said,

    March 26, 2011 @ 10:25 am

    That's fascinating. All these years I've had "Joe took father's shoe-box out" reverberating in my skull and never knew of the Harvard sentences. They have a certain oblique poetry if you read them seriatim.

    [I wonder if anyone can bring to mind the Australian soldiers who (during WW II) reformatted an army manual on mosquito control into short lines and won an Australian literary prize for young poets.]

    When the National Weather Service began TTS dissemination of weather forecasts by telephone (WE6-1212) they ran a test at one exchange in Baltimore. They intercepted random calls and asked listeners to rate the service on several criteria. They were worried about the Uncanny Valley effect, but the TTS service won hands down on both "clarity" and "friendliness". I don't recall whether Bell Labs and Miss Audix were involved.

  3. Chris Kern said,

    March 26, 2011 @ 10:41 am

    What I found kind of amusing about this sequence is that Japan has already done this with well-known anime voice actors; you can purchase a number of packages for GPS devices that have voice actors either speaking as some of the famous characters they play, or just in their normal voice.

  4. Rodger C said,

    March 26, 2011 @ 11:19 am

    I wonder if anyone can bring to mind the Australian soldiers who (during WW II) reformatted an army manual on mosquito control into short lines and won an Australian literary prize for young poets.

    You're thinking of James McAuley and Harold Stewart, the creators of Ern Malley:

    http://en.wikipedia.org/wiki/Ern_Malley

  5. Dan Lufkin said,

    March 26, 2011 @ 11:29 am

    @ Rodger C — Thanks very kindly. That's it precisely.

    As soon as I get time, I'm going to run the Harvard sentences through Mark V. Shaney. We may be on the threshold of Something Big.

    [(myl) There's also the TIMIT sentences.]

  6. Pflaumbaum said,

    March 26, 2011 @ 11:39 am

    BBC Radio 1, which is mostly music and aimed at a youth audience, has its newsreaders employ the Happy Secretary in their segment ('Newsbeat'). Whatever the topic – tsunami or tamagotchi – they give us the same relentlessly light, perky tone.

    Here's one of them chirruping away about quite a serious subject – the attempt by the fascist BNP to re-brand itself as moderate.

    http://www.youtube.com/watch?v=PnvmyyPLbb8

  7. a George said,

    March 26, 2011 @ 12:09 pm

    myl, You have given us a fascinating eyewitness account, and it makes one wish that you taped your interviews and discussions with Miss Audix, not only her performances. And it comes at a time when the awareness of the importance of the voice is on the increase again, an interest that was in a valley because of the perceived importance and hence predominance of the visual side.
    When considering the listener's expectations, with some reservation I would recommend a book, Lars Nyre: "Sound Media. From live journalism to music recording", Routledge 2008, with a CD.
    Part 1, until p. 110 is refreshing, though less stringent than Don Ihde. I have severe reservations about the second part, which is a reverse time-line discussion of technical development in the audio field. This is a novelty in itself but not very fertile. However, this is not the place, nor is there time for a dissection of pp. 111-195.
    When considering the development of voice performance (singing) in the context of the general feeling in society, Daniel Leech-Wilkinson has an interesting discussion of Lieder performance in the "recordable" period, in "The Changing Sound of Music: Approaches to Studying Recorded Musical Performances" (2009), available free on:
    http://www.charm.kcl.ac.uk/studies/chapters/intro.html with links.
    Again, a thorough discussion (this time without reservation) falls outside a comment, but one specific observation by D.L.-W. needs to be reported: he perceives an emotional relationship between motherese and portamento and the latter's falling into disuse being caused by the general feeling of unrest in society (reference to: Daniel Leech-Wilkinson: ‘Portamento and musical meaning’, Journal of Musicological Research 25 (2006), 233-61).

  8. Carl Offner said,

    March 26, 2011 @ 7:37 pm

    This is interesting. I'm by nature very non-visual but very auditory, and I've always been sensitive and aware of the way radio and TV announcers modulate their voices in ways that make it clear what you are supposed to think of what they are saying. I've always been frustrated that most people I talk to are completely unaware of this — at least consciously. I've always been convinced, however, that these techniques do serve a purpose and that people do respond to the message, however unconsciously. So it's revealing to see how aware these announcers are of what they are doing and how deliberate this all is. I feel vindicated!

  9. Corwin said,

    March 27, 2011 @ 2:49 am

    @Dan Lufkin—
    Although the sentences are somewhat less poetically oblique, being selected on the basis of final rhyme as well as, seemingly, length, I feel an effect quite similar to that of reading the Harvard sentences when I read through "The Longest Poem in the World," composed by aggregating public Twitter entries: http://www.longestpoemintheworld.com/

  10. Matt S said,

    March 27, 2011 @ 4:35 am

    A large portion of voice talent work nowadays in the computer games market – games like Mass Effect have hundreds if not thousands of hours of recorded voice, and nearly a hundred for the main voice talent alone. Interestingly, the most desired ability in the field isn't quality or emotion, it's more a case of "first take's the best take" – that is, there's very little time in the studio and the actors have to get the lines out right first time, with little or no context or direction. It's oddly amusing that there's more emotional direction in a weather bulletin that a feature entertainment.

  11. Andrew Greene said,

    March 27, 2011 @ 4:24 pm

    When I worked at a company that used Audix for voicemail (this was in the mid-1990s), we called her "Auntie Audix."

    I'm delighted to hear the background story.

  12. Antropica said,

    January 10, 2014 @ 3:00 pm

    Nice! My favorite ARCTIC-like sentence from this piece is this:

    "fragments taken out of context in order to optimize the coverage of segmental n-grams tend to be similarly incongruous"

    :)

RSS feed for comments on this post