Language Log

ASR Elevator

November 14, 2010 @ 8:30 am · Filed by Mark Liberman under Computational linguistics, Humor, Variation

This is funny, though unfair:

It's unfair because the expected word error rate these days for isolated number-words, whatever the accent, is a few tenths of a percent; and also because even the worst-designed voice-response system would have better fall-back strategies than this one does.

The clip is from Burnistoun, "a critically acclaimed sketch show for BBC Scotland by the Scottish comedians Iain Connell & Robert Florence". The most linguistically-interesting thing in this clip is Scott and Peter's attempts to produce "eleven" in a variety of regional accents.

[Update — Just to introduce a small note of reality into the discussion below, I tried playing the audio from the clip into Google Voice, using the "note to self" feature on my Droid, and holding the phone up to the speaker of my laptop. I went through the original rendition of "eleven", and the next eight copies, at which point I got bored and stopped. All nine performances were transcribed by the ASR system as "11".

This despite the fact that the system was obviously not primed to expect that the alternatives were limited to numbers or other words plausibly representing the floors of a building. The tallest building in Scotland appears to have 31 floors — even if this is extended with basements and synonyms for various floors, a competently-designed speaker-independent system limited to 50 or so possibilities would hardly ever make a mistake, and would not be derailed by accents as transparent as those in this sketch.

ASR systems are of course far from perfect, but the public stereotype is very far from the truth.]

[Update #2 — As long as we're in complaining-about-technology mode, I can't resist mentioning an annoying interactional quirk of the elevator in the building I live in. It has no speech technology at all, neither input nor output, so this was a quirk of its button-based communications system.

For several years, pushing "1" had the effect of making it deaf and blind to floors 2 and 3. If those floors had previously been depressed, the little lights in the center of their buttons went out, and no further attempts to press them would be registered, nor would the elevator respond to calls from those floors. The only way to restore the elevators willingness to admit that floors 2 and 3 existed was to take it all the way up to 4, then go back down to the basement and try again.]

November 14, 2010 @ 8:30 am · Filed by Mark Liberman under Computational linguistics, Humor, Variation

Permalink

39 Comments

Dave said,

November 14, 2010 @ 9:43 am

I don't think it's unfair at all. I'm American, living in Australia, and I have to attempt similar contortions (including attempting an Aussie accent) to get the damn things to understand me. I'm surprised you say their error rate is so low; I don't think I've ever gotten a phone number or license plate number across in its entirety without having to correct at some point. Mostly that's over the phone, of course, which can't help.
JR said,

November 14, 2010 @ 9:55 am

It's not *that* unfair. Naturally exaggerated for comic effect but, as a Scotsman with an iPhone, I can assure you it frequently misunderstands me when using "voice control"…

[(myl) Sure. But the isolated number "eleven", in a context where the only possible results are floor numbers?]
Ian Tindale said,

November 14, 2010 @ 10:10 am

What did you just say?
lukas said,

November 14, 2010 @ 10:21 am

On voice-response: who ever thought there was something wrong with pressing buttons?

[(myl) In elevators, no one that I know; which is why I've never seen a voice-response elevator system. But other situations are different, which is why (for instance) Google Voice Search seems to be a success.]
A. Marina Fournier said,

November 14, 2010 @ 10:22 am

This reminds me of the new Star Trek movie, where Chekhov has to keep repeating the command to the Helm computer because it doesn't understand his accent, which is far more real than Walter Koenig's was.

This was a kick. I'm a Yank, and I can't always get voice-recognition devices to recognize what I say if I deviate off their scripts.
Alicia said,

November 14, 2010 @ 10:28 am

This reminds me of http://www.youtube.com/watch?v=dABo_DCIdpM

another interesting attempt to produce English accents.
Patrick said,

November 14, 2010 @ 10:37 am

As an australian, living in australia, it usually takes me at least at least two tries to get through any of these voice systems. (And I work in IT, so I'm used to dealing with systems that "should work but don't", and having to compensate.)

The error rate may be a few tenths of a percent in a perfect, laboratory controlled environment (with spherical elephants), and I'm sure the manufacturers' marketing material all tout such stats. However, as per usual as soon as you throw in ambient noise, then background chatter, then convert it from analogue to digital, then compress it into a ~3khz range, then transmit it over a digitally routed network with attendant packetloss… you get something that fails frequently and frustratingly for all concerned.

[(myl) For common problems like digits, digit-strings, number words, etc., I can assure you that people have been testing such systems in real-world settings since the mid 1980s, when I first saw the results of testing at Bell Labs for telephone voice input of telephone numbers, credit-card numbers and so on. The challenge is mainly people who say things "um, well, it's eleven I think", not people who say "eleven" in whatever sort of accent.

In an elevator voice-command application (not that I imagine anyone would really want one), there's no reason to downsample to telephone bandwidth — though that really wouldn't make a lot of difference for the relatively trivial task of recognizing isolated floor numbers.]

Thus the video I guess, "It is better to laugh about your problems than to cry about them." (Einstein?)
lukas said,

November 14, 2010 @ 10:55 am

Well, GVS isn't so much voice response as it is voice recognition (hope I'm getting my terms right here). But "say 'thing A' in order to A, 'thing B' to B, or 'thing C' in order C" has always struck me as very long winded and unwieldy, especially if you have to do it three times over because the damn thing won't understand you. Besides, most people I know just plain don't like talking to machines, they want to push buttons… might be a psychological/cultural thing.
Spell Me Jeff said,

November 14, 2010 @ 11:21 am

It is funny and even comforting how the latest Star Trek should be aware of the challenges to speech recognition software.

[(myl) Another way to think about this is that the law of preservation of pop-culture error is operating. Back when ASR systems didn't work, the pop-culture stereotype was that they would soon be perfect. Now that they work pretty well, the pop-culture stereotype is that they'll always be painfully bad.]

TOS took for granted that a computer would find it speech recognition easy, and also that voice input/output would be universally desirable. Witness Spock's ridiculous manner of playing chess with a computer, and also Scotty's notion (in movie IV) that keyboards and mice were historically quaint. He's an engineer! He should know that you're not gonna program a starship's computer by voice alone.
Spell Me Jeff said,

November 14, 2010 @ 11:41 am

@myl
Agreed. The fact that the franchise is so old makes it an interesting test case. Each new movie has to negotiate the difference between series-canonical "facts" and facts/technology/culture that have arisen during the interim(s).
Mr Fnortner said,

November 14, 2010 @ 11:54 am

I'm puzzled by myl's impassioned defense of speech recognition technology, to the extent that he begins the defense in his opening remarks before allowing us to enjoy a funny skit and raise issues. Does he really believe that comedy is that great a threat to speech recognition research and industry?
Geoffrey K. Pullum said,

November 14, 2010 @ 12:09 pm

Very few people could have been in as good a position as I was to enjoy this clip when I first saw it. I work in a modern building in Scotland where the elevator is a badly-built piece of junk (we are now on our fourth firm of elevator contractors, trying to get it fixed) and the voice synthesis that does the floors-and-doors announcing is an utter disaster. (There are various bugs in the controller program; and I swear that when you reach level 3 you hear a much-too-loud stuck-up home-counties Englishwoman's voice shriek "Revel droon!"; it absolutely nothing like "level three"; when in Edinburgh, just take the elevator to the third floor of the Dugald Stewart Building and listen for yourself.) On top of that, Ken the wonderful building facilities manager who has to try and persuade the elevator contractor to get on with fixing the faults fixed is a basilectal Scottish English speaker who says "eleven" just like the guys in the video, and has the job of rescuing people who get stuck in the lift. I sent the link to him straight away, and he loved it. Mark is absolutely right that the sketch is unfair to the speech industry, because remarkable progress really has been made, over the course of his career and mine, in the direction of getting speech recognized in context. But hey, it is funny. It has kept Ken and me laughing instead of weeping over the bugs in the elevator programming, and that's worth a lot.
af said,

November 14, 2010 @ 12:25 pm

Well, given that the Apple computer phone tree ASR system has trouble with "yes" vs "no" with typical female f0 and formants (at least in this, I think I'm typical!) when these are the only two response options. I make a lot of support calls for my lab, so I've started tracking this. Generally, "yes" (as in, "yes, I already have a case number") is parsed by the system as "no" or it isn't understood at all, and I'm kicked into some kind of backup tree. Occasionally, if I artificially lower my f0, my response is correctly parsed. (To be fair, if I get to the point of reciting a system serial number, the ASR system has less difficulty with my voice.)
Dan S said,

November 14, 2010 @ 12:37 pm

I am delighted to learn that for voice-recognition systems the "expected word error rate these days for isolated number-words […] is a few tenths of a percent."

As recently as 2005, the LG VX8100 cellphone shipped, with a voice-dial feature that was impressively good at matching the unusual names on my list, without training. But it was weirdly inept at recognizing "yes" and "no", specifically when those were the only two choices! (I speak a generic mid-Atlantic American, and I'm not the only one who had that problem. It seemed to recognize my yes/no only when I spoke in an irritated tone. My kids loved watching this.)

That just had to be a stupid implementation, right?
Dierk said,

November 14, 2010 @ 12:37 pm

Simple solution to the elevator problem: ten.
UK lawyer said,

November 14, 2010 @ 1:30 pm

Whilst renting a car a few weeks ago in the US, I called the hire company to extend the hire period. I was quite surprised that it was all automated – in the UK I am used to pressing 4 etc, but not saying words. Needless to say, the voice recognition software didn't understand my standard English accent, and I resorted to imitating the voice that was asking me the questions – it then worked fine. My wife thought I was talking to someone and was quite embarrassed at how rude she thought I was being, speaking in a fake US accent.
Kaushik Janardhanan said,

November 14, 2010 @ 1:55 pm

Up or Down. Up. Or sideways? Or a staircase?

http://www.youtube.com/watch?v=7xbjGFLA6_8
Kacie Landrum said,

November 14, 2010 @ 2:36 pm

Considering that I'm American and I still have the *worst* time getting voice-recognition software to understand me, I can see how Scottish people could have problems. My Mac just beeps at 9 out of 10 commands I give it, and the 1 out of 10 it responds to it misunderstands (i.e. opening Mail when told to 'open Safari').

I spent the most irritating 30 minutes of my life on the line with Social Security a couple of months ago trying to order a new Social Security card. This software couldn't understand 'yes' or 'no', letters of the alphabet, or numbers.

BTW, am I the only one that thought their 'American' eleven sounded Russian?
Bobbie said,

November 14, 2010 @ 4:29 pm

In 1976 I lived in Warsaw Poland. (I am an American and had learned some basic Polish but not much.) I tried to call a 4-digit business extension at the university. The (live) operator refused to recognize my attempts to say each number individually. I think she wanted me to say something like Four -thousand- one hundred–thirty-five but I didn't know **how to say that in Polish! I ended up weeping into the phone and pleading with the operator to connect me. Ah, the feeling of power she must have had after that!
MattF said,

November 14, 2010 @ 4:38 pm

Well, you see, voice-recognition errors will tend to be correlated– so the average miss rate is probably not equal to the average frustration rate. Most people will be understood, but a few people will have a very unpleasant experience. More specifically, if 'eleven' isn't understood, the chances that 'ten', 'twelve', or 'thirteenty minus two' [in an enraged tone of voice] are understood will be low.
Qov said,

November 14, 2010 @ 5:06 pm

I'm a pilot, so whenever I there is difficulty understanding with a machine or transmission device involved, I unconsciously revert to formal international radio speech, telling the disembodied voice "negative" and "niner fife fo-er." This rarely does either of us any good.
Kylopod said,

November 14, 2010 @ 6:19 pm

This skit resonates with me because when I was 11 (yes, 11) in 1988, I visited a tourist attraction in Florida called "Xanadu: House of the Future." Among the futuristic items in this house was a $1400 computer called "Godfrey" that responded to voice commands to operate all the electronics in the house, such as switching on and off lights. After explaining all the incredible stuff "Godfrey" could do, our guide went ahead and demonstrated it. He intoned, "Godfrey." Nothing happened. He repeated: "Godfrey." The machine didn't respond. He altered his cadence slightly and said "Godfrey." Nothing. This went on for several more tries before the machine burst to life, but by that point most of us were stifling laughter.

In the mid to late '90s, I was back in Florida and discovered that the place where this "House of the Future" had been was now boarded up.
Seonachan said,

November 14, 2010 @ 6:20 pm

Talking elevators are a timeless source of linguistic comedy.

Woody Allen had a bit in his standup act, back in the 60s, where he encounters a voice-activated elevator, and he feels self-conscious because he spoke with a New York accent, "and the elevator spoke quite well." As he gets off, he hears the elevator "make a remark."

(Later, in frustration at his household appliances breaking down on him, he smashes his tv. The next day he's on the elevator again, and as it's going up it says, "Are you the guy that hit the television set?")
Chris said,

November 14, 2010 @ 6:59 pm

It's unfair for at least one other reason, suspiciously ignored by Profs Liberman and Pullum: some of the best speech recognition research in the world comes out of…wait for it…Scotland!
Rech said,

November 14, 2010 @ 9:25 pm

I work in a building in Australia that has a ground floor with French windows looking out to a formal Japanese garden, sculptures, huge art photos, piped music, fountains, valets to greet clients at the door and accompany them to their appointment. You get in the lift, and after the door closes this female voice shrieks "Goawing urp" in a sawtooth almost parodic accent sounding like the most rustic inhabitant of outback Northern Queensland..
Euan MacDonald said,

November 14, 2010 @ 9:32 pm

Another sympathetic anecdote: as a Scotsman living a few years ago in New York, I had rela problems trying to speak to utilities providers on the telephone. I had real difficulty in getting the automated system even to recognise my "yes" in a straight "yes/no", even resorting to an (appaling) attempt at an American accent.
Chandra said,

November 14, 2010 @ 9:36 pm

I used to work as an OnStar operator. I can assure you that this isn't all that far off the mark.
Angus Grieve-Smith said,

November 14, 2010 @ 10:17 pm

Seonachan, I found a transcription (by hand, presumably) of that Woody Allen routine. Thanks!
groki said,

November 14, 2010 @ 11:17 pm

@myl: The only way to restore the elevators willingness to admit that floors 2 and 3 existed was to take it all the way up to 4, then go back down to the basement and try again.

hysteresis! (though for you residents, probably not hysterical.)
groki said,

November 15, 2010 @ 12:55 am

my first use of voice recognition (on a mac 660 AV under System 7: I bought the hype over the AV series) was on menu commands, and one would say Computer: File Open or whatever. but far too many times, whatever menu command I actually said was heard by the computer as Computer: Cut. I kept losing work. fail-un-safe!

before I just went back to the keyboard, what I said most often was Computer: Undo, which, thankfully, the system usually did understand.

my never-confirmed conjecture at the time was that the schwa in and the shortness of Cut meant that quite a few words tickled Cut's ranking as well as the actual target, and Cut emerged as some kind of (dangerous) default, a mediocre-but-slightly-above-the-other-mediocrities choice in whatever crude algorithm was used back then.
Kylopod said,

November 15, 2010 @ 5:48 am

Sometime during my teenage years, we got voice recognition software for the first time, and my father attempted to play a solitaire game entirely by voice commands. It was going well for several minutes until the computer decided to interpret one of his commands as "exit," and the game prematurely closed. It's no wonder this topic has so much comic potential.
Dan T. said,

November 15, 2010 @ 8:28 am

So, if some people are in the elevator talking about yesterday's football game, and say "Tennessee was up 10 to 7 in the first quarter, but they tied it up 10 to 10 by halftime," the elevator might be jumping all over the place after interpreting all of those numbers. It might even hear a "ten" in "Tennessee".
Jerry Friedman said,

November 15, 2010 @ 11:21 am

A voice-controlled alternative in elevators would be useful for blind people and even more for people with disabilities of their hands.
Greg Morrow said,

November 15, 2010 @ 2:44 pm

Dierk: "ten" might have a reasonable chance of failing for me — I have pin/pen merger, and pronounce "ten" with something close to the BIT vowel.
Russell said,

November 16, 2010 @ 3:53 am

Is it just me or is there some near-labiolingual articulation around 1:55?
Ellen K. said,

November 16, 2010 @ 9:17 am

This strikes me as humor that exaggerates reality. Not supposed to be realistic, but, rather, taking something people experience and expressing it in an exaggerated form.
Terminologia etc. » » Numeri e accenti inglesi said,

November 19, 2010 @ 8:51 am

[…] trovato molto divertente uno sketch, visto in Language Log, su un ascensore che anziché i soliti pulsanti ha un sistema di riconoscimento vocale automatico e […]
ajay said,

November 19, 2010 @ 10:49 am

The only way to restore the elevators willingness to admit that floors 2 and 3 existed was to take it all the way up to 4, then go back down to the basement and try again.

"Hello," said the elevator sweetly, "I am to be your elevator for this trip to the floor of your choice. I have been designed by the Sirius Cybernetics Corporation to take you, the visitor to the Hitch Hiker's Guide to the Galaxy, into these their offices. If you enjoy your ride, which will be swift and pleasurable, then you may care to experience some of the other elevators which have recently been installed in the offices of the Galactic tax department, Boobiloo Baby Foods and the Sirian State Mental Hospital, where many ex-Sirius Cybernetics Corporation executives will be delighted to welcome your visits, sympathy, and happy tales of the outside world."

"Yeah," said Zaphod, stepping into it, "what else do you do besides talk?"

"I go up," said the elevator, "or down."

"Good," said Zaphod, "We're going up."

"Or down," the elevator reminded him.

"Yeah, OK, up please."

There was a moment of silence.

"Down's very nice," suggested the elevator hopefully.

"Oh yeah?"

"Super."

"Good," said Zaphod, "Now will you take us up?"

"May I ask you," inquired the elevator in its sweetest, most reasonable voice, "if you've considered all the possibilities that down might offer you?"
MarkD said,

November 21, 2010 @ 6:15 am

http://www.youtube.com/watch?v=YId_ArKyoYs for voice recognition technology – old school

RSS feed for comments on this post

ASR Elevator

39 Comments

Dave said,

JR said,

Ian Tindale said,

lukas said,

A. Marina Fournier said,

Alicia said,

Patrick said,

lukas said,

Spell Me Jeff said,

Spell Me Jeff said,

Mr Fnortner said,

Geoffrey K. Pullum said,

af said,

Dan S said,

Dierk said,

UK lawyer said,

Kaushik Janardhanan said,

Kacie Landrum said,

Bobbie said,

MattF said,

Qov said,

Kylopod said,

Seonachan said,

Chris said,

Rech said,

Euan MacDonald said,

Chandra said,

Angus Grieve-Smith said,

groki said,

groki said,

Kylopod said,

Dan T. said,

Jerry Friedman said,

Greg Morrow said,

Russell said,

Ellen K. said,

Terminologia etc. » » Numeri e accenti inglesi said,

ajay said,

MarkD said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta