Another fake AI failure?

« previous post | next post »

The "silly AI doing something stupidly funny" trope is a powerful one, partly because people like to see the mighty cast down, and partly because the "silly stupid AI" stereotype is often valid.

But as with stereotypes of human groups, the most viral examples are often fakes. Take the "Voice Recognition Elevator"  skit from a few years ago, which showed an ASR system that was baffled by a Scottish accent, epitomizing the plight of Scots trapped in a dehumanized world that doesn't understand them. But back in the real world, I found that when I played the YouTube version of the skit to the 2010 version of Google Voice on my cell phone, it correctly transcribed the whole thing.

And I suspect that the recent viral "tuba-to-text conversion" meme is another artful fraud.

There's no recording of the "dad … playing the tuba" sounds that alleged were transcribed as "Woo woo woo woo  … who wu Google woop woop … who get glue who who … will do", so I can't try to reproduce the results directly. But I tried feeding Google Voice several tuba sequences from YouTube, for example this one:

In each case I started with "OK Google note to self", which sets up transcription of what follows and sends the resulting text to me as an email. And in each case, Google Voice listened, and then displayed on the screen the message "What's the note?". When I touched the screen, the system spoke the message "You just said something — I didn't hear what what it was."

So maybe there's a way to play the tuba that will fool Google Voice into transcribing the sounds as a long sequence of "woo woo woo" and "woop woop woop" and so on — though the alleged result in this case is not likely to get a very good language model score, and the state of the art in distinguishing human voices from tubas and the like is really not that easy to fool. Or maybe Apple's systems are a lot less capable in the relevant ways than Google's are. But my guess is that the whole thing was just a funny fake.

If anyone can actually get a speech recognition system to produce this general pattern of behavior — recognizing brass instrument sounds as a sequence of human syllables — please give us the details of the system and the setting, the audio recording, the resulting transcript.

Of course, one of the things that humans can do with non-speech sounds is to translate them into more-or-less similar speech patterns: whippoorwill, boom, cha-ching, moo, twang, boop-boop-a-doop, etc. — and we might want an ASR system to be able to do the same thing, on demand. But humans don't confuse a mechanical cash register ringing up a sale with someone actually saying cha-ching, and ASR systems don't either.

 



14 Comments

  1. KeithB said,

    November 21, 2016 @ 9:46 am

    Just because I like it, there is the scene in "The Lego Movie" where Benny is telling the AI to drop the shield and keeps getting misunderstood. Finally the pirate Metalbeard comes up, speaks some pirate argot and is understood immediately.

  2. Adam F said,

    November 21, 2016 @ 10:50 am

    @KeithB
    I love that bit — "Be ye disabling yon shield!" / "Shield disabled."

  3. Erik said,

    November 21, 2016 @ 12:04 pm

    Yesterday, I had the same doubts, so I tried this with a trombone and my Android phone. I opened a memo and then hit the microphone button on my Swype keyboard. I don't actually know whether that uses Swype's own voice recognition or uses Google or something else already on the phone. Anyway, the result are as follows:

    The first time my wife played the trombone for maybe 15 seconds and it just said "Hello". She tried it again, and we got "wrong". Then I took it and played for another 15 seconds or so, with the results "hello hello hello hello hello hello hello hello".

    In the interests of completeness, I tried saying "Ok Google" and then playing a few notes. I tried several times, and Google always responded "If you just said something, I didn't hear it." So then I tried to dictate a text message using Google and it acted as though there was complete silence.

    Finally, I reopened the memo app and used the Swype voice recognition again to see if I could get anything other than "hello" out of it. No matter how I varied the rhythm or pitch, I could only get the word "hello".

    I suspect that the choice of "hello" has more to do with some predictive algorithm than what my trombone sounds like.

    (Full disclosure, I haven't really played the trombone for over 20 years. I'm astonished the slide even still slides.)

  4. Ralph Hickok said,

    November 21, 2016 @ 2:25 pm

    I didn't take that as "silly AI doing something stupidly funny"; on the contrary, I thought it was too clever for AI to have done it :)

  5. Miriam Nussbaum said,

    November 21, 2016 @ 5:36 pm

    Not a brass player, but I do have an iPhone, so I was curious (even before I saw this post) to see if it would respond amusingly to my flute playing as it allegedly did to the tuba.

    Here's the first thing I tried:

    https://www.dropbox.com/s/qii12y0v8n2q9dx/syrinx_test.mov?dl=0

    It will go a fairly long time without "picking up a word", and it's not clear what makes it think something is a word versus not.

    I had one hypothesis: the piece I played is full of long slurred phrases, and maybe something with more staccato articulations would be interpreted by the system as more speech-like.

    https://www.dropbox.com/s/ys8bcch8am1ahyi/mendelssohn_full.mov?dl=0

    (I completely forgot the middle bit of the excerpt and faked it from there. I haven't played it in about 10 years — thanks, muscle memory! Mostly!)

    There's certainly more going on, but there are still long stretches of time where it doesn't "hear a word".

    It's kind of variable what it will pick up: you can play the same thing twice and get different results.

    This was the first time in nearly 5 years of using an iPhone that I had ever used the dictation feature, so I'm no expert, but there's one thing that makes me skeptical of the accidentalness of the tuba message: I would think you still have to hit 'send' manually. Also, as shown in the first video, the dictation will end on its own if it doesn't pick anything up for a while. To send someone an entire message like this, I'd have had to hit the "start dictation" button multiple times while playing.

    Also, I don't know why it deletes things sometimes when it picks up a new word. In each of these videos, I didn't touch the phone at all while I was playing.

  6. Rubrick said,

    November 21, 2016 @ 6:31 pm

    From the screen shot, which is clearly on iOS, I assumed the transcription was being done by some flavor of Siri. Also, without a recording of the actual tuba-playing in whatever acoustic environment it was in, it seems hard to draw conclusions.

    If it is a fake, it's remarkably well done. Both the lead-in texts and the distribution and choice of the "real" words are quite convincing.

  7. D.O. said,

    November 21, 2016 @ 6:34 pm

    For the reproducibility's sake the experiment should start with ordinary speech just to get software into the mood.

  8. John Laviolette said,

    November 21, 2016 @ 7:42 pm

    I've always figured that the overwhelming majority of those "AI Fail/Voice Recognition Fail/Autocorrect Fail" memes are fake. In the autocorrect fail, the conversations don't feel "natural" and the suggestions don't seem likely. In my experience, autocorrect usually suggests words that are somewhat similar to the intended word. Even for the one weird suggestion I personally encountered, when I misspelled "other" and Android suggested "Not-Lard", I had actually typed "Not-Lard" in some previous message I'd forgotten about and thus "trained" Android to think that was something I talked about frequently.

  9. Johan P said,

    November 22, 2016 @ 5:51 am

    @John Laviolette

    I use a Swype-style keyboard which attempts to guess what I'm typing by following my finger being swiped accross the on-screen keyboard. It's surprisingly good at guessing*, using semantic data and common words, but sometimes as you can imagine it will give you wildly different results, especially since I have three language dictionaries set up. I don't believe all autocrrect fails, but I've had enough myself to believe it happens.

    *except, for some reason, on some really, really common short words. "You", for isntance.

  10. Gwen Katz said,

    November 22, 2016 @ 1:30 pm

    While the tuba thing indeed looks suspicious, I can personally vouch for the vagaries of autocorrect. It apparently genuinely thought I meant to say "where arse you going."

  11. Mick O said,

    November 22, 2016 @ 6:43 pm

    Fake? Possible, certainly. Occam's razor might suggest a simpler explanation, one that I find much more plausible: Dad joke.

  12. Ralph Hickok said,

    November 22, 2016 @ 7:14 pm

    Mick O:
    I think it's very likely that you're right, but I don't see how that's a simpler explanation. It's still an artful fraud that required some time and effort; only the intent is different. There are no fewer entities involved.

  13. Marja Erwin said,

    November 23, 2016 @ 7:21 pm

    I can't use phones, touch devices, or trombones, so I can't test these examples, but I have tried speech to text software, to help with my rsi, and I had maddening failures.

  14. Jack Parsons said,

    November 28, 2016 @ 10:24 pm

    An Aussie friend claims that one video game console can't understand him, so he has to switch to a broad, bogus Texas accent.

RSS feed for comments on this post