The "silly AI doing something stupidly funny" trope is a powerful one, partly because people like to see the mighty cast down, and partly because the "silly stupid AI" stereotype is often valid.
But as with stereotypes of human groups, the most viral examples are often fakes. Take the "Voice Recognition Elevator" skit from a few years ago, which showed an ASR system that was baffled by a Scottish accent, epitomizing the plight of Scots trapped in a dehumanized world that doesn't understand them. But back in the real world, I found that when I played the YouTube version of the skit to the 2010 version of Google Voice on my cell phone, it correctly transcribed the whole thing.
And I suspect that the recent viral "tuba-to-text conversion" meme is another artful fraud.
There's no recording of the "dad … playing the tuba" sounds that alleged were transcribed as "Woo woo woo woo … who wu Google woop woop … who get glue who who … will do", so I can't try to reproduce the results directly. But I tried feeding Google Voice several tuba sequences from YouTube, for example this one:
In each case I started with "OK Google note to self", which sets up transcription of what follows and sends the resulting text to me as an email. And in each case, Google Voice listened, and then displayed on the screen the message "What's the note?". When I touched the screen, the system spoke the message "You just said something — I didn't hear what what it was."
So maybe there's a way to play the tuba that will fool Google Voice into transcribing the sounds as a long sequence of "woo woo woo" and "woop woop woop" and so on — though the alleged result in this case is not likely to get a very good language model score, and the state of the art in distinguishing human voices from tubas and the like is really not that easy to fool. Or maybe Apple's systems are a lot less capable in the relevant ways than Google's are. But my guess is that the whole thing was just a funny fake.
If anyone can actually get a speech recognition system to produce this general pattern of behavior — recognizing brass instrument sounds as a sequence of human syllables — please give us the details of the system and the setting, the audio recording, the resulting transcript.
Of course, one of the things that humans can do with non-speech sounds is to translate them into more-or-less similar speech patterns: whippoorwill, boom, cha-ching, moo, twang, boop-boop-a-doop, etc. — and we might want an ASR system to be able to do the same thing, on demand. But humans don't confuse a mechanical cash register ringing up a sale with someone actually saying cha-ching, and ASR systems don't either.