Knowing when you don't know

« previous post | next post »

It's often observed that current AI systems will generalize confidently to areas far away from anything in their training, where the right answer should be "huh?" This is true even when other available algorithms, often simple ones, could easily diagnose the lack of fit to expectations.

We've seen many amusing examples, which we've filed in the category Elephant Semifics, named for a phrase emerging from one of Google's hallucinatory translations of meaningless repetitions of Japanese or Thai characters, or random strings of ascii vowels. Obviously a human translator would immediately notice the unexpected properties of the inputs — and in fact it's trivial to create algorithms that could screen for such things. Google and its colleagues don't bother, or at least didn't do so in the past, because why should they? Except that in real world applications, noticing that inputs are nonsense is a clue that something has gone wrong, and maybe business-as-usual is not the right response.

Most of the repeated or random character jokes in Google Translate have now been fixed — it's not clear whether this is due to better overall algorithms or to a special front-end check. But something similar remains true in today's best AI speech-to-text algorithms. If you give a human being a sound clip that's not in the language they expect, they'll notice, and tell you about it. "That's not Engish, it's French." Or "I have no idea what language that is, but it's not English." But today's speech-to-text systems just forge ahead, doing the best they can without complaint.

Here's a simple demonstration of what happens when you give a French sentence to some current speech-to-text systems, telling them that it's English:

Le travail n'est pas venu tout seul briser nos vies, comme une catastrophe céleste.

Google: Recovering a pavoni to sell Breezy Novi Community, Catalyst of Celeste.
AWS: Rojava in a parvenu to solve briseno V communicate to staff celeste.
IBM: Recover a new permanent to celebrities interviewed current Kustoff Celeste.

These same systems would do quite well if you told them the input was French. And current language-identification technology is very good. So this is just another example of failing to recognize input outside the boundaries of a system's training.

A recent paper diagnosing one aspect of this problem, and suggesting an algorithm-internal fix, is Augustinus Kristiadi, Matthias Hein, and Philipp Hennig, "Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks", PMLR 2020:

The point estimates of ReLU classification networks—arguably the most widely used neural network architecture—have been shown to yield arbitrarily high confidence far away from the training data. This architecture, in conjunction with a maximum a posteriori estimation scheme, is thus not calibrated nor robust. Approximate Bayesian inference has been empirically demonstrated to improve predictive uncertainty in neural networks, although the theoretical analysis of such Bayesian approximations is limited. We theoretically analyze approximate Gaussian distributions on the weights of ReLU networks and show that they fix the overconfidence problem. Furthermore, we show that even a simplistic, thus cheap, Bayesian approximation, also fixes these issues. This indicates that a sufficient condition for a calibrated uncertainty on a ReLU network is “to be a bit Bayesian”. These theoretical results validate the usage of last-layer Bayesian approximation and motivate a range of a fidelity-cost trade-off. We further validate these findings empirically via various standard experiments using common deep ReLU networks and Laplace approximations.



  1. Ross Presser said,

    July 8, 2021 @ 8:10 am

    The question that arises then is how do humans recognize it's an out-of-bounds input, and how can we train ML to notice this? You said "in fact it's trivial to create algorithms that could screen for such things", but it seems to be an inherent part of human cognition, not separately programmed. When we learn to recognize French we also recognize what's not French; we don't need a separate training session on all non-French input. How can ML be improved to do the same?

  2. Ross Presser said,

    July 8, 2021 @ 8:11 am

    Somehow while i was typing that I completely skipped over the abstract you ended with, which is a proper answer to my question. Sigh…

    [(myl) Your question remains a very good one. The cited paper is one attempt to make things better, but how far such ideas can take us is not yet clear.]

  3. Holly Unlikely said,

    July 8, 2021 @ 9:18 am

    The lack of easily accessible confidence metrics is, in fact, a rampant problem, but STT engines are actually the _least_ guilty in that regard. Unlike out-of-the-box NLP toolkits, most speech-recognition engines I know report a confidence estimate for every word, making it relatively easy to infer garbage inputs.

    [(myl) Unfortunately those confidence metrics have never been very good, in my experience, although back in the days of statistical ML there was a well-defined (though not very helpful) posterior probability. The corresponding numerical values in modern end-to-end systems, ???]

  4. cameron said,

    July 8, 2021 @ 10:59 am

    A classic example of a human transcriber earnestly try to make sense of nonsense is provided by the classic Beatles song "Rain", from 1966. At the end of the song there is a line of lyrics from earlier in the song played backwards. (That might be the first use of tape played backwards in a pop song.) When the song was released in Japan, they included a lyric sheet, and transcribed the backwards line of lyrics as "Stare it down and nourish what comes near you".

    Of course generations of religious fanatics would later freak out over supposed "backwards masked" lyrics. But in those cases they knew that the recording was backwards, but were convinced that Satan was speaking through it. The people working for the Japanese record label in 1966 didn't realize that the recording was backwards, but also didn't recognize it as nonsense

  5. ktschwarz said,

    July 8, 2021 @ 3:41 pm

    Most of the repeated or random character jokes in Google Translate have now been fixed — it's not clear whether this is due to better overall algorithms or to a special front-end check.

    Better overall algorithms, according to this post from a year ago on Google's AI Blog: Recent Advances in Google Translate. For example, multilingual training:

    A technique that has been especially helpful for low-resource languages has been M4, which uses a single, giant model to translate between all languages and English. This allows for transfer learning at a massive scale. As an example, a lower-resource language like Yiddish has the benefit of co-training with a wide array of other related Germanic languages (e.g., German, Dutch, Danish, etc.), as well as almost a hundred other languages that may not share a known linguistic connection, but may provide useful signal to the model.

    Sounds like they did not specifically target hallucinations, but they do use resistance to hallucinations as a quality measure:

    In addition to general quality improvements, the new models show increased robustness to machine translation hallucination, a phenomenon in which models produce strange “translations” when given nonsense input. This is a common problem for models that have been trained on small amounts of data, and affects many low-resource languages. For example, when given the string of Telugu characters “ష ష ష ష ష ష ష ష ష ష ష ష ష ష ష”, the old model produced the nonsensical output “Shenzhen Shenzhen Shaw International Airport (SSH)”, seemingly trying to make sense of the sounds, whereas the new model correctly learns to transliterate this as “Sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh”.

    (Today, that string of Telugu characters isn't even transliterated, it's just spit back unchanged.)

  6. SusanC said,

    July 8, 2021 @ 4:43 pm

    The GPT machine learning algorithm does a mixture of both.

    What GPT does is take an initial string and produce a continuation of it, based on its (large) training set.

    If you use this framework to ask it a question, with the question in the initial string and your hope is that the answer will be in the continuation….

    If the answer lies in its training set, it will often produce the right answer.

    Sometimes (often) it will continue the initial string with someth8ng equivalent to ‘I don’t know”.

    But sometimes, it will just make something up that sounds plausible but has no basis in reality. I have a mild concern that this will turn out to be dangerous, in that we have enough trouble as it is with people believing plausible-sounding-but-false stuff, without having an algorithm to mass produce it.

  7. SusanC said,

    July 8, 2021 @ 4:55 pm

    A concrete example of how GPT might be dangerous. You can give it an initial string saying that it is a report compiled by the FBI’s Robert Mueller, outlining the case for prosecuting Donald Trump, a former president of the United States.

    GPT will happily produce the rest of the report, culling relevant information from many newspaper articles in its training set.

    It will also, sometimes, just make stuff up.

    P.S. Fun game: give it an initial string that says that what follows is not considered constitutionally protected speech in the United States.

  8. D.O. said,

    July 8, 2021 @ 5:07 pm

    I am not sure that making algorithms react to an input as humans would is a good idea. If I tell an algorithm to interpret a string of symbols or sounds as English, presumably that is what I want it to do. And I would like the algorithm to presume exactly as I want. Sometimes I might want it loose and "detect language" or do a more nuanced thing like presume that the recording is in English, but allow for the possibility that it is some gobbledygook. Or French. But that should be my decision, not algorithm's.

  9. Phil H said,

    July 8, 2021 @ 7:59 pm

    Older attempts at AI often tried to codify basic knowledge or common sense (my favourite was Cyc); presumably in the end a “general AI” will include both highly-developed algorithms and a smattering of common sense to constrain them. But for the moment it seems like pushing algorithms alone as far as possible is the more fruitful path of research. It’s still yielding extraordinary advances.

  10. A practicioner said,

    July 9, 2021 @ 1:30 am

    The reason is cost. To properly detect the language, one would have to run the audio through multiple text recognition systems, multiplying the cost of each operation tenfold or more.

    Nowadays it is possible to create a single system that will recognize all languages, but that system, by itself, would be far costlier to run than a single-system language, where the user specifies as input what language model to use.

    ("Cost" here refers mostly to "compute", the amount of server-side processor cycles that should be invested in completing the task. While the resources of the tech giants seem immense, they are not unlimited, and there is no reason to be wasteful.)

    [(myl) At least for well-documented languages, Language Recognition algorithms are very accurate and very (computationally) cheap, so checking whether an input is English (or French or Chinese or whatever) should not be a problem.]

  11. Wanda said,

    July 9, 2021 @ 9:21 am

    Following up on myl's comment to a practitioner: There are situations where something you might consider a trivial cost matters. My spouse works on a voice assistant. Because people actually talk to it and expect it to respond as quickly as a person would, every ms of latency matters. It doesn't understand me when I ask it to play a song and give the title in Chinese, but I understand that if it could do that, it would slow down everyone's queries, the vast majority of which do not switch languages mid-sentence.

    [(myl) There's a difference between dealing with "code switching" or "code mixing" — which is hard unless it's a well-defined trainable community culture, like Hindi/English or Cantonese/English mixtures — and mis-interpreting sounds that are not even close to being in the hypothesized linguistic system. Again, current algorithms are in general not good at knowing when they don't know; but some kinds of out-of-frame inputs are not hard to detect. Presumably the voice assistant in question is pretty good at not responding to dog barks or ambulance sirens…]

  12. David Morris said,

    July 9, 2021 @ 5:57 pm

    A practitioner and Mark partly answer my question, but I would also ask how this relates to written translators' 'detect language' feature. Google and Bing both recognised the passage above as French. On the other hand, I tried a simple greeting in Arrernte (central Australia) (taken from Wikipedia's page on the language). One identified it as Xhosa and gave a coherent but different translation, and the other identified it as English and 'translated' it in the same words.

  13. susanC said,

    July 10, 2021 @ 10:35 am

    “Presumably the voice assistant in question is pretty good at not responding to dog barks or ambulance sirens…”

    Some years ago, I was using Dragon Dictate to voice dictate a letter when the crew of builders working next door threw a load of trash into a dumpster.

    The AI dutifully turned the sound of trash falling into a dumpster into English language text. No, it wasn’t sensible text.

    Hopefully software has got better since then, but the basic problem was very noticeable back then.

  14. Rodger C said,

    July 10, 2021 @ 11:58 am

    susanC, that was probably about the time when John Crowley found words like "womb" and "wound" appearing mysteriously on his screen, and eventually realized it was due to trucks passing on the highway. He blogged about this at the time and later used it in his story "Mount Auburn Street."

  15. Jonathan Badger said,

    July 11, 2021 @ 4:58 am

    Or the (probably apocryphal) story of why a new voice-recognition airline luggage system was sending so many bags incorrectly to Phoenix. The story went that a conveyor belt near an input station where a human was supposed to announce the destination was squeaky, and the squeaks were interpreted as "Phoenix".

  16. David J. Littleboy said,

    July 15, 2021 @ 2:27 pm

    The other day, Jimi Kimmel figured out that if he said "Alexa, order pizza" loudly and clearly on his show, about a third of his home audience would get a pizza delivery.

RSS feed for comments on this post