Adversarial attacks on modern speech-to-text

« previous post | next post »

Generating adversarial STT examples.

In a post on this blog recently Mark Liberman raised the lively area of so-called "adversarial" attacks for modern machine learning systems. These attacks can do amusing and somewhat frightening things such as force an object recognition algorithm to identify all images as toasters with remarkably high confidence. Seeing these applied to image recognition, he hypothesized they could also be applied to modern speech recognition (STT, or speech-to-text) based on e.g. deep learning. His hypothesis has indeed been recently confirmed.

The waveforms have been modified such that the perceptual, acoustic differences are tiny, but they can be modified to force the STT to come up with any desired transcription which completely contradicts the words that any normal human listener would transcribe. For example, here is an original recording:

("without the dataset the article is useless") and here is the adversarial one:

(transcribed as "okay google browse to evil dot com").

So, what is going on?

Accumulated evidence over the last few years shows that empirically, methods (such as the Mozilla DeepSpeech architecture in the example above) based on deep learning and massive amounts of labelled data ("thousands of hours of labeled audio"), seem to outperform earlier methods, when compared on the test set. In fact, the performance on the test set can be quite extraordinary, e.g. word error rates (WERs) as low as 6.5% (which comes close to human performance, around 5.8% on their training dataset). These algorithms often have 100's of millions of parameters (Mozilla DeepSpeech has 120 million) and model training can take hundreds of hours on massive amounts of hardware (usually GPUs, which have dedicated arrays of co-processors). So, these algorithms are clearly exquisitely tuned to the STT task for the particular distribution of the given dataset.

The crucial weakness here — what the adversarial attack exploits — is their manifest success, i.e. very low WER on the given dataset distribution. But because they are so effective at this task, they have what might best be described as huge "blind spots". Adversarial attacks work by learning how to change the input in tiny steps such as to force the algorithm into any desired classification output. This turns out to be surprisingly easy and has been demonstrated to work for just about every kind of deep learning classifier.

Current machine learning systems, even sophisticated deep learning methods, are only able to solve the problem they are set up to solve, and that can be a very specific problem. This may seem obvious but the hyperbole that accompanies any deep learning application (coupled with clear lack of analytical understanding how these algorithms actually work) often provokes a lot of what might best be described as "magical thinking" about their extraordinary powers as measured by some single error metric.

So, the basic fact is that if they are set up to map sequences of spectrogram feature vectors to sequences of phoneme labels in such a way as to minimize the WER on that dataset distribution, then that is the only task they can do. It is important not to fall into the magical thinking trap about modern deep learning-based systems. Clearly, these algorithms have not somehow "learned" language in the way that humans understand it. They have no "intelligence" that we would recognize. Recent work has shown that, for example, convolutional nets (the current best performers in object image recognition) derive most of their performance through frequency selective spatial Fourier filtering inherent to the convolution operation, even when the deep net weights are entirely random and fixed during training.

What does this mean for STT applications in general?

For many commercial STT and associated user-centric applications this is mostly a curiosity. If I can order pizza and nearly always get it right in one take through Siri, I don't really see the problem here, even if it is obviously highly brittle (and probably massively overparameterized, nobody claimed these algorithms are computationally efficient). At least the headline WER is low and that makes a real difference in practice, and for those with disabilities who rely on ASR this could be a boon.

Nonetheless, I think this brittleness does have consequences. There will be critical uses for which this technology simply can't work. Specialised dictionaries may exist (e.g. clinical terminology) for which it may be almost impossible to obtain sufficient training data to make it useful. Poorly represented minority accents may cause it to fail. Stroke survivors and those with voice or speech impairments may be unable to use them. And there are attacks such as the above in which a device is hacked remotely. Even at human-level WER, because these systems are not intelligent this is a huge problem for accountability and there are really no ways to see if the system is working in deployment without having copious, new hand-labelled speech databases. I would not recommend them for scientific or legal annotation applications, for example.

I would also paraphrase this (somewhat provocative) paper from David Hand in 2006 that progress in classifier technology may be somewhat illusory in practice. One needs to do proper risk assessments with and these very brittle software algorithms, and their limitations and weaknesses need to be thoroughly understood. This will take time.



  1. Ethan said,

    January 30, 2018 @ 5:08 pm

    The parallel to attacks on image recognition are clear, and having read the earlier image attack papers I am unsurprised that an audio analog is possible. But my intuition is that the two attack domains are different in interesting ways.

    The image attack attempts to trigger incorrect retrieval of a member of the original training set. I.e. "toaster" is available as a target because the algorithm already knows about the category "toaster"; you couldn't trick it into false identification of an image as "platypus" if it had never previously been shown platypus examplars. By contrast the STT attack attempts to elicit an output that was never in the training set, at least at the level of complete phrase or sentence. Maybe that difference goes away if each of the attacks given as examples are actually considered as sequence of attacks on successive subsets along the time axis of the audio, each containing a single word, phoneme, or some other primitive?

    The other thing that strikes me is that two classes of image identification attacks were shown in the work linked to earlier. One class of image attacks was to introduce perturbations in the frequency domain, tweaking most/all the pixels just a little. That seems directly analogous to maliciously tweaking the audio waveform everywhere. But the "everything's a toaster" attack was something else. Here the malicious perturbation of the original consists of replacing a compact region of the original spatial domain. The small ur-toaster image is used as a generic "adversarial patch" that can be applied to any larger image. Can there be an analogous STT attack consisting of, say, total replacement of a fraction of a second of the original audio, leading to misinterpretation of the entire segment?

  2. D.O. said,

    January 30, 2018 @ 10:34 pm

    Can someone help me with numbers here? Let's say the reasonable unit of speech is a phoneme with its environment. Suppose it's around 100msec. I am not sure how many thousands of hours the training set has, let's take 1 thousand. That makes it 36 million units of speech. And 120 million parameters. No wonder it's grossly overfitted.

  3. tangent said,

    January 31, 2018 @ 12:32 am

    A model can be quite good at generalizing to out-of-training examples in general, but still fall to this type of adversarial input crafting. A given quantity of overfit can be more or less exploitable, which is interesting.

    I'm not in the field but this smells to me like a solvable technical problem, solvable in the practical sense that creating a crypto hash is. Let me go see the literature…

  4. tangent said,

    January 31, 2018 @ 12:52 am

    Question for practitioners — whenever we see a model with 120M parameters, i assume somebody tried at 60M and it didn't train as well? Correct assumption about general practice? How much effort do people put into that tightening?

    If this model's heavily overfit, that's just a blooper, but things are more interesting if it's fit okay in sum, but has these hugely overfit areas…

  5. Max said,

    January 31, 2018 @ 7:00 am

    It's worth noting that in the strict technical sense there is no overfitting going on here, since the impressive performance is estimated out-of-sample. There are a vast number of parameters, this is true, but number of parameters is not necessarily the correct measure of complexity here (consider e.g. nonparametric classification methods such as k-nearest neighbours which effectively uses all the training data as parameters).

    The problem as I see it here is a different one: the whole setup is really much more specialized than perhaps was intended or indeed expected by users of the system. If we want a simple way to adapt this setup to being less specialized and therefore more widely applicable, we might have to compromise on the raw performance figures. This seems rather unsatisfactory to many.

    From a less technical standpoint, many users might expect the system to behave in a more human-like (and some would say, more useful) fashion, but that requires the system to have some measure of "intelligence" which is wishful thinking unfortunately.

  6. D.O. said,

    January 31, 2018 @ 11:38 am

    All right. So no overfitting then. What about trying to add noise of the type used to trick these systems and using this modified natural speech as training set as well? That should probably fix the problem of ignoring irrelevant features of speech that never or rarely happen in production anyway.

  7. Max said,

    January 31, 2018 @ 1:30 pm

    This strategy – adding adversarial examples to the training set – has been tried with image recognition and the outcome is that it does improve the susceptibility to these examples but doesn't "protect" entirely from them. However, there are lots of different ways of generating new, unseen adversarial examples, and protection against one method can be bypassed using another. It would be a game of endless whack-a-mole.

RSS feed for comments on this post