{"id":36447,"date":"2018-01-30T08:56:53","date_gmt":"2018-01-30T13:56:53","guid":{"rendered":"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=36447"},"modified":"2018-01-30T10:13:10","modified_gmt":"2018-01-30T15:13:10","slug":"adversarial-attacks-on-modern-speech-to-text","status":"publish","type":"post","link":"https:\/\/languagelog.ldc.upenn.edu\/nll\/?p=36447","title":{"rendered":"Adversarial attacks on modern speech-to-text"},"content":{"rendered":"<p><img decoding=\"async\" src=\"http:\/\/nicholas.carlini.com\/code\/audio_adversarial_examples\/fig.svg\" alt=\"Generating adversarial STT examples.\" width=\"300\" align=\"right\" \/><\/p>\n<p>In a post on this blog recently Mark Liberman <a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=33608\" target=\"_blank\" rel=\"noopener\">raised the lively area of so-called \"adversarial\" attacks for modern machine learning systems<\/a>. These attacks can do amusing and somewhat frightening things such as force an object recognition algorithm to <a href=\"https:\/\/youtu.be\/i1sp4X57TL4\" target=\"_blank\" rel=\"noopener\">identify all images as toasters with remarkably high confidence<\/a>. Seeing these applied to image recognition, he hypothesized they could also be applied to modern speech recognition (STT, or speech-to-text) based on e.g. deep learning. <a href=\"http:\/\/nicholas.carlini.com\/code\/audio_adversarial_examples\/\" target=\"_blank\" rel=\"noopener\">His hypothesis has indeed been recently confirmed.<\/a><\/p>\n<p><!--more--><\/p>\n<p>The waveforms have been modified such that the perceptual, acoustic differences are tiny, but they can be modified to force the STT to come up with <em>any<\/em> desired transcription which completely contradicts the words that any normal human listener would transcribe. For example, here is an original recording:<\/p>\n<p><audio src=\"http:\/\/nicholas.carlini.com\/code\/audio_adversarial_examples\/normal0.wav\" controls=\"controls\"><\/p>\n<p>(\"without the dataset the article is useless\") and here is the adversarial one:<\/p>\n<p><audio controls=\"controls\" src=\"http:\/\/nicholas.carlini.com\/code\/audio_adversarial_examples\/adversarial0.wav\"><\/audio><\/audio><\/p>\n<p>(transcribed as \"okay google browse to evil dot com\").<\/p>\n<p><strong>So, what is going on?<\/strong><\/p>\n<p>Accumulated evidence over the last few years shows that empirically, methods (such as the Mozilla DeepSpeech architecture in the example above) based on deep learning and massive amounts of labelled data (<a href=\"https:\/\/hacks.mozilla.org\/2017\/11\/a-journey-to-10-word-error-rate\/\" target=\"_blank\" rel=\"noopener\">\"thousands of hours of labeled audio\"<\/a>), seem to outperform earlier methods, when compared on the test set. In fact, the performance on the test set can be quite extraordinary, e.g. word error rates (WERs) as low as 6.5% (which comes close to human performance, around 5.8% <a href=\"https:\/\/arxiv.org\/abs\/1512.02595\" target=\"_blank\" rel=\"noopener\">on their training dataset<\/a>). These algorithms often have 100's of millions of parameters (Mozilla DeepSpeech has 120 million) and model training can take hundreds of hours on massive amounts of hardware (usually GPUs, which have dedicated arrays of co-processors). So, these algorithms are clearly exquisitely tuned to the STT task for the particular distribution of the given dataset.<\/p>\n<p>The crucial weakness here \u2014 what the adversarial attack exploits \u2014 is their manifest success, i.e. very low WER on the given dataset distribution. But because they are so effective at this task, they have what might best be described as huge \"blind spots\". Adversarial attacks work by learning how to change the input in tiny steps such as to force the algorithm into any desired classification output. This turns out to be surprisingly easy and has been demonstrated to work for just about every kind of deep learning classifier.<\/p>\n<p>Current machine learning systems, even sophisticated deep learning methods, are only able to solve the problem they are set up to solve, and that can be a <i>very specific<\/i> problem. This may seem obvious but the hyperbole that accompanies any deep learning application (coupled with clear lack of analytical understanding how these algorithms actually work) often provokes a lot of what might best be described as \"magical thinking\" about their extraordinary powers as measured by some single error metric.<\/p>\n<p>So, the basic fact is that if they are set up to map sequences of spectrogram feature vectors to sequences of phoneme labels in such a way as to minimize the WER on that dataset distribution, then that is the only task they can do. It is important not to fall into the magical thinking trap about modern deep learning-based systems. Clearly, these algorithms have not somehow \"learned\" language in the way that humans understand it. <a href=\"https:\/\/arxiv.org\/abs\/1709.06126\" target=\"_blank\" rel=\"noopener\">They have no \"intelligence\" that we would recognize<\/a>. Recent work has shown that, for example, convolutional nets (the current best performers in object image recognition) derive most of their performance through <a href=\"http:\/\/ai.stanford.edu\/~ang\/papers\/nipsdlufl10-RandomWeights.pdf\" target=\"_blank\" rel=\"noopener\">frequency selective spatial Fourier filtering inherent to the convolution operation<\/a>, even when the deep net weights are entirely random and fixed during training.<\/p>\n<p><strong>What does this mean for STT applications in general?<\/strong><\/p>\n<p>For many commercial STT and associated user-centric applications this is mostly a curiosity. If I can order pizza and nearly always get it right in one take through Siri, I don't really see the problem here, even if it is obviously highly brittle (<a href=\"https:\/\/openreview.net\/forum?id=Hy-w-2PSf\" target=\"_blank\" rel=\"noopener\">and probably massively overparameterized<\/a>, nobody claimed these algorithms are computationally efficient). At least the headline WER is low and that makes a real difference in practice, and for those with disabilities who rely on ASR this could be a boon.<\/p>\n<p>Nonetheless, I think this brittleness does have consequences. There will be critical uses for which this technology simply can't work. Specialised dictionaries may exist (e.g. clinical terminology) for which it may be almost impossible to obtain sufficient training data to make it useful. Poorly represented minority accents may cause it to fail. Stroke survivors and those with voice or speech impairments may be unable to use them. And there are attacks such as the above in which a device is hacked remotely. Even at human-level WER, because these systems are not intelligent this is a huge problem for accountability and there are really no ways to see if the system is working in deployment without having copious, new hand-labelled speech databases. I would not recommend them for scientific or legal annotation applications, for example.<\/p>\n<p>I would also paraphrase this (somewhat provocative) paper from David Hand in 2006 that <a href=\"https:\/\/projecteuclid.org\/euclid.ss\/1149600839\" target=\"_blank\" rel=\"noopener\">progress in classifier technology may be somewhat illusory in practice<\/a>. One needs to do proper risk assessments with and these very brittle software algorithms, and their limitations and weaknesses need to be thoroughly understood. This will take time.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In a post on this blog recently Mark Liberman raised the lively area of so-called \"adversarial\" attacks for modern machine learning systems. These attacks can do amusing and somewhat frightening things such as force an object recognition algorithm to identify all images as toasters with remarkably high confidence. Seeing these applied to image recognition, he [&hellip;]<\/p>\n","protected":false},"author":40,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[60,299],"tags":[],"class_list":["post-36447","post","type-post","status-publish","format-standard","hentry","category-computational-linguistics","category-elephant-semifics"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/36447","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/users\/40"}],"replies":[{"embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=36447"}],"version-history":[{"count":27,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/36447\/revisions"}],"predecessor-version":[{"id":36474,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=\/wp\/v2\/posts\/36447\/revisions\/36474"}],"wp:attachment":[{"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=36447"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=36447"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/languagelog.ldc.upenn.edu\/nll\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=36447"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}