DolphinAttack

« previous post | next post »

Guoming Zhang et al., "DolphinAttack: Inaudible Voice Commands", arXiv 8/31/2017:

In this work, we design a completely inaudible attack, DolphinAttack, that modulates voice commands on ultrasonic carriers (e.g., f > 20 kHz) to achieve inaudibility. By leveraging the nonlinearity of the microphone circuits, the modulated lowfrequency audio commands can be successfully demodulated, recovered, and more importantly interpreted by the speech recognition systems. We validate DolphinAttack on popular speech recognition systems, including Siri, Google Now, Samsung S Voice, Huawei HiVoice, Cortana and Alexa.

This suggests a more insidious version of the "Two tons of creamed corn" ploy:

Rather than the creamed-corn scenario, Zhang et al. suggest (and test) the following possible "sneaky attacks":

(1) Visiting a malicious website. The device can open a malicious website, which can launch a drive-by-download attack or exploit a device with 0-day vulnerabilities.
(2) Spying. An adversary can make the victim device initiate outgoing video/phone calls, therefore getting access to the image/sound of device surroundings.
(3) Injecting fake information. An adversary may instruct the victim device to send fake text messages and emails, to publish fake online posts, to add fake events to a calendar, etc.
(4) Denial of service. An adversary may inject commands to turn on the airplane mode, disconnecting all wireless communications.
(5) Concealing attacks. The screen display and voice feedback may expose the attacks. The adversary may decrease the odds by dimming the screen and lowering the volume.

In all of Zhang et al.'s experiments, the maximum effective distance at which various attacks were effective ranged from 2 to 175 cm, and it's not clear that versions of this technique can be made to work under inverse-square attentuation at greater distances, given the presumably low efficiency of the non-linear microphone effects that they're relying on to produce signals in the frequency range appropriate for speech. But still…



11 Comments

  1. MattF said,

    September 11, 2017 @ 3:57 pm

    Also, I'd guess that ultrasonic signals would be very directional– so it may be difficult to point the source precisely enough to hit the desired target, particularly if the source is a significant distance from the target.

  2. Rubrick said,

    September 11, 2017 @ 6:16 pm

    I'm sorry, but this sort of dog-whistle politics has no place on Language Log.

  3. Chris C. said,

    September 11, 2017 @ 9:21 pm

    This post didn't strike me as political in the least. Is there a subtext I somehow missed?

  4. Emily said,

    September 11, 2017 @ 9:26 pm

    @Chris C: It's a reference to this term: https://en.wikipedia.org/wiki/Dog-whistle_politics

    Funnily enough, just yesterday I was reading about a similar concept from an urban legend: https://en.wikipedia.org/wiki/BadBIOS
    Ah, synchronicity…

  5. Idran said,

    September 11, 2017 @ 9:54 pm

    @Chris C: I think that was a dog whistle/high pitched noise joke. :P

    "In all of Zhang et al.'s experiments, "the speaker is place 10 cm from the target device","

    That was only for measuring the impact of sound pressure levels on recognition rates as far as I can tell. Table 3 indicates that depending on the specific scenario and device involved in an attack simulation, the maximum distance at which an attack was effective varied anywhere from 2cm away to 175cm.

    [(myl) Thanks for that — I've updated the effective-distance note in the post.]

  6. Chris C. said,

    September 11, 2017 @ 10:54 pm

    So a joke I missed then. :P

  7. tangent said,

    September 12, 2017 @ 12:42 am

    So is their modulated ultrasound in fact inaudible? I didn't see that they checked.

    They don't refer to previous "sound from ultrasound" work whose purpose was to be audible — Pompei and so on.

    125 dB is plenty enough to be audible depending on the modulation.

  8. 번하드 said,

    September 12, 2017 @ 2:31 am

    Hmmm, those mysterious "sonic weapon attacks" against the US embassy in Cuba come to mind.
    Maybe somebody was trying to hijack an iphone to Havanna.

  9. Idran said,

    September 12, 2017 @ 9:34 am

    @tangent: At one point in the paper they specifically mention an issue with a near-ultrasound attack simulation being that it was barely audible – sounding "like crickets" – so I assume they did, yes.

  10. Idran said,

    September 12, 2017 @ 9:38 am

    To clarify the above, that was a single specific run at close to 20KHz and they call out the audibility as the result of frequency leakage below 20KHz, not an issue with the entire system; that was when testing success rates at various frequencies from 20KHz up to 24KHz, section 6.5.

  11. MikeA said,

    September 18, 2017 @ 6:58 pm

    With the proliferation of "bonk to pay" variants, I suspect it would not be difficult to create a device, ostensibly to allow a user to easily purchase, say, popcorn, but actually to precisely position and orient their phone, so as to have them order two tons of creamed corn.

RSS feed for comments on this post