
Audio Adversarial Examples: Targeted Attacks on Speech-To-Text - pulisse
https://arxiv.org/abs/1801.01944
======
alkonaut
This will be really interesting when people start attacking the visual
networks of autonomous cars. That giant poster at the side of the road that
looks like an ad for toothpaste, but that all autonmous cars seem to slow down
for...

Maybe the happy path of current autonomous cars isn't that they are tested in
Arizona or Californa with traction and sunshine- but that they aren't being
attacked by adversarial input. (I do remember that guy that painted a line on
the ground that trapped cars though).

What will it mean if we suddenly realize that convolutional neural network
object recognition is too easily fooled to be a secure part of autonomous
vehicles? Would that push the state of the art backwards a long way, or would
it not matter because there are other alternatives?

~~~
spyder
The autonomous cars could check some some official database of road signs and
report if something is off and even put the strange sign in captchas for
humans to check it :)

~~~
pavel_lishin
> _even put the strange sign in captchas for humans to check it_

"Please select all pictures that look like a legitimate street sign. Please
hurry."

------
padwan
Audio Adversarial Examples
[http://nicholas.carlini.com/code/audio_adversarial_examples/](http://nicholas.carlini.com/code/audio_adversarial_examples/)

------
nukeop
I remember there was an advertisement on TV where an actor would say "okay
google, what is X?", and the phones nearby would search for that phrase. With
this, any advertisement could be made to activate nearby phones to search for
any phrase or do anything really, without the ad itself containing any google-
related phrases.

~~~
yorwba
Not yet.

 _The audio adversarial examples we construct in this paper do not remain
adversarial after being played over-the-air, and therefore present a limited
real-world threat; however, just as the initial work on image-based
adversarial examples did not consider the physical channel and only later was
it shown to be possible, we believe further work will be able to produce audio
adversarial examples that are effective over-the-air._

~~~
nukeop
Weird, I'd expect that since they've achieved a 100% success rate, they could
get at least 50% in real life scenarios. Also this could be played in stores
to get the phones of customers to search for certain phrases.

~~~
daveFNbuck
Their 100% success rate requires perfect fidelity. If they're saying it
doesn't work over the air, they probably have close to 0% success on store
speakers.

~~~
Crespyl
Right, it looks very similar to the early adversarial visual examples. At
first they only worked on direct simulated inputs with perfect clarity, then
images, and now live camera feed of arbitrary 3D printed objects from any
angle.

------
DannyB2
Can't the technique itself be used to better train the speech recognition
systems?

Using the first example where it sounds like "without the dataset the article
is useless" but the speech recognition thinks it hears "okay google browse to
evil dot com"; you could use that to train the recognition system to recognize
it correctly as what humans think they heard.

Of course, many attacks would need to be used to create lots of training data.

------
brw12
I'm not ready to call BS on this, but I'm deeply skeptical of the general
language they're using. They might have iteratively gotten particular
recognition to do what they say, but I don't think they've gotten recognition
in general.

I plugged the first 4 examples at
[http://nicholas.carlini.com/code/audio_adversarial_examples/](http://nicholas.carlini.com/code/audio_adversarial_examples/)
into Google Docs' own Voice typing, and got:

1: At the yard course Eustis

2: Set the Artic course Eustis

3: (nothing, it's an operatic wall of sound)

4: (nothing, it's an operatic wall of sound)

~~~
Ar-Curunir
Given that Nick is one of the world's experts on adversarial examples against
NN, and given that he already has another paper attacking voice recognition
systems, i'd hold off before calling this work "BS"

------
dzhiurgis
Does it require perfect knowledge of the object?

~~~
yorwba
Not sure what you mean by "object", but this is a white-box attack against
Mozilla's DeepSpeech model that relies on being able to compute gradients for
the complete pipeline. They didn't test whether the adversarial examples
transfer to other models.

~~~
dzhiurgis
Ooops, my model was flooded with beer when I typed it. Indeed I meant model...

------
pulisse
From the abstract: _Given any audio waveform, we can produce another that is
over 99.9% similar, but transcribes as any phrase we choose (at a rate of up
to 50 characters per second)._

~~~
yorwba
That sentence is quite easy to misunderstand. The 50 characters per second are
the capacity of the attacked model (it chops the audio in frames of 1/50th
second), and what they want to highlight is that they can cram the maximum
amount of text into the adversarial example.

The actual generation takes an hour on an nVidia 1080Ti, but they can
parallelize the process to compute multiple examples on the same GPU, giving
an amortized cost of a few minutes.

