
Practical Attacks against Deep Learning Systems using Adversarial Examples - Houshalter
http://arxiv.org/abs/1602.02697
======
argonaut
Stepping back for a moment, while it's obvious how this might be an issue for
image algorithm APIs, it's unclear to me whether this is an actual impediment
to AI in the real world.

What I mean is: I have a hunch that if you were to gain control of the precise
activation of each retinal photoreceptor in my eyes, you could send me into
epileptic shock or induce all sorts of terrible physiological conditions. You
could create retinal activation maps that, when printed out on a screen,
appear like noise or normal objects, but when applied directly to
photoreceptors would be interpreted by the brain to be something completely
different. So I'm not sure defending against this is necessary for real-world
AI, although improvements in this area will probably carry over to
improvements in general performance / the theory of how deep learning works.

~~~
Jarwain
For some reason, your comment reminded me of Snow Crash

~~~
lmm
If you like that kind of thing I recommend Eclipse Phase (which calls this
idea a "Basilisk attack"). In the same general area, _Blindsight_ by Peter
Watts involves human perceptual flaws as a plot element.

~~~
Houshalter
All of these stories were inspired by a short story named BLIT, which is
remarkably similar to this:
[http://www.infinityplus.co.uk/stories/blit.htm](http://www.infinityplus.co.uk/stories/blit.htm)

~~~
YeGoblynQueenne
Yay, good reference, I remember the story so well I knew the one you meant
before I followed the link (I didn't remember the title).

However there have been similar ideas in earlier works, as acknowledged by the
author himself. And if you asked the authors of those earlier works I'm sure
they'd say they first thought of it when reading someone else's work - it's
how it goes.

For what it's worth, the idea is similar to the deadliest joke sketch from the
Monty Pythons, that predates most (but not all) the authors cited by Langford
as influences:

 _Langford 's later short story comp.basilisk FAQ, [1] first published in
Nature in December 1999, mentions William Gibson's Neuromancer (1984), Fred
Hoyle's The Black Cloud (1957), J.B. Priestley's The Shapes of Sleep (1962),
and Piers Anthony's Macroscope (1969) as containing a similar idea. Examples
not mentioned include the short story White Cane 7.25 (1985) by Czech writer
Ondřej Neff, A. E. van Vogt's War Against the Rull (1959), and John Barnes'
Kaleidoscope Century (1996)._

From:
[https://en.wikipedia.org/wiki/BLIT_%28short_story%29](https://en.wikipedia.org/wiki/BLIT_%28short_story%29)

------
taneq
Psychological manipulation (or maybe, being more charitable, 'magic tricks')
for AIs. Interesting.

I wonder if, by treating an Amazon Turk style classifier instead of a deep
learning system, you could use this approach to farm optical illusions or
similar human 'glitches'?

------
danieltillett
This is not my area, but couldn’t this problem be solved by adding random
noise to the input data such that the adversary can’t control what is passed
into system? You would trade off accuracy for robustness.

~~~
russdill
I feel like the network needs random data. Perhaps random recombinations of a
larger training set. It would lower the accuracy of the network, but be more
resilient against these sorts of attacks. Sort of like sort algorithms that
include randomness.

------
dave_sullivan
> Discussing defenses against this attack is outside the scope of this paper.
> Previous work suggested the use of adversarial sample training [15],
> Jacobian-based reg- ularization [16], and distillation [31] as means to make
> DNNs robust to adversarial samples.

It seems like it would be very simple (given you already generated the
examples) to take half the generated adversarial examples and train your
original model more with those examples to check effect (adversarial sample
training). This has been shown the improve model performance while also
nullifying this "attack" in other (practically similar) settings. At this
point, practitioners should be doing that anyway to maximize accuracy.

Attack is a strong word people like to latch onto; the paper could also be
titled "Generating new training examples with transfer learning". That said,
there is a very real contribution here as to new ways to generate adversarial
examples (which should then be used as further training examples).

------
nl
It's worth noting that this would usually trivial to protect against in
practice by checking any attack vector against multiple different neural
networks (ideally trained with different(ly permutated) data, and different
architectures).

Having said that, the MNIST examples might fool multiple networks. But one
should note that the "3" example would probably fool a lot of humans on a
single casual glance.

Also, this paper gives an excellent way to build a feedback for adversarial
training (maybe some kind of semi-supervised auto encoder thing?)

Edit: I noticed that Ian Goodfellow is a co-author of [1] so he's probably
thinking about the adversarial training idea.

[1] [http://gitxiv.com/posts/eFw3ArCyvhaFJ6bzb/adversarial-
autoen...](http://gitxiv.com/posts/eFw3ArCyvhaFJ6bzb/adversarial-autoencoders)

------
nabla9
As they say in the paper: "it is not yet known whether adversarial example
generation requires greater control over the image than can be obtained by
physically altering the sign" I would be more worried people wearing T-shirts
with traffic sign pictures in them.

Image recognition might not be the best use of this kind of attack. How about
financial data, profiling data etc? It's often possible to alter the less
significant digits at will and there is no extra noise.

------
p4wnc6
I've only read the paper once just now, but if I understand correctly, this
idea is essentially exploiting overfitting by the oracle.

Think of it this way: if a model is not overfitting in the region near a
certain example input, then it means there is good generalization in that
region. Perturbations around that example will remain correctly classified. If
a model is overfitting near an input (think of a classic case of overfitted
high-degree polynomial regression) then a tiny perturbation in that region can
lead to an unstable change in the output.

Another way to think about it is to ask: why wouldn't this sort of thing work
against a simpler algorithm like linear regression? I don't know that I am
correct in my hunch that it wouldn't. But my feeling is that it wouldn't
precisely because of the linear link function. If you perturb the inputs, even
in a high-dimensional input space, you're only going to get at most a linear
effect in the size of the perturbation, so the output on the true example and
the adversarially perturbed example would, by construction, always be pretty
similar. In other words, the adversarial example could either be plausible-
looking, but not produce a great error in the oracle's output, or else it
would have to be ridiculous-looking to produce that error, since the simple
linear model does not present any regions of overfitting to exploit. (Of
course, the linear model may not be good enough ... I'm not saying linear
models are superior ... just trying to think about how overfitting is related
to this.)

The algorithm works by training a companion model to the oracle model (the
model you want to attack), such that a certain gradient structure may likely
be correlated between the two models. That gradient structure in the companion
model is then used to make perturbations around plausible inputs to the
oracle, and evaluate the chances that the perturbation will lead to one of the
unstable changes in oracle outputs (e.g. how likely is it that the oracle was
overfitted in that region).

The whole paper seems to have a lot of connections to the idea of boosting,
but almost in reverse. In boosting, you want to clamp exemplars that you
demonstrably get right, and then re-train using example you get wrong (with
weights shuffling the importance more and more onto what you get wrong).

It suggests to me that this technique could also be used as the basis for
'boosting' the original oracle classifier. You can produce the adversarial
examples that are likely to be misclassified, and then go back and augment the
oracle's original training set with these adversarial examples _with their
correct labels_.

In a sense, you would be putting synthetic examples into the training set at
the places where you most anticipate the model would otherwise overfit.

Then you could do rounds of this, iteratively, and probably with various
weighting schemes, and all sorts of other permutations that could pump a few
hundred people through their PhDs.

I wonder if the output of re-training with these "high leverage" sythetic /
adversarial points would then lead to an overall classifier less susceptible
to this type of attack? Maybe if you do enough iterations of this attack-then-
retrain process, there cease to be the types of overfitted regions exploited
by the attack. And how much would it degrade accuracy in these regions?

It also suggests adding an "attackability" metric to the model's overall cost
function, and I wonder how much that is just going to look exactly like a
regularization term...

~~~
Houshalter
The issue is not over fitting. In fact linear regression is extremely
exploitable with this method. See
[http://karpathy.github.io/2015/03/30/breaking-
convnets/](http://karpathy.github.io/2015/03/30/breaking-convnets/)

~~~
p4wnc6
That is a very interesting post. My intuition is that it is precisely
_because_ there is generalization across models that it is an effect of
overfitting. This is a different kind of generalization than the never-before-
seen-data generalization that, when high, means no overfitting. If the
underlying structural correlations in models result in similar fragile
decision boundaries when trained on the same data, that does suggest
overfitting to me.

The example with softmax regression is extremely interesting and I have to
think about it more, but I don't think it quite represents what I was saying
about plain OLS in my comment. With even basic logistic regression, you get
the logistic function sort of "smearing" perturbations from the decision
boundary, and so an effect that only occurs in a single direction even may be
large enough to cause the logistic function to "amplify" it. Reasoning about
that same possibility in a super high-dimensional softmax scenario seems very
hard and it's at least not obvious to me that the softmax functions wouldn't
similarly "amplify" small differences, leading to places where the decision
boundary could be fragile in this overfitting sense.

~~~
Houshalter
The issue is that NNs are continuous, by design. The more linear and
continuous they are, the easier they are to train and generalize well.

But what that means is, changing each input by a small amount also changes the
output by a small amount. And images have lots of pixels. Changing each of
them a tiny amount in the right direction adds up to a big change to the
output.

~~~
p4wnc6
I see what you mean now. This plus the comment from user `argonaut` helped me
realize what I was missing. Not all of generalization error is because of
overfitting, and even if none of it is because of overfitting, the idea you
describe would _still_ allow for the attack.

------
eutectic
If the problem is that the class probabilities change too rapidly wrt the
input data, then maybe a solution would be to add a term penalizing large
gradients to the cost used for backpropagation.

------
Houshalter
What I posted this a day ago.

------
samstave
Can someone please comment on the future of AIs attacking other AIs?

That is going to be an amazing Turning point day...

(I'm also amazed that my auto correct changed my turning to Turing)

~~~
PascalsMugger
Maybe AIs attacking AIs _is_ a Turing point, and your autocorrect was trying
to pass the Turing test in the only way it knew how.

