Stepping back for a moment, while it's obvious how this might be an issue for image algorithm APIs, it's unclear to me whether this is an actual impediment to AI in the real world.
What I mean is: I have a hunch that if you were to gain control of the precise activation of each retinal photoreceptor in my eyes, you could send me into epileptic shock or induce all sorts of terrible physiological conditions. You could create retinal activation maps that, when printed out on a screen, appear like noise or normal objects, but when applied directly to photoreceptors would be interpreted by the brain to be something completely different. So I'm not sure defending against this is necessary for real-world AI, although improvements in this area will probably carry over to improvements in general performance / the theory of how deep learning works.
Your leap from DNNs to brains and "AI" is unjustified. Our brain does not work like a DNN (AFAWK), and we don't know whether DNNs are even the beginning of the path that would lead us to "true" AI some day. So I wouldn't try projecting anything from this result to our brains (which, BTW, are already known to be susceptible to illusions, but we don't know whether that mechanism is the same) or to some hypothetical future AI which we know nothing about.
I have made no such connection, and I am generally skeptical of any attempts to replicate the human brain: https://news.ycombinator.com/item?id=10920890. I'm merely pointing out the possibility this isn't a huge problem to developing robust AI, because my educated guess is humans are similarly vulnerable.
Adding random noise doesn't help a lot, because it's random. Mean 0. This technique relies on changing each pixel in just the right direction to affect its output. Random noise pushes half the pixels back to the original direction and the other half even further in the wrong direction.
It's unclear if viewing the image at different scales or from different angles would help. Likely it would. However NNs are designed and trained to be as invariant to deformation as possible.
A more interesting domain is audio. Audio doesn't have any concept of different rotation or scale. It's guaranteed each bit will be input in the correct order. Its also a recurrent system, which is vulnerable to falling into states of chaotic behavior.
So if humans are vulnerable to these things, it will likely be in the form of audio.
What's new in this paper is that they do not have access to the neural net activations. They need only send it images, and get back the class the nnet predicts.
But argonaut wasn't talking about having access to the neural net activations, but about having access to individual pixels in the image (which corresponds to individual photoreceptor cells in the human retina). Crafting these NN-defeating images seems to depend on tweaking individual pixels.
I'm not sure how far it really does depend on that -- whether, e.g., you can use a more sophisticated version of the same approach to make an image that fools the NN reliably even if it's slightly offset, out of focus, stretched, etc. I wouldn't be astonished if you could. If not, that suggests an obvious defence mechanism: feed the NN not the raw image but several slightly perturbed ones, and take a majority vote or something.
It's also worth observing that there are attacks on the human visual system, found by trying things out and observing their effects. We call them optical illusions, and some of them are very convincing.
If you like that kind of thing I recommend Eclipse Phase (which calls this idea a "Basilisk attack"). In the same general area, Blindsight by Peter Watts involves human perceptual flaws as a plot element.
Yay, good reference, I remember the story so well I knew the one you meant before I followed the link (I didn't remember the title).
However there have been similar ideas in earlier works, as acknowledged by the author himself. And if you asked the authors of those earlier works I'm sure they'd say they first thought of it when reading someone else's work - it's how it goes.
For what it's worth, the idea is similar to the deadliest joke sketch from the Monty Pythons, that predates most (but not all) the authors cited by Langford as influences:
Langford's later short story comp.basilisk FAQ, [1] first published in Nature in December 1999, mentions William Gibson's Neuromancer (1984), Fred Hoyle's The Black Cloud (1957), J.B. Priestley's The Shapes of Sleep (1962), and Piers Anthony's Macroscope (1969) as containing a similar idea. Examples not mentioned include the short story White Cane 7.25 (1985) by Czech writer Ondřej Neff, A. E. van Vogt's War Against the Rull (1959), and John Barnes' Kaleidoscope Century (1996).
Psychological manipulation (or maybe, being more charitable, 'magic tricks') for AIs. Interesting.
I wonder if, by treating an Amazon Turk style classifier instead of a deep learning system, you could use this approach to farm optical illusions or similar human 'glitches'?
This is not my area, but couldn’t this problem be solved by adding random noise to the input data such that the adversary can’t control what is passed into system? You would trade off accuracy for robustness.
I feel like the network needs random data. Perhaps random recombinations of a larger training set. It would lower the accuracy of the network, but be more resilient against these sorts of attacks. Sort of like sort algorithms that include randomness.
> Discussing defenses against this attack is outside the scope of this paper. Previous work suggested the use of adversarial sample training [15], Jacobian-based reg- ularization [16], and distillation [31] as means to make DNNs robust to adversarial samples.
It seems like it would be very simple (given you already generated the examples) to take half the generated adversarial examples and train your original model more with those examples to check effect (adversarial sample training). This has been shown the improve model performance while also nullifying this "attack" in other (practically similar) settings. At this point, practitioners should be doing that anyway to maximize accuracy.
Attack is a strong word people like to latch onto; the paper could also be titled "Generating new training examples with transfer learning". That said, there is a very real contribution here as to new ways to generate adversarial examples (which should then be used as further training examples).
It's worth noting that this would usually trivial to protect against in practice by checking any attack vector against multiple different neural networks (ideally trained with different(ly permutated) data, and different architectures).
Having said that, the MNIST examples might fool multiple networks. But one should note that the "3" example would probably fool a lot of humans on a single casual glance.
Also, this paper gives an excellent way to build a feedback for adversarial training (maybe some kind of semi-supervised auto encoder thing?)
Edit: I noticed that Ian Goodfellow is a co-author of [1] so he's probably thinking about the adversarial training idea.
As they say in the paper: "it is not yet known whether adversarial example generation requires greater control over the image than can be obtained by physically altering the sign" I would be more worried people wearing T-shirts with traffic sign pictures in them.
Image recognition might not be the best use of this kind of attack. How about financial data, profiling data etc? It's often possible to alter the less significant digits at will and there is no extra noise.
I've only read the paper once just now, but if I understand correctly, this idea is essentially exploiting overfitting by the oracle.
Think of it this way: if a model is not overfitting in the region near a certain example input, then it means there is good generalization in that region. Perturbations around that example will remain correctly classified. If a model is overfitting near an input (think of a classic case of overfitted high-degree polynomial regression) then a tiny perturbation in that region can lead to an unstable change in the output.
Another way to think about it is to ask: why wouldn't this sort of thing work against a simpler algorithm like linear regression? I don't know that I am correct in my hunch that it wouldn't. But my feeling is that it wouldn't precisely because of the linear link function. If you perturb the inputs, even in a high-dimensional input space, you're only going to get at most a linear effect in the size of the perturbation, so the output on the true example and the adversarially perturbed example would, by construction, always be pretty similar. In other words, the adversarial example could either be plausible-looking, but not produce a great error in the oracle's output, or else it would have to be ridiculous-looking to produce that error, since the simple linear model does not present any regions of overfitting to exploit. (Of course, the linear model may not be good enough ... I'm not saying linear models are superior ... just trying to think about how overfitting is related to this.)
The algorithm works by training a companion model to the oracle model (the model you want to attack), such that a certain gradient structure may likely be correlated between the two models. That gradient structure in the companion model is then used to make perturbations around plausible inputs to the oracle, and evaluate the chances that the perturbation will lead to one of the unstable changes in oracle outputs (e.g. how likely is it that the oracle was overfitted in that region).
The whole paper seems to have a lot of connections to the idea of boosting, but almost in reverse. In boosting, you want to clamp exemplars that you demonstrably get right, and then re-train using example you get wrong (with weights shuffling the importance more and more onto what you get wrong).
It suggests to me that this technique could also be used as the basis for 'boosting' the original oracle classifier. You can produce the adversarial examples that are likely to be misclassified, and then go back and augment the oracle's original training set with these adversarial examples with their correct labels.
In a sense, you would be putting synthetic examples into the training set at the places where you most anticipate the model would otherwise overfit.
Then you could do rounds of this, iteratively, and probably with various weighting schemes, and all sorts of other permutations that could pump a few hundred people through their PhDs.
I wonder if the output of re-training with these "high leverage" sythetic / adversarial points would then lead to an overall classifier less susceptible to this type of attack? Maybe if you do enough iterations of this attack-then-retrain process, there cease to be the types of overfitted regions exploited by the attack. And how much would it degrade accuracy in these regions?
It also suggests adding an "attackability" metric to the model's overall cost function, and I wonder how much that is just going to look exactly like a regularization term...
That is a very interesting post. My intuition is that it is precisely because there is generalization across models that it is an effect of overfitting. This is a different kind of generalization than the never-before-seen-data generalization that, when high, means no overfitting. If the underlying structural correlations in models result in similar fragile decision boundaries when trained on the same data, that does suggest overfitting to me.
The example with softmax regression is extremely interesting and I have to think about it more, but I don't think it quite represents what I was saying about plain OLS in my comment. With even basic logistic regression, you get the logistic function sort of "smearing" perturbations from the decision boundary, and so an effect that only occurs in a single direction even may be large enough to cause the logistic function to "amplify" it. Reasoning about that same possibility in a super high-dimensional softmax scenario seems very hard and it's at least not obvious to me that the softmax functions wouldn't similarly "amplify" small differences, leading to places where the decision boundary could be fragile in this overfitting sense.
The issue is that NNs are continuous, by design. The more linear and continuous they are, the easier they are to train and generalize well.
But what that means is, changing each input by a small amount also changes the output by a small amount. And images have lots of pixels. Changing each of them a tiny amount in the right direction adds up to a big change to the output.
I see what you mean now. This plus the comment from user `argonaut` helped me realize what I was missing. Not all of generalization error is because of overfitting, and even if none of it is because of overfitting, the idea you describe would still allow for the attack.
To be clear, this is not overfitting. This is non-smooth (poor?) generalizability. Overfitting is related to, but not equivalent to, generalization.
I'm very that skeptical data augmentation is the answer to this. The space of adversarial examples is much larger than the space of true population examples. Boosting adversarial examples would only push the classifier to work really hard to properly classify examples that don't even belong to the population distribution. I suspect it would be equally susceptible to adversarial attacks.
Thank you, your comment along with the one further down by `Houshalter` helped me see something I was missing.
I think it is still an interesting point to reflect on whether or not the adversarial examples "belong" in the population distribution or not. Take the example of the distorted Stop sign in the paper. It comes from the space of adversarial examples, but I think most of us would agree, as far as human perception goes, that that image should be considered part of whatever space comprises positive examples (if the system is trying to replicate human performance). Humans see it unequivocally as a Stop sign, so regardless of subtleties that make it non-natural, from the point of view of "I know it when I see it" it should be considered as if it were natural.
So then the next question is whether all adversarial examples have to have that property. I mean, they obviously have to be good enough to fool the oracle, and potential any humans who check them out to see what's going on.
In the other link that `Houshalter` shared, however, there is the example of random noise being classified as an ostrich. So in that case, we can see that input is non-sensical for the particular label. There would probably be some spectrum moving from obvious images to noise/gibberish images, all in the adversarial example space. And finding the line between them probably could not be any easier than the original learning problem in the first place.
But this paper does not (metaphorically) get these things in front of the computer eyeballs (which would fall in the space of naturally observed images). It gets them directly implanted (metaphorically) on the computer's retinal photoreceptors (not in the space of naturally observed images).
Perhaps that distinction is not important, but my guess is it is.
If the problem is that the class probabilities change too rapidly wrt the input data, then maybe a solution would be to add a term penalizing large gradients to the cost used for backpropagation.
What I mean is: I have a hunch that if you were to gain control of the precise activation of each retinal photoreceptor in my eyes, you could send me into epileptic shock or induce all sorts of terrible physiological conditions. You could create retinal activation maps that, when printed out on a screen, appear like noise or normal objects, but when applied directly to photoreceptors would be interpreted by the brain to be something completely different. So I'm not sure defending against this is necessary for real-world AI, although improvements in this area will probably carry over to improvements in general performance / the theory of how deep learning works.