Hacker News new | comments | show | ask | jobs | submit login
Attacking machine learning with adversarial examples (openai.com)
308 points by dwaxe on Feb 16, 2017 | hide | past | web | favorite | 82 comments

"attackers could target autonomous vehicles by using stickers or paint to create an adversarial stop sign that the vehicle would interpret as a 'yield' or other sign"

yeah, this article needs to go to the top of HN and stay there for a while

While these modifications are fascinating for being something that humans can't readily notice, the underlying attack is something you can also trivially perform against a human driver by covering, removing, or altering a stop sign such that it no longer looks like a stop sign, and it will be much more effective than trying to trick self-driving cars.

In the case of a human not seeing a stop sign, some drivers will remember a stop sign used to be at that location, but a self-driving car is less likely to forget and might (in a world with networked cars, which I find slightly scary for different reasons than this) be able to "remember" that other cars had previously seen a stop sign at that location.

I don't think it would be "much more effective" than trying to trick self-driving cars. Humans are (still, much) better at reasoning their way through ambiguous sensory input based on their abstract understanding of the world than AIs are.

While an AI might specifically already know that a given intersection is supposed to be controlled, there are large numbers of driving situations in which memory (or maps) aren't very useful in dealing with.

When it comes to self-driving cars, probably the way around attacks on signs or other static road features is just to fail safe even in situations in which you're mis-directed by road features (that is, you drive cautiously enough that even if you don't understand the road situation, you can find a way to stop without hurting anyone). But that's not a general AI solution, and may not even be a general self-driving car solution.

Except no amount of specially crafted noise will trick a human. If it looks like an octagon and is red then a human will interpret it as a stop sign. That's because I'd hope as a human you have a better conceptual understanding of what a stop sign is and what its intent is. Whereas none of the current crop of AI can claim any kind of understanding or even yield any kind of explanatory model.

Deep neural nets should really be called "really complicated composition of differentiable functions" and all the training algorithms should really be called "really complicated function root finding" because that's all these things are at the end of the day, a rats nest of functions with a billion knobs.

> That's because I'd hope as a human you have a better conceptual understanding of what a stop sign is and what its intent is.

That's not the reason at all.

It's because if you add exactly the right type of specially crafted noise to an image of a stop sign such that, to a human, it looks like a panda--then who is going to be the arbiter that decides "No, it's still actually really an image of a stop sign and not a panda". The machine?

The real reason is the subjective nature of the perception of reality. With the advent of AI, all sorts of schools of philosophy that used to seem like useless navel-gazing are going to find some real important practical applications real soon. This is metaphysics and ontology. On a similar note (not in TFA), in the "friendly AI" debate, we're going to be looking real hard at the foundations of the philosophy of ethics (where does it really come from?), not just the teleological vs deontological (etc) debates we use to reason about law, politics, governance and appeasement of our "justice" instincts.

> Except no amount of specially crafted noise will trick a human

Or quite possibly we just don't know how to construct the noise yet :P

Well we do. Conspiracy theories are noise but not in the same category as pixel noise. When neural nets start falling for conspiracy theories then I'll start to worry about the coming singularity.

Might as well traffic signs were intentionally chosen as something human are good in classification of without any regard to machine learning algorithms (nonexistent at that time). Yet there are definitely tasks where specially crafted noise can fool humans. Think optical illusions or this car prototype camouflage: https://arstechnica.com/cars/2017/02/the-new-mclaren-720s-wi...

> covering, removing, or altering a stop sign such that it no longer looks like a stop sign

Isn't that illegal in most jurisdictions?

I would rather expect the same laws that keep you from defacing or removing a stop sign would apply to modifying one with noise to make it invisible to self-driving cars.

Although, you could argue that an attacker who is targeting self-driving cars could modify the sign in a way that is unnoticeable to humans, but greatly affects how the self-driving car interprets the sign.

Eg. look at the panda images at the start of the linked post. The third image looks like a panda to humans (and most humans probably wouldn't be able to see the difference between the first and third images), but it greatly affects the machine learning interpretation.

By modifying the sign in this way, if it is possible, it would be much harder to detect and enforce against than it would be if the attacker was targeting human drivers.

I get what you are getting at here. The human isn't the interesting bit. Humans have pretty good bullshit detectors built in already.

An image that triggers a false positive within the AI won't be noticeable to the average human, however good at detecting bullshit as they are. It probably looks like some annoying abstract pattern at the worst.

To the car, however, it could represent anything that made it react adversely upon observing it. Seems like figuring out how to find those images might pose a challenge.

Already exists - "The triangles are known variously as 3-D, virtual or just plain fake speed humps."


The comment there suggests calling them “bump l’œil”, which is genius.

Humans will quickly learn that these things are fake and just drive over them at full speed, though. It might stop people new to the area, but not the majority of people who live there.

I think that's reasonable but perhaps not that important. In a school zone, the neighborhood people presumably send their kids there/have friends kids there, and are likely to slow down anyways. Even if you don't, it could be a useful reminder given that it's a school zone to really take it slow. It's not a perfect solution but with how much cheaper it is perhaps still viable.

I wonder if such "patterns" could find use in clothing or as bumper stickers. I can envision this sort of thing taking off as a counter-culture, or anti-technology social weapon. It certainly wouldn't be hard to produce and iterate on.

I imagine it would be hard to enforce, let alone legislate against subtle visual cues that trigger machine vision signals.

Interesting times lie ahead...

About 10 years ago, when I learned about the EURion Constellation[1], I made a T-shirt with the pattern on it. It sort-of worked - a photo taken at the right distance would frequently cause a copy machine to refuse to copy it. It was twitchy, though.

At the time, I was thinking of posing for my DMV photo with it on, because I thought it was interesting and kinda funny. Failed to do so and never resumed the experiment.

[1] https://en.wikipedia.org/wiki/EURion_constellation

People have been using a similar idea to counter facial recognition.


More recently, CMU made glasses that can make you show up as someone else. http://qz.com/823820/carnegie-mellon-made-a-special-pair-of-...

The AI can be trained to become robust to Adversarial examples. Not only that, there is work going on right now that aims to recognize all Adversarial examples. Early results are promising.

I fear for the future if the fascists control the neural networks.

Did you read the article?

> We find that both adversarial training and defensive distillation accidentally perform a kind of gradient masking. Neither algorithm was explicitly designed to perform gradient masking, but gradient masking is apparently a defense that machine learning algorithms can invent relatively easily when they are trained to defend themselves and not given specific instructions about how to do so. If we transfer adversarial examples from one model to a second model that was trained with either adversarial training or defensive distillation, the attack often succeeds, even when a direct attack on the second model would fail. This suggests that both training techniques do more to flatten out the model and remove the gradient than to make sure it classifies more points correctly.

Ah, haven't kept up with Adversarial examples research lately,sorry.

or water transfer printed "makeup"

Everyone can have custom-tailored hoodies that cause the human recognition systems to identify them as Luke Cage.

Ever seen a video of some kids pulling the "invisible rope prank"?

The AI doesn't need to be perfect, just better than humans.

But the problem is that AI can fail in unexpected ways.

Humans are much worse than AI when it comes to self-driving tasks like image/pattern recognition, sure. But they are inefficient in expected ways where measures can be taken like having larger and brighter signs, stronger rules, etc. But what happens when you don't know for sure when it can fail and when it won't?

It's been a few years since people are seeing the magic of machine learning but something like this was just discovered. Are you sure if someone goes in front of a Tesla with a picture like the parent comment quoted, it wouldn't crash and cause harm to the people inside?

So, it just needs to be (significantly) better than the average human. I think it's even better because each time an unexpected crash happens all cars can learn. We can't even teach people not to drive drunk.

Is it that scary to not know how and when it can crash? You don't think about being blindsided at each intersection...

Good unit testing includes maliciously constructed inputs.

Attackers can also shine bright laser pointers at aircrafts. Thankfully we have cops to stop this kind of vandalism effectively.

Do current self-driving systems use vision as ground truth about the presence of things like stop signs? I would hope that the cameras would be used to augment and perhaps feed back into a street map with this information, and that cars would come to a stop at ambiguous intersections.

Potentially they could also just watch for cars on visible cross-streets; even in otherwise ideal conditions, it would be good if your autonomous car didn't keep going when it has the right of way if it can see a car about to barrel through a stoplight from the side.

This brings up a point I've thought about quite a bit with autonomous vehicles. I studied transportation planning in college and one of the tenants is that human drivers are on a spectrum between aggressive and conservative about finding efficiencies in high traffic situations. The canonical example is that if a lane next to you is going significantly faster, an aggressive driver will perform a (relatively) unsafe maneuver to cut off drivers in the faster lane and change lanes.

IIRC, significant research has been performed to identify whether the resulting lane change is net positive (frees space in the slower lane) or net negative (causes cut off car to slow down).

If autonomous vehicles prioritize safety over efficiency by stopping if there's an ambiguous intersection, will that have a net negative effect on traffic while all these self-driving cars slow down for each other.

Couldn't this kind of problem be mitigated by some form of coordination between vehicles? I think a fully automated fleet would have an overall net positive effect, due to, for example, improving the reaction time to traffic lights (or eliminating them altogether).

s/tenants/tenets/g :)

You definitely can't expect map data to be reliable enough to be the only source of truth for something as important as stop signs.

Yeah, but these can definitely be used as complementary sources - with proper integration, they'd match almost always, but all mismatches seem like you should assume that the sign is there; e.g. if a sign is in map but not in DB, then it's likely the sign is obscured by something but still should be followed; and if there is a sign in vision but not in DB, then it's likely to be part of some temporary road works. The scenarios where the sign truly "isn't" there are possible, but comparably much more rare.

The same sort of tactic can be done with humans:


Nothing (and likely on-one) is foolproof.

Adversarial examples are just one way to prove that deep learning (deep convolutional nets) fail at generalizable vision. It's not a security problem, it's a fundamental problem.

Instead, ask yourselves why these deep nets fail after being trained on huge datasets -- and why even more data doesn't seem to help.

The short answer is that mapping directly from static pixel images to human labels is the wrong problem to be solving.

Edit: fixed autocorrect typo

"Adversarial examples are just one way to prove that deep learning fail at generalization"

Do you know what proof is? Adversarial examples demonstrate that there is one esoteric failure mode of current deep learning models, one that for all we know is present in human vision (we can't take derivatives with respect to the parameters of our own neurons). It will likely be solved in the next few years. At a minimum you start training on adversarially generated examples.

This response is absolute hyperbole and clearly devoid of any factual knowledge of the nature of deep conv nets and their properties.

Training on adversarial examples doesn't solve the fundamental problem, it merely tries to plug the holes. But in such high dimensional spaces there are many many holes to be plugged. :)

Agreed the failure mode may seem esoteric, but note that OpenAI is making a big deal about them.

A non-esoteric way to demonstrate the lack of generalization is to feed a deep conv network real world images (from outside the dataset). Grab a camera and upload your own photo. Roboticists who try to use deep conv nets as real world vision systems see these failures all the time.

FYI, @OpenAI:

"At OpenAI, we think adversarial examples are a good aspect of security to work on because they represent a concrete problem in AI safety that can be addressed in the short term."


Hardly proof that deep learning is fundamentally flawed.

Regarding real world issues, these issues come up when you don't separate training and test (and real world) sets properly. My worries would be with implementation.

I'm certainly not saying that deep learning is fundamentally flawed. It's a great method, very powerful. (Excellent algorithm.)

I'm saying it's not reasonable to expect good generalization in deep convnets that learn mappings from static images to human labels. (Wrong problem.)

You're both right and wrong.

No credible machine learning researcher will tell you that deep learning has totally solved "generalizable" computer vision. The only people claiming such a broad statement are usually the media or enthusiasts who have never done any actual research.

So it might be technically correct to say adversarial examples prove (by counterexample) that deep learning fails at generalization, but nobody in the field claimed that in the first place.

It is hyperbole to claim that adversarial examples will be solved in the next few years. That is extremely unlikely, since the reason they exist is due to the linear nature of convolutions (and I don't think anyone is suggesting we get rid of convolutions entirely).

How did you reach that conclusion? What is the right problem to be solving?

You just need to chain RNN after CNN

Protip to people down voting: This wasnt a joke, chaining rnn behind cnn is precisely how you give neural networks context and short term memory.

I'm actually wondering how much the no-free-lunch theorem for data compression affects adverserial examples. A neural network can be conceptualized as an extremely efficient compression technique with a very high decoding cost[1]; the NFLT implies that such efficiency must have a cost. If we follow this heuristic intuitively we're led to the hypothesis that an ANN needs to expand its storage space significantly in order to prevent adversarial examples from existing.

[1] -- consider the following encoding/decoding scheme: train a NN to recognize someone's face, and decode by generating random images until one of them is recognized as said face. If this works then the Kolmogorov complexity of the network must exceed the sum of the complexities of all "stored" faces.

So what features are those networks actually learning? What are thy looking for? They can not be much like features used by humans because the features used by humans are robust against such adversarial noise. I am also somewhat tempted to say that they can also not be to different from the features used by humans because otherwise, it seems, they would not generalize well. If they just learned some random accidental details in the trainings set, they would probably fail spectacularly in the validation phase with high probability but they don't. And we would of course have a contradiction with the former statement.

So it seems that there are features quite different from the features used by humans that are still similarly robust unless you specifically target them. And they also correlate well with features used by humans unless you specifically target them. Real world images are very unusual images in the sense that almost all possible images are random noise while real world images are [almost] never random noise. And here I get a bit stuck, I have this diffuse idea in my head that most possible images do not occur in the real world and that there are way more degrees of freedom into direction that just don't occur in the real world but this idea is just too diffuse so that I am currently unable to pin and write down.

> I have this diffuse idea in my head that most possible images do not occur in the real world and that there are way more degrees of freedom into direction that just don't occur in the real world but this idea is just too diffuse so that I am currently unable to pin and write down.

Yes! You're on the right track! The number of degrees of freedom of images of pixels and textures is HUGE. There is not enough data to practically learn directly from those images. So the deep networks are starved for data -- even with the big datasets they are trained on. (It's only thanks to the way they are set up they do well when tested on very similar images, like sharp hi-res photos. But they fail to generalize to other kinds of images.)

So how can you reasonable reduce these degrees of freedom?

It turns out that the continuity of reality itself provides a powerful constraint that can reduce the degrees of freedom. See, when a ball rolls along, this physical event is not just a collection of textures to be memorized. It's an ordered sequence of textures that vary in a consistent and regular way because of many learnable physical constraints (like lighting).

So, it turns out you can reduce the dimensionality by making a particular kind of large recurrent neural net learn to predict the future in video. Our very preliminary testing shows it works shockingly well.

That sounds very interesting is there somewhere I can keep up with the progress?

And... you know these are immune to adversarial examples?

"So what features are those networks actually learning? What are thy looking for? They can not be much like features used by humans because the features used by humans are robust against such adversarial noise."

There are two issues with this - first, we know that the lower feature detectors of neural nets closely mimic the feature detection of the human visual cortex. Secondly, the features could be the same while there could be technical imperfections with the later stages of current neural nets.

There was a presentation at defcon 2016 about another software package that attacked other deep learning models. See



Are there any examples of these kinds of adversarial patterns that don't look like noise?

While it is pretty easy to add noise to another image, it isn't exactly easy to do it to a real object. The noise wouldn't remain the same as you change perspective with respect to the sign, which would likely change its effectiveness.

I can't see that image without thinking of Snow Crash. This is almost literally Snow Crash for neural nets.

Anyone want to make a mobile app that emits 'noisy' light so when you use your phone in public CCTV facial recognition fails?*

I'd be interested to know if this is a viable concealment strategy. It might only be effective at night or low light situations, so sunlight doesn't wash out the noise. It would be pretty subtle to use as well, how many people do you see walking around with their noses stuck to a screen?

* For research purposes only, of course.

I've also been interested in using adversarial examples in extracting sensitive info from models. Both extracting unique info from the training set (doesn't seem feasible but I can't prove it) or doing a "forced knowledge transfer" when a competitor has a well trained model and you don't.

I wonder if adversarial examples can be deliberately used as a kind of steganography? Kind of like a hidden QR code. On the surface, the product looks a panda, with deliberately added signal. Under the hood, it is classified as gibbon. It could be used to verify the authenticity of a particular product.

As a defensive measure, why can't random noise just be added to the image prior to classification attempt?

The issue is essentially due to the curse of dimensionality. Since your input is very high dimensional, you can find some direction where you don't have go very far to maximize some other output. You can find a few of these adversarial examples and add them to the training set, but there are going to be an exponential number of other directions where you only have to go a little farther. So you can't just add all these perturbations to your training set because there are too many of them. It's a hard problem. (I've been thinking about this issue a lot for my research...)

I think she means you can add white noise to the front end making the output a little bit non-deterministic. Since it won't give the same confidence number twice, it frustrates gradient descent.

Just adding white noise won't work. If you add white noise to an image the NN will make almost exactly the same prediction. The issue with adversarial images is that you add a very particular perturbation to the image, not just any perturbation at all.

I think the idea proposed is that if you add white noise on top of the adversarial perturbation, it will destroy that very particular perturbation.

It won't destroy the perturbation. There is a low dimensional manifold along which the cost function will decrease/increase. The adversarial perturbation lies on that manifold, but random white noise (which is a random high dimensional vector) will have close to zero length with high probability when projected onto the manifold and hence won't affect the cost function.

Ohhhh, I see. I'm not sure whether that would work. I would have to actually do the experiment. It may not work because this bizarro region of parameter space may be somewhat robust to perturbations, so in a sense you may have to travel out same way you travelled in. But, then again, maybe not.

Edit: I should add that these perturbations appear to be very robust to different architectures and datasets. So, the same adversarial perturbation will trick different NNs that were trained on different datasets. This suggests that it will probably be fairly robust to noise. But maybe not! I'm not aware of this experiment having been done.

If the attacker knows the distribution of your random noise, they can factor that in to the adversarial search process.

From the article:

"Adversarial examples are hard to defend against because it is difficult to construct a theoretical model of the adversarial example crafting process. Adversarial examples are solutions to an optimization problem that is non-linear and non-convex for many ML models, including neural networks. Because we don’t have good theoretical tools for describing the solutions to these complicated optimization problems, it is very hard to make any kind of theoretical argument that a defense will rule out a set of adversarial examples."

I think you might be able to define the vector space "NoiseSpace" in which the random noise is generated as a subspace of the space of all images "ImageSpace", and generate an adverserial example using the quotient space ImageSpace / NoiseSpace.

How would that help?

Because all the adversarial images that are out there look like noise. If you are training each image with a different amount of noise on top of it, then plausibly you could train the classifier to be insensitive to noise like this overlaid on an image.

It's like 'fake news', but for computers.

You're right. Like fake news, it's sometimes hard to identify an adversarial example because it exploits weaknesses in perception or judgement.

Without knowing much about ML, it seems that using two (or more) very different methods could be a reasonable defense; if the methods are sufficiently different then it will get exponentially harder to find a gradient that fools all the methods; what to do when the outputs strongly disagree is a good question, but switching to a failsafe mode seems better than what we have now.

These adversarial inputs have been shown to generalize to separately trained models.

I was thinking use the same training set but different ML techniques; It's been almost 20 years since I took an AI class, but RNNs weren't the only thing in there...

Every example provided in the article is model and training data specific, it only tells one thing, your data is not telling the truth, so why not getting better data.

These adversarial inputs have been shown to generalize to separately trained models.

This is going to be the SQL injection in the AI age

Adding high frequency noise "fools" ML but not the human eye. It feels like this is a general failure in regularization schemes.

Why not try training multiple models on different levels of coarse grained data? Evaluate the image on all of them. Plot the class probability as a function of coarse graining. Ideally its some smooth function. If it's not, there may be something adversarial (or bad training) going on.

This seems like an obvious thing to try, but technically the convolutional structure already looks at different scales (in a way similar to mipmaps).

I've built a CNN before and that's not my understanding of it. The high frequency noise changes the output of the first layer on the CNN. This is what gets pooled as you go deeper into the net. Coarse graining is like getting rid of the weights you have for the first layer and replacing them with something uniform (average the smallest details together).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact