Hacker News new | comments | ask | show | jobs | submit login
The Neural Net Tank Urban Legend (gwern.net)
161 points by JoshTriplett on Oct 16, 2017 | hide | past | web | favorite | 61 comments



I first encountered this idea in a sci-fi story (I want to say it was one of Peter Watts' "Rifters" novels, but I can't find it now). The idea was that someone trained a neural network to look at live video feeds of passengers moving through a subway station, and control the station's ventilation system. Unfortunately, the movements of individual people were fairly random, whereas the large-scale traffic patterns were extremely regular and periodic. So instead of basing its output on the actual crowd patterns, the neural net decided it was more accurate to look at the hands of an analog clock that happened to be visible through one of its cameras.

All well and good, until the clock stopped working during rush hour, and people started asphyxiating.



I thought i had a bleak outlook on life, until i started reading Watts novels and blog.

BTW, did he stop releasing his writing as creative commons?


Definitely Rifters, though I'm not sure which. Probably the first.


>I suggest that dataset bias is real but exaggerated by the tank story, giving a misleading indication of risks from deep learning

I don't see how this story gives a "misleading" view of deep learning. From my (admittedly limited) experience with self-driving RC cars, this type of mistake is quite easy for a neural net to make while being quite difficult to detect. In our case, after utilizing a visual back-prop method, we realized our car was using the lights above to direct itself rather than the lanes on the road.

Now, you can refute this and say "well clearly your data wasn't extensive enough" or "your behavioral model is too simple for a complicated task like driving" however as these tools become easier to use, more and more organizations will put them into practice without as much care as the researchers behind most of the current production efforts.


Another more modern and well-documented example of this would seem to occur in a 2015 write-up of the "Right Whale" competition in Kaggle: http://felixlaumon.github.io/2015/01/08/kaggle-right-whale.h...

Contrary to this author's claims, despite using data augmentation and a fancy modern CNN, a neural network trained to identify whales hit a local optimum where it looked at patterns in waves on the water to identify the whale instead of distinctive markings on the whale's body.

I don't buy the "this isn't a problem in real world applications" argument being made in this article.


He says that his first attempt at whale recognition looked at waves instead of whales, but

> This naive approach yielded a validation score of just ~5.8 (logloss, lower the better) which was barely better than a random guess.

which is different from the tank story. For the tanks, the neural network appeared to perform well, but was actually not looking at the tanks. Here, it never performed well, and when debugging why not he found that it was not looking at the whales.


> I don't buy the "this isn't a problem in real world applications" argument being made in this article.

Me neither. Especially considering that this story was already alive before the latest deep learning advances. It is totally believable.

And even with a modern CNN approach, you would expect a model to be able to learn a sunny/cloudy categorization much easier than the nationality of a tank.

This story was repeated by professionals for ages because it is totally believable.


I assume you're referring to some simple lane-keeping CNN where the CNN predicts steering angle from a video recording+human inputs: and yes, your dataset isn't extensive enough, and you'll never have enough data either, not due to some amusing bias in your CNN or taking shortcuts, but because it's a reinforcement learning problem and not a classification problem - your RC CNN could learn a better model of the road which doesn't involve lights at all and it won't make any real difference, it'll still be unable to correct for its errors or adapt to new situations and crash.


I did the human version of this when I was a newbie driver. I learned to predict traffic lights changing to red by watching the pedestrian signals as I approached an intersection. All the lights all over the city followed the same pattern. Then one day I happened upon one where the pattern was different, and I stopped for no reason at a green light, like an idiot.


The whole "Could it happen"? section is a bit strange. On the one hand, it focuses on CNNs when it's clear we're talking about a binary classifier (the article itself points that out). If Fredkin was really the originator of the story, then discussing CNNs is an anachronism (they were 30 years away at the time).

More importantly, it's obvious that "it" could definitely happen and in fact happens a lot- "it" being overfitting to examples. Machine learning classifiers suffer from this a lot, it's the whole bias/variance tradeoff issue. Neural Nets are not only not immune to overfitting, they are even particularly vulnerable to it (especially the ones with millions of parameters). We've probably all read the adversarial examples papers- a clear case of overfitting to irrelevant details ("noise").

The story (apocryphal or not) seems like a cautionary tale against overfitting, or a not-so-innocent attempt to poke fun at machine learning researchers. One way or another, overfitting is no joke and it's definitely no urban legend.


> The whole "Could it happen"? section is a bit strange. On the one hand, it focuses on CNNs when it's clear we're talking about a binary classifier (the article itself points that out). If Fredkin was really the originator of the story, then discussing CNNs is an anachronism (they were 30 years away at the time).

? How is it strange? Presumably people are not, right now, still retelling the story because they are terribly afraid that there are perceptrons out there from the 1960s lurking, waiting to strike, or that anyone is going to go out and try to use 1960s style perceptrons. People are telling it as a cautionary story about current NNs, in the 2000s and 2010s and 2017. Which means... CNNs. So it's worth asking, can it happen with CNNs as trained by any reasonably standard workflow?

> More importantly, it's obvious that "it" could definitely happen and in fact happens a lot- "it" being overfitting to examples.

Overfitting is not dataset bias, as I note several times. For example, dropout or heldout datasets or crossvalidation are highly effective in fighting/detecting overfitting, but do nothing about dataset bias.


> One way or another, overfitting is no joke and it's definitely no urban legend.

Can you give an example of where overfitting happened and was successfully corrected for?


That's why testing and validation sets exists. Overfitting is prevented by ending learning process when error on validation set starts growing. It is a standard procedure, so it's unlikely there's a recent case of overfitting slipping into production.

If irrelevant feature is present in all class samples, then it is not a fault of NN to use it as a class feature, it's bad data.


So you would agree overfitting is not a real issue and talking about it is distracting from NN?

My question is are people using overfitting as an excuse of a what is instead a badly made NN.

If you are smart enough to create a NN that can tell if it's sunny or not then tanks would also be possible. But if your NN just sucks than blaming overfitting is a convenient out.


>> My question is are people using overfitting as an excuse of a what is instead a badly made NN.

Overfitting is a major issue in machine learning and it's an inherent characteristic of learning from examples and not the result of a mistake, or of poor practice. There are special techniques developed explicitly to reduce overfitting- early stopping (what red75prime above, describes), regularisation, bagging (in decision trees) etc. A lot of work also goes into ensuring measures of learning performance don't mistake overfitting for successful learning (e.g. k-fold cross validation).

I'm sorry that I don't have time to track down a good source for a discussion of the bias-variance tradeoff and overfitting. You can start at the wikipedia page [https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff] and follow the links. In short- a model that learns to reproduce its example data with very high fidelity, risks generalising poorly, whereas a model that generalises well may have high training error. Linear classifiers in particular are high-bias, whereas nonlinear learners, like multi-layered neural networks or decision trees, are high-variance.

The problem is real, it's a big bugbear and you won't find any specialist who dismisses it, or who considers it "not a real issue".


Building a robust process that creates a model fitting signal and not noise is very tough. It becomes philosophical for some problems, such as forecasting.

Over-fitting is dismissed as an amateur mistake when it really is an endemic problem you are constantly battling no matter how good you are.


For a better, actual example of this problem, see the leopard sofa: http://rocknrollnerd.github.io/ml/2015/05/27/leopard-sofa.ht...


That comment section took an immediate and unexpected turn for the worse.


>That comment section took an immediate and unexpected turn for the worse.

What the heck is going on there?


Terry Davis - he occasionally chimes in here with similarly themed posts (but only if you have show dead enabled).

He's schizophrenic, is famous for TempleOS and infamous for the contents of his posts on the internet.


Interestingly, one of the comments mentions the tank story as an example of the same issue:

  Rainer Kordmaa • 6 months ago

  Kinda reminds me a story of how a neural network was 
  trained by military to   detect camouflaged tanks on 
  terrain, except pictures with tanks were taken on a  nice 
  sunny day and pictures without on a cloudy day and instead 
  of a tank detector they ended up with a sunny day detector


I think the author's conclusion- that this scenario is unrealistic and would never happen given today's understanding of machine learning techniques- is extremely optimistic. NNs are demonstrably[1] not robust image classifiers.

In my opinion, it's far more dangerous to downplay the limitations of this technology and embolden snake-oil purveyors than it is to demand an inconvenient degree of rigor and caution in reporting results.

[1] https://arxiv.org/abs/1707.07397#


> I think the author's conclusion- that this scenario is unrealistic and would never happen given today's understanding of machine learning techniques- is extremely optimistic. NNs are demonstrably[1] not robust image classifiers.

I am well-aware of adversarial examples, and they are not the same thing as dataset bias, and I am very troubled by them. If you look at the section on whether we should tell the tank story as a cautionary story, I already say:

> I also fear that telling the tank story tends to promote complacency and underestimation of the state of the art by implying that NNs and AI in general are toy systems which are far from practicality and cannot work in the real world (particularly the story variants which date the tank story recently), or that such systems will fail in easily diagnosed and visible ways, ways which can be diagnosed by a human just comparing the photos or applying some political reasoning to the outputs, when what we actually see with deep learning are failure modes like "adversarial examples" which are quite as inscrutable as the neural nets themselves (or AlphaGo's one misjudged move resulting in its only loss to Lee Sedol).

To expand a little: dataset bias at least has the tendency to expose itself as soon as you try to apply it. You waste your time, but that's generally the worst part. I'm more worried about stuff like adversarial examples, which will work great in the field right up until a hacker comes by with a custom adversarial example (eg the adversarial car sign work showing you can trick simple CNNs into misclassifying speed limits and stop signs using adversarial examples pasted onto walls or signs or streets). This is not dataset bias; you can collect images of every single stop sign in the world and that will not stop adversarial examples.

> embolden snake-oil purveyors than it is to demand an inconvenient degree of rigor and caution in reporting results.

I think it's ironic to say that doing the very simplest level of fact-checking like 'did this story ever actually happen' is an 'inconvenient degree of rigor and caution' and 'emboldens snake-oil purveyors'.


I get the author's feelings about why we shouldn't tell this story, but I still disagree. It's a pithy, funny example of GIGO in machine learning. People could read conclusions about the abilities of neural networks from the story, but they're wrong to do so -- it's a PEBKAC error, not a technology one. "Truthy" cautionary tales are a near-universal feature of human cultures -- why shouldn't machine learning have some?


It's a plausible story. Fine to present as a parable, but we should stop presenting it as true unless we can find a reliable source for it.

Until today, I believed it was true. It was told to me as an undergrad, by a professor who believed it himself.


I first heard it in an AI class as well. When I relate the story I usually give it as an anecdote of apocryphal origin. But I do love the story, because it brings into focus several issues: what do you actually want your classifier to learn? what does your training set actually teach? how do you know you've learned the right thing?

Many times it seems like people go into these things hoping that the machine learning part will figure out things for them and relieve themselves of the problem of thinking hard about the problem. It doesn't. It only moves your problem over a bit and increases the difficulty.

In fact this problem pops up even in pedagogy where the lessons people are taught actually train them to do the wrong thing (for example pilots responding to aircraft attitude upsets).

The parable's lesson is a simplistic one, basically: "stop and think about what you're doing". But like other simple lessons about crying wolf or stitching in-time, it bears repeating.


> It only moves your problem over a bit and increases the difficulty.

Well, that's not quite true. In robot sensing several things have recently moved from the nigh-on-impossible column to the holy-shit-that-actually-works-pretty-well column, thanks to ML.

But I agree with the rest of it.


> I suggest that dataset bias is real but exaggerated by the tank story, giving a misleading indication of risks from deep learning and that it would be better to not repeat it but focus on established risks like AI systems optimizing for wrong utility functions.

He's not arguing against having cautionary tales, he's arguing that we should base them on actual problems instead of imaginary ones.


But ensuring correct datasets is a problem that people new to machine learning have to be made aware of. It's easy for people to see that NNs are good with 'noisy' data and incorrectly assume they can throw any data at it and get good results. And I think the quote is a false dilemma -- its not like we can't have multiple different stories for different problems. Make up / find a "truthy" story to spread for misoptimized utility functions if one doesn't exist, but there's no need to kill this useful story in the process.


If a failure mode has never been reported in the wild, why is it so important to tell a juicy story about it, at the expense of attention to empirically observed failure modes?


Sure, it wouldn't do any harm to move to relevant and well-sourced examples, or construct such demonstrations specifically for the class. This HN discussion alone has gathered several instances of classifiers overfitting or fitting to features that do not generalize outside the training set as well as it first may appear, e.g. the curious case of leopard sofa [1] linked by andreasvc [2] demonstrates more or less same problem as the tank parable (leopard detector performs quite well but in reality is not detecting the exact thing we'd expect).

However, also consider that the tank parable has circulated in textbooks and undergraduate introductory lectures for several decades. The main lesson I learned from the story that after training a model, one should validate it to see if it does generalize to detecting tanks both during night and daytime. Is it really a surprise that it might be difficult find egregiously naive mistakes?

[1] http://rocknrollnerd.github.io/ml/2015/05/27/leopard-sofa.ht...

[2] https://news.ycombinator.com/item?id=15486441


Google's image tech confusing black people and gorillas?


There are three separate lessons to be learned from this parable, and I think people are conflating them:

1. Training on a biased data set leads to biased predictions. This is undoubtedly true.

2. Data sets can be biased in unexpected and unforeseen ways, so therefore productions can also be biased in unexpected and unforeseen ways. The examples at the end of this article don't quite touch on that point. But examples of this abound in social science. Eg: https://blog.conceptnet.io/2017/07/13/how-to-make-a-racist-a...

3. Deep and convolutional neural networks are susceptible to this phenomenon. This is the point that the article is debating.


This kind of stuff is fairly normal actually. I don't really agree that it couldn't happen. Neural Networks train for the wrong features all of the time. It's part of what happens when you're training unsupervised. You load in a lot of data and then you find the bias. Sure, as the article purports, if done perfectly it wouldn't happen. But that's like saying "If you build a bridge perfectly it won't fall down." before building the first bridge. This was supposedly an old tale so I'm not sure why the author would assume the people working on the original theoretical NN were data experts who knew the correct ways to train NNs.


Do you have any documented examples?


Umm, but then a story linked from the article as "alternative example" (thus presumably "better" than the tank story), and it being one from HN by the way, seems to have a nearly identical gist, at least for me as a layman: https://news.ycombinator.com/item?id=6269114 - only not about neural nets, but genetic/evolutionary algorithms. Or is it somehow drastically different and I just don't understand that?


The difference is that api (Adam Ierymenko, 17k karma, 10 year old account) there says he did it himself - he is not retelling 'a friend of a grad student of a professor told me about some NNs'... I am willing to believe that a HN user who says something happened to himself, that it actually happened.

And there is a big difference between something that happened and something that did not happen.


For those who say we shouldn't pay too much attention to urban legends about neural network failures, here's a real-life example of neural networks translating "inorganic cat litter" as "in organic cat litter" and thereby creating a real-life half-billion-dollar dirty bomb that genuinely exploded.

https://jonathanturley.org/2014/11/21/kitty-litter-dirty-bom...

This is a hasty link, IIRC the error happened when someone read out instructions aloud to someone else took who was taking notes.

Yes, that badly behaving neural network(s) was human, and therefore far more sophisticated than any we can build yet. Which makes the problem worse and more real, not better or less real.


The big takeway for me is that even if untrue, similar situations are true. Like this one from the article: "Gender-From-Iris or Gender-From-Mascara?" https://arxiv.org/pdf/1702.01304.pdf


All this speculation is silly. Just generate your own data set (since the story is from the early 90s, if not earlier, the number of training examples would have been quite small compared to today's data sets) and see if today's networks make the same mistake.


Tanks are already in ImageNet: http://image-net.org/synset?wnid=n04389033


I have heard this story multiple times. My impression is that often person who tells the story treats the fact that the mistrained model was NN-based as a minor detail (or the kind of juicy but ultimately insignificant detail that make the story more fun to tell; and if the original story was about a NN, nobody is going to change it to a SVM or something else).

From this viewpoint, I found the section where the author lengthly argues how this could not possibly happen with the current state of the art visual task CNNs (especially because people apply preprocessing steps such as whitening and augmentation to get rid of exactly this kind of biases), let's say, weird. The parable is not about CNNs, it is about the importance of paying attention what features your model will extract from the training dataset and whether your model is learning the right things.


Scare-quotes dataset bias (in science we say "selection bias") is the bread and butter of any field that doesn´t get an opportunity to fine-tune sample design issues.

There's even hierarchical models with an equation giving the probability that an item will be observed at all, conditioned to known features.

Those who don't know their statistical models are bound to reinvent statistical theory.


reinvent it poorly as well?


> a common preprocessing step in computer vision (and NNs in general) is to whiten the image by standardizing or transforming pixels to a normal distribution; this would tend to wipe global brightness levels, promoting invariance to illumination

Is there anybody still doing this?


No, it's not widely used anymore, although I wouldn't call it rare either. The most popular image CNNs don't use them: ResNets, Inception, VGG, fully convolutional nets, etc.

The Google query gwern cites is highly misleading because "normalize" in the context of neural nets for computer vision almost always means "subtract the average and then divide by the standard deviation."


Seems to still be pretty common: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&as_yl... Am I wrong?


"normalizing" is a bad search term, it can mean a lot of things. And whitening images is pretty much dead. What is done is subtracting mean colors from each pixel, but those are means over the whole database, not per-image, so that keeps brightness shifts intact.


In searches you should err on the side of broadness. If you cut it down to just 'whitening', as you can see from the snippets as well, there are plenty of hits. You may not like whitening, but it does still seem to be common.


Using your search, at least for me, none of the snippets on the first page use normalization in the sense that you are in this context. So including that term just got you a lot of noise. And the only reference to whitening on that search page is not using it in an input pipeline, it is using ZCA to detect images that are modified to be adversarial.

If you want some better data than unreliable searches, go download pretrained models for popular architectures and popular frameworks and look at the input pipelines for them. You'll find that whitening is absolutely not common for image classification/detection today (yes, there are still some cases where it is used, but typically on smaller datasets where you can't get that invariance from data, which is the way you prefer it to be - if one class actually is more likely to be present in dark images, you don't want to kill that information).


I always heard the version that went the other way around. After it was shown that single layer perceptrons were unable to deal with data sets that weren't linearly separable, there was an effort to figure out how the single layer tank classifier was working.


That's an interesting variant - none of the versions I've seen so far link it to Minsky's perceptron book. Any chance you recall where you saw that one?


This was from a prof giving an undergrad ML lecture at Cornell about 10-12 years ago. Wikipedia's coverage of _Perceptrons_ suggests my lecturer also had only heard the mistaken version of Minky's XOR example, so this could have been entirely wrong :)


Oh. How boring. Wonder if I should include that as an example... It's a valid example of how urban legends evolve, after all.


It's funny, I heard this legend a bunch, but stopped hearing it after the 2012 AlexNet paper.


For reference:

AlexNet[1] is the name of a convolutional neural network, originally written with CUDA to run with GPU support, which competed in the ImageNet Large Scale Visual Recognition Challenge in 2012. The network achieved a top-5 error of 15.3%, more than 10.8 percentage points ahead of the runner up. AlexNet was designed by the SuperVision group, consisting of Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever.

AlexNet Paper(PDF)[0]

[0]: http://vision.stanford.edu/teaching/cs231b_spring1415/slides... [1]: https://en.wikipedia.org/wiki/AlexNet


Makes me think of the story of soviet soldiers training dogs to carry explosives under tanks during WW2. Only they used their own tanks to train the dogs. So when deployed to the battlefield, well...


I've always liked this story better, it's in a somewhat similar vein: https://www.damninteresting.com/on-the-origin-of-circuits/


What does one validate that a trained NN has learned a correct method?


Why is it called an urban legend, it seems to accurately depict how NNs work.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: