Batch Normalization: Accelerating Deep Network Training [pdf]

lars · on Feb 14, 2015

Since this paper along with [1] both beat human level performance on ImageNet, using seemingly unrelated techniques, it seems like there's a clear path to achieve the next state of the art result: Implement and train a network that uses both Batch Normalization and PReLU's at the same time. Right?

1: http://arxiv.org/pdf/1502.01852v1.pdf

dougabug · on Feb 14, 2015

Exactly my thoughts. Performance of both incremental improvements seems similar, would be interesting to see what the combined effect would be. It's pretty clear that they're reaching the limitations of the data set, though.

hyperbovine · on Feb 13, 2015

The phrase "exceeding the accuracy of human raters" is a bit puzzling--if a human misclassifies something whose true class was determined by another human, who is in error?

davmre · on Feb 14, 2015

The comparison is to humans that see the same data as the algorithm. The "true" labels can be assigned using additional data.

For example, if I make a dataset for the task of distinguishing cocker spaniels from springer spaniels, then when I collect the pictures I can be quite confident about the labels I assign (eg by DNA-testing the actual dogs). But when someone else tries to identify the dogs based only on the pictures I took, they might get some of them wrong.

The 5.1% "human accuracy" number that gets thrown around for ImageNet comes from Andrej Karpathy attempting to manually label part of the ImageNet test set. He wrote a blog post about the experience (http://karpathy.github.io/2014/09/02/what-i-learned-from-com...), which has some interesting details. For example, they did actually find that 1.7% of images were genuinely misannotated: that's because ImageNet annotations come from a consensus of Mechanical Turkers, which is less reliable than DNA testing. :-)

karpathy · on Feb 14, 2015

The TLDR is that to work at scale we sacrifice some correctness in the dataset, and what's really being evaluated here is closer to some notion of human agreement.

The way the data was collected was not through a 1000-way classification. Instead, we'd query things like "red fox" from a search engine, and then have turkers do the binary task of YES/NO to clean up the results. This is a messy, noisy process but we do our best. In particular, the turkers saying YES/NO to red fox are completely unaware of what other classes are in the dataset - everything they say YES to becomes the "red fox" class in imagenet, in the final discriminative task among the 1000 categories.

Also note that if an image contains a whole bunch of things (e.g. see left-most image here http://karpathy.github.io/assets/ilsvrc2.png), the turker during data collection would be correct in saying that there is in fact, e.g. a "ruler" in that image, but at test time when you're seeing everything in that image, you only have 5 possible predictions so you have to guess about what you think the "true" label is.

dchichkov · on Feb 13, 2015

Probably exceeding the accuracy of a "single human rater".

Note, the definition of a "human rater accuracy" is a bit vague. For example an "Amazon Turk rater" qualifies as a human rater, yet the accuracy of such rater can be easily exceeded by an accuracy of dedicated grad student rater, or an accuracy of a dedicated H1 employee rater.

edit: That's my guess from previous (2.5 years back) work in computer vision / imagenet and a quick skim over the article.

obstinate · on Feb 14, 2015

Well, I don't know how they did it, but you could easily imagine a scenario:

- 3 human raters, one computer rater

- computer always agrees with 2/3 humans

- humans do not always agree with 2/3 humans

freddealmeida · on Feb 14, 2015

I find it fascinating that deep learning research is no.1 on HN. Not even 6months ago it was nearly unheard of on the top 3-4 pages. Is this an indicator that machine intelligence is the new normal?

sz4kerto · on Feb 14, 2015

Nah. Since neural networks are called deep learning it became very fashionable.

There's some well-funded startup (SkyMind maybe?) who declares on their homepage that deep learning is sooo much better than machine learning (first wtf). Then they explain that neural nets with less than 3 hidden layers is machine learning, with more than 3 layers it's deep learning.(second)

I didn't know if I should laugh or cry.(Not that what they actually seem to be doing is bad, but this newly found hype is just wrong.)

agibsonccc · on Feb 14, 2015

Adam (founder of skymind here). I'd like to say that we don't encourage neural networks as a black box like a lot of the startups out there.

I also mainly advocate neural networks on unstructured data where the results are proven to be a significant improvement over other techniques. Many startups in the space would believe that you can have a simple GUI and you're somehow set to go.

In my upcoming oreilly book Deep Learning: A Practioner's Approach, I go through a practical applications oriented view of neural networks that I think will help change things in the coming years.

I'd also add that I've implemented every neural network architecture and allow people to mix and match. A significant result that many people are familiar with are the image scene description generators done by karpathy et.al[1].

Either way, unlike most our code is open source :). I don't claim things are ideal or perfect, but it's out there for you to play with. I focus on integrations and packaging and providing real value rather than pretending that some algorithm is going to be my edge. Many startups in the space will tell you they have awesome algorithms that are cutting edge when in reality you should be providing a solution for people.

[1]: https://github.com/karpathy/neuraltalk/tree/master/imagernn

karmacondon · on Feb 14, 2015

Thanks for commenting here, Adam. I was wondering if you could address the issues that sz4kerto brought up. Do you think that deep learning is better than machine learning? And do you define deep learning as a neural network with >= 3 layers as opposed to machine learning being < 3 layers?

I don't see those claims anywhere on your website and I have no idea where the grandparent comment got those statements from. Frankly, they don't make any sense. But I do share his or her sentiment that the phrase "deep learning" is overused, poorly understood and is rapidly becoming a meaningless buzzword akin to "Big Data" or "Web 2.0".

I don't expect you to explain statements that you don't appear to have made, but I would like to hear an expert's view on exactly what deep learning is and how it compares to other machine learning techniques. I understand if you've already addressed this issue in your book and we can read your thoughts on this when it's published. But just a few sentences here might do a lot to clear up some misconceptions for fellow hn readers.

agibsonccc · on Feb 14, 2015

I think there's trade offs. I apologize if our marketing copy seems excessive, I'd love to fix that and take feedback seriously. Please do email me.

Deep Learning done right applied to unstructured data is a great part of either an ensemble of methods or great for working with hard to engineer features.

I think as for normal machine learning where you are typically doing feature engineering, you need to understand what's going on to make recommendations for actionable insights. Everything has its place.

I'd like to echo Andrew Ng here, the hype in deep learning is overblown. While there are great results, it's not magic.

Neural networks still need feature scaling, among other things to work well. Much of the hype comes from the giant marketing machine that are the PR firms for the research labs who need new data scientists.

While great work is being done in these labs, much of it isn't going to be applicable in the day to day work of a data scientist just trying to do some some basic A/B testing. Hope that makes sense!

sz4kerto · on Feb 14, 2015

http://www.skymind.io/contact.html

Quote:" On a technical level, deep-learning networks are distinguished from the more commonplace single-hidden-layer neural networks by their depth; that is, the number of node layers through which data is passed in a multistep process of pattern recognition. More than three layers, including input and output, is deep learning. Anything less is machine learning. The number of layers affects the complexity of the features that can be identified."

Again, it's great what you are doing in general, I just brought this up as a example of the buzzwords and hype around deep learning.

agibsonccc · on Feb 14, 2015

I see. We can modify that ;). Thanks for the shout out. Criticisms are how we learn.

blueyes · on Feb 14, 2015

this is fixed jfyi

joe_the_user · on Feb 14, 2015

Well,

May I ask what makes a neural network into "not-a-black-box"?

You do mention "proven results". It seems to me that an experiment where one approach does better than another is compatible with one or both approaches being black boxes, ie, there not being more of an explanation than "it works".

But if there's something more going on here, I would love to hear more details.

agibsonccc · on Feb 14, 2015

Sure. Neural networks don't have feature introspection. You use random forest for that. Hinton and others as well as what's reflected in my software encourage you to debug neural networks visually.

This could mean debugging neural networks with histograms to make sure the magnitude of your gradient isn't too large, ensuring debugging with renders in the first layer if you're doing vision work to see if the features are learned properly, for neural word embeddings, using TSNE to look at groupings of words to ensure they make sense, or mikolov et. al on word2vec give you an accuracy measure wrt predicted words based on nearest neighbors approaches.

For sound, one thing I was thinking of building was a play back mechanism. With canova as our vectorization lib that converts sound files to arrays, feed that in to a neural network, and then listen to what it reconstructs.

The take away is, while you can't rank features with information gain like random forest, you can at least go in not completely blind.

Remember, one of the key take ways with deep learning is that it works well on unstructured data, aka: things that have brittle feature engineering(manually) to begin with.

Edit: re proven results

Audio: https://gigaom.com/2014/12/18/baidu-claims-deep-learning-bre...

Vision: http://www.33rdsquare.com/2015/02/microsoft-achieves-substan...

Text: http://nlp.stanford.edu/sentiment/

To further clarify what I mean by black box: I don't like typing:

model = new DeepLearning()

What does that mean?

How do I instantiate a combination architecture? Can you create a generic neural net that mixes convolutional and recursive layers? What about recurrent? How do I use different optimization algos?

Not to pick on ml lib, but another example:

val logistic = new LogisticRegressionwithSomeOptimizationAlgorithm()

Why can't it be: val logistic = new Logistic.Builder().withOptimization(..)

The key here is visualization and configuration.

Hope that makes sense!

joe_the_user · on Feb 14, 2015

Hmm,

Perhaps a thing with lots of parameters and some tools/rules-of-thumb for debugging might be called a "gray box" where a "real" statical model with a defined distribution, tests of hypothesis validity and so-forth could be called a "white box".

agibsonccc · on Feb 15, 2015

Right. Being able to trouble shoot models with some level of certainty is better than nothing. Tuning neural nets is still more of an art than a science in some respects, I'm not aiming for anything magical here. My first step with neural nets is to make them usable.

icelancer · on Feb 14, 2015

I work / used to work in ML/AI. IMO I think he is saying that a "black box" approach is one that just shoves the methodology on everything without regard for first structuring and analyzing the problem, which is VERY common in the field. I respect his work greatly.

vonnik · on Feb 14, 2015

I'd like to add that one thing that makes Deeplearning4j's neural networks "not a black box" is the fact that they are open-source. Many deep-learning startups don't have their entire code base available on Github.

vonnik · on Feb 14, 2015

First of all, not all neural nets are deep, and many people discussing "deep" learning actually know the difference.

Secondly, if you'd like to dispute the series of records broken by deep learning across many benchmark datasets, feel free to take that up with Geoff Hinton and Andrew Ng. Skymind is hardly saying anything new when it points to the advances deep learning has made in unsupervised data.

mbq · on Feb 14, 2015

Meh, people just like simple solutions (or at least those which sound simple).

maxerickson · on Feb 14, 2015

The research project this is part of was discussed almost 3 years ago:

https://news.ycombinator.com/item?id=4779647

There's lots of other results for 'deep networks' in the search too (sorting by date, they appear quite consistently, many of them don't get many votes though).

dchichkov · on Feb 14, 2015

I think it went 'mainstream' around 2007, with Hinton's TechTalk at Google http://www.youtube.com/watch?v=AyzOUbkUf3M And even being pretty far from the ML/AI community at the time, I remember playing with Bengio's GPU based Theano Deep Learning Tutorial at around 2008/09. What had happened right now is just that it had finally started beating SVMs consistently and is fast enough to be used for practical purposes.

WillNotDownvote · on Feb 14, 2015

Hinton's Coursera class a couple years ago pulled me in. I wish there were more Coursera classes at that level.

wodenokoto · on Feb 14, 2015

deeplearning has hit the front page regularly for over a year now.

singularity2001 · on Feb 13, 2015

PS: The human being beaten was a guy from Stanford who did the validation set for this specific image recognition task a while ago (he somewhere said that he could reduce the error rate to 3% given a second chance ;)

This same week Microsoft released a similar result, but without such an important new approach for networks.

The important chart really is on page number seven.

karpathy · on Feb 14, 2015

You're referring to my google+ post from yesterday :) https://plus.google.com/+AndrejKarpathy/posts/dwDNcBuWTWf

The short story is that I tried for 500 images and then got 5.1% error on the test set. Looking at my mistakes, I then tried to differentiate two sources of error: a genuine problem with the dataset (e.g. many correct answers in an image, or incorrect label), and errors I felt could be eliminated by an ensemble of very committed humans who were even better than me at classifying dogs :) And that optimistic error rate is approx 3%.

pipeep · on Feb 13, 2015

I get that this is preprint, but the reference on page 8 has an overfull hbox with a rather high badness.

sz4kerto · on Feb 13, 2015

I don't know if this the title is right. Beats human in ... what, the speed of neural nets? (In image recognition, by the way, as the paper explains.) I also don't exactly understand what is the breakthrough here -- there are plenty papers, hardware, etc. that speed up some process significantly. 10x speedup won't bring a huge breakthrough in deep learning in general.

karpathy · on Feb 14, 2015

10x speedup in training is a huge deal since modern nets take 2-3 weeks to train on 4 GPUs. Note that the speedup is not actually in the forward-backward raw speed of the computation in the network. The speedup here is a more clever form of the forward function, which makes it converge faster as you train it. In fact, they have made the network slower in this paper because they added the normalization computation into it.

At test time (once training is finished), they are very efficient though, on orders of milliseconds per image.

mdda · on Feb 14, 2015

(responding here because your comment addresses the actual new content of the paper).

Four things pop out at me from the paper:

1) The whitening (per batch) & rescaling (overall) is a neat new idea. But (as referred to in their p5 comment about the bias term being subsumed) this also points to the idea that the (Wu+b) transformation probably has a better-for-learning 'factorization', since their un-scale/re-scale operation on (Wu+b) is mainly taking out such a factor (while also putting in the minibatch accumulation change).

2) The idea that this could replace Dropout as the go-to trick for speeding up learning is pretty worrying (IMHO), since the gains from Dropout seem to be in a 'meta network' direction, rather than a data-dependency direction. Both approaches seem well worth understanding more thoroughly, even though the 'do what works' ML approach might favour leaving Dropout behind.

3) The publication of this paper, so closely behind the new ReLu+ results from Microsoft, seems too coincidental. One has to wonder what other results each of the companies has in their back-pockets so that they can repeatedly steal the crown from each other.

4) For me, the application to MNIST is attention-grabbing enough. While I appreciate that playing with Inception (etc) sexes-up the paper a lot, it raises the hurdle for others who may not have that quantity of hardware to contribute to the (more interesting) project of improving the learning rates of all projects (which is quite possible to do on the MNIST dataset, except that it's pretty much 'solved' with the error cases being pretty questionable for humans too).

bradneuberg · on Feb 14, 2015

I'm looking forward to people applying this result to Recurrent Neural Networks for speech recognition tasks. What do you think?

snackyman123 · on Feb 13, 2015

It looks like significantly better training accuracy is achieved in a small fraction of the number of training steps previously required. ~14x speedup in training is a big deal; that speedup would enable training many more models (trying new research ideas) in a given period of time.

vonnik · on Feb 14, 2015

People would kill for a 10x speed up in training. You have no idea what you're talking about. How well a neural net performs is partly based on how much data it trains on. If you are able to train on 10x as much data in the same period, you're going to have much better nets. This is very important when you start trying to implement real-time solutions.

chromakode · on Feb 13, 2015

It accelerates training. From the conclusion: "We have presented a novel mechanism for dramatically accelerating the training of deep networks"

dang · on Feb 13, 2015

We changed the title from "Googles breakthrough paper shows 10*faster neural nets[pdf]".