

Batch Normalization: Accelerating Deep Network Training [pdf] - singularity2001
http://arxiv.org/pdf/1502.03167v1.pdf

======
lars
Since this paper along with [1] both beat human level performance on ImageNet,
using seemingly unrelated techniques, it seems like there's a clear path to
achieve the next state of the art result: Implement and train a network that
uses both Batch Normalization and PReLU's at the same time. Right?

1:
[http://arxiv.org/pdf/1502.01852v1.pdf](http://arxiv.org/pdf/1502.01852v1.pdf)

~~~
dougabug
Exactly my thoughts. Performance of both incremental improvements seems
similar, would be interesting to see what the combined effect would be. It's
pretty clear that they're reaching the limitations of the data set, though.

------
hyperbovine
The phrase "exceeding the accuracy of human raters" is a bit puzzling--if a
human misclassifies something whose true class was determined by another
human, who is in error?

~~~
davmre
The comparison is to humans that see the same data as the algorithm. The
"true" labels can be assigned using additional data.

For example, if I make a dataset for the task of distinguishing cocker
spaniels from springer spaniels, then when I collect the pictures I can be
quite confident about the labels I assign (eg by DNA-testing the actual dogs).
But when someone else tries to identify the dogs based only on the pictures I
took, they might get some of them wrong.

The 5.1% "human accuracy" number that gets thrown around for ImageNet comes
from Andrej Karpathy attempting to manually label part of the ImageNet test
set. He wrote a blog post about the experience
([http://karpathy.github.io/2014/09/02/what-i-learned-from-
com...](http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-
against-a-convnet-on-imagenet/)), which has some interesting details. For
example, they did actually find that 1.7% of images were genuinely
misannotated: that's because ImageNet annotations come from a consensus of
Mechanical Turkers, which is less reliable than DNA testing. :-)

------
freddealmeida
I find it fascinating that deep learning research is no.1 on HN. Not even
6months ago it was nearly unheard of on the top 3-4 pages. Is this an
indicator that machine intelligence is the new normal?

~~~
sz4kerto
Nah. Since neural networks are called deep learning it became very
fashionable.

There's some well-funded startup (SkyMind maybe?) who declares on their
homepage that deep learning is sooo much better than machine learning (first
wtf). Then they explain that neural nets with less than 3 hidden layers is
machine learning, with more than 3 layers it's deep learning.(second)

I didn't know if I should laugh or cry.(Not that what they actually seem to be
doing is bad, but this newly found hype is just wrong.)

~~~
agibsonccc
Adam (founder of skymind here). I'd like to say that we don't encourage neural
networks as a black box like a lot of the startups out there.

I also mainly advocate neural networks on unstructured data where the results
are proven to be a significant improvement over other techniques. Many
startups in the space would believe that you can have a simple GUI and you're
somehow set to go.

In my upcoming oreilly book Deep Learning: A Practioner's Approach, I go
through a practical applications oriented view of neural networks that I think
will help change things in the coming years.

I'd also add that I've implemented every neural network architecture and allow
people to mix and match. A significant result that many people are familiar
with are the image scene description generators done by karpathy et.al[1].

Either way, unlike most our code is open source :). I don't claim things are
ideal or perfect, but it's out there for you to play with. I focus on
integrations and packaging and providing real value rather than pretending
that some algorithm is going to be my edge. Many startups in the space will
tell you they have awesome algorithms that are cutting edge when in reality
you should be providing a solution for people.

[1]:
[https://github.com/karpathy/neuraltalk/tree/master/imagernn](https://github.com/karpathy/neuraltalk/tree/master/imagernn)

~~~
joe_the_user
Well,

May I ask what makes a neural network into "not-a-black-box"?

You do mention "proven results". It seems to me that an experiment where one
approach does better than another is compatible with one or both approaches
being black boxes, ie, there not being more of an explanation than "it works".

But if there's something more going on here, I would love to hear more
details.

~~~
agibsonccc
Sure. Neural networks don't have feature introspection. You use random forest
for that. Hinton and others as well as what's reflected in my software
encourage you to debug neural networks visually.

This could mean debugging neural networks with histograms to make sure the
magnitude of your gradient isn't too large, ensuring debugging with renders in
the first layer if you're doing vision work to see if the features are learned
properly, for neural word embeddings, using TSNE to look at groupings of words
to ensure they make sense, or mikolov et. al on word2vec give you an accuracy
measure wrt predicted words based on nearest neighbors approaches.

For sound, one thing I was thinking of building was a play back mechanism.
With canova as our vectorization lib that converts sound files to arrays, feed
that in to a neural network, and then listen to what it reconstructs.

The take away is, while you can't rank features with information gain like
random forest, you can at least go in not completely blind.

Remember, one of the key take ways with deep learning is that it works well on
unstructured data, aka: things that have brittle feature engineering(manually)
to begin with.

Edit: re proven results

Audio: [https://gigaom.com/2014/12/18/baidu-claims-deep-learning-
bre...](https://gigaom.com/2014/12/18/baidu-claims-deep-learning-breakthrough-
with-deep-speech/)

Vision: [http://www.33rdsquare.com/2015/02/microsoft-achieves-
substan...](http://www.33rdsquare.com/2015/02/microsoft-achieves-substantial-
beyond.html)

Text: [http://nlp.stanford.edu/sentiment/](http://nlp.stanford.edu/sentiment/)

To further clarify what I mean by black box: I don't like typing:

model = new DeepLearning()

What does that mean?

How do I instantiate a combination architecture? Can you create a generic
neural net that mixes convolutional and recursive layers? What about
recurrent? How do I use different optimization algos?

Not to pick on ml lib, but another example:

val logistic = new LogisticRegressionwithSomeOptimizationAlgorithm()

Why can't it be: val logistic = new Logistic.Builder().withOptimization(..)

The key here is visualization and configuration.

Hope that makes sense!

~~~
joe_the_user
Hmm,

Perhaps a thing with lots of parameters and some tools/rules-of-thumb for
debugging might be called a "gray box" where a "real" statical model with a
defined distribution, tests of hypothesis validity and so-forth could be
called a "white box".

~~~
agibsonccc
Right. Being able to trouble shoot models with some level of certainty is
better than nothing. Tuning neural nets is still more of an art than a science
in some respects, I'm not aiming for anything magical here. My first step with
neural nets is to make them usable.

------
singularity2001
PS: The human being beaten was a guy from Stanford who did the validation set
for this specific image recognition task a while ago (he somewhere said that
he could reduce the error rate to 3% given a second chance ;)

This same week Microsoft released a similar result, but without such an
important new approach for networks.

The important chart really is on page number seven.

~~~
karpathy
You're referring to my google+ post from yesterday :)
[https://plus.google.com/+AndrejKarpathy/posts/dwDNcBuWTWf](https://plus.google.com/+AndrejKarpathy/posts/dwDNcBuWTWf)

The short story is that I tried for 500 images and then got 5.1% error on the
test set. Looking at my mistakes, I then tried to differentiate two sources of
error: a genuine problem with the dataset (e.g. many correct answers in an
image, or incorrect label), and errors I felt could be eliminated by an
ensemble of very committed humans who were even better than me at classifying
dogs :) And that optimistic error rate is approx 3%.

------
pipeep
I get that this is preprint, but the reference on page 8 has an overfull hbox
with a rather high badness.

------
sz4kerto
I don't know if this the title is right. Beats human in ... what, the speed of
neural nets? (In image recognition, by the way, as the paper explains.) I also
don't exactly understand what is the breakthrough here -- there are plenty
papers, hardware, etc. that speed up some process significantly. 10x speedup
won't bring a huge breakthrough in deep learning in general.

~~~
karpathy
10x speedup in training is a huge deal since modern nets take 2-3 weeks to
train on 4 GPUs. Note that the speedup is not actually in the forward-backward
raw speed of the computation in the network. The speedup here is a more clever
form of the forward function, which makes it converge faster as you train it.
In fact, they have made the network slower in this paper because they added
the normalization computation into it.

At test time (once training is finished), they are very efficient though, on
orders of milliseconds per image.

~~~
mdda
(responding here because your comment addresses the actual new content of the
paper).

Four things pop out at me from the paper:

1) The whitening (per batch) & rescaling (overall) is a neat new idea. But (as
referred to in their p5 comment about the bias term being subsumed) this also
points to the idea that the (Wu+b) transformation probably has a better-for-
learning 'factorization', since their un-scale/re-scale operation on (Wu+b) is
mainly taking out such a factor (while also putting in the minibatch
accumulation change).

2) The idea that this could replace Dropout as the go-to trick for speeding up
learning is pretty worrying (IMHO), since the gains from Dropout seem to be in
a 'meta network' direction, rather than a data-dependency direction. Both
approaches seem well worth understanding more thoroughly, even though the 'do
what works' ML approach might favour leaving Dropout behind.

3) The publication of this paper, so closely behind the new ReLu+ results from
Microsoft, seems too coincidental. One has to wonder what other results each
of the companies has in their back-pockets so that they can repeatedly steal
the crown from each other.

4) For me, the application to MNIST is attention-grabbing enough. While I
appreciate that playing with Inception (etc) sexes-up the paper a lot, it
raises the hurdle for others who may not have that quantity of hardware to
contribute to the (more interesting) project of improving the learning rates
of all projects (which is quite possible to do on the MNIST dataset, except
that it's pretty much 'solved' with the error cases being pretty questionable
for humans too).

