
Batch Normalization for deep networks - rvarma
http://rohanvarma.me/Batch-Norm/
======
nafizh
This paper might be interesting regarding this post.

'Batch Normalization for Improved DNN Performance, My Ass'
[http://nyus.joshuawise.com/batchnorm.pdf](http://nyus.joshuawise.com/batchnorm.pdf)

~~~
laythea
That made me chuckle. So true. Thanks :)

------
chillee
A couple of things: 1\. How can activations be below 0 in a ReLU network? It
seems like:

> h_out = z * (z > 0)

is the line that you're using to store activations. How can that be below 0
for any values of z?

EDIT: Whups, missed that he was storing the output of activations + batchnorm.
I think it would make more sense to just store the output of the activations.
The goal here is to show that batch norm provides good properties throughout
the entire network. In this case, you're storing h_out/std(h_out). Which is
trivially normalizing your layers to have the same variance.

2\. Using initialization distributions to motivate batch norm is a little bit
misleading. Prior to batch norm, people had realized that initializing
uniformly was a problem, and had switched to using xavier initialization
(specifically, a follow up by Kaiming He found the initialization that works
best for ReLU).

I do think that the intuition for why both make sense are fairly similar.
Although xavier initialization fixes the variances initially, batch norm
allows you to maintain it through training.

Another thing that's cool about your post is that my intuition was that xavier
initialization was not necessary if one was also using batch norm. It's cool
to see that vindicated.

~~~
singularity2001
also there are 'leaking' ReLUs

f(x) = a if x<0 else b

usually 0 < a << b

~~~
rvarma
I actually think the idea of using leaky ReLUs is interesting, because it'll
still provide a small gradient when x < 0, which perhaps may slightly
alleviate the vanishing gradients issue

