
Why Deep Learning Works II: the Renormalization Group - miket
https://charlesmartin14.wordpress.com/2015/04/01/why-deep-learning-works-ii-the-renormalization-group/
======
albertzeyer
Note: This is about unsupervised learning and mostly about RBMs/DBNs. Most of
the Deep Learning success is all about supervised learning. In the past, RBMs
have been used for unsupervised pretraining of the model, however, nowadays,
everyone uses supervised pretraining.

And the famous DeepMind works (Atari games etc) is mostly about Reinforcement
learning, which is again different.

~~~
BenderV
Well, if I understood correctly, the RL DeepMind implementation is basically
making a RL algorithm work with a supervised model.

~~~
sieisteinmodel
This has been done since the 90s. The Deepmind paper is about a few more
tricks.

------
milesf
Okay, I confess. I really didn't understand most of that post. It sounds
really smart, but someone will have to vouch that it's legit, because the
picture of Kadanoff cuddling Cookie Monster trigged my baloney detector
[https://charlesmartin14.files.wordpress.com/2015/04/kadanoff...](https://charlesmartin14.files.wordpress.com/2015/04/kadanoff.jpeg)

~~~
TravisDick
I don't mean to be super negative, but because of the general tone early in
the article and some sloppy notation, I never finished reading. I think the
goal of an article like this should be to give a high-level intuitive
explanation for some technical result, rather than sounding smart or
complicated.

First, it is a little weird to me to talk about "old-school ML" as learning
maps from inputs to hidden features. That seems neither old, nor very
representative of the field of Machine Learning as a whole. It's also weird to
say that RBMs and other deep learning algorithms are formulated using
classical statistical mechanics. Moreover, implying that this scary-sounding
formulation is the reason they are interesting seems like an attempt at
sounding smart. Typically there are many ways to motivate and derive different
algorithms, and it is /useful/ to acknowledge the multiple viewpoints because
they often give different insights.

Second, the section about flow maps and fixed points seems to make a mess out
of the notation by either being unclear or disagreeing with standard notation.
What is meant by the notation "f(X) -> X"? Presumably this means something
like f is a function that maps elements of the set X to elements in the set X.
More standard notation for this would be something like "f: X -> X". Perhaps
it means that the image of the set X under the function f is again the set X.
But does that require that f be a surjective function? Confusingly, it also
looks like the function f might be required to be the identity function, but
given the context this is clearly not the intended interpretation.

When defining the fixed point, it seems that it would be more natural to say
that x is a fixed point of f if f(x) = x. That is, x is fixed or unmoved by
the function f. It turns out that for contractions (and some other functions,
too), that the sequence f(x), f(f(x)), f(f(f(x))), and so on is guaranteed to
converge to a unique fixed point of f. The notation f^n typically refers to
the function f being applied n times, which is not the usage in the article.
In the article, f^1, f^2, and so on are all identical copies of the function
f. Using the standard notation, the definition of f_infty would be f_infty(x)
= lim_{n -> infty} f^n(x). And, in the case of a contraction, the Banach fixed
point theorem gives that f_infty is well-defined, and there exists a unique
x_fix in X so that f_infty(x) = x_fix for all x in X (i.e., iterating f
repeatedly converges to a unique fixed point x_fix of the function f).

These things do not necessarily mean that the article is uninteresting or
uninformative or even technically incorrect. But if the author didn't take the
time to make the simple things clear, then I'm not sure that I want to read
the rest.

Sorry for the rant.

~~~
sieisteinmodel
Well, there is more.

E.g. abbreviating deep belief nets with DBM, which is the commonly used
acronym for deep boltzmann machines. These are similar, but very different.
Calling an RBM an encoder is somehow not far fetched, but there are many
differences between auto encoders and RBMs. He eventually claims an RBM
minimises reconstruction error, which is just plain wrong and shows that this
guy has absolutely no clue what he is writing about.

~~~
charleshmartin
'Technically' this is correct--the RBM CD algo is not minimizing this
function; that's not the point.

It is known that when training an RBM, the reconstruction error decreases but
not monotonically; in fact it fluctuates. In the words of Hinton, 'trust it
but don't use it'.

[http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf](http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf)
(which is cited in the post as well)

So in a global sense, yes, I would say that the RBM does eventually minimize
the reconstruction error even though it fluctuates.

I can even offer a conjecture here on why the error fluctuates ; in a discrete
RG flow map, there could be finite size effects that would give log-periodic
fluctuations. This is a stretch--but it is something that could be tested.

I explain this idea here [http://charlesmartin14.wordpress.com/2015/01/16/the-
bitcoin-...](http://charlesmartin14.wordpress.com/2015/01/16/the-bitcoin-
crash-and-how-nature-works/)

As to stacking the RBMs to form a DBN--yeah that's the point. "Hinton showed
that RBMs can be stacked and trained in a greedy manner to form so-called Deep
Belief Networks (DBN)"
[http://deeplearning.net/tutorial/DBN.html](http://deeplearning.net/tutorial/DBN.html)

------
paulpauper
I think this is similar to the scalar theory of the stock market, which uses
scale invariant geometric objects to represent stock market emery levels

[http://greyenlightenment.com/sornette-vs-taleb-
debate/](http://greyenlightenment.com/sornette-vs-taleb-debate/)

Sornette’s 2013 TED video, in which he predicts an imminent stock market crash
due to some ‘power law’, is also wrong because two years later the stock
market has continued to rally.

You write on your blog:

 _These kinds of crashes are not caused by external events or bad players–they
are endemic to all markets and result from the cooperative actions of all
participants._

Easier said than done. I don't think the log periodic theory is a holy grail
to making money in the market. There are too many instances here it has
failed, but you cherry-picked a single example with bitcoin where it could
have worked.

~~~
charleshmartin
It is easier to apply the Sornette theory to antibubbles.

Bitcoin seemed like a great example.

I gotta go back and see how well the predictions actually worked.

------
fizixer
One way of think of it is that:

There are connections between Deep Learning and Theoretical Physics because
there are (even stronger) connections between Information Theory and
Statistical Mechanics.

------
sgt101
I don't like the assertion at all because so many techniques are held to be
"deep learning" and because even when specific techniques are built on an
analogy of this sort (think Simulated Annealing and Genetic Algorithms) they
do not work "because" they are "like" the physical processes that served as an
inspiration.

Names are useful, but only as a aide to thinking. Does this help us think
about these techniques?

------
reader5000
Is the "group" in renormalization group the same "group" in group theory?

~~~
ximeng
oneloop, your comment looks helpful but you are hell-banned, you may want to
email HN to be reinstated.

\--

Edit - he's been reinstated.

~~~
chmartin
what does hell-banned mean?

~~~
ximeng
Comments are marked as dead (can only be seen by people with showdead set in
their profile), but appear visible to the user so that they may not realise
why nobody responds to them or upvotes their comments. Normally a punishment
for bad behaviour or poor commenting, but seems inappropriate in this case.

------
jmount
I think a key difference is the physics renormalization structures use fairly
regular or uniform weights and the deep learning plays a lot with the weights.
So there are going to be pretty big differences in behavior.

------
noobermin
And here I thought the renormalization group had no application outside high
energy physics and condensed matter. May be I should have stuck with HEP after
all.

------
octatoan
No MathJax. I am disappoint.

------
curiousjorge
It always depresses me when I read anything with math formulas and esoteric
terms, a constant reminder of my lifelong incompetence with math and
university calculus courses.

~~~
abecedarius
I expect very few people learned from this post; I didn't, and I kinda like
math. This was the sort of math writing that makes sense only given nearly the
same background as the author. (Someone above posted specific complaints about
the unclear notation.)

