
Why does deep and cheap learning work so well? - mauliknshah
https://arxiv.org/abs/1608.08225
======
heyitsguay
A year ago, a group of us at UMD took a couple weeks to go through this paper.
It's got some interesting insights, but like all theory papers right now, gaps
remain. The construction they present for learning Hamiltonians of low
polynomial order doesn't look a ton like any common production neural network
modules, and the justification for why we would only be dealing with that
Hamiltonian family in practice is unconvincing to me. That said, overall it's
worth a close read. Section 2' parts A and B are the best summary of the
connection between probability theory and deep learning that i have come
across.

------
tedsanders
Here's a related 2016 talk by Max Tegmark (second author) on connections
between deep learning and physics:

[https://www.youtube.com/watch?v=5MdSE-N0bxs](https://www.youtube.com/watch?v=5MdSE-N0bxs)

The gist of it is that physical data tends to have symmetries, and these
symmetries make descriptions of the data very compressible into relatively
small neural circuits. Random data does not have this property, and cannot be
learned easily. Super fascinating.

~~~
3pt14159
In that case, the fact that our minds run on similar substrates is probably
non-coincidental.

~~~
akvadrako
Indeed, the connections seem profound. It seems to be a general-purpose
optimal algorithm for, well, optimisation. And that would explain why the
universe, our brains and AIs all trend toward it.

It could also be just that intelligence tends to mirror the outside world, but
that seems a bit arbitrary.

~~~
vanderZwan
> It could also be just that intelligence tends to mirror the outside world,
> but that seems a bit arbitrary.

We are part of the universe, so why would it be arbitrary if our brains were
structured in ways that match typical structures found in this universe?

~~~
sitkack
[https://en.wikipedia.org/wiki/Panpsychism](https://en.wikipedia.org/wiki/Panpsychism)

~~~
vanderZwan
I was thinking more along the lines that if our minds process outside
information in a way that makes sense of that information, partially through
simulating it, it does not seem so strange if the structures end up matching
the outside structures through some form of convergent evolution.

From a cursory reading of that article I do not see it argue the same thing.

~~~
sitkack
That the universe is the best simulator of itself? That say, simulating water
flowing through a pipe, the system doing the simulation re-formulates itself
into something that resembles the pipe and the water?

~~~
vanderZwan
> That the universe is the best simulator of itself?

What is this even supposed to mean? Also, "pipe" and "water" are
_ridiculously_ high level constructs, categorisations made by humans. Neither
says anything about structures inherent to the universe.

I mean that when working with symmetries, information flow, and fundamental
building blocks, certain structures just tend to pop up naturally. Hence
fractals and geometrically shapes in places where you might not expect them.
Or how laws of thermodynamics suddenly seems to be everywhere in biology all
of a sudden now that we started looking[0][1].

[0] [http://ecologicalsociology.blogspot.de/2010/11/geoffrey-
west...](http://ecologicalsociology.blogspot.de/2010/11/geoffrey-west-on-
scaling-rules.html)

[1] [https://www.quantamagazine.org/a-thermodynamic-answer-to-
why...](https://www.quantamagazine.org/a-thermodynamic-answer-to-why-birds-
migrate-20180507/)

------
chatmasta
I really like this kind of cross-disciplinary research and knowledge transfer.
It should happen more often.

It’s interesting that so many equations governing different laws in different
fields actually share quite a few properties, and on deeper analysis, can be
explained by a single mathematical property. It makes me wonder how many
insights we are missing simply because they were discovered in another field,
under a different name, for a different purpose.

~~~
bcherny
I recall a project (out of MIT I think) that used category theory and
knowledge mapping to discover isomorphism between the knowledge graphs of
different fields like math, physics, and economics.

~~~
mitchtbaum
I found something similar with topology, graph theory, ...

links??

------
crb002
Learning a transformation from the full transformation semigroup is the most
general case. Consider a mystery unary CPU operation M on a 64 bit register.
How deep of a circuit do you need to calculate any such transformation M?
According to Kolmogorov just writing out a random transformation function
takes (2^64)*(64) bits to make the lookup table. log of that to get a result
out efficiently. Results here were already proved in information theory. If
you use a depth less than log of the lookup table you are screwed unless your
function is extremely non-random.

~~~
zerostar07
it's not the same results, their point was to justify it on physical grounds

------
petters
> we show that n variables cannot be multiplied using fewer than 2^n neurons
> in a single hidden layer.

I don't know, but it just feels like this should have been known in CS earlier
than 2016. Circuit complexity has been studied for a long time.

~~~
raverbashing
Depends if you want an exact multiplication or not

Especially if you have neurons with non-linear response

remember ln(a*b) = ln(a)+ln(b)

------
zerostar07
This is a very nice paper that puts some "meaningful" names in neural network
function. I like how eq. 8 neatly describes a neural network as a readout of
an energy hamiltonian and an output distribution. They show that real-world
data, thats is described by low-order polynomial hamiltonians , needs a small
number of units and that "depth" gives the network its
compositional/hierarchical ability. Even though some of the theory goes over
my head, their main arguments seem to "fit" together very nicely.

So basically a deep network can "cheaply" (i.e. not fatally expensive)
describe anything that occurs in nature, which is wonderful. I wonder however
what will happen when we move to higher cognition and meta-cognition which
requires the readout of network states that are not found in nature, but are
generated internally. Would be interesting to know if we need much more brain
or a little more. In any case very intersting read.

------
jdonaldson
Love to see more papers like this. I remember a recent paper that showed
decent model performance when the model was allowed to pick its own activation
functions. It was picking wacky things like sine waves. You'd like to learn
something about the model by the way it configures itself, but right now we
can only understand simple model features.

~~~
ralusek
That's so weird, I can't help but feel like that would only ever result in it
coming up with something that initially got it a better result and the rest of
the network adjusted around it and is able to carry the weight of that
weirdness.

What IS interesting is that it shows just how resilient the networks that come
out of this really are that they can support such weirdo activations.

But there are other elements of introspection that should absolutely be
handled by neural nets, things like choosing how deep a given network needs to
be should be accomplished by other neural nets specifically looking for trends
while training.

~~~
jdonaldson
Yeah, I agree about smarter parameter search and training. That was a big
theme at ICLR this year.

Fwiw, here's the paper on RNN architecture generation referenced earlier.

[https://arxiv.org/abs/1712.07316](https://arxiv.org/abs/1712.07316)

------
electricslpnsld
I am not a Physicist (IANAP), so maybe I’m misrecalling some definitions, but
isn’t the restriction to Hamiltonians a bit, well, restrictive? This limits
the results to path independent potentials, which is basically nothing in the
non-spherical cow world. Are the authors working from a different definition
of Hamiltonian?

Does it matter if your Hamiltonian is smooth or if you are working from a
discrete theory?

~~~
rotskoff
In this context, the Hamiltonian is just the log of the conditional
probability of the input data given some parameters. In fact, such functions
are much _more_ diverse than the functions defined by neural networks (which,
though there are many variants, are basically all compositions of linear and
nonlinear functions). The question being asked in the paper is which
Hamiltonians can be robustly approximated by neural networks. The authors then
argue that the class of Hamiltonians that appear in nature are simple enough
that neural nets do a decent job.

------
tomdre
One of my dreams is to do something similar by simplifying neural nets via
(bi)simulation (something along the lines of quotient automata).

~~~
long
I think you will like this paper then:
[https://arxiv.org/abs/1711.09576](https://arxiv.org/abs/1711.09576)

------
sp332
Man, it's good to see some actual information theory applied to deep learning
for a change.

~~~
mathinpens
I'm sorry but this comment is just at odds with the current literature in deep
learning. A few very prominent researchers have been trying to use information
theoretic concepts to explain generalization for a while now-see the somewhat
controversial branch Naftali Tishby developed...However many other papers
apply 'actual information theory'....

[https://arxiv.org/search/?query=deep+learning+information+th...](https://arxiv.org/search/?query=deep+learning+information+theory&searchtype=all&order=&size=50)

~~~
derefr
I think the GP poster meant to contrast this paper not with other papers, but
with the junk explanations for "how deep learning works" that science
journalism comes up with.

~~~
mathinpens
trust me some of the junk explanations for how deep learning works originate
from researchers. we really just don't have a convincing picture about why
these networks work or don't work in a theoretical way rather than
empirical/heuristic handwaves

~~~
crb002
Bollocks see my comment. We know worst case bounds for Kolmogorov random
transformations of a finite set, the most complex learning task there is.

~~~
mathinpens
huh? not even wrong......

------
blueprint
Same reason general relativity works. If you try to model results without a
fundamental principle of the system's content's operation then you are going
to have some limitations.

~~~
nonbel
>"Same reason general relativity works."

Dark matter/energy?

~~~
contravariant
Well, technically general relativity has no real problems with either of those
(there shouldn't be, given that general relativity is the main justification
for both). It's the quantum mechanics behind them that's not really well
understood.

~~~
nonbel
>"the quantum mechanics behind them that's not really well understood."

I've never heard of it being understood at all.

------
dkaigorodov
Large companies like Google and Facebook will easily be able adapt to the
change. They have an army of lawyers and workforce for this. Smaller startup?
They will face challenges. Especially European ones. Most possibly, this
"innovation" will end up as one more pop-up with "Accept or Leave" message on
every website you are visiting from Europe for the first time.

------
megaman22
Probably because so little is actually driven by real data and so much by
marketing and hype.

------
bfirsh
Here's an HTML version of the paper if you're on a phone: [https://www.arxiv-
vanity.com/papers/1608.08225/](https://www.arxiv-
vanity.com/papers/1608.08225/)

------
KasianFranks
It does not. It's about unsupervised learning and unlabeled data. It's the
next frontier. Automated feature engineering for triangulated datasets is the
real target.

