Hacker News new | past | comments | ask | show | jobs | submit login
Why does deep and cheap learning work so well? (arxiv.org)
434 points by mauliknshah 10 months ago | hide | past | web | favorite | 49 comments

A year ago, a group of us at UMD took a couple weeks to go through this paper. It's got some interesting insights, but like all theory papers right now, gaps remain. The construction they present for learning Hamiltonians of low polynomial order doesn't look a ton like any common production neural network modules, and the justification for why we would only be dealing with that Hamiltonian family in practice is unconvincing to me. That said, overall it's worth a close read. Section 2' parts A and B are the best summary of the connection between probability theory and deep learning that i have come across.

Here's a related 2016 talk by Max Tegmark (second author) on connections between deep learning and physics:


The gist of it is that physical data tends to have symmetries, and these symmetries make descriptions of the data very compressible into relatively small neural circuits. Random data does not have this property, and cannot be learned easily. Super fascinating.

Thank you!!! as a Physics PhD that was one of the first videos I found on deep learning, and having no idea what a big deal his insights were I promptly forgot who the speaker was (remembering only it was a name I knew intimately from my time in Physics). Have frequently gone back to try to find this unsuccessfully.

In that case, the fact that our minds run on similar substrates is probably non-coincidental.

Indeed, the connections seem profound. It seems to be a general-purpose optimal algorithm for, well, optimisation. And that would explain why the universe, our brains and AIs all trend toward it.

It could also be just that intelligence tends to mirror the outside world, but that seems a bit arbitrary.

> It could also be just that intelligence tends to mirror the outside world, but that seems a bit arbitrary.

We are part of the universe, so why would it be arbitrary if our brains were structured in ways that match typical structures found in this universe?

I was thinking more along the lines that if our minds process outside information in a way that makes sense of that information, partially through simulating it, it does not seem so strange if the structures end up matching the outside structures through some form of convergent evolution.

From a cursory reading of that article I do not see it argue the same thing.

That the universe is the best simulator of itself? That say, simulating water flowing through a pipe, the system doing the simulation re-formulates itself into something that resembles the pipe and the water?

> That the universe is the best simulator of itself?

What is this even supposed to mean? Also, "pipe" and "water" are ridiculously high level constructs, categorisations made by humans. Neither says anything about structures inherent to the universe.

I mean that when working with symmetries, information flow, and fundamental building blocks, certain structures just tend to pop up naturally. Hence fractals and geometrically shapes in places where you might not expect them. Or how laws of thermodynamics suddenly seems to be everywhere in biology all of a sudden now that we started looking[0][1].

[0] http://ecologicalsociology.blogspot.de/2010/11/geoffrey-west...

[1] https://www.quantamagazine.org/a-thermodynamic-answer-to-why...

I really like this kind of cross-disciplinary research and knowledge transfer. It should happen more often.

It’s interesting that so many equations governing different laws in different fields actually share quite a few properties, and on deeper analysis, can be explained by a single mathematical property. It makes me wonder how many insights we are missing simply because they were discovered in another field, under a different name, for a different purpose.

I recall a project (out of MIT I think) that used category theory and knowledge mapping to discover isomorphism between the knowledge graphs of different fields like math, physics, and economics.

I found something similar with topology, graph theory, ...


As above, so below.

As the Universe, so the Soul.


GnosticMedia - Alchemy and the Endocrine System - an interview with Jose Barrera

Despite the predictable down-votes, I wanted to say that I greatly enjoyed that interview.

Learning a transformation from the full transformation semigroup is the most general case. Consider a mystery unary CPU operation M on a 64 bit register. How deep of a circuit do you need to calculate any such transformation M? According to Kolmogorov just writing out a random transformation function takes (2^64)*(64) bits to make the lookup table. log of that to get a result out efficiently. Results here were already proved in information theory. If you use a depth less than log of the lookup table you are screwed unless your function is extremely non-random.

it's not the same results, their point was to justify it on physical grounds

> we show that n variables cannot be multiplied using fewer than 2^n neurons in a single hidden layer.

I don't know, but it just feels like this should have been known in CS earlier than 2016. Circuit complexity has been studied for a long time.

Depends if you want an exact multiplication or not

Especially if you have neurons with non-linear response

remember ln(a*b) = ln(a)+ln(b)

This is a very nice paper that puts some "meaningful" names in neural network function. I like how eq. 8 neatly describes a neural network as a readout of an energy hamiltonian and an output distribution. They show that real-world data, thats is described by low-order polynomial hamiltonians , needs a small number of units and that "depth" gives the network its compositional/hierarchical ability. Even though some of the theory goes over my head, their main arguments seem to "fit" together very nicely.

So basically a deep network can "cheaply" (i.e. not fatally expensive) describe anything that occurs in nature, which is wonderful. I wonder however what will happen when we move to higher cognition and meta-cognition which requires the readout of network states that are not found in nature, but are generated internally. Would be interesting to know if we need much more brain or a little more. In any case very intersting read.

Love to see more papers like this. I remember a recent paper that showed decent model performance when the model was allowed to pick its own activation functions. It was picking wacky things like sine waves. You'd like to learn something about the model by the way it configures itself, but right now we can only understand simple model features.

That's so weird, I can't help but feel like that would only ever result in it coming up with something that initially got it a better result and the rest of the network adjusted around it and is able to carry the weight of that weirdness.

What IS interesting is that it shows just how resilient the networks that come out of this really are that they can support such weirdo activations.

But there are other elements of introspection that should absolutely be handled by neural nets, things like choosing how deep a given network needs to be should be accomplished by other neural nets specifically looking for trends while training.

Yeah, I agree about smarter parameter search and training. That was a big theme at ICLR this year.

Fwiw, here's the paper on RNN architecture generation referenced earlier.


I am not a Physicist (IANAP), so maybe I’m misrecalling some definitions, but isn’t the restriction to Hamiltonians a bit, well, restrictive? This limits the results to path independent potentials, which is basically nothing in the non-spherical cow world. Are the authors working from a different definition of Hamiltonian?

Does it matter if your Hamiltonian is smooth or if you are working from a discrete theory?

In this context, the Hamiltonian is just the log of the conditional probability of the input data given some parameters. In fact, such functions are much more diverse than the functions defined by neural networks (which, though there are many variants, are basically all compositions of linear and nonlinear functions). The question being asked in the paper is which Hamiltonians can be robustly approximated by neural networks. The authors then argue that the class of Hamiltonians that appear in nature are simple enough that neural nets do a decent job.

If you are asking the neural network to make a discrete classification or discrete prediction about some set of input data, you are almost by definition asking it a Hamiltonian class of question.

By way of a counter example, you could say “but it depends whether the person walked or took the bus!” In that case either you need to provide the data on whether they walked or took the bus (in which case your question collapses to a simple Hamiltonian form with the additional data) or you don’t provide the data and the question is fundamentally unanswerable in a simple discrete way.

One of my dreams is to do something similar by simplifying neural nets via (bi)simulation (something along the lines of quotient automata).

I think you will like this paper then: https://arxiv.org/abs/1711.09576

Man, it's good to see some actual information theory applied to deep learning for a change.

I'm sorry but this comment is just at odds with the current literature in deep learning. A few very prominent researchers have been trying to use information theoretic concepts to explain generalization for a while now-see the somewhat controversial branch Naftali Tishby developed...However many other papers apply 'actual information theory'....


I think the GP poster meant to contrast this paper not with other papers, but with the junk explanations for "how deep learning works" that science journalism comes up with.

trust me some of the junk explanations for how deep learning works originate from researchers. we really just don't have a convincing picture about why these networks work or don't work in a theoretical way rather than empirical/heuristic handwaves

Bollocks see my comment. We know worst case bounds for Kolmogorov random transformations of a finite set, the most complex learning task there is.

huh? not even wrong......

Good. I hope to see more of them posted to HN.

if you are interested in this type of analysis are you familiar with the tishby papers? you will at the very least find them enjoyable/mind opening... but some of his key claims are probably contradicted by empirical results.

the original paper: https://arxiv.org/abs/1503.02406

a refutation:(with very harsh/unprofessional words by tishby in counter-refutation)


I found his talk on youtube, remarkable:


He's a good speaker/teacher but I think there are serious problems with his work- mainly that a lot of his results collapse when tested on more general architectures.

There's papers posted to arXiv every day using information theory to analyze deep learning...

Same reason general relativity works. If you try to model results without a fundamental principle of the system's content's operation then you are going to have some limitations.

>"Same reason general relativity works."

Dark matter/energy?

Well, technically general relativity has no real problems with either of those (there shouldn't be, given that general relativity is the main justification for both). It's the quantum mechanics behind them that's not really well understood.

>"the quantum mechanics behind them that's not really well understood."

I've never heard of it being understood at all.

They're posited as a fudge factor -because- the quantum mechanics is not understood.

Well that's actually my point, you are the one who got it

Large companies like Google and Facebook will easily be able adapt to the change. They have an army of lawyers and workforce for this. Smaller startup? They will face challenges. Especially European ones. Most possibly, this "innovation" will end up as one more pop-up with "Accept or Leave" message on every website you are visiting from Europe for the first time.

Probably because so little is actually driven by real data and so much by marketing and hype.

Here's an HTML version of the paper if you're on a phone: https://www.arxiv-vanity.com/papers/1608.08225/

It does not. It's about unsupervised learning and unlabeled data. It's the next frontier. Automated feature engineering for triangulated datasets is the real target.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact