> With enough computation, your neural net weights would converge to some very c...

_hark · 2024-08-28T10:54:12.000000Z

The raw capacity of the network doesn't tell you how complex the weights actually are. The capacity is only an upper bound on the complexity.

It's easy to see this by noting that you can often prune networks quite a bit without any loss in performance. I.e. the effective dimension of the manifold the weights live on can be much, much smaller than the total capacity allows for. In fact, good regularization is exactly that which encourages the model itself to be compressible.

godelski · 2024-08-28T16:54:39.000000Z

I think your confusing capacity with the training dynamics.

Capacity is autological. The amount of information it can express.

Training dynamics are the way the model learns, the optimization process, etc. So this is where things like regularization come into play.

There's also architecture which affects the training dynamics as well as model capacity. Which makes no guarantee that you get the most information dense representation.

Fwiw, the authors did also try distillation.

_hark · 2024-08-28T23:57:48.000000Z

Sorry I wasn't more clear! I'm referring to the Kolmogorov complexity of the network. The OP said:

> With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. Maybe smaller even than the source code itself? Someone in the field could probably correct me on that.

And they're not wrong! An ideally trained network could, in principle, learn the data-generating program, if that program is within its class of representable functions. I might have a NN that naively looks like it takes up GBs of space, but it might actually be parameterizing a much simpler function (hence our ability to prune/compress the weights without performance loss - most of the capacity wasn't being used for any interesting computation).

You're right that there's no guarantee that the model finds the most "dense" representation. The goal of regularization is to encourage that, though!

All over the place in ML there are bounds like:

test loss <= train loss + model complexity

Hence minimizing model complexity improves generalization performance. This is a kind of Occam's Razor: the simplest model generalizes best. So the OP is on the right track - we definitely want networks to learn the "underlying" process that explains the data, which in this case would be a latent representation of the source code (well, except that doesn't really make sense since you'd need the whole rest of the compute stack that code runs on - the neural net has no external resources/embodied complexity it calls, unlike the source code which gets to rely on drivers, hardware, operating systems, etc.)

godelski · 2024-08-29T05:48:04.000000Z

  > An ideally trained network could, in principle, learn the data-generating program

No disagreement

  > I might have a NN that naively looks like it takes up GBs of space, but it might actually be parameterizing a much simpler function (hence our ability to prune/compress the weights without performance loss - most of the capacity wasn't being used for any interesting computation).

Also no disagreement.

I suggested that this probably isn't the case here since they tried distillation and saw no effect. While this isn't proof that this particular model can't be compressed more it does suggest that it is non-trivial. This is especially true given the huge difference in size. I mean we're talking about 700x...

Where I think our disagreement is in that I read the OP as saying __this__ network. If we're talking about a theoretical network, well... nothing I said anywhere is in any disagreement with that. I even said in the post I linked to that the difference shows that there's still a long way to go but that this is still cool. Why did I assume OP was talking about __this__ network? Well because we're in a thread talking about a paper and well... yes, we're talking about compression machines so theoretically (well not actually supported by any math theory) this is true for so many things and that is a bit elementary. So makes more sense (imo) that we're talking about this network. And I wanted to make it clear that this network is nowhere near compression. Can further research later result in something that is better than the source code? Who knows? For all the reasons we've both mentioned. We know they are universal approximators (which are not universal mimicers and have limits) but we have no guarantee of global convergence (let alone proof such a thing exists in many problems).

And I'm not sure why you're trying to explain the basic concepts to me. I mentioned I was an ML researcher. I see you're a PhD at Oxford. I'm sure you would be annoyed if I was doing the same to you. We can talk at a different level.

_hark · 2024-08-29T12:58:53.000000Z

Totally fair points all. Sorry if it came across as condescending!

I agree with you that this network probably has not found the source code or something like a minimal description in its weights.

Honestly, I'm writing a paper on model compression/complexity right now, so I may have co-opted the discussion to practice talking about these things...! Just a bit over-eager (,,>﹏<,,)

Have you given much thought to how we can encourage models to be more compressible? I'd love to be able to explicitly penalize the filesize during training, but in some usefully learnable way. Proxies like weight norm penalties have problems in the limit.

godelski · 2024-08-29T17:25:17.000000Z

Haha totally fair and it happens to me too, but trying to work on it.

I actually have some stuff I'm working on in that area that is having some success. I do need to extend it to diffusion but I see nothing stopping me.

Personally I think a major slowdown for our community is it's avoidance of math. Like you don't need to have tons of math in the papers, but many of the lessons you learn in the higher level topics do translate to usable techniques in ML. Though I would also like to see a stronger push on theory because empirical results can be deceiving (Von Neumann's elephant and all)