I have no idea what you’re saying. It sounds important and interesting but I’d l...

benreesman · 2024-04-30T23:22:04 1714519324

Roughly, all of the parameters of any NN (and many other models as well” can be thought of as spaces that are flatter or smoother in one region or another or under some pretty fancy “zooming” (Atiyah-Singer indexing give or take).

The way we train them involves finding steepness and chasing it, and it almost always works for a bit, often for quite a while. But the flat places it ends up are both really flat, and zillions of them are nearly identical.

Those two sets of nearly identically “places”, and in particular their difference in being useful via selection bias, are called together or separately a “gauge symmetry”, which basically means things remain true as you vary things a lot. The things that remain true are usually “conserved quantities”, and in the case of OpeenAI 100% compressing the New York Times, the conserved quantity is compression ratio up to some parameter or lossiness.

SubiculumCode · 2024-05-01T00:53:53 1714524833

I am probably way off base here, but what I think you are saying is that these 'flat regions' come close to lossless compression, and thus copyright infringement is occurring.?

benreesman · 2024-05-01T00:59:30 1714525170

Not quite, the abuse of the commons in trivial violation of the spirit of our system of government is suggested (I’d contend demonstrated) by necessary properties of the latent manifolds.

The uniformity (gauge symmetry up to a bound) of such regions is a way of thinking about the apparent contradiction between the properties of a billion dimensional space before and after a scalar loss pushing a gradient around in it.

SubiculumCode · 2024-05-01T02:00:25 1714528825

Okay, yeah obviously there is a loss of entropy.

benreesman · 2024-05-01T02:24:44 1714530284

Entropy is a tricky word: legend has it that von Neumann persuaded Shannon to use it for the logarithmic information measure because “no one knows what it means anyways”.

These days we have KL-divergence and information gain and countless other ways to be rigorous, but you still have to be kind of careful with “macro” vs “micro” states, it’s just a slippery concept.

Whether or not some 7B parameter NN that was like, Xavier-Xe initialized or whatever the Fortress of Solitude people are doing these days is more or less unique than after you push an exabyte of Harry Potter fan fiction through it?

I think that’s an interesting question even if I (we) haven’t yet posed it in a rigorous way.