Hacker News new | past | comments | ask | show | jobs | submit login
The boundary of neural network trainability is fractal (arxiv.org)
200 points by RafelMri 9 months ago | hide | past | favorite | 65 comments



This is much more interesting if you see the animations. https://x.com/jaschasd/status/1756930242965606582


Fractal zoom videos are worth infinite words.


> infinite

I see you


So what exactly are we looking at here? Did the authors only use two hyperparameters for the purpose of this visualization?


It's explained in the post:

> Have you ever done a dense grid search over neural network hyperparameters? Like a really dense grid search? It looks like this (!!). Blueish colors correspond to hyperparameters for which training converges, redish colors to hyperparameters for which training diverges.


I saw this, but am still not clear on what the axes represent. I assume two hyperparameters, or possibly two orthogonal principal components. I guess my point is it’s not clear how/which parameters are mapped onto the image.


your point is valid but the paper explains it clearly and obviously. they are NOT dimensionally reduced hyperparameters, no. The hyperparameters are learning rates, that's it. X axis, learning rate for input (1 hidden layer). Y axis, learning rate for output layer.

So what this is saying, for certain ill-chosen learning weights, model convergence is for lack of a better word, chaotic and unstable.


Just to add to this, only the two learning rates are changed, everything else including initialization and data is fixed. From the paper:

Training consists of 500 (sometimes 1000) iterations of full batch steepest gradient descent. Training is performed for a 2d grid of η0 and η1 hyperparameter values, with all other hyperparameters held fixed (including network initialization and training data).


Reposting comment from last time since I'm still curious:

This is really fun to see. I love toy experiments like this. I see that each plot is always using the same initialization of weights, which presumably makes it possible to have more smoothness between each pixel. I also would guess it's using the same random seed for training (shuffling data). I'd be curious to know what the plots would look like with a different randomness/shuffling of each pixel's dataset. I'd guess for the high learning rates it would be too noisy, but you might see fractal behavior at more typical and practical learning rates. You could also do the same with the random initialization of each dataset. This would get at if the chaotic boundary also exists in more practical use cases.


I think if you used a random seed for weights and training data order, and reran the experiment enough times to average out the noise, then the resulting charts would then be smooth with no fractal patterns.


An interesting conjecture, well worth a paper in response.


Do you consider the random seed (or by extension the randomized initial weights) a hyperparameter?


exactly. if they always use the same initialization seed then this isn't very surprising.

one would have to do many runs for each point in the grid and average them or something.

but i didn't read the paper so maybe they did.


One of Wofram's comments is that there appears to be much more internal structure in language semantics that we had expected, contra-Chomsky.

We also know the brain, cortex esp, is highly recurrent, so it should be primed for creating fractals and chaotic mixing.

So maybe the hidden structure is the set of neural hyperparams needed to put a given cluster of neurons into fractal/chaotic oscillations like this. Seems potentially more useful too.. way more information content than a configuration that yields a fast convergence to a fixed point.

Perhaps this is what learning deep NNs is doing: producing conditions where the substrate is at the tipping point, to get to a high-information generation condition, and then shaping this to fit the target system as well as it can with so many free parameters.

That suggests that using iterative generators that are somehow closer to the dynamics of real neurons would be more efficient for AI: it'd be easier to drive them to similar feedback conditions and patterns

Like matching resonators in any physical system


Two quibbles: (1) neural nets don’t necessarily operate on language, (2) they only loosely model biological neurons in that they operate in discrete space. All that is to say that any similarities are purely incidental without accounting for these two facts.


1) agreed. it's exciting seeing the same basic architecture broadly applied

2) not sure what you mean by "operate in discrete space"

I'd emphasize the potential similarity to biological recurrence. Deep ANNs don't need to have this explicitly (tho e.g. LSTM has explicit recurrence), but it is known that recurrent NNs can be emulated by unrolling, in a process similar to function currying. In this mode, a learned network would learn to recognize certain inputs and carry them across to other parts of the network that can be copies of the originator, thus achieving functional equivalence to self feedback, or neighbor feedback. It takes a lot of layers and nodes in theory, but ofc modern nets are getting very big.


I didn’t phrase (2) particularly well. I meant that digital neural networks typically use floating point numbers, which can represent only discrete values. Do aliasing effects show up in the behavior of these networks that wouldn’t apply for continuously variable systems, or perhaps more appropriate for biological NNs, temporal coding via spike trains?


Well, in some sense they don’t operate on language at all - but on mathematical representation of tokens derived from language.

I’m sure from your comment you are aware of the distinction, but it is an interesting concept for people to keep in mind.


Per some offline discussion, I'll note that while this paper is about the structure of hyperparams, it also starts with the analogy between Mandelbrot & Julia sets, Julia as the hyperparam space for Mandelbrot, the param space

Well, they both also have similar fractal dimensions. Mandlebrot's Hausdorff dimension is 2, Julia is 1-2.

I won't argue it here but just suggest that this is an important complexity relationship and that the neural net being fit may also have similar fractal complexity, and that the distinction between param and hyper-param in this sense may be somewhat a red-herring


> Julia as the hyperparam space for Mandelbrot

It’s kind of the opposite, no? The Mandelbrot set is the set of points where the Julia set for that point includes that point.


Yeah, think you're right.. was reading the intro to the paper wrong. Just saw the acko visualization and was reminded your point

https://acko.net/blog/how-to-fold-a-julia-fractal/

Cheers!


An LLM does not model language. The name is misleading, and should be changed to Large Text Model.

Text is infinitely complex. Written text is only somewhat less so. The text people choose to write is full of complex entropy.

The language patterns we use to read text, on the other hand, are much more simple. The most complicated part is ambiguity, and we don't resolve that with language. Instead, we resolve ambiguity with context.


> Text is infinitely complex.

No, it isn’t.

> An LLM does not model language. The name is misleading, and should be changed to Large Text Model.

What, because it isn’t processing speech?

Presumably you are aware of models that model recordings of speech, and therefore this isn’t what you mean.

In that case, it seems to me like you are probably jerrymandering some concepts?


The set of all possible written text is infinitely complex. That's not what is being modeled, though: LLMs model text that was written intentionally by humans. That's less complicated, but it's not simple enough for language rules to completely define it. Natural language is "context-sensitive", so the meaning of a written language statement is dependent on more than the language itself. That's why we can't just parse natural language like we can programming languages.

An LLM is a model of written text. It doesn't know anything about language rules. In fact, it doesn't follow any rules whatsoever. It only follows the model, which tells you what text is most likely to come next.


The set of all strings over an alphabet is infinite in cardinality. I don’t think I would say that it is infinitely complex. I don’t know how you are defining complexity, but the shortest program that recognizes the language “all strings (over this alphabet)” is pretty short.

A program that has a good rate at distinguishing whether a string is human written, would be substantially longer than one that recognizes the language of all strings over a particular alphabet.

If you want to generate strings instead of recognizing them, a program that enumerates all possible strings over a given alphabet, can also be pretty short.

Not sure what you mean by complexity.

I don’t know what you mean by “it doesn’t follow any rules at all”.


When you write a parser, you build it out of language (syntax) rules. The parser uses those rules to translate text into an abstract syntax tree. This approach is explicit: syntax is known ahead of time, and any correctly written text can be correctly parsed.

When you train an LLM, you don't write any language rules. Instead, you provide examples of written text. This approach is implicit: syntax is not known ahead of time. The core feature is that there is no distinction between what is correct and what is not correct. The LLM is liberated from the syntax rules of language. The benefit is that it can work with ambiguity. The limitation is that it can't decide what interpretation of that ambiguity is correct. It can only guess what interpretation is most likely, based on the text it was trained on.


I wonder if we use LLM because LTM (and LSTM) was taken.


I doubt it. The entire purpose of an LLM is to make a model that does language (or at least something close enough). It's an easy mistake to name something what you hope it will do instead of what it literally is. Unfortunately, that distinction becomes important later.

It's a similar problem with "AI". Artificial Intelligence does not exist, yet we put that name on all kinds of things. This causes real problems, too: by labeling an LLM "AI", we anthropomorphize it. From then on, the entire narrative is misleading.


Intelligence does not imply sapience, or person-like-ness.


It doesn't need to to be a problem.

An LLM is a model, not an actor. As soon as we call it "AI", that distinction gets muddled, and the whole narrative follows.


Could it be that this behavior is just caused by numerical issues and/or randomness in the calculation, rather than a real property of neural networks?


Fractals are not caused by randomness. Fractals arise from scale invariance and self similarity, which in turn can come from non linear systems and iterative processes. It is very easy to generate fractals in practice and even the most vanilla neural networks trivially fulfill the requirements (at least when you look at training output). In that sense it would be weird not to find fractal structures when you look hard enough.


Not exactly, Newton's fractal, which is topological, specifically the wada property, is a boundary condition.

It relates to fractal (non-integer) dimensions, which was first described by Mandelbrot in a paper about self similarity.

Here is a paper that covers some of that.

https://www.minvydasragulskis.com/sites/default/files/public...

In Newton's fractal, no matter how small a circle you can draw, your circle will either contain one root or all the roots.

The basins that contain one root are open sets that share a boundary set.

Even if you could have perfect information and precision this property holds. This means any change in initial conditions that crosses a boundary will be indeterminate.

There is another feature called riddled basins, where every point is arbitrarily close to other basins. This is another situation where even with perfect information and unlimited precision a perturbations would be indeterminate.

A positive Laponov exponent which isn't sufficient to prove chaos, but is always positive in the presence of chaos may even be 0 or negative in the above situations.

Take the typical predator prey model and add fear and refuge and you hit the riddled basins.

Stack four reflective balls in a pyramid and shine different color lights in two sides and you will see the Wada property.

Neither of those problems are addressable with the assumption of deterministic effects with finite precision.


Chaos is just another possible consequence of non linear systems. And newtons's fractal is also generated by just another iterative process. That doesn't mean the origin is necessarily random.


"If we're built from spirals, while living in a giant spiral, then everything we put our hands to, is infused with the spiral"

... Sorry I couldn't help myself.


Related: https://news.ycombinator.com/item?id=39349992 (seems to be the same content)


Even the boundary for newtons approximation is fractal. This is a feature of non-linear optimization.


In case it isn't obvious, you can tap on any of the figures in the PDF or HTML version to watch the video.


> Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations.

Reading this gave me goosebumps


The blog post would be a better link for this submission. https://sohl-dickstein.github.io/2024/02/12/fractal.html


It was posted on HN last week: https://news.ycombinator.com/item?id=39349992


It feels weird to me to use the hyper parameters as the variables to iterate on, and also wasteful. Surely there must be a family of models that give fractal like behaviour ?


The fractal behaviour is an undesirable property, not the goal :P ideally every network would be trainable (would converge)! this is the graphed result of a hyperparameter search, a form of optimization in neural networks.

If you envision a given architecture as a class of (higher-order) function, the inputs would be the parameters, and the constants would be the hyperparameters. Varying the constants moves to a different function in the class, or, varying the hyperparameters gives a different model with the same architecture (even with the same data).


> It feels weird to me to use the hyper parameters as the variables to iterate on

Yes, I also think this is strange. In regular fractals the x and y coordinates have the same units (roughly speaking), but here this is not the case, so I wonder how they determine the relative scale.


Is there really any meaningful sense in which real and imaginary numbers have the same units but two dimensionless real hyperparameters don't?


Complex numbers multiplied by an imaginary number rotate preserving magnitude. In this sense they have the same units.


You can rotate any vector while preserving its magnitude.


If the units are different, you do need to come up with a conversation between them, or you're implicitly saying it's 1-to-1


What does this mean?



As another poster pointed out, its much more intuitive with graphics.

The parameters that you tweak to control model learning have a self-similar property where as you zoom if you see more and more complexity. Its the very definition of local maxima all over the place.


Nothing


The chaotic nature of these images, full of local minimas, makes me intuitively think that genetic algorithm approaches to optmizing hyper parameter searching are better suited than regression-based ones.


Its interesting how fluid-like this fractal is compared to other fractal-zoom videos i see on the internet. I have no idea how common it is for fractals to be like that.


I vaguely recall that in vivo neural oscillations also exhibit fractal structure (in some cases at least).


I've produced some KIFS (Kaleidoscopic iterated function system) fractals that look like this


Note that the paper does not provide a proof of this statement, only some experimental evidence.


Makes me think of something I think I read from Penrose about consciousness.


I am not trying to downplay the contribution of the paper, but isn't it obvious that this is the case?


I'll defend the idea that it was obvious. (Although, it wasn't obvious to me until someone pointed it out, so maybe that's not obvious.)

If you watch this video[0], you'll see in the first frame that there is a clear boundary between learning rates that converge or not. Ignoring this paper for a moment, what if we zoom in really really close to that boundary? There are two possibilities, either (1) the boundary is perfectly sharp no matter how closely we inspect it, or (2) it is a little bit fuzzy. Of those two possibilities, the perfectly sharp boundary would be more surprising.

[0]: https://x.com/jaschasd/status/1756930242965606582


I don't think it's obvious per se, but people who have studied numerical methods at the graduate level have likely seen fractal boundaries like this before - even Newton's method produces them [0]. The phenomenon says more about iterative methods than it says about neural networks.

[0] https://en.wikipedia.org/wiki/Newton_fractal


Not only it is not obvious; it is not known to be true.


Obvious to whom?


I think the "obvious" comment was a bit snarky, but out of curiosity, I posed the question to the Groq website which currently happens to be on the front page right now. (It claims to run Mixtral 8x7B-32k at 500 T/s)

And indeed, the AI response indicated that the boundary between convergence and divergence is not well defined, has many local maxima and minima, and could be quote: "fractal or chaotic, with small changes in hyperparameters leading to drastically different outcomes."


I'll add to this.

It's not only the boundary that is fractal.

We'll soon see that learning on one dataset (area of fractal) with enough data will generalize to other seemingly unrelated datasets.

There is evidence that the structure neural networks are learning to approximate in a generative fractal of sorts.

Finally, we'll need to adapt gradient descent to operate at move between different scales




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: