> Have you ever done a dense grid search over neural network hyperparameters? Like a really dense grid search? It looks like this (!!). Blueish colors correspond to hyperparameters for which training converges, redish colors to hyperparameters for which training diverges.
I saw this, but am still not clear on what the axes represent. I assume two hyperparameters, or possibly two orthogonal principal components. I guess my point is it’s not clear how/which parameters are mapped onto the image.
your point is valid but the paper explains it clearly and obviously. they are NOT dimensionally reduced hyperparameters, no. The hyperparameters are learning rates, that's it. X axis, learning rate for input (1 hidden layer). Y axis, learning rate for output layer.
So what this is saying, for certain ill-chosen learning weights, model convergence is for lack of a better word, chaotic and unstable.
Just to add to this, only the two learning rates are changed, everything else including initialization and data is fixed. From the paper:
Training consists of 500 (sometimes 1000) iterations of full batch steepest gradient descent. Training is performed for a 2d grid of η0 and η1 hyperparameter values, with all other hyperparameters held fixed (including network initialization and training data).
Reposting comment from last time since I'm still curious:
This is really fun to see. I love toy experiments like this. I see that each plot is always using the same initialization of weights, which presumably makes it possible to have more smoothness between each pixel. I also would guess it's using the same random seed for training (shuffling data).
I'd be curious to know what the plots would look like with a different randomness/shuffling of each pixel's dataset. I'd guess for the high learning rates it would be too noisy, but you might see fractal behavior at more typical and practical learning rates. You could also do the same with the random initialization of each dataset. This would get at if the chaotic boundary also exists in more practical use cases.
I think if you used a random seed for weights and training data order, and reran the experiment enough times to average out the noise, then the resulting charts would then be smooth with no fractal patterns.
One of Wofram's comments is that there appears to be much more internal structure in language semantics that we had expected, contra-Chomsky.
We also know the brain, cortex esp, is highly recurrent, so it should be primed for creating fractals and chaotic mixing.
So maybe the hidden structure is the set of neural hyperparams needed to put a given cluster of neurons into fractal/chaotic oscillations like this. Seems potentially more useful too.. way more information content than a configuration that yields a fast convergence to a fixed point.
Perhaps this is what learning deep NNs is doing: producing conditions where the substrate is at the tipping point, to get to a high-information generation condition, and then shaping this to fit the target system as well as it can with so many free parameters.
That suggests that using iterative generators that are somehow closer to the dynamics of real neurons would be more efficient for AI: it'd be easier to drive them to similar feedback conditions and patterns
Two quibbles: (1) neural nets don’t necessarily operate on language, (2) they only loosely model biological neurons in that they operate in discrete space. All that is to say that any similarities are purely incidental without accounting for these two facts.
1) agreed. it's exciting seeing the same basic architecture broadly applied
2) not sure what you mean by "operate in discrete space"
I'd emphasize the potential similarity to biological recurrence. Deep ANNs don't need to have this explicitly (tho e.g. LSTM has explicit recurrence), but it is known that recurrent NNs can be emulated by unrolling, in a process similar to function currying. In this mode, a learned network would learn to recognize certain inputs and carry them across to other parts of the network that can be copies of the originator, thus achieving functional equivalence to self feedback, or neighbor feedback. It takes a lot of layers and nodes in theory, but ofc modern nets are getting very big.
I didn’t phrase (2) particularly well. I meant that digital neural networks typically use floating point numbers, which can represent only discrete values. Do aliasing effects show up in the behavior of these networks that wouldn’t apply for continuously variable systems, or perhaps more appropriate for biological NNs, temporal coding via spike trains?
Per some offline discussion, I'll note that while this paper is about the structure of hyperparams, it also starts with the analogy between Mandelbrot & Julia sets, Julia as the hyperparam space for Mandelbrot, the param space
Well, they both also have similar fractal dimensions. Mandlebrot's Hausdorff dimension is 2, Julia is 1-2.
I won't argue it here but just suggest that this is an important complexity relationship and that the neural net being fit may also have similar fractal complexity, and that the distinction between param and hyper-param in this sense may be somewhat a red-herring
An LLM does not model language. The name is misleading, and should be changed to Large Text Model.
Text is infinitely complex. Written text is only somewhat less so. The text people choose to write is full of complex entropy.
The language patterns we use to read text, on the other hand, are much more simple. The most complicated part is ambiguity, and we don't resolve that with language. Instead, we resolve ambiguity with context.
The set of all possible written text is infinitely complex. That's not what is being modeled, though: LLMs model text that was written intentionally by humans. That's less complicated, but it's not simple enough for language rules to completely define it. Natural language is "context-sensitive", so the meaning of a written language statement is dependent on more than the language itself. That's why we can't just parse natural language like we can programming languages.
An LLM is a model of written text. It doesn't know anything about language rules. In fact, it doesn't follow any rules whatsoever. It only follows the model, which tells you what text is most likely to come next.
The set of all strings over an alphabet is infinite in cardinality. I don’t think I would say that it is infinitely complex. I don’t know how you are defining complexity, but the shortest program that recognizes the language “all strings (over this alphabet)” is pretty short.
A program that has a good rate at distinguishing whether a string is human written, would be substantially longer than one that recognizes the language of all strings over a particular alphabet.
If you want to generate strings instead of recognizing them, a program that enumerates all possible strings over a given alphabet, can also be pretty short.
Not sure what you mean by complexity.
I don’t know what you mean by “it doesn’t follow any rules at all”.
When you write a parser, you build it out of language (syntax) rules. The parser uses those rules to translate text into an abstract syntax tree. This approach is explicit: syntax is known ahead of time, and any correctly written text can be correctly parsed.
When you train an LLM, you don't write any language rules. Instead, you provide examples of written text. This approach is implicit: syntax is not known ahead of time. The core feature is that there is no distinction between what is correct and what is not correct. The LLM is liberated from the syntax rules of language. The benefit is that it can work with ambiguity. The limitation is that it can't decide what interpretation of that ambiguity is correct. It can only guess what interpretation is most likely, based on the text it was trained on.
I doubt it. The entire purpose of an LLM is to make a model that does language (or at least something close enough). It's an easy mistake to name something what you hope it will do instead of what it literally is. Unfortunately, that distinction becomes important later.
It's a similar problem with "AI". Artificial Intelligence does not exist, yet we put that name on all kinds of things. This causes real problems, too: by labeling an LLM "AI", we anthropomorphize it. From then on, the entire narrative is misleading.
Could it be that this behavior is just caused by numerical issues and/or randomness in the calculation, rather than a real property of neural networks?
Fractals are not caused by randomness. Fractals arise from scale invariance and self similarity, which in turn can come from non linear systems and iterative processes. It is very easy to generate fractals in practice and even the most vanilla neural networks trivially fulfill the requirements (at least when you look at training output). In that sense it would be weird not to find fractal structures when you look hard enough.
In Newton's fractal, no matter how small a circle you can draw, your circle will either contain one root or all the roots.
The basins that contain one root are open sets that share a boundary set.
Even if you could have perfect information and precision this property holds. This means any change in initial conditions that crosses a boundary will be indeterminate.
There is another feature called riddled basins, where every point is arbitrarily close to other basins. This is another situation where even with perfect information and unlimited precision a perturbations would be indeterminate.
A positive Laponov exponent which isn't sufficient to prove chaos, but is always positive in the presence of chaos may even be 0 or negative in the above situations.
Take the typical predator prey model and add fear and refuge and you hit the riddled basins.
Stack four reflective balls in a pyramid and shine different color lights in two sides and you will see the Wada property.
Neither of those problems are addressable with the assumption of deterministic effects with finite precision.
Chaos is just another possible consequence of non linear systems. And newtons's fractal is also generated by just another iterative process. That doesn't mean the origin is necessarily random.
> Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations.
It feels weird to me to use the hyper parameters as the variables to iterate on, and also wasteful. Surely there must be a family of models that give fractal like behaviour ?
The fractal behaviour is an undesirable property, not the goal :P ideally every network would be trainable (would converge)! this is the graphed result of a hyperparameter search, a form of optimization in neural networks.
If you envision a given architecture as a class of (higher-order) function, the inputs would be the parameters, and the constants would be the hyperparameters. Varying the constants moves to a different function in the class, or, varying the hyperparameters gives a different model with the same architecture (even with the same data).
> It feels weird to me to use the hyper parameters as the variables to iterate on
Yes, I also think this is strange. In regular fractals the x and y coordinates have the same units (roughly speaking), but here this is not the case, so I wonder how they determine the relative scale.
As another poster pointed out, its much more intuitive with graphics.
The parameters that you tweak to control model learning have a self-similar property where as you zoom if you see more and more complexity. Its the very definition of local maxima all over the place.
The chaotic nature of these images, full of local minimas, makes me intuitively think that genetic algorithm approaches to optmizing hyper parameter searching are better suited than regression-based ones.
Its interesting how fluid-like this fractal is compared to other fractal-zoom videos i see on the internet. I have no idea how common it is for fractals to be like that.
I'll defend the idea that it was obvious. (Although, it wasn't obvious to me until someone pointed it out, so maybe that's not obvious.)
If you watch this video[0], you'll see in the first frame that there is a clear boundary between learning rates that converge or not. Ignoring this paper for a moment, what if we zoom in really really close to that boundary? There are two possibilities, either (1) the boundary is perfectly sharp no matter how closely we inspect it, or (2) it is a little bit fuzzy. Of those two possibilities, the perfectly sharp boundary would be more surprising.
I don't think it's obvious per se, but people who have studied numerical methods at the graduate level have likely seen fractal boundaries like this before - even Newton's method produces them [0]. The phenomenon says more about iterative methods than it says about neural networks.
I think the "obvious" comment was a bit snarky, but out of curiosity, I posed the question to the Groq website which currently happens to be on the front page right now. (It claims to run Mixtral 8x7B-32k at 500 T/s)
And indeed, the AI response indicated that the boundary between convergence and divergence is not well defined, has many local maxima and minima, and could be quote: "fractal or chaotic, with small changes in hyperparameters leading to drastically different outcomes."