I have actually attempted this recently. I took a small 10M parameter Shakespear...

eigenvalue · on July 18, 2023

Not really surprising-- CMAES basically replaces the actual gradient you care about with a rough numerical approximation to it that's based on looking at lots of input-output pairs. I think the concept originated in surveying, where it's called the technique of "kriging":

https://www.publichealth.columbia.edu/research/population-he...

https://en.wikipedia.org/wiki/Kriging

Basically, you are wasting most of your compute to come up with a rough local approximation to the thing you actually want. But that's sort of pointless in the NN training context, because what you want is basically the gradient (and maybe some higher order terms that tell you about the local curvature too).

CMAES makes sense when the gradient is not even well defined. For example, if you have a bunch of parameters for an airplane design, and then want to take that design and do a bunch of huge aerodynamics calculations to compute its lift, or do a big finite element analysis to measure how well it withstands various stresses, and at the end of that big analysis, you get back a number, like "maximum lift" or something. If each run takes hours on a supercomputer, then you clearly don't have anything close to a gradient and it would be very expensive to even try to approximate it numerically. So CMAES is useful there in helping you pick better high level parameters in a smart way-- basically it's a big improvement over grid search.

adeon · on July 18, 2023

I think I saw a paper that argued that CMA-ES is making an approximation to the natural gradient, which is not the same gradient you see in typical NN trainings. Or at least so I understood it. (I have no background in data science or ML, I'm just a bored engineer)

I haven't estimated the number of trials you would need for 10M Shakespeare model but I think to get to the same level as gradient descent, it might be around 10M, i.e. same ballpark as the number of parameters. Which makes some intuitive sense because of how little you learn from each black box function evaluation.

There's maybe some hope that there is a hidden structure in the problem that does not actually need anywhere close to 10M parameters so that a black box optimizer might still find it. I don't have my hopes up though but I'm trying to poke at it.

I would think that if it turns out LLMs are not totally impossible with black box optimization, then it would be good to find a reason to use it. E.g. objective functions that don't have a hope of having a good gradient. Some kind of weird architecture that can't be trained conventionally. Maybe fine-tuning gradient descent optimized models with those things. Etc. Feels like a solution looking for a problem.

I'm doing my project for fun and just seeing if it's possible at all.

stormfather · on July 18, 2023

Have you tried Hinton's forward-forward method?

adeon · on July 18, 2023

I have not. I have read the paper though. I do want to try it and likely will at some point.

Next up after this project is that I want to test some metalearning ideas. I read some papers where the idea is that all weights are actually tinier neural networks, all with the same parameters where you train it to learn backpropagation (or whatever learning algorithm it converges to). The paper I read this from argued it also worked for forward-only but my intuition doesn't quite understand how. I want to follow up a bit on this line of research and check if there's been any new developments since I read them and try them out in my own code.

dadoomer · on July 18, 2023

Very interesting, thanks for sharing. I would be interested in reading more on gradient-free optimization applied to large problems (like LLMs).

Legend2440 · on July 18, 2023

No one has the scale to make that happen.

It's about information. Gradient-free methods integrate little or no information about the problem; they're a blind watchmaker. This works, but it's slow and gets slower the bigger your problem is. (curse of dimensionality)

Gradients integrate some limited information about the problem. This lets you find solutions much faster, and neural networks are structured specifically to be easy to optimize with gradients. Local minima don't seem to be a problem.

The future is probably even smarter optimizers that integrate more information about the problem and learn to make good assumptions. This is the goal of Learned Optimizers, like Velo (https://arxiv.org/abs/2211.09760).

p1esk · on July 18, 2023

Thank you for posting Velo optimizer - very interesting work.

yobbo · on July 18, 2023

You could start reading on CMA-ES; which is something like a particle filter on the model parameters. So for 100 "particles", it means 100 resampled copies of the model, which are then evaluated to create something like a "synthetic" gradient which is used to update a distribution over the model parameters.

But it doesn't solve the problem of local minima, and it will also need to use minibatches.