I have actually attempted this recently. I took a small 10M parameter Shakespeare language model used as an example in nanoGPT, swapped out gradient descent, tested various black-box optimizers from what I could find in literature.
It takes 3 minutes to train the Shakespeare model with gradient descent. The black-box methods I tested so far likely take 30+ hours to train (I haven't tried to take them to the end yet). I've hit a wall where progress is very slow. The text generated at that stage has punctuation and words are split with spaces but the words themselves are mostly nonsense. Almost feels like it learned that English is letters separated by spaces, and that you put exclamation marks or periods at the end but not that much more.
There's some larger scale CMA-ES variants I still want to test that don't have quality implementations. I've tried to stare at pictures of gradients and weights from half-trained models and trying to come up with ideas how to get there with black-box optimization. Also trying some original ideas where you compute a gradient, but you would not compute it against a loss function. The gradient would be more for discovering hidden structure in weights, that you would then put on some black-box optimizer as a guide (which I guess makes it not entirely black box. Gray box?)
Possible? I mean, I guess technically. Practical? No way, unless some major breakthrough happens.
My current goal is to just produce a model, even if training takes laughably long so I can say I've trained a language model using nothing but getting a fitness score from a black box function.
Edit: if you are reading this and are aware of any other serious attempts at training a non-trivial sized language model without gradient descent I would want to know. So I can try their methods. I know there's some large scale stuff used in reinforcement learning like in one Uber paper but not in LLMs specifically.
Not really surprising-- CMAES basically replaces the actual gradient you care about with a rough numerical approximation to it that's based on looking at lots of input-output pairs. I think the concept originated in surveying, where it's called the technique of "kriging":
Basically, you are wasting most of your compute to come up with a rough local approximation to the thing you actually want. But that's sort of pointless in the NN training context, because what you want is basically the gradient (and maybe some higher order terms that tell you about the local curvature too).
CMAES makes sense when the gradient is not even well defined. For example, if you have a bunch of parameters for an airplane design, and then want to take that design and do a bunch of huge aerodynamics calculations to compute its lift, or do a big finite element analysis to measure how well it withstands various stresses, and at the end of that big analysis, you get back a number, like "maximum lift" or something. If each run takes hours on a supercomputer, then you clearly don't have anything close to a gradient and it would be very expensive to even try to approximate it numerically. So CMAES is useful there in helping you pick better high level parameters in a smart way-- basically it's a big improvement over grid search.
I think I saw a paper that argued that CMA-ES is making an approximation to the natural gradient, which is not the same gradient you see in typical NN trainings. Or at least so I understood it. (I have no background in data science or ML, I'm just a bored engineer)
I haven't estimated the number of trials you would need for 10M Shakespeare model but I think to get to the same level as gradient descent, it might be around 10M, i.e. same ballpark as the number of parameters. Which makes some intuitive sense because of how little you learn from each black box function evaluation.
There's maybe some hope that there is a hidden structure in the problem that does not actually need anywhere close to 10M parameters so that a black box optimizer might still find it. I don't have my hopes up though but I'm trying to poke at it.
I would think that if it turns out LLMs are not totally impossible with black box optimization, then it would be good to find a reason to use it. E.g. objective functions that don't have a hope of having a good gradient. Some kind of weird architecture that can't be trained conventionally. Maybe fine-tuning gradient descent optimized models with those things. Etc. Feels like a solution looking for a problem.
I'm doing my project for fun and just seeing if it's possible at all.
I have not. I have read the paper though. I do want to try it and likely will at some point.
Next up after this project is that I want to test some metalearning ideas. I read some papers where the idea is that all weights are actually tinier neural networks, all with the same parameters where you train it to learn backpropagation (or whatever learning algorithm it converges to). The paper I read this from argued it also worked for forward-only but my intuition doesn't quite understand how. I want to follow up a bit on this line of research and check if there's been any new developments since I read them and try them out in my own code.
It's about information. Gradient-free methods integrate little or no information about the problem; they're a blind watchmaker. This works, but it's slow and gets slower the bigger your problem is. (curse of dimensionality)
Gradients integrate some limited information about the problem. This lets you find solutions much faster, and neural networks are structured specifically to be easy to optimize with gradients. Local minima don't seem to be a problem.
The future is probably even smarter optimizers that integrate more information about the problem and learn to make good assumptions. This is the goal of Learned Optimizers, like Velo (https://arxiv.org/abs/2211.09760).
You could start reading on CMA-ES; which is something like a particle filter on the model parameters. So for 100 "particles", it means 100 resampled copies of the model, which are then evaluated to create something like a "synthetic" gradient which is used to update a distribution over the model parameters.
But it doesn't solve the problem of local minima, and it will also need to use minibatches.
It takes 3 minutes to train the Shakespeare model with gradient descent. The black-box methods I tested so far likely take 30+ hours to train (I haven't tried to take them to the end yet). I've hit a wall where progress is very slow. The text generated at that stage has punctuation and words are split with spaces but the words themselves are mostly nonsense. Almost feels like it learned that English is letters separated by spaces, and that you put exclamation marks or periods at the end but not that much more.
There's some larger scale CMA-ES variants I still want to test that don't have quality implementations. I've tried to stare at pictures of gradients and weights from half-trained models and trying to come up with ideas how to get there with black-box optimization. Also trying some original ideas where you compute a gradient, but you would not compute it against a loss function. The gradient would be more for discovering hidden structure in weights, that you would then put on some black-box optimizer as a guide (which I guess makes it not entirely black box. Gray box?)
Possible? I mean, I guess technically. Practical? No way, unless some major breakthrough happens.
My current goal is to just produce a model, even if training takes laughably long so I can say I've trained a language model using nothing but getting a fitness score from a black box function.
Edit: if you are reading this and are aware of any other serious attempts at training a non-trivial sized language model without gradient descent I would want to know. So I can try their methods. I know there's some large scale stuff used in reinforcement learning like in one Uber paper but not in LLMs specifically.