I've spent some time playing with their Jupyter notebooks. The most useful (to me, anyway) is their Example_3_classfication.ipynb ([1]).
It works as advertised with the parameters selected by the authors, but if we modified the network shape in the second half of the tutorial (Classification formulation) from (2, 2) to (2, 2, 2), it fails to generalize. The training loss gets down to 1e-9, while test loss stays around 3e-1. Getting to larger network sizes does not help either.
I would really like to see a bigger example with many more parameters and more data complexity and if it could be trained at all. MNIST would be a good start.
Update: I increased the training dataset size 100x, and that helps with the overfitting, but now I can't get training loss below 1e-2. Still iterating on it; a GPU acceleration would really help - right now, my progress is limited by the speed of my CPU.
Update2: got it to 100% training accuracy, 99% test accuracy with (2, 2, 2) shape.
Changes:
1. Increased the training set from 1000 to 100k samples. This solved overfitting.
2. In the dataset generation, slightly reduced noise (0.1 -> 0.07) so that classes don't overlap. With an overlap, naturally, it's impossible to hit 100%.
3. Most important & specific to KANs: train for 30 steps with grid=5 (5 segments for each activation function), then 30 steps with grid=10 (and initializing from the previous model), and then 30 steps with grid=20. This is idiomatic to KANs and covered in the Example_1_function_fitting.ipynb: https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
Overall, my impressions are:
- it works!
- the reference implementation is very slow. A GPU implementation is dearly needed.
- it feels like it's a bit too non-linear and training is not as stable as it's with MLP + ReLU.
- Scaling is not guaranteed to work well. Really need to see if MNIST is possible to solve with this approach.
I will definitely keep an eye on this development.
This makes me wonder what you could achieve if instead of iteratively growing the grid, or worrying about pruning or regularization, you governed network topology with some sort of evolutionary algorithm.
You can do much better by growing an AST with memoization and non-linear regression. So much so, the EVO folks gave a best paper to a non-EVO, deterministic algorithm at their conference
Interesting, the use of grammar production rules reminds me of Grammatical Evolution[0], which has shown some promise in constraining the search space when using EAs for e.g. symbolic regression.
Much of what I did in my work was to reduce or constrain the search space.
1. Don't evolve constants or coefficients, use regression to find
2. Leverage associativity and commutativity, simplify with SymPy, sort operands to add/mul
So much effort in GP for SR is spent evaluating models which are effectively the same, even though their "DNA" is different. Computational effort, and algorithmic effort (to deal with loss of population diversity, i.e. premature convergence)
I've seen a few papers since pick up on the idea of local search operators, the simplification, and regression, trying to maintain the evolution aspect. Every algo ends up in local optima and works of effectively the same form by adding useless "DNA". I could see the PGE algo doing this too, going down a branch of the search space that did not add meaningful improvement. With the recent (~5y) advancements in AI, there are some interesting things to try
1000s, there is a whole field and set of conferences. You can find more by searching "Genetic Programming" or "Symbolic Regression"
KAN, with the library of variables and math operators, very much resembles this family of algos, problems, and limitations. The lowest hanging fruit they usually leave on the proverbial tree is that you can use fast regression techniques for the constants and coefficients. No need to leave it up to random perturbations or gradient descent. What you really need to figure out is the form or shape of the model, rather than leaving it up to the human (in KAN)
> Increased the training set from 1000 to 100k samples. This solved overfitting.
Solved over fitting or created more? Even if your sets are completely disjoint with something like two moons the more data you have the lower the variance.
This. I don't think toy examples are useful for modern ML techniques. If you tested big ideas in ML (transformers, LSTM's, ADAM) on a training dataset of 50 numbers trying to fit a y=sin(x) curve, I think you'd wrongly throw these ideas out.
It's possible to run it on CUDA. One of their examples shows how. But I found it's slower than on CPU. I'm actually not really surprised since running something on GPU is not a guarantee that it's gonna be fast, especially when lots of branching is involved.
Unfortunately, I had to modify KAN.py and KANLayer.py to make it work as not all relevant tensor are put on the correct device. In some places the formatting even suggests that there was previously a device argument.
It works as advertised with the parameters selected by the authors, but if we modified the network shape in the second half of the tutorial (Classification formulation) from (2, 2) to (2, 2, 2), it fails to generalize. The training loss gets down to 1e-9, while test loss stays around 3e-1. Getting to larger network sizes does not help either.
I would really like to see a bigger example with many more parameters and more data complexity and if it could be trained at all. MNIST would be a good start.
Update: I increased the training dataset size 100x, and that helps with the overfitting, but now I can't get training loss below 1e-2. Still iterating on it; a GPU acceleration would really help - right now, my progress is limited by the speed of my CPU.
1. https://github.com/KindXiaoming/pykan/blob/master/tutorials/...