I've spent some time playing with their Jupyter notebooks. The most useful (to m...

krasin · on May 1, 2024

Update2: got it to 100% training accuracy, 99% test accuracy with (2, 2, 2) shape.

Changes:

1. Increased the training set from 1000 to 100k samples. This solved overfitting.

2. In the dataset generation, slightly reduced noise (0.1 -> 0.07) so that classes don't overlap. With an overlap, naturally, it's impossible to hit 100%.

3. Most important & specific to KANs: train for 30 steps with grid=5 (5 segments for each activation function), then 30 steps with grid=10 (and initializing from the previous model), and then 30 steps with grid=20. This is idiomatic to KANs and covered in the Example_1_function_fitting.ipynb: https://github.com/KindXiaoming/pykan/blob/master/tutorials/...

Overall, my impressions are:

- it works!

- the reference implementation is very slow. A GPU implementation is dearly needed.

- it feels like it's a bit too non-linear and training is not as stable as it's with MLP + ReLU.

- Scaling is not guaranteed to work well. Really need to see if MNIST is possible to solve with this approach.

I will definitely keep an eye on this development.

thom · on May 1, 2024

This makes me wonder what you could achieve if instead of iteratively growing the grid, or worrying about pruning or regularization, you governed network topology with some sort of evolutionary algorithm.

verdverm · on May 1, 2024

You can do much better by growing an AST with memoization and non-linear regression. So much so, the EVO folks gave a best paper to a non-EVO, deterministic algorithm at their conference

https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_... (author)

verdverm · on May 2, 2024

Code for the curious: https://github.com/verdverm/pypge

zedpm · on May 2, 2024

Interesting, the use of grammar production rules reminds me of Grammatical Evolution[0], which has shown some promise in constraining the search space when using EAs for e.g. symbolic regression.

[0]: https://en.wikipedia.org/wiki/Grammatical_evolution

verdverm · on May 2, 2024

Much of what I did in my work was to reduce or constrain the search space.

1. Don't evolve constants or coefficients, use regression to find

2. Leverage associativity and commutativity, simplify with SymPy, sort operands to add/mul

So much effort in GP for SR is spent evaluating models which are effectively the same, even though their "DNA" is different. Computational effort, and algorithmic effort (to deal with loss of population diversity, i.e. premature convergence)

I've seen a few papers since pick up on the idea of local search operators, the simplification, and regression, trying to maintain the evolution aspect. Every algo ends up in local optima and works of effectively the same form by adding useless "DNA". I could see the PGE algo doing this too, going down a branch of the search space that did not add meaningful improvement. With the recent (~5y) advancements in AI, there are some interesting things to try

gxyt6gfy5t · on May 1, 2024

Believe there is a Google paper out there that tried that

verdverm · on May 1, 2024

1000s, there is a whole field and set of conferences. You can find more by searching "Genetic Programming" or "Symbolic Regression"

KAN, with the library of variables and math operators, very much resembles this family of algos, problems, and limitations. The lowest hanging fruit they usually leave on the proverbial tree is that you can use fast regression techniques for the constants and coefficients. No need to leave it up to random perturbations or gradient descent. What you really need to figure out is the form or shape of the model, rather than leaving it up to the human (in KAN)

godelski · on May 1, 2024

> Increased the training set from 1000 to 100k samples. This solved overfitting.

Solved over fitting or created more? Even if your sets are completely disjoint with something like two moons the more data you have the lower the variance.

londons_explore · on May 1, 2024

> I would really like to see a bigger example

This. I don't think toy examples are useful for modern ML techniques. If you tested big ideas in ML (transformers, LSTM's, ADAM) on a training dataset of 50 numbers trying to fit a y=sin(x) curve, I think you'd wrongly throw these ideas out.

samus · on May 2, 2024

It's possible to run it on CUDA. One of their examples shows how. But I found it's slower than on CPU. I'm actually not really surprised since running something on GPU is not a guarantee that it's gonna be fast, especially when lots of branching is involved.

Unfortunately, I had to modify KAN.py and KANLayer.py to make it work as not all relevant tensor are put on the correct device. In some places the formatting even suggests that there was previously a device argument.