I quickly skimmed the paper, got inspired to simplify it, and created some Pytor...

AbrahamParangi · 2024-05-02T12:36:29

When I played around with implementing this last night I found using a radial basis function instead of Fourier coefficients (I tried the same, nice and parallel and easy to write) to be more well behaved in training networks of depth greater than 2.

droidlyx · 2024-05-03T13:02:55

Hi Noesis, I just noticed that your implementation, combined with the efficientKAN by Blealtan (https://github.com/Blealtan/efficient-kan), results in a structure very similar to Siren(MLP with Sin activations). efficientKAN first computes the common basis functions for all the edge activations and the output can be calculated with a linear combination of the basis. If the basis functions are fourier, then a KAN layer can be viewed as a linear layer with fixed weights + Sin activation + a linear layer with learnable weights, which is a special form of Siren. I think this may show some connection between KAN and MLP.

bionhoward · 2024-05-04T09:33:12

How could this help us understand the difference between the learned parameters and their gradients? Can the gradients become one with the parameters a la exponential function?

sevagh · 2024-05-02T07:20:37

Does your code work? Did you train it? Any graphs?

>Of course, if my code doesn't work, it doesn't mean theirs doesn't.

But, _does_ it work?

agnosticmantis · 2024-05-01T19:26:17

How GPU-friendly is this class of models?

cloudhan · 2024-05-02T00:44:48

Very unfriendly.

The symbolic library (type of activations) requires a branching at the very core of the kernel. GPU will need to serialized on these operations warp-wise.

To optimize, you might want to do a scan operation beforehand and dispatch to activation funcs in a warp specialized way, this, however, makes the global memory read/write non-coalesced.

You then may sort the input based on type of activations and store it in that order, this makes the gmem IO coalesced but requires gather and scatter as pre and post processing.

jiggawatts · 2024-05-02T06:25:29

Wouldn't it be faster to calculate every function type and then just multiply them by 0s or 1s to keep the active ones?

samus · 2024-05-02T07:17:28

That's pretty much how branching on GPUs already works.

svantana · 2024-05-02T15:21:45

couldn't you implement these as a texture lookup, where x is the input and the various functions are stacked in y? That should be quite fast on gpus.

itsthecourier · 2024-05-01T15:07:02

you really are a pragmatic programmer, Noesis

GistNoesis · 2024-05-01T18:42:26

Thanks. I like simple things.

Sums and products can get you surprisingly far.

Conceptually it's simpler to think about and optimize. But you can also write it use einsum to do the sum product reductions (I've updated some comment to show how) to use less memory, but it's more intimidating.

You can probably use KeOps library to fuse it further (einsum would get in the way).

But the best is probably a custom kernel. Once you have written it as sums and product, it's just iterating. Like the core is 5 lines, but you have to add roughly 500 lines of low-level wrapping code to do cuda parallelisation, c++ to python, various types, manual derivatives. And then you have to add various checks so that there are no buffer overflows. And then you can optimize for special hardware operations like tensor cores. Making sure along the way that no numerical errors where introduced.

So there are a lot more efforts involved, and it's usually only worth it if the layer is promising, but hopefully AI should be able to autocomplete these soon.