Hacker News new | past | comments | ask | show | jobs | submit login
Kolmogorov-Arnold Networks (github.com/kindxiaoming)
568 points by sumo43 7 months ago | hide | past | favorite | 142 comments



I quickly skimmed the paper, got inspired to simplify it, and created some Pytorch Layer :

https://github.com/GistNoesis/FourierKAN/

The core is really just a few lines.

In the paper they use some spline interpolation to represent 1d function that they sum. Their code seemed aimed at smaller sizes. Instead I chose a different representation, aka fourier coefficients that are used to interpolate the functions of individual coordinates.

It should give an idea of Kolmogorov-Arnold networks representation power, it should probably converge easier than their spline version but spline version have less operations.

Of course, if my code doesn't work, it doesn't mean theirs doesn't.

Feel free to experiment and publish paper if you want.


When I played around with implementing this last night I found using a radial basis function instead of Fourier coefficients (I tried the same, nice and parallel and easy to write) to be more well behaved in training networks of depth greater than 2.


Hi Noesis, I just noticed that your implementation, combined with the efficientKAN by Blealtan (https://github.com/Blealtan/efficient-kan), results in a structure very similar to Siren(MLP with Sin activations). efficientKAN first computes the common basis functions for all the edge activations and the output can be calculated with a linear combination of the basis. If the basis functions are fourier, then a KAN layer can be viewed as a linear layer with fixed weights + Sin activation + a linear layer with learnable weights, which is a special form of Siren. I think this may show some connection between KAN and MLP.


How could this help us understand the difference between the learned parameters and their gradients? Can the gradients become one with the parameters a la exponential function?


Does your code work? Did you train it? Any graphs?

>Of course, if my code doesn't work, it doesn't mean theirs doesn't.

But, _does_ it work?


How GPU-friendly is this class of models?


Very unfriendly.

The symbolic library (type of activations) requires a branching at the very core of the kernel. GPU will need to serialized on these operations warp-wise.

To optimize, you might want to do a scan operation beforehand and dispatch to activation funcs in a warp specialized way, this, however, makes the global memory read/write non-coalesced.

You then may sort the input based on type of activations and store it in that order, this makes the gmem IO coalesced but requires gather and scatter as pre and post processing.


Wouldn't it be faster to calculate every function type and then just multiply them by 0s or 1s to keep the active ones?


That's pretty much how branching on GPUs already works.


couldn't you implement these as a texture lookup, where x is the input and the various functions are stacked in y? That should be quite fast on gpus.


you really are a pragmatic programmer, Noesis


Thanks. I like simple things.

Sums and products can get you surprisingly far.

Conceptually it's simpler to think about and optimize. But you can also write it use einsum to do the sum product reductions (I've updated some comment to show how) to use less memory, but it's more intimidating.

You can probably use KeOps library to fuse it further (einsum would get in the way).

But the best is probably a custom kernel. Once you have written it as sums and product, it's just iterating. Like the core is 5 lines, but you have to add roughly 500 lines of low-level wrapping code to do cuda parallelisation, c++ to python, various types, manual derivatives. And then you have to add various checks so that there are no buffer overflows. And then you can optimize for special hardware operations like tensor cores. Making sure along the way that no numerical errors where introduced.

So there are a lot more efforts involved, and it's usually only worth it if the layer is promising, but hopefully AI should be able to autocomplete these soon.


I've spent some time playing with their Jupyter notebooks. The most useful (to me, anyway) is their Example_3_classfication.ipynb ([1]).

It works as advertised with the parameters selected by the authors, but if we modified the network shape in the second half of the tutorial (Classification formulation) from (2, 2) to (2, 2, 2), it fails to generalize. The training loss gets down to 1e-9, while test loss stays around 3e-1. Getting to larger network sizes does not help either.

I would really like to see a bigger example with many more parameters and more data complexity and if it could be trained at all. MNIST would be a good start.

Update: I increased the training dataset size 100x, and that helps with the overfitting, but now I can't get training loss below 1e-2. Still iterating on it; a GPU acceleration would really help - right now, my progress is limited by the speed of my CPU.

1. https://github.com/KindXiaoming/pykan/blob/master/tutorials/...


Update2: got it to 100% training accuracy, 99% test accuracy with (2, 2, 2) shape.

Changes:

1. Increased the training set from 1000 to 100k samples. This solved overfitting.

2. In the dataset generation, slightly reduced noise (0.1 -> 0.07) so that classes don't overlap. With an overlap, naturally, it's impossible to hit 100%.

3. Most important & specific to KANs: train for 30 steps with grid=5 (5 segments for each activation function), then 30 steps with grid=10 (and initializing from the previous model), and then 30 steps with grid=20. This is idiomatic to KANs and covered in the Example_1_function_fitting.ipynb: https://github.com/KindXiaoming/pykan/blob/master/tutorials/...

Overall, my impressions are:

- it works!

- the reference implementation is very slow. A GPU implementation is dearly needed.

- it feels like it's a bit too non-linear and training is not as stable as it's with MLP + ReLU.

- Scaling is not guaranteed to work well. Really need to see if MNIST is possible to solve with this approach.

I will definitely keep an eye on this development.


This makes me wonder what you could achieve if instead of iteratively growing the grid, or worrying about pruning or regularization, you governed network topology with some sort of evolutionary algorithm.


You can do much better by growing an AST with memoization and non-linear regression. So much so, the EVO folks gave a best paper to a non-EVO, deterministic algorithm at their conference

https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_... (author)


Code for the curious: https://github.com/verdverm/pypge


Interesting, the use of grammar production rules reminds me of Grammatical Evolution[0], which has shown some promise in constraining the search space when using EAs for e.g. symbolic regression.

[0]: https://en.wikipedia.org/wiki/Grammatical_evolution


Much of what I did in my work was to reduce or constrain the search space.

1. Don't evolve constants or coefficients, use regression to find

2. Leverage associativity and commutativity, simplify with SymPy, sort operands to add/mul

So much effort in GP for SR is spent evaluating models which are effectively the same, even though their "DNA" is different. Computational effort, and algorithmic effort (to deal with loss of population diversity, i.e. premature convergence)

I've seen a few papers since pick up on the idea of local search operators, the simplification, and regression, trying to maintain the evolution aspect. Every algo ends up in local optima and works of effectively the same form by adding useless "DNA". I could see the PGE algo doing this too, going down a branch of the search space that did not add meaningful improvement. With the recent (~5y) advancements in AI, there are some interesting things to try


Believe there is a Google paper out there that tried that


1000s, there is a whole field and set of conferences. You can find more by searching "Genetic Programming" or "Symbolic Regression"

KAN, with the library of variables and math operators, very much resembles this family of algos, problems, and limitations. The lowest hanging fruit they usually leave on the proverbial tree is that you can use fast regression techniques for the constants and coefficients. No need to leave it up to random perturbations or gradient descent. What you really need to figure out is the form or shape of the model, rather than leaving it up to the human (in KAN)


> Increased the training set from 1000 to 100k samples. This solved overfitting.

Solved over fitting or created more? Even if your sets are completely disjoint with something like two moons the more data you have the lower the variance.


> I would really like to see a bigger example

This. I don't think toy examples are useful for modern ML techniques. If you tested big ideas in ML (transformers, LSTM's, ADAM) on a training dataset of 50 numbers trying to fit a y=sin(x) curve, I think you'd wrongly throw these ideas out.


It's possible to run it on CUDA. One of their examples shows how. But I found it's slower than on CPU. I'm actually not really surprised since running something on GPU is not a guarantee that it's gonna be fast, especially when lots of branching is involved.

Unfortunately, I had to modify KAN.py and KANLayer.py to make it work as not all relevant tensor are put on the correct device. In some places the formatting even suggests that there was previously a device argument.


There exists a Kolmogorov-Arnold inspired model in classical statistics called GAMs (https://en.wikipedia.org/wiki/Generalized_additive_model), developed by Hastie and Tibshirani as an extension of GLMs (https://en.wikipedia.org/wiki/Generalized_linear_model).

GLMs in turn generalize logistic-, linear and other popular regression models.

Neural GAMs with learned basis functions have already been proposed, so I'm a bit surprised that the prior art is not mentioned in this new paper. Previous applications focused more on interpretability.


Exactly! This is the first thought that came to my mind. Google search with KAN and GAM bought me here.


The success we're seeing with neural networks is tightly coupled with the ability to scale - the algorithm itself works at scale (more layers), but it also scales well with hardware, (neural nets mostly consist of matrix multiplications, and GPUs have specialised matrix multiplication acceleration) - one of the most impactful neural network papers, AlexNet, was impactful because it showed that NNs could be put on the GPU, scaled and accelerated, to great effect.

It's not clear from the paper how well this algorithm will scale, both in terms of the algorithm itself (does it still train well with more layers?), and ability to make use of hardware acceleration, (e.g. it's not clear to me that the structure, with its per-weight activation functions, can make use of fast matmul acceleration).

It's an interesting idea, that seems to work well and have nice properties on a smaller scale; but whether it's a good architecture for imagenet, LLMs, etc. is not clear at this stage.


> with its per-weight activation functions

Sounds like something which could be approximated by a DCT (discrete cosine transform). JPEG compression does this, and there are hardware accelerations for it.

> can make use of fast matmul acceleration

Maybe not, but matmul acceleration was done in hardware because it's useful for some problems (graphics initially).

So if these per weight activations functions really work, people will be quick to figure out how to run them in hardware.


It's so refreshing to come across new AI research different from the usual "we modified a transformer in this and that way and got slightly better results on this and that benchmark." All those new papers proposing incremental improvements are important, but... everyone is getting a bit tired of them. Also, anecdotal evidence and recent work suggest we're starting to run into fundamental limits inherent to transformers, so we may well need new alternatives.[a]

The best thing about this new work is that it's not an either/or proposition. The proposed "learnable spline interpolations as activation functions" can be used in conventional DNNs, to improve their expressivity. Now we just have to test the stuff to see if it really works better.

Very nice. Thank you for sharing this work here!

---

[a] https://news.ycombinator.com/item?id=40179232


There's a ton actually. Just they tend to go through extra rounds of review (or never make it...) and never make it to HN unless there's special circumstances (this one is MIT and CIT). Unfortunately we've let PR become a very powerful force (it's always been a thing, but seems more influential now). We can fight against this by up voting things like this and if you're a reviewee, not focusing on sota (it's clearly been gamed and clearly leading us in the wrong direction)


Yes seconding this. If you want a broad view of ML IMHO the best places to look at are conference proceedings. The typical review process is imperfect so that still doesn't show you all the interesting work out there (which you mention), but it is still a start wrt diversity of research. I follow LLMs closely but then going through proceedings means I come across exciting research like these [1],[2],[3].

References:

[1] A grad.-based way to optimize axis-parallel and oblique decision trees: the Tree Alternating Optimization (TAO) algorithm https://proceedings.neurips.cc/paper_files/paper/2018/file/1.... An extension was the softmax tree https://aclanthology.org/2021.emnlp-main.838/.

[2] XAI explains models, but can you recommend corrective actions? FACE: feasible and Actionable Counterfactual Explanations https://arxiv.org/pdf/1909.09369, Algorithmic Recourse: from Counterfactual Explanations to Interventions https://arxiv.org/pdf/2002.06278

[3] OBOE: Collaborative Filtering for AutoML Model Selection https://arxiv.org/abs/1808.03233


Honestly, these days I just rely on arxiv. The conferences are so noisy that it is hard to really tell what's useful and what's crap. Twitter is a bit better but still a crap shoot. So as far as it seems to me, there's no real good signal to use to differentiate. And what's the point of journals/conferences if not to provide some reasonable signal? If it is a slot machine, it is useless.

And I feel like we're far too dismissive of instances we see where good papers get rejected. We're too dismissive of the collusion rings. What am I putting in all this time to write and all this time to review (and be an emergency reviewer) if we aren't going to take some basic steps forward? Fuck, I've saved a Welling paper from rejection from two reviewers who admitted to not knowing PDEs, and this was a workshop (should have been accepted into the main conference). I think review works for those already successful, who can p̶a̶y̶ "perform more experiments when requested" their way out of review hell, but we're ignoring a lot of good work simply for lack of m̶o̶n̶e̶y̶ compute. It slows down our progress to reach AGI.


I agree with almost all you said except that Twitter is better than top conferences, and I take a contrarian view that reviewers slow down AGI with requests for additional experiments. Without going into specifics, which you can probably guess based on your background, too many ideas that work well, even optimally, at small scale fail horribly at large scale. Other ideas that work at super specialized settings don’t transfer or don’t generalize. The saving of two or three dimensions for exact symmetry operations is super important when you deal with handful of dimensions and is often irrelevant or slowing down training a lot when you already deal with tens of thousands of dimensions. Correlations in huge multimodal datasets are way more complicated than most humans can grasp and we will not get to AGI before we can have a large enough group of people dealing with such data routinely. It is very likely detrimental for our progress to AGI that we lack abundant hardware for academics and hobbyists to contribute frontier experiments, however we don’t do anybody a favor by increasing the entropy of the publications in the huge ML conferences. This particular work listed in HN stands out despite lack of scaling and will probably make it in a top conference (perhaps with some additional background citations) but not everything that is merely novel should simply make it to ICLR or neurIPS or ICML, otherwise we could have a million papers in each in a few years from today and nobody would be the wiser.


> too many ideas that work well, even optimally, at small scale fail horribly at large scale.

Not that I disagree, but I don't think that's a reason to not publish. There's another way to rephrase what you've said

  many ideas that work well at small scales do not trivially work at large scales
But this is true for many works, even transformers. You don't just scale by turning up model parameters and data. You can, but generally more things are going on. So why hold these works back because of that? There may be nuggets in there that may be of value and people may learn how to scale them. Just because they don't scale (now or ever) doesn't mean they aren't of value (and let's be honest, if they don't scale, this is a real killer for the "scale is all you need" people)

> Other ideas that work at super specialized settings don’t transfer or don’t generalize.

It is also hard to tell if these are hyper-parameter settings. Not that I disagree with you, but it is hard to tell.

> Correlations in huge multimodal datasets are way more complicated than most humans can grasp and we will not get to AGI before we can have a large enough group of people dealing with such data routinely.

I'm not sure I understand your argument here. The people I know that work at scale often have the worst understanding of large data. Not understanding the differences between density in a normal distribution and a uniform. Thinking that LERPing in a normal yields representative data. Or cosine simularity and orthogonality. IME people that work at scale benefit from being able to throw compute at problems.

> we don’t do anybody a favor by increasing the entropy of the publications in the huge ML conferences

You and I have very different ideas as to what constitutes information gain. I would say a majority of people studying two models (LLMs and diffusion) results in lower gain, not more.

And as I've said above, I don't care about novelty. It's a meaningless term. (and I wish to god people would read the fucking conference reviewer guidelines as they constantly violate them when discussing novelty)


I think information gain will be easy to measure in principle with an AI in the near future: if the work is correct, how unexpected is it. Anything trivially predictable based on published literature, including exact reproduction disguised as novel is not worthy of too much attention. Anything that has a change of changing the model of the world is important. It can seem minor even trivial to some nasty reviewer, but if the effect is real and not demonstrated before then it deserves attention. Until then, we deal with imperfect humans.

Regarding large multimodal data, I don’t know what people you refer to, so I can’t comment further. The current math is useful but very limited when it comes to understanding the densities in such data; vectors are always orthogonal at high dim and densities are always sampled very poorly. The type of understanding of data that would help progress in drug and material design, say, is very different from the type of data that can help a chatbot code. Obviously the future AI should understand it all, but it may take interdisciplinary collaborations that best start at an early age and don’t fit the current academic system very well unfortunately.


> will be easy to measure in principle with an AI in the near future

I'd like to push back on this quite a bit. We don't have AI that shows decent reasoning capabilities. You can hope that this will be resolved, but I'd wager that this will just become more convoluted. A thing that acts like a human, even at an indistinguishable level need not also be human nor have the same capabilities of of a human[0]. This question WILL get harder to answer in the future, I'm certain of that, but we do need to be careful.

Getting to the main point, metrics are fucking hard. The curse of dimensionality isn't just that there are lots of numbers, it is that your nearest neighbor becomes ambiguous. It is that the difference between the furthest point (neighbor) and the closest point (nearest neighbor) decreases. It is that orthogonality becomes a more vague concept. That means may not be representative of a distribution. This is stuff that is incredibly complex and convolutes the nature of these measurements. For AI to be better than us, it would have to actually reason, because right now we __decide__ not to reason instead __decide__ to take the easy way out and act as if metrics are the same as they are in 2D (ignoring all advice from the mathematicians...).

It is not necessarily about the type of data when the issue we're facing is at an abstraction of any type of data. Categorically they share a lot of features. The current mindset in ML is "you don't need math" when the current wall we face is highly dependent on understanding these complex mathematics.

I think it is incredibly naive to just rely on AI solving our problems. How do we make AI to solve problems when we __won't__ even address the basic nature of problems themselves?

[0] As an example, think about an animatronic duck. It could be very lifelike and probably even fool a duck. In fact, we've seen pretty low quality ones fool animals, including just ones that are static and don't make sounds. Now imagine one that can fly and quack. But is it a duck? Can we do this without the robot being sentient? Certainly! Will it also fool humans? Almost surely! (No, I'm not suggesting birds aren't real. Just to clarify)


An AI that can help referee papers to advance human knowledge doesn’t need to have lots of human qualities. I think it suffices if a) it has the ability to judge correctness precisely, and b) it expresses a degree of surprise (low log likelihood?) if the correct data does not fit its current worldview.


> it has the ability to judge correctness precisely,

That's not possible from a paper.

> it expresses a degree of surprise (low log likelihood?)

I think you're interpreting statistical terms too literally.

The truth of the matter is that we rely on a lot of trust from both reviewers and authors. This isn't a mechanical process. You can't just take metrics at face value[0]. The difficulty of peer review is the thing that AI systems are __the worst__ at and we have absolutely no idea how to resolve. It is about nuance. Anything short of nuance and we get metric hacking. And boy, you wanna see the degrade of academic works, the make the referee an automated system. No matter how complex that system is, I guarantee you human ingenuity will win and you'll just have metric hacking. We already see this in human led systems (like "peer review" and anyone that's ever had a job has experienced this).

I for one don't want to see science led by metric hacking.

Processes will always be noisy, and I'm not suggesting we can get a perfect system. But if we're unwilling to recognize the limitations of our systems and the governing dynamics of the tools that we build, then you're doomed to metric hack. It's a tale as old as time (literally). Now, if we create a sentient intelligence, well that's a completely different ball game but not what you were arguing either.

  You need to stop focusing on "making things work" and making sure they actually work. No measurement is perfectly aligned with ones goals. Anyone in ML that isn't intimately familiar with Goodhart's Law is simply an architect of Goodhart's Hell.
Especially if we are to discuss AGI, because there is no perfect way to measure and there never will be. It is a limitation in physics and mathematics. The story of the Jinni is about precisely this, but we've formalized it.

[0] This is the whole problem with SOTA. Some metrics no longer actually mean anything useful. I'll give an example, look at FID, the main metric for goodness of image generation. It's assumptions are poor (the norms aren't very normal and it's based on a ImageNet1k training which is extremely biased. And no, these aren't solved by just switching to CLIP-FID). There's been many papers written on this and similar for any given metric.


Yes arxiv is a good first source too. I mentioned conferences as a way to get exposed to diversity, but not necessarily (sadly) merit. It has been my experience as an author and reviewer both that review quality has plummeted over the years for the most part. As a reviewer I had to struggle with the ills of "commission and omission" both, i.e., (a) convince other reviewers to see an idea (from a trendy area such as in-context learning) as not novel (because it has been done before, even in the area of LLMs), and (b) see an idea as novel, which wouldn't haven't seemed so initially because some reviewers weren't aware of the background or impact of anything non-LLM, or god forbid, non-DL. As an author this has personally affected me because I had to work on my PhD remotely, so I didn't have access to a bunch of compute and I deliberately picked a non-DL area, and I had to pay the price for that in terms of multiple rejections, reviewer ghosting, journals not responding for years (yes, years).


I've stopped considering novelty at all. The only thing I now consider is if the precise technique has been done before. If not, well I've seen pretty small things change results dramatically. The pattern I've seen that scares me more is that when authors do find simple but effective changes, they end up convoluting the ideas because simplicity and clarity is often confused with novelty. And honestly, revisiting ideas is useful as our environments change. So I don't want to discourage this type of work.

Personally, this has affected me as a late PhD student. Late in the literal sense as I'm not getting my work pushed out (even some SOTA stuff) because of factors like these and my department insists something is wrong with me but will not read my papers, the reviews, or suggest what I need to do besides "publish more." (Literally told to me, "try publishing 5 papers a year, one should get in.") You'll laugh at this, I pushed a paper into a workshop and a major complaint was that I didn't give enough background on StyleGAN because "not everyone would be familiar with the architecture." (while I can understand the comment, 8 pages is not much room when you gotta show pictures on several datasets. My appendix was quite lengthy and included all requested information). We just used a GAN as a proxy because diffusion is much more expensive to train (most common complaints are "not enough datasets" and "how's it scale"). I think this is the reason so many universities use pretrained networks instead of training things from scratch, which just railroads research.

(I also got a paper double desk rejected. First because it was "already published." Took a 2 months for them to realize it was arxiv only. Then they fixed that and rejected again because "didn't cite relevant works" with no mention of what those works were... I've obviously lost all faith in the review process)


Sorry to hear all this (after writing my other sibling comment). Please don’t lose faith in the review process. It is still useful. Until the AGI can be better reviewers, which is hopefully not too far in the future.


For me to regain faith in the review process I need to actually see some semblance of the review process working.

So far, instead, I've seen:

  - Banning social media posting so that only big tech and collusion positing can happen to "protect the little guy"
  - Undoing the ban to lots of complaints
  - Instituting a no LLM policy with no teeth and no method to actually verify
  - Instituting a high school track to get those rich kids in sooner
Until I see such changes like "we're going to focus on review quality" I'm going to continue thinking it is a scam. They get paid by my tax dollars, by private companies, and I volunteer time, for what...? Something a LLM could have actually done better? I'm seeing great papers from big (and small) labs get turned down while terrible papers are getting accepted. Collusion rings go unpunished. And methods get more and more convoluted as everyone tries to game the system.

You think of all people, we, ML, would understand reward hacking. But until we admit it, we can't solve it. And if we can't solve it here, how the hell are we going to convince anyone we're going to create safe AGI?


I feel you. Here are some thoughts from the other side of the fence:

Social media banning aims to preserve anonymity when the reviews are blind. It is hard to convincingly keep anonymity for many submissions, but an effort to keep it is still worthwhile and typically helps the less privileged to get a fair shot at a decent review, avoiding the social media popularity contest.

The policies for LLM usage differ between conferences. The only possibly valid concern with use of AI is the disclosure of non public info to an outside LLM company that may happen to publish or be retrained on that data (however unlikely this is in practice) before the paper becomes public; for example, someone could withdraw their publication and it no longer sees the day of light on the openreview website. (I personally disagree with this concern.) As far as I know there is no real limitation to using self hosted AI as long as the reviewer takes full credit for the final product and there is no limitation in using non public AI to improve the review clarity without dumping the full paper text. A fraction of authors would appreciate better referee reports, so at a minimum, the use of AI can bridge the language gap. I wouldn’t mind the conferences instituting an automatic AI processing to help the reviewers reduce ambiguity and avoid trivialities.

The high school track has been ridiculed, as expected. I think it is a great idea and doesn’t only apply to rich kids. There exist excellent specialized schools in NYC and other places in the US that might find ways to get resources for underprivileged ambitious high schoolers. It is possible that in the future a variant of such a track will incentivize some industry to donate compute resources to high school programs and it may start early and powerful local communities. I learned a lot in what would be middle school in the US by interacting with self motivated children at a ad hoc computer club and kept the same level of osmotic learning in the computer lab at college. The current state of AI is not super deep in terms of background knowledge, mostly super broad, and some specialized high schools already cover calculus and linear algebra, and certainly many high schools nowadays provide sufficient background in programming and elementary data analysis.

My personal reward hacking is that the conferences provide a decent way to focus the review to the top hundred or couple hundred plausible abstracts and even when the eventual choice is wrong I get a much better reward to noise ratio than from social media and the pure attacks on the arxiv (although LLMs help here as well). I always find it refreshing to see the novel ideas when they are in a raw form before they have been polished and before everyone can easily judge their worth. Too many of them get unnecessary negative views, which is why the system integrates multiple reviewers and area chairs that can make corrective decisions. It is important to avoid too much noise even at the risk of missing a couple great ones, and yet it always hurts when people drop greatness because of misunderstandings or poor chair choices. No system is perfect, but scaling these conferences from a couple hundred people a year up to about a dozen years ago to approaching hundred thousand a year has worked reasonably well.


> Social media banning aims to preserve anonymity when the reviews are blind.

Then ban preprints. That's the only reasonable resolution to solve the stated problem. But I think we recognize that in doing so, we'd be taking steps back that aren't worth it.

> avoiding the social media popularity contest.

The unfortunate truth is that this has always been the case. It's just gotten worse because __we__ the researchers fall for this trap more than the public does. Specifically, we discourage dissenting opinions. Specifically, we still rely heavily on authority (but we call it prestige).

> The policies for LLM usage differ between conferences.

This is known, and my comment was in a direct reference to CVPR policy being laughable.

The point I was making is not so literal as your interpretation. It is one step abstracted: the official policies are being carelessly made, and in such ways that are laughable and demonstrate that the smallest iota of reasoning was placed into these. Implying that there is a goal to signal rather than address the issues at hand. Because let's be real, resolving the issues is no easy task. So instead of addressing the difficulties of this and acknowledging them, we try to sweep them under the rug and signal that we are doing something. But that's no different than throwing your hands up and giving up.

> The high school track ... doesn’t only apply to rich kids.

You're right in theory but if you think this will be correct in practice I encourage you to reason a bit more deeply and talk to your peers who come from middle and lower class families. Ones where parents were not in academia. Ones where they may be the only STEM person in their family. The only person pursuing graduate education. Maybe even the only one with an undergraduate degree (or that it is uncommon in their family). Ask them if they had a robotics club. A chess club. IB classes? AP classes? Hell, I'll even tell you that my undergraduate didn't even have research opportunities, and this is essentially a requirement now for grad school. Be wary of the bubbles you live in. If you do not have these people around you, then consider the bias/bubble that led to this situation. And I'll ask you an important question: do you really think the difference between any two random STEM majors in undergrad are large? Sure, there's decent variance, but do you truthfully think that you can't pick a random STEM student from a school ranked 100 and place them in a top 10 school (assume financials are not an issue and forget family issues), that they would not have a similar success rate? Because there's plenty of data on this (there's a reason I mentioned the specific caveats, but let's recognize those aren't about the person's capabilities, which is what my question is after). If you are on my side, then I think you'd recognize that the way we are doing things is giving up a lot of potential talent, and if you want to accelerate the path to AGI then I'd argue that this is far more influential than any r̶i̶c̶h̶ ̶c̶h̶i̶l̶d̶,̶ ̶c̶h̶i̶l̶d̶ ̶o̶f̶ ̶p̶r̶o̶f̶e̶s̶s̶o̶r̶ High School track. But we both know that's not going to happen because we care more about e̶l̶i̶t̶i̶s̶m̶ "prestige" than efficiency. (And think about the consequences of this for when we teach a machine to mimic humans)

Edit: I want to make sure I ask a different question. You seem to recognize that there is a problem. I take it you think it's small. Then why defend it? Why not try to solve it? If you think there is no problem, why? And why do you think it isn't when so many do? (There seems to be a bias of where these attitudes come from. And I want to make clear that I truly believe everyone is working hard. I don't think anyone is trying to undermine hard work. I don't care if you're at a rank 1 or 100 school, if you're doing a PhD you're doing hard work)


Sorry to hear that. My experiences haven't been very different. I really can't tell if the current review process is the least bad among alternatives or is there something better (if so, what is it?).


I'm sorry to hear that too. I really wish there was something that could be done. I imagine a lot of graduate students are in complicated situations because of this.

As for alternatives: I don't see why we don't just push to OpenReview and call it a day. We can link our code, it has revisions, and people can comment and review. I don't see what the advantage of having 1-3 referees who don't want to read my paper and have no interest in it but have strong incentives to reject it is any meaningful signal of value. I'll take arxiv over their opinions.


> I've saved a Welling paper from rejection from two reviewers who admitted to not knowing PDEs

Thank you for fighting the good fight.

This is why I love OpenReview, I can spot and ignore nonsensical reviewer criticisms and ratings and look for the insightful comments and rebuttals. Many reviewers do put in a lot of very valuable work reading and critiquing most of which would go to waste if not made public.


I like OR too and I wish we would just post to there instead. It has everything we need, and I see no value from the venues. No one wants to act in good faith and they have every incentive not to.

And I gotta say, I'm not going to put up a fight much longer. As soon as I get out of my PhD I intend to just post to OR.


> never make it to HN unless there's special circumstances

Yes, I agree. The two most common patterns I've noticed in research that does show up on HN are: 1) It outright improves, or has the potential to improve, applications currently used in production by many HN readers. In other words, it's not just navel-gazing. 2) The authors and/or their organizations are well-known, as you suggest.


What bothers me the most is that comments will float to the top of a link that's an arxiv paper or uni press where people will talk about how something is still in a prototype stage and not production yet/has a ways to go to production. While this is fine, that's also the context of works like these. But it is the same thing that I see in reviews. I've had works myself killed because reviewers treat the paper as a product rather than... you know... research.


For example, I find Spike Neural Networks to be cool, but until they reach SOTA, how can they displace conventional neural networks?


Compare how much time has been spent studying the two different architectures. Who knows if SNNs can displace other stuff, but I wouldn't rely on SOTA for being the benchmark. Progress has to be made and it isn't made in leaps and bounds. If you find them cool, study them more. Maybe you'll stumble onto something. Maybe you'll find an edge in a niche domain (and maybe you find that that edge can generalize more than you initially thought).

Stop worrying about displacing conventional networks and start worrying about understanding things. We chip away at this together, as a community. There's a lot we need to learn and a lot that needs to be explored. Why tie anyone's hands behind their backs?


I'm no stranger to having written papers that follow my own curiosity that didn't show any promising results.

However, I wouldn't blame "the community" for not taking my idea and building on it. There needs to be a seed of hope, a taste of future benefits, or else why is it anybody's obligation to care about something subpar?

The introducer of a novel idea needs to beat the incumbent by a large margin. This is just reality, not injustice.


The incumbent approaches usually benefit from a ton of research that might or might not be transferrable to the newcomer.

Even if many optimizations also apply to the new approaches, taking advantage of them takes a lot of work. For example, I have not yet implemented KV caches for my nanoGPTs that I'm fooling around with.


> The introducer of a novel idea needs to beat the incumbent by a large margin. This is just reality, not injustice.

It is an injustice and an impedance to scientific progress.

It is also a very odd thing to see in any technological progress. This is not a normal process btw. Generally we see S-curves and the newer technology is initially worse. That should be unsurprising given that it has had far less time and far less attention. You have to look at the potential and see if things are worth pursuing. We should not expect that to be carried by one team. If we do, we'll only have the lucky, crazy, and the big leading. That's not a great thing for science, especially if we want to claim that it is on the merit of ideas, not status.


I read a book on NNs by Robert Hecht Nielsen in 1989, during the NN hype of the time (I believe it was the 2nd hype cycle, the first beginning with Rosenblatt’s original hardware perceptron and dying with Minsky and Pappert’s “Perceptrons” manuscript a decade or two earlier).

Everything described was laughably basic by modern standards, but the motivation given in that book was the Kolmogorov representation theorem: a modest 3 layer networks with the right activation function can represent any continuous m-to-n function.

Most research back then focused on 3 layer networks, possibly for that reason. Sigmoid activation was king, and vanishing gradients the main issue. It took 2 decades until AlexNet brought NN research back from the AI winter of the 1990’s


> Everyone is getting tired of those papers.

This is science as is :)

95% percent will produce mediocre-to-nice improvements to what we already have so there were reserachers that eventually grow up and do something really exciting


Nothing wrong with incremental improvements. Giant leaps (almost always) only happen because of a lack of your niche domain expertise. And I mean niche niche


From the preprint - 100 input dimensions is considered "high", and most problems considered have 5 or fewer input dimensions. This is typical of physics-inspired settings I've seen considered in ML. The next step would be demonstrating them on MNIST, which, at 784 dimensions is tiny by modern standards.


In actual business processes there are lots of ML problems with fewer than 100 input dimensions. But for most of them decision trees are still competitive with neural networks or even outperform them.


The aid to explainability seems at least somewhat compelling. Understanding what a random forest did isn't always easy. And if what you want isn't the model but the closed form of what the model does, this could be quite useful. When those hundred input dimensions interact nonlinearly in a million ways thats nice. Or more likely I'd use it when I don't want to find a pencil to derive the closed form of what I'm trying to do.


Business processes don't need deep learning in the first place. It is just there because hype.


Competent companies tend to put a lot of effort into building data analysis tools. There will often be A/B or QRT frameworks in place allowing deployment of two models, for example, the new deep learning model, and the old rule based system. By using the results from these experiments in conjunction with typical offline and online evaluation metrics one can begin to make statements about the impact of model performance on revenue. Naturally model performance is tracked through many offline and online metrics. So people can and do say things like "if this model is x% more accurate then that translates to $y million dollars in monthly revenue" with great confidence.

Lets call someone working at such a company Bob.

A restatement of your claim is that Bob decided to launch a model to live because of hype rather than because he could justify his promotion by pointing to the millions of dollars in increased revenue his switch produced. Bob of course did not make his decision based on hype. He made his decision because there were evaluation criteria in place for the launch. He was literally not allowed to launch things that didn't improve the system according to the evaluation criteria. As Bob didn't want to be fired for not doing anything at the company, he was forced to use a tool that worked to improve the evaluation according to the criteria that was specified. So he used the tool that worked. Hype might provide motivation to experiment, but it doesn't justify a launch.

I say this as someone whose literally seen transitions from decision trees to deep learning models on < 100 feature models which had multi-million dollar monthly revenue impacts.


Very interesting! Kolmogorov neutral networks can represent discontinuous functions [1], but I've wondered about how practically applicable they are. This repo seems to show that they have some use after all.

[1]: https://arxiv.org/abs/2311.00049


Not for discontinuous functions, as your paper explains, we know that g exists for discontinuous bounded, but nothing to find it with.

> A practical construction of g in cases with discontinuous bounded and un- bounded functions is not yet known. For such cases Theorem 2.1 gives only a theoretical understanding of the representation problem. This is because for the representation of discontinuous bounded functions we have derived (2.1) from the fact that the range of the operator Z∗ is the whole space of bounded functions B(Id). This fact directly gives us a formula (2.1) but does not tell how the bounded one-variable function g is attained. For the representation of unbounded functions we have used a linear extension of the functional F , existence of which is based on Zorn’s lemma (see, e.g., [19, Ch. 3]). Application of Zorn’s lemma provides no mechanism for practical construction of such an extension. Zorn’s lemma helps to assert only its existence.

If you look at the OP post arxiv link, you will see they are using splines .

https://arxiv.org/abs/2404.19756

Still interesting and potentially useful, but not useful for discontinuous functions without further discoveries.

If I am wrong please provide a link, it is of great interest to me.


Perhaps a hasty comment but linear combinations of B-splines are yet another (higher-degree) B-spline. Isn't this simply fitting high degree B-splines to functions?


That would be true for a single node / single layer. But once the output of one layer is fed into the input of the next it is not just a linear combination of splines anymore.


1. Interestingly the foundations of this approach and MLP were invented / discovered around the same time about 66 years ago:

1957: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_repr...

1958: https://en.wikipedia.org/wiki/Multilayer_perceptron

2. Another advantage of this approach is that it has only one class of parameters (the coefficients of the local activation functions) as opposed to MLP which has three classes of parameters (weights, biases, and the globally uniform activation function).

3. Everybody is talking transformers. I want to see diffusion models with this approach.


Biases are just weights on an always on input.

There isn't much difference between weights of a linear sum and coefficients of a spline.


> Biases are just weights on an always on input.

Granted, however this approach does not require that constant-one input either.

> There isn't much difference between weights of a linear sum and coefficients of a function.

Yes, the trained function coefficients of this approach are the equivalent to the trained weights of MLP. Still this approach does not require the globally uniform activation function of MLP.


At this point this is a distinction without a difference.

The only question is if splines are more efficient than lines at describing general functions at the billion to trillion parameter count.


To your 3rd point, most diffusion models already use a transformer-based architecture (U-Net with self attention and cross attention, Vision Transformer, Diffusion Transformer, etc.).


Yes, #2 is a difference. But what makes it an advantage?

One might argue this via parsimony (Occam’s razor). Is this your thinking? / Anything else?


I may be wrong but with midern llms biases aren’t really used any more.


From what I remember, larger LLMs like PaLM don't use biases for training stability, but smaller ones tend to still use them.


Feels like someone stuffed splines into decision trees.


It’s Monte Carlo all the way down


splines, yes.

I'm not seeing decision trees, though. Am I missing something?

> "KANs’ nodes simply sum incoming signals without applying any non-linearities." (page 2 of the PDF)


I definitely think I'm projecting and maybe seeing things that aren't there. If you replaced splines with linear weights, it kind of looks like a decision tree to me.


If it were me, I would let go of that association, unless you find value in it.


Very cool stuff! Exciting to see so many people sharing their works on KANs. Seeing as the authors claim that KANs are able to reduce the issues of catastrophic forgetting that we see in MLPs, I thought "Wouldn't it be nice if there was an LLM that substituted MLPs with KANs?". I looked around and didn't find one, so I built one!

- PyTorch Module of the KAN GPT

- Deployed to PyPi

- MIT Licence

- Test Cases to ensure forward-backward passes work as expected

- Training script

I am currently working on training it on the WebText dataset to compare it to the original gpt2. Facing a few out-of-memory issues at the moment. Perhaps the vocab size (50257) is too large?

I'm open to contributions and would love to hear your thoughts!

https://github.com/AdityaNG/kan-gpt

https://pypi.org/project/kan-gpt/


This reminds me of Weight Agnostic Neural Networks https://weightagnostic.github.io/



Archive article 2023 research more generic case, where splines is considered as particular case of the basis functions.


https://kindxiaoming.github.io/pykan/intro.html

At the end of this example, they recover the symbolic formula that generated their training set: exp(x₂² + sin(3.14x₁)).

It's like a computation graph with a library of "activation functions" that is optimised, and then pruned. You can recover good symbolic formulas from the pruned graph.

Maybe not meaningful for MNIST.


I wonder if Breiman’s ACE (alternating conditional expectation) is useful as a building block here.

It will easily recover this formula, because it is separable under the log transformation (which ACE recovers as well).

But ACE doesn’t work well on unseparable problems - not sure how well KAN will.


It’d be really cool to see a transformer with the MLP layers swapped for KANs and then compare its scaling properties with vanilla transformers


After trying this out with the fourier implementation above, swapping MLP/Attention Linear layers for KANs (all, or even a few layers) produces diverging loss. KANs don't require normalization for good forward pass dynamics, but may be trickier to train in a deep net.


Note that KANs use LBFGS, which is second-order optimization method. My experience with the use of second-order features suggests that simple gradient descent often leads to divergence.


This is the first thought came to my mind too.

Given its sparse, Will this be just replacement for MoE.


MoE is mostly used to enable load balancing since it makes it possible to put experts on different GPUs. This isn't so easy to do with a monolithic, but sparse layer.


Why was this your first thought? Is a limiting factor to transformers the MLP layer? I thought the bottleneck was in the renormalization part.


At small input size, yes the MLP dominates compute. At large input attention matters more


How does back propagation work now? Do these suffer from vanishing or exploding gradients?


At page 6 it explains how they did back propagation https://arxiv.org/pdf/2404.19756 (and in page 2 it says that previous efforts to leverage Kolmogorov-Arnold representation failed to use backpropagation), so maybe using backpropagation to train multilayer networks with this architecture is their main contribution?

> Unsurprisingly, the possibility of using Kolmogorov-Arnold representation theorem to build neuralnetworks has been studied [8, 9, 10, 11, 12, 13]. However, most work has stuck with the original depth-2 width-(2n + 1) representation, and did not have the chance to leverage more modern techniques (e.g., back propagation) to train the networks. Our contribution lies in generalizing the original Kolmogorov-Arnold representation to arbitrary widths and depths, revitalizing and contextualizing it in today’s deep learning world, as well as using extensive empirical experiments to highlight its potential role as a foundation model for AI + Science due to its accuracy and interpretability.


No, the activations are a combination of the basis function and the spline function. It's a little unclear to me still how the grid works, but it seems like this shouldn't suffer anymore than a generic relu MLP.



I can't assess this, but I do worry that overnight some algorithmic advance will enhance LLMs by orders of magnitude and the next big model to get trained is suddenly 10,000x better than GPT-4 and nobody's ready for it.


What to be worried about? Technical progress will happen, sometimes by sudden jumps. Some company will become a leader, competitors will catch up after a while.


"Technical progress" has been destroying our habitat for centuries, causing lots of other species to go extinct. Pretty much the entire planet surface has been 'technically progressed', spreading plastics, climate change and whatnot over the entirety of it.

Are you assuming that this particular "progress" would be relatively innocent?


On the other hand, the same "technical progress" (if we're putting machine learning, deforestation, and mining in the same bag) gave you medicine, which turns many otherwise deadly diseases into inconveniences and allows you to work less than 12 hrs/7 days per week to not die from hunger in a large portion of the world. A few hundred years ago, unless you were born into the lucky 0.01% of the ruling population, working from dawn to sunset was the norm for a lot more people than now.

I'm not assuming that something 10k x better than GPT-4 will be good or bad; I don't know. I was just curious what exactly to be worried about. I think in the current state, LLMs are already advanced enough for bad uses like article generation for SEO, spam, scams, etc., and I wonder if an order of magnitude better model would allow for something worse.


Where did you learn that history?

What do you mean by "better"?


I had a European peasant in the 1600-1700s in mind when I wrote about the amount of work. During the season, they worked all day; off-season, they had "free time" that went into taking care of the household, inventory, etc., so it's still work. Can't quickly find a reliable source in English I could link, so I can be wrong here.

"Better" was referring to what OP wrote in the top comment. I guess 10x faster, 10x longer context, and 100x less prone to hallucinations would make a good "10k x better" than GPT-4.


Sorry, I can't fit that with what you wrote earlier: "12 hrs/7 days per week to not die from hunger".

Those peasants payed taxes, i.e. some of their work was exploited by an army or a priest rather than hunger, and as you mention, they did not work "12 hrs/7 days per week".

Do you have a better example?


This entire line of argument is just pointless.


You probably placed this wrong? I'm not driving a line of argument here.


I mean 6mian. He hand-waved non-data (badly) disguised as historical facts to make a point. Then you came around and asked for actual facts. It's clear you won't get them, because he got nothing to begin with.


Many species went extinct during Earth's history. Evolution requires quite aggressive competition.

The way the habitat got destroyed by humans is stupid because it might put us in danger. You can call me "speciesist" but I do care more for humans rather than for a particular other specie.

So I think progress should be geared towards human species survival and if possible preventing other species extinction. Some of the current developments are a bit too much on the side of "I don't care about anyone's survival" (which is stupid and inefficient).


If other species die, we follow shortly. This anthropocentric view really ignore how much of our food chain exists because of other animals surviving despite human activities.


Evolution is the result of catastrophies and atrocities. You use the word as if it has positive connotations, which I find weird.

How do you come to the conclusion "stupid" rather than evil? Aren't we very aware of the consequences of how we are currently organising human societies, and have been for a long time?


I think this is unlikely. There has never (in the visible fossil record) been a mutation that suddenly made tigers an order of magnitude stronger and faster, or humans an order of magnitude more intelligent. It's been a long time (if ever?) since chip transistor density made a multiple-order-of-magnitude leap. Any complex optimized system has many limiting factors and it's unlikely that all of them would leap forward at once. The current generation of LLMs are not as complex or optimized as tigers or humans, but they're far enough along that changing one thing is unlikely to result in a giant leap.

If and when something radically better comes along, say an alternative to back-propagation that is more like the way our brains learn, it will need a lot of scaling and refinement to catch up with the then-current LLM.


Comparing it to evolution and SNPs isn't really a good analogy. Novel network architectures are much larger changes, maybe comparable to new organelles or metabolic pathways? And those have caused catastrophic changes. Evolution also operates on much longer time-scales due to its blind parallel search.

https://en.wikipedia.org/wiki/Oxygen_catastrophe


>some algorithmic advance will enhance LLMs by orders of magnitude

I would worry if I'd own Nvidia shares.


Actually, that would be fantastic for NVIDIA shares;

1. A new architecture would make all/most of these upcoming Transformer accelerators obsolete => back to GPUs.

2. Higher performance LLMs on GPUs => we can speed up LLMs with 1T+ parameters. So, LLMs become more useful, so more of GPUs would be purchased.


1. A new architecture would make all/most of these upcoming Transformer accelerators obsolete => back to GPUs.

There's no guarantee that that is what would happen. The right (or wrong, depending on your POV) algorithmic breakthrough might make GPU's obsolete for AI, by making CPU's (or analog computing units, or DSP's, or "other") the preferred platform to run AI.


Assuming there is a development that makes GPUs obsolete, I think it's safe to assume that what will replace them at scale will still take the form dedicated AI card/rack

1. Tight integration necessary for fundamental compute constraints like memory latency.

2. Economies of scale

3. Opportunity cost to AI orgs. Meta, OpenAI etc want 50k h100s to arrive in shipping container and plug in so they can focus on their value-add.

Everyone will have to readjust to this paradigm. Even if next get AI runs better on CPU, Intel won't suddenly be signing contracts to sell 1,000,000 xeons and 1,000,000 motherboards etc

Also, Nvidia have 25bn cash in hand and almost 10 billion yearly r&d spend. They've been an AI-first company for over a decade now, they're more prepared to pivot than anyone else

Edit: nearly forgot - Nvidia can issue 5% new stocks and raise 100B like it's nothing.


If you like this, you may also like this 2019 research paper: "Deep networks and the Kolmogorov–Arnold theorem" https://hadrien-montanelli.github.io/2019-06-25.html https://arxiv.org/pdf/1906.11945


A more elaborate implementation of this was published years ago, and it wasn't the very first one https://www.science.org/doi/10.1126/science.1165893


https://arxiv.org/abs/1210.7273

  In the article "Distilling free-form natural laws from experimental data", Schmidt and Lipson introduced the idea that free-form natural laws can be learned from experimental measurements in a physical system using symbolic (genetic) regression algorithms. An important claim in this work is that the algorithm finds laws in data without having incorporated any prior knowledge of physics. Upon close inspection, however, we show that their method implicitly incorporates Hamilton's equations of motions and Newton's second law, demystifying how they are able to find Hamiltonians and special classes of Lagrangians from data. 
I think this is hilarious.

I cannot get a PDF of your article and instead I will read a commentary on it which appears to be very interesting.


This seems very similar in concept to the finite element method. Nice to see patterns across fields like that.


Nice implementation I've been playing with, https://github.com/Blealtan/efficient-kan alongside @GistNoesis's.


Interesting!

Would this approach (with non-linear learning) still be able to utilize GPUs to speed up training?


Seconded. I’m guessing you could create an implementation that is able to do that and then write optimised triton/cuda kernels to accelerate them but need to investigate further


I was under the impression that graph neural nets already trained learnable functions on graph edges rather than nodes, albeit typically on a fully connected graph. Is there any comparison to just a basic GNN here?


Bayesian networks learn probability functions, but looks like only their tabulated versions:

https://en.wikipedia.org/wiki/Bayesian_network#Graphical_mod...

> Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if m parent nodes represent m Boolean variables, then the probability function could be represented by a table of 2^m entries, one entry for each of the 2^m possible parent combinations.


So a new type of neural network that has been proven to work well on regression tasks common in physics? And tested in practice to fit well on elementary algebra and compositions of complex functions. But no evidence at all if it works on the most basic machine learning tasks like MNIST, not to mention language models.

I mean it's great but at the current state it seems better suited for tasks where an explicit formula exists (though not known) and the goal is to predict it on unknown points (and be able to interpret the formula as a side effect). Deep learning tasks are more of a statistical nature (think models with a cross entropy loss - it's statistically predicting the frequency of different choices of the class/next token), it requires a specialized training procedure and it is designed to fit 100% rather than somewhat close (think linear algebra - it won't be good at it). It would very likely take a radically different idea to apply it to deep learning tasks. The recently updated "Author's note" also mentions this: "KANs are designed for applications where one cares about high accuracy and/or interpretability."

It's great but let's be patient before we see this improve LLM accuracy or be used elsewhere.


Looks super interesting

I wonder how many more new architectures are going to be found in the next few years


Very interesting! Could existing MLP-style neural networks be put into this form?


I am curious to know if this type of network can help with causal inference.


They help with interpretability, which is a step brother. See for example "From Shapley Values to Generalized Additive Models and back" https://arxiv.org/abs/2209.04012


Indeed, Thank you!


They definitely do, I'm planning to release a new package to do that in a couple weeks


doesn't KA representation require continuous univariate functions? do B-splines actually cover the space of all continuous functions? wouldn't... MLPs be better for the learnable activation functions?


The paper [0] is pretty good for handling questions like those.

> doesn't KA representation require continuous univariate functions?

All multivariate continuous functions (on a bounded domain) can be represented as compositions of addition and univariate continuous functions. Much like an MLP, you can also approximate discontinuous functions well on most of the domain (learning a nearby continuous function instead).

> do B-splines actually cover the space of all continuous functions

Much like an MLP, you can hit your favorite accuracy bound with more control points.

> wouldn't... MLPs be better for the learnable activation functions

Perhaps. A B-spline is comparatively very fast to compute. Also, local training examples have global impacts on an MLP's weights. That's good and bad. One property you would expect while training a KAN in limited data regimes is that some control points are never updated, leading to poor generalization due to something like a phase shift as you cross over control points (I think the entropy-based regularizer they have in the paper probably solves that, but YMMV). The positive side of that coin is that you neatly side-step catastrophic forgetting.

[0] https://arxiv.org/abs/2404.19756


“Increasing control points” hides a lot under the covers here. Your answer and the paper provide virtually no reason to believe one type of continuous function approximation is better than another. The comparisons made are superficial and only serve to address contrived issues like representing sinusoidal function families concisely.

It’s weird to just ignore MLPs when approximating a continuous univariate function. But if the paper did use MLPs theyd have ended up with something that looks a lot more like conventional neural networks, so maybe thats why?


This feels aggressively wrong. I'll bite in case you responded to the wrong person or something:

> “Increasing control points” hides a lot under the covers here.

Maybe. Like what exactly?

> Your answer and the paper provide virtually no reason to believe one type of continuous function approximation is better than another.

Even if the paper offered nothing, my answer is immediately above yours. What about being faster to compute or having gradient updates without global information destruction is either not clear or not ever better than what an MLP provides?

> The comparisons made are superficial and only serve to address contrived issues like representing sinusoidal function families concisely.

I don't care about that at all, and the paper barely cares about it; their same algorithm for reifying splines into known function families would work about as well with MLPs.

> It’s weird to just ignore MLPs when approximating a continuous univariate function.

Maybe. MLPs are particularly well suited to high input+output dimensionality, and while they _can_ approximate arbitrary 1D continuous functions they (1) can't do so efficiently, (2) can't be trained via gradient descent to find some of those, and (3) can't approximate topologically interesting 1D functions without many layers and training complexity. The authors ignored infinitely many other things too; the fact that they ignored MLPs is probably just some combination of their reference material (KANs have been around in some form for awhile) not using MLPs, alongside a hunch that they'd be less efficient (and perhaps harder to train) in an already slow library, and the fact that splines empirically sufficed.

> But if the paper did use MLPs theyd have ended up with something that looks a lot more like conventional neural networks, so maybe thats why?

See above, I don't think that would be the most important reason, even if it were true.

I don't think it's true though. Even in its current state, a KAN network already looks a lot like an MLP. Each layer does an O(d^2) computation to transform one d-dimensional vector into another. Instead of sum(dot(w, v)) the computation is sum(spline_w(v)), but aside from the sparsification (which is (1) optional, (2) available for MLPs, and (3) not important to most of the paper's ideas other than interpretability), the core computational kernel of these KANs is almost identical to an MLP.

What they showed, to the extent that it's true (it's always hard to say when focusing on physics computations because of how easily a carefully placed cos/sin/exp can greatly improve test+training error, and more specialized models taking advantage of that property tend to not do as well in more consumer-focused ML), is that if you use an O(grid) factor of extra weights for the same amount of computation then you can get an MLP with better scaling properties (for the same amount of model volume and model compute you get lower training times and better test errors, by a very healthy margin).

I'd be interested in seeing how an MLP would fit in there, but if the learned splines are usually complicated then you would expect a huge multiplicative slowdown, and regardless of spline complexity you would expect to re-introduce most of the training issues of deep networks. Please let me know if you give MLP sub-units a shot and they actually work better. I'd love to not have to do that experiment myself.


This really reminds me of petrinets but an analog version? But instead of places and discrete tokens we have activation functions and signals. You can only trigger a transition if an activation function (place) has the right signal (tokens).


Eli5: why aren't these more popular and broadly used?


Because they have just been invented!


60 years ago


Bayesian KANs, KAN Transformers and KAN VAE's in 3.2...


Looks very interesting, but my guess would be that this would run into the problem of exploding/vanishing gradients at larger depths, just like TanH or sigmoid networks do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: