Related: There was buzz last year about Kolmogorov Arnold Networks, and https://arxiv.org/abs/2409.10594 was claiming KANs perform better than standard MLPs in the transformer architecture. Does anyone know of these being explored in the LLM space? KANs seem to have better properties regarding memory if I'm not mistaken.
I believe KAN hype died off due to practical reasons (e.g. FLOPs from implementation) and empirical results, i.e. people reproduced KANs and they found the claims/results made in the original paper were misleading.
Here's a paper showing KANs are no better than MLPs, if anything they are typically worse when comparing fairly. https://arxiv.org/pdf/2407.16674