> I doubt it because there's $BILLIONS in this market and everyone wants a piece...

hansvm · on May 30, 2024

It's too late to edit. My initial skepticism was unfounded.

Separately, the paper largely comprises of "super attention" and everything else.

The "everything else" part of the paper might matter, but it's basically just operator fusion and the impacts of doing so on training dynamics, except they left out the impact on performance for different model parameters and didn't study how training dynamics impact the result. It's not a new idea, even on the ML arxiv, I'm glad they got good results, and it needs more study before being sold so strongly.

The "super attention" part of the paper is interesting. It basically ups the matrix polynomial rank of an attention layer by 1 and claims good results from the process. That's believable, especially given that the main contribution of attention is good empirical results from upping the previous layer matrix polynomial rank by a bit. You'd want to dive into the code and check that they didn't screw up the masking before taking the results at face value though (information leakage can make even very weak models seem to perform well).