> I doubt it because there's $BILLIONS in this market and everyone wants a piece of the AI cake, so it doesn't make sense to ignore promising methods.
I also doubt this result. The "why have $BILLIONS not already invested" question is interesting in its own right though. Generally, the literature on the theoretical bounds of swarm optimization is pertinent. Those $BILLIONS aren't being invested by a single omniscient entity, so they're subject to interesting constraints.
As one of many explanations, fragmentation is common. If $BILLIONS are split between greedy, mostly non-interacting entities (e.g., competing companies each trying to replace the transformer in a bounded number of hours and dollars while securing their market dominance), you expect, probabilistically, for each of them to converge on the same strateg(y/ies), especially if the "best" alternatives are obvious or globally known for some reason (e.g., some solutions intuitively feel "natural" or your researchers publish early results or you have employee movement between companies or whatever). Riskier strategies won't be touched, and you'll have $BILLIONS spent duplicating the same most likely alternatives when $MILLIONS would have sufficed.
The normal counterpoint is that a few big players dominate the spending, and they would have higher internal coordination. Interestingly though, they don't usually, except when that coordination would tend to enforce the same strategies smaller competition are pursuing. How often do you hear about stories like the misaligned Google+ integrations resulting in employee bonuses for poor customer experiences vs a forward-thinking executive actively devoting funds to a meaningful number of competing solutions? Approximately never. It's career suicide if you fail and depend on other people for your position, you _are_ actually more likely to outdo the competition with your increased resources if you just lean into the "best" alternatives, and for a whole host of reasons very few executives (except for people with real power) will coordinate a more comprehensive strategy, certainly not one orthogonal to the competition's just for the sake of allocating the global $BILLIONS more efficiently.
Separately (going back to the merits of the preprint), I'll probably read the full thing later, but a few points stuck out as suspicious on an initial skim. Notably, they seem to mix linear transformations in different domains. E.g., `xa` is linear in both `x` and `a`, and `vx` is linear in both `v` and `x`, but `xax` is _not_ linear in `x`, even if you try to "prove" that idea with `v = xa`. Linearity in `v` isn't enough to make the composition linear in `x`. A lot of their results seem to rely on eliminating those "redundant" computations, even though the things they're replacing with linear computations are actually higher order polynomials. On an initial skim, the other "novel" ideas also don't seem well grounded.
Their experimental results are decent. That could mean a lot of things (normally that the authors made more errors in their competitors' than in their own work), but it's probably worth looking into for a few hours despite my other complaints.
It's too late to edit. My initial skepticism was unfounded.
Separately, the paper largely comprises of "super attention" and everything else.
The "everything else" part of the paper might matter, but it's basically just operator fusion and the impacts of doing so on training dynamics, except they left out the impact on performance for different model parameters and didn't study how training dynamics impact the result. It's not a new idea, even on the ML arxiv, I'm glad they got good results, and it needs more study before being sold so strongly.
The "super attention" part of the paper is interesting. It basically ups the matrix polynomial rank of an attention layer by 1 and claims good results from the process. That's believable, especially given that the main contribution of attention is good empirical results from upping the previous layer matrix polynomial rank by a bit. You'd want to dive into the code and check that they didn't screw up the masking before taking the results at face value though (information leakage can make even very weak models seem to perform well).
I also doubt this result. The "why have $BILLIONS not already invested" question is interesting in its own right though. Generally, the literature on the theoretical bounds of swarm optimization is pertinent. Those $BILLIONS aren't being invested by a single omniscient entity, so they're subject to interesting constraints.
As one of many explanations, fragmentation is common. If $BILLIONS are split between greedy, mostly non-interacting entities (e.g., competing companies each trying to replace the transformer in a bounded number of hours and dollars while securing their market dominance), you expect, probabilistically, for each of them to converge on the same strateg(y/ies), especially if the "best" alternatives are obvious or globally known for some reason (e.g., some solutions intuitively feel "natural" or your researchers publish early results or you have employee movement between companies or whatever). Riskier strategies won't be touched, and you'll have $BILLIONS spent duplicating the same most likely alternatives when $MILLIONS would have sufficed.
The normal counterpoint is that a few big players dominate the spending, and they would have higher internal coordination. Interestingly though, they don't usually, except when that coordination would tend to enforce the same strategies smaller competition are pursuing. How often do you hear about stories like the misaligned Google+ integrations resulting in employee bonuses for poor customer experiences vs a forward-thinking executive actively devoting funds to a meaningful number of competing solutions? Approximately never. It's career suicide if you fail and depend on other people for your position, you _are_ actually more likely to outdo the competition with your increased resources if you just lean into the "best" alternatives, and for a whole host of reasons very few executives (except for people with real power) will coordinate a more comprehensive strategy, certainly not one orthogonal to the competition's just for the sake of allocating the global $BILLIONS more efficiently.
Separately (going back to the merits of the preprint), I'll probably read the full thing later, but a few points stuck out as suspicious on an initial skim. Notably, they seem to mix linear transformations in different domains. E.g., `xa` is linear in both `x` and `a`, and `vx` is linear in both `v` and `x`, but `xax` is _not_ linear in `x`, even if you try to "prove" that idea with `v = xa`. Linearity in `v` isn't enough to make the composition linear in `x`. A lot of their results seem to rely on eliminating those "redundant" computations, even though the things they're replacing with linear computations are actually higher order polynomials. On an initial skim, the other "novel" ideas also don't seem well grounded.
Their experimental results are decent. That could mean a lot of things (normally that the authors made more errors in their competitors' than in their own work), but it's probably worth looking into for a few hours despite my other complaints.