So a mishmash of ideas from other papers(Not to downplay the results). This is exciting times of hackery and basically using puzzle pieces and piecing together stuff.
This is the kind of stuff that can only be done so quickly by having more and more people brought into the field to try these ideas out. The more people the more permutations of ideas.
Am I reading the paper right, or is there any reason not to be cynical about these points?
1) It’s weird to choose linear attention for their implementation because that’s not what their paper is about and they claim no insights relevant to attention mechanisms.
2) By benchmarking all models this way (linear vs linear) it likely inflated their numbers over comparing their removal of matmul in a quadratic vs quadratic scenario.
3) This claim implies a comparison to the state of the art in language models where the standard is quadratic attention, and is therefore a flawed comparison:
“We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency.”
4) Those type of brain comparisons fall apart under scrutiny, are not standard in ML research and don’t mean much anyway.
5) Right up front in the abstract they make specific performance claims and imply they come from removing matmul, but don’t mention linear attention until section 4 on experiments.