Mamba is the new and hot "linear transformer", could one day replace GPT based LLMs and scale sequence length up to 1M. It uses a clever math trick to parallelize token inference in the input while keeping a constant state size for generating the output.
> while keeping a constant state size for generating the output
Isn't Mamba's attention mechanism linear relative to the input? The innovation as I understand it is that attention in Mamba *isn't* quadratic (like transformers).
The innovation is that they managed to write a hardware-aware kernel to make it run fast/efficiently on GPUs. The authors of Mamba are the same authors of FlashAttention which was a performance optimization kernel written to reduce IO while computing normal O(N^2) attention. As I understand it, previously SSM models were not as easily parallelizable as the Transformer architecture.
They were previously parallelizable (via fft), but performed poorly on language modeling tasks.
Mamba adds a dependence on the inputs that makes language modeling competitive with transformers, but that prevents using the fft approach. So they switch to a method using parallel prefix scan.
Yes and no. There's an dual connection between ssms and convolutional models if certain constraints are met. Training convolutionally and inferring sequentially seeks a compromise between the two sides. I think we're about to find out the degree to which those constraints impact "easily".
Has anyone tried this? I’ve seen the same people hyping this up all over /r/ml with multiple posts a week, marketing material, etc. Is this a legit new technique or just cleverly marketed paper?
Do you mean include it with the Mixtral "Mixture of Experts" model? I'm not sure Mistral Mamba makes sense, since it's a completely different architecture.