Hacker News new | past | comments | ask | show | jobs | submit login
Vision Mamba: Efficient Visual Representation Learning with Bidirectional SSM (arxiv.org)
74 points by andy99 10 months ago | hide | past | favorite | 16 comments



Mamba is the new and hot "linear transformer", could one day replace GPT based LLMs and scale sequence length up to 1M. It uses a clever math trick to parallelize token inference in the input while keeping a constant state size for generating the output.


> while keeping a constant state size for generating the output

Isn't Mamba's attention mechanism linear relative to the input? The innovation as I understand it is that attention in Mamba *isn't* quadratic (like transformers).


State size is constant (a matrix), attention is linear (a single unidirectional scan over all states)

Both statements are correct


The innovation is that they managed to write a hardware-aware kernel to make it run fast/efficiently on GPUs. The authors of Mamba are the same authors of FlashAttention which was a performance optimization kernel written to reduce IO while computing normal O(N^2) attention. As I understand it, previously SSM models were not as easily parallelizable as the Transformer architecture.


They were previously parallelizable (via fft), but performed poorly on language modeling tasks.

Mamba adds a dependence on the inputs that makes language modeling competitive with transformers, but that prevents using the fft approach. So they switch to a method using parallel prefix scan.


Yes and no. There's an dual connection between ssms and convolutional models if certain constraints are met. Training convolutionally and inferring sequentially seeks a compromise between the two sides. I think we're about to find out the degree to which those constraints impact "easily".


Has anyone tried this? I’ve seen the same people hyping this up all over /r/ml with multiple posts a week, marketing material, etc. Is this a legit new technique or just cleverly marketed paper?


It’s legit and has been honed and refined for the past few years by the same Stanford PhD students (now graduated and working at CMU & Princeton).

The latest iteration has promising results for language but no one has yet trained and released a big (7B+) model to see how it scales.


The peer reviews on Mamba's ICLR2024 are pretty interesting, especially with how split the reviewers are: https://openreview.net/forum?id=AL1fq05o7H


Kind of weird we haven’t seen Mistral 7B Mamba. Until that’s out it’s untested vaporware.


This isn’t a fine tune. It’s not even a transformer like Qwen, Llama, GPT, Gemini, or Mistral. It’s another approach to language models based on RNNs


Doesn't the point remain though -- when/where do we demo this?


There is Mamba 7B pre trained model on hugging face/ LM studio right now.


Do you mean include it with the Mixtral "Mixture of Experts" model? I'm not sure Mistral Mamba makes sense, since it's a completely different architecture.



... they named the software... "vim"

Really?! Come-on.

https://github.com/hustvl/Vim




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: