Hacker News new | past | comments | ask | show | jobs | submit login

Not if they use flash attention which solves the problem in fixed memory by working tile by tile. They never materialise the whole attention matrix at once. But the computation time is still quadratic.



They present it as an article about transformers in general, not ones using Flash Attention. Anyway maybe they're presenting per token memory requirement instead of the requirement for the entire sequence at once.


They don't mention it explicitly except in one place:

> GPT-NeoX achieves 150 TFLOP/s/A100 with normal attention and 180 FLOP/s/A100 with Flash Attention.

This advice implies they are using flash attention.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: