Not if they use flash attention which solves the problem in fixed memory by work...

letitgo12345 · on April 19, 2023

They present it as an article about transformers in general, not ones using Flash Attention. Anyway maybe they're presenting per token memory requirement instead of the requirement for the entire sequence at once.

visarga · on April 20, 2023

They don't mention it explicitly except in one place:

> GPT-NeoX achieves 150 TFLOP/s/A100 with normal attention and 180 FLOP/s/A100 with Flash Attention.

This advice implies they are using flash attention.