Not if they use flash attention which solves the problem in fixed memory by working tile by tile. They never materialise the whole attention matrix at once. But the computation time is still quadratic.
They present it as an article about transformers in general, not ones using Flash Attention. Anyway maybe they're presenting per token memory requirement instead of the requirement for the entire sequence at once.