Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Clarify VRAM usage during LLM forward pass
1 point by furiousteabag 9 months ago | hide | past | favorite
Hey HN,

I'm working with Llama 2 and have hit a snag regarding VRAM usage during the forward pass in inference. Despite understanding that only the largest activation tensor is stored (as activations from previous layers aren't needed in inference), my VRAM calculations don't match up with what I'm observing.

Context:

Using a 7B model, batch size of 8, sequence length of 1024. Data type: bfloat16. After the forward pass, VRAM is much higher than expected.

Observations:

Model weights: 12915 MiB; VRAM post-forward pass: 16679 MiB (expected significantly less); VRAM for a forward pass: 3764 MiB; Output tensor VRAM (in fp32): 1000 MiB.

Thus, activations used 2764 MiB, whereas my calculations suggest a single layer should use only 1368 MiB. Moreover, the actual VRAM usage should be even less than this, as activations from the attention layer can be discarded when computing the MLP block. The highest VRAM consumption should come from the softmax output plus v (in attention block), estimated at 576 MiB.

Why does the forward pass use more VRAM than expected, considering activations from previous layers aren't stored? Any insights or similar experiences would be greatly appreciated. Happy to give more context if needed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: