Hacker News new | past | comments | ask | show | jobs | submit login

It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.

So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: