Now remind us what HW did we need to run local inference of llama2-69B (July, 2023)? And then contrast it to the HW we need to run llama3.1-70B (July, 2024)? In particular, which optimizations and in what way did they dramatically cut down the cost of the inference?
I seriously don't get this argument and I see it being repeated all over and over again. Although model possibilities are increasing, no doubt in that, HW costs for inference remained the same and they're mostly driven by the amount of (V)RAM you need.
I seriously don't get this argument and I see it being repeated all over and over again. Although model possibilities are increasing, no doubt in that, HW costs for inference remained the same and they're mostly driven by the amount of (V)RAM you need.