Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Re-computing everything every time is the worst-case scenario and which is why I included it in the example (1k tokens). In that case, KV-cache is obviously set to 0 but it is also obvious that it is a much worse alternative than using the KV-cache. Which is pretty much the reason why we have the KV-cache. Therefore the argument about loading the cached tensors doesn't make a difference at all.

> It's going to take me some minutes to find out what's wrong in this napkin math.

I am sure you will. Please don't be so entitled.



> Therefore the argument about loading the cached tensors doesn't make a difference at all.

Sorry, what? Who the fuck in this world runs decode without k/v cache??! If you run without k/v cache you are basically doing prefill for every token you generate and that's not what we called "decode". That's what we called "prefill".

k/v cache, while named "cache", is a lot more important than what people would perceive as a "cache". It's the essential part of the algorithm. If you lose your k/v cache you must run prefill again. If you run prefill for every token you generate it's not O(n^2), it's going to be O(n^3).

And yeah, you can run prefill 1000 times to generate a 1000 tokens output. Or you can run prefill once and with the persisted k/v cache run decode 1000 times. Tradeoff has to be made here but it simply makes no sense to drop a k/v cache in the middle of generating a response, as your number shows, recomputing is guaranteed to be slower than loading k/v cache.

> Please don't be so entitled.

When someone came up with a wrong number, I try to be nice and run the numbers myself and figure out why someone would end up with such a number and point out the specific mistake, instead of dumping a page of my own calculation. It's usually just a missing factor somewhere. Guess I shouldn't be so nice to retards who keep insisting that you can be fine without k/v cache during decoding. Also in this case I admit I failed to have a theory on why your number is so off because giving out prefill numbers and claiming it's decode isn't in my book.

Yeah, I know this sounds extremely mean, feel free to downvote, but I hope readers can feel my frustration now.


I am not gonna downvote you but you will need to find your manners. People around and most certainly your colleagues will be grateful for that. Perhaps also learn to deal with the arguments and different opinions without coloring them "plain wrong" in advance. Give people around you the benefit of a doubt.

> Also in this case I admit I failed to have a theory on why your number is so off because giving out prefill numbers and claiming it's decode isn't in my book.

Maybe it's because it is not off? It's not terribly difficult to sum up all the matmul calculcations and number of bytes one needs to load and store per each layer in self-attention. My number could be off for a bit but it is certainly not terribly off.


Thanks for your advice.

> different opinions

I won't argue with you so hard if it's your "opinions". What you described is not an opinion. And facts could be wrong. Plainly wrong.

> Maybe it's because it is not off?

Yeah, as I said earlier your number might be correct as an estimation for prefilling 1000 tokens on Llama 3 8B. That's not what everybody here called "decode". Your number shows that prefill is compute-bound. So what?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: