Hacker News new | past | comments | ask | show | jobs | submit login

The implementation absolutely can influence the outputs.

If you have a sloppy implementations which somehow accumulates a lot of error in it's floating point math, you will get worse results.

It's rarely talked about, but it's a real thing. Floating point addition and multiplication is non-associative and the order of operations affects the correctness and performance. Developers might (unknowningly) trade performance for correctness. And it matters a lot more in the low precision modes we operate today. Just try different methods of summing a vector containing 9,999 fp16 ones in fp16. Hint: it will never be 9,999.0 and you won't get close to the best approximation if you do it in a naive loop.




I thought all current implementations accumulate into a fp32 instead of accumulating in fp16.


We (gemma.cpp) recently started accumulating softmax terms into f64. There is at least one known case of this causing differing output, but after 200 tokens, hence unlikely to be detected in many benchmarks.

Does anyone have experience with higher-precision matmul and whether it is worthwhile?


Isn’t 200 tokens basically nothing? Did you mean to say 2000?


That's indeed short for some actual uses such as summarization, but AFAIK many/most? evals involve generating less than 200.


I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) allows you to accumulate at fp16 precision.


How well does bf16 work in comparison?


Even worse, I'd say since it has fewer bits for the fraction. At least in the example i was mentioning, where you run into precision limits, not into range limits.

I believe bf16 was primarily designed as a storage format, since it just needs 16 zero bits added to be a valid fp32.


TIL, thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: