The implementation absolutely can influence the outputs. If you have a sloppy im...

jiggawatts · 2024-10-11T22:14:57 1728684897

I thought all current implementations accumulate into a fp32 instead of accumulating in fp16.

janwas · 2024-10-12T09:06:23 1728723983

We (gemma.cpp) recently started accumulating softmax terms into f64. There is at least one known case of this causing differing output, but after 200 tokens, hence unlikely to be detected in many benchmarks.

Does anyone have experience with higher-precision matmul and whether it is worthwhile?

ComputerGuru · 2024-10-12T16:47:25 1728751645

Isn’t 200 tokens basically nothing? Did you mean to say 2000?

janwas · 2024-10-12T19:40:51 1728762051

That's indeed short for some actual uses such as summarization, but AFAIK many/most? evals involve generating less than 200.

KeplerBoy · 2024-10-11T23:22:02 1728688922

I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) allows you to accumulate at fp16 precision.

sroussey · 2024-10-11T20:45:43 1728679543

How well does bf16 work in comparison?

KeplerBoy · 2024-10-11T20:56:35 1728680195

Even worse, I'd say since it has fewer bits for the fraction. At least in the example i was mentioning, where you run into precision limits, not into range limits.

I believe bf16 was primarily designed as a storage format, since it just needs 16 zero bits added to be a valid fp32.

littlestymaar · 2024-10-11T20:19:29 1728677969

TIL, thanks.