Addition Is All You Need for Energy-Efficient Language Models

visarga · 2024-10-09T06:04:38.000000Z

> can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products

It this were about convolutional nets then optimizing compute would be a much bigger deal. Transformers are lightweight on compute and heavy on memory. The weakest link in the chain is fetching the model weights into the cores. The 95% and 80% energy reductions cited are for the multiplication operations in isolation, not for the entire inference process.

SuchAnonMuchWow · 2024-10-09T07:25:36.000000Z

Its worse than that: the energy gains are when comparing computations made with fp32, but for fp8 the multipliers are really tiny and the adder/shifters represent a largest part of the operators (energy-wise and area-wise) and this paper will only have small gains.

On fp8, the estimated gate count of fp8 multipliers is 296 vs. 157 with their technique, so the power gain on the multipliers will be much lower (50% would be a more reasonable estimation), but again for fp8 the additions in the dot products are a large part of the operations.

Overall, its really disingenuous to claim 80% power gain and small drop in accuracy, when the power gain is only for fp32 operations and the small drop in accuracy is only for fp8 operators. They don't analyze the accuracy drop in fp32, and don't present the power saved for fp8 dot product.

lifthrasiir · 2024-10-09T06:19:47.000000Z

I'm also sure that fp8 is small enough that multiplication can really be done in a much simpler circuit than larger fp formats. Even smaller formats like fp4 would be able to just use a lookup table, and that makes them more like sort-of-standardized quantization schemes.

tankenmate · 2024-10-09T06:50:09.000000Z

i suspect that you could do fp8 with log tables and interpolation if you really wanted to (compared to the memory required for the model it's peanuts), it just turns into a LUT (log table look up) and bit shift (interpolation). so again, memory bandwidth is the limiting factor for transformers (as far as energy is concerned).

lifthrasiir · 2024-10-09T07:06:06.000000Z

This time though LUT exists in a circuit, which is much more efficient than typical memory lookup. Such LUT would have to exist per each ALU though, so it can't be too large.

brilee · 2024-10-09T07:09:16.000000Z

fp4/fp8 for neural networks don't work the way you think they do - they are merely compression formats - a set of, say, 256 fp32 weights from 1 neuron are lossily turned into 1 max value (stored in fp32 precision) and 256 fp4/fp8 numbers. Those compressed numbers are multiplied by the fp32 number at runtime to restore the original weights and full fp32 multiplication + additions are executed.

lifthrasiir · 2024-10-09T07:27:28.000000Z

You are correct that the accumulation (i.e. additions in dot products) has to be done in a higher precision, however the multiplication can still be done via LUT. (Source: I currently work at a hardware-accelerated ML hardware startup.)

SuchAnonMuchWow · 2024-10-09T07:35:12.000000Z

The goal of this type of quantization is to move the multiplication by the fp32 rescale factor outside of the dot-product accumulation.

So the multiplications+additions are done on fp8/int8/int4/whatever (when the hardware support those operators of course) and accumulated in a fp32 or similar, and only the final accumulator is multiplied by the rescale factor in fp32.

imjonse · 2024-10-09T07:34:57.000000Z

With w8a8 quantization the hw (>= hopper) can do the heavy math in fp8 twice as fast as fp16.

imjonse · 2024-10-09T07:28:01.000000Z

That is true for single user/light inference only. For training and batch inference you can get compute bound fast enough.

saagarjha · 2024-10-09T07:56:02.000000Z

That really depends on what you're doing. Trying to feed a tensor core is pretty hard–they're really fast.

kendalf89 · 2024-10-09T06:18:53.000000Z

Maybe this technique can be used for training then since that is a lot more compute intensive?

js8 · 2024-10-09T08:24:55.000000Z

Haven't read it, but isn't this just logarithmic tables in some form?

I am asking not to dismiss it, I genuinely feel I don't understand logarithms on a fundamental level (of logic gates etc.). If multiplication can be replaced with table lookup and addition, then there has to be a circuit that gives you difficult addition and easy multiplication, or any combination of those tradeoffs.

CGamesPlay · 2024-10-09T06:18:20.000000Z

I believe this reduces the compute required, but still uses 8 bits per value, so it does not reduce the memory requirements required to run inference, so it doesn’t particularly make the models more accessible for inference. Is this storage method suitable for training? That could potentially be an interesting application.

cpldcpu · 2024-10-09T06:15:32.000000Z

It puzzles me that there does not seem to be a proper derivation and discussion of the error term in the paper. It's all treated indirectly way inference results.

md_rumpf · 2024-10-09T05:53:12.000000Z

The return of the CPU?!

scotty79 · 2024-10-09T08:21:00.000000Z

All You Need is Considered Harmful.