Standardizing next-generation narrow precision data formats for AI

opcode84 · on Oct 18, 2023

Earlier this year, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Technologies, Inc. formed the Microscaling Formats (MX) Alliance with the goal of creating and standardizing next-generation 6- and 4-bit data types for AI training and inferencing. The key enabling technology that enables sub 8-bit formats to work, referred to as microscaling, builds on a foundation of years of design space exploration and research. MX enhances the robustness and ease-of-use of existing 8-bit formats such as FP8 and INT8, thus lowering the barrier for broader adoption of single digit bit training and inference.

Spec: https://www.opencompute.org/documents/ocp-microscaling-forma...

Whitepaper: https://arxiv.org/abs/2310.10537

Code: https://github.com/microsoft/microxcaling

ljosifov · on Oct 18, 2023

Thanks - interesting. I wish

> Integer data types use a 2’s complement encoding, but the maximum negative representation (−2) may be left unused to maintain symmetry between the maximum positive and negative representations and avoid introducing a negative bias.

... the maximum negative representation was used for a NAN. IDK why and how it is that we all agree that NAN-s are useful for floats (and they are super useful), but very few think the same for integers??

duskwuff · on Oct 18, 2023

> IDK why and how it is that we all agree that NAN-s are useful for floats (and they are super useful), but very few think the same for integers??

Because making an integer bit pattern act as a NaN would require specific semantics (e.g. NaN + X = NaN; NaN != NaN) which are difficult to implement efficiently in hardware. These properties would also potentially rule out some arithmetic optimizations which are currently possible.

ljosifov · on Oct 19, 2023

Yep - agreed that there will be a price to be paid. Do you have any inkling of how costly in h/w would it be? Maybe compared to the similar in floats? For floats it's been decided that paying the price was worth it, it seems.

duskwuff · on Oct 19, 2023

> Do you have any inkling of how costly in h/w would it be?

For something like addition, a ripple-carry adder is ~5 gates per bit. To check for NaN on input, you'd need a wide AND/OR on each input (~N log N gates per bit per input, so ~4 gates per bit for a 64-bit adder), a multiplexer on the output (~3 gates per bit), and a bunch of fan-in/out for the "is this NaN" signal. That'd likely more than double the size of the cell.

Subtraction makes that even more awkward. With 2's complement, an adder can also perform subtraction by inverting the second operand and carrying in a 1. This trick stops working if one of your bit patterns is a special value, so you either have to add even more logic to specify that NaN is inverted, or duplicate the whole mess for subtraction.

You'd also have to add a bunch of completely new hardware to distinguish between e.g. "X is bitwise equal to Y" and "X is numerically equal to Y" in equality tests, because NaN != NaN. It's hard to speculate how expensive that would be, but it certainly wouldn't be trivial.

> For floats it's been decided that paying the price was worth it, it seems.

That's a bit of an oversimplification. I'd say that it's more that:

1) Floating-point arithmetic is already fairly complex; even if you didn't handle the special values it'd still require a lot more logic than integer math.

2) Handling special values like infinities and NaN was simply part of the "spec" for floating-point math. It wouldn't have been considered fit for purpose without those features.

cyrillite · on Oct 18, 2023

For us less technical folks (in this field), what’s the big take away here / why does this matter / why should we be excited?

buildbot · on Oct 18, 2023

Typically, you need to use some tricks for pre-training in lower precision (finetuning seems to work at low precision), with FP16 you need loss scaling for example. With MX, you can train in 6 bits of precision without any tricks, and hit the same loss as FP32.

andy99 · on Oct 18, 2023

On most hardware, handwritten math is required for all the nonstandard formats, e.g. for quantized int-8 https://github.com/karpathy/llama2.c/blob/master/runq.c#L317

Integer quantization doesn't typically just round, it has scaling and other factors in blocks so it's not just a question of manipulating int8's. And the FP16/FP8 are not supported by most processors so need their own custom routines as well. It would be great if you could just write code that operates with intrinsics on the quanitzed types.

imjonse · on Oct 18, 2023

Future hardware implementations for these <8bit data types will result in much larger (number of parameters) models fitting in the same memory. Unless they are standardized, each vendor and software framework will have their own slightly different approach.

superkuh · on Oct 18, 2023

Actual article as text for those like myself getting completely blocked by the broken captcha wall. I must have hit the "prove you're not a robot" check-box 20 times.

https://web.archive.org/web/20231018183224/https://www.openc...

sanxiyn · on Oct 19, 2023

This reminds me a lot of Nervana Systems' Flexpoint.

Dylan16807 · on Oct 18, 2023

Huh, for FP4 just E2M1 with no E3M0? I've seen a paper in the past that went so heavy on exponent it was skipping every other power of two, so I would have thought the demand was there.

Oddly they do have E8M0.

pclmulqdq · on Oct 18, 2023

E3M0 was the format I was most excited to see here, but I guess not. E8M0 makes sense because of the relationship to E8M23 (float32) and E8M7 (bfloat16). Nvidia has their own E8M12 format that uses the exponent logic of float32 and the mantissa logic of float16, allowing you to multiply 2x more numbers at a time in E8M12 as E8M23 without adding more hardware or resorting to a narrower exponent.

buildbot · on Oct 19, 2023

Copy my comment here too - Point of clarification - there is no E8M0 direct datatype (unless I misunderstand something!) E8M0 is only used for the scaling of exponents in the block - there is 8 bits of scale per block.

pclmulqdq · on Oct 19, 2023

I think you're right. In general, storage and operating formats seem to be decoupling for AI/ML.

Nvidia's E8M12 is also a format specifically for operators - they expect you to store FP32 when you operate in E8M12. Storage is almost always in power-of-2 sizes.

buildbot · on Oct 20, 2023

I would hope so ;)

buildbot · on Oct 19, 2023

Point of clarification - there is no E8M0 direct datatype (unless I misunderstand something!) E8M0 is only used for the scaling of exponents in the block - there is 8 bits of scale per block.

samspenc · on Oct 18, 2023

Interesting to see Nvidia here - I would assume they have the most to lose from an open consortium like this. Or do they think they will come out ahead even if there is an open standard like this.

jjoonathan · on Oct 18, 2023

Back in the day, NVidia's OpenCL implementation beat AMD's OpenCL implementation hands down. Even die-hard AMD fans who were determined to tough it out would eventually sniff out that the grass was greener on the other side and switch teams. It was excellent advertising.

SR2Z · on Oct 18, 2023

I think that while Nvidia has a near-monopoly on training, the inference side of things is much more multiplatform. That's probably why.

buildbot · on Oct 18, 2023

This is for both training and inference, for what it is worth.

dragonwriter · on Oct 19, 2023

They have the most to lose by their being a standard that they don’t participate in, though, yes, no standard at all is better for them than a standard they do participate in.

teach · on Oct 18, 2023

Apple is notably absent

notnullorvoid · on Oct 19, 2023

Floats seem like a pretty bad data format for ML (Also in general). Infinity, NaN, -0 all useless. NaNs in particular have multiple binary representations so waste extra space.

Not sure why more effort isn't being put toward Posits, or thinking up a different format for ML specifically.

jcranmer · on Oct 19, 2023

Well, if you actually read the document, that's not how NaN is encoded in these data types--these are not IEEE 754-compliant encodings.

As for why not posits, I'm not entirely sure, but the "variably-encoded exponent width" nature of posits likely makes several details of their construction in hardware more difficult than things that have fixed exponent and mantissa widths. Although by the time you're talking about 8-bit datatypes, implementing operations by lookup table starts to look appealing.

notnullorvoid · on Oct 19, 2023

Ahh okay wasn't aware the encodings were different, after taking a glance at spec it seems much better than what I pessimistically envisioned.

2 NaNs, and -0 still seem like a bad call (-0 could have been the singular NaN), but I guess I understand maybe why they don't want to deviate too much from IEEE754 floats.

quotemstr · on Oct 19, 2023

Meanwhile, people look at you like you have two heads if you suggest using anything other than "float" in certain contexts. "How can you do subpixel positioning without float?!?!?! This is unacceptable!" "You can use 1/1000 of a dp to express fractional pix..." "FLOATS. FLOATS GIVE US FRACTIONS. WHY DO YOU HATE OUR CUSTOMERS?".

IEE754 floats should never have been a primitive in any programming language. Library type only.

joshlk · on Oct 18, 2023

Graphcore have been doing some good research in the area: https://twitter.com/sergiopprz/status/1708827369031516169

phkahler · on Oct 18, 2023

Any idea how these formats compare to POSITs for AI computations?

Also: It's not cool IMHO that they have two distinct formats for FP8.

pclmulqdq · on Oct 18, 2023

The promises of POSITs don't seem to hold water in terms of their application-level benefits, n-bit posits need equivalent-sized hardware to 2n-bit floats, and the numerical analysis on them is hell. All in all, they were an interesting thought experiment. Compressed/quantized storage of floating point numbers seems to just be better.

Dylan16807 · on Oct 19, 2023

> n-bit posits need equivalent-sized hardware to 2n-bit floats

Not in general. Every 8 bit posit with es=1 can be decoded to a 10 bit E5M4 float. Every 8 bit posit with es=2 can be decoded to E6M3, and almost all fit into E5M3.

Though overall I'm not sure if posits are particularly useful here. Posits give you more precision near 1.0, and they give you better range for outliers. The latter almost certainly helps, but I don't know about the former.

Saving 1 bit wouldn't hurt though.

pclmulqdq · on Oct 19, 2023

If you think of posits as a storage format, it's true that you only need an extra posit->float decoder, but you need bfloat16 to exactly decode all 8-bit posits into a single, common FP format.

In terms of hardware to operate directly on posits, things get ugly. Floats are actually relatively easy to work with due to the normalized storage format and fixed precision across the range of numbers, while posits don't have that.

Dylan16807 · on Oct 19, 2023

> If you think of posits as a storage format, it's true that you only need an extra posit->float decoder, but you need bfloat16 to exactly decode all 8-bit posits into a single, common FP format.

But you're not limited to powers of two. If you're decoding to 16 bit floats then you can use posits up to 13 bits wide.

> In terms of hardware to operate directly on posits, things get ugly. Floats are actually relatively easy to work with due to the normalized storage format and fixed precision across the range of numbers, while posits don't have that.

But is that extra shifting bigger than the multiplier unit? It's hard for me to see it growing the circuit that much.

gumby · on Oct 18, 2023

Excellent. For me, six bits of mantissa in fixed point is perfectly adequate

cpeterso · on Oct 18, 2023

What would be the most practical range for a hypothetical 1-bit floating point type (float1_t)? Zero and one? Zero and infinity? NaN and infinity?

invalidator · on Oct 19, 2023

I'd say a type which doesn't have both a significand and an exponent isn't a float at all. How can you have a floating point with no ability to move the point?

To have something IEEE754-like you'd need a sign, significand, and exponent. That gets you +/-(0, 1, Inf, Nan). Add another bit to the exponent if you want non-integer values. That's the minimum for what I'd call a float.

If you really want to strip it down to one bit, you should choose which of those bits to keep. Significand gives you (0,1), which is the same as a uint1_t. Exponent arguably gives you (0, Inf). Sign gives you (-1,1).

Of those, I'd say the sign-bit, giving -1 and 1, is maybe the most practical, but it's not a complete float.

Y_Y · on Oct 19, 2023

Three-bit floats have been the subject almost serious research, http://tom7.org/nand/nand.pdf

Dylan16807 · on Oct 18, 2023

As far as successful calculations go, NaN is useless and infinity is next to useless.

And any nonzero finite number might as well be 1 to make it easier to think about.

0 and 1 would work, as long as you have a final scaling factor per node. -1 and 1 should work too.

Also the entire concept of a floating point is weak enough at 3-4 bits and collapses entirely when you have 1 or 2.

bmh · on Oct 19, 2023

It's interesting that the standard "K" (number of elements with a shared scale) is 32. That seems to imply that the neural network will somehow learn to group weights at those 32-element boundaries. Does anybody understand how that works? I mean, what is the mechanism that naturally causes the model to group weight scales into those K-element clusters?

buildbot · on Oct 20, 2023

There is no mechanism per-say, it's more of a bit space vs quality issue. You could think of MX4 with an 8 bit exponent scale as a 12 bit number if the block size is one, "MX12" with E10M1. You can share the scale with some error per element in a block, with that error going up as you increase the size of the block. As the block size is increased, the effective size per element goes down and the hardware implementation gets smaller/cheaper.

Sparkyte · on Oct 18, 2023

Nice, standardization is essential. Without it things could get overly complicated and hard to develop technology to support the consumption of data outside of the convention. Although I would assume that there will be updates continually to the model.

larodi · on Oct 19, 2023

this all is based on the voluntary work of many researchers in academia and opensource, and now these big companies are 'standardizing' stuff they did not invent, but was invented despite them going in other market directions..

buildbot · on Oct 20, 2023

Not true in this case, it was developed in the industry. https://azure.microsoft.com/en-us/blog/fostering-ai-infrastr...

"Building on years of design space exploration and research at Microsoft, Microscaling technology enables sub 8-bit formats while also enhancing the strength and ease-of-use of existing 8-bit formats such as FP8 and INT8. These advancements also help contribute to broader sustainability goals like reducing the environmental impact of AI technologies as demand continues to grow by improving the energy efficiency of AI in datacenters as well as on many AI endpoints."

Of course, it was not the first block floating point, those have been around since the 1963! https://en.wikipedia.org/wiki/Block_floating_point.

streakfix · on Oct 18, 2023

Finally, llm space is littered with closed standards.

brucethemoose2 · on Oct 18, 2023

Cool. The way current hardware handles very low precision is quite inefficient.

kristianp · on Oct 18, 2023

Indeed, reading the QLoRa paper, 4 bit quantised data is converted to 16 bit floats (usually BFloat16) for calculations, then converted back again. I suppose this standard allows for smaller data types to be supported by the hardware instructions.

buildbot · on Oct 19, 2023

Exactly!

yardstick · on Oct 18, 2023

Stupid question but will people start pushing for more precision once the cost of memory and compute falls more? Or by the nature of these data types there will never really be a need for more bits?

brucethemoose2 · on Oct 18, 2023

Different types of models seem to have different "sweet spots."

For example, current transformers llms seem to like 4-6 bits with smart quantization, with good performance at 3-4 bits with extremely aggressive quantization methods (like good use of sparsity and profiling inference on useful data).

The Stable Diffusion unet doesn't like 8 bit without some changes, the vae barely even likes fp16.

So to answer your question, some quantization is "free" and theres no reason not to use it, but sometimes its very lossy and a serious compromise that would not be taken with more compute/ram.

Also, sometimes there is compute overhead that makes quantization inferencr/training slower. Sometimes the reduced model weights size makes passes faster due to a memory bandwidth bottleneck. It just depends.

redox99 · on Oct 18, 2023

If memory and compute prices fall, people will just train bigger models.

dist-epoch · on Oct 18, 2023

Both. You could either go for more precision or go for bigger models.

gyrovagueGeist · on Oct 19, 2023

Still no stochastic rounding :(