Hacker Newsnew | past | comments | ask | show | jobs | submit | opcode84's commentslogin

For narrow-precision formats to be practical in large-scale pretraining, they must ensure both model accuracy and stable convergence. To assess the viability of 4-bit precision in large-scale model training, experiments were conducted with FP8 and NVFP4 on a 12-billion parameter model based on a combined Mamba-Transformer architecture (12B Hybrid Mamba-Transformer model)—similar to NVIDIA Nemotron Nano 2. This model was trained on a massive dataset of 10 trillion tokens using a phased data-blending approach, switching to a different dataset mix in the second phase of training at 70%, and in the third phase of training at 90% during pretraining.

A version of the 12B Hybrid Mamba-Transformer model was initially trained with 8-bit precision—FP8, which has been shown in previous studies to closely match 16-bit precision, and hence served as our baseline for comparison. We then successfully trained this same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support full pretraining at trillion-token scale. The NVFP4 run exhibited stable convergence without the training instabilities or divergence issues that typically plague ultra-low precision training.

Figure 3 below shows that NVFP4’s validation loss curve closely matches the loss curves from the higher-precision baseline (i.e., FP8) throughout the entire duration of training. The quantization techniques outlined above ensure that even with aggressive bit-width reduction, the 4-bit pretraining dynamics closely resemble those of higher-precision runs.


A version of the 12B Hybrid Mamba-Transformer model was initially trained with 8-bit precision—FP8, which has been shown in previous studies to closely match 16-bit precision, and hence served as our baseline for comparison. We then successfully trained this same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support full pretraining at trillion-token scale. The NVFP4 run exhibited stable convergence without the training instabilities or divergence issues that typically plague ultra-low precision training.

Introducing NVFP4 for Efficient and Accurate Low-Precision Inference


The formats supported in the tutorial are the OCP microscaling formats, including mxfp4 and mxfp8, as well as NVIDIA’s nvfp4 format. These matrix multiplications are accelerated by fifth generation tensor core instructions on CUDA devices with compute capability 10.

Blog post: https://developer.nvidia.com/blog/openai-triton-on-nvidia-bl...


From the article:

"NVIDIA Blackwell introduces revolutionary block-scaled floating point formats, including the Open Computing Project’s microscaling formats, which Triton now unlocks for NVIDIA Blackwell-powered hardware acceleration.

These formats provide higher average precision at higher performance than the non-native block-scaling techniques emulated frequently in LLM inference projects today.

For OCP format support, MXFP8 GEMMs on Triton showcase exceptional performance similar to the FP8 GEMMs performance accelerated and shown earlier in this post, while natively allowing for scaling in the Tensor Core.

Similarly, MXFP4 provides a new operating point in the precision-performance trade-off space but while offering double the hardware-accelerated performance of FP8 and MXFP8 GEMMs.

While Triton performance for MXFP8 is close to the NVIDIA Blackwell-accelerated FP8 shown earlier, we continue working with the community to accelerate and enable new use cases around block-scaling support."



Yes this is it!



Yes! Great catch :)


Earlier this year, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Technologies, Inc. formed the Microscaling Formats (MX) Alliance with the goal of creating and standardizing next-generation 6- and 4-bit data types for AI training and inferencing. The key enabling technology that enables sub 8-bit formats to work, referred to as microscaling, builds on a foundation of years of design space exploration and research. MX enhances the robustness and ease-of-use of existing 8-bit formats such as FP8 and INT8, thus lowering the barrier for broader adoption of single digit bit training and inference.

Spec: https://www.opencompute.org/documents/ocp-microscaling-forma...

Whitepaper: https://arxiv.org/abs/2310.10537

Code: https://github.com/microsoft/microxcaling


Thanks - interesting. I wish

> Integer data types use a 2’s complement encoding, but the maximum negative representation (−2) may be left unused to maintain symmetry between the maximum positive and negative representations and avoid introducing a negative bias.

... the maximum negative representation was used for a NAN. IDK why and how it is that we all agree that NAN-s are useful for floats (and they are super useful), but very few think the same for integers??


> IDK why and how it is that we all agree that NAN-s are useful for floats (and they are super useful), but very few think the same for integers??

Because making an integer bit pattern act as a NaN would require specific semantics (e.g. NaN + X = NaN; NaN != NaN) which are difficult to implement efficiently in hardware. These properties would also potentially rule out some arithmetic optimizations which are currently possible.


Yep - agreed that there will be a price to be paid. Do you have any inkling of how costly in h/w would it be? Maybe compared to the similar in floats? For floats it's been decided that paying the price was worth it, it seems.


> Do you have any inkling of how costly in h/w would it be?

For something like addition, a ripple-carry adder is ~5 gates per bit. To check for NaN on input, you'd need a wide AND/OR on each input (~N log N gates per bit per input, so ~4 gates per bit for a 64-bit adder), a multiplexer on the output (~3 gates per bit), and a bunch of fan-in/out for the "is this NaN" signal. That'd likely more than double the size of the cell.

Subtraction makes that even more awkward. With 2's complement, an adder can also perform subtraction by inverting the second operand and carrying in a 1. This trick stops working if one of your bit patterns is a special value, so you either have to add even more logic to specify that NaN is inverted, or duplicate the whole mess for subtraction.

You'd also have to add a bunch of completely new hardware to distinguish between e.g. "X is bitwise equal to Y" and "X is numerically equal to Y" in equality tests, because NaN != NaN. It's hard to speculate how expensive that would be, but it certainly wouldn't be trivial.

> For floats it's been decided that paying the price was worth it, it seems.

That's a bit of an oversimplification. I'd say that it's more that:

1) Floating-point arithmetic is already fairly complex; even if you didn't handle the special values it'd still require a lot more logic than integer math.

2) Handling special values like infinities and NaN was simply part of the "spec" for floating-point math. It wouldn't have been considered fit for purpose without those features.


For us less technical folks (in this field), what’s the big take away here / why does this matter / why should we be excited?


Typically, you need to use some tricks for pre-training in lower precision (finetuning seems to work at low precision), with FP16 you need loss scaling for example. With MX, you can train in 6 bits of precision without any tricks, and hit the same loss as FP32.


On most hardware, handwritten math is required for all the nonstandard formats, e.g. for quantized int-8 https://github.com/karpathy/llama2.c/blob/master/runq.c#L317

Integer quantization doesn't typically just round, it has scaling and other factors in blocks so it's not just a question of manipulating int8's. And the FP16/FP8 are not supported by most processors so need their own custom routines as well. It would be great if you could just write code that operates with intrinsics on the quanitzed types.


Future hardware implementations for these <8bit data types will result in much larger (number of parameters) models fitting in the same memory. Unless they are standardized, each vendor and software framework will have their own slightly different approach.


Actual article as text for those like myself getting completely blocked by the broken captcha wall. I must have hit the "prove you're not a robot" check-box 20 times.

https://web.archive.org/web/20231018183224/https://www.openc...


This reminds me a lot of Nervana Systems' Flexpoint.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: