opcode84's comments

opcode84 · 2025-08-26T16:54:51 1756227291

For narrow-precision formats to be practical in large-scale pretraining, they must ensure both model accuracy and stable convergence. To assess the viability of 4-bit precision in large-scale model training, experiments were conducted with FP8 and NVFP4 on a 12-billion parameter model based on a combined Mamba-Transformer architecture (12B Hybrid Mamba-Transformer model)—similar to NVIDIA Nemotron Nano 2. This model was trained on a massive dataset of 10 trillion tokens using a phased data-blending approach, switching to a different dataset mix in the second phase of training at 70%, and in the third phase of training at 90% during pretraining.

A version of the 12B Hybrid Mamba-Transformer model was initially trained with 8-bit precision—FP8, which has been shown in previous studies to closely match 16-bit precision, and hence served as our baseline for comparison. We then successfully trained this same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support full pretraining at trillion-token scale. The NVFP4 run exhibited stable convergence without the training instabilities or divergence issues that typically plague ultra-low precision training.

Figure 3 below shows that NVFP4’s validation loss curve closely matches the loss curves from the higher-precision baseline (i.e., FP8) throughout the entire duration of training. The quantization techniques outlined above ensure that even with aggressive bit-width reduction, the 4-bit pretraining dynamics closely resemble those of higher-precision runs.

opcode84 · 2025-08-25T16:54:35 1756140875

A version of the 12B Hybrid Mamba-Transformer model was initially trained with 8-bit precision—FP8, which has been shown in previous studies to closely match 16-bit precision, and hence served as our baseline for comparison. We then successfully trained this same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support full pretraining at trillion-token scale. The NVFP4 run exhibited stable convergence without the training instabilities or divergence issues that typically plague ultra-low precision training.

opcode84 · 2025-06-26T14:40:05 1750948805

Introducing NVFP4 for Efficient and Accurate Low-Precision Inference

opcode84 · 2025-02-10T22:47:34 1739227654

The formats supported in the tutorial are the OCP microscaling formats, including mxfp4 and mxfp8, as well as NVIDIA’s nvfp4 format. These matrix multiplications are accelerated by fifth generation tensor core instructions on CUDA devices with compute capability 10.

Blog post: https://developer.nvidia.com/blog/openai-triton-on-nvidia-bl...

opcode84 · 2025-02-06T04:10:29 1738815029

From the article:

"NVIDIA Blackwell introduces revolutionary block-scaled floating point formats, including the Open Computing Project’s microscaling formats, which Triton now unlocks for NVIDIA Blackwell-powered hardware acceleration.

These formats provide higher average precision at higher performance than the non-native block-scaling techniques emulated frequently in LLM inference projects today.

For OCP format support, MXFP8 GEMMs on Triton showcase exceptional performance similar to the FP8 GEMMs performance accelerated and shown earlier in this post, while natively allowing for scaling in the Tensor Core.

Similarly, MXFP4 provides a new operating point in the precision-performance trade-off space but while offering double the hardware-accelerated performance of FP8 and MXFP8 GEMMs.

While Triton performance for MXFP8 is close to the NVIDIA Blackwell-accelerated FP8 shown earlier, we continue working with the community to accelerate and enable new use cases around block-scaling support."

opcode84 · on March 19, 2024

https://arxiv.org/pdf/2310.10537.pdf

Discussed in a previous post https://news.ycombinator.com/item?id=37930663

buildbot · on March 19, 2024

Yes this is it!

opcode84 · on Nov 15, 2023

The Microsoft chip has MX data types: https://news.ycombinator.com/item?id=37930663

https://arxiv.org/abs/2310.10537

buildbot · on Nov 15, 2023

Yes! Great catch :)

opcode84 · on Oct 18, 2023

Earlier this year, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Technologies, Inc. formed the Microscaling Formats (MX) Alliance with the goal of creating and standardizing next-generation 6- and 4-bit data types for AI training and inferencing. The key enabling technology that enables sub 8-bit formats to work, referred to as microscaling, builds on a foundation of years of design space exploration and research. MX enhances the robustness and ease-of-use of existing 8-bit formats such as FP8 and INT8, thus lowering the barrier for broader adoption of single digit bit training and inference.

Spec: https://www.opencompute.org/documents/ocp-microscaling-forma...

Whitepaper: https://arxiv.org/abs/2310.10537

Code: https://github.com/microsoft/microxcaling

ljosifov · on Oct 18, 2023

Thanks - interesting. I wish

> Integer data types use a 2’s complement encoding, but the maximum negative representation (−2) may be left unused to maintain symmetry between the maximum positive and negative representations and avoid introducing a negative bias.

... the maximum negative representation was used for a NAN. IDK why and how it is that we all agree that NAN-s are useful for floats (and they are super useful), but very few think the same for integers??

duskwuff · on Oct 18, 2023

> IDK why and how it is that we all agree that NAN-s are useful for floats (and they are super useful), but very few think the same for integers??

Because making an integer bit pattern act as a NaN would require specific semantics (e.g. NaN + X = NaN; NaN != NaN) which are difficult to implement efficiently in hardware. These properties would also potentially rule out some arithmetic optimizations which are currently possible.

ljosifov · on Oct 19, 2023

Yep - agreed that there will be a price to be paid. Do you have any inkling of how costly in h/w would it be? Maybe compared to the similar in floats? For floats it's been decided that paying the price was worth it, it seems.

duskwuff · on Oct 19, 2023

> Do you have any inkling of how costly in h/w would it be?

For something like addition, a ripple-carry adder is ~5 gates per bit. To check for NaN on input, you'd need a wide AND/OR on each input (~N log N gates per bit per input, so ~4 gates per bit for a 64-bit adder), a multiplexer on the output (~3 gates per bit), and a bunch of fan-in/out for the "is this NaN" signal. That'd likely more than double the size of the cell.

Subtraction makes that even more awkward. With 2's complement, an adder can also perform subtraction by inverting the second operand and carrying in a 1. This trick stops working if one of your bit patterns is a special value, so you either have to add even more logic to specify that NaN is inverted, or duplicate the whole mess for subtraction.

You'd also have to add a bunch of completely new hardware to distinguish between e.g. "X is bitwise equal to Y" and "X is numerically equal to Y" in equality tests, because NaN != NaN. It's hard to speculate how expensive that would be, but it certainly wouldn't be trivial.

> For floats it's been decided that paying the price was worth it, it seems.

That's a bit of an oversimplification. I'd say that it's more that:

1) Floating-point arithmetic is already fairly complex; even if you didn't handle the special values it'd still require a lot more logic than integer math.

2) Handling special values like infinities and NaN was simply part of the "spec" for floating-point math. It wouldn't have been considered fit for purpose without those features.

cyrillite · on Oct 18, 2023

For us less technical folks (in this field), what’s the big take away here / why does this matter / why should we be excited?

buildbot · on Oct 18, 2023

Typically, you need to use some tricks for pre-training in lower precision (finetuning seems to work at low precision), with FP16 you need loss scaling for example. With MX, you can train in 6 bits of precision without any tricks, and hit the same loss as FP32.

andy99 · on Oct 18, 2023

On most hardware, handwritten math is required for all the nonstandard formats, e.g. for quantized int-8 https://github.com/karpathy/llama2.c/blob/master/runq.c#L317

Integer quantization doesn't typically just round, it has scaling and other factors in blocks so it's not just a question of manipulating int8's. And the FP16/FP8 are not supported by most processors so need their own custom routines as well. It would be great if you could just write code that operates with intrinsics on the quanitzed types.

imjonse · on Oct 18, 2023

Future hardware implementations for these <8bit data types will result in much larger (number of parameters) models fitting in the same memory. Unless they are standardized, each vendor and software framework will have their own slightly different approach.

superkuh · on Oct 18, 2023

Actual article as text for those like myself getting completely blocked by the broken captcha wall. I must have hit the "prove you're not a robot" check-box 20 times.

https://web.archive.org/web/20231018183224/https://www.openc...

sanxiyn · on Oct 19, 2023

This reminds me a lot of Nervana Systems' Flexpoint.