Typically, you need to use some tricks for pre-training in lower precision (fine...

buildbot on Oct 18, 2023 | parent | context | favorite | on: Standardizing next-generation narrow precision dat...

Typically, you need to use some tricks for pre-training in lower precision (finetuning seems to work at low precision), with FP16 you need loss scaling for example. With MX, you can train in 6 bits of precision without any tricks, and hit the same loss as FP32.