Even for regular calculations, it is unfortunate that the conventional split reserved only five bits of exponent. A single extra bit there would make it much more useful, and the loss to the mantissa would be an easy tradeoff.
Bfloat16 is much better for conversions from f32.
Basically, what they are computing is:
f32 acc = C_a0 * C_b0 + Ca1 * Cb1 + Ca2 * Cb2 ...
with very many coefficients, all of which are Bfloat16. The precision of the coefficients is not that important, but they can be of substantially different magnitude, so the coefficients can use few bits in the mantissa but the accumulator needs to be wider.
And bfloat16 is new but not that new: Tensorflow had it 3 years back.
Apparently range is more useful than precision for machine learning, which would be why they went 8/8 instead of IEEE's 11/5 FP16.
Is anyone else using it?