Hacker News new | past | comments | ask | show | jobs | submit login
Bfloat16 – Hardware Numerics Definition [pdf] (intel.com)
44 points by anonymfus 4 months ago | hide | past | web | favorite | 24 comments

As I understand it, the shortened mantissa matters less because these numbers never appear alone. The total number of bits, of all the numbers in the set, provide enough precision to make fine distinctions between states of the network.

Even for regular calculations, it is unfortunate that the conventional split reserved only five bits of exponent. A single extra bit there would make it much more useful, and the loss to the mantissa would be an easy tradeoff.

I'll just leave this here. :-X

Bfloat16 is much better for conversions from f32.


It’s definitely conceptually simpler, but both conversions are a single fully-pipelined operation on any CPU made in the past 5 years, and can be folded into the arithmetic operation on custom HW. In practice the cost of conversion isn’t really an issue; the win with bfloat16 is the added dynamic range.

I'm a computational scientist. Do ML problems not deal with problems that are sensitive input precision? Is it too naive to say if they don't, does one really need ML for said problem over just plain old fitting and stats?

These are not used for data but for the computation of the internal coefficients. Said coefficients would be stored as F32:s, and the job of modifying them involves computing a lot of multiplications, none of which need to be that precise.

Wouldn't conversion between the steps lead to round off?

Conversion from Bfloat16 to f32 is just extending the mantissa with zeroes.

Basically, what they are computing is:

f32 acc = C_a0 * C_b0 + Ca1 * Cb1 + Ca2 * Cb2 ... with very many coefficients, all of which are Bfloat16. The precision of the coefficients is not that important, but they can be of substantially different magnitude, so the coefficients can use few bits in the mantissa but the accumulator needs to be wider.

7 bit mantissa doesn't sound like a lot, 2^-7 is 0.0078125... shouldn't there be at least 9 bits for the mantissa?

Note that 7 bits stored means 8 bit mantissa.

And bfloat16 is new but not that new: Tensorflow had it 3 years back[0].

Apparently range is more useful than precision for machine learning, which would be why they went 8/8 instead of IEEE's 11/5 FP16.

[0] https://github.com/tensorflow/tensorflow/blob/f41959ccb2d9d4...

My understanding is that the optimization steps of neural networks (typically gradient descent + backpropagation) act a bit like Lloyd's algorithm. Neuron weights will push each other into place. Often, what matters is how they compare to each other, such that the error is minimized; reaching the theoretically-perfect weight is less important.

The format optimizes for dynamic range over precision.

There's an existing standardized 16-bit float with 10 bits mantissa¹, that graphics people are fond of. This one is for machine learning [drink].

¹ https://en.wikipedia.org/wiki/Half-precision_floating-point_...

Have a look at section 1.1 and Figure 1-1 of the Intel Whitepaper. There this topic is discussed.

With the bonus bit, that's 2.4 significant digits of precision. Plenty enough for ML, especially if you use a F32 as the accumulator.

Sorry for not in topic, did Intel calculate bonuses on hn karma (more officially impact)? I see this bf16 multiple times and it like authors dying for Christmas bonus.

To me it looks like a clever optimization. Same range as FP32, but half the size and less precise and can be converted back and forth by truncating and concatenating zeros.

Is anyone else using it?

Google uses it on their TPUs [0]. If you're interested in how it would effect the numerical stability of an algorithm you want to use, there is a Julia package that makes prototyping linear algebra over this datatype pretty straightforward [1].

[0] https://cloud.google.com/tpu/docs/system-architecture

[1] https://github.com/JuliaComputing/BFloat16s.jl

And Facebook is taking this even further. And while all these things are very cool, do not let ASIC designers claim they are barriers to entry for GPUs and CPUs. Whatever variants of this precision potpourri catch on are but a generation away from incarnation in general processors IMO...


Google's TPUs use them. But it has been for a year. I don't agree with the "new" or "Intel's" in the title.

And TPU uses them because Tensorflow uses them, it's been present since the first public commit: https://github.com/tensorflow/tensorflow/blob/f41959ccb2d9d4...

I would be extremely surprised if the motivation for putting bfloat16 in tensorflow was not the TPU. That first public commit was ~1.5 years before TPUv2 was announced at I/O, so it was almost certainly already in development.

bfloat16 was first in DistBelief, so it actually predates TensorFlow and TPUs (I worked on both systems). IIRC the motivation was more about minimizing parameter exchange bandwidth for large-scale CPU clusters rather than minimizing memory bandwidth within accelerators, but the idea generalized.

Thank you! I didn't know this. I thought they introduced them shortly after announcing TPU v1 in the 2016 (or 2017, can't remember) Google I/O.

Why is it clever to change the mantissa and exponent size? I thought the clever ones were the nervana flexpoint which seemed at least partially novel. And it's interesting Intel isn't pushing that format given nervana's asic had it.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact