
Bfloat16 – Hardware Numerics Definition [pdf] - anonymfus
https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf
======
ncmncm
As I understand it, the shortened mantissa matters less because these numbers
never appear alone. The total number of bits, of all the numbers in the set,
provide enough precision to make fine distinctions between states of the
network.

Even for regular calculations, it is unfortunate that the conventional split
reserved only five bits of exponent. A single extra bit there would make it
much more useful, and the loss to the mantissa would be an easy tradeoff.

------
ndesaulniers
I'll just leave this here. :-X

Bfloat16 is much better for conversions from f32.

[https://github.com/tensorflow/tensorflow/blob/master/tensorf...](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/bfloat16.h)

~~~
stephencanon
It’s definitely conceptually simpler, but both conversions are a single fully-
pipelined operation on any CPU made in the past 5 years, and can be folded
into the arithmetic operation on custom HW. In practice the cost of conversion
isn’t really an issue; the win with bfloat16 is the added dynamic range.

------
noobermin
I'm a computational scientist. Do ML problems not deal with problems that are
sensitive input precision? Is it too naive to say if they don't, does one
really need ML for said problem over just plain old fitting and stats?

~~~
Tuna-Fish
These are not used for data but for the computation of the internal
coefficients. Said coefficients would be stored as F32:s, and the job of
modifying them involves computing a lot of multiplications, none of which need
to be that precise.

~~~
noobermin
Wouldn't conversion between the steps lead to round off?

~~~
Tuna-Fish
Conversion from Bfloat16 to f32 is just extending the mantissa with zeroes.

Basically, what they are computing is:

f32 acc = C_a0 * C_b0 + Ca1 * Cb1 + Ca2 * Cb2 ... with very many coefficients,
all of which are Bfloat16. The precision of the coefficients is not that
important, but they can be of substantially different magnitude, so the
coefficients can use few bits in the mantissa but the accumulator needs to be
wider.

------
jokoon
7 bit mantissa doesn't sound like a lot, 2^-7 is 0.0078125... shouldn't there
be at least 9 bits for the mantissa?

~~~
masklinn
Note that 7 bits stored means 8 bit mantissa.

And bfloat16 is new but not that new: Tensorflow had it 3 years back[0].

Apparently range is more useful than precision for machine learning, which
would be why they went 8/8 instead of IEEE's 11/5 FP16.

[0]
[https://github.com/tensorflow/tensorflow/blob/f41959ccb2d9d4...](https://github.com/tensorflow/tensorflow/blob/f41959ccb2d9d4c722fe8fc3351401d53bcf4900/tensorflow/core/framework/bfloat16.h)

~~~
espadrine
My understanding is that the optimization steps of neural networks (typically
gradient descent + backpropagation) act a bit like Lloyd's algorithm. Neuron
weights will push each other into place. Often, what matters is how they
compare to each other, such that the error is minimized; reaching the
theoretically-perfect weight is less important.

------
bleke
Sorry for not in topic, did Intel calculate bonuses on hn karma (more
officially impact)? I see this bf16 multiple times and it like authors dying
for Christmas bonus.

~~~
rbanffy
To me it looks like a clever optimization. Same range as FP32, but half the
size and less precise and can be converted back and forth by truncating and
concatenating zeros.

Is anyone else using it?

~~~
staticfloat
Google uses it on their TPUs [0]. If you're interested in how it would effect
the numerical stability of an algorithm you want to use, there is a Julia
package that makes prototyping linear algebra over this datatype pretty
straightforward [1].

[0] [https://cloud.google.com/tpu/docs/system-
architecture](https://cloud.google.com/tpu/docs/system-architecture)

[1]
[https://github.com/JuliaComputing/BFloat16s.jl](https://github.com/JuliaComputing/BFloat16s.jl)

~~~
scottlegrand2
And Facebook is taking this even further. And while all these things are very
cool, do not let ASIC designers claim they are barriers to entry for GPUs and
CPUs. Whatever variants of this precision potpourri catch on are but a
generation away from incarnation in general processors IMO...

[https://code.fb.com/ai-research/floating-point-math/](https://code.fb.com/ai-
research/floating-point-math/)

