
Intel Prepares to Graft Google’s Bfloat16 onto Processors - rbanffy
https://www.nextplatform.com/2019/07/15/intel-prepares-to-graft-googles-bfloat16-onto-processors/
======
kolbusa
ISA:
[https://software.intel.com/sites/default/files/managed/c5/15...](https://software.intel.com/sites/default/files/managed/c5/15/architecture-
instruction-set-extensions-programming-reference.pdf) Look for anything marked
with AVX512_BF16 CPUID feature flag.

Numerical details:
[https://software.intel.com/sites/default/files/managed/40/8b...](https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-
numerics-definition-white-paper.pdf)

Support for bfloat16 is already present in MKL-DNN
([https://github.com/intel/mkl-dnn](https://github.com/intel/mkl-dnn))

Disclaimer: I work for Intel

~~~
burfog
Please take my bug report:

Dropping denormals is a huge mistake. This is easy to see if you draw out a
number line for a very tiny floating-point format, for example with a 2-bit
exponent and a 2-bit fraction. (do this on a sheet of graph paper) Without
denormals, there is a huge gap surrounding zero.

Strangely, the infinities were kept. Treating these as NaN is far less harmful
than dropping denormals. Treating -0.0 as 0.0 and never producing -0.0 would
be less harmful. (the PDF didn't say what happens) Even treating NaN values as
normal numbers is probably less harmful than screwing up the denormals.

IEEE floating point has lots of crazy stuff to annoy hardware vendors. Most of
it isn't all that important, but denormals matter.

~~~
Veedrac
bfloat16 is only really intended for ML purposes, and denormals don't really
matter there, especially given they accurately multiply-accumulate into 32 bit
floats.

~~~
NotSammyHagar
Would it hurt ML usage if they added them, would it make the implementation
much harder? If not, could really increase use.

~~~
Veedrac
I'm not an expert in the nitty gritty, and I've heard conflicting information,
but my rough understanding is that it would be affordable and not too
difficult for Xeon processors, but relatively more expensive for the dedicated
Nervana neural network processors.

~~~
burfog
Sure, I can see that being the case. My point is that the Xeon should support
denormals.

The Bfloat16 format does allow for denormals. Intel's implementation mangles
them, changing them to 0.0 on both input and output.

~~~
Veedrac
But then you need either new denormal-accepting instructions or, worse, a new
global state bit enabling bfloat16 denormals, all to support use cases
probably over two orders of magnitude less common than ML. What's the
compelling reason to bother? Note that you need to support the denormal-
disabled case because you'll want compatibility with Nervana.

~~~
burfog
Intel did add a global state bit. It just isn't useful because you can't
modify it.

Nervana is discontinued, isn't it? Compatibility doesn't matter. It's pretty
compatible anyway, as long as you aren't demanding bit-identical output.

~~~
Veedrac
> Intel did add a global state bit. It just isn't useful because you can't
> modify it.

You mean the CPUID bit? That's free. Toggling denormals isn't.

> Nervana is discontinued, isn't it? Compatibility doesn't matter. It's pretty
> compatible anyway, as long as you aren't demanding bit-identical output.

Nervana isn't discontinued according to their website[1], and bitwise
compatibility does matter, certainly more than denormals do.

[1] [https://www.intel.ai/ai-at-ces/](https://www.intel.ai/ai-at-ces/)

~~~
burfog
I mean the bit to toggle denormals, not the one to identify support for the
opcodes.

Denormals are far more important than bitwise compatibility. To be clear, you
would still be able to load a Nervana-produced number into a processor that
supports denormals, and the other way would work too. You'd just avoid
mangling numbers that are near zero.

If you still think denormals don't matter, seriously do what I suggested: draw
it out on graph paper. They matter.

~~~
Veedrac
> I mean the bit to toggle denormals, not the one to identify support for the
> opcodes.

bfloat16 doesn't handle denormals, why is there a bit to toggle it? What's it
called (so I can CTRL-f for it)?

> If you still think denormals don't matter, seriously do what I suggested:
> draw it out on graph paper. They matter.

No, I get how denormals work, I know what you're pointing at. But ML genuinely
doesn't care, neural nets don't give a damn about mathematical purity[1]. In
contrast compatibility matters because ML doesn't give you any guarantee that
it's not depending on the behaviour at small values, and minor differences in
rounding does cause issues. For example, Leela Chess Zero had difficulties
with reproducibility because different GPUs round floats differently.

[1] Fun but relevant aside: [https://openai.com/blog/nonlinear-computation-in-
linear-netw...](https://openai.com/blog/nonlinear-computation-in-linear-
networks/)

~~~
burfog
It's two bits, DAZ and FTZ. (seems like "denormals are zero" and "flush to
zero")

Bfloat16 obviously can handle denormals. The encoding is possible. There would
be no need to handle the issue if the encoding did not exist.

As hex, these would be denormal: 0x0001 to 0x0080, and 0x8001 to 0x8080. It's
the same as plain old 32-bit IEEE, with half the bits lopped off.

ML is all about difficulties with reproducibility. I don't see a reason to get
upset about denormals when a 3D-printed turtle can be confused with a rifle.

~~~
Veedrac
> It's two bits, DAZ and FTZ. (seems like "denormals are zero" and "flush to
> zero")

You mean the standard ones for normal floats? You certainly wouldn't want to
reuse that for bfloat16s.

> ML is all about difficulties with reproducibility. I don't see a reason to
> get upset about denormals when a 3D-printed turtle can be confused with a
> rifle.

These are different things, despite the similarity in terminology.

------
choppaface
I met Naveen Rao after Intel bought Nervana. He seemed pretty adamant about
getting stuff shipped fast. In contrast, the Xeon folks own all the politics
and seem to want the transition to be very gradual. Plus the Phi folks get
phased out. They had done a Nervana trial at Facebook but then flaked on other
trials. Clearly Intel is trying to desperately manage their books.

Having Nervana and friends on a Xeon chip could be a huge positive change for
software. Not only could we toss out the issue of GPU memory transfer, but
Nvidia GPUs aren’t so great with concurrency, and here with the linux kernel
we might have a chance to beat Nvidia. Naveen sure would like that... Nervana
once had a Maxwell compiler that was better than Nvidia’s.

~~~
shaklee3
They had an assembler where one person wrote kernels that were faster than
cublas in a lot of cases. Afaik, nobody ever released anything else with that
assembler, and Nvidia caught up to that performance quickly. In talking with
the cublas devs, it seemed more that maxas kernels were highly tuned for
specific sizes, whereas cublas/cudnn had to be more general.

Nowadays it's really a moot point with Nvidia's Cutlass being open source.

~~~
choppaface
True story-- the Nervana Maxwell stuff didn't go very far-- but it was
noteworthy because they had both that small win as well as their own hardware
platform.

One other thought about the Nervana-Xeon convergence is that the support for
more memory (thru DDR, Optane, or even just mmap'ed NVME) will be a big win
for modeling and large-minibatch SGD. For example, the minibatch fetching
could be pushed to the hardware / OS instead of Tensorflow (or the crazy guy
behind Tensorpack) using a threadpool. A lot of training is still I/O bound at
some level, and processors only support so many PCI-e lanes...

~~~
jlebar
A V100 GPU gets 900GB/s of memory bandwidth. I am less of a CPU expert but
afaict you'll be lucky to get much more than 10% of that out of a CPU. This is
going to make a huge difference that Intel can't make up with bigger execution
units.

bfloat helps with this because the data is half as large. But of course if
you're doing Nvidia you're probably already doing (IEEE) fp16.

~~~
tempguy9999
That sounded a bit low. This

[https://en.wikichip.org/wiki/intel/microarchitectures/cooper...](https://en.wikichip.org/wiki/intel/microarchitectures/cooper_lake)

says "Higher bandwidth (174.84 GiB/s, up from 119.209 GiB/s)"

I don't know if memory bandwidth matters for this type of job, though.

~~~
drewg123
Is that intel ARK published spec sheet bandwidth, or actual usable bandwidth?
There is a difference.

I've found I get about 75-80% of the advertised bandwidth both from my real
app (TLS crypto) and a toy memory copy benchmark using AVX256 instructions.
The toy memory copy benchmark is how I realized that my bottleneck was
actually memory bandwidth and not CPU horsepower on Broadwell based servers.

~~~
tempguy9999
I haven't a clue. I just got it off the link quoted. It's a good question and
when there's a difference, you know what marketing will say.

To make a stab, I suppose it might depend on whether all requests are coming
from a single memory bank or spread evenly across all memory banks, assuming
fully populated (again from the link "Octa-channel (up from hexa-channel)")

------
eatbitseveryday
> At this point, Intel doesn’t have bfloat16 implemented in any of its
> processors, so they used current AVX512 vector hardware present in its
> existing processor to emulate the format and the requisite operations.
> According to the researchers, this resulted in “only a very slight
> performance tax.”

Why implement bfloat if you get just slightly less performance emulating it
with AVX512, which already exists? Maybe it’s an “us too” claim?

~~~
jchw
They did not specify what the tax was relative to. Maybe they meant relative
to 32-bit float?

AVX512 is expensive. I believe if you have an AVX512-heavy workload it can
cause the processor to throttle.

~~~
smitty1110
Yes, there’s a bios setting to control this. It basically under clocks the
core while AVX units are under load.

~~~
celrod
Taking my 7980xe as an example: When it runs non-avx512 loads, I currently
have it set to run at 4.1 GHz (all-core). When running avx-512 heavy loads, it
instead runs at 3.6 GHz -- and tends to get much hotter (70-80C instead of
50-60C). 3.6 GHz is a mild overclock; Silicon Lottery reports 100% can achieve
that speed for avx512 loads.[1]

Running programs doing the same thing (eg, Hamiltonian Monte Carlo where the
likelihood function has or has not been vectorized), the avx512 version is far
faster than scalar, and routinely 50%+ faster than avx2.

The avx512 instruction set itself also provides conveniences that make it
easier to explicitly vectorize, even if most compilers don't take advantage of
them on their own. Masking load and store operations in particular (they're
better about masking to handle branches).

On why avx512 vs a graphics card: I need double precision, and my code
routinely has maximum widths smaller than the 32 or 64 a graphics card would
want to computer in parallel.

[1]
[https://siliconlottery.com/pages/statistics](https://siliconlottery.com/pages/statistics)

~~~
m0zg
Yeah, people tend to completely exaggerate the impact of throttling from
AVX512. It's only an issue when you do short bursts of AVX512 and the rest is
not AVX512. If you do math and your math can be done in AVX512, even with
throttling it's going to be substantially faster. That it runs hotter doesn't
concern me at all. Intel's claimed safe Tjunction is something like 105C. EEs
tend to take the published component specifications seriously (e.g. your 1000v
diode is guaranteed to withstand at least 1KV of reverse voltage), so I trust
Intel when they say things are fine up to that temperature. Even beyond that
it won't burn out, it'll just thermal throttle.

~~~
magicalhippo
Maximum Tjunction for an STM32F303 (just happened to have datasheet open) is
150C, as is most other ICs I've seen.

So is 105C just a very conservative number, compensating for the probe
location, or are there process specific things which brings it down to 105C?

~~~
userbinator
From what I understand, the newer very-high-density procsses are far more
sensitive to voltage and temperature than the older larger ones.

~~~
magicalhippo
Makes sense. The STM32G series, which still has 150C Tjmax, is ST's first 90nm
MCU[1] so yeah.

[1]: [https://blog.st.com/stm32g0-mainstream-90-nm-
mcu/](https://blog.st.com/stm32g0-mainstream-90-nm-mcu/)

------
sandGorgon
does anyone know what is the story on the software side - CUDA is basically
industry standard now.

Tensorflow OpenCL support bug [1] has been open for FOUR years now (with the
discussion devolving into an Intel PlaidML flame war).

AMD OpenCL is now ROCm ?

At the end of the day, I cant run ANY accelerated workloads using Intel
graphics or AMD ....because there's simply no software support anywhere.

OTOH, if you have a nVidia stack... boom. you get accelerated python
[https://developer.nvidia.com/how-to-cuda-
python](https://developer.nvidia.com/how-to-cuda-python)

Are you running containerized workloads on kubernetes ? it has BAKED-IN
support for nvidia-docker ([https://kubernetes.io/docs/tasks/manage-
gpus/scheduling-gpus...](https://kubernetes.io/docs/tasks/manage-
gpus/scheduling-gpus/))

Is anything going to change ?

[1]
[https://github.com/tensorflow/tensorflow/issues/22](https://github.com/tensorflow/tensorflow/issues/22)

~~~
CydeWeys
Huh, that's strange. I remember in the infancy of cryptocurrency mining (back
before specialized ASIC hardware), OpenCL was far superior to CUDA, and AMD
cards were doing 10X the hashrate for the price as OpenCL cards. What changed
in the interim? Is SHA256^2 just a completely different workload than
Tensorflow, or has Nvidia pulled ahead?

~~~
fwip
It's a different workload. Afaik, AMD cards are still better hash-per-dollar
for the coins that are gpu-mineable.

------
mochomocha
This is super exciting! Brings a bit of competition to NVIDIA for ML-related
tasks, while being more "open" (to some extent) than the TPU ASICs (because
you won't have single-cloud lock-in). In any case, good to see Intel finally
waking up.

~~~
nabla9
Is Intel planning TPU's or GPU's?

I don't see how GPU with AVX512 can compete with TPU's or GPU's, BF16 or not.

------
rjeli
I wonder why they chose it over facebook’s 8 bit posit:
[https://code.fb.com/ai-research/floating-point-math/](https://code.fb.com/ai-
research/floating-point-math/)

~~~
kristianp
What does the "int8/32" mean in that paper?

~~~
msclrhd
4x8-bit values in a 32-bit value? SIMD does this to perform an operation on 4
8-bit values in a single 32-bit register. There are other configurations,
IIRC.

[https://en.wikipedia.org/wiki/SIMD#Software](https://en.wikipedia.org/wiki/SIMD#Software)

------
fortran77
They're not really "grafting" it, they're implementing it as a first-class
data type.

~~~
_Nat_
Definitely feels click-bait-y.

Ditto for them calling it "Brain Floating Point" [1]. I mean it appears to be
a typical floating-point numeric data type with some of the precision
truncated to reduce its cost.

I guess they might be trying to make it sound super-fancy to preempt people
seeing it as cheap or/and low-tech?

[1]: [https://en.wikipedia.org/wiki/Bfloat16_floating-
point_format](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)

~~~
exsf0859
It's called brain floating point because it was developed by Google's "Google
Brain" machine learning project.

~~~
_Nat_
Yeah, I get the marketing perspective; just, we typically describe primitive
data types in terms of what they are from a technical perspective rather than
a marketing perspective.

~~~
vlovich123
fp16 & fp8 refer to IEEE 754 floating point which has notable differences from
bfloat16 & make them have worse performance for machine learning.

~~~
Traster
Yeah the point is 'reduced precision FP16' would make a lot more sense - since
it actually describes what it does. 'Brain' floating point makes you wonder
how Neuralink are planning to use this datatype.

------
m0zg
Why "graft"? It's not something that's foreign to them. This promises to
essentially double the performance of Intel chips on an increasingly important
workload, and also simplify the modeling work, because the models don't
experience any accuracy drop when simply converted to bfloat16, unlike with
quantization, where it's model dependent and finicky AF. I'd much rather do
fp16 or bfloat16 at inference time, without constant pain that is
quantization. I hope ARM, AMD and RISCV pay attention and implement this in
the exact same way, so that models could be portable.

~~~
exsf0859
I read this article as saying, "hey we can emulate bfloat16 pretty well in
software on top of our existing hardware features". That's what "graft" and
"minimal impact" mean to me.

Intel (for better or worse) takes a very experiment-results-driven approach to
choosing which features to implement in hardware. So this result -- that
software emulation of a feature works almost as well as a hardware
implementation would -- probably makes Intel less likely to implement the
feature in hardware.

ARM, AMD, RISCV etc, will probably come to similar conclusions.

~~~
kristianp
The article links to another saying that bfloat16 is coming to Xeons. Also the
title of the current says it too. There's nothing implying less likely if you
read the articles.

[https://www.nextplatform.com/2018/12/16/intel-unfolds-
roadma...](https://www.nextplatform.com/2018/12/16/intel-unfolds-roadmaps-for-
future-cpus-and-gpus/)

~~~
exsf0859
Ouch, you're right! I misunderstood.

------
tempguy9999
Newbie question: what is the typical and extreme values (excluding
+/-infinity, and are these used too?) that can occur in training/running of
NNs? Also, what level of accuracy is needed?

It may well be a stupid q but I really don't know, and always assumed they
would be [-1..+1] and that fixed point would suffice. Clearly not.

~~~
drewm1980
Nope, no standardization at that level.

------
beckerdo
I am sorry for Intel. Perhaps John Gustafson’s 16 bit posits or unums would
have made a better choice.

~~~
H8crilA
Why? Google has certainly researched their floats before committing an entire
line of silicon chips. It's easy to just enumerate all possible float16
configurations in a simulator to see which one performs best on a wide range
of neural network applications. Then pick the best one. Big data driven
organizations do this all the time (brute force through an entire line of
solutions, pick best results).

~~~
mochomocha
I know nothing about ASIC or CPU simulators but I suspect that it's not as
easy as you make it sound: for machine-learning related tasks, performance
doesn't only come from raw compute numbers: you'll also want to model the
actual data movement costs across the caches hierarchy and registers. Because
a lot of time training is not necessarily compute-bound: the relative cost of
data transfer (VS compute) can be quite high, or even dominate.

~~~
duskwuff
fp16 and bfloat16 are the same size, so there's no difference in data transfer
rates.

------
krick
That's surprising, I wouldn't guess it is something they would do so easily,
especially considering Intel processors isn't something you generally use to
train NNs.

But I'd be rather glad if they implemented unums already.

~~~
drewm1980
I absolutely train on Intel from time to time at work. If your model fits, cpu
is a dream. The driver never breaks, they never crash and lock up your
display...

------
kstenerud
It looks somewhat similar to the minimal size of compact float [1] (a format I
developed for data communication), but with 2 more bits because I use them for
sizing.

[1] [https://github.com/kstenerud/compact-
float/blob/master/compa...](https://github.com/kstenerud/compact-
float/blob/master/compact-float-specification.md)

------
jacobolus
7 bits of precision.... or 2.1 decimal digits.

So not quite as good as a 6-inch slide rule.

Edit: voters in this thread seem pretty uptight.

~~~
asdfasgasdgasdg
What would be the wattage of a typical six inch slide performing calculations
as quickly as a top-of-the-line Intel microchip? Or if it would be physically
impossible because of speed of light considerations, what would be the wattage
of n slide rules in parallel performing such computations such that it adds up
to the throughput of a microchip?

Speaking of boiling the ocean . . .

------
im3w1l
Does bfloat16 have any other uses than deep learning?

~~~
londons_explore
It isn't really very good for machine learning - machine learning doesn't need
8 bits of exponent.

For weights during training, 7 bits of mantissa also seems a bit low - it's
common for weights to adjust much less than 1% during a single batch of
training, which this couldn't represent.

I think this is more a "we want something which is faster but is compatible
with existing code written for fp32's".

~~~
jlebar
Bfloat is specifically designed for ML. It is the native type in Google's
TPUs. It is quite good at ML; most models that work with fp32 work with bfloat
with no adjustments; that's in contrast to IEEE fp16.

You're right that the mantissa is small. The trick is that you always
accumulate into fp32 and then truncate down to 16 bits at the end. You'd do
this for any 16-bit floating type.

Source: I work on this at Google.

[https://cloud.google.com/tpu/docs/bfloat16](https://cloud.google.com/tpu/docs/bfloat16)

~~~
edwintorok
Have you evaluated alternatives such as posits, Kulisch accumulation, and zfp?
[https://arxiv.org/pdf/1805.08624.pdf](https://arxiv.org/pdf/1805.08624.pdf)
[https://arxiv.org/pdf/1811.01721.pdf](https://arxiv.org/pdf/1811.01721.pdf)
[https://insidehpc.com/2018/05/universal-coding-reals-
alterna...](https://insidehpc.com/2018/05/universal-coding-reals-alternatives-
ieee-floating-point/)

In particular the latter describes a generic framework that can be used to
generate a lot of different number systems. Could hardware implement this,
allowing us to compose and choose the number system by just setting some
simple flags?

------
blipblap
Is there a principled way to not overfit? Or is the best we can do not
overtraining or reducing the precision of the calculations?

~~~
nabla9
regularization, dropout, early stopping etc.

------
liquidify
Could we get a Bfloat32?

~~~
ndesaulniers
You already have that; it's more formally called IEEE 754 single precision.
Requires double the memory bandwidth (bad) for extra precision that back
propagation would have corrected for anyways.

~~~
ScottBurson
Seems like BFloat32 would have the same exponent range as Float64 (double
precision), and a correspondingly shorter significand.

------
shubham3435
Ohh intel is awesome

------
novaRom
I desperately want Chinese companies finally begin producing and designing
general purpose CPUs, GPUS, and other types of accelerators. Current situation
is terrible duo- and mono-polies, slow pace of innovations, and low
reliability. We need more players.

~~~
layoutIfNeeded
Finally Bloomberg will be able to report on real hardware backdoors!

------
Lowkeyloki
Something about Google being able to influence features in consumer grade CPUs
rubs me the wrong way.

~~~
m-p-3
Not surprising though, new features shows up where there's money to make.

------
simonebrunozzi
English is not my first language. I have never heard the term "Graft", even if
I consider myself quite literate in English. So here you go, for everybody
else in my situation:

Graft, as understood in American English, is a form of political corruption,
being the unscrupulous use of a politician's authority for personal gain.

Edit: by the way, I really couldn't fit the term with the article. And
realized I was probably looking at the wrong definition. This one might be
much more apt:

a shoot or twig inserted into a slit on the trunk or stem of a living plant,
from which it receives sap.

~~~
MrRadar
That definition definitely does not apply to this usage. You are looking for
"to join (one thing) to another as if by grafting, so as to bring about a
close union." (etymology 1, verb, definition 4 on Wiktionary [1]).

[1]
[https://en.m.wiktionary.org/wiki/graft](https://en.m.wiktionary.org/wiki/graft)

