Hacker News new | past | comments | ask | show | jobs | submit login
Bfloat16 support coming to Apple's Metal and PyTorch [video] (developer.apple.com)
93 points by dlewis1788 on July 3, 2023 | hide | past | favorite | 53 comments



It's a shame that large language models are mostly moving to 4 bit weights for inference, and a bunch of papers have shown promising techniques for training in 4 bit too...

Remember that switching from 16 bit to 4 bit lets you have 4x as many weights, 4x as many weights loaded from RAM per second, and ~1/16 of the silicon area for the calculations (a multiplier scales with approximately the number of bits squared). That smaller silicon area will let you do more per $ too...


There is some overhead from the quantization, and right now the operations themself are sometimes done at higher precision than the weights in RAM.

And widespread hardware 4 bit will take some time. If the HW makers started designing 4 bit silicon in 2022, then we are still years away.


What?! Can you also train with quantization? Incredible! I'd have thought the gradients were way too ugly for any convergence with 4 bits.

Any particularly good papers you can recommend me on the topic?


Here's a recent paper on training transformers with 4 bit integer weights.

https://arxiv.org/abs/2306.11987


A group at IBM has been working on minifloat training for a while. Here's a paper from 2020 on FP4 training: https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8...


Their best performing 4-bit number format uses 1 sign bit, 3 exponent bits, and no mantissa bits!

Ie. All weights, activations and gradients become powers of two! Which means all multiplications become simple bit shifts. That really changes mathematics and silicon design.


Does it really make much of a difference?

You're usually feeding a ton of multiplies into an accumulator. You can handle one or two mantissa bits as the same bit shifting except that it outputs two or three numbers to accumulate. And accumulators are very easy to scale.

Also in the extreme I've seen powers of 4 get used.


At just 4 bits, there are only 16 possible numbers. It becomes lookup table territory - and there is no need to have the numbers on your numberline be linearly or exponentially spaced - you can assign them arbitarily. For example, you could have a number system consisting of: (+-) 0.5, 1, 2, 3, 5, 10, 1000, 1000000 - getting some nice accuracy in the middle of the number line where you expect most values to lie, plus some extreme values so convergence doesn't take forever if some big activation/gradient needs to be propagated.


The more recent 4 bit quantizations are almost along these lines. Q4_1 in ggml for example takes a block of 32 weights and gives each block a scaling factor 'd' and takes the minimum of the weights 'm' to be the quantized '0', so the final weights from a quantized weight 'q' is q * d + m, and taking a relatively small block size makes it more likely that those are all within a reasonable quantization range. Notably, d and m can be stored with more accuracy without sacrificing too much space, since the overhead is divided by 32. Q4_k goes a bit further, and takes 'superblocks' of 8 blocks, and applies another scaling factor 'd_s' and minimum 'm_s' to that, so the final weight is (q * d + m) * d_s + m_s, and the additional factors are stored as 6 bits instead of 4.

In practice this seems to get very good results, while being cheap to implement and relatively space efficient, Q4_K for example works out to 4.5 bits per weight instead of 4. The PR adding it has more details: https://github.com/ggerganov/llama.cpp/pull/1684


Very efficient for storage and memory bandwidth, but such a scheme is a headache for high throughput hardware implementations (at least compared to regular 4 bit math, which can be packed really really densely)


Also I would highly recommend Q5_K_M for both 7B and 13B models.

It has the best balance between quality and weight of the model and almost indistinguishable from original f16: https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...


This is an excellent explanation, thank you!!


I dimly remember reading that the mathematical compute-per-density optimum is around 3.x bits in a „brain like structure“, I don’t remember any details though or the precise context. Does this ring a bell with anyone?


Is it possible we will we eventually see 1-bit weights in use?


There are already papers on it, and there is 2-bit quant in llama.cpp.

But it seems to be past the point of diminishing returns, where you mind as well use a model with fewer parameters... For now.

There was another scheme in a paper where the "sparse" majority of the model was highly quantized, while the "dense" part was left in FP16, with good results.


For some time I played with Brevitas and Xilinx's FINN, you could quantize like crazy. I haven't looked since transformers took over the AI world where they were.


Confirmed Apple M1 lacks bfloat16 support completely - M1: hw.optional.arm.FEAT_BF16: 0 vs M2: hw.optional.arm.FEAT_BF16: 1


Luckily BF16 is just a truncated FP32. That means that the hardware can do BF16, just you don't get any performance benefit compared to FP32 (and depending on the hardware design, you might also have to space the data 4 bytes apart rather than 2), so you lose the memory bandwidth and RAM usage benefits too.


At that point it’d be better to do everything in fp32. The hardware can’t do bf16 in the way you’re saying; the conversions would consume all your time.


Compute in F32, but then round and pack a pair of BF16 into 4 bytes.


The conversions are just a mask and shift? Super cheap


You still get a perf benefit from half the memory traffic and keeping twice as much data in caches, since you can do the expansion to f32 when loading into registers.


Conversions from IEEE-32 to BF16 don't round?


I don't believe the standard defines it. I believe implementations truncate (ie. round towards zero).

Remember BF16 was invented specifically to be able to be backwards compatible with existing silicon - and pulling 2 bytes out of 4 is a far cheaper operation than any rounding.


Just to elaborate, as I was confused about this and had to look it up: BF16 is indeed designed to just be a truncated F32: you can grab the top 16 bits of a F32 value and it'll still "make sense": the sign bits are in the same place in both (unsurprisingly), and the exponent part of BF16 and F32 are both 8 bits. In the case of the mantissa, you end up grabbing the top 7 bits of the F32's 23-bit mantissa, so it all works out, as this will "round" the value toward zero.


There's no standardized definition of BF16.


Somehow missed this from WWDC23, but it looks like Sonoma will add support for bfloat16 with Metal, and there's an active PR to add support with the PyTorch MPS back-end (PR #99272). Since M2 added bfloat16 support at the hardware level, I'm assuming this will only be supported on M2 Macs.

That maxed out Mac Studio M2 w/ 192GB of memory now looks more appealing...


Visible in the unofficial documentation for AMX instructions too - M2 only bf16 functionality - https://github.com/corsix/amx/blob/main/matfp.md

This matfp instruction computes an outer product and is a kernel for matrix multiplication.


I didn't even know about Apple's AMX instructions until I clicked on your link. Very interesting - thanks!


bf16 in Metal on macOS 14 is supported on all Macs. Emulated in software transparently.


Yeah, Metal is pretty great because it runs the same on all Macs. Apple is really really good at this.


I think the trillion dollar question is: can Apple ever make Mac's / GPU's to compete with NVIDIA?


Maybe someone can help me understand why people are investing into this.

Inhousing typically means falling behind in technology but having lower operating costs. That makes the company win, not the users.

If you hinge your career on Apple, they might make your technology obsolete on a dime.

Its not the fastest, its not the best, its not the cheapest, its not some combination either.

> 'compute per watt'

With AI? The local LLM models are near useless already. There will be a time to cut down on power, but from what I've read, there is currently ~no value even with a 4090 with 512 RAM.

I suggest avoiding Windows/M$, I am annoyed with Linux bugs, and google cannot be trusted. But all of that could be said about Apple as well.

I just don't see a future with Apple hardware, it gives me some serious Nintendo vibes where they are going to be some quirky niche that is just enough for marketers to sell it. Compute per watt seems like a wiimote that no one asked for, but suddenly claim is ultra important.

Maybe someone can change my view. I don't see who buys this when they are educated on the possible options.


> Maybe someone can help me understand why people are investing into this.

Buying a Mac for running LLMs is kinda like buying a Mac for gaming. Its thoeretically interesting, but I don't think thats a serious driver of Mac sales.

But:

- Finetuned local LLMs are good for specific niches, like roleplaying, text games, and helper bots for your own pile of data. And they are getting better at other niches like code completion for specific languages, or summarization.

- Remember that a huge selling point for Macs is iPhone/iPad development. The market for AI App Store apps is not small.This is also a reason to believe there will be some stability with the ML support.


> - Finetuned local LLMs are good for specific niches, like roleplaying, text games, and helper bots for your own pile of data.

I can't see how they don't hallucinate/are leagues away from GPT-3.5 let alone GPT-4 level of quality of output. Am I mistaken?


They are better than GPT 3.5 (which I am generally not impressed with), but not as good as GPT4.

Again, the specialized variants perform very well in their niches.


Hallucinations are exactly what you want in a gaming model. That's another way of saying "creativity".


You seem to assume hallucinations are a fatal flaw. You give it a document to summarize and see how often it hallucinates. Very little. Human performance.

Now often does a human make random shit up about general knowledge questions?


There are a lot of ML applications outside of LLMs. Why would a developer invest in it? Because there are hundreds of millions of iOS devices out there where computer vision, text recognition, etc would be useful features.


Desktop computers are heat-limited. We could have much faster computers if we found a way to cool them down. Thus, compute per watt is the ultimate metric to optimize for. If your cooling capacity is 500W, then obviously you'll want to fit as much compute in that as possible.

Mobile devices are energy-limited. You'll want to do as much compute as possible on a limited battery.


My question to you is what are you currently using as an alternative for the COU/SOC in your personal & work environments?

Intel? AMD Ryzen?

Apple has taken their ARM approach and scaled it to all their platforms.

Amazon now is on what, Gen 2 or 3 for their graviton platform in AWS.

And what OS are you using if you don’t trust Microsoft, Linux or Apple?


CPU arch isnt't even that critical here, as Apple is talking about Metal.


I'm still confused by the proliferation of bf16. Although it certainly doesn't hurt compared to fp16, in my testing even with A100 GPUs optimized for it, both training speed and inference quality are the same between bf16 and fp16.


Sometimes during training, fp16 will cause networks that would converge on fp32, to explode to Infs or NaNs with fp16, because of the limited range. bf16 generally speaking fixes that.

It's true also that fp16 is often manageable with enough batch/layer norm and gradient clipping.


Yea, I spent a few months comparing the two, and empirically i had a lot more issues with various normalized entropy problems (explosion, not converging, converging slower) with fp16 than with bf16.

The transfer pipeline I wrote for fp32->fp16 also took a lot more work than fp32->bf16


My understanding is for certain types of networks BF16 will train better than FP16, given the additional protection against exploding gradients and loss functions with the extended range of BF16 - at the loss of precision.


bf16 is generally easier to train neural network than fp16 on due to no need for scaling. And most model training and inference performs the same with fp32 and bf16.


Despite the other answers, I will tell you the grim truth: Your mileage might vary.

It's an empirical question and depends upon the nature of your problem and data. You should try all three fp32, fp16, and bf16 as part our model selection / hyperparameter tuning.

For example, in audio generative models (where typical output is 16-bit), I've sometimes found that fp16 and bf16 just don't produce good output as fp32 weights.


Fp16 makes it easy to accidentally overflow, especially around summation operations.


(Not an ML guy.) bf16 and fp16 should be comparable if the weights are of the same magnitude, but what happens in a network where the weights are poorly regularized?


Someone commented below that with enough batchnorm/layernorm/etc. and/or gradient clipping you can manage it, but BF16 just makes life easier if you can live without some precision.


I think posits are better. https://posithub.org/


Posits seem interesting, but they are fundamentally very different than floats and ints, and much harder to analyze.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: