It's a shame that large language models are mostly moving to 4 bit weights for inference, and a bunch of papers have shown promising techniques for training in 4 bit too...
Remember that switching from 16 bit to 4 bit lets you have 4x as many weights, 4x as many weights loaded from RAM per second, and ~1/16 of the silicon area for the calculations (a multiplier scales with approximately the number of bits squared). That smaller silicon area will let you do more per $ too...
Their best performing 4-bit number format uses 1 sign bit, 3 exponent bits, and no mantissa bits!
Ie. All weights, activations and gradients become powers of two! Which means all multiplications become simple bit shifts. That really changes mathematics and silicon design.
You're usually feeding a ton of multiplies into an accumulator. You can handle one or two mantissa bits as the same bit shifting except that it outputs two or three numbers to accumulate. And accumulators are very easy to scale.
Also in the extreme I've seen powers of 4 get used.
At just 4 bits, there are only 16 possible numbers. It becomes lookup table territory - and there is no need to have the numbers on your numberline be linearly or exponentially spaced - you can assign them arbitarily. For example, you could have a number system consisting of: (+-) 0.5, 1, 2, 3, 5, 10, 1000, 1000000 - getting some nice accuracy in the middle of the number line where you expect most values to lie, plus some extreme values so convergence doesn't take forever if some big activation/gradient needs to be propagated.
The more recent 4 bit quantizations are almost along these lines. Q4_1 in ggml for example takes a block of 32 weights and gives each block a scaling factor 'd' and takes the minimum of the weights 'm' to be the quantized '0', so the final weights from a quantized weight 'q' is q * d + m, and taking a relatively small block size makes it more likely that those are all within a reasonable quantization range. Notably, d and m can be stored with more accuracy without sacrificing too much space, since the overhead is divided by 32. Q4_k goes a bit further, and takes 'superblocks' of 8 blocks, and applies another scaling factor 'd_s' and minimum 'm_s' to that, so the final weight is (q * d + m) * d_s + m_s, and the additional factors are stored as 6 bits instead of 4.
In practice this seems to get very good results, while being cheap to implement and relatively space efficient, Q4_K for example works out to 4.5 bits per weight instead of 4. The PR adding it has more details: https://github.com/ggerganov/llama.cpp/pull/1684
Very efficient for storage and memory bandwidth, but such a scheme is a headache for high throughput hardware implementations (at least compared to regular 4 bit math, which can be packed really really densely)
I dimly remember reading that the mathematical compute-per-density optimum is around 3.x bits in a „brain like structure“, I don’t remember any details though or the precise context. Does this ring a bell with anyone?
There are already papers on it, and there is 2-bit quant in llama.cpp.
But it seems to be past the point of diminishing returns, where you mind as well use a model with fewer parameters... For now.
There was another scheme in a paper where the "sparse" majority of the model was highly quantized, while the "dense" part was left in FP16, with good results.
For some time I played with Brevitas and Xilinx's FINN, you could quantize like crazy. I haven't looked since transformers took over the AI world where they were.
Luckily BF16 is just a truncated FP32. That means that the hardware can do BF16, just you don't get any performance benefit compared to FP32 (and depending on the hardware design, you might also have to space the data 4 bytes apart rather than 2), so you lose the memory bandwidth and RAM usage benefits too.
At that point it’d be better to do everything in fp32. The hardware can’t do bf16 in the way you’re saying; the conversions would consume all your time.
You still get a perf benefit from half the memory traffic and keeping twice as much data in caches, since you can do the expansion to f32 when loading into registers.
I don't believe the standard defines it. I believe implementations truncate (ie. round towards zero).
Remember BF16 was invented specifically to be able to be backwards compatible with existing silicon - and pulling 2 bytes out of 4 is a far cheaper operation than any rounding.
Just to elaborate, as I was confused about this and had to look it up: BF16 is indeed designed to just be a truncated F32: you can grab the top 16 bits of a F32 value and it'll still "make sense": the sign bits are in the same place in both (unsurprisingly), and the exponent part of BF16 and F32 are both 8 bits. In the case of the mantissa, you end up grabbing the top 7 bits of the F32's 23-bit mantissa, so it all works out, as this will "round" the value toward zero.
Somehow missed this from WWDC23, but it looks like Sonoma will add support for bfloat16 with Metal, and there's an active PR to add support with the PyTorch MPS back-end (PR #99272). Since M2 added bfloat16 support at the hardware level, I'm assuming this will only be supported on M2 Macs.
That maxed out Mac Studio M2 w/ 192GB of memory now looks more appealing...
Maybe someone can help me understand why people are investing into this.
Inhousing typically means falling behind in technology but having lower operating costs. That makes the company win, not the users.
If you hinge your career on Apple, they might make your technology obsolete on a dime.
Its not the fastest, its not the best, its not the cheapest, its not some combination either.
> 'compute per watt'
With AI? The local LLM models are near useless already. There will be a time to cut down on power, but from what I've read, there is currently ~no value even with a 4090 with 512 RAM.
I suggest avoiding Windows/M$, I am annoyed with Linux bugs, and google cannot be trusted. But all of that could be said about Apple as well.
I just don't see a future with Apple hardware, it gives me some serious Nintendo vibes where they are going to be some quirky niche that is just enough for marketers to sell it. Compute per watt seems like a wiimote that no one asked for, but suddenly claim is ultra important.
Maybe someone can change my view. I don't see who buys this when they are educated on the possible options.
> Maybe someone can help me understand why people are investing into this.
Buying a Mac for running LLMs is kinda like buying a Mac for gaming. Its thoeretically interesting, but I don't think thats a serious driver of Mac sales.
But:
- Finetuned local LLMs are good for specific niches, like roleplaying, text games, and helper bots for your own pile of data. And they are getting better at other niches like code completion for specific languages, or summarization.
- Remember that a huge selling point for Macs is iPhone/iPad development. The market for AI App Store apps is not small.This is also a reason to believe there will be some stability with the ML support.
You seem to assume hallucinations are a fatal flaw. You give it a document to summarize and see how often it hallucinates. Very little. Human performance.
Now often does a human make random shit up about general knowledge questions?
There are a lot of ML applications outside of LLMs. Why would a developer invest in it? Because there are hundreds of millions of iOS devices out there where computer vision, text recognition, etc would be useful features.
Desktop computers are heat-limited. We could have much faster computers if we found a way to cool them down. Thus, compute per watt is the ultimate metric to optimize for. If your cooling capacity is 500W, then obviously you'll want to fit as much compute in that as possible.
Mobile devices are energy-limited. You'll want to do as much compute as possible on a limited battery.
I'm still confused by the proliferation of bf16. Although it certainly doesn't hurt compared to fp16, in my testing even with A100 GPUs optimized for it, both training speed and inference quality are the same between bf16 and fp16.
Sometimes during training, fp16 will cause networks that would converge on fp32, to explode to Infs or NaNs with fp16, because of the limited range. bf16 generally speaking fixes that.
It's true also that fp16 is often manageable with enough batch/layer norm and gradient clipping.
Yea, I spent a few months comparing the two, and empirically i had a lot more issues with various normalized entropy problems (explosion, not converging, converging slower) with fp16 than with bf16.
The transfer pipeline I wrote for fp32->fp16 also took a lot more work than fp32->bf16
My understanding is for certain types of networks BF16 will train better than FP16, given the additional protection against exploding gradients and loss functions with the extended range of BF16 - at the loss of precision.
bf16 is generally easier to train neural network than fp16 on due to no need for scaling. And most model training and inference performs the same with fp32 and bf16.
Despite the other answers, I will tell you the grim truth: Your mileage might vary.
It's an empirical question and depends upon the nature of your problem and data. You should try all three fp32, fp16, and bf16 as part our model selection / hyperparameter tuning.
For example, in audio generative models (where typical output is 16-bit), I've sometimes found that fp16 and bf16 just don't produce good output as fp32 weights.
(Not an ML guy.) bf16 and fp16 should be comparable if the weights are of the same magnitude, but what happens in a network where the weights are poorly regularized?
Someone commented below that with enough batchnorm/layernorm/etc. and/or gradient clipping you can manage it, but BF16 just makes life easier if you can live without some precision.
Remember that switching from 16 bit to 4 bit lets you have 4x as many weights, 4x as many weights loaded from RAM per second, and ~1/16 of the silicon area for the calculations (a multiplier scales with approximately the number of bits squared). That smaller silicon area will let you do more per $ too...