That's not a 'bit' ("Binary digIT"). It's closer to a 'trit' ("TeRnary-digIT"). Specifically, ternary digits spanning {-1, 0, 1} (rather than the usual {0, 1, 2} in a base-3 numbering system) are 'balanced ternary'.
A great intro to the theoretical reasons ternary might have some promise in computing is this 2001 article from 'American Scientist', "Third Base", which quotes Knuth calling balanced-ternary "perhaps the prettiest numbering system of all" and also discusses an abortive Soviet effort in the direction of ternary computing:
In an aside, the article hints that e-nary digits (base 2.718…) if somehow made practical/meaningful, might actually be better than ternary (or perhaps even optimal?).
So maybe this paper's observation that ~"1.58 bits" (ln2(3) binary-digits) is a sweet-spot could be further refined into some method for representing the state of a e-nary-modeled algorithm in ln2(e) binary-digits (~"1.44 bits") per underlying e-it.
Yeah, specifically, the definition of optimal provided - radix economy. There are plenty of other considerations one could make in other contexts. Practically, a transcendental base seems... rather impractical. And base 2 is not so much 'more optimal' than base 3 to warrant the electrical complexity probably, for example.
It is obviously pretty common to represent matrices with lots of zeros in a sparse format, like csr or something. I wonder if they could get away with 1-bit representation using a sparse matrix. Of course, it would be a little different from a typical sparse matrix because there’s no problem normally having a zero-value in a structurally non-zero location.
Note that they're not claiming that their LLM is 1-bit - they're saying that there is a 1-bit era of LLMs. What they do say is that their approach is a variant of a 1-bit LLM variant, namely a ternary LLM (they explicitly state that in the abstract).
That page also brings up the whole "but division" problem with balanced ternary, however, I personally suspect that http://degiorgi.math.hr/aaa_sem/Div_Krishna/887-889.pdf ("A Division Algorithm for Signed-Digit Arithmetic" by Chin Tung, from 1968 !) might offer an overlooked path to a solution to that problem
And see also also², this quote from TAOCP:
"Cauchy pointed out that negative digits make it unneccesary for a person to memorize the multiplication table past 5x5."
The—INCREDIBLY ANNOYING TO LOCATE—source for which is "105. Calculs numériques. sur les moyens d'éviter les erreurs dans les calculs numériques." on Pdf page 445/document page 431 here:
( +a vaguely related paper here on quantum mechanics & radix economy, BUT it makes the mistake of using an overly specific formula applicable only to unsigned-digit representations thus drawing the wrong conclusions: https://www.researchgate.net/profile/Vladimir_Garcia-Morales... )
-0 is not indistinguishable from 0 in floating point math. Most ops return +0 and -0 can behave differently. I don't know of any examples where -0 is important for machine learning, though.
Why? Just because it's spelled identical to a human body part?
This kind of shit is one of the most bizarre things about human society (or the prude cultures of it at least), to consider the most natural things so taboo and a "joke" to mention.
Take this with a grain of salt until someone reproduces it. Improvements such as these require extraordinary evidence. Not to mention extreme quantization has been tried before.
The theoretical capacity of a binary network is 69% of the capacity of a full-weight network, so it makes sense that LLM would converge to 1-bit networks in the long term.
It's nice to finally see practical networks reach the theoretical limits found in the statistical mechanics of Ising models. A good pointer to efficient 1-bit training, from the statistical mechanics point of view, is here:
These models will are compatible with llama.cpp out of the box, we (GigaML - https://gigaml.com) are planning to train a small model (3-4B, 1-bit, opensource) with the latest stack-v2 dataset released today. Let me know if anyone is interested in collaborating with us.
I'm interested in collaborating. For example, from the comments it occurred to me that a 128-bit SIMD register can contain 64 2-bit values. It seems straightforward that SIMD bitwise logical operations could be used in training such models.
Highly interested in collaborating – got a bunch of proprietary legal data already pre-sorted and labeled for various scenarios. I've already benchmarked legal use-cases (i.e. legal speciality, a few logic-based questions, and specific document creation) with various LLMs – so would love to see what benchmarks this can produced compared to early Mistral or Llama.
It's funny how discoveries in NLP & computer vision complement each other. The replacement of multiplication by additions made me think about the AdderNet paper (https://arxiv.org/abs/1912.13200), which concluded as you had to suffer almost no performance drop.
Perhaps the accumulators in current hardware cannot leverage this to its full potential, but combined with such a strict quantization, this would open LLM to the wider ML community much earlier than expected (when consumer hardware allows you to train near SOTA LLMs from scratch on your machine).
Too bad there seem to be no pretrained models to download. This is not a quantization method to apply on existing models, so having the pretrained weights is needed if one wants to test it.
The mathematics of the BNNs are sound. The shannon entropy of a word is really small (I vaguely remember ~2 bits). Also all neural networks are ridiculously over provisioned.
I worked on 7 years ago trying to efficiently binarize CNNs from existing models. It the difficult was getting training running without the losses going to high. I think that vision models will be much more difficult to binarize, but you might not need to with clip if the vision encoder stays in regular math {fp16,int8}
Just to be clear, it's all theoretically possible. There are already versions of BNN versions of YoLo and other CNNs. No reason why transformers wouldn't work for that or audio. It just might be harder to get them to train well enough.
Speech to text, however, is super interesting. You just gave me an idea! I'm gonna go run some experiments :D
A 1 bit multiplier in silicon is a single logic gate, but a ternary decoder to decode a packed tri-state 'weight' is bigger.
I therefore suspect that this method will be extended to make all weights simple 1 or 0 (ie. Binary). Perhaps that will be done by having half the weights have 1 or 0 values, while the other half are -1 or 0.
It's optimal if your program is naturally ternary, which this one is. Using three signals, rather than ternary gates, is less effective, because you need much more precision to detect two different voltage levels rather than just up and down.
I think it's the right chain of thought. You could either have 0/1 and then have additional nodes with negative activation functions, or -1/1
-1/1 is appealing to me (0 = -1) because bit hackery could be used instead of the multiplication function, presumably on integral or fixed-point representations. The goal would be to eliminate any "if/then" like "if 0 do this if 1 do that" to avoid the need for branch prediction - there are bit-hackery ways to bypass this. That would lend itself well to all existing processors, ASICs, FPGAs, GPUs, etc.
Probably because despite the 1200 citations, they didn't have the ability to apply it to modern LLMs. Nobody cares about an image classifier using 50% less parameters since most of them were small enough to fit in memory anyway.
Refreshing paper in terms of machine learning papers, simple explanation, easy to replicate, no alchemy-tier interpretations. Can't wait to see this paper replicated or disproved when it comes to real-life production tasks.
The most glaring omission is that they only compared to fp16 models, not to quantized models. And of course the benchmarks might be misleading compared to the real experience.
But if you wanted to make LLM-specific hardware (or x64 instructions tuned for LLMs) this model architecture makes that extremely cheap. Multiplication requires a lot of transistors, this architecture requires only two-bit adders. You could make SIMD instructions that do thousands of these in parallel, for fairly little silicon cost.
"Straight-through estimator. To train our 1-bit model, we employ the straight-through estimator
(STE)[BLC13] to approximate the gradient during backpropagation. This method bypasses the nondifferentiable functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions, during the backward pass.
STE allows gradients to flow through the network without being affected by these non-differentiable
functions, making it possible to train our quantized model."
On its own, each trit doesn't encode much information at all. But it's not about information at the individual level -- it's more about the shape of the network.
I appreciated this comment [0] from earlier in the thread by paul_mk1:
> My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.
For myself, I've done a lot of work with image hashing (such as pHash and dHash) -- and in those, you throw away a LOT of information, but simply by keeping the value of each region and tracking whether or not it's above or below the average (essentially, the sign), then it's astounding how robust those algorithms are. Because you don't look at the individual pixels of an image, but it's very good at capturing the impression of the overall _shape_ of the image.
It's less about each individual datum, and more about the shape of the network.
If you're not familiar with Lottery Ticket Hypothesis, that would be worth reading up on.
Do you mind pointing out where they make the model larger? The paper seems to suggest they are maintaining the same model sizes.
> Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption
Interesting return to ternary. Effectively, each weight says only whether it's correlated (+1), uncorrelated (0), or anti-correlated (-1) with the input, and the structure of the network is the actual computation over that information.
Is it really so surprising that something like this works given how human brain neurons work? My admittedly basic understanding is that these operate through an all-or-nothing principle for their action potentials (firing): they either fire or they don't, based on whether the input signals reach a certain threshold. So the output is already sort of binary in biological neurons. The inputs are more like continuous values, since they are the sum of many different neurons sending signals into each neuron, but in this paper the activations are 8-bit, not binary/ternary. Can any neuroscientists here comment?
Well I think it's an interesting idea, and to add to that, the "-1" values would correspond to an inhibitory neuron!
What neurons can do though is integrate over time, so your output can be one spike, or 3 spikes very quick, same for your input, and maybe 10 quick spikes in a row is a more powerful signal than a lone spike. We know this intuitively, though, via vision, we don't see in mac-classic style black/white images, we see shades of brightness and color, indicating that at least our optic nerve is sending what amounts to an analog signal (even if encoded as binary spikes - is the spike timing not analog?)
This is not to mention all the biochemical signaling that happens, and the multitude of local neurotransmitters and global physiological/hormonal factors at play. And all that weird stuff like glial cells and astrocytes is there in the mix too.
First of all, they operate independent of a synchronized clock, and they can also accumulate signals instead of executing on a input. Neuromorphic chips are closer to how the brain works, but they're still super early. I believe Intel has the best one with the Loihi 2.
(Not a neuroscientist but my wife is and that's what I understand from our chats)
Assuming this is confirmed, what's the impact on training?
Inference is definitely an issue for LLMs right now. But if training were suddenly possible for lone hackers (or maybe smaller companies), it would open up a lot of new possibilities as well.
In theory it should make training a lot easier too, particularly on CPUs. But I think you'll still need reasonably expensive compute to get a model something close to the current big models, and you really can't ignore data. Data quality and quantity are both huge ingredients in model quality, at least as big as architecture. It's still non-trivial to get a good quality, large dataset, certainly out of the reach of lone hackers and most small companies.
1-bit LLMs remind me of a random forum post I read about SACD and limitations of the 1-bit DSD audio format. https://www.audiosciencereview.com/forum/index.php?threads/d... Accumulating approximate values in one bit leads to being "constantly overloaded", with any error correction overwriting all of your real signal from the next step. I think this trinary system might leave enough room to avoid this problem.
Damn. Well, I guess I better hurry up and write and publish a paper on the Ternary Neural Network research that I've been doing (part-time) for the last several months, before it all gets scooped.
Modify your schedule, sure but do not rush it (just to beat the other folks). The first paper on any given topic may garner some 15 minutes of fame but the well-researched, boring paper is one oft-cited. Even if it isn't the first on its topic.
Be thorough and by golly, include some useful visuals! Even bad pictures and low-effort charts and graphs can vastly improve the grokability of a research paper.
Also, request assistance! Are you terrible at making charts and graphs? Ask someone to help you! For the low, low price of adding their name to the paper I'm 100% certain you can borrow an expert's time to add some dapper displays of useful information along with drastic wording and layout improvements.
The amount of papers in the wild that are just walls of jargon with completely useless, nearly-impossible-to-read charts and graphs is seemingly limitless.
Refreshing is the paper that a non-expert can read and understand! You don't have to ELI5 but well-written text and explanations are loved by all. The individual using it to gain actual knowledge will grok it from skimming and looking at the data anyway so you might as well take the time to explain some of the more complicated aspects like it's going to be read by a freshman STEM major (no need to go further back in education than that).
If you need help with grammar just paste a portion of your text into some LLM (even the small, locally-run models) and they usually do a pretty good job at finding and fixing such mistakes.
In both cases this is a prime opportunity for anyone to disrupt Nvidia. They are in this market position in large part because both video games and neural networks do a lot of highly parallel floating point math, especially matrix multiplication. This model architecture doesn't do any of that.
Of course it should be fairly simple for Nvidia to add special silicon and instructions for two-bit addition to a future generation of their cards. But it'll take a while because they already have a roadmap and preexisting commitments. And any competitor doesn't have to copy everything Nvidia does to make floating point numbers go fast, they can just focus on making two-bit data handling and addition go fast.
Yes, but with their current market cap, the more likely result is they acquire one of the several competitors poised to take advantage of this and throw massive resources behind them.
BF16 is a pretty big unit in an ASIC - You need at least 9 * 5 gates to calculate the exponent of the result, a 10 bit barrel shifter (10*10 + 10*ceil(log2(10)) gates), and a 10 bit multiplier (approximately 10 * 10 * 9 gates)
Total = 1085 gates. The reality is probably far more, because you're going to want to use carry-look-ahead and pipelining.
Whereas 1 bit multiplies and add's of say a 16 bit accumulator use... 16 gates! (and probably half since you can probably use scheduling tricks to skip past the zero's, at the expense of variable latency...)
So when 1 bit math uses only 1/100th of the silicon area of 16 bit math, and according to this paper gets the same results, the future is clearly silicon that can do 1 bit math.
- we have llama.cpp (could be enough or at least as mentioned in the paper a co-processor to accelerate the calc can be added, less need for large RAM / high end hardware)
- as most work is inference, might not need for as many GPUs
- consumer cards (24G) could possibly run the big models
This opens the door to very exciting hardware shifts, like to optical computing, where there's already been over a decade of research on ternary optical computing and other parallel research at using optical computing for more efficient neural networks.
If this really holds up, it likely means we'll be moving to new dedicated hardware for AI compute much faster than when it was FP.
As per answer, the reason float is faster than in is because a) hardware companies provide float ALUs than integer ALUs and b) float FMA is a thing, while integer FMA isn't. Both are because currently most HPC-like loads use floats instead of integers, not because of intrinsic hardware reasons.
So for the uninitiated (me), does this mean the input is not a float (i.e. is quantized on input), such that all the math can be done with int operations?
That’s still not 1 bit, and that would basically destroy whatever perf advantage you might hope to get if you want to keep the model in memory in that format rather than unpack it on load.
Not fully, 8 bits has 256 values. It's easy to keep a look up table in the L1 cache of any CPU and constant cache of any GPU. For ASICs and FPGAs, it's a simple 256-value LUT. It's not ideal, yes, but not a deal breaker. Epically considering LLMs are memory bound. GGML dequantizes weights on-the-fly and still gets near linear scaling on GPUs.
LLMs have gone from 32-bit floating point numbers down to 16 and 8 bit values. Now 2 bits. It's a hint as to how evolution did it. The basic component is simple and has very wide tolerances. There are just a lot of them. That's something biology can evolve.
Would there be value in distinguishing -0 and +0? If a 0 was quantized from a small negative or a small positive, it seems like retaining the sign is better than forgetting it.
The question remains whether the benefit and the simpler design are worth the loss of density.
Low bit parameters is always talked about in terms of performance benefits but I wonder if allowing the LLM to combine parameters to represent values, means it can select the resolution of each value, that is use a kind of internal scientific notation to track the uncertainty of values. More low bit parameters combined together means more precision and resolution, less can mean more uncertainty. This might allow the LLM to better calibrate the uncertainty of it's knowledge in a Bayesian way, to prevent hallucinations from the overconfidence you get from overfitting on too many bits.
Maybe a silly question but nonlinearity is important for neural nets. Wouldn't it make more sense for the three values to be e.g. (2, 0, -1) so they are not colinear?
Also, what are the prospects for FPGA implementations of this?
Does quantization need to be an all or nothing? with the kind of low bit models we have seen, my assumption would be that only certain weights would benefit from the extra precision. A mixture of precision with 2-bit, 3-bit, to 8-bit weights might perform well, but I am unsure if any training process could identify the weights that need the extra precision.
Given the weights are just mapping to a virtual network structure anyways, my guess would be that as parameter sizes increase any difference node precision might have will evaporate when trained from the ground up.
So moving to extremely high efficiency native ternary hardware like with optics is going to be a much better result than trying to mix precision in classical hardware.
We'll see, but this is one of those things that I wouldn't have expected to be true but as soon as I see that it is it kind of makes sense. If it holds up (and it probably will) it's going to kick off a hardware revolution in AI.
How does gradient descent work with these discrete ternary parameters? If you compute the partial differential for a parameter, how do you determine what to nudge the parameter when updating on back propagation? Do you only update if the "nudging amount" meets a threshold?
> While the weights and the activations are quantized to low precision, the gradients and the optimizer states are stored in high precision to ensure training stability and
accuracy. Following the previous work [ LSL+21 ], we maintain a latent weight in a high-precision format for the learnable parameters to accumulate the parameter updates. The latent weights are binarized on the fly during the forward pass and never used for the inference process.
There's an interesting mental model I've been toying with. At what point do LLMs just become circuit-shaped NNs with stochastic gradient descent backing them?
E.G. are we just determining the best program by rearranging 1s and 0s?
What's the benefit of using ternary encoding over just a binary representation? And if we have come so far is there potential for a more efficient algorithm than gradient descent?
The paper talks about LLMs a lot, but would this result hold for all Transformers? Are Ternary Transformers going to make things like Whisper faster/better?
Could there be some value in recognizing areas where the model needs finer grained weights and somehow using a different data type just in certain areas?
It seems tough to do, besides I'm not sure what the benefit would be, with that you can't do the optimized matrix multiplication anymore, and if you need more precision presumably you can just add more neurons and/or train for longer and/or with better data.
Yes, actually: That's the entire point of the paper! The concept is that the amount of information contained in a weight like 0.00006103515625 is equivalent to 0. -0.99951172 is equivalent to -1, 1.26406236 equivalent to 1, etc. That there's no practical difference when actually utilizing the model (if trained in ternary from the start).
The paper posits (and provides evidence) that if you train a model using ternary values instead of floating point values you get equivalent (useful/practical) information. You can't take an existing model and round all the values down to `{-1,0,+1}` values but you can (re)train a model using ternary values to get the same end result (equivalent information/output).
Technically a model trained using FP16 values contains vastly more information than a model trained using ternary values. Practically though it seems to make no difference.
My prediction: Floating point models will still be used extensively by scientists and academics in their AI research but nearly all real-world, publicly-distributed AI models will be ternary. It's just too practical and enticing! Even if the ternary representation of a model is only 90% effective it's going to be so much faster and cheaper to use it in reality. We're talking about the difference between requiring a $500 GPU or a $5 microcontroller.
I don't think you really answered my question. What's been done by the paper is show experimentally that networks don't have enough information to justify their weight precision, and that's really good and a very important result, but what I was asking was if there's a rigorous way to take an arbitrary network and determine its information content (either by itself, or compared to another network). Possibly that can be relative to its outputs.
Ok can someone catch me up to speed on LLM hardware requirements? Last I looked I needed a 20 gb vram card to run a good one. Is that not true anymore?
Oh Jesus so basically it’s very feasible for me to run my own local llm on a NAS or a server or something… well I guess it’s time for me to get on with the times…
It's a fairly straightforward modification of BitNet, so I assume this quote from the BitNet paper applies:
To train our 1-bit model, we employ the straight-through estimator (STE)[BLC13 ] to approximate the gradient during backpropagation. This method bypasses the non-differentiable functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions, during the backward pass. STE allows gradients to flow through the network without being affected by these non-differentiable functions, making it possible to train our quantized model
It seems to have more details (it's the paper before the linked one) about the actual training, but I'm scanning it and this isn't my field so maybe it's too light also.
Not really, that's for the binary version of the algorithm, the ternary version can propagate a lot more information in the backwards pass using the fact outputs either -1, 0, 1.
But I imagine they are using the same thing since a bunch of the authors are the same.
It seems like it could be any transformer, which is exciting now that even in imaging gradient transformers are all the rage. But ideally we'd need to see this result in other transformers (but I have a hard time seeing why it wouldn't be the case).
Not even remotely. I suppose you could kind of say that activations are boolean in the sense that neurons emit spikes, but arguably significant information is encoded in spike timing.
This is great, my employer just gave me a M1 laptop with only 16gb ram and I had to downgrade my 7B parameter local LLM’s to 3 bit quantizing, they’ve been surprisingly okay!
In my personal machine at 64gb ram, I usually use 8x7B at Q5 or 70B at Q4
Its Mistral all the way down! Imagining Q1.58 that’s doing well makes me happy
I really can't tell but it seems to be a continuation of this work if I read the To-Dos correctly, what do you think? Here it seems to be 1-bit on just the transformer, https://huggingface.co/shi3z/BitNetWikipedia110M
It's not quiet spikes but getting closer to the idea. I'm amazed it has taken this long for this type of thing to reach HN which gives next to no attention to spiking neural networks.
Simon Thorpe, a CNRS researcher has got some fascinating papers and lectures on YouTube on using binary weights on neuromorphic hardware which has had practical applications for over 20 years already.
I made an account just to drop his name somewhere on this forum.
It is not. Perhaps I should have clarified that I don't have another account. I've been a lurker until now.
In my time lurking I've noticed that the community here basically focuses solely on the von Neumann architecture. If anyone is interested in delving into the world of spikes he has some interesting ideas and good material available.
Can someone versed in the ways of math explain how this is different from previous quantization methods?
And specifically, seeing how going from 16fp to 8bit mostly gives same perplexity while anything further seems to lose quality / dumb down the model, how is this even less precise method is able to achieve this?
If I understand it correctly, this seems to be more than just quantizing, the models are apparently trained in this format as well. So it's possible that the many layers adjust themselves in a way that "cancels out" the inaccuracies of the lower bit count
So modern NNs aren't really using the network nodes in the structure they physically are, but essentially builds a virtual neural network using combinations of nodes (how you can model hundreds of parameters in only a dozen or so nodes).
So as the number of nodes scales up, the individual precision probably matters less and less. Which is what they found here - it reaches parity at 3B and then starts exceeding performance at larger sizes, up to the 2T tested.
Seemingly when trained from scratch the virtual network can find adequate precision from ternary physical nodes where needed. This is different from the information loss as an already trained floating point network has its weights quantized to smaller precision and sees a performance loss.
Not only is this approach more efficient, it seems to perform better too at larger network sizes, which is probably the most interesting part.
A great intro to the theoretical reasons ternary might have some promise in computing is this 2001 article from 'American Scientist', "Third Base", which quotes Knuth calling balanced-ternary "perhaps the prettiest numbering system of all" and also discusses an abortive Soviet effort in the direction of ternary computing:
http://web.archive.org/web/20011205185830/http://americansci...
In an aside, the article hints that e-nary digits (base 2.718…) if somehow made practical/meaningful, might actually be better than ternary (or perhaps even optimal?).
So maybe this paper's observation that ~"1.58 bits" (ln2(3) binary-digits) is a sweet-spot could be further refined into some method for representing the state of a e-nary-modeled algorithm in ln2(e) binary-digits (~"1.44 bits") per underlying e-it.
(As it may be of renewed interest, I've also put this 2001 "American Scientist" base-3 intro as a new HN submission for discussion: https://news.ycombinator.com/item?id=39541756)