The Era of 1-bit LLMs: ternary parameters for cost-effective computing

cs702 · 2024-02-28T13:54:56

There are two findings I find shocking in this work:

* In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).

* In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value. See the paper for exact details.

On existing hardware, the gains in compute and memory efficiency are significant, without performance degradation (as tested by the authors).

If the proposed methods are implemented in hardware, we will see even greater gains in compute and memory efficiency.

Wow.

paul_mk1 · 2024-02-28T22:48:52

Fun to see ternary weights making a comeback. This was hot back in 2016 with BinaryConnect and TrueNorth chip from IBM research (disclosure, I was one of the lead chip architects there).

Authors seemed to have missed the history. They should at least cite Binary Connect or Straight Through Estimators (not my work).

Helpful hint to authors: you can get down to 0.68 bits / weight using a similar technique, good chance this will work for LLMs too.

https://arxiv.org/abs/1606.01981

This was a passion project of mine in my last few months at IBM research :).

I am convinced there is a deep connection to understanding why backprop is unreasonably effective, and the result that you can train low precision DNNs; for those note familiar, the technique is to compute the loss wrt to the low precision parameters (eg project to ternary) but apply the gradient to high precision copy of parameters (known as the straight through estimator). This is a biased estimator and there is no theoretical underpinning for why this should work, but in practice it works well.

My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.

cs702 · 2024-02-29T14:19:34

Thank you. Others on this thread have addressed the citation-trail issues you raise. I just want to tell you how helpful I find your comment about why ternary weights ought to work at all without degrading performance:

> My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.

Your guess sounds and feels right to me, even if currently there's no way to express it formally, with the rigor it deserves.

Thank you again for your comment!

mjcohen · 2024-02-29T04:51:21

IIRC, Hamming's book "Digital Filters" (1989) has a section on FFTs with only the sign of the coefficient being used. It performed surprisingly well.

mitthrowaway2 · 2024-02-29T07:36:20

What is the sign of a complex number? Do you mean the phase?

nine_k · 2024-02-29T14:45:45

AFAICT, both the real and imaginary components are from (-1, 0, +1) only. No single sign, but only 8 directions and the center.

thomasahle · 2024-02-29T06:22:06

You mean Fast Hadamard Transform?

fabmilo · 2024-02-29T03:13:04

They train using Straight Through Estimator but is cited in the previous BitNet paper. What happen to the TrueNorth Chip? I think investing in specialized hardware for AI is a good bet.

paul_mk1 · 2024-02-29T04:31:24

Nice to know there is a trail to relevant citations. I missed the BitNet paper and need to catch up.

Btw TrueNorth project evolved into "NorthPole" chip by the same group, and was recently in the press. From afar NorthPole looks like an interesting design point and leverages on-chip memory (SRAM)--so it's targeting speed and efficiency at the expense of memory density (so perhaps like Groq in some respects). Tbh I haven't followed the field closely after leaving the group.

WhitneyLand · 2024-02-29T00:56:46

That’s really interesting to see the breadcrumb trail goes back that far.

So what are the most important insights in this paper compared to what was previously done?

I assume there’s more context to the story and it’s not just that no one thought to apply the concepts to LLM’s until now?

paul_mk1 · 2024-02-29T04:40:42

I don't think there is anything conceptually new in this work, other than it is applied to LLMs.

But in fairness, getting these techniques to work at scale is no small feat. In my experience quantization aware training at these low bit depths was always finicky and required a very careful hand. I'd be interested to know if it has become easier to do, now that there are so many more parameters in LLMs.

In any case full kudos to the authors and I'm glad to see people continuing this work.

eru · 2024-02-29T01:41:17

You can probably apply the same techniques 'Deep neural networks are robust to weight binarization and other non-linear distortions' used to get to 0.68 bits / weight to get your ternary weights below one bit; so you can claim they are still one-bit networks.

WiSaGaN · 2024-02-29T15:03:21

Could the reason that 3 states in this case be more efficient than 2 states be that 3 is closer to 2.718... (Euler's number) than 2 is?

altruios · 2024-02-29T15:30:29

Why not have some layers/nodes/systems be 2 states and have others be 3... couldn't you get arbitrarily close to Euler's number that way?

nxobject · 2024-02-29T00:21:27

As aside, I'm curious: what was it like to work at IBM research, especially as a legacy industrial research org?

antimatter15 · 2024-02-29T03:10:47

They cite straight through estimators in the previous work with many of the same authors on (actual binary) BitNet

vessenes · 2024-02-28T18:36:22

I'd be VERRY cautious about being excited here.

My priors are like this:

1. Initial training of a neural network moves all weights around a large amount at first.

2. Later training of the network adjusts them a small amount.

3. An undertrained network will therefore look a lot like figuring out "positive, negative, or 0?" for each node during early training.

If all these things are true, then

1. Early training of an fp16 network and a bitnet with 0 added will be roughly similar in results

2. Later training will yield different / worse results, as the network gets into the 'fine tuning' part of the training.

I think the paper's stats back these priors up -- they say "this works on (3B+) large networks, but not small ones." They then imply there's something about the structure of a large network that allows a bitnet to do well. It seems more likely to me it works on large networks because they have not put the compute into 3B+ networks to get past the 'gross tuning' phase.

The networks they have compute to put in to get them 'fully' trained -- those networks don't show the results.

Also, a quick reminder that Perplexity 12 is really terrible. You would not want to use such a network. Hopefully I'm wrong and we can get something for free here! But, I'm cautious - to - skeptical.

vessenes · 2024-02-28T20:41:18

Update - I'm still cautious about this paper, but I had the table numbers inverted in my head while thinking about it. The paper shows better perplexity results than competing models at larger parameter sizes, so I was wrong.

pclmulqdq · 2024-02-28T23:38:32

I was pretty unhappy and suspicious for the same reason. Not reporting perplexity for a 70B network while reporting its efficiency means that someone did something and the result wasn't good enough to put in the paper.

GaggiX · 2024-02-28T23:45:41

According to the author, the 70B model is not fully trained.

pclmulqdq · 2024-02-29T01:06:35

"Is not fully trained" can also mean "we did not figure out how to reach an acceptable loss" or "training was unstable," both of which are common for ML systems.

GaggiX · 2024-02-29T02:09:59

It probably means that the model is not fully trained, because it is very expensive to train a 70B model, not even Mamba or RWKV have a model that comes close to that size, the leeriness is just kinda silly honestly.

bick_nyers · 2024-02-29T13:40:34

Extraordinary claims require extraordinary evidence.

That's not to say that a 70B model is necessary, but surely something larger than 3B is doable, especially given that the results of the paper directly imply a significant reduction in memory requirements for training such a model.

edflsafoiewq · 2024-02-29T14:20:15

> results of the paper directly imply a significant reduction in memory requirements for training such a model

Isn't memory use in training higher, since they maintain high precision latent weights in addition to the binarized weights used in the forward pass?

danielmarkbruce · 2024-02-29T21:55:03

Yes. The optimizer is keeping a higher precision copy. It's likely slower and requires more memory than an equivalent full precision model when it comes to training. I'd also imagine it requires a multiple of epochs to get one epoch equivalent because the forward pass will need several goes to get the right choice between three states, rather than just moving a little bit in the right direction.

pclmulqdq · 2024-02-29T14:38:38

Most research universities have the resources to train a ~10B parameter model, at least.

GaggiX · 2024-02-29T13:53:00

For sure bigger models are needed to compete with transformer LLM, same thing for Mamba, I was just bothered by the distrust about something very reasonable like not being able to fully train a 70B model.

kristjansson · 2024-02-29T16:39:44

One can forgive the lack of quality results for the 70B model, but apparently they trained 7B and 13B versions of their model, and don't report those either.

svantana · 2024-02-28T19:59:09

Wait, are we reading the same paper? What I'm seeing is comparable accuracy to unquantized models for <4B params, and nothing reported for larger models except resource consumption.

vessenes · 2024-02-28T20:40:24

Nope, you're right, I got the table inverted in my head. I'm updating my top comment.

gradascent · 2024-02-28T23:26:32

Then perhaps a method emerges out of this to make training faster (but not inference) - do early training on highly quantized (even ternary) weights, and then swap out the weights for fp16 or something and fine-tune? Might save $$$ in training large models.

cs702 · 2024-02-28T20:35:48

Thank you. Your key point -- that so far all models with the proposed methods may have been only "grossly trained" -- is compelling. If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That seems sensible to me, and makes replication easier, but I agree we need more to see extensive testing, after more extensive pretraining, on models of larger sizes.

gliptic · 2024-02-28T20:51:17

They also trained 3B with 2 trillion tokens.

> The number of training tokens is a crucial factor for LLMs. To test the scalability of BitNet b1.58 in terms of tokens, we trained a BitNet b1.58 model with 2T tokens following the data recipe of StableLM-3B [ TBMR], which is the state-of-the-art open-source 3B model.

> [..]

> Our findings shows that BitNet b1.58 achieves a superior performance on all end tasks, indicating that 1.58-bit LLMs also have strong generalization capabilities.

craq · 2024-02-29T19:31:28

And I was hoping to agree on this, but there is no 'SOTA StableLM-3b' with 2T tokens. Which is a big gap in the paper, because StableLM 3B is trained on 1T tokens for 4 epochs. And the benchmarks they report far exceed the benchmarks shown in the paper. You can find them in the official StableLM git and compare to the results in the paper https://github.com/Stability-AI/StableLM?tab=readme-ov-file#...

cs702 · 2024-02-28T21:58:44

You're right. Thank you for pointing that out!

mise_en_place · 2024-02-28T18:47:48

Intuitively I've always been a bit skeptical of quantization. Wouldn't there be a tiny loss in precision by doing this type of quantization? I could imagine the error function increasing by utilizing these types of techniques.

thesz · 2024-02-28T22:53:56

John Carmack pointed out (and I learned it here at HN) that what training really needs is the *sign" of each individual gradient parameter. I.e., you can quantize gradient to -1, 0 and 1 and still have neural network learn much of the dataset.

Solvency · 2024-02-29T00:34:23

Why isn't John Carmack working for OpenAI? Hell, why did he waste years at Meta to work on a VR headset and NOT AI? He even announced he wants to focus on AGI but he missed out on literally all the action.

igleria · 2024-02-29T09:34:49

he has his own AGI startup now https://dallasinnovates.com/john-carmacks-keen-technologies-...

TBH I think they won't get anywhere. Doing good game engine work... why that would translate to AGI?

bkydcmpr2 · 2024-03-10T07:22:10

That game engine was over 3 decades ago! John is one of the sharpest minds I've ever seen, if he's passionate on AGI, he surely has much deeper understanding what he's doing than the AI trendies on social media.

thesz · 2024-02-29T21:42:39

Let me introduce you to the wonderful game that is The Talos Principle: https://en.wikipedia.org/wiki/The_Talos_Principle

It discusses whether it is possible to evolve AGi using... computer game engine! And that is John's bread and butter.

farhanhubble · 2024-02-29T02:21:35

Wow! Is there a link to read up more on this?

thesz · 2024-02-29T07:01:05

  > It is interesting that things still train even when various parts are pretty wrong — as long as the sign is right most of the time, progress is often made.

https://forums.fast.ai/t/how-to-do-reproducible-models-and-u...

danielmarkbruce · 2024-02-29T21:57:23

They seem to be doing training with higher precision. The optimizer is keeping a copy.

eightysixfour · 2024-02-28T19:54:27

It does increase the “error” (meaning it is less likely to predict the next word when compared against a dataset) but the losses are lower than your intuition would guide you to believe.

int_19h · 2024-02-28T19:58:23

Quantization does reduce quality of the outputs. But the point is that you save enough memory doing so that you can cram a larger model into the same hardware, and this more than compensates for lost precision.

spencerchubb · 2024-02-28T21:48:29

Yes each weight will not be able to "learn" as much if it has less bits of precision. But the idea is that you can use more weights, and the big question is whether these low-precision weights can make the model more accurate, as a whole.

gliptic · 2024-02-28T20:07:48

> Also, a quick reminder that Perplexity 12 is really terrible.

The 3B model had a perplexity of 9.91, less than LLaMa 1 in fp16.

nutanc · 2024-02-28T14:48:26

We have been experimenting with the paper(https://www.researchgate.net/publication/372834606_ON_NON-IT...).

There is a mathematical proof that binary representation is enough to capture the latent space. And in fact we don't even need to do "training" to get that representation.

The practical application we tried out for this algorithm was to create an alternate space for mpnet embeddings of Wikipedia paragraphs. Using Bit embedding we are able to represent 36 million passages of Wikipedia in 2GB.(https://gpt3experiments.substack.com/p/building-a-vector-dat...)

SushiHippie · 2024-02-28T18:29:39

Wow, this works better than I would've thought.

> Who moderates Hacker News?

First result:

> Hacker News

> At the end of March 2014, Graham stepped away from his leadership role at Y Combinator, leaving Hacker News administration in the hands of other staff members. The site is currently moderated by Daniel Gackle who posts under the username "dang".

khimaros · 2024-03-01T17:18:58

how did you test this?

SushiHippie · 2024-03-01T18:25:50

First link in the substack article

https://speech-kws.ozonetel.com/wiki

cs702 · 2024-02-28T16:37:57

You're talking about mapping floating-point vector representations, i.e., embeddings, computed by a pretrained LLM to binary vector representations, right? And you're talking about doing this by first having someone else's pretrained LLM compute the embeddings, right? Sorry, but that seems only minimally, tangentially related to the topic of running LLMs in ternary space. I don't see how your comment is relevant to the discussion here.

nutanc · 2024-02-28T16:57:19

Yeah, sorry, needed a much bigger canvas than a comment to explain. Let me try again. The example I took was to show mapping from one space to another space and it may have just come across as not learning anything. Yes. You are right it was someone else's pretrained LLM. But this new space learnt the latent representations of the original embedding space. Now, instead of the original embedding space it could also have been some image representation or some audio representation. Even neural networks take input in X space and learn a representation in Y space. The paper shows that any layer of a neural network can in fact be replaced with a set of planes and we can represent a space using those planes and that those planes can be created in a non iterative way. Not sure if I am being clear, but have written a small blog post to show for MNIST how an NN creates the planes(https://gpt3experiments.substack.com/p/understanding-neural-...). Will write more on how once these planes are written, how we can use a bit representation instead of floating point values to get similar accuracy in prediction and next how we can draw those planes without the iterative training process.

pests · 2024-02-28T23:28:18

> how we can draw those planes without the iterative training process.

Sounds interesting, but this is the part I would need more explanation on.

Just started reading your linked blog, I see it goes into some details there.

nutanc · 2024-02-29T03:45:37

Will add a lot more details next week. Have been postponing it for a long time.

fabmilo · 2024-02-28T21:27:01

I find this extremely interesting. Do you share the source code of the process? any more references?

nutanc · 2024-02-29T03:49:39

Unfortunately the source code is currently not open sourced. Some more details at (https://www.researchgate.net/publication/370980395_A_NEURAL_...), the source code is built on top of this.

The approach is used to solve other problems and papers have been published under https://www.researchgate.net/profile/K-Eswaran

We are currently trying a build a full fledged LLM using just this approach(no LLM training etc) and also an ASR. We should have something to share in a couple of months.

licnep · 2024-02-29T11:52:04

Am I missing something or is this just a linear transformation?

It says here ( https://www.researchgate.net/publication/370980395_A_NEURAL_... ) that each layer can be represented as a matrix multiplication (equation 3): Ax = s

So concatenating multiple layers could just be reduced to a single matrix multiplication?

If there is no non-linearity I don't see how this could replace neural networks, or am I missing something?

nutanc · 2024-03-01T03:48:02

The attempt is not to replace a particular neural network which has already been trained by using Sigmoid or Rel functions. If one does this then one would necessarily have to use non-linear maps. The whole point is that such a non-linear technique is not necessary for classifications. It is not necessary to confine clusters by hyperplanes for solving a classification problem. Our focus is on individual points.

We believe the brain does not do nonlinear maps!

m3kw9 · 2024-02-28T18:39:00

How is this not lossy compression?

sandyarmstrong · 2024-02-28T19:01:29

LLMs and vector embeddings are always lossy compression, yes?

eru · 2024-02-29T01:54:04

Almost always. Though you can use them in a lossless compression system, too, with a few tricks.

stolsvik · 2024-03-03T09:35:25

.. but you don't want to tell us?

eru · 2024-03-04T05:43:37

Two possible implementation:

(1) Take your data as a stream. Use your machine learning gadget to give you the (predicted) probability for each of the possible next tokens. Then use those probability in arithmetic coding to specify which token actually came next.

(2) Take your data D. Apply lossy compression to it. Store the result L := lossy(D). Also compute the residue R := D - uncompress(L). If your lossy compression is good, R will be mostly zeroes (and only a few actually differences), so it will compress well with a lossless compression algorithm.

Approach (1) is a more sophisticated version of (2). None of this is anything I came up with, those approaches are well known.

See eg https://arxiv.org/abs/2306.04050 and https://en.wikipedia.org/wiki/Audio_Lossless_Coding or https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049... (Probably not the best links, but something I could find quickly.)

baq · 2024-02-28T19:25:18

kind of related: https://medium.com/@heinrichpeters/commentary-gzip-knn-beats...

rf15 · 2024-02-28T18:58:39

It kind of is!

creshal · 2024-02-28T14:36:03

> * In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).

Why is this so shocking? Quantization has been widely explored, driving that to its extreme (and blowing up parameter count to make up for it) just seems like a natural extension of that.

Easier said than done, of course, and very impressive that they pulled it off.

> In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value

I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.

cs702 · 2024-02-28T14:58:49

> Why is this so shocking? Quantization has been widely explored, driving that to its extreme (and blowing up parameter count to make up for it) just seems like a natural extension of that.

I find it shocking that we don't even need lower floating-point precision. We don't need precision at all. We only need three symbols to represent every value.

> I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.

I find it shocking. Consider that associative addition over ternary digits, or trits, represented by three symbols (a,b,c) has only three possible input pairs, (a,b), (a,c), or (b,c) (within each pair, order doesn't matter), and only three possible outputs, a, b, or c. Matrix multiplications could be executed via crazy-cheap tritwise operations in hardware. Maybe ternary hardware[a] will become a thing in AI?

---

[a] https://en.wikipedia.org/wiki/Ternary_computer

jerf · 2024-02-28T15:28:35

An integer is just a concatenation of bits. Floating point appears more complicated but from an information theory perspective it is also just a concatenation of bits. If, for the sake of argument, one replaced a 64-bit int with 64 individual bits, that's really the same amount of information and a structure could hypothetically then either choose to recreate the original 64-bit int, or use the 64-bits more efficiently by choosing from the much larger set of possibilities of ways to use such resources.

Trits are helpful for neural nets, though, since they really love signs and they need a 0.

So from the perspective that it's all just bits in the end the only thing that is interesting is how useful it is to arrange those bits into trits for this particular algorithm, and that the algorithm seems to be able to use things more effectively that way than with raw bits.

This may seem an absolutely bizarre zigzag, but I am reminded of Busy Beavers, because of the way they take very the very small primitives of a Turing Machine, break it down to the smallest pieces, then combine them in ways that almost immediately cease to be humanly comprehensible. Completely different selection mechanism for what appears, but it turns out Turing Machine states can do a lot "more" than you might think simply by looking at human-designed TMs. We humans have very stereotypical design methodologies and they have their advantages, but sometimes just letting algorithms rip can result in much better things than we could ever hope to design with the same resources.

cs702 · 2024-02-28T16:51:18

> So from the perspective that it's all just bits in the end the only thing that is interesting is how useful it is to arrange those bits into trits for this particular algorithm, and that the algorithm seems to be able to use things more effectively that way than with raw bits.

Thank you. I find many other things interesting here, including the potential implications for hardware, but otherwise, yes, I agree with you, that is interesting.

SkyBelow · 2024-02-28T20:15:32

This sort of breakdown also reminds me of the explanation of why busy beavers grow faster than anything humans can ever define. Anything a human can define is a finite number of steps that can be represented by some turing machine of size M. A turning machine of size N > M can then use M as a subset of it, growing faster than than the turing machine of size M. Either it is the busy beaver for size N, or it grows slower than the busy beaver for size N. Either way, the busy beaver for size N grows faster than whatever the human defined that was captured by the turning machine of size M. This explanation was what helped me understand why busy beavers is faster growing than any operator that can be formally defined (obviously you can define an operator that references busy beaver itself, but busy beaver can be considered to not be formally defined, and thus any operator defined used it isn't formally defined either).

The bit about floating point numbers just being a collection of bits interpreted in a certain way helps make sense why a bigger model doesn't need floating points at all.

eru · 2024-02-29T01:57:20

> We humans have very stereotypical design methodologies and they have their advantages, but sometimes just letting algorithms rip can result in much better things than we could ever hope to design with the same resources.

Yes. Though here the interesting point is not so much that these structures exist, but that 'stupid' back-propagation is smart enough to find them.

You can't find busy beavers like that.

jxy · 2024-02-28T15:35:07

The matrices (weights) are ternary.

The vectors are not.

cs702 · 2024-02-28T15:41:09

The activations are in (-1, 1), so they're also representable by (-1, 0, 1).

rfoo · 2024-02-29T08:09:33

This is wrong. The paper described that their activation is in int8 during inference.

That being said, before-LLM-era deep learning already had low bit quantization down to 1w2f [0] working back in 2016 [1]. So it's certainly possible it would work for LLM too.

[0] 1-bit weights, 2-bit activations; though practically people deployed 2w4f instead. [1] https://arxiv.org/abs/1606.06160

cs702 · 2024-02-29T15:04:35

EDIT: Embarrassingly, on the last paragraph I got the number of possible input pairs wrong:

> only three possible input pairs, (a,b), (a,c), or (b,c) (within each pair, order doesn't matter)

The correct number, ignoring order, is six pairs, because we have to include (a,a), (b,b), and (c,c).

p1esk · 2024-02-29T01:40:54

If you find three symbols per weight shocking, this paper should completely blow your mind: https://arxiv.org/abs/1803.03764

I admit it did shock me when it came out.

satellite2 · 2024-02-28T15:06:07

Because it's no longer a linear optimization or curve fitting problem. It becomes a voting or combinatorial problem. Which at least in my mind are two completely different areas of research.

HPsquared · 2024-02-28T15:46:53

With enough parameters, it probably starts looking continuous again. Like how in physics everything is quantised at the smallest scale but if you put enough atoms together it all smooths out and behaves "classically".

amelius · 2024-02-28T21:20:01

Yes, but we can simulate classical physics using mathematical shortcuts. Simulating every little atom would take a lot more work.

gemeral · 2024-02-29T06:40:55

> and blowing up parameter count to make up for it

based on (an admittedly rapid and indulgent reading of the paper), it seems like they're not increasing the parameter size. Do you mind pointing out where the blowup is occurring?

magnustesshu · 2024-02-29T23:00:16

They're saying that likely, models of comparable size will perform worse (the paper claims as good)

But since they are (optimized up to 8 or 10x if packing terns beyond 2 bits, in practice it seems 3-5x considering larger other structures needed in memory) more memory efficient, the largest models can be that much larger.

SuchAnonMuchWow · 2024-02-29T12:06:18

> I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.

No, unless I'm mistaken it's a huge impact: it means the matrix product is separable: basically, it's a O(n²) algorithm, and not O(n3): add together all the c_j = sum(a_i_j), d_i = sum(b_i_j), and the final results are all the combinations of cj+di. And even then, half that is unnecessary because the d_i can all be pre-computed when before inference since they are weights.

But I skimmed over the paper, and didn't found the part where it was explained how they replace the product by additions: from what I understand, they remplace multiplications by bi by selecting +ai, 0, or -ai. So the final matrix multiplication can be implemented by only additions, but only because the weights are 1,0,-1 they avoid multiplications altogether. This is really different from what the GP said (remplacing a0*b0+... by a0+b0+...).

ncruces · 2024-02-28T18:49:05

Well I guess it's the “blowing up parameter count to make up for it” that confuses me, but maybe it's just ignorance.

Like what would be the expected factor of this blow up to make up the difference between ternary and whatever 16 bits encoding they were using?

I mean intuitively I'd expect to need ~10× the symbols to encode the same information? Are they using an order of magnitude more parameters, or is that not how it works?

int_19h · 2024-02-28T20:01:03

With existing common quantization techniques, a 70b model quantized to 3-bit still drastically outperforms an unquantized 35b model.

p1esk · 2024-02-29T01:38:31

Are you sure? I was under impression that 3b quantization still results in a significant degradation. Which quantization method are you talking about?

int_19h · 2024-02-29T03:40:00

It does result in a significant degradation relative to unquantized model of the same size, but even with simple llama.cpp K-quantization, it's still worth it all the way down to 2-bit. The chart in this llama.cpp PR speaks for itself:

https://github.com/ggerganov/llama.cpp/pull/1684#issue-17396...

p1esk · 2024-02-29T04:55:42

Oh wow, you’re right. Though it seems that they are using very small weight group sizes: either 16 or 32 (fp16 scaling factor per group). In this paper it seems there’s no weights grouping, so it’s a bit apples to oranges.

Noe2097 · 2024-02-28T22:17:06

There is another _shocking_ realization in this work: there are 11 types of people: those who know what binary means, those who don't, and those who say they do but actually don't.

"The era of 1-bit LLMs"

Representing { -1, 0, 1 } can't be done with 1-bit, I'm sorry -- and sad, please let's all get back to something vaguely sound and rigorous.

npunt · 2024-02-28T22:20:57

Ternary supporters are always bitter about this

(I'll let myself out)

gpderetta · 2024-02-29T16:05:05

There are 10 types of people, those who don't know binary, those who do and those who know ternary.

hk__2 · 2024-02-29T17:07:16

> please let's all get back to something vaguely sound and rigorous

Something rigorous would be to actually read the paper rather than stop at the first part of its title. The authors are not claiming their LLM is 1-bit.

esrauch · 2024-02-29T01:52:53

One trit but that's not a word anyone knows.

baq · 2024-02-29T06:17:23

That used to be true yesterday…

jandrese · 2024-02-28T16:23:35

It seems like the AI space is slowly coming back around to the old Thinking Machines CM-1 architecture. It's not too often in computing where you see ideas a full 40 years ahead of their time make it into production.

giantrobot · 2024-02-28T22:58:06

IIUC the main issue with the CM-1 architecture was feeding the processor cluster with data. That required a heftier front end system than was practical/affordable at the time. With modern CPUs and memory subsystems the GPUs can be saturated pretty easily. So going back to huge clusters of super narrow cores won't starve them for work.

theendisney · 2024-02-29T03:10:16

Memristors any moment now

TMWNN · 2024-02-29T14:48:36

I'm holding out for Josephson junctions

abeppu · 2024-02-28T16:43:06

> On existing hardware, the gains in compute and memory efficiency are significant, without performance degradation (as tested by the authors).

Did they actually show absence of performance degradation?

I think it's conspicuous that Table 1 and Table 2 in the paper, which show perplexity and accuracy results respectively, are only for small model sizes, whereas Figure 2, Figure 3 (latency, memory, energy consumption) and Table 3 (throughput) all show larger model sizes. So it seems like they had every opportunity to show the perplexity/accuracy comparisons at the larger model sizes, but did not include them.

cs702 · 2024-02-28T16:46:03

Others have already made the same point in this thread. See my response here: https://news.ycombinator.com/item?id=39539508

flockonus · 2024-02-28T20:23:25

Considering how much faster additions are processed, and how a particular silicon chip could be optimized for this very specific case; all parts added together perhaps could show >100x speed up vs current systems.

I must concur, "wow".

Nevermark · 2024-02-29T02:43:49

For hardware, 2-argument ternary additions and multiplications should be very close in terms of the tiny circuit required for either.

If you are doing ternary calculations on 32/16-bit hardware, then the additions would be simpler.

p1esk · 2024-02-28T21:51:43

Ternary networks have been used since 2015. There are hundreds of papers. They all require full QAT (training from scratch). Not sure why you’re shocked.

cs702 · 2024-02-29T19:11:41

Because it's not just the use ternary values. It's also that there are no dot-products; there are only additions. And when we apply both changes to existing LLMs, there's no performance degradation (as tested by the authors).

rhaps0dy · 2024-02-28T15:11:31

I think you need more evidence than this paper (which is very short and light on actual numbers) to be this shocked.

For example, most of the plots in the paper are actually of throughput, memory, etc. all performance characteristics that are better on the ternary version. Which, of course.

The only thing that contains perplexities are Table 1 and 2. There, they compare "BitNet b1.58 to our reproduced FP16 LLaMA LLM in various sizes" on the RedPajama data set. The first thing to note is the perplexities are very high: they're all at least ~9.9, which compared for example with quantized Llama on wikitext-2 which is 6.15 (https://www.xzh.me/2023/09/a-perplexity-benchmark-of-llamacp...). Maybe RedPajama is a lot harder than wikitext-2, but that's a big gap.

I think probably their benchmark (their "reproduced FP16 LLaMA LLM") is just not very good. They didn't invest much in training their baseline and so they handily beat it.

cs702 · 2024-02-28T15:19:03

Thank you. I think the paper as it is provides enough evidence to support the claims. If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That's sensible. It allows for easier replication of the results. Otherwise, I agree with you that more extensive testing, after more extensive pretraining, is still necessary.

craq · 2024-02-29T19:40:04

And that's true, but why do they limit it to 100B tokens? And why not provide the loss curves in the end to show that both models have converged? What's not proven to me, in this paper, is the ability of the model to scale and generalize to bigger datasets. It's easy to see how a model of sufficient size can overcome the quantization bottleneck, when trained on such a small dataset. Which is perhaps why smaller variations failed.

fzliu · 2024-02-28T22:34:36

This will be big for FPGAs - adders are extremely cheap compared to multipliers and other DSP blocks.

eru · 2024-02-29T01:59:10

Multipliers for eg 8 bit or 4 bit floating point values should also be pretty cheap? (I assume multipliers have a cost that grows quadratically with the number of bits?)

imtringued · 2024-02-29T14:36:04

You use DSPs for that. Effinix has direct bfloat16 support in their FPGAs. The real game changer is using the carry chain with your LUT based adders. Assuming 16 LUTs, you could be getting 11 teraops out of a Ti180 using a few watts. Of course that is just a theoretical number though but I could imagine using four FPGAs for speech recognition and synthesis and vision based LLMs operating in real time.

phkahler · 2024-02-29T14:39:10

>> we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value

Thinking out loud here. If you encode 64 weights in 2 64-bit words you can have the bits in one word indicating +1 if they're 1, and the bits in the other word indicating -1 if they are 1. You should be able to do the "products" with a few boolean operations on these 2 words to get a pair of 64 bit words for the result. Then summing becomes a matter of using a count-of-1's instruction on each word and subtracting the "negative" count from the positive. If AVX instructions can do this too, it seems like equivalent of 10-100 TOPS might be possible on a multi-core CPU.

cs702 · 2024-02-29T15:00:35

Yes. More generally, this will enable implementation via crazy-cheap bit-wise ops in binary hardware, and possibly, maybe, via crazy-cheap trit-wise ops in ternary hardware that manipulates ternary digits, or trits. Note that any binary op over trits has only nine possible (trit, trit) input pairs and only three possible trit outputs. Maybe ternary hardware for AI will become a thing?

phkahler · 2024-02-29T15:21:14

Fleshing out my thought above. If we want to multiply A*B = C and all operands are stored in 2 separate bits Ap and An (Ap = 1 if A = +1 while An = 1 if A = -1). We can do a product with:

Cp = (Ap & Bp) | (An & Bn)

Cn = (An & Bp) | (Ap & Bn)

So 64 products in 6 instructions, or 256 in 6 instructions with AVX2, or 512 in six instructions using AVX512. If you can execute 2 instructions at a time on different words, this becomes 1024 "products" in 6 cycles or between 0.5 and 1 TOP per core.

The summing still involves using popcount on the positive and negative bits - I doubt AVX supports that but its still a fast way to "sum" individual bits. I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.

rep_lodsb · 2024-02-29T20:42:02

Another way would be to use one register for "zero" vs. "non-zero", and another for negative (basically 2 bit sign-magnitude representation).

    C_sgn = A_sgn ^ B_sgn
    C_mag = A_mag & B_mag

The result can then be converted into bitmasks for positive and negative:

    C_plus = C_mag & ~C_sgn
    C_minus = C_mag & C_sgn

This solution should be more efficient if there is an "AND NOT" instruction, or when multiplying more than two factors.

rep_lodsb · 2024-02-29T22:48:25

Thinking a bit more about this, you could eliminate the conversion and do

    sum = popcount(mag) - 2*popcount(mag & sgn)

cs702 · 2024-02-29T15:24:49

> I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.

Yes, I agree. This still needs to be more extensively tested.

beagle3 · 2024-02-28T16:23:14

I haven’t been keeping tabs, but this seems very much like RIP / Achilioptas version of the Johnson Lindenstrauss lemma.

Perhaps the rest of the JL lemma promise applies as well - compressing the number of parameters by a few orders of magnitude as well.

lr1970 · 2024-02-28T15:47:00

Authors reported perplexity only for small up to 3B weights models. On the other hand, they reported throughput for 70B model, but not its performance (perplexity, end-to-end tasks). Very unfortunate omission. Overall, the paper is rather poorly written.

cs702 · 2024-02-28T15:49:11

If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That's sensible. It allows for easier replication of the results. Otherwise, I agree with you that more extensive testing, after more extensive pretraining, at larger model sizes, is still necessary.

lr1970 · 2024-02-28T17:06:16

towards the end of the paper they mentioned training on 2T tokens.

cs702 · 2024-02-28T21:06:34

You're right. Thank you for pointing that out.

bjornsing · 2024-02-29T07:16:10

> * In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value. See the paper for exact details.

Aren’t you over complicating it a bit here? A dot product between a vector of activations (a₁, a₂, …) and a vector of ternary weights (b₁, b₂, …) can of course be computed as the sum of all activations for which the weight is 1, minus the sum of all activations for which the weight is -1.

It can’t however be computed as (a₁+b₁ + a₂+b₂ ...). You must have gotten that wrong.

PaulHoule · 2024-02-28T21:44:55

I am not startled at all. Dense vector representations are pretty silly, they can’t really be the road to knowledge representation.

jsnelgro · 2024-02-29T18:11:13

It all seems too good to be true but your comment helped me develop a mental model for how this could work.

The most inspiring aspect to me here is just realizing how much potential low-hanging fruit there is in this space! What other seemingly naïve optimizations are there to try out?

acchow · 2024-02-28T21:52:31

Conversely, this also implies our current model sizes can still embed a ton more “understanding”

tbalsam · 2024-02-29T16:31:50

It's not too surprising, honestly! I've poked around with similar in the past and am of a perspective that ternary is a very good thing for a lot of neural networks.

Training CIFAR-10 speedily w/ ternary weights on an fp16 interface (using fp16 buffers, and norm params unchanged): https://gist.github.com/tysam-code/a43c0fab332e50163b74141bc...

fragmede · 2024-02-29T00:10:59

> In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).

does that mean we can do integer instead of floating point math for some parts of the training? that seems like a really big win

AaronFriel · 2024-02-28T22:21:55

In undergrad, some of us math majors would joke that there's really only three quantities: 0, 1, infinity.

So, do we need the -1, and/or would a 2.32 bit (5 state, or 6 with +/-0) LLM perform better than a 1.58 bit LLM?

api · 2024-02-29T14:35:02

Question is whether you can train in this domain or whether you need increased precision to properly represent gradients.

If we could train in this domain it would be an even bigger game changer.

sva_ · 2024-02-28T21:38:57

I'm also curious about the potential speed gains in automatic differentiation, as there are way less branches to 'go up'. Or am I wrong here?

lumost · 2024-02-28T21:42:10

They actually use a relu to represent the model weights. But I'm not convinced that this can't be avoided. We do gradient boosted decision tree training without this trick.

chrsw · 2024-02-29T15:44:38

It almost seems too good to be true

verytrivial · 2024-02-29T01:36:29

> If the proposed methods are implemented in hardware

.. And the paper is _true_ of course, indeed, this sort of compounding quantum leap in efficiency due to representational change starts to get towards the Black Mirror / SciFi foundational mythology level of acceleration. Wild (if true!)

eru · 2024-02-29T01:59:55

Slight tangent: in physics a quantum leap is the smallest possible change.

anon373839 · 2024-02-28T11:47:46

> BitNet b1.58 can match the performance of the full precision baseline starting from a 3B size. ... This demonstrates that BitNet b1.58 is a Pareto improvement over the state-of-the-art LLM models.

> BitNet b1.58 is enabling a new scaling law with respect to model performance and inference cost. As a reference, we can have the following equivalence between different model sizes in 1.58-bit and 16-bit based on the results in Figure 2 and 3.

> • 13B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 3B FP16 LLM.

> • 30B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 7B FP16 LLM.

> • 70B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 13B FP16 LLM.

This paper seems to represent a monumental breakthrough in LLM efficiency, as the efficiency gains come with zero (or negative) performance penalty.

Does it seem at all likely that existing models could be converted?

btbuildem · 2024-02-28T13:12:16

Discussion on HF [1] implies that no, conversion is not helpful. It would take training the model from scratch.

1: https://huggingface.co/papers/2402.17764

anon373839 · 2024-02-28T15:19:47

It’s a pity if realizing these gains absolutely requires full pre-training from scratch. I imagine more than a few people will at least try to find a way to repurpose the knowledge contained in existing models.

cooljoseph · 2024-02-28T21:29:59

You can also have another model "mentor" a new model you are teaching to speed up training. You don't have to start from scratch with zero knowledge. This is done a lot in what are called distillations.

eru · 2024-02-29T02:01:33

You can also re-use a lot of the infrastructure. Eg you can re-use your training data.

fnordpiglet · 2024-02-29T01:06:00

This came out a little bit ago, my open question is if this approach can be used to port weights between architectures like this.

https://arxiv.org/abs/2402.13144

accurrent · 2024-02-28T11:52:53

They seem to be using LLAMA. Might be worth trying out. Their conversion formula seems stupidly simple.

wongarsu · 2024-02-28T12:06:36

However they trained their models from scratch, which is also why they only have meaningful numbers for 700M, 1.3B, 3B and 3.9B models. Apparently they are following BitNet's approach of replacing linear layers with quantized layers during training? If it was trivial to convert existing models without performance loss I would have expected them to include a benchmark of that somewhere in the paper to generate even more impact.

imjonse · 2024-02-28T12:09:54

They present numbers for 7B to 70B models as well.

anon373839 · 2024-02-28T12:19:33

Those numbers are for cost only, not performance. It’s not clear they actually trained a 70B vs. just using randomly initialized parameters.

sp332 · 2024-02-28T12:18:29

They do not have perplexity numbers for the larger models (see Table 2), only speed and memory benchmarks.

imjonse · 2024-02-28T13:11:45

You're both right, I skimmed the paper, saw large model numbers but didn't notice it was for speed. On the HF page they say those models are being trained.

https://huggingface.co/papers/2402.17764

"We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready."

FrustratedMonky · 2024-02-28T13:06:12

Yes. I wonder then how long before someone that does have a lot of compute power like OpenAI/MS, or others, can rapidly pivot and try this out on some even larger models.

Doesn't this mean that current big players can rapidly expand by huge multiples in size.?

ignoramous · 2024-02-28T13:09:45

I wonder if 1bit quantization is the main reason why pplx.ai is faster than any other RAG or chatbot. For instance, Gemini in comparison is a turtle, though it is better at explanations, while pplx is concise.

vitorgrs · 2024-02-29T05:43:54

Nop. The model on Perplexity is a finetuned GPT 3.5 (the free one). And the paid versons, well, you can choose between GPT4 (not turbo), Gemini pro, Claude, etc.

You can choose their model ("Experimental"), but is not faster than the other models.

All of these, proprietary models are fast on Perplexity. I do guess they are using some insane cache system, better API infrastructure...

refulgentis · 2024-02-29T01:04:43

Absolutely not, 1 bit isn't even real yet. perplexity does a ton of precaching, TL;Dr every novel query is an opportunity to cache: each web page response, the response turned into embeddings, and the LLM response. That's also why I hate it, it's just a rushed version of RAG with roughly the same privacy guarantees any incumbent would have given you in last 15 years (read: none, and gleefully will exploit yours while saying "whoops!")

osigurdson · 2024-02-28T12:53:13

I have often mused that, in some ways, it seems like the transistor is really being wasted in AI applications. We use binary states in normal computing to reduce entropy. In AI this is less of a concern, so why not use more of the available voltage range? Basically, re-think the role of the transistor and re-design from the ground up - maybe NAND gates are not the ideal fundamental building block here?

sigmoid10 · 2024-02-28T13:03:35

People are working on that [1]. In some sense, it's a step back to analog computing. Add/multiply is possible to do directly in memory with voltages, but it's less versatile (and stable) than digital computing. So you can't do all calculations in a neural network that way, meaning some digital components will always be necessary. But I'm pretty sure analog will make a comeback for AI chips sooner or later.

[1] https://www.nature.com/articles/s41586-023-06337-5

zcw100 · 2024-02-28T14:15:32

Reminds me of my father saying something about how vacuum tubes are great integrators.

monocasa · 2024-02-28T15:48:04

Chips are too. Opamps can add, multiply, subtract, divide, integrate and differentiate depending on how they're plugged in.

klysm · 2024-02-28T18:46:35

Hence the name 'operational' amplifier

trebligdivad · 2024-02-28T17:51:28

Trinary however is an interesting middle; people have built trinary hardware long ago; it feels like you could make natively trinary hardware for something like this; it might even be quite a win.

int_19h · 2024-02-28T20:02:35

People haven't built reliable ternary electronics, though. Soviets tried with Setun, but they eventually had to resort to emulating each trit with two hardware bits (and wasting one state out of the possible four).

eru · 2024-02-29T02:11:53

If you are are using two bits anyway, you might as well represent (-2, -1, 0, 1) instead of ternary?

int_19h · 2024-02-29T03:31:06

Sure, but then you lose the symmetry that makes trits so convenient for many things.

thsksbd · 2024-02-28T18:24:11

Can you make a "CMOS" three voltage level circuit though? One where the only current flow is when the state changes?

Im not in this field but that's a question that's been bugging me for a while. Off you can't do this wouldn't energy consumption balloon?

neomantra · 2024-02-29T12:41:23

My friend was working on this in the mid-90s at Texas Instruments. Not sure what the underlying semiconductors were, but it did involve making ternary logic via voltage levels. Just searched a bit and found this TI datasheet which might be an example of it (high logic, low logic, high impedance), but maybe not: https://www.ti.com/lit/ds/symlink/sn74act534.pdf

irrelative · 2024-02-28T16:34:49

Hadn't thought about it this way before, but given that LLMs are auto regressive (use their own data for next data), they're sensitive to error drift in ways that are rather similar to analog computers.

eru · 2024-02-29T02:09:13

Analog computing for neural networks is always very tempting.

> We use binary states in normal computing to reduce entropy. In AI this is less of a concern, so why not use more of the available voltage range?

Transistors that are fully closed or fully open use basically no energy: they either have approximately zero current or approximately zero resistance.

Transistors that are partially open dissipate a lot of energy; because they have some current flowing at some resistance. They get hot.

In addition, modern transistors are so small and so fast that the number of electrons (or holes..) flowing through them in clock cycle is perhaps in the range of a few dozen to a hundred. So that gives you at most 7 bits (~log_2(128)) of precision to work with in an analog setting. In practice, quite a bit less because there's a lot of thermal noise. Say perhaps 4 bits.

Going from 1 bit per transistor to 4 bits (of analog precision) is not worth the drastically higher energy consumption nor the deviation from the mainstream of semi-conductor technological advances.

baq · 2024-02-29T08:02:25

As someone who knows almost nothing about electronics I assume you’d want a transistor which can open in two ways: with positive and negative voltage. I’ve seen TNAND built out of normal transistors, not sure if such exotic ones would help even if they were physically possible.

eru · 2024-03-01T08:46:37

That's for building ternary gates. They are still discrete, so it might be possible to do something here.

I was talking about analogue computing.

gryn · 2024-02-28T13:20:12

the reason why digital/numeric processing won is the power loss in the analog world. when design an analog circuit the next processing stage you add at the end has impact on the ones before it.

this then require a higher skill from the engineers/consumers.

if you want to avoid that you need to add op-amps with a gain of 1 at the boundary of each one, this also that care of the power loss at each stage.

the other part is that there's a limit of to the amount of useful information/computation you can do with analog processing too once you take into account voltage noise. when you do a comparison there are stages where analog win but also place where where digital wins.

I'll edit later this with a link to some papers that discuss these topics if I manage to find them in my mess.

dazed_confused · 2024-02-28T17:58:58

Good explanation. When I was working at a semiconductor manufacturer, our thresholds were like 0 - 0.2V to 0.8 - 1.0V. Additionally, if you look at QLC SSDs, their longevity is hugely degraded. Analog computing is non-trivial, to say the least.

im3w1l · 2024-02-28T18:09:16

For the specific case of neural networks they seem to be very resistant to noise. That's why quantization works in the first place.

eru · 2024-02-29T02:11:02

You also have literal power losses, as in waste heat, to deal with.

See https://news.ycombinator.com/item?id=39545817

loudmax · 2024-02-28T13:08:15

The Veritasium Youtube channel did a video about this about a year ago: https://www.youtube.com/watch?v=GVsUOuSjvcg

They visit Texas company Mythic AI to discuss how they use flash memory for machine learning. There's a California company named Syntiant doing something similar.

dwightboyyy · 2024-02-29T02:06:37

I was thinking of this exact video, crazy to think that the principle is gaining momentum

StableAlkyne · 2024-02-28T14:33:35

It would be something of a full circle I feel went back to dedicated circuits for NNs - that's how they began life when Rosenblatt built his Perceptron.

I remember reading a review on the history in grad school (can't remember the paper) where the author stated that one of the initial interests in NNs by the military was their distributed nature. Even back then, people realized you could remove a neuron or break a connection and they would still work (and even today, dropout is a way of regularizing the network). The thinking was that being able to build a computer or automated device that could be damaged (radiation flipping bits, an impact destroying part of the circuit, etc) and still work would be an advantage given the perceived inevitably of nuclear war.

Compared to a normal von Neumann machine which is very fault intolerant - remove the CPU and no processing, no memory=no useful calculation, etc. One reason people may have avoided further attempts at physical neural networks is it's intrinsically more complex than von Neumann, since now your processing and memory is intertwined (the NN is the processor and the program and the memory at the same time).

kurisufag · 2024-02-28T15:09:34

>von Braun machine

von neumann? though it is funny to imagine von braun inventing computer architecture as a side hustle to inventing rocket science.

StableAlkyne · 2024-02-28T15:19:33

Oh fuck, thanks for catching that!

intalentive · 2024-02-29T15:08:38

The US military’s interest in network robustness led to the internet if I’m not mistaken.

Also preceding the perceptron was the McCulloch & Pitts neuron, which is basically a digital gate. NNs and computing indeed have a long history together.

mikewarot · 2024-02-28T21:02:51

>maybe NAND gates are not the ideal fundamental building block here?

It's my long held opinion that LUTs (Look Up Tables) are the basis of computation for the future. I've been pondering this for a long time since George Gilder told us that wasting transistors was the winning strategy. What could be more wasteful than just making a huge grid of LUTs that all interconnect, with NO routing hardware?

As time goes by, the idea seems to have more and more merit. Imagine a grid of 4x4 bit look up tables, each connected to its neighbors, and clocked in 2 phases, to prevent race conditions. You eliminate the high speed long lines across chips that cause so much grief (except the clock signals, and bits to load the tables, which don't happen often).

What you lose in performance (in terms of latency), you make up for with the homogenous architecture that is easy to think about, can route around bad cells, and be compiled to almost instantly, thanks to the lack of special cases. You also don't ever have to worry about latency, it's constant.

phdelightful · 2024-02-28T21:29:23

It’s been a long time since I worked on FPGAs, but it sounds like FPGAs! What do you see as the main differences?

mikewarot · 2024-02-28T21:44:15

No routing, no fast lines that cut across the chip, which cut way down on latency, but make FPGAs harder to build, and especially hard to compile to once you want to use them.

All that routing hardware, and the special function units featured in many FPGAs are something you have to optimize the usage of, and route to. You end up with using solvers, simulated annealing, etc... instead of a straight compile to binary expressions, and mapping to the grid.

Latency minimization is the key to getting a design to run fast in an FPGA. In a BitGrid, you know the clock speed, you know the latency by just counting the steps in the graph. BitGrid performance is determined by how many answers/second you can get from a given chip. If you had a 1 Ghz rack of BitGrid chips that could run GPT-4, with a latency of 1 mSec per token, you'd think that was horrible, but you could run a million such streams in parallel.

wakawaka28 · 2024-02-28T13:07:27

I have heard of people trying to build analog AI devices but that seems like years ago, and no news has come out about it in recent times. Maybe it is harder than it seems. I bet it is expensive to regulate voltage so precisely and it's not a flexible enough scheme to be support training neural networks like we have now, which are highly reconfigurable. I've also heard of people trying to use analog computing for more mundane things. But no devices have hit the market after so many years so I'm assuming it is a super hard problem, maybe even intractible.

osigurdson · 2024-02-28T13:25:21

Perhaps another variation on the idea is to allow a higher error rate. For example, if a 0.01% error rate was acceptable in AI, perhaps the voltage range between states could be lowered (which has a quadratic relationship to power consumption) and clock speed could increase.

the8472 · 2024-02-28T14:56:22

Bits are copyable without data loss. Analog properties of individual transistors are less so.

eru · 2024-02-29T02:14:11

Yes, but the whole point of the link submitted to HN here is that in some applications, like machine learning, precision doesn't matter too much.

(However, analog computing is still a bad fit for machine learning, because it requires a lot more power.)

the8472 · 2024-02-29T13:45:56

Exact copies aren't just about precision but also about reproducibility.

eru · 2024-03-01T11:28:40

You can keep your weights in a discrete format for storage, but do inference and training in analog.

the8472 · 2024-03-02T11:08:46

That only prevents analog copy degradation. It doesn't give you reproducibility. Reproducibility means running the same process twice with the same inputs and getting the same outputs. E.g. to later prove that something came from an LLM and not a human you could store the random seed and the input and then reproduce the output. But that only works if the network is digital.

busfahrer · 2024-02-29T14:11:09

This reminds me of this article[1] recently linked on HN, talking about how Intel had an analog chip for neural nets in the 90s, if I understood correctly

[1] https://thechipletter.substack.com/p/john-c-dvorak-on-intels...

MagicMoonlight · 2024-02-29T12:20:56

It’s going to be funny if it turns out biology was right all along and we end up just copying it.

BlueTemplar · 2024-02-28T13:36:55

I have heard that the first commercial neural network chip (by Intel, in the 90s) was analog ?

kromem · 2024-02-28T23:29:51

It sure looks like this might pair well with ternary optical computing advances:

https://ieeexplore.ieee.org/document/9720446

barrenko · 2024-02-28T13:48:54

Hmm, maybe some (signaling) inspiration from biology other than neural signaling.

drexlspivey · 2024-02-28T18:48:03

Next Up: Quantum AI

seydor · 2024-02-28T18:42:38

let's use cells

Razengan · 2024-02-28T19:48:49

We already do.

adrianN · 2024-02-28T13:01:09

You could call them connection machine and perhaps have an llm trained on Feynman help with the design.

w-m · 2024-02-28T14:36:43

I was reading Exposing Floating Point today (as Airfoil is on the HN front page and I was perusing the archive of the author). It's a blog explaining the inner workings of floating point representations. About zero values it says [0]:

> Yes, the floating point standard specifies both +0.0 and −0.0. This concept is actually useful because it tells us from which “direction” the 0 was approached as a result of storing value too small to be represented in a float. For instance -10e-30f / 10e30f won’t fit in a float, however, it will produce the value of -0.0.

The authors of the LLM paper use the values {-1, 0, -1}. Connecting the two ideas, I'm now wondering whether having a 2-bit {-1, -0, 0, 1} representation might have any benefit over the proposed 1.58 bits. Could the additional -0 carry some pseudo-gradient information, ("the 0 leaning towards the negative side")?

Also, I've seen 2-bit quantizations being proposed in other LLM quantization papers. What values are they using?

[0] https://ciechanow.ski/exposing-floating-point/#zero

creshal · 2024-02-28T14:50:38

> Could the additional -0 carry some pseudo-gradient information, ("the 0 leaning towards the negative side")?

Probably, but is it worth the cost? One of the goals behind BitNet and this paper is to find a way to implement LLMs as efficiently in hardware as possible, and foregoing floating point semantics is a big part of it. I'm not sure if there's a way to encode -0 that doesn't throw out half the performance gains.

SushiHippie · 2024-02-28T18:46:22

But if I understand it correctly, they already need to use 2 bits, one for the sign and another one for the value, so there is already one wasted state, which could be used for -0.

pennomi · 2024-02-28T22:08:27

You can pack two trits into three bits, however. So one byte could hold 5 values instead of 4.

threatripper · 2024-02-28T22:54:25

How exactly would you do that? 3 states need 1.58 bits which is a tad more than 1.5. Two 3-states have 3²=9 states while three bits only give you 2³=8 states.