Hacker News new | past | comments | ask | show | jobs | submit login
QuIP#: 2-bit Quantization for LLMs (cornell-relaxml.github.io)
201 points by jasondavies 5 months ago | hide | past | favorite | 59 comments



Just to make sure I'm understanding this correctly.

This paper signals that the authors have found a way to run Llama 2 70B, but with 1/8th the VRAM requirements as compared to the original model, right?

And the output is on-par with the original along some metrics (ArcE/PiQA), within 25% on others (Wiki/C4), and the trajectory of their progress hints that there's even more ground to gain in the future?


Your last paragraph is the key. Without their improvements, the quality tradeoff would have been hard to stomach.


When considering quality, you should really compare not to the original model, but to smaller models which use the same vram/compute


But people use large model for better metrics, this defeats the purpose.



If this quantization method works with smaller models, it would enable running up to 33B models with only 12GB VRAM.

Especially important for democratizing access to Mistral MoE new model.


IIRC quantizing small models causes a higher relative drop in the metrics.


One of the best papers I've read in a long time. This could be huge.


For quantization, you should always verify directly on your own intended tasks, not trusting the quantization will preserve accuracy on a boarder spectrums of tasks, because surprises are not that infrequent.


Since a pixel can have more states than binary, could you have more space and compute by leveraging RGBA-space for data/compute than binary?

Maybe a stupid question.


A pixel takes more than 1 bit to store, too


on a video card? I'm not sure that is true wrt to the geometry and charge.


yes. common bit depths to represent pixels range up to 64-bits.


But the physical space?


3840 × 2160 Total number of pixels = 8,294,400

log2(8,294,400) ≈ 23 (each additional bit doubles the number of unique addresses that can be represented)

so 24 bits are the minimum required to locate a point in 2-d space at the resolution of "4k". when you add in color, 60 bits per pixel.

19 bits for 640x480.


Is it possible to think of something as a quarter bit with rgba


Can you explain what you mean?


Found this https://news.ycombinator.com/item?id=27409371

Kinda related. Especially the comments.


I’m a layperson when it comes to this topic but does this mean every value in the network is a value from 00 to 11? I.e: 00, 01, 10, and 11?

I struggle to understand how a network with only two bits of precision could ever generate text or numbers or anything really.

Is my intuition wrong here? If so, can someone give an example of what it means to quantize the network down to 2 bits only.


It’s better to think about mapping groups of weights to an entry in a large codebook, and then using clever methods to do weight lookups.

2 bits of precision per weight is perfectly fine as long as you have enough weights. The information encoded by a neural network is measured in total number of bits, so you can compress it by either reducing number of weights or reducing number of bits per weight.


can anyone comment on running the 2b quantized llama 70b on consumer cards like the 4090?


It's been on my list to do a proper shootout of all the various new quant formats floating around (my list here: https://llm-tracker.info/books/llms/page/quantization-overvi...) but a lot of them don't have very good production code yet (eg, a few months ago, when I tried OmniQuant, some of the important bits of code wasn't even included and had to be gotten directly from the authors: https://llm-tracker.info/books/llms/page/omniquant).

If you're looking for the best widely deployed quant format atm, it's probably ExLlamaV2's EXL2 - it supports arbitrary bpw w/ a calibration file, and also 8-bit kvcache support. I haven't tested EXL2 much at lower bpws though.

Note, both llama.cpp and AirLLM allow layer offloading to system memory (or in AirLLM's case, even to disk?!).

r/LocalLlama probably is the best place to search for if you're looking for people's experiences w/ quants. I know some people have been testing, like: https://www.reddit.com/r/LocalLLaMA/comments/17klaa5/tested_...


> https://llm-tracker.info/books/llms/page/quantization-overvi...

This is a very cool resource, thanks!

Gems like this, even in areas I follow pretty closely, are why I keep coming back to HN.


i humbly request you to add mlc-llm to your quant test when/if you get around to doing it


Sure, I think their quant format is pretty basic, something similar to bnb q4 - my plan will to be scripting a framework for testing, so should do that as well since the omniquant implementation is in mlc-llm anyways.


i was trying to get this to work with mlc-llm. i'd appreciate any pointers


more specifically on a non-cuda gpu - mali on orangepi via opencl


I have quantized Llama-2-chat-70b to 2q_k with llama.cpp then run on a 4090 for 50/83 layers. I was able to get 1.3 tokens per second. It answers this question, which smaller models have trouble with:

<s>[INST]Hello. The functions addition, multiplication, and exponentiation can be considered each as a higher order of the previous. For example, 2+2+2 = 2*3 which is significant because there are three twos, which is the multiplication factor. Likewise 2*2*2=2^3, again, the power number represents the count of the numbers being multiplied. What is this concept and what is the next higher order function?[/INST] The concept you are referring to is called the "iterative" or "recursive" nature of arithmetic operations. Each operation can be seen as a repeated application of the previous operation, with the multiplication factor representing the number of times the operation is applied.

For example, the addition of two numbers, 2+2, can be seen as a repeated application of the multiplication operation, where the multiplicand is 2 and the multiplier is 3 (i.e., 2*3 = 6, which is the result of the addition). Similarly, the multiplication of two numbers, 2*2, can be seen as a repeated application of the exponentiation operation, where the base is 2 and the power is 3 (i.e., 2^3 = 8, which is the result of the multiplication).

The next higher order function would be the "tetration" operation, which is the iterative application of the exponential function. In other words, tetration(n,x) = x^(x^(x^(...(x^n)...))), where n is the number of times the exponential function is applied.

For example, tetration(2,2) = 2^(2^(2^(2))), which can be simplified to 2^16, or 65536.

It's worth noting that this concept can be extended even further to higher levels of abstraction, such as "hyper-operations" and "super -operations", but these are more advanced and less commonly used concepts in mathematics.*


Note that it got all the math wrong. 2+2 is a repetition of the succession operator, not multiplication; 2+2 equals 4, not 6; multiplication of two numbers is repeated addition, not exponentiation; 2*2 equals 4, not 8; and tetration(2,2) is 4, not 65536. This is actually the invariant in that n-ation(2,2) equals 4 for all natural numbers n.


To be clear, this is not using QuIP#, and is thus not particularly relevant to the discussion at hand.



It is Llama-2-70b-chat. I quantized it to 2q_k using `quantize` with llama.cpp.


So your experience isn’t representative of the work presented by this post? Or does llama.cpp use the same technique for quantization?


I don't know, hopefully it will help set general expectations.


The answer seems kind of low quality, but it got further than others. It could probably self correct.


I don’t know too many humans who could answer that half as well.


"For example, the addition of two numbers, 2+2, can be seen as a repeated application of the multiplication operation,"

I bet most humans ever born did not know what the word "multiplication" meant. Not sure how your metric is useful.


Sometimes, to achieve some perspective, I also zoom out and declare normal over the estimated +-120,000,000,000 people who have ever lived.


I think the point went over your head (which just further proves mine).

Hint: Perhaps read the original output more carefully: "For example, the addition of two numbers, 2+2, can be seen as a repeated application of the multiplication operation,". Or are you counted in the majority of that 120,000,000,000?


Paris

in the

the spring


To be fair it is a memory/recite test mostly. “Define this”


Better output than the smaller llamas in my limited testing, but it's surprisingly slow:

Output generated in 101.74 seconds (0.98 tokens/s, 100 tokens, context 82, seed 532878022)

Output generated in 515.46 seconds (0.99 tokens/s, 511 tokens, context 27, seed 660997525)

Checking nvidia-smi it stalls at ~130W (out of ~470 W max) power usage, ~25% GPU usage and ~10% memory bandwidth usage. There's fairly much traffic on the pci-bus though, and the python process is stable at 100% usage of one core. GPU possibly limited by some thing handled in python? Pausing the GPU-accelerated video-decoding of a twitch stream it get a surprisingly large boost:

Output generated in 380.42 seconds (1.34 tokens/s, 511 tokens, context 26, seed 648992918)


How does this 2-bit quantization method compare to HQQ which was posted yesterday?

https://news.ycombinator.com/item?id=38563537


Does LM Studio support it?

By the way, what’s your favorite easy-to-use LLM front end?


LM Studio, sort of. Unfortunately my Macbook is incapable of running all but the smallest models. It would be great if LM Studio can connect to a remote model running on a GPU server that I can rent.


You could set up something similar using gradio quite easily:

https://www.jerpint.io/blog/model-inference/


I wonder whether quantization to 1-bit would be functional?


If we get 1-bit quantization, wouldn't it be basically a bunch of nested if's and else's?


No. The catch is that what's quantized here is not individual weights but groups of them.


So, just like the rest of my code then


So a decision tree? They are good for some tasks I reckon


There was a paper from the Allen institute from around 2017 successfully using 1 bit quantization but I can’t find it right now. We started using it where I was working at the time but I’m no longer there so I don’t know how it all turned out.


I predict that even less-than-1-bit viable quantization methods will be found eventually. Of course, the bits-per-weight figure should be interpreted as something average: e.g., group-quantizing 32 weights into 24 bits would be 0.75 bits per weight.


Those networks are named as discrete neural networks. There are already research on those, mainly for (homomorphic) encryption purposes (because it’s much easier to homomorphically encrypt NNs of just 0’s and 1’s than normal NNs).


Functional? Sure, it will work. Useful? Unlikely. It would be surprising to see quantization of pretrained models to surpass the existing research around binary NNs (trained from scratch as 1-bit while using full precision for the most critical parts)



Can someone answer CS 101 questions about this please.

I know there are other methods related to matrix factorization, but I’m asking specifically about quantization.

Does quantization literally mean the weight matrix floats are being represented using fewer bits than the 64 bit standard?

Second, if fewer bits are being used, are CPUs able to do math directly on fewer bits? Aren’t CPU registers still 64 bit? Are these floats converted back to 64 bit for math, or is there some clever packing technique where a 64 bit float actually represents many numbers (sort of a hackey simd instruction)? Or do modern CPUs have the hardware to do math on fewer bits?


This is for GPUs, not CPUs. GPUs do have lower precision ALUs to do math on fewer bits. Though not 2 bits - I believe there’s support for 1, 4 and 8 bit computation in modern Nvidia cards.

But even without such support there’s a benefit of model size compression so that bigger models can fit in GPU memory, eliminating costly CPU/GPU data transfers.


Yes but no. The actual values represented by the quantized bits don't use a representation akin to IEEE floating point, but they are able to act like floating point values due to mathematical transformations during propagation. The floating point values a quantized value corresponds to are chosen using some kind of precomputation depending on the quantization method




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: