Just to make sure I'm understanding this correctly.
This paper signals that the authors have found a way to run Llama 2 70B, but with 1/8th the VRAM requirements as compared to the original model, right?
And the output is on-par with the original along some metrics (ArcE/PiQA), within 25% on others (Wiki/C4), and the trajectory of their progress hints that there's even more ground to gain in the future?
For quantization, you should always verify directly on your own intended tasks, not trusting the quantization will preserve accuracy on a boarder spectrums of tasks, because surprises are not that infrequent.
It’s better to think about mapping groups of weights to an entry in a large codebook, and then using clever methods to do weight lookups.
2 bits of precision per weight is perfectly fine as long as you have enough weights. The information encoded by a neural network is measured in total number of bits, so you can compress it by either reducing number of weights or reducing number of bits per weight.
If you're looking for the best widely deployed quant format atm, it's probably ExLlamaV2's EXL2 - it supports arbitrary bpw w/ a calibration file, and also 8-bit kvcache support. I haven't tested EXL2 much at lower bpws though.
Note, both llama.cpp and AirLLM allow layer offloading to system memory (or in AirLLM's case, even to disk?!).
Sure, I think their quant format is pretty basic, something similar to bnb q4 - my plan will to be scripting a framework for testing, so should do that as well since the omniquant implementation is in mlc-llm anyways.
I have quantized Llama-2-chat-70b to 2q_k with llama.cpp then run on a 4090 for 50/83 layers. I was able to get 1.3 tokens per second. It answers this question, which smaller models have trouble with:
<s>[INST]Hello. The functions addition, multiplication, and exponentiation can be considered each as a higher order of the previous. For example, 2+2+2 = 2*3 which is significant because there are three twos, which is the multiplication factor. Likewise 2*2*2=2^3, again, the power number represents the count of the numbers being multiplied. What is this concept and what is the next higher order function?[/INST]
The concept you are referring to is called the "iterative" or "recursive" nature of arithmetic operations. Each operation can be seen as a repeated application of the previous operation, with the multiplication factor representing the number of times the operation is applied.
For example, the addition of two numbers, 2+2, can be seen as a repeated application of the multiplication operation, where the multiplicand is 2 and the multiplier is 3 (i.e., 2*3 = 6, which is the result of the addition). Similarly, the multiplication of two numbers,
2*2, can be seen as a repeated application of the exponentiation operation, where the base is 2 and the power is 3 (i.e., 2^3 = 8, which is the result of the multiplication).
The next higher order function would be the "tetration" operation, which is the iterative application of the exponential function. In other words, tetration(n,x) = x^(x^(x^(...(x^n)...))), where n is the number of times the exponential function is applied.
For example, tetration(2,2) = 2^(2^(2^(2))), which can be simplified to 2^16, or 65536.
It's worth noting that this concept can be extended even further to higher levels of abstraction, such as "hyper-operations" and "super
-operations", but these are more advanced and less commonly used concepts in mathematics.*
Note that it got all the math wrong. 2+2 is a repetition of the succession operator, not multiplication; 2+2 equals 4, not 6; multiplication of two numbers is repeated addition, not exponentiation; 2*2 equals 4, not 8; and tetration(2,2) is 4, not 65536. This is actually the invariant in that n-ation(2,2) equals 4 for all natural numbers n.
I think the point went over your head (which just further proves mine).
Hint: Perhaps read the original output more carefully: "For example, the addition of two numbers, 2+2, can be seen as a repeated application of the multiplication operation,". Or are you counted in the majority of that 120,000,000,000?
Checking nvidia-smi it stalls at ~130W (out of ~470 W max) power usage, ~25% GPU usage and ~10% memory bandwidth usage. There's fairly much traffic on the pci-bus though, and the python process is stable at 100% usage of one core. GPU possibly limited by some thing handled in python?
Pausing the GPU-accelerated video-decoding of a twitch stream it get a surprisingly large boost:
LM Studio, sort of. Unfortunately my Macbook is incapable of running all but the smallest models. It would be great if LM Studio can connect to a remote model running on a GPU server that I can rent.
There was a paper from the Allen institute from around 2017 successfully using 1 bit quantization but I can’t find it right now. We started using it where I was working at the time but I’m no longer there so I don’t know how it all turned out.
I predict that even less-than-1-bit viable quantization methods will be found eventually. Of course, the bits-per-weight figure should be interpreted as something average: e.g., group-quantizing 32 weights into 24 bits would be 0.75 bits per weight.
Those networks are named as discrete neural networks. There are already research on those, mainly for (homomorphic) encryption purposes (because it’s much easier to homomorphically encrypt NNs of just 0’s and 1’s than normal NNs).
Functional? Sure, it will work. Useful? Unlikely. It would be surprising to see quantization of pretrained models to surpass the existing research around binary NNs (trained from scratch as 1-bit while using full precision for the most critical parts)
Can someone answer CS 101 questions about this please.
I know there are other methods related to matrix factorization, but I’m asking specifically about quantization.
Does quantization literally mean the weight matrix floats are being represented using fewer bits than the 64 bit standard?
Second, if fewer bits are being used, are CPUs able to do math directly on fewer bits? Aren’t CPU registers still 64 bit? Are these floats converted back to 64 bit for math, or is there some clever packing technique where a 64 bit float actually represents many numbers (sort of a hackey simd instruction)? Or do modern CPUs have the hardware to do math on fewer bits?
This is for GPUs, not CPUs. GPUs do have lower precision ALUs to do math on fewer bits. Though not 2 bits - I believe there’s support for 1, 4 and 8 bit computation in modern Nvidia cards.
But even without such support there’s a benefit of model size compression so that bigger models can fit in GPU memory, eliminating costly CPU/GPU data transfers.
Yes but no. The actual values represented by the quantized bits don't use a representation akin to IEEE floating point, but they are able to act like floating point values due to mathematical transformations during propagation. The floating point values a quantized value corresponds to are chosen using some kind of precomputation depending on the quantization method
This paper signals that the authors have found a way to run Llama 2 70B, but with 1/8th the VRAM requirements as compared to the original model, right?
And the output is on-par with the original along some metrics (ArcE/PiQA), within 25% on others (Wiki/C4), and the trajectory of their progress hints that there's even more ground to gain in the future?