Essentially, you lose some accuracy and there might be some weird answers and probably more likely to go off the rail and hallucinate. But the quality loss is lower the more parameters you have. So for very large model sizes the differences might be negligible. Also, this is the cost of inference only. Training is a whole other beast and requires much more power.
Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter. Bandwidth and memory size are probably, in my ignorance mind, easier to increase than raw compute so maybe we will soon actually have "smart" devices.
The dropdown at the top selects which comparison: Falcon compares GGML, Vicuna compares bits and bytes. I have some more comparisons planned, feel free to open an issue if you'd like to see something specific: https://github.com/the-crypt-keeper/can-ai-code
I want to see examples like "here's a prompt against a model and the same prompt against a quantized version of that model, see how they differ."
We suck at evaluating and comparing models imo. There are metrics and evaluation task, but it's still very subjective.
The closer we get to assessing human like performance, the tougher it is, because it becomes more subjective and less deterministic by the nature of the task. I don't know the answer, but I know that for the metrics we have it's not so easy to translate them into any idea about the kind of performance on some specific thing you might want to do with the model.
Not mathematically, at the very least. Perplexity is a translation of the best measure we have for informing us how a model is doing empirically over a test dataset (both pre and post). It is enough to be, usably, at least the final word on how different quantization methods perform.
Subjective ratings are different, but for compression things are quite well defined.
> some specific thing you might want to do with the model.
I think this right here is the answer to measuring and comparing model performance.
Instead of trying to compare models holistically, we should be comparing them for specific problem sets and use cases... the same as we compare humans against one another.
Using people as an example, a hiring manager doesn't compare 2 people holistically, they compare 2 people based on how well they're expected to perform a certain task or set of tasks.
We should be measuring and comparing models discriminately rather than holistically.
You could have two models answer 100 questions the same way, and differ on the 101st. They’re unpredictable by nature - if we could accurately predict them we’d just use the predictions instead.
Even at T=0 and run deterministically, the answers still have "randomness" with respect to the exact prompt used. Change wording slightly and you've introduced randomness again even if the meaning doesn't change. It would be the same for a person.
For an llm, a trivial change in wording could produce a big change in answer, same as running it again with a new random seed. "Prompt engineering" is basically overfitting if not approached methodically. For example, it would be interesting to try deliberate permutations of an input that don't change the meaning and see how the answer changes as part of an evaluation.
But if T=0 and you use the exact same input (not a single word or position changes) do you get the same output? Reading your response it implies that the randomness is related to even slight changes.
As a sibling comment mentioned, threading on a gpu is not automatically deterministic so you could randomness from there, although I can't think of anything in the forward pass of a normal LLM they would depend on execution order. So yes, you should get the same, it's basically just matrix multiplication. There may be some implementation details I don't know about that would add other variability though.
Look at this minimal implementation (Karpathy's) of LLaMA, the only randomness is in the "sample" function that comes in at non-zero temperature, otherwise its easy to see everything is deterministic: https://github.com/karpathy/llama2.c/blob/master/run.c
Otoh, with MoE like GPT-4 has, it can still vary at zero temperature.
Some GPU operations give different results depending on the order they are done. This happens because floating point numbers are approximations and lose associativity. Requiring a strict order causes a big slowdown.
Basically, the more you quantize with K-quant, the dumber the model gets. 2 bit llama 13B quant, for instance, is about as dumb as 7B F16, but the dropoff is not nearly as severe from 3-6 bits.
FWIW here's why perplexity is useful: it's a measure of uncertainty that can easily be compared between different sources. Perplexity k is like the uncertainty of a roll of a k-sided die. Here I think perplexity is per-token, and is a measuring the likelihood of re-generating the strings in the test set.
So for the reduction in size given by (q4 -> q3), you get a 2% increase in the uncertainty. Now, that doesn't tell you which specific capabilities get worsened (or even if that's really considered a huge or tiny change), but it is a succinct description of general performance decreases.
If you want more fine-grained explanations of how generation of certain types of texts get clobbered, you would probably need to prepare datasets comprised of that type of string, and measure the perplexity delta on that subset. i.e.
dperplexity/dquantization(typed_inputs).
I think it might be more difficult to get a comprehensive sense of the qualitative differences in the other direction, e.g.
The problem is that it's not consistent enough for a good demo. Not even two different models, but even two different fine tunes of the same base model may be wildly differently affected by quantization. It can range from making hardly a difference to complete garbage output.
Just the other day someone published ARC comparison results for different quants as well as the code for the harness that they used to easily run lm-eval against quants to your heart's content: https://www.reddit.com/r/LocalLLaMA/comments/15rh3op/effects...
>Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter.
I'm not sure what you mean by this. You've always been able to run GPT3 on a single server (your typical 8xA100).
8xA100 is technically a single server, but I think OP is talking about affordable and plentiful CPU hosts, or even relatively modest single GPU instances.
DGX boxes do not grow on trees, especially these days
Because 175B parameters (350GB for the weights FP16, let's say a bit over 400GB for actual inference), fit very comfortably on 8xA100 (640GB VRAM total).
And basically all servers will have 8xA100 (maybe 4xA100). Nobody bothers with a single A100 (of course in a VM you might have access to only one)
Wtf does HGX mean? God enough with the acronyms people.
Please take an extra ten seconds to speak in proper human language!
You could save on the worlds carbon footprint by reducing the number of times humans have to search for “what is NVIDIA hgx?” or is it “what is AMD HGX” and then subsequently visiting the websites to see if that’s right or not.
Yes, there is a logarithmically-bound (or exponential if you're viewing it from another angle) falloff in the information lost in quantization. This comes from the non-uniform "value" of different weights. We can try to get around them with different methods, but at the end of the day, some parameters just hurt more to squeeze.
What is insane though is how far we've taken it. I remember when INT8 from NVIDIA seemed like a nigh-pipedream!
Could this be why people recently say they see more weird results in ChatGPT? Maybe OpenAI is trying out different quantization methods for the GPT4 model(s) to reduce resource usage of ChatGPT.
I'd be more inclined to believe that they're dropping down to gpt-3.5-turbo based on some heuristic, and that's why sometimes it gives you "dumber" responses. If you can serve 5/10 requests with 3.5 by swapping only the "easy" messages out, you've just cut your costs by nearly half (3.5 is like 5% of the cost of 4).
Serving me ChatGPT 3.5 when I explicitly requesting ChatGPT 4 sounds like a very bad move? They're not marketing it like "ChatGPT Basic" and "ChatGPT Pro".
"But what we found with these neural networks is, if you use 32 bits, they're just fine. And then you use 16 bits, and they're just fine. And then with eight bits, you need to use a couple of tricks and then it's just fine.
And now we find if you can go to four bits, and for some networks, that's much easier. For some networks, it's much more difficult, but then you need a couple more tricks. And so it seems they're much more robust."
That will be really interesting for FPGAs, because the current ones are basically oceans of 4-bit computers.
Yes, you can gang together a pair of 4LUTs to make a 5LUT, and a pair of 5LUTs to make a 6LUT, but you halve your parallelism each time you do that. OTOH you can't turn a 4LUT into a pair of 3LUTs on any currently-manufactured FPGA. It's simply the "quantum unit" of currently-available hardware -- and it's been that way for at least 15 years (Altera had 3LUTs back in the 2000s). There's no fundamental reason for the number 4 -- but it is a very, very deep local minimum for the current (non-AI) customers of FPGA vendors.
Interesting, how would that work? Are there any well-known examples?
Is it: the weights all happen to be where float is sparse, so quantization ends up increasing fidelity? Or is it more of a “worse is better” dropout-type situation?
I suspect it works as regularisation of the network. It usually happens when you train with quantisation instead of post-training quantisation, an I haven't seen that done with LLMs yet.
In my experience, it usually means Small Block Chevy, but in certain communities it means Single Board Computer, an older way of referring to devices like the Raspberry Pi.
I would elaborate and say, anywhere that your computer is resource constrained ( ram, processing power ) but you still want to make up articles for your Amazon Affiliate blog
In this context I'd assume SBC means Single Board Computer, such as a Raspberry Pi or one of the many imitators. The article itself mentions running LLaMa on a Pi 4.
The interesting implication about running an LLM on a single board computer is that if it's a proof of concept for an LLM on a smartphone. If you have a model that can produce useful results on a Ras Pi, you have something that could potentially run on hundreds of millions of smartphones. I'm not sure what the use case is for running an LLM on your phone instead of the cloud, but it opens some interesting possibilities. It depends just how useful such a small LLM could be.
- Token generation is serial and bandwidth bound, but prompt ingestion is not and runs in batches of 512+. Short tests are fast on pure CPU llama.cpp, but long prompting (such as with ongoing conversation) is extremely slow compared to other backends.
- Llama.cpp now has very good ~4 bit quantization that doesn't affect perplexity much. Q6_K almost has the same perplexity as FP16, but is still massively smaller.
- Batching is a big thing to ignore outside of personal deployments.
- The real magic of llama.cpp is model splitting. A small discrete GPU can completely offload prompt ingestion and part of the model inference. And it doesn't have to be an Nvidia GPU! There is no other backend that will do that so efficiently in the generative AI space.
- Hence the GPU backends (OpenCL, Metal, CUDA, soon ROCm and Vulkan) are the defacto way to run llama.cpp these days. Without them, I couldn't even run 70B on my desktop, or 33B on my (16GB RAM) laptop.
ROCm works now! I just set it up tonight on a 6900xt with 16gb vram running wayland at the same time. The trick was using the opencl-amd package (somehow rocm packages don't depend on opencl, but llama does, idk).
I'm astonished at the results I can get from the q6_K models.
Can you please share more info on this? I have a 6900xt "gathering dust" in a proxmox server - would like to try to do a passthrough to a vm and use it. Thank you in advance!
* llama.cpp now has GPU support including "CLBlast", which is what we need for this, so compile with LLAMA_CLBLAST=ON
* now you can run any model llama.cpp supports, so grab some ggml models that fit on the card from https://huggingface.co/TheBloke.
* Test it out with: ./main -t 30 -ngl 128 -m huginnv1.2.ggmlv3.q6_K.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
* You should see `BLAS = 1` in the llama.cpp output and you should get maybe 5 tokens per second on a 13b 6bit quantized ggml model.
* You can compile llama-cpp-python with the same arguments and get text-generation-ui to work also, but there's a bit of dependency fighting to do it.
* koboldcpp might be better, I just haven't tried it yet
Text gen ui is nice. Some specific nicities include pre formatted instruct templates for popular modules, good prompt caching from llama-cpp-python, and integration of a vector db.
But its also finicky, kinda unstable and the dependencies are tricky.
Koboldcpp has other nicities, like some different generation parameters to tweak and some upstream features pulled in from PRs before the official llama.cpp release has them. The UI is nice, predating llamav1. Its standalone, dead simple to compile and has integration with AI Horde, which is (IMO) a huge essential feature.
I’d love to host something local but have been so overwhelmed by the rapid progress and every time I start looking I find a guide that inevitably has a “then plug in your OpenAI api key…” step which is a hard NOPE for me.
I have a few decent gpus but I’ve got no idea where to start…
- Run Koboldcpp with opencl (or rocm) with as many layers as you can manage on the GPU. If you use rocm, you need to install the rocm package from your linux distro (or direct from AMD on Windows).
- Access the UI over http. Switch to instruct mode and copy in the correct prompt formatting from the model download page.
- If you are feeling extra nice, get an AI Horde API key and contribute your idle time to the network, and try out other models on from other hosts: https://lite.koboldai.net/#
Perplexity is a measure of how certain the model is of the next token. It's calculated by looking at the probabilities that the model calculates for the next token in a stream. If there are several choices for the next token with similar probabilities, that's telling you that the model is having a hard time telling what the right answer should be: the model is more perplexed, perplexity is higher. If there's a single option with a much higher implied probability than any other, that means the model is more certain, and perplexity is lower.
Note that this has nothing to do with whether the answer is objectively correct. It's just measuring how confident it is.
Reading this could make people believe it is computed from the probability distribution of the model alone.
To be clearer, it is the exponent of the average negative log probability that the model gives to the real tokens of a sample text[0]. Roughly, it relates to how strongly the model can predict the sample text. A perfect model would have perplexity one; a random model has a perplexity equal to the number of possible tokens; the worst model has infinite perplexity.
I think what you describe ("confidence about the next token") is the entropy of the model's output. A model can be very certain about the next token (its output has low entropy) but if it is usually wrong on the text you measure it against, it will have high perplexity. (For example when the model was trained only on children's books and you measure it on Wikipedia.)
Perplexity is measured by testing a language model on a known text. The model's output is a probability for every possible next word/token. The model is highly perplexed if it gave a low probability to the actual next token.
Its the phase where llama.cpp processeses the input, as opposed to the phase where its generating the response.
Speed is particularly important. The response can be streamed word by word, so it doesn't have to be particularly fast, but slow input processing leads to very noticable latency.
This project's been a blast to work with. While it's written in C++, it provides a C interface to compile against which makes it especially easy to extend with Go, Python and other runtimes.
A few folks and I have been building a tool with it in Go for pulling & running multiple models, and serving them on a REST API: https://github.com/jmorganca/ollama
In similar light, you haven't checked it out, llama.cpp also has a pretty extensive "server" tool (in its examples directory in the repo) with a web ui and support for grammar (e.g. forcing the output to be JSON)
> A few folks and I have been building a tool with it in Go for pulling & running multiple models, and serving them on a REST API: https://github.com/jmorganca/ollama
Why this tool was created? What was the reason for creating such tool? Why it was named this way?
Explanation for wishing to flag/block/bully basic simple questions:
Questions mean what they ask and nothing else. Any imagination of aggression or anything else related to emotions is misplaced.
As someone unfamiliar with the field I’ve read GitHub page and didn’t find answers to things I was curious about regarding this project.
So out of curiosity I wrote simple basic questions:
“Why this tool was created? What was the reason for creating such tool? Why it was named this way?“
Suggesting that if I didn’t figured it out by myself perhaps some one else who also wish to join the field and looking for entry point to try something in practice would also be confused and thus answering those simple basic questions would really help to the beginners.
With experience of entering into many new fields I know that usually there is a lot of confusion in the beginning about basic stuff .
I also suggested that those are obvious reasons why people ask questions and didn’t expect to be ‘bullied’ for basic simple questions. There were no one word of aggression in text. Perhaps seeing aggression there requires certain level of misplaced imagination.
Usually there is a fun story about the name for example GNU is Not Unix etc … This is opportunity to tell it if there is. Sometimes name holds some clue about functionality and then it’s easier to remember. Sharing it can help too to beginners.
As someone unfamiliar with the field I’ve read github page and didn’t find answers to things I was curious about regarding this project.
So out of curiosity I wrote simple basic questions:
“Why this tool was created? What was the reason for creating such tool? Why it was named this way?“
Suggesting that if I didn’t figured it out by myself perhaps some one else who also wish to join the field and looking for entry point to try something in practice would also be confused and thus answering those simple basic questions would really help to the beginners.
With experience of entering into many new fields I know that usually there is a lot of confusion in the beginning about basic stuff .
I also suggested that those are obvious reasons why people ask questions and didn’t expect to be ‘bullied’ for basic simple questions. There were no one word of aggression in text. Perhaps seeing aggression there requires certain level of misplaced imagination.
> Especially the one about the name, is the name offensive or what? To me it sounds quite benign.
So why the simple question doesn’t sound benign to you?
Usually there is a fun story about the name for example GNU is Not Unix etc … and sometimes it holds some clue about functionality and then it’s easier to remember.
Usually friendly advice is given after friendly answering question, which was not demonstrated here.
We are perfectly aware of how to deal with extra-sensitive people and kind approach is the way to go. However on this site I expect certain level of technical training and intelligence in answering direct questions without being offended by search for a deeper understanding of some topic and without inventing emotions that never been there in the first place. I was also under impression that guidelines of this site directly where encouraging answering I expect by declaring : “be kind” “don’t be rude” “suggest a good faith” and unfortunately none of those were demonstrated by your answers even when you have learned the fact that you were wrong and your accusations were completely ungrounded . The fact that such behaviour was encouraged by (as you claim it) “multiple people” including moderation of this site tells a very sad story about this site regarding following own declared principles.
And this example:
“Which crypto? Where can I pay with it? What if I don't want to have a hardware wallet everywhere with me? Can I cancel transactions and get my money back? Does it use more energy per year than Argentina?” (https://news.ycombinator.com/item?id=36656667)
tells another sad story about hypocrisy which means it’s hard to take “friendly“ advice seriously. The next time you wish to give a truly friendly advice try to be kind in the first place, do not suggest a bad faith and do not insist on false ungrounded accusations encouraged by confirmation biases especially when clarification presented.
You could have just apologise for own real aggression contrary to imaginative aggression from my questions and simply give answer while keeping “friendly” advice to yourself.
This way people would learn something about the project rather than learning something about cancelling “culture” which is under the flag of “being nice” demonstrates much worse behaviour in practice than one it claim to oppose. You actually among others should be very aware and disgusted by those tactics after you’ve seen effects of “soviet methods” in Romania.
I’ve been working through that repo and managed the 13B dataset on a single Pi4 8gig
I’ve also replicated the work in OpenMPI ( from a thread on the llama.cpp GitHub repo ) and today I managed to get the 65B dataset operational on three pi4 nodes.
I’m not saying this as any achievement of mine, but as a comment on the current reality of reproducible LLM At home on anything you’ve got.
The objective performance I'm getting is flat poor, mostly because of the network I'm using. On the other hand, simply being able to do it at all with one node on wireless until I can pull another drop, and the rest being on 100 Mbit ... I'm really running a bargain basement cluster.
I don't know about SRD, but llama.cpp has MPI configurations built-in. I didn't have to engineer anything or rewrite anything ( I made an optimization patch, but I didn't even make that one up myself ) I just compiled it with flags set.
As far as performance on 65B, I'm still waiting for it to finish to get the timings :)
I eagerly await your numbers! Maybe I'll post some of my own if I can get far enough ahead at work.
I was thinking it's time to upgrade from 1GigE anyway, 10GigE is cheap and at work we're ripping that out in favor of 25 and 50...
I'll look at the code, depending on how well the authors used MPI it could be exciting times! It's not that hard (or expensive) to get a bunch of power hungry used servers off ebay and string em together with a cheap 10GigE switch. It would be loud and power hungry but I wonder if I could have a 65B local model in the privacy of my own home, for fractions of the cost of buying a A100....
Edit: Oh, and SRD is a ... network protocol designed to work hand in hand with EFA which can substantially improve the performance of HPC MPI workloads, running on EC2, during network bound phases.
The other way around is whole number math. I added the 3-node output from the 13B model to github, the timings are below. The 3-node 65B job hasn't finished yet.
llama_print_timings: load time = 17766.29 ms
llama_print_timings: sample time = 264.42 ms / 128 runs ( 2.07 ms per token, 484.07 tokens per second)
llama_print_timings: prompt eval time = 10146.71 ms / 8 tokens ( 1268.34 ms per token, 0.79 tokens per second)
llama_print_timings: eval time = 287157.12 ms / 127 runs ( 2261.08 ms per token, 0.44 tokens per second)
llama_print_timings: total time = 297598.22 ms
In my best estimation, Finbarr makes pretty great content, he and I have had a number of positive interactions on Twitter. I tend to have a pretty grumpy disposition towards a lot of modern ML and such as I feel it's shovelware, but whenever Finbarr puts out work, I tend to set aside some time to give it a good gander, as I feel like it's generally pretty "meaty" (which I honestly find pretty hard to do past a certain pace). Well worth the subscribe if you have not done so already (I'm not affiliated with him, I just really like his work!).
It is useful to mention running inference on modern cpus that have AVX2 is not that bad. Sure it is slower than on the gpu, but you get the benefit of having a single long continuous region of ram.
But there is one huge problem why this is not that popular on x86_64. Having to run in fp32. As far as I know our most common ml libraries (pytorch, tf, onnx etc) do not have an option to quantize to 4 bits and they don't have an option to run inference at anything other than fp32 on the x86_64 cpus.
It is a huge shame. There is openvino which supports int8, but if you can't easily quantize large models without a gpu, what use is it? (For small models I suppose).
So if anyone figured out a way to quantize a transformer model to 4/8 bit and run it on the x86_64 cpu platform I'm very interested in hearing about it.
Subjective experience: AVX512 helps a lot. I would have liked to read more about this. It seems that AVX512 supports fp16 in hardware and allows 32 fused multiplication-add operations per core. So I imagine on a Ryzen 9 with 12 cores you can have 384 simultaneous fused multiplication-add operations. I am not sure whether my estimation is off. Anyone know more than me?
Wait, I wasn't aware llama.cpp even runs on x86_64.i thought it is arm hw only. If what you say is correct that indeed is very interesting. Especially if I can extend it to other models like falcon.
A good analogy: as you approach the rainbow moves off. Others see you in it, but you can confirm it's somewhere else. It's an effect, a side effect, it's pretty and we value it. There's no pot of gold in literal sense, it's ephemeral value in other products.
I enjoyed this article, but it seems to me that the latency numbers should have units of nanoseconds or maybe CPU cycles. I feel like the article was a bit sloppy with units.
Another question that occurs to me is: why do chipmakers even bother putting so many functional units on the chip if almost all workloads are memory bound? Based of the calculations in this article, you could decrease the number of teraflops a modern GPU can perform by a factor of 2 and not notice any appreciable difference in ML performance.
1. I think nanosecond-scale latency numbers on operations taking dozens to hundreds of ms are probably overkill?
2. Inference is only one aspect of what GPUs are used for. Many other workloads are compute-bound. That being said, given the recent rise of these kinds of open-source, pre-trained large language models, I wouldn't be surprised if future Nvidia product launches offered variants with significantly more VRAM. There would probably be a lot of interest in "3080-equivalent compute, but 48GB VRAM" these days — certainly I would take one over a 4090 with 24GB VRAM. (Then again, that's basically an A6000, and those go for nearly $7k...)
Yeah, Nvidia won't do big VRAM consumer cards until AMD forces them to. They're running flat out just trying to keep up with demand for H100s at forty thousand USD each.
Or Intel! They're not making any high-end cards currently, but the A770 16GB is pretty decent if you're on a budget. The software support isn't really amazing yet, but their GPUs have some quite decent matmul acceleration.
Unfortunately support for their GPUs is not upstream in Tensorflow or Pytorch, and I don't think they're well-supported by Llama.cpp either, but Intel does look quite promising. I also believe that Intel oneMKL works on Arc GPUs these days, which in CPU land is amongst the fastest BLAS libraries out there. Also their matrix hardware is accessible using OpenCL extensions, which means that rolling a custom kernel for things related to quantization should be quite possible.
(For Tensorflow and PyTorch, you need to install a custom package called Intel Extension for $FRAMEWORK. The PyTorch one got updated to PyTorch 2.0 recently, which is promising.)
Currently rumors seem to indicate that their second generation GPUs will go from 32 to 64 Xe cores for the top model, and keep the 256 bit bus. If Intel were to double the VRAM to 32 GB as well (at least as an option, just like the A770 comes in both 8GB and 16GB variants), it'd immediately make them the crown of consumer VRAM size, which I'd wager would drive a lot of interest from the ML community.
The primary use case for GPUs in devices, graphics, is not as often memory bound. Even for other general data compute purposes that may not always be the case. It's specifically neural nets that have extremely wide batches of extremely simple operations occurring across extremely large chunks of memory where being memory bound is the default case.
You can find content like this on Twitter if you follow the right people. In fact I read this article before it was even posted here because @karpathy tweeted about it.
I’ve never been a regular Twitter user, and don't really enjoy the platform, but this comment of yours is either abusing the word “shameful” or betraying a major lack of understanding that you can’t expect other people to care deeply about the things you care deeply about.
There’s a lot of shameful things in this world, GP using twitter isn’t one of them. Not even a “little.”
It's betraying the word shameful in that it's an utter understatement.
If people don't care about supporting companies that enable and spread far-right content and groups, it's they who are a problem.
To say nothing of the pure disregard of the human right to privacy (and with AI now IP as well) that is forced on the the rest of the world by the dominance of the US market.
Apologies, but I genuinely find the intensity of this comment delusional. (1) Doesn't this perspective apply to any social media platform, for example Facebook Groups. And (2) RE: pure disregard of the human right to privacy: doesn't this implicate pretty much all popular digital platforms?
It's both you and right wing people who are the problem.
You're both puppets in the hands of the powerful who wants us divided and weak.
Don't trust authority. Don't trust anyone.
In the past left wing people cared and fought about freedom, today is the right wing fighting for freedom.
It's all irrelevant anyway, governments keep growing stronger and stronger during right or left governments. And soon there is not going to be anywhere to run to.
Mastodon is a good way to get at this content in unfiltered form. Take a look at the raw feed of the sigmoid.social instance or follow the #LLM hashtag.
If you want it more predigested and summarized I can recommend the AI Explained channel on youtube.
Does anyone know what the next breakthroughs will be and their rough timelines regarding locally run models? For instance will anything like chat gpt 4 be runnable on an M1 Mac within the next year?
"Breakthroughs" are inherently hard to predict. This field is advancing at a very rapid pace. A lot of the improvements are incremental, but even these are coming very fast, and moving in different directions at once.
I don't think there's any likelihood of replicating GPT-4 on your M1 in the next twelve months, especially not if you're expecting responsive performance. What we could see are a plethora of models dedicated to doing particular tasks. Say, specialists in aspects of programming or accounting or law or medicine. Or general knowledge models with access to a local cache of Wikipedia. Individually, none of these models would have to come close to GPT-4's overall level of power and flexibility. But collectively, they could reach that level of utility.
Given the massive imbalance in the memory bandwidth bottleneck, I wonder why specialized hardware is the way it is. Is there some use case in which processing is the bottleneck, or at least it's more even? Are we expecting some software paradigm shift which will change the balance? Why couldn't they just make a cheaper, more rounded card which isn't heavily underutilized because of a large bottleneck?
But llama.cpp, and llms in general, are very atypically memory bound, even for AI workloads. Other models/backends will typically make more use of compute and cache.
For llms specifically, cloud hosts will batch requests to make better use of GPUs.
IMHO the current bottleneck is driven by the partial overlap between the gaming GPU market and machine learning needs. The expensive things that are needed by both (i.e. parallel calculations) are included in all cards, but expensive things that aren't needed by games (i.e. memory bandwidth) become a bottleneck unless you use expensive ML-specific niche hardware instead of consumer (gamer) GPUs.
What I find more stunning is what this implies going forward. If tech advances as it tends to do then having a 200bn model fit into consumer hardware isn't that far away.
Might not be AGI but I think cliched as it is that would "change everything". If not at 200 then 400 or whatever. Doesn't matter - the direction of travel seems certain.
Really? I can guess at what spicy autocomplete might actually mean but I doubt a LLM ... OK ChatGPT did a pretty good job of it (I've just asked it), whilst sidestepping the definition of spicy in this context. It is after all a very good next word guesser, given a context, and its not ... me! I am capable of hallucinating but it was 30 odd years ago since I hunted out certain fungi on Dartmoor, or smoked hemp.
To be fair, we humans do often interrupt each other to second guess a sentence completion. Done correctly it is a brief satisfying collaboration. Done wrong ... I've been married for 18 years and know when to bite my tongue, but I still get it wrong from time to time - sometimes deliberately. Despite that, me and the wiff can autocomplete each other's sentences with uncanny accuracy and end up with perfect harmony or a cough slight disagreement as a result.
We are getting some phenomenal slide rules these days but the darleks are not going to be flying up the stairwell just yet, nor will SkyNet be taking over tomorrow.
That said, you just know that some noddy is trying to sell a nuclear "deterrent" LLM AI thingie somewhere. Thankfully, production military equipment takes quite a while to get to deployment. There is a good chance that we will get to grips with all this stuff before SkyNet is let loose for real 8)
No, a human isn't born with a set of knowledge like a freshly trained LLM, keeping the model fixed and responding to input. The analog to the model changes based on the human's experience. Just making bigger and bigger LLMs won't give you this.
I have been thinking about this lately. Things that are recently developed like language skills are easiest to replicate by machines. But things that too took a time to develop like walking and grabbing are still something that machines struggle with.
Ah, but I believe you forget the implicit biases of genetic programming. Instincts in my experience are the skeleton, and in a sense the default basis functions for the structure of how we live, see, do, and learn.
No, I don't forget that. There's obviously a starting point, behaviors and abilities that newborns already have. The point is that the model is not static.
Whoa, imagine you get a good base LLM model and save all conversations with it. Run a batch process every night to fine tune a LORA on convo dataset. If I ever came across such a chat bot I'd probably freak out as to why it remembers things outside of the context window, without summarisation
Bit of an unfair comparison when humans also have a bunch of senses that LLMs don't have. They might be trained on orders of magnitude more words, but more data? Doubtful.
That's the key. I'm reminded of the Helen Keller story. She was completely blind and deaf. Her teacher spent a very long time signing into her hand. It took a very long time before she realized that the sign for "water" designated the thing she could feel flowing onto her hand; before that breakthrough the signs were meaningless to her. An LLM only knows the structure of language. It doesn't know that there is an external physical world that the language refers to. It only can predict what words follow which other words, and which output is preferred. Without any senses (and the huge bandwidth of information provided by them) an LLM is very crippled.
Crippled, yes, but I would disagree that it is fundamentally limited, or that an input stream of human language is inadequate to bootstrap "meaning", or in some way philosophically inferior to native biological senses.
It's very interesting you bring up Helen Keller because she's generally regarded as possessing the same level of sentience - and indeed intelligence - as anyone else, despite the extreme narrowness of her sensory input. It took her much longer to get going, but it's not as if she only understood concepts that directly related to touch. The experience with "water" taught her the concept of a symbol, and from there she could bootstrap everything else. LLMs already work with symbols - that is their sense.
In fact we're all a bit like Helen Keller, in the sense that if sensory input is the basis for our entire world model, then it is a very small foundation supporting an incredibly vast and intricate edifice. There is a considerable abstraction gap between concepts like "capitalism" and any direct sensory input. We all of us, all the time, manipulate concepts without thinking through what they "mean" all the way to something we can see and touch.
No they don't. You're "just" doing what everyone else in the past has done with the brain/human intelligence and using the latest technology as a metaphor without realizing it.
We want to think we’re exceptional but all we can do is say “human consciousness is special” without having any way of measuring it or disproving the assertion that we’re just really fancy pattern matchers.
Take any metaphor you want, it’s the same outcome: we may all be philosophical zombies.
I’m conscious, maybe you’re not? Not that I really believe that. I think you probably see colors and hear sounds, even in your dreams! But engineering types tend to be persuaded by a particular view of the world, failing to understand that it is a view and not nature itself.
The irony in your statement is immense. Yes, Kurzeweil has been saying this for decades. No it doesn't mean AGI is close. These llms do nothing to advance AGI. There is no theoretical basis to the belief in emergent intelligence from statistical language models and the answers are amazingly good, highly unreliable and parrot meaning at best. There is no inductance, and no inteospection and no understanding of the deep semantic meaning of the language presented. There's no intelligence.
The lack of concept of "knowledge" is a big one for me - if that's an emergent thing it hasn't even shown hints of this yet. This to me seems a pretty hard line right now, as it limits their capability of things even inexperienced humans can do - namely decide if they actually know something, and identify when they don't know something and attempt to fix that - e.g. asking for clarification on vague inputs, or deciding if something is actually truth or fiction.
That then ties into another limitation right now - how after a training the model is pretty static, so cannot learn and has no state outside it's context buffer. This could just be another point where a few orders of magnitude more computing power can "fix" it, doing whole training steps between each input to actually incorporate new information and corrections into itself instead of relying on a fixed size small context.
But I'm not deep enough into things to say if they're fundamental issues, or current techniques will start displaying those effects as emergent characteristics as the complexity and training increases. There's been a few other examples when "known" techniques start to show unexpected characteristics down the line as they are scaled up, so can't really say for sure they'll /never/ be shown, just that the current examples don't seem to show even the beginnings of that sort of thing.
Why do you say they do nothing to advance AGI? Do you know what it takes to advance AGI? It's hard to state that without knowing how AGI would work yourself.
LLMs would be considered magic just a couple years ago. Sure, not AGI but behaves just like one for certain workloads. I find hard to believe we're not a bit closer now - or maybe even a lot closer.
AGI should have morals, opinions, self-reflection, learn continuously from sensor data, reason, realize when they’re proven wrong and update their model of the world, and be creative. So far LLMs exhibit none of those. But LLMs exhibit a digestible distillation of a very large body of data which may be a component of an AGI.
But you can have an AGI that doesn’t have encyclopedic knowledge but it’s still highly intelligent, so I don’t think LLMs have to be an intrinsic component.
That is not what AGI means. AGI = Artificial General Intelligence.
1. Artificial = we made it
2. General = it can solve problems in any field
3. Intelligence = the ability to solve problems
A chess engine is a very strong Artificial Intelligence. But it’s not very General, it can only evaluate chess positions.
GPT-4 is very General, you can ask it about any question and get a somewhat reasonable answer. But it’s not very intelligent, often the answer is wrong.
You’re talking about an Artificial Human. That’s a different problem. Intelligence is not species dependent. Dolphins are intelligent (a bit), aliens can be intelligent and have zero emotions or conception of self. There’s certainly plenty of amoral intelligent serial killers.
That's news to me. AGI (or strong AI) is typically defined as "human-level intelligence", or "perform any task that a human or animal can." Humans and animals often perform tasks that are critically reliant on being conscious, emoting, reading body language, reasoning, etc.
Not only that but prominent thinkers who have carved out the notion of AGI (or Strong AI) tend to have consciousness, mental states, and emotions at the core of it.
I think what you're talking about is a multi-task AI, not an AGI.
We don't have a good computer model of dolphin intelligence and the llms are not even remotely close to consciousness or dolphins, dogs, parrots on the intelligence Front.
How do humans write if not by intuiting what word comes after another?
Intelligence is the ability of that next word decision procedure to determine a next word that is aligned with our human intuition and model of truth.
I believe what you’re getting at is modality, that GPT-4 only provides responses in text. You can’t ask it to drive a car, or paint like Dall-e. And that’s a fair criticism, but it’s mostly just because it would make the models too large and slow, not because we don’t know how to do it. The thing we don’t know how to do is make a model reason as well as a human, and it makes sense to try to solve that in the text domain first rather than making highly multimodal models that reason poorly in all domains.
> the answers are amazingly good, highly unreliable and parrot meaning at best. There is no inductance, and no inteospection and no understanding of the deep semantic meaning of the language presented. There's no intelligence.
You have a metric for human brains worth of computing power which hasn't already been exceeded? I can't do infinite precision arithmetic or the RSA algorithm in my head, or index a billion strings into lexical sort order.
But I am human, I am conscious and no visible VLSI work or algorithmic model will lead to AGI or a human equivalent computing power by 2029. Let alone for $1000.
well, you could just be hallucinating your own consciousness. By 2029 it seems not unreasonable to expect that the most sophisticated models will carry out visual and auditory interactions which could fool even the most sophisticated viewer. At that point, what really does consciousness mean? If the robot insisted to me it was conscious, how can I really say no?
By most measures you could think of for intelligence languages models are improving, so I don’t see why you think this wouldn’t lead to something at least almost human-level if you scaled it up enough
Of course there could be some wall somewhere but I don’t see why there would be
That's "we need a larger cowbell" thinking. It's not a theory of mind, it's wishful thinking that it will.. emerge. Absent theory I don't think moar will make it emerge, no.
If you want theory there’s this: https://arxiv.org/abs/2001.08361 (I haven’t actually read this but I know roughly what it’s about)
It’s saying that so far the abilities of an LLMs have scaled up with its parameter count and training data size. Of course there’s no way to be sure without actually training larger models but I don’t see why the point where it stops would be just after our current best LLMs. Many properties have already emerged from making it bigger so I don’t see why this would be the exception
While it may be true that new data is coming in at a trickle these days, due to things like Discord, Slack, et al. all locking conversation and context up, as well as the daily volume of chapter is small relative to what is out there now.
The fact is that training data can be used in many different ways and I bet you we see the products of that fairly quickly as those who see this same as I do reach a point where they want to show n tell and test.
>The fact is that training data can be used in many different ways and I bet you we see the products of that fairly quickly as those who see this same as I do reach a point where they want to show n tell and test.
Sounds like wishful thinking to overcome the limitations of LLMs.
At the same time we get more and more texts generated by LLMs so it gets harder to get actual man made texts.
A 200b 4-bit quantized model could potentially fit into 128 GB of RAM. The inference would just be really slow.
Ie you could technically run something like that today.
I think more VRAM on GPUs isn't necessarily a technical limitation either. I think GPU manufacturers could add a lot more VRAM to their cards if they wanted to. The question is whether it would be worth the price increase.
> Ie you could technically run something like that today.
Yep, on higher end machines it should already be feasible. I can do 2.5-3 tok/sec on a 70B model quantized at 4 bit today with my MacBook Pro M2 MAX w/96GB. It's a little slower than a 30B, but the difference is less than I had guessed it would be. That's not super fast, but it's usable.
And that's on a machine that isn't designed for this workload. Over the next few years things should improve quite a bit. 200B does not seem like a reach.
About the RAM. I doubt they wanted to do that, since basic gpu function is to render a frame in as little ms as possible. Currently VRAM is latency optimized on consumer gpus and all memory chips are an inch away from the gpu. Light only travels as far in the gigahertz realm. Thats why they started mounting vram chips on both sides of the board, cause there was no more place left on the first side.
Just checked: light travels 30cm in one nanosecond. So if the gpu is running at 4GHz it goes only 7.5 cm.
VRAM is not latency optimized. VRAM has worse latency than your CPU RAM. The reason why it's mounted closer is because of signal integrity because of higher frequencies, not because of latency.
Sorry can't provide any resources right now. If you search a bit I'm sure you'll find some latency comparisons between DDR and GDDR.
But basically GPU memory (GDDR5/6/6X/etc) is optimized for bandwidth (because GPUs need to move a lot of data, have few branches, few unknown data dependencies, high spatial locality). CPU memory is more optimized for latency (because of branchy code).
>> Memory bandwidth is the limiting factor in almost everything to do with sampling from transformers.
So how about using an APU - a CPU with GPU built in. The GPU shares the CPU memory, so if you want you can have 128GB RAM and allocate 100GB to the GPU.
Sure the GPU i not fast, but if memory is important.....
Most CPU RAM is much slower than GPU RAM. GPUs typically pack RAM 2 generations ahead with a wider bus than anything you'd find on a consumer motherboard.
For reference, DDR4-3200 in quad channel is ~100 GB/s while a 3090's VRAM is 960 GB/s. Of course, most consumers only have dual channel.
M1 Pro is 200 and M1 Max is 400. Which is slow for GPU memory, but incredible for main memory -- although I'm not sure how much of that a single core can actually pull.
You're thinking about the wrong bandwidth here. The article is talking about going from the GPU's RAM <-> GPU cores (i.e. through load/store instructions in a cuda kernel), not from CPU's RAM <-> GPU's RAM. That kind of bandwidth is still important but usually not the bottleneck on most ML workloads.
The author (correctly) made a distinction about the two like you said, but at the end when talking about Raspberry Pi 4 they use a number (~4GB/s of memory bandwidth) from an article [1] which I can only assume is NOT about graphics memory or its bandwidth (do Raspberry PIs even have it?).
And how exactly is the bandwidth counted if I use integrated GPU (like i5 13600K)? Or pure CPU?
CPUs and GPUs both interface with their own memory, and those memories have a certain bandwidth. Generally, CPU memory has relatively little bandwidth, but relatively good latency. (For example, an i9-13900K supports memory up to around 100 GB/s, while even my previous GPU, a midrange Radeon HD 7850 from 2012, has over 150 GB/s of bandwidth).
An integrated GPU shares memory with the CPU, so at best it gets the same amount of bandwidth assuming the CPU is not using any (which is rather unlikely).
A dedicated GPU has its own private high-bandwidth memory (an RTX 4090 has a memory bandwidth of over 1000 GB/s), but to get anything in there it needs to be loaded over the PCIe bus (which has a measly 32 GB/s bandwidth for PCIe 4.0 x16).
That said, CPU memory does have one big advantage: it tends to be much larger. You can pair up to 192 GB with a regular Ryzen 7000 CPU, while consumer GPUs don't go above 24 GB of memory (RTX 4090, RX 7900 XTX). (There are bigger GPUs out there, but those are generally intended for datacenters, and if you go that route, an Epyc or Xeon CPU can also support much more memory than a plain desktop Ryzen, although you can also slot multiple GPUs into a single server.)
I believe that for LLM performance, memory bandwidth is key, because all the neural network layers need to be streamed from memory, and very little work is done with it each time, although I guess batching operations could help if you're working at scale, since each weight would be applied multiple times then.
Raspberry Pi's do have a usable GPU, but using it for computation is not a particularly well-traveled path. I think that's a shame. The pre-4 models have a different Broadcom graphics core to the 4, and it looks like you can get useful work out of both, but they are different enough that it's a rebuild between the generations.
This is basically the appeal of the apple chips in this domain. Apple have fuck-you money so they have a bunch of high-bandwidth decent-latency soldered onto the chip.
This article would probably be useful for a lot more people if it spent just a couple of sentences introducing the various parameters, rather than just throwing variable names at the reader. Interestingly, a whole paragraph is spent on explaining n_bytes.
80/20 rule. Approximations have smaller, less accurate approximates which do in many contexts. If your goal is to reach America, a crude compass and speed reasoning works. If you want to target an ICBM you need better positional accuracy.
Fantastic article - I would love to see an analysis comparing larger quantized models to smaller unquantized models. e.g. is a 14b quantized model better than a 7b unquantized model?
Are there benchmark figures for CPU based setups somewhere? Eg a 192-core / 24-channel 2P Zen 4 box. Or is the CPU side too unturned to be interesting?
If you dont have a GPU, prompt ingestion is totally threaded. The more cores the better.
For generating tokens, more cores helps to a point, but then:
- You start saturating the memory bus, and performance plateaus.
- There is some overhead from the threading implementation, and too many threads hurts performance.
The ideal number of threads varys per CPU. For instance, using hyperthreaded cores or Apple/Intel e cores typically hurts performance... But not always. You just have to test and see.
The problem is memory bandwidth rather than CPU cores: "Memory bandwidth is the limiting factor in almost everything to do with sampling from transformers. Anything that reduces the memory requirements for these models makes them much easier to serve"
I’m not an expert in this but my “does that feel right wrong” sense isn’t going off
Pixel 5 is 2-3 years old. CPUs aren’t doubling in speed every 2 years, but let’s very generously say we expect current designs to be 4x faster than 2-3 year old equivalent.
Apple silicon is faster than other ARM chips, so if we imagine that’s another 2x we’re up to 8x
Common wisdom is that “real computers” are faster than phones, but the difference between the A16 and M2 is less than 2x for multi core, and much much less for single core benchmarks. Rounding up, another 2x is 16x
Maybe there are characteristics of the two devices which make this more surprising, I’d be interested to learn more.
https://oobabooga.github.io/blog/posts/perplexities/
Essentially, you lose some accuracy and there might be some weird answers and probably more likely to go off the rail and hallucinate. But the quality loss is lower the more parameters you have. So for very large model sizes the differences might be negligible. Also, this is the cost of inference only. Training is a whole other beast and requires much more power.
Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter. Bandwidth and memory size are probably, in my ignorance mind, easier to increase than raw compute so maybe we will soon actually have "smart" devices.