Hacker News new | past | comments | ask | show | jobs | submit login
How Is LLaMa.cpp Possible? (finbarr.ca)
685 points by birriel on Aug 15, 2023 | hide | past | favorite | 227 comments



In case anyone is wondering, yes, there is a cost when a model is quantized.

https://oobabooga.github.io/blog/posts/perplexities/

Essentially, you lose some accuracy and there might be some weird answers and probably more likely to go off the rail and hallucinate. But the quality loss is lower the more parameters you have. So for very large model sizes the differences might be negligible. Also, this is the cost of inference only. Training is a whole other beast and requires much more power.

Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter. Bandwidth and memory size are probably, in my ignorance mind, easier to increase than raw compute so maybe we will soon actually have "smart" devices.


I was hoping that link would answer the question that's been bugging me for months: what are the penalties that you pay for using a quantized model?

Sadly it didn't. It talked about "perplexities" and showed some floating point numbers.

I want to see examples like "here's a prompt against a model and the same prompt against a quantized version of that model, see how they differ."


I have several sets of quant comparisons posted on my HF spaces, the caveat is my prompts are all "English to code": https://huggingface.co/spaces/mike-ravkine/can-ai-code-compa...

The dropdown at the top selects which comparison: Falcon compares GGML, Vicuna compares bits and bytes. I have some more comparisons planned, feel free to open an issue if you'd like to see something specific: https://github.com/the-crypt-keeper/can-ai-code


  I want to see examples like "here's a prompt against a model and the same prompt against a quantized version of that model, see how they differ."
We suck at evaluating and comparing models imo. There are metrics and evaluation task, but it's still very subjective.

The closer we get to assessing human like performance, the tougher it is, because it becomes more subjective and less deterministic by the nature of the task. I don't know the answer, but I know that for the metrics we have it's not so easy to translate them into any idea about the kind of performance on some specific thing you might want to do with the model.


Not mathematically, at the very least. Perplexity is a translation of the best measure we have for informing us how a model is doing empirically over a test dataset (both pre and post). It is enough to be, usably, at least the final word on how different quantization methods perform.

Subjective ratings are different, but for compression things are quite well defined.


> some specific thing you might want to do with the model.

I think this right here is the answer to measuring and comparing model performance.

Instead of trying to compare models holistically, we should be comparing them for specific problem sets and use cases... the same as we compare humans against one another.

Using people as an example, a hiring manager doesn't compare 2 people holistically, they compare 2 people based on how well they're expected to perform a certain task or set of tasks.

We should be measuring and comparing models discriminately rather than holistically.


You could have two models answer 100 questions the same way, and differ on the 101st. They’re unpredictable by nature - if we could accurately predict them we’d just use the predictions instead.


(Stupid question) are models still non-deterministic if you set the temperature to zero?

Would setting the temperature to zero degrade the quality of response?


Even at T=0 and run deterministically, the answers still have "randomness" with respect to the exact prompt used. Change wording slightly and you've introduced randomness again even if the meaning doesn't change. It would be the same for a person.

For an llm, a trivial change in wording could produce a big change in answer, same as running it again with a new random seed. "Prompt engineering" is basically overfitting if not approached methodically. For example, it would be interesting to try deliberate permutations of an input that don't change the meaning and see how the answer changes as part of an evaluation.


But if T=0 and you use the exact same input (not a single word or position changes) do you get the same output? Reading your response it implies that the randomness is related to even slight changes.


As a sibling comment mentioned, threading on a gpu is not automatically deterministic so you could randomness from there, although I can't think of anything in the forward pass of a normal LLM they would depend on execution order. So yes, you should get the same, it's basically just matrix multiplication. There may be some implementation details I don't know about that would add other variability though.

Look at this minimal implementation (Karpathy's) of LLaMA, the only randomness is in the "sample" function that comes in at non-zero temperature, otherwise its easy to see everything is deterministic: https://github.com/karpathy/llama2.c/blob/master/run.c

Otoh, with MoE like GPT-4 has, it can still vary at zero temperature.


Some GPU operations give different results depending on the order they are done. This happens because floating point numbers are approximations and lose associativity. Requiring a strict order causes a big slowdown.


Well the same is true for people, and yet hiring managers still available valuate for specific tasks.


It makes the model dumber.

That seems simplistic, but its really simple as that. Naive 3 bit quantization will turn llama 7B into blubbering nonsense.

But llama.cpp quantization is good! I recommend checking out the graphs ikawrakow made for their K-quants implementation:

https://github.com/ggerganov/llama.cpp/pull/1684

Basically, the more you quantize with K-quant, the dumber the model gets. 2 bit llama 13B quant, for instance, is about as dumb as 7B F16, but the dropoff is not nearly as severe from 3-6 bits.


FWIW here's why perplexity is useful: it's a measure of uncertainty that can easily be compared between different sources. Perplexity k is like the uncertainty of a roll of a k-sided die. Here I think perplexity is per-token, and is a measuring the likelihood of re-generating the strings in the test set.

e.g. take a look at these two rows:

    llama-65b.ggmlv3.q4_K_M.bin 4.90639 llama.cpp
    llama-65b.ggmlv3.q3_K_M.bin 5.01299 llama.cpp
So for the reduction in size given by (q4 -> q3), you get a 2% increase in the uncertainty. Now, that doesn't tell you which specific capabilities get worsened (or even if that's really considered a huge or tiny change), but it is a succinct description of general performance decreases.

If you want more fine-grained explanations of how generation of certain types of texts get clobbered, you would probably need to prepare datasets comprised of that type of string, and measure the perplexity delta on that subset. i.e.

    dperplexity/dquantization(typed_inputs).
I think it might be more difficult to get a comprehensive sense of the qualitative differences in the other direction, e.g.

    dtype/dquantization(all_outputs).


The problem is that it's not consistent enough for a good demo. Not even two different models, but even two different fine tunes of the same base model may be wildly differently affected by quantization. It can range from making hardly a difference to complete garbage output.


I have been using nat.dev to compare quantized models and it works great.


Just the other day someone published ARC comparison results for different quants as well as the code for the harness that they used to easily run lm-eval against quants to your heart's content: https://www.reddit.com/r/LocalLLaMA/comments/15rh3op/effects...


it will be different for every usecase, the only way to find out is spinning one up..


would an answer "there aren't much significant penalties" suffice?


>Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter.

I'm not sure what you mean by this. You've always been able to run GPT3 on a single server (your typical 8xA100).


8xA100 is technically a single server, but I think OP is talking about affordable and plentiful CPU hosts, or even relatively modest single GPU instances.

DGX boxes do not grow on trees, especially these days


Am I missing something or how do you know this? Also I think the OP was talking about a single card not multiple but that was just my reading.


Because 175B parameters (350GB for the weights FP16, let's say a bit over 400GB for actual inference), fit very comfortably on 8xA100 (640GB VRAM total).

And basically all servers will have 8xA100 (maybe 4xA100). Nobody bothers with a single A100 (of course in a VM you might have access to only one)


> And basically all servers will have 8xA100

for those wondering: no this is not the norm. My lab at CMU doesn't own any A100s (we have A6000s).


The servers the commenter is talking about are DGX machines from NVIDIA.

It doesn’t really make sense to BTO. What you gain economically you lose in the science you can do.

But nobody could have anticipated this.


you could also get HGX from any of the vendors.


Wtf does HGX mean? God enough with the acronyms people.

Please take an extra ten seconds to speak in proper human language!

You could save on the worlds carbon footprint by reducing the number of times humans have to search for “what is NVIDIA hgx?” or is it “what is AMD HGX” and then subsequently visiting the websites to see if that’s right or not.


What does Wft mean? God enough with the acronyms people. /s


You got me there hahaha

However, there’s a difference between an acronym known to the broader public versus some single shot, context-specific one!


who's norm? I assure you it's the norm. :)


> And basically all servers will have 8xA100 (maybe 4xA100). Nobody bothers with a single A100 (of course in a VM you might have access to only one)

wishing, or guessing something, without actual experience, doesn't make it true.


The effect is lesser than you think. 5 bit quantization has negligible performance loss compared to 16 bits: https://github.com/ggerganov/llama.cpp/pull/1684


This paper from last month has a method for acceptable 3-bit quantization and a start at 2-bit.

https://arxiv.org/abs/2307.13304


Yes, there is a logarithmically-bound (or exponential if you're viewing it from another angle) falloff in the information lost in quantization. This comes from the non-uniform "value" of different weights. We can try to get around them with different methods, but at the end of the day, some parameters just hurt more to squeeze.

What is insane though is how far we've taken it. I remember when INT8 from NVIDIA seemed like a nigh-pipedream!


Good blog post, shame the site has no RSS feed!


Could this be why people recently say they see more weird results in ChatGPT? Maybe OpenAI is trying out different quantization methods for the GPT4 model(s) to reduce resource usage of ChatGPT.


I'd be more inclined to believe that they're dropping down to gpt-3.5-turbo based on some heuristic, and that's why sometimes it gives you "dumber" responses. If you can serve 5/10 requests with 3.5 by swapping only the "easy" messages out, you've just cut your costs by nearly half (3.5 is like 5% of the cost of 4).


Serving me ChatGPT 3.5 when I explicitly requesting ChatGPT 4 sounds like a very bad move? They're not marketing it like "ChatGPT Basic" and "ChatGPT Pro".


Thank you! Is there a sweet spot with quantization. how much can you quantize for given model type and size and still be useful.


Tim Dettmers recently (https://www.manifold1.com/episodes/ai-on-your-phone-tim-dett...):

"But what we found with these neural networks is, if you use 32 bits, they're just fine. And then you use 16 bits, and they're just fine. And then with eight bits, you need to use a couple of tricks and then it's just fine.

And now we find if you can go to four bits, and for some networks, that's much easier. For some networks, it's much more difficult, but then you need a couple more tricks. And so it seems they're much more robust."


> And now we find if you can go to four bits

That will be really interesting for FPGAs, because the current ones are basically oceans of 4-bit computers.

Yes, you can gang together a pair of 4LUTs to make a 5LUT, and a pair of 5LUTs to make a 6LUT, but you halve your parallelism each time you do that. OTOH you can't turn a 4LUT into a pair of 3LUTs on any currently-manufactured FPGA. It's simply the "quantum unit" of currently-available hardware -- and it's been that way for at least 15 years (Altera had 3LUTs back in the 2000s). There's no fundamental reason for the number 4 -- but it is a very, very deep local minimum for the current (non-AI) customers of FPGA vendors.


This is not generally true, sometimes quantisation can improve accuracy. I haven't seen that with LLMs yet though.


Interesting, how would that work? Are there any well-known examples?

Is it: the weights all happen to be where float is sparse, so quantization ends up increasing fidelity? Or is it more of a “worse is better” dropout-type situation?


I suspect it works as regularisation of the network. It usually happens when you train with quantisation instead of post-training quantisation, an I haven't seen that done with LLMs yet.


For image recognition it can sometimes be like that. My gut feeling is that lowering from fp32 to fp16 can get rid of some kind of overfitting or so.


Any use case for using the 7B model over the 13B, quantized?


Inference speed. Sometimes 7B is good enough for the task at hand, and using 13B just makes you wait longer.


SBC


Wtf does SBC mean? God enough with the acronyms people.


In my experience, it usually means Small Block Chevy, but in certain communities it means Single Board Computer, an older way of referring to devices like the Raspberry Pi.

I would elaborate and say, anywhere that your computer is resource constrained ( ram, processing power ) but you still want to make up articles for your Amazon Affiliate blog


Single board computer makes sense. I wish folks would type things out.


In this context I'd assume SBC means Single Board Computer, such as a Raspberry Pi or one of the many imitators. The article itself mentions running LLaMa on a Pi 4.

The interesting implication about running an LLM on a single board computer is that if it's a proof of concept for an LLM on a smartphone. If you have a model that can produce useful results on a Ras Pi, you have something that could potentially run on hundreds of millions of smartphones. I'm not sure what the use case is for running an LLM on your phone instead of the cloud, but it opens some interesting possibilities. It depends just how useful such a small LLM could be.


This leaves a ton of stuff out.

- Token generation is serial and bandwidth bound, but prompt ingestion is not and runs in batches of 512+. Short tests are fast on pure CPU llama.cpp, but long prompting (such as with ongoing conversation) is extremely slow compared to other backends.

- Llama.cpp now has very good ~4 bit quantization that doesn't affect perplexity much. Q6_K almost has the same perplexity as FP16, but is still massively smaller.

- Batching is a big thing to ignore outside of personal deployments.

- The real magic of llama.cpp is model splitting. A small discrete GPU can completely offload prompt ingestion and part of the model inference. And it doesn't have to be an Nvidia GPU! There is no other backend that will do that so efficiently in the generative AI space.

- Hence the GPU backends (OpenCL, Metal, CUDA, soon ROCm and Vulkan) are the defacto way to run llama.cpp these days. Without them, I couldn't even run 70B on my desktop, or 33B on my (16GB RAM) laptop.


ROCm works now! I just set it up tonight on a 6900xt with 16gb vram running wayland at the same time. The trick was using the opencl-amd package (somehow rocm packages don't depend on opencl, but llama does, idk).

I'm astonished at the results I can get from the q6_K models.


Can you please share more info on this? I have a 6900xt "gathering dust" in a proxmox server - would like to try to do a passthrough to a vm and use it. Thank you in advance!


Sure thing. There's a bunch of ways to do it, but here's some quick notes on what I did.

* arch linux has tons of `rocm` packages. I installed pretty much all of them: https://archlinux.org/packages/?sort=&q=rocm&maintainer=&fla...

* you also need this one package from AUR: https://aur.archlinux.org/packages/opencl-amd

* llama.cpp now has GPU support including "CLBlast", which is what we need for this, so compile with LLAMA_CLBLAST=ON

* now you can run any model llama.cpp supports, so grab some ggml models that fit on the card from https://huggingface.co/TheBloke.

* Test it out with: ./main -t 30 -ngl 128 -m huginnv1.2.ggmlv3.q6_K.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"

* You should see `BLAS = 1` in the llama.cpp output and you should get maybe 5 tokens per second on a 13b 6bit quantized ggml model.

* You can compile llama-cpp-python with the same arguments and get text-generation-ui to work also, but there's a bit of dependency fighting to do it.

* koboldcpp might be better, I just haven't tried it yet

Hope that helps!

Edit: just tried https://github.com/LostRuins/koboldcpp and it also works great. I should have started here probably.

Compile with `make LLAMA_CLBLAST=1` run with ` python koboldcpp.py --useclblast 0 0 --gpulayers 128 huginnv1.2.ggmlv3.q6_K.bin`


Text gen ui is nice. Some specific nicities include pre formatted instruct templates for popular modules, good prompt caching from llama-cpp-python, and integration of a vector db.

But its also finicky, kinda unstable and the dependencies are tricky.

Koboldcpp has other nicities, like some different generation parameters to tweak and some upstream features pulled in from PRs before the official llama.cpp release has them. The UI is nice, predating llamav1. Its standalone, dead simple to compile and has integration with AI Horde, which is (IMO) a huge essential feature.


Same boat.

I’d love to host something local but have been so overwhelmed by the rapid progress and every time I start looking I find a guide that inevitably has a “then plug in your OpenAI api key…” step which is a hard NOPE for me.

I have a few decent gpus but I’ve got no idea where to start…


Path of least resistance:

- Download koboldcpp: https://github.com/LostRuins/koboldcpp

- Download your 70B ggml model of choice, for instance airoboros 70B Q3_K_L: https://huggingface.co/models?sort=modified&search=70b+ggml

- Run Koboldcpp with opencl (or rocm) with as many layers as you can manage on the GPU. If you use rocm, you need to install the rocm package from your linux distro (or direct from AMD on Windows).

- Access the UI over http. Switch to instruct mode and copy in the correct prompt formatting from the model download page.

- If you are feeling extra nice, get an AI Horde API key and contribute your idle time to the network, and try out other models on from other hosts: https://lite.koboldai.net/#


What’s perplexity?


Perplexity is a measure of how certain the model is of the next token. It's calculated by looking at the probabilities that the model calculates for the next token in a stream. If there are several choices for the next token with similar probabilities, that's telling you that the model is having a hard time telling what the right answer should be: the model is more perplexed, perplexity is higher. If there's a single option with a much higher implied probability than any other, that means the model is more certain, and perplexity is lower.

Note that this has nothing to do with whether the answer is objectively correct. It's just measuring how confident it is.


Reading this could make people believe it is computed from the probability distribution of the model alone.

To be clearer, it is the exponent of the average negative log probability that the model gives to the real tokens of a sample text[0]. Roughly, it relates to how strongly the model can predict the sample text. A perfect model would have perplexity one; a random model has a perplexity equal to the number of possible tokens; the worst model has infinite perplexity.

[0]: https://github.com/pytorch/torcheval/blob/3faf19c060b8a7c074...


I think what you describe ("confidence about the next token") is the entropy of the model's output. A model can be very certain about the next token (its output has low entropy) but if it is usually wrong on the text you measure it against, it will have high perplexity. (For example when the model was trained only on children's books and you measure it on Wikipedia.)


Thank you!


Perplexity is measured by testing a language model on a known text. The model's output is a probability for every possible next word/token. The model is highly perplexed if it gave a low probability to the actual next token.


What's prompt ingestion?


Its the phase where llama.cpp processeses the input, as opposed to the phase where its generating the response.

Speed is particularly important. The response can be streamed word by word, so it doesn't have to be particularly fast, but slow input processing leads to very noticable latency.


This project's been a blast to work with. While it's written in C++, it provides a C interface to compile against which makes it especially easy to extend with Go, Python and other runtimes.

A few folks and I have been building a tool with it in Go for pulling & running multiple models, and serving them on a REST API: https://github.com/jmorganca/ollama

In similar light, you haven't checked it out, llama.cpp also has a pretty extensive "server" tool (in its examples directory in the repo) with a web ui and support for grammar (e.g. forcing the output to be JSON)


> A few folks and I have been building a tool with it in Go for pulling & running multiple models, and serving them on a REST API: https://github.com/jmorganca/ollama

Why this tool was created? What was the reason for creating such tool? Why it was named this way?

Explanation for wishing to flag/block/bully basic simple questions: Questions mean what they ask and nothing else. Any imagination of aggression or anything else related to emotions is misplaced.

As someone unfamiliar with the field I’ve read GitHub page and didn’t find answers to things I was curious about regarding this project. So out of curiosity I wrote simple basic questions:

“Why this tool was created? What was the reason for creating such tool? Why it was named this way?“

Suggesting that if I didn’t figured it out by myself perhaps some one else who also wish to join the field and looking for entry point to try something in practice would also be confused and thus answering those simple basic questions would really help to the beginners.

With experience of entering into many new fields I know that usually there is a lot of confusion in the beginning about basic stuff .

I also suggested that those are obvious reasons why people ask questions and didn’t expect to be ‘bullied’ for basic simple questions. There were no one word of aggression in text. Perhaps seeing aggression there requires certain level of misplaced imagination.

Usually there is a fun story about the name for example GNU is Not Unix etc … This is opportunity to tell it if there is. Sometimes name holds some clue about functionality and then it’s easier to remember. Sharing it can help too to beginners.


I've also been experimenting with the C# implementation

https://github.com/trrahul/llama2.cs


Why Parallel.For() instead of SIMD types?

I would expect a much better performance.


Thank you for the link to your project! I'll be playing around with it.


Just wanted to say thanks homie for making something so hairy so accessible. Love it!!!


[flagged]


Because.

On a more serious node, your questions seem very... aggressive.

Especially the one about the name, is the name offensive or what? To me it sounds quite benign.


As someone unfamiliar with the field I’ve read github page and didn’t find answers to things I was curious about regarding this project.

So out of curiosity I wrote simple basic questions:

“Why this tool was created? What was the reason for creating such tool? Why it was named this way?“

Suggesting that if I didn’t figured it out by myself perhaps some one else who also wish to join the field and looking for entry point to try something in practice would also be confused and thus answering those simple basic questions would really help to the beginners.

With experience of entering into many new fields I know that usually there is a lot of confusion in the beginning about basic stuff .

I also suggested that those are obvious reasons why people ask questions and didn’t expect to be ‘bullied’ for basic simple questions. There were no one word of aggression in text. Perhaps seeing aggression there requires certain level of misplaced imagination.

> Especially the one about the name, is the name offensive or what? To me it sounds quite benign.

So why the simple question doesn’t sound benign to you? Usually there is a fun story about the name for example GNU is Not Unix etc … and sometimes it holds some clue about functionality and then it’s easier to remember.

> Because

well… it’s still unclear


Your comment has been flagged and my comment has received a bunch of upvotes. Multiple people consider your wording or tone aggressive.

In case you're a person with social issues, asking so many questions in a succession is considered quite rude.

Space them out, in writing add newlines. Or add more context to your questions.

Your comment history suggests that you do this often and you have a ton of downvoted comments.

Friendly advice :-)


Usually friendly advice is given after friendly answering question, which was not demonstrated here.

We are perfectly aware of how to deal with extra-sensitive people and kind approach is the way to go. However on this site I expect certain level of technical training and intelligence in answering direct questions without being offended by search for a deeper understanding of some topic and without inventing emotions that never been there in the first place. I was also under impression that guidelines of this site directly where encouraging answering I expect by declaring : “be kind” “don’t be rude” “suggest a good faith” and unfortunately none of those were demonstrated by your answers even when you have learned the fact that you were wrong and your accusations were completely ungrounded . The fact that such behaviour was encouraged by (as you claim it) “multiple people” including moderation of this site tells a very sad story about this site regarding following own declared principles.

And this example:

“Which crypto? Where can I pay with it? What if I don't want to have a hardware wallet everywhere with me? Can I cancel transactions and get my money back? Does it use more energy per year than Argentina?” (https://news.ycombinator.com/item?id=36656667)

tells another sad story about hypocrisy which means it’s hard to take “friendly“ advice seriously. The next time you wish to give a truly friendly advice try to be kind in the first place, do not suggest a bad faith and do not insist on false ungrounded accusations encouraged by confirmation biases especially when clarification presented.

You could have just apologise for own real aggression contrary to imaginative aggression from my questions and simply give answer while keeping “friendly” advice to yourself.

This way people would learn something about the project rather than learning something about cancelling “culture” which is under the flag of “being nice” demonstrates much worse behaviour in practice than one it claim to oppose. You actually among others should be very aware and disgusted by those tactics after you’ve seen effects of “soviet methods” in Romania.


I’ve been working through that repo and managed the 13B dataset on a single Pi4 8gig

I’ve also replicated the work in OpenMPI ( from a thread on the llama.cpp GitHub repo ) and today I managed to get the 65B dataset operational on three pi4 nodes.

I’m not saying this as any achievement of mine, but as a comment on the current reality of reproducible LLM At home on anything you’ve got.

It really feels like this technique has arrived.

https://github.com/cameronbunce/ClusterConfig


> I’ve also replicated the work in OpenMPI...

Oh cool! How did it perform?

I wonder if this would be an exciting test for Amazon's SRD protocol which appears to be built for HPC. I'm looking for an excuse to play with it...


The objective performance I'm getting is flat poor, mostly because of the network I'm using. On the other hand, simply being able to do it at all with one node on wireless until I can pull another drop, and the rest being on 100 Mbit ... I'm really running a bargain basement cluster.

I don't know about SRD, but llama.cpp has MPI configurations built-in. I didn't have to engineer anything or rewrite anything ( I made an optimization patch, but I didn't even make that one up myself ) I just compiled it with flags set.

As far as performance on 65B, I'm still waiting for it to finish to get the timings :)


I eagerly await your numbers! Maybe I'll post some of my own if I can get far enough ahead at work.

I was thinking it's time to upgrade from 1GigE anyway, 10GigE is cheap and at work we're ripping that out in favor of 25 and 50...

I'll look at the code, depending on how well the authors used MPI it could be exciting times! It's not that hard (or expensive) to get a bunch of power hungry used servers off ebay and string em together with a cheap 10GigE switch. It would be loud and power hungry but I wonder if I could have a 65B local model in the privacy of my own home, for fractions of the cost of buying a A100....

Edit: Oh, and SRD is a ... network protocol designed to work hand in hand with EFA which can substantially improve the performance of HPC MPI workloads, running on EC2, during network bound phases.


Nice write up man. Thanks for sharing your research


How many tokens a second?


The other way around is whole number math. I added the 3-node output from the 13B model to github, the timings are below. The 3-node 65B job hasn't finished yet.

llama_print_timings: load time = 17766.29 ms llama_print_timings: sample time = 264.42 ms / 128 runs ( 2.07 ms per token, 484.07 tokens per second) llama_print_timings: prompt eval time = 10146.71 ms / 8 tokens ( 1268.34 ms per token, 0.79 tokens per second) llama_print_timings: eval time = 287157.12 ms / 127 runs ( 2261.08 ms per token, 0.44 tokens per second) llama_print_timings: total time = 297598.22 ms


This is very interesting and actually in the usable realm, for some use cases


My networking setup is not optimal, but it was quite surprising how easy it was to get it all to work.


In my best estimation, Finbarr makes pretty great content, he and I have had a number of positive interactions on Twitter. I tend to have a pretty grumpy disposition towards a lot of modern ML and such as I feel it's shovelware, but whenever Finbarr puts out work, I tend to set aside some time to give it a good gander, as I feel like it's generally pretty "meaty" (which I honestly find pretty hard to do past a certain pace). Well worth the subscribe if you have not done so already (I'm not affiliated with him, I just really like his work!).


It is useful to mention running inference on modern cpus that have AVX2 is not that bad. Sure it is slower than on the gpu, but you get the benefit of having a single long continuous region of ram.

But there is one huge problem why this is not that popular on x86_64. Having to run in fp32. As far as I know our most common ml libraries (pytorch, tf, onnx etc) do not have an option to quantize to 4 bits and they don't have an option to run inference at anything other than fp32 on the x86_64 cpus.

It is a huge shame. There is openvino which supports int8, but if you can't easily quantize large models without a gpu, what use is it? (For small models I suppose).

So if anyone figured out a way to quantize a transformer model to 4/8 bit and run it on the x86_64 cpu platform I'm very interested in hearing about it.


Subjective experience: AVX512 helps a lot. I would have liked to read more about this. It seems that AVX512 supports fp16 in hardware and allows 32 fused multiplication-add operations per core. So I imagine on a Ryzen 9 with 12 cores you can have 384 simultaneous fused multiplication-add operations. I am not sure whether my estimation is off. Anyone know more than me?



Sorry, but the topic of this post, llama.cpp, runs quantized 4/8 bit models just fine on x86_64 with AVX2, or am I missing some requirement you have?


Wait, I wasn't aware llama.cpp even runs on x86_64.i thought it is arm hw only. If what you say is correct that indeed is very interesting. Especially if I can extend it to other models like falcon.


It doesn't support Falcon right now, but there's a fork that does (https://github.com/cmp-nct/ggllm.cpp/).


Bah. We still haven't equaled the rude and hateful AI achieved in a microcomputer in 1981. <https://scp-wiki.wikidot.com/scp-079>


We can keep reaching for that rainbow.


A good analogy: as you approach the rainbow moves off. Others see you in it, but you can confirm it's somewhere else. It's an effect, a side effect, it's pretty and we value it. There's no pot of gold in literal sense, it's ephemeral value in other products.


I enjoyed this article, but it seems to me that the latency numbers should have units of nanoseconds or maybe CPU cycles. I feel like the article was a bit sloppy with units.

Another question that occurs to me is: why do chipmakers even bother putting so many functional units on the chip if almost all workloads are memory bound? Based of the calculations in this article, you could decrease the number of teraflops a modern GPU can perform by a factor of 2 and not notice any appreciable difference in ML performance.


1. I think nanosecond-scale latency numbers on operations taking dozens to hundreds of ms are probably overkill?

2. Inference is only one aspect of what GPUs are used for. Many other workloads are compute-bound. That being said, given the recent rise of these kinds of open-source, pre-trained large language models, I wouldn't be surprised if future Nvidia product launches offered variants with significantly more VRAM. There would probably be a lot of interest in "3080-equivalent compute, but 48GB VRAM" these days — certainly I would take one over a 4090 with 24GB VRAM. (Then again, that's basically an A6000, and those go for nearly $7k...)


Yeah, Nvidia won't do big VRAM consumer cards until AMD forces them to. They're running flat out just trying to keep up with demand for H100s at forty thousand USD each.


Or Intel! They're not making any high-end cards currently, but the A770 16GB is pretty decent if you're on a budget. The software support isn't really amazing yet, but their GPUs have some quite decent matmul acceleration.

Unfortunately support for their GPUs is not upstream in Tensorflow or Pytorch, and I don't think they're well-supported by Llama.cpp either, but Intel does look quite promising. I also believe that Intel oneMKL works on Arc GPUs these days, which in CPU land is amongst the fastest BLAS libraries out there. Also their matrix hardware is accessible using OpenCL extensions, which means that rolling a custom kernel for things related to quantization should be quite possible.

(For Tensorflow and PyTorch, you need to install a custom package called Intel Extension for $FRAMEWORK. The PyTorch one got updated to PyTorch 2.0 recently, which is promising.)

Currently rumors seem to indicate that their second generation GPUs will go from 32 to 64 Xe cores for the top model, and keep the 256 bit bus. If Intel were to double the VRAM to 32 GB as well (at least as an option, just like the A770 comes in both 8GB and 16GB variants), it'd immediately make them the crown of consumer VRAM size, which I'd wager would drive a lot of interest from the ML community.


Compromise on microseconds


The primary use case for GPUs in devices, graphics, is not as often memory bound. Even for other general data compute purposes that may not always be the case. It's specifically neural nets that have extremely wide batches of extremely simple operations occurring across extremely large chunks of memory where being memory bound is the default case.


Will be interesting to see what people can do with local models, particularly for open source programming tools and PCG models for video games.


They'll probably write fanfic and some Harry Potter ships.


Great article. Don't see content like this anywhere else outside of HN.


You can find content like this on Twitter if you follow the right people. In fact I read this article before it was even posted here because @karpathy tweeted about it.


[flagged]


I’ve never been a regular Twitter user, and don't really enjoy the platform, but this comment of yours is either abusing the word “shameful” or betraying a major lack of understanding that you can’t expect other people to care deeply about the things you care deeply about.

There’s a lot of shameful things in this world, GP using twitter isn’t one of them. Not even a “little.”


It's betraying the word shameful in that it's an utter understatement.

If people don't care about supporting companies that enable and spread far-right content and groups, it's they who are a problem.

To say nothing of the pure disregard of the human right to privacy (and with AI now IP as well) that is forced on the the rest of the world by the dominance of the US market.


Apologies, but I genuinely find the intensity of this comment delusional. (1) Doesn't this perspective apply to any social media platform, for example Facebook Groups. And (2) RE: pure disregard of the human right to privacy: doesn't this implicate pretty much all popular digital platforms?



There are a lot of oil producers I consider shameful, but I'm still going to buy gasoline every few weeks.


It's both you and right wing people who are the problem.

You're both puppets in the hands of the powerful who wants us divided and weak.

Don't trust authority. Don't trust anyone. In the past left wing people cared and fought about freedom, today is the right wing fighting for freedom.

It's all irrelevant anyway, governments keep growing stronger and stronger during right or left governments. And soon there is not going to be anywhere to run to.


Shameful? Just because you dislike it doesn’t mean everyone has to.


Says you. I'd prefer to take advantage of the resources available there. Your silly moralism is easy to ignore


Mastodon is a good way to get at this content in unfiltered form. Take a look at the raw feed of the sigmoid.social instance or follow the #LLM hashtag.

If you want it more predigested and summarized I can recommend the AI Explained channel on youtube.


Does anyone know what the next breakthroughs will be and their rough timelines regarding locally run models? For instance will anything like chat gpt 4 be runnable on an M1 Mac within the next year?


"Breakthroughs" are inherently hard to predict. This field is advancing at a very rapid pace. A lot of the improvements are incremental, but even these are coming very fast, and moving in different directions at once.

I don't think there's any likelihood of replicating GPT-4 on your M1 in the next twelve months, especially not if you're expecting responsive performance. What we could see are a plethora of models dedicated to doing particular tasks. Say, specialists in aspects of programming or accounting or law or medicine. Or general knowledge models with access to a local cache of Wikipedia. Individually, none of these models would have to come close to GPT-4's overall level of power and flexibility. But collectively, they could reach that level of utility.


Memory bound token generation is a limitation of transformer decoders.

In the past, hardware has motivation algorithm innovations.

I’m curious how long it will take until we see more hardware friendly models.


The Rwkv family of models qualifies, since it computes like a recurrent network at inference time.


Given the massive imbalance in the memory bandwidth bottleneck, I wonder why specialized hardware is the way it is. Is there some use case in which processing is the bottleneck, or at least it's more even? Are we expecting some software paradigm shift which will change the balance? Why couldn't they just make a cheaper, more rounded card which isn't heavily underutilized because of a large bottleneck?


See: https://en.m.wikipedia.org/wiki/Random-access_memory#Memory_...

But llama.cpp, and llms in general, are very atypically memory bound, even for AI workloads. Other models/backends will typically make more use of compute and cache.

For llms specifically, cloud hosts will batch requests to make better use of GPUs.

But you are not far off. There is a proposition to pipe chips with very fast memory (aka no external memory at all) together: https://www.nextplatform.com/2023/07/12/microsofts-chiplet-c...


IMHO the current bottleneck is driven by the partial overlap between the gaming GPU market and machine learning needs. The expensive things that are needed by both (i.e. parallel calculations) are included in all cards, but expensive things that aren't needed by games (i.e. memory bandwidth) become a bottleneck unless you use expensive ML-specific niche hardware instead of consumer (gamer) GPUs.


I'm shocked - before I seen it working myself, I was not believe, so large (smart) LLM could run on desktop cpu.

- core i7, 4 cores, 3.3GHz, 64G ddr4-2400, 70b model, ~0.5 tokens/s; 30b model ~1 tokens/s.


What I find more stunning is what this implies going forward. If tech advances as it tends to do then having a 200bn model fit into consumer hardware isn't that far away.

Might not be AGI but I think cliched as it is that would "change everything". If not at 200 then 400 or whatever. Doesn't matter - the direction of travel seems certain.


Basically Ray Kurzweil's argument, he's been saying $1000 worth of compute will be able to match human performance around 2029 for decades now.


First, there has to be something capable of matching human performance at a much higher cost. This is still just spicy autocomplete.


Humans just do spicy autocomplete too.


Really? I can guess at what spicy autocomplete might actually mean but I doubt a LLM ... OK ChatGPT did a pretty good job of it (I've just asked it), whilst sidestepping the definition of spicy in this context. It is after all a very good next word guesser, given a context, and its not ... me! I am capable of hallucinating but it was 30 odd years ago since I hunted out certain fungi on Dartmoor, or smoked hemp.

To be fair, we humans do often interrupt each other to second guess a sentence completion. Done correctly it is a brief satisfying collaboration. Done wrong ... I've been married for 18 years and know when to bite my tongue, but I still get it wrong from time to time - sometimes deliberately. Despite that, me and the wiff can autocomplete each other's sentences with uncanny accuracy and end up with perfect harmony or a cough slight disagreement as a result.

We are getting some phenomenal slide rules these days but the darleks are not going to be flying up the stairwell just yet, nor will SkyNet be taking over tomorrow.

That said, you just know that some noddy is trying to sell a nuclear "deterrent" LLM AI thingie somewhere. Thankfully, production military equipment takes quite a while to get to deployment. There is a good chance that we will get to grips with all this stuff before SkyNet is let loose for real 8)


No, a human isn't born with a set of knowledge like a freshly trained LLM, keeping the model fixed and responding to input. The analog to the model changes based on the human's experience. Just making bigger and bigger LLMs won't give you this.


Humans are born with millions of years of evolutionary training embedded in their dna and brain. We are not born with nothing.


Yeah but chimpanzees and cats and mice have all the same million-year-old stuff that we do.

The million-year-old stuff is not what makes humans interesting.


I have been thinking about this lately. Things that are recently developed like language skills are easiest to replicate by machines. But things that too took a time to develop like walking and grabbing are still something that machines struggle with.


It's what makes humans possible. "Necessary but not sufficient" is the phrase that pays.


Ah, but I believe you forget the implicit biases of genetic programming. Instincts in my experience are the skeleton, and in a sense the default basis functions for the structure of how we live, see, do, and learn.


No, I don't forget that. There's obviously a starting point, behaviors and abilities that newborns already have. The point is that the model is not static.


So a human is different because it keeps training its neural network?


Whoa, imagine you get a good base LLM model and save all conversations with it. Run a batch process every night to fine tune a LORA on convo dataset. If I ever came across such a chat bot I'd probably freak out as to why it remembers things outside of the context window, without summarisation


That's a pretty neat idea, I would be surprised if no one is already working on that.


I did it years ago on a lark with a seq2seq model in a matrix chat room.


How did it perform? Was it well received by members?


Poorly! It was a small seq2seq and was gibberish to start with. Although it did tell my friend that it loved him which was nice.


one reason why a human is different: just based on word count alone, most LLM's are trained on 3-5 orders of magnitude more input.

could be a difference that makes no difference, or ...


Bit of an unfair comparison when humans also have a bunch of senses that LLMs don't have. They might be trained on orders of magnitude more words, but more data? Doubtful.


That's the key. I'm reminded of the Helen Keller story. She was completely blind and deaf. Her teacher spent a very long time signing into her hand. It took a very long time before she realized that the sign for "water" designated the thing she could feel flowing onto her hand; before that breakthrough the signs were meaningless to her. An LLM only knows the structure of language. It doesn't know that there is an external physical world that the language refers to. It only can predict what words follow which other words, and which output is preferred. Without any senses (and the huge bandwidth of information provided by them) an LLM is very crippled.


Crippled, yes, but I would disagree that it is fundamentally limited, or that an input stream of human language is inadequate to bootstrap "meaning", or in some way philosophically inferior to native biological senses.

It's very interesting you bring up Helen Keller because she's generally regarded as possessing the same level of sentience - and indeed intelligence - as anyone else, despite the extreme narrowness of her sensory input. It took her much longer to get going, but it's not as if she only understood concepts that directly related to touch. The experience with "water" taught her the concept of a symbol, and from there she could bootstrap everything else. LLMs already work with symbols - that is their sense.

In fact we're all a bit like Helen Keller, in the sense that if sensory input is the basis for our entire world model, then it is a very small foundation supporting an incredibly vast and intricate edifice. There is a considerable abstraction gap between concepts like "capitalism" and any direct sensory input. We all of us, all the time, manipulate concepts without thinking through what they "mean" all the way to something we can see and touch.


No they don't. You're "just" doing what everyone else in the past has done with the brain/human intelligence and using the latest technology as a metaphor without realizing it.


We want to think we’re exceptional but all we can do is say “human consciousness is special” without having any way of measuring it or disproving the assertion that we’re just really fancy pattern matchers.

Take any metaphor you want, it’s the same outcome: we may all be philosophical zombies.


We may "just" be neural networks that run on meat instead of silicon, but that does not mean that we're LLMs.


Why doesn’t it?


It's a formal logical error. One does not follow from the other without affirming the consequent.


Because not all neural networks are LLMs.

A GAN is a neural network, does that make it an LLM?


We have inputs other than words, for a start


I’m conscious, maybe you’re not? Not that I really believe that. I think you probably see colors and hear sounds, even in your dreams! But engineering types tend to be persuaded by a particular view of the world, failing to understand that it is a view and not nature itself.


Maybe you do. Luckily, not everyone is quite so simple.


monkeys make monkeys accidentally.

monkeys make meseeks on purpose.

there is a difference, but will it be fun?


I am 100% invested in how much ridiculous fun this era is going to be. Right up until the moment when it becomes a horror.


The irony in your statement is immense. Yes, Kurzeweil has been saying this for decades. No it doesn't mean AGI is close. These llms do nothing to advance AGI. There is no theoretical basis to the belief in emergent intelligence from statistical language models and the answers are amazingly good, highly unreliable and parrot meaning at best. There is no inductance, and no inteospection and no understanding of the deep semantic meaning of the language presented. There's no intelligence.


The lack of concept of "knowledge" is a big one for me - if that's an emergent thing it hasn't even shown hints of this yet. This to me seems a pretty hard line right now, as it limits their capability of things even inexperienced humans can do - namely decide if they actually know something, and identify when they don't know something and attempt to fix that - e.g. asking for clarification on vague inputs, or deciding if something is actually truth or fiction.

That then ties into another limitation right now - how after a training the model is pretty static, so cannot learn and has no state outside it's context buffer. This could just be another point where a few orders of magnitude more computing power can "fix" it, doing whole training steps between each input to actually incorporate new information and corrections into itself instead of relying on a fixed size small context.

But I'm not deep enough into things to say if they're fundamental issues, or current techniques will start displaying those effects as emergent characteristics as the complexity and training increases. There's been a few other examples when "known" techniques start to show unexpected characteristics down the line as they are scaled up, so can't really say for sure they'll /never/ be shown, just that the current examples don't seem to show even the beginnings of that sort of thing.


Why do you say they do nothing to advance AGI? Do you know what it takes to advance AGI? It's hard to state that without knowing how AGI would work yourself.

LLMs would be considered magic just a couple years ago. Sure, not AGI but behaves just like one for certain workloads. I find hard to believe we're not a bit closer now - or maybe even a lot closer.


AGI should have morals, opinions, self-reflection, learn continuously from sensor data, reason, realize when they’re proven wrong and update their model of the world, and be creative. So far LLMs exhibit none of those. But LLMs exhibit a digestible distillation of a very large body of data which may be a component of an AGI.

But you can have an AGI that doesn’t have encyclopedic knowledge but it’s still highly intelligent, so I don’t think LLMs have to be an intrinsic component.


That is not what AGI means. AGI = Artificial General Intelligence.

1. Artificial = we made it

2. General = it can solve problems in any field

3. Intelligence = the ability to solve problems

A chess engine is a very strong Artificial Intelligence. But it’s not very General, it can only evaluate chess positions.

GPT-4 is very General, you can ask it about any question and get a somewhat reasonable answer. But it’s not very intelligent, often the answer is wrong.

You’re talking about an Artificial Human. That’s a different problem. Intelligence is not species dependent. Dolphins are intelligent (a bit), aliens can be intelligent and have zero emotions or conception of self. There’s certainly plenty of amoral intelligent serial killers.


> That’s a different problem.

That's news to me. AGI (or strong AI) is typically defined as "human-level intelligence", or "perform any task that a human or animal can." Humans and animals often perform tasks that are critically reliant on being conscious, emoting, reading body language, reasoning, etc.

Not only that but prominent thinkers who have carved out the notion of AGI (or Strong AI) tend to have consciousness, mental states, and emotions at the core of it.

I think what you're talking about is a multi-task AI, not an AGI.


We don't have a good computer model of dolphin intelligence and the llms are not even remotely close to consciousness or dolphins, dogs, parrots on the intelligence Front.


Consciousness ≠ intelligence.

Consciousness: being “awake” and perceiving the world.

Intelligence: solving problems, finding the truth.

Consciousness is perceiving the world, whereas intelligence is understanding it.


GPT-4 solves only 1 problem: what is the most likely stream of tokens to follow what we already have.

It is remarkably good at this. But there's absolutely zero reason to believe it can solve any other problem at all.


How do humans write if not by intuiting what word comes after another?

Intelligence is the ability of that next word decision procedure to determine a next word that is aligned with our human intuition and model of truth.

I believe what you’re getting at is modality, that GPT-4 only provides responses in text. You can’t ask it to drive a car, or paint like Dall-e. And that’s a fair criticism, but it’s mostly just because it would make the models too large and slow, not because we don’t know how to do it. The thing we don’t know how to do is make a model reason as well as a human, and it makes sense to try to solve that in the text domain first rather than making highly multimodal models that reason poorly in all domains.


> How do humans write if not by intuiting what word comes after another?

We don't know. It may turn out that we use mechanisms similar to LLMs, or it might be something entirely different.

As for the rest: nobody knows how to make ChatGPT butter a piece of toast, let alone drive a car.

ChatGPT does not reason about text, either.


>nobody knows how to make ChatGPT butter a piece of toast

There is plenty of research on LLMs successfully piloting robots.

It's but no means a solved problem but "Nobody knows how" is a stretch.

https://tidybot.cs.princeton.edu/ https://innermonologue.github.io/

>ChatGPT does not reason about text, either.

It does and there's plenty of output to demonstrate that.



That's an interface, not an implementation.

("The most likely" out of what distribution? The model's distribution. So that just means "what the model thinks the answer to your question is".)


> the answers are amazingly good, highly unreliable and parrot meaning at best. There is no inductance, and no inteospection and no understanding of the deep semantic meaning of the language presented. There's no intelligence.

Can say the same about a half of population, tbh


So what you’re saying is.... we’re going to have human-level AI, and it’s going to be incredibly stupid


BAAGI (Below Average Artificial General Intelligence).


Or more simply not AGI but AGS


His prediction was that one human brain's worth of computing power could be acquired for $1000 by 2029. That still seems reasonable.

That's not the same as AGI or the singularity.


You have a metric for human brains worth of computing power which hasn't already been exceeded? I can't do infinite precision arithmetic or the RSA algorithm in my head, or index a billion strings into lexical sort order.

But I am human, I am conscious and no visible VLSI work or algorithmic model will lead to AGI or a human equivalent computing power by 2029. Let alone for $1000.


well, you could just be hallucinating your own consciousness. By 2029 it seems not unreasonable to expect that the most sophisticated models will carry out visual and auditory interactions which could fool even the most sophisticated viewer. At that point, what really does consciousness mean? If the robot insisted to me it was conscious, how can I really say no?


> it seems not unreasonable

This is where we differ.


By most measures you could think of for intelligence languages models are improving, so I don’t see why you think this wouldn’t lead to something at least almost human-level if you scaled it up enough

Of course there could be some wall somewhere but I don’t see why there would be


That's "we need a larger cowbell" thinking. It's not a theory of mind, it's wishful thinking that it will.. emerge. Absent theory I don't think moar will make it emerge, no.


If you want theory there’s this: https://arxiv.org/abs/2001.08361 (I haven’t actually read this but I know roughly what it’s about)

It’s saying that so far the abilities of an LLMs have scaled up with its parameter count and training data size. Of course there’s no way to be sure without actually training larger models but I don’t see why the point where it stops would be just after our current best LLMs. Many properties have already emerged from making it bigger so I don’t see why this would be the exception


Because you need more training data for better results and they are running out of new training data.


I don't think so.

While it may be true that new data is coming in at a trickle these days, due to things like Discord, Slack, et al. all locking conversation and context up, as well as the daily volume of chapter is small relative to what is out there now.

The fact is that training data can be used in many different ways and I bet you we see the products of that fairly quickly as those who see this same as I do reach a point where they want to show n tell and test.


>The fact is that training data can be used in many different ways and I bet you we see the products of that fairly quickly as those who see this same as I do reach a point where they want to show n tell and test.

Sounds like wishful thinking to overcome the limitations of LLMs.

At the same time we get more and more texts generated by LLMs so it gets harder to get actual man made texts.


That’s true for LLMs but not necessarily for reinforcement learning


A 200b 4-bit quantized model could potentially fit into 128 GB of RAM. The inference would just be really slow.

Ie you could technically run something like that today.

I think more VRAM on GPUs isn't necessarily a technical limitation either. I think GPU manufacturers could add a lot more VRAM to their cards if they wanted to. The question is whether it would be worth the price increase.


> Ie you could technically run something like that today.

Yep, on higher end machines it should already be feasible. I can do 2.5-3 tok/sec on a 70B model quantized at 4 bit today with my MacBook Pro M2 MAX w/96GB. It's a little slower than a 30B, but the difference is less than I had guessed it would be. That's not super fast, but it's usable.

And that's on a machine that isn't designed for this workload. Over the next few years things should improve quite a bit. 200B does not seem like a reach.


About the RAM. I doubt they wanted to do that, since basic gpu function is to render a frame in as little ms as possible. Currently VRAM is latency optimized on consumer gpus and all memory chips are an inch away from the gpu. Light only travels as far in the gigahertz realm. Thats why they started mounting vram chips on both sides of the board, cause there was no more place left on the first side.

Just checked: light travels 30cm in one nanosecond. So if the gpu is running at 4GHz it goes only 7.5 cm.


VRAM is not latency optimized. VRAM has worse latency than your CPU RAM. The reason why it's mounted closer is because of signal integrity because of higher frequencies, not because of latency.


Interesting. Where can I read more about that?


Sorry can't provide any resources right now. If you search a bit I'm sure you'll find some latency comparisons between DDR and GDDR.

But basically GPU memory (GDDR5/6/6X/etc) is optimized for bandwidth (because GPUs need to move a lot of data, have few branches, few unknown data dependencies, high spatial locality). CPU memory is more optimized for latency (because of branchy code).


IMO the direction we're going seems more like having a few small models in a MoE that are equivalent to a current 200bn model


And then things like neural implants and BCIs -- seems like your dog could have language capabilities sooner than you'd think ;)


Thank you very much for writing this out!


You’re welcome :)


>> Memory bandwidth is the limiting factor in almost everything to do with sampling from transformers.

So how about using an APU - a CPU with GPU built in. The GPU shares the CPU memory, so if you want you can have 128GB RAM and allocate 100GB to the GPU.

Sure the GPU i not fast, but if memory is important.....


You can, right now, with the OpenCL backend.

And for the moment, its slower than pure CPU. Optimizing for IGPs is not trivial.

MLC's Vulkan backend is actually quite good on my AMD APU, but unfortunately it won't split the model to a dGPU.


Most CPU RAM is much slower than GPU RAM. GPUs typically pack RAM 2 generations ahead with a wider bus than anything you'd find on a consumer motherboard.


For reference, DDR4-3200 in quad channel is ~100 GB/s while a 3090's VRAM is 960 GB/s. Of course, most consumers only have dual channel.

M1 Pro is 200 and M1 Max is 400. Which is slow for GPU memory, but incredible for main memory -- although I'm not sure how much of that a single core can actually pull.


The Anand tech article has profiled this. IIRC the CPU cores has access to half that band width only, which is still quite a lot.


The AMD Ryzen™ 9 7940HS uses DDR5-5600, which I understand to be about 89.6GB/s in a dual channel setup.


You're thinking about the wrong bandwidth here. The article is talking about going from the GPU's RAM <-> GPU cores (i.e. through load/store instructions in a cuda kernel), not from CPU's RAM <-> GPU's RAM. That kind of bandwidth is still important but usually not the bottleneck on most ML workloads.


I'm still confused.

The author (correctly) made a distinction about the two like you said, but at the end when talking about Raspberry Pi 4 they use a number (~4GB/s of memory bandwidth) from an article [1] which I can only assume is NOT about graphics memory or its bandwidth (do Raspberry PIs even have it?).

And how exactly is the bandwidth counted if I use integrated GPU (like i5 13600K)? Or pure CPU?

[1] https://forums.raspberrypi.com/viewtopic.php?t=281183


CPUs and GPUs both interface with their own memory, and those memories have a certain bandwidth. Generally, CPU memory has relatively little bandwidth, but relatively good latency. (For example, an i9-13900K supports memory up to around 100 GB/s, while even my previous GPU, a midrange Radeon HD 7850 from 2012, has over 150 GB/s of bandwidth).

An integrated GPU shares memory with the CPU, so at best it gets the same amount of bandwidth assuming the CPU is not using any (which is rather unlikely).

A dedicated GPU has its own private high-bandwidth memory (an RTX 4090 has a memory bandwidth of over 1000 GB/s), but to get anything in there it needs to be loaded over the PCIe bus (which has a measly 32 GB/s bandwidth for PCIe 4.0 x16).

That said, CPU memory does have one big advantage: it tends to be much larger. You can pair up to 192 GB with a regular Ryzen 7000 CPU, while consumer GPUs don't go above 24 GB of memory (RTX 4090, RX 7900 XTX). (There are bigger GPUs out there, but those are generally intended for datacenters, and if you go that route, an Epyc or Xeon CPU can also support much more memory than a plain desktop Ryzen, although you can also slot multiple GPUs into a single server.)

I believe that for LLM performance, memory bandwidth is key, because all the neural network layers need to be streamed from memory, and very little work is done with it each time, although I guess batching operations could help if you're working at scale, since each weight would be applied multiple times then.


Raspberry Pi's do have a usable GPU, but using it for computation is not a particularly well-traveled path. I think that's a shame. The pre-4 models have a different Broadcom graphics core to the 4, and it looks like you can get useful work out of both, but they are different enough that it's a rebuild between the generations.


This is basically the appeal of the apple chips in this domain. Apple have fuck-you money so they have a bunch of high-bandwidth decent-latency soldered onto the chip.


It's LPDDR memory with wires going between the chips, the difference to other LPDDR applications is just running the wires inside the same package.


This article would probably be useful for a lot more people if it spent just a couple of sentences introducing the various parameters, rather than just throwing variable names at the reader. Interestingly, a whole paragraph is spent on explaining n_bytes.


80/20 rule. Approximations have smaller, less accurate approximates which do in many contexts. If your goal is to reach America, a crude compass and speed reasoning works. If you want to target an ICBM you need better positional accuracy.


Fantastic article - I would love to see an analysis comparing larger quantized models to smaller unquantized models. e.g. is a 14b quantized model better than a 7b unquantized model?


Basically, the most aggressive quants for a larger sized model is always better than the unquantized model of a smaller model: https://github.com/ggerganov/llama.cpp/pull/1684

Tim Dettmers did a bunch of research on this last year as well: https://arxiv.org/abs/2212.09720


Are there benchmark figures for CPU based setups somewhere? Eg a 192-core / 24-channel 2P Zen 4 box. Or is the CPU side too unturned to be interesting?



And BTW this analysis holds true for many workloads. Memory bandwidth is often at the core of performance.


Is this single-thread? Or are they putting all available CPUs on the problem?


Its complicated.

If you dont have a GPU, prompt ingestion is totally threaded. The more cores the better.

For generating tokens, more cores helps to a point, but then:

- You start saturating the memory bus, and performance plateaus.

- There is some overhead from the threading implementation, and too many threads hurts performance.

The ideal number of threads varys per CPU. For instance, using hyperthreaded cores or Apple/Intel e cores typically hurts performance... But not always. You just have to test and see.


The problem is memory bandwidth rather than CPU cores: "Memory bandwidth is the limiting factor in almost everything to do with sampling from transformers. Anything that reduces the memory requirements for these models makes them much easier to serve"


It's multithreaded.


Is there a Llama2 65b quantized version for Mac M2?


I'd look for llama2 thebloke 70b GGML on hugginface.

Update, this might work: https://huggingface.co/TheBloke/Llama-2-70B-GGML

Example command mentions the 4bit variant:

./main -m llama-2-70b.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "Llamas are"


I also found this link ( https://blog.lastmileai.dev/run-llama-2-locally-in-7-lines-a...) but havent verified if it helps, but thanks for your reply I am taking a look now.


Just checked on a m1 w 32GB ram. 7b and 13b work as a charm. 70b doesn't fit, even the 2 bit version


Did anyone notice that apparently a M2 Macbook Pro is only 16x faster than a Pixel 5? Not sure that makes sense.


I’m not an expert in this but my “does that feel right wrong” sense isn’t going off

Pixel 5 is 2-3 years old. CPUs aren’t doubling in speed every 2 years, but let’s very generously say we expect current designs to be 4x faster than 2-3 year old equivalent.

Apple silicon is faster than other ARM chips, so if we imagine that’s another 2x we’re up to 8x

Common wisdom is that “real computers” are faster than phones, but the difference between the A16 and M2 is less than 2x for multi core, and much much less for single core benchmarks. Rounding up, another 2x is 16x

Maybe there are characteristics of the two devices which make this more surprising, I’d be interested to learn more.


The phone slows down as it heats up, and then slows down infinitely when the battery dies.

(With some workloads and chargers, it's possible running on a phone would drain the battery faster than it can recharge even if you plugged it in.)


[flagged]


I'm feeling a sense of very wry irony will approach us soon....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: