I think there are a couple of basic questions need answered before we can find a good solution:
1) What are you trying to do?
2) What's your budget?
Generically saying, "run inference" is like... you can do that on your current thinkpad, if you want a small enough model. If you want to run 7B or 13B or 34B models for document or sentiment analysis, or whatever, then you can move to the budget question.
When I was faced with this question, I bought the cheapest 4060 Ti with 16GB I could find. It does "okay". Here's an example run:
Llama.generate: prefix-match hit
llama_print_timings: load time = 627.53 ms
llama_print_timings: sample time = 415.30 ms / 200 runs ( 2.08 ms per token, 481.58 tokens per second)
llama_print_timings: prompt eval time = 162.12 ms / 62 tokens ( 2.61 ms per token, 382.44 tokens per second)
llama_print_timings: eval time = 8587.32 ms / 199 runs ( 43.15 ms per token, 23.17 tokens per second)
llama_print_timings: total time = 9498.89 ms
Output generated in 9.79 seconds (20.43 tokens/s, 200 tokens, context 63, seed 1836128893)
I'm using the text-generation-webui to provide the OpenAI API interface. It's pretty easy to hit:
import os
import openai
url = "http://localhost:7860/v1"
openai_api_key = os.environ.get("OPENAI_API_KEY")
client = openai.OpenAI(base_url=url, api_key=openai_api_key)
result = client.chat.completions.create(
model="wizardlm_wizardcoder-python-13b-v1.0",
messages = [
{"role":"system", "content":"You are a helpful AI agent. You are honest and truthful"},
{"role":"user", "content": "What is the best approach when writing recursive functions?"},
]
print(result)
But again, it just depends on what you want to do.
This is by far the most important question. Because frankly, I can run LLaMA on my raspberry pi. It's slow as hell and not suited for any real time task but there are definitely operations where this would be an appropriate cost effective solution (preferably with actually a smaller distilled model).
There is no one size-fits all solution. The general advice is going to be a general mid tier graphics card but I assume that's information OP already has or could have found just as easily by typing this question into Google or any LLM. So if you (OP) want better advice, we got to have more information. The more detailed, the better (if this is a commercial application, then the answer is A100 because geforce cards are not allowed to be used for commercial environments, but no one's really going to stop you either). Ask vague question, get vague answers. But we will ask refining questions to help you ask better questions too :)
I've noticed that llama 2 + llama.cpp doesn't seem to even use the GPU much. I tried a better gpu (more speed, more memory) and my inference speed didn't increase.
2. AMD: They may change the land scape in coming months. And it looks like the US gov restrictions on GPU's are going to impact price in the server market in 2024.
3. The stacks are evolving quickly. What you buy for today may be supersede by something tomorrow that means you should have spent more or could have spent less.
If you want to play: Ram, is what matters most. GPU ram and system ram (in that order). Get the best GPU you can (ram wise) under clock it and then add system memory if you can. Once you have a test bed that works for you, renting/cloud is a way to scale and play with bigger toys till you have a better sense of what you want and/or need.
Have you considered running on a cloud machine instead? You can rent machines on https://vast.ai/ for under $1 an hour that should work for small/medium models (I've mostly been playing with stable diffusion so I don't know what you'd need for an LLM off hand).
Good GPUs and Apple hardware is pricey. Get a bit of automation setup with some cloud storage (e.g backblaze B2) and you can have a machine ready to run your personally fined tuned model rapidly with a CLI command or two.
There will be a break even point of course. Though a major advantage of renting is you can move easily as the tech does. You don't want to sink large amounts of money into a GPU only to find the next new hot open model needs more memory than you've got.
A gaming desktop PC with Nvidia 3060 12GB or better. Upgrade the GPU first if you can afford it, prioritizing VRAM capacity and bandwidth. Nvidia GPU performance will blow any CPU including M3 out of the water and the software ecosystem pretty much assumes you are using Nvidia. Laptop GPUs are not equivalent to the desktop ones with the same number so don't be fooled. 8x 3090 (purchased used) is a popular configuration for people who have money and want to run the biggest models, but splitting models between GPUs requires extra work.
Personally I have 1x 4090 because I like gaming too, but it isn't really a big improvement over 3090 for ML unless you have a specific use for FP8, because VRAM capacity and bandwidth are very similar.
A data point for you: 7B models at 5-bit quantization run quite comfortably under llama.cpp on the AMD Radeon RX 6700 XT, which has 12GB VRAM and was part of a lot of gaming PC builds around 2021-22.
I can’t give this as a recommendation - there are far more tools available for Nvidia GPUs, but larger VRAM is available on AMD GPUs at lower prices from what I can see.
(If you want a Mac,) Apple silicon has the advantage of the unify memory, and with llama.cpp, they can run those models locally and quickly. I’d say start with the largest model you want to run, run it through llama.cpp which will tell you the amount of memory needed. And buy the Mac with at least that amount of memory that you can afford. If you have more budget, prioritize more memory because you may want to be able to run larger model later.
If not Mac, follow other advice with NVidia GPU. in term of the software ecosystem, NVidia >> Apple >> AMD > Intel. (I think I got the ordering right, but the magnitude of difference might be subjective.)
Of course with those you'll also have to spend some money on motherboard, ram, SSD, PSU, CPU, ect.
I think the best bang for the buck is probably a Mac studio with as much ram as you can afford.
I bought an RTX A2000 (12GB VRAM), and it's fine for 7B models and some 13B models with 4 bit quantization, but I kind of regret not getting something with more VRAM.
I hate how the market is right now. I understand that NVidia doesn't want to provide a consumer level graphics card with truly impressive RAM specs, even though they could, because they feel it would eat into their datacenter market (and truth be told, it probably would), but it's super frustrating that you need to pay so much for decent performance, even as a single machine for personal use.
In my experience, an nvidia card with the most memory you can get — that’s more important than speed, as models are tending to get bigger, and streaming models really hits speed.
Somewhat related; how to run an uncensored model locally? I run llamafile (llamafile-server-0.1-llava-v1.5-7b-q4 and mistral-7b-instruct-v0.1-Q4_K_M-server) ones on my macbook m1 and they run file (fast enough for playing), but they both seem neutered quite a bit. It's hard to get them off the rails and mistral (the above one) actually barfs really quickly just repeating the same letter (fffffff usually) where it should've said fuck. Now i'm not looking for something that writes porn or whatnot, but the online models are so pc, it's getting on my nerves.
When you're running inference, it's super important to make sure that you're using the right prompt format (if you're using the Oobabooga web text ui, make sure you have 'chat', 'chat-instruct', or 'instruct' properly selected). The model card on Huggingface will usually tell you.
Yes, I wonder the same. I'm not so much about porn but pretty much every conversation with a model gets abruptly cut off when it's just getting interesting. Every model I tried - mistral, codellama etc. - is terribly maimed in this respect.
Nvidia GPU's are really your only choice. There is no framework as mature as CUDA and nvidia has been making the fastest hardware for decades. They know their stuff when it comes to architecture, so its unlikely that the hot new thing will actually be able to compete.
Having their cards run in servers is a big part of their business model, and Linux owns that market, so I’m surprised if their support is getting worse.
Things like poor Wayland support—sure. But then, why would somebody use this matrix-multiplication accelerator to draw graphics, right?
On my home system I've been running 2x RTX 2070's on Fedora and have had serious enough problems. It's been fairly stable for a while, but the last week or so I keep having the screen go black and not come back. I'm going to try Debian as it's supposed to have better support for nvidia cards. I've been using Fedora or Redhat for a long time, and I'd rather not switch, but these driver issues make the system unusable.
The screen goes dark & unrecoverable during normal use, not while using ml tools, so I just assumed it was a problem with nvidia's drivers being generally disagreeable with Fedora.
You make a a good point, I've been having one card do double-duty as hdmi output and GPGPU. I'll try the motherboard's built-in hdmi and see how that goes.
I've been really busy with other stuff for a few weeks and haven't really thought about the best way to fix this. Thanks for the suggestion!
CUDA works great on Linux. Full stop. If you’re having issues it’s because you’ve done something bizarre, like installing multiple versions of the driver. I promise you. I’ve been there and it was wholly my fault. Is it obvious or necessarily easy to fix? Nope. But that is problem with Linux and not the driver or CUDA.
I'm using a DUO 16 2023 with 4090 16GB and ryzen 9 7945HX (16c/32t). It also uses another 32 GB shared RAM which makes it a 48GB 4090. It's quite a bit slower than a full on 4090 but it can load decent sized models and works well.
Tested both Linux (some things will need manual patching) and windows. Works like a charm.
To add to this, I have a laptop with 32G of RAM and am able to run some 7B models on CPU. But I'd like to work on some larger models. Are there any eGPUs that can aid in this?
Yeah, I just meant that is like the only option, which is crazy because its a 2020 GPU.
There are lots of questions about what hardware to get for ML, and the generic answer is basically always "get a 3090." Its so frequently recommended that it feels like a meme to me.
I’m running mistral 7B on a M1 Mac 8GB just barely. It’s ask a question get a coffee type of thing. No idea how this works, as 32 bit floats require 4 bytes and with 7B it would need to be swapping with the SSD.
If I had the cash I would go for 24GB M2/3 pro. That would allow me to comfortably load the 7B model in to ram.
Have you looked into quantization? At 8-bit quantization, a 7B model requires ~7GB of RAM (plus a bit of overhead); at 4-bit, it would require around 3.5GB and fit entirely into the RAM you have. Quality of generation does degrade a bit the smaller you quantize, but not as much as you may think.
How? I have an M2 Pro and I run 7B and 13B models through Ollama and also LM Studio.
Because there’s no CUDA, the speed is much slower than ChatGPT. The answers from 7B are also not at the same quality as ChatGPT. (Lots of mistakes and hallucinations)
I run a 13B Q4 llama variant on my ten year old server with two Xeon E5-2670, 128GB of RAM, and no GPU
It runs at under 3 tokens per second. I usually just give it my prompt and go make a coffee or something. The server is in my basement, you can barely hear the fans screaming at all.
I don't want to derail the OP's question, but would the same kind of system to run an LLM on also be suitable for an image generator like Stable Diffusion or does it work through different methods?
If you're wiling to wait a few days, remember that Intel Core Ultra processors (Meteor Lake) are supposed to be available on December 14th. The embedded NPU should make a difference.
Somewhat related. I’ve got an M2 Max Mac Studio with 32GB of ram. Is there anything interesting I can do with it in terms of ML? What’s the scene like on moderately powered equipment like this?
M1 MacBooks are still available as new in high memory configs... I picked up a 64GB M1 Max for less than $2500. Good setup because of the shared cpu/gpu memory scheme
I'm getting "Error: llama runner process has terminated" when trying to run the model. According to this ticket[1], it's a memory issue. Not sure why 16GB ram are struggling with a 7b model, though.
Oh, interesting. Mistral:7b works, but wizardcoder:7b-python throws the same error as before. What's another good coding model to use besides wizardcoder?
Edit: wizardcoder:7b-python-q4_1 throws the same error
I was interested in Stable Diffusion / images, and also text generation.
I started playing with ComfyUI and Ollama.
An M1 studio ultra would generate a 'base' 512x512 image in around 6 seconds, and ollama responses seemed easily 'quick enough'. Faster than I could read.
On an I7-3930K, purely CPU only, a similar image would take around 2.5 minutes, and ollama was painful, as I would be waiting for the next word.
Then I switched to a 3080ti, which I hadn't been using for gaming as it got stupidly hot and I regretted having it. Suddenly it was redeemed.
On the 3080ti, the same images come out in less than a second, and ollama generation is even faster. Sure, I'm limited to 7B models for text (the mac could go much higher) and there will be limits with image size/complexity, but this thing is so much faster than I expected, and hardly generates any heat/noise at the same time - completely different to gaming. This is all a simple install under linux (pop os in this case).
tl;dr - A linux PC with a high-end GPU is the best value by far unless you really need big models, in my experience.
I started to put together a second machine to be good at inference and then decided to just make my daily driver capable enough. Ended up upgrading my laptop to an MBP w/M2 MAX and 96GB. It runs even bigger models fairly well.
You then download a model you want, say, Llama 2. Install a Python package to interact with the model (possibly `pip install llama-cpp-python`) and have fun.
1) What are you trying to do?
2) What's your budget?
Generically saying, "run inference" is like... you can do that on your current thinkpad, if you want a small enough model. If you want to run 7B or 13B or 34B models for document or sentiment analysis, or whatever, then you can move to the budget question.
When I was faced with this question, I bought the cheapest 4060 Ti with 16GB I could find. It does "okay". Here's an example run:
I'm using the text-generation-webui to provide the OpenAI API interface. It's pretty easy to hit: But again, it just depends on what you want to do.