Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What's the best hardware to run small/medium models locally?
74 points by triyambakam on Dec 10, 2023 | hide | past | favorite | 96 comments
M2 MacBook? A certain graphics card? Running inference on CPU on my old Thinkpad isn't fun.



I think there are a couple of basic questions need answered before we can find a good solution:

1) What are you trying to do?

2) What's your budget?

Generically saying, "run inference" is like... you can do that on your current thinkpad, if you want a small enough model. If you want to run 7B or 13B or 34B models for document or sentiment analysis, or whatever, then you can move to the budget question.

When I was faced with this question, I bought the cheapest 4060 Ti with 16GB I could find. It does "okay". Here's an example run:

  Llama.generate: prefix-match hit
  
  llama_print_timings:        load time =     627.53 ms
  llama_print_timings:      sample time =     415.30 ms /   200 runs   (    2.08 ms per token,   481.58 tokens per second)
  llama_print_timings: prompt eval time =     162.12 ms /    62 tokens (    2.61 ms per token,   382.44 tokens per second)
  llama_print_timings:        eval time =    8587.32 ms /   199 runs   (   43.15 ms per token,    23.17 tokens per second)
  llama_print_timings:       total time =    9498.89 ms
  Output generated in 9.79 seconds (20.43 tokens/s, 200 tokens, context 63, seed 1836128893)

I'm using the text-generation-webui to provide the OpenAI API interface. It's pretty easy to hit:

  import os
  import openai
  url = "http://localhost:7860/v1"
  openai_api_key = os.environ.get("OPENAI_API_KEY")
  client = openai.OpenAI(base_url=url, api_key=openai_api_key)
  result = client.chat.completions.create(
      model="wizardlm_wizardcoder-python-13b-v1.0",
      messages = [
          {"role":"system", "content":"You are a helpful AI agent. You are honest and truthful"},
          {"role":"user", "content": "What is the best approach when writing recursive functions?"},
      ]
  print(result)

But again, it just depends on what you want to do.


> What are you trying to do?

This is by far the most important question. Because frankly, I can run LLaMA on my raspberry pi. It's slow as hell and not suited for any real time task but there are definitely operations where this would be an appropriate cost effective solution (preferably with actually a smaller distilled model).

There is no one size-fits all solution. The general advice is going to be a general mid tier graphics card but I assume that's information OP already has or could have found just as easily by typing this question into Google or any LLM. So if you (OP) want better advice, we got to have more information. The more detailed, the better (if this is a commercial application, then the answer is A100 because geforce cards are not allowed to be used for commercial environments, but no one's really going to stop you either). Ask vague question, get vague answers. But we will ask refining questions to help you ask better questions too :)


I've noticed that llama 2 + llama.cpp doesn't seem to even use the GPU much. I tried a better gpu (more speed, more memory) and my inference speed didn't increase.


Make sure that you're telling it to use the GPU. How are you launching llama_cpp?


I was using the command line. llama-7b

I did some investigating, and I found it doesn't start using the GPU unless you have a lot of input (such as a long prompt).


And a third question - what kind of machine are you looking for?

If you want a laptop, that's going to be a very different machine to a desktop.


1. The GPU market is a mess! https://www.tweaktown.com/news/94394/amds-top-end-rdna-3-sal... Insiders who watch the prices and talk to VAR's all say that the channels seem stuffed and that prices are holding back sales.

2. AMD: They may change the land scape in coming months. And it looks like the US gov restrictions on GPU's are going to impact price in the server market in 2024.

3. The stacks are evolving quickly. What you buy for today may be supersede by something tomorrow that means you should have spent more or could have spent less.

If you want to play: Ram, is what matters most. GPU ram and system ram (in that order). Get the best GPU you can (ram wise) under clock it and then add system memory if you can. Once you have a test bed that works for you, renting/cloud is a way to scale and play with bigger toys till you have a better sense of what you want and/or need.


This is what I did. 3080 eventually became a 3090 off ebay and 32gb now 128... but all on a budget and over time.

Also as other have pointed out, it depends... I run models on Raspberry Pi's as well, one is doing live network detection...


Have you written any blog posts or anything about the one you have doing live network detection? I’d be really curious to hear more.


Have you considered running on a cloud machine instead? You can rent machines on https://vast.ai/ for under $1 an hour that should work for small/medium models (I've mostly been playing with stable diffusion so I don't know what you'd need for an LLM off hand).

Good GPUs and Apple hardware is pricey. Get a bit of automation setup with some cloud storage (e.g backblaze B2) and you can have a machine ready to run your personally fined tuned model rapidly with a CLI command or two.

There will be a break even point of course. Though a major advantage of renting is you can move easily as the tech does. You don't want to sink large amounts of money into a GPU only to find the next new hot open model needs more memory than you've got.


I will link a few that I haven't used yet but seem promising:

- https://octoai.cloud/

- https://www.fal.ai/

- https://vast.ai/ (linked by gchadwick above)

- https://www.runpod.io/

- https://www.cerebrium.ai/


A gaming desktop PC with Nvidia 3060 12GB or better. Upgrade the GPU first if you can afford it, prioritizing VRAM capacity and bandwidth. Nvidia GPU performance will blow any CPU including M3 out of the water and the software ecosystem pretty much assumes you are using Nvidia. Laptop GPUs are not equivalent to the desktop ones with the same number so don't be fooled. 8x 3090 (purchased used) is a popular configuration for people who have money and want to run the biggest models, but splitting models between GPUs requires extra work.

Personally I have 1x 4090 because I like gaming too, but it isn't really a big improvement over 3090 for ML unless you have a specific use for FP8, because VRAM capacity and bandwidth are very similar.


[deleted]


Not a dumb question, but maybe dumb to make a whole comment instead of taking two seconds to google it

https://www.techspot.com/review/2625-nvidia-rtx-4090-laptop-...


A data point for you: 7B models at 5-bit quantization run quite comfortably under llama.cpp on the AMD Radeon RX 6700 XT, which has 12GB VRAM and was part of a lot of gaming PC builds around 2021-22.

I can’t give this as a recommendation - there are far more tools available for Nvidia GPUs, but larger VRAM is available on AMD GPUs at lower prices from what I can see.


(If you want a Mac,) Apple silicon has the advantage of the unify memory, and with llama.cpp, they can run those models locally and quickly. I’d say start with the largest model you want to run, run it through llama.cpp which will tell you the amount of memory needed. And buy the Mac with at least that amount of memory that you can afford. If you have more budget, prioritize more memory because you may want to be able to run larger model later.

If not Mac, follow other advice with NVidia GPU. in term of the software ecosystem, NVidia >> Apple >> AMD > Intel. (I think I got the ordering right, but the magnitude of difference might be subjective.)


4060Ti w/ 16GB VRAM or 3090 w/ 24 GB VRAM

Of course with those you'll also have to spend some money on motherboard, ram, SSD, PSU, CPU, ect.

I think the best bang for the buck is probably a Mac studio with as much ram as you can afford.

I bought an RTX A2000 (12GB VRAM), and it's fine for 7B models and some 13B models with 4 bit quantization, but I kind of regret not getting something with more VRAM.


Just FYI you can get a hacked 2080 Ti with 22GB for half the cost of a used 3090.

https://news.ycombinator.com/item?id=38573884


I hate how the market is right now. I understand that NVidia doesn't want to provide a consumer level graphics card with truly impressive RAM specs, even though they could, because they feel it would eat into their datacenter market (and truth be told, it probably would), but it's super frustrating that you need to pay so much for decent performance, even as a single machine for personal use.


In my experience, an nvidia card with the most memory you can get — that’s more important than speed, as models are tending to get bigger, and streaming models really hits speed.

I don’t have any Mac experience.


Somewhat related; how to run an uncensored model locally? I run llamafile (llamafile-server-0.1-llava-v1.5-7b-q4 and mistral-7b-instruct-v0.1-Q4_K_M-server) ones on my macbook m1 and they run file (fast enough for playing), but they both seem neutered quite a bit. It's hard to get them off the rails and mistral (the above one) actually barfs really quickly just repeating the same letter (fffffff usually) where it should've said fuck. Now i'm not looking for something that writes porn or whatnot, but the online models are so pc, it's getting on my nerves.


When you're running inference, it's super important to make sure that you're using the right prompt format (if you're using the Oobabooga web text ui, make sure you have 'chat', 'chat-instruct', or 'instruct' properly selected). The model card on Huggingface will usually tell you.


Yes, I wonder the same. I'm not so much about porn but pretty much every conversation with a model gets abruptly cut off when it's just getting interesting. Every model I tried - mistral, codellama etc. - is terribly maimed in this respect.


Nvidia GPU's are really your only choice. There is no framework as mature as CUDA and nvidia has been making the fastest hardware for decades. They know their stuff when it comes to architecture, so its unlikely that the hot new thing will actually be able to compete.


+1

Also I run Linux full time, there's nothing better.

Apple side is still severely broken, you'll be fighting x86 shenanigans regularly.


Unless you use linux, where the quality of Nvidia support continues to decline.


Have there been issues with CUDA on Linux?

Having their cards run in servers is a big part of their business model, and Linux owns that market, so I’m surprised if their support is getting worse.

Things like poor Wayland support—sure. But then, why would somebody use this matrix-multiplication accelerator to draw graphics, right?


On my home system I've been running 2x RTX 2070's on Fedora and have had serious enough problems. It's been fairly stable for a while, but the last week or so I keep having the screen go black and not come back. I'm going to try Debian as it's supposed to have better support for nvidia cards. I've been using Fedora or Redhat for a long time, and I'd rather not switch, but these driver issues make the system unusable.


So one is for graphics and the other is for CUDA or something?


Yes, the Debian setup is going to be just for cuda and ml. It's intended for cheap and easy experiments, focusing on ml for small systems.


Hmm, I’m still confused sorry. To are you devoting both dGPUs to GPGPU stuff and using the iGPU for the desktop? Or is one dGPU doing double duty?

I wonder if it is worth trying the iGPU, if you haven’t already.

It would be a shame to distro-hop for a non-preference-based reason like this.


The screen goes dark & unrecoverable during normal use, not while using ml tools, so I just assumed it was a problem with nvidia's drivers being generally disagreeable with Fedora.

You make a a good point, I've been having one card do double-duty as hdmi output and GPGPU. I'll try the motherboard's built-in hdmi and see how that goes.

I've been really busy with other stuff for a few weeks and haven't really thought about the best way to fix this. Thanks for the suggestion!


CUDA works great on Linux. Full stop. If you’re having issues it’s because you’ve done something bizarre, like installing multiple versions of the driver. I promise you. I’ve been there and it was wholly my fault. Is it obvious or necessarily easy to fix? Nope. But that is problem with Linux and not the driver or CUDA.


I am running 3090 on Linux Mint and 2060 on Rocky (RHEL 9) without any issues. Both CUDA and regular desktop use.


Running KDE on both 2060 RTX laptop and 3080 RTX desktop, flawless.


Linux support is great, what are you on about? You running Nouveau drivers or something lol


> where the quality of Nvidia support continues to decline.

what are you talking about?


Which GPUs should I consider?


I'm using a DUO 16 2023 with 4090 16GB and ryzen 9 7945HX (16c/32t). It also uses another 32 GB shared RAM which makes it a 48GB 4090. It's quite a bit slower than a full on 4090 but it can load decent sized models and works well.

Tested both Linux (some things will need manual patching) and windows. Works like a charm.


Any links to setting up a ChatGPT-like experience that is entirely local - ie. no connectivity to the web/cloud?



https://github.com/lmstudio-ai - several models to choose, laso API access if you want


To add to this, I have a laptop with 32G of RAM and am able to run some 7B models on CPU. But I'd like to work on some larger models. Are there any eGPUs that can aid in this?


I have a 12-year-old desktop with 24GB of RAM and no modern GPU and can run the 7B models, it's just no fun :(


I have a similarly old desktop with 24GB RAM and outdated GPU. What sort of token/second were you able to get with the 7B models?


A used RTX3090 of eBay is the most interesting budget option by far.

If you have twice the cash, go for a new RTX4090 for rougly twice the performance.

If you need more than 24GB vram, you want to get comfortable with sharding across a few of the 3090's, or spend a lott more on a 48, 80, 100 GB card.

If you feel adventurous, you can go a non nvidia route, but expect a lott of friction and elbow grease at least for now.


Yeah, the 3090 is a meme in local AI communities. Additionally, the support is amazing because its essentially the same architecture as an A100.

The 3060 is popular too, being a 3090 cut in half.


> Yeah, the 3090 is a meme in local AI communities.

Calling the 3090 a "meme" makes it sound like the 3090 is a joke. Do you mean that the 3090 is "well-known" in local AI communities?


Yeah, I just meant that is like the only option, which is crazy because its a 2020 GPU.

There are lots of questions about what hardware to get for ML, and the generic answer is basically always "get a 3090." Its so frequently recommended that it feels like a meme to me.


It hits a good sweet spot between price and power, and there's usually some easily available. Availability is a major factor imo.


Yeah, exactly. Gaudi boards might be good for local genai... if you could find any.


I myself wanted a new 4090 (2.000€), but my budget constrained me to a used 3090 from eBay (800€) which has served very well so far.


Anything from nVidia.

3060 RTX with 12GB VRAM if you're budget friendly, dial up from there.

Steer away from Apple unless all you do is work from a laptop.


I’m running mistral 7B on a M1 Mac 8GB just barely. It’s ask a question get a coffee type of thing. No idea how this works, as 32 bit floats require 4 bytes and with 7B it would need to be swapping with the SSD.

If I had the cash I would go for 24GB M2/3 pro. That would allow me to comfortably load the 7B model in to ram.


Have you looked into quantization? At 8-bit quantization, a 7B model requires ~7GB of RAM (plus a bit of overhead); at 4-bit, it would require around 3.5GB and fit entirely into the RAM you have. Quality of generation does degrade a bit the smaller you quantize, but not as much as you may think.


This is interesting; I've written how I set it up here; https://christiaanse.ca/posts/running_llm/


I run mistral on an M2 air and it's broadly similar to chatgpt.


How? I have an M2 Pro and I run 7B and 13B models through Ollama and also LM Studio.

Because there’s no CUDA, the speed is much slower than ChatGPT. The answers from 7B are also not at the same quality as ChatGPT. (Lots of mistakes and hallucinations)


Is metal working?


Yes but it’s still no ChatGPT 4.


Can I ask what you're using it for?


No, that’s the whole point of running this stand alone without anyone spying.


I run a 13B Q4 llama variant on my ten year old server with two Xeon E5-2670, 128GB of RAM, and no GPU

It runs at under 3 tokens per second. I usually just give it my prompt and go make a coffee or something. The server is in my basement, you can barely hear the fans screaming at all.


I don't want to derail the OP's question, but would the same kind of system to run an LLM on also be suitable for an image generator like Stable Diffusion or does it work through different methods?


I think so. A big nvidia gpu will run both.


Is a GPU the same as a graphics card?


Yes (Graphics Processing Unit).


Yes.


If you're wiling to wait a few days, remember that Intel Core Ultra processors (Meteor Lake) are supposed to be available on December 14th. The embedded NPU should make a difference.


How will the onboard NPU compare to an outboard GPU on AI tasks do you think?


Somewhat related. I’ve got an M2 Max Mac Studio with 32GB of ram. Is there anything interesting I can do with it in terms of ML? What’s the scene like on moderately powered equipment like this?


Try https://ollama.ai

Super easy to get setup and play around. Probably start with Llama2 13b or 7b, and then Mistral 7b.

If you want a dead-simple chat GUI for interfacing, check out Ollamac.


Wow, just like that. Straight to the races. Thank you!

Kind of regretting not getting 64GB of ram now. I didn't think I'd need it, but... Here I am, wishing I could run some of these models.


M1 MacBooks are still available as new in high memory configs... I picked up a 64GB M1 Max for less than $2500. Good setup because of the shared cpu/gpu memory scheme


MacBook, thanks to Apple's new MLX framework.


Which model can I use with 16GB ram? I tried Ollama with WizardCoder 7b, but it didn't work.


Not sure.. there was a benchmark article on the front page earlier today, but now I can't seem to find it.


What didn’t work? I run Ollama on my M2 Pro and all models work.


I'm getting "Error: llama runner process has terminated" when trying to run the model. According to this ticket[1], it's a memory issue. Not sure why 16GB ram are struggling with a 7b model, though.

[1] https://github.com/jmorganca/ollama/issues/1231


Could you try another model like Mistral 7b and see if triggers the same error? (And check your available memory)

I have 16GB too and have no trouble running even 13B models.


Oh, interesting. Mistral:7b works, but wizardcoder:7b-python throws the same error as before. What's another good coding model to use besides wizardcoder?

Edit: wizardcoder:7b-python-q4_1 throws the same error


I’ve been running Orca 2 13B on M1 Pro with 32GB of RAM with LLM Studio and GPU acceleration quite nicely.

https://huggingface.co/TheBloke/Orca-2-13B-GGUF


Ollama runs 13b models just fine on my M2 Air with 16 GB


What kind of tokens/sec do you get?


I was interested in Stable Diffusion / images, and also text generation.

I started playing with ComfyUI and Ollama.

An M1 studio ultra would generate a 'base' 512x512 image in around 6 seconds, and ollama responses seemed easily 'quick enough'. Faster than I could read.

On an I7-3930K, purely CPU only, a similar image would take around 2.5 minutes, and ollama was painful, as I would be waiting for the next word.

Then I switched to a 3080ti, which I hadn't been using for gaming as it got stupidly hot and I regretted having it. Suddenly it was redeemed.

On the 3080ti, the same images come out in less than a second, and ollama generation is even faster. Sure, I'm limited to 7B models for text (the mac could go much higher) and there will be limits with image size/complexity, but this thing is so much faster than I expected, and hardly generates any heat/noise at the same time - completely different to gaming. This is all a simple install under linux (pop os in this case).

tl;dr - A linux PC with a high-end GPU is the best value by far unless you really need big models, in my experience.


the synopsis is basically, Nvidia wins processing value but the new M* chips have more money for value


Llama models run fine on the M2/M3 Macbooks thanks to llama.cpp/GGML.


You can try edgeimpulse.com they support a lot of “small” hardware for running different models.


I started to put together a second machine to be good at inference and then decided to just make my daily driver capable enough. Ended up upgrading my laptop to an MBP w/M2 MAX and 96GB. It runs even bigger models fairly well.


Just run them on AWS


Can you expand more on how that's done or point me in the direction to a guide?


Possibly the easiest way would be to run them in a notebook on Amazon SageMaker.

https://aws.amazon.com/sagemaker/notebooks/


So you just start a notebook and then what?


You then download a model you want, say, Llama 2. Install a Python package to interact with the model (possibly `pip install llama-cpp-python`) and have fun.


How would AWS run the user's models locally (as per the question)?


I'm happy with my used 3090.


M2? That's some cope.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: