Thanks, that makes sense and helps a lot. I have a 16gb m1 that I got llama 13b ...

thewataccount · on May 8, 2023

Just a heads up - GPU's are a looooot faster then CPU's depending on what models you're running, especially if you were looking at running the image models.

Admittedly I'm not sure how well they work if you stream/batch to the GPU (say 96GB of system ram + 24GB GPU).

I've heard used nvidia workstation cards are reasonably cheap for >24GB VRAM.

A 3090/4090 have 24GB of vram and can run up the 30B models with some optimizations, and this is the easiest way to run the 30B models which are essentially the highest end any consumer card can run. If you also play games and have money then this is the way to go IMO.

If you were to get a GPU, it must have CUDA support (so nvidia only) unless you want a headache.

execveat · on May 8, 2023

As a data point I'm getting >3 tokens per second for 30b model (q5_1 quantization) and >1 token per second for 60b model (q5_1 as well) on M1 Max. This is good enough for my usecase and it beats an old P40, but I have no idea what the performance on 3090/4090 would be. Keep in mind, 24GB VRAM is not enough to hold quantized 65B, so it would be using GPU + CPU in that case.

BaculumMeumEst · on May 8, 2023

oh a 4090 can run a 30b model? that’s excellent! i was afraid it wouldnt be able to load bigger models than my macbook.

i’ve got a perfectly usable desktop not being used with a 1070 in it, i’ll probably grab a 4090 to throw in there and give that a try, getting 4k gaming would be a nice bonus. thanks for the comment.

being confined to nvidia is indeed a bummer though, especially because i like sway on linux. but my understanding is that rocm is not anywhere near parity with CUDA.

thewataccount · on May 8, 2023

Looking at the comments I would double check the benchmarks because maybe the CPUs are faster then I thought for LLMs?

I know my 4090 for Stable Diffusion isn't even comparable to my i7 8700k and AFAIK the AMD/Intel offerings still don't compare for LLMs but admittedly it's possibly they've caught up?

I don't have a ton of time at the moment to keep looking, I have a very hard time believing the M1 can keep up with a 4090 at all, I just don't want you to drop 1.7k if I'm wrong :P

EDIT: Oh to clarify - The 4090 can definitely run the 30B model without issue with 4bit quantization.