I’m curious about how good the performance with local LLMs is on ‘outdated’ hard...

magicalhippo · 2024-10-16T07:40:32.000000Z

I've been playing with some LLMs like Llama 3 and Gemma on my 2080Ti. If it fits in GPU memory the inference speed is quite decent.

However I've found quality of smaller models to be quite lacking. The Llama 3.2 3B for example is much worse than Gemma2 9B, which is the one I found performs best while fitting comfortably.

Actual sentences are fine, but it doesn't follow prompts as well and it doesn't "understand" the context very well.

Quantization brings down memory cost, but there seems to be a sharp decline below 5 bits for those I tried. So a larger but heavily quantized model usually performs worse, at least with the models I've tried so far.

So with only 6GB of GPU memory I think you either have to accept the hit on inference speed by only partially offloading, or accept fairly low model quality.

Doesn't mean the smaller models can't be useful, but don't expect ChatGPT 4o at home.

That said if you got a beefy CPU then it can be reasonable to have it do a few of the layers.

Personally I found Gemma2 9B quantized to 6 bit IIRC to be quite useful. YMMV.

magicalhippo · 2024-10-16T10:54:57.000000Z

Yes, gemma-2-9b-it-Q6_K_L is the one that works well for me.

I tried gemma-2-27b-it-Q4_K_L but it's not as good, despite being larger.

Using llama.cpp and models from here[1].

[1]: https://huggingface.co/bartowski

khafra · 2024-10-16T07:49:07.000000Z

If you want to set up an AI server for your own use, it's exceedingly easy to install LM Studio and hit the "serve an API" button.

Testing performance this way, I got about 0.5-1.5 tokens per second with an 8GB 4bit quantized model on an old DL360 rack-mount server with 192GB RAM and 2 E5-2670 CPUs. I got about 20-50 tokens per second on my laptop with a mobile RTX 4080.

taosx · 2024-10-16T08:02:10.000000Z

LM studio is so nice, I'm up and running in 5 minutes. ty

dtquad · 2024-10-16T08:36:11.000000Z

I am using an old laptop with a GTX 1060 6 GB VRAM to run a home server with Ubuntu and Ollama. Because of quantization Ollama can run 7B/8B models on an 8 year old laptop GPU with 6 GB VRAM.

whitefables · 2024-10-16T07:40:33.000000Z

Here's how it looks like in real time: https://youtu.be/3vhJ6fNW8AI

thisguyagain · 2024-10-16T07:48:03.000000Z

What’d you use to record that? Looks really great.

whitefables · 2024-10-16T08:01:58.000000Z

Screen studio

alias_neo · 2024-10-16T10:35:00.000000Z

You can get a relative idea here: https://developer.nvidia.com/cuda-gpus

I use a Tesla P4 for ML stuff at home, it's equivalent to a 1080 Ti, and has a score of 7.1. A 2070 (they don't list the "super") is a 7.5.

For reference, 4060 Ti, 4070 Ti, 4080 and 4090 are 8.9, which is the highest score for a gaming graphics card.

taosx · 2024-10-16T07:34:23.000000Z

Last time I tried a local llm was about a year ago with a 2070S and 3950x and the performance was quite slow for anything beyond phi 3.5 and the small models quality feels worse than what some providers offer for cheap or free so it doesn't seem worth it with my current hardware.

Edit: I've loaded llama 3.1 8b instruct GGUF and I got 12.61 tok/sec and 80tok/sec for 3.2 3b.

nubinetwork · 2024-10-16T08:38:39.000000Z

I'm happy with a Radeon VII, unless the model is bigger than 16gb...