When running on apple silicon you want to use mlx, not llama.cpp as this benchma...

hnfong · 2025-03-10T10:51:53 1741603913

Unless something changed recently, I’m not aware of big perf differences between MLX and llama.cpp on Apple hardware.

loufe · 2025-03-10T11:10:55 1741605055

I'm under the same impression. Llama.cpp's readme used to start with "Apple Silicon as a First Class Citizen", and IIRC Georgi works on Mac himself.

nicman23 · 2025-03-10T08:29:11 1741595351

there is a plateau where you simply need more compute and the m4 cores are not enough, so even if they have enough ram for the model the token/s is not useful

mirekrusin · 2025-03-10T12:55:20 1741611320

For all models fitting 2x 5090 (2x 32GB) that's not a problem, so you can say if you have this problem then RTX is also not an option.

On apple silicons you can always use MoE models, which work beautifully. On RTX it's kind of waste to be honest to run MoE, you'd be better off running single, whole active model that fills available memory (with enough space for the context).

sgt · 2025-03-10T09:31:36 1741599096

I'm trying to find out about that as well as I'm considering a local LLM for some heavy prototyping. I don't mind which HW I buy, but it's on a relative budget and energy efficiency is also not a bad thing. Seems the Ultra can do 40 tokens/sec on DeepSeek and nothing even comes close at that price point.

hnfong · 2025-03-10T10:50:22 1741603822

The DeepSeek R1 distilled onto Llama and Qwen base models are also unfortunately called “DeepSeek” by some. Are you sure you’re looking at the right thing?

The OG DeepSeek models are hundreds of GB quantized, nobody is using RTX GPUs to run them anyway…

Tostino · 2025-03-10T12:57:06 1741611426

You are missing something. This is a single stream of inference. You can load up the Nvidia card with at least 16 inference streams and get at much higher throughout tokens/sec.

This just is just a single user chat experience benchmark.