More

mongrelion · 2026-04-20T10:28:35 1776680915

letting the market set prices ensures that the chips go to the critical markets and uses.

Can you please elaborate what you mean by "critical market"?

Edit: formatting

mongrelion · 2026-04-16T08:06:57 1776326817

llama.cpp moves too quickly to be added as a stable package. Instead, you can get it directly from AUR: https://aur.archlinux.org/packages?O=0&K=llama.cpp

There are packages for Vulkan, ROCm and CUDA. They all work.

yjftsjthsd-h · 2026-04-16T16:04:18 1776355458

That doesn't make sense. Why would llama.cpp need to move any faster than ollama? For that matter, why not have a llama.cpp package and llama.cpp-git in the AUR?

rf15 · 2026-04-16T16:35:27 1776357327

what are you talking about? llama.cpp doesn't need to respect ollamas speed at all. It does not depend on it, it's the opposite of that.

yjftsjthsd-h · 2026-04-16T17:28:56 1776360536

The claim was that llama.cpp moves too fast to be in Arch's normal repos. But Arch does package ollama. Therefore, either 1. ollama somehow avoids the need to move fast, or 2. it moves at an acceptable pace when packaged.

Edit: Or perhaps put differently: If ollama includes a copy of llama.cpp and has a non-AUR package, why can't there be a non-AUR package that's just llama.cpp without ollama?

mongrelion · 2026-04-09T21:51:38 1775771498

I have been so far happy with the value that Copilot brought but for the past few weeks I have felt the chokehold on the number of requests.

I have had the chance to test the main Chinese models through OpenRouter but the Pay-as-you-go model is expensive compared to a subscription model, but I don't want to marry to a single provider.

Thanks for bringing OpenCode Go to my attention. Your comparison is the research I didn't know I needed, and I will be cancelling my Copilot subscription to replace it with OpenCode Go right away.

mongrelion · 2026-04-05T11:00:25 1775386825

It's clear to me that the maintainer is referring to "shushtain" and those type of people

> when they take that tone with you.

This makes it sound as if you took it personally?

mongrelion · 2026-04-05T10:56:24 1775386584

Having a bad day does not entitle you to take it out on others

perching_aix · 2026-04-05T12:57:44 1775393864

Empathy goes both ways. You can recognize them being unfair while still appreciating their reasons for being unfair.

People seem to have this notion that there's some theoretical possible world where everything is completely moral, and we're just failing to get there. But that is not true. You get locally moral and globally moral arrangements, and they're not necessarily going to mesh. It's just like any other large system.

Guy can be justified from their perspective, people can be justified for distancing themselves from him. That's life. Having a reason for something is further the bare minimum, not the endgame.

zysko-vendy · 2026-04-05T12:56:14 1775393774

that's why i said it's not really an excuse?

mongrelion · 2026-04-05T10:55:26 1775386526

You should totally post this on the original thread just for adjustment :-)

thayne · 2026-04-05T18:04:13 1775412253

The project is archived, you can't.

mongrelion · 2026-03-27T11:14:31 1774610071

Not the answer that you are looking for, but I am a fellow AMD GPU owner, so I want to share my experience.

I have a 9070 XT, which has 16GB of VRAM. My understanding from reading around a bunch of forums is that the smallest quant you want to go with is Q4. Below that, the compression starts hurting the results quite a lot, especially for agentic coding. The model might eventually start missing brackets, quotes, etc.

I tried various AI + VRAM calculators but nothing was as on the point as Huggingface's built-in functionality. You simply sign up and configure in the settings [1] which GPU you have, so that when you visit a model page, you immediately see which of the quants fits in your card.

From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.

The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.

Which model have you tried locally? Also, out of curiosity, what is your host configuration?

[1]: https://huggingface.co/settings/local-apps [2]: https://unsloth.ai/docs/models/qwen3.5

kroaton · 2026-03-27T13:26:46 1774618006

For autocomplete, Qwen 3.5 9B should be enough even at Q4_k_m. The upcoming coding/math Omnicoder-2 finetune might be useful (should be released in a few days).

Either that or just load up Qwen3.5-35B-A3B-Q4_K_S I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).

mongrelion · 2026-03-27T17:56:41 1774634201

I am definitely looking forward to TurboQuant. Makes me feel like my current setup is an investment that could pay over time. Imagine being able to run models like MiniMax M2.5 locally at Q4 levels. That would be swell.

sznio · 2026-03-28T07:45:58 1774683958

I don't remember exact models, but I tried whatever was available in Ollama. I remember using some really low parameter version of llama

mongrelion · 2026-03-27T10:26:45 1774607205

What is this 10€ per month subscription that you are talking about?

harias · 2026-03-27T10:34:31 1774607671

MiniMax token plan

https://platform.minimax.io/docs/guides/pricing-token-plan

throwa356262 · 2026-03-27T14:06:42 1774620402

How is the speed and stability?

These small Chinese companies dont always have access to serious hardware.

dkersten · 2026-03-28T09:45:26 1774691126

I’ve never had any problems with MiniMax. I wouldn’t call the speed fast exactly, but it’s faster than GLM and seems similar to Opus.

It’s been fast enough that I’ve been using it as my main model (M2.7 and before that, M2.5). Opus still does better at tasks, but MiniMax is so much cheaper. I’ve used their cheaper plan and I’ve never been rate limited.

mongrelion · 2026-03-14T10:41:58 1773484918

At what temperature did you run it and what was your context limit?

mongrelion · 2026-03-14T21:34:09 1773524049

I don't understand why I'm getting downvoted.

I am legitimately curious about the parameters that the person used for running the model locally to get the results they got because I am myself currently experimenting with running models locally myself. You can see I am asking similar questions to others in this same thread and correlate the timestamps.

mongrelion · 2026-03-13T21:38:37 1773437917

Apparently there is a whole science behind running models. I have seen the instructions that unsloth publishes for their quants and depending on the model they'll tweak things like the temperature, top k, etc.

The size of the quantization you chose also makes a difference.

The GPU driver also plays an important role.

What was your approach? What software did you use to run the models?