More

parthsareen · 2026-02-05T01:39:05 1770255545

Also recently added ollama launch claude if you want to connect to cloud models from there :)

parthsareen · 2026-01-27T00:49:11 1769474951

Hey! One of the maintainers of Ollama. 8GB of VRAM is a bit tight for coding agents since their prompts are quite large. You could try playing with qwen3 and at least 16k context length to see how it works.

parthsareen · 2025-12-22T05:14:50 1766380490

How much ram are you running with? Qwen3 and gpt-oss:20b punch a good bit above their weight. Personally use it for small agents.

parthsareen · 2025-12-22T05:10:40 1766380240

You're welcome to go through the source: https://github.com/ollama/ollama/

parthsareen · 2025-12-22T05:09:30 1766380170

Desktop app is open-source now.

parthsareen · 2025-09-25T20:44:09 1758833049

Hi - author of the post. Yes it does! The "build a search agent" example can be used with a local model. I'd recommend trying qwen3 or gpt-oss

lxgr · 2025-09-25T20:54:40 1758833680

Very cool, thank you!

Looking forward to try it with a few shell scripts (via the llm-ollama extension for the amazing Python ‘llm’) or Raycast (the lack of web search support for Ollama has been one of my biggest reasons for preferring cloud-hosted models).

parthsareen · 2025-09-25T20:58:28 1758833908

Since we shipped web search with gpt-oss in the Ollama app I've personally been using that a lot more especially for research heavy tasks that I can shoot off. Plus with a 5090 or the new macs it's super fast.

parthsareen · 2025-09-25T20:43:22 1758833002

Hey! Author of the blogpost and I also work on Ollama's tool calling. There has been a big push on tool calling over the last year to improve the parsing. What's the issues you're running into with local tool use? What models are you using?

vrzucchini · 2025-09-25T21:14:28 1758834868

Hey, unrelated to the question you're answering but where do I see the rate limits for free and paid tiers?

yggdrasil_ai · 2025-09-26T22:44:47 1758926687

I went back and had another look at my implementation, and got it to work. Sorry I was mistaken!

parthsareen · 2025-09-25T06:23:30 1758781410

That's a great idea. Going to try this next :)

EnnEmmEss · 2025-09-26T05:07:39 1758863259

Happy to help :)

parthsareen · 2025-09-25T06:22:49 1758781369

Hey! I'm the author of the post. We haven't optimized sampling yet so it's running linearly on the CPU. A lot of SOTA work either does this while the model is running the forward pass or does the masking on the GPU.

The greedy accept is so that the mask doesn't need to be computed. Planning to make this more efficient from either ends.

parthsareen · 2025-09-23T19:01:58 1758654118

Thank you! Maybe not "perfect" but near-perfect is something we can expect. Models like the Osmosis structure which just structure data inspired some of that thinking (https://ollama.com/Osmosis/Osmosis-Structure-0.6B). Historically, JSON generation has been a latent capability of a model rather than a trained one, but that seems to be changing. gpt-oss was particularly trained for this type of behavior and so the token probabilities are heavily skewed to conform to JSON. Will be interesting to see the next batch of models!