Hey! One of the maintainers of Ollama. 8GB of VRAM is a bit tight for coding agents since their prompts are quite large. You could try playing with qwen3 and at least 16k context length to see how it works.
Looking forward to try it with a few shell scripts (via the llm-ollama extension for the amazing Python ‘llm’) or Raycast (the lack of web search support for Ollama has been one of my biggest reasons for preferring cloud-hosted models).
Since we shipped web search with gpt-oss in the Ollama app I've personally been using that a lot more especially for research heavy tasks that I can shoot off. Plus with a 5090 or the new macs it's super fast.
Hey! Author of the blogpost and I also work on Ollama's tool calling. There has been a big push on tool calling over the last year to improve the parsing. What's the issues you're running into with local tool use? What models are you using?
Hey! I'm the author of the post. We haven't optimized sampling yet so it's running linearly on the CPU. A lot of SOTA work either does this while the model is running the forward pass or does the masking on the GPU.
The greedy accept is so that the mask doesn't need to be computed. Planning to make this more efficient from either ends.
Thank you! Maybe not "perfect" but near-perfect is something we can expect. Models like the Osmosis structure which just structure data inspired some of that thinking (https://ollama.com/Osmosis/Osmosis-Structure-0.6B). Historically, JSON generation has been a latent capability of a model rather than a trained one, but that seems to be changing. gpt-oss was particularly trained for this type of behavior and so the token probabilities are heavily skewed to conform to JSON. Will be interesting to see the next batch of models!
reply