More

jmorgan · 2024-12-16T04:17:21 1734322641

Thank you for writing this!

jmorgan · 2024-06-28T04:34:52 1719549292

Sorry about this – working on fixing the issue with hitting the context limit. Gemma 2 supports a 8192 context limit – which can be selected if you provide the `num_ctx` parameter in the API or via `ollama run` with `/set parameter num_ctx 8192`

thot_experiment · 2024-06-28T04:43:25 1719549805

Thanks! If you have a moment can you give me a quick explainer on what happens when you hit the context limit in ollama? I had assumed that ollama would just trunc the context to whatever is set in the model, but I guess this isn't the case?

jmorgan · 2024-06-28T07:39:11 1719560351

Currently when the context limit is hit, there's a halving of the context window (or a "context shift") to allow inference to continue – this is helpful for smaller (e.g. 1-2k) context windows.

However, not all models (especially newer ones) respond well to this, which makes sense. We're working on changing the behavior in Ollama's API to be more similar to OpenAI, Anthropic and similar APIs so that when the context limit is hit, the API returns a "limit" finish/done reason. Hope this is helpful!

jmorgan · 2024-06-24T04:01:46 1719201706

Sorry it's slow for you – happy to help debug why - shoot me an email at jeff@ollama.com

jmorgan · 2024-06-17T20:22:05 1718655725

Thank you for building htmx!

jmorgan · 2024-06-16T02:40:23 1718505623

Sorry it's taking so long to review and for the radio silence on the PR.

We have been trying to figure out how to support more structured output formats without some of the side effects of grammars. With JSON mode (which uses grammars under the hood) there were originally quite a few issue reports namely around lower performance and cases where the model would infinitely generate whitespace causing requests to hang. This is an issue with OpenAI's JSON mode as well which requires the caller to "instruct the model to produce JSON" [1]. While it's possible to handle edge cases for a single grammar such as JSON (i.e. check for 'JSON' in the prompt), it's hard to generalize this to any format.

Supporting more structured output formats is definitely important. Fine-tuning for output formats is promising, and this thread [2] also has some great ideas and links.

[1] https://platform.openai.com/docs/guides/text-generation/json...

[2] https://github.com/ggerganov/llama.cpp/issues/4218

regen7253 · 2024-06-16T09:08:00 1718528880

Thank you!

I've been using llama.cpp for about a year now, mostly implementing some RAG and React related papers to stay up to date. I mostly used llama.cpp, but since a few months, I started to use both Ollama and Llama.cpp.

If you added grammars I wouldn't have to be running the two servers, I think you're doing an excellent job out of maintaining Ollama. Every update is like Christmas. They also don't seem to have the server as a priority (it's still literally just an example of how you'd use their C api).

So, I understand your position, since their server API has been quite unstable, and the grammar validation didn't work at all until February. I also still can't get their multiple model loading to work reliably right now.

Having said that, GBNF is a godsend for my daily use cases. I even prefer using phi3b with a grammar than deal with the hallucinations of a 70b without it. Fine tuning helps a lot, but can't solve the problem fully (you still need to validate the generation), and it's a lot less agile when implementing ideas. Crating some synthetic data sets is easier if you have support for grammars.

I think many like me are in the same spot. Thank you for being considerate about the stability and support that it would require. But please, take a look at the current state of their grammar validation, which is pretty good right now.

okwhateverdude · 2024-06-16T09:45:11 1718531111

Not to put too fine of a point on it, but why not merge one of the simpler PRs for this feature, gate the feature behind an opt-in env var (ie. OLLAMA_EXPERIMENTAL_GRAMMAR=1), and sprinkle these caveats you've mentioned into the documentation? That should be enough to ward off the casuals that would flood the issue queue. Add more hoops if you'd like.

There seems to be enough interest in this specific feature that you don't need to make it perfect or provide a complicated abstraction. I am very willing to accept/mitigate the side effects for the ability to arbitrarily constrain generation. Not sure about others, but given there are half a dozen different PRs specifically for this feature, I am pretty sure they, too, are willing to accept the side effects.

washadjeffmad · 2024-06-16T13:37:58 1718545078

Since it's trivial enough to run mainline features on actual llama.cpp, it seems redundant to ask ollama to implement and independently maintain branches or features that aren't fully working, if it's not something already in an available testing branch.

We're not relying on ollama for feature development and there are multiple open projects with implementations already, so no one is deprived of anything without this or a hundred other potential PRs not in ollama yet.

jmorgan · 2024-06-16T02:26:53 1718504813

Pre-release versions are created to test new updates on bunch of different hardware setups (OS/GPUs) before releasing more broadly (and making new versions the default for the Linux/macOS/Windows installers – those pull from the 'latest' release).

There are a good number of folks that test the pre-releases as well (thank you!) especially if there's a bug fix or new feature they are waiting for. "Watch"ing the repo on GitHub will send emails/notifications of new pre-release versions

jmorgan · 2024-06-16T02:22:51 1718504571

Not at the moment, although it is a highly requested feature (specifically fine-tuning). There are a few tools that you can (or will soon be able to) use to fine tune a model and then import the resulting adapter layers into Ollama: MLX [1] on macOS, Unsloth [2] on Windows and Linux

[1] https://github.com/ml-explore/mlx

[2] https://github.com/unslothai/unsloth

jmorgan · 2024-06-01T19:56:55 1717271815

Shoot sorry about that. There's a few ways to keep the model loaded in memory:

1. If using `ollama run`: `ollama run llama3 --keepalive -1`

2. If running ollama serve directly, use `OLLAMA_KEEP_ALIVE=-1` ollama serve

3. If using the api, there's a `keep_alive` parameter you can set to -1

wkat4242 · 2024-06-02T20:33:29 1717360409

Yeah I tried all those things. Especially the middle one. Because that's how I use it mostly. I added the environment variable to the systemd service. But it still removes it after 5 minutes. Very weird.

jmorgan · 2024-05-29T16:59:34 1717001974

Codestral was just published here as well: https://ollama.com/library/codestral

jmorgan · 2024-04-29T01:46:59 1714355219

Yes, we are also looking at integrating MLX [1] which is optimized for Apple Silicon and built by an amazing team of individuals, a few of which were behind the original Torch [2] project. There's also TensorRT-LLM [3] by Nvidia optimized for their recent hardware.

All of this of course acknowledging that llama.cpp is an incredible project with competitive performance and support for almost any platform.

[1] https://github.com/ml-explore/mlx

[2] https://en.wikipedia.org/wiki/Torch_(machine_learning)

[3] https://github.com/NVIDIA/TensorRT-LLM

smcleod · 2024-04-29T05:19:39 1714367979

MLX and TensorRT would be really nice!