Is there some convenience wrapper around this to drop-in replace the OpenAI api with it?
I‘d like to put this on a modest DO droplet or Fly.io machine, and be able to have a private/secured HTTP endpoint to code against from somewhere else.
I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.
I have some very easy classification/extraction tasks at hand, but a huge quantity of them (millions of documents) + privacy restrictions, so using any cloud service isn’t feasible.
Running something like mistral as a simple microservice, or even via Bumblebee in my Elixir apps natively would be _huge_!
Works flawlessly in Docker on my Windows machine, which is quite shocking.
Supports Mistral as well as everything else.
Biggest downside is that there's no way to operate the tokenizer through the API. I put in a feature request but they said "you really ought to write your own specialized client-side code for that". Real bummer when the server already supports everything, but oh well.
It has token streaming, automatically halts inference on connection close, and other niceties.
Quite long startup time, but worth it as it doesn't have to be restarted with the client.
What kind of resources do you need to run this setup, and how well does yours perform? Is it like chatting with any other chat bot (ChatGPT or Claude, for example) or is it significantly slower? Can you train it on your own self-hosted documents, like markdown?
The inference server doesn't do training. The speed is pretty decent for https://huggingface.co/TheBloke/storytime-13B-GPTQ on my 3060, it definitely doesn't feel like you are really waiting for a response.
My exact invocation (on Windows) was:
$ docker run --gpus all --shm-size 1g -p 8080:80 -v C:\text-generation-webui\models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id /data/storytime-13b-GPTQ --quantize gptq
Thanks! I don't have a GPU though, so I'm assuming it isn't going to perform very well. I'll have to see if there are any models that can run on CPU only.
> Is there some convenience wrapper around this to drop-in replace the OpenAI api with it?
text-generation-webui has an OpenAI API implementation.
> I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.
Probably to get the maximum use out of that (particularly the support for grammars), it would be better not to use the OpenAI API implementation, and just use the native API in text-generation-webui (or any other runner for the model that supports grammars or the other features you are looking for.)
Ollama is essentially docker for LLMs, and LiteLLM offers an API passthrough to make Ollama OpenAI API compatible. I haven't tried it yet, but I will be trying it probably this weekend.
Cool! Is it possible to make this self-hosted model reference my own content, for example markdown files? Or does it only know how to respond to things it was trained on?
I‘d like to put this on a modest DO droplet or Fly.io machine, and be able to have a private/secured HTTP endpoint to code against from somewhere else.
I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.
I have some very easy classification/extraction tasks at hand, but a huge quantity of them (millions of documents) + privacy restrictions, so using any cloud service isn’t feasible.
Running something like mistral as a simple microservice, or even via Bumblebee in my Elixir apps natively would be _huge_!