Is there some convenience wrapper around this to drop-in replace the OpenAI api ...

LoganDark · on Oct 11, 2023

A simple microservice would be https://github.com/huggingface/text-generation-inference .

Works flawlessly in Docker on my Windows machine, which is quite shocking.

Supports Mistral as well as everything else.

Biggest downside is that there's no way to operate the tokenizer through the API. I put in a feature request but they said "you really ought to write your own specialized client-side code for that". Real bummer when the server already supports everything, but oh well.

It has token streaming, automatically halts inference on connection close, and other niceties.

Quite long startup time, but worth it as it doesn't have to be restarted with the client.

replwoacause · on Oct 13, 2023

What kind of resources do you need to run this setup, and how well does yours perform? Is it like chatting with any other chat bot (ChatGPT or Claude, for example) or is it significantly slower? Can you train it on your own self-hosted documents, like markdown?

LoganDark · on Oct 14, 2023

The inference server doesn't do training. The speed is pretty decent for https://huggingface.co/TheBloke/storytime-13B-GPTQ on my 3060, it definitely doesn't feel like you are really waiting for a response.

My exact invocation (on Windows) was:

$ docker run --gpus all --shm-size 1g -p 8080:80 -v C:\text-generation-webui\models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id /data/storytime-13b-GPTQ --quantize gptq

replwoacause · on Oct 14, 2023

Thanks! I don't have a GPU though, so I'm assuming it isn't going to perform very well. I'll have to see if there are any models that can run on CPU only.

dragonwriter · on Oct 11, 2023

> Is there some convenience wrapper around this to drop-in replace the OpenAI api with it?

text-generation-webui has an OpenAI API implementation.

> I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.

Probably to get the maximum use out of that (particularly the support for grammars), it would be better not to use the OpenAI API implementation, and just use the native API in text-generation-webui (or any other runner for the model that supports grammars or the other features you are looking for.)

[0] https://github.com/oobabooga/text-generation-webui

all2 · on Oct 11, 2023

Ollama [0] + LiteLLM [1]

[0] https://ollama.ai

[1] https://github.com/jmorganca/ollama/issues/305#issuecomment-...

---

Ollama is essentially docker for LLMs, and LiteLLM offers an API passthrough to make Ollama OpenAI API compatible. I haven't tried it yet, but I will be trying it probably this weekend.

replwoacause · on Oct 13, 2023

Cool! Is it possible to make this self-hosted model reference my own content, for example markdown files? Or does it only know how to respond to things it was trained on?

brucethemoose2 · on Oct 11, 2023

Koboldcpp has an OpenAI (and kobold api) endpoint now, and supports grammar syntax like you said:

https://github.com/LostRuins/koboldcpp

The biggest catch is it doesn't support llama.cpp's continuous batching yet. Maybe soon?

siquick · on Oct 11, 2023

Pretty straightforward with Ollama + LlamaIndex

https://gpt-index.readthedocs.io/en/latest/examples/llm/olla...

Havoc · on Oct 11, 2023

Llama.cpp has a translation server

leschak · on Oct 12, 2023

python3 -m llama_cpp.server --model /path/to/model.gguf

nothrowaways · on Oct 11, 2023

Try ollama

ipaddr · on Oct 11, 2023

Oobabooga

redox99 · on Oct 11, 2023

Ooba is not meant to serve multiple users (no batching). Batching gives you 5x to 10x throughput increase.