https://www.reddit.com/r/LocalLLaMA/comments/18bwd1y/comment/kc7c22w/?utm_source...

brucethemoose2 · on Dec 7, 2023

Thats... Not very promising.

The thread suggests it doesn't even quantize the model (running it in FP16, so tons of ram usage), and that its slower than the llama.cpp Metal backend anyway?

And MLC-LLM was faster than llama.cpp, last I checked. Its hard to keep up with developments.

behnamoh · on Dec 7, 2023

I think llama.cpp is the sweet spot right now, due to its grammar capability and many other features (e.g., multimodal). MLC-LLM is nice but they don't offer uncensored models.

brucethemoose2 · on Dec 7, 2023

- A: You can convert models to MLC yourself, just like GGUF models, with relative ease.

- B: Yeah, llama.cpp has a killer feature set. And killer integration with other frameworks. MLC is way behind, but is getting more fleshed out every time I take a peek at it.

- C: This is a pet peeve of mine, but I've never run into a local model that was really uncensored. For some, if you give them a GPT4 prompt... Of course you get a GPT4 response. But you can just give them a unspeakable system prompt or completion, and they will go right ahead and complete it. I don't really get why people fixate on the "default personality" of models trained on GPT4 data.

Tiberium · on Dec 7, 2023

C. Have you tried OpenHermes 2.5? It's a Mistral chat finetune, but a very good one at that.

brucethemoose2 · on Dec 7, 2023

Yeah. If you are looking for new models, I would recommend Xaberius and Cybertron, as well as the various OpenHermes DPO finetunes.

Personally I run my own Yi 200K DARE merge because I love the long context.

mark_l_watson · on Dec 7, 2023

Llama.cpp is great but I have moved to mostly using Ollama because it is both good on the command line and ‘ollama server’ runs a very convenient to use REST server.

In any case, I had fun with MLX today, and I hope it implements 4 bit quantization soon.

behnamoh · on Dec 7, 2023

Does Ollama let you set server parameters? (e.g., temperature, max_tokens)

lagniappe · on Dec 7, 2023

yes, you put them in a Modelfile along with whatever system prompt and model you choose. The grammar is similar to a Dockerfile.