The thread suggests it doesn't even quantize the model (running it in FP16, so tons of ram usage), and that its slower than the llama.cpp Metal backend anyway?
And MLC-LLM was faster than llama.cpp, last I checked. Its hard to keep up with developments.
I think llama.cpp is the sweet spot right now, due to its grammar capability and many other features (e.g., multimodal). MLC-LLM is nice but they don't offer uncensored models.
- A: You can convert models to MLC yourself, just like GGUF models, with relative ease.
- B: Yeah, llama.cpp has a killer feature set. And killer integration with other frameworks. MLC is way behind, but is getting more fleshed out every time I take a peek at it.
- C: This is a pet peeve of mine, but I've never run into a local model that was really uncensored. For some, if you give them a GPT4 prompt... Of course you get a GPT4 response. But you can just give them a unspeakable system prompt or completion, and they will go right ahead and complete it. I don't really get why people fixate on the "default personality" of models trained on GPT4 data.
Llama.cpp is great but I have moved to mostly using Ollama because it is both good on the command line and ‘ollama server’ runs a very convenient to use REST server.
In any case, I had fun with MLX today, and I hope it implements 4 bit quantization soon.