Mistral LLM on Apple Silicon Using Apple's MLX Framework Runs Instantaneously

jasonjmcghee · on Dec 7, 2023

What an odd misleading title. From the referenced reddit thread in another comment (https://www.reddit.com/r/LocalLLaMA/comments/18bwd1y/comment...) sounds like it runs slower than llama cpp? 25 t/s. That’s not instantaneous by any measure

dagmx · on Dec 7, 2023

I think you misread the title. The title implies that it took little to no work to port it to this new framework, not that it runs in an instant.

buffington · on Dec 8, 2023

How does one read "Runs Instantly" and not interpret that to mean anything other than that it runs instantly?

"A fire built around a wad of kerosene soaked rags starts instantly."

Does that sentence suggest that kerosene starts rags to you, or that kerosene soaked rags help start fires instantly?

dagmx · on Dec 8, 2023

That comparison is irrelevant because words have meaning based on the context they’re used in.

English, being a very complex language, can have the same sentence mean multiple things as well depending on how you interpret the context. There may be better ways to phrase something to remove ambiguity, but the title as presented could be read both ways.

smoldesu · on Dec 8, 2023

OP changed the submission title after this first comment was written. It used to say MLX was "much faster" without any point of comparison. It has since been edited to what it is now.

robertritz · on Dec 7, 2023

MLX only works for fp16 right now. If it ever works quantized I will almost certainly move my app over to MLX instead of llama.cpp.

My app also uses a very small (30MB) PyTorch model and shipping it requires an extra 100MB for PyTorch in the app. Very very stupid.

I think its important to remember that last mile inference is still pretty bespoke for most things. If we want to see gen AI stuff take off and now have the big cloud providers in charge this needs to be fixed.

Apple is in a good place to solve at least part of the equation.

behnamoh · on Dec 7, 2023

Apple will most likely introduce their own API for an LLM-powered Siri which devs can use in their apps. This way, Apple keeps their control and can fully optimize the SiriGPT (or whatever you call it). Very unlikely that Apple would just provide the hardware, good amount of RAM, a framework competing with Pytorch, all out of the goodness of their heart. I use Apple products but I know they're way past the point of doing things just for the sake of fcking doing something awesome.

novok · on Dec 7, 2023

Becoming the AI dev and inference platform of choice and getting that nvidia profit margin is probably plenty enough of a reason. Right now it's CUDA on linux PCs.

AceJohnny2 · on Dec 7, 2023

I don't see Apple taking NVidia's crown for training. Apple's hardware acceleration is fine for low-power on-device inference, which they were designed for, but competitive training requires an order or three magnitude more power, which they aren't.

lolive · on Dec 7, 2023

I keep a bookmark of your comment for my 2030 me. Just to check whether that prediction will realize: Apple will ALSO storm the LLM market!

smoldesu · on Dec 7, 2023

Unless you also predict that Apple will release datacenter systems a-la Grace and Instict, I don't think they're even in the runnings. AMD is only competitive in the LLM market because they sell extremely cheap and fast compute hardware at the same scale Nvidia does. As of today, Apple doesn't sell any hardware that can go toe-to-toe with a DGX system. They also have a lot of software problems (VM limitations, poor GPU API support, limited integration with open-source, etc.) that would need to be fixed for parity with Nvidia or AMD.

Apple will definitely push for on-device AI, but even in 2030 I firmly believe that they won't be leading the industry in performance. I'd be surprised if they even supported anything other than their proprietary CoreML by then.

colinng · on Dec 7, 2023

Ditto on this. I want to not buy an A100 for $20k, or even consumer GPUs, but the truth is that for LLM inference, to run large models like LLaMa2 70b with INT4 quantization so it could fit

A100: 1248 TOPS

MI250: 362.1 TOPS

M3 Max: 18 TOPS

Yes, 18. Unless Apple has accelerated INT4 workloads but just forgot to document it.

Honestly, I’m an Apple fan, but when they go on stage and say “AI” they mean it can do speech recognition or tell a dog apart from a cat, or autofocus a camera. It can’t run ChatGPT-like things by a loooong mile.

cyanydeez · on Dec 7, 2023

all they care about is the value of a walled garden.

I just went through all the commercial options for local LLM hosting and they're definitely poised correctly because they have the correct amount of memory in their machines.

they're not going to have a come to Jesus moment.

kristianp · on Dec 7, 2023

If the model could be made to work with llama.cpp, then https://github.com/abetlen/llama-cpp-python might be more compact. llama.cpp only supports a limited list of model types though.

klohto · on Dec 7, 2023

Use llamafile

luke-stanley · on Dec 7, 2023

The title could be made much clearer by saying "X" tokens per second, with "Y" model,on "Z" hardware, and what the model load time was. The quantisation level and byte size helps too.

smoldesu · on Dec 7, 2023

"Much Faster" than what?

behnamoh · on Dec 7, 2023

https://www.reddit.com/r/LocalLLaMA/comments/18bwd1y/comment...

brucethemoose2 · on Dec 7, 2023

Thats... Not very promising.

The thread suggests it doesn't even quantize the model (running it in FP16, so tons of ram usage), and that its slower than the llama.cpp Metal backend anyway?

And MLC-LLM was faster than llama.cpp, last I checked. Its hard to keep up with developments.

behnamoh · on Dec 7, 2023

I think llama.cpp is the sweet spot right now, due to its grammar capability and many other features (e.g., multimodal). MLC-LLM is nice but they don't offer uncensored models.

brucethemoose2 · on Dec 7, 2023

- A: You can convert models to MLC yourself, just like GGUF models, with relative ease.

- B: Yeah, llama.cpp has a killer feature set. And killer integration with other frameworks. MLC is way behind, but is getting more fleshed out every time I take a peek at it.

- C: This is a pet peeve of mine, but I've never run into a local model that was really uncensored. For some, if you give them a GPT4 prompt... Of course you get a GPT4 response. But you can just give them a unspeakable system prompt or completion, and they will go right ahead and complete it. I don't really get why people fixate on the "default personality" of models trained on GPT4 data.

Tiberium · on Dec 7, 2023

C. Have you tried OpenHermes 2.5? It's a Mistral chat finetune, but a very good one at that.

brucethemoose2 · on Dec 7, 2023

Yeah. If you are looking for new models, I would recommend Xaberius and Cybertron, as well as the various OpenHermes DPO finetunes.

Personally I run my own Yi 200K DARE merge because I love the long context.

mark_l_watson · on Dec 7, 2023

Llama.cpp is great but I have moved to mostly using Ollama because it is both good on the command line and ‘ollama server’ runs a very convenient to use REST server.

In any case, I had fun with MLX today, and I hope it implements 4 bit quantization soon.

behnamoh · on Dec 7, 2023

Does Ollama let you set server parameters? (e.g., temperature, max_tokens)

lagniappe · on Dec 7, 2023

yes, you put them in a Modelfile along with whatever system prompt and model you choose. The grammar is similar to a Dockerfile.

bluSCALE4 · on Dec 7, 2023

I don't really get the purpose of this. Can anyone share some insights?