Hacker News new | past | comments | ask | show | jobs | submit login
Many options for running Mistral models in your terminal using LLM (simonwillison.net)
215 points by simonw 12 months ago | hide | past | favorite | 99 comments



I run in Linux system it via cli using Ollama, very easy to setup

https://ollama.ai/library/mistral

curl https://ollama.ai/install.sh | sh

ollama run mistral:text


I want to like Ollama, but I wish it didn't obfuscate the actual directives (full prompt) that it sends to the underlying model. Ollama uses a custom templatizing script in its Modelfiles to translate user input into the format that a specific model expects ([INST], etc), but it can be difficult to tell if it's working as expected because it won't show up in the logs at all.

Other than that it's a great project - very easy to get started and has a solid API implementation. I've got it running on both a Win 10 + WSL2 docker and on a Mac M1.


You can bypass the templating in raw mode, by setting the request parameter `raw` to true.

https://github.com/jmorganca/ollama/blob/main/docs%2Fapi.md


yeah I guess I could compare the output at 0.0 temperature with same prompt using Modelfile and then after using raw mode with my best guess of how the modelfile is creating the raw data it's passing to Ollama.

I'd push a PR to the repo itself but I have zero experience with Go...


Yeah, I was surprised Ollama was not mentioned as it’s by far the easiest to get started with. If it only had real grammar support, I’d never have to use another library again (it has a JSON mode that generally works, however).


What is grammar support? I've seen that mentioned several times now. Does it allow to restrict the output to a given template, or am I totally wrong there?


Yes. Here's a quick rundown of grammar in llama.cpp (link on the docs of faraday.dev which runs llama.cpp under the hood)

https://docs.faraday.dev/character-creation/grammars


Ollama is great. I discovered it today while looking for a way to serve LLMs locally for my terminal command generator tool (cmdh: https://github.com/pgibler/cmdh) and was able to get it up and running and implement support for it very easily.


Huh, I've fought with a few of these things on my laptop, no nvidia GPU, limited ram, etc.

This actually worked as advertised.


What's the performance like (quality and speed wise)?


Yesterday I tried mixtral 7bx8 running on the CPU. With an Intel 11th gen chip and 64gb DDR4 at 3200mhz, I got around 2-4 tokens/second in a small context, this gets progressively slower as the context grows.

You would get a much better experience with apple silicon and lots of RAM


Can confirm. My M3 Max gets about 22t/s, putting the bottleneck BKAC.


That's 10x speed increase. What's the secret behind apple M3? Faster clocked RAMs? Specific AI hardware?


Unified memory and optimizations in llama.cpp (which Ollama wraps).


Is that using the GPU?


It can be variably configured. There are details in the repo, but llama.cpp makes use of Metal.


Mistral 7b is serviceable for short contexts. If you have a longer conversation, token generation can start to lag a lot.


Mixtral-8x7B-Instruct is also available in a llamafile now: https://github.com/Mozilla-Ocho/llamafile#other-example-llam...


Just added that option - you can talk to it from my LLM CLI tool using its OpenAI-compatible API endpoint: https://simonwillison.net/2023/Dec/18/mistral/#llamafile-ope...


If I have the baseline Mixtral-8x7B .pth, how do I convert that to a llamafile compatible .gguf?


Just download the GGUF from huggingface. @TheBloke always uploads GGUF versions of the popular models


Useless as he never updates them post release. You are better served doing it yourself.


Where did you get your information from?

Just 4 days ago he redid a model after the author updated the EOS token: https://huggingface.co/TheBloke/openchat-3.5-1210-GGUF/discu...


Just make the gguf yourself it takes 5 mins and you dont need thebloke


That was my question. How do I create the gguf?


Using the convert.py script in the llama.cpp repo.


Perfect, thanks.

Don't suppose if you know if the conversion is easily reversible? Some of these models are big, it sucks to carry around the original plus the gguf, but I would hate to be in a situation where the gguf represents a dead end.


All this needless difficulty when you can just do the right thing and use oogabooga.

https://github.com/oobabooga/text-generation-webui


Looks like that provides an OpenAI-compatible API endpoint, which means you can use my LLM command-line utility against with the same pattern as for Llamafile: https://simonwillison.net/2023/Dec/18/mistral/#llamafile-ope...


This is a bit off-topic and probably a noob question. But maybe someone with experience could help me. I'm about to buy a M2 or M3 Mac to replace my Intel Mac, and would like to play around with locally run LLMs in the future. How much RAM do I have to buy in order not to run into bottlenecks? I'm aware that you would probably want a beefier GPU for better performance. But as I said, I'm not worried about speed so much.


TheBloke's GGUF model card tells you how much RAM you'll want to run the various versions of the model: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF#provi...

In the article, Simon mentions the Q6_K.gguf model, which is about 40GB. A Mac Studio can handle this, but any of these models are going to be a tight fit or impossible on a Mac laptop without swapping to disk. Maybe NVME is fast enough that swapping isn't too terrible.

In my experience, the Mixtral models work pretty well on llama.cpp on my Linux workstation with a 10GB GPU, and offloading the rest to CPU.

It is impressive how fast the smaller models are improving. Still, a safe rule of thumb is the more RAM the better.

Also, really question how much you need to run these models locally. If you just want to play around with these models, it's probably far more cost effective to rent something in the cloud.


I tried Mixtral via ollama on my Apple M1 Max with 32GB of RAM, and was a total nonstarter. I ended up having to powercycle my machine. I then just used two L4 GPU's on Google Cloud (so 48GB of GPU RAM, see [1]) and it was very smooth and fast there.

[1] https://github.com/sagemathinc/cocalc-howto/blob/main/ollama...


Wow, as an author of the project I'm so sorry about you having to restart your computer. The memory management in Ollama needs a lot of improvement – will be working on this a bunch going forward. I also have a M1 32GB Mac and it's unfortunately just below the amount of memory Mixtral needs to run well (for now!)


Will the Mixtral 8x7 run well on a 64GB M2 Max then?


I'm using an M2 64GB and Mixtral works pretty well.


It’s wild my laptop can run AI models better than my RTX 4090 desktop. Thanks for the info!


If you want to run decently heavy models, I'd recommend getting at a minimum getting 48GB. This allows you to run 34b llama models with ease, 70b models quantized, mixtral without problems.

If you want to run most models, get 64GB. This just gives you some more room to work with.

If you want to run anything, get 128GB or more. Unquantized 70b? Check. Goliath 120b? Check.

Note that high end consumer gpus end at 24GB VRAM. I have one 7900xtx for running llms, and the best it can reliably run is 4-bit quantized 34b models, anything larger is partially in regular ram.


Thank you for this detailed response. I'm not sure if it was clear, but I was going to use just the Apple Silicon CPU/GPU, not an external one from Nvidia.

Is there anything useful you can do with 24 or 32GB of RAM with llms? Regular M2 Mac minis can only be ordered with up to 24GB of RAM. The Pro Mac mini M2 is upgradable to 32GB RAM.


I've been unable to get Mixtral (mixtral-8x7b-instruct-v0.1.Q6_K.gguf) to run well on my M2 MacBook Air (24 GB). It's super slow and eventually freezes after about 12-15 tokens of a response. You should look at M3 options with more RAM -- 64 GB or even the weird-sounding 96 GB might be a good choice.


https://www.reddit.com/r/LocalLLaMA/comments/17kcgjv/how_doe... reddit thread talks about some of the pros and cons of a m3 max with 128 Gb costing ~5-6K


You can buy multiple 4090s for that money and will get real GPUs including tensor cores. Still relevant it seems: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...


32GB will run a good quantized Mixtral, though I can't confidently explain how much of a quality difference there is from unquantized.


Data Point: I am currently having issues getting Mixtral Q4_K_M running in LMStudio on my 32gb M1 Max. I'm trying Q3 to see if it fits.

I can have it run in on 'cpu' which is very slow, but offloading to the GPU runs out of memory.


You just have to allow more than 75% memory to be allocated to the GPU by running sudo sysctl -w iogpu.wired_limit_mb=30720 (for a 30 GB limit in this case).


1. That worked after some tweaking. 2. I had to lower the context window size to get LM Studio to load it up. 3. LM Studio has two distinct checkboxes that both say "Apple Metal GPU". No idea if they do the same thing....

Thanks a ton! I'm running on GPU w/ Mixtral 8x Instruct Q4_K_M now. tok/sec is about 4x what CPU only was. (Now at 26 tok/sec or so).


I was talking about m2 Macs. Just comparing that the best you can do with a gpu is 24GB, Macs go far beyond because of their integrated memory.


Just be aware you don't get to use all of it. I believe you only get access to ~20.8GB of GPU memory on a 32GB Apple Silicon Mac, and perhaps something like ~48GB on a 64GB Mac. I think there are some options to reconfigure the balance but they are technical and do not come without risk.


This is an important consideration. Thanks for mentioning it.


Ok, sorry. I did not understand that you just mentioned that to give more context. Totally makes sense.


https://www.reddit.com/r/LocalLLaMA/comments/17kcgjv/how_doe... reddit thread has some ideas.

What I can draw from reading of that thread is that you can buy a Desktop Rig with 200GB memory bandwidth (comparable to m3 pro and max) and a lot of expansion capability (256GB RAM). You should find out if that's still good enough for your local use case for token per second or training.

Then just use SSH/XTerm(and possibly ngrok) to login with good speed from anywhere into your rig with a light M2 ?


Get as much RAM as possible.

16GB is not enough.

32GB is enough to run quantized Mixtral, which is the current best openly licensed model.

... but who knows what will emerge in the next 12 months?

I have 64GB and I'm regretting not shelling out for more.

Frustratingly you still have WAY more options for running interesting models on a Linux or Windows NVIDIA device, but I like Mac for a bunch of other reasons.


I run 7b/13b models pretty gracefully on a 16GB M1 Pro, but it does leave me wanting a little more headroom for other things like Docker and browser eating multiple gigabytes of ram themselves.

Maybe keep an eye out for M1 / M2 deals with high ram config? I've seen 64GB MBPs lately for <$2300 (slickdeals.net)


Thank you (and all the adjacent comments). I don't really need so much RAM for my regular work. Running the llms locally would only be for fun and experiments.

I think 32GB might be the best middle ground for my needs and budget constraints.

It's really a pity that you can't extend RAM in most Apple Silicon Macs and have to decide carefully upfront.


I ended up ordering a 36GB M3 for similar reason.

I currently run Mistral and a few mistral derivatives using Ollama with decent inference speed on a 2019 Intel Mac 32GB. So I assumed the new one with 32ish should do a better job.

I've tried vision model Llava as well, a bit more latency but works fine.

With Apple's own Mlx things might improve .


Why not both? :)

Bait aside, I'd love to read about how are you using those models. I'm mostly interested in code comprehension and meeting summarisation.


I still mostly use GPT-4 to get actual work done, because until a Mixtral came along it felt like the local models were no competition.

I'm going to bump up my usage of Mixtral a bit now to see how it feels for that kind of stuff.


Ah, same here. Have been itching to make some use of those local llms for some time, always ending up in chatgpt as well.

Although for those napkin like ideas gpt4 (including the turbo variant) get costly quickly.


Just got 64gb as well but about to refund it for a bigger one after reading your comment :)


long story short, for a machine that has about 5+ years of shelf life it isn't a bad value prop to go and buy the absolute top end, especially if you expect some roi out of it, just be aware that if you don't want the top end for llm the m2 currently provides more value because m2 has memory bandwith than the m3 at the middle range.

memory bandwidth is the key to model speed, and memory size is what enable you to use larger model (quantization let you push thing further, to a point) so one thing to note is that on the M3 pro/max only the top end model gets the full bandwidh, while the m1/m2 pro enjoy full bandwidth from a smaller memory size. this may be important if you value speed above model size or vice versa. M2 Pro, M2 Max get approximately 200 GB/s and 400 GB/s, but things are more complicated for m3: M3 Pro gets 150mb/s, and M3 max gets 300mb/s at 36gb and 400mb/s at 48gb

few more things to note:

it's absolutely fine to go and play around with llm but even with a llm monster machine there's nothing wrong in starting with smaller models and learning how to squeeze the maximum amount of work out of them. the learning do transfer to larger model. this may or may not be important if at some point you'll want to monetize or deploy to production what you learned. while the mac itself is a good investment for personal use, once you move to servers, cost skyrockets with model size, because of supply constraints on 40gb+ memory gpus. if you are dependent to a 70b parameter model, you'll have a hard time to make a cost effective solution. if it's stricly to playing around, you can disregard this concern

even if you're playing around, a 70b is going to run at 7 tokens / second, which is fine for a local chat, but if you are writing a program and need inference in bulk, it's fairly slow.

another thing of note is that while the field is still undecided on which size and architecture is good enough, the moltitude of small fish experimenting with tuning and mix of instructions are largely experimenting on smaller models. currently my favorite is openhermes-2.5-mistral-7b-16k, but it's not an indication that mistrals are strictly better than llama2, more an indication that experimenting with 7b is more cost effective for third parties without access to gpu than experimenting with 13b, and so you'll find 13b model kinda stagnating, with many of them trained in a period where people didn't really know the best parameters for finetuning and are so to say a bit behind the curve. a few tuners are working at 70b models, but these seems to be pivoting to mixtrals and the likes, which will cause a similar stagnation on the top end, that is, until llama3 or the next mixtral size drops, then, who knows


All that you can afford, I got an M2 with 96GB RAM.


Rule of thumb: when buying a new Mac, always max RAM.


To max RAM, you have to also max the cpu. It gets super expensive .


This may be a dumb question but how do you update an LLM with new information after its knowledge cutoff?


The term that you're looking for is Retrieval Augmented Generation (RAG).

https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-fo...


Generally, you train it again entirely from scratch.

It's possible to introduce new information by fine-tuning a new model on top of the existing model, but it's debatable how effective that is for introducing new information - most fine-tuning success stories I've seen focus on teaching a model how to perform new kinds of task as opposed to teaching it new "facts".

If you want a model to have access to updated information, the best way to do that is still via Retrieval Augmented Generation. That's the fancy name for the trick where you give the model the ability to run searches for information relevant to the user's questions and then invisibly paste that content into the prompt - effectively what Bing, Bard and ChatGPT do when they run searches against their attached search engines.


Most LoRAs are less effective for facts since changes largely shift attention (Q and K, not V layers) and only touch a tiny percentage of weights at that, however full fine-tunes on models are pretty effective for introducing new facts (you could probably use ReLoRA as well).

There are also newer techniques like or ROME that could edit individual facts, and you might also be able to get there when you are updating by doing a DPO tune of the old vs the new answers as well.

While I agree that RAG/tool use (with consistency checking) might be overall best approach for facts, being able to update/tune for model drift is probably going to still be important.

I'd also disagree about the training entirely from scratch - unless you're changing architecture/building a brand new foundational model or have unlimited time/compute budget, that seems like the worst option (and pretty unrealistic) for most people.


I don't think that's a dumb question at all. It depends on your objectives and how much resources you're willing to spend.

These open weights models can be retrained. Start with a foundational model like Llama2 or something and expose it to more recent training data that includes whatever updated information you want it to have access to. This is relatively expensive, but allows for big changes to the model.

If you have some relatively small subset of new information you want to bring in, you could build a Lora. Then either run your model with the Lora, or fold the Lora into your base model. This is relatively cheap, but fairly narrow in terms of your updates.

In the long run, it might be that Retrieval Augmented Generation (RAG) is the way to go. Here, your embeddings go into a vector database, and the model reads from there. Then you just need to update the database for the model to have access to new information.

This LLM stuff is new enough that anything like best practices are still being worked out. The optimal way to bring in new information could be a variant of one of the methods I mentioned above, or some combination of all three, or something else altogether.


- in context learning (simply put it into context)

- finetuning, as in restarting the models checkpoint and relearning it on the previous + new data

- adding extra neurons (e.g. LoRA adapters) at certain places and restarting learning

Oh, in classic machine learning there's also the "bagging/boosting classifiers" option, but I have no knowledge if that can be applied to a ANN.


Have you seen LoRA adapters actually work for this kind of thing?

The leaked Google "We have no moat" memo was very excited about LoRA style techniques, but it's not clear to me that it's been proven as a technique yet.


No, I have tried about 3 times and all failed. Altough it might be that I'm simply doing it wrong.

There are people (can point at a discord server) claiming it works for them and that they even sell finetuned models to business clients.

EDIT: I found one of the articles I tried to follow: https://www.mlexpert.io/prompt-engineering/chatbot-with-loca...

EDIT2: Ignore above. This seemed much more promising: https://www.youtube.com/watch?v=pnwVz64jNvw . Author provides consulting services and seemed very nice and approachable


The easiest way is to let it do web searches with a tool former/plugin framework.

For specifically knowledge you want it to be able to recall (like knowledge base articles or blog posts) vector database embeddings are best.

For knowledge you want it to operationalize, like being able to program in a new language the last resort is finetuning but this is not easy, requires massive amounts of high quality data, and is not generally effective for things which do not have a large amount of data to fine tune on (tens of thousands of pages worth of content).


Not sure of there is any point to it. You cannot trust the output of a LLM for current info. They hallucinate way too often for that.


It's great that there are many comments about being able to run models locally, but I rarely see comments about what folks are using them for. Would love to learn more about use cases and problems being solved.


I use them for coding, you'd be surprised at how much you can get done with deepseek coder 6.7b

Most GPT plugins for code editors will also work for local models since you can have OpenAI API stubs running locally.

Clearly it is nowhere near GPT-4 capacity, but if you ask simple boilerplate things ("write a class with the following methods", then "write unit tests for it") it will mostly work. Even if it doesn't, you can manually fix it, and it still can save you some time.

Always review code generated by LLM, regardless if it comes from GPT-4!


It's a goldrush and no one wants to share their secrets.

Here's a good example from previous HN comments: https://news.ycombinator.com/item?id=38482347


I recently made them work with meeting presentation transcript. Mistral 7b (tiny one) was enough to reliably extract quite a lot of information!

- It could derive speaker names, based purely on how people called themselves in the conversation

- Drew a mermaid sequence diagram that wasn't perfect, but wasn't a complete garbage either. With few back and forth corrections it was on-point.

- Created a truly usable meeting notes

It was a much better UX than having to hunt down relevant video section, watch it and force to focus despite a lot of filler communication. That works very well for those kinds of 30 min. upward meeting, where the real content is in the spoken words and slides just reitarate that. I was really pleasantly surprised how much I liked that.

Also, some models can rate each speaker contribution :o)


Are people actually using this to 10x themselves / create actually profitable services already?

Or more like finetuning a model to have an edge on the leaderboards for a day or two then taking VC money, or integrating som "AI magic" into existing userbases?

I've been following the locallama sub on reddit and the few services created seem very niche besides tons of hobby stuff.


i recently started using them as assistants for programming. i usually don't let them write code as it's often (subtly) wrong. examples are:

* how do i use library X to do task Y (excellent for quickly getting up to speed with new libraries).

* actual example from a few days ago: "the most common CI/CD systems and how to identify them" - chatgpt correctly gave me the environment variable names for github actions, gitlab ci, travis, circleci, jenkins and a one or two others. theoretically it saved me having to go through the docs of 7 different systems looking for the right information, which i still did to make sure the data was legit. just confirming the info was still a lot less work as i already knew what to look for.

* how do i create a certain style with css framework XYZ

* is there an algorithm for solving the following problem ...?

* alternative phrasings or synonyms if i can't find the right words myself.

* cooking recipe suggestions ("i have ingredients a, b and c, give me a stew recipe")

* pop culture questions ("why did the fremen settle on a hostile planet like arrakis in the first place?") i'm too lazy to research myself or ask on reddit

* sometimes my (non-natively english speaking) coworkers produce engrish i just can't parse. asking gpt to correct or explain the sentence often yields surprisingly good results.

usually i double check the results but that's still less work than doing all the work by myself from the beginning. recently i also let it write and style html forms which works quite well.

so for me, they're a welcome productivity boost.


Let's be honest, they're using them mostly for ahem personal time.


> race to the bottom in terms of pricing from other LLM hosting providers.

>This trend makes me a little nervous, since it actively disincentivizes future open model releases from Mistral and from other providers who are hoping to offer their own hosted versions.

That does indeed seem ominous. I guess they’ll just introduce a significant lag till they release it in future


I’ve been wondering, without stuff like llama.cpp and ollama etc how exactly were you intended to run Mistral when they released it?


The main framework for running LLMs was pytorch, which AFAIK has a slow CPU implementation, so only GPU/CUDA users would be covered.

Llama.cpp was the project that popularized running LLMs on the CPU due to its very efficient implementation. Ollama is a frontend to it.


Any GUI option for us mere mortals not proficient with command line? faraday.dev has a mistral, but not that 8x option.


Honestly, Ollama (posted by someone earlier) was surprisingly simple to use. If you have WSL (on windows) or Linux/OSX it's a one-line install and a one-line use. I was up and running using a lowly 6GB VRAM GPU in about 3 minutes (time it took for initial download of models).

Installing WSL (on Windows) is similarly straightforward nowadays. In your search bar, lookup the Microsoft Store, open the app, search for Ubuntu, install it, run it, follow the one-liner for installing Ollama.

https://ollama.ai/library/mistral


but will that wsl ubuntu be able to access my gpu card? do i need to install drivers for that? At untold CLI pain?


Anybody have a Mistral invitation they can spare? I’m super curious how the medium performs compared to GPT3.5/4


Are you on their waitlist? I got through that in less than 48 hours, so it might be worth a shot.


I am, I signed up late though and got an email about it last night “ Access to our API is currently invitation-only, but we'll let you know when you can subscribe to get access to our best models.”


Maybe i did something wrong, but the first mistral gguf i downloaded from hugging face and used with llama2 ended every answer with something along the lines of "let us know in the comments" and answers felt like blog posts. :D


Any option to use a SaaS UI to use those OS models? I’m ok with paying the API access for Anya ale for example and use the ApI for the UI.


I’ve been using baseten (https://baseten.co) and it’s been fun and has reasonable prices. Sometimes you can run some of these models from the hugging face model page, but it’s hit or miss.


You missed ollama option


I was hoping I could run my LLM CLI tool against Ollama via their localhost API, but it looks like they don't offer an OpenAI-compatible endpoint yet.

If they add that it will work out of the box: https://llm.datasette.io/en/stable/other-models.html#openai-...

Otherwise someone would need to write a plugin for it, which would probably be pretty simple - I imagine it would look a bit like the llm-mistral plugin but adapted for the Ollama API design: https://github.com/simonw/llm-mistral/blob/main/llm_mistral....


Which honestly is the easiest option of them all if you own an Apple Silicon based Mac. You just download the ollama and then run `ollama run mixtral` (or choose a quantization from their models page if you don't have enough ram to run the defalt q4 model) and that's it.


I tried an hour ago and had a can't load model error. Everything up to date. Is there any special step?


Tried `ollama pull mixtral` just now and it seems to be working, albeit pretty slowly.


How much RAM do you have? Mixtral is a beast and the non quantized model wants 40GB+ of memory.


Ah, that might be it! I have only 32


The q2 should fit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: