I'm in search of a local LLM that can run completely offline for processing personal documents. Key requirements include privacy (no data leaves my machine) and performance (efficient with large datasets). Any recommendations for open-source / commercial solutions that fit the bill in 2024?
Also, what's the current state of local LLMs—are: Are they practical and useful, or still facing significant limitations?
> Are they practical and useful, or still facing significant limitations?
They are. Working on a product using a fine-tuned Mistral-7B-Instruct-v0.2 model and it's pretty mind-blowing. Works flawlessly on my RTX3090 and serviceable on my M1 MBP as well. I'm building in Rust (using the candle crate), but for personal usage Python is probably the better choice since it's easier to get up and running.
Mistral-7B-Instruct-v0.2 - I'm using this exact model too, and it is mind blowing, but to get the most out of it, make sure you use llama.cpp and turn on self-extend (I'm not sure if support for self-extend has been merged into main yet, I manually merged a dev branch)
I asked this another place in this thread, but curious to know how you plan to run and deploy this model. What's the cheapest or I guess most cost efficient way to do this without burning money. I'm a college student for reference so I don't have a lot of money lying around to experiment.
I have infra in my house, not gonna lie it cost a lot, I have a rack with 30k of equipment in it (including 5 GPUs)
But this would probably run on an AWS P2 instance with is 0.90 USD an hour, or there’s lambda labs which is also pretty cheap (no affiliation, just satisfied customer)
Shooting from the hip here I’d say $800 would be a good start for all new parts. You might get down to $500 with used parts.
The most expensive piece in either build is a video card. You want to be able to load the LLM file into your video card RAM.
I just got a 16gb new card for this. I can load up to 34b models on it but poorly. Anything 13b or less runs perfectly. A 12gb card would be able to run 7b models and with the right training I think 7b models can be awesome.
Mistral-7B will run on a Raspberry Pi at q04 and a little bit of patience. For good acceleration though, you'll want an Nvidia machine with enough VRAM to comfortable store the model.
RAG happens all locally (local embedding model and local vector db).
The app is secured by Mac App Sandbox, meaning it only have access to your selected file in the system dialog or drag and dropped files. If you use a local LLM, everything works offline.
I run local LLM's (Mistral-7B-Instruct-v0.2) using LM Studio (Ollama works well too I believe) and host a local server on my Mac. I can hit the endpoints the same way you would with OpenAI's chat completions API, and can trigger it inline across my other applications using MindMac.
> I'm in search of a local LLM that can run completely offline for processing personal documents. Key requirements include privacy (no data leaves my machine) and performance (efficient with large datasets). Any recommendations for open-source / commercial solutions that fit the bill in 2024? Also, what's the current state of local LLMs—are: Are they practical and useful, or still facing significant limitations?
We've added support for it in our app if you wanna give it a try: https://curiosity.ai
You need suitable hardware (ideally a 3090, 4090 or an Apple M device with a decent amount of mem).
Then set up software - ollama for easy mode (but less control) or text-generation-webui for more control.
After that you can just try models. The subreddit /r/localllama has whatever is flavour of the week. The Mixtral model at like Q3 quantization is probably a good starting point
With that said, I dont think you need anything special to run LLMs these days. I can run 7B models on a 4 year old AMD or Intel CPU (no GPU), for programming tasks.
Yes you can. Download LM studio and then search for a model such as Mistral-7B-Instruct-v0.2. LM studio will suggest variants of that model that suit your hardware to download.
Here's a demonstration: https://www.youtube.com/watch?v=VXHryjPu52k
You can do this with Lamma2. There are multiple ways to compile it unless you use python. If you don't have familiarity with cpp i would just stick to python and save yourself time. Buy a big PC that can handle it.
They are. Working on a product using a fine-tuned Mistral-7B-Instruct-v0.2 model and it's pretty mind-blowing. Works flawlessly on my RTX3090 and serviceable on my M1 MBP as well. I'm building in Rust (using the candle crate), but for personal usage Python is probably the better choice since it's easier to get up and running.