Keep in mind that gguf/llama.cpp, although highly performant and portable, is not the best performing way to launch certain models if you have a GPU. (even though llama.cpp does support GPU acceleration)
ExLLAmA v2 + elx2 quantization, and maybe tensorrt-llm might be the contender for the top performer
Almost none of you already have python. Download exl2, exui from github and run a few terminal commands. This let's me run the 120b param models, which won't fit in vram if I use llamacpp
Panchovix/goliath-120b-exl2 (there's a different branch for each size)
Some of them I've had to do myself eg. I wanted a Q2 GGUF of Falcon 180b
There's a guy on huggingface called "TheBloke" who does GGUF, AWQ and GPTQ for most models. For exl2, you can usually just search for exl2 and find them.
I am currently planning my own small LLM trained on documents we use internally for work. Does anyone have any tips and tricks on how to make this work the best? Could a project like Llamafile help me with this, even if it is just for testing?
How did you choose between training a model from scratch vs using retrieval augmented generation with an existing off-the-shelf model? From what I've observed, RAG + off-the-shelf model seems to be the more common approach for use cases like "create LLM that answers questions about my company's internal documentation", particularly because the iteration/improvement cycle is much shorter-- it's much easier to iterate on RAG/prompts vs. training a whole new model to improve it.
(If the answer is "I just wanted to try training a whole new llm", I won't fault you for that! :) )
Best in what sense? Seems by description a whole lot less convenient (and more limited) if you want to use more than one model (and especially more than just models available as GGUFs) than oobabooga/text-generation-webui or other similar tools that do things like bundle multiple backends for different LLM architectures, support downloading models from huggingface, present a common web UI for LLM configuration, managing prompts, and actually doing inference (in chat, notebook, and other styles), and also supports presenting an OpenAI-compatible API endpoint backed with a local LLM to support other frontends.
I've tried running Llamafile on my Lenovo Legion Pro 5 laptop with 8GB VRAM, but it has a dashboard that shows the GPU and CPU utilisation in real time, and almost all the processing is done on the CPU. Is there a way to shift the processing to the GPU like using gpt-fast?
Pretty much any modern machine will do you fine for ~5 tokens/s on a 7B (small end) model like Mistral-7B or llama2-chat-7B (or any of their respective fine-tunes). The computer you already have can probably do this.
It can run on CPU cores if you have enough RAM for the model. It feels like you have time warped back to the early 1990's and are talking to someone on a BBS (AKA the words appear slowly), but it is entirely functional if you have 32+GB of RAM.
For reference: I run 7B 5bit quantized models on a ryzen 7 5700G with 64Gb at 8 tokens/second CPU only.
It's not close to what you can get with a high end graphics card but for every day use it is alright and has headroom for bigger models. Upgrading CPU, RAM and Mainboard in a 10+year old PC cost me just 400€
I haven't tested, but I suspect you might be able to get good CPU inference on this mini pc: https://www.aliexpress.com/item/1005005825981362.html. It costs ~$800 in the maxed configuration with 64GB RAM clocked at 5600Mhz
It would be nice if input could be taken from a command line argument or better yet, stdin so that it is fully scriptable.
ollama has a way to do this and lets you play with a bunch of models without being very smart (just do ollama pull <model-name> and it downloads a model and makes it available to ollama run/ollama serve).
It's not the best way. It's a really cool and technically interesting way. But embedding the model with the executable is terrible for anything beyond a demo. The best way would be just running llama.cpp (what llamafile uses) and loading external models. I get that compiling is too difficult for some people and llamafiles are great for them. Maybe even the best for them.
But it's far from best for people that actually want to explore LLM and play.
Or you can download LM Studio or oobabooga and execute any model compiled to GGUF format, and also models in a wide variety of formats that aren't GGUF.
From what I've read llamafile can also load other models if you pass a flag with a path to them. There's also LM Studio for anyone who'd like to play with a ChatGPT-like GUI.
If you have 8GB you can play around with heavily quantized 7B models, or up to moderately quantized 13GB models w/ 16GB.
Something like a Mistral-7B Dolphin finetune actually is surprisingly useful, like GPT-3.5 in some respects. I imagine it would render ~10 tok/sec on an M2 for at least short bursts until throttling sets in.
As I mentioned in another post, try this out with LM Studio. Super simple GUI that even a non-tech person could probably figure out for finding/downloading/loading models w/ a ChatGPT like interface for chatting.
Is there any reason to run this if you already have LM Studio installed?
Which LLM to download for general questions/info that has good accuracy?
Separately, I couldn’t figure out how to use the NVIDIA GPU with LM Studio. Only wants to use the Intel card. I use NVIDIA with Diffusion with no problems if that makes any difference.
Without knowing what you consider to be of worth, that's difficult to answer.
The good news is that the article describes in detail how to determine for yourself if it's worth it. The author uses an M2 as well, so you can at least know it'll likely work.