Many setup rely on Nvidia GPUs, Intel stuff, Windows or other stuff, that I would rather not use, or are not very clear about how to set things up.
What are some recommendations for running models locally, on decent CPUs and getting good valuable output from them? Is that llama stuff portable across CPUs and hardware vendors? And what do people use it for?
Have you tried a Llamafile? Not sure what platform you are using. From their readme:
> … by combining llama.cpp with Cosmopolitan Libc into one framework that collapses all the complexity of LLMs down to a single-file executable (called a "llamafile") that runs locally on most computers, with no installation.
Low cost to experiment IMO. I am personally using MacOS with an M1 chip and 64gb memory and it works perfectly, but the idea behind this project is to democratize access to generative AI and so it is at least possible that you will be able to use it.
I should have qualified the meaning of “works perfectly” :) No 70b for me, but I am able to experiment with many quantized models (and I am using a Llama successfully, latency isn’t terrible)
"Cosmopolitan Libc makes C a build-once run-anywhere language, like Java, except it doesn't need an interpreter or virtual machine. Instead, it reconfigures stock GCC and Clang to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + NetBSD + BIOS with the best possible performance and the tiniest footprint imaginable."
I use it just fine on a Mac M1. The only bottleneck is how much RAM you have.
I use whisper for podcast transcription. I use llama for code complete and general q&a and code assistance. You can use the llava models to ingest images and describe them.
Ollama works on anything: Windows, Linux, Mac and Nvidia or AMD. I don't know if other cards like Arc are supported by anything yet, bit of it supports the open Vulkan API (like AMD) then it should work.
Every inference server out there supports running from CPU, but realize that it's much slower than running on a GPU - that's why this revolution didn't begin until GPUs became powerful and affordable.
As far as being clear to setup, Ollama is trivial: it's a single command line that only asks what model you want and they provide you with a list on their website. They even have a Docker container if you don't want to worry about installing any dependencies. I don't know what could be easier than that.
Most other tools like LM Studio or Jan are just a fancy UI running llama.cpp as their server and using HuggingFace to download the models. They don't even offer anything beyond simple inference, such as RAG or agents.
I've yet to see anything more than a simple RAG that's available to use out of the box for local use. The only full service tools are online services like Microsoft Copilot or ChatGPT. Anyone else who wants to do that more advanced kind of system ends up writing their own code. It's not hard if you know Python - there are lots of libraries available like HuggingFace, LangChain, and Llama-Index, as well as millions of tutorials (every blog has one).
Maybe that's a sign that there's room for an open source platform for this kind of thing, but given that it's a young field and everyone is rushing to become the next big online service or toolkit, there might not be as much interest from developers to build an open source version of a high quality online service.
What are some recommendations for running models locally, on decent CPUs and getting good valuable output from them? Is that llama stuff portable across CPUs and hardware vendors? And what do people use it for?