> adjust your paths as necessary. It has a tendency to talk to itself.
This is always a fun surprise. What I've seen help, especially with chat models, is to use a prompt template. Some tools (e.g. https://ollama.ai/ – mentioned in the article) use a default, model-specific prompt template when you run the model. This is easier for users since they can just input your chat messages and get answers. The hard part is every model is trained (and behaves) differently.
With llama.cpp you'd need wrap your prompt text with the right template. For llama 2, the facebook developers' generation code wraps the system prompt and user prompts in specific tags (<<SYS>>{system prompt}<</SYS>> and [INST]{user prompt}[/INST]) respectively): https://github.com/facebookresearch/llama/blob/main/llama/ge....
Worth noting that customizing prompt templates can be fun – you don't have to use "prescribed" one – I've had a model generate a conversation between a few characters that talk to each other for example – it's pretty entertaining!
Note, it is possible to reinject <<SYS>> prompts during the conversation to keep the LLM on target: https://twitter.com/overlordayn/status/1681631554672513025 (but obviously this should be the first thing you filter and track if you are running an LLM that is accessible to end-users - rebuff is a lib w/ some good ideas to start with)
With new fine tunes coming out, it's worth noting again that different datasets/models all use slightly different prompt formats (many are using multiple datasets these days, so it's hard to say just how much it matters now): https://www.reddit.com/r/LocalLLaMA/comments/13lwwux/comment...
Ideally, fine tunes for LLama2 would regularize all datasets to the official tokens/format and inference interfaces could standardize too (or there could be some metadata collected for what model uses what, but some of the low hanging fruit that's out for the future I guess). One other thing to keep in mind is that all the benchmarks/leaderboards don't use instruct formatting at all, and don't really represent capabilities for a model. IMO, ELO-style rankings against models for specific tasks would probably be more representative.
I have a question: Last week I downloaded llama-7b-chat from meta's github directly (https://github.com/facebookresearch/llama) using the URL they sent via e-mail. As a result, I now have the model as consolidated.00.pth.
Your commands assume the model is a .bin file (so I guess there must be a way to convert the pytorch model .pth to the .bin file). How can I do this and what is the difference between the two models?
The facebook repo provides commands for using the models, these commands don't work on my windows machine: "NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to ...."
The facebook repo does not describe which OS you are supposed to use, so I assumed it would work on Windows too. But then if this can work why would anyone need the ggerganov llama code? I am new to all of this and easily confused, so any help is appreciated
To be perfectly honest, I know absolutely nothing about AI or Llama; I'm just a Windows C++ programmer so I wanted to provide cmake instructions for Windows, sorry. The .bin file is what I got from the OP's link
It's ok, I just followed your instructions and with that model is works well. But are you sure that this uses CUDA? My CPU utilization is at 50% while my GPU utilization is at 1% while the output is being generated..
The cmake build prints that it finds cuda when I run the cmakelists (prints the location of cuda headers), however I dont see any noticeable difference between cpu-only and cuda builds. So if its not working then maybe there a CLI option thats required, or maybe cuda support is broken on windows
llama.cpp needs the files to be in ggml format, there is a command string you can run to convert one from the other (as well as perform quantization). Or just download the GGML version
Some you may have seen this but I have a Llama 2 finetuning live coding stream from 2 days ago where I walk through some fundamentals (like RLHF and Lora) and how to fine-tune LLama 2 using PEFT/Lora on a Google Colab A100 GPU.
In the end with quantization and parameter efficient fine-tuning it only took up 13gb on a single GPU.
Can you do more videos on preparing the data and scraping data? I noticed you seem to be proficient at it by your terminology and it’s something I want to dive into more.
This covers three things:
Llama.cpp (Mac/Windows/Linux),
Ollama (Mac),
MLC LLM (iOS/Android)
Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (https://github.com/huggingface/text-generation-inference). And I am sure there are other things that could be covered.
As an article aimed towards newbies I think it's pretty good, but I'd agree "comprehensive" is a bit of a reach. I think a better title would have been something like "Get LLama2 running locally on every platform easily."
Anyone w/ more than a single consumer GPU probably has a good grip on their options (a vllm vs hf shootout would be neat for exmaple), but I'd add a few more projects for those taking the next step for local inferencing:
* exllama - while llama.cpp has matched its token generation performance, exllama is still largely my preferred inference engine because it is so memory efficient (shaving gigs off the competition) - this means you can run a 33B model w/ 2K context easily on a single 24GB card. It also scales almost perfectly for inferencing on 2 GPUs. It's been tested to run a llama2-70b w/ 16K context (NTK RoPE scaling) sneaking in at 47GB. Ridiculous.
* AutoGPTQ - while it's fallen a bit behind for inference, if you are using an older (eg Pascal) cards, it's worth taking a look. It's also the easiest tool for making GPTQ quants.
* ggml - for non-llama models, this covers running quants of almost everything else out there; GPU acceleration only w/ cuBLAS/CLBlast, so not the fastest, but fast enough
* python-llama-cpp and LocalAI - while these are technically llama.cpp bindings, they're pretty useful/worth mentioning since they replicate the OpenAI API making it easy as a drop-in replacement for a whole ecosystems of tools/apps
* A lot of hobbyists like oobabooga, kobold.cpp, silly tavern, etc but I haven't gotten around to poking into those as much. They seem like a lot of work, always behind their mainline dependencies, and while featureful and interesting, they also feel like they're one update away from breaking (eg, automatic1111 vibes).
Just a note that you have to have at least 12GB VRAM for it to be worth even trying to use your GPU for LLaMA 2.
The 7B model quantized to 4 bits can fit in 8GB VRAM with room for the context, but is pretty useless for getting good results in my experience. 13B is better but still not anything near as good as the 70B, which would require >35GB VRAM to use at 4 bit quantization.
My solution for playing with this was just to upgrade my PC's RAM to 64GB. It's slower than the GPU, but it was way cheaper and I can run the 70B model easily.
Just a quick note, it's worth pointing out that for most people (eg, wanting to chat to a model in realtime), I don't think running locally on a CPU is a very viable option unless you're very patient. On my 16c Ryzen 5950X/64GB DDR4-3800 system, llama-2-70b-chat (q4_K_M) running llama.cpp (eb542d3) and testing doing a 100 token test (life's too short to try max context), I got 1.25 tokens/second (~1 word/second) output.
Compiled with cuBLAS w/ `-ngl 0` (~400MB of VRAM usage, no layers loaded) makes no perf difference. The max layers I can load on a headless 24GB 4090 is 45/83 layers (running `-ngl 45 --low-vram`) which brings speeds up to 2.5 t/s. A little less painful, but still not super pleasant. For reference, people have reported performance of 12-15 t/s with 2x4090s w/ exllama (GPTQ). People are using a 14,20 split and able to load a full (NTK Rope scaled) 16K context into 48GB of VRAM.
The benefits are primarily price - 96GB of VRAM would be 4x3090/4090 (~$6K) or 2xA6000 (~$8-14K) cards (also, looks like you can buy an 80GB A100 PCIe for about $15K atm). While Apple is using LPDDR5, it is also running a lot more channels than comparable PC hardware. The M2 has 100GB/s, M2 Pro 200GB/s, M2 Max 400GB/s, and M2 Ultra is 800GB/s (8 channel) of memory bandwidth. The Nvidia cards are about 900GB/s-1TB/s (A100 PCIe gets up to 1.5TB/s).
In practice, on quantizes of the larger open LLMs, an M2 Ultra can currently inference about 2-4X faster than the best PC CPUs I've seen (mega Epyc systems), but also about 2-4X slower than 2x4090s.
That is useful info, but still does not quite address the question.
The question was how memory type, memory amount and bandwidth factor into actual performance. So let me rephrase: Given a budget of $X, what performance/limitations should you expect with
- 256GB of non-unified DDR5 in a PC, just CPU
- 128GB of DDR5 for an APU
- 96GB of unified DDR5
- Whatever Nvidia will sell you for $X.
An answer of "just compare a single memory bandwidth number" seems a bit short. Sure, more bandwidth helps, but is half as much RAM at double bandwidth better or worse?
No idea, I just said I wanted to try this out and see how it performs.
Doesn’t VRAM amount limit the size of the model you can load? I’m not talking about training just inference. I also pointed out these are not the greatest GPUs available, just that the advantage they have is being able to address more memory since on those machines is a shared block between system and GPU.
Trying to figure out what hardware to convince my boss to spend on... if we were to get one of the A6000/48gb cards, will that see significant performance improvements over just a 4090/24gb? The primary limitation is vram, is it not?
VRAM is what gets you up to the larger model sizes, and 24GB isn't enough to load the full 70B even at 4 bits, you need at least 35 and some extra for the context. So it depends a lot on what you want to do—fine tuning will take even more as I understand it.
The card's speed will affect your performance, but I don't know enough about different graphics cards to tell you specifics.
How would an APU, such as 5700g with up to 128gb of system ram perform when allocating it as vram? Is this a cost effective way of using running this on a budget?
Well, 48gb is better than nothing at least. And it has the potential (if we get the build right) to drop a second A6000 card into it with the nvlink module (I think this does allow you to effectively have 96gb) later.
You might consider getting a Mac Studio (with as much RAM as you can afford up to 192GB) instead, since 192GB is more (unified) memory than you're going to easily get to with GPUs.
This. The main system memory on a Mac Studio is GPU memory and there's a lot of it.
It also has the Neural Engine, which is specifically designed for this type of work - most software isn't designed to take advantage of that yet, but presumably it will soon.
While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:
* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.
* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].
You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)
Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)
Just a note to say thank you for this detailed reply! I did not know these things, and am getting a Mac Studio of similar spec for work soon (for reasons unrelated to AI) and it's helpful to know what to expect about its ML capabilities.
(Still, how much would you have to spend to get 192GB of GPU RAM available to you, fully purchased? The 192GB Mac Studio M2 Ultra is around $5800. Is that the difference between sort-of-GPU speeds and falling down to CPU speeds, if you want to run e.g. the best, largest open source models available?
I suppose even "falling down to CPU speeds" isn't really plausible -- I think you'd find it hard to put 192GB DDR5 (at least without falling to speeds below DDR4) in any fast, modern desktop because they all have two channels of DDR5.
If you want to simply run inference or do QLoRA fine tunes of "the best, largest open source models" eg the llama2-70b models, you can do so with 2 x RTX 3090 24GB (~$600 used), so for about $1200 for the GPUs, 48GB of VRAM (set to PL 300W, so 600W while inferencing) - q4 version of llama2-70b take about 38-40GB of memory + kvcache.
If you want 192GB of VRAM, your cheapest realistic option is probably going to be 4 x A6000's (~$16,000) - you will need to have a chassis that will provide adequate power and cooling (1200W for the GPUs). I'd personally suggest that anyone looking to buy that kind of hardware have a fairly good idea of what they're going to use it for beforehand.
I'm not sure what exactly you're asking about with regards to memory, but for workstations, the Xeon W-3400's have 8 channels of DDR5-4800 (the W5-3425 has a $1200 list price) and the upcoming Threadripper Pro 7000s will likely have similar memory support (or you can get an EPYC 9124 for ~$1200 now if you want 12 channels of DDR5).
Would it be worthwhile just using "cloud GPUs" (like the providers who rent out GPUs, not the overpriced AWS stuff) until the next generation comes out, then using that?
Yes, it runs totally fine. I ran it in Oobabooga/text generation web ui. Nice thing about it is that it autodownloads all necessary gpu binaries on it's own and creates a isolated conda env. I asked same questions on the official 70b demo and got same answers. I even got better answers with ooba, since the demo cuts text early
Not really? I just got those cards in separate PCI slots and the Exllama_hf handles spreading the load internally. No NVLink bridge in particular. I use the "20,22" memory split so that the display card has some room for the framebuffer to handle display
Do you mean you don't use NVLink or just use one that works? I am under the impression it is being phased out ("PCIe 5 is fast enough") and some kits don't use it.
Ollama works with Windows and Linux as well too, but doesn't (yet) have GPU support for those platforms. You have to compile it yourself (it's a simple `go build .`), but should work fine (albeit slow). The benefit is you can still pull the llama2 model really easily (with `ollama pull llama2`) and even use it with other runners.
DISCLAIMER: I'm one of the developers behind Ollama.
> DISCLAIMER: I'm one of the developers behind Ollama.
I got a feature suggestion - would it be possible to have the ollama CLI automatically start up the GUI/daemon if it's not running? There's only so much stuff one can keep in a Macbook Air's auto start.
Good suggestion! This is definitely on the radar, so that running `ollama` will start the server when it's needed (instead of erroring!): https://github.com/jmorganca/ollama/issues/47
I think you'd need to offload the model into CoreML to do so, right? My understanding is that none of the current popular inference frameworks do this (not yet, at least).
Llama.cpp has been fun to experiment around with. I was suprised with how easy it was to set up, much easier than when I tried to set up a local llm almost a year ago.
Depends on which model. I haven't bothered doing it on my 8GB because the only model that would fit is the 7B model quantized to 4 bits, and that model at that size is pretty bad for most things. I think you could have fun with 13B with 12GB VRAM. The full size model would require >35GB even quantized.
The easiest way I found was to use GPT4All. Just download and install, grab GGML version of Llama 2, copy to the models directory in the installation folder. Fire up GPT4All and run.
For most people who just want to play around and are using MacOS or Windows, I'd just recommend lmstudio.ai. Nice interface, with super easy searching and downloading of new models.
on windows it's an unsigned binary , backed by a website that only indicates a twitter/discord/github as far as explaining their organization, and the github doesn't include source on the client itself, only models.
If by lower-end Macbook air, you mean with 8GB of memory, try the smaller models (Such as Orca Mini 3B). You can do this via LM Studio, Oogabooga/text-generation-webui, KoboldCPP, GPT4all, ctransformers, and more.
I'm biased since I work on Ollama, and if you want to try it out:
32GB should be fine. I went a little overboard and got a new MBP with M2 MAX and 96GB, but the hardware is really best suited at this point to a 30B model. I can and do play around with 65B models, but at that point you're making a fairly big tradeoff in generation speed for an incremental increase in quality.
As a datapoint, I have a 30B model [0] loaded right now and it's using 23.44GB of RAM. Getting around 9 tokens/sec, which is very usable. I also have the 65B version of the same model [1] and it's good for around 3.6 tokens/second, but it uses 44GB of RAM. Not unusably slow, but more often than not I opt for the 30B because it's good enough and a lot faster.
Local memory management will definitely get better in the future.
For now:
You should have at least 8 GB of RAM to run the 3B models, 16 GB to run the 7B models, and 32 GB to run the 13B models.
My personal recommendation is to get as much memory as you can if you want to work with local models [including VRAM if you are planning to be executing on GPU]
By lower-end I meant that the Airs are quite low-end in general (compared to Pro/Studio). I have the maxed-out 24gb, but 16gb may be more common among people who might use an Air for this kind of thing.
Don't remember if the grammar has been merged in llama.cpp yet but it would be the first step to have Llama + Stable diffusion locally to output text + images and talk to each other. The only part I'm not sure is how Llama would interpret images back. At least it could use them though, to build e.g. a webpage.
Basically, bash executes the script line by line as it is downloading - pausing the download while that line executes. By sending a sleep() command early in the script you can detect the delay in the next line beind downloaded.
Its a lot more complicated due to TCP buffers and trying to hide output from the user.
Original article below. It is giving me a certificate error though but its available through archives or a cache.
In case you might be trying this with Ollama (given the model:tag syntax) – I haven't added this to the readme yet (seeing some issues with the prompt template) but check out llama2-uncensored:
% ollama run llama2-uncensored
>>> Generate a json object describing the most stunning part of the movie: "The Matrix"
{
"title": "The Matrix",
"director": "The Wachowskis",
"release_year": 1999,
"description": "Set in a dystopian future, The Matrix follows the story of Neo (Keanu Reeves), a computer hacker who is drawn into an underground war against machines that have enslaved humanity. Along with his allies Morpheus (Laurence Fishburne) and Trinity (Carrie-Anne Moss), Neo learns to harness the power of the Matrix, a virtual reality system designed to keep humans obedient and ignorant. With the help of a rogue program named Agent Smith (Hugo Weaving), Neo discovers the true nature of the Matrix and sets out on a journey to free humankind from their digital prison.",
"stunning scenes": [
"bulletproof kung-fu fighting",
"bullet time",
"dream-like imagery",
"special effects",
]
}
You can migrate that data to a vector database (eg Pinecone or pgVector) and then query it. I didn’t write it but this guide has a good overview of concepts and some code. In your case your just replace the web crawler with database queries. All the libraries used also exist in Python.
You can, but as a crazy idea you can also ask chatgpt to write select queries using the functions parameter they added recently - you can also ask it to write jsonpath.
As long as it understands the schema and general idea of data, it does a fairly good job. Just be careful to do too much with one prompt, you can easily cause hallucinations
I might be missing something. The article asks me to run a bash script on windows.
I assume this would still need to be run manually to access GPU resources etc, so can someone illuminate what is actually expected for a windows user to make this run?
I'm currently paying 15$ a month in a personal translation/summarizer project's ChatGPT queries. I run whisper (const.me's GPU fork) locally and would love to get the LLM part local eventually too! The system generates 30k queries a month but is not super-affected by delay so lower token rates might work too.
Thanks for pointing this out — we should've pointed out the script needs to be run on WSL. I added a note to the post to clarify (I work at Replicate).
I know (should have included in my earlier response but editing would've felt weird) but I still assume one should run the result natively, so am asking if/where there's some jumping around required.
Last time I tried running an LLM I tried wsl&native both on 2 machines and just got lovecraftian-tier errors so waiting if I'm missing something obvious before going down that route again
The bash script is downloading llama.cpp, a project which allows you to run LLaMA-based language models on your CPU. The bash script then downloads the 13 billion parameter GGML version of LLaMA 2. The GGML version is what will work with llama.cpp and uses CPU for inferencing. There are ways to run models using your GPU, but it depends on your setup whether it will be worth it.
I would highly recommend looking into the text-generation-webui project (https://github.com/oobabooga/text-generation-webui). It has a one-click installer and very comprehensive guides for getting models running locally and where to find models. The project also has an "api" command flag to let you use it like you might use a web-based service currently.
Maybe obvious to others, but the 1 line install command with curl is taking a long time. Must be the build step. Probably 40+ minutes now on an M2 max.
I did manually clone from GitHub and build myself and download the model separately. I also noticed that many of the CLI flags the author chose are questionable after I read the docs and help text.
How do you decide what model variant to use? There's a bunch of Quant method variations of Llama-2-13B-chat-GGML [0], how do you know which one to use? Reading the "Explanation of the new k-quant methods" is a bit opaque.
This is a great question. My best answer is that there's a speed/intelligence trade-off. The smaller weights (7B) will run faster and require less memory, but you won't get the same quality responses as the 13B / 70B model. I think there may be a Llama 30B variant coming soon too.
TLDR: Lower quantization means higher perplexity (i.e. how 'confused' the model is when seeing new information). It's a matter of testing it out and choosing a model that fits your available memory. The higher the quantization number, the better (generally).
If you just want to do inference/mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.
Idiot question: if I have access to sentence-by-sentence professionally-translated text of foreign-language-to-English in gigantic quantities, and I fed the originals as prompts and the translations as completions...
... would I be likely to get anything useful if I then fed it new prompts in a similar style? Or would it just generate gibberish?
Indeed, it sounds like you have what's called fine tuned data (given an input, here's the output), there's loads of info both here on HN about fine tuning and on youtube's huggingface channels
Note if you have sufficient data, look into existing models on huggingface, you may find a smaller, faster and more open (licencing-wise) model that you can fine tune to get the results you want - Llama is hot, but not a catch-all for all tasks (as no model should be)
It's more about quality vs sufficiency - you can have a relatively small but accurate and wide ranging dataset, this is better than an inaccurate huge dataset
Is it possible for such local install to retain conversation history so if for example you're working on a project and use it as your assistance across many days that you can continue conversations and for the model to keep track of what you and it already know?
There is no fully built solution, only bits and pieces. I noticed that llama outputs tend to degrade with amount of text, the text becomes too repetitive and focused, and you have to raise the temperature to break the model out of loops.
Does what you're saying mean you can only ask questions and get answers in a single step, and that having a long discussion where refinement of output is arrived at through conversation isn't possible?
My understanding is that at a high level you can look at this model as a black box which accepts a string and outputs a string.
If you want it to “remember” things you do that by appending all the previous conversations together and supply it in the input string.
In an ideal world this would work perfectly. It would read through the whole conversation and would provide the right output you expect, exactly as if it would “remember” the conversation. In reality there are all kind of issues which can crop up as the input grows longer and longer. One is that it takes more and more processing power and time for it to “read through” everything previously said. And there are things like what jmiskovic said that the output quality can also degrade in perhaps unexpected ways.
But that also doesn’t mean that “ refinement of output is arrived at through conversation isn't possible”. It is not that black and white, just that you can run into troubles as the length of the discussion grows.
I don’t have direct experience with long conversations so I can’t tell you how long is definietly too long, and how long is still safe. Plus probably there are some tricks one can do to work around these. Probably there are things one can do if one unpacks that “black box” understanding of the process. But even without that you could imagine a “consolidation” process where the AI is instructed to write short notes about a given length of conversation and then those shorter notes would be copied in to the next input instead of the full previous conversation. All of these are possible, but you won’t have a turn-key solution for it just yet.
The limit here is the "context window" length of the model, measured in tokens, which will quickly become too short to contain all of your previous conversations, which will mean it has to answer questions without access to all of that text. And within a single conversation, it will mean that it starts forgetting the text from the start of the conversation, once the [conversation + new prompt] reaches the context length.
The kind of hacks that work around this are to train the model on the past conversations, and then rely on similarity in tensor space to pull the right (lossy) data back out of the model (or a separate database) later, based on its similarity to your question, and include it (or a summary of it, since summaries are smaller) within the context window for your new conversation, combined with your prompt. This is what people are talking about when they use the term "embeddings".
My benchmark is having a peer programming session spanning days and dozens of queries with ChatGPT where we co-created a custom static site generator that works really well for my requirements. It was able to hold context for a while and not "forget" what code it provided me dozens of messages earlier, it was able to "remember" corrections and refactors that I gave it and overall was incredibly useful for working out things like recurrence for folder hierarchies and building data trees. This kind and similar use-cases where memory is important, when the model is used as a genuine assistant.
Excelent! That sounds like a very usefull personal benchmark then. You could test llama v2 by copying in different lengths of snippets from that conversation and checking how usefull you find its outputs.
llama is just an input/output engine. It takes a big string as input, and gives a big string of output.
Save your outputs if you want, you can copy/paste them into any editor. Or make a shell script that mirrors outputs to a file and use that as your main interface. It's up to the user.
This is usable, but hopefully folks manage to tweak it a bit further for even higher tokens/s. I’m running Llama.cpp locally on my M2 Max (32 GB) with decent performance but sticking to the 7B model for now.
I need some hand-holding .. I have a directory of over 80,000 PDF files. How do I train Llama2 on this directory and start asking questions about the material - is this even feasible?
1. Logically split and store PDFs content in a vector database
2. Embed the query (Your questions) and search the vector database for closest results.
3. Use the results and LLAMA prompt to format the answer.
I assume you are talking about a Windows/Linux PC. I have done that and got a Threadripper with 32 cores and 256Gb RAM. It runs any llama on the CPU although the 65/70b are quite slow. I also added an A6000 (48Gb VRAM) that allows you to run the 65/70b quantized with very good performance.
If you are going with the GPU and don't care about loads of RAM then a 16 Zen CPU will do just fine (or Intel for that matter).
If you are only interested in llama only then an M1 Studio with 64Gb RAM is probably cheaper and will work just as well.
Before steps:
1. (For Nvidia GPU users) Install cuda toolkit https://developer.nvidia.com/cuda-downloads
2. Download the model somewhere: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolv...
In Windows Terminal with Powershell:
`-DLLAMA_CUBLAS` uses cuda`2> $null` is to direct the debug messages printed to stderr to a null file so they don't spam your terminal
Here's a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama "prompt goes here"`:
adjust your paths as necessary. It has a tendency to talk to itself.