Hacker News new | past | comments | ask | show | jobs | submit login
Guide to running Llama 2 locally (replicate.com)
683 points by bfirsh on July 25, 2023 | hide | past | favorite | 170 comments



For my fellow Windows shills, here's how you actually build it on windows:

Before steps:

1. (For Nvidia GPU users) Install cuda toolkit https://developer.nvidia.com/cuda-downloads

2. Download the model somewhere: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolv...

In Windows Terminal with Powershell:

    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    mkdir build
    cd build
    cmake .. -DLLAMA_CUBLAS=ON
    cmake --build . --config Release
    cd bin/Release
    mkdir models
    mv Folder\Where\You\Downloaded\The\Model .\models
    .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin --color -p "Hello, how are you, llama?" 2> $null
`-DLLAMA_CUBLAS` uses cuda

`2> $null` is to direct the debug messages printed to stderr to a null file so they don't spam your terminal

Here's a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama "prompt goes here"`:

    function llama {
        .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin -p $args 2> $null
    }
adjust your paths as necessary. It has a tendency to talk to itself.


> adjust your paths as necessary. It has a tendency to talk to itself.

This is always a fun surprise. What I've seen help, especially with chat models, is to use a prompt template. Some tools (e.g. https://ollama.ai/ – mentioned in the article) use a default, model-specific prompt template when you run the model. This is easier for users since they can just input your chat messages and get answers. The hard part is every model is trained (and behaves) differently.

With llama.cpp you'd need wrap your prompt text with the right template. For llama 2, the facebook developers' generation code wraps the system prompt and user prompts in specific tags (<<SYS>>{system prompt}<</SYS>> and [INST]{user prompt}[/INST]) respectively): https://github.com/facebookresearch/llama/blob/main/llama/ge....

Worth noting that customizing prompt templates can be fun – you don't have to use "prescribed" one – I've had a model generate a conversation between a few characters that talk to each other for example – it's pretty entertaining!


As a followup on LLama2's prompting, here's a good thread with some more details: https://www.reddit.com/r/LocalLLaMA/comments/155po2p/get_lla... (see also: https://www.reddit.com/r/LocalLLaMA/comments/1561vn5/here_is... - it's a bit complex and people are still wrapping their heads around it)

Note, it is possible to reinject <<SYS>> prompts during the conversation to keep the LLM on target: https://twitter.com/overlordayn/status/1681631554672513025 (but obviously this should be the first thing you filter and track if you are running an LLM that is accessible to end-users - rebuff is a lib w/ some good ideas to start with)

With new fine tunes coming out, it's worth noting again that different datasets/models all use slightly different prompt formats (many are using multiple datasets these days, so it's hard to say just how much it matters now): https://www.reddit.com/r/LocalLLaMA/comments/13lwwux/comment...

Ideally, fine tunes for LLama2 would regularize all datasets to the official tokens/format and inference interfaces could standardize too (or there could be some metadata collected for what model uses what, but some of the low hanging fruit that's out for the future I guess). One other thing to keep in mind is that all the benchmarks/leaderboards don't use instruct formatting at all, and don't really represent capabilities for a model. IMO, ELO-style rankings against models for specific tasks would probably be more representative.


I have a question: Last week I downloaded llama-7b-chat from meta's github directly (https://github.com/facebookresearch/llama) using the URL they sent via e-mail. As a result, I now have the model as consolidated.00.pth.

Your commands assume the model is a .bin file (so I guess there must be a way to convert the pytorch model .pth to the .bin file). How can I do this and what is the difference between the two models?

The facebook repo provides commands for using the models, these commands don't work on my windows machine: "NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to ...."

The facebook repo does not describe which OS you are supposed to use, so I assumed it would work on Windows too. But then if this can work why would anyone need the ggerganov llama code? I am new to all of this and easily confused, so any help is appreciated


To be perfectly honest, I know absolutely nothing about AI or Llama; I'm just a Windows C++ programmer so I wanted to provide cmake instructions for Windows, sorry. The .bin file is what I got from the OP's link


It's ok, I just followed your instructions and with that model is works well. But are you sure that this uses CUDA? My CPU utilization is at 50% while my GPU utilization is at 1% while the output is being generated..


The cmake build prints that it finds cuda when I run the cmakelists (prints the location of cuda headers), however I dont see any noticeable difference between cpu-only and cuda builds. So if its not working then maybe there a CLI option thats required, or maybe cuda support is broken on windows


llama.cpp needs the files to be in ggml format, there is a command string you can run to convert one from the other (as well as perform quantization). Or just download the GGML version

https://www.reddit.com/r/LocalLLaMA/wiki/models#wiki_llama_2...


try *cd llama.cpp && python convert-pth-to-ggml.py models/7B/ 1*


I also wrote a small Python script that you can also use if you do not want to use Powershell to talk to your Llama model: https://downioads.github.io/posts/llama-python/


> adjust your paths as necessary.

What does this mean, exactly, and how does it prevent llama from carrying on a (rather pleasant, actually) conversation with itself?


I just mean where I have a relative path, set your own paths to what they would be on your machine. The subsequent comment is unrelated


Thanks for sharing these instructions, I really appreciate it.


How much storage and memory does this use?


Some you may have seen this but I have a Llama 2 finetuning live coding stream from 2 days ago where I walk through some fundamentals (like RLHF and Lora) and how to fine-tune LLama 2 using PEFT/Lora on a Google Colab A100 GPU.

In the end with quantization and parameter efficient fine-tuning it only took up 13gb on a single GPU.

Check it out here if you're interested: https://www.youtube.com/watch?v=TYgtG2Th6fI


Can you do more videos on preparing the data and scraping data? I noticed you seem to be proficient at it by your terminology and it’s something I want to dive into more.


I have a solid stream about scraping Twitter I did yesterday you should check it out https://www.youtube.com/watch?v=-Hfx9tCeShA&t=5516s


Sick I've put it in my queue


which version of Llama 2?


The 7b parameter one so not one of the larger ones


What's special about Llama2?


This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama (Mac), MLC LLM (iOS/Android)

Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (https://github.com/huggingface/text-generation-inference). And I am sure there are other things that could be covered.


As an article aimed towards newbies I think it's pretty good, but I'd agree "comprehensive" is a bit of a reach. I think a better title would have been something like "Get LLama2 running locally on every platform easily."

Anyone w/ more than a single consumer GPU probably has a good grip on their options (a vllm vs hf shootout would be neat for exmaple), but I'd add a few more projects for those taking the next step for local inferencing:

* exllama - while llama.cpp has matched its token generation performance, exllama is still largely my preferred inference engine because it is so memory efficient (shaving gigs off the competition) - this means you can run a 33B model w/ 2K context easily on a single 24GB card. It also scales almost perfectly for inferencing on 2 GPUs. It's been tested to run a llama2-70b w/ 16K context (NTK RoPE scaling) sneaking in at 47GB. Ridiculous.

* AutoGPTQ - while it's fallen a bit behind for inference, if you are using an older (eg Pascal) cards, it's worth taking a look. It's also the easiest tool for making GPTQ quants.

* ggml - for non-llama models, this covers running quants of almost everything else out there; GPU acceleration only w/ cuBLAS/CLBlast, so not the fastest, but fast enough

* python-llama-cpp and LocalAI - while these are technically llama.cpp bindings, they're pretty useful/worth mentioning since they replicate the OpenAI API making it easy as a drop-in replacement for a whole ecosystems of tools/apps

* A lot of hobbyists like oobabooga, kobold.cpp, silly tavern, etc but I haven't gotten around to poking into those as much. They seem like a lot of work, always behind their mainline dependencies, and while featureful and interesting, they also feel like they're one update away from breaking (eg, automatic1111 vibes).


Just a note that you have to have at least 12GB VRAM for it to be worth even trying to use your GPU for LLaMA 2.

The 7B model quantized to 4 bits can fit in 8GB VRAM with room for the context, but is pretty useless for getting good results in my experience. 13B is better but still not anything near as good as the 70B, which would require >35GB VRAM to use at 4 bit quantization.

My solution for playing with this was just to upgrade my PC's RAM to 64GB. It's slower than the GPU, but it was way cheaper and I can run the 70B model easily.


Just a quick note, it's worth pointing out that for most people (eg, wanting to chat to a model in realtime), I don't think running locally on a CPU is a very viable option unless you're very patient. On my 16c Ryzen 5950X/64GB DDR4-3800 system, llama-2-70b-chat (q4_K_M) running llama.cpp (eb542d3) and testing doing a 100 token test (life's too short to try max context), I got 1.25 tokens/second (~1 word/second) output.

Compiled with cuBLAS w/ `-ngl 0` (~400MB of VRAM usage, no layers loaded) makes no perf difference. The max layers I can load on a headless 24GB 4090 is 45/83 layers (running `-ngl 45 --low-vram`) which brings speeds up to 2.5 t/s. A little less painful, but still not super pleasant. For reference, people have reported performance of 12-15 t/s with 2x4090s w/ exllama (GPTQ). People are using a 14,20 split and able to load a full (NTK Rope scaled) 16K context into 48GB of VRAM.


Apple Silicon Macs might not have great GPUs but they do have unified memory. I need to try this on mine I have 96GB of RAM on my M2 Max.


What does "unified" actually mean and how much would that help? It is still off the shelf LPDDR5‑6400, just with a better interconnnect (like a ps5).

How does this compare to non-unified ddr5 or hmb2e as on nvidia A100 cards?


The benefits are primarily price - 96GB of VRAM would be 4x3090/4090 (~$6K) or 2xA6000 (~$8-14K) cards (also, looks like you can buy an 80GB A100 PCIe for about $15K atm). While Apple is using LPDDR5, it is also running a lot more channels than comparable PC hardware. The M2 has 100GB/s, M2 Pro 200GB/s, M2 Max 400GB/s, and M2 Ultra is 800GB/s (8 channel) of memory bandwidth. The Nvidia cards are about 900GB/s-1TB/s (A100 PCIe gets up to 1.5TB/s).

In practice, on quantizes of the larger open LLMs, an M2 Ultra can currently inference about 2-4X faster than the best PC CPUs I've seen (mega Epyc systems), but also about 2-4X slower than 2x4090s.


That is useful info, but still does not quite address the question.

The question was how memory type, memory amount and bandwidth factor into actual performance. So let me rephrase: Given a budget of $X, what performance/limitations should you expect with

- 256GB of non-unified DDR5 in a PC, just CPU

- 128GB of DDR5 for an APU

- 96GB of unified DDR5

- Whatever Nvidia will sell you for $X.

An answer of "just compare a single memory bandwidth number" seems a bit short. Sure, more bandwidth helps, but is half as much RAM at double bandwidth better or worse?


No idea, I just said I wanted to try this out and see how it performs.

Doesn’t VRAM amount limit the size of the model you can load? I’m not talking about training just inference. I also pointed out these are not the greatest GPUs available, just that the advantage they have is being able to address more memory since on those machines is a shared block between system and GPU.


It's a term used to justify non replaceable parts :-).


Trying to figure out what hardware to convince my boss to spend on... if we were to get one of the A6000/48gb cards, will that see significant performance improvements over just a 4090/24gb? The primary limitation is vram, is it not?


VRAM is what gets you up to the larger model sizes, and 24GB isn't enough to load the full 70B even at 4 bits, you need at least 35 and some extra for the context. So it depends a lot on what you want to do—fine tuning will take even more as I understand it.

The card's speed will affect your performance, but I don't know enough about different graphics cards to tell you specifics.


How would an APU, such as 5700g with up to 128gb of system ram perform when allocating it as vram? Is this a cost effective way of using running this on a budget?


Well, 48gb is better than nothing at least. And it has the potential (if we get the build right) to drop a second A6000 card into it with the nvlink module (I think this does allow you to effectively have 96gb) later.


You might consider getting a Mac Studio (with as much RAM as you can afford up to 192GB) instead, since 192GB is more (unified) memory than you're going to easily get to with GPUs.


This. The main system memory on a Mac Studio is GPU memory and there's a lot of it.

It also has the Neural Engine, which is specifically designed for this type of work - most software isn't designed to take advantage of that yet, but presumably it will soon.


While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:

* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.

* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].

You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)

Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)

[1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...

[2] https://github.com/microsoft/DeepSpeed/issues/1580

[3] https://github.com/TimDettmers/bitsandbytes/issues/485

[4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

[5] https://forums.macrumors.com/threads/ai-generated-art-stable...


Just a note to say thank you for this detailed reply! I did not know these things, and am getting a Mac Studio of similar spec for work soon (for reasons unrelated to AI) and it's helpful to know what to expect about its ML capabilities.

(Still, how much would you have to spend to get 192GB of GPU RAM available to you, fully purchased? The 192GB Mac Studio M2 Ultra is around $5800. Is that the difference between sort-of-GPU speeds and falling down to CPU speeds, if you want to run e.g. the best, largest open source models available?

I suppose even "falling down to CPU speeds" isn't really plausible -- I think you'd find it hard to put 192GB DDR5 (at least without falling to speeds below DDR4) in any fast, modern desktop because they all have two channels of DDR5.


Those are very different questions...

If you want to simply run inference or do QLoRA fine tunes of "the best, largest open source models" eg the llama2-70b models, you can do so with 2 x RTX 3090 24GB (~$600 used), so for about $1200 for the GPUs, 48GB of VRAM (set to PL 300W, so 600W while inferencing) - q4 version of llama2-70b take about 38-40GB of memory + kvcache.

If you want 192GB of VRAM, your cheapest realistic option is probably going to be 4 x A6000's (~$16,000) - you will need to have a chassis that will provide adequate power and cooling (1200W for the GPUs). I'd personally suggest that anyone looking to buy that kind of hardware have a fairly good idea of what they're going to use it for beforehand.

I'm not sure what exactly you're asking about with regards to memory, but for workstations, the Xeon W-3400's have 8 channels of DDR5-4800 (the W5-3425 has a $1200 list price) and the upcoming Threadripper Pro 7000s will likely have similar memory support (or you can get an EPYC 9124 for ~$1200 now if you want 12 channels of DDR5).


Would it be worthwhile just using "cloud GPUs" (like the providers who rent out GPUs, not the overpriced AWS stuff) until the next generation comes out, then using that?


What is necessary to run 70B on CPU without quantization?


Bump. interested as well.


Running just some of the layers on the GPU can still make things much faster, though.


I have 2x 3090 do you know if it's feasible to use that 48GB total for running this?


Yes, it runs totally fine. I ran it in Oobabooga/text generation web ui. Nice thing about it is that it autodownloads all necessary gpu binaries on it's own and creates a isolated conda env. I asked same questions on the official 70b demo and got same answers. I even got better answers with ooba, since the demo cuts text early

Ooobabooga: https://github.com/oobabooga/text-generation-webui

Model: TheBloke_Llama-2-70B-chat-GPTQ from https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ

ExLlama_HF loader gpu split 20,22, context size 2048

on the Chat Settings tab, choose Instruction template tab and pick Llama-v2 from the instruction template dropdown

Demo: https://huggingface.co/blog/llama2#demo


Is there any specific settings to make 2x3090 work together?


Not really? I just got those cards in separate PCI slots and the Exllama_hf handles spreading the load internally. No NVLink bridge in particular. I use the "20,22" memory split so that the display card has some room for the framebuffer to handle display


Do you mean you don't use NVLink or just use one that works? I am under the impression it is being phased out ("PCIe 5 is fast enough") and some kits don't use it.


I don't use NVLink


Interested in this too


I'm very curious what your other components are and how you managed to fit 2 3090s in one PC.


Does that mean you have to run it on the CPU? Or can you use the GPU with system RAM?


Ollama works with Windows and Linux as well too, but doesn't (yet) have GPU support for those platforms. You have to compile it yourself (it's a simple `go build .`), but should work fine (albeit slow). The benefit is you can still pull the llama2 model really easily (with `ollama pull llama2`) and even use it with other runners.

DISCLAIMER: I'm one of the developers behind Ollama.


> DISCLAIMER: I'm one of the developers behind Ollama.

I got a feature suggestion - would it be possible to have the ollama CLI automatically start up the GUI/daemon if it's not running? There's only so much stuff one can keep in a Macbook Air's auto start.


Good suggestion! This is definitely on the radar, so that running `ollama` will start the server when it's needed (instead of erroring!): https://github.com/jmorganca/ollama/issues/47


I've been wondering, is the M2's neural engine usable for this?


I think you'd need to offload the model into CoreML to do so, right? My understanding is that none of the current popular inference frameworks do this (not yet, at least).



Matt Williams talk today at that conference was a great intro to Ollama.


Llama.cpp has been fun to experiment around with. I was suprised with how easy it was to set up, much easier than when I tried to set up a local llm almost a year ago.


> If you have a linux machine with GPUs

How much VRAM one needs to run inference with llama 2 on a GPU approximately?


16Gb is minimum to run 7B model with float16 weights; out of the box, with no further efforts.


Depends on which model. I haven't bothered doing it on my 8GB because the only model that would fit is the 7B model quantized to 4 bits, and that model at that size is pretty bad for most things. I think you could have fun with 13B with 12GB VRAM. The full size model would require >35GB even quantized.


70B at f16 needs about 120-160GB (4 A100 40GB or 2 A100 80GB).

Quantization still seems to have some issues for the 70B model.


Self-plug. Here’s a fork of the original llama 2 code adapted to run on the CPU or MPS (M1/M2 GPU) if available:

https://github.com/krychu/llama

It runs with the original weights, and gets you to ~4 tokens/sec on MacBook Pro M1 with the 7B model.


The easiest way I found was to use GPT4All. Just download and install, grab GGML version of Llama 2, copy to the models directory in the installation folder. Fire up GPT4All and run.


For most people who just want to play around and are using MacOS or Windows, I'd just recommend lmstudio.ai. Nice interface, with super easy searching and downloading of new models.


>I'd just recommend lmstudio.ai.

on windows it's an unsigned binary , backed by a website that only indicates a twitter/discord/github as far as explaining their organization, and the github doesn't include source on the client itself, only models.

this must throw up some red flags for others, no?


Does this not work with Stable Diffusion models? Not super familiar with all of this yet but I can't find any from HuggingFace that are compatible.


This seems interesting. Does anyone know of an iOS app compatible with OpenAI API that you could use to talk to LM Studio over local network?


Does it make any sense to try this on a lower-end Mac (like a M2 Air)?


Yeah! How much memory do you have?

If by lower-end Macbook air, you mean with 8GB of memory, try the smaller models (Such as Orca Mini 3B). You can do this via LM Studio, Oogabooga/text-generation-webui, KoboldCPP, GPT4all, ctransformers, and more.

I'm biased since I work on Ollama, and if you want to try it out:

1. Download https://ollama.ai/download

2. `ollama run orca`

3. Enter your input to prompt

Note Ollama is open source, and you can compile it too from https://github.com/jmorganca/ollama


I’m deliberating on how much RAM to get on my new MBP. Is 32gb going to stand me in good stead?


32GB should be fine. I went a little overboard and got a new MBP with M2 MAX and 96GB, but the hardware is really best suited at this point to a 30B model. I can and do play around with 65B models, but at that point you're making a fairly big tradeoff in generation speed for an incremental increase in quality.

As a datapoint, I have a 30B model [0] loaded right now and it's using 23.44GB of RAM. Getting around 9 tokens/sec, which is very usable. I also have the 65B version of the same model [1] and it's good for around 3.6 tokens/second, but it uses 44GB of RAM. Not unusably slow, but more often than not I opt for the 30B because it's good enough and a lot faster.

Haven't tried the llama2 70B yet.

[0] https://huggingface.co/TheBloke/upstage-llama-30b-instruct-2... [1] https://huggingface.co/TheBloke/Upstage-Llama1-65B-Instruct-...


What's your use case for local if you don't mind?


Thankyou that’s really helpful! The CTO lead times on Mac are huge here so it’s either the pro with 16 or the max with 32. Ideally I’d go pro with 64.


Local memory management will definitely get better in the future.

For now:

You should have at least 8 GB of RAM to run the 3B models, 16 GB to run the 7B models, and 32 GB to run the 13B models.

My personal recommendation is to get as much memory as you can if you want to work with local models [including VRAM if you are planning to be executing on GPU]


I run the original llama 7b model just fine on 8GB of Ram. It is best to give advice from experience not only what you read from others.


This is from us manually testing it on macbooks that we have available. It might run, but it's probably using swap.


Thanks - the issue I’m facing is the CTO lead times on Macs here!


By lower-end I meant that the Airs are quite low-end in general (compared to Pro/Studio). I have the maxed-out 24gb, but 16gb may be more common among people who might use an Air for this kind of thing.


Heads up for anyone else: clicking that link automatically starts the download (92MB)


what about a m1 with 16gb ram?


Was taking a look into this. Is the source code open for lmstudio.ai?


The correct answer, as always, is the oogabooga text generation webUI, which supports all of the relevant backends: https://github.com/oobabooga/text-generation-webui


Yep. Use ooba. And people who like to RP often use ooba as a backend, and sillitavern as a frontend.


Can it run onnx transformer models? I found optimised onnx models are at least twice the speed of vanilla pytorch on the CPU.


Don't remember if the grammar has been merged in llama.cpp yet but it would be the first step to have Llama + Stable diffusion locally to output text + images and talk to each other. The only part I'm not sure is how Llama would interpret images back. At least it could use them though, to build e.g. a webpage.


It has merged! https://github.com/ggerganov/llama.cpp/pull/1773

I haven't had a chance to try it yet, but I am :excitedllama:


> curl -L "https://replicate.fyi/install-llama-cpp" | bash

Seriously? Pipe script from someone's website directly to bash?


That's the recommended way to get Rust nightly too: https://rustup.rs/ But don't look there, there is memory safety somewhere!


In rustup's defense, if you're already trusting them enough to run their executables, this isn't that much worse, afaik.


oh, this again.


Either you trust the TLS session to their website to deliver you software you're going to run, or you don't.


You can clone llama.cpp on GitHub and the models from HuggingFace. No need to trust this unrelated website.


But is you do trust it, very convenient.


Yes. If you are worried, you can redirect it to file and then sh it. It doesn’t get much easier to inspect than that…


Pretty common. You can inspect the script before piping it.


Bad actors can detect if its being piped to bash and send different data. Better to just download the script first if you're concerned.


That what I meant but I had no idea about piping detection at the same time so thanks for pointing that out, nifty.


How can you detect where someone pipes the output of curl output to?


Basically, bash executes the script line by line as it is downloading - pausing the download while that line executes. By sending a sleep() command early in the script you can detect the delay in the next line beind downloaded.

Its a lot more complicated due to TCP buffers and trying to hide output from the user.

Original article below. It is giving me a certificate error though but its available through archives or a cache.

https://www.idontplaydarts.com/2016/04/detecting-curl-pipe-b...


Neat article.

Cached version → http://archive.today/O46rw


"This Connection is Invalid. SSL certificate expired."


Yeah I mentioned that. You have to go through a cache or an archive.


Amazing, thanks.


who doesn't love surprises


IMO this is equivalently scary to installing an arbitrary rpm.


Seems to be a better guide here (without the risk curl):

https://www.stacklok.com/post/exploring-llama-2-on-a-apple-m...


The LLM is impressive (llama2:13b) but appears to have been greatly limited to what you are allowed to do with it.

I tried to get it to generate a JSON object about the movie The Matrix and the model refuses.


In case you might be trying this with Ollama (given the model:tag syntax) – I haven't added this to the readme yet (seeing some issues with the prompt template) but check out llama2-uncensored:

  ollama run llama2-uncensored


  % ollama run llama2-uncensored
  >>> Generate a json object describing the most stunning part of the movie: "The Matrix"
  {   
    "title": "The Matrix",
    "director": "The Wachowskis",
    "release_year": 1999,
    "description": "Set in a dystopian future, The Matrix follows the story of Neo (Keanu Reeves), a computer hacker who is drawn into an underground war against machines that have enslaved humanity. Along with his allies Morpheus (Laurence Fishburne) and Trinity (Carrie-Anne Moss), Neo learns to harness the power of the Matrix, a virtual reality system designed to keep humans obedient and ignorant. With the help of a rogue program named Agent Smith (Hugo Weaving), Neo discovers the true nature of the Matrix and sets out on a journey to free humankind from their digital prison.",
    "stunning scenes": [
      "bulletproof kung-fu fighting",
      "bullet time",
      "dream-like imagery",
      "special effects",
    ]
  }
edit: formatting


Are you going to add a "stop" argument to the API?


Looking into this! Since quite a few of the recommended prompts don't always work https://github.com/jmorganca/ollama/issues/217


This argument is vital even when using openAI’s api. The LLMs want to continue writing… even when they shouldn’t.


Off topic: is there a way to use one of the LLMs and have it ingest data from a SQLite database and ask it questions about it?


You can migrate that data to a vector database (eg Pinecone or pgVector) and then query it. I didn’t write it but this guide has a good overview of concepts and some code. In your case your just replace the web crawler with database queries. All the libraries used also exist in Python.

https://www.pinecone.io/learn/javascript-chatbot/

Edit: this might also be of use

https://python.langchain.com/docs/modules/chains/popular/sql...


You can, but as a crazy idea you can also ask chatgpt to write select queries using the functions parameter they added recently - you can also ask it to write jsonpath.

As long as it understands the schema and general idea of data, it does a fairly good job. Just be careful to do too much with one prompt, you can easily cause hallucinations


Have a look at this too, it's just an integration which langchain can be good at : https://walkingtree.tech/natural-language-to-query-your-sql-...


I've experimented with that a bit.

Currently the absolutely best way to do that is to upload a SQLite database file to ChatGPT Code Interpreter.

I'm hoping that someone will fine-tune an openly licensed model for this at some point that can give results as good as Code Interpreter does.


You can but what you’ll end up trading precise answers while querying to a chance of hallucinations.


I might be missing something. The article asks me to run a bash script on windows.

I assume this would still need to be run manually to access GPU resources etc, so can someone illuminate what is actually expected for a windows user to make this run?

I'm currently paying 15$ a month in a personal translation/summarizer project's ChatGPT queries. I run whisper (const.me's GPU fork) locally and would love to get the LLM part local eventually too! The system generates 30k queries a month but is not super-affected by delay so lower token rates might work too.


Thanks for pointing this out — we should've pointed out the script needs to be run on WSL. I added a note to the post to clarify (I work at Replicate).

Also, you don't need a GPU to run this script! It builds and runs llama.cpp https://github.com/ggerganov/llama.cpp


Windows has supported linux tools for some time now, using WSL: https://learn.microsoft.com/en-us/windows/wsl/about

No idea if it will work, in this case, but it does with llama.cpp: https://github.com/ggerganov/llama.cpp/issues/103


I know (should have included in my earlier response but editing would've felt weird) but I still assume one should run the result natively, so am asking if/where there's some jumping around required.

Last time I tried running an LLM I tried wsl&native both on 2 machines and just got lovecraftian-tier errors so waiting if I'm missing something obvious before going down that route again


The bash script is downloading llama.cpp, a project which allows you to run LLaMA-based language models on your CPU. The bash script then downloads the 13 billion parameter GGML version of LLaMA 2. The GGML version is what will work with llama.cpp and uses CPU for inferencing. There are ways to run models using your GPU, but it depends on your setup whether it will be worth it.

I would highly recommend looking into the text-generation-webui project (https://github.com/oobabooga/text-generation-webui). It has a one-click installer and very comprehensive guides for getting models running locally and where to find models. The project also has an "api" command flag to let you use it like you might use a web-based service currently.


check my comment elsewhere in the thread, I wrote build instructions for windows + Nvidia GPUs


Maybe obvious to others, but the 1 line install command with curl is taking a long time. Must be the build step. Probably 40+ minutes now on an M2 max.


That's odd, the build step only took me a few minutes on an 5900X on Linux.

EDIT: Timed a clean build at 30 seconds.


Thanks for the info, wonder what the deal is


I did manually clone from GitHub and build myself and download the model separately. I also noticed that many of the CLI flags the author chose are questionable after I read the docs and help text.


Self plug: run llama.cpp as an inference server on a spot instance anywhere: https://cedana.readthedocs.io/en/latest/examples.html#runnin...


Looks cool, joined the waitlist.


How do you decide what model variant to use? There's a bunch of Quant method variations of Llama-2-13B-chat-GGML [0], how do you know which one to use? Reading the "Explanation of the new k-quant methods" is a bit opaque.

[0] https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML


This is a great question. My best answer is that there's a speed/intelligence trade-off. The smaller weights (7B) will run faster and require less memory, but you won't get the same quality responses as the 13B / 70B model. I think there may be a Llama 30B variant coming soon too.


Hey there, I was confused at this exact question too. This link might help, written by a contributor to llama.cpp: https://github.com/ggerganov/llama.cpp/pull/1684

TLDR: Lower quantization means higher perplexity (i.e. how 'confused' the model is when seeing new information). It's a matter of testing it out and choosing a model that fits your available memory. The higher the quantization number, the better (generally).


The thing I get peeved by is that none of the models say how much RAM/VRAM they need to run. Just list minimum specs please!


If you just want to do inference/mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.

0. https://github.com/huggingface/blog/blob/main/llama2.md#usin...


Idiot question: if I have access to sentence-by-sentence professionally-translated text of foreign-language-to-English in gigantic quantities, and I fed the originals as prompts and the translations as completions...

... would I be likely to get anything useful if I then fed it new prompts in a similar style? Or would it just generate gibberish?


Indeed, it sounds like you have what's called fine tuned data (given an input, here's the output), there's loads of info both here on HN about fine tuning and on youtube's huggingface channels

Note if you have sufficient data, look into existing models on huggingface, you may find a smaller, faster and more open (licencing-wise) model that you can fine tune to get the results you want - Llama is hot, but not a catch-all for all tasks (as no model should be)

Happy inferring!


How to know if the data is sufficient?


It's more about quality vs sufficiency - you can have a relatively small but accurate and wide ranging dataset, this is better than an inaccurate huge dataset


If you have that much data you can build your own model that can be much smaller and faster.

A simple version is a beginner tutorial: https://pytorch.org/tutorials/beginner/translation_transform...


I appreciate their honesty when it's in their interest that people use their API rather than run it locally.


This is classic blogging strategy. And the right way to do it. And few do. Which is why I ignore corporate blogs in general.


Is it possible for such local install to retain conversation history so if for example you're working on a project and use it as your assistance across many days that you can continue conversations and for the model to keep track of what you and it already know?


My LLM command line tool can do that - it logs everything to a SQLite database and has an option to continue a conversation: https://llm.datasette.io


There is no fully built solution, only bits and pieces. I noticed that llama outputs tend to degrade with amount of text, the text becomes too repetitive and focused, and you have to raise the temperature to break the model out of loops.


Does what you're saying mean you can only ask questions and get answers in a single step, and that having a long discussion where refinement of output is arrived at through conversation isn't possible?


My understanding is that at a high level you can look at this model as a black box which accepts a string and outputs a string.

If you want it to “remember” things you do that by appending all the previous conversations together and supply it in the input string.

In an ideal world this would work perfectly. It would read through the whole conversation and would provide the right output you expect, exactly as if it would “remember” the conversation. In reality there are all kind of issues which can crop up as the input grows longer and longer. One is that it takes more and more processing power and time for it to “read through” everything previously said. And there are things like what jmiskovic said that the output quality can also degrade in perhaps unexpected ways.

But that also doesn’t mean that “ refinement of output is arrived at through conversation isn't possible”. It is not that black and white, just that you can run into troubles as the length of the discussion grows.

I don’t have direct experience with long conversations so I can’t tell you how long is definietly too long, and how long is still safe. Plus probably there are some tricks one can do to work around these. Probably there are things one can do if one unpacks that “black box” understanding of the process. But even without that you could imagine a “consolidation” process where the AI is instructed to write short notes about a given length of conversation and then those shorter notes would be copied in to the next input instead of the full previous conversation. All of these are possible, but you won’t have a turn-key solution for it just yet.


The limit here is the "context window" length of the model, measured in tokens, which will quickly become too short to contain all of your previous conversations, which will mean it has to answer questions without access to all of that text. And within a single conversation, it will mean that it starts forgetting the text from the start of the conversation, once the [conversation + new prompt] reaches the context length.

The kind of hacks that work around this are to train the model on the past conversations, and then rely on similarity in tensor space to pull the right (lossy) data back out of the model (or a separate database) later, based on its similarity to your question, and include it (or a summary of it, since summaries are smaller) within the context window for your new conversation, combined with your prompt. This is what people are talking about when they use the term "embeddings".


My benchmark is having a peer programming session spanning days and dozens of queries with ChatGPT where we co-created a custom static site generator that works really well for my requirements. It was able to hold context for a while and not "forget" what code it provided me dozens of messages earlier, it was able to "remember" corrections and refactors that I gave it and overall was incredibly useful for working out things like recurrence for folder hierarchies and building data trees. This kind and similar use-cases where memory is important, when the model is used as a genuine assistant.


Excelent! That sounds like a very usefull personal benchmark then. You could test llama v2 by copying in different lengths of snippets from that conversation and checking how usefull you find its outputs.


llama is just an input/output engine. It takes a big string as input, and gives a big string of output.

Save your outputs if you want, you can copy/paste them into any editor. Or make a shell script that mirrors outputs to a file and use that as your main interface. It's up to the user.


This is usable, but hopefully folks manage to tweak it a bit further for even higher tokens/s. I’m running Llama.cpp locally on my M2 Max (32 GB) with decent performance but sticking to the 7B model for now.


I need some hand-holding .. I have a directory of over 80,000 PDF files. How do I train Llama2 on this directory and start asking questions about the material - is this even feasible?


1. Logically split and store PDFs content in a vector database 2. Embed the query (Your questions) and search the vector database for closest results. 3. Use the results and LLAMA prompt to format the answer.


    curl -L "https://replicate.fyi/windows-install-llama-cpp"
... returns 404 Not Found


Looks like it's fixed now, I had the same 404 for a while


Whoops! Fixed now


You don't need bash or WSL to build on windows; windows runs it perfectly fine with CUDA as well


Is it possible to do hybrid inference if I have a 24GB card with the 70B model? Ie. Offload some of it to my RAM?


As someone with too little spare time I'm curious, what are people using this for, except research?


Did anyone build pc for running these models and which one do you recommend


I assume you are talking about a Windows/Linux PC. I have done that and got a Threadripper with 32 cores and 256Gb RAM. It runs any llama on the CPU although the 65/70b are quite slow. I also added an A6000 (48Gb VRAM) that allows you to run the 65/70b quantized with very good performance.

If you are going with the GPU and don't care about loads of RAM then a 16 Zen CPU will do just fine (or Intel for that matter).

If you are only interested in llama only then an M1 Studio with 64Gb RAM is probably cheaper and will work just as well.


Hi, so what are you using the models for?


I am exploring possible products and also integration of LLMs with our existing mathematical modeling software.


I'm still curious to know the hype behind Llama 2


Llama.cpp can run on Android too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: