> Why do these takes around open-source AI remain so popular? I can only speak f...

kristianp · on April 9, 2023

How do you run it locally? llama.cpp + 64GB RAM + 4bit quantized?

tyfon · on April 9, 2023

I have a 5950x with 64 gb ram and they are quantized to 4 bit yes :)

The weights are stored on a samsung 980 pro so the load time is very fast too. I get about 2 tokens/second with this setup.

edit: forgot to confirm, it is llama.cpp

edit2: I am going to try the FP16 version after easter as I ordered 64 GB of additional ram. But I suspect the speed will be abyssal with the 5950x having to calculate through 120 gb of weights. Hopefully some smart person will come up with a way to allow the GPU to run off system memory via the amd infinity fabric or something.

barbariangrunge · on April 9, 2023

I thought it needed 64gb of vram. 64gb of ram is easy to obtain

sbierwagen · on April 9, 2023

5950x is a CPU model. Integer-quantized models are generally run with CPU inference. For the larger models the problem then becomes generation time per token.

int_19h · on April 9, 2023

Quantized models are used aplenty with GPUs as well - 4-bit quantization is the only way you can squeeze llama-30b into 24Gb of VRAM (i.e. RTX 3090 or 4090).

In fact, I would say that, at this point, most people running LLaMA locally are likely using 4-bit quantization regardless of model size and hardware, just to get the most out of the latter.

whimsicalism · on April 10, 2023

Most people running llama locally are doing CPU inference, period.

barbariangrunge · on April 10, 2023

If your desktop had 256gb of ram, could you train a far larger model? Some motherboards support that

apetresc · on April 9, 2023

How have you managed to run the 65B model? Cloud resources, or you have a very kitted-out homelab?

sp332 · on April 9, 2023

If you're not running on GPU, you can upgrade your system RAM instead of finding a card with lots of VRAM. 64GB of DDR4 is only $120.

trifurcate · on April 9, 2023

All you need is 2 3090s.

MacsHeadroom · on April 10, 2023

All you need is 64GB of RAM and a CPU, actually. Two 3090s is much faster but not strictly necessary.

Aeolun · on April 10, 2023

All you need is a few thousand dollars lying around to spend solely on your inference fun?

I don’t think that many people really qualify as such (though it’s probably true that many of them are on HN).

FrostKiwi · on April 10, 2023

Not just inference.

AFAIK, you are able to fine-tune the models with custom data[1], which does not seem to require anything but a GPU with enough VRAM to fit the model in question. I'm looking to get my hands on an RTX 4090 to ingest all of the repair manuals of a certain company and have a chatbot capable of guiding repairs, or at least try to do so. So far doing inference only as well.

[1] https://github.com/tloen/alpaca-lora

sharemywin · on April 10, 2023

you might think about do the training in the cloud and then your back to needing standard hardware for the bot.

Also, another thought might be to generate embeddings for each paragraph of the manual and then index those using Faiss then you generate an embedding of the question and use Faiss to return the most relevant paragraphs feed those into the model with a prompt like "given the following: {paragraphs} \n\n {questions}"

I'm sure there are better prompts but you get the idea.

barking_biscuit · on April 10, 2023

>All you need is a few thousand dollars lying around to spend solely on your inference fun? I don’t think that many people really qualify as such (though it’s probably true that many of them are on HN).

Can confirm. Did a new build just for inference fun. Expensive, and worth it.

itake · on April 9, 2023

I think it’s hard to verify and those articles get clicks.

Similar to vein of articles promising self driving cars in 202x