Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Code Llama 70B on a dedicated server
67 points by lavren1974 9 months ago | hide | past | favorite | 42 comments
I want to rent a server with 128 GB of RAM for my web projects. But primarily for launching CodeLlama 70B models. Is this possible without video memory?



I run CodeLlama70B in my home lab using two RTX 4090s and the MLC AI framework. It provides Muti-GPU support and is extremely fast on consumer grade hardware. I'm seeing about 30 tokens/sec.

https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...


I'm kinda stunned at that speed. I wonder if quantized models would affect that, i.e. GGUFs via llama.cpp. My M2 MBP would beat that, but, M2s aren't that great, like, my 3090s faster.


How much is the power consumption?


I have a similar system. When idle, nvtop reports about 11-12W/4090. When doing bs=1 inference, the default PL for my cards is 450W each, although I've found that I can keep 97% of prefill and 99%+ of inference performance lowering the power limit to 400W.

With system overhead, basically I'd guess 50W when idle, 900W when actively working.


> 900W when actively working.

I'm trying to think of my typical ChatGPT conversations and the flow of them. I probably ask it anywhere between 150 and 1,000 tokens, and then I wait for it to respond to me about 150 to 3,000 tokens

At 30 tokens a second, responses can be "900W" for about 5s to 100s

Then as a human you have to sit there and read what it says (system doesn't need to generate tokens then as you interpret them).

I'm trying to think what that usage graph looks like/what it translates roughly to in terms of full usage for an hour so you can estimate daily usage in kilowatt hour to quantify how intensive peak 900w load really is in real world use...


~1kW I guess.


1kW under load for 30 tokens/sec and each request is about 100-1000 tokens and each response is about 100-1000 tokens (so a response takes about 3-33s)


I run llama chat 70b on a p3 8x large (4 Tesla) and it runs at like 1-5 tokens per sec. And I’m running the model with only 4 bit precision. Are you doing anything else?


I wanted my students to be able to run open source models. Easiest way was using ollama, and ollama web ui (now open web ui) on google cloud for .18 cents an hour with a 4-core nvidia and 16 GB of ram on a spot instance. I created a tutorial for my students: https://docs.google.com/document/d/1OpZl4P3d0WKH9XtErUZib5_2...


Open WebUI is what I settled on as a centralized hub for LLM queries. It connects to both ollama and OpenAI (via api key) and gives me a single drop down to pick. I can download ollama models from within Open WebUI, which is nice. Built-in LangChain lets people quickly see possibilities and limitations of RAG (spoiler: good for lookups but don't expect to sumif a csv).


Can you share results like ChatGPT with convo uuid link?


Who good for everyday tasks ?


Quite expensive. Heard groq API is cheaper?


Quite costly. I heard that the grow API is cheaper


Well thanks to downvote, I know I misspelled groq, anyway here it is:

https://wow.groq.com/


It's technically possible, in the sense that you will eventually get a complete response, but it will be extremely slow. Inference will be something like 1 word per second. Far too slow to use as an assistant for writing code.

If you don't want to rent something with a GPU, you should look into running non-70B models and experiment with varying levels of quantization. Without a GPU you really want to start with 7B models, not 70B models. The Mistral variants in particular are worth a try.


I had a similar idea yesterday; I'd like to run Code Llama 70B on a GPU supercomputer cluster I have access to. But the wall I hit is the licence agreement by Facebook.

I'd rather download the models from 3rd party unofficial sources/torrents, since that'd be legal where I live for personal research usage. Does someone know of some server that hosts the Code Llama model weights or some torrent? I have downloaded the torrent with Llama weights, but I'd rather use the Code variant, if possible.


That doesn’t quite work.

b. Redistribution and Use.

i. If you distribute or make the Llama Materials, or any derivative works thereof, available to a third party, you shall provide a copy of this Agreement to such third party.

It would be equivalent to you buying something someone else stole aka still illegal. If you wanna use llama you have to agree to their terms as far as legality goes.

As people modify a model to be further and further from the original I’m unsure how the legality and detectability work.


HuggingFace is the place to go for easily downloading model weights. Search for the user TheBloke if you’re particularly interested in GGUF formatted weights.


As an alternative MassedCompute [0] has some interesting rental options (billed by the minute). They recently changed their site so it's harder to see the options available - but, I've used it a few times and it's generally competitive with respect to price and environments / features offered. Looks like you, unfortunately, need to sign up for an account now.

[0] https://massedcompute.com/


The only system without a discrete GPU I've seen which pulls this off at a usable 8 ish tokens a second is an m3 macbook pro with 128gb unified memory [1], key word here unified... Something like a 6000$ investment. You're possibly better off renting GPU somewhere or investing in a couple of rtx cards

[1] https://www.nonstopdev.com/llm-performance-on-m3-max/


RTX4090 is going for 2-3K on ebay. And some other commenter is saying you need 2, to run a 70B model. So that's about the same price. Though I image the GPU is faster.

https://www.ebay.com/sch/i.html?_from=R40&_nkw=RTX+4090&_sac...


Just get 7900xtx cards and use MLC the performance (about 30 tokens/s) is comparable at 1k/card. And the Linux driver support is better.


Two used 3090s aren't as fast as two 4090s, but they're $800 each.


It might be cheaper for you to call an API to run the inference instead of renting a machine. GPT-4-Turbo goes for $0.01 on 1k input and $0.03 for 1k output on Azure. A x2gd.2xlarg instance on AWS has 8vCPU and 128GB of memory. It goes for ~200$ per month.


I imagine the motivation is privacy above anything else.


I really wouldn’t go with aws for this. You can get a more powerful box for 1/3rd the price at hetzner


Matt Williams just posted quick video you may be interested in [0] where he describes easy setup with brev and tailscale.

[0] https://www.youtube.com/watch?v=QRot1WtivqI


Hetzner just added GPU servers, might be worth checking too


Oh wow, I had totally missed that news!

For everyone else: https://news.ycombinator.com/item?id=39440503


"Yes", but it's going to be unbearably slow on CPU without some form of GPU acceleration.

You'll probably get more info/responses from https://www.reddit.com/r/LocalLLaMA/


Using llama.cpp on a 128 GB server running codellama-70b-python.Q4_K_M.gguf I get 1.3 tokens per second which is too slow. With Nous-Hermes-2-Mistral-7B-DPO.Q5_K_M.gguf I get 8.3 tokens per second which is usable.


It should work on Mac Studio with Ollama


Is it correct to assume that you want to do this for privacy reasons only?

I currently use GPT-4 for software development work that does not have those concerns. I am assuming that someone in my position would not benefit from my own server for this sort of thing. Even with a couple of 4090s.


Do you guys really get that much help from these local LLMs? Chatgpt has SO much more functionality and it still is quite limited. Why is it worth it for you guys to run these things locally like what are they doing that provides you so much value?


Learning about the low levels, for one.



Possible? Yes. Unusably slow? Also yes.

(You can rent a server with two 3090s for around $1/hour, or buy two used 3090s for around $1700.)


You might look into something like together.ai or runpod doing token-based usage or per-minute usage.


It works but is painfully slow. Unless you don’t need results real time I wouldn’t


You can run a Q4 quant of a 70B model in about 40GB of RAM (+context). You're single user (batch size 1, bs=1) inference speed will be basically memory bottlenecked, so on a dual channel dedicated box you'd expect somewhere about 1 token/s. That's inference, prefill/prompt generation will take even longer (as your chat history grows) on CPU. So falls into the realm of technically possible, but not for real world use.

If you're looking specifically for CodeLlama 70B, Artificial Analysis https://artificialanalysis.ai/models/codellama-instruct-70b/... lists Perplexity, Together.ai, Deep Infra, and Fireworks as potential hosts, with Together.ai and Deepinfra at about $0.9/1M tokens, with about 30 tokens/s and about 300ms latency (time to first token).

For those looking for local coding models in specifically. I keep a list of LLM coding evals here: https://llm-tracker.info/evals/Code-Evaluation

On the EvalPlus Leaderboard, there about about 10 open models that rank higher than CodeLlama 70B, all smaller models: https://evalplus.github.io/leaderboard.html

A few other evals (worth cross-referencing to counter contamination, overfitting):

* CRUXEval Leaderboard https://crux-eval.github.io/leaderboard.html

* CanAiCode Leaderboard https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...

* Big Code Models Leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderb...

From the various leaderboards, deepseek-ai/deepseek-coder-33b-instruct still looks like the best performing open model (it has a very liberal ethical license), followed by ise-uiuc/Magicoder-S-DS-6.7B (a deepseek-coder-6.7b-base fine tune). The former can be run as a Q4 quant on a single 24GB GPU (a used 3090 should run you about $700 atm), and the latter, if it works for you will run 4X faster and fit on even cheaper/weaker GPUs.

There's always recent developments, but two worth pointing out:

OpenCodeInterpreter - a new system that uses execution feedback and outperforms ChatGPT4 Code Interpreter that is fine-tuned off of the DeepSeek code models: https://opencodeinterpreter.github.io/

StarCoder2-15B just dropped and also looks competitive. Announcement and relevant links: https://huggingface.co/blog/starcoder2


how many users you plan to support with this setup?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: