I want to rent a server with 128 GB of RAM for my web projects. But primarily for launching CodeLlama 70B models. Is this possible without video memory?
I run CodeLlama70B in my home lab using two RTX 4090s and the MLC AI framework. It provides Muti-GPU support and is extremely fast on consumer grade hardware. I'm seeing about 30 tokens/sec.
I'm kinda stunned at that speed. I wonder if quantized models would affect that, i.e. GGUFs via llama.cpp. My M2 MBP would beat that, but, M2s aren't that great, like, my 3090s faster.
I have a similar system. When idle, nvtop reports about 11-12W/4090. When doing bs=1 inference, the default PL for my cards is 450W each, although I've found that I can keep 97% of prefill and 99%+ of inference performance lowering the power limit to 400W.
With system overhead, basically I'd guess 50W when idle, 900W when actively working.
I'm trying to think of my typical ChatGPT conversations and the flow of them. I probably ask it anywhere between 150 and 1,000 tokens, and then I wait for it to respond to me about 150 to 3,000 tokens
At 30 tokens a second, responses can be "900W" for about 5s to 100s
Then as a human you have to sit there and read what it says (system doesn't need to generate tokens then as you interpret them).
I'm trying to think what that usage graph looks like/what it translates roughly to in terms of full usage for an hour so you can estimate daily usage in kilowatt hour to quantify how intensive peak 900w load really is in real world use...
1kW under load for 30 tokens/sec and each request is about 100-1000 tokens and each response is about 100-1000 tokens (so a response takes about 3-33s)
I run llama chat 70b on a p3 8x large (4 Tesla) and it runs at like 1-5 tokens per sec. And I’m running the model with only 4 bit precision. Are you doing anything else?
I wanted my students to be able to run open source models. Easiest way was using ollama, and ollama web ui (now open web ui) on google cloud for .18 cents an hour with a 4-core nvidia and 16 GB of ram on a spot instance. I created a tutorial for my students:
https://docs.google.com/document/d/1OpZl4P3d0WKH9XtErUZib5_2...
Open WebUI is what I settled on as a centralized hub for LLM queries. It connects to both ollama and OpenAI (via api key) and gives me a single drop down to pick. I can download ollama models from within Open WebUI, which is nice. Built-in LangChain lets people quickly see possibilities and limitations of RAG (spoiler: good for lookups but don't expect to sumif a csv).
It's technically possible, in the sense that you will eventually get a complete response, but it will be extremely slow. Inference will be something like 1 word per second. Far too slow to use as an assistant for writing code.
If you don't want to rent something with a GPU, you should look into running non-70B models and experiment with varying levels of quantization. Without a GPU you really want to start with 7B models, not 70B models. The Mistral variants in particular are worth a try.
I had a similar idea yesterday; I'd like to run Code Llama 70B on a GPU supercomputer cluster I have access to. But the wall I hit is the licence agreement by Facebook.
I'd rather download the models from 3rd party unofficial sources/torrents, since that'd be legal where I live for personal research usage. Does someone know of some server that hosts the Code Llama model weights or some torrent? I have downloaded the torrent with Llama weights, but I'd rather use the Code variant, if possible.
i. If you distribute or make the Llama Materials, or any derivative works thereof, available to a third party, you shall provide a copy of this Agreement to such third party.
It would be equivalent to you buying something someone else stole aka still illegal. If you wanna use llama you have to agree to their terms as far as legality goes.
As people modify a model to be further and further from the original I’m unsure how the legality and detectability work.
HuggingFace is the place to go for easily downloading model weights. Search for the user TheBloke if you’re particularly interested in GGUF formatted weights.
As an alternative MassedCompute [0] has some interesting rental options (billed by the minute). They recently changed their site so it's harder to see the options available - but, I've used it a few times and it's generally competitive with respect to price and environments / features offered. Looks like you, unfortunately, need to sign up for an account now.
The only system without a discrete GPU I've seen which pulls this off at a usable 8 ish tokens a second is an m3 macbook pro with 128gb unified memory [1], key word here unified... Something like a 6000$ investment. You're possibly better off renting GPU somewhere or investing in a couple of rtx cards
RTX4090 is going for 2-3K on ebay. And some other commenter is saying you need 2, to run a 70B model. So that's about the same price. Though I image the GPU is faster.
It might be cheaper for you to call an API to run the inference instead of renting a machine. GPT-4-Turbo goes for $0.01 on 1k input and $0.03 for 1k output on Azure. A x2gd.2xlarg instance on AWS has 8vCPU and 128GB of memory. It goes for ~200$ per month.
Using llama.cpp on a 128 GB server running codellama-70b-python.Q4_K_M.gguf I get 1.3 tokens per second which is too slow. With Nous-Hermes-2-Mistral-7B-DPO.Q5_K_M.gguf I get 8.3 tokens per second which is usable.
Is it correct to assume that you want to do this for privacy reasons only?
I currently use GPT-4 for software development work that does not have those concerns. I am assuming that someone in my position would not benefit from my own server for this sort of thing. Even with a couple of 4090s.
Do you guys really get that much help from these local LLMs? Chatgpt has SO much more functionality and it still is quite limited. Why is it worth it for you guys to run these things locally like what are they doing that provides you so much value?
You can run a Q4 quant of a 70B model in about 40GB of RAM (+context). You're single user (batch size 1, bs=1) inference speed will be basically memory bottlenecked, so on a dual channel dedicated box you'd expect somewhere about 1 token/s. That's inference, prefill/prompt generation will take even longer (as your chat history grows) on CPU. So falls into the realm of technically possible, but not for real world use.
If you're looking specifically for CodeLlama 70B, Artificial Analysis https://artificialanalysis.ai/models/codellama-instruct-70b/... lists Perplexity, Together.ai, Deep Infra, and Fireworks as potential hosts, with Together.ai and Deepinfra at about $0.9/1M tokens, with about 30 tokens/s and about 300ms latency (time to first token).
From the various leaderboards, deepseek-ai/deepseek-coder-33b-instruct still looks like the best performing open model (it has a very liberal ethical license), followed by ise-uiuc/Magicoder-S-DS-6.7B (a deepseek-coder-6.7b-base fine tune). The former can be run as a Q4 quant on a single 24GB GPU (a used 3090 should run you about $700 atm), and the latter, if it works for you will run 4X faster and fit on even cheaper/weaker GPUs.
There's always recent developments, but two worth pointing out:
OpenCodeInterpreter - a new system that uses execution feedback and outperforms ChatGPT4 Code Interpreter that is fine-tuned off of the DeepSeek code models: https://opencodeinterpreter.github.io/
https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...