I would like to use stuff like this as a side-project. Buy a Nvidia Geforce GPU and stick it into my 24/7 server and play around with it in my free time, to see what can be done.
The issue with all these AI models is that there's no information on which GPU is enough for which task. I'm absolutely clueless if a single RTX 4000 SFF with its 20GB VRAM and only 70W of max power usage will be a waste of money, or really something great to do experiments on. Like do some ASR with Whisper, images with Stable Diffusion or load a LLM onto it, or this project here from Facebook.
Renting a GPU in the cloud doesn't seem to be a solution for this use case, where you just want to let something run for a couple of days and see if it's useful for something.
Granted, it's talking about quantized models, which use less memory. But you can see the 30B models taking 36 GB at 8-bit, and at least 20 GB at 4-bit.
The page even lists the recommended cards.
But as others have pointed out, you may get more bang "renting" as in purchasing cloud instances able to run these workloads. Buying a system costs about as much as buying instance time for one year. Theoretically, if you only run sporadic workloads when you're playing around it would cost less. If you're training... that's a different story.
The more VRAM the better if you'd like to run larger LLMs. Old Nvidia P40 (Pascal 24GB) cards are easily available for $200 or less and would be easy/cheap to play. Here's a recent writeup on the LLM performance you can expect for inferencing (training speeds I assume would be similar): https://www.reddit.com/r/LocalLLaMA/comments/13n8bqh/my_resu...
This repo lists very specific VRAM usage for various LLaMA models (w/ group size, and accounting for context window, which is often missing) - this are all 4-bit GPTQ quantized models: https://github.com/turboderp/exllama
Note the latest versions of llama.cpp now have decent GPU support and has both a memory tester and lets you load partial models (n-layers) into your GPU. It inferences about 2X slower than exllama from my testing on a RTX 4090, but still about 6X faster than my CPU (Ryzen 5950X).
Wait why is renting a GPU in the cloud not a solution? You can even try multiple options and see which ones are capable enough for your use case.
Look into some barebones cloud GPU services, for example Lambda Labs which is significantly cheaper than AWS/GCP but offers basically nothing besides the machine with a GPU. You could even try something like Vast in which people rent out their personal GPU machines for cheap. Not something I'd use for uhhh...basically anything corporate, but for a personal project with no data security or uptime issues it would probably work great.
My annoyance was managing state. I’d have to spend hours installing tools, downloading data, updating code, then when I want to go to bed I have to package it up and store as much as I can on s3 before shutting off the $$ server.
I've played a lot with Stable Diffusion using AWS spot instances, mostly because it is the platform with which I'm more familiar. The Terraform script[0] should be easy to adapt to any other project of this kind.
Let me know if you are interested, and maybe we can find time to work on it together :).
You should check out https://brev.dev. You can rent GPUs, pause instances, use your own AWS/GCP accounts to make use of your credits, and the CLI lets you use your GPU as if it’s on your local machine.
It handles storage, setup, etc for machine learning work loads across several providers - which helps a lot if you need one of the instances that rarely have capacity like 8x A100 pods.
No, you write a script that runs the sync and shut down the instance. When for instance is stopped you don’t pay for it. Resuming it is a simple api call. You don’t even really need to do the sync, it’s just to be certain if the instance volume is lost you have a backup.
The shutdown / stop on an instance is like closing the lid on your laptop. When you start it again it resumes where it was left off. In the mean time the instance doesn’t occupy a VM.
A caveat is you can’t really do this with spot instances. You would need to do a sync and rebuild on start. But, again, scriptable easily.
Speaking as someone who has solved these difficulties hundreds of times, "draw the rest of the owl" doesn't tell you the specific things to google to get detailed examples and tutorials on how millions of others have sidestepped these repeated issues.
You "spend hours messing around" with everything you don't know or understand at first. One could say the same about writing the software itself. At its core Dockerfiles are just shell scripts with worse syntax, so it's not really that much more to learn. Once you get it done once, you don't have to screw around with it anymore, and you have it on any box you want in seconds.
In either case you have to spend hours screwing around with your environment. If those hours result in a Dockerfile, then it's the last time. If they don't, then it's each time you want it on a new host (which as was correctly pointed out a pain in the ass).
Storing data in a database vs in files on disk is like application development 101 and is pretty much a required skill period. It's required that you learn how to do this because almost all applications revolve around storing some kind of state and, as was noted, you can't reasonably expect it to persist on the app server without additional ops headaches.
Many people will host dbs for you without you having to think about it. Schema is only required if you use a structured db (which is advisable) but it doesn't take that long.
I applaud your experience, but honestly I agree with parent: knowledge acquisition for a side project may not be the best use of their time, especially if it significantly impedes actually launching/finishing a first iteration.
It's a similar situation for most apps/services/startup ideas: you don't necessarily need a planet scale solution in the beginning. Containers are great and solve lots of problems, but they are not a panacea and come with their own drawbacks. Anecdotally, I personally wanted to make a small local 3 node Kubernetes cluster at one time on my beefy hypervisor. By the time I learned the ins and outs of Kubernetes networking, I lost momentum. It also didn't end up giving me what I wanted out of it. Educational, sure, but in the end not useful to me.
I'm having trouble imagining what data I would store in a database as opposed to a filesystem if my goal is to experiment with large models like Stable Diffusion.
I would take GP's kind of dogmatic jibber jabber with a grain of salt. There is an unspoken and timeless elegance to the simplicity of running a program from a folder with files as state
Isn't terminfo db famous for this filesystem-as-db approach? File Vs DB: I say do whatever works for you. There is certainly more overhead in the DB route.
IMO tensors & other large binary blobs are fairly edge-casey. You might as well treat them like video files and video file servers also don't store large videos in databases either, and most devs don't have large binary blob management experience.
'shell scripts with worse syntax' lol I wish shell could emulate Alpine on a non-linux box.
Shell script with worse syntax for config VM may be closer to a qemu cloud init file.
Besides some great tooling out there if you wanted to roll your own, you can literally rent windows/linux computers, with persistent disks. If you have good internet, you can even use it as a gaming PC, as I do.
Is there an easy way to off-board the persistent disk to cheaper machines when you don't need the gpus?
Like imagine, setting up and installing everything with the gpu attached, but when you're not using the gpu or all the cpu cores, you can disconnect them.
If you have docs on how to do this, please let me know.
With AWS (and probably most other cloud VPC services) the disk is remote from the hardware so you can halt the CPU and just pay for the storage until you restart.
AWS also provides accessible datasets of training data:
"but offers basically nothing besides the machine with a GPU"
they must offer distributed storage, that can accommodate massive models, though? how else would you have multiple GPUs working on a single training model?
I have not seen many setups that wouldn’t pay itself back (including energy in my case) within a year (sometimes even 6 months) with buying vs renting. For something that pays itself back that fast, and that is without renting it out myself, just training with it, I cannot see how I would want to rent one.
Edit; on Lambda labs, the only exception seems to be the H100; it would be 1.5 years or so, but even 2 years would still fast enough. I have an A100 which paid itself back; thinking of getting another one.
I think the downside to buying hardware as well is that compared to other tech, this LLM/ML stuff is moving very quickly, people are great at quantising now (whereas I only really saw it done before for the coral edge TPUs etc).
Someone could buy an H100 to run the biggest and bestest stuff right now, but we could find that a model gets shrunk down to run on a consumer card within a year or two with equivalent performance.
I suppose it makes sense if someone wants to be on the bleeding edge all the time.
Cloud has a variable price (up to the whim of whatever they decide the price to be that day) so it's uncertain, but typically it is far" more expensive for this type of application.
So when faced with 1) probably far more expensive, or 2) single price that will be cheaper, is always available, and has far more uses, I think most would choose 2) for self-hosting. Cloud is very rarely* a good option.
Having to connect to a gpu over the internet seems extremely cumbersome.
Stuff like this should be as easy as running a local program with an accelerator.
You can finetune whisper, stable diffusion, and LLM up to about 15B parameters with 24GB VRAM.
Which leads you to what hardware to get. Best bang for the $ right now is definitely a used 3090 at ~$700. If you want more than 24GB vram just rent the hardware as it will be cheaper.
If you're not willing to drop $700 don't buy anything just rent. I have had decent luck with vast.ai
No clue but if you want learn/finetuned ML use a Linux box otherwise you will spend all your time fighting your machine. If you just want to run models Mac might work.
I would recommend a 3090. It can handle everything a 4000 series can albeit slightly slower, has enough VRAM to handle most things for fun, and can be bought for around $700.
Just do it. Spend a few hours doing research and you will find out. With that said, buy as much memory as can. That makes 4090 king if you have the server that can carry it, plus the budget. For me, I settled for 3060, it's a nice compromise between cost and ram. Cheap, 12gb and 170TPW.
I think you just need to educate yourself a bit about the space.
These models are very small (the large version is only 1B parameters) so should run on a 4GB gaming GPU.
The issue with all these AI models is that there's no information on which GPU is enough for which task. I'm absolutely clueless if a single RTX 4000 SFF with its 20GB VRAM and only 70W of max power usage will be a waste of money, or really something great to do experiments on. Like do some ASR with Whisper, images with Stable Diffusion or load a LLM onto it, or this project here from Facebook.
Renting a GPU in the cloud doesn't seem to be a solution for this use case, where you just want to let something run for a couple of days and see if it's useful for something.