Yeah, this is a case where money is inevitably going to be lost and everyone is saying "not it". It's not clear that there's any "fair" way to allocate the losses that will result from massive GPU oversupply.
Have you crunched the numbers on how many users you need to justify renting a VM with an A100 or similar?
From my brief research they seem to be available for about $30 / hour, and can run 50 iterations in roughly 6 seconds.
One of the missing pieces for me is how many parallel requests they can serve, and whether it's feasible to start/stop the machines based on demand (it sounds like some providers don't allow the stopped machines to retain state long enough for this to be practical)
Would be interested in your thoughts, have been tinkering using my laptop GPU but it's dog slow and I have a vague idea for an online game that I might try out if I can summon enough mental energy one weekend
I didn't look for an A100 specifically, but the cheapest GPU spot instance on AWS is around $200/mo, if I remember correctly.
Since this is a hobby project, I probably will never be able to break even on that. I'm thinking of writing a bit of code that will use GPU spot instances to perform the computation, and turn it off when the queue is emptyish.
You don't really need any state on the machines, the worker communicates with a broker, receives a very small bit of JSON containing the parameters, does the work and uploads the image to R2. Then, it sends a message back saying that image is done.
That way, you don't even care if you get interrupted mid-generation, since the work is idempotent. I can just spin up a machine whenever available and tear it down whenever.
It will be nice to try, though I think I'll use my desktop for a while yet due to a lack of users. That one takes around a minute to generate an image, and can't run while I'm using it.
This is essentially what I have in mind - some kind of queue, with a mediator that starts/stops a spot instance using a custom image that has everything installed and configured to start on boot
With regards to state I just meant the image - wouldn't want to have to install everything at each start. Using something like aws or gcp I don't think this would be an issue, but I was window shopping some providers that specialize in GPU VMs and it wasn't clear if they provided equivalent services. Probably easiest to just use aws.
I think if you couldn't make a ready-made image you'd use a Docker container with everything in it instead. Though then I don't know how you'd use spot instances, because AFAIK you want those for the "whenever it's available" startup and can't really be logging into each one to start them up. If they let you run a container, presumably they let you run other commands to set stuff up as well?
This kind of depends on the provider, but it's mostly an implementation detail. The main crux is a broker and a queue, I use Dramatiq (the site is written in Python) and it works great.
With spot instances you just have to be prepared for them to get terminated at random. If you're processing messages off a queue in most cases you get this for free - the ack deadline will expire and the message will be redelivered to another subscriber.
Docker container might be a good shout, though I've not done anything using a GPU from docker before
Yeah exactly, the queue handles this for free. Docker with the Nvidia extension actually works better for me than native, Torch doesn't work on my system but works in Docker.
Haha, more of a necessity, but yes, it is very nice to have.
Oddly, I did see an issue yesterday with the push notification arriving multiple times (far apart from each other) even though the server only sent it once.