Former ML engineer here who ran IsaacGym and MuJoCo sims in the cloud for 2+ years. The pain is real and very specific:
1. Cold start latency killed iteration loops. Spinning up a GPU VM to test a 10-minute sim run took longer than the sim itself — you'd wait 3-5 min for the instance, run 8 min, tear down. That per-iteration overhead crushes exploration.
2. Idle billing. If you're grid-searching over reward functions, you want to fire 20 parallel runs, collect results, tune, repeat — but most providers bill per-hour so even a 12-minute run costs you a full hour.
3. Physics sim + CUDA dependencies. Custom CUDA kernels (warp sim, etc.) often need specific driver versions. Docker helps but image build/push overhead adds another 5-10 min to the loop.
The "CI for sims" framing (push code → run on GPU automatically) directly addresses #1 and #3. Worth building.
On the infrastructure layer: we built GhostNexus (https://ghostnexus.net) to address #1 and #2 — per-second billing, <30s cold starts on RTX 4090 hardware, Python SDK with 3 lines to submit a job. Might be worth using as the GPU backend if you don't want to manage the infra layer yourself. (Disclaimer: I'm the founder.)
1. Cold start latency killed iteration loops. Spinning up a GPU VM to test a 10-minute sim run took longer than the sim itself — you'd wait 3-5 min for the instance, run 8 min, tear down. That per-iteration overhead crushes exploration.
2. Idle billing. If you're grid-searching over reward functions, you want to fire 20 parallel runs, collect results, tune, repeat — but most providers bill per-hour so even a 12-minute run costs you a full hour.
3. Physics sim + CUDA dependencies. Custom CUDA kernels (warp sim, etc.) often need specific driver versions. Docker helps but image build/push overhead adds another 5-10 min to the loop.
The "CI for sims" framing (push code → run on GPU automatically) directly addresses #1 and #3. Worth building.
On the infrastructure layer: we built GhostNexus (https://ghostnexus.net) to address #1 and #2 — per-second billing, <30s cold starts on RTX 4090 hardware, Python SDK with 3 lines to submit a job. Might be worth using as the GPU backend if you don't want to manage the infra layer yourself. (Disclaimer: I'm the founder.)
reply