Hey folks! We're Alex and Evan, and we're working on putting together a 512 H100 compute cluster for startups and researchers to train large generative models on.
- it runs at the lowest possible margins (<$2.00/hr per H100)
- designed for bursty training runs, so you can take say 128 H100s for a week
- you don’t need to commit to multiple years of compute or pay for a year upfront
Big labs like OpenAI and Deepmind have big clusters that support this kind of bursty allocation for their researchers, but startups so far have had to get very small clusters on very long term contracts, wait months of lead time, and try to keep them busy all the time.
Our goal is to make it about 10-20x cheaper to do an AI startup than it is right now. Stable Diffusion only costs about $100k to train -- in theory every YC company could get up to that scale. It's just that no cloud provider in the world will give you $100k of compute for just a couple weeks, so startups have to raise 20x that much to buy a whole year of compute.
Once the cluster is online, we're going to be pretty much the only option for startups to do big training runs like that on.
In 2023 you can barely get a single TPU for more than an hour. Back then you could get literally hundreds, with an s.
I believed in TRC. I thought they’d solve it by scaling, and building a whole continent of TPUs. But in the end, TPU time was cut short in favor of internal researchers — some researchers being more equal than others. And how could it be any other way? If I made a proposal today to get these H100s to train GPT to play chess, people would laugh. The world is different now.
Your project has a youthful optimism that I hope you won’t lose as you go. And in fact it might be the way to win in the long run. So whenever someone comes knocking, begging for a tiny slice of your H100s for their harebrained idea, I hope you’ll humor them. It’s the only reason I was able to become anybody.