Because you have to do inference distributed between multiple nodes at this point. For prefill because prefill is actually quadratic, but also for memory reasons. KV Cache for 405B at 10M context length would take more than 5 terabytes (at bf16). That's 36 H200 just for KV Cache, but you would need roughly 48 GPUs to serve bf16 version of the model. Generation speed at that setup would be roughly 30 tokens per second, 100k tokens per hour, and you can server only a single user because batching doesn't make sense at these kinds of context lengths. If you pay 3 dollars per hour per GPU, it's $1440 per million tokens cost. For fp8 version the numbers are a bit better: you need only 24 GPUs, generation speed stays roughly the same, so it's only 700 dollars per million tokens. There are architectural modifications that will bring that down significantly, but, nonetheless, it's still really really expensive, but also quite hard to get to work.
How so? As far as I can tell, Microsoft has a large equity interest in OpenAI, and OpenAI has a lot of cloud credits usable on Microsoft’s cloud. I don’t think those credits are transferable to other providers.
The value in the proposition is OpenAI IP. Money and data centers are commodities easily replaced, especially when you hold the IP everyone wants a piece of.
The arrangement is mutually beneficial, but the owner of the IP holds the cards.
But how many of them have hot data centers to offer? Google is a direct competitor, so Oracle or Amazon are kinda the only other two big options to offer them what MS is right now.
If MS drops OpenAI, it's not like they can just seamlessly pivot to running their own data centers with no downtime, even with pretty high investment.
A relationship that’s mutually beneficial needn’t be symmetric. Microsoft’s relationship is fairly commoditized - money and GPUs. OpenAI controls the IP that matters.
I’d note that the supplier of GPUs is Nvidia, who also offers cloud GPU services and doesn’t have a stake in the GCP, Azure, AWS behemoth battle. I’d actually see that as a more natural less middle man relationship.
The real value azure brings is enterprise compliance chops. However IMO aws bedrock seems to be a more successful enterprise integration point. But they’re all commodity products and don’t provide the value OpenAI provides to the relationships.
Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.
Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!
Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?
Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)
t5 is an architecture, t5x is a framework for training models that was created with that architecture in mind, but can be used to train other architectures, including decoder-only ones(there is one in examples).
To quote their official response "If the WSE weren't rectangular, the complexity of power delivery, I/O, mechanical integrity and cooling become much more difficult, to the point of impracticality.".