Hacker News new | past | comments | ask | show | jobs | submit | terafo's comments login

Because you have to do inference distributed between multiple nodes at this point. For prefill because prefill is actually quadratic, but also for memory reasons. KV Cache for 405B at 10M context length would take more than 5 terabytes (at bf16). That's 36 H200 just for KV Cache, but you would need roughly 48 GPUs to serve bf16 version of the model. Generation speed at that setup would be roughly 30 tokens per second, 100k tokens per hour, and you can server only a single user because batching doesn't make sense at these kinds of context lengths. If you pay 3 dollars per hour per GPU, it's $1440 per million tokens cost. For fp8 version the numbers are a bit better: you need only 24 GPUs, generation speed stays roughly the same, so it's only 700 dollars per million tokens. There are architectural modifications that will bring that down significantly, but, nonetheless, it's still really really expensive, but also quite hard to get to work.


Why mention Microsoft twice?


The departments aren’t talking, so they accidentally made two orders.


I’d note Microsoft needs OpenAI a hell of a lot more than OpenAI needs Microsoft. I’d actually pivot that to be why mention OpenAI twice.


How so? As far as I can tell, Microsoft has a large equity interest in OpenAI, and OpenAI has a lot of cloud credits usable on Microsoft’s cloud. I don’t think those credits are transferable to other providers.


The value in the proposition is OpenAI IP. Money and data centers are commodities easily replaced, especially when you hold the IP everyone wants a piece of.

The arrangement is mutually beneficial, but the owner of the IP holds the cards.


OpenAI doesn’t operate without the enormous amounts of funding MS gives it.


I think a lot of institutions and people would love the chance to give them money.


But how many of them have hot data centers to offer? Google is a direct competitor, so Oracle or Amazon are kinda the only other two big options to offer them what MS is right now.

If MS drops OpenAI, it's not like they can just seamlessly pivot to running their own data centers with no downtime, even with pretty high investment.


A relationship that’s mutually beneficial needn’t be symmetric. Microsoft’s relationship is fairly commoditized - money and GPUs. OpenAI controls the IP that matters.

I’d note that the supplier of GPUs is Nvidia, who also offers cloud GPU services and doesn’t have a stake in the GCP, Azure, AWS behemoth battle. I’d actually see that as a more natural less middle man relationship.

The real value azure brings is enterprise compliance chops. However IMO aws bedrock seems to be a more successful enterprise integration point. But they’re all commodity products and don’t provide the value OpenAI provides to the relationships.


There was. Now second gen of that goes for $15.


It's the third gen.

The Zero was $5, the Zero W was $10 and the Zero W 2 is $15.


Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.


If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.


Why does it disproportionately use bandwidth?


In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.


Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!

Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?


Wouldn’t the softmax typically be “fused” with the matmul though?


Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)


t5 is an architecture, t5x is a framework for training models that was created with that architecture in mind, but can be used to train other architectures, including decoder-only ones(there is one in examples).


t5x was used to train PaLM 1.


To quote their official response "If the WSE weren't rectangular, the complexity of power delivery, I/O, mechanical integrity and cooling become much more difficult, to the point of impracticality.".


Not right now.


Because SRAM stopped getting smaller with recent nodes.


This thing targets training, which isn't affected by tiny accelerators inside CPUs.


No, it's comparable to 230Mb of SRAM on Groq chip, since both of them are SRAM-only chips that can't really use external memory.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: