More

terafo · 2024-12-06T05:33:25 1733463205

Because you have to do inference distributed between multiple nodes at this point. For prefill because prefill is actually quadratic, but also for memory reasons. KV Cache for 405B at 10M context length would take more than 5 terabytes (at bf16). That's 36 H200 just for KV Cache, but you would need roughly 48 GPUs to serve bf16 version of the model. Generation speed at that setup would be roughly 30 tokens per second, 100k tokens per hour, and you can server only a single user because batching doesn't make sense at these kinds of context lengths. If you pay 3 dollars per hour per GPU, it's $1440 per million tokens cost. For fp8 version the numbers are a bit better: you need only 24 GPUs, generation speed stays roughly the same, so it's only 700 dollars per million tokens. There are architectural modifications that will bring that down significantly, but, nonetheless, it's still really really expensive, but also quite hard to get to work.

terafo · 2024-08-31T20:15:56 1725135356

Why mention Microsoft twice?

mensetmanusman · 2024-08-31T20:20:08 1725135608

The departments aren’t talking, so they accidentally made two orders.

fnordpiglet · 2024-08-31T20:53:53 1725137633

I’d note Microsoft needs OpenAI a hell of a lot more than OpenAI needs Microsoft. I’d actually pivot that to be why mention OpenAI twice.

amluto · 2024-08-31T22:20:17 1725142817

How so? As far as I can tell, Microsoft has a large equity interest in OpenAI, and OpenAI has a lot of cloud credits usable on Microsoft’s cloud. I don’t think those credits are transferable to other providers.

fnordpiglet · 2024-09-01T04:24:38 1725164678

The value in the proposition is OpenAI IP. Money and data centers are commodities easily replaced, especially when you hold the IP everyone wants a piece of.

The arrangement is mutually beneficial, but the owner of the IP holds the cards.

FridgeSeal · 2024-08-31T22:18:31 1725142711

OpenAI doesn’t operate without the enormous amounts of funding MS gives it.

adwi · 2024-08-31T22:41:37 1725144097

I think a lot of institutions and people would love the chance to give them money.

noirbot · 2024-08-31T22:56:58 1725145018

But how many of them have hot data centers to offer? Google is a direct competitor, so Oracle or Amazon are kinda the only other two big options to offer them what MS is right now.

If MS drops OpenAI, it's not like they can just seamlessly pivot to running their own data centers with no downtime, even with pretty high investment.

fnordpiglet · 2024-09-01T04:53:51 1725166431

A relationship that’s mutually beneficial needn’t be symmetric. Microsoft’s relationship is fairly commoditized - money and GPUs. OpenAI controls the IP that matters.

I’d note that the supplier of GPUs is Nvidia, who also offers cloud GPU services and doesn’t have a stake in the GCP, Azure, AWS behemoth battle. I’d actually see that as a more natural less middle man relationship.

The real value azure brings is enterprise compliance chops. However IMO aws bedrock seems to be a more successful enterprise integration point. But they’re all commodity products and don’t provide the value OpenAI provides to the relationships.

terafo · 2024-08-20T09:03:37 1724144617

There was. Now second gen of that goes for $15.

SirMaster · 2024-08-20T16:08:09 1724170089

It's the third gen.

The Zero was $5, the Zero W was $10 and the Zero W 2 is $15.

terafo · 2024-05-15T23:01:12 1715814072

Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.

tehsauce · 2024-05-15T23:54:06 1715817246

If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.

cgearhart · 2024-05-15T23:16:23 1715814983

Why does it disproportionately use bandwidth?

jacobn · 2024-05-16T00:11:39 1715818299

In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.

cgearhart · 2024-05-16T00:30:47 1715819447

Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!

Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?

bjornsing · 2024-05-16T03:39:05 1715830745

Wouldn’t the softmax typically be “fused” with the matmul though?

anewhnaccount2 · 2024-05-16T05:20:01 1715836801

Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)

terafo · 2024-04-24T11:37:09 1713958629

t5 is an architecture, t5x is a framework for training models that was created with that architecture in mind, but can be used to train other architectures, including decoder-only ones(there is one in examples).

ma2rten · 2024-04-24T15:01:35 1713970895

t5x was used to train PaLM 1.

terafo · 2024-03-13T20:59:18 1710363558

To quote their official response "If the WSE weren't rectangular, the complexity of power delivery, I/O, mechanical integrity and cooling become much more difficult, to the point of impracticality.".

terafo · 2024-03-13T20:52:01 1710363121

Not right now.

terafo · 2024-03-13T20:51:52 1710363112

Because SRAM stopped getting smaller with recent nodes.

terafo · 2024-03-13T20:51:50 1710363110

This thing targets training, which isn't affected by tiny accelerators inside CPUs.

terafo · 2024-03-13T20:51:48 1710363108

No, it's comparable to 230Mb of SRAM on Groq chip, since both of them are SRAM-only chips that can't really use external memory.