48 GB is at the tail end of what's reasonale for normal GPUs. The IO requires a ...

jsheard · 2024-12-03T19:05:43 1733252743

> The IO requires a lot of die space.

And even if you spend a lot of die space on memory controllers, you can only fit so many GDDR chips around the GPU core while maintaining signal integrity. HBM sidesteps that issue but it's still too expensive for anything but the highest end accelerators, and the ordinary LPDDR that Apple uses is lacking in bandwidth compared to GDDR, so they have to compensate with ginormous amounts of IO silicon. The M4 Ultra is expected to have similar bandwidth to a 4090 but the former will need a 1024bit bus to get there while the latter is only 384bit.

Numerlor · 2024-12-03T19:15:30 1733253330

Going off of how the 4090 and 7900 xtx is arranged I think you could maybe fit on or two chips more around the die over their 12, but that's still a far cry from 128. That would probably just need a shared bus like normal DDR as you're not fitting that much with 16 gbit density

ryao · 2024-12-04T10:37:16 1733308636

Look at the 3090, which uses 24 chips (12 on one side and 12 on another). Pushing it to 32 is doable. 32 is all you need to reach 128GB VRAM with the 32Gbit GDDR7 chips that should be on the market in the near future.

Numerlor · 2024-12-04T15:15:22 1733325322

Where would you route the connection to the additional 4 groups of chips around the die? The PCIe connection needs to be there too, and they also may not like power delivery going through them

ryao · 2024-12-04T19:10:39 1733339439

Nvidia has done a 512-bit bus in the past. The 3090 has 4 groups of 3 on each side of the card. Switching to 4 groups of 4 should be doable visually. That said, I would not want to be the one responsible for doing the trace routing.

SmellTheGlove · 2024-12-03T20:52:00 1733259120

What if we did what others suggested was the practical limit - 48GB. Then just put 2-3 cards in the system and maybe had a little bridge over a separate bus for them to communicate?

Numerlor · 2024-12-04T02:30:28 1733279428

I believe that would need some software work from Intel where they're lacking a bit now with their delayed start. Not sure how the frameworks themselves split up the inference work to avoid crossing GPUs as the bandwidth is horrible there.

If we're being reasonable and say that you're not using a modern HEDT CPU that costs a couple thousand, the best a consumer botherboard can get right now would be 2x 8x PCIe gen 5 at 32GB/s and one chipset x8 PCIe gen 4 at 16GB/s. I'm not sure if a motherboard like that actually exists but Intel's chipset should allow it; AMD only does x4 to chipset so the third slot is limited by that