The Future of Memory

ksec · 2024-01-21T10:16:56

Not a single word on the actual BOM cost of DRAM. I could only wish we have technology to make the current $1/GB sustainable and profitable.

PaulHoule · 2024-01-21T12:35:57

Isn’t it a chronic problem (since at least the 1980s) that memory has cycles of gluts and shortages?

ksec · 2024-01-21T14:29:16

Yes. But but I am referring to actual production cost of DRAM, not its selling price which goes through boom and bust cycle. BOM cost has been pretty much constant for the past 10- 15 years. Although one could argue if we had adjusted for inflation it has still gotten a bit cheaper.

cmrdporcupine · 2024-01-21T17:37:34

At the same time that RAM costs have plateau'd and seem to be going up, non-volatile storage prices and speeds have been getting better and better.

We need more software engineering progress on paging & persistent storage systems.

spintin · 2024-01-21T13:25:53

Above 64GB you need registered RAM increasing the latency.

So as you increase bandwidth you reduce program speed.

Higher frequency results in more heat.

We are fast approaching the need for a Wii like Broadway architecture where the program is running in "fast" SRAM and the data is on "slow" DDR.

lmz · 2024-01-21T13:28:55

The 7800X3D already has 96MB of L3. Surely that's enough for a lot of programs.

spintin · 2024-01-21T13:37:45

Yes, but the cost in terms of watt and manufacturing are not scalable.

Also most programs have cache-misses.

_a_a_a_ · 2024-01-21T14:16:48

Why not have a 'pin-to-cache' functionality then.

smolder · 2024-01-21T14:29:06

What would it do when you pin more things than your cache can hold? Trigger an interrupt? It basically becomes another memory layer you'd need to manage.

Const-me · 2024-01-22T03:15:33

The higher levels of the stack, be it hardware or software, need to make sure that never happens.

That’s how GPUs are doing that. Each thread group can use a limited amount of SRAM, programs declare in advance how many bytes they need. Then in runtime the scheduler who dispatches tasks to cores enforces that limit by never dispatching too many thread groups on each core.

_a_a_a_ · 2024-01-21T14:53:34

Well duh. How is using SRAM going to be any different when you run out of that?

smolder · 2024-01-21T15:25:21

I was not here arguing in favor of explicitly tiered memory. The implied answer to your original question, "why not have a pin to cache functionality?" is that it's effectively the same as having another OS managed memory layer, which is bad since it complicates the architecture. I'll take some cache misses over having to manage it explicitly.

AnthonyMouse · 2024-01-21T19:15:28

Not only that, if you had enough cache to fit everything then there wouldn't be cache misses, and if you didn't, cache misses are pretty unavoidable.

It's like the existing APIs for pining things in memory so they can't get paged out. They have very specific uses and normal programs generally don't use them and shouldn't.

kimixa · 2024-01-21T20:52:11

Much of the cache "management" can be done with specialist load/store instructions that skip the cache rather than being OS managed like a mapping.

foobiekr · 2024-01-21T22:23:49

They certainly have this. A lot of embedded boot loaders run entirely from cache until they can bring main memory up and check it.

anonymousDan · 2024-01-21T20:21:02

Sorry can you explain what registered RAM is and why it increases latency?

kimixa · 2024-01-21T20:36:40

Registered memory has a buffer for communication between the dram and the memory controller. So the DDR bus is attached to an intermediate buffer chip, rather than directly to the dram chips on the DIMM.

This can give better electrical characteristics of the bus, as the buffer chip to the DIMM connector can have simplified routing and higher power signaling without putting more load on the DRAM chips, and the buffer chip design being focused on this interface signaling rather than compromising between that and the actual DRAM cells.

It's a bit more expensive, being an extra chip on each DIMM, and has a latency penalty, as the buffer chip means everything on the DDR bus is effectively 1 clock behind what the DRAM chips themselves provide. But it's often necessary if you have a large number of DIMMs on a single channel or very long traces required for packing lots of DIMMs around a CPU, as that increases the electrical capacitance and noise of each path, which many DRAM chips can struggle to drive, especially at higher speeds.

As dram chip density increases you can get higher capacities without the longer bus traces and more DIMMs per channel that might require registered ram, there's nothing "fundamental" about 64gb needing registered ram, and you are already seeing 48gb DDR5 DIMMs that can work on consumer platforms, which often have no issues running 4 DIMMs without registered ram.

King1st · 2024-01-23T18:00:18

DDR5 chips are already semi-buffered. It was part of the major changes from DDR4 to DDR5.

ilaksh · 2024-01-21T15:37:31

I wonder if it's possible to design the next AI systems along with the hardware at the same time. For example, maybe by focusing on more approaches like mixture of experts or similar, there are ways to keep much of the data close to the cores that operate on it.

yazaddaruvala · 2024-01-23T08:02:06

There are many possible future architectural changes that could help. Some are more feasible than others but all require fundamental advancement to be useful and cost effective.

Some of them include:

- GPU - CPU shared memory (already possible just not yet the standard)

- Higher DRAM bandwidth (already possible, just not yet a priority)

- system on chip FPGAs (always possible just very expensive to fit “AI models”)

- SOC NVM. Ideally even NVM on the same wafer as the GPU and CPU (possible today but requires a lot of work on the yield. NVM would take up a lot of real estate that could ruin yield).

- analog circuits

- new semi-conductors / photonics

- memristors

CyberDildonics · 2024-01-21T19:26:37

That's called CPU cache. It doesn't require "mixture of experts" (whatever that would mean) it just needs transistors for SRAM.

ilaksh · 2024-01-21T20:46:46

That's one example of the more general category of what I am talking about. But I was trying to get just a little more specific.

CyberDildonics · 2024-01-21T22:31:37

Can you give another example and explain how "mixture of experts" gets data closer to a CPU?

ilaksh · 2024-01-22T10:43:36

I'm talking about GPUs and don't know the details very well. It was a rough idea.

CyberDildonics · 2024-01-22T14:32:24

the more general category of what I am talking about. But I was trying to get just a little more specific

Can you get more specific then? Can you give any details or any overview? There must have been some information that led you to post this originally, can you link it?

Lind5 · 2024-01-22T17:18:17

part 2 of this series https://semiengineering.com/rethinking-memory/

Alifatisk · 2024-01-22T10:02:03

Why is the favicon Apple?

b2zx14413 · 2024-01-21T23:12:58