Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

DDR4 pages ("rows") are 8kiB, with reads/writes being 64B bursts into the "row"buffer. (For a full DIMM; the following goes a bit more into detail and doesn't strictly assume full 64 data bit (+ optional ECC) wide channels.) Each bank has a single "row" buffer into which a page has to be loaded ("activate" command) before reading/writing into that page, and it has to be written back ("precharged") before a new page can be loaded or an internal bank refresh can be triggered.

"row" buffer locality is relatively important for energy efficiency, and can account for about a 2x factor in the DRAM's power consumption. Each rank has 4 bank groups of 4 banks each, with the frequency of "activate" commands to banks in the same group being restricted more severely than to separate bank groups. A rank is just a set of chips who's command/address lines are in parallel, and who's data lines are combined to get up to the typical 64/72 bits of a channel. A single chip is just 4, 8, or 16bit wide. A channel can have typically between IIRC 1 and 12 ranks, which are selected with separate chip select lines the controller uses to select which rank the contents of the command/address bus are meant for.

Also, "activate" and "precharge" take about as long (most JEDEC standard timings have literally the same number is cycles for these 3 timings (also known as the "primary" timings)) as the delay between selecting the "column" in a "row" buffer, and the corresponding data bits flowing over the data bus (for reads and writes there may iirc be a one-cycle difference due to sequencing and data buffers between the DDR data bus and the serdes that adapts the nominal 8-long bursts at DDR speeds to parallel accesses to the "row" buffer).

With JEDEC timings for e.g. DDR4-3200AA being 22-22-22 CL-tRCD-tRP ("column" address-to-data delay; "row"-to-"column" address delay; "precharge"-to-"activate" delay), and the burst length being 4 DDR cycles, this is far from random access.

In fact, within a bank, and even neglecting limits on activate frequency (as I'm too lazy to figure out if you can hit them when using just a single bank), with assumed-infinite many "rows" and thus zero "row" locality for random accesses, as well as perfect pipelining and ignoring the relatively rare "precharge"/"activate" pairs at the end of a 1024-columns-wide "row":

Streaming accesses would take 4 cycles per read/write of a 64B cacheline (assuming a 64bit (72 with normal SECDED ECC) DIMM; is less bits, the cacheline would be narrower), archiving 3200 Mbit/s per data pin. Random accesses would take 22+22+22=66 cycles per read/write, archiving 193 ³¹/₃₃ Mbit/s per data pin.

So random accesses are just 6 ²/₃₃ % efficient in the worst case and assuming simplified pathological conditions. In practice, you have 16 banks to schedule your request queue over, and often have multiple ranks (AFAIK typically one rank per 4/8/16GiB on client (max 4 per channel on client, 2 optimal, with 1 having not enough banks and 3-4 limiting clock speeds due to limited transmitter power) and low-capacity servers; one rank per 16-32GiB on high-capacity servers).

There is a slight delay penalty for switching between ranks, iirc most notable when reading from multiple ones directly in sequence or possibly when reading from one and subsequently writing to another (both are pretty bad, but I don't recall which one typically has more stall cycles between transmissions; iirc it's like around 3 cycles (=6 bits due to DDR; close to one wasted burst)).



> DDR4 pages ("rows") are 8kiB

Wait, are the underlying hardware access size for memory bigger than the default Linux page size (4kb)? Wouldn't that introduce needless inefficiencies?

Can false sharing happen between different pages if they happen to be in the same row?


Its internal implementation detail, invisible to the CPU.

https://faculty-web.msoe.edu/johnsontimoj/EE4980/files4980/m...

diagrams on pages 4-7.


Not really. Pages are about virtual memory and not about physical memory.

In practice, the only thing DDR4 banks do is make prefetching an important strategy, thus making sequential performance for DDR4 incredible.

A fact already known to high performance programmers. Accessing byte 65 after byte 64 is much more efficient than accessing byte 9001.


What you don't want is things accessed together residing in different rows of the same bank. Things being on the same row is a good thing.


Even for false sharing? That is, the problem when two unrelated atomics are allocated in the same page but accessed frequently from different threads, causing the memory to be thrashing between two different L1 cache lines.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: