Hacker News new | past | comments | ask | show | jobs | submit login

You can't get close to 50 GB/s of bandwidth to a single core, even if your memory configuration and controller supports that because the per core bandwidth is limited by the latency per cache line and the maximum number of concurrent requests. There is a maximum of 10 outstanding requests that missed in L1 per core, so if you take a (very good) latency to DRAM of 50 ns, you can reach only 64 bytes/line * 10 lines outstanding / 50 ns = 12.8 GB/s.

In fact, you'll observe pretty much exactly that limit if you disable hardware prefetching. Of course, almost no one does that, and it turns out in particular that one can do better than this limit due to hardware L2 prefetching (but not software, since it restricted by the same 10 LFBs as above), since the concurrency from L2 to the uncore and beyond is higher than 10 (perhaps 16-20) and the latency is lower, so you get higher concurrency, but you are still concurrency limited to around 30 GB/s on modern Intel client cores.

On server cores, the latency is much higher, closer to 100ns, and so the per-core bandwidth is much lower, and the memory subsystem usually has a higher throughput, so it takes many more active server cores to saturate the memory bandwidth than on the client (which one just about does it on dual-channel systems).




But isn't the situation of base64 encoding a "streaming" situation? I agree with you that the 10-outstanding requests + latency metric is useful for a random-access situation. But base64 encoding (and decoding) starts at the beginning of something in memory, and keeps going till you reach the end.

Its about as ideal a streaming case as you can get it. No hash tables or anything involved. And WITH hardware prefetchers enabled, DDR4 becomes so much faster.

http://www.sisoftware.eu/wp-content/uploads/2017/08/amd-tr-m...

The measured "streaming" latency numbers seen in that picture are likely due to the hardware prefetcher. "Streaming" data from DDR4 is incredibly efficient! An order of magnitude more efficient than random-walks.

Intel's memory controller also seems to be optimized for DDR4 page-hits (a terminology that means the DDR4 stick only needs a CAS-command to fetch memory. Normally, DDR4 requires RAS+CAS or even PRE+RAS+CAS, which increases latency significantly). So if you can keep your data-walk inside of a DDR4 page, you can drop the latency from ~80ns to ~20ns on Skylake (but not Ryzen)

--------

With that being said, it seems like my previous post is a bit optimistic and doesn't take into account CAS commands and the like (which would eat up the theoretical bandwidth of DDR4 RAM). DDR4 3200MT/s RAM practically has a bandwidth of ~31GB/s on a dual-channel setup.

Still, I stand by the theory. If you "stream" beginning-to-end some set of data from DDR4 (WITH hardware prefetchers, that's the important part! Don't disable those), then you can achieve outstanding DDR4 bandwidth numbers.


Yes, it's a streaming situation - and that's my point: contrary to conventional wisdom, even the streaming situation - for a single core - is generally limited by concurrency and latency, and not by DRAM bandwidth, and there is just no way around it.

Now a streaming load, measured in cache line throughput, ends up about twice as fast as a random access load: because the latter is mostly limited by concurrency in the "superqueue" between L2 and DRAM, and the superqueue is larger (about 16-20 entries) and has a lower total latency than the full path from L1, while the former scenario is limited by the concurrency of the 10 LFBs between L1 and L2.

Prefetch isn't magic: it largely has to play by the same rules as "real" memory requests. In particular, it has to use the same limited queue entries that regular requests do: the only real "trick" is that when talking about L2 prefetchers (i.e., the ones that observe accesses to L2 and issue prefetches from there), they get to start their journey from the L2, which means they avoid the LFB bottleneck for demand requests (and software prefetches, and L1 hardware prefetches).

You don't have to take my word for it though: it's easy to test. First you can turn off prefetchers and observe that bandwidth numbers are almost exactly as predicted by the 10 LFBs. Then you can turn on prefetchers (one by one, if you want - so you see the L2 streamer is the only one that helps), and you'll see that the bandwidth is almost exactly as predicted based on 16-20 superqueue entries and a latency about 10 ns less than the full latency.

This is most obvious when you compare server and client parts. My laptop with a wimpy i7-6700HQ has a per-core bandwidth of about 30 GB/s, but a $10,000 Haswell/Broadwell/Skylake-X E5 server part, with 4 channels and faster DRAM will have a per-core bandwidth about half that, because it has a latency roughly 2x the client parts and it is _latency limited_ even for streaming loads! You find people confused about this all the time.

Of course, if you get enough cores streaming (usually about 4) you can still use all the bandwidth on the server parts and you'll become bandwidth limited.

There are other ways to see this in action, such as via NT stores, which get around the LFB limit because they don't occupy a fill buffer for the entire transaction, but instead only long enough to hand off to the superqueue.

This is covered in more detail in this SO answer:

https://stackoverflow.com/a/43574756/149138


This is a fabulous exchange. If you (or dragontamer) are interested in collaborating with Daniel (the blog author) on future academic papers where this level of architectural detail is relevant, I'm sure he'd appreciate hearing from you by email. Or contact me (email in profile) and I'll forward.


I appreciate the elaboration. I wasn't aware of this issue.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: