> ...misconceptions... Yet if you skim through specs of modern NVMe devices you ...

jorangreef · on Nov 26, 2020

The advertised bandwidth for RAM is not actually what you get per-core, which is what you care about in practice.

If you want to know the upper bound on your per-core RAM bandwidth:

64 bytes (the size of a cache line) * 10 slots (in a CPU core's LFB or line fill buffer) / 100ns (the typical cost of a cache miss) * 1000000 * 1000 (to convert ns to ms to seconds) = 6400000000 bytes per second = 5.96 GiB per second RAM bandwidth per core

There's no escaping that upper bound per core.

Nanosecond RAM latencies don't help much when you're capped by the line fill buffer and queuing delay kicks in spiking your cache miss latencies. You can only fetch 10 lines at a time per core and when you exceed your 5.96 GiB per second budget your access times increase.

If you compare with NVMe SSD throughput plus Direct I/O plus io_uring, around 32 GIB per second and divide that by 10 according to the difference in access latencies, then I think the author is about right on target. The point they are making is valid: it's the same order of magnitude.

sgtnoodle · on Nov 26, 2020

While I was in the hospital ICU earlier this year, I promised myself I would build a zen 3 desktop when it came out despite my 10 year old desktop still working just fine.

I've since bought all the pieces but the CPU; they are all sold out. So I got a 6 core 3600XT in the interim. I bought fairly high binned RAM and overclocked it to 3600Mhz, and was surprised to cap out at about 36GB/s throughput. Your 6GiB/s per core explanation checks out for me!

jorangreef · on Nov 26, 2020

Cool! I had a similar empirical experience working on a Cauchy Reed-Solomon encoder awhile back, which is essentially measuring xor speed, but I just couldn't get it past 6 GiB/s per core either, until I guessed I was hitting memory bandwidth limits. Only a few weeks ago I stumbled on the actual formula to work it out!

throwaway_pdp09 · on Nov 26, 2020

> capped by the line fill buffer and queuing delay kicks in spiking your cache miss

could you point me to a little reading material on this? I know what an LFB is, more or less, but what queueing delay, an dhow does that relate to cache misses? Thanks.

jorangreef · on Nov 26, 2020

Sure, I'm still pretty fuzzy on these things, but queueing delay is Little's law: https://en.wikipedia.org/wiki/Little's_law

It means if a system can only do X of something per second, then if you push the system past that, new arriving stuff has to wait on existing work in the queue, and things take longer than if the queue was empty. You can think of it like a traffic jam and it applies to most systems.

For example, our local radio station here in Cape Town loves to talk about "queuing traffic" when they do the 8am traffic report, and I always think of Little's law.

Bufferbloat is another example of queueing delay, e.g. where you fill the buffer of your network router say with a large Gmail attachment upload and spike the network ping times for everyone else sharing the same WiFi.

Here is where I got the per-core bandwidth calculation from: https://www.eidos.ic.i.u-tokyo.ac.jp/~tau/lecture/parallel_d...

throwaway_pdp09 · on Nov 26, 2020

Appreciated, thanks

wmf · on Nov 26, 2020

What about prefetching? Tiger Lake gets over 20 GB/s per core. https://www.anandtech.com/show/16084/intel-tiger-lake-review...

throwaway_pdp09 · on Nov 26, 2020

From your link

> In the DRAM region we’re actually seeing a large change in behaviour of the new microarchitecture, with vastly improved load bandwidth from a single core, increasing from 14.8GB/S to 21GB/s

Yeah, that's odd. But the article's really about cache, so maybe it's a mistake. Next para says

> More importantly, memory copies between cache lines and memory read-writes within a cache line have respectively improved from 14.8GB/s and 28GB/s to 20GB/s and 34.5GB/s.

so it looks like it's talking about cache not ram but... shrug

jorangreef · on Nov 26, 2020

Beats me!

wtallis · on Nov 26, 2020

The article isn't exactly conflating RAM and flash; if it were, the conclusions would be very different. A synchronous blocking IO API is fine if you're working with nanosecond latencies of RAM, or with storage that's as painfully slow and serial as a mechanical hard drive.

Flash is special in that its latency is still considerably higher than that of DRAM, but its throughput can get reasonably close once you have more than a handful of SSDs in your system (or if you're willing to compare against the DRAM bandwidth of a decade-old PC). Extracting the full throughput from a flash SSD despite the higher-than-DRAM latency is what requires more suitable APIs (if you're doing random IO; sequential IO performance is easy).

gogopuppygogo · on Nov 26, 2020

Sustainable read/write speeds are also different than peak on SSD vs RAM.

wtallis · on Nov 26, 2020

True, but that applies more to writes than reads. Most real-world workloads do a lot more reads than writes, and what writes they do perform can usually tolerate a lot of buffering in the IO stack to further reduce the overall impact of low write performance on the underlying storage.

hinkley · on Nov 26, 2020

We’ve been using Ethernet cards for storage because the network round trip to RAM over TCP/IP on another machine in the same rack is far cheaper than accessing local storage. Latency compared to that option is likely the most noteworthy performance gain.

My understanding of distributed computing history is that the last time network>local storage happened was in the 80’s, and most of the rest of the history of computing, moving the data physically closer to the point of usage has always been faster.

Just as then, we’ve taken a pronounced software architecture detour. This one has lasted much longer, but it can’t really last forever. With this new generation of storage, we’ll probably see a lot of people trotting out 90’s era system designs as if they are new ideas rather than just regression to the mean.

Same as it ever was.

gravypod · on Nov 26, 2020

Depending on the storage technology the comparison to RAM is not that far off. Intel is trying to market it that way any way [0]. It's obviously not RAM but it's not the <500GB 5200RPM SATA 3GB/s disk I started programming on.

[0] - https://www.intel.com/content/www/us/en/architecture-and-tec...

smcameron · on Nov 26, 2020

Yeah, back in 2014, I worked at HP on storage drivers for linux, and we got 1 million IOPS (4k random reads) on a single controller, with SSDs, but we had to do some fairly hairy stuff. This was back when NVME was new and we were trying to do SCSI over PCIe. We set up multiple ring buffers for command submission and command completion, one each per CPU, and pinned threads to CPUs and were very careful to avoid locking (e.g. spinlocks, etc.). I think we also had to pin some userland processes to particular CPUs to avoid NUMA induced bottlenecks.

The thing is, up until this point, for the entire history of computers, storage was so relatively slow compared to memory and the CPU that drivers could be quite simple, chuck requests and completions into queues managed by simple locking, and the fraction of time that requests spent inside the driver would still be negligible compared to the time they spent waiting for the disks. If you could theoretically make your driver infinitely fast, this would only amount to maybe a 1% speedup. So there was no need to spend a lot of time thinking about how to make the driver super efficient. Until suddenly there was.

smcameron · on Nov 26, 2020

Oh yeah, iirc, the 1M IOPS driver was a block driver. For the SCSI over PCIe stuff, there was the big problem at the time that the entire SCSI layer in the kernel was a bottleneck, so you could make the driver as fast as you wanted, but your requests were still coming through a single queue managed by locks, so you were screwed. There was a whole ton of work done by Christoph Hellwig, Jens Axboe and others to make the SCSI layer "multiqueue" around that time to fix that.