> PCIe Gen4 does matter. M.2 NVMe has been read limited for a long time already ...

Ygg2 · on Aug 26, 2019

> I cant feel the difference in storage performance between my NVMe

That's because for the most common tasks, real performance had not improved.

Manufacturers always display the write/read speed using many cores and with huge queue depth. The typical use case for that is copying files. It's around 3000MB/s

When starting OS and loading program you have tasks that use low queue and thread count, which is about 60-70 MB/s.

tempguy9999 · on Aug 26, 2019

Genuine questions. I just don't know this area...

> Manufacturers always display the write/read speed using many cores...

Why does #cores matter? Once the IO is initiated it's handed off to some DMA/coprocessor stuff so the core can get back to starting other IOs without contention. Being clueless I can't see why a single core couldn't saturate IO bandwidth by just blasting out IO requests.

> ...and with huge queue depth.

Really stupid question now, I thought disk queues were queued up requests caused by issuing a buttload of IO requests, so why does a disk queue of say 20 actually make things any faster over a disk queue of just 2?

I know (correction, believe) with spinny disks a longer queue could be used by the controller to optimise access on the disk surface by proximity, but with SSDs that performance characteristic doesn't exist (access is uniform anywhere, I think) so that optimisation doesn't apply.

wtallis · on Aug 26, 2019

> Why does #cores matter?

If you're benchmarking 4kB IOs, then the system call and interrupt handling overhead means you can't keep a high-end NVMe SSD 100% busy with only a single CPU core issuing requests one at a time. The time it takes to move 4kB across a PCIe link is absolutely trivial compared to post-Spectre/Meltdown context switch times. A program performing random IO to a mmapped file will never stress the SSD as much as a program that submits batches of several IO requests using a low-overhead asynchronous IO API.

> so why does a disk queue of say 20 actually make things any faster over a disk queue of just 2?

Because unlike almost all hard drives, SSDs can actually work on more than one outstanding request at a time. Consumer SSDs use controllers with four or eight independent NAND flash memory channels. Enterprise SSD controllers are 8-32 channels. And there's some amount of parallelism available between NAND flash dies attached to the same channel, and between planes on an individual die. Also, for writes, SSDs will commonly buffer commands to be combined and issued with fewer actual NAND program operations.

tempguy9999 · on Aug 26, 2019

Thanks, very informative!

One more thing, I'm surprised that 4KB blocks are relevant, I'd have thought that disk requests in benchmarking (edit: cos manufacturers like to cheat), and a lot of reads in the real world, would operate at much larger requests than 4K.

Is it that IOs are broken down to 4K blocks at the disc controller level, or is that done deliberately in benchmarking to stress the IO subsystem?

wtallis · on Aug 26, 2019

SSDs are traditionally marketed with sequential IO performance expressed in MB/s or GB/s, and random IO performance expressed in 4kB IOs per second (IOPS). Using larger block sizes for random IO will increase throughput in terms of GB/s but will almost always yield a lower IOPS number. Block sizes smaller than 4kB usually don't give any improvement to IOPS because the SSD's Flash Translation Layer is usually built around managing data with 4kB granularity.

tempguy9999 · on Aug 26, 2019

> expressed in 4kB IOs per second (IOPS). Using larger block sizes for random IO will increase throughput in terms of GB/s but will almost always yield a lower IOPS number

and higher IOPs numbers give the marketing department the warm and fuzzies. Got it.

mlyle · on Aug 26, 2019

It is a good metric because it is a measure of random access. Bigger requests are a mixed measure of sequential performance and random access (and you can basically infer the performance for any io size from the huge request bandwidth and the smallest reasonable IO IOPS)

josteink · on Aug 26, 2019

> Personally, I cant feel the difference in storage performance between my NVMe system and older SATA SSD ones, despite NVMe being much faster.

For the tasks I do, this is night and day, and lack of NVMe is reason enough to warrant buying new hardware.

abarringer · on Aug 26, 2019

We are crafting a new Server cluster and it looks like the Gen4 PCIe will give us a 40% performance boost for what is basically NVMe over Fabric using Mellanox 100Gb cards (Azure HCI aka S2D). Next year we will be building out another cluster and will hopefully use Mellanox 200Gb cards which is only possible due to Gen4.

Performance tests here https://www.servethehome.com/amd-epyc-7002-series-rome-deliv...

FullyFunctional · on Aug 26, 2019

Ah, but that’s making assumptions about the user. There are lots of power users for whom IO bandwidth matters; high-end video editing for example.

tjoff · on Aug 26, 2019

And low-cores are not?! We have been core starved for almost a decade!

IO matters but peak bandwidth sequential reads are not such a limiting factor even for power users.

nottorp · on Aug 26, 2019

They - and any other subset - can get the PCIe 4 SSDs of course.

There are "lots" of other subsets where it stopped mattering a while ago though.