> PCIe Gen4 does matter. M.2 NVMe has been read limited for a long time already (NAND bandwidth scales trivially). The I/O section of this article is basically nonsense.
I think the author's point was that storage is already plenty fast enough for many tasks. Personally, I cant feel the difference in storage performance between my NVMe system and older SATA SSD ones, despite NVMe being much faster.
> I cant feel the difference in storage performance between my NVMe
That's because for the most common tasks, real performance had not improved.
Manufacturers always display the write/read speed using many cores and with huge queue depth. The typical use case for that is copying files. It's around 3000MB/s
When starting OS and loading program you have tasks that use low queue and thread count, which is about 60-70 MB/s.
> Manufacturers always display the write/read speed using many cores...
Why does #cores matter? Once the IO is initiated it's handed off to some DMA/coprocessor stuff so the core can get back to starting other IOs without contention. Being clueless I can't see why a single core couldn't saturate IO bandwidth by just blasting out IO requests.
> ...and with huge queue depth.
Really stupid question now, I thought disk queues were queued up requests caused by issuing a buttload of IO requests, so why does a disk queue of say 20 actually make things any faster over a disk queue of just 2?
I know (correction, believe) with spinny disks a longer queue could be used by the controller to optimise access on the disk surface by proximity, but with SSDs that performance characteristic doesn't exist (access is uniform anywhere, I think) so that optimisation doesn't apply.
If you're benchmarking 4kB IOs, then the system call and interrupt handling overhead means you can't keep a high-end NVMe SSD 100% busy with only a single CPU core issuing requests one at a time. The time it takes to move 4kB across a PCIe link is absolutely trivial compared to post-Spectre/Meltdown context switch times. A program performing random IO to a mmapped file will never stress the SSD as much as a program that submits batches of several IO requests using a low-overhead asynchronous IO API.
> so why does a disk queue of say 20 actually make things any faster over a disk queue of just 2?
Because unlike almost all hard drives, SSDs can actually work on more than one outstanding request at a time. Consumer SSDs use controllers with four or eight independent NAND flash memory channels. Enterprise SSD controllers are 8-32 channels. And there's some amount of parallelism available between NAND flash dies attached to the same channel, and between planes on an individual die. Also, for writes, SSDs will commonly buffer commands to be combined and issued with fewer actual NAND program operations.
One more thing, I'm surprised that 4KB blocks are relevant, I'd have thought that disk requests in benchmarking (edit: cos manufacturers like to cheat), and a lot of reads in the real world, would operate at much larger requests than 4K.
Is it that IOs are broken down to 4K blocks at the disc controller level, or is that done deliberately in benchmarking to stress the IO subsystem?
SSDs are traditionally marketed with sequential IO performance expressed in MB/s or GB/s, and random IO performance expressed in 4kB IOs per second (IOPS). Using larger block sizes for random IO will increase throughput in terms of GB/s but will almost always yield a lower IOPS number. Block sizes smaller than 4kB usually don't give any improvement to IOPS because the SSD's Flash Translation Layer is usually built around managing data with 4kB granularity.
> expressed in 4kB IOs per second (IOPS). Using larger block sizes for random IO will increase throughput in terms of GB/s but will almost always yield a lower IOPS number
and higher IOPs numbers give the marketing department the warm and fuzzies. Got it.
It is a good metric because it is a measure of random access. Bigger requests are a mixed measure of sequential performance and random access (and you can basically infer the performance for any io size from the huge request bandwidth and the smallest reasonable IO IOPS)
We are crafting a new Server cluster and it looks like the Gen4 PCIe will give us a 40% performance boost for what is basically NVMe over Fabric using Mellanox 100Gb cards (Azure HCI aka S2D). Next year we will be building out another cluster and will hopefully use Mellanox 200Gb cards which is only possible due to Gen4.
I think the author's point was that storage is already plenty fast enough for many tasks. Personally, I cant feel the difference in storage performance between my NVMe system and older SATA SSD ones, despite NVMe being much faster.