I cannot seem to see why these results are suprising? There doesnt seem to be a conclusion stating this in the article? How were the pre 4.17 benchmarks for the setup?
The general consensus has been that SSDs does not have seek times in the same way as spinning disks and does not need to reorder the writes/reads to optimise for this, hence "none" has been recommended.
The surprise is that the others beat "none" in some situations when they in theory just add code and write order should not matter on SSDs. Apparently write order does matter on SSDs, at least in some situations.
Reordering I/O operations matters a lot for hard drives because seek times depend on where the head is and how far it has to travel.
The effects in play for SSDs are very different. Reads are more likely to be performance-critical than writes, so it can be beneficial to limit the number of write operations that are in the drive's queue so that reads can be serviced more quickly. Writes take longer than reads, but they can usually be deferred without causing problems (though the Optane SSD in question doesn't have a write buffer). Deferring operations also sometimes gives the drive the opportunity to combine more host commands into a single media access.
The order of commands issued to a SSD matters much less than the number of currently pending commands. If you're issuing just reads or just writes, there's little benefit to reordering them unless they have different priority levels (which NVMe allows through the creation of separate queues for each priority level, but I don't think Linux currently uses this).
>Reads are more likely to be performance-critical than writes
Yep. I've worked with multiple environments where people were caught off guard by seeing read latency spikes that were much higher on NVMe drives than regular SSDs. This hit them pretty hard on database-style workloads.
Moving to utilizing the Kyber scheduler basically eliminated the latency spikes - it allows you to set a target latency for IO operations and reorders them to keep to those targets.
I recommend it for anyone using a blk-mq device that has a mixed read/write workload where latency is important.
It’s most likely that given BFQ low latency’s generally low performance that the link isn’t saturated by those background processes, and that lets the foreground application that’s starting up have more bandwidth than it would otherwise have.
It’s not that surprising, really. It’s just like a per process I/O cap.
I could see some small advantage in bundling up operations that are proximate in the hope that they're in the same flash block and you can save one read-write cycle.