Oops, O_DIRECT does not actually make that big of a difference. I had updated my ad-hoc test to use O_DIRECT, but didn't check that write() now returned errors because of wrong alignment ;-)
As mentioned in the sibling comment, syncs are still slow. My initial 1-2ms number came from a desktop I bought in 2018, to which I added an NVME drive connected to an M.1 slot in 2022. On my current test system I'm seeing avg latencies of around 250us, sometimes a lot more (there a fluctuations).
# put the following in a file "fio.job" and run "fio fio.job"
# enable either direct=1 (O_DIRECT) or fsync=1 (fsync() after each write())
[Job1]
#direct=1
fsync=1
readwrite=randwrite
bs=64k # size of each write()
size=256m # total size written
Add sync=1 to your fio O_DIRECT write tests (not fsync, but sync=1) and you’ll see a big difference on consumer SSDs without power loss protection for their controller buffers. It adds the FUA flag (force unit access) to the write requests to ensure persistence of your writes, O_DIRECT alone won’t do that
I'm not an expert, but I think an enterprise NVMe will have some sort of power loss protection so it can afford to fsync to ram/caches as they will be written down in a power loss.
Consumer NVMe drives afaik lack this so fsync will force the file to be written.
What drive is this and does it need a trim? Not all NVMe devices are created equal, especially in consumer drives. In a previous role I was responsible for qualifying drives. Any datacenter or enterprise class drive that had that sort of latency in direct IO write benchmarks after proper pre-conditioning would have failed our validation.
Unfortunately, this data is harder to find than it should be. For instance, just looking at Kioxia, which I've found to be very performant, their datasheets for the CD series drives don't mention write latency at all. Blocks and Files[1] mentions that they claim <255us average, so they must have published that somewhere. This is why we would extensively test multiple units ourselves, following proper preconditioning as defined by SNIA. Averaging 250us for direct writes is pretty good.
I assume fsyncing a whole file does more work than just ensuring that specific blocks made it to the WAL which it can achieve with direct IO or maybe sync_file_range.
Enterprise NVMe can do fsync much faster than consumer hardware. This is because they can cheat and report successful fsync() before data actually had been flushed to flash. They have backup capacitors which allow them to flush caches in case of power loss, so no data loss.
NVMe is just a protocol. There are drives that are absolute shit and others that cost as much as luxury automobiles. In either case not quite DRAM latency because it is expansion bus attached.
Update: about 800us on a more modern system.