I still measure 1-2ms of latency with an NVMe disk on my Desktop computer, doing...

rbranson · 2025-03-13T18:41:59 1741891319

Not so sure that's true. This is single-threaded direct I/O doing a fio randwrite workload on a WD 850X Gen4 SSD:

    write: IOPS=18.8k, BW=73.5MiB/s (77.1MB/s)(4412MiB/60001msec); 0 zone resets
    slat (usec): min=2, max=335, avg= 3.42, stdev= 1.65
    clat (nsec): min=932, max=24868k, avg=49188.32, stdev=65291.21
     lat (usec): min=29, max=24880, avg=52.67, stdev=65.73
    clat percentiles (usec):
     |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   35],
     | 30.00th=[   37], 40.00th=[   38], 50.00th=[   40], 60.00th=[   43],
     | 70.00th=[   53], 80.00th=[   60], 90.00th=[   70], 95.00th=[   84],
     | 99.00th=[  137], 99.50th=[  174], 99.90th=[  404], 99.95th=[  652],
     | 99.99th=[ 2311]

jstimpfle · 2025-03-13T19:26:02 1741893962

I checked again with O_DIRECT and now I stand corrected. I didn't know that O_DIRECT could make such a huge difference. Thanks!

jstimpfle · 2025-03-13T21:57:50 1741903070

Oops, O_DIRECT does not actually make that big of a difference. I had updated my ad-hoc test to use O_DIRECT, but didn't check that write() now returned errors because of wrong alignment ;-)

As mentioned in the sibling comment, syncs are still slow. My initial 1-2ms number came from a desktop I bought in 2018, to which I added an NVME drive connected to an M.1 slot in 2022. On my current test system I'm seeing avg latencies of around 250us, sometimes a lot more (there a fluctuations).

   # put the following in a file "fio.job" and run "fio fio.job"
   # enable either direct=1 (O_DIRECT) or fsync=1 (fsync() after each write())
   [Job1]
   #direct=1
   fsync=1
   readwrite=randwrite
   bs=64k  # size of each write()
   size=256m  # total size written

tanelpoder · 2025-03-14T04:09:35 1741925375

Add sync=1 to your fio O_DIRECT write tests (not fsync, but sync=1) and you’ll see a big difference on consumer SSDs without power loss protection for their controller buffers. It adds the FUA flag (force unit access) to the write requests to ensure persistence of your writes, O_DIRECT alone won’t do that

wmf · 2025-03-13T20:27:11 1741897631

Random writes and fsync aren't the same thing. A single unflushed random write on a consumer SSD is extremely fast because it's not durable.

rbranson · 2025-03-13T20:41:57 1741898517

You're right. Sync writes are ten times as slow. 331µs.

  write: IOPS=3007, BW=11.7MiB/s (12.3MB/s)(118MiB/10001msec); 0 zone resets
    clat (usec): min=196, max=23274, avg=331.13, stdev=220.25
     lat (usec): min=196, max=23275, avg=331.25, stdev=220.27
    clat percentiles (usec):
     |  1.00th=[  210],  5.00th=[  223], 10.00th=[  235], 20.00th=[  262],
     | 30.00th=[  297], 40.00th=[  318], 50.00th=[  330], 60.00th=[  343],
     | 70.00th=[  355], 80.00th=[  371], 90.00th=[  400], 95.00th=[  429],
     | 99.00th=[  523], 99.50th=[  603], 99.90th=[ 1631], 99.95th=[ 2966],
     | 99.99th=[ 8225]

madisp · 2025-03-13T19:26:54 1741894014

I'm not an expert, but I think an enterprise NVMe will have some sort of power loss protection so it can afford to fsync to ram/caches as they will be written down in a power loss. Consumer NVMe drives afaik lack this so fsync will force the file to be written.

dogben · 2025-03-13T20:26:24 1741897584

I believe that's power saving in action. A single operation at idle is slow, the drive needs time to wake from idle.

dzr0001 · 2025-03-13T19:14:35 1741893275

What drive is this and does it need a trim? Not all NVMe devices are created equal, especially in consumer drives. In a previous role I was responsible for qualifying drives. Any datacenter or enterprise class drive that had that sort of latency in direct IO write benchmarks after proper pre-conditioning would have failed our validation.

jstimpfle · 2025-03-13T22:05:48 1741903548

My current one reads SAMSUNG MZVL21T0HCLR-00BH1 and is built into a quite new work laptop. I can't get below around 250us avg.

On my older system I had a WD_BLACK SN850X but had it connected to an M.1 slot which may be limiting. This is where I measured 1-2ms latency.

Is there any good place to get numbers of what is possible with enterprise hardware today? I've struggled for some time to find a good source.

dzr0001 · 2025-03-14T04:20:48 1741926048

Unfortunately, this data is harder to find than it should be. For instance, just looking at Kioxia, which I've found to be very performant, their datasheets for the CD series drives don't mention write latency at all. Blocks and Files[1] mentions that they claim <255us average, so they must have published that somewhere. This is why we would extensively test multiple units ourselves, following proper preconditioning as defined by SNIA. Averaging 250us for direct writes is pretty good.

[1] https://blocksandfiles.com/2023/08/07/kioxias-rocketship-dat...

the8472 · 2025-03-13T18:58:37 1741892317

I assume fsyncing a whole file does more work than just ensuring that specific blocks made it to the WAL which it can achieve with direct IO or maybe sync_file_range.

delamon · 2025-03-14T12:38:29 1741955909

Enterprise NVMe can do fsync much faster than consumer hardware. This is because they can cheat and report successful fsync() before data actually had been flushed to flash. They have backup capacitors which allow them to flush caches in case of power loss, so no data loss.

Here PM983 doing `fio --name=fsync_test --ioengine=sync --rw=randwrite --bs=4k --size=1G --numjobs=1 --runtime=10s --time_based --fsync=1`

  Jobs: 1 (f=1): [w(1)][100.0%][w=183MiB/s][w=46.7k IOPS][eta 00m:00s]
  fsync_test: (groupid=0, jobs=1): err= 0: pid=11905: Fri Mar 14 13:34:34 2025
    write: IOPS=39.1k, BW=153MiB/s (160MB/s)(1527MiB/10001msec); 0 zone resets
      clat (nsec): min=1052, max=223288, avg=1606.69, stdev=2345.64
       lat (nsec): min=1082, max=223458, avg=1653.08, stdev=2346.58
      clat percentiles (nsec):
       |  1.00th=[  1128],  5.00th=[  1176], 10.00th=[  1240], 20.00th=[  1320],
       | 30.00th=[  1448], 40.00th=[  1496], 50.00th=[  1528], 60.00th=[  1576],
       | 70.00th=[  1640], 80.00th=[  1720], 90.00th=[  1816], 95.00th=[  1960],
       | 99.00th=[  2576], 99.50th=[  3376], 99.90th=[ 10816], 99.95th=[ 32640],
       | 99.99th=[124416]
     bw (  KiB/s): min=123168, max=190568, per=99.00%, avg=154788.63, stdev=19610.50, samples=19
     iops        : min=30792, max=47642, avg=38697.16, stdev=4902.62, samples=19
    lat (usec)   : 2=95.61%, 4=4.10%, 10=0.19%, 20=0.04%, 50=0.03%
    lat (usec)   : 100=0.02%, 250=0.01%
    fsync/fdatasync/sync_file_range:
      sync (usec): min=13, max=1238, avg=23.08, stdev= 9.27
      sync percentiles (usec):
       |  1.00th=[   15],  5.00th=[   16], 10.00th=[   16], 20.00th=[   17],
       | 30.00th=[   18], 40.00th=[   25], 50.00th=[   26], 60.00th=[   26],
       | 70.00th=[   26], 80.00th=[   26], 90.00th=[   26], 95.00th=[   27],
       | 99.00th=[   34], 99.50th=[   79], 99.90th=[  101], 99.95th=[  126],
       | 99.99th=[  347]

The same test on SN850X

  Jobs: 1 (f=1): [w(1)][100.0%][w=22.9MiB/s][w=5859 IOPS][eta 00m:00s]
  fsync_test: (groupid=0, jobs=1): err= 0: pid=23328: Fri Mar 14 13:35:04 2025
    write: IOPS=5742, BW=22.4MiB/s (23.5MB/s)(224MiB/10001msec); 0 zone resets
      clat (nsec): min=400, max=110253, avg=797.80, stdev=1244.19
       lat (nsec): min=430, max=110273, avg=826.49, stdev=1248.86
      clat percentiles (nsec):
       |  1.00th=[  502],  5.00th=[  540], 10.00th=[  572], 20.00th=[  612],
       | 30.00th=[  644], 40.00th=[  668], 50.00th=[  708], 60.00th=[  748],
       | 70.00th=[  804], 80.00th=[  868], 90.00th=[ 1032], 95.00th=[ 1176],
       | 99.00th=[ 1560], 99.50th=[ 2224], 99.90th=[ 8384], 99.95th=[23424],
       | 99.99th=[66048]
     bw (  KiB/s): min=19800, max=24080, per=100.00%, avg=23004.21, stdev=1039.13, s  amples=19
     iops        : min= 4950, max= 6020, avg=5751.05, stdev=259.78, samples=19
    lat (nsec)   : 500=0.80%, 750=58.72%, 1000=29.04%
    lat (usec)   : 2=10.89%, 4=0.28%, 10=0.18%, 20=0.04%, 50=0.04%
    lat (usec)   : 100=0.01%, 250=0.01%
    fsync/fdatasync/sync_file_range:
      sync (usec): min=136, max=28040, avg=172.88, stdev=195.00
      sync percentiles (usec):
       |  1.00th=[  145],  5.00th=[  149], 10.00th=[  151], 20.00th=[  151],
       | 30.00th=[  159], 40.00th=[  159], 50.00th=[  159], 60.00th=[  159],
       | 70.00th=[  159], 80.00th=[  161], 90.00th=[  198], 95.00th=[  202],
       | 99.00th=[  396], 99.50th=[  416], 99.90th=[  594], 99.95th=[ 1467],
     | 99.99th=[ 5145]

kev009 · 2025-03-14T07:47:39 1741938459

NVMe is just a protocol. There are drives that are absolute shit and others that cost as much as luxury automobiles. In either case not quite DRAM latency because it is expansion bus attached.

bobmcnamara · 2025-03-15T01:05:11 1742000711

RIP Optane DIMMs.