Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I still measure 1-2ms of latency with an NVMe disk on my Desktop computer, doing fsync() on a file on a ext4 filesystem.

Update: about 800us on a more modern system.



Not so sure that's true. This is single-threaded direct I/O doing a fio randwrite workload on a WD 850X Gen4 SSD:

    write: IOPS=18.8k, BW=73.5MiB/s (77.1MB/s)(4412MiB/60001msec); 0 zone resets
    slat (usec): min=2, max=335, avg= 3.42, stdev= 1.65
    clat (nsec): min=932, max=24868k, avg=49188.32, stdev=65291.21
     lat (usec): min=29, max=24880, avg=52.67, stdev=65.73
    clat percentiles (usec):
     |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   35],
     | 30.00th=[   37], 40.00th=[   38], 50.00th=[   40], 60.00th=[   43],
     | 70.00th=[   53], 80.00th=[   60], 90.00th=[   70], 95.00th=[   84],
     | 99.00th=[  137], 99.50th=[  174], 99.90th=[  404], 99.95th=[  652],
     | 99.99th=[ 2311]


I checked again with O_DIRECT and now I stand corrected. I didn't know that O_DIRECT could make such a huge difference. Thanks!


Oops, O_DIRECT does not actually make that big of a difference. I had updated my ad-hoc test to use O_DIRECT, but didn't check that write() now returned errors because of wrong alignment ;-)

As mentioned in the sibling comment, syncs are still slow. My initial 1-2ms number came from a desktop I bought in 2018, to which I added an NVME drive connected to an M.1 slot in 2022. On my current test system I'm seeing avg latencies of around 250us, sometimes a lot more (there a fluctuations).

   # put the following in a file "fio.job" and run "fio fio.job"
   # enable either direct=1 (O_DIRECT) or fsync=1 (fsync() after each write())
   [Job1]
   #direct=1
   fsync=1
   readwrite=randwrite
   bs=64k  # size of each write()
   size=256m  # total size written


Add sync=1 to your fio O_DIRECT write tests (not fsync, but sync=1) and you’ll see a big difference on consumer SSDs without power loss protection for their controller buffers. It adds the FUA flag (force unit access) to the write requests to ensure persistence of your writes, O_DIRECT alone won’t do that


Random writes and fsync aren't the same thing. A single unflushed random write on a consumer SSD is extremely fast because it's not durable.


You're right. Sync writes are ten times as slow. 331µs.

  write: IOPS=3007, BW=11.7MiB/s (12.3MB/s)(118MiB/10001msec); 0 zone resets
    clat (usec): min=196, max=23274, avg=331.13, stdev=220.25
     lat (usec): min=196, max=23275, avg=331.25, stdev=220.27
    clat percentiles (usec):
     |  1.00th=[  210],  5.00th=[  223], 10.00th=[  235], 20.00th=[  262],
     | 30.00th=[  297], 40.00th=[  318], 50.00th=[  330], 60.00th=[  343],
     | 70.00th=[  355], 80.00th=[  371], 90.00th=[  400], 95.00th=[  429],
     | 99.00th=[  523], 99.50th=[  603], 99.90th=[ 1631], 99.95th=[ 2966],
     | 99.99th=[ 8225]


I'm not an expert, but I think an enterprise NVMe will have some sort of power loss protection so it can afford to fsync to ram/caches as they will be written down in a power loss. Consumer NVMe drives afaik lack this so fsync will force the file to be written.


I believe that's power saving in action. A single operation at idle is slow, the drive needs time to wake from idle.


What drive is this and does it need a trim? Not all NVMe devices are created equal, especially in consumer drives. In a previous role I was responsible for qualifying drives. Any datacenter or enterprise class drive that had that sort of latency in direct IO write benchmarks after proper pre-conditioning would have failed our validation.


My current one reads SAMSUNG MZVL21T0HCLR-00BH1 and is built into a quite new work laptop. I can't get below around 250us avg.

On my older system I had a WD_BLACK SN850X but had it connected to an M.1 slot which may be limiting. This is where I measured 1-2ms latency.

Is there any good place to get numbers of what is possible with enterprise hardware today? I've struggled for some time to find a good source.


Unfortunately, this data is harder to find than it should be. For instance, just looking at Kioxia, which I've found to be very performant, their datasheets for the CD series drives don't mention write latency at all. Blocks and Files[1] mentions that they claim <255us average, so they must have published that somewhere. This is why we would extensively test multiple units ourselves, following proper preconditioning as defined by SNIA. Averaging 250us for direct writes is pretty good.

[1] https://blocksandfiles.com/2023/08/07/kioxias-rocketship-dat...


I assume fsyncing a whole file does more work than just ensuring that specific blocks made it to the WAL which it can achieve with direct IO or maybe sync_file_range.


Enterprise NVMe can do fsync much faster than consumer hardware. This is because they can cheat and report successful fsync() before data actually had been flushed to flash. They have backup capacitors which allow them to flush caches in case of power loss, so no data loss.

Here PM983 doing `fio --name=fsync_test --ioengine=sync --rw=randwrite --bs=4k --size=1G --numjobs=1 --runtime=10s --time_based --fsync=1`

  Jobs: 1 (f=1): [w(1)][100.0%][w=183MiB/s][w=46.7k IOPS][eta 00m:00s]
  fsync_test: (groupid=0, jobs=1): err= 0: pid=11905: Fri Mar 14 13:34:34 2025
    write: IOPS=39.1k, BW=153MiB/s (160MB/s)(1527MiB/10001msec); 0 zone resets
      clat (nsec): min=1052, max=223288, avg=1606.69, stdev=2345.64
       lat (nsec): min=1082, max=223458, avg=1653.08, stdev=2346.58
      clat percentiles (nsec):
       |  1.00th=[  1128],  5.00th=[  1176], 10.00th=[  1240], 20.00th=[  1320],
       | 30.00th=[  1448], 40.00th=[  1496], 50.00th=[  1528], 60.00th=[  1576],
       | 70.00th=[  1640], 80.00th=[  1720], 90.00th=[  1816], 95.00th=[  1960],
       | 99.00th=[  2576], 99.50th=[  3376], 99.90th=[ 10816], 99.95th=[ 32640],
       | 99.99th=[124416]
     bw (  KiB/s): min=123168, max=190568, per=99.00%, avg=154788.63, stdev=19610.50, samples=19
     iops        : min=30792, max=47642, avg=38697.16, stdev=4902.62, samples=19
    lat (usec)   : 2=95.61%, 4=4.10%, 10=0.19%, 20=0.04%, 50=0.03%
    lat (usec)   : 100=0.02%, 250=0.01%
    fsync/fdatasync/sync_file_range:
      sync (usec): min=13, max=1238, avg=23.08, stdev= 9.27
      sync percentiles (usec):
       |  1.00th=[   15],  5.00th=[   16], 10.00th=[   16], 20.00th=[   17],
       | 30.00th=[   18], 40.00th=[   25], 50.00th=[   26], 60.00th=[   26],
       | 70.00th=[   26], 80.00th=[   26], 90.00th=[   26], 95.00th=[   27],
       | 99.00th=[   34], 99.50th=[   79], 99.90th=[  101], 99.95th=[  126],
       | 99.99th=[  347]
The same test on SN850X

  Jobs: 1 (f=1): [w(1)][100.0%][w=22.9MiB/s][w=5859 IOPS][eta 00m:00s]
  fsync_test: (groupid=0, jobs=1): err= 0: pid=23328: Fri Mar 14 13:35:04 2025
    write: IOPS=5742, BW=22.4MiB/s (23.5MB/s)(224MiB/10001msec); 0 zone resets
      clat (nsec): min=400, max=110253, avg=797.80, stdev=1244.19
       lat (nsec): min=430, max=110273, avg=826.49, stdev=1248.86
      clat percentiles (nsec):
       |  1.00th=[  502],  5.00th=[  540], 10.00th=[  572], 20.00th=[  612],
       | 30.00th=[  644], 40.00th=[  668], 50.00th=[  708], 60.00th=[  748],
       | 70.00th=[  804], 80.00th=[  868], 90.00th=[ 1032], 95.00th=[ 1176],
       | 99.00th=[ 1560], 99.50th=[ 2224], 99.90th=[ 8384], 99.95th=[23424],
       | 99.99th=[66048]
     bw (  KiB/s): min=19800, max=24080, per=100.00%, avg=23004.21, stdev=1039.13, s  amples=19
     iops        : min= 4950, max= 6020, avg=5751.05, stdev=259.78, samples=19
    lat (nsec)   : 500=0.80%, 750=58.72%, 1000=29.04%
    lat (usec)   : 2=10.89%, 4=0.28%, 10=0.18%, 20=0.04%, 50=0.04%
    lat (usec)   : 100=0.01%, 250=0.01%
    fsync/fdatasync/sync_file_range:
      sync (usec): min=136, max=28040, avg=172.88, stdev=195.00
      sync percentiles (usec):
       |  1.00th=[  145],  5.00th=[  149], 10.00th=[  151], 20.00th=[  151],
       | 30.00th=[  159], 40.00th=[  159], 50.00th=[  159], 60.00th=[  159],
       | 70.00th=[  159], 80.00th=[  161], 90.00th=[  198], 95.00th=[  202],
       | 99.00th=[  396], 99.50th=[  416], 99.90th=[  594], 99.95th=[ 1467],
     | 99.99th=[ 5145]


NVMe is just a protocol. There are drives that are absolute shit and others that cost as much as luxury automobiles. In either case not quite DRAM latency because it is expansion bus attached.


RIP Optane DIMMs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: