Added a small clarification: "The intuition here is that directories are a share...

preseinger · on March 28, 2023

> The NVMe protocol has extremely deep queues [1] and leaving them empty means leaving performance on the table. I think you'll find it surprisingly difficult to saturate the PCIe bus with just one core: PCIe 7 will support 512GB/s

is disk IO bottlenecked by NVMe/PCIe limits, or by disk iops limits?

> Unix FSs tend to store file_name to inode mappings per directory, so creating a new mapping in one directory doesn't interfere with another directory.

again, you're handwaving on what "interfere" means

do "inode mappings" represent resources with no shared resource constraints?

is reading one "inode mapping" as fast as you can with one core independent from reading a different "inode mapping" as fast as you can with a separate core?

afaik it is not, am i wrong?

SUPERCILEX · on March 28, 2023

> is disk IO bottlenecked by NVMe/PCIe limits, or by disk iops limits?

Note that I'm out of my depth here, so this is all speculation. Until we hit hardware limitations (which will be PCIe 6 if I had to guess), I'm pretty sure those are the same thing. One read/write iop = 4KiB. If your PCIe bandwidth is limited to N GB/s, then there are only so many iops you can physically send/receive to/from the SSD, regardless of how many iops the SSD could be capable of processing. So currently we're bottlenecked by PCIe, but I doubt that will continue to be the case.

> again, you're handwaving on what "interfere" means

It depends on how the file system is implemented, but my guess would be that a lock on the inode or block cache entry is acquired.

> do "inode mappings" represent resources with no shared resource constraints?

Those are the contents of the directory. The problem is not reading them, but changing them.

preseinger · on March 30, 2023

when you interact with a file or directory, the mapping of that logical thing to disk inodes (which are the physical 4kb whatevers you speak of) is managed for you by the fs cache

that cache coalesces and serializes access to disk, it does all of this "locking" you're referring to, and it's very smart

it seems like you're writing code assuming this intermediating layer does not exist?

preseinger · on March 28, 2023

> The intuition here is that directories are a shared resource for their direct children and must therefore serialize concurrent directory-modifying operations, causing contention.

why do you think this is true?

i've never heard of anything like it

directories are inodes on a file system, they are in no way "shared resources for their direct children", and there is no concept of a "directory-modifying operation" which contends with operations on any file (or directory) which is a "child" (subdir, sub-file) of that directory

your claim is totally bizarre to me, afaik it's nonsensical, but i guess i could be mistaken

SUPERCILEX · on March 28, 2023

A directory is a file like anything else that contains a map of names to inodes. If you're trying to add or remove mappings (create or delete files), then clearly some synchronization must occur or the contents of the file will contain garbage. In theory you could get away with a very small critical section that says "lock bytes N through M" of the file, but then how do you deal with disk block alignment (i.e. two pairs of n-m bytes are on the same disk block, so they need to take turns anyway) and how do you deal with I/O errors (the first n-m bytes fail, but the second n-m succeed, now you have a hole with garbage).

Also no need to theorize: run the benchmark I linked for yourself. It clearly shows a massive advantage to having each thread work with its own directory.

preseinger · on March 30, 2023

> A directory is a file like anything else that contains a map of names to inodes. If you're trying to add or remove mappings (create or delete files), then clearly some synchronization must occur or the contents of the file will contain garbage.

this synchronization is handled for you by the fs, specifically the fs cache

inode alignment and errors are managed by this intermediating layer

your benchmarks are not demonstrating what you think they are demonstrating

SUPERCILEX · on March 30, 2023

Except they are and your claims are trivial to disprove: simply run the benchmarks under perf. You'll find that most of the time is spent on the rwsem which is described here onwards: https://www.kernel.org/doc/html/latest/filesystems/path-look...

preseinger · on April 1, 2023

your benchmarks use O_DIRECT, which bypasses the fs cache

the fs cache does most/all of the optimizations you're doing manually

bypassing the fs cache is highly atypical for user-space code

preseinger · on March 28, 2023

1. your benchmarks exercise a very narrow set of use cases

2. the overhead of modifying the dirent is statistically zero compared to the costs related to manipulating the files on disk

SUPERCILEX · on March 28, 2023

No. Run the benchmark on a tmpfs:

$ hyperfine --warmup 3 -N "./test /dev/shm 8 zip" "./test /dev/shm 8 chain" Benchmark 1: ./test /dev/shm 8 zip Time (mean ± σ): 118.5 ms ± 11.6 ms [User: 92.9 ms, System: 726.6 ms] Range (min … max): 103.6 ms … 143.4 ms 23 runs

Benchmark 2: ./test /dev/shm 8 chain Time (mean ± σ): 235.7 ms ± 11.0 ms [User: 116.4 ms, System: 1537.7 ms] Range (min … max): 220.1 ms … 258.3 ms 13 runs

Summary './test /dev/shm 8 zip' ran 1.99 ± 0.22 times faster than './test /dev/shm 8 chain'

preseinger · on March 30, 2023

do you know what a tmpfs is? or what these programs are actually exercising?

i mean ignore me if you want, no skin off my back

but you're not benchmarking what you think you're benchmarking