Hacker News new | past | comments | ask | show | jobs | submit login
The fastest rm command and one of the fastest cp commands (alexsaveau.dev)
86 points by SUPERCILEX on March 25, 2023 | hide | past | favorite | 81 comments



Hmm. I appreciate the cute name for the project but I tested this on my machine (a Mac) and the results were not very impressive. I sacrificed a few of my SSD cycles to test how this is on deleting Xcode (for those unaware, it's a 11 GB mess of several hundred thousand files of varying sizes). Here are the results:

  $ time rm -rf Xcode.app
  
  real 0m39.850s
  user 0m0.429s
  sys 0m29.153s
  
  $ time rmz Xcode.app
  
  real 0m36.476s
  user 0m1.468s
  sys 1m59.916s
It's a little bit faster, but not by much. Despite claims that it runs in parallel, it seems like it really just hits unlink on many more cores and contends on the kernel's filesystem spinlocks rather than doing useful work. So the end result is that rm uses 70% CPU and rmz uses 400% CPU and they basically end up doing the same thing.

(FWIW, I don't use rm when deleting Xcode anyways, because it takes long and I do it too often. When testing unxip I just have it write to a temporary APFS volume and wipe it in between runs, which takes all of 10 seconds.)


The benchmarks page[1] for rmz shows it does pretty well on Linux, with many cases of 1.5x-4x vs "rm -rf". Perhaps something about Linux, or the filesystem type that was used to develop, test, and tune it.

[1] https://github.com/SUPERCILEX/fuc/tree/master/comparisons


There has been a huge amount of work done on Linux over the years to increase parallelism in the VFS and in the common filesystems like ext4, XFS and btrfs.

I would guess that macOS, being targeted at desktops and the typical workloads being run on them, hasn't received the same level of scrutiny in this particular area.


Slight tangent: why do you need to delete Xcode so often? (I don't really develop on Mac, mostly on Linux.)


They develop https://github.com/saagarjha/unxip, “a fast Xcode unarchiver”. Very few people should routinely delete Xcode.


Thanks!


I had to delete xcode because an update through the app store just didn't work. It had been in an "updating" state for over a week, I had given the machine full working days worth of continuous on-time, and eventually I tried to cancel the update which also didn't work. The only recourse was to `sudo rm -rf /Applications/Xcode.app`.

Not the reason OP is deleting Xcode, but you'd be surprised at how buggy a lot of stuff involving Xcode can be.


App Store Xcode updates always take forever for god knows why. It’s almost as if there’s some nonlinear behavior in softwareupdated and co. that’s hit by Xcode with its gazillion little files.

Therefore I update Xcode by downloading the .xip from developer.apple.com/downloads.


I haven't looked into this closely but I think it first tries to validate the bundle and then decompresses to disk very slowly


Not the OP’s reason it seems, but I have to delete Xcode quite more frequently than you’d think because it gets into a weird state during updates. This seems to happen once or twice every year.


Not OP, but disk space maybe? It's pretty big.

$ du -Ash Xcode.app

22G Xcode.app


> Not OP, but disk space maybe? It's pretty big.

> $ du -Ash Xcode.app

> 22G Xcode.app

My installation of Xcode only takes up 3.6G and is still fully operational.

As a developer of both desktop and mobile apps, I discovered some time ago that I could delete most, if not all, of the Xcode simulators that I don’t use without affecting the functionality of the IDE. The software is smart enough to download additional assets when you attempt to launch one of the simulators. I believe that, by default, Xcode comes with simulators for the latest versions of iOS, iPad OS, Watch OS, and Apple TV, sometimes including two or three versions for each platform.

I think this is the reason why your installation weights 22G.


3.6 GB is pretty small. What did you remove from your installation?


Xcode is compressed on disk by default; you should generally use du -sh when measuring its size.


I added a clarification to the benchmarks section:

"The macOS/Windows implementations are currently equivalent to the *_rayon implementations shown in the benchmarks."

Rayon is pretty good, but clearly suboptimal as evidenced by the benchmarks.


Look at user and system times in your test. rmz seems much worse by that measure, particularly syscalls its 4x more time for 4x the number of cpus, maybe a link there. evidently it is multi threaded and as you say causes contention.


If you need a large sub tree gone, mv dir .old-dir && rm -r .old-dir & works pretty well. Faster even than an optimized parallel unlinker.


Very nice!

This reminds me of a similar trick to speed up deallocations by moving them to a new thread: https://abramov.io/rust-dropping-things-in-another-thread


Probably good enough for most use cases, and it’s basically what I do when I’m sure it’ll succeed. And a lot of times I just do the mv and worry about cleanup later, because disk space is cheaper than recreating whatever I might yeet by mistake.


Exactly, I use 'gio trash' on Linux for this reason. And I let a cronjob do removal of all the .Trash with find -ctime +180 -exec rm...


Why would running mv before rm be faster? On what file systems is this true?


I think you missed the `&` at the end of the rm command, which runs it as a background process. It's not faster in total time, but it will allow the user to continue inputting new commands much sooner.


The idea is to rename it so that you don't have to wait for it to be deleted. You can let the deletion run in the background once it has been moved out of the way.


And to be totally, ridiculously pedantic (this is HN after all) mv is a single atomic metadata operation in the filesystem, which will almost certainly go into the filesystem log without any wait on i/o. A big rm is going to generate enough metadata churn that the filesystem log will need to be synced to hardware at least once, and then you've got 2 or 3 orders of magnitude more time to wait (per sync).


Plus, if there's a complex directory tree, rm has to traverse it (to deal with all the metadata) while mv doesn't.


It's not faster, just somewhere else.


Yes, I use this trick at work to remove the generated build directory. IMHO this behaviour should be added to 'rm', but not by default due to the background work triggered.


That’s pretty outside-the-box and fun.


Amusingly it’s basically how most GUI desktops “delete” stuff, they just put a confirmation in place of the &&


I don't think any GUI move items before a confirmation, so I don't see the link at all.


They almost universally do! They move stuff to a designated trash or recycle bin or whatever to stage for final deletion when you commit to it.


That's a very liberal way to define "confirmation", especially since modern OSes will delete trashed items on a rolling 30-day basis, without any further action.

To rephrase:

> mv dir .old-dir && rm -r .old-dir &

it's how most GUI desktops “delete” stuff, they just run the two actions separately

It's kind of a stretch anyway since the core idea of the command was the final `&`


> modern OSes will delete trashed items on a rolling 30-day basis, without any further action.

Hmm, the best I can find, no OS does this by default although all of them are capable of doing it.


Windows, if asking is enabled, will ask before moving into the recycle bin.

Once the file is in the recycle bin, it will probably be months before final deletion happens, and windows will not ask before doing so.


Unless you happen to start to run low on disk space; at that point, Storage Sense will kick in and start complaining about it - if you enable it without changing standard settings it deletes "trash bin" files every 30d.


I once wrote a command called trickle-rm, which was designed to be an i/o constrained rm, this the exact opposite of this article. I needed trickle-rm because after extensive analysis, I'd found that the i/o load of a diagnostic data cleanup job was interfering with the very tight latency requirements of the main app on the server. My first effort was "nice rm -rf $oldLogs". When I'm feeling a bit evil I will ask an interview candidate why that didn't work.

Always something interesting to learn at the margins


Why didn't it work? Was try 2 ionice?


Nice reduces your scheduling priority for access to the CPU, which means it might take your niced rm longer to do the processing to submit a bunch of i/o to the kernel. But once those i/o ops are in the kernel, they are competing directly on an equal playing field with the latency critical i/o ops of the real app, causing the same degradation of latency to the end users. ionice was not available on my platform, thus the invention of trickle-rm.

Now the second interview question: how did trickle-rm work? How can you simply "reflect" the back pressure of the "x rm's per second" constraint back onto the "walk the directory tree to find the rm's to do" so that the tree walk generates a trickle of i/o operations?


Run rm in a bash loop for each file and sync after each iteration?

I'm surprised that rm's io would cause that much of an issue. I thought it only removed entries from the partition table.


Not the parent, but ionice only works with certain elevators. And elevator choice mattered a lot on spinning disks.


On windows I find it's much faster to use robocopy to mirror an empty folder into a full one than it is to delete the full folder than pretty much any other command. I regularly have to delete a 200GB folder with 500k+ files in it, and robocopy outperforms regular rm -r by a factor of two, and GUI shift+right click delete by a factor of 4 or 5.


When deleting a 25k-file node_modules folder on windows, I use the built-in rmdir, as it is much faster than the GUI. I do not know how it compares to robocopy.


That's a neat trick; I was in a similar situation recently and wrote a tool that's faster for me though; for a million files ROBOCOPY was 257 seconds, but https://github.com/shaggie76/FastDelete did it in 34 seconds on a hex-core laptop.


Oh that's super cool! Will need to try it.


I’m going to have to try that. I want to throw my computer out the window when I delete a folder in File Explorer and I get the “Preparing to delete” dialog come up. If I’m truly deleting and not recycling, what is it preparing?


It queries the list of all files and their sizes so it can show you a progress bar. RM and robocopy will just get on with it, but they don't know how much they are about to delete(not that they need to)


I cant wait to see the future where we have uring (io_uring) based tools that all work async, getting hopefully embarassingly parallel. Heck yes copying & deleting 2x as fast.

Actually I'm on btrfs so reflink copy is like instant I think? I should test that better.


async nearly always trades concurrency for latency, so it wouldn't necessarily be faster overall. Lots of profiling and tuning would probably be involved.

That, or your algorithm could be optimal for concurrency already - and you would see an immediate performance improvement.


What? How does async (and io_uring specifically) give up concurrency? They’re orthogonal things.


Historically async has had a penalty because it means doing usrerland code, then sys calling the kernel, then going back to userland, then latter somehow picking up the event, which is if nothing else just more sys calls. Which have overhead, as the processor has to context switch to do so.

The whole point of io_uring is to drastically decrease the number of sys calls, creating channels where more requests can be filed with lower than traditional cost of a readFile syscall for example, and where completion can also be lower overhead delivery of events.

So historically I kind of would have agreed with the parent. Today, we don't really know! Hence my excitement.


> The whole point of io_uring is to drastically decrease the number of sys calls, creating channels where more requests can be filed with lower than traditional cost of a readFile syscall for example, and where completion can also be lower overhead delivery of events.

In theory. I did some work on high performance filesystem I/O on Linux about a year ago, doing intensive random-access to fast SSDs, and found io_uring to be slightly slower than a well-tuned thread pool with an appropriate queue depth.

That was a little surprising as the thread pool has to do system calls for each I/O operation and io_uring does not. Perhaps it is faster with newer kernels or other access patterns.

io_uring is better able to adapt autonatically to different numbers of cores, device queue depth and amount of filesystem data cache residency. That comes from it having access to kernel state which is not made available to userspace on Linux, to guide thread offloading decisions, rather than from the ringbuffer communication.


reply from jamie lokier will make my day any day.

not to fanboy out TOO much, but your posting on HyBi was & is greatly influential to me. see, https://github.com/rektide/pipe-layer#essence

(sorely neglected project to me but also still very near & dear, still a core principle & value in my pantheon of beliefs)]

no particular comment on io_uring. thankfully jens keeps making it better. the numbers he posts for his synthetics keep seeming impossibly good. but i fully am ready to believe the real situation is more complicated.

i do wish we'd see some uptake from the usual suspects. both Deno and Node have delayed/deferred work on these topics. but supposedly slowly happening in node. https://github.com/libuv/libuv/issues/1947 https://github.com/denoland/deno/issues/16232


You interpreted it the wrong way. You gain concurrency, you gain latency.

With blocking I/O and parallelism you have a thread ready to go when the operation is complete. You have N threads for N iops. With concurrency you have to dequeue completed work, and then delegate that work to (usually) fewer than N threads. Dequeuing completed iops takes time (it's an extra syscall), and there may not be a thread ready hand the completed iop. More latency.

Running 1000s of threads isn't realistic because your OS would typically grind to a halt, so concurrency is unavoidable. It does have a cost, though.


I’m not aware of any latency impact from io_uring. If anything, it has lower latency because you can pipeline I/O from a single CPU which you can’t do from typically thread-based parallelism. Additionally, the processing of the ring buffer could happen on a background kernel thread (in theory not sure if it happens today) which then avoids context switching away from your thread and screwing with cache performance.


io_uring avoids the syscall overhead. Consider this intentionally bad scenario: you have one background thread pulling completions off the ringbuffer, and passing those completions to a single worker thread (nodejs would be a real example of this). In this scenario the latency of the first completion would be fantastic, but you'd have to wait for the single to become available for subsequent completions.

To be clear, the added latency here is better than the work never happening at all (which would be the result of running 1000s of threads on modern mainstream operating systems), but there is unavoidable latency if you are handling >N iops with N threads (which is intrinsic to the definition of concurrency).

I am referring to the broad, general case, much like big-O works. You can find numerous exception to big-O, such as preferring arrays over hashes when the set is very small. Let's invent big-L notation, N is the number of threads, M is number of iops. With pure parallelism you have L(N), with pure concurrency you have L(M), and with a hybrid you have L(M-N).



> The key insight is that file operations in separate directories don’t (for the most part) interfere with each other, enabling parallel execution.

i'm clearly missing something here

parallel execution helps when operations are cpu bound

file operations are (almost always) io bound

and totally unclear how directories represent an "interference" boundary

bizarre


Parallel execution absolutely helps when operations are IO bound, if they're more or less independent. Making two network requests in parallel is twice as fast as making them sequentially, if the payload is small enough so that latency dominates and bandwidth is negligible.

The question is, how independent are IO operations in separate directories. And the article is claiming that they're fairly independent and don't block each other.


making two io-bound requests in parallel is twice as fast as making them sequentially, only if they don't contend for the same io resource -- bandwidth, disk iops, etc.

maybe this is what you mean by independent?

but the thing is that in disk io, directory structure is (as far as i know) basically unrelated to relevant contentious resources, when measuring speed

maybe if you're doing a billion small files than overhead begins to matter, but copying 3 big files from 3 different directories is gonna take just as long if you do them in parallel vs. if you do them sequentially

that may not be true if they're on different disks, but that kind of proves my point, the directory isn't the factor, the underlying disk is

> The question is, how independent are IO operations in separate directories. And the article is claiming that they're fairly independent and don't block each other.

yeah and in this sense the article is misleading, because (as far as i know) directories are basically unrelated to independence in the general case


Added a small clarification: "The intuition here is that directories are a shared resource for their direct children and must therefore serialize concurrent directory-modifying operations, causing contention. In brief, file creation or deletion cannot occur at the same time within one directory."

Plus accompanying benchmark: https://alexsaveau.dev/blog/projects/performance/files/fuc/f...

---

> file operations are (almost always) io bound

This is a common misconception. It was presumably true a decade ago, but PCIe is getting exponentially faster every 3 years: https://arstechnica.com/gadgets/2022/06/months-after-finaliz...

The NVMe protocol has extremely deep queues [1] and leaving them empty means leaving performance on the table. I think you'll find it surprisingly difficult to saturate the PCIe bus with just one core: PCIe 7 will support 512GB/s. Assuming a single core can produce 64 bytes (an entire cache line!) per cycle running at 5GHz, you're still only at 320/512=62.5% saturation. This napkin math is a little BS, but my point is that individual cores are quickly going to be outpaced by bandwidth availability.

> and totally unclear how directories represent an "interference" boundary

To add a bit more color here, it depends on how your file system is implemented. I belive Windows stores every file's metadata in a global database, so segmenting operations by directory yields no benefits. On the other hand, Unix FSs tend to store file_name to inode mappings per directory, so creating a new mapping in one directory doesn't interfere with another directory.

[1]: https://en.wikipedia.org/wiki/NVM_Express#Comparison_with_AH...


> The NVMe protocol has extremely deep queues [1] and leaving them empty means leaving performance on the table. I think you'll find it surprisingly difficult to saturate the PCIe bus with just one core: PCIe 7 will support 512GB/s

is disk IO bottlenecked by NVMe/PCIe limits, or by disk iops limits?

> Unix FSs tend to store file_name to inode mappings per directory, so creating a new mapping in one directory doesn't interfere with another directory.

again, you're handwaving on what "interfere" means

do "inode mappings" represent resources with no shared resource constraints?

is reading one "inode mapping" as fast as you can with one core independent from reading a different "inode mapping" as fast as you can with a separate core?

afaik it is not, am i wrong?


> is disk IO bottlenecked by NVMe/PCIe limits, or by disk iops limits?

Note that I'm out of my depth here, so this is all speculation. Until we hit hardware limitations (which will be PCIe 6 if I had to guess), I'm pretty sure those are the same thing. One read/write iop = 4KiB. If your PCIe bandwidth is limited to N GB/s, then there are only so many iops you can physically send/receive to/from the SSD, regardless of how many iops the SSD could be capable of processing. So currently we're bottlenecked by PCIe, but I doubt that will continue to be the case.

> again, you're handwaving on what "interfere" means

It depends on how the file system is implemented, but my guess would be that a lock on the inode or block cache entry is acquired.

> do "inode mappings" represent resources with no shared resource constraints?

Those are the contents of the directory. The problem is not reading them, but changing them.


when you interact with a file or directory, the mapping of that logical thing to disk inodes (which are the physical 4kb whatevers you speak of) is managed for you by the fs cache

that cache coalesces and serializes access to disk, it does all of this "locking" you're referring to, and it's very smart

it seems like you're writing code assuming this intermediating layer does not exist?


> The intuition here is that directories are a shared resource for their direct children and must therefore serialize concurrent directory-modifying operations, causing contention.

why do you think this is true?

i've never heard of anything like it

directories are inodes on a file system, they are in no way "shared resources for their direct children", and there is no concept of a "directory-modifying operation" which contends with operations on any file (or directory) which is a "child" (subdir, sub-file) of that directory

your claim is totally bizarre to me, afaik it's nonsensical, but i guess i could be mistaken


A directory is a file like anything else that contains a map of names to inodes. If you're trying to add or remove mappings (create or delete files), then clearly some synchronization must occur or the contents of the file will contain garbage. In theory you could get away with a very small critical section that says "lock bytes N through M" of the file, but then how do you deal with disk block alignment (i.e. two pairs of n-m bytes are on the same disk block, so they need to take turns anyway) and how do you deal with I/O errors (the first n-m bytes fail, but the second n-m succeed, now you have a hole with garbage).

Also no need to theorize: run the benchmark I linked for yourself. It clearly shows a massive advantage to having each thread work with its own directory.


> A directory is a file like anything else that contains a map of names to inodes. If you're trying to add or remove mappings (create or delete files), then clearly some synchronization must occur or the contents of the file will contain garbage.

this synchronization is handled for you by the fs, specifically the fs cache

inode alignment and errors are managed by this intermediating layer

your benchmarks are not demonstrating what you think they are demonstrating


Except they are and your claims are trivial to disprove: simply run the benchmarks under perf. You'll find that most of the time is spent on the rwsem which is described here onwards: https://www.kernel.org/doc/html/latest/filesystems/path-look...


your benchmarks use O_DIRECT, which bypasses the fs cache

the fs cache does most/all of the optimizations you're doing manually

bypassing the fs cache is highly atypical for user-space code


1. your benchmarks exercise a very narrow set of use cases

2. the overhead of modifying the dirent is statistically zero compared to the costs related to manipulating the files on disk


No. Run the benchmark on a tmpfs:

$ hyperfine --warmup 3 -N "./test /dev/shm 8 zip" "./test /dev/shm 8 chain" Benchmark 1: ./test /dev/shm 8 zip Time (mean ± σ): 118.5 ms ± 11.6 ms [User: 92.9 ms, System: 726.6 ms] Range (min … max): 103.6 ms … 143.4 ms 23 runs

Benchmark 2: ./test /dev/shm 8 chain Time (mean ± σ): 235.7 ms ± 11.0 ms [User: 116.4 ms, System: 1537.7 ms] Range (min … max): 220.1 ms … 258.3 ms 13 runs

Summary './test /dev/shm 8 zip' ran 1.99 ± 0.22 times faster than './test /dev/shm 8 chain'


do you know what a tmpfs is? or what these programs are actually exercising?

i mean ignore me if you want, no skin off my back

but you're not benchmarking what you think you're benchmarking


Synchronous file system IO throughput is inversely correlated with latency. If you have a remote network then using multiple connections at the same time allows linear performance improvements until you have maxed out the throughout of your network link or remote filesystem.


yes, or the remote network link, or any middlebox between you and they

but this is all a bit tangential as the article is about file system stuff


What I would suppose is that it reduces the amount of redundant IO where a directory is edited, flushed to disk, and then later updated again. If all these updates happen in a batch there will be less IO overall.


sure, but the fs cache does all of this kind of stuff for you, right?


Just started working on ‘can’, an ‘rm’ replacement, that moves files to the trash instead instead of deleting them. It only works on macos right now but I was hoping to make it cross-platform. It’s not faster than ‘rm’ but hopefully saves accidental deletions. https://github.com/joshvoigts/can


For Linux there's [trash-cli](https://github.com/andreafrancia/trash-cli/). Doesn't seem to work for MacOS per this issue (https://github.com/andreafrancia/trash-cli/issues/284), but it suggests to use https://hasseg.org/trash/


Ya, it’s mostly a learning excuse.


This could be useful for e.g. bazel, where I’ve regularly seen deleting the bazel caches take on the order of 10 minutes because of absurdly large and repeated directory trees (caused by things like runfile trees containing the python interpreter).


The lazy delete as outlined in https://news.ycombinator.com/item?id=35309328 should also work?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: