Hacker News new | past | comments | ask | show | jobs | submit login
Unraveling rm: what happens when you run it? (safia.rocks)
217 points by ingve 11 months ago | hide | past | web | favorite | 88 comments

What I found most interesting from working on some FUSE filesystems (and from the git pseudo-filesystem) is that removing a file via unlink is not a file operation at all, but an operation on the parent directory. The only way that the filesystem knows how to find a file (an inode) is by finding it in a directory listing (which is itself a filesystem object).

The name itself is the giveaway -- "unlink" because you're removing a hard link to a file.

Similarly, access permissions are also properties of the embedding in the directory, rather than the bits of the file itself.

This is a place where POSIX and Win32 diverge significantly -- in Win32, permissions and access happen at the file level, which is why Windows is testy about letting you delete a file that is in use, while POSIX doesn't care -- the process accessing the file, once the file is open, maintains a link to the inode, and all the file data is intact, just not findable in the directory where it was initially located.

A neat trick here is that you can effectively still access the file (even restore it) if a process has the file open, through the /proc filesystem.

> A neat trick here is that you can effectively still access the file (even restore it) if a process has the file open, through the /proc filesystem.

A technique used in the most extreme unix system recovery story I've ever read, by Al Viro:



* system libraries are recovered from init still having them mapped/opened, as you describe

* basic system utilities like 'ln' are recreated from their syscalls and writing the assembly

* ELF binaries are recreated by crafting their headers manually

> access permissions are also properties of the embedding in the directory, rather than the bits of the file itself.

No, in POSIX they are properties of the file. You can use fchmod() and fchown() to change mode and ownership via an fd.

> A neat trick here is that you can effectively still access the file (even restore it) if a process has the file open, through the /proc filesystem.

Yes, you can access them; but I don't believe you can link them. (But I'd love to proven wrong. A few years ago I actually needed to un-unlink a non-regular file that was still open.)

> Yes, you can access them; but I don't believe you can link them.

I think you can if you use debugfs. I wrote a post here about recovering a running binary after deleting the file on disk: http://lukechampine.com/recoverbin.html

Huh, I didn't know debugfs can operate on a mounted filesystem. Sounds incredibly dangerous...

debugfs(8) manpage says that ln "does not adjust the inode reference counts". Is there a way to increase that number?

I haven't tried it, but based on the manpage, I would expect this to work:

    set_inode_field foo links_count 1

Why does it sound dangerous? (Genuinely curious, it looks like it could be a very useful tool in certain situations)

I've never used it, but it appears to operate in read-only mode by default[0]:

> -w

> Specifies that the file system should be opened in read-write mode. Without this option, the file system is opened in read-only mode.

[0] https://linux.die.net/man/8/debugfs

> Yes, you can access them; but I don't believe you can link them.

Yes you can. I remember YouTube's flash viewer back in the day would put the downloaded flv video in /tmp and then delete it. I used to check the flash pid, go to /proc/{pid}/fd and see the symlink to the deleted file. Then a cp would give me the actual file.

I don't think this is the same as linking the file. You are not creating a new link to an existing file, you are creating a copy of it and creating a link to that.

If you modified the old file after the cp, you wouldn't see the changes in the new one.

That's interesting about the access permissions and ownership; I thought that access permissions in POSIX were path-dependent. Some quick experimentation indicates that ownership and access does in fact apply across hard links.

There's still some truth to the path-dependent notion, in that you may not be able to access a file through a hard link in a directory that you do not have access to, even if you have access to that same hard link through another path. But if you don't have access to the file itself then you're out of luck.

This does make sense from a security perspective, but I thought that the path-dependent checks in the kernel were strong enough to not require inode-associated ACLs.

You're right about restoring the file by re-linking to the hard link, but you can access the contents and cp it out of proc at least.

On Linux, the linkat(2) function can link up open files using an empty origin path and the AT_EMPTY_PATH flag.

That’s not part of POSIX though.

You certainly cannot un-unlink file by linking the pseudofile in /proc somewhere, as that would involve cross-filesystem hardlinks.

But it is probably possible to write simple kernel module that would allow you to do that through some non-standard interface.

Yes you can.

    linkat(AT_FDCWD, "/proc/self/fd/N", destdirfd, newname, AT_SYMLINK_FOLLOW);
Will do it. This is the longest-available `flink` syscall method. See https://lwn.net/Articles/562488/

Sadly the AT_EMPTY_PATH change was backed out between 3.11-rc7 and release. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

As pointed out in another comment, this doesn't work when the link count is 0:

  $ uname -rv
  4.15.0-3-amd64 #1 SMP Debian 4.15.17-1 (2018-04-19)

  $ touch foo

  $ exec 3<>foo

  $ rm foo

  $ ls -l /proc/$$/fd/3
  lrwx------ 1 jwilk jwilk 64 Apr 25 10:02 /proc/324/fd/3 -> '/home/jwilk/foo (deleted)'

  $ strace -e linkat ln -L /proc/$$/fd/3 foo
  linkat(AT_FDCWD, "/proc/3447/fd/3", AT_FDCWD, "foo", AT_SYMLINK_FOLLOW) = -1 ENOENT (No such file or directory)
  ln: failed to create hard link 'foo' => '/proc/3447/fd/3': No such file or directory

  $ sudo strace -e linkat ln -L /proc/$$/fd/3 foo
  linkat(AT_FDCWD, "/proc/3447/fd/3", AT_FDCWD, "foo", AT_SYMLINK_FOLLOW) = -1 ENOENT (No such file or directory)
  ln: failed to create hard link 'foo' => '/proc/3447/fd/3': No such file or directory
  +++ exited with 1 +++

Unless you use `O_TMPFILE` without `O_EXCL` to create the file.

    open("/tmp", O_RDWR|O_TMPFILE, 0666)    = 3
    linkat(3, "", AT_FDCWD, "/tmp/bar", AT_EMPTY_PATH) = 0

I don't think the cross-filesystem hardlink is the problem, since the link in proc is a symbolic link.

I can do this experimentally by creating a symbolic link in /tmp (a different filesystem) to a file in /home, and then creating a hard link (with ln -L) from the symbolic link to another file in /home, and the result is a valid hardlink to the same inode as the original file.

This doesn't work through /proc for an unlinked file, but only because the underlying link call requires a path, not an inode. You can create a hard link out of proc if the file has not been deleted, though, without any cross-filesystem problems.

You have to somehow increment inode's reference count and write reference to it into some directory.

symlink() does not increase reference count of anything and in fact its target does not have to be meaningful filename at all (although in the practical non-POSIX sense there does not exist any string that is not valid filename). One interesting ab-use of this is that you can use symlink()/readlink() as ad-hoc key-value store with atomicity guarantees (that hold true even on NFS). For example emacs uses exactly this for it's file locking mechanism.

IIRC the files in /proc/pid/fd are not true symlinks but something that behaves as both file (you can do same IO operations as on the original FD) and symlink (ie. you can readlink() them and get some string) at once.

man 2 open section about O_TMPFILE seems to strongly imply that you can linkat from /proc/<pid>/fd to concrete file. Not sure if there are some special cases for /proc/self/fd vs /proc/<pid>/fd, but that would seem bit odd.


edit: nevermind, seems like O_TMPFILE is the one that has been special-cased here, from man 2 linkat:

> This will generally not work if the file has a link count of zero (files created with O_TMPFILE and without O_EXCL are an exception).



In POSIX, why can't you link a file to which you have read access, but you're not an owner or have group access? It's a pretty annoying restriction.

because you are modifying the file's inode, which you do not have permission to modify.

Ok, but what does it modify besides the refcount?

not sure. but you don't have permission to modify that inode, hence, no permission to link. the model is pretty straightforward.

also, being able to 'create' files 'owned' by another user in other locations (by linking them into place) could create quite a few bizarre and undefined corner cases, some of which might have implications for system stability and/or security.

But... traditionally, Unix systems do allow creating hardlinks of other users' files. And yes, this misfeature is a source of great number of security holes.

An option to disable this behavior (/proc/sys/fs/protected_hardlinks) was addded only in Linux 3.6, and then it's still disabled by default.

consider what would happen if the file was counted against someone's quota and they rm'd the file but your link was still outstanding.

I suppose you can read the file if you can intercept an open fd, read it, and write it somewhere. It could possibly be the now vacant previous location.

For me it was (IIRC) a socket, so I couldn't "read it".

I believe you may be able to relink the inode with debugfs, depending on the filesystem.

On macOS, you might be able to accomplish this with fclonefileat()

This trick used to be the way I would download (or play in VLC) videos sent through Flash embeds in webpages. Way back in the day there was an actual temp file. But many companies didn't like that so they started deleting the file immediately to preserve the consensual hallucination that is 'streaming' and keep the lawyers happy. The way around this was to use stat, proc and awk in a bashrc function, ie:

    vlc $(stat -c %N /proc/*/fd/\* 2>&1 | awk -F[\`\'] '/tmp\/Flash/{print$2}')
With newer versions of Flash this too went away (around ~2014). But I still keep a browswer profile around that uses an old one around just so I can access the downloaded file to play in VLC (much smoother).

Access permissions are part of the inode, IIRC, and not part of the directory entry (basically the same as in Win32?).

On FAT the "permissions" are in fact part of the directory entry (IIRC some OSes in "multitasking DOS" family even have FAT extended by having what essentially is the unix mode, uid, gid tuple in the directory entry).

On NTFS permissions live in MFT, which essentially is same thing as inode.

Yes, this is documented in inode(7)

Another neat side effect of this is that you can remove a file that you don't own, so long as it's in a directory that you do. rm prompts to confirm by default, but it's intuitively surprising that can delete a root-owned file that you have no read or write permissions to.

This is also a cause of frequent confusion among less knowledgeable users when they try to clear up space on a filesystem that is reporting full, but there is a process keeping the file desciptor open. As far as they can tell the file is gone, but the usage hasn't gone down. There is a guy at my work that I point out lsof +L1 to about every three months or so. He can't seem to wrap his head around the concept.

Using hard links you can have a file that exists in multiple places with none of these being ‘the’ place. If you remove one of these links, nothing happens in the other places. All you see is the refence count going down by one.

Only when the reference count is zero will the file data be removed.

> Only when the reference count is zero will the file data be removed.

Not even necessarily then, right? (That is, there's no guarantee that a file's data is zeroed out just because its reference count drops to 0.) It's more just that it's only when the reference count is 0 that the actual space occupied on disk can be overwritten.

Flash (ab)use this to hide the cache files for streamed videos. They create a temp file, then open it, then delete it. You can however still grab it out of proc...

O_TMPFILE sort of does that trick now automatically; creates file without directory entry so you don't need jump through those (race-prone) hoops of unlinking the file.

What a cool way to learn. I've been doing this 20 years and it never occurred to me to just unravel some of the basic tools I use every day and see if it matches my expectations.

I particularly like using strace/dtrace/dtruss to show the communications between the user process and the kernel. You can learn some very interesting system calls like that, and see how experienced programmers use them.

There's another tool of that variety: ltrace shows calls to functions in dynamically-loaded libraries. It... works, mostly, but it doesn't know things like function type signatures, so it has to guess, and sometimes gets it wrong in weird ways. It also doesn't know defines and enums, so it can't turn numbers back into symbolic constants like strace and its close kin can:


For tracing functions, there's also latrace:


Thanks for the link, I've mostly been focussing on perf and bpf related things but TIL about LD_AUDIT http://man7.org/linux/man-pages/man7/rtld-audit.7.html

Fun fact: according to unsubstantiated UNIX lore, "rm" is NOT short-hand for "remove" but rather, it stands for the initials of the developer that wrote the original implementation, Robert Morris.


I have no proof of this, but through oral tradition, such a tale has been relayed to me. Believe whatever you will.

Considering the naming scheme of other basic unix utilities, I'll chalk this one up to "fun coincidence" rather than actual truth.

Alternatively, we could make "Robert"/"Bob" a euphemism for nuking files ;) "Yeah I Bob'd the whole build directory" "Bombed?", "No, Bob'd. Like deleted... never mind"

Obligatory Bobby Tables reference [0].

[0] https://xkcd.com/327/

Some people in computing have definitely gotten their initials into wide distribution:



("PK" is not only the beginning of ZIP files but also of, among other things, ODT, DOCX, and JAR files, which are all in turn implemented as ZIP files.)

I was curious if this was true, and asked. The answer I got was http://minnie.tuhs.org/pipermail/tuhs/2018-April/013510.html and it appears that this isn't true.

Specifically the original man page (http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/man/man1/rm....), dated November, 1971, shows Dennis Ritchie and Ken Thompson as the original authors.

I once had to remove a LOT of files, so was looking how to do it efficiently -- find has a -delete option, so I straced it:

  $ touch a b c d e f g h
  /tmp/x $ strace find . -delete
  execve("/usr/bin/find", ["find", ".", "-delete"], [/* 32 vars */]) = 0
  fcntl(4, F_DUPFD, 3)                    = 5
  fcntl(5, F_GETFD)                       = 0
  fcntl(5, F_SETFD, FD_CLOEXEC)           = 0
  getdents(4, /* 10 entries */, 32768)    = 240
  getdents(4, /* 0 entries */, 32768)     = 0
  close(4)                                = 0
  fcntl(5, F_DUPFD_CLOEXEC, 0)            = 4
  unlinkat(5, "d", 0)                     = 0
  unlinkat(5, "e", 0)                     = 0
  unlinkat(5, "f", 0)                     = 0
  unlinkat(5, "b", 0)                     = 0
  unlinkat(5, "c", 0)                     = 0
  unlinkat(5, "h", 0)                     = 0
  unlinkat(5, "a", 0)                     = 0
  unlinkat(5, "g", 0)                     = 0
  exit_group(0)                           = ?
  +++ exited with 0 +++


  int unlinkat(int dirfd, const char *pathname, int flags);
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by unlink(2) and rmdir(2) for a relative pathname).

I would expect the reason for this is in case someone is moving directories around while the find is happening ... Each time find enters a directory it doesn't actually chdir(), so it opens that directory and uses it to anchor the removal.

rm -r does something similar.

> The first couple of lines in the trace seem to be pretty clearly related to setting up the sudo part of the command.

This is not correct. "sudo dtruss" makes sudo run dtruss, so it is not possible for dtruss to track what sudo is doing. You can see what "sudo" does (which is much more complicated) by running "sudo dtruss sudo". You need sudo twice because sudo is a setuid program, and obviously it would be insecure to allow anybody to trace a setuid program (for example, it has to read the shadow file, so if you could trace it then you could dump all the hashes and crack them on your own time).

On a separate note, it's probably a bad idea to go reading Linux man pages and expect those to give you accurate information about Mac system calls.

Indeed, surely Mac OS includes its own man pages? man 2 mprotect, man 2 unlink

It does, but they generally match those in Linux pretty closely due to POSIX.

on the higher level yes, but anything dealing with lower level things like kernel interfaces, devices, etc no

(similar situation for other unices and posix-likes)

Yes, but mprotect and unlink are both high level API calls.

I don't think OP wanted to trace sudo.

But then I'm not sure what sudo was for... Does dtruss require root privileges?

if you run without sudo, you get "dtrace: failed to initialize dtrace: DTrace requires additional privileges"

For the curious, here's what dtruss is:


     The dtruss	utility	traces system calls and	(optionally) userland stack
     traces for	the specified programs.

> The getpid command has a pretty self-explanatory name, but I wondered what the parameters that were passed to the function were.

Does anyone know what's going on here? It looks like all syscalls are shown to have 3 arguments, even when they wouldn't need that many... Except close() which is shown to have only one.

This appears to be an artifact of dtruss(1). Syscalls that aren't special-cased are assumed to have 3 arguments: https://github.com/joyent/dtruss-osx/blob/c3b23e279187056e76...

There are special cases for 0..6 arguments (including one for close(2)) so I guess getpid(2) slipped through the cracks.

not sure how much macos retains of this,

but bsd derivatives essentially convert userland libc 'syscalls' into to a call to a single lower level 'syscall' function which passes data to/from the kernel using a macro / integer list to determine which actual functionality is desired..


> As it turns out, the getpid function supposedly takes no parameters, so the inclusion of the parameters in the call above perplexed me.

This is due to the fact that dtruss is actually a shell script trying to emulate truss with a d script and not quite succeeding.

See the section starting with the comment "print 0 arg output" in dtruss if you wish to understand the meaning of the 3 "arguments."

getpid() is most likely being called with 0 arguments.

> To allow it to process the rm command, I had to make a copy of the executable into a temporary directory and execute that.

To be clear this is due to SIP.

You should really spell out the full term if the acronym is predominantly used somewhere else. Unless I'm wrong and you're actually talking about the session initiation protocol (voice over ip)... But that makes little sense

It's probably System Integrity Protection, right?

I believe you are correct. System Integrity Protection is likely what they meant.

I wonder why did Apple block tracing SIP protected programs? Like, wasn't SIP supposed to protect against their modification on disk, like NetBSD's veriexec? How would DTracing rm be dangerous?

Because then people could trace rm in a way that allowed them to run arbitrary code in that process. It's the same issue you'd have if you attached a debugger to the process or loaded a dynamic library in their address space.

That is not true, dtrace cannot modify data and is specifically designed not to do so.

It can however be used to leak information or read info out of other processes.

I was going to say that for this reason even on linux you can't ptrace processes that are not children of the current process (e.g. you can run something under strace, but not attach to an existing process unless you twiddle a flag or do it as root). Having said that, you CAN modify data with ptrace, unlike dtrace. So that's kindof an aside. In any case the idea is that one process can't hijack another even from the same user for ptrace.

As far as I was aware, dtrace lets you perform essentially arbitrary reads/writes of process memory using copyin and copyout. Is this not true?

It appears you are correct. TIL.

There are a few “destructive actions” that can be enabled with a DTrace flag and also require appropriate system permissions.

Never the less.

Cool "anthropology" method. This triggered my interest to go look at the C source code for linux[1]. Turns out the command actually relies on a few different files, including a "remove.h"[2]. I was surprised I didn't immediately see the call to unlink in the source code, but obviously there's a lot of useful infrastructure to build on. Further investigation that the unlinkat[3] syscall was actually being used. Minorly fascinating :)

[1] https://github.com/coreutils/coreutils/blob/master/src/rm.c

[2] https://github.com/coreutils/coreutils/blob/master/src/remov...

[3] https://linux.die.net/man/2/unlinkat

> This triggered my interest to go look at the C source code for linux[1].

Coreutils is not Linux, and GNU tools are notoriously heavyweight. For contrast, here is Toybox implementation of rm:



https://git.busybox.net/busybox/tree/coreutils/rm.c https://git.busybox.net/busybox/tree/libbb/remove_file.c

And finally openbsd:


I was somewhat surprised by the small number of process startup/libc initialization syscalls in the trace in original article.

On Linux (Debian unstable in my case) you will get order of magnitude more, because rm is dynamically linked (although it seems that only with libc and nothing else), because of libc startup (did you know that linux has amd64-specific syscall arch_prctl(PRCTL_SET_FS), that does exactly what it sounds like?). And then because core utils rm cares about such things as whether stdin is terminal (probably because -f/-i behavior changes depending on that) and does the actual unlink in somewhat convoluted way that involves fstatat() (called twice, for some reason) and only then unlinkat(). Somewhat notably last thing that rm does is trying to lseek() stdin only to get ESPIPE...

It's very educational to dig into stuff that everyone takes for granted or even try implement your own version that does similar things (at a smaller scale and with less features, performance and security of course, just for educational purposes, not to serve as a real replacement).

E.g. writing your own shell sounds hard but you can do it in an hour (of course it'd be very bare compared to even ash), I once did it during a very casual C class while trying to impress the teacher. He even joked it was self hosting when he has seen the end result (as in - I didn't need other shells anymore and could use vim and gcc from my shell so I could use my shell to work on my shell). I wanted to put that into tutorial at some point but there already exists one[0] (that blog seems quite tinker-y actually, it has implementing own Linux sys call too, which is also surprisingly easy to do and I did it for a class at one point too[1]).

Going knee deep into this stuff also completely dispels the magic that language and system runtimes, filesystems, file formats, linkers, shells, standard commands (e.g. ls, I had to reimplement it once as a homework) or whatever have, or even better - it's still magic to most people and you're the wizard now! It's also very accomplishing to do something so unique in an hour or two (although to me due to my C and C++ bias at some point making a pastebin clone in an hour in PHP or Python became an unique experience).

Even JIT (which sounds scary due to V8, LuaJIT, etc. being so tightly made and complex) is easy to get into and understand at toy scale[2].

[0] - https://brennan.io/2015/01/16/write-a-shell-in-c/

[1] - https://brennan.io/2016/11/14/kernel-dev-ep3/

[2] - http://blog.reverberate.org/2012/12/hello-jit-world-joy-of-s...

There is an old story floating on the net where a sysadmin did the fateful recursive "rm" on the root directory and managed to hit CTRL-C before it got too far.

He then had to jump through a bunch of hoops and use a bunch of strange commands to restore / because of all the missing binaries. I wish I could find it.

I had to do a sort of version of this on a machine on mine once. Its hard disk was failing, and it couldn't I/O half of the system-critical binaries, but it did still have some in disk cache. It doubled as my network's SSH gateway/proxy for when I'm coming from an IPv4-only network, as my home only gets a single IPv4 address.

> I had a handy program around (doesn't everybody?) for converting ASCII hex to binary, and the output of /usr/bin/sum tallied with our original binary.

I copied busybox's nc onto it (needed for the ssh jump); had to copy to /run as / was unwritable due to the disk failure. Now, scp no longer ran, so "copy" was a Python script to turn the binary into a printf command, which is a shell built-in and can write arbitrary binary.

(If you ever get into a jam like this, busybox's utilities are very useful.)

> But hang on---how do you set execute permission without /bin/chmod? A few seconds thought (which as usual, lasted a couple of minutes) suggested that we write the binary on top of an already existing binary, owned by me...problem solved.

I think umask'ing correctly prior to a printf should work nowadays, no? (IDK about in the author's time.) Thankfully in my case, chmod was in disk cache still.


Thank you for finding that.

I do recall reading that story too.

It's not the story you want, but the origin story of the ext3grep tool is interesting too: https://web.archive.org/web/20110529114328/https://carlo17.h...

I have a similar funny story of recovery after a root dir incident. Someone told me their "cygwin just broke". It was indeed broken and the bash was unusable.

Long story short: they accidentally ran find with an exec of chmod that takes away the x bit on /.

It sounds ouch-y but it's not that bad because it turned out that find went alphabetically (or so) so first it went into /bin and then quickly made chmod unexecutable, so only the stuff that came before chmod in /bin was affected and all it took was to make it executable again normally via Windows.

It's actually even good bash happens to come before chmod in the sorting and that made cygwin completely "broken" at a glance or else it'd Murphy's law its way into "I don't have executable bit on this exotic rarely used command or other" 5 months down the line with the relevant lines of ~/.bash_history and everyone's short term memory long gone.

There's probably more than one of these posts :)

Here's the one I found in my bookmarks: http://lambdaops.com/rm-rf-remains/

One thing I've been thinking is if there would be value in (obviously non-POSIXy) userland that would mirror more closely the underlying syscalls. I'm not sure how it would exactly look like, and obviously there would be still need for higher level utilities too.

There's some precedent for that: POSIX has "link" and "unlink" utilities, which just call the corresponding functions. Most people use higher-level "ln" and "rm".

> Some investigation revealed that csops is a system call that is unique to the Apple operating system and can be used to check the signature that is written into a memory page by the operating system.

Close. That's one thing that csops can do, but in this case it's being used to extract the entitlements from the binary (CS_OPS_ENTITLEMENTS_BLOB == 7).

> I wasn’t sure what the memory addresses that were referenced in the mprotect call actually corresponded to or what the best way to figure it would be.

You could fire it up in LLDB and break on mprotect…

which is why building from source is handy, if you want to know what something does, just take a look.

If you delete a file in PHP then you 'unlink(file)' rather than 'rm(file)'. Never knew there was a reason, thought it was a quirk.

You are right, in a way, PHP is a quirk.

Why does rm need to call mprotect or ioctl?

It's called by the dynamic loader when it's loading pages into memory.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact