
Unraveling rm: what happens when you run it? - ingve
https://blog.safia.rocks/post/173241985600/unraveling-rm-what-happens-when-you-run-it
======
andrewla
What I found most interesting from working on some FUSE filesystems (and from
the git pseudo-filesystem) is that removing a file via unlink is not a file
operation at all, but an operation on the parent directory. The only way that
the filesystem knows how to find a file (an inode) is by finding it in a
directory listing (which is itself a filesystem object).

The name itself is the giveaway -- "unlink" because you're removing a hard
link to a file.

Similarly, access permissions are also properties of the embedding in the
directory, rather than the bits of the file itself.

This is a place where POSIX and Win32 diverge significantly -- in Win32,
permissions and access happen at the file level, which is why Windows is testy
about letting you delete a file that is in use, while POSIX doesn't care --
the process accessing the file, once the file is open, maintains a link to the
inode, and all the file data is intact, just not findable in the directory
where it was initially located.

A neat trick here is that you can effectively still access the file (even
restore it) if a process has the file open, through the /proc filesystem.

~~~
jwilk
> access permissions are also properties of the embedding in the directory,
> rather than the bits of the file itself.

No, in POSIX they are properties of the file. You can use fchmod() and
fchown() to change mode and ownership via an fd.

> A neat trick here is that you can effectively still access the file (even
> restore it) if a process has the file open, through the /proc filesystem.

Yes, you can access them; but I don't believe you can link them. (But I'd love
to proven wrong. A few years ago I actually needed to un-unlink a non-regular
file that was still open.)

~~~
dfox
You certainly cannot un-unlink file by linking the pseudofile in /proc
somewhere, as that would involve cross-filesystem hardlinks.

But it is probably possible to write simple kernel module that would allow you
to do that through some non-standard interface.

~~~
daurnimator
Yes you can.

    
    
        linkat(AT_FDCWD, "/proc/self/fd/N", destdirfd, newname, AT_SYMLINK_FOLLOW);
    

Will do it. This is the longest-available `flink` syscall method. See
[https://lwn.net/Articles/562488/](https://lwn.net/Articles/562488/)

Sadly the AT_EMPTY_PATH change was backed out between 3.11-rc7 and release.
[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v3.11&id=f0cc6ffb8ce8961db587e5072168cac0cbc25f05)

~~~
jwilk
As pointed out in another comment, this doesn't work when the link count is 0:

    
    
      $ uname -rv
      4.15.0-3-amd64 #1 SMP Debian 4.15.17-1 (2018-04-19)
    
      $ touch foo
    
      $ exec 3<>foo
    
      $ rm foo
    
      $ ls -l /proc/$$/fd/3
      lrwx------ 1 jwilk jwilk 64 Apr 25 10:02 /proc/324/fd/3 -> '/home/jwilk/foo (deleted)'
    
      $ strace -e linkat ln -L /proc/$$/fd/3 foo
      linkat(AT_FDCWD, "/proc/3447/fd/3", AT_FDCWD, "foo", AT_SYMLINK_FOLLOW) = -1 ENOENT (No such file or directory)
      ln: failed to create hard link 'foo' => '/proc/3447/fd/3': No such file or directory
    
      $ sudo strace -e linkat ln -L /proc/$$/fd/3 foo
      linkat(AT_FDCWD, "/proc/3447/fd/3", AT_FDCWD, "foo", AT_SYMLINK_FOLLOW) = -1 ENOENT (No such file or directory)
      ln: failed to create hard link 'foo' => '/proc/3447/fd/3': No such file or directory
      +++ exited with 1 +++

~~~
daurnimator
Unless you use `O_TMPFILE` without `O_EXCL` to create the file.

    
    
        open("/tmp", O_RDWR|O_TMPFILE, 0666)    = 3
        linkat(3, "", AT_FDCWD, "/tmp/bar", AT_EMPTY_PATH) = 0

------
bhuga
What a cool way to learn. I've been doing this 20 years and it never occurred
to me to just unravel some of the basic tools I use every day and see if it
matches my expectations.

~~~
msla
I particularly like using strace/dtrace/dtruss to show the communications
between the user process and the kernel. You can learn some very interesting
system calls like that, and see how experienced programmers use them.

There's another tool of that variety: ltrace shows calls to functions in
dynamically-loaded libraries. It... works, mostly, but it doesn't know things
like function type signatures, so it has to guess, and sometimes gets it wrong
in weird ways. It also doesn't know defines and enums, so it can't turn
numbers back into symbolic constants like strace and its close kin can:

[https://en.wikipedia.org/wiki/Ltrace](https://en.wikipedia.org/wiki/Ltrace)

~~~
jwilk
For tracing functions, there's also latrace:

[https://people.redhat.com/jolsa/latrace/index.shtml](https://people.redhat.com/jolsa/latrace/index.shtml)

~~~
lathiat
Thanks for the link, I've mostly been focussing on perf and bpf related things
but TIL about LD_AUDIT [http://man7.org/linux/man-pages/man7/rtld-
audit.7.html](http://man7.org/linux/man-pages/man7/rtld-audit.7.html)

------
terminado
Fun fact: according to unsubstantiated UNIX lore, " _rm_ " is _NOT_ short-hand
for " _remove_ " but rather, it stands for the initials of the developer that
wrote the original implementation, Robert Morris.

[https://en.wikipedia.org/wiki/Robert_Morris_(cryptographer)](https://en.wikipedia.org/wiki/Robert_Morris_\(cryptographer\))

I have no proof of this, but through oral tradition, such a tale has been
relayed to me. Believe whatever you will.

~~~
AceJohnny2
Considering the naming scheme of other basic unix utilities, I'll chalk this
one up to "fun coincidence" rather than actual truth.

Alternatively, we could make "Robert"/"Bob" a euphemism for nuking files ;)
"Yeah I Bob'd the whole build directory" "Bombed?", "No, Bob'd. Like
deleted... never mind"

~~~
pjmorris
Obligatory Bobby Tables reference [0].

[0] [https://xkcd.com/327/](https://xkcd.com/327/)

------
rrauenza
I once had to remove a LOT of files, so was looking how to do it efficiently
-- find has a -delete option, so I straced it:

    
    
      $ touch a b c d e f g h
      /tmp/x $ strace find . -delete
      execve("/usr/bin/find", ["find", ".", "-delete"], [/* 32 vars */]) = 0
      [...]
      openat(AT_FDCWD, ".", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 4
      [...]
      fcntl(4, F_DUPFD, 3)                    = 5
      fcntl(5, F_GETFD)                       = 0
      fcntl(5, F_SETFD, FD_CLOEXEC)           = 0
      getdents(4, /* 10 entries */, 32768)    = 240
      getdents(4, /* 0 entries */, 32768)     = 0
      close(4)                                = 0
      fcntl(5, F_DUPFD_CLOEXEC, 0)            = 4
      unlinkat(5, "d", 0)                     = 0
      unlinkat(5, "e", 0)                     = 0
      unlinkat(5, "f", 0)                     = 0
      unlinkat(5, "b", 0)                     = 0
      unlinkat(5, "c", 0)                     = 0
      unlinkat(5, "h", 0)                     = 0
      unlinkat(5, "a", 0)                     = 0
      unlinkat(5, "g", 0)                     = 0
      [...]
      exit_group(0)                           = ?
      +++ exited with 0 +++
    
    

[https://linux.die.net/man/2/unlinkat](https://linux.die.net/man/2/unlinkat)

    
    
      int unlinkat(int dirfd, const char *pathname, int flags);
    

_If the pathname given in pathname is relative, then it is interpreted
relative to the directory referred to by the file descriptor dirfd (rather
than relative to the current working directory of the calling process, as is
done by unlink(2) and rmdir(2) for a relative pathname)._

I would expect the reason for this is in case someone is moving directories
around while the find is happening ... Each time find enters a directory it
doesn't actually chdir(), so it opens that directory and uses it to anchor the
removal.

rm -r does something similar.

------
Hello71
> The first couple of lines in the trace seem to be pretty clearly related to
> setting up the sudo part of the command.

This is not correct. "sudo dtruss" makes sudo run dtruss, so it is not
possible for dtruss to track what sudo is doing. You can see what "sudo" does
(which is much more complicated) by running "sudo dtruss sudo". You need sudo
twice because sudo is a setuid program, and obviously it would be insecure to
allow anybody to trace a setuid program (for example, it has to read the
shadow file, so if you could trace it then you could dump all the hashes and
crack them on your own time).

On a separate note, it's probably a bad idea to go reading Linux man pages and
expect those to give you accurate information about Mac system calls.

~~~
temprature
Indeed, surely Mac OS includes its own man pages? man 2 mprotect, man 2 unlink

~~~
saagarjha
It does, but they generally match those in Linux pretty closely due to POSIX.

~~~
cat199
on the higher level yes, but anything dealing with lower level things like
kernel interfaces, devices, etc no

(similar situation for other unices and posix-likes)

~~~
saagarjha
Yes, but mprotect and unlink are both high level API calls.

------
msla
For the curious, here's what dtruss is:

[https://www.freebsd.org/cgi/man.cgi?query=dtruss&sektion=1&m...](https://www.freebsd.org/cgi/man.cgi?query=dtruss&sektion=1&manpath=FreeBSD+8.2-RELEASE)

    
    
         The dtruss	utility	traces system calls and	(optionally) userland stack
         traces for	the specified programs.

------
jwilk
> The getpid command has a pretty self-explanatory name, but I wondered what
> the parameters that were passed to the function were.

Does anyone know what's going on here? It looks like all syscalls are shown to
have 3 arguments, even when they wouldn't need that many... Except close()
which is shown to have only one.

~~~
zwp
This appears to be an artifact of dtruss(1). Syscalls that aren't special-
cased are assumed to have 3 arguments: [https://github.com/joyent/dtruss-
osx/blob/c3b23e279187056e76...](https://github.com/joyent/dtruss-
osx/blob/c3b23e279187056e76c8ad02057d98b67f760290/dtruss#L237)

There are special cases for 0..6 arguments (including one for close(2)) so I
guess getpid(2) slipped through the cracks.

------
mark-wagner
> As it turns out, the getpid function supposedly takes no parameters, so the
> inclusion of the parameters in the call above perplexed me.

This is due to the fact that dtruss is actually a shell script trying to
emulate truss with a d script and not quite succeeding.

See the section starting with the comment "print 0 arg output" in dtruss if
you wish to understand the meaning of the 3 "arguments."

getpid() is most likely being called with 0 arguments.

------
gbrown_
> To allow it to process the rm command, I had to make a copy of the
> executable into a temporary directory and execute that.

To be clear this is due to SIP.

~~~
y4mi
You should really spell out the full term if the acronym is predominantly used
somewhere else. Unless I'm wrong and you're actually talking about the session
initiation protocol (voice over ip)... But that makes little sense

It's probably System Integrity Protection, right?

~~~
jessemillar
I believe you are correct. System Integrity Protection is likely what they
meant.

------
acbart
Cool "anthropology" method. This triggered my interest to go look at the C
source code for linux[1]. Turns out the command actually relies on a few
different files, including a "remove.h"[2]. I was surprised I didn't
immediately see the call to unlink in the source code, but obviously there's a
lot of useful infrastructure to build on. Further investigation that the
unlinkat[3] syscall was actually being used. Minorly fascinating :)

[1]
[https://github.com/coreutils/coreutils/blob/master/src/rm.c](https://github.com/coreutils/coreutils/blob/master/src/rm.c)

[2]
[https://github.com/coreutils/coreutils/blob/master/src/remov...](https://github.com/coreutils/coreutils/blob/master/src/remove.c)

[3]
[https://linux.die.net/man/2/unlinkat](https://linux.die.net/man/2/unlinkat)

~~~
zokier
> This triggered my interest to go look at the C source code for linux[1].

Coreutils is not Linux, and GNU tools are notoriously heavyweight. For
contrast, here is Toybox implementation of rm:

[https://github.com/landley/toybox/blob/master/toys/posix/rm....](https://github.com/landley/toybox/blob/master/toys/posix/rm.c)

busybox:

[https://git.busybox.net/busybox/tree/coreutils/rm.c](https://git.busybox.net/busybox/tree/coreutils/rm.c)
[https://git.busybox.net/busybox/tree/libbb/remove_file.c](https://git.busybox.net/busybox/tree/libbb/remove_file.c)

And finally openbsd:

[https://github.com/openbsd/src/blob/master/bin/rm/rm.c](https://github.com/openbsd/src/blob/master/bin/rm/rm.c)

~~~
dfox
I was somewhat surprised by the small number of process startup/libc
initialization syscalls in the trace in original article.

On Linux (Debian unstable in my case) you will get order of magnitude more,
because rm is dynamically linked (although it seems that only with libc and
nothing else), because of libc startup (did you know that linux has
amd64-specific syscall arch_prctl(PRCTL_SET_FS), that does exactly what it
sounds like?). And then because core utils rm cares about such things as
whether stdin is terminal (probably because -f/-i behavior changes depending
on that) and does the actual unlink in somewhat convoluted way that involves
fstatat() (called twice, for some reason) and only then unlinkat(). Somewhat
notably last thing that rm does is trying to lseek() stdin only to get
ESPIPE...

------
FRex
It's very educational to dig into stuff that everyone takes for granted or
even try implement your own version that does similar things (at a smaller
scale and with less features, performance and security of course, just for
educational purposes, not to serve as a real replacement).

E.g. writing your own shell sounds hard but you can do it in an hour (of
course it'd be _very_ bare compared to even ash), I once did it during a very
casual C class while trying to impress the teacher. He even joked it was self
hosting when he has seen the end result (as in - I didn't need other shells
anymore and could use vim and gcc from my shell so I could use my shell to
work on my shell). I wanted to put that into tutorial at some point but there
already exists one[0] (that blog seems quite tinker-y actually, it has
implementing own Linux sys call too, which is also surprisingly easy to do and
I did it for a class at one point too[1]).

Going knee deep into this stuff also completely dispels the magic that
language and system runtimes, filesystems, file formats, linkers, shells,
standard commands (e.g. ls, I had to reimplement it once as a homework) or
whatever have, or even better - it's still magic to most people and you're the
wizard now! It's also very accomplishing to do something so unique in an hour
or two (although to me due to my C and C++ bias at some point making a
pastebin clone in an hour in PHP or Python became an unique experience).

Even JIT (which sounds scary due to V8, LuaJIT, etc. being so tightly made and
complex) is easy to get into and understand at toy scale[2].

[0] - [https://brennan.io/2015/01/16/write-a-shell-
in-c/](https://brennan.io/2015/01/16/write-a-shell-in-c/)

[1] - [https://brennan.io/2016/11/14/kernel-dev-
ep3/](https://brennan.io/2016/11/14/kernel-dev-ep3/)

[2] - [http://blog.reverberate.org/2012/12/hello-jit-world-joy-
of-s...](http://blog.reverberate.org/2012/12/hello-jit-world-joy-of-simple-
jits.html)

------
bsder
There is an old story floating on the net where a sysadmin did the fateful
recursive "rm" on the root directory and managed to hit CTRL-C before it got
too far.

He then had to jump through a bunch of hoops and use a bunch of strange
commands to restore / because of all the missing binaries. I wish I could find
it.

~~~
nzmsv
[https://www.ee.ryerson.ca/~elf/hack/recovery.html](https://www.ee.ryerson.ca/~elf/hack/recovery.html)

~~~
deathanatos
I had to do a sort of version of this on a machine on mine once. Its hard disk
was failing, and it couldn't I/O half of the system-critical binaries, but it
did still have some in disk cache. It doubled as my network's SSH
gateway/proxy for when I'm coming from an IPv4-only network, as my home only
gets a single IPv4 address.

> _I had a handy program around (doesn 't everybody?) for converting ASCII hex
> to binary, and the output of /usr/bin/sum tallied with our original binary._

I copied busybox's nc onto it (needed for the ssh jump); had to copy to /run
as / was unwritable due to the disk failure. Now, scp no longer ran, so "copy"
was a Python script to turn the binary into a printf command, which is a shell
built-in and can write arbitrary binary.

(If you ever get into a jam like this, busybox's utilities are _very_ useful.)

> _But hang on---how do you set execute permission without /bin/chmod? A few
> seconds thought (which as usual, lasted a couple of minutes) suggested that
> we write the binary on top of an already existing binary, owned by
> me...problem solved._

I think umask'ing correctly prior to a printf should work nowadays, no? (IDK
about in the author's time.) Thankfully in my case, chmod was in disk cache
still.

------
zokier
One thing I've been thinking is if there would be value in (obviously non-
POSIXy) userland that would mirror more closely the underlying syscalls. I'm
not sure how it would exactly look like, and obviously there would be still
need for higher level utilities too.

~~~
jwilk
There's some precedent for that: POSIX has "link" and "unlink" utilities,
which just call the corresponding functions. Most people use higher-level "ln"
and "rm".

------
saagarjha
> Some investigation revealed that csops is a system call that is unique to
> the Apple operating system and can be used to check the signature that is
> written into a memory page by the operating system.

Close. That's one thing that csops can do, but in this case it's being used to
extract the entitlements from the binary (CS_OPS_ENTITLEMENTS_BLOB == 7).

> I wasn’t sure what the memory addresses that were referenced in the mprotect
> call actually corresponded to or what the best way to figure it would be.

You could fire it up in LLDB and break on mprotect…

------
cat199
which is why building from source is handy, if you want to know what something
does, just take a look.

------
Theodores
If you delete a file in PHP then you 'unlink(file)' rather than 'rm(file)'.
Never knew there was a reason, thought it was a quirk.

~~~
jjuhl
You are right, in a way, PHP _is_ a quirk.

------
iamrohitbanga
Why does rm need to call mprotect or ioctl?

~~~
saagarjha
It's called by the dynamic loader when it's loading pages into memory.

