Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What I found most interesting from working on some FUSE filesystems (and from the git pseudo-filesystem) is that removing a file via unlink is not a file operation at all, but an operation on the parent directory. The only way that the filesystem knows how to find a file (an inode) is by finding it in a directory listing (which is itself a filesystem object).

The name itself is the giveaway -- "unlink" because you're removing a hard link to a file.

Similarly, access permissions are also properties of the embedding in the directory, rather than the bits of the file itself.

This is a place where POSIX and Win32 diverge significantly -- in Win32, permissions and access happen at the file level, which is why Windows is testy about letting you delete a file that is in use, while POSIX doesn't care -- the process accessing the file, once the file is open, maintains a link to the inode, and all the file data is intact, just not findable in the directory where it was initially located.

A neat trick here is that you can effectively still access the file (even restore it) if a process has the file open, through the /proc filesystem.




> A neat trick here is that you can effectively still access the file (even restore it) if a process has the file open, through the /proc filesystem.

A technique used in the most extreme unix system recovery story I've ever read, by Al Viro:

http://yarchive.net/comp/linux/extreme_system_recovery.html

Wherein:

* system libraries are recovered from init still having them mapped/opened, as you describe

* basic system utilities like 'ln' are recreated from their syscalls and writing the assembly

* ELF binaries are recreated by crafting their headers manually


> access permissions are also properties of the embedding in the directory, rather than the bits of the file itself.

No, in POSIX they are properties of the file. You can use fchmod() and fchown() to change mode and ownership via an fd.

> A neat trick here is that you can effectively still access the file (even restore it) if a process has the file open, through the /proc filesystem.

Yes, you can access them; but I don't believe you can link them. (But I'd love to proven wrong. A few years ago I actually needed to un-unlink a non-regular file that was still open.)


> Yes, you can access them; but I don't believe you can link them.

I think you can if you use debugfs. I wrote a post here about recovering a running binary after deleting the file on disk: http://lukechampine.com/recoverbin.html


Huh, I didn't know debugfs can operate on a mounted filesystem. Sounds incredibly dangerous...

debugfs(8) manpage says that ln "does not adjust the inode reference counts". Is there a way to increase that number?


I haven't tried it, but based on the manpage, I would expect this to work:

    set_inode_field foo links_count 1


Why does it sound dangerous? (Genuinely curious, it looks like it could be a very useful tool in certain situations)

I've never used it, but it appears to operate in read-only mode by default[0]:

> -w

> Specifies that the file system should be opened in read-write mode. Without this option, the file system is opened in read-only mode.

[0] https://linux.die.net/man/8/debugfs


> Yes, you can access them; but I don't believe you can link them.

Yes you can. I remember YouTube's flash viewer back in the day would put the downloaded flv video in /tmp and then delete it. I used to check the flash pid, go to /proc/{pid}/fd and see the symlink to the deleted file. Then a cp would give me the actual file.


I don't think this is the same as linking the file. You are not creating a new link to an existing file, you are creating a copy of it and creating a link to that.

If you modified the old file after the cp, you wouldn't see the changes in the new one.


That's interesting about the access permissions and ownership; I thought that access permissions in POSIX were path-dependent. Some quick experimentation indicates that ownership and access does in fact apply across hard links.

There's still some truth to the path-dependent notion, in that you may not be able to access a file through a hard link in a directory that you do not have access to, even if you have access to that same hard link through another path. But if you don't have access to the file itself then you're out of luck.

This does make sense from a security perspective, but I thought that the path-dependent checks in the kernel were strong enough to not require inode-associated ACLs.

You're right about restoring the file by re-linking to the hard link, but you can access the contents and cp it out of proc at least.


On Linux, the linkat(2) function can link up open files using an empty origin path and the AT_EMPTY_PATH flag.

That’s not part of POSIX though.


You certainly cannot un-unlink file by linking the pseudofile in /proc somewhere, as that would involve cross-filesystem hardlinks.

But it is probably possible to write simple kernel module that would allow you to do that through some non-standard interface.


Yes you can.

    linkat(AT_FDCWD, "/proc/self/fd/N", destdirfd, newname, AT_SYMLINK_FOLLOW);
Will do it. This is the longest-available `flink` syscall method. See https://lwn.net/Articles/562488/

Sadly the AT_EMPTY_PATH change was backed out between 3.11-rc7 and release. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...


As pointed out in another comment, this doesn't work when the link count is 0:

  $ uname -rv
  4.15.0-3-amd64 #1 SMP Debian 4.15.17-1 (2018-04-19)

  $ touch foo

  $ exec 3<>foo

  $ rm foo

  $ ls -l /proc/$$/fd/3
  lrwx------ 1 jwilk jwilk 64 Apr 25 10:02 /proc/324/fd/3 -> '/home/jwilk/foo (deleted)'

  $ strace -e linkat ln -L /proc/$$/fd/3 foo
  linkat(AT_FDCWD, "/proc/3447/fd/3", AT_FDCWD, "foo", AT_SYMLINK_FOLLOW) = -1 ENOENT (No such file or directory)
  ln: failed to create hard link 'foo' => '/proc/3447/fd/3': No such file or directory

  $ sudo strace -e linkat ln -L /proc/$$/fd/3 foo
  linkat(AT_FDCWD, "/proc/3447/fd/3", AT_FDCWD, "foo", AT_SYMLINK_FOLLOW) = -1 ENOENT (No such file or directory)
  ln: failed to create hard link 'foo' => '/proc/3447/fd/3': No such file or directory
  +++ exited with 1 +++


Unless you use `O_TMPFILE` without `O_EXCL` to create the file.

    open("/tmp", O_RDWR|O_TMPFILE, 0666)    = 3
    linkat(3, "", AT_FDCWD, "/tmp/bar", AT_EMPTY_PATH) = 0


I don't think the cross-filesystem hardlink is the problem, since the link in proc is a symbolic link.

I can do this experimentally by creating a symbolic link in /tmp (a different filesystem) to a file in /home, and then creating a hard link (with ln -L) from the symbolic link to another file in /home, and the result is a valid hardlink to the same inode as the original file.

This doesn't work through /proc for an unlinked file, but only because the underlying link call requires a path, not an inode. You can create a hard link out of proc if the file has not been deleted, though, without any cross-filesystem problems.


You have to somehow increment inode's reference count and write reference to it into some directory.

symlink() does not increase reference count of anything and in fact its target does not have to be meaningful filename at all (although in the practical non-POSIX sense there does not exist any string that is not valid filename). One interesting ab-use of this is that you can use symlink()/readlink() as ad-hoc key-value store with atomicity guarantees (that hold true even on NFS). For example emacs uses exactly this for it's file locking mechanism.

IIRC the files in /proc/pid/fd are not true symlinks but something that behaves as both file (you can do same IO operations as on the original FD) and symlink (ie. you can readlink() them and get some string) at once.


man 2 open section about O_TMPFILE seems to strongly imply that you can linkat from /proc/<pid>/fd to concrete file. Not sure if there are some special cases for /proc/self/fd vs /proc/<pid>/fd, but that would seem bit odd.

http://man7.org/linux/man-pages/man2/open.2.html

edit: nevermind, seems like O_TMPFILE is the one that has been special-cased here, from man 2 linkat:

> This will generally not work if the file has a link count of zero (files created with O_TMPFILE and without O_EXCL are an exception).

http://man7.org/linux/man-pages/man2/linkat.2.html

:(


In POSIX, why can't you link a file to which you have read access, but you're not an owner or have group access? It's a pretty annoying restriction.


because you are modifying the file's inode, which you do not have permission to modify.


Ok, but what does it modify besides the refcount?


not sure. but you don't have permission to modify that inode, hence, no permission to link. the model is pretty straightforward.

also, being able to 'create' files 'owned' by another user in other locations (by linking them into place) could create quite a few bizarre and undefined corner cases, some of which might have implications for system stability and/or security.


But... traditionally, Unix systems do allow creating hardlinks of other users' files. And yes, this misfeature is a source of great number of security holes.

An option to disable this behavior (/proc/sys/fs/protected_hardlinks) was addded only in Linux 3.6, and then it's still disabled by default.


consider what would happen if the file was counted against someone's quota and they rm'd the file but your link was still outstanding.


I suppose you can read the file if you can intercept an open fd, read it, and write it somewhere. It could possibly be the now vacant previous location.


For me it was (IIRC) a socket, so I couldn't "read it".


I believe you may be able to relink the inode with debugfs, depending on the filesystem.


On macOS, you might be able to accomplish this with fclonefileat()


This trick used to be the way I would download (or play in VLC) videos sent through Flash embeds in webpages. Way back in the day there was an actual temp file. But many companies didn't like that so they started deleting the file immediately to preserve the consensual hallucination that is 'streaming' and keep the lawyers happy. The way around this was to use stat, proc and awk in a bashrc function, ie:

    vlc $(stat -c %N /proc/*/fd/\* 2>&1 | awk -F[\`\'] '/tmp\/Flash/{print$2}')
With newer versions of Flash this too went away (around ~2014). But I still keep a browswer profile around that uses an old one around just so I can access the downloaded file to play in VLC (much smoother).


Access permissions are part of the inode, IIRC, and not part of the directory entry (basically the same as in Win32?).


On FAT the "permissions" are in fact part of the directory entry (IIRC some OSes in "multitasking DOS" family even have FAT extended by having what essentially is the unix mode, uid, gid tuple in the directory entry).

On NTFS permissions live in MFT, which essentially is same thing as inode.


Yes, this is documented in inode(7)


Another neat side effect of this is that you can remove a file that you don't own, so long as it's in a directory that you do. rm prompts to confirm by default, but it's intuitively surprising that can delete a root-owned file that you have no read or write permissions to.


This is also a cause of frequent confusion among less knowledgeable users when they try to clear up space on a filesystem that is reporting full, but there is a process keeping the file desciptor open. As far as they can tell the file is gone, but the usage hasn't gone down. There is a guy at my work that I point out lsof +L1 to about every three months or so. He can't seem to wrap his head around the concept.


Using hard links you can have a file that exists in multiple places with none of these being ‘the’ place. If you remove one of these links, nothing happens in the other places. All you see is the refence count going down by one.

Only when the reference count is zero will the file data be removed.


> Only when the reference count is zero will the file data be removed.

Not even necessarily then, right? (That is, there's no guarantee that a file's data is zeroed out just because its reference count drops to 0.) It's more just that it's only when the reference count is 0 that the actual space occupied on disk can be overwritten.


Flash (ab)use this to hide the cache files for streamed videos. They create a temp file, then open it, then delete it. You can however still grab it out of proc...


O_TMPFILE sort of does that trick now automatically; creates file without directory entry so you don't need jump through those (race-prone) hoops of unlinking the file.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: