
File handling in Unix: tips, traps and outright badness - janvdberg
https://rachelbythebay.com/w/2020/08/11/files/
======
deathanatos
> _Put it another way: if your idea of avoiding races in your temp file
> creation is to just have a wider namespace, what keeps an evildoer (me) from
> just grabbing every possibility first? Disk space is cheap, right? You can
> only have so many pids, and the time of day is obviously predictable, and
> your random number space is also similarly limited. I could just set up
> every one of them first and when you hit it, I win!_

A UUID, generated by the system's CSPRNG. Ideally through a syscall if the OS
has one, so I don't need to burn a FD. You can't pre-fill all UUIDs without
running out of space, and the CSPRNG prevents prediction, short of using gdb
to attach to the process mid-name-generation, at which point you're on the
other side of the airtight hatch.

I do wish linkat could do an atomic replace of an O_TMPFILE file descriptor.
It'd avoid the whole random-name business, which is an ugly wart.

------
greggman3
This sounds to me like seriously bad API design. I'm not saying the low-level
functions that allow handling fancy situations aren't important but shouldn't
the default be "it just works" and not "you must write 1000s of lines code
exactly correctly or else your program will fail and you'll be powned"

This reminds me of Jonathon Blow's rant on similar issues on XBox and PS2/3
where the API, instead of doing the right thing by default required the
developer to spend days/weeks using it correctly and having their app rejected
over and over and in this particular case there was zero good reason for it
not to handle the issues itself rather than pass it all on to the developer.

~~~
asveikau
I think interrupting I/O can be tricky for an application to handle. If
write(2) didn't have this behavior, it's likely that a common outcome would be
frozen processes that you can't quit with ^C.

There either needs to be a notion of partial writes (what the article
describes) or a write that gets rolled back if interrupted (probably really
hellish to implement ... Maybe impossible for something like a socket where
the remote machine may have already seen some packets).

EINTR or SIGPIPE are famously criticized as poor mechanisms for this. However
even without that, interrupting a blocked write would remain a tricky problem.

~~~
nine_k
Writing to a file should be transactional. It either succeeds, or gets rolled
back. Journaling filesystems implement it as a crash protection mechanism, but
an API where you can commit changes without closing, and can roll back
explicitly, would be nice.

Writing to a socket is unlike writing to a local file, and cannot offer such
guarantees.

~~~
tannhaeuser
Journalling protects metadata writes (file names, dates, sizes, permissions,
root block allocations) not file content writes, such that a FS can be brought
up quickly after a crash. If you want transactional/atomic file ops, you need
to implement it yourself on top of existing mechanisms, like transactional
databases have been doing for ages.

~~~
teddyh
> _Journalling protects metadata writes_ […] _not file content writes_

As I understand it, the ext4 file system can have full-data journaling with
the “data=journal” mount option (instead of the default, “data=ordered”, which
is as you describe).

------
ezekiel68
I enjoyed this post and the one before it (linked in the article). +1 for a
resurrection of the old PC loser-ing situation made semi-famous by the seminal
"worse is better" article (also linked therein). Really this all comes down
to, "Where should the complexity live?" People who enjoy languages, tools, and
OSes which cater to systems programming (myself included) eat this stuff up. I
can definitely appreciate the dismay this causes in folks who simply want to
glue things together to produce higher-level results more quickly.

~~~
thayne
I really don't see how this post supports the one before it. Much of the
complexity here can and in many, perhaps most, cases should be abstracted into
a library, whether that is the standard library or a well-established high-
quality third-party library. Given how easy it is to make mistakes, common
operations, like atomically writing a buffer to a file, or creating a new file
and ensuring it didn't already exist should not have to be re-implemented in
every application.

------
userbinator
_You need a signal handler to eat it and do something about it, or you have to
explicitly say that you don 't care and set it to "ignored", but you can't
just pretend it won't happen... because it sure will._

The default action of SIGPIPE is to kill the process, which is exactly what
you need for pipelines to work correctly. Related article:
[https://news.ycombinator.com/item?id=22647539](https://news.ycombinator.com/item?id=22647539)

~~~
sfoley
Not all programs are meant to be used in pipelines.

It’s not even entirely accurate to say that that’s EXACTLY how pipelined
programs should work; maybe they want to do some cleanup before they die.
SIGPIPE was essentially created as a hack for naive programs that didn’t
properly check the return value of write(). EPIPE should have been enough,
it’s a perfectly fine solution.

~~~
inetknght
> _Not all programs are meant to be used in pipelines._

I would argue that programs that aren't meant to be used in pipelines aren't
well designed programs.

> _It’s not even entirely accurate to say that that’s EXACTLY how pipelined
> programs should work; maybe they want to do some cleanup before they die._

If you need to do some cleanup before you die, then that's exactly what a
signal handler is there for you to do. Nothing stops you from exiting
immediately after a signal handler.

~~~
jlokier
> If you need to do some cleanup before you die, then that's exactly what a
> signal handler is there for you to do.

I challenge you to clean up a temporary directory, like doing "rm -fr
$TMPDIR/tmp$SECRET/" except not by running another program, in an async-
signal-safe manner inside a SIGPIPE handler which then exits.

Hint: You're not allowed to call system(), readdir(), malloc() or any of
exec*().

------
jwilk
> _Let 's say it returns a value of 8192 instead, setting errno to EINTR._

errno is meaningful only after the function failed. Short write doesn't count
as failure.

~~~
jeffbee
Indeed, errno is not cleared by successful calls, so you can't check it for
success.

After a short write your only legitimate course of action is to advance the
pointer and write again.

~~~
nitrogen
You can set errno to 0 and see if it changes.

~~~
jeffbee
The syscall is allowed to set it even in case of success.

~~~
rwmj
This is sort of right, but the behaviour is a bit subtle. errno should never
be set to _zero_ by any POSIX-compliant syscall or library call. However it
may be set to something (which is non-zero) on a successful call. Ref:
[https://pubs.opengroup.org/onlinepubs/9699919799/functions/e...](https://pubs.opengroup.org/onlinepubs/9699919799/functions/errno.html)

------
cryptonector
Sometimes you want to use a temp file named ${target_name}.new, then rename it
after writing to it -- the point of this is to avoid leaving garbage if you
have unclean exits, but it requires using flock(2). Even better is to create
an unlinked file using O_TMPFILE (Linux-only) and then use linkat(2) and then
rename(2) (renameat2(2) doesn't quite support what's needed, and it's a darned
shame). If you're doing this in a /tmp/ directory, well, you do have to worry
about attacks by other users if you try to do the .new thing.

~~~
asveikau
Wanting a few more details I googled that and found:
[https://lwn.net/Articles/559147/](https://lwn.net/Articles/559147/)

Second time I googled something you mentioned in this thread, eager to know
more. This guy filesystems, folks.

------
praptak
The scenario described in the article suffers from one more problem, not
mentioned in the article. Quote:

 _" Create a file adjacent to the target path using mktemp or similar.

Write your data to it

rename() it from the temporary name to the final name."_

I believe this is prone to data loss when the system goes down and the cause
is that POSIX guarantees on write reordering are loose. The rename can make it
to disk _before_ the data gets written.

I think it caused actual data loss with one of the fancier filesystems which
actually pushed the limits of write reordering and exposed this (pretty
common) scenario as buggy.

Edit: I found an article about this problem. "Ext4 and data loss":
[https://lwn.net/Articles/322823/](https://lwn.net/Articles/322823/)

~~~
sgerenser
Of course, between finishing the write and doing the rename you have to call
fsync() on the fd (as that article pointed out).

------
janvdberg
Also reminds me of this talk by Dan Luu:
[https://www.deconstructconf.com/2019/dan-luu-
files](https://www.deconstructconf.com/2019/dan-luu-files)

~~~
cryptonector
He doesn't mention ZFS at all.

------
lenkite
So...basically use SQLite for this then as recommended.
[https://www.sqlite.org/aff_short.html](https://www.sqlite.org/aff_short.html)

~~~
GoblinSlayer
It has its own share of legacy and poor tradeoffs. Transaction log is an
overkill for most applications, but the only alternative is full cache in RAM,
there's no middle ground like you have with file system.

------
clktmr
The first half of the article basically complains about the write()'s syscall
possibility to fail. Errors can occur and you need to handle them, there is no
way around it. Given that "Everything is a File", the OS shouldn't try to make
any guarantees around that. There are cases where you want to handle an early
return by write().

This is a syscall after all. First priority is to provide a very general
interface and remain compatible, not developer ergonomics.

------
self_awareness
I wonder if EINTR can also happen on macOS? macOS is certified UNIX, and the
post talks about Unix in general, that's why I wonder.

I've inspected the family of fopen() functions (fread, fwrite), and they don't
handle EINTR. Do they delegate this task to the caller? Or that EINTR doesn't
exist there? I mean, the error code itself is used, and by quickly skimming
XNU sources I can see it's used in some NFS code, but can it happen when
reading from or writing to local storage (hard disk)?

Edit: some minutes later, I can see that e.g. libstdcxx supports EINTR, so
maybe it can happen on macOS as well (the xwrite() function):

[https://opensource.apple.com/source/libstdcxx/libstdcxx-104....](https://opensource.apple.com/source/libstdcxx/libstdcxx-104.1/src/basic_file.cc.auto.html)

~~~
comex
I just tested it on macOS and Linux with the two main sources of EINTR: signal
handlers and attaching a debugger. I tested by interrupting a read from stdin,
but the behavior for disk reads and writes should be similar.

Signal handlers: POSIX specifies that, after a process runs a signal handler
which was triggered when it was in the middle of a system call, either the
system call will automatically be resumed or it will fail with EINTR,
depending on a per-signal-handler setting. If the signal handler was set with
sigaction(), this setting is the SA_RESTART flag; otherwise it can be set by
siginterrupt(), but the default seems to be implementation-defined(?). Anyway,
it seems that both macOS and Linux enable restarting by default. (But well-
written library code should take into account the possibility that something
else in the process could have decided to disable it.)

Attaching a debugger (using ptrace): This interrupts system calls because the
debugger expects to see a userland state. On Linux it seems to automatically
restart the system call, but on macOS it produces EINTR. This behavior doesn't
seem to be configurable on either OS.

On both OSes, when EINTR does occur in read(), fread() does not cover up for
it by automatically retrying, regardless of whether it successfully read any
bytes beforehand.

~~~
saagarjha
Which libc were you using on Linux, or were you making raw system calls?
Because I think glibc’s behavior is to EINTR unless you define _GNU_SOURCE,
which seems to me to be incompatible with your description…

~~~
comex
glibc, Debian package version 2.31-3, without any special defines. Here is the
program I used:

[https://gist.github.com/comex/58ba394588478bb1d506a998850b94...](https://gist.github.com/comex/58ba394588478bb1d506a998850b9436)

If I run the program and press Ctrl-C, it prints "interrupted" but then keeps
going. However, if I add `siginterrupt(SIGINT, 1)` after the call to signal(),
interrupting the read() produces EINTR as expected.

Interestingly, the glibc manual used to state that the default behavior
depended on the presence of _BSD_SOURCE or _GNU_SOURCE, but in 2014 it was
changed to state that the default is always to use EINTR:

[https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=manual/...](https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=manual/signal.texi;h=51f68b5d3ec59ea8a662c18360a3ba15cebe80ef;hp=f0e57ddbe41af3b0948d64c1dea420c6b1db9928;hb=c941736c92fa3a319221f65f6755659b2a5e0a20;hpb=e8d8d7ec98af7c3777fd664adca8be5630afbc90)

~~~
saagarjha
Oh, huh, that’s really interesting! I didn’t realize that glibc had changed
their behavior; I wonder if this broke any applications…

------
for_xyz
Does write situation apply to linux as well?

Since 99% of the disk I/O is cached, write either success or fails completely.
It's beyond scope of the userspace application to know what and when data
arrives on the physical disk. Same applies to Windows as well.

Using direct I/O is different story where this is the correct behaviour.

~~~
iforgotpassword
NFS. Also the gdb/strace case still holds. It's really rare so many programs
get away with not doing it, and then when it happened once in a honeymoon your
shrug it off and restart the program.

~~~
saagarjha
I hope you’re not fixing short writes during your honeymoon…

------
asveikau
Maybe I have skimmed too much, but I didn't see anything about the problems
with using O_EXCL to resolve races over NFS. I believe (correct me if I am
wrong) it doesn't work.

~~~
cryptonector
No, O_EXCL works on NFSv3 and NFSv4 just fine. NFSv4 is stateful and solves a
number of problems that NFSv3 had.

~~~
asveikau
Googling around has this for Linux:
[http://nfs.sourceforge.net/#faq_d10](http://nfs.sourceforge.net/#faq_d10)

Tldr: it used to not work. It was fixed in Linux in 2004.

No idea about other unix-like OSs.

------
icedchai
In my experience, most developers ignore the "partial write" scenario with
files. Then they get experience working with sockets... where this happens all
the time.

~~~
jlokier
> Then they get experience working with sockets...

Ha. I have seen TCP sockets code which breaks in a "partial write" scenario,
and even a "partial read" scenario.

Code worked fine on their own LAN. Broke for someone else, they couldn't see
why...

They were _very surprised_ when I explained that you don't always get one
message per read(), because they had for years. Or maybe they hadn't, but
assumed something else caused their occasional application glitches.

------
perlgeek
If you're doing async IO, I _strongly_ recommend using a library that
abstracts much of the pain away, like libuv. Deal with just the libc functions
(or system calls) is a way too low level API for many applications.

------
Hitton
And this is still only scratching the surface. For more file related horror:

Files are hard - [https://danluu.com/file-
consistency/](https://danluu.com/file-consistency/)

------
ur-whale
I remember being bitten by write doing only partial work early in my career.

This definitely felt like an API shortcoming, ended up writing a wrapper for
write that did what I expected.

