

Crash Consistency: Rethinking the Fundamental Abstractions of the File System - BruceM
http://queue.acm.org/detail.cfm?id=2801719

======
Animats
The UNIX file system abstraction is very simple, and doesn't define post-crash
states. I once proposed different guarantees for different types of files:

\- Unit files - the unit of update is the file. Files are created, written,
and closed, and are not visible to other processes until closed. Once closed,
the file is read-only; it cannot be rewritten, only replaced as a unit. For
POSIX-type systems, files created with 'creat' should be created in this mode.
O_TRUNC should be interpreted as "replace the old file with the new version on
close". If the program aborts before a proper close, the new file should be
dropped, leaving the old version intact.

The crash guarantee should be that post-crash, you have a completely written
file. It can be either the old version or the new version, but never a partial
version. This eliminates the gyrations people go through to get this behavior.

\- Log files - the unit of update is the write, which must be at the end.
These are files opened for append. Appending is always at the end of the file,
even from multiple processes. "seek" is disallowed if the file is open for
writing; you can only append.

The crash guarantee should be that post-crash, you have a file which is either
complete to the last write, or truncated precisely after some write. The file
may not be cut in the middle of a record or trail off into junk.

\- Temporary files - after a crash, they're gone.

\- Managed files - these are for databases, and support additional functions
related to locking and file synchronization. That's what the article is about.
For the other types of files, you don't need all those features.

In practice, most files are unit files, log files, or temporary files. The
number of programs which use managed files is small; mostly they're database
program or libraries.

Programs which use managed files and need data soundness after a crash must be
very aware of concurrency and safety semantics. A somewhat different API may
be required. There should be two notifications from a write - "data accepted"
and "data safely committed". Callers should be able to make blocking writes
based on either of those, or make non-blocking writes and get two callbacks.
This puts the concurrency management in the database application, which knows
what data depends on other data. The file system can't know that, and
shouldn't try to.

~~~
netheril96
Currently many people writes to a new file and `rename` into the old one to
achieve the first kind of guarantee.

~~~
Animats
Linux has added "renameat()" and "renameat2()", which are supposed to be
atomic if the system does not crash but are not guaranteed to recover cleanly.
Some networked file systems (NFS, at least) can't do an atomic rename; if you
lose the network connection during the switch you may be in an invalid state.

------
PhantomGremlin
The discussion and complaints are mostly about Linux filesystems.

It would be interesting to know how well BSD FFS does and how well ZFS does.
Not a whole lot said about Windows either (my anecdotal experiences with NTFS
have been pretty good).

The article does touch on one very serious problem faced by all filesystems,
and that is the underlying hardware often lies to the OS. That problem will
probably get worse before it gets better. Specifically, flash devices have
very complex firmware that is often buggy. Also flash devices do so much more
to data (e.g. moving data around for wear leveling).

------
nickpsecurity
"As previously explained, however, in-order updates (i.e., better crash
behavior) are not practical in multitasking environments with multiple
applications."

They say this is the obstacle to the _best_ approach using the same interface.
I say look at our best results in DB, concurrency, and distributed computing
research to see if there's a solution that might work given a certain number
of cores or applications. There's also the possibility of processor
enhancements that put this sort of thing in the I/O architecture.

