
File consistency (2015) - navinsylvester
http://danluu.com/file-consistency/
======
josephg
This entire situation is a barely contained disaster. Doing all the steps
listed at the top of the article is not only exhausting and error prone, it
also dramatically lowers write performance.

The most frustrating part is that (as I understand it) SSD and NVME hardware
exposes strict read and write ordering to the OS anyway. That would allow file
modification code to be written in a way thats both fast and correct. After
all, these are the primitives we already know how to use for memory
concurrency which enable fast lockless libraries like ConcurrencyKit. But for
files, posix only exposes an inconsistently-implemented, coarse, heavyweight
fsync() implementation. End users are forced to navigate weird, undocumented,
imprecise hoops that are hard to understand, hard to test, and have poor
performance just to do the one job the filesystem was supposed to have in the
first place.

I'm curious how much better you could do by skipping the OS entirely and
making a userspace disk API, similar to DPDK. If your database code stores
data in a single data file anyway, you wouldn't lose much by ditching the
filesystem. (You would need specialised tools to do backups and figure out how
much space you have free, but it might be worth it.)

I've been writing my own tiny implementation of Kafka recently. I was reading
through Kafka's design docs to figure out how they solved this problem. Kafka
basically gives up on trusting the OS to store files safely. Instead they
figure any fault tolerant kafka deployment will be a cluster of machines, so
kafka stores all messages (+checksums) across all cluster instances. It hopes
at least one of the machines in the cluster will survive without corruption
when power goes out.

~~~
zzzcpan
> how much better you could do by skipping the OS entirely

I'd say this is the main reason distributed systems even exist, because no
amount of crash proofing file writes for the filesystem is going to make
disks, raids correct and reliable, individual machines, datacenters reliable,
operational mistakes not happen, interpretation and other mistakes not happen
and so on.

On the other hand consistency on random writes to files is not something
absolutely necessary. We can get away with just sequential writes and there
would be not much to reorder and to mess up to begin with.

~~~
josephg
> On the other hand consistency on random writes to files is not something
> absolutely necessary.

This stuff _does_ matter; because most applications aren't distributed
systems. Using file system operations in the way they were intended to be used
shouldn't sometimes accidentally corrupt files _by default_. Look again at the
table in the article showing consistency bugs in application code. They even
found bugs in sqlite. If not even sqlite does this stuff correctly, mortals
have no hope.

And distributed systems themselves depend on the guarantees provided by
filesystems for correctness. The guarantees filesystems provide are poorly
documented, poorly understood and inconsistently implemented across
filesystems and operating systems. Take some time to read about what redis,
postgres and sqlite do to correctly write files - its an awful morass.

At a minimum we should have a standard atomic write primitive that does all
the stuff listed at the top of that article. I don't really care how slow it
is - atomic writes are what everyone expects write() to actually do. Atomic
write shouldn't be poorly reimplemented by every text editor and configuration
tool. Video games shouldn't need to ask players not to switch off their
consoles while they're saving. (What an absolute embarrassment.)

If filesystems want to compete on performance, compete on the performance of
atomic append. Right now any atomic append implementation in userland is both
complicated and dog slow. With help from kernel-level filesystem code, we can
do way better on both counts.

Then for high performance applications (databases, etc) the OS should expose
either fence() operations or some sort of durable write order. Thats the API
databases will actually use. For extra credit I would also like a write API
that notified me when my data has been durably synced to disk. A kqueue event
would be perfect.

And once we have those APIs I have no idea when anyone would use the current
write/fsync implementation. Its a huge foot-gun. Its what you would get if
mongodb made a filesystem API.

~~~
zzzcpan
These days mere mortals have access to enough information about brokenness of
operating systems, filesystems, disks, raids to make better decisions, than
sqlite ever did. I even suggest to go purely log structured and completely get
rid of random writes to save people some time.

Also these days I disagree about perfect APIs, they are not realistically
achievable. It's a mistake to think there will be no mistakes. For example,
another common mistake people make is in event loops, when they assume kernels
are going to report descriptors as readable or writable correctly all the time
for correct descriptors. These are all mistakes that come from the assumption
that other people don't make mistakes, instead of treating lower levels as
black boxes and not relying on promises. So, while it sounds like a nice idea
to have yet another API to fix broken promises, it's not. A good library can
fix broken promises. A bad library still won't realize it's relying on broken
promises.

~~~
josephg
I recommend log structured too, although probably for different reasons (
[https://twitter.com/josephgentle/status/978616782607998977](https://twitter.com/josephgentle/status/978616782607998977)
)

> It's a mistake to think there will be no mistakes.

Its more programming philosophy at this point, but I disagree with this.
There's lots of assumptions we 100% rely on for modern software. Things like
ADD / MUL working correctly in our CPUs, TCP delivering packets in the order
they were sent, correctness in our programming languages, the correctness of
cryptographic primitives and the safety of operations like constant-time
comparisons. Database transactions. Malloc. GC. Etc etc.

Actually I would say that "it works all the time" is the norm in computing.
Its so pervasive and great that when things work all the time we don't even
need to think about them or understand them at all. Not like "it works
_almost_ all the time". Things that only work most of the time we have to
think about constantly. "Does my page still render correctly in all
browsers?", etc.

Its easy to find examples of API promises being violated because these
problems take up all our time. But this is no way to do engineering. In
aggregate its almost always far slower than just spending the time to fix the
bugs themselves. And fixing the actual issue will usually take less code, and
run faster than making a pile of abstractions around the bug.

An example of this: A few years ago I worked at a startup where we had a
search indexer. The indexer wasn't architected well, and it kept having issues
and leaving stale data around. So we kept needing to invest engineering time
into fixing small search issues. Patching around it was a constant low level
time sink for years. A few years later I was asked to implement search at a
different company. I used the lessons I learned at the startup to do it right
this time. And the code I wrote has been almost untouched to this day. Now,
the interesting part is that at both companies management thought the amount
of time we were allocating toward search seemed normal. But the second
approach was objectively better. At the startup it would have taken less time
overall to fix the indexer's bugs (or rewrite it) than we spent patching
around it and dealing with its brokenness.

Which is to say, sometimes better is better.

~~~
zzzcpan
This is some kind of anti-resilience philosophy. If you assume things work
correctly all the time, you are going to have disasters and downtime, long
debugging sessions and hard to find bugs when things break, because the system
is not architected to survive incorrect assumptions.

~~~
josephg
How resilient is your software against, say, malloc returning invalid ranges?
Against compiler bugs?

I’m saying that when there are compiler bugs the right solution isn’t to blame
the user for not having resilient enough code. The right answer is to _fix the
bug in the compiler_. Likewise when basically everybody uses your API
incorrectly, don't blame all the users. Instead _fix the actual problem in the
kernel_.

Or if you want, consider it as simple math - we get better returns on time
spent solving the problem in the kernel than we do solving the problem in user
space of every application.

I've heard some facebook devs talk about the concept of DX: developer
experience. DX is something we should be optimizing for when writing APIs.
Currently the DX of writing correct file handling code is awful. The functions
don't do what you want, they're implemented inconsistently and they're poorly
documented. If the only place you're allowed to write code is your
application, then you're right. Make it as robust as you can. But we're
programmers and linux is opensource. We should just fix the problem.

------
pletnes
Looks to me like yet another reason to use sqlite instead of «flat files». E.g
postgres seems like a many orders of magnitude increase in complexity.

Possibly related: We’ve been using hdf5 for a lot of data storage at work (raw
image data). I often discover corrupt files, even though we (think we) are
flushing files etc. I’d love to see some work on reliability there, but it’s
hard to know if the article is relevant to those issues.

Also, what happens when you’re on RAID? Even more assumptions out the window
I’d imagine?

~~~
zejn
sqlite also uses fsync, which at least on Linux has corner cases where errors
don't even get reported. When fsync does report error, you should panic, as
even if the subsequent fsync returns ok, that does not mean the data for which
there was an error was actually written, in fact the opposite is true - kernel
clears the dirty bit on non-written data thus neither the application nor
kernel know which data is not written.

~~~
nordsieck
> sqlite also uses fsync

At least sqlite has done the hard work of figuring out how to be as reliable
as possible.

At some level, writing to disk is a bit like using crypto. It's a lot easier
to get right if you use well tested libraries that only offer high level
primitives.

------
evrydayhustling
This was a cool read. I'd be interested in a perspective on how the increasing
focus on distributed consistency is impacting design and research at this
local consistency level. In particular, given the findings about frequency of
errors, I wonder if there are guidelines for coordinating local filesystem
settings along with distributed system settings to maximize performance at the
distributed scale. Anybody out there doing this?

------
mjw1007
In linux, renameat2() with RENAME_EXCHANGE on directories ought to be a very
helpful primitive.

Does anyone know what the state of glibc support is? The last thing I saw was
this thread: [https://sourceware.org/ml/libc-
alpha/2015-11/msg00459.html](https://sourceware.org/ml/libc-
alpha/2015-11/msg00459.html)

~~~
jwilk
From renameat2 manpage:

    
    
        Glibc does not provide a wrapper for the renameat2() system call; call it using syscall(2).
    

[http://man7.org/linux/man-
pages/man2/renameat2.2.html](http://man7.org/linux/man-
pages/man2/renameat2.2.html)

------
dis-sys
I am wondering what is the easy way to run tests for such file consistency
issues say how the system react to power loss? Unplugging the power cable is
not that automated for most people.

~~~
nickpsecurity
The most impressive setup I've seen is SQLite's described in sections 3.2-3.3
of first link with more detail in their file operations in second.

[https://sqlite.org/testing.html](https://sqlite.org/testing.html)

[https://www.sqlite.org/fileio.html](https://www.sqlite.org/fileio.html)

------
notduncansmith
Are there no filesystems which address this issue? What challenges are
involved? Does hardware support safe append-write filesystems? Why/why not?

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=10725859](https://news.ycombinator.com/item?id=10725859)

------
zeveb
It's interesting that IBM's JFS is so broken. Was the Linux version written by
IBM, or is it a reimplementation by a third party?

