
Unix’s file durability problem - robinhouston
https://utcc.utoronto.ca/~cks/space/blog/unix/FileSyncProblem
======
Animats
I've suggested an approach to this before. There should be several types of
files.

\- "Unit" files commit when closed properly (this does not include a program
exit without close), and then replace the old version of the file. The file
system should guarantee that, after a crash, you have either the old version
or the new complete version. This should be the default when a file is opened
via "creat()"

\- "Temp" files always go away completely after a crash. This should be the
default for designated temp directories.

\- "Log" files can only be appended. Writers cannot seek. The file system
guarantees that after a crash, the end of the file is at the end of some
write; the file may not tail off into junk. This should be the default for
files opened for append.

\- "Managed" files are for structured databases. They have an additional API.
"writemanaged()" has a callback parameter. It returns when the data has been
queued, but in addition, the writer gets a callback when the write has
committed to disk. The file system must guarantee that a write for which the
callback has been made will survive a crash. This provides fine-grained
information of when data has been committed to disk, which is what a database
needs. It improves performance by not blocking waiting for disk commit to take
place. The database can have several I/O operations going at once in different
parts of the file without blocking.

~~~
jstimpfle
> \- "Unit" files commit when closed properly (this does not include a program
> exit without close), and then replace the old version of the file. The file
> system should guarantee that, after a crash, you have either the old version
> or the new complete version.

Incidentally, this is the way it currently is. From _rename(2)_ : If "newpath
already exists, it will be atomically replaced".

Just don't forget that "replacing" means to create a _new_ file, writing it,
syncing it, and then replacing the old version with _rename(2)_ (and fsyncing
the directory).

> \- "Temp" files always go away completely after a crash. This should be the
> default for designated temp directories.

You can have that: Just don't link the inode of the file. On Linux, you can
use O_TMPFILE. Alternatively, just create the file, and keep it open but
immediately unlink it (it's a little hack but nothing dramatic).

You can also write to /tmp. Normally that gets cleaned at reboot, which might
be enough.

> \- "Log" files can only be appended.

In _open(2)_ , look for O_APPEND.

> Writers cannot seek.

That's not how you write a kernel API. If you don't want to seek, then don't.

> The file system guarantees that after a crash, the end of the file is at the
> end of some write

How do you intend to implement that? Disks don't support arbitrarily large
atomic commits. Making such a guarantee in software is necessarily
inefficient. The kernel will not choose your tradeoffs for you. You can do it
from userland. Decide yourself what is the least bad way to do it.

> "Managed" files are for structured databases. [..] The writer gets a
> callback when the write has committed to disk. The file system must
> guarantee that a write for which the callback has been made will survive a
> crash.

Fsync does just that, except it isn't asynchronous. Maybe there's something
behind _aio(7)_?

~~~
Animats
Rename isn't quite an atomic replacement. If you crash before the rename, the
new file hangs around. (Hence unwanted .part files.)

O_APPEND isn't airtight on all systems. On some older UNIX systems, multiple
writers created with "open()" (not "dup()") do not share a file position. NTFS
doesn't do append correctly.

How do you guarantee that, after a crash, the end of the file is at the end of
some write? By updating the file size _after_ the write. The file size update
can be deferred during heavy write traffic, but you should always get a file
size that ends at a write boundary.

"fsync" synchs the whole file, not just one I/O, which can take a while.
Databases such as MySQL's InnoDB, which puts multiple tables in one file, can
have independent I/O going on in different parts of a file.

aio(7) has the right mechanism, a callback/signal on completion. But it's not
clear if the file system guarantees the data is safely on disk when the
completion signal comes in.

The original article complains that UNIX/Linux file system semantics aren't
well enough defined for database safety. He's right. They're close, but not
quite there, because behavior after a crash is unspecified.

~~~
kentonv
> How do you guarantee that, after a crash, the end of the file is at the end
> of some write? By updating the file size after the write.

Note that Linux ext4 does _not_ do this. On a power outage, you can get bogus
trailing zeros on a file which you were appending, because the file size was
updated before the data was written. I asked Ted T'so about this and he said
it was working as intended.

[https://plus.google.com/+KentonVarda/posts/JDwHfAiLGNQ](https://plus.google.com/+KentonVarda/posts/JDwHfAiLGNQ)

~~~
ryao
ZFS does do this. Changes are atomic, so they either happen or do not. There
is no in-between.

~~~
jstimpfle
I don't know about ZFS, but the usual _write(2)_ API does not support this: It
might for example return early with a short write because some interrupt
occurred. Can happen for all "slow devices" (see _signal(7)_ ). And I think
that's a good thing and am sure many programs _expect_ this.

~~~
kentonv
The signal(7) man page also states clearly that a local disk is _not_ a "slow"
device, so this seems moot.

    
    
        read(2), readv(2), write(2), writev(2), and ioctl(2) calls on "slow" devices.  A "slow" device is one where the I/O call  may  block  for  an
        indefinite  time, for example, a terminal, pipe, or socket.  If an I/O call on a slow device has already transferred some data by the time it
        is interrupted by a signal handler, then the call will return a success status (normally, the number of  bytes  transferred).   Note  that  a
        (local) disk is not a slow device according to this definition; I/O operations on disk devices are not interrupted by signals.

~~~
jstimpfle
oops, had that wrong. Thanks for noticing!

------
rogerbinns
A good solution is to use SQLite. It addresses the issues (pretty much by
doing all the fsync etc mentioned including on directories) and has a very
comprehensive test suite. It is also used very widely on desktops, mobile
devices, applications etc.
[https://www.sqlite.org/whentouse.html](https://www.sqlite.org/whentouse.html)

A notable quote: SQLite does not compete with client/server databases. SQLite
competes with fopen().

~~~
striking
I wonder what the implications of making an SQLite filesystem would be.

~~~
jclulow
If you then expose that file system through a POSIX file system API, you have
all of the same issues of underspecified or unclear behaviour that the article
mentions.

~~~
derefr
If I were designing a userland from the ground up, I'd probably give processes
a transactional MVCC _object store_ , and make guarantees about that; and then
implement a "POSIX compatibility layer" file system API in terms of that, but
explicitly say that none of the same guarantees from the object-store layer
apply.

Some days I really do wish we weren't so inured to the particular 50-year-old
systems-programming abstractions.

~~~
dap
How would it be different than the filesystem API?

There was a great lightning talk a few years ago that I can't seem to find
where the author described an API for storing blobs in a hierarchical
namespace. Of course, halfway through, it became clear that it's just the
POSIX API: you can "open" handles to objects, "rename" them, remove them, and
so on. You'd end a transaction with "fsync()". (Okay, that one's a little more
complicated, but I don't think it's as hard as the OP claims, at least for
single files. Multiple files are more complicated, but that problem is
intrinsically more complicated.)

~~~
Sanddancer
The problem is that fsync() runs outside of the control of the program.
There's no way for an application to start a transaction, perform steps x, y,
and z, and then end a transaction, rolling back to before step x if there are
any failures. For example, suppose you're rotating a an audit log file at the
same moment your backup program is running. Your backup program reads the
directory, and at that same moment, your rotate script had renamed the file,
but had created the new file, but data hadn't been written to the file yet.
What does the backup program see? does it see the old file you just renamed?
Does it see the new zero-length file? Your backup now has an indeterminable
state, and potentially lost data, because the backup received a consistent
view of the overall data. Were there a way of creating a transaction, the
second program looking at the same data would either see the old file, or
would see the new file and the rotated file. This is where a transactional
file system would be of great benefit, because it limits the amount of
indeterminate state to a very minimum, even while multiple programs are
operating on the same file.

~~~
frutiger
Realistically, backups need to be based on atomic snapshots, like `zfs`.

~~~
Sanddancer
Yep. I'll admit my example was a bit convoluted, but I was trying to show a
way in which common tasks can race and create undesirable situations. ZFS
snapshots would definitely make things quite a bit more predictable.

------
ChuckMcM
One of the engineers at Google took the time to figure this out, and updated
the page out code in the Linux kernel they were using so that the "correct"
steps were known if you had to know your data was on disk. It was discussed on
LKML as I recall but considered "not generally useful" and I doubt it made it
into the main sources.

One of the interesting things about writes is that they disrupt reads more
significantly than you might expect. Greg Lindahl characterized the impact at
Blekko when we were crawling so that we could optimize writes from the crawl
to not disrupt latency on the search engine side. Later we completely
separated those functions for similar reasons. I believe every disk I've
evaluated over the years is slowest on the random read/write 50% test.

~~~
_yosefk
Even DRAM will be slowest on a random read/write 50% test (it has read-to-
write and write-to-read penalties; admittedly these will be dwarfed by
precharge/activate penalties, but still, mixing reads and writes will make
things worse.)

------
ape4
I think the man page
[http://linux.die.net/man/2/fsync](http://linux.die.net/man/2/fsync) answers
the questions the article asks about fsync().

fsync() = YES metadata, YES data, NO dir entry

fdatasync() = NO metadata, YES data, NO dir entry

fsync() of dir = NO metadata, NO data, YES dir entry

~~~
kr7
For Linux, but it's not portable across all Unixes. fsync is allowed to do
nothing (1) and does not need to work on directories (2). On Mac OSX, fsync
will not flush the disk cache, which can lead to data loss (3).

(1)
[http://pubs.opengroup.org/onlinepubs/9699919799/functions/fs...](http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html)

(2)
[http://austingroupbugs.net/view.php?id=672](http://austingroupbugs.net/view.php?id=672)

(3)
[https://developer.apple.com/library/mac/documentation/Darwin...](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fsync.2.html)

~~~
valleyer
fsync(2) on OS X and other Darwins also does not sync metadata. You need to
use fcntl(2) with F_FULLFSYNC for that.

~~~
ryao
It looks like doing `sysctl -w kern.always_do_fullfsync=1` Will produce a sane
behavior:

[https://opensource.apple.com/source/xnu/xnu-2422.1.72/bsd/hf...](https://opensource.apple.com/source/xnu/xnu-2422.1.72/bsd/hfs/hfs_vnops.c)

Alternatively, you could just use the ZFS driver for Mac OS X. It will do
fsync properly.

------
cowsandmilk
As long as your hardware doesn't lie as well??

[http://brad.livejournal.com/2094221.html](http://brad.livejournal.com/2094221.html)

[http://brad.livejournal.com/2116715.html](http://brad.livejournal.com/2116715.html)

------
marvy
I ran across a nice paper a few years ago: Rethink the Sync.

[https://www.usenix.org/legacy/event/osdi06/tech/nightingale/...](https://www.usenix.org/legacy/event/osdi06/tech/nightingale/nightingale.pdf)

The basic idea is that the file system provides two guarantees, one boring,
and one interesting. I'll illustrate using code instead of words. The boring
guarantee is:

    
    
        FILE* f = fopen("autosave.bak", "w");
        fwrite(f, buffer, length); // save current document
        fclose(f);
    

The system will try to make the write durable within 5 seconds of executing
the second line. The interesting guarantee is this:

    
    
        FILE* f = fopen("autosave.bak", "w");
        fwrite(f, buffer, length); // save current document
        puts("Saving complete!")
        fclose(f);
    

The system guarantees that the user will not see the message until the data is
safely on disk! The way they do this is to implement a fancy dependency
tracking mechanism that makes sure that the computer never generates output
until the writes that the output depends on have completed.

They do a bunch of benchmarks that show that their system is almost as fast as
mounting ext3 _asynchronously_. (In fact, not much worse than a RAM disk
even.) Of course, they also show that in the case of power loss, their system
behaves well, whereas ext3 does not, unless you turn up all the paranoia knobs
to 11, and then the performance is WAY worse than their system.

I'm oversimplifying quite a bit, since this comment is already pretty long.
Read the paper for details. Or ask questions here, but I'll probably forget to
check the comments, because HN doesn't remind me :(

~~~
planckscnst
I misread this comment and it could be dangerous if someone who doesn't know
misreads it the same way. So I want to put a big warning here.

Warning! This comment is _not_ saying that your operating system provides
these guarantees. In fact, it almost certainly does not. This is a novel
suggestion (and implementation?) presented in this particular paper.

~~~
marvy
Yes, definitely! If the edit timer hadn't expired, I would rewrite this. I
wrote this in a horribly unclear way.

------
schmichael
Disks can buffer, disks can have firmware bugs, disks can fail both
catastrophically and subtly. Even the fanciest battery-backed RAID controllers
have firmware bugs and dead batteries. Those are just ways your storage
hardware can fail you even if the kernel and libraries are bug free and you
follow the right mystic incantations to sync data.

While there's no excuse for bad documentation or poor APIs, you can never
consider data written to a single local disk "safe". It never is.

It's a shame making a best effort at safety is nontrivial, but it does force
developers to write more defensive and crash-safe code which is all that can
save you in the end.

~~~
GauntletWizard
Yeah, from a SRE perspective, the last N writes are always purely
probabilistic. The real quest is to have enough redundancy that that curve is
fairly close to 1, and enough failure warning that your system will fix itself
before it droops. That means Battery Backup, SMART detection, ECC memory, etc.

~~~
planckscnst
It's even worse than that. All writes have some probability of being
incorrect, even after having been written to disk properly (the programmer did
the right thing, the OS did the right thing, the filesystem did the right
thing, and the disk did the right thing). After writing, your data can be
modified by actions on other cells or by simply leaving it alone completely
(see read disturbance and charge leakage). Some of this can be fixed by error
correcting codes, but there is still a chance of loosing data simply by doing
nothing wrong.

So yes, agreed++. You need to have a level of redundancy appropriate to the
criticality of the data.

------
cm2187
Stupid question: why don't all computers (irrespective of the OS) have a built
in power loss mechanism? It seems to be such a common and obvious problem.

1\. The PSU would have a big enough capacitor to keep the computer running for
a few seconds at its stated output power

2\. The PSU would notify the OS of a power loss

3\. The OS would immediately flush all caches and adopt a "brace position".

4\. Events are spread system wide so that apps can also flush and brace.

It should work even if it is the PSU that fails (as long as the capacitor is
there).

Surely the problem cannot be the cost. Why don't modern desktops have that
feature?

~~~
drewm1980
You have just described a UPS. Or a laptop. A battery just makes more sense
than a capacitor in this application.

~~~
brigade
Why? Wouldn't a large enough capacitor last longer than batteries? UPS
batteries I've seen are only rated for 3 years, and half of my recent laptop
batteries have physically swelled after 3 years of use.

~~~
sdk77
It would, but the size would be impractical and it's too costly. Quick
estimate, let's assume we need 10W for 10 seconds, that's 100 Joule of energy.
The energy stored in a capacitor is 0.5 _C_ V^2. Say we use a 10V capacitor
then C = 2 Farad. They exist, but they are very large (look them up on amazon
for instance). You'll probably need more like twice the capacity though
because it's impossible to extract all energy from the capacitor, and it's
lossy to convert it to a constant +5V / +12V.

~~~
tzs
Instead of 10 V and 2 F, how about going for 2.5 V and 50 F? A capacitor with
those specs is only 40 mm long and 18 mm diameter [1]. That shouldn't be too
hard to fit in a typical server or desktop. That's under $4 in quantity.

[1]
[http://www.mouser.com/ds/2/257/Maxwell_HCSeries_DS_1013793-9...](http://www.mouser.com/ds/2/257/Maxwell_HCSeries_DS_1013793-9-341195.pdf)

~~~
sdk77
Theoretically it's possible, but in practice the lower input voltage makes it
harder to convert it to +5/+12V. It becomes increasingly lossy and expensive -
to convert 0.5V to 12V at 10W is not trivial to begin with. Voltage across a
capacitor drops continuously while discharging (unlike a battery). So with a
2.5V capacitor, being able to use it between 1.25V and 2.5V is already pushing
it. On average, discharge current at 10W is around 8A. The internal resistance
of the capacitor (ESR) better be _very_ low (it probably isn't) at this low
voltage - even if it's 0.1 Ohm, at 8A we already lost 0.8V from out meager
2.5V, and now the useful energy is just 1.7-1.25 = 0.5 _C_ 0.45V^2 = 5J, just
enough for 500ms at 10W.

~~~
tzs
Interesting. 500 ms would would probably not be enough time to save everything
(although maybe on an SSD based system it would be...), but it would probably
be enough time to save select information that would allow ensuring that the
disk is in a consistent state.

The 10 F capacitor has an ESR of 0.075 ohm, but that's at 1 A. They have a 100
F model that is 0.015 ohm at 10 A, and a 150 F that is 0.015 at 15 A. Based on
your calculations, these look like they would have a good chance of giving
enough time to save everything (especially on an SSD system).

Those are physically bigger but should still fit in a normal desktop or
server.

(There is another manufacturer that has up to 630 F!)

------
zakalwe2000
There's worse problems at a lower level - hardware caches on the disk do not
really guarantee flushes are honored either. There's a discussion of this here
- honoring writes is an enterprise feature...
[http://serverfault.com/questions/460864/safety-of-write-
cach...](http://serverfault.com/questions/460864/safety-of-write-cache-on-
sata-drives-with-barriers)

------
rikkus
I use something akin to djb's Maildir delivery procedure when I want to be as
close to sure as I can be that a file 'has been saved'.

1\. Create a temp file on the same mount

2\. Write data to file, checking return of each write(), then a final fsync(),
and close()

3\. link() to filename we actually want

I still haven't figured out how to be this safe on Windows, because AFAICT
there's no atomic way to link() or 'move' a file - and they canned the
transactional API for the filesystem. Any pointers to how to write files
'safely' on Windows would be much appreciated!

~~~
pbsd
`ReplaceFile` is supposed to be atomic, at least according to the
documentation: [https://msdn.microsoft.com/en-
us/library/windows/desktop/hh8...](https://msdn.microsoft.com/en-
us/library/windows/desktop/hh802690\(v=vs.85\).aspx#applications_updating_a_single_file_with_document-
like_data)

------
colin_mccabe
UNIX has "a standard way to deal with the file durability problem." If you
fsync the file, then fsync the containing directory, that should be durable on
all properly behaved setups.

Of course this comes with a lot of footnotes. There was a bug on some older
Linux kernels where they didn't tell the hard drive to flush the data after an
fsync. This is why a lot of people still incorrectly believe that enabling
hard disk write cache is unsafe. But this was a bug, and the bug was fixed.
There are also reports of hard drives that don't honor flush requests. There's
not much UNIX or any other OS can do about this-- if the hardware lies, you
are in trouble.

You could also imagine a much richer durability API. Empirically, databases
need such a richer API rather than simple fsync. The POSIX async I/O standard
was supposed to standardize all this, but Linux's glibc just implements it as
a thread pool making blocking system calls. If you want real async I/O on
Linux, you need to use an OS-specific API.

------
cmurf
This is kinda old now, 2013, but I found it a useful quick read that's related
to some of these issues, at least on Linux. Many of the comments are also
interesting. Atomic I/O operations in Linux
[https://lwn.net/Articles/552095/](https://lwn.net/Articles/552095/)

------
jstimpfle
I don't see why you would have to fsync the "parent of the directory", too.
That may not be explicitly specified, but it works just this way:

Everything is a file ("object"), whether it's a standard file, or a directory.
If you _create_ a new file, you want two things synced: the contents of the
file, and the linking of the file (that is the pointer from the directory in
which you created the file, to the file object). If you don't sync the link
you may not be able to find the file again, even though it was synced to the
disk just fine. (It's just how git works btw.)

The link to the file is part of the directory object's contents. That's why
the directory needs to be synced.

There is no need to sync the "parent of the directory" because that was never
modified.

------
POSIXprog
Most programmers, and especially the hipsterish HN crowd, simply cannot be
trusted to write correct file manipulation code. In general, anything they
produce will be riddled with race conditions and erroneous assumptions (e.g.,
that rename works cross-device, that close cannot fail) and that breaks in
rare but possibly catastrophic circumstances.

The solution is copy-on-write file systems such as ZFS and Btrfs that ensure
neither data nor meta-data are ever altered in-place and the reuse of correct
file manipulation code (written by adults) rather than rolling your own,
either from a library or something higher-level like SQLite.

------
jakub_g
Related link with a very detailed writeup: [http://danluu.com/file-
consistency/](http://danluu.com/file-consistency/) (surprised no one posted it
yet).

TL;DR: files are hard; filesystems differ a lot; sqlite is _very_ robust, most
other software (git and mercurial including) not that much.

------
zzzcpan
> One issue is that unlike many other Unix API issues, it's impossible to test
> to see if you got it all correct and complete. If your steps are incomplete,
> you don't get any errors; your data is just silently sometimes at risk.

There is no problem with the API. The issue is in an underlying assumption,
that data sometimes is not at risk. Sadly or maybe luckily, there is nothing
you can do to guaranty 100% durability, no point trying to do the impossible.
Your data is always silently at risk. Accept it, deal with it, minimize the
risk, if you need to. Like replicate your data synchronously cross continent.

------
faragon
Rule of thumb for OS disk I/O: write as soon as possible, i.e. write as soon
as possible without hurting performance using buffers on memory for slow
component amortization, spreading operations during longer periods.

~~~
planckscnst
The article wasn't about the decision of when to write to disk, but how to
actually do it. Say you have a point in your program where you have made the
decision that it is necessary to write to disk. How do you actually do that
for sure? It's not write() or even necessarily fsync().

The author is finding this frustrating because there are several things to do
that can seem arbitrary, random, and counterintuitive. His trust in the OS was
damaged and now he or she is understandably grumpy about it.

~~~
rdtsc
> How do you actually do that for sure? It's not write() or even necessarily
> fsync().

Right. For decision of "when" you can configure a few kernel parameters,
namely dirty writes thresholds. That assumes you just do write()s from your
process and kernel decides to flush that data out. That can be configured to
be a function of time and or amount of unwritten data.

I had to do it a few times. Once it was a realtime-ish system which was
recording data to disk, and noticed recording thread would time-out. (Timeouts
were in the 10s of seconds and there were noticed by a watchdog system).
Thought 10 seconds should be enough for that process. But it turns out because
of how priorities were setup and how fast writes went, dirty page flushing was
not keeping up with writes. Periodically it would hit the top threshold and at
that point any process doing disk writes would be blocked.

Was able to get it under control by essentially doing what gp suggested,
writing a little bit at a time, but more often. The total throughput probably
went went down but performance was smoothed out quite a bit.

~~~
planckscnst
That's still missing the point. Say you're writing a program that stores a log
of transactions. People are trading resources and you log those transactions
so you can find out who owns which resources. User A just gave 100 units to
user B; you need to store this information and tell both users that the
transaction is complete. How do you do that? Your program will do something
like "write(t_log_fd, t_entry, t_bytes);". Can you now tell the users that the
transaction is complete?

~~~
rdtsc
I guess I have misunderstood, your question about "The article wasn't about
the decision of when to write to disk".

I thought you are asking how to control in general when your system is writing
to disk.

Well if you don't use fsync then it decides at some point like I mentioned
above. If want to talk about transactions then do a write and fsync, does that
not work someetimes fro you? If you are worried about the new file appearing
in the directory then fsync the directory as well.

If you want more guarantees, you'll have to dig deeper and find out about your
specific device, does it have battery power and does it tell lies about it
writing data and so on.

------
lazyant
Not sure if I understand the problem. You don't want in most cases to flush
out file changes or dirty memory to disk, so you an batch those operations or
write only final change and not all intermediary ones. You don't want to wait
forever since RAM is volatile. This is why databases have setting for
checkpoint intervals, you can set the % or time for dirty pages to be written,
I thought the default was 30 seconds, doesn't the same idea apply for file
buffers?

------
keypusher
We needed this recently when doing some custom filesystem work, ended up
patching our kernel with something we found in a discussion from lkml.

------
pgaddict
FWIW this uncertainty in fsync behavior is essentially the cause of the recent
PostgreSQL bug on ext4 (fixed in the last round of minor releases).

Basically, on XFS and all other tested filesystems is seems to work just fine,
on EXT4 we may loose the last rename() effects.

So this makes it difficult to get the durability right, even when the project
is as careful about it as PostgreSQL.

------
Ericson2314
Yet another example about how the filesystem, and its POSIX realization, is in
fact a horrible abstraction.

~~~
jstimpfle
I don't think there is anything wrong with the POSIX file system API. Two
things:

\- I think it's mainly the modern file systems like Btrfs and I think partly
also ext4, which introduced a shift in paradigm, which broke old applications
(or at least broke their performance, for example dpkg).

\- We're talking about a hierarchical file system, meaning it's easy for
humans to find data, but terrible for machines because they must hunt pointers
to get to files. Similar problem for syncing a bunch of logically connected
files if they are not stored in a single directory. (How do you atomically
commit multiple changes across directories?)

What would be a better abstraction or non-abstraction (that's still a
hierarchical file system)?

~~~
Ericson2314
A friend of mine had the realization that all optimization boils down to
making lower layers understand higher layers' concerns, or higher level layers
understand on lower levels' concerns. Here are some examples off the top of my
head:

POSIX fails as a high level interface:

\- No (at all powerful) notion of transactions. Mutation without transactions
is extraordinarily primitive.

\- No multiple FS roots to indicate boundary across which data will never be
synchronized. (This is also good for how to spread data across multiple
devices, a low-level concern.)

\- Overall pushes people to maintain their own structure within files rather
than use FS's trees.

POSIX fails as a low level interface:

\- No way to hint or control caching along the memory hierarchy.

\- Block size, locality out of control.

\- Can't statically disallow various non-free actions which are costly
(ability to expand file), can only hope that not using them does incur
penalty.

Perhaps ZFS an BTRFS made other approaches more viable/obvious, but these
weakness are inherent to the API itself and the high level stuff was
noticeably missing from the get-go.

~~~
jstimpfle
> \- No (at all powerful) notion of transactions. Mutation without
> transactions is extraordinarily primitive.

There is nothing that prevents you from implementing these in userland. Does
not belong in the kernel, since the kernel can't know what are your atoms that
must be atomically committed. Research how databases do it.

> \- No multiple FS roots to indicate boundary across which data will never be
> synchronized. (This is also good for how to spread data across multiple
> devices, a low-level concern.)

You can check what device a file belong to with _stat(2)_. It's the _st_dev_
member. You can also check it from the shell with the _stat_ command.

> \- Overall pushes people to maintain their own structure within files rather
> than use FS's trees.

And that's entirely ok. Hierarchies are not for databases. Database-y problems
are not the problems that the POSIX fs solves.

> \- Block size, locality out of control.

I actually heard that Unix filesystems have traditionally been quite good at
preserving locality. That's why I don't know of any defrag tool for e.g. ext3.

Overall, if the FS does not solve your problems, implement your own
abstraction. That's perfectly ok.

------
kazinator
Is it just "Unix"? What if fsync returns when the hardware indicates that it
has completed a write, but it's actually in some drive controller cache for a
few moments more?

~~~
keypusher
If you care about these types of things, you can tell the drives to not cache
writes.

------
leni536
What about O_DIRECT | O_SYNC options in linux's open()?

~~~
rwmj
O_SYNC has abysmal performance. O_DIRECT has very underspecified semantics,
but particularly it demands that you only read and write whole "blocks" (where
"block" depends on the filesystem type, underlying block device and phase of
the moon).

~~~
ymse
See also Linux Torvalds' opinion on O_DIRECT:

 _The thing that has always disturbed me about O_DIRECT is that the whole
interface is just stupid, and was probably designed by a deranged monkey on
some serious mind-controlling substances._

[https://lkml.org/lkml/2002/5/11/58](https://lkml.org/lkml/2002/5/11/58)

------
jwatte
fdatasync() followed by fsync() "should" do it although on very recently
created/mounted directories you'll want to use sync() as well. Luckily sync()
is synchronous on Linux (sanity!) even though traditional Unix doesn't require
it to be.

So perhaps the article can be shortened to "sync() is not guaranteed to be
synchronous on non-Linux Unix."

------
amelius
But what if, behind the scenes, the hardware is performing similar tricks as
the OS?

------
lamontcg
Even if you solve all these problems, if your RAID array completely loses its
mind, then you can still lose/corrupt data.

The best solution here it to replicate data across multiple machines, and take
good backups that you can restore from, and plan on corruption to happen.

You also need to assess exactly how much you really need those transactions.
If they're financial transactions where each one could be millions of dollars
of stock, then you probably need to care about this a bit. For most web
transactions I think the risk/cost analysis here is that you don't need to
worry about being perfect. You might lose the last one or two transactions in
the case of a hard crash, but your customer service team should be able to
handle that and fix it for the customer out of band. You really should
consider if you have a business justifcation for worrying about perfection.

On the other hand, I did live through the ext2 era before ext3 was production
ready, and fully async super-duper fast filesystems really are bad. They would
regularly corrupt the disk and require rebuilds. That wasn't a data
availability problem, but one or two of those a week was problematic when it
came to the operational load (out of 400-800 servers that we managed at the
time). We later scaled out ext3 to 30,000 servers with some sets of servers
having 2,500 hosts of basically the same type of webserver, and while kernel
crashes were a daily issue, corruption and rebuilding was relatively low. If
you're not at Google/Amazon/etc and dealing with servers counts an order of
magnitude higher than this, you don't really need to worry about it. ext3 or
ext4 and fdatasync should be fine, and then apply proper levels of engineering
principles to ensure that you don't stay offline for too long or lose too much
data.

You are dealing with free commodity hardware and software that isn't ever
going to be perfect. If you really needed to never lose a transaction you'd
probably be buying some kind of awfully expensive mainframe system.

Oh and I do recall one case of filesystem corruption leading to a service
being down for over a week and probably the loss of a multi-million dollar
business deal. But in that case the software ran on a single box. The dev team
that was responsible for it never saw that as a problem even though the
ops/sysadmin teams kinda yelled at them about it. Then one day the RAID array
lost its mind and the server was unrecoverable. When we attempted to rebuild
it, it was discovered that over the years the software devs had tweaked the
versions of libraries that their software linked against and by crashing all
that information had been lost, so it took ages to debug and find the right
incantations to get it all back up again. Huge business risk there, but
nothing that could be mitigated by naval gazing analysis of filesystems and
fdatasync -- backups, documentation, replication, proper config management
practices, etc were what was needed.

------
rocky1138
Can't someone smart just read the source code and figure out exactly under
which conditions files get written to the disk?

~~~
mwcampbell
Which source code? There's more than one implementation of all of the
following: OS kernel, disk driver, and filesystem.

~~~
lisper
It only takes one combination to start the chain reaction. If someone
identifies _one_ combination of kernel, disk driver, file system, hardware,
and syscalls that results in reliable durability then people who care about
reliable durability will start using it, and that will eventually turn into a
de facto standard which will eventually turn into an actual standard.

~~~
tamana
Until someone who doesn't care changes some code and that pattern is no longer
durable.

------
em3rgent0rdr
I've been told to type sync into terminal whenever I want writes to complete.

~~~
rsync
"I've been told to type sync into terminal whenever I want writes to
complete."

If you'd really like a guarantee, you can always unmount the filesystem, or
alternatively, mount it read-only:

mount -ur /mnt/blah

That will guarantee, at least at the filesystem/OS level, that the writes are
committed.

~~~
rocky1138
My understanding that, after that point, the hardware drivers could lie and
say they've written it and we'd never know, correct?

And even after that, the hardware could return a "written OK" value but not
actually do the job, right?

So the point is that without complete transparency from end-to-end, there's no
way to tell.

Is that correct?

~~~
tremon
Yes, and there's also the problem that magnetization of material isn't
permanent so your data may disappear after N days even if it were written
correctly (typically, N > 5000). Should your "complete transparency" model
also include expected longevity of the written bits?

------
late2part
Can I solve most of this in linux like Ubuntu 14 by doing "# sync" ?

------
userbinator
_I 'll admit that one reason I'm unusually grumpy about this is that I feel
rather unhappy not knowing what I need to do to safeguard data that I care
about._

...backups?

 _This issue is not unsolvable at a technical level, but it probably is at a
political level. Someone would have to determine and write up what is good
enough now (on sane setups), and then Unix kernel people would have to say
'enough, we are not accepting changes that break this de facto standard'. You
might even get this into the Single Unix Specification in some form if you
tried hard, because I really do think there's a need here._

Or we could just have everyone perform regular backups like they already
are/should be doing, and decide that if systems are crashing so frequently as
to lose data often enough, trying to "fix" this "problem" by adding what would
likely be another mass of design-by-committee complexity to filesystems is not
addressing the cause but only its symptoms.

Then again, with over two decades of experience using the FAT filesystem and
never a single instance of unrecoverable data loss despite sudden crashes
while hearing countless tales of others corrupting their data frequently even
when using far more complex and "robust" filesystems, it makes me wonder why I
don't seem to suffer quite the same problems...

~~~
johnbender
Backups don't help if writes don't make it to disk in the order and manner
expected by the application programmer. There's an emerging consensus that
there are crash protocol bugs lurking everywhere due to I/O scheduler
reordering. For example this bug in gzip:

[http://bugs.gnu.org/22768](http://bugs.gnu.org/22768)

~~~
EdiX
This bug report is truly surreal. A filesystem could easily just write on disk
directly skipping write buffers, the whole reason they don't is "because
performance".

In fact the "file system mathematically guaranteed to not lose data" is too
slow to be used in practice and will implement fsync/fdatasync in the future
to regain performance (and in the process it will stop being "mathematicall
guaranteed not to lose data").

Clearly the solution is to add fsync/fdatasync calls to every single program
so as to negate the performance gains of file system write buffers entirely.

Clearly, the next step after that is for filesystem to start ignoring
fsync/fdatasync entirely, because otherwise they would be too slow.

