
Why Unix needs a standard way to deal with the file durability problem - zdw
https://utcc.utoronto.ca/~cks/space/blog/unix/WhyFileSyncStandardNeeded
======
haberman
IMO the way to tackle a problem like this is:

1\. Figure out the right primitive(s) that applications need to provide both
correctness and performance. Make these as simple as possible.

2\. Make a portable library that implements the simple primitives by doing
whatever is necessary on each platform to provide the API's guarantees. This
will likely require a lot of research/expertise on each platform. The result
will be an ugly mess of #ifdef, but _using_ the library will not be ugly.

3\. If implementing the library is impossible or unnecessarily slow on some
platform, lobby the maintainers of that platform to provide something better.
If they fundamentally disagree with the abstraction, put it on them to
counter-propose a better one.

4\. (Optional) lobby to have the primitive become part of the C standard
library or POSIX to shift the maintenance burden to the platform owners.

~~~
geocar
Okay, here's a stab at it:

• SIGIOFIN works like SIGIO except it fires when the IO operation completes
(is on the disk; has left the network card, etc).

I should be able to use signalfd() to collect these. Let me use
TCP_DEFER_ACCEPT on regular file descriptors -- basically let me make sure
there are _n_ bytes before being woken up on a fifo, and _n_ events in my
signalfd() before waking up.

Here's another idea:

• all() returns a file descriptor that is initially readable.

• on(POLLIN,f,d) while poll(f,POLLIN) would block, d is unreadable

• POLLSYNC added to poll(), etc which is flagged when all previous writes to
the file descriptor have reached the disk (has left the network card, etc). To
get multiple checkpoints use dup().

I'm not sure which one I'd like better. If we didn't have to make it
"primitive", we could simply copy what windows does[1], since they figured
this out years ago.

What do you think?

[1]: [https://msdn.microsoft.com/en-
gb/library/windows/desktop/aa3...](https://msdn.microsoft.com/en-
gb/library/windows/desktop/aa365198\(v=vs.85\).aspx)

~~~
gue5t
I advise skipping the signal nonsense and making the interface purely fd-
based. The only "advantage" of signals over fd readiness is that signals can
interrupt syscalls that don't expect them, and this can be emulated without
undue performance loss using asynchronous requests and epoll-style
architectures. Signals cause a lot of pain in concurrent (e.g. coroutines and
reentrancy) and multithreaded contexts. Plus, including data about which IO
request finished and whether it succeeded or failed is more complicated with
signals than simply including it in the event structs read from an fd.

Another problem is that this doesn't really address the synchronous
assumptions of filesystem (rather than file content) operations: applications
expect that they can "mkdir(...); chdir(...); open(...);" and the filesystem
doesn't know if the application would work properly with those operations
reordered (it may or may not depending on the values of their arguments).
Applications need to be more explicit with the transactional nature of what
they're requesting from filesystems. Right now the filesystem tries to guess
at the transaction structure based on fsync/fdatasync hints, and can't do as
good a job scheduling as a result.

~~~
skissane
> Applications need to be more explicit with the transactional nature of what
> they're requesting from filesystems. Right now the filesystem tries to guess
> at the transaction structure based on fsync/fdatasync hints, and can't do as
> good a job scheduling as a result.

How about a transactional filesystem API, where applications can call an API
saying "start transaction on current thread", do some IO operations, and then
call one saying "end transaction on current thread" (or abort/rollback)?

Windows has something like it: [https://msdn.microsoft.com/en-
us/library/windows/desktop/bb9...](https://msdn.microsoft.com/en-
us/library/windows/desktop/bb968806\(v=vs.85\).aspx)

Albeit, it is rarely used, and Microsoft suggests one shouldn't use it, and
that maybe a future version of Windows might remove it.

I think part of the reason it is rarely used is that it is a whole new API–for
many of the existing filesystem APIs calls, they have new versions with
"Transacted" appended to their names, and an additional transaction handle
argument. If they'd made the current transaction implicit (i.e. thread-local)
in the standard filesystem API, it might have seen more adoption.

------
jstimpfle
So I never dug deep into the POSIX but I still miss this question answered
before I accept the blog post's premise:

What's wrong with the standard procedure

    
    
        1. open / create file
        2. write to file
        3. sync file
        4. possibly rename file to final location
        5. sync parent directory of file
    

Which compliant file systems don't guarantee that this is durable
(notwithstanding disks that tell lies)?

Maybe POSIX doesn't specify this exact sequence explicitly (don't know), but
this is the folklore which you find everywhere, and it's just what you arrive
at naturally when you consider the object model on which the API is built
(where an object is an "open file", i.e. what fds reference). (I mentioned
this: it's basically the model that git also adopts, except that git objects
are immutable).

~~~
daurnimator
The issue is that most programs do _not_ do step 3: e.g. fsync is not called
by C's stdio.

Which means that the `rename` might get reordered before the `write`. Which
means data loss. yay!

~~~
jstimpfle
The discussion hasn't been about stdio so far.

Btw. my fflush(3) manpage does mention that fsync(2) is required to ensure
data reaches the disk. Of course it's not nice that we have to drop
abstractions (and portability) to go the safe way. Let's just say stdio is a
bit simplistic.

Btw. many file systems don't reorder write and rename for exactly this reason.
But at least ext2 and btrfs seem to do it.

------
Animats
I wrote about this yesterday, in yesterday's topic.[1]

Some of the ideas I presented there are from UCLA Locus, circa 1980.[2] UCLA
Locus was a distributed UNIX, and an improvement on most of its successors.
Multiple machines on a network could not only share files, but you could run a
program on a different machine than its files. The machines didn't even have
to have the same type of CPU.

Overwriting a file resulted in creating a new file which initially shared all
its blocks with the old one. Updating was a copy-on-write type operation. The
new file became visible to other readers when closed, or when the writer
called "commit()". If the writer crashed, or a network connection was lost,
the file reverted. There was also a "revert()" call. Readers thus always saw a
consistent version of the file.

Locus also supported replicated files, where two nodes had copies of the same
file, but the file was treated as being a single file. If changed, all nodes
saw the change.

This was before there were databases for UNIX, and so they gave the file
system database-like semantics. It was a good idea that didn't catch on. The
technology was sold several times (the sale to SCO was the killer) and
eventually open-sourced.[3] It hasn't been updated since about 2010, though.

[1]
[https://news.ycombinator.com/item?id=11511269](https://news.ycombinator.com/item?id=11511269)
[2]
[https://en.wikipedia.org/wiki/LOCUS_%28operating_system%29](https://en.wikipedia.org/wiki/LOCUS_%28operating_system%29)
[3] [http://openssi.org/cgi-
bin/view?page=openssi.html](http://openssi.org/cgi-bin/view?page=openssi.html)

------
wyc
Here's a relevant passage from the essay "Worse Is Better" by Richard P.
Gabriel:

[http://dreamsongs.com/WIB.html](http://dreamsongs.com/WIB.html)

"""

Two famous people, one from MIT and another from Berkeley (but working on
Unix) once met to discuss operating system issues. The person from MIT was
knowledgeable about ITS (the MIT AI Lab operating system) and had been reading
the Unix sources. He was interested in how Unix solved the PC loser-ing
problem. The PC loser-ing problem occurs when a user program invokes a system
routine to perform a lengthy operation that might have significant state, such
as IO buffers. If an interrupt occurs during the operation, the state of the
user program must be saved. Because the invocation of the system routine is
usually a single instruction, the PC of the user program does not adequately
capture the state of the process. The system routine must either back out or
press forward. The right thing is to back out and restore the user program PC
to the instruction that invoked the system routine so that resumption of the
user program after the interrupt, for example, re-enters the system routine.
It is called PC loser-ing because the PC is being coerced into loser mode,
where loser is the affectionate name for user at MIT.

The MIT guy did not see any code that handled this case and asked the New
Jersey guy how the problem was handled. The New Jersey guy said that the Unix
folks were aware of the problem, but the solution was for the system routine
to always finish, but sometimes an error code would be returned that signaled
that the system routine had failed to complete its action. A correct user
program, then, had to check the error code to determine whether to simply try
the system routine again. The MIT guy did not like this solution because it
was not the right thing.

"""

~~~
infinity0
The quote continues, "The New Jersey guy said that the Unix solution was right
because the design philosophy of Unix was simplicity and that the right thing
was too complex. Besides, programmers could easily insert this extra test and
loop. The MIT guy pointed out that the implementation was simple but the
interface to the functionality was complex. The New Jersey guy said that the
right tradeoff has been selected in Unix-namely, implementation simplicity was
more important than interface simplicity. "

As I see it, too many people abuse / misunderstand "worse is better". "Worse"
might be better than "perfect" in some cases, but "perfect" is better than
"crap", and "trying to be worse" usually results in "crap" whereas "trying to
be perfect" usually results in "worse".

Even in the above example, the MIT guy pointed out that the simple
implementation results in a complex interface, and undoubtedly much much more
time was wasted in the end.

------
jeffdavis
The UNIX filesystem should not be used for applications, in my opinion. And
for the underlying technologies, like databases, object stores, etc, it's the
only option (but not a great one).

Even for something simple like photos, it's not good:

* durability problems

* hierarchy doesn't work out well (browsing by dates vs people vs event types can't all work in the same hierarchy)

* you want storage on and accessibility from more than one machine

* something like content-addressed storage probably makes sense to avoid unnecessary duplication

Then, for underlying tech like databases it's not great either. Postgres used
to initialize the data directory without syncing it. That makes sense for
production, because who is going to put important data in it and crash within
a few minutes? But it caused problems with testing, so I fixed it. To do such
a trivial thing: write out a bunch of files and sync the whole directory, it
was very non trivial. Syncing the files individually was very slow (because I
have no clue what the most convenient order is for the OS to sync the files).
I had to use the linux-specific sync_file_range() to get decent performance,
and then it was still slow so we needed a way to optionally turn it off (e.g.
for tests where it's not important). Horrible. It would have been much better
to just say: do all of these things and sync them in the most convenient
order, then tell me when you are finished. Kind of like a log based system.

~~~
cyphar
You can fake multiple hierarchies with hard links. The only problem is you
can't delete every hardlink easily.

~~~
mtanski
Your app has to define it's own cascading delete strategy.

------
gue5t
I'm so glad this is being discussed. The intricate and weak durability
guarantees, the unaddressed dangers of global mutable state (without
reasonable locking capabilities outside of "only ever write something so small
the OS wouldn't ever decide to break it up"), and the inability to properly
deal with files and the filesystem asynchronously are gaping holes in POSIX
(and Linux) semantics.

Maybe we can actually get a powerful asynchronous IO API that allows the OS to
schedule disk requests effectively and applications to avoid blocking in
situations where they don't have to, while providing more guarantees about
data durability and concurrency. It's undeniably possible, but it will require
a level of coordination between parties (standardization and work from kernel
hackers in Linux and other OSes, userspace libraries to expose new kernel
functionality, applications switching over to use the libs) that I hadn't
expected to see come together. Maybe these discussions will provide some
impetus to get people working on all sides of the issue.

Anyone who hasn't read them should see the comments on previous discussion of
this topic, which demonstrate how hairy this area is right now:
[https://news.ycombinator.com/item?id=11511269](https://news.ycombinator.com/item?id=11511269)

------
marvy
I said this in the previous thread and I'll say it here too: Rethink the Sync
is an attempt to solve this problem from ten years ago.

[https://www.usenix.org/legacy/events/osdi06/tech/nightingale...](https://www.usenix.org/legacy/events/osdi06/tech/nightingale/nightingale.pdf)

Basically (read my comment in the other thread for a bit more info), you get
file durability and good performance without ever having to type the word
"sync", all with the standard POSIX interface.

~~~
jordonwii
That paper was the first thing I thought of when I read this, too. It really
is a fascinating solution to this problem. Any ideas why external synchrony
didn't catch on?

I suppose one could argue that the implementation would add a decent amount of
conceptual overhead to the kernel, for comparatively little reward
(application authors can fsync now, etc.).

~~~
marvy
I wonder if the Linux folks are even aware of this paper. Maybe they're not?

------
zzzcpan
I still don't see why this is such a big deal. Let's say everyone agrees on a
standard and guarantees to make fsync() send data to a disk. Does this
guarantee even worth anything? Will every disk suddenly disable write cache on
itself by default? Who is going to sacrifice performance and make it a
default? Even if somebody is going to do it, will disks suddenly become bug-
free, 100% reliable and indestructible?

~~~
fulafel
fsync works fine with on disk write caches (if the OS so chooses, ie not OS
X). It's all specified in SATA and SCSI. Yes you can have bugs and lying
devices, but reputable stuff is generally fine.

~~~
mtanski
Many consumer drives tend to omit this feature (or lie about it) because all
anybody looks at is size and iops. Nobody sells more consumer drives by
advertising that the writeback cache on the drive respects flush commands (and
barriers).

~~~
fulafel
Which current or recent sata/scsi drive omits or lies about cache flush
commands? Leaving FUA unimplemented is fine. Bottom of barrel flash sticks
probably do lie, and some IDE era PC drives did, but major disk vendors are
pretty sensitive about their reputations wrt data integrity now.

~~~
mtanski
A few years ago some of the intel SSD drives has issues with this.
[http://www.evanjones.ca/intel-ssd-
durability.html](http://www.evanjones.ca/intel-ssd-durability.html) Drive was
advertising FUA but it didn't actually implement it (or implemented it
incorrectly).

Then there were drives that were storing things in cache and reporting as
written but didn't actually have capacitors to write out the data in case of
power failure.
[http://lkcl.net/reports/ssd_analysis.html](http://lkcl.net/reports/ssd_analysis.html)
, also the paper from FAST13.

I think in 2016 the situation looks much better.

------
ape4
No this isn't the same article that was posted the other day. Is a response to
comments.

------
ipozgaj
Even if API gave you strong durability guarantees, it still wouldn't mean
much. Disk caches, big enterprise SAN attached storages etc, they can also
"cheat", saying they flushed the cache while they actually didn't.

~~~
yason
The API allows blaming the right cog in the machinery. Now everyone gets a Get
Out Of Jail Free card because they can blame the kernel, filesystem drivers,
userspace libraries, application developers, disk controllers, and whatnot,
and thus nothing forces a single direction in that cyclic graph of blame.

------
Drdrdrq
While OP makes good points, I think there is no way to guarantee data
persistence on a single node. Especially with the SSDs.

~~~
ubernostrum
OP is not asking for something that will guarantee a complete, safe write on
all systems under all conditions in all logically-possible universes.

OP is asking for something that, if it loses data due to the way the
filesystem or operating system behaves, would be treated _as a bug in the
filesystem or OS_ , rather than a "lol, this idiot didn't even do (set of
syscalls unique to the particular FS/OS combination), not our fault it didn't
actually write after it said it would".

