
PostgreSQL's fsync surprise - craigkerstiens
https://lwn.net/Articles/752063/
======
josephg
I've said it before and I'll say it again: Modern filesystems straight out
expose the wrong API for 99% of applications. App developers almost never
think of data as a stream of bytes. We think of data as a set of records.
Values in the set change through atomic modification events.

The fundamental API primitives should be atomic changes. Atomic write (bytes),
and atomic append. The funny thing about it is that POSIX already supports
basically this API (datagrams) for both IPC and networking. It just doesn't
support this API in the one place it would be most useful - the filesystem.

Ideally I want:

\- Write() to be blocking / atomic by default. Don't return until data is
safely committed.

\- A transactional API: begin(fd); write(); write(); err = commit(fd). If any
error happens, commit returns the error and _none_ of the data is stored.

\- An IOCP-style API for non-blocking applications. This is the API databases
want to use, with the loop being <get network request>, <write data to
filesystem>, <yield>, <get write completion event>, <send confirmation to
client>.

\- Deprecate fsync & friends. If you don't want to wait for the data to get
committed, write in non-blocking mode and ignore the completion event.

Solving this problem in end-user applications is really hard - almost no
applications implement atomic write on top of filesystems correctly. And they
shouldn't have to - this should be the job of the filesystem. The filesystem
can do this much more safely, with better guarantees, better performance and
better error handling. Modern filesystems already have journals - buggy
reimplementions of journals in userland doesn't help anyone. "Do not turn off
console while game is saving" is an embarrassment to everyone.

~~~
colanderman
> \- An IOCP-style API for non-blocking applications.

This exists, it's called Linux AIO (distinct from Posix AIO). The problem is,
when you use it, you have to reimplement _caching_ (and buffering) in
userspace instead of journaling – which is just as hard to get right and can
just as easily – maybe even more easily – lead to corruption. (Postgres, as an
example, relies on the OS to buffer writes.)

> \- Write() to be blocking / atomic by default. Don't return until data is
> safely committed.

This is the wrong thing for 99% of use cases. You get terrible performance
unless you batch things, which means that casual use ends up with severe
performance issues. (IIRC this was actually a problem on Android devices –
many apps were misusing SQLite by not using transactions, resulting in every
single database update being a separate atomic write to disk. Not only did
this kill performance, but it caused excessive flash wear.)

> Modern filesystems already have journals - buggy reimplementions of journals
> in userland doesn't help anyone.

Databases (among other systems) need features and control over the journal
that a filesystem cannot provide. (Think MVCC, replication, etc.)

Beside – someone has to write the filesystem, and that filesystem uses largely
the same mechanisms in the kernel that are exposed to userspace.

I agree fully that fsync ought to be deprecated for something with much more
clearly-defined semantics. Both OS X and Linux have made attempts at this
(F_FULLSYNC and sync_file_range, respectively), though clearly at least Linux
still has some work to do.

But – barring such unclear semantics – the general model of using fsync to
guarantee ordering is not a complex one to understand, and matches most use
cases well.

~~~
josephg
> > \- Write() to be blocking / atomic by default. Don't return until data is
> safely committed.

> This is the wrong thing for 99% of use cases. You get terrible performance
> unless you batch things, which means that casual use ends up with severe
> performance issues.

The right answer would be to batch writes in the OS, but of course this would
deadlock without non-blocking APIs. Maybe you're right - maybe write() should
have the effectively non-blocking API it has now with fsync as a commit. But I
still want my guarantee that if the system / hardware dies halfway through a
write (or batch of writes) then data won't be left in a corrupt half-written
state. Package installations and OS upgrades should be able to do the entirety
of their work in a filesystem-level transaction. I shouldn't need to worry
about my OS needing a reinstall because my laptop ran out of power halfway
through an update. I just can't think of any sensible way to do this using the
APIs we have today.

> Beside – someone has to write the filesystem, and that filesystem uses
> largely the same mechanisms in the kernel that are exposed to userspace.

But importantly not the same mechanisms. The OS has control over write
ordering, and most block devices support read & write barriers. Filesystems
lean heavily on these primitives for correctness - and for good reason;
they're really useful primitives for building correct, performant systems. But
for some reason they aren't exposed to userspace. Userspace applications are
instead left scratching in the dirt with fsync. It would be much easier to
build correct, performant databases in userspace if we had access to this
stuff.

~~~
colanderman
A filesystem-level transaction would be a good idea. Maybe some FSes (e.g.
ZFS) already support such a thing. You can get this (somewhat awkwardly) for a
single file with any FS that supports copy-on-write by making a (lazy) copy of
the file, doing your updates there, and then renaming it to the original file
name (which is guaranteed to be atomic on most FSes). With smaller files this
is standard practice even on FSes without copy-on-write.

> most block devices support read & write barriers

No, they really don't. Some _hardware_ devices might. I'm not aware of any SAN
that does. (I've specifically asked this question of the developers of three.)

> Filesystems lean heavily on these primitives for correctness

In Linux, not since 2010 they don't [1], for exactly the reason that they're
poorly implemented by devices.

[1] [https://lwn.net/Articles/400541/](https://lwn.net/Articles/400541/)

------
mjw1007
I've been surprised to see an apparent consensus from the filesystem
developers that Postgres should be using direct IO.

I worry that if the Postgres people do make that change, they'll find
themselves hearing from a different set of kernel developers that they should
have known direct IO doesn't work properly and they should be using buffered
IO instead.

Previously I'd thought the latter was the general view from the kernel side.

For example this message from ten years ago, and other strongly-worded views
in that thread:
[https://lkml.org/lkml/2007/1/10/235](https://lkml.org/lkml/2007/1/10/235)

In particular, I'd taken this bit as a suggestion that if people found
problems with buffered IO then the right thing to do is to ask the kernel side
to improve things, rather than switch:

« As a result, our madvise and/or posix_fadvise interfaces may not be all that
strong, because people sadly don't use them that much. It's a sad example of a
totally broken interface (O_DIRECT) resulting in better interfaces not getting
used, and then not getting as much development effort put into them. »

~~~
anarazel
> I worry that if the Postgres people do make that change, they'll find
> themselves hearing from a different set of kernel developers that they
> should have known direct IO doesn't work properly and they should be using
> buffered IO instead.

That definitely will happen. But the fact remains that at the moment you'll
get considerably higher performance when expertly using O_DIRECT, and there's
nothing on the horizon to change that.

> For example this message from ten years ago, and other strongly-worded views
> in that thread:
> [https://lkml.org/lkml/2007/1/10/235](https://lkml.org/lkml/2007/1/10/235)

> In particular, I'd taken this bit as a suggestion that if people found
> problems with buffered IO then the right thing to do is to ask the kernel
> side to improve things, rather than switch:

I think partially that's just been overtaken by reality. A database is
guaranteed to need its own buffer pool and you're a) going to have more
information about recency in there b) the OS caching adds a good chunk of
additional overhead. With buffered IO we (PostgreSQL) already had to add code
to manage e.g. the amount of dirty data caching the OS does. The only reason
DIO isn't always going to be beneficial after doing the necessary
architectural improvements, is that the OS buffer pool is more adaptive in
mixed use / not as well tuned databases.

------
bvinc
Can we ignore for a second what the proper behavior should be, and instead
focus on the documentation.

In my opinion, even a careful reading of the fsync man page does not cover
what exactly happens if you close an fd, reopen the file in another process,
and then call fsync. Am I supposed to read kernel source code? Ideally, after
reading a man page, I should have no questions about exactly what guarantees
are provided by an API.

------
dagenix
I'm always surprised by what a mess doing what seem like simple file
operations is. Maybe even more surprised that everything seems to generally
work pretty well even with those issues. Even "I want to save this file"
requires numerous sync operations on the file and the directory its in.

I'm certainly not qualified to criticize anyone for the current situation,
and, as the article points out, even some of the more egregious sounding
behavior (marking as clean pages after writing fails) has a pretty reasonable
explanation. But, IIRC, as storage capacities continue to rise, error rates
aren't falling nearly as fast. So, I'm left kinda wondering if there is some
day in the future where the likely hood of encountering an error finally gets
high enough that things don't work pretty well anymore.

~~~
gerdesj
Although its not exactly associated with this as such there is a growing
understanding that SMB/CIFS shares have a nasty habit of reporting "on
storage" before the data really is safe. That is a bit of a problem for many
backup systems, unless you do a verify afterwards and pick up the pieces.
Backups can involve massive files with odd write and read patterns and
databases generally involve quite large files with odd read and write patterns
compared to say document storage.

Perhaps we need database and backup oriented filesystems and not funny looking
files on top of generic filesystems.

~~~
jandrewrogers
Ironically, most sophisticated database engines do implement complete file
systems, treating those "funny looking files" as little more than virtual
block devices. In fact, with very little extra code, you can trivially
retarget some database kernels to run directly on top of raw block devices,
eliminating the redundant file system. It partly depends on the storage
management requirements of the user e.g. if they expect to share block devices
across unrelated applications. In my experience, the raw block device code is
simpler and more reliable; there are many odd edge cases in Linux file system
behavior that come up that you must account for if you require robust and
reliable storage behavior on top of one.

There are some additional performance and behavioral advantages to working
with the storage devices directly. Anecdotally, if you run databases on
virtual machines (never recommended but many people do), using raw block
devices instead of a file system often seems to eliminate much of the disk I/O
weirdness that occurs under VMs.

~~~
bbuchalter
Could you expand on what you mean by "weirdness" on VM disk I/O in the context
of database storage?

~~~
jandrewrogers
The storage has anomalously high latency and throughput variance with some
patterns that you don't see with non-virtualized storage and a modest
degradation in average performance. This is expected, but it makes it
difficult to schedule I/O efficiently. This is more noticeable if you are
doing direct I/O because having a VM intercept your storage access defeats the
purpose.

What was surprising is that the direct I/O behavior appears to be conditional
on whether you are accessing the storage through a file system. My database
kernel is block device agnostic, using files and raw devices interchangeably
via direct I/O. Against expectations, when we accessed the same virtualized
storage as raw block devices, the behavior was like bare metal even though we
are running the exact same operations over the same direct I/O interface in a
VM. Basically, the only difference was the file descriptor type.

I'm guessing that file systems are virtualization aware to some extent and
access through them is actively managed; raw device accesses are VM oblivious
and simply passed through by the storage virtualization layer.

------
loeg
> When a buffered I/O write fails due to a hardware-level error, filesystems
> will respond differently, but that behavior usually includes discarding the
> data in the affected pages and marking them as being clean.

That behavior seems problematic.

As always, there's a great Dan Luu blog post on the subject:
[https://danluu.com/filesystem-errors/](https://danluu.com/filesystem-errors/)

> Filesystem error handling seems to have improved. Reporting an error on a
> pwrite if the block device reports an error is perhaps the most basic error
> propagation a robust filesystem should do; few filesystems reported that
> error correctly in 2005. Today, most filesystems will correctly report an
> error _when the simplest possible error condition that doesn’t involve the
> entire drive being dead occurs if there are no complicating factors_.

Emphasis added.

------
zAy0LfpBZLC8mAC
Now, taking the case of a user pulling out a USB thumb drive as an excuse for
not keeping the dirty pages around seems ... disingenuous?

If the storage device has disappeared for good, you can just return EIO for
all further I/O operations, and mark all open file descriptions for which
there were dirty pages such that any further fsync() calls on the
corresponding fds return an error?

I mean, either you think you can still retry, then you should keep the dirty
page around, or you think retrying is futile, then feel free to drop the dirty
pages, but make sure anyone who tries something that would make this loss
visible gets an error, which should only require keeping flags on open file
descriptions, and possibly pages/inodes/block devices that (semi-)persist the
error at the desired resolution, which you can broaden if the bookkeeping uses
too much memory.

~~~
loeg
Yeah. The USB case is a cop out. For USB, keeping the pages dirty and the
fsyncs erroring (as seems consistent with Postgres' needs and common sense)
seems fine.

The memory can be reclaimed when 'umount --force', or something like that,
discards filesystem dirty state.

------
kazinator
> _Such a change, though, would take Linux behavior further away from what
> POSIX mandates and would raise some other questions, including: when and how
> would that flag ever be cleared? So this change seems unlikely to happen._
    
    
      fcntl(fd, F_CLR_FKNG_ERR);

------
pkaye
I've seen this kind of problems when I was writing SSD firmware 10-15 years
back. The operating systems just dont do much with the hardware reported
errors. There is some old research papers called "IRON filesystems" that is a
pretty good reading on how poor the error handling was and maybe still is.

------
toothpasta
There's no way to recover from a failed write (if the drive is still operating
and could reallocate the sector, it would have already done that). So mark the
pages _damaged_ and deallocate their contents. Keep the metadata for the
damaged pages around until someone tries to sync or close the associated file.

~~~
caf
_There 's no way to recover from a failed write (if the drive is still
operating and could reallocate the sector, it would have already done that)._

That's not exactly true. In the thin-provisioned block device case,
administrator action can make it resume accepting writes.

------
Annatar
If you take into consideration that there are alternatives like FreeBSD and
SmartOS which do not suffer from such serious and basic functionality
malfunctions, it is illogical to keep putting up with GNU/Linux on the basis
of being the only thing one is comfortable with.

Comfort is of little consolation or use if the operating system is this
unreliable, especially since making sure that data is safely and correctly
stored is core, basic functionality.

------
dis-sys
This article is really good with lots of details in those linked discussions.

wondering what happens to other critical libraries such as RocksDB/LevelDB,
what actually happens when there is hardware error not limited to unplugged
usb cable?

------
bbuchalter
Does anyone know if a similar conversation around this issue is needed or
being had in the MySQL community?

~~~
takeda
Sorry for being snarky, but from my ops experience MySQL manages to lose data
even without hardware errors[1].

[1] My last experience was due to bug where certain pattern of data made
MySQL/MariaDB think the data page was encrypted, after which it proceeded to
discard that page and crash complaining that data is corrupter and from that
point on refused to start until data got restored.

~~~
wruza
Ah, sort of “this sequence will never appear in user data” assumption, I
guess?

------
cryptonector
Here's a simpler fix: when the underlying device produces an error then mark
the in-core inode (not on disk) as having an error and have all further writes
return EIO. Then fsync() too can notice the error state flag being set and
also return EIO.

~~~
caf
That's similar to Willy Tarreau's suggestion and suffers from the same issue -
it only works up until that inode gets evicted due to memory pressure.

~~~
cryptonector
PG could keep it open. I know it doesn't, but it could and should.

Also, perhaps the error flag should keep the inode in core. Note that the
pages would still get thrown out, so as far as memory pressure this is not the
end of the world.

------
forkandwait
Can anyone more knowledgeable than I comment on how this affects FreeBSD?

~~~
kev009
It doesn't affect FreeBSD.
[https://lwn.net/Articles/752388/](https://lwn.net/Articles/752388/)

------
CaptainZapp
"The job of calling fsync(), however, is handled in a single "checkpointer"
process, which is concerned with keeping on-disk storage in a consistent state
that can recover from failures"

And there in lies the rub. Sybase ASE calls fsync() upon every commit, which
is the reason that database devices are still mostly implemented with raw
devices. Before version 11.9.2 (as far as I recall) you ran the exact same
risk if you used the file system as devices. Now it's safe, but performance
can get pretty heinous on write intensive systems.

~~~
anarazel
> Sybase ASE calls fsync() upon every commit

Those are journal commits, not the commits that the piece you quote is talking
about (actual data files).

------
deepsun
> Andres Freund, like a number of other PostgreSQL developers, has
> acknowledged that DIO is the best long-term solution. But he also noted that
> getting there is "a metric ton of work" that isn't going to happen anytime
> soon.

No, that is a "recipe for disaster", as they say. Not doing something that
everyone acknowledging as important, because that's a "lot of work" is what
makes projects a mess. I've seen that many times on various projects.

~~~
2RTZZSro
What is DIO?

~~~
Annatar
Direct input/output; the software performs write(2) system calls directly and
does its own input/output buffering and scheduling, bypassing the operating
system and filesystem driver buffering. For this to be effective and function
correctly, the filesystem driver must support mounting in direct input/output
mode, or raw character devices must be presented to the software in question.
Note that presenting raw block devices bypasses filesystem driver’s buffering,
but does not bypass operating system (kernel) buffering, hence character
devices must be used to bypass both. Software which uses raw devices usually
has sophisticated input/output scheduling and buffering optimally tuned for
its use case.

