
Linux Fsync Issue for Buffered IO and Its Preliminary Fix for PostgreSQL - avivallssa
https://www.percona.com/blog/2019/02/22/postgresql-fsync-failure-fixed-minor-versions-released-feb-14-2019/#FSYNC-ERRORS-ARE-NOW-DETECTED
======
masklinn
If you want an overview of the issue, here's a presentation from Tomas Vondra
at FOSDEM 2019: [https://youtu.be/1VWIGBQLtxo](https://youtu.be/1VWIGBQLtxo)

Or an early recap of the "fsyncgate" issue in textual form:
[https://lwn.net/Articles/752063/](https://lwn.net/Articles/752063/)

Related (also listed by Tomas Vondra): Linux's IO errors reporting
[https://youtu.be/74c19hwY2oE](https://youtu.be/74c19hwY2oE)

A previous hn discussion on the subject:
[https://news.ycombinator.com/item?id=19119991](https://news.ycombinator.com/item?id=19119991)

Also note that this is a broad issue with fsync, it's possible that your own
software is affected:
[https://wiki.postgresql.org/wiki/Fsync_Errors](https://wiki.postgresql.org/wiki/Fsync_Errors)
links to mysql and mongodb fixes for the same assumptions, one of the posts
from the original fsyncgate thread mentions that dpkg made the same incorrect
assumption.

~~~
fake-name
> fsyncgate

Oh cripes, can we not?

------
mjw1007
The Linux project takes the view « We don't attempt to rigorously document our
API; instead we promise that if your program worked yesterday it will continue
to work in the future. »

I think this story shows a weakness in that approach: for rarely-exercised
error handling paths, it's too likely that your program didn't work yesterday
and you had no easy way to know that.

(This is a separate issue from the fact that until recently the kernel
implementation of fync itself had significant bugs, measured against what its
maintainers thought ought to be guaranteed.)

~~~
pgaddict
Except that this issue (both the behavior and lack of exact docs) applies to
other kernels, not just Linux. See
[https://wiki.postgresql.org/wiki/Fsync_Errors](https://wiki.postgresql.org/wiki/Fsync_Errors)

So no, this is not just about Linux.

~~~
scottlamb
I agree with mjw1007 that lack of rigorous API documentation for error paths
is a huge weakness and with you that it's not a Linux-only problem.

There are a lot of related filesystem robustness questions I'd love to get
authoritative answers on. Neither the Single UNIX Specification nor OS-
specific kernel docs / manpages gives enough information to write a robust,
performant program, and certainly you can't find one place that gives
everything you'd want to know when writing a portable program. For example:

* Does fsync() make guarantees about just the inode, or also the dirent? (iirc on Linux only the inode; for a freshly-created file you also have to fsync() the directory.)

* What does fsync() success guarantee is written to permanent storage? from this whole thing, apparently on Linux recently (even ignoring the bugs) only things since a previous fsync() failure or the current open() call, whichever was later. Yuck. That's a terrible behavior, and even worse for being undocumented.

* Does it even guarantee that if you don't say "Simon says"? On macOS, I gather you need to do this extra F_FULLFSYNC thing. Are the other platforms like that? I dunno. And there are certainly mentions of older hard drives where nothing can be trusted. Is there any database of hard drive behavior? Stress test program to tell if I have a broken model?

* If you do a write and power is lost before fsync, what guarantees do you have about the current state? I was trying to figure out recently if an N-byte aligned overwrite is guaranteed to reflect the "old" or "new" states (for various Ns: 1, 512, 4096, st_blksize). The best I could find is here: <[https://stackoverflow.com/a/2068608/23584>](https://stackoverflow.com/a/2068608/23584>) which suggests yes for N=512 "these days". Do I trust that? on all platforms? for hard drives made how recently? etc.

* If you create a file, write to it, and rename() it into place and power is lost before fsync, what guarantees do you have about the current state? is it guaranteed that the dirent points to either a previous inode (if any) or the new one? if it points to the new one, is the file guaranteed to have the write length or contents? the conservative thing to do would be to create, write, fsync() the file, fsync() the directory, rename, fsync() the directory again. But three syncs is getting ridiculous. Is it safe to remove one or more?

~~~
the8472
> the conservative thing to do would be to create, write, fsync() the file,
> fsync() the directory, rename, fsync() the directory again.

Afaik the _conservative thing_ is the _necessary thing_ if you're on an ext4
mounted with noauto_da_alloc,data=writeback. I think you can skip the last
fsync if you're fine with losing the new version as long as you get the old
version in its place.

~~~
scottlamb
Thanks for mentioning those options. I found a little more about the in the
ext4 manpage.

Things about it a little more, I'd expect I could skip the first directory
fsync I'd mentioned. Surely the rename can't make it to the directory without
the creation getting there, too...

Anyway, I feel like I've could come up with a list of questions 10x as long as
the one I just gave, but you'd never really get answers, even for a particular
drive, os, fs combo, without expensive testing or source code digging.

~~~
the8472
Yes, it's an area of research,
[https://www.usenix.org/system/files/conference/osdi14/osdi14...](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-
pillai.pdf)

But in the end you should code against what posix guarantees, not what
particular filesystems happen to do, because the next filesystem might use
some other leeway the spec provides.

------
lelf
> Starting from kernel 4.13, we can now reliably detect such errors during
> fsync.

No. Not even close.

See
[https://wiki.postgresql.org/wiki/Fsync_Errors](https://wiki.postgresql.org/wiki/Fsync_Errors)

~~~
zaarn
Yes, they do. If fsync returns an error, they crash.

The problem occurs when you retry fsync.

~~~
gnulinux
My understanding is that that's not true, since if some other process
(postgres or not) fsyncs the same file, then kernel will act as if it's been
retried (because fsync called twice) so it will not report error correctly.

~~~
cryptonector
That's the bug in PostgreSQL. The fix is to not do the the fsync() in a
different process that used an FD opened separately for the same file.

~~~
macdice
I think it's debatable whether that's a bug in PostgreSQL, an underspecified
interface, or a bug somewhere else. A more interesting question is how we get
it fixed.

The new Linux errseq_t design makes sure that every fd that was open before
the error will see the error, and that at least one fd will see the error even
if it happened when no one had it open (but only for as long as the inode
doesn't fall out the cache). Before errseq_t came along, Linux was undeniably
buggy here, since the AS_EIO flag could apparently be cleared in various ways
and userspace could never be told about it.

The things achieved so far since the PostgreSQL community first crashed into
this problem (thanks to the efforts of Craig Ringer):

* PostgreSQL now panics on failure, in cases where it previously retried (unless you set data_sync_retry = on, which should be safe on eg FreeBSD, though I don't think there is much point in it so the setting was included just as a matter of principle, when rolling out such a drastic change)

* Linux now reports errors to _at least one fd_ in versions new enough to have errseq_t (in addition to reporting it in every fd that was open at the time); that came out of discussions between Linux and PostgreSQL people about all this

* There have also been changes to OpenBSD, though I'm not sure what exactly

* PostgreSQL hackers are working on a plan to make sure that file descriptors are held open until the data is synced, so that there is no reliance on the error state surviving in the inode cache during times when it's not open (this is complicated by the use of processes instead of threads)

* Longer term, this whole thing boosts interest in developing DIO support for PostgreSQL (previously thought to be a performance feature)

~~~
cryptonector
This is definitely debatable. However, because *BSD had the same semantics as
Linux (IIUC), we can infer that the fsync()-syncs-only-writes-through-this-
open-file semantics is actually what's reasonable to implement.

~~~
macdice
FreeBSD has had different semantics here for ~20 years:
[https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357c...](https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266)

I wish I could find the discussion around that commit.

It'll only start throwing away dirty buffers if the device actually goes away:
[https://github.com/freebsd/freebsd/commit/46b5afe7b1ae0ee655...](https://github.com/freebsd/freebsd/commit/46b5afe7b1ae0ee655ec863ebe373a3cf16eef1d)

------
mehrdadn
Does anyone know if FlushFileBuffers() on Windows also forgets to flush data
that previously failed to flush? i.e., if Windows has the same issue or not?

~~~
macdice
I am also curious. I tried asking Microsoft via Twitter and they asked me to
copy the information from that Wiki page into Microsoft One Drive so they
could read it, and then suggested I try asking on Stack Overflow instead!
[https://twitter.com/windowsdev/status/989857799994822663](https://twitter.com/windowsdev/status/989857799994822663)

Amusing social media exchanges aside, I'm quite curious to know as well. We
don't have the answers for any non-open OS (though of course we can speculate
about descendants of BSD except FreeBSD (probably throw away buffers on
error), FreeBSD (probably keep buffers cached and dirty), and SysV systems
(probably throw away buffers on error)).

EDIT: They also suggested asking on MSDN Forums, which I didn't do because I
don't have an account and am not a Windows developer at all, just a humble
database hacker trying to understand how our stuff works on every platform.
The code we committed assumes the worst by default, so not knowing the answer
here isn't damaging. I dunno if it's possible to reach actual kernel hackers
via MSDN Forums, but maybe someone should follow up with that.

I think someone with the right skillset could possibly design an experiment to
figure it out for Windows (several people have shown how to set up experiments
on Linux and FreeBSD to test this).

~~~
mehrdadn
Those tweets from Microsoft tick me off so much I might actually try to find a
way to test this. I just don't have the time. But here's an idea that _might_
work... anyone want to try it? Create a VM in VirtualBox and a .VMDK file that
leaves some parts of the disk unmapped or read-only. Then try to write to it
from inside the VM. It'll naturally fail (and the VM should give you an error)
but you can continue and try flushing again, and see if it fails. If it does,
then it re-tried the write. If not, then it didn't. To check, also test this
on a Linux guest with fsync() to make sure it doesn't fail the second time.
(Caveat: If the behavior differs depending on the particular error from the
block device then you won't know. But it might be worth a try.)

~~~
macdice
I don't know anything about Windows, but that seems like the right sort of
approach; also is there such a thing as a network block layer you could
temporarily break by disconnecting it? On other OSes there are fault-injecting
drivers you can use to simulate IO errors. Maybe something like that exists?

In fairness to the team handling their Twitter account, I recognise that it is
completely the wrong forum to ask complicated kernel questions, it's
impossible to get through the front-line silly question filter. (Try reporting
a kernel bug to Apple; it seems to be impossible, they're all set up to
receive bug reports about consumer UX stuff etc, there isn't even a drop-down
option for "kernel", and reports complete with reproducers filed under "other"
just linger unanswered. These mega-corps aren't like open source projects.)

~~~
rwmj
> is there such a thing as a network block layer you could temporarily break
> by disconnecting it

Yes, nbdkit (assuming a Linux or BSD host for your VM) can do this kind of
thing. I gave a talk about this topic at FOSDEM earlier this month:
[https://rwmj.wordpress.com/2019/02/04/video-take-your-
loop-m...](https://rwmj.wordpress.com/2019/02/04/video-take-your-loop-mounts-
to-the-next-level-with-nbdkit/) The bit about testing is towards the end, but
you may find the whole talk relevant.

------
bepvte
[https://edfile.pro/380a8f2](https://edfile.pro/380a8f2)

Really enjoyable reading experience...

------
dooglius
> To understand it better, consider an example of Linux trying to write dirty
> pages from page cache to a USB stick that was removed during an fsync.
> Neither the ext4 file system nor the btrfs nor an xfs tries to retry the
> failed writes. A silently failing fsync may result in data loss, block
> corruption, table or index out of sync, foreign key or other data integrity
> issues… and deleted records may reappear.

As opposed to what? If the drive isn't there anymore, there's not a whole lot
that can be done.

> With the new minor version for all supported PostgreSQL versions, a PANIC is
> triggered upon such error. This performs a database crash and initiates
> recovery from the last CHECKPOINT.

How is a recovery possible if the hard drive is borked? I don't understand the
model that leads to this "fix" making any difference.

~~~
LIV2
A better example might be a SAN briefly becoming unavailable due to a
transient issue with your ISCSI network?

~~~
dooglius
Yeah, this more or less clears it up, I was assuming an error implied disk
failure.

~~~
pgaddict
Right, this is about "ephemeral" failures which are becoming more common
thanks to accessing storage over network, virtualization, thin provisioning
etc.

~~~
mprovost
NFS has been around forever and has always had a bad reputation due to
problems like this. It mostly handles transient failures by waiting
(indefinitely) for the server to return, but it's unclear what a better option
is.

~~~
masklinn
And that likely "hid" this issue for quite a long time, according to Tomas
Vondra ([https://youtu.be/1VWIGBQLtxo](https://youtu.be/1VWIGBQLtxo)): data
loss would just be blamed on NFS being NFS and kinda crappy, and not
necessarily properly investigated in full (why waste time on NFS shitting the
bed yeah?), but it's likely the incorrect checkpointing / fsync assumptions
were the culprit in at least some of the issues.

Though an other factor is that people now run a lot more DBs, on a lot more
environments, with a lot less reliability, and concurrently the database
improved, so things which were rare and lost in the noise when run on "big
iron" with expensive drive controllers become visible signal.

~~~
macdice
FWIW here is a standalone test that shows Linux NFS exhibiting behaviour that
would corrupt a PostgreSQL database:

[https://www.postgresql.org/message-
id/CAEepm=1FGo=ACPKRmAxvb...](https://www.postgresql.org/message-
id/CAEepm=1FGo=ACPKRmAxvb53mBwyVC=TDwTE0DMzkWjdbAYw7sw@mail.gmail.com)

You can also tweak that test so that ENOSPC is discovered at close() time. Now
you have a system that has thrown away data that PostgreSQL has already
evicted from its own buffers, and there is no way to get it back (other than
replaying the WAL, which is what PANIC achieves, as unpleasant a solution as
it is, especially if it just happens again, and again, ...).

The recent change in 11.2 adds a PANIC on error there. But I'm not sure it's
sufficient in Linux NFS, because even on the tip of the master branch of Linux
(by my inexpert drive-by reading, at least), the errseq_t stuff doesn't seem
to have made it into the NFS client code, so it's still using the old single
AS_EIO flag. That probably exposes at least one race that is discussed in this
thread:

[https://www.postgresql.org/message-id/flat/CA%2BhUKGKa-
HtBHJ...](https://www.postgresql.org/message-id/flat/CA%2BhUKGKa-
HtBHJaBUJuZHsKwvVkxW2nE0W8BqRVOhNr5yNgiDA%40mail.gmail.com)

I think we need to do something to make space allocation eager for NFS clients
(a couple of concrete approaches are discussed) so that ENOSPC is excluded as
a possibility after we have evicted data from PostgreSQL's buffer, and then I
think we need Linux NFS to adopt errseq_t behaviour, and PostgreSQL to adopt
the "fd passing" design discussed on the pgsql-hackers mailing list (to make
sure the checkpointer's file descriptor is old enough to see all relevant IO
errors). Or we need direct IO.

TL;DR We are not out of the woods on NFS.

~~~
wbl
Why would you run a database on NFS?

~~~
macdice
Well, I wouldn't. But people do. It makes more sense to use a SAN IMHO. I'm
told it's not uncommon to use NFS for Oracle. One interesting thing is that
they have their own NFS client implementation instead of trusting the kernel
(they also do direct IO by default, though I'm not actually sure whether their
NFS or DIO support came first).

