Or an early recap of the "fsyncgate" issue in textual form: https://lwn.net/Articles/752063/
Related (also listed by Tomas Vondra): Linux's IO errors reporting https://youtu.be/74c19hwY2oE
A previous hn discussion on the subject: https://news.ycombinator.com/item?id=19119991
Also note that this is a broad issue with fsync, it's possible that your own software is affected: https://wiki.postgresql.org/wiki/Fsync_Errors links to mysql and mongodb fixes for the same assumptions, one of the posts from the original fsyncgate thread mentions that dpkg made the same incorrect assumption.
Oh cripes, can we not?
I think this story shows a weakness in that approach: for rarely-exercised error handling paths, it's too likely that your program didn't work yesterday and you had no easy way to know that.
(This is a separate issue from the fact that until recently the kernel implementation of fync itself had significant bugs, measured against what its maintainers thought ought to be guaranteed.)
So no, this is not just about Linux.
There are a lot of related filesystem robustness questions I'd love to get authoritative answers on. Neither the Single UNIX Specification nor OS-specific kernel docs / manpages gives enough information to write a robust, performant program, and certainly you can't find one place that gives everything you'd want to know when writing a portable program. For example:
* Does fsync() make guarantees about just the inode, or also the dirent? (iirc on Linux only the inode; for a freshly-created file you also have to fsync() the directory.)
* What does fsync() success guarantee is written to permanent storage? from this whole thing, apparently on Linux recently (even ignoring the bugs) only things since a previous fsync() failure or the current open() call, whichever was later. Yuck. That's a terrible behavior, and even worse for being undocumented.
* Does it even guarantee that if you don't say "Simon says"? On macOS, I gather you need to do this extra F_FULLFSYNC thing. Are the other platforms like that? I dunno. And there are certainly mentions of older hard drives where nothing can be trusted. Is there any database of hard drive behavior? Stress test program to tell if I have a broken model?
* If you do a write and power is lost before fsync, what guarantees do you have about the current state? I was trying to figure out recently if an N-byte aligned overwrite is guaranteed to reflect the "old" or "new" states (for various Ns: 1, 512, 4096, st_blksize). The best I could find is here: <https://stackoverflow.com/a/2068608/23584> which suggests yes for N=512 "these days". Do I trust that? on all platforms? for hard drives made how recently? etc.
* If you create a file, write to it, and rename() it into place and power is lost before fsync, what guarantees do you have about the current state? is it guaranteed that the dirent points to either a previous inode (if any) or the new one? if it points to the new one, is the file guaranteed to have the write length or contents? the conservative thing to do would be to create, write, fsync() the file, fsync() the directory, rename, fsync() the directory again. But three syncs is getting ridiculous. Is it safe to remove one or more?
Afaik the conservative thing is the necessary thing if you're on an ext4 mounted with noauto_da_alloc,data=writeback. I think you can skip the last fsync if you're fine with losing the new version as long as you get the old version in its place.
Things about it a little more, I'd expect I could skip the first directory fsync I'd mentioned. Surely the rename can't make it to the directory without the creation getting there, too...
Anyway, I feel like I've could come up with a list of questions 10x as long as the one I just gave, but you'd never really get answers, even for a particular drive, os, fs combo, without expensive testing or source code digging.
But in the end you should code against what posix guarantees, not what particular filesystems happen to do, because the next filesystem might use some other leeway the spec provides.
No. Not even close.
The problem occurs when you retry fsync.
The new Linux errseq_t design makes sure that every fd that was open before the error will see the error, and that at least one fd will see the error even if it happened when no one had it open (but only for as long as the inode doesn't fall out the cache). Before errseq_t came along, Linux was undeniably buggy here, since the AS_EIO flag could apparently be cleared in various ways and userspace could never be told about it.
The things achieved so far since the PostgreSQL community first crashed into this problem (thanks to the efforts of Craig Ringer):
* PostgreSQL now panics on failure, in cases where it previously retried (unless you set data_sync_retry = on, which should be safe on eg FreeBSD, though I don't think there is much point in it so the setting was included just as a matter of principle, when rolling out such a drastic change)
* Linux now reports errors to at least one fd in versions new enough to have errseq_t (in addition to reporting it in every fd that was open at the time); that came out of discussions between Linux and PostgreSQL people about all this
* There have also been changes to OpenBSD, though I'm not sure what exactly
* PostgreSQL hackers are working on a plan to make sure that file descriptors are held open until the data is synced, so that there is no reliance on the error state surviving in the inode cache during times when it's not open (this is complicated by the use of processes instead of threads)
* Longer term, this whole thing boosts interest in developing DIO support for PostgreSQL (previously thought to be a performance feature)
I wish I could find the discussion around that commit.
It'll only start throwing away dirty buffers if the device actually goes away: https://github.com/freebsd/freebsd/commit/46b5afe7b1ae0ee655...
> Linux 4.13 and 4.15: fsync() only reports writeback errors that occurred after you called open()
Amusing social media exchanges aside, I'm quite curious to know as well. We don't have the answers for any non-open OS (though of course we can speculate about descendants of BSD except FreeBSD (probably throw away buffers on error), FreeBSD (probably keep buffers cached and dirty), and SysV systems (probably throw away buffers on error)).
EDIT: They also suggested asking on MSDN Forums, which I didn't do because I don't have an account and am not a Windows developer at all, just a humble database hacker trying to understand how our stuff works on every platform. The code we committed assumes the worst by default, so not knowing the answer here isn't damaging. I dunno if it's possible to reach actual kernel hackers via MSDN Forums, but maybe someone should follow up with that.
I think someone with the right skillset could possibly design an experiment to figure it out for Windows (several people have shown how to set up experiments on Linux and FreeBSD to test this).
In fairness to the team handling their Twitter account, I recognise that it is completely the wrong forum to ask complicated kernel questions, it's impossible to get through the front-line silly question filter. (Try reporting a kernel bug to Apple; it seems to be impossible, they're all set up to receive bug reports about consumer UX stuff etc, there isn't even a drop-down option for "kernel", and reports complete with reproducers filed under "other" just linger unanswered. These mega-corps aren't like open source projects.)
Yes, nbdkit (assuming a Linux or BSD host for your VM) can do this kind of thing. I gave a talk about this topic at FOSDEM earlier this month: https://rwmj.wordpress.com/2019/02/04/video-take-your-loop-m... The bit about testing is towards the end, but you may find the whole talk relevant.
And while there was no hard answer and the documentation is basically on the level of POSIX (aka it says jack about the system's state on failure), the consensus seems to be "assume it's fucked":
> The answer will probably depend on the file system driver, as FlushFileBuffers is just a user-mode wrapper around NtFlushBuffersFile, and it looks to me like that function just assembles a flush IRP (IRP_MJ_FLUSH_BUFFERS) and sends it via IoCallDriver. Of course, one should eye calls to FlushFileBuffers with a great deal of suspicion.
Really enjoyable reading experience...
As opposed to what? If the drive isn't there anymore, there's not a whole lot that can be done.
> With the new minor version for all supported PostgreSQL versions, a PANIC is triggered upon such error. This performs a database crash and initiates recovery from the last CHECKPOINT.
How is a recovery possible if the hard drive is borked? I don't understand the model that leads to this "fix" making any difference.
What was possible before is:
1) transaction committed OK (which involves fsync of WAL, but that's it)
2) during checkpoint (essentially asynchronous writes to data files in the background), something went wrong and OS just discarded the dirty pages on fsync()
3) PostgreSQL assumed it can retry the fsync and the pages will be written again
So in the end, the contents of data files mismatched what's in WAL.
What this change does is it "crashes" the database after step (2), forcing the database into a recovery which re-applies the writes done in (1) from WAL. Of course, if the I/O error is permanent, this this won't change a thing. But PostgreSQL won't lie to you returning stale data etc.
Note: This assumes all the layers (notably kernel + filesystem) do the right thing, i.e. report errors reliably. That may or may not be the case, depending on the kernel version etc.
Though an other factor is that people now run a lot more DBs, on a lot more environments, with a lot less reliability, and concurrently the database improved, so things which were rare and lost in the noise when run on "big iron" with expensive drive controllers become visible signal.
You can also tweak that test so that ENOSPC is discovered at close() time. Now you have a system that has thrown away data that PostgreSQL has already evicted from its own buffers, and there is no way to get it back (other than replaying the WAL, which is what PANIC achieves, as unpleasant a solution as it is, especially if it just happens again, and again, ...).
The recent change in 11.2 adds a PANIC on error there. But I'm not sure it's sufficient in Linux NFS, because even on the tip of the master branch of Linux (by my inexpert drive-by reading, at least), the errseq_t stuff doesn't seem to have made it into the NFS client code, so it's still using the old single AS_EIO flag. That probably exposes at least one race that is discussed in this thread:
I think we need to do something to make space allocation eager for NFS clients (a couple of concrete approaches are discussed) so that ENOSPC is excluded as a possibility after we have evicted data from PostgreSQL's buffer, and then I think we need Linux NFS to adopt errseq_t behaviour, and PostgreSQL to adopt the "fd passing" design discussed on the pgsql-hackers mailing list (to make sure the checkpointer's file descriptor is old enough to see all relevant IO errors). Or we need direct IO.
TL;DR We are not out of the woods on NFS.
If hard drive is broken it is broken and it is not really PostgreSQL fault. This is more about losing access to hard drive, for example broken SATA cable etc.
The issue here is to properly handle fsync() . Until recently, PostgreSQL assumed that if fsync() call fails and the call is retired and succeeds, then everything is A-OK. So after that PostgreSQL removed the "successful" transaction from WAL and continue to process next transaction.
Turned out that on some systems such as Linux if a fsync() fails, then subsequent fsync() succeeds doesn't mean that all data from both calls was written successfully, so this change makes PostgreSQL panic on first fsync() failure and PostgreSQL refusing to continue. This means that you still will have uncommitted transactions in WAL log and you can use that to recover data when the disk starts to work again.
Most systems really, keeping the buffers dirty is the exception (FreeBSD, possibly DragonflyBSD, Illumos, possibly Solaris).
OpenBSD recently changed some of their behaviour, but apparently only so fsync doesn't lose the error: https://marc.info/?l=openbsd-cvs&m=155044187426287&w=2 NetBSD and macOS remain on the old BSD behaviour (discard the buffers, second fsync will come back clean)