Hacker News new | past | comments | ask | show | jobs | submit login
Linux Fsync Issue for Buffered IO and Its Preliminary Fix for PostgreSQL (percona.com)
137 points by avivallssa on Feb 24, 2019 | hide | past | web | favorite | 37 comments

If you want an overview of the issue, here's a presentation from Tomas Vondra at FOSDEM 2019: https://youtu.be/1VWIGBQLtxo

Or an early recap of the "fsyncgate" issue in textual form: https://lwn.net/Articles/752063/

Related (also listed by Tomas Vondra): Linux's IO errors reporting https://youtu.be/74c19hwY2oE

A previous hn discussion on the subject: https://news.ycombinator.com/item?id=19119991

Also note that this is a broad issue with fsync, it's possible that your own software is affected: https://wiki.postgresql.org/wiki/Fsync_Errors links to mysql and mongodb fixes for the same assumptions, one of the posts from the original fsyncgate thread mentions that dpkg made the same incorrect assumption.

> fsyncgate

Oh cripes, can we not?

The Linux project takes the view « We don't attempt to rigorously document our API; instead we promise that if your program worked yesterday it will continue to work in the future. »

I think this story shows a weakness in that approach: for rarely-exercised error handling paths, it's too likely that your program didn't work yesterday and you had no easy way to know that.

(This is a separate issue from the fact that until recently the kernel implementation of fync itself had significant bugs, measured against what its maintainers thought ought to be guaranteed.)

Except that this issue (both the behavior and lack of exact docs) applies to other kernels, not just Linux. See https://wiki.postgresql.org/wiki/Fsync_Errors

So no, this is not just about Linux.

I agree with mjw1007 that lack of rigorous API documentation for error paths is a huge weakness and with you that it's not a Linux-only problem.

There are a lot of related filesystem robustness questions I'd love to get authoritative answers on. Neither the Single UNIX Specification nor OS-specific kernel docs / manpages gives enough information to write a robust, performant program, and certainly you can't find one place that gives everything you'd want to know when writing a portable program. For example:

* Does fsync() make guarantees about just the inode, or also the dirent? (iirc on Linux only the inode; for a freshly-created file you also have to fsync() the directory.)

* What does fsync() success guarantee is written to permanent storage? from this whole thing, apparently on Linux recently (even ignoring the bugs) only things since a previous fsync() failure or the current open() call, whichever was later. Yuck. That's a terrible behavior, and even worse for being undocumented.

* Does it even guarantee that if you don't say "Simon says"? On macOS, I gather you need to do this extra F_FULLFSYNC thing. Are the other platforms like that? I dunno. And there are certainly mentions of older hard drives where nothing can be trusted. Is there any database of hard drive behavior? Stress test program to tell if I have a broken model?

* If you do a write and power is lost before fsync, what guarantees do you have about the current state? I was trying to figure out recently if an N-byte aligned overwrite is guaranteed to reflect the "old" or "new" states (for various Ns: 1, 512, 4096, st_blksize). The best I could find is here: <https://stackoverflow.com/a/2068608/23584> which suggests yes for N=512 "these days". Do I trust that? on all platforms? for hard drives made how recently? etc.

* If you create a file, write to it, and rename() it into place and power is lost before fsync, what guarantees do you have about the current state? is it guaranteed that the dirent points to either a previous inode (if any) or the new one? if it points to the new one, is the file guaranteed to have the write length or contents? the conservative thing to do would be to create, write, fsync() the file, fsync() the directory, rename, fsync() the directory again. But three syncs is getting ridiculous. Is it safe to remove one or more?

> the conservative thing to do would be to create, write, fsync() the file, fsync() the directory, rename, fsync() the directory again.

Afaik the conservative thing is the necessary thing if you're on an ext4 mounted with noauto_da_alloc,data=writeback. I think you can skip the last fsync if you're fine with losing the new version as long as you get the old version in its place.

Thanks for mentioning those options. I found a little more about the in the ext4 manpage.

Things about it a little more, I'd expect I could skip the first directory fsync I'd mentioned. Surely the rename can't make it to the directory without the creation getting there, too...

Anyway, I feel like I've could come up with a list of questions 10x as long as the one I just gave, but you'd never really get answers, even for a particular drive, os, fs combo, without expensive testing or source code digging.

Yes, it's an area of research, https://www.usenix.org/system/files/conference/osdi14/osdi14...

But in the end you should code against what posix guarantees, not what particular filesystems happen to do, because the next filesystem might use some other leeway the spec provides.

> Starting from kernel 4.13, we can now reliably detect such errors during fsync.

No. Not even close.

See https://wiki.postgresql.org/wiki/Fsync_Errors

Yes, they do. If fsync returns an error, they crash.

The problem occurs when you retry fsync.

My understanding is that that's not true, since if some other process (postgres or not) fsyncs the same file, then kernel will act as if it's been retried (because fsync called twice) so it will not report error correctly.

That's the bug in PostgreSQL. The fix is to not do the the fsync() in a different process that used an FD opened separately for the same file.

I think it's debatable whether that's a bug in PostgreSQL, an underspecified interface, or a bug somewhere else. A more interesting question is how we get it fixed.

The new Linux errseq_t design makes sure that every fd that was open before the error will see the error, and that at least one fd will see the error even if it happened when no one had it open (but only for as long as the inode doesn't fall out the cache). Before errseq_t came along, Linux was undeniably buggy here, since the AS_EIO flag could apparently be cleared in various ways and userspace could never be told about it.

The things achieved so far since the PostgreSQL community first crashed into this problem (thanks to the efforts of Craig Ringer):

* PostgreSQL now panics on failure, in cases where it previously retried (unless you set data_sync_retry = on, which should be safe on eg FreeBSD, though I don't think there is much point in it so the setting was included just as a matter of principle, when rolling out such a drastic change)

* Linux now reports errors to at least one fd in versions new enough to have errseq_t (in addition to reporting it in every fd that was open at the time); that came out of discussions between Linux and PostgreSQL people about all this

* There have also been changes to OpenBSD, though I'm not sure what exactly

* PostgreSQL hackers are working on a plan to make sure that file descriptors are held open until the data is synced, so that there is no reliance on the error state surviving in the inode cache during times when it's not open (this is complicated by the use of processes instead of threads)

* Longer term, this whole thing boosts interest in developing DIO support for PostgreSQL (previously thought to be a performance feature)

This is definitely debatable. However, because *BSD had the same semantics as Linux (IIUC), we can infer that the fsync()-syncs-only-writes-through-this-open-file semantics is actually what's reasonable to implement.

FreeBSD has had different semantics here for ~20 years: https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357c...

I wish I could find the discussion around that commit.

It'll only start throwing away dirty buffers if the device actually goes away: https://github.com/freebsd/freebsd/commit/46b5afe7b1ae0ee655...

Read the wiki page. [Not Postgres’ fault in any way of course.]

> Linux 4.13 and 4.15: fsync() only reports writeback errors that occurred after you called open()

Does anyone know if FlushFileBuffers() on Windows also forgets to flush data that previously failed to flush? i.e., if Windows has the same issue or not?

I am also curious. I tried asking Microsoft via Twitter and they asked me to copy the information from that Wiki page into Microsoft One Drive so they could read it, and then suggested I try asking on Stack Overflow instead! https://twitter.com/windowsdev/status/989857799994822663

Amusing social media exchanges aside, I'm quite curious to know as well. We don't have the answers for any non-open OS (though of course we can speculate about descendants of BSD except FreeBSD (probably throw away buffers on error), FreeBSD (probably keep buffers cached and dirty), and SysV systems (probably throw away buffers on error)).

EDIT: They also suggested asking on MSDN Forums, which I didn't do because I don't have an account and am not a Windows developer at all, just a humble database hacker trying to understand how our stuff works on every platform. The code we committed assumes the worst by default, so not knowing the answer here isn't damaging. I dunno if it's possible to reach actual kernel hackers via MSDN Forums, but maybe someone should follow up with that.

I think someone with the right skillset could possibly design an experiment to figure it out for Windows (several people have shown how to set up experiments on Linux and FreeBSD to test this).

Those tweets from Microsoft tick me off so much I might actually try to find a way to test this. I just don't have the time. But here's an idea that might work... anyone want to try it? Create a VM in VirtualBox and a .VMDK file that leaves some parts of the disk unmapped or read-only. Then try to write to it from inside the VM. It'll naturally fail (and the VM should give you an error) but you can continue and try flushing again, and see if it fails. If it does, then it re-tried the write. If not, then it didn't. To check, also test this on a Linux guest with fsync() to make sure it doesn't fail the second time. (Caveat: If the behavior differs depending on the particular error from the block device then you won't know. But it might be worth a try.)

I don't know anything about Windows, but that seems like the right sort of approach; also is there such a thing as a network block layer you could temporarily break by disconnecting it? On other OSes there are fault-injecting drivers you can use to simulate IO errors. Maybe something like that exists?

In fairness to the team handling their Twitter account, I recognise that it is completely the wrong forum to ask complicated kernel questions, it's impossible to get through the front-line silly question filter. (Try reporting a kernel bug to Apple; it seems to be impossible, they're all set up to receive bug reports about consumer UX stuff etc, there isn't even a drop-down option for "kernel", and reports complete with reproducers filed under "other" just linger unanswered. These mega-corps aren't like open source projects.)

> is there such a thing as a network block layer you could temporarily break by disconnecting it

Yes, nbdkit (assuming a Linux or BSD host for your VM) can do this kind of thing. I gave a talk about this topic at FOSDEM earlier this month: https://rwmj.wordpress.com/2019/02/04/video-take-your-loop-m... The bit about testing is towards the end, but you may find the whole talk relevant.

Somebody asked on SO: https://stackoverflow.com/questions/54660780/is-it-safe-to-c...

And while there was no hard answer and the documentation is basically on the level of POSIX (aka it says jack about the system's state on failure), the consensus seems to be "assume it's fucked":

> The answer will probably depend on the file system driver, as FlushFileBuffers is just a user-mode wrapper around NtFlushBuffersFile, and it looks to me like that function just assembles a flush IRP (IRP_MJ_FLUSH_BUFFERS) and sends it via IoCallDriver. Of course, one should eye calls to FlushFileBuffers with a great deal of suspicion.


Really enjoyable reading experience...

> To understand it better, consider an example of Linux trying to write dirty pages from page cache to a USB stick that was removed during an fsync. Neither the ext4 file system nor the btrfs nor an xfs tries to retry the failed writes. A silently failing fsync may result in data loss, block corruption, table or index out of sync, foreign key or other data integrity issues… and deleted records may reappear.

As opposed to what? If the drive isn't there anymore, there's not a whole lot that can be done.

> With the new minor version for all supported PostgreSQL versions, a PANIC is triggered upon such error. This performs a database crash and initiates recovery from the last CHECKPOINT.

How is a recovery possible if the hard drive is borked? I don't understand the model that leads to this "fix" making any difference.

The difference is that with the fix, PostgreSQL should not silently lose data (confirmed transactions).

What was possible before is:

1) transaction committed OK (which involves fsync of WAL, but that's it)

2) during checkpoint (essentially asynchronous writes to data files in the background), something went wrong and OS just discarded the dirty pages on fsync()

3) PostgreSQL assumed it can retry the fsync and the pages will be written again

So in the end, the contents of data files mismatched what's in WAL.

What this change does is it "crashes" the database after step (2), forcing the database into a recovery which re-applies the writes done in (1) from WAL. Of course, if the I/O error is permanent, this this won't change a thing. But PostgreSQL won't lie to you returning stale data etc.

Note: This assumes all the layers (notably kernel + filesystem) do the right thing, i.e. report errors reliably. That may or may not be the case, depending on the kernel version etc.

A better example might be a SAN briefly becoming unavailable due to a transient issue with your ISCSI network?

Yeah, this more or less clears it up, I was assuming an error implied disk failure.

Right, this is about "ephemeral" failures which are becoming more common thanks to accessing storage over network, virtualization, thin provisioning etc.

NFS has been around forever and has always had a bad reputation due to problems like this. It mostly handles transient failures by waiting (indefinitely) for the server to return, but it's unclear what a better option is.

And that likely "hid" this issue for quite a long time, according to Tomas Vondra (https://youtu.be/1VWIGBQLtxo): data loss would just be blamed on NFS being NFS and kinda crappy, and not necessarily properly investigated in full (why waste time on NFS shitting the bed yeah?), but it's likely the incorrect checkpointing / fsync assumptions were the culprit in at least some of the issues.

Though an other factor is that people now run a lot more DBs, on a lot more environments, with a lot less reliability, and concurrently the database improved, so things which were rare and lost in the noise when run on "big iron" with expensive drive controllers become visible signal.

FWIW here is a standalone test that shows Linux NFS exhibiting behaviour that would corrupt a PostgreSQL database:


You can also tweak that test so that ENOSPC is discovered at close() time. Now you have a system that has thrown away data that PostgreSQL has already evicted from its own buffers, and there is no way to get it back (other than replaying the WAL, which is what PANIC achieves, as unpleasant a solution as it is, especially if it just happens again, and again, ...).

The recent change in 11.2 adds a PANIC on error there. But I'm not sure it's sufficient in Linux NFS, because even on the tip of the master branch of Linux (by my inexpert drive-by reading, at least), the errseq_t stuff doesn't seem to have made it into the NFS client code, so it's still using the old single AS_EIO flag. That probably exposes at least one race that is discussed in this thread:


I think we need to do something to make space allocation eager for NFS clients (a couple of concrete approaches are discussed) so that ENOSPC is excluded as a possibility after we have evicted data from PostgreSQL's buffer, and then I think we need Linux NFS to adopt errseq_t behaviour, and PostgreSQL to adopt the "fd passing" design discussed on the pgsql-hackers mailing list (to make sure the checkpointer's file descriptor is old enough to see all relevant IO errors). Or we need direct IO.

TL;DR We are not out of the woods on NFS.

Why would you run a database on NFS?

Well, I wouldn't. But people do. It makes more sense to use a SAN IMHO. I'm told it's not uncommon to use NFS for Oracle. One interesting thing is that they have their own NFS client implementation instead of trusting the kernel (they also do direct IO by default, though I'm not actually sure whether their NFS or DIO support came first).

> How is a recovery possible if the hard drive is borked? I don't understand the model that leads to this "fix" making any difference.

If hard drive is broken it is broken and it is not really PostgreSQL fault. This is more about losing access to hard drive, for example broken SATA cable etc.

The issue here is to properly handle fsync() . Until recently, PostgreSQL assumed that if fsync() call fails and the call is retired and succeeds, then everything is A-OK. So after that PostgreSQL removed the "successful" transaction from WAL and continue to process next transaction.

Turned out that on some systems such as Linux if a fsync() fails, then subsequent fsync() succeeds doesn't mean that all data from both calls was written successfully, so this change makes PostgreSQL panic on first fsync() failure and PostgreSQL refusing to continue. This means that you still will have uncommitted transactions in WAL log and you can use that to recover data when the disk starts to work again.

> Turned out that on some systems such as Linux

Most systems really, keeping the buffers dirty is the exception (FreeBSD, possibly DragonflyBSD, Illumos, possibly Solaris).

OpenBSD recently changed some of their behaviour, but apparently only so fsync doesn't lose the error: https://marc.info/?l=openbsd-cvs&m=155044187426287&w=2 NetBSD and macOS remain on the old BSD behaviour (discard the buffers, second fsync will come back clean)

This issue is observed during a CHECHKPOINT when a BGWriter writes dirty or modified buffers from shared buffers to Disk. So, upon recovery after crash, the changes since the last checkpoint are applied from WAL, this time. So, we are using the changes that are in a WAL (Write-ahead-log) because the changes are already gone from Disk as you rightly said. A good question in fact.

nitpick: BGWrites is not involved in checkpoint - it's a separate process, doing something like checkpoint (so an alternative).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact