Hacker News new | past | comments | ask | show | jobs | submit login
Fsyncgate: Errors on fsync are unrecoverable (2018) (danluu.com)
219 points by Darkstryder 3 months ago | hide | past | web | favorite | 114 comments



The other half of the thread was on linux-ext4, starting here: https://lists.openwall.net/linux-ext4/2018/04/10/33

The part I found most interesting was here: https://lists.openwall.net/linux-ext4/2018/04/12/8

where the ext4 maintainer writes:

« The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon. »

He later says that the netlink channel stuff was never submitted to the upstream kernel.

It all feels like a situation where the people maintaining this code knew deep down that it wasn't really up to scratch, but were sufficiently used to workarounds ("use direct io", "scrape dmesg") that they no longer thought of it as a problem.


> It all feels like a situation where the people maintaining this code knew deep down that it wasn't really up to scratch, but were sufficiently used to workarounds ("use direct io", "scrape dmesg") that they no longer thought of it as a problem.

It's an ops-level solution. The one thing you never really do, as an ops person, is reach into the black-box components that make up your infrastructure to fix them at an architectural level. That's not your job. The system is already in production; it's your job to make it run, without changing it—to shim around the black boxes to make them work despite their architectural flaws.

Sure, you can file a ticket with the dev team about how dumb you think an architectural choice was—but, in most businesses, most of the black-box components you're working with are third-party, so filing a ticket isn't likely to get you much. (Google is an exception, but even Google has to hire their SREs from somewhere, and they'll come in enculturated to the "it's all black boxes and we're here to ziptie them together" mindset.)

And, even in an environment like Google's where you have access to the dev-teams of every component you're running, dev cycles are still longer than ops SLA deadlines. Ops solutions are chosen because they're quick to get into production. (Component can't handle the load because it's coded poorly? Replicate it and throw a load balancer in front! Five minute job.) So even if you do file that ticket, you've still got to solve the problem in the here-and-now. And once you do solve the immediate problem, it's no longer a hair-on-fire problem, so that ticket isn't going to be very high-priority to fix.


> The one thing you never really do, as an ops person, is reach into the black-box components

I'm sorry, but just no. That's far too strong general statement. I've done quite a bit of sysadmin/devops style work over the years, and also worked with many other people who've sent various kinds of fixes upstream. Sure, third party software isn't always easy to fix, but it depends on what the alternatives are. You probably need to do an immediate workaround as well.

I would say that's an important part of why open source have had such a strong following among sysadmins, the ability to fix things. One could even say that diving into large pieces of software and exploring issues is the most rewarding part of the job. Just don't tell them I said that.


Usually I've ventured into the code and made a bad enough attempt to fix that the devs are galvanised into doing it right. Filing a ticket usually not enough table stakes to get your voice heard if things are worked around.


I like this concept. I wonder if there’s a word for it. Any Germans want to help out with a fifteen syllable wonder?


Tikettfileninsufficientstadtdelvencodebasefixen.

Or something? More seriously, relevant English idioms:

- A stitch in time saves nine;

- Something worth doing's worth doing well;

- If you want it done right, do it yourself;

and surely more. I'm certain there's a farming-analogy one about fixing something sooner rather than later, but it's escaped me.


So, we've got:

Cunningham's Law: "the best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer" [1]

"Broke gets fixed, crappy is forever" [2]

I feel like there should be a third but that's what i've got.

[1] http://fed.wiki.org/journal.hapgood.net/cunninghams-law/fora...

[2] https://dandreamsofcoding.com/2013/05/06/broke-gets-fixed-cr...


Maybe something about the dummy fix polluting the lead dev's code aesthetics?


Threat-in-a-patch


Duct tape ops?


Agreed.

I found that the best way to get reliable ops long-term is to mercilessly dive into the code for every issue encountered, and fix it at fundamental level.

It takes significant time, but after following this practice for a while, things start working reliably.

I'm currently doing this with software like GlusterFS, Ceph, Tinc, Consul, Stolon, nixpkgs, and a few others.

In other words, try to make it so that none of your components is a black box to you. Own the entire stack, and be able to fix it. That makes reliable systems.


> Agreed.

> I found that the best way to get reliable ops long-term is to mercilessly dive into the code for every issue encountered, and fix it at fundamental level.

> It takes significant time, but after following this practice for a while, things start working reliably.

> I'm currently doing this with software like GlusterFS, Ceph, Tinc, Consul, Stolon, nixpkgs, and a few others.

> In other words, try to make it so that none of your components is a black box to you. Own the entire stack, and be able to fix it. That makes reliable systems.

Do you then submit your fixes back to upstream? This is the critical element I think.


Do you then submit your fixes back to upstream? This is the critical element I think.

It makes sense to, otherwise you're stuck forward-porting your private patches forever. Or worse, get stranded on an old version because there's no-one around who can do that forward-porting anymore...


I don't think there was anything wrong with Google using this workaround.

But Google were also employing the ext4 maintainer, and fixing those components presumably was part of his job.


It depends quite a bit on the company and the software they are running. I've worked at places that use open source for the majority of services they run on the servers specifically because they can both look at the source as an advanced form of debugging, and provide local patches to software until it's accepted upstream (or indefinitely if it's not accepted or too company specific).

A lot of patches for open source projects come from companies like this. Is there also a lot of working around buggy software? Yes. It's not always that way though, and if it's gotten to the point that you have devs for that software on staff, it really shouldn't be, because if those people who are financially motivated to care don't, who will?


> I've worked at places that use open source for the majority of services they run on the servers specifically because they can both look at the source as an advanced form of debugging, and provide local patches to software until it's accepted upstream (or indefinitely if it's not accepted or too company specific).

Sure, but what you're talking about here is more "the developers of your software, solving problems in the composed release of the software, by vendoring + patching infrastructural dependencies."

Like, for example: my software depends on Postgres. So I have a copy of Postgres running on my workstation, to develop against. It's built from source. If I find a bug in Postgres, I can fix it in the source, submit a PR, and in the mean-time, package this fork of Postgres into the application's Docker image in place of the mainline one.

This is basically the same as saying "my project depends on ecosystem-package libfoo. I can vendor a copy of libfoo to fix a problem with libfoo."

Either way, that's still something only the developer of the software can really do (on any kind of a practical time-scale.) Changing the software from talking to an external Postgres, to talking to one packaged inside the Docker image? That's an internal architectural change that requires rewriting the software's configuration handling logic and build scripts at the very least, and probably also adding some sort of init daemon to the resulting Docker image. It's a software change. It's not something an ops person should be spending their time getting up-to-speed enough on a project to be able to do. It's the thing they file a ticket to ask the developer to do.


As a dev, I hate this sentiment of siloed ops.


It's kind of impossible to have it any other way. Ops is a cost center; you'll only have as many ops staff as you need to ensure your hair isn't on fire. And so ops staff only have enough time to solve problems with the most direct shortcut (usually by adding more components into the mix), rather than doing good Engineering to pay down the technical debt responsible for the problem.

If you have an infinite budget, by all means, hire developers whose jobs are to be prophylactic for ops tickets, by paying down tech-debt. But you'll have to guard that department with your life, because it'll be the first thing to be cut the moment anyone's looking to grow margin by shaving costs.


That is, until your customers start demanding higher SLAs and offering money for them... Maybe only a thing for really big companies, but when that happens, improving reliability and ops becomes someone's job. If not multiple people.

That doesn't mean no duck tape like solutions get used, but that the ones that cause problems get removed.


I don't think thats siloed.

A good File system maintainer is a precious resource. wasting them on running stuff is madness. (I say that as a Devop.) I know a fair amount about running/configuring/tinkering with filesystems. (hell I've bumped into at least 3 XFS bugs, ones that were new to Redhat.) But I would never presume to be able to make a meaningful patch.

Its the same way that I'd not expect the Ext4 maintainer to know how to instrument, metric and monitor a large company's infrastructure.

Almost certainly they would have flung a ticket over the fence, and it was up to politics to make it work.


Its the fence that I dislike. But I work in a small studio where I can easily work with (and do some) devops.


i have some bad news for you. that’s not how ops at google works. root cause solutions are preferred and rabbit hole explorations are encouraged.


If fixing the thing itself is easier, and also benefits everyone else, why not?


Fixing the thing isn't usually easier. In an ops environment, you're not dealing with Git repos full of source code that you can just patch and run through CI; you're usually dealing with pre-built packages from an apt server, or pre-built Docker images, or pre-built virtual-appliance VMs. And usually these are all created in drastically-varying, heterogeneous ways, many again by third-parties (public Apt packages, public Docker images, etc.), but even if internal, then by different internal teams with different favored approaches toward building and deploying things.

From the position in the pipeline where ops people operate, it's more work to figure out how to fork a build of software X, patch it, build it, and get out the same kind of artifact your deploy pipeline is expecting, hosted in the same place; as it is to just throw more spaghetti at the wall to solve the problem.

Half of the reason behind Google's monorepo, and the focus on Golang, is to empower SREs to patch broken components when possible, by making there be only One True Way to do that, and only one language (without weird per-project DSLs) to know. But that still doesn't completely fix the problem, since you still need to understand the architecture of the software project, and sometimes that's the hardest part.


I dunno, hardware errors really aren't a thing that is well handled by local APIs. I mean, it all feels to me like the PG folks (and you) are expecting this to have all been fine if they just called fsync() a second time to be sure the busted write got to disk, and then everything would have been OK.

No, that's not the way it works. In virtually all circumstances, that second write is going to fail too, because the machine done broke! And even if it works, you have a situation where the machine broke and something somewhere worked around it, and you need to be screaming like crazy for an operator to fix it. That's not an API situation, that's a deployment thing, and it's exactly the regime where Google applied the protocols that you don't seem to like.

That's not to say that the linux fsync behavior is correct here. I think clearing the error having reported it once is kinda broken too. But... fixing that doesn't actually "fix" the real problem.

Really the best suggestion in that thread was up at the top, where it was pointed out that ext3/4 support remounting the filesystem read-only on errors, which sounds like a great policy to me.


It seems pretty clear from the first message in the thread that what the postgres code (and so the "PG folks") was expecting was that fsync would continue to report errors as long as it hadn't succeeded.

That way the checkpoint wouldn't complete, and the relevant write-ahead log files wouldn't get deleted (and "checkpoints aren't happening and WAL is piling up" is the sort of thing people already monitor).


> the sort of thing people already monitor

Exactly! So it's a devops solution either way. The choice is between one based on system-level monitoring and one based on tool-level error reporting. In neither case does a non-broken fsync() actually fix anyone's problem.

My point was that I don't see tools like remount-ro and dmesg monitoring as "workarounds" for an underlying problem. They're correct solutions implemented at the right level of abstraction. They just happen not to be shipped with PostgreSQL, which... I mean, yeah, basically I don't care much either way.


> In neither case does a non-broken fsync() actually fix anyone's problem.

The system's broken either way, but failure modes matter!

With fsync() reporting EIO only once (one time, to one caller), callers that don't expect this behavior (and _no_ callers were written to expect this POSIX-violating, undocumented behavior AFAIK) could corrupt data if writes intermittently succeeded. Eg marking a segment of WAL as fully applied when it wasn't, not only throwing away durability for that transaction but corrupting the whole database. And maybe an error is reported now and then, but that's not so hard to miss.

With fsync() consistently reporting EIO, the system never reports a transaction as fully committed then loses its contents, and the existing database isn't corrupted. And someone will notice the problem. mjw1007 worded it perfectly: "'checkpoints aren't happening and WAL is piling up' is the sort of thing people already monitor".

So yes, a non-broken fsync() doesn't magically fix your damaged platter, but it's worlds better.


That's just a restatement of the PG point in the linked thread. To which the reply was "it's better to do this at the system level", thus the rejoinder "but that's a workaround" and my "no, it's a perfectly valid way of solving a problem at the same level of abstraction", and now we're full circle because you're picking on my defense (which I did not make!) of the linux fsync behavior and ignoring all the context.

You have to do system level reporting of errors, full stop. You can do that with a fixed fsync. You can do it with dmesg. You can do it even with the current fsync behavior. Pick one. Arguing about fsync is pontificating over the wrong problem.

FWIW: I'm all but certain you're wrong about the "POSIX-violating" part. My read of the standard says the author was thinking only about the current request. There's certainly no language in there saying that fsync should return EIO because of a previous failure, nor even what the state of the system (c.f. remount-ro) should be following the first one.


You have to synchronously report the error to the application depending on the fsync, full stop. None of the other proposed mechanisms are a replacement for that; for one, because they are asynchronous. An application can't be written to "do an fsync, assume it's done...unless some system-wide error reporting mechanism says otherwise arbitrarily later".

> FWIW: I'm all but certain you're wrong about the "POSIX-violating" part. My read of the standard says the author was thinking only about the current request.

An fsync call is not a complete request (it doesn't specify the data to be written), so what on earth does "only about the current request" mean?

I can't reconcile any of these arguments that Linux's behavior was correct with the RATIONALE section here: http://pubs.opengroup.org/onlinepubs/9699919799/functions/fs... and particularly the phrase "all data up to the time of the fsync() call is recorded on the disk". There's no mention of "since a previous fsync() call" or any such thing there.


> You have to synchronously report the error to the application depending on the fsync, full stop.

Why? As reported above, Google itself isn't even doing that. We're not talking about a network timeout here. Hardware failures are hardware failures and not in the general case recoverable by software no matter how strongly held the opinions are about how that software should work.

The fsync behavior on Linux was surprising and wrong. But fixing that won't fix any of the stuff you're yelling about. API hygiene can't fix machines, and the machine broke.


> Why? As reported above, Google itself isn't even doing that.

I work at "Google itself". Typically there's a single process responsible for all the disk writes of significance on the machine, possibly bypassing the in-kernel filesystem entirely in favor of direct block device access. That process's caller waits until three machines (each with its own battery backup, and frequently on a separate PDU from the others) have acked the write before proceeding. Also, that process's caller is Spanner, and those three machines are necessary to say a single replica has accepted the write. Typically there are three replicas in different clusters (often in separate cities) and two of them have to accept the write (in Paxos terminology) before Spanner tells the application it's complete.

Please don't take Google's production environment working successfully most of the time from the customer viewpoint as proof of anything about the Linux virtual filesystem layer or ext4. You can paper over a lot of brokenness when you have six independent battery-backed copies of the data in two cities before you depend on it.

> We're not talking about a network timeout here. Hardware failures are hardware failures and not in the general case recoverable by software no matter how strongly held the opinions are about how that software should work.

Software can't fix the broken hardware, and (without significant redundancy and cross-verification as Google does) can't correct for hardware simply lying. But correctly written software can absolutely change the likely failure modes when hardware reports error. And intermittent hardware failures happen a lot more than you might think anyway. Bugs that turn intermittent failure into permanent corruption are nasty.


What about this: you write() some data, and then Linux throws it away because of a writeback error, and later you read() it, and Linux gives you an older version from before your write (because this error condition has somehow been fixed and now it reads that block back into the page cache). I'm pretty sure that violates POSIX, which says:

"The read() function reads data previously written to a file."


I don't think that's the behavior, though. The blocks don't get flushed, they're still in cache and get read back. Probably. Again, the system broke. You've got no guarantees about anything. Fixing an API isn't going to change that.


API guarantees are still important even when a peripheral breaks. It's the difference between knowing how and where performance is degraded and what to work around, versus utter chaos.


Totally agree. Speaking as a database hacker, just tell me it broke. I will deal with the rest.


It did tell you it broke, though. It just didn't repeat the message when you asked again. Really, there is no "correct" behavior from getting an -EIO from a fsync beyond "shut it all down, notify everyone you can and hope for the best". This retry behavior was itself a bad design choice (and again, I'm not defending fsync here).

And again, speaking as a systems hacker: it broke. You can't "deal with" broke. And the system absolutely did tell you it broke. I genuinely don't understand the insistence by folks on this list that OMG THE ERROR MUST BE HANDLED IN THE DATABASE SOFTWARE. It seems kinda ridiculous to me, honestly.


Well, there are/were a whole bunch of related problems and scenarios discussed in that and other threads, so it depends which bit we're discussing. (1) errors consumed by asynchronous kernel activity that never make it to user space; (2) write(), write-back error not reported to user yet, buffer page replaced, read() sees old data; (3) fsync() -> EIO/ENOSPC, then fsync() -> SUCCESS; (4) write() in one process, then fsync() in another process from a different fd; (5) write(), close(), open(), fsync().

It seems that the developers of database software and kernels/filesystems had different understanding of what should happen to error reporting in those cases (and maybe some more that I'm forgetting). PostgreSQL and FreeBSD happened to agree and all of these scenarios do the right thing (IMHO, but you could call my opinion not impartial, since I am involved in both of those projects). Linux has had various different behaviours since Jeff Layton realised how terrible it all was and started improving it, starting from (1) which were straight up bugs. Some problems still exist. OpenBSD also made recent changes due to the noise generated by this stuff. So I genuinely don't understand how anyone can argue that there was nothing wrong or imply that it's not complicated: the maintainers of the software in question apparently disagreed.

In my humble opinion someone should go and talk to http://www.opengroup.org/austin/ and try to get some increased clarity here for POSIX 2022. Maybe I will find the energy if someone doesn't beat me to it.

By "I will deal with it", I didn't mean I can fix a hosed server, I mean something like "the database will fail over to another node, or enter recovery" or something like that. And, since we (and other DBs who apparently read our mailing list) introduced panic-on-fsync-failure, that's what we do. There are still a few edge cases to fix, though. We're on it...


When a write-back error happens, Linux marks the block clean so it can be replaced by another block at any time. Additionally, some Linux filesystems (I don't recall which right now) proactively throw it away immediately.


That was also suggested in the pg list:

> An crazy idea would be to have a daemon that checks the [kernel] logs and stops Postgres when it seems something wrong.

> [..] you could have stable log messages or implement some kind of "fsync error log notification" via whatever is the most sane way to get this out of kernel.


I sure wonder how IBM mainframes and other computer systems handle this intractable failure case. Joking aside, here's an excerpt from ''The UNIX-HATERS Handbook'':

Only the Most Perfect Disk Pack Need Apply

One common problem with Unix is perfection: while offering none of its own, the operating system demands perfection from the hardware upon which it runs. That's because Unix programs usually don't check for hardware errors--they just blindly stumble along when things begin to fail, until they trip and panic. (Few people see this behavior nowadays, though, because most SCSI hard disks do know how to detect and map out blocks as the blocks begin to fail.)

...

In recent years, the Unix file system has appeared slightly more tolerant of disk woes simply because modern disk drives contain controllers that present the illusion of a perfect hard disk. (Indeed, when a modern SCSI hard disk controller detects a block going bad, it copies the data to another block elsewhere on the disk and then rewrites a mapping table. Unix never knows what happened.) But, as Seymour Cray used to say, ''You can't fake what you don't have.'' Sooner or later, the disk goes bad, and then the beauty of UFS shows through.


lol hi alex


Related article: PostgreSQL's fsync() surprise https://lwn.net/Articles/752063/ (April 18, 2018)

And the followup coverage from LSFMM summit (linked also in the OP discussion): https://lwn.net/Articles/752613/


Upvoted this because the LWN article seems to give a much better picture of the what and why, including key kernel devs explaining what tgl called "kernel brain damage". The posted danluu.com link seems to be just the pgsql mailing list.


One thing I took away from this thread: when someone tells you something surprising, it's best to ask, rather than deny. It doesn't look good to say things like:

"Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued."

When you have plainly not actually verified that it is the case.

Instead, you could say, "doesn't Posix say . . .?" This has the following benefits: you avoid egg on your face, the conversation takes on a less aggressive tone, and problems are resolved quicker.


100% agree with that takeaway in general, except that in this case I think POSIX does actually say this. I pointed out and discussed this a few months ago [1] (and it's already discussed on the page as well), but basically, those who disagree believe that fsync's task is to merely "send" the data to the storage device, not actually make sure it's written persistently... which I see as being blatantly inconsistent with "to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded to the disk". There's nothing on the page that says "data since the last call to fsync", and that sentence says that was definitely not the intention, and yet somehow that's how people read it to support the notion that current implementations are correct (or vice-versa: they use current implementations as evidence that this is the correct reading).

[1] https://news.ycombinator.com/item?id=19128228


Perhaps? The people in the thread seem fairly convinced that's not what it says. When I read it, I see "... all data for the open file descriptor named by fildes is to be transferred to the storage device ..." So, in the event of a failure in the underlying hardware, the data may have been transferred, but subsequently an error has occurred. I don't see how the sentence obliges the implementation to transfer multiple times. I'm not an expert though.

In any case, the problem is with the attitude and how opinions are expressed, not the exact state of which opinion is being expressed. :) Also, it is plainly not "entirely clear" since we two apparently reasonable people disagree on what the sentence means. So even if the person is right about what was ultimately intended when the spec was written, they are still incorrect in their view that it's clear what POSIX says.


That sentence's language can be improved then if a sufficiently determined kernel developer (not sure I'd say "reasonable" but whatever) can misread that sentence in that way (while ignoring the entire "RATIONALE" section). But what value is a guarantee that it's transferred to the storage device and not that it's actually committed to permanent storage? If someone's best argument is that they correctly provide a completely useless guarantee and that there's nothing that provides the guarantee that people actually need for real work, they're on pretty shaky footing.


Presumably people will not buy permanent storage that does not commit data that is transferred to it? The thing is, the fsync guarantee isn't useless. You "just" have to rewrite any data that you had previously written since the last sync, and be tolerant of partially committed data in the interim.


What if the hardware has its own notion of sync. I am not familiar with disk IO protocols, but I assume they are permitted to store writes in an internal buffer. In such a case, I would expect fsync to issue a sync request to the device and wait until the device reports that the sync is finished. Admittedly, the phrasing of the spec is not 100% clear on this point; but I think the correct reading of it is that the correct behavior is to wait until the device actually commits the data (or at least claims to; at some point broken/lieing hardware is just broken).


Your assumption is correct.


> So, in the event of a failure in the underlying hardware, the data may have been transferred, but subsequently an error has occurred. I don't see how the sentence obliges the implementation to transfer multiple times. I'm not an expert though.

That is actually an issue with disks which have a small memory buffer before actually writing to persistent storage: not only will they acknowledge success as soon as the data's in their buffer (losing information in case of error in the underlying storage write) they may not properly handle a power loss of some sort, ack'ing the write to the buffer then losing it before it has the time to hit storage.


It's kinda depressing that after a quarter of a century of work in the region of a hundred thousand dev years and billions of dollars of investment the world's most used operating system fails at the most basic tasks like reliably writing files or allocating memory.



Well, at least in this case one can abort (or go into some read-only mode) in case of fsync() returning failure. With most storage media that is the correct thing to do anyway. Having multiple processes and having the fsync() error returned to only one of them is problematic though.

I recently found out that syncfs() doesn't return an error at all in most cases (through data loss :/). It's being worked on ... https://lkml.org/lkml/2018/6/1/640

It's astonishing that such critical issues are still present in such a widely used piece of software.


> Well, at least in this case one can abort (or go into some read-only mode) in case of fsync() returning failure.

the problem here isn’t if fsync() returns failure. it’s that it always returns success, even if it failed.


> it’s that it always returns success, even if it failed.

It doesn't (except on the most broken of old Linux kernels, and even then it mostly lost async write errors). Rather it's that on most systems (basically all of them except FreeBSD and Illumos, possibly OpenBSD after some recent changes) fsync will only report errors once, but that call will clear the flags and subsequent fsync calls will succeed (unless new errors have occurred).

Basically, you can only rely on fsync reporting errors having happened since the last fsync, which is obviously concerning for all sort of reasons (not least being concurrent updates).

They also mention a bit about fsync not necessarily reporting errors from before the file was opened.


Imagine a juggler, juggling 5 balls at a time. At a set period of time, he will drop one ball and accept another ball thrown at him. He handles this very well because there is order/cadence.

Now imagine asking him to support being thrown balls randomly, at any interval. He may make it work, but I'd imagine he will stutter a bit.

In my experience, anytime you interrupt the page cache's normal routine, it stutters everything. I've seen the "sync" command freeze my Ubuntu machine (music player, GUI, etc).

I work on embedded devices, and my employer wanted to reduce the window at which data-loss would happen for 30mb+ files (video capture). It wasn't a supported use case, "But why not! It makes our product theoretically better!" I put my foot down. We aren't touching page-cache until there is a clear benefit to the user. It almost got me fired, but good riddance if so.


The annoying thing is that the Postgres people don't even really need "sync this file right now" for the main data files.

AIUI what they'd ideally like is "sync this file reasonably soon, and let me know when it's finished".


io_uring supports issuing async sync_file_range requests, which can be used to do most of the heavy lifting.

You will still need a final fsync to also get the metadata written, but that should faster at that point.


correction: I took a closer look at liburing and it also supports straight fsync[0], not just sync_file_range, this makes it even simpler.

[0] http://git.kernel.dk/cgit/liburing/tree/src/io_uring.h?id=90...


It should be noted that io_uring landed (in 5.1) a fair while after the whole Postgres back-and-forth started in 2018.


It should also be noted that even sync_file_range's manpage warns you against using it, which tells you how reliable that thing is.


It's not really that you shouldn't use it because it is dangerous in itself. It's just doesn't do what one would naively expect it to do. It's still useful as "flush dirty pages" hinter while preparing the real fsync.


fsync != syncfs/sync

The former is for a single file, the others are for one/all filesystems


I'm aware, but the analogy still applies, which was about page-cache in particular.



If fsync() fails isn't valhalla lost anyhow? If you can't write things down because your pencil is broken, probably time to stop what you're doing and get a new pencil.

If the kernel can't flush dirty write buffers, maybe it's time to send up a flag and panic in the kernel itself?


> If the kernel can't flush dirty write buffers, maybe it's time to send up a flag and panic in the kernel itself?

Being unable to write to a disk is a recoverable scenario, especially in some conditions. The most common cause of disk write failures is "someone yanked the flash drive out of the computer," and the recovery is "pop up a dialog telling the user to put it back in."


Another situation: a GCSFuse volume dealing with temporary networking errors.


Or NFS being NFS.


I think there is a case for never throwing away dirty pages.

Ie. either the write eventually succeeds, or all other writes will block or fail until it does.


That would be a great improvement over the ‘tell the user he shouldn’t have done that’ we have now!


That can happen when you remove a removable storage device, like a USB drive. You don't want to panic the kernel for that.


FreeBSD lived with that up to circa 2010, afair. Few fast-pulling desktop linux trolls complained at the time, but nothing serious.

Anyway, if given such choice, as a user/admin I would prefer an instant fail with a panic message, rather than a corrupted 10gb online client database with no clear recovery path. I’m always scared by this kind of story when something fails to fail underneath and I even didn’t ever hear of it.


Isn't your data already corrupted most of the time when writes start to fail? If the drive is gone, a panic won't help or hurt, it will just ruin everything else you have open.

Redoing failed writes should be the method that ends in the least amount of corruption.


Not going into details, at the end of the day one should have a plan to recover their value from a potential failure. By the means of version control systems, transaction logs, backups, mirrors, etc. But if a failure makes its way to a recovery storage undetected, the entire plan is doomed.

I don’t care if I lose few last hours of my local work, since this time is buffered in a project or is not precious at all (like game saves). That throws back me a little, but not anyone else. I also don’t really care about losing downloads or experimental files. But when it comes to a business database or a project repository, that value can be orders of magnitude bigger and can cost responsible one their job at the very least.

“Redoing failed writes should be the method that ends in the least amount of corruption.”

But in fact it led to a database corruption that was only fixed by some black magic. As far as I understood, they did manual rollback of failed writes. I wouldn’t blame anyone for subj bug even if met it by myself — writing low-level software is hard, the behavior was unintended, and both linux, xfs & pg are free as in beer and free of any charges. But if it happened and was not detected for longer than a recovery window allows it, costs were imminent. That’s why this trade-off is easy on a distance.


> But in fact it led to a database corruption

You're saying you had a system that did redo failed writes to disk, unlike other systems discussed so far, and it managed to corrupt your database anyway?


Reminds me. On CPM, hit save when the drive door is open, so sorry. All gone.


So kernel panic on disk full errors? Because that's one of the scenarios causing data loss here - fsync throwing away data on a full disk and lying about having committed it.


In that thread, Craig Ringer and Thomas Munro notes most systems & FS won't report ENOSPC on fsync(), they'll do that on write():

> ENOSPC doesn't seem to be a concern during normal operation of major file systems (ext3, ext4, btrfs, xfs) because they reserve space before returning from write(). But if a buffered write does manage to fail due to ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a potentially data corrupting condition since NFS doesn't preallocate space before returning from write().


no, the fs doesn’t overcommit storage like the memory system does.


I don't think you understand the problem.


mmm, maybe this is a little too aggressive there could be perfectly functional storage on the host.

maybe something like "make the mount read only and ensure all open fds on the device return nothing but error and then dump the dirty blocks to swap in some kind of recovery log that could potentially be applied to the disk"


if you've got removable disks on your servers, valhalla was lost long ago!


Huh? Hot swapping disks is extremely common.


Depends on your definition of 'hot' (and also on your definition of 'disk'). If you mean 'FS volume under active write load' then no it isn't extremely common.


the raid device remains stable and fsyncs don't fail in that case.


You don't use networked filesystems?


not on production db servers, no.


no, some errors are transient. a retry is a reasonable thing to be able to do in such a case. this happens with storage on SANs for example, due to a network blip.


This is sad, I know that one large user of Linux found this problem in 2009 or so and fixed it for the version of Linux they used in their fleet of servers. I am surprised it didn't make it upstream from then.


I don't know if it's the same person, but someone working on Atlassian's cloud offerings said he reported this issue (along with a working patch) to Postgres and they declined it. Sounds like what they do for cloud services is run a lot of Postgres instances where the DB data is on a big NFS, and this bug was a bigger problem due to how NFS works. But after patching the fsync handling in Postgres, they continued using Postgres-on-NFS with great success.

Unfortunately I can't find the original comment, but I think it was on another HN story about Postgres+fsync.

Edit: I found it: https://news.ycombinator.com/item?id=19126601 and I may or may not have hallucinated the part about Atlassian.


I think it's a bit strange to claim the PostgreSQL community just declined the patch. Just look at the thread on pgsql-hackers list:

https://www.postgresql.org/message-id/flat/46ED7555-9383-45D...

As anarazel already pointed out in the hackernews thread you linked, there's one person who was initially skeptical about the patch, and a couple of people chimed in with feedback (e.g. pointing out some mount options may fix it, which I think is valuable). And then the author of the patch just disappears from the discussion, so the thread kinda gets forgotten for a couple of months.

I'm sure PostgreSQL community has its flaws, but I doubt this is a particularly clear example ...


In case someone reads this and gets the impression that running PG over NFS is in general safe or a good idea, I'm pretty sure it still isn't, unless like the OP you have a complete understanding of what you're doing.


I have quite successfully run pg atop NFS — even, in limited, point-solution type roles, in production. My experience doing that didn't leave me with the impression that it was a particularly egregious thing to do, though I would definitely take many, many additional steps to ensure redundancy and availability if I were going to use it more generally.

You're right though: you really do want to have some idea what you're doing, if you're going to go there.

Source: my day job is PostgreSQL DBA, and has been for ~15 years now.

EDIT: Phrasing.


I think the problem is mostly ENOSPC from fsync() which jettisons data just like EIO on Linux. If you ran out of space, PostgreSQL would only learn about that while checkpointing, and then retry and carry on. Boom, data loss. Today PostgreSQL would panic on the first ENOSPC from fsync() so the problem is mostly "fixed".


The main thing to keep in mind is that NFS exercises a bunch of VFS code paths that effectively no other filesystems touches. So you are more at risk to strange VFS bugs, as well as general NFS shenanigans, which other PG users may not have seen before.


OT: it depends on how much writes/reads you get. for most people nfs is probably an easier solution for HA on postgres than anything else.


That’s pretty much computing. It’s usually the case that unless you have a complete understanding, there’s a pothole waiting to trip you.

As an FYI - we no longer use Postgres on NFS as we have migrated the entire architecture underlying the systems in question.


I read the start and the end of thread but couldn’t get an understranding of what the current situation is : did linux update its fsync behavior ? Does pg now panics on linux on the first fsync ?


Linux kernel behavior is still the same. The error reporting issues (failure to report I/O errors in various cases) has been fixed on recent kernels, AFAIK.

PostgreSQL now PANICs on I/O errors during fsync, forcing a recovery.


Every computer component can fail in arbitrary ways, including drives.

If you’re not robust against that, then when things like fsync fail, then you’ll lose availability and/or data.

Even though Linux’s fsync behavior is clearly broken, it is far from the craziest behavior I’ve seen from the I/O stack.

Anyway, the main lesson here is that untested error handling is worse than no error handling. They should have figured out how to test that this path actually proceeds correctly (on real, intermittently failing hardware) or just panicked the process.


It is broken but there's no behavior that isn't, and the Postgres developers quickly understood why changing Linux's behavior isn't really possible.

From https://lwn.net/Articles/752063/: "Linux is not unique in behaving this way; OpenBSD and NetBSD can also fail to report write errors to user space. [...] If some process was copying a lot of data to that drive, the result will be an accumulation of dirty pages in memory, perhaps to the point that the system as a whole runs out of memory for anything else [...] a fair amount of attention was paid to the idea that write failures should result in the affected pages being kept in memory, in their dirty state. But the PostgreSQL developers had quickly moved on from that idea and were not asking for it".


> It is broken but there's no behavior that isn't, and the Postgres developers quickly understood why changing Linux's behavior isn't really possible.

I think there's still a fair number of PostgreSQL developers who think the way Linux kernel behaves makes the fsync() API rather difficult to use for anything but the simplest scenarios.

The reason why the community decided to accept it was the realization that there's about 0.001% chance of convincing kernel devs to change it, and the fact that we'd still have to deal with existing kernels in foreseeable future.


> Every computer component can fail in arbitrary ways, including drives. > > If you’re not robust against that, then when things like fsync fail, then you’ll lose availability and/or data.

The fact is that often the I/O issues are temporary, and those situations are becoming more and more common (think running out of disk space with thin provisioning, networking issues with NFS, etc.). So it might be quite valuable to handle those issues gracefully, without essentially crashing the database (which is pretty much what PANIC does).

> Even though Linux’s fsync behavior is clearly broken, it is far from the craziest behavior I’ve seen from the I/O stack.

"clearly broken" might be bit too harsh, but it certainly makes it way harder to use.

> Anyway, the main lesson here is that untested error handling is worse than no error handling. They should have figured out how to test that this path actually proceeds correctly (on real, intermittently failing hardware) or just panicked the process.

That's true, of course. It's a sad fact of life that error paths are the least tested part of almost any code base. It's however also true that testing I/O errors is pretty difficult to do (especially before "error" dm target), and a significant part of the fsyncgate was that kernel was not reporting errors reliably. It's also true that the behavior is somewhat filesystem-specific (some keep the data in page cache but marked as "not dirty", some will discard the data, ...). That makes testing pretty hard.


Love that FreeBSD is doing things right - and has been for 20 years.

https://wiki.postgresql.org/wiki/Fsync_Errors


2007, Linus rant: https://lkml.org/lkml/2007/1/10/233

  The right way to do it is to just not use O_DIRECT. 

  The whole notion of "direct IO" is totally braindamaged. Just say no.

    This is your brain: O
    This is your brain on O_DIRECT: .
    Any questions?

  I should have fought back harder. There really is no valid reason for EVER
  using O_DIRECT. You need a buffer whatever IO you do, and it might as well
  be the page cache. There are better ways to control the page cache than
  play games and think that a page cache isn't necessary.

  So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
  instead.
2019, how things are:

    This is your brain: O
    This is your brain on O_DIRECT: .
    And... this is your brain when cached: ?!

  The right way to do it is to just use O_DIRECT.
  The whole notion of "kernel IO" is fsync and games. Just say no.


This is one reason we choose to use Windows Storage Spaces Direct and the tranactional NTFS. They really are better.

https://docs.microsoft.com/en-us/windows-server/storage/stor...


So if fsync fails, what is one supposed to do? You can't retry it and you don't know how much of the file that has been synced?

Only feasible option is to create a completely new file and retry writing there? And if that fails your disk is probably bust or ejected, which should require user interaction about the new file location. Doesn't seem too unreasonable?

This would require you to have the complete file contents elsewhere so you can rewrite it? Or would it still be possible to read from the original file being in the unflushed buffer? And in the disk ejected+remounted case, the old contents should still be there intact thanks to ext4 journaling?


I didn't read the entire thread, so maybe this was answered: has anyone actually made a system that's "fully" correct with regards to file system errors? Most people throw them away, but even programs that try to account for them get them wrong on some system (or the system changes behavior from out under them…). Is there a library that does this?


FreeBSD gets the kernel side correct (dirty unwritable blocks continue to report IO errors, including to fsync()) so you can at least build reliable libraries and applications on top of it.


Solaris, from which came OpenSolaris, which was forked into illumos, from which SmartOS is built.

fsync(2) there does exactly what POSIX says: it will only return success if the write safely made it to disk.

This is the #1 reason why all my infrastructure runs on a combination of Solaris and SmartOS and the primary reason why I don't run GNU/Linux on anything that's mine. #2 reason is that Solaris / illumos and therefore SmartOS will not overcommit memory, whereas in GNU/Linux this must be explicitly disabled, since overcommit is enabled by default.


Sqlite has put immense effort into this—it’s a better fopen!

That said, my understanding is that handling this with cross-platform, posix-only code is mostly impossible due to behavior like what’s described in the article.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: