The part I found most interesting was here:
where the ext4 maintainer writes:
The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring
machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon.
He later says that the netlink channel stuff was never submitted to the upstream kernel.
It all feels like a situation where the people maintaining this code knew deep down that it wasn't really up to scratch, but were sufficiently used to workarounds ("use direct io", "scrape dmesg") that they no longer thought of it as a problem.
It's an ops-level solution. The one thing you never really do, as an ops person, is reach into the black-box components that make up your infrastructure to fix them at an architectural level. That's not your job. The system is already in production; it's your job to make it run, without changing it—to shim around the black boxes to make them work despite their architectural flaws.
Sure, you can file a ticket with the dev team about how dumb you think an architectural choice was—but, in most businesses, most of the black-box components you're working with are third-party, so filing a ticket isn't likely to get you much. (Google is an exception, but even Google has to hire their SREs from somewhere, and they'll come in enculturated to the "it's all black boxes and we're here to ziptie them together" mindset.)
And, even in an environment like Google's where you have access to the dev-teams of every component you're running, dev cycles are still longer than ops SLA deadlines. Ops solutions are chosen because they're quick to get into production. (Component can't handle the load because it's coded poorly? Replicate it and throw a load balancer in front! Five minute job.) So even if you do file that ticket, you've still got to solve the problem in the here-and-now. And once you do solve the immediate problem, it's no longer a hair-on-fire problem, so that ticket isn't going to be very high-priority to fix.
I'm sorry, but just no. That's far too strong general statement. I've done quite a bit of sysadmin/devops style work over the years, and also worked with many other people who've sent various kinds of fixes upstream. Sure, third party software isn't always easy to fix, but it depends on what the alternatives are. You probably need to do an immediate workaround as well.
I would say that's an important part of why open source have had such a strong following among sysadmins, the ability to fix things. One could even say that diving into large pieces of software and exploring issues is the most rewarding part of the job. Just don't tell them I said that.
Or something? More seriously, relevant English idioms:
- A stitch in time saves nine;
- Something worth doing's worth doing well;
- If you want it done right, do it yourself;
and surely more. I'm certain there's a farming-analogy one about fixing something sooner rather than later, but it's escaped me.
Cunningham's Law: "the best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer" 
"Broke gets fixed, crappy is forever" 
I feel like there should be a third but that's what i've got.
I found that the best way to get reliable ops long-term is to mercilessly dive into the code for every issue encountered, and fix it at fundamental level.
It takes significant time, but after following this practice for a while, things start working reliably.
I'm currently doing this with software like GlusterFS, Ceph, Tinc, Consul, Stolon, nixpkgs, and a few others.
In other words, try to make it so that none of your components is a black box to you. Own the entire stack, and be able to fix it. That makes reliable systems.
> I found that the best way to get reliable ops long-term is to mercilessly dive into the code for every issue encountered, and fix it at fundamental level.
> It takes significant time, but after following this practice for a while, things start working reliably.
> I'm currently doing this with software like GlusterFS, Ceph, Tinc, Consul, Stolon, nixpkgs, and a few others.
> In other words, try to make it so that none of your components is a black box to you. Own the entire stack, and be able to fix it. That makes reliable systems.
Do you then submit your fixes back to upstream? This is the critical element I think.
It makes sense to, otherwise you're stuck forward-porting your private patches forever. Or worse, get stranded on an old version because there's no-one around who can do that forward-porting anymore...
But Google were also employing the ext4 maintainer, and fixing those components presumably was part of his job.
A lot of patches for open source projects come from companies like this. Is there also a lot of working around buggy software? Yes. It's not always that way though, and if it's gotten to the point that you have devs for that software on staff, it really shouldn't be, because if those people who are financially motivated to care don't, who will?
Sure, but what you're talking about here is more "the developers of your software, solving problems in the composed release of the software, by vendoring + patching infrastructural dependencies."
Like, for example: my software depends on Postgres. So I have a copy of Postgres running on my workstation, to develop against. It's built from source. If I find a bug in Postgres, I can fix it in the source, submit a PR, and in the mean-time, package this fork of Postgres into the application's Docker image in place of the mainline one.
This is basically the same as saying "my project depends on ecosystem-package libfoo. I can vendor a copy of libfoo to fix a problem with libfoo."
Either way, that's still something only the developer of the software can really do (on any kind of a practical time-scale.) Changing the software from talking to an external Postgres, to talking to one packaged inside the Docker image? That's an internal architectural change that requires rewriting the software's configuration handling logic and build scripts at the very least, and probably also adding some sort of init daemon to the resulting Docker image. It's a software change. It's not something an ops person should be spending their time getting up-to-speed enough on a project to be able to do. It's the thing they file a ticket to ask the developer to do.
If you have an infinite budget, by all means, hire developers whose jobs are to be prophylactic for ops tickets, by paying down tech-debt. But you'll have to guard that department with your life, because it'll be the first thing to be cut the moment anyone's looking to grow margin by shaving costs.
That doesn't mean no duck tape like solutions get used, but that the ones that cause problems get removed.
A good File system maintainer is a precious resource. wasting them on running stuff is madness. (I say that as a Devop.) I know a fair amount about running/configuring/tinkering with filesystems. (hell I've bumped into at least 3 XFS bugs, ones that were new to Redhat.) But I would never presume to be able to make a meaningful patch.
Its the same way that I'd not expect the Ext4 maintainer to know how to instrument, metric and monitor a large company's infrastructure.
Almost certainly they would have flung a ticket over the fence, and it was up to politics to make it work.
From the position in the pipeline where ops people operate, it's more work to figure out how to fork a build of software X, patch it, build it, and get out the same kind of artifact your deploy pipeline is expecting, hosted in the same place; as it is to just throw more spaghetti at the wall to solve the problem.
Half of the reason behind Google's monorepo, and the focus on Golang, is to empower SREs to patch broken components when possible, by making there be only One True Way to do that, and only one language (without weird per-project DSLs) to know. But that still doesn't completely fix the problem, since you still need to understand the architecture of the software project, and sometimes that's the hardest part.
No, that's not the way it works. In virtually all circumstances, that second write is going to fail too, because the machine done broke! And even if it works, you have a situation where the machine broke and something somewhere worked around it, and you need to be screaming like crazy for an operator to fix it. That's not an API situation, that's a deployment thing, and it's exactly the regime where Google applied the protocols that you don't seem to like.
That's not to say that the linux fsync behavior is correct here. I think clearing the error having reported it once is kinda broken too. But... fixing that doesn't actually "fix" the real problem.
Really the best suggestion in that thread was up at the top, where it was pointed out that ext3/4 support remounting the filesystem read-only on errors, which sounds like a great policy to me.
That way the checkpoint wouldn't complete, and the relevant write-ahead log files wouldn't get deleted (and "checkpoints aren't happening and WAL is piling up" is the sort of thing people already monitor).
Exactly! So it's a devops solution either way. The choice is between one based on system-level monitoring and one based on tool-level error reporting. In neither case does a non-broken fsync() actually fix anyone's problem.
My point was that I don't see tools like remount-ro and dmesg monitoring as "workarounds" for an underlying problem. They're correct solutions implemented at the right level of abstraction. They just happen not to be shipped with PostgreSQL, which... I mean, yeah, basically I don't care much either way.
The system's broken either way, but failure modes matter!
With fsync() reporting EIO only once (one time, to one caller), callers that don't expect this behavior (and _no_ callers were written to expect this POSIX-violating, undocumented behavior AFAIK) could corrupt data if writes intermittently succeeded. Eg marking a segment of WAL as fully applied when it wasn't, not only throwing away durability for that transaction but corrupting the whole database. And maybe an error is reported now and then, but that's not so hard to miss.
With fsync() consistently reporting EIO, the system never reports a transaction as fully committed then loses its contents, and the existing database isn't corrupted. And someone will notice the problem. mjw1007 worded it perfectly: "'checkpoints aren't happening and WAL is piling up' is the sort of thing people already monitor".
So yes, a non-broken fsync() doesn't magically fix your damaged platter, but it's worlds better.
You have to do system level reporting of errors, full stop. You can do that with a fixed fsync. You can do it with dmesg. You can do it even with the current fsync behavior. Pick one. Arguing about fsync is pontificating over the wrong problem.
FWIW: I'm all but certain you're wrong about the "POSIX-violating" part. My read of the standard says the author was thinking only about the current request. There's certainly no language in there saying that fsync should return EIO because of a previous failure, nor even what the state of the system (c.f. remount-ro) should be following the first one.
> FWIW: I'm all but certain you're wrong about the "POSIX-violating" part. My read of the standard says the author was thinking only about the current request.
An fsync call is not a complete request (it doesn't specify the data to be written), so what on earth does "only about the current request" mean?
I can't reconcile any of these arguments that Linux's behavior was correct with the RATIONALE section here: http://pubs.opengroup.org/onlinepubs/9699919799/functions/fs... and particularly the phrase "all data up to the time of the fsync() call is recorded on the disk". There's no mention of "since a previous fsync() call" or any such thing there.
Why? As reported above, Google itself isn't even doing that. We're not talking about a network timeout here. Hardware failures are hardware failures and not in the general case recoverable by software no matter how strongly held the opinions are about how that software should work.
The fsync behavior on Linux was surprising and wrong. But fixing that won't fix any of the stuff you're yelling about. API hygiene can't fix machines, and the machine broke.
I work at "Google itself". Typically there's a single process responsible for all the disk writes of significance on the machine, possibly bypassing the in-kernel filesystem entirely in favor of direct block device access. That process's caller waits until three machines (each with its own battery backup, and frequently on a separate PDU from the others) have acked the write before proceeding. Also, that process's caller is Spanner, and those three machines are necessary to say a single replica has accepted the write. Typically there are three replicas in different clusters (often in separate cities) and two of them have to accept the write (in Paxos terminology) before Spanner tells the application it's complete.
Please don't take Google's production environment working successfully most of the time from the customer viewpoint as proof of anything about the Linux virtual filesystem layer or ext4. You can paper over a lot of brokenness when you have six independent battery-backed copies of the data in two cities before you depend on it.
> We're not talking about a network timeout here. Hardware failures are hardware failures and not in the general case recoverable by software no matter how strongly held the opinions are about how that software should work.
Software can't fix the broken hardware, and (without significant redundancy and cross-verification as Google does) can't correct for hardware simply lying. But correctly written software can absolutely change the likely failure modes when hardware reports error. And intermittent hardware failures happen a lot more than you might think anyway. Bugs that turn intermittent failure into permanent corruption are nasty.
"The read() function reads data previously written to a file."
And again, speaking as a systems hacker: it broke. You can't "deal with" broke. And the system absolutely did tell you it broke. I genuinely don't understand the insistence by folks on this list that OMG THE ERROR MUST BE HANDLED IN THE DATABASE SOFTWARE. It seems kinda ridiculous to me, honestly.
It seems that the developers of database software and kernels/filesystems had different understanding of what should happen to error reporting in those cases (and maybe some more that I'm forgetting). PostgreSQL and FreeBSD happened to agree and all of these scenarios do the right thing (IMHO, but you could call my opinion not impartial, since I am involved in both of those projects). Linux has had various different behaviours since Jeff Layton realised how terrible it all was and started improving it, starting from (1) which were straight up bugs. Some problems still exist. OpenBSD also made recent changes due to the noise generated by this stuff. So I genuinely don't understand how anyone can argue that there was nothing wrong or imply that it's not complicated: the maintainers of the software in question apparently disagreed.
In my humble opinion someone should go and talk to http://www.opengroup.org/austin/ and try to get some increased clarity here for POSIX 2022. Maybe I will find the energy if someone doesn't beat me to it.
By "I will deal with it", I didn't mean I can fix a hosed server, I mean something like "the database will fail over to another node, or enter recovery" or something like that. And, since we (and other DBs who apparently read our mailing list) introduced panic-on-fsync-failure, that's what we do. There are still a few edge cases to fix, though. We're on it...
> An crazy idea would be to have a daemon that checks the [kernel] logs and stops Postgres when it seems something wrong.
> [..] you could have stable log messages or implement some kind of "fsync error log notification" via whatever is the most sane way to get this out of kernel.
Only the Most Perfect Disk Pack Need Apply
One common problem with Unix is perfection: while offering none of its own, the operating system demands perfection from the hardware upon which it runs. That's because Unix programs usually don't check for hardware errors--they just blindly stumble along when things begin to fail, until they trip and panic. (Few people see this behavior nowadays, though, because most SCSI hard disks do know how to detect and map out blocks as the blocks begin to fail.)
In recent years, the Unix file system has appeared slightly more tolerant of disk woes simply because modern disk drives contain controllers that present the illusion of a perfect hard disk. (Indeed, when a modern SCSI hard disk controller detects a block going bad, it copies the data to another block elsewhere on the disk and then rewrites a mapping table. Unix never knows what happened.) But, as Seymour Cray used to say, ''You can't fake what you don't have.'' Sooner or later, the disk goes bad, and then the beauty of UFS shows through.
And the followup coverage from LSFMM summit (linked also in the OP discussion): https://lwn.net/Articles/752613/
"Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued."
When you have plainly not actually verified that it is the case.
Instead, you could say, "doesn't Posix say . . .?" This has the following benefits: you avoid egg on your face, the conversation takes on a less aggressive tone, and problems are resolved quicker.
In any case, the problem is with the attitude and how opinions are expressed, not the exact state of which opinion is being expressed. :) Also, it is plainly not "entirely clear" since we two apparently reasonable people disagree on what the sentence means. So even if the person is right about what was ultimately intended when the spec was written, they are still incorrect in their view that it's clear what POSIX says.
That is actually an issue with disks which have a small memory buffer before actually writing to persistent storage: not only will they acknowledge success as soon as the data's in their buffer (losing information in case of error in the underlying storage write) they may not properly handle a power loss of some sort, ack'ing the write to the buffer then losing it before it has the time to hit storage.
I recently found out that syncfs() doesn't return an error at all in most cases (through data loss :/). It's being worked on ... https://lkml.org/lkml/2018/6/1/640
It's astonishing that such critical issues are still present in such a widely used piece of software.
the problem here isn’t if fsync() returns failure. it’s that it always returns success, even if it failed.
It doesn't (except on the most broken of old Linux kernels, and even then it mostly lost async write errors). Rather it's that on most systems (basically all of them except FreeBSD and Illumos, possibly OpenBSD after some recent changes) fsync will only report errors once, but that call will clear the flags and subsequent fsync calls will succeed (unless new errors have occurred).
Basically, you can only rely on fsync reporting errors having happened since the last fsync, which is obviously concerning for all sort of reasons (not least being concurrent updates).
They also mention a bit about fsync not necessarily reporting errors from before the file was opened.
Now imagine asking him to support being thrown balls randomly, at any interval. He may make it work, but I'd imagine he will stutter a bit.
In my experience, anytime you interrupt the page cache's normal routine, it stutters everything. I've seen the "sync" command freeze my Ubuntu machine (music player, GUI, etc).
I work on embedded devices, and my employer wanted to reduce the window at which data-loss would happen for 30mb+ files (video capture). It wasn't a supported use case, "But why not! It makes our product theoretically better!" I put my foot down. We aren't touching page-cache until there is a clear benefit to the user. It almost got me fired, but good riddance if so.
AIUI what they'd ideally like is "sync this file reasonably soon, and let me know when it's finished".
You will still need a final fsync to also get the metadata written, but that should faster at that point.
The former is for a single file, the others are for one/all filesystems
This February: https://news.ycombinator.com/item?id=19119991
If the kernel can't flush dirty write buffers, maybe it's time to send up a flag and panic in the kernel itself?
Being unable to write to a disk is a recoverable scenario, especially in some conditions. The most common cause of disk write failures is "someone yanked the flash drive out of the computer," and the recovery is "pop up a dialog telling the user to put it back in."
Ie. either the write eventually succeeds, or all other writes will block or fail until it does.
Anyway, if given such choice, as a user/admin I would prefer an instant fail with a panic message, rather than a corrupted 10gb online client database with no clear recovery path. I’m always scared by this kind of story when something fails to fail underneath and I even didn’t ever hear of it.
Redoing failed writes should be the method that ends in the least amount of corruption.
I don’t care if I lose few last hours of my local work, since this time is buffered in a project or is not precious at all (like game saves). That throws back me a little, but not anyone else. I also don’t really care about losing downloads or experimental files. But when it comes to a business database or a project repository, that value can be orders of magnitude bigger and can cost responsible one their job at the very least.
“Redoing failed writes should be the method that ends in the least amount of corruption.”
But in fact it led to a database corruption that was only fixed by some black magic. As far as I understood, they did manual rollback of failed writes. I wouldn’t blame anyone for subj bug even if met it by myself — writing low-level software is hard, the behavior was unintended, and both linux, xfs & pg are free as in beer and free of any charges. But if it happened and was not detected for longer than a recovery window allows it, costs were imminent. That’s why this trade-off is easy on a distance.
You're saying you had a system that did redo failed writes to disk, unlike other systems discussed so far, and it managed to corrupt your database anyway?
> ENOSPC doesn't seem to be a concern during normal operation of major file systems (ext3, ext4, btrfs, xfs) because they reserve space before returning from write(). But if a buffered write does manage to fail due to ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a potentially data corrupting condition since NFS doesn't preallocate space before returning from write().
maybe something like "make the mount read only and ensure all open fds on the device return nothing but error and then dump the dirty blocks to swap in some kind of recovery log that could potentially be applied to the disk"
Unfortunately I can't find the original comment, but I think it was on another HN story about Postgres+fsync.
Edit: I found it: https://news.ycombinator.com/item?id=19126601 and I may or may not have hallucinated the part about Atlassian.
As anarazel already pointed out in the hackernews thread you linked, there's one person who was initially skeptical about the patch, and a couple of people chimed in with feedback (e.g. pointing out some mount options may fix it, which I think is valuable). And then the author of the patch just disappears from the discussion, so the thread kinda gets forgotten for a couple of months.
I'm sure PostgreSQL community has its flaws, but I doubt this is a particularly clear example ...
You're right though: you really do want to have some idea what you're doing, if you're going to go there.
Source: my day job is PostgreSQL DBA, and has been for ~15 years now.
As an FYI - we no longer use Postgres on NFS as we have migrated the entire architecture underlying the systems in question.
PostgreSQL now PANICs on I/O errors during fsync, forcing a recovery.
If you’re not robust against that, then when things like fsync fail, then you’ll lose availability and/or data.
Even though Linux’s fsync behavior is clearly broken, it is far from the craziest behavior I’ve seen from the I/O stack.
Anyway, the main lesson here is that untested error handling is worse than no error handling. They should have figured out how to test that this path actually proceeds correctly (on real, intermittently failing hardware) or just panicked the process.
From https://lwn.net/Articles/752063/: "Linux is not unique in behaving this way; OpenBSD and NetBSD can also fail to report write errors to user space. [...] If some process was copying a lot of data to that drive, the result will be an accumulation of dirty pages in memory, perhaps to the point that the system as a whole runs out of memory for anything else [...] a fair amount of attention was paid to the idea that write failures should result in the affected pages being kept in memory, in their dirty state. But the PostgreSQL developers had quickly moved on from that idea and were not asking for it".
I think there's still a fair number of PostgreSQL developers who think the way Linux kernel behaves makes the fsync() API rather difficult to use for anything but the simplest scenarios.
The reason why the community decided to accept it was the realization that there's about 0.001% chance of convincing kernel devs to change it, and the fact that we'd still have to deal with existing kernels in foreseeable future.
The fact is that often the I/O issues are temporary, and those situations are becoming more and more common (think running out of disk space with thin provisioning, networking issues with NFS, etc.). So it might be quite valuable to handle those issues gracefully, without essentially crashing the database (which is pretty much what PANIC does).
> Even though Linux’s fsync behavior is clearly broken, it is far from the craziest behavior I’ve seen from the I/O stack.
"clearly broken" might be bit too harsh, but it certainly makes it way harder to use.
> Anyway, the main lesson here is that untested error handling is worse than no error handling. They should have figured out how to test that this path actually proceeds correctly (on real, intermittently failing hardware) or just panicked the process.
That's true, of course. It's a sad fact of life that error paths are the least tested part of almost any code base. It's however also true that testing I/O errors is pretty difficult to do (especially before "error" dm target), and a significant part of the fsyncgate was that kernel was not reporting errors reliably. It's also true that the behavior is somewhat filesystem-specific (some keep the data in page cache but marked as "not dirty", some will discard the data, ...). That makes testing pretty hard.
The right way to do it is to just not use O_DIRECT.
The whole notion of "direct IO" is totally braindamaged. Just say no.
This is your brain: O
This is your brain on O_DIRECT: .
I should have fought back harder. There really is no valid reason for EVER
using O_DIRECT. You need a buffer whatever IO you do, and it might as well
be the page cache. There are better ways to control the page cache than
play games and think that a page cache isn't necessary.
So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
This is your brain: O
This is your brain on O_DIRECT: .
And... this is your brain when cached: ?!
The right way to do it is to just use O_DIRECT.
The whole notion of "kernel IO" is fsync and games. Just say no.
Only feasible option is to create a completely new file and retry writing there? And if that fails your disk is probably bust or ejected, which should require user interaction about the new file location. Doesn't seem too unreasonable?
This would require you to have the complete file contents elsewhere so you can rewrite it? Or would it still be possible to read from the original file being in the unflushed buffer? And in the disk ejected+remounted case, the old contents should still be there intact thanks to ext4 journaling?
fsync(2) there does exactly what POSIX says: it will only return success if the write safely made it to disk.
This is the #1 reason why all my infrastructure runs on a combination of Solaris and SmartOS and the primary reason why I don't run GNU/Linux on anything that's mine. #2 reason is that Solaris / illumos and therefore SmartOS will not overcommit memory, whereas in GNU/Linux this must be explicitly disabled, since overcommit is enabled by default.
That said, my understanding is that handling this with cross-platform, posix-only code is mostly impossible due to behavior like what’s described in the article.