
Fsyncgate: Errors on fsync are unrecoverable (2018) - Darkstryder
https://danluu.com/fsyncgate/
======
mjw1007
The other half of the thread was on linux-ext4, starting here:
[https://lists.openwall.net/linux-
ext4/2018/04/10/33](https://lists.openwall.net/linux-ext4/2018/04/10/33)

The part I found most interesting was here: [https://lists.openwall.net/linux-
ext4/2018/04/12/8](https://lists.openwall.net/linux-ext4/2018/04/12/8)

where the ext4 maintainer writes:

« The solution we use at Google is that we watch for I/O errors using a
completely different process that is responsible for monitoring machine
health. It used to scrape dmesg, but we now arrange to have I/O errors get
sent via a netlink channel to the machine health monitoring daemon. »

He later says that the netlink channel stuff was never submitted to the
upstream kernel.

It all feels like a situation where the people maintaining this code knew deep
down that it wasn't really up to scratch, but were sufficiently used to
workarounds ("use direct io", "scrape dmesg") that they no longer thought of
it as a problem.

~~~
derefr
> It all feels like a situation where the people maintaining this code knew
> deep down that it wasn't really up to scratch, but were sufficiently used to
> workarounds ("use direct io", "scrape dmesg") that they no longer thought of
> it as a problem.

It's an ops-level solution. The one thing you never really do, as an ops
person, is reach into the black-box components that make up your
infrastructure to fix them at an architectural level. That's not your job. The
system is already in production; it's your job to _make it run_ , without
changing it—to shim _around_ the black boxes to make them work _despite_ their
architectural flaws.

Sure, you can file a ticket with the dev team about how dumb you think an
architectural choice was—but, in _most_ businesses, _most_ of the black-box
components you're working with are third-party, so filing a ticket isn't
likely to get you much. (Google is an exception, but even Google has to hire
their SREs from somewhere, and they'll come in enculturated to the "it's all
black boxes and we're here to ziptie them together" mindset.)

And, even in an environment like Google's where you have access to the dev-
teams of every component you're running, dev cycles are still longer than ops
SLA deadlines. Ops solutions are chosen because they're quick to get into
production. (Component can't handle the load because it's coded poorly?
Replicate it and throw a load balancer in front! Five minute job.) So even if
you do file that ticket, you've still got to solve the problem in the here-
and-now. And once you _do_ solve the immediate problem, it's no longer a hair-
on-fire problem, so that ticket isn't going to be very high-priority to fix.

~~~
xorcist
> The one thing you never really do, as an ops person, is reach into the
> black-box components

I'm sorry, but just no. That's far too strong general statement. I've done
quite a bit of sysadmin/devops style work over the years, and also worked with
many other people who've sent various kinds of fixes upstream. Sure, third
party software isn't always easy to fix, but it depends on what the
alternatives are. You probably need to do an immediate workaround as well.

I would say that's an important part of why open source have had such a strong
following among sysadmins, the ability to fix things. One could even say that
diving into large pieces of software and exploring issues is the most
rewarding part of the job. Just don't tell them I said that.

~~~
zwischenzug
Usually I've ventured into the code and made a bad enough attempt to fix that
the devs are galvanised into doing it right. Filing a ticket usually not
enough table stakes to get your voice heard if things are worked around.

~~~
stallmanite
I like this concept. I wonder if there’s a word for it. Any Germans want to
help out with a fifteen syllable wonder?

~~~
OJFord
Tikettfileninsufficientstadtdelvencodebasefixen.

Or something? More seriously, relevant English idioms:

\- A stitch in time saves nine;

\- Something worth doing's worth doing well;

\- If you want it done right, do it yourself;

and surely more. I'm certain there's a farming-analogy one about fixing
something sooner rather than later, but it's escaped me.

~~~
twic
So, we've got:

Cunningham's Law: "the best way to get the right answer on the internet is not
to ask a question; it's to post the wrong answer" [1]

"Broke gets fixed, crappy is forever" [2]

I feel like there should be a third but that's what i've got.

[1] [http://fed.wiki.org/journal.hapgood.net/cunninghams-
law/fora...](http://fed.wiki.org/journal.hapgood.net/cunninghams-
law/forage.ward.fed.wiki.org/cunninghams-law)

[2] [https://dandreamsofcoding.com/2013/05/06/broke-gets-fixed-
cr...](https://dandreamsofcoding.com/2013/05/06/broke-gets-fixed-crappy-is-
forever/)

~~~
Izkata
Maybe something about the dummy fix polluting the lead dev's code aesthetics?

------
verisimilitudes
I sure wonder how IBM mainframes and other computer systems handle this
intractable failure case. Joking aside, here's an excerpt from ''The UNIX-
HATERS Handbook'':

 _Only the Most Perfect Disk Pack Need Apply_

One common problem with Unix is perfection: while offering none of its own,
the operating system demands perfection from the hardware upon which it runs.
That's because Unix programs usually don't check for hardware errors--they
just blindly stumble along when things begin to fail, until they trip and
panic. (Few people see this behavior nowadays, though, because most SCSI hard
disks do know how to detect and map out blocks as the blocks begin to fail.)

...

In recent years, the Unix file system has appeared slightly more tolerant of
disk woes simply because modern disk drives contain controllers that present
the illusion of a perfect hard disk. (Indeed, when a modern SCSI hard disk
controller detects a block going bad, it copies the data to another block
elsewhere on the disk and then rewrites a mapping table. Unix never knows what
happened.) But, as Seymour Cray used to say, ''You can't fake what you don't
have.'' Sooner or later, the disk goes bad, and then the beauty of UFS shows
through.

~~~
ds2643
lol hi alex

------
AnssiH
Related article: PostgreSQL's fsync() surprise
[https://lwn.net/Articles/752063/](https://lwn.net/Articles/752063/) (April
18, 2018)

And the followup coverage from LSFMM summit (linked also in the OP
discussion):
[https://lwn.net/Articles/752613/](https://lwn.net/Articles/752613/)

~~~
beering
Upvoted this because the LWN article seems to give a much better picture of
the what and why, including key kernel devs explaining what tgl called "kernel
brain damage". The posted danluu.com link seems to be just the pgsql mailing
list.

------
asdfasgasdgasdg
One thing I took away from this thread: when someone tells you something
surprising, it's best to ask, rather than deny. It doesn't look good to say
things like:

"Moreover, POSIX is entirely clear that successful fsync means all preceding
writes for the file have been completed, full stop, doesn't matter when they
were issued."

When you have plainly not actually verified that it is the case.

Instead, you could say, "doesn't Posix say . . .?" This has the following
benefits: you avoid egg on your face, the conversation takes on a less
aggressive tone, and problems are resolved quicker.

~~~
mehrdadn
100% agree with that takeaway in general, except that in this case I think
POSIX does actually say this. I pointed out and discussed this a few months
ago [1] (and it's already discussed on the page as well), but basically, those
who disagree believe that fsync's task is to merely "send" the data to the
storage device, not actually make sure it's written persistently... which I
see as being blatantly inconsistent with "to assure that after a system crash
or other failure that _all data_ up to the time of the fsync() call is
recorded to the disk". There's nothing on the page that says "data since the
_last call_ to fsync", and that sentence says that was definitely not the
intention, and yet somehow that's how people read it to support the notion
that current implementations are correct (or vice-versa: they use current
implementations as evidence that this is the correct reading).

[1]
[https://news.ycombinator.com/item?id=19128228](https://news.ycombinator.com/item?id=19128228)

~~~
asdfasgasdgasdg
Perhaps? The people in the thread seem fairly convinced that's not what it
says. When I read it, I see "... all data for the open file descriptor named
by fildes is to be _transferred_ to the storage device ..." So, in the event
of a failure in the underlying hardware, the data may have been transferred,
but subsequently an error has occurred. I don't see how the sentence obliges
the implementation to transfer multiple times. I'm not an expert though.

In any case, the problem is with the attitude and how opinions are expressed,
not the exact state of which opinion is being expressed. :) Also, it is
plainly not "entirely clear" since we two apparently reasonable people
disagree on what the sentence means. So even if the person is right about what
was ultimately intended when the spec was written, they are still incorrect in
their view that it's _clear_ what POSIX says.

~~~
scottlamb
That sentence's language can be improved then if a sufficiently determined
kernel developer (not sure I'd say "reasonable" but whatever) can misread that
sentence in that way (while ignoring the entire "RATIONALE" section). But what
value is a guarantee that it's _transferred_ to the storage device and not
that it's actually committed to permanent storage? If someone's best argument
is that they correctly provide a completely useless guarantee and that there's
nothing that provides the guarantee that people actually need for real work,
they're on pretty shaky footing.

~~~
asdfasgasdgasdg
Presumably people will not buy permanent storage that does not commit data
that is transferred to it? The thing is, the fsync guarantee isn't useless.
You "just" have to rewrite any data that you had previously written since the
last sync, and be tolerant of partially committed data in the interim.

~~~
gizmo686
What if the hardware has its own notion of sync. I am not familiar with disk
IO protocols, but I assume they are permitted to store writes in an internal
buffer. In such a case, I would expect fsync to issue a sync request to the
device and wait until the device reports that the sync is finished.
Admittedly, the phrasing of the spec is not 100% clear on this point; but I
think the correct reading of it is that the correct behavior is to wait until
the device actually commits the data (or at least claims to; at some point
broken/lieing hardware is just broken).

~~~
scottlamb
Your assumption is correct.

------
patrec
It's kinda depressing that after a quarter of a century of work in the region
of a hundred thousand dev years and billions of dollars of investment the
world's most used operating system fails at the most basic tasks like reliably
writing files or allocating memory.

------
based2
PostgreSQL will now PANIC on fsync() failure

[https://wiki.postgresql.org/wiki/Fsync_Errors](https://wiki.postgresql.org/wiki/Fsync_Errors)

[https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...](https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1)

[https://lwn.net/Articles/752063/](https://lwn.net/Articles/752063/)

------
tobias3
Well, at least in this case one can abort (or go into some read-only mode) in
case of fsync() returning failure. With most storage media that is the correct
thing to do anyway. Having multiple processes and having the fsync() error
returned to only one of them is problematic though.

I recently found out that syncfs() doesn't return an error at all in most
cases (through data loss :/). It's being worked on ...
[https://lkml.org/lkml/2018/6/1/640](https://lkml.org/lkml/2018/6/1/640)

It's astonishing that such critical issues are still present in such a widely
used piece of software.

~~~
techslave
> Well, at least in this case one can abort (or go into some read-only mode)
> in case of fsync() returning failure.

the problem here isn’t if fsync() returns failure. it’s that it always returns
success, even if it failed.

~~~
masklinn
> it’s that it always returns success, even if it failed.

It doesn't (except on the most broken of old Linux kernels, and even then it
mostly lost async write errors). Rather it's that on most systems (basically
all of them except FreeBSD and Illumos, possibly OpenBSD after some recent
changes) fsync will only report errors _once_ , but that call will clear the
flags and subsequent fsync calls will succeed (unless new errors have
occurred).

Basically, you can only rely on fsync reporting errors having happened since
the last fsync, which is obviously concerning for all sort of reasons (not
least being concurrent updates).

They also mention a bit about fsync not necessarily reporting errors from
before the file was opened.

------
pknopf
Imagine a juggler, juggling 5 balls at a time. At a set period of time, he
will drop one ball and accept another ball thrown at him. He handles this very
well because there is order/cadence.

Now imagine asking him to support being thrown balls randomly, at any
interval. He may make it work, but I'd imagine he will stutter a bit.

In my experience, anytime you interrupt the page cache's normal routine, it
stutters everything. I've seen the "sync" command freeze my Ubuntu machine
(music player, GUI, etc).

I work on embedded devices, and my employer wanted to reduce the window at
which data-loss would happen for 30mb+ files (video capture). It wasn't a
supported use case, "But why not! It makes our product theoretically better!"
I put my foot down. We aren't touching page-cache until there is a clear
benefit to the user. It almost got me fired, but good riddance if so.

~~~
mjw1007
The annoying thing is that the Postgres people don't even really need "sync
this file right now" for the main data files.

AIUI what they'd ideally like is "sync this file reasonably soon, and let me
know when it's finished".

~~~
the8472
io_uring supports issuing async sync_file_range requests, which can be used to
do most of the heavy lifting.

You will still need a final fsync to also get the metadata written, but that
should faster at that point.

~~~
cyphar
It should be noted that io_uring landed (in 5.1) a fair while after the whole
Postgres back-and-forth started in 2018.

~~~
masklinn
It should also be noted that even sync_file_range's manpage warns you against
using it, which tells you how reliable that thing is.

~~~
the8472
It's not really that you shouldn't use it because it is dangerous in itself.
It's just doesn't do what one would naively expect it to do. It's still useful
as "flush dirty pages" hinter while preparing the real fsync.

------
dang
At the time:
[https://news.ycombinator.com/item?id=16974033](https://news.ycombinator.com/item?id=16974033)

This February:
[https://news.ycombinator.com/item?id=19119991](https://news.ycombinator.com/item?id=19119991)

[https://news.ycombinator.com/item?id=19238121](https://news.ycombinator.com/item?id=19238121)

------
a-dub
If fsync() fails isn't valhalla lost anyhow? If you can't write things down
because your pencil is broken, probably time to stop what you're doing and get
a new pencil.

If the kernel can't flush dirty write buffers, maybe it's time to send up a
flag and panic in the kernel itself?

~~~
jcranmer
> If the kernel can't flush dirty write buffers, maybe it's time to send up a
> flag and panic in the kernel itself?

Being unable to write to a disk is a recoverable scenario, especially in some
conditions. The most common cause of disk write failures is "someone yanked
the flash drive out of the computer," and the recovery is "pop up a dialog
telling the user to put it back in."

~~~
carbocation
Another situation: a GCSFuse volume dealing with temporary networking errors.

~~~
masklinn
Or NFS being NFS.

------
ChuckMcM
This is sad, I know that one large user of Linux found this problem in 2009 or
so and fixed it for the version of Linux they used in their fleet of servers.
I am surprised it didn't make it upstream from then.

~~~
beering
I don't know if it's the same person, but someone working on Atlassian's cloud
offerings said he reported this issue (along with a working patch) to Postgres
and they declined it. Sounds like what they do for cloud services is run a lot
of Postgres instances where the DB data is on a big NFS, and this bug was a
bigger problem due to how NFS works. But after patching the fsync handling in
Postgres, they continued using Postgres-on-NFS with great success.

Unfortunately I can't find the original comment, but I think it was on another
HN story about Postgres+fsync.

Edit: I found it:
[https://news.ycombinator.com/item?id=19126601](https://news.ycombinator.com/item?id=19126601)
and I may or may not have hallucinated the part about Atlassian.

~~~
dboreham
In case someone reads this and gets the impression that running PG over NFS is
in general safe or a good idea, I'm pretty sure it still isn't, unless like
the OP you have a complete understanding of what you're doing.

~~~
rosser
I have quite successfully run pg atop NFS — even, in limited, point-solution
type roles, in production. My experience doing that didn't leave me with the
impression that it was a particularly egregious thing to do, though I would
definitely take many, many additional steps to ensure redundancy and
availability if I were going to use it more generally.

You're right though: you really do want to have some idea what you're doing,
if you're going to go there.

Source: my day job is PostgreSQL DBA, and has been for ~15 years now.

EDIT: Phrasing.

~~~
macdice
I think the problem is mostly ENOSPC from fsync() which jettisons data just
like EIO on Linux. If you ran out of space, PostgreSQL would only learn about
that while checkpointing, and then retry and carry on. Boom, data loss. Today
PostgreSQL would panic on the first ENOSPC from fsync() so the problem is
mostly "fixed".

------
bsaul
I read the start and the end of thread but couldn’t get an understranding of
what the current situation is : did linux update its fsync behavior ? Does pg
now panics on linux on the first fsync ?

~~~
pgaddict
Linux kernel behavior is still the same. The error reporting issues (failure
to report I/O errors in various cases) has been fixed on recent kernels,
AFAIK.

PostgreSQL now PANICs on I/O errors during fsync, forcing a recovery.

------
hedora
Every computer component can fail in arbitrary ways, including drives.

If you’re not robust against that, then when things like fsync fail, then
you’ll lose availability and/or data.

Even though Linux’s fsync behavior is clearly broken, it is far from the
craziest behavior I’ve seen from the I/O stack.

Anyway, the main lesson here is that untested error handling is worse than no
error handling. They should have figured out how to test that this path
actually proceeds correctly (on real, intermittently failing hardware) or just
panicked the process.

~~~
bonzini
It is broken but there's no behavior that isn't, and the Postgres developers
quickly understood why changing Linux's behavior isn't really possible.

From [https://lwn.net/Articles/752063/](https://lwn.net/Articles/752063/):
"Linux is not unique in behaving this way; OpenBSD and NetBSD can also fail to
report write errors to user space. [...] If some process was copying a lot of
data to that drive, the result will be an accumulation of dirty pages in
memory, perhaps to the point that the system as a whole runs out of memory for
anything else [...] a fair amount of attention was paid to the idea that write
failures should result in the affected pages being kept in memory, in their
dirty state. But the PostgreSQL developers had quickly moved on from that idea
and were not asking for it".

~~~
pgaddict
> It is broken but there's no behavior that isn't, and the Postgres developers
> quickly understood why changing Linux's behavior isn't really possible.

I think there's still a fair number of PostgreSQL developers who think the way
Linux kernel behaves makes the fsync() API rather difficult to use for
anything but the simplest scenarios.

The reason why the community decided to accept it was the realization that
there's about 0.001% chance of convincing kernel devs to change it, and the
fact that we'd still have to deal with existing kernels in foreseeable future.

------
ysleepy
Love that FreeBSD is doing things right - and has been for 20 years.

[https://wiki.postgresql.org/wiki/Fsync_Errors](https://wiki.postgresql.org/wiki/Fsync_Errors)

------
jorangreef
2007, Linus rant:
[https://lkml.org/lkml/2007/1/10/233](https://lkml.org/lkml/2007/1/10/233)

    
    
      The right way to do it is to just not use O_DIRECT. 
    
      The whole notion of "direct IO" is totally braindamaged. Just say no.
    
        This is your brain: O
        This is your brain on O_DIRECT: .
        Any questions?
    
      I should have fought back harder. There really is no valid reason for EVER
      using O_DIRECT. You need a buffer whatever IO you do, and it might as well
      be the page cache. There are better ways to control the page cache than
      play games and think that a page cache isn't necessary.
    
      So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
      instead.
    

2019, how things are:

    
    
        This is your brain: O
        This is your brain on O_DIRECT: .
        And... this is your brain when cached: ?!
    
      The right way to do it is to just use O_DIRECT.
      The whole notion of "kernel IO" is fsync and games. Just say no.

------
fortran77
This is one reason we choose to use Windows Storage Spaces Direct and the
tranactional NTFS. They really are better.

[https://docs.microsoft.com/en-us/windows-
server/storage/stor...](https://docs.microsoft.com/en-us/windows-
server/storage/storage-spaces/storage-spaces-direct-overview)

------
Too
So if fsync fails, what is one supposed to do? You can't retry it and you
don't know how much of the file that has been synced?

Only feasible option is to create a completely new file and retry writing
there? And if that fails your disk is probably bust or ejected, which should
require user interaction about the new file location. Doesn't seem too
unreasonable?

This would require you to have the complete file contents elsewhere so you can
rewrite it? Or would it still be possible to read from the original file being
in the unflushed buffer? And in the disk ejected+remounted case, the old
contents should still be there intact thanks to ext4 journaling?

------
saagarjha
I didn't read the entire thread, so maybe this was answered: has anyone
actually made a system that's "fully" correct with regards to file system
errors? Most people throw them away, but even programs that try to account for
them get them wrong on some system (or the system changes behavior from out
under them…). Is there a library that does this?

~~~
loeg
FreeBSD gets the kernel side correct (dirty unwritable blocks continue to
report IO errors, including to fsync()) so you can at least build reliable
libraries and applications on top of it.

