
PostgreSQL used fsync incorrectly for 20 years - lelf
https://fosdem.org/2019/schedule/event/postgresql_fsync/
======
georgebarnett
Years ago (2010 iirc), I reported to the Postgres list, with a patch, that
(one of) the reason Postgres “didn’t work on NFS” was because it couldn’t deal
with short writes. I got told the usual “you’re holding it wrong” instead of
an acknowledgement of PG not sticking to the spec.

I patched my own systems (wrap the calls in a loop as per standard practice)
and then proceeded to run literally hundreds of thousands of PostgreSQL
instances on NFS for many more years with no problems.

The patch was eventually integrated it seems but I never found out why because
I lost interest in trying to work with the community after that experience.

~~~
jarym
Actually as great as Postgres is and as generally approachable the community
is - my experience was the same a few times and I read it on the mailing list
happening to others:

Someone comes along with a patch or idea. Bunch of big Postgres people come
knock it and it dies right there.

Happened to me when I suggested a more multi-tenant approach back around 2010
and today we have Citus. I was told that (paraphrased) no users were asking
for that sort of thing.

I see it kind of happening with the Foreign Key Array patch that the author
asked for help to rebase and no one bothered to reply.

Someone suggested replacing the Postmaster with a threaded approach so it
could scale better (and showed benchmarks of their implementation handling
more connections). Community response was there were already 3rd party
connection pools that do the job. An outsider looking in considers this nuts -
most people would not run additional components if they did not need to!

Another example: NAMEDATALEN restricts every identifier to 63 bytes - a limit
that causes frequent issues (more so now that we have partitioning) and also a
problem for multi-lingual table names. It’s been proposed to increase this
limit a few times. Every time the answer has been: abbreviate your table names
or recompile your own version of Postgres.

Could name a few other examples too I’ve noticed over the years and just
sighed at. I don’t expect every idea to be accepted or even welcomed - but
there is that sense of bias against change.

~~~
darkerside
While this can and does suck when it happens to you, this is exactly what it
takes to keep products focused so they don't die death by a thousand cuts (or
feature requests). For every awesome feature embarked upon, there's an
opportunity cost of bug fixes, stability updates, tech debt reductions, and
other inglorious but necessary work. Aggressively de-scoping is the difficult
but necessary work of keeping a product alive in a competitive marketplace.
And yes, it's a marketplace even if the product is open source.

~~~
benologist
I disagree that this is required. You can see from Linus Torvald's
backtracking on decades of abrasive behavior that it was never an important
part of Linux after all, so an abrasive experience for people trying to help
other open source projects is probably going to be superfluous too. You can
still reject ideas without disregarding them or the person.

~~~
notriddle
"We don't support running Postgres on NFS" isn't the same thing as "fuck you
Intel ACPI team; you're dumber than a chimpanzee". Equating disagreement and
criticism with Linus-isms is why the relationship between users and developers
is such a mess to begin with. Being a maintainer requires you to say "no"
sometimes, but it doesn't require you to be a jerk.

~~~
yourbandsucks
No, it's worse.

Linus was trying to make things work, with profanity. Postgres couldn't be
bothered.

Sure, performative profanity isn't everyone's cup of tea, but milquetoast
passive-aggressive dismissals of people like OP who ARE TOTALLY RIGHT aren't
actually nice.

~~~
anarazel
Please cite which parts of the thread you're referring to. Again, several
people +1'd the idea, and reviewed the code.

------
htfy96
> In short, PostgreSQL assumes that a successful call to fsync() indicates
> that all data written since the last successful call made it safely to
> persistent storage. But that is not what the kernel actually does. When a
> buffered I/O write fails due to a hardware-level error, filesystems will
> respond differently, but that behavior usually includes discarding the data
> in the affected pages and marking them as being clean. So a read of the
> blocks that were just written will likely return something other than the
> data that was written.

> Google has its own mechanism for handling I/O errors. The kernel has been
> instrumented to report I/O errors via a netlink socket; a dedicated process
> gets those notifications and responds accordingly. This mechanism has never
> made it upstream, though. Freund indicated that this kind of mechanism would
> be "perfect" for PostgreSQL, so it may make a public appearance in the near
> future.

[https://lwn.net/Articles/752063/](https://lwn.net/Articles/752063/)

A real-life example can be found at
[https://stackoverflow.com/questions/42434872/writing-
program...](https://stackoverflow.com/questions/42434872/writing-programs-to-
cope-with-i-o-errors-causing-lost-writes-on-linux)

~~~
solidsnack9000
The linked LWN article (from April 2018) is a great summary of the problem and
potential solutions and its cause:

 _Ted Ts 'o, instead, explained why the affected pages are marked clean after
an I/O error occurs; in short, the most common cause of I/O errors, by far, is
a user pulling out a USB drive at the wrong time. If some process was copying
a lot of data to that drive, the result will be an accumulation of dirty pages
in memory, perhaps to the point that the system as a whole runs out of memory
for anything else. So those pages cannot be kept if the user wants the system
to remain usable after such an event._

~~~
anarazel
That justification is bogus however. There's already separate logic for the
case the entire underlying device vanishes.

~~~
Dylan16807
Also USB bus resets are not unheard of. Or moving devices from one port to
another. If the device comes back within a minute or two you probably
shouldn't throw out those writes.

~~~
0x0
If someone quickly pulls an usb drive, plugs it in another system, and then
plugs it back in to the original system, then flushing writes could cause
massive data corruption if those writes are relative to an outdated idea of
what's on the block device. Sounds like a misfeature to me

~~~
zozbot123
> If someone quickly pulls an usb drive, plugs it in another system, and then
> plugs it back in to the original system, then flushing writes could cause
> massive data corruption

That's user error, though. The kernel should react to removable media being
pulled by sending a wall message to the appropriate session/console, stating
something similar to "Please IMMEDIATELY place media [USB_LABEL] back into
drive!!", with [Retry] and [Cancel] options. That way, the user knows what to
expect -- OS's used to do this as a matter of course when removable media was
in common use. In fact, you could even generalize this, by asking the user to
introduce some specific media (identified by label) _when some mountpoint is
accessed_ , even if no operations were actually in progress.

~~~
adrianmsmith
RISC OS works like you propose. If you access the path "ADFS::MyDisk.$.Foo"
(that is the Advanced Disk FileSystem, disk called "MyDisk", $ is the root
directory and within that the file "Foo"), the user will get a pop-up asking
them to insert the disk "MyDisk" into any available disk drive, then press OK
to continue the I/O operation successfully. (The user can also click Cancel in
which case the I/O operation will return an error.)

You don't ever have had to interact with the "MyDisk" disk before. Simply
access it, and (by the time the Disk I/O system call returns) the disk will be
there (by virtue of asking the user to insert it.)

~~~
hyc_symas
MSDOS used to do this too. And Atari GEMDOS.

~~~
adrianmsmith
I don't think MSDOS did the following aspect of zozbot123's post:

> In fact, you could even generalize this, by asking the user to introduce
> some specific media (identified by label) when some mountpoint is accessed,
> even if no operations were actually in progress.

In MSDOS the paths were like "A:FOO.TXT" as far as I remember. That means you
had no facility, as a program, to request the user insert a particular disk
identified by a particular label.

For example, on RISC OS (but not on MSDOS) you could implement a program to
copy files between two disks by simply reading from the disk with the source
label and writing to the disk with the destination label. Even on a machine
with a single disk drive. The OS would request the user insert the source and
destination disks as appropriate.

~~~
hyc_symas
The FAT partition header has always had an ID field, the OS could request you
to reinsert a removed disk and would know if the wrong one was inserted.

Also, on single-floppy systems there was a virtual B: so you could do disk to
disk copies just by saying "copy A:foo B:" etc.

------
chubot
Doesn't this affect all databases? Or is it a different issue?

[https://www.sqlite.org/atomiccommit.html](https://www.sqlite.org/atomiccommit.html)

 _SQLite does a "flush" or "fsync" operation at key points. SQLite assumes
that the flush or fsync will not return until all pending write operations for
the file that is being flushed have completed. We are told that the flush and
fsync primitives are broken on some versions of Windows and Linux. This is
unfortunate. It opens SQLite up to the possibility of database corruption
following a power loss in the middle of a commit. However, there is nothing
that SQLite can do to test for or remedy the situation. SQLite assumes that
the operating system that it is running on works as advertised. If that is not
quite the case, well then hopefully you will not lose power too often._

Also this seems related:

[https://danluu.com/file-consistency/](https://danluu.com/file-consistency/)

 _That results in consistent behavior and guarantees that our operation
actually modifies the file after it 's completed, as long as we assume that
fsync actually flushes to disk. OS X and some versions of ext3 have an fsync
that doesn't really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush
to disk, and some versions of ext3 only flush to disk if the the inode changed
(which would only happen at most once a second on writes to the same file,
since the inode mtime has one second granularity), as an optimization._

The linked OSDI '14 paper looks good:

 _We find that applications use complex update protocols to persist state, and
that the correctness of these protocols is highly dependent on subtle
behaviors of the underlying file system, which we term persistence properties.
We develop a tool named BOB that empirically tests persistence properties, and
use it to demonstrate that these properties vary widely among six popular
Linux file systems._

~~~
hyc_symas
> Doesn't this affect all databases?

Didn't affect LMDB. If an fsync fails the entire transaction is
aborted/discarded. Retrying was always inherently OS-dependent and unreliable,
better to just toss it all and start over. Any dev who actually RTFM'd and
followed POSIX specs would have been fine.

LMDB's crash reliability is flawless.

~~~
garyclarke27
LMDB claimed speed and reliability seems remarkable (from a quick glance). I
would guess is easier to achieve such, for a KV store, than for much more
complex Relational Database. Got me thinking though. Mayby Postgres could take
advantage of LMDB? Mayby by using LMDB as it’s cache? instead of using OS page
cache, maybe writing the WAL to LMDB?

~~~
hyc_symas
LMDB itself only uses the OS page cache. The way for LMDB to improve an RDBMS
is for it to replace the existing row and index store, and eliminate any WAL.
This is what SQLightning does with SQLite.

Have looked at replacing InnoDB in MySQL, but that code is a lot harder to
read, so it's been slow going. Postgres doesn't have a modular storage
interface, so it would be even uglier to overhaul.

~~~
garyclarke27
Thanks, make sense, I think Postgres are planning go have pluggable storage
interface in nest version 12, would that help? Also nobody has mention data
checksum added v9.3, do you know if this helps avoid this kind of fsync
related corruption?

~~~
anarazel
> Also nobody has mention data checksum added v9.3, do you know if this helps
> avoid this kind of fsync related corruption?

Not really, I think. Page-level checksums don't protect against entire writes
going missing, unfortunately.

------
haberman
> Both Chinner and Ts'o, along with others, said that the proper solution is
> for PostgreSQL to move to direct I/O (DIO) instead.

Wait, is "direct I/O" the same as O_DIRECT?

The same O_DIRECT that Linus skewered in 2007?

> There really is no valid reason for EVER using O_DIRECT. You need a buffer
> whatever IO you do, and it might as well be the page cache. There are better
> ways to control the page cache than play games and think that a page cache
> isn't necessary.

[https://lkml.org/lkml/2007/1/10/233](https://lkml.org/lkml/2007/1/10/233)

> Side note: the only reason O_DIRECT exists is because database people are
> too used to it, because other OS's haven't had enough taste to tell them to
> do it right, so they've historically hacked their OS to get out of the way.

[https://lkml.org/lkml/2007/1/10/235](https://lkml.org/lkml/2007/1/10/235)

More background from 2002-2007:
[https://yarchive.net/comp/linux/o_direct.html](https://yarchive.net/comp/linux/o_direct.html)

~~~
anarazel
Turns out sometimes people other than Linus have more experience with IO than
Linus.

I think there's pretty good reasons to go for DIO for a database. But only
when there's a good sysadmin/DBA and when the system is dedicated to the
database. There's considerable performance gains in going for DIO (at the cost
of significant software complexity), but it's much more sensitive to bad
tuning and isn't at all adaptive to overall system demands.

~~~
pgaddict
Yeah, which is one of the reasons PostgreSQL went with buffered I/O, not to
have to deal with this complexity. And it served us pretty well over time, I
think.

~~~
anarazel
I don't think that's really true. It worked well enough, true, but I think it
allowed us to not fix deficiencies in a number of areas that we should just
have fixed. IOW, I think we survived despite not offering DIO (because other
things are good), rather than because of it.

~~~
pgaddict
I don't think we disagree, actually.

Yes - from a purely technical point of view, DIO is superior in various ways.
It allows tuning to specific I/O patterns, etc.

But it's also quite laborious to get right - not only does it require a fair
amount of new code, but AFAIK there is significant variability between
platforms and storage devices. I'm not sure the project had enough developer
bandwidth back then, or differently - it was more efficient to spend the
developer time on other stuff, with better cost/benefit ratio.

------
bbunix
Historic note (like from the 80's) - any time a machine was rebooted we'd type
sync; sync; sync; reboot - the explanation was that the only guarantee was
that the second sync wouldn't start until the first sync successfully
completed, plus one for good luck...

~~~
Jerry2
I have a flash drive that I sometimes put a video on to watch it on a small TV
in the basement and I've noticed that Linux doesn't copy the file right away.
The 'cp' does finish quickly but the data is not on the flash drive yet. You
either have to eject and wait or sync and wait for it to actually transfer.

Needless to say, this tripped me up few times and videos weren't fully
transferred.

~~~
pmontra
I also noticed that on Gnome. I never investigated the implementation details
of that progress bar but my gut feelings is that the file is read from (or
written to) the buffer cache quickly and the progress bar goes near to 100%,
then stays there until the last writes succeed and actually write to the USB
stick. Then the eject button sometimes need extra time to finish the sync and
tells me to wait a little. I always remove the stick when it tells me it's
safe to do it.

~~~
loeg
Your intuition is correct. The writes are quickly buffered to RAM and then
fsync or close writes them out to the slo media. The (naive) progress bar
probably only tracks progress buffering the writes — it's the simplest way to
track progress, if inaccurate.

------
anarazel
I think it's worthwhile to note that this, even before both kernel and
postgres changes, really only is an issue if there are very serious storage
issues, commonly causing more corruption than just forgetting the data due to
be fsynced. The kernel has its own timeout / retry logic and if those retries
succeed, there's no issue. Additionally, most filesystems remount read-only if
there's any accompanying journaling errors, and in a lot of cases PG IO will
also have a metadata effect.

~~~
waz0wski
i/o errors are an ongoing issue for AWS M5 and C5 instance types with their
cloudy nvme devices, they have a tendency to randomly disappear or have
extended timeouts

[https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788035](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788035)

unfortunately the upstream patches being backported don't provide any real
write guarantee

------
blaisio
I don't really understand why people went with this headline and not
"PostgreSQL developers discover most programs have used fsync incorrectly for
decades, including PostgreSQL".

~~~
giovannibajo1
Because most normal programs do open-write-sync-close, and that mostly works
as expected.

Postgres does open-write-close and then later, in another unrelated process,
open-fsync-close. They discovered this doesn’t always work, because if
somebody somewhere does a fsync on that file for any reason, their process
might miss the error as it doesn’t stick.

------
mehrdadn
Am I misunderstanding this or does this mean Linux literally does not provide
any way to ensure writes are flushed to a file in the presence of transient
errors?

Does anyone know if Windows's FlushFileBuffers is susceptible to this as well?
(P.S., interesting bit about FILE_FLAG_WRITE_THROUGH not working as you might
expect:
[https://blogs.msdn.microsoft.com/oldnewthing/20170510-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20170510-00/?p=95505))

------
macdice
For the record and not mentioned in Tomas's talk: the PostgreSQL release due
out next week will add a PANIC after any fsync failure (in other words, no
longer retry). The same thing has been done by MySQLand MongoDB and probably
everyone else doing (optional) buffered IO for database-ish stuff who followed
this stuff on LWN etc.

------
lelf
[https://wiki.postgresql.org/wiki/Fsync_Errors](https://wiki.postgresql.org/wiki/Fsync_Errors)

Only FreeBSD & Illumos do the sane thing.

~~~
zaarn
The wiki uses the word "presumably". I take that as "the devs say it works but
nobody tested it".

------
pkaye
There is an old paper
[http://pages.cs.wisc.edu/~remzi/Classes/736/Papers/iron.pdf](http://pages.cs.wisc.edu/~remzi/Classes/736/Papers/iron.pdf)
that analyzes how various filesystems do error handling and not too long ago
it was fairly bad. My own experience was some of the older Windows would not
even check if a write command failed. I used to laugh when developers would
state how robust their databases were when the underlying filesystem does not
even check for many errors. Hopefully things are better now.

------
beefhash
OpenBSD has brought forth a patch earlier this month to try and make fsync(2)
less of a mess to use on OpenBSD[1], though it hasn't been committed yet.

[1] [https://marc.info/?l=openbsd-
tech&m=154897756917794&w=2](https://marc.info/?l=openbsd-
tech&m=154897756917794&w=2)

~~~
anarazel
Note that that patch doesn't really fix the issue. You can
write();close();open();fsync(); and you'll miss the issue if the OS failed
during writeback before the open(). That's worse than on new-ish linux, where
at least it'll only forget the error if there's a lot of memory pressure.

------
pr3dr49
I can see a mention of SQLite and MySQL. Also worth mentioning since this can
affect any system:

[https://www.firebirdsql.org/pdfmanual/html/gfix-
sync.html](https://www.firebirdsql.org/pdfmanual/html/gfix-sync.html)

[https://www.firebirdsql.org/file/documentation/reference_man...](https://www.firebirdsql.org/file/documentation/reference_manuals/user_manuals/html/qsg3-safety.html#qsg3-safety-
howtocorrupt)

I wonder how Oracle handles this. Raw device/partition and its own FS business
logic?

~~~
masklinn
Direct IO is the default for raw devices, and while apparently not the default
otherwise IIRC it's pretty widely used by Oracle shops.

------
ciudilo
FWIW MongoDB fixed[1] this last year in their WiredTiger storage engine.

[1]
[https://jira.mongodb.org/browse/WT-4045](https://jira.mongodb.org/browse/WT-4045)

------
mindslight
The kernel itself isn't really a transaction manager. If there is an I/O error
on a file, then I'd only expect it to percolate up in that immediate
"session". When _should_ that error flag be expected to carry over until,
filesystem remount? Or even longer, with the talk of storing a flag in the
superblock?

Specifically it seems like asking for trouble to open() a path a second time,
and expect fsync() calls to apply to writes done on the other file descriptor
object - there's no guarantee they're even the same file [0]! At the very
least, pass the actual file descriptor to the other process. Linux's initial
behavior was in violation of what I've said here, but the patch to propagate
the error to every fd open at the time of the error should resolve this to
satisfaction.

I would think about the only reasonable assumption you could make about
concurrent cross-fd syncing is that after a successful fsync(fd), any
successful read(fd) should return data that was on disk at the time of that
fsync(fd) or _later_. In other words, dirty pages shouldn't be returned by
read() and then subsequently fail being flushed out.

disclaimer: It's been a while since I've written syscall code and I've had the
luxury of never really having to really lean into the spec to obtain every
last bit of performance. Given how many guarantees the worse-is-better
philosophy forewent, I don't see much point to internalize what few are there.

[0] Ya sure, you can assert that the user shouldn't be otherwise playing
around with the files, I'm just pointing out that translating path->fd isn't a
pure function.

------
EGreg
Have there been any real-world consequences from this, and how can they be
prevented?

Does MySQL have the same flaw?

~~~
j16sdiz
This can happen when io device have intermittent failure. Bad device or faulty
cable.

~~~
pgaddict
Interestingly enough, such issues are becoming more common. It's not just
about devices being less reliable, but e.g. thin provisioning being used more
widely etc.

------
ausjke
I read this at lwn.net a while ago but it seems there is no fix to it. How is
MySQL doing? I believe Oracle etc are not having this problem as they deal
with disk directly.

~~~
jeltz
I think that all databases are affected by this to some degree when not using
Direct-IO, and I think Oracle and MySQL can both run with or without direct-
IO.

------
pyrus
Does anyone know if FoundationDB is affected?

~~~
ryanworl
FoundationDB does not use buffered IO, so no.

------
xorgar831
The meta point here is that just in OSS folks assume that the code has been
reviewed since it's open, but the reality is that unless someone actually does
you really don't know. The popularity of a project doesn't mean there aren't
fatal flaws.

~~~
loeg
No need to bring out that dead horse for a beating. No one in this thread has
yet made the claim that OSS code is flawless; you're picking a fight with a
straw man.

~~~
clhodapp
Humorously, someone in this thread has now actually made the claim that their
open source database product has flawless crash reliability:
[https://news.ycombinator.com/item?id=19127011](https://news.ycombinator.com/item?id=19127011)

~~~
hyc_symas
Multiple independent research teams with their crash analysis/fuzzing tools
confirmed the fact. Along with over 7 years deployment in large enterprises:
Zero crash-induced corruption, zero startup/recovery time. Crash-proof by
design.

------
wuxb
It's a perfect scenario to hear some rants from Linus. While, the "code" has
changed and he really has to calm down for this :)

