What about this: you write() some data, and then Linux throws it away because of...

ajross · on July 22, 2019

I don't think that's the behavior, though. The blocks don't get flushed, they're still in cache and get read back. Probably. Again, the system broke. You've got no guarantees about anything. Fixing an API isn't going to change that.

Dylan16807 · on July 22, 2019

API guarantees are still important even when a peripheral breaks. It's the difference between knowing how and where performance is degraded and what to work around, versus utter chaos.

macdice · on July 22, 2019

Totally agree. Speaking as a database hacker, just tell me it broke. I will deal with the rest.

ajross · on July 22, 2019

It did tell you it broke, though. It just didn't repeat the message when you asked again. Really, there is no "correct" behavior from getting an -EIO from a fsync beyond "shut it all down, notify everyone you can and hope for the best". This retry behavior was itself a bad design choice (and again, I'm not defending fsync here).

And again, speaking as a systems hacker: it broke. You can't "deal with" broke. And the system absolutely did tell you it broke. I genuinely don't understand the insistence by folks on this list that OMG THE ERROR MUST BE HANDLED IN THE DATABASE SOFTWARE. It seems kinda ridiculous to me, honestly.

macdice · on July 22, 2019

Well, there are/were a whole bunch of related problems and scenarios discussed in that and other threads, so it depends which bit we're discussing. (1) errors consumed by asynchronous kernel activity that never make it to user space; (2) write(), write-back error not reported to user yet, buffer page replaced, read() sees old data; (3) fsync() -> EIO/ENOSPC, then fsync() -> SUCCESS; (4) write() in one process, then fsync() in another process from a different fd; (5) write(), close(), open(), fsync().

It seems that the developers of database software and kernels/filesystems had different understanding of what should happen to error reporting in those cases (and maybe some more that I'm forgetting). PostgreSQL and FreeBSD happened to agree and all of these scenarios do the right thing (IMHO, but you could call my opinion not impartial, since I am involved in both of those projects). Linux has had various different behaviours since Jeff Layton realised how terrible it all was and started improving it, starting from (1) which were straight up bugs. Some problems still exist. OpenBSD also made recent changes due to the noise generated by this stuff. So I genuinely don't understand how anyone can argue that there was nothing wrong or imply that it's not complicated: the maintainers of the software in question apparently disagreed.

In my humble opinion someone should go and talk to http://www.opengroup.org/austin/ and try to get some increased clarity here for POSIX 2022. Maybe I will find the energy if someone doesn't beat me to it.

By "I will deal with it", I didn't mean I can fix a hosed server, I mean something like "the database will fail over to another node, or enter recovery" or something like that. And, since we (and other DBs who apparently read our mailing list) introduced panic-on-fsync-failure, that's what we do. There are still a few edge cases to fix, though. We're on it...

macdice · on July 22, 2019

When a write-back error happens, Linux marks the block clean so it can be replaced by another block at any time. Additionally, some Linux filesystems (I don't recall which right now) proactively throw it away immediately.