
Linux's fsync() woes are getting some attention - r4um
http://rhaas.blogspot.com/2014/03/linuxs-fsync-woes-are-getting-some.html
======
bodyfour
I think part of the problem is that fsync() is that it's an insufficient
interface. Most of the time, you want two things: * write ordering ("write B
must hit disk after write A") * notification ("let me know when write A is on
disk") In particular, you often _don 't_ want to actually force I/O to happen
immediately, since for performance reasons it's better for the kernel to
buffer as much as it wants. In other words, what you want should be nearly
free, but instead you have to do a very expensive operation.

For an example for notification: suppose I have a temporary file with data
that is being journaled into a data store. The operation I want to do is: 1\.
Apply changes to the store 2\. Wait until all of those writes hit disk 3\.
Delete the temporary file I don't care if step 2 takes 5 minutes, nor do I
want the kernel to schedule my writes in any particular way. If you implement
step 2 as a fsync() (or fdatasync()) you're having a potentially huge impact
on I/O throughput. I've seen these frequent fsync()s cause 50x performance
drops!

~~~
dap
> In particular, you often don't want to actually force I/O to happen
> immediately, since for performance reasons it's better for the kernel to
> buffer as much as it wants.

If you don't issue an fsync(), the kernel never has to write anything. If you
had a different function that did what you suggest, returning when the kernel
had decided to write the data (without instructing the kernel to do so
soonish), you could literally be waiting forever. If the function ever
returned, it would only be because something else on the system asked for a
sync (and it was easier to write everything), or because you're operating on a
filesystem that chooses to write data sooner than required. Both of these are
basically working by accident (i.e., may well not work on other POSIX
systems).

I think what you really want is to instruct the kernel that this data should
be written out, and you want to block until that happens. That's exactly what
fsync() is supposed to do. The fact that some kernels hammer the I/O subsystem
while doing so is a bug in those kernels, not the fsync() interface.

~~~
ryandrake
Coming from a graphics background, this reminds me of a similar problem in
graphics.

In the OpenGL API, you have two functions (hard core graphics pedants, please
forgive my simplification): There's glFinish() which instructs the graphics
system to process all graphics commands previously sent and to block until
everything is done. Then there's glFlush() which returns immediately, and
basically says, make sure all commands previously sent will finish "in a
finite amount of time" (as opposed to "maybe never").

From the discussion here, it seems that fsync() is like glFinish(). Maybe
what's needed would be something similar to glFlush()? Something that says,
"kernel, now would be a good time to start writing out data, but I'm not going
to wait."

~~~
dap
That's potentially interesting, but the question is: if you're not going to do
anything that relies on the data being on stable storage, why bother asking
the system to write it out?

Most of the use cases I think of involve a server of some kind servicing
client requests. Most of the time, the only sane semantics are that if the
request completes successfully, then the change will survive a system crash.
In that case, you _have_ to fsync() (or equivalent). Conversely, if the client
doesn't need that guarantee (e.g., this is a cache that can be reconstructed
from elsewhere, as in the case of a CDN), then there's no reason to make sure
it's on stable storage at all. You're basically using the filesystem as a
large, slow extension of DRAM, and if everything you ever write fits in DRAM,
you wouldn't care if the kernel ever wrote it out.

Is there a use case you had in mind? I can't think of a middle ground.

~~~
jerf
Anything where you want an image to be guaranteed consistent, even if not
complete, could use an ordering guarantee without a particular "this has been
written _now_ " guarantee. A log-structured data store where you don't mind a
bit of data loss if there's a power outage is a particularly clear example of
that, but it's a useful property in general.

In fact filesystems in general attempt to implement this for themselves,
because a filesystem should ideally always be in a consistent state... it may
not be the _right_ state, per se, but it should not be actively inconsistent,
leaving you (or fsck) to basically guess what the correct state is.

The problem appears to be that today there's only "write this all out now and
DO ABSOLUTELY NOTHING ELSE until that happens", and "yeah, whatever, write it
whatever order you like and I sure hope it all works out."

Is that correct? There really isn't anything like a write barrier? All my
reading of the links here seem to indicate that but I find it hard to believe
that really is an accurate summarization of the current state of affairs on
Linux. (Though I concede that I can see how hard it would be to propagate such
a guarantee all the way from the disk hardware, through the drivers, through a
large and varied number of file systems, all the way out to user space,
without bugs, bugs, bugs.)

~~~
dap
> Anything where you want an image to be guaranteed consistent, even if not
> complete, could use an ordering guarantee without a particular "this has
> been written now" guarantee. A log-structured data store where you don't
> mind a bit of data loss if there's a power outage is a particularly clear
> example of that, but it's a useful property in general.

So a little bit of data loss is okay, but a lot isn't? How does a program or
operator determine how much is okay and how much isn't? How does the
application ensure that that limit isn't exceeded? Without answers to these
questions, it feels like "you'll probably be fine, but I can't be sure of
anything", which feels pretty lame. But if a little data loss really is okay,
then forget about both ordering and fsync and truncate the log after the last
consecutive valid record.

> The problem appears to be that today there's only "write this all out now
> and DO ABSOLUTELY NOTHING ELSE until that happens", and "yeah, whatever,
> write it whatever order you like and I sure hope it all works out." > Is
> that correct?

I don't think so, but it depends on what you mean by "absolutely nothing
else". You can always use other threads, and on most systems, you can do lots
of useful I/O with reasonable performance while an fsync() is going on.

> There really isn't anything like a write barrier?

Other than fsync() and equivalents, I don't know of one. Non-blocking write
barriers would represent a much more complicated abstraction for both
applications and the filesystem, and (as you can tell from my comments on this
thread) I'm not convinced it's worth the complexity for any rigorous program.

~~~
rosser
_You can always use other threads, and on most systems, you can do lots of
useful I /O with reasonable performance while an fsync() is going on._

No, you can't. At least not without knowing what you're doing and careful
planning.

My day job is PostgreSQL DBA, and I've been doing that for most of a decade
now. As the kids on the Reddits would say, "I've seen some shit." I have some
rather large servers, with some rather powerful IO subsystems — my production
environment has SAS SLC SSDs under hardware RAID with a ginormous cache. I
still see the behavior described in TFA far more often than I'd like. Linux
really is pretty dumb here.

For example, because of this fsync() issue, and the fact that fsync() calls
flush all outstanding writes _for the entire filesystem_ the file(s) being
fsync()'ed reside upon, I've set up my servers such that my $PGDATA/pg_xlog
directory is a symlink from the volume mounted at (well, above) $PGDATA to a
separate, much smaller filesystem. (That is: transaction logs, which must be
fsync()'ed often to guarantee consistency, and enable crash recovery, reside
on a smaller, dedicated filesystem, separate from the rest of my database's
disk footprint.)

If I didn't do that, at every checkpoint, my performance would measurably
fall. I learned this lesson the hard way, at an old job, where my postgres
clusters lived _on a SAN_ — it wasn't just my db instances that were being
adversely affected by this IO storm. _It was everything else that lived on the
filer, too_.

That's how bad it can be.

~~~
tytso
It's not true that fsync() calls flush all outstanding writes for the entire
file system; that was true for ext3 in data=ordered mode, but it's definitely
not true for ext4 or xfs. If you use fdatasync(), and there were no write
commands that issued against the file descriptor that required metadata
updates (i.e., you didn't do any block allocations, etc), then both ext4 and
xfs won't need to trigger a journal commit, so the only thing that has to get
sent to disk is all of the dirty metadata blocks, followed by a SYNC CACHE
command which forces the disk drive to guarantee that all writes sent to the
disk will survive a power cut.

If you use fsync() and/or you have allocated blocks or otherwise performanced
a write which required updating file system metadata, and thus will require a
journal commit, then you will need to force out all pending metadata updates
to the journal as part of the file system commit, but that's still not the
same as "flush all outstanding writes for the entire file system".

~~~
bodyfour
> or you have allocated blocks or otherwise performanced a write which
> required updating file system metadata

What if you're appending to a file, and want to checkpoint every so often. I
guess you can be clever with fallocate(FALLOC_FL_KEEP_SIZE) to avoid the block
allocation, but won't st_size still need to be updated?

I also assume that st_mtime doesn't count towards dirtying the metadata.

------
haberman
It always amazes me that after all these years, Linux still hasn't fixed this.

In my experience, any program that overloads I/O will make the system grind to
a halt on Linux. Any notion of graceful degradation is gone and your system
just thrashes for a while.

My theory about this has always been that any I/O related to page faults is
starved, which means that every process spends its time slice just trying to
swap in its program pages (and evicting other programs from the cache,
ensuring that the thrashing will continue).

I've never gotten hard data to prove this, and part of me laments that SSDs
are "fast enough" that this may never actually get fixed.

Can anyone who knows more about this comment? It seems like a good rule inside
Linux would be never to evict pages that are mapped executable if you can help
it.

Has anyone experimented with ionice or iotop?
[http://www.electricmonk.nl/log/2012/07/30/setting-io-
priorit...](http://www.electricmonk.nl/log/2012/07/30/setting-io-priorities-
on-linux/)

~~~
FigBug
Happens to Windows and Mac OS too. Every time I boot Dropbox thrashes my disk
for 10 minutes while the system is almost completely unresponsive.

~~~
ksk
>Every time I boot Dropbox thrashes my disk for 10 minutes while the system is
almost completely unresponsive.

I seriously doubt that any usermode program could overwhelm the OS scheduler
like that. What are your use case parameters?

~~~
FigBug
Just after the OS boots, Dropbox needs to index 120 GB of files. Any other
program that wants to access the disk takes forever. For Dropbox to finish,
for my mail and IDEs to open takes about 10 minutes. Any other program that
needs the disk is uselessly slow.

~~~
ksk
Interesting. How many files do you have? I recorded a trace of dropbox
executing on my windows machine (mostly flat folder hierarchy, ~500MiB , ~1000
files) and the file I/O for querying all my data took 71542.070μs (0.07s). I
believe dropbox also does some extra things (reading the NTFS journal, its own
file cache-journal, updating hashes, etc ) and so the total File I/O cost was
around 2944815.431μs (2.9s). Note that the I/O happened sporadically, and the
wall clock time is higher as expected (it didn't block the scheduler from
scheduling other processes).

I assume since my data was synced and didn't need to be indexed all over again
- I got some savings there. Maybe your dropbox configuration data is corrupted
and thats why it needs to index it all again.

~~~
FigBug
How do you record a trace? That would be interesting to do.

~~~
ksk
Windows Performance Recorder.

------
rwmj
Interesting couple of related articles / rants by Jeff Darcy:

[http://pl.atyp.us/2013-08-local-filesystems-
suck.html](http://pl.atyp.us/2013-08-local-filesystems-suck.html)

[http://pl.atyp.us/2013-11-fixing-
fsync.html](http://pl.atyp.us/2013-11-fixing-fsync.html)

------
zurn
Good to see this summit involving the kernel developers, since their past
situation sounds rather bleak interaction-wise: using kernel version from 2009
and haven't tested the improvements in the (2012) 3.2 kernel.

BTW, Linux provides the direct I/O O_DIRECT interface that allows apps to
bypass the kernel caching business altogether. This is also discussed in Mel
Gorman's message that this blog borrows from.

------
est
Hmm. maybe that's why they had so many issues with single instance Redis &
Mongodb? As soon as fsync() the whole db became unresponsive.

~~~
fiatmoney
MongoDB's storage engine is more or less a mmap'd linked list of documents. It
has a lot of issues once you actually start doing reads or writes, whether
it's because your working set exceeds RAM or you actually want durability.
It's a nice term-paper DB implementation but there's a good reason why most of
the big single-instance RDBMS's use their own I/O algos instead of delegating
to the kernel. Fundamentally the kernel doesn't know your optimal access
pattern or your desired tradeoffs.

Redis turned away mmap'd storage some time ago; you can snapshot the DB to
disk, but that's done via a fork() and the writes happen in that other
process.

------
mrottenkolber
I think this is a very important area of improvements for linux. While we call
it "multitasking" there are a lot of situations where one might doubt it
deserves that title.

I have been experimenting with very low cost computing setups that optimize
for robustness and that led me to pretty slow disk I/O. While thats not a
typical scenario for desktop computing, it can and should be possible with the
limited but sane resources I ended up with. In practice however, certain loads
freeze the whole system until a single usually non-urgent write finishes.
Basically the whole throuput is used for a big write and then X (and others)
freeze because they are waiting for the filesystem (probably just a stat and
similar).

There are differences between applications. Some "behave" worse than others.
Some even manage to choke themselves (ever seen GIMP take over an hour to
write 4MB to an NFS RAID with 128kb/s throughput?).

I guess this is a hard problem, but I would wish for an OS to never stall on
load. It is even better to slow down exponentially than to halt other tasks.
Ideally the sytem would be smart and deprioritize long-running tasks so that
small, presumably urgent, tasks are impacted as little as possible.

------
zvrba
Re Mel Gorman's details in
[http://article.gmane.org/gmane.linux.kernel/1663694](http://article.gmane.org/gmane.linux.kernel/1663694)

I don't understand why PostgreSQL people don't want to write their own IO
scheduler and buffer management. It's not that hard to implement (even a MT
IO+BM is not really complicated), and there are major advantages:

\- you become truly platform-independent instead of relying on particulars of
some kernel [the only thing you need from the OS is some form of O_DIRECT; it
exists also on Win32]

\- you have total control over buffer memory allocation and IO scheduling

\- whatever scheduling and buffer management policy you're using, you can more
easily adapt it to SSDs and other storage types, which are still in their
infancy (e.g., memristors) [thus not depending on the kernel developers'
goodwill]

I mean, really: these pepole have implemented a RDBMS with a bunch of
extensions to standard SQL, and IO+buffer management layer is suddenly
complicated, or [quote from the link]: "While some database vendors have this
option, the Postgres community do not have the resources to implement
something of this magnitude."

This smells more like politics than a technical issue.

------
ibotty
here is a mail from mel gorman with many more details.

[http://mid.gmane.org/%3C20140310101537.GC10663%40suse.de%3E](http://mid.gmane.org/%3C20140310101537.GC10663%40suse.de%3E)

------
peapicker
So, what does Oracle Enterprise Linux do on an Exadata box running Linux?

~~~
caf
ora uses O_DIRECT.

------
zvrba
Ditch Linux and port PgSQL to run on top of raw Xen interfaces. You get to
control your own buffering, worker thread scheduling, and talk directly to the
(virtual) disk driver. I believe it'd be a win.

------
cratermoon
O_PONIES ride again?

------
dschiptsov
Oh yes, that annoying problem (especially fro MongoDB) that data should
eventually be committed to a disk.)

Informix (and PostgreSQL) allows DBA to choice "checkpoint/vacuum intervals".

The rule of thumb, unless you are a Mongo fan, is that checkpoints should be
performed often enough to not take too long, which depends only on the actual
insert/update data flow.

But any real DBA could tell the same - sync quickly, sync often, so server
will run smoothly, but not "at web scale" and the pain of recovery will be
less severe.)

