

Fsync() on a different thread: apparently a useless trick - antirez
http://antirez.com/post/fsync-different-thread-useless.html

======
pquerna
I guess I'm not gorking how the behavior of write() blocking when fsync is
being called is considered a problem.

write() is documented and well known to be a potentially blocking function,
for potentially a long time. If you are writing a small-number-of-threads
process, that is doing Disk and Network IO in the same threads, you can
definitely starve out the network IO when the disk system starts thrashing,
but generally that is fine, because you can't keep up with new requests at
that point anyways.

For that matter, read() can block too, people are just 'used' to their Kernels
being smart and most important data being in cache.

If you want your event-style threads servicing network clients to be able to
use write() or read without 'blocking' up a whole thread for apparently random
times, consider using the aio_* style write/fsync, although you definitely
would want to benchmark their impacts on general performance.

If you don't want to use aio_*, which is likely to be.. dangerous ground
considering how few people use them, you end up adding threads to provide more
isolation and parallelism to IO.

~~~
antirez
Hello. Write in non real time Linux is not able to guarantee that it will
return in a given amount of time, still all the system calls have more or less
predictable timing behavior when the disk and the CPU are not busy. What I
mean is that if you remove the fsync() call from the other thread, what you
get is a constant stream of "13 microseconds" delays.

So when fsync() is not into the mix the kernel will do the right thing, will
use buffers and will make the write calls very cheap. This is important for
many applications. But when there are more strict durability requirements this
is no longer true and care must be used.

Non blocking I/O (aio_*) is an interesting alternative in some application,
but in the case of Redis it is important to return "OK" only after we got an
acknowledge from write(2). Doing this suspending the client and resuming it
when the write was performed will turn Redis from a 140k operations/second
database into a 10k operations second database, so this is not going to be the
solution.

Real world software is written not reading manual pages, but checking how the
underlying OS actually works IMHO. For instance Redis persistence uses fork()
copy-on-write semantic of modern operation system. Also the fact that write(2)
can block per semantic, does not mean you'll be happy to know your kernel is
blocking a process for seconds many times as a result of a write(2) operation.

~~~
xpaulbettsx
Kernel developer here - honestly, I'm not surprised that fsync() stops the
world, and furthermore I suspect that even if you got past the vfs layer, you
would see different effects on different filesystems (i.e. you _still_
couldn't bet that fsync() would act like you want it to). The semantics of
fsync mean "Please guarantee that _everything_ is written to disk, flush all
caches now".

The kernel doesn't keep a 2nd queue for post-fsync writes that it will then
swap into the "real" one - think about what happens if someone _else_ calls
fsync(); does it spin up a 3rd queue for that one? Does the fsync block? I
think it would quickly descend into Crazyville.

~~~
antirez
Hello xpaulbettsx, thanks for your comment!

Yes I guess the implementation may get more complex, not sure about the actual
implementation it's just a linked list of operations to flush like it happens
looking at a few source code fragments, then it's just possible to put a
"sentinel" in the list that will block the first fsync() when the first
sentinel is found and so forth.

I mean, I'm all against the complexity myself in the code I write, so I can't
really question this behavior and the "fsync every second" policy is not a
huge use case indeed, but still it's important to now that. Googling a bit
around there are tons of people that appear to be pretty confident that moving
fsync() into another thread is the way to go, while instead stuff are working
in a different way.

------
alecco
Using a complex generic filesystem with too much metadata is a bit part of the
problem. If your app is write oriented try a logfs-style filesystem. Or
perhaps it's time for a new minimal filesystem.

Direct I/O is another option but too crass, IMHO.

------
russell_h
Is there a reason to call fsync() all the time instead of using O_SYNC?

~~~
antirez
This is what I'm considering for "fsync always", but for "fsync everysec" this
is now ideal unfortunately. Still O_SYNC can be able to fix at least one
problem, that's cool :) Thanks for the comment.

Edit: worth noting that instead O_DIRECT can't be used for an append only
file, because of alignment requirements.

Edit 2: yep O_SYNC helps a lot. Just updated the post with the results.

~~~
russell_h
Ah, yeah, thats what I meant, I probably should have specified that.

------
geocar
This has nothing to do with fsync().

If you lseek() back to the beginning after each write(), this behavior should
never occur.

Growing a file is complicated, and may require extra unexpected disk accesses
while the directory entry is modified or the free blocks are reassigned.
_These_ are what are blocking the write(), and write() will do them
eventually, on a busy enough system, whether you call fsync() or not.

------
houseabsolute
Seems like a userspace write queue would solve the write-delay problem. On the
other hand, it's very dangerous to report success to the user before the
changes actually hit the disk, because they might make some other change to
the world that depends on the contents of that file being on disk, leading to
an inconsistent state.

------
azim
One thing the author of this post is doing here is calling gettimeofday() in
every iteration through the loops. That's an awful way to benchmark something.
gettimeofday() issues a serializing instruction, which can potentially flush
the entire pipeline. It can also cause cores to sync their clocks. Either way,
the net result of calling gettimeofday() too often is awful multithreaded
performance. Calling it much less often or using a profiler would give very
different results than reported.

------
rbranson
Have you considered timing the average wait time for fsync() and continuously
adjusting a timer to fsync() every N milliseconds based on this data? You
could queue all of your response-to-client messages to wait on this fsync(),
and do tens of these per second, even on a slow drive. This technique can
potentially get client responses down to fractions of a second, even in high
throughput scenarios.

