
PostgreSQL pain points - sciurus
https://lwn.net/Articles/591723/
======
spudlyo
I must be missing something. Why doesn't PostgreSQL use O_DIRECT when doing
non-journaled writes to the data blocks? I know there are alignment
restrictions on the userspace buffers used to do the reads and writes, but
these don't seem insurmountable, and the payoff is huge. Is it because it's
not POSIX?

~~~
jandrewrogers
Using O_DIRECT well requires taking on a surprising amount of additional
resource management responsibility. Also, the implementation is not portable
and, combined with the additional responsibility previously mentioned, adds a
ton of conditional compilation. The alignment requirements are not an issue at
all; anyone using O_DIRECT is also going to be used properly aligned I/O
buffers in any case.

O_DIRECT can have a lot of benefits. However, while O_DIRECT may be simple to
use by itself, the other design and implementation requirements that are
indirectly dragged along are not trivial by any means. Consequently, you
should not be using O_DIRECT unless you can manage the substantial indirect
overhead in implementation. It turns off a lot of OS features most developers
take for granted.

~~~
spudlyo
You've essentially just waved your hands. Conditional complication is not an
issue at all; anyone working on a large portable program is going to have
large swaths of conditionally compiled code. Phrases like "resource management
responsibility" and "substantial indirect overhead" are pretty vague, and
aren't really helping me understand why O_DIRECT can't be used.

 _It turns off a lot of OS features most developers take for granted._

Yeah, like OS buffering. What else?

~~~
jandrewrogers
I use O_DIRECT almost exclusively in the software I write; I am familiar with
the overhead. You can use O_DIRECT very easily without a lot of overhead if
you wish but the performance will be much worse than not using it unless you
competently reimplement many other parts of the OS functionality at the same
time.

To use O_DIRECT effectively in Linux, at a minimum you need to write a buffer
cache that is both fast and adaptive to workloads with proper cache
replacement schedulers (no LRU crap), you need to write a complete I/O
scheduler replacement (e.g. using io_submit and related interfaces) which is
not portable at all, and you need to stop using almost all file system APIs
(it interferes with the I/O scheduling bypass).

O_DIRECT is easy to use for very narrow cases. Taking advantage of it for
anything slightly complicated in terms of file I/O usage and you bite off a
lot more low-level implementation or the performance will actually be worse.

------
bananas
I'd love to see an objective real world comparison of Linux kernel vs FreeBSD
kernel and ZFS with postgresql. I've got a (not terribly busy) box to build
for someone and I'm torn between the two.

------
jwatte
Why does the Linux kernel let the disk sit idle before the fsync? Why does
fsync cause such a storm, if the kernel could write as it goes? Also, why does
PostgreSQL do buffered writes instead of unbuffered? Seems like two wrongs
make a wrong here...

~~~
fiatmoney
Write-as-you-go isn't necessarily an optimal strategy for general purpose
workloads with intermittent IO - batching writes lets you do more linear
writes, do a better job of allocating blocks, hit the journal less often...

There is a need for an alternate API, but it's not like fsync is "broken" for
the general case.

~~~
dbrower
fsync has been broken since it was introduced in BSD4.x. It's never worked
well, usually flushing the whole machine cache in one huge batch of i/o. I'm
told current versions on Linux will restrict it to the one open file, but it's
still a shotgun instead of a rifle.

------
batbomb
I wonder if running on systems with NV cache controllers experience the issues
with the slow fsync.

~~~
dbrower
Yes, fsync typically chews a boatload of CPU running down the pages in the
cache to see what pages need to be written, then all the i/o path to enqueue
the writes.

