
 Subtly Bad Things Linux May Be Doing To PostgreSQL  - r4um
http://rhaas.blogspot.in/2014/04/subtly-bad-things-linux-may-be-doing-to.html
======
jandrewrogers
Issues like the ones raised are why all sufficiently advanced database engine
designs tend to evolve toward a kernel bypass architecture. From the
perspective of a database engine, operating systems do a lot of "dumb" things
with resource management and scheduling in ways that are essentially
impossible to avoid that the database engine has enough context to do
intelligently on its own. OS bypass in a database can have substantial
performance benefits and add robustness for many edge cases. The obvious
downside is that you basically end up rewriting the operating system in
userspace, minus the device drivers, which is an enormous amount of work. It
is not an incremental kind of design approach and it makes portability
difficult. PostgreSQL actually does a really good job trying to be fast and
robust without going to bypass internals.

I've always asserted that the reason we've never seen a proper kernel bypass
database engine in open source is that the Minimum Viable Product is too
complex. A bare, stripped-down, low-level database engine that does full
bypass of the operating system is usually _at least_ 100kLoC of low-level
C/C++, and that is before you add all the features a database user will
actually want. That is a big initial investment by some people with fairly
rare software design skills.

~~~
drzaiusapelord
How does this work? I don't see how its possible to fight the scheduler or how
caching works via userspace.

Maybe I don't understand this stuff, but maybe the kernel should have some
bypass API for high performance applications, instead of coders finding
curious ways to fight it.

~~~
usefulcat
You can bypass filesystem caching on Linux using O_DIRECT. The tradeoff
(besides the obvious lack of caching) is that there are specific alignment
restrictions on the length and address of the userspace buffers and file
offsets, and these restrictions may vary by filesystem and by kernel.

------
moe
I love that infinitely nerdy stuff like this can still make it to the HN
homepage (there's still hope!).

Stuff like this really needs to make it into the PG tuning guide
([https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...](https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server)).
The only place where it will ultimately be seen by a worthwhile audience.

~~~
viraptor
> infinitely nerdy stuff

Since when is "in depth" and "technical" equivalent to "infinitely nerdy"?
We're professionals using that kind of information for work, not nerds doing
infinitely nerdy stuff.

~~~
d23
Since when is "anti-social" and "nit-picky to the point of absurdity"
equivalent to "expressing critical thought"?

------
davidpardo
Being a user of both Linux and PostrgreSQL, I'm very interested in this issue,
but I only understand some of the words...

Could everybody wiser than me tell me if I should be concerned and the
possible implications of these decisions? Should I invest in alternative
platforms?

~~~
treenyc
same here. Was a FreeBSD user, but find hard to find VM a few years ago that
support FreeBSD, may switch back, if this issue is NOT addressed.

~~~
dfox
First issue is relevant only on systems that have more than one NUMA node,
which is probably every meaningful physical server and essentially no VM (at
least on Xen, multiprocessor VMs are single NUMA node), as it does not make
much sense to advertise NUMA topology to guest VMs.

Second issue is relevant for postgresql mostly only if you use very large
shared_buffers which anyway is not recommended for general workloads. Writing
page that exists on disk and was not read short time before is not especially
common thing to do.

~~~
rodgerd
NUMA can absolutely ping you in virtual servers, but without access to the
hypervisor you'll never know why it's happening (JVMs straddling NUMA regions
have caused me pain in the past, when the guest was split across memory
regions).

~~~
dfox
The point is that kernel inside VM guest knows nothing about NUMA, so it
cannot do any kind of NUMA optimizations hence such optimizations cannot hurt
performance as they do not happen at all.

~~~
rodgerd
Actually, KVM can allow you to create NUMA domains inside the guest, for
better or worse.

------
forkandwait
Anybody know of a FreeBSD comparison?

~~~
kev009
+1 I'd love to see a FreeBSD kernel hacker chime in here.

ZFS enables some interesting things for pgsql:

* [http://open-zfs.org/wiki/Performance_tuning#PostgreSQL](http://open-zfs.org/wiki/Performance_tuning#PostgreSQL) \- it seems like the primarycache setting prevents the double buffering problem that Linux' page cache has

* [http://citusdata.com/blog/64-zfs-compression](http://citusdata.com/blog/64-zfs-compression)

I run pgsql on FreeBSD/ZFS and have no complaints but am not taxing the
system.

~~~
ksec
I love ZFS, but Oracle is holding it up which is very annoying.

~~~
kev009
How do you mean? I trust OpenZFS more and in fact would consider it negligent
if Oracle wasn't cherry picking OpenZFS bug fixes. Several original Sun
engineers including ZFS co-founder Matt Ahrens are involved with OpenZFS.

~~~
bcantrill
Yes, as I have elaborated upon at length[1], the ZFS engineers have long-since
left Oracle -- and the open community's ZFS (that is, OpenZFS) has become the
ZFS of record.

One clarification: Oracle can't actually cherry-pick back OpenZFS bug fixes
because they are (ironically) violating the CDDL by not making available
source code. This isn't an issue for the code for which they hold copyright --
but that doesn't include any of the bug fixes and features that we've seen in
OpenZFS since 2010. And yes, it is absolutely negligent, but of a different
sort than you intended...

[1]
[http://www.youtube.com/watch?v=-zRN7XLCRhc](http://www.youtube.com/watch?v=-zRN7XLCRhc)

~~~
kev009
Cheers Bryan, that's a great one of your talks and thanks for the interesting
licensing comedy.

------
caf
Point 2 is not actually the case, as long as your write does not partially
fill a pagecache page (pagecache pages are the same size as the architecture's
native page size - 4K on x86, x86-64 and arm).

You can demonstrate this with a program like the following:

    
    
      int main(int argc, char *argv[])
      {
        int i;
        char pattern[512*1024];
        int fd;
    
        for (i = 0; i < sizeof pattern; i++)
            pattern[i] = 'X';
    
        fd = open("testfile", O_RDWR | O_CREAT | O_EXCL, 0666);
    
        if (fd < 0)
        {
            perror("open");
            return 1;
        }
    
        while (1)
        {
            pwrite(fd, pattern, sizeof pattern, 0);
            fdatasync(fd);
            posix_fadvise(fd, 0, sizeof pattern, POSIX_FADV_DONTNEED);
        }
    
        return 0;
      }
    

...then watch vmstat or iostat while it's running. Plenty of writes, no reads.

On the other hand, if you subtract one from the size of 'pattern', you'll see
that you also get reads (as partially writing the last page requires a read-
modify-write cycle).

------
spaznode
I was under impression from reading another article that setting the correct
kernel io scheduler to use helps with a related concept, maybe possibly:
[http://www.cybertec.at/postgresql-linux-kernel-io-
tuning/](http://www.cybertec.at/postgresql-linux-kernel-io-tuning/)

------
raverbashing
For 1: I don't remember exactly which machines use NUMA, but I thought it was
limited to 1st Opterons (and the behaviour makes sense)

2: Not sure, this may be specific to FS, or something that has to do with the
behaviour of MMAPed files however I don't know how do you guarantee that what
you're writing corresponds to a single block in the FS (unless you're writing
directly to /dev/sda and even then)

~~~
kev009
1 affects any multi-socket Intel system after Nehalem, and any multi-socket
Opteron+ AMD system.

So basically most physical servers.

------
MrBuddyCasino
This is slightly OT, but I always wondered: what is the overhead of fetching
cached stuff from the file system vs. having a built-in cache? Is there a way
to circumvent going through the kernel and avoid a context switch?

In other words: could a database have only minimal built-in caching and
instead rely on the OS cache?

~~~
Negitivefrags
Postgres does rely on the OS cache. The "effective_cache_size" parameter is
there for you to tell postgres how big you are expecting your OS cache to be
and it is supposedly used when planning queries, presumably so it can rely on
a plan having cached reads.

------
atmosx
If we had some metrics in a graph form, backing up the post would be great for
us _mere mortals_ to understand which way to go in order to achieve maximal
performance.

------
lallysingh
For (1), you can use numactl to change the default behavior. For #s (2) and
(3), I'm pretty sure this is FS dependent?

------
rivert
related post on lwn:
[https://lwn.net/Articles/591723/](https://lwn.net/Articles/591723/)

------
contingencies
Not to detract from the very intelligent and reasoned posting, but what tiny
percentage of people honestly still use fat-ass RDBMS as their primary
datastore _and_ would be better off performance tuning it at the kernel IO
level than actually analyzing their load and subsequently sharding or
migrating their data structures to less behemoth-like datastores? _Yes_ ,
RDBMS are easy to hire developers and DBAs for, are well supported and full-
featured. _However_ , in this day and age using them just feels a little ...
lazy ... for most workloads.

~~~
kev009
That's a very obtuse point of view. I'm curious sociologically: what field do
you work in and what your exposure to data is?

Consider an inventory system for a big box retailer. I can't think of anything
better than a fat-ass RDBMS as the primary data store. Sharding sounds like a
horrific idea. There are myriad workloads like this.

Personally, I've seen pgsql handle terabytes of data just fine and it wasn't
really noteworthy or a source of problems to even bring up considering
something else. YMMV but it's a good idea to use logic and reason to dictate
architecture instead of following the shiny thing or hubris.

~~~
davidw
Well, yes and no.

Everybody knows that relational databases don't scale because they use JOINs
and write to disk.

Also, relational databases weren't built for web scale. MongoDB handles web
scale. You turn it on and it scales right up.

And before you knock shards, shards are the secret ingredient in the web scale
sauce. They just work.

Furthermore, relational databases have impetus mismatch, and Postgresql is
slow as a dog. MongoDB will run circles around Postgresql because MongoDB is
web scale.

~~~
ro_sharp
Are you being intentionally sarcastic? Because this reads a lot like
[http://www.mongodb-is-web-scale.com/](http://www.mongodb-is-web-scale.com/)

Edit: Whoops, just read your reply :)

~~~
GFischer
His post was a textbook example of Poe's Law

[http://en.wikipedia.org/wiki/Poe's_law](http://en.wikipedia.org/wiki/Poe's_law)

"without a clear indication of the author's intent, it is difficult or
impossible to tell the difference between an expression of sincere extremism
and a parody of extremism"

