Hacker News new | past | comments | ask | show | jobs | submit login
How Linux 3.6 Nearly Broke PostgreSQL (lwn.net)
221 points by alrs on Oct 11, 2012 | hide | past | web | favorite | 51 comments

Unintended adverse side effects from a tiny change to a small component of a complex OS kernel that runs on complex modern processors that are part of mindbogglingly complex computer systems, on which we run the ridiculously vast software ecosystem which makes possible the massively complex global network of applications and services we call "the Web."

Every time I read or hear about unintended-consequence incidents like this one, I'm reminded me of Jean-Baptiste Queru's essay, "Dizzying but Invisible Depth" -- highly recommended if you haven't read it.[1]


[1] https://plus.google.com/u/0/112218872649456413744/posts/dfyd...

The problem is that no testing is done. I found 3 bugs in fcntl/dup2/dup3 in the latest kernel release, things which were easily found when I ran the gnulib test suite, but had been bugs in the kernel for at least 2 months:




Does the Linux kernel have test suites?

It has autotest.

It has gnulib (not a test suite, but a very comprehensive POSIX API test).

It has the poor saps who have to use it.

However none of these things gate commits to the kernel.

Well, living creatures are like that, too.

You just arched an eyebrow. Simple, isn't it? Well, when you know a bit about biology...

What a tease.

Nice, but I think "I, pencil" (http://en.wikisource.org/wiki/I,_Pencil) is better.

"A potentially simpler alternative is to let the application itself tell the scheduler that one of its processes is special. PostgreSQL could request that its dispatcher be allowed to run at the expense of one of its own workers, even if the normal scheduling algorithm would dictate otherwise."

I don't see how this is so bad, it seems like the best solution too me. If you're writing a specialized high performance piece of software I feel like the application developer should be the one tasked with making sure the kernel knows certain things about it's application. It's pretty clear a project like postgres is doing all sorts of tricks and optimizations already, I don't see how this would be any more or less burdensome.

Overall I feel like it's a fair trade-off to have kernel be told specific things by the application so it can make the better scheduling decisions vs it having to guess and potentially make poor decisions at the expense of most common applications.

There are two problems with this:

First, there's no POSIX standard way to communicate this scheduling request to the kernel. So you either have to add a new API or add some "secret knock" to an existing API that will trigger the desired behavior. Neither of these encourage portable code, and neither help existing, deployed applications, which would have to be modified to get the desired result.

Second, it introduces a new axis for regressions. With today's scheduler, maybe setting this "I'm a control process" flag makes your application faster. But maybe the new scheduler implementation in the next major kernel release actually causes your application to run slower with this flag on than off. Some other applications might see the reverse.

> First, there's no POSIX standard way to communicate this scheduling request to the kernel. So you either have to add a new API or add some "secret knock" to an existing API that will trigger the desired behavior. Neither of these encourage portable code, and neither help existing, deployed applications, which would have to be modified to get the desired result.

Linux doesn't fully comply with posix and has many of it's own apis. Cross platform applications either abstract this away themselves or libraries do it for them. I doubt a database doesn't use a lot of os specific features.

It's ok for an user program to "hint" the kernel, but the problem with this approach is that it becomes the solution for many problems the kernel should take care of.

After all the purpose of the operating systems is to solve the hard nuts for you, isn't it?

I'm very surprised by the hack that reduces the area of possibles for a process to two CPUs. This will cause other problems when 32+ cores computers get more common.

I'm even more surprised by "some benchmarks show it's faster, let's merge it".

Maybe they could try something larger than subsets of 2 CPUs?

If I'm understanding correctly, the two core limit is only for initially waking the process, after which normal load balancing can move it elsewhere if necessary. There's a benefit to sticking to one of the two same hyper threaded CPUs in terms of cache, so this does make sense even on 32 core machines.

Yes, however I think you can get a 15-puzzle situation - even if you do this only for waking the process - when you wake a lot of process at the same time on a machine with many cores.

I really prefer Linus suggestion, even if it's the hardest.

Yeah, I really don't like blocking off all that extra capacity just because the OS is too lazy to find it. Keep a better data structure for fast searches, or improve the result over time instead of sticking with the first quick result. Reminds me of Android where background processes can never use the full resources of the device even if they are idle otherwise. They are in a scheduling class that simply doesn't allow it as a fixed percentage of total resources, not a smart one that gives them less priority. Now say an app comes around that wants to do some heavy media processing for the user in the background that the user requested be done. Oops, sorry, you don't get to use all your hardware. Same thing that is happening to the DBMS in this case.

In the Android example, I can see why they limit low-priority processes to a fraction of total resources: to avoid silently draining the battery. Android runs on phones, and using 100% CPU on a phone is a great way to kill battery life. Allowing higher usage would lead to users getting annoyed at their phones (and Android in general), when really blame would lie with poorly-written apps they installed.

But I agree with you about the Linux patch. Considering the performance impact, it seems irresponsible not to implement the best solution. The kernel isn't some web app where everyone gets the latest version automatically. Bad code in the kernel sticks around for a long time.

Making this solution works with something else than a "buddy" CPU is tricky if you assume that it's always a power of two.

Linus Torvalds ripping into the patch committer http://lwn.net/Articles/518351/

I'd hardly call it that if it was anybody. And for Torvalds, this is more like mild advice.

He seemed to be ripping into the old code more than the new.

The linked post makes me wonder whether being top dog automatically turns a person into a nigh-intolerable Steve Jobs, whoever they were to begin with. Because of some underappreciated but irresistible biological law.

My question is - why does Postgres need its own scheduler? Shouldn't that be the job of the OS? Is it a legacy thing or just something to squeeze out a tiny bit of extra performance?

For similar reasons to why DB engines often implement their own caching and disable OS disk caching; a combination of more information about what needs to be done, and a very competitive marketplace.

Indeed, there are lessons here, especially from people like PHK and projects like Varnish. Understanding the underlying system is critical for these sorts of projects. For 99% of users the OS will do a much better job of handling your resources than you ever will, but there will always remain a segment where additional knowledge about how those resources are going to be used can yield significant performance gains.

PostgreSQL's spinlock implementation is very fast and scalable and not all supported platforms have good implementations of mutexes. The Linux futexes might be fast enough to replace PostgreSQL's spinlock implementation, but even that is not known yet.

Hereā€”have a microbenchmark!

Here are two really simple mutexes: a wrapper around the pthread functions, and a no-thought-required spinlock:

    class Spinlock {
        bool state;
        Spinlock() { state = false; }
        void lock() {
            while(!__sync_bool_compare_and_swap(&state, false, true));
        void unlock() {
            state = false;
    class PthreadMutex {
        pthread_mutex_t mutex;
        PthreadMutex() {
            pthread_mutex_init(&mutex, NULL);
        void lock() {
        void unlock() {
        ~PthreadMutex() {
I used these to protect a not-optimized-at-all dispatch queue built on top of std::deque (single reader, single writer, reader busy-waits until work is available, work is an empty function). I did this test on Mac OS X 10.6.8 (with Apple s gcc 4.2.1 build).

A run using PthreadMutex:

Using Spinlock:

All times are in microseconds to run 1,000,000 iterations of the test. Only time to enqueue is measured.

I've seen similar results with a very similar test on FreeBSD, though I don't have a box to retest on at the moment.

I can only conclude it's not at all unreasonable for PostgreSQL to use its own spinlocks.

It will have more information about what is trying to be achieved than can be given to the kernel with the available (across all platforms where possible) API. More knowledge/more specialised means it can have greater optimisation.

PostgreSQL does not have its own scheduler. Here's a comment at LWN that tries to correct the article a bit: http://lwn.net/Articles/518405/

Everyone does this. Even SQL Server, which runs on only Windows and is made by that very same company, had to have Fibres invented in the kernel so it can do its own user mode scheduling.



Keen observers of database history may remember Sybase. Sybase made a similar decision about doing their own scheduling, rather than relying on the operating system. Oracle at that time let the OS do the scheduling. The former turned out to be a strategic mistake.

Ahh, I do believe I can share some insight here.

That Sybase implemented its own threading was actually quite an astute decision at this time. We're talking early to middle nineties when threading provided by the operating systems was in a, charitably put, pretty unstable state. We're talking of a time when a gig of physical memory was not that bad for a database server.

The reason why Sybase failed (relatively to Oracle at least) had nothing to do with threading, but with the fact that Sybase did not support row level locking and adamantly refused to implement it.

From a purely theoretical position Sybase was right. If your database application is well designed, page level locking has a number of advantages. Namely in much less resource usages and less requirements for internal housekeeping tasks. While a page level managed database is not quite maintenance free it is much more so then when the data is row organized.

The problem, however, was that the real world is not a theoretical thing and the various application suites (SAP R/3 PeopleSoft, etc), which boomed around that time, absolutely required row level locking. PeopleSoft actually did run on Sybase, but performance was, again charitably put, difficult.

It didn't help that Sybase, at that time, released Sybase 10, which was an dreadful product, quality wise. From what I heard (and yes, this is hearsay) was that engineering implored on senior management to give it six month more time, which they refused to do.

While I never heard about data corruption on Sybase 10 databases, the quality was quite horrible. Couple this with Sybase' arrogance as a high flyer at that time chiding their customers for their own quality issues was not a smart move.

But the main issue was row level locking and certainly not the threading architecture.

Two additional points: The new Sybase kernel (15.7) actually uses OS threads as a default. You can still use the internal scheduler, but according to Sybase most installations should profit from the new kernel.

The other thing is that it's rather ironic that SAP bought Sybase (it's marketed now as SAP / Sybase), which somehow brings the whole story full circle.

I work with Sybase products since 4.2, which is early nineties and I worked for Sybase Professional Services from 95 - 99. Which makes me believe I'm somewhat qualified to comment on the issue.

Lack of support for row level locking was a problem. I recall having to work around that by padding records with extra columns to ensure only one row would fit on a page. But I'd say more generally the bigger issue was in their overall approach to concurrency. Oracle's optimistic currency control mechanism (where readers don't wait for writers) worked better in practice than Sybase's early lock-based concurrency control (where they did).

I recall Philip Greenspun dedicated a substantial portion of his late 90's database-backed website book to that topic.

That, again, very much depends on your application design.

Sybase SQLServer (now ASE) was designed from the ground up as an OLTP database. It required that you keep your write and update transactions really short.

I've seen - and worked on projects - that had an amazing throughput. They where, however, designed from the ground up, to perform well on the underlying database.

Where Sybase' concurrency turned to dreadful, was when you ran chained transactions on isolation level 3. All I can say is: don't try this at home, folks.

Also, Oracle's locking mechanism didn't come free. I remember (and my Oracle knowledge is really minimal) the dreadful, overflowing rollback log, whoms sizing was a science of its own.

I'm not saying one is better then the other, but it points out quite nicely the impact of desing decisions. And how they always come with a price.

To really perform well, applications do need to be written with the underlying database in mind. The popularity of ORM-based data access layers has duped a lot of folks into believing that you don't have to think too much about the database.

I do believe you are much more on top of this than I. Thanks for the fine detailed clarification.

Every database management system that has half-decent SMP implementation and is portable to multiple platforms and has enough history that it is running on old busted kernels has already gotten a heavily tested, high performance spinlock mechanism. Including Oracle. Xorg is given as another example, not in being a database system, but having to deal with multiple platforms and busted stuff.

Postgres as a project is very keen to offload tasks to the operating system when operating systems at large are not unacceptably slow or broken, and sometimes even when they are, but nobody has the resources to do anything about it. That's why it doesn't schedule its own writes using O_DIRECT, taking a hit in buffer copying from shared memory, unlike virtually every proprietary database.

I don't think Sybase's failures were mostly technical. If you count Sybase's "fork" MS SQL, then the database code did rather well.

This doesn't sound like Linux almost broke Postgres. It sounds like Postgres is doing things (scheduler) that it should not be.

I don't think so. PostgreSQL has a fast implementation of its dispatcher that works well. The kernel change introduces a regression in the performance of PostgreSQL and potentially other programs doing the same style of locking. The kernel people are not suggesting that PostgreSQL use a different implementation; they do not want to break working code, and are treating this code change as a regression that needs to be fixed in the kernel. In the comments, the article author writes:

FWIW, the discussion in kernelland was based on the assumption that this regression was the kernel's problem. Nobody there suggested telling the PostgreSQL developers to come up with a new locking scheme.

The problem is that PostgreSQL is not the only software that does this. I believe at least both Oracle and Erlang's etstables also have own implementations of spinlocks which might be affected by this change. There could be plenty of software broken by this Kernel change.

I did not see anyone in the lkml discussion blaming PostgreSQL for having an own spinlock implementation.

AFAIK, just about everybody writing high performance multiprocess/thread code that relies heavily on mutexes makes some use of user-mode spin-locking.

It's essential if you have N processes contending for a single mutex which they will hold for very short periods of time. Asking the kernel to put you to sleep until the mutex is available means progress is limited by the rate at which the OS can wake up processes. If the mutex is only going to be held for a few dozen cycles (say, to increment the heads of a few queues) then the throughput cost could be considerable over simply spinning a few nanoseconds in user mode until the mutex is available.

And yes, the need becomes more acute if you want to be sure you'll get reasonable performance across a broad range of platforms and their corresponding scheduling policies.

Xorg also seems to have that kind of behaviors. Basically any software with a "master" process and a bunch of slaves which tries to reach high performances and cross-platform stability appears to go with that kind of stuff.

Given that Postgres runs on a whole bunch of different platforms and since each platform does scheduling differently, I imagine that doing it yourself is very practical since it gives you consistent and optimized behavior irregardless of the OS.

Where broke = made it run 20% slower.

High load can take out entire clusters. Imagine if a change was pushing to NetFlix that slowed it down 20%, the requests would queue up to the point where everything fell over.

That calls for shedding load as triage. If you're beyond capacity, it's better (less bad, anyway) for 20% of requests to fail quickly than to actually attempt them if that will prevent you from completing the other 80%.

Interesting article, explains some of the tradeoffs that OSs make in scheduling.

OS scheduler optimizations are difficult. Often what makes the desktop nice and snappy makes background stuff slower. There are always trade-offs. Its also allows vendors to sell expensive versions of linux with different schedulers (redhat mrg...cough..) The Completely Fair Scheduler with its tree of process seems to work quite well though.

It seems like they were trying to optimize for specific hardware (the link to "scheduling domains" was interesting) when cpu swapping. (2 cores vs 2 sepearate cpus...) good intentions, but..

Sometimes its useful to let users explicitly control which cpus processes can run on (process affinity). On the HPUX variant we used they let us set up groups of cpus and then map processes run on those cpu sets. you could also select scheduling of each process startup. It was a pain to get things running, but in the end it worked great. Manually selecting the wrong scheduler and process priority could result in some processes running terribly however.

There's been too much effort wasted trying to find the one-size-fits-all perfect CPU scheduler for all system rules. For apps such as postgres that care enough about CPU scheduling to have written their own cpu scheduler, then it's not too much to ask the authors of such apps to make a few additional system calls to tell the kernel what type of scheduling is preferred for this app, rather then leave it all to the kernel to determine the perfect schedule for every running app.

For that to happen, the syscalls to tune the scheduler have to exist first - which they mostly don't. You get syscalls to chose a normal scheduler or one of the realtime schedulers, and a few knobs (e.g. the nice value) to poke. This is far from enough.

Why using kernel spinlock do not made programs slow?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact