
How Linux 3.6 Nearly Broke PostgreSQL - alrs
http://lwn.net/Articles/518329/
======
cs702
Unintended adverse side effects from a tiny change to a small component of a
complex OS kernel that runs on complex modern processors that are part of
mindbogglingly complex computer systems, on which we run the ridiculously vast
software ecosystem which makes possible the massively complex global network
of applications and services we call "the Web."

Every time I read or hear about unintended-consequence incidents like this
one, I'm reminded me of Jean-Baptiste Queru's essay, "Dizzying but Invisible
Depth" -- highly recommended if you haven't read it.[1]

\--

[1]
[https://plus.google.com/u/0/112218872649456413744/posts/dfyd...](https://plus.google.com/u/0/112218872649456413744/posts/dfydM2Cnepe)

~~~
rwmj
The problem is that no testing is done. I found 3 bugs in fcntl/dup2/dup3 in
the latest kernel release, things which were easily found when I ran the
gnulib test suite, but had been bugs in the kernel for at least 2 months:

<http://www.spinics.net/lists/linux-fsdevel/msg58725.html>

<http://www.spinics.net/lists/linux-fsdevel/msg58752.html>

<http://www.spinics.net/lists/linux-fsdevel/msg58799.html>

~~~
chris_wot
Does the Linux kernel have test suites?

~~~
rwmj
It has autotest.

It has gnulib (not a test suite, but a very comprehensive POSIX API test).

It has the poor saps who have to use it.

However none of these things gate commits to the kernel.

------
efuquen
"A potentially simpler alternative is to let the application itself tell the
scheduler that one of its processes is special. PostgreSQL could request that
its dispatcher be allowed to run at the expense of one of its own workers,
even if the normal scheduling algorithm would dictate otherwise."

I don't see how this is so bad, it seems like the best solution too me. If
you're writing a specialized high performance piece of software I feel like
the application developer should be the one tasked with making sure the kernel
knows certain things about it's application. It's pretty clear a project like
postgres is doing all sorts of tricks and optimizations already, I don't see
how this would be any more or less burdensome.

Overall I feel like it's a fair trade-off to have kernel be told specific
things by the application so it can make the better scheduling decisions vs it
having to guess and potentially make poor decisions at the expense of most
common applications.

~~~
twoodfin
There are two problems with this:

First, there's no POSIX standard way to communicate this scheduling request to
the kernel. So you either have to add a new API or add some "secret knock" to
an existing API that will trigger the desired behavior. Neither of these
encourage portable code, and neither help existing, deployed applications,
which would have to be modified to get the desired result.

Second, it introduces a new axis for regressions. With today's scheduler,
maybe setting this "I'm a control process" flag makes your application faster.
But maybe the new scheduler implementation in the next major kernel release
actually causes your application to run slower with this flag on than off.
Some other applications might see the reverse.

~~~
rat87
> First, there's no POSIX standard way to communicate this scheduling request
> to the kernel. So you either have to add a new API or add some "secret
> knock" to an existing API that will trigger the desired behavior. Neither of
> these encourage portable code, and neither help existing, deployed
> applications, which would have to be modified to get the desired result.

Linux doesn't fully comply with posix and has many of it's own apis. Cross
platform applications either abstract this away themselves or libraries do it
for them. I doubt a database doesn't use a lot of os specific features.

------
shin_lao
I'm very surprised by the hack that reduces the area of possibles for a
process to two CPUs. This will cause other problems when 32+ cores computers
get more common.

I'm even more surprised by "some benchmarks show it's faster, let's merge it".

Maybe they could try something larger than subsets of 2 CPUs?

~~~
epistasis
If I'm understanding correctly, the two core limit is only for initially
waking the process, after which normal load balancing can move it elsewhere if
necessary. There's a benefit to sticking to one of the two same hyper threaded
CPUs in terms of cache, so this does make sense even on 32 core machines.

~~~
shin_lao
Yes, however I think you can get a 15-puzzle situation - even if you do this
only for waking the process - when you wake a lot of process at the same time
on a machine with many cores.

I really prefer Linus suggestion, even if it's the hardest.

------
mef
Linus Torvalds ripping into the patch committer
<http://lwn.net/Articles/518351/>

~~~
morsch
I'd hardly call it that if it was anybody. And for Torvalds, this is more like
mild advice.

------
stevencorona
My question is - why does Postgres need its own scheduler? Shouldn't that be
the job of the OS? Is it a legacy thing or just something to squeeze out a
tiny bit of extra performance?

~~~
barrkel
For similar reasons to why DB engines often implement their own caching and
disable OS disk caching; a combination of more information about what needs to
be done, and a very competitive marketplace.

~~~
chawco
Indeed, there are lessons here, especially from people like PHK and projects
like Varnish. Understanding the underlying system is critical for these sorts
of projects. For 99% of users the OS will do a much better job of handling
your resources than you ever will, but there will always remain a segment
where additional knowledge about how those resources are going to be used can
yield significant performance gains.

------
wglb
Ouch.

Keen observers of database history may remember Sybase. Sybase made a similar
decision about doing their own scheduling, rather than relying on the
operating system. Oracle at that time let the OS do the scheduling. The former
turned out to be a strategic mistake.

~~~
CaptainZapp
Ahh, I do believe I can share some insight here.

That Sybase implemented its own threading was actually quite an astute
decision at this time. We're talking early to middle nineties when threading
provided by the operating systems was in a, charitably put, pretty unstable
state. We're talking of a time when a gig of physical memory was not that bad
for a database server.

The reason why Sybase failed (relatively to Oracle at least) had nothing to do
with threading, but with the fact that Sybase did not support row level
locking and adamantly refused to implement it.

From a purely theoretical position Sybase was right. If your database
application is well designed, page level locking has a number of advantages.
Namely in much less resource usages and less requirements for internal
housekeeping tasks. While a page level managed database is not quite
maintenance free it is much more so then when the data is row organized.

The problem, however, was that the real world is not a theoretical thing and
the various application suites (SAP R/3 PeopleSoft, etc), which boomed around
that time, absolutely required row level locking. PeopleSoft actually did run
on Sybase, but performance was, again charitably put, difficult.

It didn't help that Sybase, at that time, released Sybase 10, which was an
dreadful product, quality wise. From what I heard (and yes, this is hearsay)
was that engineering implored on senior management to give it six month more
time, which they refused to do.

While I never heard about data corruption on Sybase 10 databases, the quality
was quite horrible. Couple this with Sybase' arrogance as a high flyer at that
time chiding their customers for their own quality issues was not a smart
move.

But the main issue was row level locking and certainly not the threading
architecture.

Two additional points: The new Sybase kernel (15.7) actually uses OS threads
as a default. You can still use the internal scheduler, but according to
Sybase most installations should profit from the new kernel.

The other thing is that it's rather ironic that SAP bought Sybase (it's
marketed now as SAP / Sybase), which somehow brings the whole story full
circle.

I work with Sybase products since 4.2, which is early nineties and I worked
for Sybase Professional Services from 95 - 99. Which makes me believe I'm
somewhat qualified to comment on the issue.

~~~
poormvcc
Lack of support for row level locking was a problem. I recall having to work
around that by padding records with extra columns to ensure only one row would
fit on a page. But I'd say more generally the bigger issue was in their
overall approach to concurrency. Oracle's optimistic currency control
mechanism (where readers don't wait for writers) worked better in practice
than Sybase's early lock-based concurrency control (where they did).

I recall Philip Greenspun dedicated a substantial portion of his late 90's
database-backed website book to that topic.

~~~
CaptainZapp
That, again, very much depends on your application design.

Sybase SQLServer (now ASE) was designed from the ground up as an OLTP
database. It required that you keep your write and update transactions really
short.

I've seen - and worked on projects - that had an amazing throughput. They
where, however, designed from the ground up, to perform well on the underlying
database.

Where Sybase' concurrency turned to dreadful, was when you ran chained
transactions on isolation level 3. All I can say is: don't try this at home,
folks.

Also, Oracle's locking mechanism didn't come free. I remember (and my Oracle
knowledge is really minimal) the dreadful, overflowing rollback log, whoms
sizing was a science of its own.

I'm not saying one is better then the other, but it points out quite nicely
the impact of desing decisions. And how they always come with a price.

~~~
ams6110
To really perform well, applications _do_ need to be written with the
underlying database in mind. The popularity of ORM-based data access layers
has duped a lot of folks into believing that you don't have to think too much
about the database.

------
kyrra
This doesn't sound like Linux almost broke Postgres. It sounds like Postgres
is doing things (scheduler) that it should not be.

~~~
jeltz
The problem is that PostgreSQL is not the only software that does this. I
believe at least both Oracle and Erlang's etstables also have own
implementations of spinlocks which might be affected by this change. There
could be plenty of software broken by this Kernel change.

I did not see anyone in the lkml discussion blaming PostgreSQL for having an
own spinlock implementation.

~~~
twoodfin
AFAIK, just about everybody writing high performance multiprocess/thread code
that relies heavily on mutexes makes some use of user-mode spin-locking.

It's essential if you have N processes contending for a single mutex which
they will hold for very short periods of time. Asking the kernel to put you to
sleep until the mutex is available means progress is limited by the rate at
which the OS can wake up processes. If the mutex is only going to be held for
a few dozen cycles (say, to increment the heads of a few queues) then the
throughput cost could be considerable over simply spinning a few nanoseconds
in user mode until the mutex is available.

And yes, the need becomes more acute if you want to be sure you'll get
reasonable performance across a broad range of platforms and their
corresponding scheduling policies.

------
gmac
Where broke = made it run 20% slower.

~~~
lnanek2
High load can take out entire clusters. Imagine if a change was pushing to
NetFlix that slowed it down 20%, the requests would queue up to the point
where everything fell over.

~~~
prodigal_erik
That calls for shedding load as triage. If you're beyond capacity, it's better
(less bad, anyway) for 20% of requests to fail quickly than to actually
attempt them if that will prevent you from completing the other 80%.

------
acomjean
Interesting article, explains some of the tradeoffs that OSs make in
scheduling.

OS scheduler optimizations are difficult. Often what makes the desktop nice
and snappy makes background stuff slower. There are always trade-offs. Its
also allows vendors to sell expensive versions of linux with different
schedulers (redhat mrg...cough..) The Completely Fair Scheduler with its tree
of process seems to work quite well though.

It seems like they were trying to optimize for specific hardware (the link to
"scheduling domains" was interesting) when cpu swapping. (2 cores vs 2
sepearate cpus...) good intentions, but..

Sometimes its useful to let users explicitly control which cpus processes can
run on (process affinity). On the HPUX variant we used they let us set up
groups of cpus and then map processes run on those cpu sets. you could also
select scheduling of each process startup. It was a pain to get things
running, but in the end it worked great. Manually selecting the wrong
scheduler and process priority could result in some processes running terribly
however.

------
snorkel
There's been too much effort wasted trying to find the one-size-fits-all
perfect CPU scheduler for all system rules. For apps such as postgres that
care enough about CPU scheduling to have written their own cpu scheduler, then
it's not too much to ask the authors of such apps to make a few additional
system calls to tell the kernel what type of scheduling is preferred for this
app, rather then leave it all to the kernel to determine the perfect schedule
for every running app.

~~~
noselasd
For that to happen, the syscalls to tune the scheduler have to exist first -
which they mostly don't. You get syscalls to chose a normal scheduler or one
of the realtime schedulers, and a few knobs (e.g. the nice value) to poke.
This is far from enough.

------
rosser
Repost: <http://news.ycombinator.com/item?id=4618190>

------
vishal0123
Why using kernel spinlock do not made programs slow?

