
The Linux Scheduler: A Decade of Wasted Cores [pdf] - tdurden
http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
======
brendangregg
I've worked scheduling bugs in other kernels before (Linux is not an outlier
here). The key metric we keep an eye on is run queue latency, to detect when
threads are waiting longer than one would expect. And there's many ways to
measure it; my most recent is runqlat from bcc/BPF tools, which shows it as a
histogram. eg:

    
    
       # ./runqlat 
       Tracing run queue latency... Hit Ctrl-C to end.
       ^C
            usecs               : count     distribution
                0 -> 1          : 233      |***********                             |
                2 -> 3          : 742      |************************************    |
                4 -> 7          : 203      |**********                              |
                8 -> 15         : 173      |********                                |
               16 -> 31         : 24       |*                                       |
               32 -> 63         : 0        |                                        |
               64 -> 127        : 30       |*                                       |
              128 -> 255        : 6        |                                        |
              256 -> 511        : 3        |                                        |
              512 -> 1023       : 5        |                                        |
             1024 -> 2047       : 27       |*                                       |
             2048 -> 4095       : 30       |*                                       |
             4096 -> 8191       : 20       |                                        |
             8192 -> 16383      : 29       |*                                       |
            16384 -> 32767      : 809      |****************************************|
            32768 -> 65535      : 64       |***                                     |
    

I'll also use metrics that sum it by thread to estimate speed up (which helps
quantify the issue), and do sanity tests.

Note that this isolates one issue -- wait time in the scheduler -- whereas
NUMA and scheduling also effects memory placement, so the runtime of
applications can become slower with longer latency memory I/O from accessing
remote memory. I like to measure and isolate that separately (PMCs).

So I haven't generally seen such severe scheduling issues on our 1 or 2 node
Linux systems. Although they are testing on 8 node, which may exacerbate the
issue. Whatever the bugs are, though, I'll be happy to see them fixed, and may
help encourage people to upgrade to newer Linux kernels (which come with other
benefits, like BPF).

~~~
jsingleton
I assume BPF is Berkeley Packet Filters or maybe eBPF (Extended Berkeley
Packet Filters) in this case. Just to save anyone else having to look this up.
It looks like this is the link to the tools.

[https://github.com/iovisor/bcc](https://github.com/iovisor/bcc)

[https://en.wikipedia.org/wiki/Berkeley_Packet_Filter](https://en.wikipedia.org/wiki/Berkeley_Packet_Filter)

~~~
geofft
Yup. It's the same BPF (well, except for the "extended" bit) that tools like
tcpdump and Wireshark use for packet capture: it's a bytecode for handing
simple, guaranteed-termination programs to the kernel and having the kernel
run them instead of waking up userspace all the time. This was originally
created for packet capture, so the kernel could just hand you packets on port
80 (or whatever) instead of dumping all traffic at you and letting you throw
away most of it. But it turned out this is also useful for system tracing: if
you strace a program, the kernel will notify it on every syscall, and `strace
-e` throws away most of that in userspace. So there's now a way to attach BPF
filters to processes, events, etc. so that a userspace tracer is only woken up
when something interesting happens, which reduces overhead significantly.

------
jor-el
The quotes in the paper are interesting: "Nobody actually creates perfect code
the first time around, except me. But there’s only one of me.” Linus Torvalds,
2007

And another, which highlights why there might be many more unearthed bugs, and
would probably go unnoticed. "I suspect that making the scheduler use per-CPU
queues together with some inter-CPU load balancing logic is probably trivial .
Patches already exist, and I don’t feel that people can screw up the few
hundred lines too badly"

Looking at the bigger picture in general, this again shows that getting
software right is not easy. Now and then you still hear about the bugs popping
up in the code which is very core to an OS. One I can recall is a decade old
TCP bug which Google fixed last year[1].

[1] [http://bitsup.blogspot.sg/2015/09/thanks-google-tcp-team-
for...](http://bitsup.blogspot.sg/2015/09/thanks-google-tcp-team-for-open-
source.html)

~~~
tn13
The best example is of SSL. Over a decade and we have not even got
"correctness" part right.

~~~
wtbob
Over a decade? It's over twenty years old! SSL/TLS & its XPKI are complete and
utter jokes.

------
Animats
Scheduling has become a hard problem. There's cache miss cost in moving a task
from one CPU to another. CPUs can be put into power-saving states, but take
time to come out of them. Sharing the run queue across CPUs has a locking
cost. So it's now a cost optimization problem. That's a lot of work to be
doing at the sub-millisecond decision level.

~~~
fpoling
One of the point in the article is that despite scheduler's importance and
complexity there were no tools that allowed to evaluate the performance. This
work contributed such tools and those immediately allowed to identify quite a
few bugs. Fixing those improves performance on NUMA systems rather
substantially across various loads.

~~~
kevan
>there were no tools that allowed to evaluate the performance.

This reminds me of Brian Cantril's remarks in a talk (sorry, can't remember
which one) about Solaris and DTrace. For the first time ever, DTrace gave them
the ability to look into the live performance of low-level OS code and they
found a lot of pathological worst-case behavior that no one suspected
beforehand. No matter how good your team is it's really hard to accurately
predict how a system will behave, measurement is key.

------
j1vms
As indicated in the paper, patches (currently for kernel 4.1) are here:
[https://github.com/jplozi/wastedcores](https://github.com/jplozi/wastedcores)

~~~
rando3826
Are they accepted upstream? If not, I'd like to see a link to the lkml thread
for these.

~~~
justinmk
I wonder if they are discouraged by the experience of Con Kolivas[1] who
proposed an alternative scheduler back in 2007. (Apparently he is still
maintaining his "-ck" fork of the linux kernel[2]!)

I only mention this as a historical case that has remained in my memory. Maybe
Linus is willing to revisit the issue, I don't follow LKML.

[1]
[https://en.wikipedia.org/wiki/Con_Kolivas](https://en.wikipedia.org/wiki/Con_Kolivas)

[2] [http://ck-
hack.blogspot.com/2015/12/bfs-467-linux-43-ck3.htm...](http://ck-
hack.blogspot.com/2015/12/bfs-467-linux-43-ck3.html)

~~~
LoSboccacc
I remember the whole debacle back then when the lkml was summarized on that
website I can't remember

It'd be interesting to see if that branch exhibit the same behavior and
issues.

~~~
tremon
Maybe KernelTrap? I was sad when it shut down, it was a very useful resource
for following Linux development from the sidelines.

~~~
LoSboccacc
Yeah that one! The memories.

------
edwintorok
Nice work: paper, (upcoming tools), and _actual patches_.

Wish they'd look at disk I/O next, there are some problems there that are hard
to describe other than anecdotically: e.g. my system runs on a SSD and
periodically copies data to a HDD with rsnapshot. When rsnapshot runs rm on
the HDD things freeze for a moment, (even switching windows in X) although the
only thing using the HDD is that rm ...

~~~
EdiX
Could it be
[https://lkml.org/lkml/2016/3/30/424](https://lkml.org/lkml/2016/3/30/424) ?

~~~
edwintorok
thanks, I'll have to try that although in my case the writes and reads are
done on entirely different devices (although maybe there is a shared queue
somewhere?)

------
haberman
This is absolutely nuts!

If this result is true, our Linux machines have been wasting 13-24% of our
silicon and energy for years (that number is for "typical Linux workloads")
because the scheduler fails to fulfill its most basic duty.

The quotes from Linus in the paper just twist the knife.

~~~
andrewchambers
If the CPU is idle it isn't wasting energy.

~~~
mjg59
The rest of the system is still a (roughly) fixed cost, and a 75% loaded CPU
package still consumes more than 75% of the power of a 100% loaded CPU
package.

------
mmaunder
I'm just a lowly performance obsessed dev who uses things like node, php,
python, etc. I run very high traffic applications and spend a lot of time
buying and building my own servers to try to eek out every ounce of
performance.

So can someone who knows about linux kernel internals explain the impact of
this research? I read the abstract and some of the paper and it sounds very
promising - like we may get significantly more performance out of our cores
for multi-threaded or multi-process applications.

~~~
im_down_w_otp
You will probably get significantly more performance out of your cores for
multi-threaded or multi-process applications if you stop using node, php,
python, etc. and use something that's more performance oriented.

~~~
_yosefk
Well, one might also get more performance per dollar if they programmed FPGAs,
DSPs or GPUs instead of CPUs, and one might also get more performance per
dollar if they designed their own hardware. (I do the latter for a living.)

However, "performance of Node, PHP and Python" is a sensible goal in its own
right, and I disagree with the implication of your comment, and that of sister
comments, that it is not a sensible goal. There's a lot of useful code written
in Node, PHP and Python, and moreover, this might remain true for a long while
because "something more performance oriented" is likely to be less _programmer
productivity oriented_ in that a correct, easy to use program or library will
take more time to get done. Also, Node and Python specifically can be damn
fast (numpy for instance is unbeatable for large linear algebra programs,
because it uses optimized libraries under the hood, etc. etc.)

And some things simply can't be done in a satisfactory fashion in anything but
a dynamic language, any more than you can get Apache to run on a GPU.
"Dynamic" is a feature, not just a source of overhead.

So "a performance-obsessed scripting language developer" is a perfectly fine
way to describe oneself IMO.

~~~
pjc50
Aren't all three of those languages single-threaded? So the only way to
distribute work is to run one copy per core and distribute on a per-request
basis?

~~~
krylon
Python, at least, uses actual OS-level threads. However, it also uses a global
interpreter lock (GIL), so only one thread can execute Pyton code at a time.

But when writing Python modules in C, you have control over acquiring and
releasing of the GIL, so before starting some long running operation, you give
up the lock.

Node, AFAIK, uses several OS-level threads under the hood for disk I/O. And
with PHP, a web server probably will run multiple threads for handling
requests concurrently.

So the impact might not be as big as for performance-oriented code in C/C++,
but it is not necessarily nil, either.

~~~
_yosefk
With Python, numpy for instance will use multiple threads under the hood, even
though the calling Python code might be single-threaded, and numpy's execution
is completely unaffected by the GIL. Incidentally, to get Python code to run
really fast, you'll have to offload most of the heavy lifting into libraries
in any case. But it's still a Python program - in that most of the source
lines, especially most of the source lines unique to the program as opposed to
library code, will be still in Python, and in that in some cases you couldn't
have written the program in a "more performant" language getting either the
performance _or_ the flexibility (Go for instance doesn't have operator
overloading nor can it be used to implement linear algebra as efficiently as
C/assembly; so a pure Go program doing linear algebra will be both slower and
uglier than a Python program using numpy. A Go program using a C/assembly
library will be very marginally faster than said numpy program, and just as
ugly as the pure Go program.)

Also, in my understanding TFA applies to multiple processes just as much as
multiple threads.

~~~
jarvist
A recent alternative is to write the entire program in Julia. It is a lot less
ugly than Numpy, and so performant that most code (other than steadfast
libraries such as BLAS and LAPACK) are written in Julia itself.
[http://julialang.org/](http://julialang.org/)

------
Someone
Page 8: _" To fix this bug, we alter the code that is executed when a thread
wakes up. We wake up the thread on the local core—i.e., the core where the
thread was scheduled last— if it is idle; otherwise, if there are idle cores
in the system, we wake up the thread on the core that has been idle for the
longest amount of time."_

I don't quite understand the choice for picking the core that was idle for the
longest time. I think they use that as a predictor of future load of the CPU,
and scheduling based on future load see,s a good idea, but I think this could
lead to cases where it prevents one of more CPUs to go to low power states
when the system doesn't have enough work to keep all its cores fully occupied.

(Edit: how did I overlook the following paragraph, where they discuss this
issue?)

Also, in general, I think they too easily call changes that remove the corner
cases they find _fixes_. Chances are that they introduce other corner cases,
either on workloads they didn't test or on hardware they didn't test (caveat:
I know very little about the variation in hardware that is out there)

~~~
mjg59
Picking the core that was idle for the longest period may end up working as a
proxy for picking the coolest core, and as such encouraging more even
distribution of heat and reducing the probability of thermal throttling?

~~~
Spidler
Couldn't it also work to prevent getting cores into deeper sleep states?

------
jondubois
Many people in the Node.js community have had first-hand experience with with
this issue. Node.js has had to turn off OS-based scheduling for its cluster
module because you always ended up with a couple of CPUs taking all the load
(and accepting new connections) while most of them remained idle.

I thought it was either a bug with the TCP polling mechanism or with the Linux
OS scheduler itself. It's good that this issue is finally getting some
attention.

~~~
melchebo
That might be what they call the 'overload on wakeup' bug. Maybe try the
patch? I've read that the patch additionally needs a small fix, the goto label
position went missing.

Probably just before rcu_read_unlock() in that function: [http://lxr.free-
electrons.com/source/kernel/sched/fair.c#L51...](http://lxr.free-
electrons.com/source/kernel/sched/fair.c#L5139)

------
kazinator
_" Cores may stay idle for seconds while ready threads are waiting in
runqueues."_

I do not believe it, sorry. Troll paper.

Check out the massive indentation change in this patch which obscures the
changes being made:

[https://github.com/jplozi/wastedcores/blob/master/patches/sc...](https://github.com/jplozi/wastedcores/blob/master/patches/scheduling_group_construction_linux_4.1.patch)

Good grief.

~~~
studentrob
Hm, github.com's diff doesn't let you ignore changes in whitespace? Bummer.

Anyway that's not basis enough to invalidate their claims. They're published
in EuroSys [1]. Is that not a reputable journal/conference?

Plus this dude has a sweet CV layout [2]. I'm inclined to believe him on that
basis alone.

[1]
[http://www.i3s.unice.fr/~jplozi/wastedcores/](http://www.i3s.unice.fr/~jplozi/wastedcores/)

[2] [http://www.i3s.unice.fr/~jplozi/](http://www.i3s.unice.fr/~jplozi/)

~~~
kazinator
Fact is, the "dude" doesn't know how to get his text editor to use the same
tabbing as the kernel code he's editing, and doesn't know how to run a patch
through the kernel's "checkpatch.pl" script, without which it will be rejected
from upstreaming.

The idea that the kernel runs idle tasks on cores while actual tasks are
runnable (and that this situation persists for seconds) is ridiculous. It's
such a gaping problem, that everyone doing anything with Linux anywhere would
be running into it on a regular basis.

Think about all the people out there who carefully profile the performance of
their multi-threaded apps. Like all of them wouldn't notice missing seconds of
execution?

~~~
studentrob
I've not studied this branch of CS since college a decade+ ago. As far as I
know, certain elements of the kernel aren't modified very often. Also the
researcher claims to introduce new tools that didn't exist before. Again I
think if the journal, EuroSys, is reputable, then that journal would vet the
findings properly. Replicating people's work is standard procedure. Maybe
people need time to attempt that. Right on if you are proven right!

~~~
kazinator
Certain elements of the kernel aren't modified often, but the scheduler isn't
one of them. It had been subject to a lot of churn. There is no generalization
you can state about the Linux kernel scheduler that is valid for an entire
decade of kernel versions.

~~~
studentrob
> There is no generalization you can state about the Linux kernel scheduler
> that is valid for an entire decade of kernel versions

That sounds like a generalization about the Linux kernel scheduler for an
entire decade of kernel versions =) (just kidding haha!)

------
nickysielicki
> the scheduler, that once used to be a simple isolated part of the kernel
> grew into a complex monster whose tentacles reached into many other parts of
> the system,

That's what I call vivid language!

------
thinkingkong
Some authors appear to work with the old Xen crew (now founders at coho).
Impressive work!

------
hinkley
The bit about creating a cgroup per tty was news to me, and now I'm wondering
if the way I usually manage Linux servers scales up to heavy traffic (usually
if I'm involved it's small potatoes).

Presumably things started as sysv scripts or even docker containers don't have
this problem?

------
melchebo
Presentation PDF:
[http://www.i3s.unice.fr/%7Ejplozi/wastedcores/files/extended...](http://www.i3s.unice.fr/%7Ejplozi/wastedcores/files/extended_talk.pdf)

------
akkartik
Is anyone able to explain Figure 1? I don't understand what the levels are,
and whether level 1 is the coarsest or the finest. The caption doesn't seem to
make sense either way.. Also, Algorithm 1 is in terms of cpus, but the
description mentions 'nodes' and 'cores'. Is a CPU a core or a node? Neither?

~~~
wmf
Unfortunately the authors are using AMD machines which behave differently than
most others because pairs of cores are conjoined into "modules" that share
resources. In AMD processors, a cpu is a core. In almost all other processors,
a cpu is a SMT thread (aka Hyperthread).

In figure 1 the levels/shades represent distance from node/socket 1, darker
being closer. So node 1 is distance 0 from itself, two other nodes are
distance 1, and one node is distance 2.

~~~
tremon
The only thing shared in a bulldozer core is the FPU (and perhaps some cache,
not sure). For all other purposes, a bulldozer module contains two full CPU
cores.

Also, what CPUs besides Intel's and IBM's use SMT?

~~~
Sanddancer
'Dozer, the arithmetic and memory units were pretty much the only thing that
were separated. Each core pair shared a scheduler, a dispatcher, branch
predictor, cache, etc. It really was an oddly thought out design.

~~~
jospoortvliet
It's probably even more complicated than that as they separated parts more per
core during the evolution of dozer, like the decoder which was one unified one
at first and became two separate ones later.

Yeah, an oddly thought out design for sure. The idea seems to make sense to me
but execution wasn't good enough I guess.

------
wutf
This is extremely promising. Makes you wonder if we ought not go further and
implement a machine learning-based scheduler that studies and anticipates
workloads and schedules accordingly so as to help jobs complete as quickly as
possible.

~~~
rifung
Why machine learning? The paper says that the need to handle the complexities
of modern hardware made it so that the scheduler failed to do its fundamental
job.

Machine learning sounds like it would add even more complexity but perhaps I
just fail to see why machine learning is a good idea here. I don't see how you
can predict workloads based off historical data and if you're going to try to
predict workloads based off binaries, the overhead would likely outweigh any
benefits you would get.

I'll admit I am no expert in machine learning but I have a hard time
understanding why you would look at this problem and think machine learning is
the solution.

~~~
tigershark
The only reason is because machine learning is fashionable at the moment and a
lot of people like to suggest it as the panacea. I can't even imagine the
terrible overhead that machine learning would impose in probably the most
critical part of the OS.

~~~
wutf
Prediction can be extremely fast.

~~~
marcosdumay
But learning is always slow. What is the point of machine learning if you
disable learning?

~~~
wutf
Learning doesn't need to be done in realtime... it can happen async.

------
blinkingled
Interesting work certainly. The patches look more like proof of concept rather
than something that's ready for mainline.

I think they're better off if they post a RFC patch to LKML as it exists to
facilitate discussion and testing.

~~~
melchebo
It should probably mainly be seen as a proof of concept for their scheduler
decision visualization tools. Which are not public for the time being. That
should make checking and fixing bugs easier in the future.

------
unoti
It'd be interesting to run the same tests on the Windows scheduler to see how
it compares. Anyone know?

------
thrownaway2424
I wonder what the impact is on networking loads. Often the most important
thing is for a thread to wake on the core with ready access to the packet that
woke it. Other scheduling concerns are counterproductive.

------
atemerev
The year is 2016.

Software engineers are still obsessed with squeezing every last drop of
performance from a single core, adding multicore or distributed load support
as an afterthought.

Sorry, it doesn't work this way anymore. There will be no more single core
performance increases — laws of physics forbid it. Instead, we will see more
and more cores in common CPUs (64, then 256, then 1024 — then it will be
merged with GPGPUs and FPGAs with their stream processing approach).

Learn distributed programming or perish.

~~~
jeffhuys
I don't think this is a good way to look at it. In my opinion, having a
"perfect" single core is much more valuable than an octo-core CPU. Take a look
at the iPhone 6s. Two cores, and it runs faster than some octo-core Androids.
I may be a bit of a noob in this area, but something tells me that is
significant.

~~~
wodenokoto
I'm also a noob, but according to parent, the reason why the iPhone runs
faster is that all optimisation is done for 1 core, which would then leave the
8 cores doing nothing, despite having much, much more raw power available than
the single iPhone core.

Or put in another way: If we want more raw processing power, we need more
cores, but we don't want more cores because software is optimised for a single
core.

~~~
tigershark
Not true, even in multithreaded scenarios the A9 is more performant than the
multicore android phones or it has a comparable performance in the worst case
if I remember correctly. In single threaded scenarios it simply destroys all
the other mobile chips and it is comparable with some low power Intel chips.

~~~
wodenokoto
That is the point. If software can't use multithreads, then strong single
threads will win.

