
The Linux Scheduler: A Decade of Wasted Cores - jwildeboer
https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/
======
Yuioup
Link to previous discussion on HN:
[https://news.ycombinator.com/item?id=11501493](https://news.ycombinator.com/item?id=11501493)

------
brendangregg
Run "numastat". If you only see one "node" column, then these issues don't
affect you. At Netflix, almost all our systems are single NUMA node.

I tested the patches on 1 and 2 node NUMA, and saw 0% performance
improvements, for some simple workloads.

The paper is also uninformed about scheduler bugs in general. Yes, there have
been lots of scheduler bugs (I've worked on more in other kernels too, not
Linux, where scheduling is even worse!). Yes, we already have tools for
diagnosing them (eg, examining run queue latency as a histogram). Yes, we've
already used visualizations too.

There's information for reproducing the bugs here:
[https://github.com/jplozi/wastedcores/issues/1](https://github.com/jplozi/wastedcores/issues/1)
. If you think for a second that you've hit these bugs, you should read these
steps.

------
m-app
Found it funny that one of their patches reduces a function to:

    
    
      static int should_we_balance(struct lb_env *env)
      {
          return 1;
      }

~~~
jhoechtl
certainly inlined and very likely optimized away altogether. Better to keep it
that way, with a comment. Once the problem is fully understood, there may be
room to enable it again with a more meaningful heuristics.

~~~
chris_wot
Ummm... isn't that the job of version control?

~~~
jimm
Yes, in theory. In practice, you'd not only remove this function but all of
its calls. When somebody down the road realizes that they want this function
back, they have to (A) realize it's in the VC history and (B) not only get
back the function but all the calling points. That is so much of a pain and
potentially error-prone that leaving this function in for a while with a
comment might be the more practical approach.

~~~
pklausler
On the other hand, it may no longer be called from every place where it should
be. I've found that it's better to document what was wrong with the overall
approach, scrape the dead code from the source base, and move on. Barnacles
like these accumulate over time otherwise.

------
qaq
Unless you are running on NUMA box does not look like this will affect you
much.

~~~
legooolas
Isn't NUMA how multi-core CPUs work still?

At least the AMD Bulldozer architecture diagram on the Wikipedia NUMA page
shows this, both within a socket and between sockets:

[https://en.wikipedia.org/wiki/Non-
uniform_memory_access#/med...](https://en.wikipedia.org/wiki/Non-
uniform_memory_access#/media/File:Hwloc.png)

Is the same true for Intel chips? (The Wikipedia page mentions the Intel E8870
memory controller, but not whether newer chips use similar set-ups)

~~~
signa11
> Isn't NUMA how multi-core CPUs work still?

if you have > 1 socket, then yes, otherwise no.

Edit: added verb

~~~
wyldfire
These days, yeah, that's definitely the case. A few generations back there
were a couple of multi-socket-single-memory-domain designs.

~~~
voidlogic
> A few generations back there were a couple of multi-socket-single-memory-
> domain designs.

Core2Quad/Duo based Xeons, P4s, Athlon MP, etc

~~~
vonmoltke
I thought the Core2 chips were actually multiple single-core processors on a
single slab, which would make that true of all Core2 machines (not just
Xeons). I was pretty deep in Nehalem details when it first came out and was
under the impression that microarchitecture was Intel's first true multi-core
chip (as in, the cores were actually integrated on the die). Could be wrong,
though, since I never delved into the Core2 architecture.

~~~
voidlogic
>I thought the Core2 chips were actually multiple single-core processors on a
single slab, which would make that true of all Core2 machines (not just Xeons)

Correct.

------
dooglius
LKML discussion:
[https://lkml.org/lkml/2016/4/23/135](https://lkml.org/lkml/2016/4/23/135)

------
jwildeboer
In case you want get a quick(-ish) overview, here are the slides that try to
explain it:
[http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_t...](http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf﻿)

~~~
jensv
Your url is broken and should be:
[http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_t...](http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf)

Parent directory:
[http://www.i3s.unice.fr/~jplozi/wastedcores/](http://www.i3s.unice.fr/~jplozi/wastedcores/)

------
vonmoltke
I worked on a massively threaded signal processing application a few years
ago. We manually set all of our thread affinities due to the issues raised in
this article. We were also using the real-time kernel and round robin
scheduler, though, so we had other considerations as well.

~~~
kev009
It is standard practice when max throughput and/or latency are desired to pin.
A scheduler provides convenience and automatic balancing in face of shifting
workloads. If you have a fixed function system, it's just getting in the way.

~~~
vonmoltke
Pretty much. Our pinning varied by task. Some were pinned to sockets, some to
cores, and some to specific hardware threads. That's what I meant by "other
considerations"; we would have still been pinning threads even if the bugs in
the article didn't exist. The bugs just made figuring out the best pinning
scheme more difficult.

------
thread-manager
Hi Guys, FYI in most new servers (~2005 and later), there is a BIOS setting to
make NUMA systems look like they have one NUMA zone. From memory, I think it
is the Interleaved option on most vendors. I strongly recommend most clients
to turn it to NUMA mode (which tends to be the default for most newer servers)

------
zxcvcxz
From what I understand this mostly effects servers and not desktop systems
because most desktops don't have NUMA. I've noticed that Linux servers
typically run on par (usually better) than their windows counterparts when it
comes to performance.

Could this be an indication that Windows could also be suffering from some
similar bug?

Once this bug is fixed in Linux it could supposedly improve performance by up
to 30% in some cases. Linux is already performing on par with Windows (and
this isn't really debatable), so does this mean that this patch help Linux
out-perform Windows by much wider margins now?

~~~
pritambaral
> Could this be an indication that Windows could also be suffering from some
> similar bug?

No. Linux and Windows vary greatly. If the only difference between the two
OSes were in the scheduler, maybe then we could say Windows also suffers from
a similar bug. But there are so many other working parts to a modern OS that
Windows could be slower than Linux despite having a more efficient scheduler.

------
SFJulie
Well multi core is not one CORE but rather an integrated distributed system of
cores linked with buses that are probably built with chips very similar to
switch/commuters without all the KPEX poured in optimizing network equipments.
So calling a multi core archi _one_ computer is a little lie.

Making an efficient multi core async system is as tough as making an efficient
networked distributed system, except is has less heterogeneous components but
probably also less devoted R&D.

So at one point I wonder why we just don't simply build extra cores as
additional SOCs/ASIC, CPU or GPU you can plug on dedicated buses chosen
according to the required latency in memory sharing, exchanging with central
CPU/memory.

It is almost what actual buses are, just a tinge twisted with different rings,
speed, latency because I don't think one bus can rule them all.

But, well who am I to have an opinion when google, apple, motorola, intel,
Dec, IBM knows better than I how to design computers?

Just a no one.

~~~
vonmoltke
> Well multi core is not one CORE but rather an integrated distributed system
> of cores linked with buses that are probably built with chips very similar
> to switch/commuters without all the KPEX poured in optimizing network
> equipments.

Not really. Multi-core chips share processor execution resources and L2 cache
between them. The on-die cores are directly tied to each other in a way that
off-die components, including cores in another socket on the same board, are
not. They _are_ a single computer.

Now, you can carry the network abstraction down to that level. Problem is, as
an EE I can carry the network abstraction down below the register level. A
register bit is a network of transistors and resistors, after all. At what
point do we call something a computer? I personally think any clustering of
tightly-coupled, on-die computational blocks is a "computer", which means the
milti-core chips certainly count as one computer with multiple execution
pipelines.

~~~
SFJulie
well, address bus is still a bus. They share a bus not the addresses. From the
programming point of view it is the same, from the architecture PoV the
complexity (number of gates and latency) it is not the same.

A network stack is just a external universal bus that deals and introduces way
more complexity.

But it is just a bus.

