
The Linux Scheduler: A Decade of Wasted Cores (2016) - pmoriarty
https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/
======
CalChris
> The evolution of scheduling in Linux By and large, by the year 2000,
> operating systems designers considered scheduling to be a solved problem…

If I recall correctly, 2000 is 2.3.x which had a braindead trivial scheduler.
Basically it just looped through the process list and executed _goodness()_
and found the 'best'. This was obscenely slow when there were a lot of dormant
processes. The process table links all mapped to the same cache set which led
to cache evictions during the scheduling loop, even TLB evictions. Et cetera.
It was just bad, really really bad. Compared to its BSD, Solaris, ...
contemporaries, it was garbage.

The O(1) scheduler starts in 2.4 at the tail end of 2000 followed by the
Completely Fair Scheduler in 2007, etc. And the Linux scheduler continues to
get better. But in 2000, it sucked. Reeked.

~~~
ksk
From my experience, Linux CPU and I/O scheduling got good from 2.6.x (also
around 2007) around when AIO became robust.

~~~
dullgiulio
Isn't AIO still implemented with threads in libc? In that case, it's just the
CPU scheduler that counts...

~~~
MichaelMoser123
He means io_submit, that one is not done as user threads

------
Filligree
Somewhat relevant: Threadripper CPUs, which are aimed at the high-end consumer
market, are NUMA with two memory domains.

This makes the overall scheduling problem much harder, to the point that they
were built with a special "Disable half the cores" mode and supporting
hardware to give both memory banks same-speed access to the remaining ones.

~~~
hesdeadjim
Honestly, I believe the only reason they did this is so that when review sites
run benchmarks the Threadripper won't look abnormally slow compared to the
other chips out there.

I own one and in both work and play I have had zero issues. If I drop a frame
here and there in a game due to some memory latency? Eh, could care less. If
you can afford a Threadripper you can afford a 1080 Ti and a Gsync monitor to
smooth out any issues you might run into.

~~~
Filligree
You also can probably afford enough memory that the kernel can schedule your
game entirely on one half of the CPU, but I don't know if that sort of
scheduling (and defragmentation) is commonly used yet.

~~~
hesdeadjim
It’s going to be on the application to be NUMA aware, regardless of how much
memory you have. Games have never really had to deal with this due to the
absolutely minuscule number of people who played games on server-grade dual
socket Xeons. It’ll be interesting to see if any of the big names (Unity,
Unreal, Crytek/Lunberyard) ever care enough to make a patch for proper NUMA
support.

~~~
Filligree
It's entirely possible to do this at the OS level. It makes the scheduling
problem much harder, yes, but a user can—for example—force their game to run
only in one domain using CPU affinity, then somehow trigger the kernel to
migrate all its memory to that domain. I know how to do the former, I haven't
tried the latter.

It would be more difficult to do it automatically, but if NUMA systems become
more common then I see no reason why it shouldn't be tried.

------
brendangregg
Still 0% performance wins for Netflix. Not a "decade of wasted cores" for us.

~~~
diptanu
Can you please explain that more?

~~~
awalton
The tl;dr is that unless you had an HPC workload with a NUMA box, you couldn't
observe this bad behavior.

Which means in reality, you could name approximately everyone that ran into
this issue on a single list: top500.org.

~~~
lobster_johnson
Is NUMA that rare? Back in 2007-2008 or so, my company bought some rack
servers fitted with 48-core AMD Opteron (Magny-Cours), which weren't
particularly expensive, and had a NUMA architecture.

We didn't have HPC workloads, just Postgres, which uses one OS process per
connection, and performance was terrible as a result.

~~~
anarazel
> We didn't have HPC workloads, just Postgres, which uses one OS process per
> connection, and performance was terrible as a result.

I'd bet, but not too much, that that was more due to a) postgres' internal
locking implementation scaling horribly at that time b) zone_reclaim_mode
leading to bad behaviour around IO.

~~~
lobster_johnson
It's possible, but we weren't testing 48 clients at that time, just our normal
workload, which had much less parallelism than that. The person in charge of
setting up the systems explained the performance issues as being due to NUMA.

------
ralphm
Article is from 2016. Previous discussion:
[https://news.ycombinator.com/item?id=11570606](https://news.ycombinator.com/item?id=11570606)

~~~
pdw
Some brief notes from the Linux developers POV:
[https://lwn.net/Articles/734039/](https://lwn.net/Articles/734039/) (section
"Multi-core scheduling")

------
tkyjonathan
DBAs knew about this for many years now. We would simply change the schedular
on the DB servers.

~~~
pdw
I think you might be confusing CPU schedulers and IO schedulers. Linux never
had switchable CPU schedulers.

~~~
noughth
Sure it does: [http://man7.org/linux/man-
pages/man2/sched_setscheduler.2.ht...](http://man7.org/linux/man-
pages/man2/sched_setscheduler.2.html)

~~~
ksk
If you read the page, it simply states that you can tweak the parameters of
the existing scheduler, not replace it entirely.

~~~
noughth
The scheduling policies listed on the man page I linked share some generic
kernel code, but I wouldn't classify them as the same scheduler. If you look
inside the kernel/sched/ directory in the source, you'll find that an instance
of `struct sched_class` is defined for each scheduler class. There are
dl_sched_class, rt_sched_class, fair_sched_class, and idle_sched_class. You
can see in `pick_next_task` in core.c that these class structs are iterated
over, calling into each scheduler's own `pick_next_task`: [http://elixir.free-
electrons.com/linux/v4.13.9/source/kernel...](http://elixir.free-
electrons.com/linux/v4.13.9/source/kernel/sched/core.c#L3207)

