Hacker News new | comments | ask | show | jobs | submit login
The Linux Scheduler: A Decade of Wasted Cores [pdf] (ubc.ca)
467 points by tdurden on Apr 15, 2016 | hide | past | web | favorite | 142 comments



I've worked scheduling bugs in other kernels before (Linux is not an outlier here). The key metric we keep an eye on is run queue latency, to detect when threads are waiting longer than one would expect. And there's many ways to measure it; my most recent is runqlat from bcc/BPF tools, which shows it as a histogram. eg:

   # ./runqlat 
   Tracing run queue latency... Hit Ctrl-C to end.
   ^C
        usecs               : count     distribution
            0 -> 1          : 233      |***********                             |
            2 -> 3          : 742      |************************************    |
            4 -> 7          : 203      |**********                              |
            8 -> 15         : 173      |********                                |
           16 -> 31         : 24       |*                                       |
           32 -> 63         : 0        |                                        |
           64 -> 127        : 30       |*                                       |
          128 -> 255        : 6        |                                        |
          256 -> 511        : 3        |                                        |
          512 -> 1023       : 5        |                                        |
         1024 -> 2047       : 27       |*                                       |
         2048 -> 4095       : 30       |*                                       |
         4096 -> 8191       : 20       |                                        |
         8192 -> 16383      : 29       |*                                       |
        16384 -> 32767      : 809      |****************************************|
        32768 -> 65535      : 64       |***                                     |
I'll also use metrics that sum it by thread to estimate speed up (which helps quantify the issue), and do sanity tests.

Note that this isolates one issue -- wait time in the scheduler -- whereas NUMA and scheduling also effects memory placement, so the runtime of applications can become slower with longer latency memory I/O from accessing remote memory. I like to measure and isolate that separately (PMCs).

So I haven't generally seen such severe scheduling issues on our 1 or 2 node Linux systems. Although they are testing on 8 node, which may exacerbate the issue. Whatever the bugs are, though, I'll be happy to see them fixed, and may help encourage people to upgrade to newer Linux kernels (which come with other benefits, like BPF).


I assume BPF is Berkeley Packet Filters or maybe eBPF (Extended Berkeley Packet Filters) in this case. Just to save anyone else having to look this up. It looks like this is the link to the tools.

https://github.com/iovisor/bcc

https://en.wikipedia.org/wiki/Berkeley_Packet_Filter


Yup. It's the same BPF (well, except for the "extended" bit) that tools like tcpdump and Wireshark use for packet capture: it's a bytecode for handing simple, guaranteed-termination programs to the kernel and having the kernel run them instead of waking up userspace all the time. This was originally created for packet capture, so the kernel could just hand you packets on port 80 (or whatever) instead of dumping all traffic at you and letting you throw away most of it. But it turned out this is also useful for system tracing: if you strace a program, the kernel will notify it on every syscall, and `strace -e` throws away most of that in userspace. So there's now a way to attach BPF filters to processes, events, etc. so that a userspace tracer is only woken up when something interesting happens, which reduces overhead significantly.


Anyone tested this? I'm not expecting a lot, since most of our systems are 1 or 2 nodes (run numastat to see how many nodes you have), and neither 8 nor hierarchal. Anyway, it doesn't apply cleanly to 4.1 (so I'm assuming this is for an -rc).

   arch/x86/kernel/smpboot.c: In function ‘set_cpu_sibling_map’:
   arch/x86/kernel/smpboot.c:452:16: error: ‘sched_max_numa_distance’ undeclared (first use in this function)
                && sched_max_numa_distance == -1)
                ^
   arch/x86/kernel/smpboot.c:452:16: note: each undeclared identifier is reported only once for each function it appears in
   make[2]: *** [arch/x86/kernel/smpboot.o] Error 1
Really wish they'd post this to lkml, where the engineers who wrote the scheduler, and engineers to regularly performance test Linux, can reply.


I've tested it on some 2 and 1 node systems, some quick results here: https://gist.github.com/brendangregg/588b1d29bcb952141d50ccc... . In summary, no significant difference observed.

This should be posted to lkml, where many others can test it. If there are wins to be had on larger node systems, they'll be identified and this will be fixed.


A lot of their bugs are induced when you have a cgroup with multithreaded/-process tasks and a cgroup with just a few using processor time.

Just testing one benchmark will not show it, unless you have something else running too.


Ok, depends what you mean by something else running. Another PID? That's why I tested make -j32. I also tested multi-threaded applications from a single PID (and more threads than our CPU count), since that best reflects our application workloads.

They ought to be posting it to lkml, where many engineers regularly do performance testing. I've looked enough to think that my company isn't really hurt by this.


Just read the paper, it's explained. Or read the presentation, it uses pictures.

Basically they run R, a single threaded statistics tool which is setup to hog a core, and in some other cgroup a wildly multithreaded tool. If you have a NUMA system (check with `lstopo`) then it's possible that the scheduler thinks the many tasks in one domain of cores is balanced with just R on one core of another domain. Meaning you can have several (ex: 7 out of 8) cores idle. It has to do with the way hierarchical rebalancing is coded, and that their 8x 8-core AMD machine has a deep hierarchy.


I manage the NUMA bit (where it counts) with something like this:

    cat ${node_file} | xargs -I {} -P ${NPROCS} -n 1 /usr/bin/numactl -N ${node_num} -l ${script_file} {} $*
I know about how many parallel procs I can run on a single node, and I've got something else that scrapes numactl -H for node count.


It looks like the "Group Imbalance" and "Overload on Wakeup" bugs could be noticeable on a 2-node system under the right conditions. So could the "Missing Scheduling Domains" bug, but only if you are offlining / onlining CPUs.

None of them should affect a 1-node system, and the "Scheduling Group Construction" bug requires a multi-level node hierarchy.


right; almost all our systems are 1 node. And I'm already debugging some NUMA stuff for our 2 node systems.


The quotes in the paper are interesting: "Nobody actually creates perfect code the first time around, except me. But there’s only one of me.” Linus Torvalds, 2007

And another, which highlights why there might be many more unearthed bugs, and would probably go unnoticed. "I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably trivial . Patches already exist, and I don’t feel that people can screw up the few hundred lines too badly"

Looking at the bigger picture in general, this again shows that getting software right is not easy. Now and then you still hear about the bugs popping up in the code which is very core to an OS. One I can recall is a decade old TCP bug which Google fixed last year[1].

[1] http://bitsup.blogspot.sg/2015/09/thanks-google-tcp-team-for...


I don't think the paper contradicts the second quote. It just points out that things got a lot more complex since.

"this again shows that getting software right is not easy."

Not easy, yes, but in my experience getting software right isn't too difficult. What seems impossibly hard is keeping software right, maintaining correctness in the presence of new requirements, compatibility constraints and just sheer caring about different things than we used to[1]. At some point I suspect you'd get a better design by rewriting things from scratch. But the way we do development makes rewriting stressful and error-prone, leading to the repeated arguments about whether rewriting is good or bad[2]. The question isn't whether rewriting is good or bad. Software is the most malleable medium known to man, and we should do all we can to embrace that essential malleability that is its primary characteristic and boon. The question is how not to mess up rewriting, how to keep our clay from gradually getting more and more brittle under the weight of added constraints.

(I work on this problem: https://github.com/akkartik/mu/blob/7f5a95cdbe/Readme.md. It's been 31 days since I last got on my soapbox in this forum: https://news.ycombinator.com/item?id=11285529.)

[1] "The future is a disagreement with the past about what is important." http://carlos.bueno.org/2010/10/predicting-the-future.html

[2] Against rewrites: http://www.joelonsoftware.com/articles/fog0000000069.html; https://steveblank.com/2011/01/25/startup-suicide-%E2%80%93-... https://basildoncoder.com/blog/pg-wodehouse-method-of-refact.... Pro-rewrite seems the minority position: http://brettweigl.tumblr.com/post/48385335083/what-can-produ... http://building.wanelo.com/2012/09/14/the-big-switch-how-we-... http://alicebob.cryptoland.net/why-i-rewrote-quivi-from-scra... (with many apologies). And yet we've seen several major rewrites: Netscape (for all Joel Spolsky's criticism above, would Firefox had been possible without the rewrite? Echoing Jeff Bezos's dictum, open source projects can afford to be misunderstood for a long time), Python 3 [edit: no, Python 3 was not a rewrite as I was corrected below], Perl 6, Angular 2. A more balanced viewpoint: https://blog.learningbyshipping.com/2013/04/02/surviving-leg....


Python 3 does not belong on the list of rewritten projects. Python 3 was not a rewrite. It just introduced major breaking changes in the API.


It had the same external effects of a rewrite. If Python didn't have such a large userbase it could have been an extinction level event.


Ah, thanks for the correction.

But why did Python 3 take so long to release, then?


It's not that python 3 took a long time to release. It's that it consciously broke compatibilities with pyton 2, and library authors took a long time to port their library to python 3, so a lot of people writing python didn't even bother looking at it.


A 10 year transition† around that breaking change was the plan all along ever since PEP 3000 (dating, unsurprisingly, from 2006), with:

- time for things to settle and eventual mistakes to be fixed (3.0, 3.1, 3.2) while the 2.x line lives

- lockstep 2.x versions that include stdlib and language (vai __future__) feature backports from 3.x line (2.6, 2.7) to get maximum portability and minimum tech debt for new code written on 2.x line.

- finally, impetus to transition with non-backported new language and stdlib features (3.4, maybe 3.3 already)

and of course, letting time for lib authors to port their existing code.

Never has it been the plan for people to instantly port existing, non-lib code to python 3.0 (or even 3.1 or 3.2), and never has it been so that python 3.x instantly obsoletes the 2.x line, and in fact, quite the opposite.

So people that like the new-n-shiny and blindly jump onto the latest version bandwagon were definitely in for the roughest ride, and gave Python 3 the bad rep we know.

I honestly prefer the way this transition has been handled rather than the "hey everything is compatible except it's not and things are subtly breaking all around" attitude from Ruby which I have to deal with now.

† the 10 year figure was somewhere in a ML that I can't seem to get a hold on back when I was working with Python and having to plan for this transition.

[PEP 3000]: https://www.python.org/dev/peps/pep-3000/


Ah, I see, so Python 3 only took a couple of years, about the same as any minor release: https://en.wikipedia.org/wiki/History_of_Python#Version_rele...


You can see Guido’s 2006 talk about the project here for some history https://www.youtube.com/watch?v=UIDdgeISLUI


You are thinking of Perl


The best example is of SSL. Over a decade and we have not even got "correctness" part right.


Over a decade? It's over twenty years old! SSL/TLS & its XPKI are complete and utter jokes.


That shows a massive revealed preference, and says nothing about what is possible in software.


Scheduling has become a hard problem. There's cache miss cost in moving a task from one CPU to another. CPUs can be put into power-saving states, but take time to come out of them. Sharing the run queue across CPUs has a locking cost. So it's now a cost optimization problem. That's a lot of work to be doing at the sub-millisecond decision level.


One of the point in the article is that despite scheduler's importance and complexity there were no tools that allowed to evaluate the performance. This work contributed such tools and those immediately allowed to identify quite a few bugs. Fixing those improves performance on NUMA systems rather substantially across various loads.


>there were no tools that allowed to evaluate the performance.

This reminds me of Brian Cantril's remarks in a talk (sorry, can't remember which one) about Solaris and DTrace. For the first time ever, DTrace gave them the ability to look into the live performance of low-level OS code and they found a lot of pathological worst-case behavior that no one suspected beforehand. No matter how good your team is it's really hard to accurately predict how a system will behave, measurement is key.


You can't do science without measurement.


To take this work more seriously I miss the discussion of the potential negative effect of their patches. The way it's presented, it gives the false impression that "it will just work."

However, their premise of "just wake the idle cores as soon as there are threads to run" can actually be harmful in some real-life scenarios, especially regarding the additional power use and the slowdown introduced by switching from the filled to the empty caches.

Intuitively, as they demonstrate the "speedup" on some specific programs using some specific patterns I'd expect that some specific programs and patterns exist that don't necessary benefit from every change they present.


Well, two out of the four are straight-out bug fixes.


I agree. For the task of writing an OS scheduler, I think the programmers must list desired properties, mathematically prove that their approach has the desired properties and then, but only then, test and measure their implementation for a few CPU years on dozens of types of hardware.

And yes, that proof probably will be hairy, but that's precisely the reason it has to happen.

Also, in the likely case that the proof is for a limited domain ("at most 1000 CPUs", "level 1 cache shared between the same CPUs as the level 2 cache"), I think the scheduler should test for that at boot time and strongly consider panicking the kernel if it doesn't apply.


It may be better to let a task wait briefly on the run queue rather than bringing a processor out of sleep mode. That requires a cost/benefit calculation. That's a lot to ask of a scheduler.

Unidirectional interprocess communication forces this issue to come up frequently. The pattern "send on pipe A, wait for pipe B" to a process that is waiting on pipe A and will reply on pipe B implies that, for a very short period between the send and the wait, there are two processes ready to run. Starting one of them on another CPU is suboptimal, and waking up a sleeping CPU is even worse. You're really doing an interprocess subroutine call, but the OS does not know this. (As I point out occasionally, QNX, with MsgSend/MsgReceive and a scheduler that understands them, does this much better.)


As indicated in the paper, patches (currently for kernel 4.1) are here: https://github.com/jplozi/wastedcores


Are they accepted upstream? If not, I'd like to see a link to the lkml thread for these.


I wonder if they are discouraged by the experience of Con Kolivas[1] who proposed an alternative scheduler back in 2007. (Apparently he is still maintaining his "-ck" fork of the linux kernel[2]!)

I only mention this as a historical case that has remained in my memory. Maybe Linus is willing to revisit the issue, I don't follow LKML.

[1] https://en.wikipedia.org/wiki/Con_Kolivas

[2] http://ck-hack.blogspot.com/2015/12/bfs-467-linux-43-ck3.htm...


I do not follow Kernel Dev enough to have a good representation of what happened, but it seems to me that it was another example of smart people pushed out. Nowadays I think he is just maintaining his patches from one kernel release to the next, it was smart for Con to stop interacting with them and move to other types of devs, it comes a point where you have to keep your sanity.


I remember the whole debacle back then when the lkml was summarized on that website I can't remember

It'd be interesting to see if that branch exhibit the same behavior and issues.


Maybe KernelTrap? I was sad when it shut down, it was a very useful resource for following Linux development from the sidelines.


Yeah that one! The memories.


You might be thinking of LWN. They covered the scheduler multiple times over the years: https://lwn.net/Kernel/Index/#Scheduler


I can't find any evidence they were submitted upstream at any juncture.

Moreover, since (several of) the patches seem to rip out a bunch of logic in favor of very simple logic, they would probably be contentious without broad testing and probably a runtime option to configure the behavior.


And uses spaces instead of tabs. Someone will need to go fix them up.


Not sure why you're being downvoted. Tabs instead of spaces is the Linux kernel style.


The conference where this will be presented starts next week. So I think they'll post it after they catched up some sleep after that.


Which conference, Eurosys?


Yes, since the filename is eurosys16-final29.pdf.


Nice work: paper, (upcoming tools), and actual patches.

Wish they'd look at disk I/O next, there are some problems there that are hard to describe other than anecdotically: e.g. my system runs on a SSD and periodically copies data to a HDD with rsnapshot. When rsnapshot runs rm on the HDD things freeze for a moment, (even switching windows in X) although the only thing using the HDD is that rm ...


It seems like everything with buffer/process queues needs a good polish to optimize for fairness.

E.g. this is just being readied: http://blog.cerowrt.org/post/fq_codel_on_ath10k/

(also note higher throughput as result)



thanks, I'll have to try that although in my case the writes and reads are done on entirely different devices (although maybe there is a shared queue somewhere?)


This is absolutely nuts!

If this result is true, our Linux machines have been wasting 13-24% of our silicon and energy for years (that number is for "typical Linux workloads") because the scheduler fails to fulfill its most basic duty.

The quotes from Linus in the paper just twist the knife.


From the looks of it, this mainly applies if you are running a NUMA system. Which you probably are not. Though it may apply in other places where 'scheduling domains' are used to model hardware (e.g. hyperthreading and maybe big.LITTLE).


> From the looks of it, this mainly applies if you are running a NUMA system. Which you probably are not.

Do you run on a server? Has it been made in the last 3 - 5 years? Then you are running a NUMA system.


maybe they're single core socket server?


So desktop is unaffected?


I imagine it depends on your desktop - I buy a lot more servers than desktops, and care about NUMA because of the effects on VM and JVM behaviour.


From my quick testing (mentioned earlier), I'm not seeing a significant difference on the typical systems we run (1 and 2 node).

I'd find it hard to believe we missed a 13-24% win on all Linux systems anyway. Linux undergoes intense performance analysis, including checking for poor scheduling behavior.

This sounds very much like a bug on 8 node NUMA systems. There's been lots of scheduling bugs over the years.


ISTR something from Google about modifications to their Linux systems to eke out maximum resource utilization [1], so I would not be surprised if changes with this level of impact may exist.

[1] - http://csl.stanford.edu/~christos/publications/2015.heracles...


On systems that have a compute bottleneck, yes. Questions that arise tho are:

* is the system compute bound or io bound? - in the latter case the waste will be smaller.

* How does this compare to other OSes? If it the most efficient available kernel for your workload, compared to other options, it is questionable to say it fails.


You check wether your system in NUMA with `lstopo`.


If the CPU is idle it isn't wasting energy.


The rest of the system is still a (roughly) fixed cost, and a 75% loaded CPU package still consumes more than 75% of the power of a 100% loaded CPU package.


From the linked paper:

"Resulting performance degradations are in the range 13-24% for typical Linux workloads, and reach 138× in some corner cases. Energy waste is proportional."


I'm just a lowly performance obsessed dev who uses things like node, php, python, etc. I run very high traffic applications and spend a lot of time buying and building my own servers to try to eek out every ounce of performance.

So can someone who knows about linux kernel internals explain the impact of this research? I read the abstract and some of the paper and it sounds very promising - like we may get significantly more performance out of our cores for multi-threaded or multi-process applications.


You will probably get significantly more performance out of your cores for multi-threaded or multi-process applications if you stop using node, php, python, etc. and use something that's more performance oriented.


Well, one might also get more performance per dollar if they programmed FPGAs, DSPs or GPUs instead of CPUs, and one might also get more performance per dollar if they designed their own hardware. (I do the latter for a living.)

However, "performance of Node, PHP and Python" is a sensible goal in its own right, and I disagree with the implication of your comment, and that of sister comments, that it is not a sensible goal. There's a lot of useful code written in Node, PHP and Python, and moreover, this might remain true for a long while because "something more performance oriented" is likely to be less programmer productivity oriented in that a correct, easy to use program or library will take more time to get done. Also, Node and Python specifically can be damn fast (numpy for instance is unbeatable for large linear algebra programs, because it uses optimized libraries under the hood, etc. etc.)

And some things simply can't be done in a satisfactory fashion in anything but a dynamic language, any more than you can get Apache to run on a GPU. "Dynamic" is a feature, not just a source of overhead.

So "a performance-obsessed scripting language developer" is a perfectly fine way to describe oneself IMO.


Aren't all three of those languages single-threaded? So the only way to distribute work is to run one copy per core and distribute on a per-request basis?


Python, at least, uses actual OS-level threads. However, it also uses a global interpreter lock (GIL), so only one thread can execute Pyton code at a time.

But when writing Python modules in C, you have control over acquiring and releasing of the GIL, so before starting some long running operation, you give up the lock.

Node, AFAIK, uses several OS-level threads under the hood for disk I/O. And with PHP, a web server probably will run multiple threads for handling requests concurrently.

So the impact might not be as big as for performance-oriented code in C/C++, but it is not necessarily nil, either.


With Python, numpy for instance will use multiple threads under the hood, even though the calling Python code might be single-threaded, and numpy's execution is completely unaffected by the GIL. Incidentally, to get Python code to run really fast, you'll have to offload most of the heavy lifting into libraries in any case. But it's still a Python program - in that most of the source lines, especially most of the source lines unique to the program as opposed to library code, will be still in Python, and in that in some cases you couldn't have written the program in a "more performant" language getting either the performance or the flexibility (Go for instance doesn't have operator overloading nor can it be used to implement linear algebra as efficiently as C/assembly; so a pure Go program doing linear algebra will be both slower and uglier than a Python program using numpy. A Go program using a C/assembly library will be very marginally faster than said numpy program, and just as ugly as the pure Go program.)

Also, in my understanding TFA applies to multiple processes just as much as multiple threads.


A recent alternative is to write the entire program in Julia. It is a lot less ugly than Numpy, and so performant that most code (other than steadfast libraries such as BLAS and LAPACK) are written in Julia itself. http://julialang.org/


>Aren't all three of those languages single-threaded? So the only way to distribute work is to run one copy per core and distribute on a per-request basis?

To clarify- languages are never by definition, single threaded. The reference implementations largely are.

Some background- in Python's case, PyPy supports STM which removes the global interpreter lock, while retaining backwards compatibility with existing code.

The answer to your question is no. All 3 are not single threaded.


Facebook's HHVM runs php interpreters as threads.


I need to hire developers that can work on these applications. For specific tasks, like c10k, node works incredibly and where it fails, frequently nginx comes to the rescue with a similar architecture.

But you're always going to win this argument by suggesting a lower level solution until we arrive at coding assembly optimized for a specific bare metal.


I'd suggest giving Elixir a try if you're trying to build things for extremely high-concurrency and will handle M:N scheduling across multiple CPUs for you automatically (and across multiple machines slightly less automatically).


This research applies to 'NUMA' systems. Commonly servers with multiple physical CPUs that each have a connection to their own memory banks. They can access memory of the other CPU by requesting it, but that take time. So the process scheduler has to take that into account. Usually by keeping processes a slight bit affixed to the place where it was started.

Off-topic, but high-performance long running processes are mainly programmed in C, C++ & Java. Maybe stuff like Rust and Swift in the future. Fortran if you are doing mathematical computation, but then you'd probably already use it if you need it.

For what I estimate that you mean with high traffic on PHP or node systems on multiple servers, probably you want to look at Elixir and it's Phoenix web framework. It's more appropriate for responsiveness (as in low latency). And less boilerplate than Java. |> http://www.phoenixframework.org/docs/overview


> I'm just a lowly performance obsessed dev

> who uses things like node, php, python, etc.

pick one


Actually there is a causality there. I think he became performance obsessed AFTER he picked node, php and python because performance became a problem..


You're more likely to write poor code that contains inefficiencies than having V8 or PyPy become your performance bottleneck.


Page 8: "To fix this bug, we alter the code that is executed when a thread wakes up. We wake up the thread on the local core—i.e., the core where the thread was scheduled last— if it is idle; otherwise, if there are idle cores in the system, we wake up the thread on the core that has been idle for the longest amount of time."

I don't quite understand the choice for picking the core that was idle for the longest time. I think they use that as a predictor of future load of the CPU, and scheduling based on future load see,s a good idea, but I think this could lead to cases where it prevents one of more CPUs to go to low power states when the system doesn't have enough work to keep all its cores fully occupied.

(Edit: how did I overlook the following paragraph, where they discuss this issue?)

Also, in general, I think they too easily call changes that remove the corner cases they find fixes. Chances are that they introduce other corner cases, either on workloads they didn't test or on hardware they didn't test (caveat: I know very little about the variation in hardware that is out there)


The last paragraph of that segment says they pick it because it's the first in the list of idle cores. E.g. if a core is no longer idle it removes itself from the list, and it adds itself at the end, so effectively it's ordered in deminishing idletime time.

Though from my understanding of the patch they walk over all CPUs which is not precisely constant time, and it doesn't seem to be just the list of idle CPUs; for_each_online_cpu(_cpu): https://github.com/jplozi/wastedcores/blob/master/patches/ov...

That said from what I grasp of the code it also doesn't bail out when power management is enabled, so maybe the public code or the paper are not at the same stage of development.


Picking the core that was idle for the longest period may end up working as a proxy for picking the coolest core, and as such encouraging more even distribution of heat and reducing the probability of thermal throttling?


Couldn't it also work to prevent getting cores into deeper sleep states?


Shouldn't every idle core be in low power mode? Doesn't the idle instruction (hlt) tell the cpu to power parts down? I thought it was like a switch, you don't need light? Flip it off, need some? Turn it on at nearly no cost.

There might be newer multistage power down but to wear level and use all cores, this is the right algorithm.


Many people in the Node.js community have had first-hand experience with with this issue. Node.js has had to turn off OS-based scheduling for its cluster module because you always ended up with a couple of CPUs taking all the load (and accepting new connections) while most of them remained idle.

I thought it was either a bug with the TCP polling mechanism or with the Linux OS scheduler itself. It's good that this issue is finally getting some attention.


That might be what they call the 'overload on wakeup' bug. Maybe try the patch? I've read that the patch additionally needs a small fix, the goto label position went missing.

Probably just before rcu_read_unlock() in that function: http://lxr.free-electrons.com/source/kernel/sched/fair.c#L51...


"Cores may stay idle for seconds while ready threads are waiting in runqueues."

I do not believe it, sorry. Troll paper.

Check out the massive indentation change in this patch which obscures the changes being made:

https://github.com/jplozi/wastedcores/blob/master/patches/sc...

Good grief.


Hm, github.com's diff doesn't let you ignore changes in whitespace? Bummer.

Anyway that's not basis enough to invalidate their claims. They're published in EuroSys [1]. Is that not a reputable journal/conference?

Plus this dude has a sweet CV layout [2]. I'm inclined to believe him on that basis alone.

[1] http://www.i3s.unice.fr/~jplozi/wastedcores/

[2] http://www.i3s.unice.fr/~jplozi/


Fact is, the "dude" doesn't know how to get his text editor to use the same tabbing as the kernel code he's editing, and doesn't know how to run a patch through the kernel's "checkpatch.pl" script, without which it will be rejected from upstreaming.

The idea that the kernel runs idle tasks on cores while actual tasks are runnable (and that this situation persists for seconds) is ridiculous. It's such a gaping problem, that everyone doing anything with Linux anywhere would be running into it on a regular basis.

Think about all the people out there who carefully profile the performance of their multi-threaded apps. Like all of them wouldn't notice missing seconds of execution?


I've not studied this branch of CS since college a decade+ ago. As far as I know, certain elements of the kernel aren't modified very often. Also the researcher claims to introduce new tools that didn't exist before. Again I think if the journal, EuroSys, is reputable, then that journal would vet the findings properly. Replicating people's work is standard procedure. Maybe people need time to attempt that. Right on if you are proven right!


Certain elements of the kernel aren't modified often, but the scheduler isn't one of them. It had been subject to a lot of churn. There is no generalization you can state about the Linux kernel scheduler that is valid for an entire decade of kernel versions.


> There is no generalization you can state about the Linux kernel scheduler that is valid for an entire decade of kernel versions

That sounds like a generalization about the Linux kernel scheduler for an entire decade of kernel versions =) (just kidding haha!)


It does (add ?w=1 to the query parameters), but this is a file containing a diff, not a GitHub diff.


> the scheduler, that once used to be a simple isolated part of the kernel grew into a complex monster whose tentacles reached into many other parts of the system,

That's what I call vivid language!


Some authors appear to work with the old Xen crew (now founders at coho). Impressive work!


The bit about creating a cgroup per tty was news to me, and now I'm wondering if the way I usually manage Linux servers scales up to heavy traffic (usually if I'm involved it's small potatoes).

Presumably things started as sysv scripts or even docker containers don't have this problem?



Is anyone able to explain Figure 1? I don't understand what the levels are, and whether level 1 is the coarsest or the finest. The caption doesn't seem to make sense either way.. Also, Algorithm 1 is in terms of cpus, but the description mentions 'nodes' and 'cores'. Is a CPU a core or a node? Neither?


The scheduler looking at an idle core decides wether to steal work from an overloaded neighbor. It will only compare over the interconnects in the figure (between domains).

E.g. the the two cores in the dark grey box can steal work from each other. But they will only see load averages of the neighbouring domain. In certain cases the current scheduler calculates the load figures sort of odd, so the idle core decides that a neighboring overloaded 'scheduling domain' is not overloaded.


Unfortunately the authors are using AMD machines which behave differently than most others because pairs of cores are conjoined into "modules" that share resources. In AMD processors, a cpu is a core. In almost all other processors, a cpu is a SMT thread (aka Hyperthread).

In figure 1 the levels/shades represent distance from node/socket 1, darker being closer. So node 1 is distance 0 from itself, two other nodes are distance 1, and one node is distance 2.


The only thing shared in a bulldozer core is the FPU (and perhaps some cache, not sure). For all other purposes, a bulldozer module contains two full CPU cores.

Also, what CPUs besides Intel's and IBM's use SMT?


Sun^H^H^HOracle has the UltraSparc T* CPUs which, IIRC, use SMT heavily.

Itanium also supports some form of SMT, I think, although I am not sure if anyone actually uses those.


'Dozer, the arithmetic and memory units were pretty much the only thing that were separated. Each core pair shared a scheduler, a dispatcher, branch predictor, cache, etc. It really was an oddly thought out design.


It's probably even more complicated than that as they separated parts more per core during the evolution of dozer, like the decoder which was one unified one at first and became two separate ones later.

Yeah, an oddly thought out design for sure. The idea seems to make sense to me but execution wasn't good enough I guess.


From what I understand from this presentation the 'scheduling domain' abstraction is reused through different layers of the hierarchy. So for example the two hyperthreads on one logical core are also modeled as 'scheduling domain'.

https://events.linuxfoundation.org/images/stories/slides/lfc...


ARM doesn't do SMT either.


I agree.


This is extremely promising. Makes you wonder if we ought not go further and implement a machine learning-based scheduler that studies and anticipates workloads and schedules accordingly so as to help jobs complete as quickly as possible.


Why machine learning? The paper says that the need to handle the complexities of modern hardware made it so that the scheduler failed to do its fundamental job.

Machine learning sounds like it would add even more complexity but perhaps I just fail to see why machine learning is a good idea here. I don't see how you can predict workloads based off historical data and if you're going to try to predict workloads based off binaries, the overhead would likely outweigh any benefits you would get.

I'll admit I am no expert in machine learning but I have a hard time understanding why you would look at this problem and think machine learning is the solution.


The only reason is because machine learning is fashionable at the moment and a lot of people like to suggest it as the panacea. I can't even imagine the terrible overhead that machine learning would impose in probably the most critical part of the OS.


Prediction can be extremely fast.


But learning is always slow. What is the point of machine learning if you disable learning?


Learning doesn't need to be done in realtime... it can happen async.


Go for it, historically there have been some simple Linux schedulers. It's not difficult to hack on.

There was a "genetic algorithm" patch years back, it basically identified busier processes and preferred to give them cycles. It kind of sucked as less busy processes would starve. We are at an interesting place now, we have a good scheduler with o(1) semantics and it's fair, but it's leaving cycles on the floor on certain hardware with certain workloads.

I would think a machine learning approach would be more expensive and it could potentially be difficult to explain why it was doing a certain thing.

There is a long held belief that a good Linux scheduler shouldn't have tuning knobs and we don't need to select the scheduler at build time. I could see that belief coming to an end as we run on smart phones up to supercomputers with performance and energy being key issues on both ends of the spectrum. Some of the more exotic heterogenous hardware is becoming very popular too and that may beg the question more.


According to this paper, they dropped the O(1) scheduler a few years ago for a CFS scheduler that's O(log n) but exhibits nicer behaviors in other ways.


"I don't see how you can predict workloads off of historical data"

... Think about that part, then:)


Yeah and while we're at it, we can rewrite the whole thing on top of the JVM-- get some killer constant folds goin' on.

Servers usually have super long uptimes, too. So the only valid criticism of JIT, which is warm up, will be a nonfactor!

(big /s if not obvious.

JIT and ML evangelists make me sick.)


I wouldn't be so hard on JIT. JIT algorithms can massively speed up solutions to particular problems. For example, grep uses libpcre, which is one of the fastest regex engines around, and it uses JIT compilation.


This is kind of misleading. grep has an option to enable PCRE matching (the -P flag), but it by default still uses a combination of Boyer-Moore, Commentz-Walter (like Aho-Corasick) and a lazy DFA.

Also, Rust's regex engine (which is based on a lazy DFA) is giving PCRE2's JIT a run for its money. ;-) https://gist.github.com/anonymous/14683c01993e91689f7206a186...


Ah, okay. I didn't realize that. I was trying to point out there are everyday valid use cases for JIT that don't involve the JVM.

I've used Rust's regex engine before, it's very promising. It's really neat how the regex itself is fully compiled.


Yeah, you're absolutely right to point out PCRE2's JIT. It is indeed quite fast---I'm not entirely convinced a DFA approach can beat it in every case!


I love how I got super downvoted for an idea that is quite possibly central to the future of computing. It's as if HackerNews never realized that they themselves are machine-learning-based scheduling algorithms. Guess that one is going to hit them pretty hard.


This will hopefully be a more useful critique of your idea.

This wouldn't be practical unless it could be optimized "to the nines" to be very fast. The previous Linux scheduler ran in O(1), the new Completely Fair Scheduler runs in O(log N). The scheduler has to be able to run every CPU tick, so it has to be fast. A machine learning based scheduler does not sound like it could be made to be that fast. To put this into perspective, on a regular x86_64 the CPU tick in Linux 1000Hz.


Training is expensive, prediction is fast.


Interesting work certainly. The patches look more like proof of concept rather than something that's ready for mainline.

I think they're better off if they post a RFC patch to LKML as it exists to facilitate discussion and testing.


It should probably mainly be seen as a proof of concept for their scheduler decision visualization tools. Which are not public for the time being. That should make checking and fixing bugs easier in the future.


It'd be interesting to run the same tests on the Windows scheduler to see how it compares. Anyone know?


I wonder what the impact is on networking loads. Often the most important thing is for a thread to wake on the core with ready access to the packet that woke it. Other scheduling concerns are counterproductive.


The year is 2016.

Software engineers are still obsessed with squeezing every last drop of performance from a single core, adding multicore or distributed load support as an afterthought.

Sorry, it doesn't work this way anymore. There will be no more single core performance increases — laws of physics forbid it. Instead, we will see more and more cores in common CPUs (64, then 256, then 1024 — then it will be merged with GPGPUs and FPGAs with their stream processing approach).

Learn distributed programming or perish.


> The year is 2016.

It is $CURRENT_YEAR, yes.

> Software engineers are still obsessed with squeezing every last drop of performance from a single core, adding multicore or distributed load support as an afterthought.

Normally I'd agree, but if you had bothered reading the abstract the performance losses were not negligible and the kernel scheduler was responsible for the losses. This has nothing to do with application programmers not understanding "distributed programming" as you put it.

From the article:

> As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores.

> As simple as it may seem, we found that this invariant is often broken in Linux. Cores may stay idle for seconds while ready threads are waiting in runqueues.

> In our experiments, these performance bugs caused many-fold performance degradation for synchronization-heavy scientific applications, 13% higher latency for kernel make, and a 14-23% decrease in TPC-H throughput for a widely used commercial database.


I don't think this is a good way to look at it. In my opinion, having a "perfect" single core is much more valuable than an octo-core CPU. Take a look at the iPhone 6s. Two cores, and it runs faster than some octo-core Androids. I may be a bit of a noob in this area, but something tells me that is significant.


I'm also a noob, but according to parent, the reason why the iPhone runs faster is that all optimisation is done for 1 core, which would then leave the 8 cores doing nothing, despite having much, much more raw power available than the single iPhone core.

Or put in another way: If we want more raw processing power, we need more cores, but we don't want more cores because software is optimised for a single core.


Not true, even in multithreaded scenarios the A9 is more performant than the multicore android phones or it has a comparable performance in the worst case if I remember correctly. In single threaded scenarios it simply destroys all the other mobile chips and it is comparable with some low power Intel chips.


That is the point. If software can't use multithreads, then strong single threads will win.


Yes — because most applications are still written as if for single core. This is the vicious cycle I am talking about.


Still, your earlier comment completely misses the mark. Distributed algorithms rely more heavily on a well-functioning scheduler than single-thread solutions. I fail to see how your comment about "squeezing every last drop of performance from a single core" matches the article being discussed.


There will be no more single core performance increases — laws of physics forbid it.

This is just plain wrong. Look at any Intel CPU generation and you will find that the new one is faster than the old one clock-to-clock.


Most recent Intel cores actually appear to be slightly slower at integer performance. http://imgur.com/a/2fiLF


I would fucking love to have an FGPA included with the newest Intel procs or something. :D

Kinda looking forward to lowRISC with minion cores, though - if we ever ditch x86 for games.


> I would fucking love to have an FGPA included with the newest Intel procs or something. :D

You're in luck:

http://www.theregister.co.uk/2016/03/14/intel_xeon_fpga/

Intel's purchase of Altera is sure to lead to all kinds of innovation in this area. This is hopefully only the start.


That's awesome! :D

To me it just seemed silly that we'd continually upgrade to get more dedicated circuitry for things like low-power video decoding. I know programming an FGPA still wouldn't be as efficient as it could be, but I think in the future it might be nice to have an FGPA available for adding things like x265 hardware decoding a year after purchase.

What I'd REALLY love to see is an FPGA in a Chromebook for students. School will be so awesome in the next decade.


We are stuck with x86 for good. Intel is doing impossible things both in engineering and business execution, and it's really hard to compete with them for anyone.

Multicore ARM is still inferior to multicore x86.


for which purpose?

assuming which software load?




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: