Hacker News new | comments | ask | show | jobs | submit login

Scheduling has become a hard problem. There's cache miss cost in moving a task from one CPU to another. CPUs can be put into power-saving states, but take time to come out of them. Sharing the run queue across CPUs has a locking cost. So it's now a cost optimization problem. That's a lot of work to be doing at the sub-millisecond decision level.



One of the point in the article is that despite scheduler's importance and complexity there were no tools that allowed to evaluate the performance. This work contributed such tools and those immediately allowed to identify quite a few bugs. Fixing those improves performance on NUMA systems rather substantially across various loads.


>there were no tools that allowed to evaluate the performance.

This reminds me of Brian Cantril's remarks in a talk (sorry, can't remember which one) about Solaris and DTrace. For the first time ever, DTrace gave them the ability to look into the live performance of low-level OS code and they found a lot of pathological worst-case behavior that no one suspected beforehand. No matter how good your team is it's really hard to accurately predict how a system will behave, measurement is key.


You can't do science without measurement.


To take this work more seriously I miss the discussion of the potential negative effect of their patches. The way it's presented, it gives the false impression that "it will just work."

However, their premise of "just wake the idle cores as soon as there are threads to run" can actually be harmful in some real-life scenarios, especially regarding the additional power use and the slowdown introduced by switching from the filled to the empty caches.

Intuitively, as they demonstrate the "speedup" on some specific programs using some specific patterns I'd expect that some specific programs and patterns exist that don't necessary benefit from every change they present.


Well, two out of the four are straight-out bug fixes.


I agree. For the task of writing an OS scheduler, I think the programmers must list desired properties, mathematically prove that their approach has the desired properties and then, but only then, test and measure their implementation for a few CPU years on dozens of types of hardware.

And yes, that proof probably will be hairy, but that's precisely the reason it has to happen.

Also, in the likely case that the proof is for a limited domain ("at most 1000 CPUs", "level 1 cache shared between the same CPUs as the level 2 cache"), I think the scheduler should test for that at boot time and strongly consider panicking the kernel if it doesn't apply.


It may be better to let a task wait briefly on the run queue rather than bringing a processor out of sleep mode. That requires a cost/benefit calculation. That's a lot to ask of a scheduler.

Unidirectional interprocess communication forces this issue to come up frequently. The pattern "send on pipe A, wait for pipe B" to a process that is waiting on pipe A and will reply on pipe B implies that, for a very short period between the send and the wait, there are two processes ready to run. Starting one of them on another CPU is suboptimal, and waking up a sleeping CPU is even worse. You're really doing an interprocess subroutine call, but the OS does not know this. (As I point out occasionally, QNX, with MsgSend/MsgReceive and a scheduler that understands them, does this much better.)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: