Hacker News new | comments | show | ask | jobs | submit login
[dupe] The Linux Scheduler: A Decade of Wasted Cores (acolyer.org)
136 points by jwildeboer 595 days ago | hide | past | web | favorite | 38 comments



Link to previous discussion on HN: https://news.ycombinator.com/item?id=11501493


Run "numastat". If you only see one "node" column, then these issues don't affect you. At Netflix, almost all our systems are single NUMA node.

I tested the patches on 1 and 2 node NUMA, and saw 0% performance improvements, for some simple workloads.

The paper is also uninformed about scheduler bugs in general. Yes, there have been lots of scheduler bugs (I've worked on more in other kernels too, not Linux, where scheduling is even worse!). Yes, we already have tools for diagnosing them (eg, examining run queue latency as a histogram). Yes, we've already used visualizations too.

There's information for reproducing the bugs here: https://github.com/jplozi/wastedcores/issues/1 . If you think for a second that you've hit these bugs, you should read these steps.


Found it funny that one of their patches reduces a function to:

  static int should_we_balance(struct lb_env *env)
  {
      return 1;
  }


certainly inlined and very likely optimized away altogether. Better to keep it that way, with a comment. Once the problem is fully understood, there may be room to enable it again with a more meaningful heuristics.


Ummm... isn't that the job of version control?


Not really. It's a bad idea to bury past decisions that deserve some sort of "no trespassing," "thar be dragons," or "do not feed after midnight" monument off in some historic commit in version control. The problem is that VCSs are completely undiscoverable inside the flow of walking through code to see "where's the right place to implement this awesome feature idea I had." What you might not realize as you start implementing that feature is that it's already been implemented 5 times and removed.

A comment at the end of a "go to definition" chain serves as a good dead-end indicator that can save lots of overhead and wasted effort.


Yes, in theory. In practice, you'd not only remove this function but all of its calls. When somebody down the road realizes that they want this function back, they have to (A) realize it's in the VC history and (B) not only get back the function but all the calling points. That is so much of a pain and potentially error-prone that leaving this function in for a while with a comment might be the more practical approach.


On the other hand, it may no longer be called from every place where it should be. I've found that it's better to document what was wrong with the overall approach, scrape the dead code from the source base, and move on. Barnacles like these accumulate over time otherwise.


No. Replacing that with a magic number wouldn't be good. Maybe it could be a constant instead, but this is most likely good practice.


Uh?


Unless you are running on NUMA box does not look like this will affect you much.


Isn't NUMA how multi-core CPUs work still?

At least the AMD Bulldozer architecture diagram on the Wikipedia NUMA page shows this, both within a socket and between sockets:

https://en.wikipedia.org/wiki/Non-uniform_memory_access#/med...

Is the same true for Intel chips? (The Wikipedia page mentions the Intel E8870 memory controller, but not whether newer chips use similar set-ups)


> Isn't NUMA how multi-core CPUs work still?

if you have > 1 socket, then yes, otherwise no.

Edit: added verb


These days, yeah, that's definitely the case. A few generations back there were a couple of multi-socket-single-memory-domain designs.


> A few generations back there were a couple of multi-socket-single-memory-domain designs.

Core2Quad/Duo based Xeons, P4s, Athlon MP, etc


I thought the Core2 chips were actually multiple single-core processors on a single slab, which would make that true of all Core2 machines (not just Xeons). I was pretty deep in Nehalem details when it first came out and was under the impression that microarchitecture was Intel's first true multi-core chip (as in, the cores were actually integrated on the die). Could be wrong, though, since I never delved into the Core2 architecture.


>I thought the Core2 chips were actually multiple single-core processors on a single slab, which would make that true of all Core2 machines (not just Xeons)

Correct.


To my understanding, multi core on a single package (CPU) is much simpler to schedule. There are performance penalties if you schedule work on a package that has to access memory that is not directly connected. A NUMA-aware scheduler should not make this mistake.


Looks like you are right Intel QuickPath Interconnect is similar to what AMD was doing with HT.


The computing world is quickly heading towards two different style targets: (1) ones that have a single memory domain inside a smartphone or tablet that infrequently execute CPU-bound or memory-bound tasks and (2) ones that have a NUMA architecture inside a server that often execute CPU-bound/memory-bound tasks.


False. Cache misses are expensive. Unnecessary thread sleeps and context switches are expensive.


So, you mean, pretty much every server produced in the last ~8 years?

Not much impact to desktop users, but a huge impact for servers.



In case you want get a quick(-ish) overview, here are the slides that try to explain it: http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_t...



I worked on a massively threaded signal processing application a few years ago. We manually set all of our thread affinities due to the issues raised in this article. We were also using the real-time kernel and round robin scheduler, though, so we had other considerations as well.


It is standard practice when max throughput and/or latency are desired to pin. A scheduler provides convenience and automatic balancing in face of shifting workloads. If you have a fixed function system, it's just getting in the way.


Pretty much. Our pinning varied by task. Some were pinned to sockets, some to cores, and some to specific hardware threads. That's what I meant by "other considerations"; we would have still been pinning threads even if the bugs in the article didn't exist. The bugs just made figuring out the best pinning scheme more difficult.


Yes and no. Depending on the complexity level of the solution it may make sense to at least widen the threads' affinity masks. It often makes sense to offer some degree of freedom to the scheduler.


Can you share more about the application you built? Genuinely curious about what kind of projects need this level of concurrency.


It was the signal processor for an airborne surveillance radar. I can't get into specific sizing, but our processing technique was to divide the field of view into a rectangular grid, with each grid square backed by a processing thread. As the beam positions came in (each beam also has its own processing thread) the data from the beams was mapped to the appropriate thread in the grid. Once the sweep was complete, all the threads in the grid would start running the mode algorithms (moving target indication (MTI) or synthetic aperture radar (SAR) imaging), sharing data with their logical neighbors and building a complete picture of the area.

It was basically grid computing, since we had multiple machines that had to be coordinated at certain points in the algorithm chain.


Hi Guys, FYI in most new servers (~2005 and later), there is a BIOS setting to make NUMA systems look like they have one NUMA zone. From memory, I think it is the Interleaved option on most vendors. I strongly recommend most clients to turn it to NUMA mode (which tends to be the default for most newer servers)


From what I understand this mostly effects servers and not desktop systems because most desktops don't have NUMA. I've noticed that Linux servers typically run on par (usually better) than their windows counterparts when it comes to performance.

Could this be an indication that Windows could also be suffering from some similar bug?

Once this bug is fixed in Linux it could supposedly improve performance by up to 30% in some cases. Linux is already performing on par with Windows (and this isn't really debatable), so does this mean that this patch help Linux out-perform Windows by much wider margins now?


> Could this be an indication that Windows could also be suffering from some similar bug?

No. Linux and Windows vary greatly. If the only difference between the two OSes were in the scheduler, maybe then we could say Windows also suffers from a similar bug. But there are so many other working parts to a modern OS that Windows could be slower than Linux despite having a more efficient scheduler.


> and this isn't really debatable

Citation?


Well multi core is not one CORE but rather an integrated distributed system of cores linked with buses that are probably built with chips very similar to switch/commuters without all the KPEX poured in optimizing network equipments. So calling a multi core archi one computer is a little lie.

Making an efficient multi core async system is as tough as making an efficient networked distributed system, except is has less heterogeneous components but probably also less devoted R&D.

So at one point I wonder why we just don't simply build extra cores as additional SOCs/ASIC, CPU or GPU you can plug on dedicated buses chosen according to the required latency in memory sharing, exchanging with central CPU/memory.

It is almost what actual buses are, just a tinge twisted with different rings, speed, latency because I don't think one bus can rule them all.

But, well who am I to have an opinion when google, apple, motorola, intel, Dec, IBM knows better than I how to design computers?

Just a no one.


> Well multi core is not one CORE but rather an integrated distributed system of cores linked with buses that are probably built with chips very similar to switch/commuters without all the KPEX poured in optimizing network equipments.

Not really. Multi-core chips share processor execution resources and L2 cache between them. The on-die cores are directly tied to each other in a way that off-die components, including cores in another socket on the same board, are not. They are a single computer.

Now, you can carry the network abstraction down to that level. Problem is, as an EE I can carry the network abstraction down below the register level. A register bit is a network of transistors and resistors, after all. At what point do we call something a computer? I personally think any clustering of tightly-coupled, on-die computational blocks is a "computer", which means the milti-core chips certainly count as one computer with multiple execution pipelines.


well, address bus is still a bus. They share a bus not the addresses. From the programming point of view it is the same, from the architecture PoV the complexity (number of gates and latency) it is not the same.

A network stack is just a external universal bus that deals and introduces way more complexity.

But it is just a bus.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: