I tested the patches on 1 and 2 node NUMA, and saw 0% performance improvements, for some simple workloads.
The paper is also uninformed about scheduler bugs in general. Yes, there have been lots of scheduler bugs (I've worked on more in other kernels too, not Linux, where scheduling is even worse!). Yes, we already have tools for diagnosing them (eg, examining run queue latency as a histogram). Yes, we've already used visualizations too.
There's information for reproducing the bugs here: https://github.com/jplozi/wastedcores/issues/1 . If you think for a second that you've hit these bugs, you should read these steps.
static int should_we_balance(struct lb_env *env)
A comment at the end of a "go to definition" chain serves as a good dead-end indicator that can save lots of overhead and wasted effort.
At least the AMD Bulldozer architecture diagram on the Wikipedia NUMA page shows this, both within a socket and between sockets:
Is the same true for Intel chips? (The Wikipedia page mentions the Intel E8870 memory controller, but not whether newer chips use similar set-ups)
if you have > 1 socket, then yes, otherwise no.
Edit: added verb
Core2Quad/Duo based Xeons, P4s, Athlon MP, etc
Not much impact to desktop users, but a huge impact for servers.
It was basically grid computing, since we had multiple machines that had to be coordinated at certain points in the algorithm chain.
Could this be an indication that Windows could also be suffering from some similar bug?
Once this bug is fixed in Linux it could supposedly improve performance by up to 30% in some cases. Linux is already performing on par with Windows (and this isn't really debatable), so does this mean that this patch help Linux out-perform Windows by much wider margins now?
No. Linux and Windows vary greatly. If the only difference between the two OSes were in the scheduler, maybe then we could say Windows also suffers from a similar bug. But there are so many other working parts to a modern OS that Windows could be slower than Linux despite having a more efficient scheduler.
Making an efficient multi core async system is as tough as making an efficient networked distributed system, except is has less heterogeneous components but probably also less devoted R&D.
So at one point I wonder why we just don't simply build extra cores as additional SOCs/ASIC, CPU or GPU you can plug on dedicated buses chosen according to the required latency in memory sharing, exchanging with central CPU/memory.
It is almost what actual buses are, just a tinge twisted with different rings, speed, latency because I don't think one bus can rule them all.
But, well who am I to have an opinion when google, apple, motorola, intel, Dec, IBM knows better than I how to design computers?
Just a no one.
Not really. Multi-core chips share processor execution resources and L2 cache between them. The on-die cores are directly tied to each other in a way that off-die components, including cores in another socket on the same board, are not. They are a single computer.
Now, you can carry the network abstraction down to that level. Problem is, as an EE I can carry the network abstraction down below the register level. A register bit is a network of transistors and resistors, after all. At what point do we call something a computer? I personally think any clustering of tightly-coupled, on-die computational blocks is a "computer", which means the milti-core chips certainly count as one computer with multiple execution pipelines.
A network stack is just a external universal bus that deals and introduces way more complexity.
But it is just a bus.