Hacker News new | comments | ask | show | jobs | submit login

I've worked scheduling bugs in other kernels before (Linux is not an outlier here). The key metric we keep an eye on is run queue latency, to detect when threads are waiting longer than one would expect. And there's many ways to measure it; my most recent is runqlat from bcc/BPF tools, which shows it as a histogram. eg:

   # ./runqlat 
   Tracing run queue latency... Hit Ctrl-C to end.
   ^C
        usecs               : count     distribution
            0 -> 1          : 233      |***********                             |
            2 -> 3          : 742      |************************************    |
            4 -> 7          : 203      |**********                              |
            8 -> 15         : 173      |********                                |
           16 -> 31         : 24       |*                                       |
           32 -> 63         : 0        |                                        |
           64 -> 127        : 30       |*                                       |
          128 -> 255        : 6        |                                        |
          256 -> 511        : 3        |                                        |
          512 -> 1023       : 5        |                                        |
         1024 -> 2047       : 27       |*                                       |
         2048 -> 4095       : 30       |*                                       |
         4096 -> 8191       : 20       |                                        |
         8192 -> 16383      : 29       |*                                       |
        16384 -> 32767      : 809      |****************************************|
        32768 -> 65535      : 64       |***                                     |
I'll also use metrics that sum it by thread to estimate speed up (which helps quantify the issue), and do sanity tests.

Note that this isolates one issue -- wait time in the scheduler -- whereas NUMA and scheduling also effects memory placement, so the runtime of applications can become slower with longer latency memory I/O from accessing remote memory. I like to measure and isolate that separately (PMCs).

So I haven't generally seen such severe scheduling issues on our 1 or 2 node Linux systems. Although they are testing on 8 node, which may exacerbate the issue. Whatever the bugs are, though, I'll be happy to see them fixed, and may help encourage people to upgrade to newer Linux kernels (which come with other benefits, like BPF).




I assume BPF is Berkeley Packet Filters or maybe eBPF (Extended Berkeley Packet Filters) in this case. Just to save anyone else having to look this up. It looks like this is the link to the tools.

https://github.com/iovisor/bcc

https://en.wikipedia.org/wiki/Berkeley_Packet_Filter


Yup. It's the same BPF (well, except for the "extended" bit) that tools like tcpdump and Wireshark use for packet capture: it's a bytecode for handing simple, guaranteed-termination programs to the kernel and having the kernel run them instead of waking up userspace all the time. This was originally created for packet capture, so the kernel could just hand you packets on port 80 (or whatever) instead of dumping all traffic at you and letting you throw away most of it. But it turned out this is also useful for system tracing: if you strace a program, the kernel will notify it on every syscall, and `strace -e` throws away most of that in userspace. So there's now a way to attach BPF filters to processes, events, etc. so that a userspace tracer is only woken up when something interesting happens, which reduces overhead significantly.


Anyone tested this? I'm not expecting a lot, since most of our systems are 1 or 2 nodes (run numastat to see how many nodes you have), and neither 8 nor hierarchal. Anyway, it doesn't apply cleanly to 4.1 (so I'm assuming this is for an -rc).

   arch/x86/kernel/smpboot.c: In function ‘set_cpu_sibling_map’:
   arch/x86/kernel/smpboot.c:452:16: error: ‘sched_max_numa_distance’ undeclared (first use in this function)
                && sched_max_numa_distance == -1)
                ^
   arch/x86/kernel/smpboot.c:452:16: note: each undeclared identifier is reported only once for each function it appears in
   make[2]: *** [arch/x86/kernel/smpboot.o] Error 1
Really wish they'd post this to lkml, where the engineers who wrote the scheduler, and engineers to regularly performance test Linux, can reply.


I've tested it on some 2 and 1 node systems, some quick results here: https://gist.github.com/brendangregg/588b1d29bcb952141d50ccc... . In summary, no significant difference observed.

This should be posted to lkml, where many others can test it. If there are wins to be had on larger node systems, they'll be identified and this will be fixed.


A lot of their bugs are induced when you have a cgroup with multithreaded/-process tasks and a cgroup with just a few using processor time.

Just testing one benchmark will not show it, unless you have something else running too.


Ok, depends what you mean by something else running. Another PID? That's why I tested make -j32. I also tested multi-threaded applications from a single PID (and more threads than our CPU count), since that best reflects our application workloads.

They ought to be posting it to lkml, where many engineers regularly do performance testing. I've looked enough to think that my company isn't really hurt by this.


Just read the paper, it's explained. Or read the presentation, it uses pictures.

Basically they run R, a single threaded statistics tool which is setup to hog a core, and in some other cgroup a wildly multithreaded tool. If you have a NUMA system (check with `lstopo`) then it's possible that the scheduler thinks the many tasks in one domain of cores is balanced with just R on one core of another domain. Meaning you can have several (ex: 7 out of 8) cores idle. It has to do with the way hierarchical rebalancing is coded, and that their 8x 8-core AMD machine has a deep hierarchy.


I manage the NUMA bit (where it counts) with something like this:

    cat ${node_file} | xargs -I {} -P ${NPROCS} -n 1 /usr/bin/numactl -N ${node_num} -l ${script_file} {} $*
I know about how many parallel procs I can run on a single node, and I've got something else that scrapes numactl -H for node count.


It looks like the "Group Imbalance" and "Overload on Wakeup" bugs could be noticeable on a 2-node system under the right conditions. So could the "Missing Scheduling Domains" bug, but only if you are offlining / onlining CPUs.

None of them should affect a 1-node system, and the "Scheduling Group Construction" bug requires a multi-level node hierarchy.


right; almost all our systems are 1 node. And I'm already debugging some NUMA stuff for our 2 node systems.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: