In recent weeks and months there has been quite a bit of work towards improving the responsiveness of the Linux desktop with some very significant milestones building up recently and new patches continuing to come. This work is greatly improving the experience of the Linux desktop when the computer is withstanding a great deal of CPU load and memory strain. Fortunately, the exciting improvements are far from over. There is a new patch that has not yet been merged but has undergone a few revisions over the past several weeks and it is quite small -- just over 200 lines of code -- but it does wonders for the Linux desktop.
The patch being talked about is designed to automatically create task groups per TTY in an effort to improve the desktop interactivity under system strain. Mike Galbraith wrote the patch, which is currently in its third version in recent weeks, after Linus Torvalds inspired this idea. In its third form (patch), this patch only adds 224 lines of code to the kernel's scheduler while stripping away nine lines of code, thus only 233 lines of code are in play.
Tests done by Mike show the maximum latency dropping by over ten times and the average latency of the desktop by about 60 times. Linus Torvalds has already heavily praised (in an email) this miracle patch.
Yeah. And I have to say that I'm (very happily) surprised by just how small that patch really ends up being, and how it's not intrusive or ugly either.
I'm also very happy with just what it does to interactive performance. Admittedly, my "testcase" is really trivial (reading email in a web-browser, scrolling around a bit, while doing a "make -j64" on the kernel at the same time), but it's a test-case that is very relevant for me. And it is a _huge_ improvement.
It's an improvement for things like smooth scrolling around, but what I found more interesting was how it seems to really make web pages load a lot faster. Maybe it shouldn't have been surprising, but I always associated that with network performance. But there's clearly enough of a CPU load when loading a new web page that if you have a load average of 50+ at the same time, you _will_ be starved for CPU in the loading process, and probably won't get all the http requests out quickly enough.
So I think this is firmly one of those "real improvement" patches. Good job. Group scheduling goes from "useful for some specific server loads" to "that's a killer feature".
Initially a Phoronix reader tipped us off this morning of this latest patch. "Please check this out, my desktop will never be the same again, it makes a lot of difference for desktop usage (all things smooth, scrolling etc.)...It feels as good as Con Kolivas's patches."
Not only is this patch producing great results for Linus, Andre Goddard (the Phoronix reader reporting the latest version), and other early testers, but we are finding this patch to be a miracle too. While in the midst of some major OpenBenchmarking.org "Iveland" development work, I took a few minutes to record two videos that demonstrate the benefits solely of the "sched: automated per tty task groups" patch. The results are very dramatic. UPDATE: There's also now a lot more positive feedback pouring in on this patch within our forums with more users now trying it out.
This patch has been working out extremely great on all of the test systems I tried it out on so far from quad-core AMD Phenom CPUs systems to Intel Atom netbooks. For the two videos I recorded them off a system running Ubuntu 10.10 (x86_64) with an Intel Core i7 970 "Gulftown" processor that boasts six physical cores plus Hyper Threading to provide the Linux operating system with twelve total threads.
The Linux kernel was built from source using the Linus 2.6 Git tree as of 15 November, which is nearing a Linux 2.6.37-rc2 state. The only change made from the latest Linux kernel Git code was applying Mike Galbraith's scheduler patch. This patch allows the automated per TTY task grouping to be done dynamically on the kernel in real-time by writing either 0 or 1 to /proc/sys/kernel/sched_autogroup_enabled or passing "noautogroup" as a parameter when booting the kernel. Changing the sched_autogroup_enabled value was the only system difference between the two video recordings.
Both videos show the Core i7 970 system running the GNOME desktop while playing back the Ogg 1080p version of the open Big Buck Bunny movie, glxgears, two Mozilla Firefox browser windows open to Phoronix and the Phoronix Test Suite web-sites, two terminal windows open, the GNOME System Monitor, and the Nautilus file manager. These videos just show how these different applications respond under the load exhibited by compiling the latest Linux kernel using make -j64 so that there are 64 parallel make jobs that are completely utilizing the Intel processor.
Chances are it already runs at least two, most probably four. It's not unreasonable to see 4 and 8-threads as the norm. Also keep in mind we are only considering x86s. SPARCs, IIRC, can do up to 64 on a single socket. ARM-based servers should follow a similar path.
BTW, a fully-comfigured MacPro does 12. A single-socket i7 machine can do 12. I never saw a dual-socket i7, but I have no reason to believe it's impossible.
Considering that, -j64 seems quite reasonable.
There are dual-socket and even quad-socket 8-core hyperthreaded xeons (the Xeon L75xx series). A 1U Intel with 64 threads will set you back about $20k.
AMD has 12-core chips, so you can get 48 cores in 4 sockets there. (But I think they only have one thread per core)
Personally, I would spend a part of the money on 2048x2048 square LCD screens. They look really cool.
The kernel can multi-task processes, but each process still gets exclusive use of the CPU when it runs. So if it doesn't need an adder, that adder sits idle.
With hyperthreading you can run two processes at once and the CPU merges them at the instruction level making maximum use of the components on the CPU.
If you think that one extra concurrent job is enough to fill CPU utilization in the time that other jobs are blocking on iowait, then you are fine.
So, bottom line, factors to think about:
- your i/o throughput for writing the generated object files;
- the complexity of the code being compiled, - template-rich C++ code has a lot higher CPU usage versus i/o ratio
- the amount of cores in your system
Additionally, HT affects benchmark reproducibility which is already bad enough on multicore x86 with NUMA, virtual memory, and funky networks. (Compare to Blue Gene which is also multicore, but uses no TLB (virtual addresses are offset-mapped to physical addresses), has almost independent memory bandwidth per core, and a better network.)
Yes. Dunno about HT, never used a box with it.