Hacker News new | comments | ask | show | jobs | submit login

It's not down for me... but here's the text:

In recent weeks and months there has been quite a bit of work towards improving the responsiveness of the Linux desktop with some very significant milestones building up recently and new patches continuing to come. This work is greatly improving the experience of the Linux desktop when the computer is withstanding a great deal of CPU load and memory strain. Fortunately, the exciting improvements are far from over. There is a new patch that has not yet been merged but has undergone a few revisions over the past several weeks and it is quite small -- just over 200 lines of code -- but it does wonders for the Linux desktop.

The patch being talked about is designed to automatically create task groups per TTY in an effort to improve the desktop interactivity under system strain. Mike Galbraith wrote the patch, which is currently in its third version in recent weeks, after Linus Torvalds inspired this idea. In its third form (patch), this patch only adds 224 lines of code to the kernel's scheduler while stripping away nine lines of code, thus only 233 lines of code are in play.

Tests done by Mike show the maximum latency dropping by over ten times and the average latency of the desktop by about 60 times. Linus Torvalds has already heavily praised (in an email) this miracle patch.

Yeah. And I have to say that I'm (very happily) surprised by just how small that patch really ends up being, and how it's not intrusive or ugly either.

I'm also very happy with just what it does to interactive performance. Admittedly, my "testcase" is really trivial (reading email in a web-browser, scrolling around a bit, while doing a "make -j64" on the kernel at the same time), but it's a test-case that is very relevant for me. And it is a _huge_ improvement.

It's an improvement for things like smooth scrolling around, but what I found more interesting was how it seems to really make web pages load a lot faster. Maybe it shouldn't have been surprising, but I always associated that with network performance. But there's clearly enough of a CPU load when loading a new web page that if you have a load average of 50+ at the same time, you _will_ be starved for CPU in the loading process, and probably won't get all the http requests out quickly enough.

So I think this is firmly one of those "real improvement" patches. Good job. Group scheduling goes from "useful for some specific server loads" to "that's a killer feature".

Linus

Initially a Phoronix reader tipped us off this morning of this latest patch. "Please check this out, my desktop will never be the same again, it makes a lot of difference for desktop usage (all things smooth, scrolling etc.)...It feels as good as Con Kolivas's patches."

Not only is this patch producing great results for Linus, Andre Goddard (the Phoronix reader reporting the latest version), and other early testers, but we are finding this patch to be a miracle too. While in the midst of some major OpenBenchmarking.org "Iveland" development work, I took a few minutes to record two videos that demonstrate the benefits solely of the "sched: automated per tty task groups" patch. The results are very dramatic. UPDATE: There's also now a lot more positive feedback pouring in on this patch within our forums with more users now trying it out.

This patch has been working out extremely great on all of the test systems I tried it out on so far from quad-core AMD Phenom CPUs systems to Intel Atom netbooks. For the two videos I recorded them off a system running Ubuntu 10.10 (x86_64) with an Intel Core i7 970 "Gulftown" processor that boasts six physical cores plus Hyper Threading to provide the Linux operating system with twelve total threads.

The Linux kernel was built from source using the Linus 2.6 Git tree as of 15 November, which is nearing a Linux 2.6.37-rc2 state. The only change made from the latest Linux kernel Git code was applying Mike Galbraith's scheduler patch. This patch allows the automated per TTY task grouping to be done dynamically on the kernel in real-time by writing either 0 or 1 to /proc/sys/kernel/sched_autogroup_enabled or passing "noautogroup" as a parameter when booting the kernel. Changing the sched_autogroup_enabled value was the only system difference between the two video recordings.

Both videos show the Core i7 970 system running the GNOME desktop while playing back the Ogg 1080p version of the open Big Buck Bunny movie, glxgears, two Mozilla Firefox browser windows open to Phoronix and the Phoronix Test Suite web-sites, two terminal windows open, the GNOME System Monitor, and the Nautilus file manager. These videos just show how these different applications respond under the load exhibited by compiling the latest Linux kernel using make -j64 so that there are 64 parallel make jobs that are completely utilizing the Intel processor.





Some good stuff in this thread. I found this post by Mike Galbraith (patch author) explaining why it's needed especially interesting:

http://marc.info/?l=linux-kernel&m=128991621119292&w...



OT: is "make -j64" overkill unless you have dozens of cores or am I missing something?


You're right - but that was the point. The patch was trying to fix problems with the process scheduler, and "-j64" is going to make lots of processes that want to do work and need scheduling.


thanks, but then the "that is my tipical workload" thingy does not hold, as you rarely have >60 cpu bound processes running at the same time. Well, flash player in chrome notwithstanding ;)


It probably approximates Linus's typical workload, which I imagine involves constant compiling and testing while compiling. He's probably still CPU bound.


make?


If you're the head of the world's largest computer OS project, the root of the maintainer tree as it were, I would make no assumption about what his typical CPU workload is like. :)


The number of jobs to run for an optimal compile time can be quite confusing. If none of the files you are going to compile are cached it is alright to run a lot more jobs then usual as a lot of them are going to wait for the disk I/O. After that twice the jobs then you have cores is mostly appropriate.


According to Con Kolivas's benchmarks, with the BFS scheduler you just do make -j [numprocs] for best results. I can't recall if he was accounting for disk cache, though.


How many simultaneous threads will your next computer be able to run?

Chances are it already runs at least two, most probably four. It's not unreasonable to see 4 and 8-threads as the norm. Also keep in mind we are only considering x86s. SPARCs, IIRC, can do up to 64 on a single socket. ARM-based servers should follow a similar path.

BTW, a fully-comfigured MacPro does 12. A single-socket i7 machine can do 12. I never saw a dual-socket i7, but I have no reason to believe it's impossible.

Considering that, -j64 seems quite reasonable.


Dual-socket i7 is called Xeon.

There are dual-socket and even quad-socket 8-core hyperthreaded xeons (the Xeon L75xx series). A 1U Intel with 64 threads will set you back about $20k.

AMD has 12-core chips, so you can get 48 cores in 4 sockets there. (But I think they only have one thread per core)


So, -j64 seems quite reasonable, if you have $20K around... ;-)

Personally, I would spend a part of the money on 2048x2048 square LCD screens. They look really cool.


Gentoo recommends -jN+1 where N is the number of physical and virtual cores.


Say I have a quadcore with hyperthreading, does this mean 4 + 8 + 1? Or is it either the physical or the logical cores (whichever is higher)?


How do you get 4 + 8? But anyway, it's logical cores, not physical ones.

The kernel can multi-task processes, but each process still gets exclusive use of the CPU when it runs. So if it doesn't need an adder, that adder sits idle.

With hyperthreading you can run two processes at once and the CPU merges them at the instruction level making maximum use of the components on the CPU.


make -j$(2N + 1) is roughly where minimal compile times are.


Where N is the number of physical cores? I do not use hyper-threading (it tends to be bad for the floating point and bandwidth limited operations that I do), but usually find minimal compile times at N+1 jobs (but with little penalty for several more).


It depends on many factors what the optimal number of concurrent builds is, but the bottom line is that you want to maximize your CPU utilization and minimize context switching.

If you think that one extra concurrent job is enough to fill CPU utilization in the time that other jobs are blocking on iowait, then you are fine.

So, bottom line, factors to think about:

- your i/o throughput for writing the generated object files;

- the complexity of the code being compiled, - template-rich C++ code has a lot higher CPU usage versus i/o ratio

- the amount of cores in your system


Out of curiosity, What types of applications are you running where HT hurts performance?


Sparse matrix kernels and finite element/volume integration. For bandwidth-limited operations, it is sometimes possible to get better performance by using less threads than physical cores because the bus is already saturated (for examples, see STREAM benchmarks). For dense kernels, I'm usually shooting for around 70 percent of peak flop/s, and any performance shortcomings are from required horizontal vector operations, data dependence, and multiply-add imbalance. These are not things that HT helps with.

Additionally, HT affects benchmark reproducibility which is already bad enough on multicore x86 with NUMA, virtual memory, and funky networks. (Compare to Blue Gene which is also multicore, but uses no TLB (virtual addresses are offset-mapped to physical addresses), has almost independent memory bandwidth per core, and a better network.)


I have 4 cores with hyperthreading enabled (so 8 "threads"), and find that -j10 is the fastest.


> Where N is the number of physical cores?

Yes. Dunno about HT, never used a box with it.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: