Hacker News new | past | comments | ask | show | jobs | submit login
Forcing the CPU affinity can make a monothreaded process run 2-3x faster (klaig.blogspot.com)
114 points by gbin on Dec 9, 2012 | hide | past | favorite | 50 comments



It's not just the sleeping state which this improves, it's the reduction in cache-misses for Level2 (Core2) and Level3 (i7) that this helps.

In my experience writing high performance VFX software, the Linux kernel's scheduler has been the best of all major OSs in terms of balancing threads since around 2.6.35.

OS X is the worst, it bounces threads all over the place, and on top of that, thread_policy_set() on OS X is only a hint, so often OS X will ignore affinity settings anyway.


Just tried google stalking a way to reach you but came short. I'm in vfx r&d and wanted to ask you more about your performance optimization techniques. You mind sending your email to hn@ikrima.com?


Is there any way, then, to actually set the affinity on OS X?


If this is the case then why don't CPU schedulers try to keep single threads running on the CPU instead of context switching?

I've seen this for example encoding AAC audio. I have an 8-core system but the encoder is single-threaded but the Windows scheduler still spreads the process out over 8 CPUs. Wouldn't it be better to stick on 1 core causing cache hits to be higher?


Here is a typical scenario:

  - Your AAC encoder runs UN CPU #1.

  - Your AAC encoder stalls on I/O.

  - The scheduler picks some other thread to run on CPU #1.

  - The I/O request completes.

  - A third thread, running on CPU #2, blocks.

  - Your AAC encoder is the first waiting thread.
Should the scheduler:

  - run your AAC encoder on CPU #2?

  - run your AAC encoder on CPU #1,
    and move the program happily running there to #2?

  - wait until CPU #1 becomes available before running your AAC encoder again?
Keep in mind that 'the program happily running there' could very well be my AAC encoder.

Variations include the case where, by the time your I/O completes, CPU's #1 and #3 are available, but #1 is asleep. Should the scheduler wake it, so that your thread can stay on the CPU?

Your AAC encoder may be the most important process alive for you, but how is the scheduler to know that?

(Running processes you deem important at a nice level might help, but I do not know enough about current schedulers to know about that)


> Keep in mind that 'the program happily running there' could very well be my AAC encoder.

This is rare on Windows client machines. Does anyone know if desktop Windows' scheduler uses different rules to their server OS to exploit this?


Well, it also could be your MP3 player, the game you are playing, or whatever.

Server OSes typically use different schedulers or scheduler settings than desktop ones; Windows is not different. This starts with using larger time quantums. http://download.microsoft.com/download/1/4/0/14045A9E-C978-4...:

"On client versions of Windows, threads run by default for 2 clock intervals; on server systems, by default, a thread runs for 12 clock intervals"

The windows scheduler also is aware of the GUI, and raises priority of threads handling the user interface. Some things the kernel and/or user mode code do:

"Threads that own windows receive an additional boost of 2 when they wake up because of windowing activity such as the arrival of window messages . The windowing system (Win32k .sys) applies this boost when it calls KeSetEvent to set an event used to wake up a GUI thread ."

"Client versions of Windows also include another pseudo-boosting mechanism that occurs during multimedia playback . Unlike the other priority boosts, which are applied directly by kernel code, multimedia playback boosts are actually managed by a user-mode service called the MultiMedia Class Scheduler Service (MMCSS), but they are not really boosts—the service merely sets new base priorities for the threads as needed"

(Much) more info in the above-mentioned PDF.


There are too many workloads to consider and not enough methods to balance them all to make everyone happy. At least in Linux there were already quite a few major changes and tons of little tweaks, each tweak can improve one benchmark and cause another untested benchmark to fail.


Sounds like a good reason to delegate scheduling to user space. If one size doesn't fit all, then let user space programs choose what works best for them.


Not scheduling. Scheduling is the kernel's task

But scheduling "hinting" sure, like saying "ok, keep this thread to one processor only"

Still, some of these configs may cause some kind of 'denial of service' on the system if misused (like the 'nice' command), so they're usually limited.


Scheduling my processes could certainly be done by me(my user land). It certainly shouldn't let me schedule your processes, but it could be hierarchical with first or second level scheduling in the kernel with the exact process run determined by local user space.


Scheduling can absolutely happen outside the kernel -- think outside the box man. :) http://scholar.google.com/scholar?cluster=840196971309363049...


I've seen and used solutions of user-level threads (aka green threads, aka coroutines) to do just that. It works. It's a lot of work to get right even for a very specific use-case.

In the past the power management on Linux was done by a user-space program, I'm not sure why that was changed.


Userspace programs already can choose what very vague preferences they have, using the taskset (processor affinity), nice and ionice commands.

But the main problem is that workloads are very complex, and often changing. A single userspace program is not even aware of other programs running on the machine, nor how the user thinks they should be prioritised.

If you are compiling something, do you want to to finish ASAP or a bit slower in the background without desktop sluggishness? What priority should a minimised browser window have at the same time? What if that browser window is also playing music from youtube?

At the moment, the scheduling hints that users can figure out is marking some processes as low-priority background tasks. Optimal scheduler tuning is too complex, because starting or quitting any application can completely change the optimal resource usage, and users only have a vague idea of what they expect from the scheduler ("everything should be fast").


This is good for some applications but in be general case it doesn't solve the problem since the user processes can choose policies that interfere with each other and lead to overall lower performance. This is a fascinating topic where computer architecture meets algorithms meets game theory.


Some userspace programs let you do that.


>If this is the case then why don't CPU schedulers try to keep single threads running on the CPU instead of context switching?

They do try, at least the one on Linux does. But the OS can't know what your intentions are when you fire up a thread.


> If this is the case then why don't CPU schedulers try to keep single threads running on the CPU instead of context switching?

The scheduler in the kernel does do this if possible. In order to maximize the chances of a process waking up to warm caches, the scheduler always attempt to wake up the process in the core that was most recently running it. If that core is not available then it tries another core on the same package before it tries other packages (if there are many cpu packages).

However, the kernel has other real world priorities than just trying to keep a single process throughput as high as possible. It must also be fair to the other processes while keeping the current consumption low and as many cpu cores powered off as possible.

Because the kernel has to work with different workloads ranging from low-end not-so-smart phones with a battery to a supercomputer with it's own nuclear power plant, there are compromises to be made. There's also a lot of compile time and runtime configuration options you can use to tweak the kernel to your particular workload.


I thought scheduling was one place where the kernel had branches for architecture types.


It does, but the parent is talking about workload types, not fixed CPU architectures.


AFAIK the Linux scheduler tries to be completely fair regarding every execution units.


This is the Completely Fair Scheduler (CFS) that was introduced in the 2.6 Kernel. It basically does round-robin scheduling with variable time quanta based on niceness/priority. Here is a good overview of the RBTree implementation it uses:

http://www.kernel.org/doc/Documentation/scheduler/sched-desi...

I don't know how this deals with distributing tasks across multiple processors however; the basic idea was to improve on the fairly naive runqueue implementation in 2.4 and prior.


The kernel will try to keep each process running on the same CPU. If one process is using a lot of CPU, it won't (shouldn't) generally switch it around just to keep the CPUs evenly busy.


Why is this downvoted ? Is it false ?


So in the end it didn't have something to do with monothreaded and CPU affinity, when you read the comment. It was just that he was using the powersaving CPU governor which scaled down the CPU frequency on the cores..


Still ... the cpu affinity "plays" the ondemand governor and avoids using the more power hungry performance governor so it has something to do with it. I have edited to post to reflect on tweaking the ondemand governor, but it can affect other workloads on you system.


If you have multiple (modern) physical processors, it is also good to set up appropriate interrupt affinity for your NIC and to selectively turn off NUMA interleaving where appropriate.


Do you have any more details or links to articles about this that I could read?


I don't know of specific articles but I suspect you'd find some on Intels website.

In short, in current CPU generations (x86_64 is what I work with) the PCIe controller is on the CPU die so if you have multiple CPUs you have multiple PCIe controllers and each of them control different PCIe sockets. You need to consult your motherboard to know which sockets will go with which CPU.

If an interrupt comes from a PCIe card it will trigger on the CPU that is attached to it, if the process that needs to handle the data is on the other CPU you need to transfer the data and the cache to the second CPU, it's a small effect but if you really care about performance and want to squeeze every nanosecond of latency on your work you should care about this.

You should start by considering in which slot to stick which card and if you have a multiple of PCIe cards you really want to balance them out.


Not just latency, with 10 gig Ethernet doing udp/multicast it can be a pretty huge effect throughput wise (TCP with a card that supports offloading not quite so much).


The throughput would be affected by the latency increase. I believe the QPI link between the processors has enough bandwidth so the extra latency to transfer the data between the processors and the reduced performance due to cache switch-overs would be the source of the reduced throughput.


The only relevant piece I could find quickly was the Mellanox performance tuning guide: http://www.mellanox.com/related-docs/prod_software/Performan...

It's also useful for any other PCIe card, sans the IB specific parts.


I think going forward, you're going to need to handle scheduling decisions and power management as a coupled problem.


Indeed, it looks like a very hairy problem.


In the game dev industry we set specific thread affinities to many (sometimes all) threads. Also, gaming OSs (e.g PS3) typically leave a thread on a single CPU in any case.


If a monothreaded eix-update was run immediately after a polythreaded eix, there's a high chance that a lot of the data from /usr/portage was in the disk cache. I'd like to see the results of running the test back-to-back multiple times (eix, cached eix, monothreaded eix, cached monothreaded eix), which might eliminate the caching delta.


Actually no, I repeated it several times in read only after a cold boot start on the 2 sysreccd kernel and altkernel. The limiting factor is CPU bound. Note: I have a quite fast SSD at around 500MB/s so the results may differ on other machines.


Using "ondemand" rather than "performance" can do quite a bit for your interactive performance as well. With "ondemand" your CPU runs at half speed until you've stressed it for a bit. So if say, switching to another desktop takes 0.1 cpu seconds, but the threshold is that the CPU must spend 0.5 seconds working before it kicks into full performance then all those UI actions might only execute at half speed.

So personally, I have my desktop CPU (i7) running at performance setting. That was clearly noticeable for me for short bursts of activity. It's possible you could settle on doing that for just some of the cores (I don't know if that makes sense, since they're in a single package).


It is extremely unlikely that interactive performance has any human-detectable relation to CPU switching from 1.5Ghz to 3ghz or whatever.

The I/O and Ram paging situation far dominates.


In my case the interactive performance when using "ondemand" is highly noticeable when gaming or while using some apps like Firefox. It seems to be a known problem [1] with GPU intensive tasks.

[1] http://jasondclinton.livejournal.com/72910.html


The problem is switching not occurring, or occurring too late - with the OS shuffling a single threaded app around I/O units it looks like the CPU usage is say 20% so no need to move to higher clock.


There's an insightful comment by GreenCat in the SetProcessAffinityMask documentation about how doing this in Windows is likely to hurt your performance: http://msdn.microsoft.com/en-us/library/windows/desktop/ms68...

In fact, this seems like a Prisoner's Dilemma problem with the OS having programs cooperate by default. There's a chance you'll speed up your application by locking it to a core (defecting), but only if none of the other applications try the same thing.


Right. Setting the process affinity mask is a fine thing to do - if you're making these decisions for a specific piece of hardware with a specific set of programs running on it. It's madness for the OS or the person writing the software to try to do it for you, though.


Verify these results if you're planning to do it on a laptop. For any longer running processes its possible to have worse performance due to the throttling/heat management techniques.


Sigh.

SGI had this in Ultrix in 1991-1992ish (maybe earlier).

Why? Their market was graphics. If anything non-essential prevented the system from rendering a frame, an attempt was made to run that on another processor.

I remember a 4 processor reality engine^2 that had four processors. 1 processor was dedicated to graphics. another to networking. I don't recall what the other 2 were for.


SGI = IRIX, DEC = Ultrix

What you're thinking of is different from task/processor/thread affinity.

Regarding SGI, I'm guessing you're thinking about the Onyx, which would have put it about 1993+. The R4x00s that were in the machine were not powerful enough to drive the RE2 and IR boards on a shared basis. The scheduler on many of these machines (Ultrix and IRIX both, along with Unicos, etc.) all pretty much sucked at the time, with AIX being a notable exception in some environments, especially running under VM.

You still see dedicated task processors with the z-series and maybe i as well today.


Try this with perf stat on cpu 0, on cpu non-0, and without taskset at all.

If you have 2-threaded code, it's also fun to try taskset onto one core with hyperthreading.


Interesting find. I wonder if this would apply to Java program as well. Better test it for MapDB.


Yes, it does. Read this: http://mailinator.blogspot.com.es/2010/02/how-i-sped-up-my-s... which it's a deeper take on the subject.


These are standard "tricks" in the high-frequency / low-latency trading space.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: