
Forcing the CPU affinity can make a monothreaded process run 2-3x faster - gbin
http://klaig.blogspot.com/2012/12/forcing-cpu-affinity-can-make.html
======
berkut
It's not just the sleeping state which this improves, it's the reduction in
cache-misses for Level2 (Core2) and Level3 (i7) that this helps.

In my experience writing high performance VFX software, the Linux kernel's
scheduler has been the best of all major OSs in terms of balancing threads
since around 2.6.35.

OS X is the worst, it bounces threads all over the place, and on top of that,
thread_policy_set() on OS X is only a hint, so often OS X will ignore affinity
settings anyway.

~~~
ikrima
Just tried google stalking a way to reach you but came short. I'm in vfx r&d
and wanted to ask you more about your performance optimization techniques. You
mind sending your email to hn@ikrima.com?

------
Osiris
If this is the case then why don't CPU schedulers try to keep single threads
running on the CPU instead of context switching?

I've seen this for example encoding AAC audio. I have an 8-core system but the
encoder is single-threaded but the Windows scheduler still spreads the process
out over 8 CPUs. Wouldn't it be better to stick on 1 core causing cache hits
to be higher?

~~~
ybaumes
AFAIK the Linux scheduler tries to be completely fair regarding every
execution units.

~~~
flatline
This is the Completely Fair Scheduler (CFS) that was introduced in the 2.6
Kernel. It basically does round-robin scheduling with variable time quanta
based on niceness/priority. Here is a good overview of the RBTree
implementation it uses:

[http://www.kernel.org/doc/Documentation/scheduler/sched-
desi...](http://www.kernel.org/doc/Documentation/scheduler/sched-design-
CFS.txt)

I don't know how this deals with distributing tasks across multiple processors
however; the basic idea was to improve on the fairly naive runqueue
implementation in 2.4 and prior.

------
buster
So in the end it didn't have something to do with monothreaded and CPU
affinity, when you read the comment. It was just that he was using the
powersaving CPU governor which scaled down the CPU frequency on the cores..

~~~
gbin
Still ... the cpu affinity "plays" the ondemand governor and avoids using the
more power hungry performance governor so it has something to do with it. I
have edited to post to reflect on tweaking the ondemand governor, but it can
affect other workloads on you system.

------
cma
If you have multiple (modern) physical processors, it is also good to set up
appropriate interrupt affinity for your NIC and to selectively turn off NUMA
interleaving where appropriate.

~~~
Negitivefrags
Do you have any more details or links to articles about this that I could
read?

~~~
baruch
I don't know of specific articles but I suspect you'd find some on Intels
website.

In short, in current CPU generations (x86_64 is what I work with) the PCIe
controller is on the CPU die so if you have multiple CPUs you have multiple
PCIe controllers and each of them control different PCIe sockets. You need to
consult your motherboard to know which sockets will go with which CPU.

If an interrupt comes from a PCIe card it will trigger on the CPU that is
attached to it, if the process that needs to handle the data is on the other
CPU you need to transfer the data and the cache to the second CPU, it's a
small effect but if you _really_ care about performance and want to squeeze
every nanosecond of latency on your work you should care about this.

You should start by considering in which slot to stick which card and if you
have a multiple of PCIe cards you really want to balance them out.

~~~
cma
Not just latency, with 10 gig Ethernet doing udp/multicast it can be a pretty
huge effect throughput wise (TCP with a card that supports offloading not
quite so much).

~~~
baruch
The throughput would be affected by the latency increase. I believe the QPI
link between the processors has enough bandwidth so the extra latency to
transfer the data between the processors and the reduced performance due to
cache switch-overs would be the source of the reduced throughput.

------
rayiner
I think going forward, you're going to need to handle scheduling decisions and
power management as a coupled problem.

~~~
gbin
Indeed, it looks like a very hairy problem.

------
TwoBit
In the game dev industry we set specific thread affinities to many (sometimes
all) threads. Also, gaming OSs (e.g PS3) typically leave a thread on a single
CPU in any case.

------
DHowett
If a monothreaded eix-update was run immediately after a polythreaded eix,
there's a high chance that a lot of the data from /usr/portage was in the disk
cache. I'd like to see the results of running the test back-to-back multiple
times (eix, cached eix, monothreaded eix, cached monothreaded eix), which
might eliminate the caching delta.

~~~
gbin
Actually no, I repeated it several times in read only after a cold boot start
on the 2 sysreccd kernel and altkernel. The limiting factor is CPU bound.
Note: I have a quite fast SSD at around 500MB/s so the results may differ on
other machines.

------
Erwin
Using "ondemand" rather than "performance" can do quite a bit for your
interactive performance as well. With "ondemand" your CPU runs at half speed
until you've stressed it for a bit. So if say, switching to another desktop
takes 0.1 cpu seconds, but the threshold is that the CPU must spend 0.5
seconds working before it kicks into full performance then all those UI
actions might only execute at half speed.

So personally, I have my desktop CPU (i7) running at performance setting. That
was clearly noticeable for me for short bursts of activity. It's possible you
could settle on doing that for just some of the cores (I don't know if that
makes sense, since they're in a single package).

~~~
Evbn
It is extremely unlikely that interactive performance has any human-detectable
relation to CPU switching from 1.5Ghz to 3ghz or whatever.

The I/O and Ram paging situation far dominates.

~~~
darkstalker
In my case the interactive performance when using "ondemand" is highly
noticeable when gaming or while using some apps like Firefox. It seems to be a
known problem [1] with GPU intensive tasks.

[1] <http://jasondclinton.livejournal.com/72910.html>

------
Jach
There's an insightful comment by GreenCat in the SetProcessAffinityMask
documentation about how doing this in Windows is likely to hurt your
performance: [http://msdn.microsoft.com/en-
us/library/windows/desktop/ms68...](http://msdn.microsoft.com/en-
us/library/windows/desktop/ms686223%28v=vs.85%29.aspx)

In fact, this seems like a Prisoner's Dilemma problem with the OS having
programs cooperate by default. There's a chance you'll speed up your
application by locking it to a core (defecting), but only if none of the other
applications try the same thing.

~~~
Symmetry
Right. Setting the process affinity mask is a fine thing to do - if you're
making these decisions for a specific piece of hardware with a specific set of
programs running on it. It's madness for the OS or the person writing the
software to try to do it for you, though.

------
dpeck
Verify these results if you're planning to do it on a laptop. For any longer
running processes its possible to have worse performance due to the
throttling/heat management techniques.

------
drudru11
Sigh.

SGI had this in Ultrix in 1991-1992ish (maybe earlier).

Why? Their market was graphics. If anything non-essential prevented the system
from rendering a frame, an attempt was made to run that on another processor.

I remember a 4 processor reality engine^2 that had four processors. 1
processor was dedicated to graphics. another to networking. I don't recall
what the other 2 were for.

~~~
reeses
SGI = IRIX, DEC = Ultrix

What you're thinking of is different from task/processor/thread affinity.

Regarding SGI, I'm guessing you're thinking about the Onyx, which would have
put it about 1993+. The R4x00s that were in the machine were not powerful
enough to drive the RE2 and IR boards on a shared basis. The scheduler on many
of these machines (Ultrix and IRIX both, along with Unicos, etc.) all pretty
much sucked at the time, with AIX being a notable exception in some
environments, especially running under VM.

You still see dedicated task processors with the z-series and maybe i as well
today.

------
b0b0b0b
Try this with perf stat on cpu 0, on cpu non-0, and without taskset at all.

If you have 2-threaded code, it's also fun to try taskset onto one core with
hyperthreading.

------
qwerta
Interesting find. I wonder if this would apply to Java program as well. Better
test it for MapDB.

~~~
samuel
Yes, it does. Read this: [http://mailinator.blogspot.com.es/2010/02/how-i-
sped-up-my-s...](http://mailinator.blogspot.com.es/2010/02/how-i-sped-up-my-
server-by-factor-of-6.html) which it's a deeper take on the subject.

------
brownegg
These are standard "tricks" in the high-frequency / low-latency trading space.

