
NUMA (Non-Uniform Memory Access): An Overview - llambda
http://queue.acm.org/detail.cfm?id=2513149
======
MichaelGG
Running a VoIP platform, we've "discovered" that instead of running one big
process (which spawns lots of threads), we run one process per core or socket
(still experimenting). We force affinity so the scheduler can't bounce stuff
around. We've gotten somewhere between 200% and 500% efficiency from doing so,
as well as minimized variation (that is, the one-process-per-server machines
see huge CPU spikes, whereas the affinity locked systems don't).

The way I see it: You've gotta figure out your scale-out or load balance
method. It might not hurt to just pretend each NUMA node is a separate server
and just treat it like that.

~~~
seanmcdirmid
That is not the point of NUMA at all. If you are just going to use RPC to
integrate a bunch of computers in a cluster, why would you want to bother with
the Non Uniform Memory Access abstraction at all?

NUMA is kind of out of fashion though. For everything it claims to do, there
are better alternatives (MPI, map reduce, actors, ...).

~~~
kasey_junk
What in the world do actors & map/reduce have to do with memory locality?

~~~
seanmcdirmid
They involve alternative communication mechanisms that are not shared memory.

~~~
kasey_junk
No they are abstractions around shared memory. Actors are a good example of
where having an understanding of NUMA can have serious performance
implications.

If you architect your actor system so that all of your actors are on the same
socket, all message passing (assuming messages are small enough) can happen in
socket local cache. This will be much faster than having to cross socket
memory or even worse going to main memory.

Conversely, if you aren't paying attention to memory locality you could see
unexpected performance degradation as you add actors, because even though your
system is more parallel there is more overhead in memory access.

------
berkut
Excellent article - writing NUMA-aware code such that it's aware of where a
thread's memory is compared to what socket it's on is very important for high-
performance code.

You can get serious speed gains by reducing latency.

Similarly, writing cache-aware code (in terms of the cache hierarchy) and
locking down thread affinity can have serious benefits as well, especially on
OSs with rather poor schedulers like OS X and Windows, which tend to bounce
threads around a bit too much.

~~~
fulafel
Multi-socket systems are uncommon these days. Tuning for NUMA can make a
difference for the subset of users who can utilize more parallelism than you
get from a single socket, but can't use a cluster (that would scale further
than multi-socket system).

Even then it only works if the app spends most of its time waiting for cache
misses: it could reduce the runtime of a 60% stalled app by 20%, using the
100ns vs 150s latency figure in the article - assuming as the starting point
that the OS heuristic memory placement fails pessimally. Less than 20% in more
realistic cases.

For the vast majority of apps, there are more serious perf gains to be had for
same or less effort elsewhere.

~~~
berkut
What? They're not uncommon for servers or high-performance workstations - Xeon
dual/quad socket are very common.

I've seen close to 250% speed increases with NUMA-aware code (and tying thread
affinity down) - writing things so the code's aware of where the memory is and
targets thread jobs to cores on sockets that have that memory already can
significantly reduces memory traffic over the QPI links, which if you don't do
(with AVX instructions trying to do 8 floats per clock with IB) systems tend
to just be memory-starved, as they can't feed data to processors fast enough
from main memory. Especially with Xeons, as the pre-fetchers are so
aggressive.

~~~
fulafel
You moved the goalposts - I said they're uncommon in general. Though even most
servers are single socket these days. Most high performance parallelized code
runs in game engines, media codecs, and things like that.

I can believe you can see speed increases like you cite with bandwidth-bound
code, but would doubt this kind of optimization is good bang for the buck for
most perf critical code.

~~~
berkut
What? Not if they're CPU-bound they're not. Look at Dell/HP
workstation/servers: Z600/800 | Z620/Z820 range - dual sockets, all of them.
The cheaper configs might only come with one CPU, but they've got two sockets.

I'm talking highly-parallelisable code like in renders/raytracers, fluid
simulations, image processing. All that stuff needs huge amounts of memory
bandwidth. And it's generally parallisable in both threads and with SIMD with
SSE/AVX, so you're literally up against the cache/memory throughput limits of
the system.

Game engines aren't _that_ parallel actually - even job queue based ones which
are in theory very scalable have dependencies which means that generally
there's a limit to the number of threads they can run at once. The above
examples I gave can generally scale linearly, given memory bandwidth.

~~~
fulafel
Most servers are doing pedestrian things like VMware, Exchange, web etc. and
people buy server hardware for the RAS features. I'm saying apps and hardware
you reference are uncommon.

games was an example perf critical code that seldom benefits from numa
optimization.

------
antsam
There's gotta be a numa numa joke in there somewhere.

------
spiritplumber
[http://www.youtube.com/watch?v=MhuTaD-B4qs](http://www.youtube.com/watch?v=MhuTaD-B4qs)
Sounds good to me.

~~~
rbanffy
A videoclip? Seriously?

~~~
Dylan16807
I understand being confused/annoyed at spiritplumber linking a music video to
make a pun.

I don't understand being confused/annoyed at the fact that he linked a video.

~~~
lvh
In many places, "videoclip" implies music video :)

~~~
rbanffy
I any case, I still don't get the pun.

~~~
Dylan16807
The song is known as 'numa numa' by many people. It's not a very good joke.

