
Google Finds NUMA Up to 20% Slower for Gmail and Websearch - streeter
http://highscalability.com/blog/2013/5/30/google-finds-numa-up-to-20-slower-for-gmail-and-websearch.html
======
scott_s
One quibble: the author at High Scalability refers to the authors of the paper
collectively as "Google," but the lead authors, Lingjia Tang and Jason Mars,
are professors at UC San Diego. Of course, they must have collaborated with
Google and they may have done the work while doing summer internships in 2011
(CVs are at <http://www.lingjia.org/> and <http://jasonmars.org/>).

~~~
toddh
Thanks for pointing out the error Scott. It should be fixed now.

------
mtdewcmu
I'm having a little trouble making sense of this:

"For example, bigtable beneﬁts from cache sharing and would prefer 100 %
remote accesses to 50% remote. Search-frontend prefers spreading the threads
to multiple caches to reduce cache contention and thus also prefers 100 %
remote accesses to 50% remote."

Let me see if I've got this straight:

* bigtable benefits from scheduling related threads on the same cpu so they can share a cache, I'm guessing because multiple threads work on the same data simultaneously

* search benefits from having its threads spread over many cpus, probably because the threads are unrelated to each other and not sharing data, so they like to have their own caches

I'm not sure I understand how this relates to NUMA, or why remote accesses are
ever a good thing. Maybe it requires a more sophisticated understanding of
computer architecture than what I have.

~~~
scott_s
It's not that remote accesses are _good_ , it's that trying to induce them can
harm cache usage elsewhere. If the author at High Scalability will allow me
another quibble, I'd say that actually, memory locality _is_ still King. It's
just that we have to be very careful about trying to improve it; if you try to
improve locality in one place (say, induce local accesses from a socket to
main memory), you may end up harming it somewhere else (more total number of
accesses to main memory because now the cache is thrashing).

The NUMA bit comes in when you said "scheduling related threads on the same
cpu" and "threads spread over many cpus". If you schedule related threads on
the same socket (cpu), you're more likely to get local accesses. If your
threads share data, then that's two good things: local memory accesses, and
good cache usage. But if your threads use different data, then the fact that
you have local memory accesses may not matter because you may have many more
cache misses because the threads are interfering with each other.

A simpler way to think about it: shorter access to main memory does not help
you if you end up doing many more total accesses.

~~~
b0b0b0b
Do the bigtable performance characteristics look kind of like cache line ping
ponging? My intuition for scenario 3 outperforming scenario 2 (100% remote vs
50% local + 50% remote) is that there are more mutations of data and therefore
more interconnect traffic is required to maintain coherency across sockets.

------
rpearl
Given Google's ability to obtain processors before they are available to the
public, and given that this paper refers to AMD's Barcelona processors, the
results published here are probably approximately seven years out-of-date, and
it's not clear whether they're still relevant now.

------
mckilljoy
I like reading these analyses, although I'm afraid headlines like this
oversimplify things and give off the wrong impression. There isn't anything
inherently wrong with NUMA, it just isn't useful in this situation.

No technology is a 'silver bullet'. Every workload has a different set of
considerations that require a different set of technology to optimize.

~~~
mtdewcmu
The way it's portrayed is extremely misleading. The headline misses the point
-- actually, the article didn't really have a point. It sounds like they
didn't get the results they wanted from the project, but tried to make the
best of it by highlighting what they did get, which is a jumble of facts that
are incoherent and self-contradictory. It's sort of interesting to read,
because they did honest research, asked good questions and followed the data,
and there is plenty of value in negative results.

The way I read the outcome, NUMA seems to do what it's supposed to. The
premise was that remote memory accesses are a performance killer, and forcing
threads onto fewer cpus should be a big win. But NUMA came out looking pretty
good. Leaving it alone looks like an excellent policy. Consider that google
brought in a team of experts for the sole purpose of figuring out how to beat
the default behavior of NUMA.

------
chad_walters
The title is not just misleading -- it is just plain wrong.

NUMA was 15% better for Gmail and 20% better for the Web search frontends, as
indicated by the reductions (improvements) in CPI for these workloads.

There were some workloads where NUMA did degrade performance, such as BigTable
accesses (12% regression).

------
lallysingh
Specifically: "in multicore multisocket machines, there is often a tradeoff
between optimizing NUMA performance by clustering threads close to the memory
nodes to increase the amount of local accesses and optimizing for cache
performance by spreading threads to reduce the cache contention"

I.e. the performance benefit from socket-local memory accesses may not be
worth having all the threads using that memory on that socket's CPUs, because
they'll each get too little a share of the cache.

------
hollerith
Up to 20% slower than what?

(Than SMP systems, I guess, but the OP does not say.)

~~~
rys
Than keeping memory accesses chip local (I guess via thread pinning). The
comparison was done on the same hardware platform.

~~~
CJefferson
I am not surprised, I have seen slowdowns of 40%. NUMA leads to an annoying
bimodal timing, where some runs are fast and others slow.

