One quibble: the author at High Scalability refers to the authors of the paper collectively as "Google," but the lead authors, Lingjia Tang and Jason Mars, are professors at UC San Diego. Of course, they must have collaborated with Google and they may have done the work while doing summer internships in 2011 (CVs are at http://www.lingjia.org/ and http://jasonmars.org/).
A bit of a frustrating read. What exactly is a "performance swing"? Does that mean performance increased or decreased? I find the writing does not match the precision required to cover a highly technical topic.
Given Google's ability to obtain processors before they are available to the public, and given that this paper refers to AMD's Barcelona processors, the results published here are probably approximately seven years out-of-date, and it's not clear whether they're still relevant now.
I like reading these analyses, although I'm afraid headlines like this oversimplify things and give off the wrong impression. There isn't anything inherently wrong with NUMA, it just isn't useful in this situation.
No technology is a 'silver bullet'. Every workload has a different set of considerations that require a different set of technology to optimize.
The way it's portrayed is extremely misleading. The headline misses the point -- actually, the article didn't really have a point. It sounds like they didn't get the results they wanted from the project, but tried to make the best of it by highlighting what they did get, which is a jumble of facts that are incoherent and self-contradictory. It's sort of interesting to read, because they did honest research, asked good questions and followed the data, and there is plenty of value in negative results.
The way I read the outcome, NUMA seems to do what it's supposed to. The premise was that remote memory accesses are a performance killer, and forcing threads onto fewer cpus should be a big win. But NUMA came out looking pretty good. Leaving it alone looks like an excellent policy. Consider that google brought in a team of experts for the sole purpose of figuring out how to beat the default behavior of NUMA.
"For example, bigtable beneﬁts from cache sharing and would prefer 100 % remote accesses to 50% remote. Search-frontend prefers spreading the threads to multiple caches to reduce cache contention and thus also prefers 100 % remote accesses to 50% remote."
Let me see if I've got this straight:
* bigtable benefits from scheduling related threads on the same cpu so they can share a cache, I'm guessing because multiple threads work on the same data simultaneously
* search benefits from having its threads spread over many cpus, probably because the threads are unrelated to each other and not sharing data, so they like to have their own caches
I'm not sure I understand how this relates to NUMA, or why remote accesses are ever a good thing. Maybe it requires a more sophisticated understanding of computer architecture than what I have.
It's not that remote accesses are good, it's that trying to induce them can harm cache usage elsewhere. If the author at High Scalability will allow me another quibble, I'd say that actually, memory locality is still King. It's just that we have to be very careful about trying to improve it; if you try to improve locality in one place (say, induce local accesses from a socket to main memory), you may end up harming it somewhere else (more total number of accesses to main memory because now the cache is thrashing).
The NUMA bit comes in when you said "scheduling related threads on the same cpu" and "threads spread over many cpus". If you schedule related threads on the same socket (cpu), you're more likely to get local accesses. If your threads share data, then that's two good things: local memory accesses, and good cache usage. But if your threads use different data, then the fact that you have local memory accesses may not matter because you may have many more cache misses because the threads are interfering with each other.
A simpler way to think about it: shorter access to main memory does not help you if you end up doing many more total accesses.
Do the bigtable performance characteristics look kind of like cache line ping ponging? My intuition for scenario 3 outperforming scenario 2 (100% remote vs 50% local + 50% remote) is that there are more mutations of data and therefore more interconnect traffic is required to maintain coherency across sockets.
I'm not familiar with this research, but it's possible that sequential accesses to memory would lead to prefetching, in which case going half-local half-remote could actually lead to a slowdown versus all-remote. Another hypothesis is if the memory ends up having to be migrated from one cpu cache to the other, then back. It's better if it's always in the remote cache than if it's getting flipped between the two.
I'm pretty sure it goes without saying that 100% local is always better, assuming you're not trading anything else away (like accessible CPU on other nodes).
Ah. So a remote access may be coming directly from the cache of a different CPU? That's something I didn't consider, and definitely adds another wrinkle.
I sense that the article is saying things in confusing ways, perhaps because that's the way computer architects speak (it always struck me as counterintuitive and confusing to measure a cache by its miss rate rather than its hit rate) or maybe it's this article.
Specifically: "in multicore multisocket machines, there is often a tradeoff between optimizing NUMA performance by clustering threads close to the memory nodes to increase the amount of local accesses and optimizing for cache performance by spreading threads to reduce the cache contention"
I.e. the performance benefit from socket-local memory accesses may not be worth having all the threads using that memory on that socket's CPUs, because they'll each get too little a share of the cache.