
Challenges of Memory Management on Modern NUMA System - signa11
https://queue.acm.org/detail.cfm?id=2852078
======
Moral_
My masters thesis is building a super low-latency, lock-less IPC mechanism for
passing cache lines between cores. NUMA systems are crazy, for numerous
reasons. For instance requesting remote cache lines across sockets have
asymmetric latencies. For example on the r820 server we use for testing
requesting a cache line from socket 1 to socket 0 (Socket 1 receives the cache
line from a core on socket 0) the latency is 528 cycles. But if you reverse
the experiment and socket 0 receives a cache line from socket 1 the latency is
384 cycles. So when building an application, if you're into reducing cycles
things like this creep up on you and become very confusing until you do
experiments like above.

~~~
vardump
Are you sure you measured correctly without systematic errors caused by out of
order execution? Did you serialize execution by using RDTSCP or CPUID? Any
chance you accidentally compared TSC counts between two separate sockets? Did
you keep in mind that each CPU socket has a different unsynchronized TSC.

Not accounting for those could give you an illusion of asymmetry.

~~~
Moral_
Yeah, when we got results we didn't expect we made sure that the experiment
was setup correctly, and actually made sense. We use the timing instructions
illustrated in this intel white paper, which does do serialization
[http://www.intel.com/content/www/us/en/embedded/training/ia-...](http://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-
code-execution-paper.html). More over we tried a few different programs,
CCbench, custom home-brewed, and BenchIT, all giving the same results. There's
a paper about Asymmetry:

Thread and Memory Placement on NUMA Systems: Asymmetry Matters

[https://www.usenix.org/system/files/conference/atc15/atc15-p...](https://www.usenix.org/system/files/conference/atc15/atc15-paper-
lepers.pdf)

------
vardump
NUMA systems are annoying for also other reasons. RDTSC is typically CPU
socket dependent. Different CPU sockets have different counts. So even with
constant TSC (time stamp counter), you can't compare TSC values between
different sockets. As a workaround, you need to also somehow record the CPU
socket information together with timestamp. Which is hard unless you're in
kernel context. Well, there's RDTSCP, but it also requires kernel to load
appropriate values to IA32_TSC_AUX MSR.

When requesting high precision time stamps, some operating systems fall back
to APIC or HPET. The issue is these are 2-3 orders of magnitude slower to
query than TSC. TSC takes a few nanoseconds to acquire, APIC / HPET can take a
few microseconds, which can sometimes be too much.

Thus precision timing and timing sensitive code is somewhat tricky on NUMA
systems.

Another annoyance is how easy it is to saturate a NUMA interconnect with
synchronization traffic. Atomic ops (and by extension, mutexes, semaphores,
etc.) can easily bottleneck a NUMA system.

