
C++11 threads, affinity and hyperthreading - nkurz
http://eli.thegreenplace.net/2016/c11-threads-affinity-and-hyperthreading/
======
saurik
> So if you've wondered why hyper-threading is sometimes considered a trick
> played by CPU vendors, now you know. Since the two threads on a core share
> so much, they are not fully independent CPUs in the general sense.

> Caches isn't the only thing threads within a core share, however. They also
> share many of the core's execution resources, like the execution engine,
> system bus interface, instruction fetch and decode units, branch predictors
> and so on [3].

I would go even further than this... rather than starting from the idea "they
sound separate, but apparently share a lot of stuff", you are probably better
off working from the original goal of a hyperthread: sometimes the CPU blocks
on something slow, like memory or a dependent register, and it has nothing it
can easily reorder to do in the mean time; the idea of hyperthreading was to
provide the CPU something else to do when it got stuck. So from that concept
we would never expect it to have anything at all that it doesn't share: you
are just filling in gaps in the execution of a single core by having the CPU
itself implement something akin to a coroutine task scheduler. The issue I
take with the mental approach in the article is then that it still implies
there is at least some parallel compution possible there that is being limited
by how much is being shared, but the reality is that hyperthreading has more
in common with "green threading" than real concurrency: there is really only
one thing there pretending to be two things entirely for the purpose of
tricking legacy software into giving it something else to execute.

~~~
Symmetry
It's a bit better than green threading since out of order engines are
imperfect at extracting parallelism and you can often use a second thread to
fill in spare functional units that the first thread isn't using. There's also
the danger that the extra cache pressure from two threads on a core will cause
enough thrashing that using SMT will decrease performance on net. So it's more
complicated but of course you're right that green threading is a much better
first approximation than multiple cores. I suspect you already knew that but
just didn't want to go into it.

One technique I've heard about is to schedule producer and consumer threads on
the same core so that there's zero data transfer overhead and no extra d-cache
pressure.

------
hellofunk
>This is fairly low-level thread control, but in this article I won't detour
into higher-level constructs like task-based parallelism, leaving this to some
future post.

That's fine, but just as an FYI to others: C++11 also provides some great
tools for creating futures and promises that significantly simplify the work
in multithreaded programming. In fact it closely resembles the conveniences I
came to enjoy working in Clojure, which is considered to have a great approach
to threading.

~~~
superfunc
Also, if people should look at intel's
tbb([https://www.threadingbuildingblocks.org/intel-tbb-
tutorial](https://www.threadingbuildingblocks.org/intel-tbb-tutorial))
library, it has been really excellent in my experience using at a respectably
large scale.

edit: formatting

~~~
bonzini
TBB is a little underwhelming in my opinion. Stuff like mutexes, condition
variables and even atomics is pretty much standard nowadays (even if you need
portability, you can usually rely on stuff like glib or Qt or boost); thread-
safe collections are rarely the right solution, because too many fine-grained
locks incur excessive overhead. The work-stealing queue is nice, but if you
really want things to scale you need stuff like highly-parallel fast paths,
with message passing on the slow paths. And this is where TBB falls a bit
short of the expectations.

------
shin_lao
We've experimented a lot with thread affinity and our conclusion is that more
often than not playing with thread affinity brings no performance advantage
and is very problematic to work with on different platforms.

For example on Linux the name of the mask set is cpu_set_t and on BSD it is
cpuset_t. On OS X you have to use the function thread_policy_set and on
Windows SetThreadAffinityMask, all with different logics.

The other problem is once you play with affinity you have to take care of
_all_ the threads of your application because if you leave one thread roaming
on all cores your affinity approach is ruined.

Making sure that different steps of the same operation are processed in the
same logical thread makes a bigger performance difference than playing with
thread affinity.

Last but not least, the code in this article is incorrect. You must first do
sched_getaffinity to know the cores on which your program is allowed to run
and then do pthread_setaffinity_np.

~~~
pjmlp
It really made a difference back in Windows NT 3.51 and 4.0.

We used it to keep a Apache threads per core on our servers back in those
days.

The scheduler wasn't still optimized for the .com loads and the Apache threads
kept jumping between cores.

Also Apache still wasn't that optimized for Windows as well.

------
atomic77
It was worth reading this article just to discover this lstopo utility

~~~
coherentpony
Also worth noting: lstopo -.txt

